“Observability is like TDD for production. Never accept a pull request if you don’t know how to identify if it is working. Thanks @mipsytipsy” - Adriano Bastos
In this post: if developers can learn to love testing, then they can learn to love user happiness in production. Observability helps achieve this. Git provides a source of truth for the desired state of the system, and Observability provides a source of truth for the actual production state of the running system. In GitOps we use both to manage our applications.
In the old days, a developer would make a thing, a tester would work out how to build and test it, and an operator would work out how to run, manage and monitor it. These different roles are converging into one team with one delivery cycle. This is caused by business desire for velocity and agility. Almost every business has to make money from services that are powered by tech. Competition and profits depend on how fast the business can communicate user requirements to tech teams, and how fast tech teams can act on that. In this world there is simply no room for application development teams that cannot own end to end change.
And that means software engineers need to learn to own their apps and be aware of and ideally involved in making them operable. Observability makes this intrinsic to the developer workflow.
Developers <3 Operations?
Observability is hot right now - perhaps because developers are picking up on it as a way to think about ops. Credit goes to Charity Majors, the inimitable @mipsytipsy on Twitter and in the blogs, and to Cindy Sridharan, who as @copyconstruct on Medium is narrating the interplay between observability and cloud native.
Observability is a property of systems - like Availability and Scalability. Developers care about making their applications observable so they can be in charge of monitoring their app’s behavior and impact on their app’s users.
- A system is Observable if developers can understand its current state from the outside. Observability is a holistic and contextual approach to understanding system health.
- Monitoring, Tracing & Logging are techniques for baseline observations: measurements like error rate, request latency, queries/sec, i.e. symptoms of operational wellness.
Think of developers like doctors - using a common language and set of techniques to validate health, measure symptoms and diagnose and remedy problems of all kinds. Such developers will deliver the fastest response and fix when things do go wrong. And businesses need that.
Developers naturally gravitate towards Git based workflows. In previous posts I spoke about how GitOps ties infrastructure management to Git PRs if we use tools like Kubernetes. Let’s now bring GitOps and Observability together so that each helps the other. Can developers use Git as a source of truth for monitoring, logging, tracing and more? And vice versa?
Observability Is Another Source of Truth
In GitOps we use Git as a source of truth for the desired state of the system. For example, if we blow up, we can roll back to that state. And Observability is a source of truth for the actual running state of the system right now. We observe the running system in order to understand it and control it. Here is a picture showing the flow.
Observations Are Answers To Questions About Releases
Our Weave Cloud dev team have run Kubernetes stacks in anger for several years. Overall they found that while Kubernetes is basically this magic platform that manages infrastructure plumbing for you, there is an accompanying loss of visibility. Daily issues include:
- Did my deploy work? Is my system now in the desired state and can I go home now?
- How is my system different from before? Can I use Git or our system history to check?
- Did my change improve the overall UX? (as opposed to system correctness)
- I can’t find my new service in my metrics dashboard (eg RED metrics)
- Was this glitch associated with my last service update event, or something earlier?
The temptation is to record as many observations in one uber-dashboard of all the things plus database & chatops gumf. This usually leads to cognitive overload which is why people look for solutions that hide noise. That means creating new kinds of dashboard (eg Matt’s team at Lyft) and new context based incident investigation tools (eg Charity’s team at Honeycomb).
Sidebar: Not long after 2011, companies like Twitter set up “Observability” teams to deal with the increased complexity of shifting to distributed services. See this post by Cory Watson from 2013 which tells how Twitter expanded from pure monitoring to a holistic approach to service & system health and user happiness; also these sequels from the team part 1 & part 2 & slides.
Diffs Check Observed vs Desired State
The Weave team use Diff tools to help run our SaaS product. Diffs are useful in two ways:
- Validating that our system currently observed production state corresponds to the system desired state in Git, e.g. did my last release do what I'd hoped?
- And instantly alerting us if it is not - we mostly do this via Prometheus & Slack - and informing us of the nature of the divergence
Diffs are your baseline essential GitOps tools: they compare our sources of truth ie. the current observed state against desired state in Git. Diffs do not require custom creation or coding. And, diffs “just work” because our system is defined as a set of statements - the so called “declarative infrastructure”. Use them as much as you can!
- Kubediff - For example the desired Kubernetes state might be “there are 4 redis servers”. Kubediff checks the cluster periodically and alerts if the number changes from 4. In general terms, Kubediff turns yaml files into queries on running state.
- Terradiff - Terraform has a “plan” mode, which shows us if our configuration matches reality – we periodically execute this and export the exit code to Prometheus, from where we monitor and alert on it. Our Terraform k8s set up is on GitHub.
- Ansiblediff is a Kubernetes deployment whose containers periodically run Ansible check mode and notify Prometheus if something has changed in our Ansible k8s installations.
Sidebar: Kubernetes also has a controller-style auto-reconciliation loop built in already. The orchestrator is capable of “observe + sync” in some cases. What we are talking about here is that there can also be a drift between state in Git and actually applied Kubernetes apiserver objects. So we are only talking about the first of the observe+sync loops in this sequence: Git state ==[observe+sync]==> Kubernetes API object state ==[observe+sync]==> actually running pods etc.
Automation Is Convergence
For each “diff”, there are “sync” tools for enforcing convergence. To the extent you can put statements in Git and use diff and sync to identify and force convergence, you can expand beyond what your Kubernetes orchestrator can do to facilitate and/or automate operations. In summary, GitOps provides fine grained Configuration Synchronization. Hence this:
“I always found "observability is the dual of controllability" to be a very profound statement.” - Bryan Boreham - https://twitter.com/bboreham/status/91672251625261...
Diff and Sync are a pair - observe problem and take remedial action. More generally, in operations, a person makes an observation, understands that something is wrong, and decides to take remedial action. These are all examples of “controllability”, the dual of observability.
Sidebar: Observability and control were invented by an amazing person called Rudolf Carnap. He defined Observability rigorously for a large class of systems, and showed how every observation has an associated dual “controllability” action to change that which is observed. You can find lots of examples eg gps, military, network security, and error correcting codes. In tech, “control” is called “operations” or if you prefer, “management”.
Check out this super presentation by Coda Hale from 2011, “metrics, metrics, everywhere” that ‘splains how we should measure systems and thereby accelerate optimisation in “OODA” style decision loops. And do take a look at Adrian Cockroft on microservices.
Operations and the ROODA loop
GitOps is a release oriented model of operations. See the diagram below. Delivery velocity depends on how fast a team can go round the stages in this cycle.
Now we can draw the GitOps cycle below as an OODA loop, “Observe, Orient, Decide and Act”. But there is a twist - I split the ‘act’ part into updates to Git, and releases that are pushed from Git. So it is a ROODA loop :-)
The hardest part of all this is how we “Orient, Decide and Act”. That is the dark art of operations. How can we bring all this into the sunlight? That’s the next big challenge.
In our Weave Cloud team we are using GitOps to make management and monitoring intrinsic to developer ways of working. What kinds of “control” questions do we run into?
- Are there times when I should not force a sync?
- How do I manage conflict with the orchestrator?
- Can I commit my changes to Git from the system, ie.: create a new checkpoint?
- If I update a service, can my metrics and dashboards get an automatic update too?
- How about multi stage rollouts and multi version staging (eg canary)?
I won’t try to answer these here. More general “what can go wrong” examples are shown below, specifically for cases where Kubernetes can’t fix everything for you by using orchestration.
|System not responding||Roll back to Git desired state||Weave Cloud has exploded|
|Check auto-generated RED Metrics and diffs for a release||A pull request initiates a service deployment
||Updating Weave Cloud
|Diff alerts + Kubernetes shows as OK||Investigate & sync if possible||A VM was shut down, Kube not able to fix on its own|
|ChatOps alerts for error rates||Check Weave Cloud UI, event and metric dashboards||
Users are loading the system
|Cascading failures without obvious cause||Full incident management mode! Ask new questions of the system.||System SNAFU|
Please note that the last case is one where you may have no idea what the root cause is or how to fix it. The nasty stuff happens plenty in complex systems. Eg: “race conditions are causing non-deterministic overflows that will lead to a cascading failure”. This is where Observability can really help - it is all about developers properly making their own app code observable and discoverable using good tools. And reading the best books, eg the SRE book and Release It!.
A Way Forward
Our goal is to help businesses accelerate delivery. We are providing Git-centric tools that unify pipelines with Observability in ways that make developers love operations. Here are some ideas for this - very much a work in progress.
Dashboards As Code
At Weaveworks, we use several observability tools: Prometheus, Weave Scope, Grafana, etc. Observability is part of the system, so we store our dashboards as code. This blog post explains how we do this. So we get all the benefits of source control, plus the ability to deploy monitoring etc. from source along with our whole stack - consistently.
Focused Actionable Dashboards
A huge problem with all monitoring, logging and tracing is the sheer deluge of data and noise. Cognitive overload is one outcome as are excessive storage and service costs. Focus is needed. In OODA loops the Orientation step is how we draw selected information and actionable alerts from this picture.
One way to provide focus is to pivot away from functions - “monitor and log all the things” and turn to “show me the key facts about my service and help me improve it in place”.
Weave Cloud integrates different pieces of the GitOps lifecycle into one service-centric tool. This means we can auto-generate service metrics, correlate them with deployment events and histories, and analyse current vs. past performance. This is a good approach for creating developer centric tooling for managing staged rollouts, canaries and so on, eg. when a customer is using Envoy, Linkerd, or Istio. Overall this helps with making Observability dashboards more holistic, joined up too, in order to pull together the main pillars of monitoring, logging and tracing.
Sidebar: People talk about AppOps - how app dev is taking on *some* operations at the application layer, while “platform teams” take on infrastructure ops. Here’s Bridget Kromhout tweeting Bryan Liles’ illustration of this, unifying CI, CD, Logs, Metrics and Error Handling.
Design For Observability, Tool For Controllability
Delivery of software should not be considered “done” until that software is Observable. So make monitoring and management part of your app dev process and not an afterthought. Developers should bake in application monitoring at the start of the design. In production you will need to correlate system observations with the platform and pipeline too.
When choosing tools be aware that:
- Using a cloud native monitoring tool like Prometheus can help, but will be more useful if you can cross reference results with other tools (eg. diffs)
- Logging services can be expensive - take care when you mix logging with monitoring
- Many metrics can be auto-generated by the platform eg RED Metrics
- Some new tools, eg. service mesh, provide systematic observability and control for certain kinds of distributed application. Matt Klein sets out the reasons in this fine post.
- None of this will help you go any faster if you cannot interact with the system to test and mitigate problems at the same time. For example: to troubleshoot, poke the system to ask questions, try out patches, and roll out updates.
Observability can be seen as part of the Continuous Delivery cycle for Kubernetes. In GitOps 1, I described how we can use Git as a “source of truth” for all the desired initial state of the system, and in GitOps 2, I showed an orchestrated release pipeline that enables eg. rollbacks and snapshots. In GitOps we added Observability as a source of truth for actual current state and operations. Observed state must be compared with the desired state in Git.
The role of a GitOps dashboard is to enable observation and speed up understanding and validation of the system, and suggest mitigating actions. This speeds up the operations cycle seen as a ROODA loop. Developers: make sure your system is Observable. Then observe it & build alerts off that. Monitoring alone does not answer all questions: metrics are symptoms but not the disease.
Want more like this? I presented a complete version of the GitOps story at Cloud Native London - see the video and slides. Also Sandeep and Jordan did a splendid video tutorial on Kubernetes Best Practices and Lessons in our Weave Online User Group.
Next time: control policies.