In a recent episode of Software Engineering Daily Alexis Richardson spoke with Jeffrey Meyerson and recorded a podcast on GitOps. Here is a small excerpt from that interview. 

When did convergence start to happen around the ideas that became GitOps?

GitOps is centered around the idea that you should manage a cluster by comparing its desired state as expressed through a declarative tool, like a set of declarative config files in the case of Kubernetes, and then compare that desired state with the observed state, which is the actual system.

There are four different things that are true at any given time: the desired state, the real state, the observed state and what's in the head of the operator. 

We can never truly know the actual state. We only know what we can observe through observability tools, such as monitoring and alerts. With those tools, we have a way of building up a picture of what might be true about our actual system. We have the observed state, our set of beliefs and then we have the desired state, which is the source of truth written down in our software repository, usually git. When we see a deviation between those four things, that's when we know there is a reason to check what's wrong and converge the desired and observe state. 

Somebody can make a pull request that gets merged and as a result updates the desired state. This means that the cluster observers become aware that they need to do a deployment, since the observable state says, “Hey, you haven't deployed this thing yet. The desired state has been updated, but the cluster hasn't, so let's go and do that.” 

You may also have system drift where the system drifts away from its desired state and then the observers in the Kubernetes cluster can pick that up and go, “Hey, wait a minute. We seem to be different from what we intended to be.” 

You can imagine that as time goes by that you can describe a richer set of things and create rules, alert me if this threshold is passed more than three times in five minutes on this metric. Policy statements like this could then be something that you observe for and compare. 

You can for example, trigger a diff alert when the real system that you're observing drifts away from that observable rule. 

GitOps is a very simple iteration of DevOps infrastructure as code, orchestration and observability coming together; those four things. These are the ingredients that you need to do GitOps. 

How does GitOps differ from the model of pushing to git and having a deployment tool spin up the infrastructure? 

The manual spinning up of infrastructure is something you generally don't want to do because it takes time. More often than not, if you're doing 1,000 deployments a day on a 100 clusters, then you don't want to be spinning up 100,000 clusters at the same time in order to do that. What you'd like to do ideally is make changes to an existing environment. The main challenge is updating a development cluster or a production cluster with the necessary pieces to run the tests, do a canary in production, do a production rollout, or some other multi-stage rollout.

When you allow Kubernetes to update itself, you're not giving your CI system like Jenkins access to the cluster. You can see how that might be important if you had fears that whoever controlled Jenkins could also control your production systems.

What does GitOps make easier in terms of deployments, and what frictions does it eliminate?

GitOps makes what happens in practice easier. Our customer Qordoba is a San Francisco-based VC backed startup. They have four different tech teams doing microservices. They were using GKE and Jenkins when they started working with us, before they began using Weave Cloud to adopt GitOps processes. They moved from a CI-driven deployment system, to a Kubernetes-driven deployment system based on moving the orchestration of the deployment and the releases into Kubernetes with Weave Cloud. 

Previously, Qordoba had a CI-driven update system which would sometimes break and take about 20 or 30 minutes to complete a deployment, even on GKE. When they made changes, they would stop working and wait together as a team for the change to complete. Deployment would be a company-wide event, and it would only happen once or twice a week.

Weave Cloud didn't change how they use their CI system. What changed was the introduction of this element to manage controlled automated, and semi-automated updates, giving Weave Cloud the responsibility for doing the updates, instead of Jenkins which continue to be the same for build and test.

Through that, Qordoba found that they could reduce the time of a deployment down from minutes to seconds. They could get quick feedback on how deployments were working using some of the other tooling we provided for observing the system and the diffs and so on. They could always roll back to a previous point in time. They could roll forward again after that and they found that this meant that they stopped worrying and thinking about deployment anymore.

Final Thoughts

These concepts are described in a series of blog posts on GitOps. Alexis has also done online presentations on this topic, such as the Continuous Life Cycle London 2018 presentation.