When we speak to customers about deployment automation for Kubernetes, we often see an anti-pattern that is not fully understood by most of the folks who implement it. In fact, the majority of Kubernetes CI/CD tutorials out there prescribe exactly this anti-pattern.
What is this anti-pattern?
The anti-pattern is when the CI system runs build and tests, followed by a deployment directly to Kubernetes. The levels of sophistication on how this can be done vary, and of course, most of the time deployment is done in the final stages of a fairly complex pipeline...
Also, the implementation details vary – some folks use
kubectl set image - while others use
ksonnet apply or an in-house templating solution. What people use exactly is irrelevant. Overall the method is broken and please let me explain why.
We’ve been calling this “CIOps”, and it poses multiple challenges that I shall discuss in this post. I’ll start with an example.
CIOps - an example
Let’s say someone uses a hosted CI product (Travis CI and CircleCI are popular choices) and they want to deploy apps from CI to Kubernetes.
To do this, they have to give the CI access to the Kubernetes API. Firstly, this poses a security risk, as we wrote previously. Additionally, this also means that they have to ensure that each CI job is configured correctly to deploy to the right cluster and that it has up-to-date credentials. Whenever they refresh those credentials or bring up a new cluster, re-configuring multiple CI jobs can be challenging, especially with many apps deployed independently. They may attempt to limit access to e.g. a single namespace dedicated for the given app, but such credentials are even harder to manage, as there is no tool that can manage Kubernetes credentials with limited scopes and can configure the given CI product as well.
How does “CIOps” compare to GitOps?
Here is what a typical “CIOps” deployment pipeline looks like, where the both developer and the CI system have full-access to the cluster as well as the container registry – there are no clear boundaries.
Typical "CIOps" pipeline
And this is what a common pipeline with the GitOps model looks like – there is a boundary defined by an operator that runs inside the cluster and has exclusive rights to maintain the status of the cluster, all based on the config Git repo being the source of truth.
In some cases, the config repo can be the same as the code repo, but it’s recommended that you separate these for more complex apps. Any changes should go through the config repo, and any ad-hoc changes made to the cluster will be undone, unless replicated in config repo.
Common GitOps pipeline
And this is what the Weave Cloud version of this pipeline looks like. Notice, there is an extra operator that ensures the latest container image tags are propagated to the workload definitions in the config repo, and that there is a policy associated with it. The rest of the pipeline is the same as above. The GitOps model is also very easy to extend to multiple clusters.
Weave Cloud GitOps pipeline
It’s not uncommon to deploy a CI system directly into a Kubernetes cluster (sometimes for the whole cluster, and sometimes on a per-namespace basis).
This model avoids some of the aforementioned security issues. The configuration problem also goes away, but there are other tradeoffs. For instance, you have to have enough resources to run your builds, and to manage the build log storage. Also, you will need to run the data services on which your CI of choice depends.
With multiple clusters, this approach implies that images are built independently in each of the clusters, and that each of those images aren’t going to be 100% identical, which ruins the idea of container images. You need to make sure your builds are truly reproducible, and this can be very hard.
Deployment doesn’t go well - another CIOps example
Let’s consider a scenario where one CI job updated a deployment and the update didn’t go as intended. How do you find out what version to rollback to? You’d probably need to trace through your build logs to find out.
Kubernetes assigns revisions to workloads (namely Deployments), but these aren’t universally applied to every kind of object. You have to take additional steps to map workload revisions, any other associated objects (e.g. ConfigMap or Service) and the version of the app code that you are looking to deploy. You could consider making API-level snapshots of the entire cluster, but that doesn’t help when you have to relate exact changes. Driving all cluster changes through Git is just much easier, and also helps to relate changes directly.
But what if there were multiple jobs trying to deploy the same version with different config, or what if someone got the config completely wrong and their CI job happens to override someone else’s deployment?
It is hard to ensure the ordering of CI jobs. There are few guarantees as to which builds will finish before other builds. And it is easy to end with a race between deployments. You can build your own guards around such things, but why solve a local issue? Some CI implementations may provide locking and/or ordering, but why should you depend on a given CI vendor?
All CI systems vary, it’s very hard to tell with certainty what goes on by looking at a CI system alone without organization-specific knowledge. In order to get someone new to quickly understand, you have to show them how your CI is setup.
A Kubernetes cluster can be treated as the source of truth, but it doesn’t have the knowledge about where the source code for each of your apps is and how the “world” outside of Kubernetes is structured. You could attempt to define as much of “your world” inside of Kubernetes, but that’s a challenging task…
If you consider for a moment that you could define this “world” in a Git repo (or in a set of repos), all you need to do is ensure Kubernetes is pointed to its part of the “world” (e.g. staging or production subdirectory, or branch if you prefer).
Finally, if you lost your production cluster and bring up a new one, there are multiple challenges:
- How do tell what versions of each app you need to deploy?
- You have to re-run all of your CI jobs for that cluster…
- How do you ensure every CI job will deploy to the right cluster?
The Verdict on “CIOps”
In either case, whether you choose a hosted CI or have one that runs in your Kubernetes cluster, the biggest problem you will face at scale is that the CI and the Kubernetes cluster both compete to be the source of truth. CI systems are generally not designed to be the source of truth, although some are often treated that way. Kubernetes is somewhat better at it, and you could add snapshots, but using Git is just so much better.
It’s also hard to tell a transitional and accidental state apart from the intended states. With GitOps this is much clearer at any given time. CIOps can be mistaken for GitOps, and you shouldn’t make that mistake.