This blog post is aimed at Kubernetes users who have adopted Continuous Integration (CI) and who want to add Continuous Deployment (CD). I want to home in on security and compliance. Today we’ll show how your continuous delivery pipeline can be more secure. We also demonstrate that using GitOps best practice enables a complete audit trail of system changes.
I shall assume you have already read about GitOps and high velocity CICD for Kubernetes. In that post I recommended you consider three ways to do GitOps:
- Weave Cloud is a quick path to CICD and Observability on GCP in the GitOps style
- Kelsey Hightower described a pragmatic DIY approach at Kubecon (video)
- Weave Flux is our open source CD and release automation tool, used in Weave Cloud
In this post I shall focus on Weave Flux to demonstrate the key points about security and compliance. Flux is the core of our Weave Cloud deployment service for Kubernetes and enables the best practices that I set out below. I think it’s easier than doing everything by hand or by adapting an older CD tool to Kubernetes, but it is up to you.
Three best practices
This is what we think you need to do:
- Keep a record in Git of important interactions with the system: who made changes, when and why
- Don’t rebuild images from scratch unnecessarily, if you can update config instead. Build each container image just once and 'promote' it through each test sequence / environment, do not rebuild each time. But you must still update your declarative config changes in Git.
- Use pull based deployment - do not let CI push updates into the Kubernetes cluster or use kubectl by hand
I shall now discuss these. I’ll start with the last one, push vs pull.
Push vs pull deployment
In the previous blog post and accompanying video and slides, we talked about how Weave Flux implements the Kubernetes operator pattern. In plain terms an operator is an actor that is managed by Kubernetes and can inherit the cluster’s configuration, security, availability, etc.
Doing this leads to better security “out of the box”. Why? Flux is an agent that lives inside your Kubernetes cluster. It listens for updates to all code and image repos that it is allowed to access, and it pulls images and config updates into the cluster.
The pull approach is more secure because Flux can:
- ONLY carry out operations permitted by Kubernetes role based access control (RBAC), policy and security. Trust is shared with the cluster and not managed separately.
- Bind natively to all Kubernetes objects and know whether operations have completed or need to be retried.
This is in contrast to the “push” approach which is typical today:
- An actor process lives outside the cluster and is responsible for deployment orchestration, by executing commands that load images.
- e.g.: typing kubectl at the command line to execute a direct update, or encoding updates in scripts that run as CI jobs.
If implemented with care this model can be secure because it still uses RBAC in the cluster and in Git to constrain interactions with production. But it is easier to mess it up. You are working outside the trust domain of your cluster, and integrating with it. So you have to set up the whole authentication dance by hand, and take care of hardening yourself. All this is tricky. It is also not very amenable to change without a rewrite. This is why CI systems can be an attack vector for production. Overall if used carelessly, CI can be an entry point to your systems.
How secure is your pipeline?
If you are interested in this area please: *do* take a look at the deeper post “How secure is your CICD pipeline?” by our PM Stuart Williams, which maps a “Security by Design” model to CI systems access permission.
The role of CI in GitOps
In GitOps, the CI system does not have direct access to the cluster at all. You still use CI to run builds, regression tests and so on. Then CI writes updates and images into the relevant repos. This works just great! But don’t use CI to push updates directly, and avoid kubectl if you can.
Speed of updates and recovery from failure
If your CI pipeline pushes changes into the cluster using scripts, you may have noticed that it can take its time, and sometimes breaks. The main reason that updates break is that scripting is brittle. And if there is a failure, your state may be unknown.
Using our recommended approach is fast and more robust. Flux accesses the cluster locally and natively, and is only limited by when (and how) it is notified. You can run multiple Flux agents easily. The Flux agent lifecycle, failure, recovery, availability, and scalability are all managed by Kubernetes. And Flux uses Git as a record of cluster changes. So if something goes wrong during a deployment, you can always recover, or if necessary rollback, and so on.
Note: of course, pushing built images to container image repos can also be slow. But usually the failure cases are easier to understand.
Testing in production
Here is a rather fine image created by Cindy Sridharan, @copyconstruct on Twitter.
This image is from one of Cindy’s tweets about her Observability book. The relevance to our blog post is that we are seeing more and more “testing in production”. Deployment and Release have to be fast, safe, and secure. That’s easier when you use the pull-based approach to deployment, eg. Flux for Kubernetes. High velocity CICD is great because you can test fast and do what is being called continuous experimentation for customer happiness and profit :-)
Don’t rebuild images if you can change config instead
In GitOps, ideally, we have a complete description of the desired state of the system. Git is our source of truth for this desired state. As described in our earlier pipeline post, in GitOps we are building on the following best practices:
- DevOps & Git backed pipelines
- Infrastructure as code, aka config-as-code
- Immutable deployment artefacts
We are also combining these practices with Kubernetes and other cloud native technology, into what we hope will become a fully declarative application delivery model.
Combining DevOps with Kubernetes and cloud native practices has consequences:
- All config is declarative and everything can be described and observed
- Config can be mutable even if images are not
- We can unbundle configuration from build, and update it independently
- We move from “config as code” to “ops as config”.
So what do these points imply for best practice?
From immutable infrastructure to mutable config
Immutable infrastructure is best practice, but *what* counts as infrastructure may be evolving. With containerisation, we can build code and other source information into immutable container images. With declarative configuration, we can parametrise our system as a set of values and keep all that in Git. Those values may be altered at runtime, making our system partly mutable.
Kubernetes YAML files are examples of declarative configuration. If suitably authorised and authenticated, we can update these. GitOps encourages this, because (a) you don’t always need to or want to rebuild images from code, to make an important system change, but also (b) you want all system changes to be described in Git - the source of truth for your desired state.
Weave Flux handles this case by observing the config repo as well as the image repo. If the config is updated, and the images remain unchanged, Flux will orchestrate a deployment within the Kubernetes cluster to update the application. As with pull-based deployment this has several functional benefits:
- Faster & more robust: you don’t always need a rebuild to apply changes in the correct manner. Rebuilds add latency to your delivery cycle; and may not succeed every time.
- Reduced attack surface: Provided you use a sensible access control model for “who gets to change things and when”, it is helpful to have a category of application update that is robust, recorded in Git correctly, but without requiring code changes.
- Supports mutation patterns like latching and canary while remaining within the GitOps paradigm. For example, an administrator can do incremental rollout while taking snapshots of the last good state.
But here is the best part:
- The approach is additive. You don’t get rid of your existing CI. You just make sure that it writes to image repos and then you add Flux. Finally, all this scales to multi-cluster pipelines where deployment may be to Dev, Test, UAT and Production clusters.
- A guide to multi-cluster pipelines in GitOps is provided here.
Finally, let’s turn to the third best practice, in which Git helps us record changes.
Record everything in Git to have audit and compliance
Weaveworks customers are using GitOps practices today in order to pass SOC 2 compliance audits. The big deal here is that normally people tell you to buy an expensive compliance product to pass these tests. We think that you don’t have to buy that expensive product, at least not for many day to day cases.
If you do things right, the auditor can look at Git and see who made any changes, when and why, and how that impacted the running system deployments. Note that the same approach can help with HIPAA and some PCI too. And all these tests famously introduce process overhead. In GitOps that is absorbed into normal developer practice. Let’s look at why that is.
Understanding SOC 2
The core idea is that a company needs to have divisions between roles and rules on what they can do, and keep a record of obedience. For example there must be a bright line between who can change production code and who can change the monitoring of production code. Theoretically this implies that the company would need two bad actors to sneak some evil code into production. Managing this is hard in a fast moving delivery environment with small agile teams, which is why the traditional enterprise approach is to quash velocity with process.
File integrity monitoring
I’d like to quote from the Threat Stack blog: “When’s the last time someone made an unauthorized change to your system files?”. It turns out that provided we can carve out suitable roles, we can track such changes in Git. You need at least two roles:
- Role has write access to source
- Role can look at production but is limited to no changes (or “some”)
You can use GitHub RBAC to control who is in which teams and thereby has access or not. This provides you with roles. And GitHub keeps track of every change, when it happened etc.
Adding Weave Flux to complete the audit trail
Capturing source changes is not enough. We must also track releases, staging and rolling deployments. And we must understand how image and config changes map to live cluster objects. This is enabled by using Weave Flux which itself keeps notes for you, and writes them all into Git. This means that your desired state is up to date, correct and observable. And everything that you needed to record has been kept for that day when the auditors visit.
GDPR and security by design
A quick note for GDPR folks, please skip otherwise. “Security by design” is a requirement of GDPR. Our approach (GitOps) is aligned with OWASP, a major project which defines practices for this. In particular we believe our best practices adhere to these OWASP info sec principles:
- Confidentiality – only allow access to data for which the user is permitted
- Integrity – ensure data is not tampered or altered by unauthorized users
- Availability – ensure systems and data are available to authorized users when they need it
Summary & next steps
What are you waiting for? Please try Weave Cloud now and let us know what you think. We are excited about the security and compliance benefits. If you are using containers and/or Kubernetes and other cloud native tools - I want to hear from you.
Remember that we help you implement the three best practices, and that these can help you be secure and compliant without buying hefty and expensive “solutions”.
- Keep a record in Git
- Avoid unnecessary rebuilds if you can use config instead
- Use pull based deployment
Other blogs in this series include:
- GitOps - Operations by pull request (Part 1)
- The GitOps Pipeline (Part 2)
- GitOps - Observability (Part 3)