At Weaveworks we run Weave Cloud (our monitoring & visualization service for Cloud Native applications) on a Kubernetes cluster in AWS. In this post I’ll discuss the methods we use to provision and operate the cluster through the entire lifecycle of the cluster – not just the setup phase, where most existing tools seem to focus.

To frame the discussion, I’ll start by describing the techniques we initially used to provision Kubernetes clusters, and the problems we encountered. I’ll then walk you through the evolution of our cluster management tools, how we do things currently, and where we’re going.

First attempt

Initially we used the script to provision a new Kubernetes cluster. This worked well, and gave us a functional Kubernetes cluster with minimal intervention – the current stack we use for provisioning a new cluster actually requires significantly more effort to provision a new cluster! We started to hit the limitations of this approach when we wanted to:

  • change the type of the instances we used for the minions
  • upgrade the version of Kubernetes we used
  • recover from a failure of the master
  • tweak command line arguments to things like Docker (for log rotation) and Kubernetes (to enable experimental features for network policies)

The approach we initially took when we wanted to change anything in the cluster’s infrastructure of Kubernetes layer was to provision an entire new cluster with the changes, switch traffic over by pointing DNS at the new cluster, and then delete the old cluster. Anyone who has ever done this will know than DNS changes can take a very long time to propagate – we had some clients still accessing the old cluster after a week!

It wasn’t feasible for us to keep two production clusters around for a week, and generally this meant something had to break – either a user would fail to connect to the service or features in our software that relied on cluster-wide knowledge (like the distributed websocket router behind Scope’s terminals) would break.

A New Hope

So we started a project to come up with a better way of managing change in our Kubernetes clusters. Lots of readers will instantly recognise Configuration Management as a solution to the problems listed above. So we set about dividing up the problem space into layers and picking the right tools for the job.

The problem was divided into three layers: Infrastructure, Kubernetes and Application. This architecture and separation of responsibilities was heavily influenced by earlier work on kubernetes-anywhere. The rest of the blog post will describe the responsibilities of each layer to the layers above it, and the tools used to manage them.

We chose a set of common requirements across all these layers to help guide our choice of tools and minimise cognitive load. We wanted tools that could:

  • keep all the “state” for a given layer in version control (ie a git repo), such that we could compare the state of the system to the version controlled state and tell if anything was missing or wrong, and if something broke, roll back to a previously known-good state.
  • share as much configuration between environments as possible. Ideally the tools would support some concept of parameterisable “modules” which could be “instantiated” for each cluster, and allow for the differences between clusters to be minimally expressed and easily auditable.
  • alert us if the state of a cluster diverged from the configuration.

The Infrastructure Layer

Starting at the bottom: the infrastructure layer is responsible for provisioning and maintaining VMs, networking, security groups, ELBs, RDS instances, DynamoDB tables, S3 buckets, IAM roles etc.

We chose Terraform, from Hashicorp, to provision our infrastructure resources. Terraform allows us to declaratively list the resources we want in a high level configuration language, describe the interdependencies between them, and intelligently apply changes to this config in an order that honors the dependencies. Terraform supports parameterizable, instantiable modules such that we can share the resource declarations for VMs etc between clusters, parameterising each cluster for number of minions, instance type etc. Terraform even has a “plan” mode, which shows us if our configuration matches reality – combined with prom-run we periodically execute this and export the exit code to Prometheus, from where we monitor and alert on it.

For instance, when we want an ELB to reference all our minions, we do not need to manually copy and maintain the list of minions in the config – we can just refer to our minion VMs using an terraform interpolation, in such a way that adding or removing a minion will automatically update the ELB:

resource "aws_elb" "main" {
    name = "${var.elb_name}"
    subnets = ["${aws_subnet.main.*.id}"]
    instances = ["${aws_instance.minion.*.id}"]

We made a significant departure from the way networking was configured by kube-up. kube-up will configure the Kubernetes Controller Manager to issue subnets to the Kubelets for them to use for Pod networking, and simultaneously program an AWS VPC Route Table with the mapping from subnet to subnet to Kubelet (EC2 instance). This worked well, but as we run the Controller Manager as a Pod on the masters, there is an ordering problem on startup – the Kubelets on the masters needs to contact the Controller Manager for networking information before it can bring the any Pods up, but the Controller Manager needs to be brought up before it can respond to any such request. The kube-up solution was not to have working Pod networking on the master. The solution we went with instead was to have Terraform own and configure the Route Table with an entry per EC2 Instance, with the chosen CIDR subsequently passed to the Kubelet on startup. No Controller Manager involved, and Pod networking works on the master.

Similarly, we previously exposed the frontend service in Weave Cloud as a LoadBalanced Kubernetes service, which meant under the hood that the Kubernetes Controller Manager was configuring an AWS ELB for us. This worked well as a way of getting started, but we discovered we needed more control over the ELB – to helps us atomically switch traffic between clusters, for instance. So we now have Terraform manage the ELB (as shown above) and used this to migrate traffic from the old kube-up’d cluster to the new one.

We’ve made the Terraform module we use to manage the resources “underneath” our Kubernetes cluster opensource, should anyone else find it useful! This Terraform module adds a few more features over kube-up – it will provision multiple master VMs and an ELB pointing at them, and will discover and round-robin master and minion VMs across multiple AZs. The same can be achieved with kube-up with a little effort. Using Terraform for this layer has not been without its problems – we’ve hit some bugs and had to make a series of contributions to get Terraform to where we needed it. We’ve found the Hashicorp team to be very helpful and responsive to bug reports, even if it’s taken a little longer than we’d like for them to merge our PRs!

The Infrastructure layer exports a limited interface to the next layer up – just the VM’s public and private hostnames and IP addresses, the ssh key and the Pod CIDRs for each machine. This separation of concerns is not only good engineering practice, but has allowed us to make isolated changes to different layers in the stack whilst minimising the risk of breaking the whole thing. As more and more solutions to Kubernetes provisioning and lifecycle management become available, this might be an opportunity for some standardisation and interoperability.

The Kubernetes Layer

The Kubernetes layer in our stack is responsible for provisioning and maintaining the various components which make up Kubernetes – the Docker engine, the Kubelet, the Kube-proxy, the API Server, the Scheduler, the Controller Manager, Etcd.

We evaluated tools such as Chef, Puppet, Salt etc to managing this layer and eventually we choose Ansible because:

  • it has a very similar modus operandi to Terraform, which helps to reduce cognitive load
  • it is master-less, saving us from having to have a separate system to manage the ansible master and agent-less, greatly reducing any bootstrapping needed
  • with the combination of the -C flag (check mode) and -D flag (diff mode) ansible will show us where the live system differs from the checked in config. We use this (and prom-run) to build an ansiblediff job, and get alerts when reality diverges from the expected configuration.

This layer is probably the least well defined in the stack, as it also has to deal with a bunch of other seemingly unrelated tasks like:

  • formatting the ephemeral disks that come with our VMs and mounting them in a place where we docker and kubernetes can use them – something that was surprising easy with Ansible’s LVM modules.
  • managing our own certificate authority and issuing individual, role based certificates for each of the Kubernetes components and clients.

We choose to install Kubernetes from the new upstream packages, something that was introduced in the Kubernetes 1.4 release. This saves us from building a tarball of the Kubernetes components and hosting in on S3, the way kube-up handles this. This also allows us to use systemd drop-ins to manage the command line flags for the Docker daemon and the Kubelet. These drop-ins are version controlled, and finally allow us to handle the lifecycle of these components as we add, remove and modify flags.

The rest of the Kubernetes components (API server, scheduler, controller manager, etcd and kubeproxy) are run as Static Pods – pods where the config live on disk on the node where they run, and where the kubelet is responsible for ensure the running Pod matches the on disk config. Again, this config is version controlled, allowing us to modify the command line flags for these components and have ansiblediff alerts us if we forget to deploy it to every node.

A common question I get is “Why not just use Terraform for the whole thing (config files, packages etc?” We could, but Terraform doesn’t have resources to manage apt packages, LVM volumes, systemd services etc like Ansible does. We would end up using Terraform provisioners to execute commands to install packages and I didn’t want to blow away an entire VM just because I wanted to change a package version.

The Application Layer

We’ve discussed in previous blog posts how we version control the YAML files for all the Kubernetes Deployments, Services, DaemonSets etc and have written an opensource tool called kubediff to ensure the checked in configuration doesn’t differ from what running. One nice knock on effect of this is we have no need to run the Kubernetes Addon Manager. We treat the configurations for services like Kube-DNS or Kube-Dash the same way we treat our application microservices. We use our tools for Continuous Deployment and monitoring to manage them and their configuration.

The Future

The system presented here allows for the entire state of system to be stored in version control, and for the that state to be checked and enforced by three jobs: terradiff, ansiblediff and kubediff. All the configuration files for the cluster are version controlled, and modification to any component can be rolled out with zero downtime and rollback easily if they break anything. We’re starting to live up the Google SRE saying of “changing the tires of a race car as it’s going 100mph”.

Using Terraform and Ansible to provision and manage Kubernetes is not a new idea – there are lots of projects out there to help provision and maintain a Kubernetes cluster, and as such we don’t intend to make all of the scripts and configuration we use open source. That does beg the question, why did we do this ourselves, and not use one of these projects? This project was easier for us, as its scope was not as wide – we are not trying to support all the major cloud providers or on premises, for instance. We only work with a single Linux distro (Ubuntu). We do not what to support multiple different ways of configuring the networking, as we’re happy with AWS route tables. And I wanted to learn how Kubernetes clusters are put together!

In the future I’d like to investigate more automation around the Terraform and Ansible layers in the stack; we continuously deploy changes to the Application layer, why can’t we do it to the infrastructure? I also want to investigate ways to reduce the level of duplication in the configuration of the Kubernetes objects (Deployments, Services etc) between clusters – I’ve made some encouraging progress using jsonnet for this task.

Thank you for reading our blog. We build Weave Cloud, which is a hosted add-on to your clusters. It helps you iterate faster on microservices with continuous delivery, visualization & debugging, and Prometheus monitoring to improve observability.

Try it out, join our online user group for free talks & trainings, and come and hang out with us on Slack.