The problem: you have a semi-stateful service running on Kubernetes, you want to be able to do rolling upgrades on it, and when they exit they have to flush their state to a second service. You don’t want them all to exit at once and overwhelm the second service, so in Kubernetes how do you ensure they get upgraded one at a time?

Cortex TCO

This is a problem we faced in Cortex (our open source project which powers our hosted Prometheus-as-a-Service in Weave Cloud).

The ingesters store the most recent samples in memory, batching them up until a chunk is full, and then writing these chunks to S3 (with an index in DynamoDB). DynamoDB is provisioned to have a given write capacity (which we pay for), enough for normal operation - if we had to provision DynamoDB to have enough for normal operation plus enough for every ingester to flush all their in-memory chunks at once, the system would cost too much to run.

To give you some numbers here, our Dev environment (which only has samples from the jobs in our Prod and Dev environments):

  • Has 3 ingesters.
  • Has 600k active timeseries.
  • Has a single open chunk per timeseries.
  • Chunks are 1KB in size.
  • Average chunk is open for 1hr.
  • There are on average 7 index entries per chunk.

Given those numbers, in normal operation we have to flush around 170 chunks a second to just keep up - which means on average 1.2k writes a second to DynamoDB. When we want to upgrade an ingester, we have to do 1.4m extra writes to DynamoDB - if we target ingesters taking no longer than 10mins to flush, this turn into an extra 2.3k writes a second. So a “thickly” provisioned DynamoDB table that can handle all ingesters flushing simultaneously would need to handle 8.2k writes a second, or over double what we’d need if we could limit an upgrade to one ingester flushing at once (3.5k/s).

Making deployments upgrade one by one

It took a couple of attempts to get this right, but in the end it was relatively straightforward. The trick was that Kubernetes will pause a rolling upgrade until the new Pods enter the “Ready” state.

We use the Cortex hash ring to store the state of the ingesters - when ingesters receive SIGTERM they set their state to “Leaving” and begin flushing chunks. When they finish flushing all their chunks they remove their entry from the ring. Newly started ingesters detect if any ingesters were in “Leaving” state, and then reflect this in the the Pod ready state - in other words, newly started ingesters do not enter the “Ready” state until all other ingesters have exited the “Leaving” state. When combined with the following deployment flags:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: ingester
spec:
  replicas: 3

  # Ingesters are not ready for at least 1 min
  # after creation.  This has to be in sync with
  # the ring timeout value, as this will stop a
  # stampede of new ingesters if we should lose
  # some.
  minReadySeconds: 60

  # Having maxSurge 0 and maxUnavailable 1 means
  # the deployment will update one ingester at a time
  # as it will have to stop one (making one unavailable)
  # before it can start one (surge of zero)
  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1

  template:
    spec:
      # Give ingesters 40 minutes grace to flush chunks and exit cleanly.
      # Service is available during this time, as long as we don't stop
      # too many ingesters at once.
      terminationGracePeriodSeconds: 2400

      containers:
      - name: ingester
        readinessProbe:
          httpGet:
            path: /ready
            port: 80
          initialDelaySeconds: 15
          timeoutSeconds: 1
        ...

This achieves the desired behaviour of each ingester exiting one at a time during a rolling upgrade, as seen in this graph of ingester flush queue length and flush rate:

What I like about this approach is that is shows the power of the building blocks the Kubernetes gives you - you can achieve some reasonably advanced use cases, even with stateful applications, without having to resort to building your own “operator”.

Thank you for reading our blog. We build Weave Cloud, which is a hosted add-on to your clusters. It helps you iterate faster on microservices with continuous delivery, visualization & debugging, and Prometheus monitoring to improve observability.

Try it out, join our online user group for free talks & trainings, and come and hang out with us on Slack.