For the past three years we have been building out Weave Cloud, our operations and management platform for developers and DevOps teams. As a consequence of the iterative nature of this development, we now have:

  • Six databases
  • Five RDS instances
  • A mixture of 9.5.x and 9.6.x PostgreSQL versions
  • A mixture of legacy db.m3 and db.m4 instance types

In order to reduce costs and technical debt, we decided to consolidate and simplify the above to:

  • Six uniformly named databases
  • A single RDS instance
  • PostgreSQL 10.x
  • db.m5 instance type

This post explains how we used GitOps best practices and workflows  to safely migrate our data with minimal downtime.

The Plan

It was important for us to perform the migration with:

  • Zero downtime for reads
  • Write downtime comparable to an RDS multi-AZ failover e.g. a few minutes at most
  • Ability to abort quickly and cleanly if necessary

In order to achieve this, we came up with the following plan:

  1. Pre-create the new RDS instance, populated with six empty databases and corresponding users
  2. Then for each existing database in turn:
    1. Update the client services to reconnect to the existing database with read-only permissions. Once in effect, no further writes are accepted; during this time the WC UI will decline write operations touching this database with a warning.
    2. Execute pg_dump | pg_restore to clone the database to the new instance, taking the opportunity to rename them using a common scheme.
    3. Update the client services to connect to the new database with read-write permissions. Once in effect, full service is restored.

There is no explicit mechanism in PostgreSQL to make an entire database read-only - you have to avoid writes or revoke permissions. We decided to create a separate role for this purpose - this blog post was very helpful in understanding how.

Execution GitOps Styles

In accordance with GitOps principles, the Kubernetes manifests expressing the intended state for Weave Cloud are kept in a git repository with Weave Cloud Deploy ensuring the cluster is kept in sync and alerting on any variance. Because of this, we can execute the plan with a sequence of pull-requests matching the steps outlined above:

PR #1: Make Microservices Read-only

Most of our databases have multiple clients in the cluster - microservice replicas, sync processes, stats exporters etc. Our intent here is to make them all read-only as a single logical operation, so we collect all the manifest updates in a single commit which can be reviewed for correctness.

When it is merged, Weave Cloud Deploy automatically applies all the manifest updates to the cluster, and the services gracefully roll over to read-only as the Kubernetes deployment controller works its magic. We need to wait until all the rolling deployments are complete before it is safe to merge the sync job PR introduced in the next section. For this we used the Weave Cloud Deploy UI, which can display rollout progress of all workloads across the entire cluster.

At any point, reverting this PR will restore the cluster to a working state in seconds without needing to mess around with snapshot restoration; the existing database effectively functions as a hot-standby until we have verified the migration is complete.

PR #2: Create Sync Job

We want the PostgreSQL dump/restore operation to take place inside the cluster for two key reasons:

  • Speed. This operation is the limiting factor in meeting our write-downtime objective.
  • Security. We don’t want to export either the data or PostgreSQL credentials outside the cluster, and in any case the RDS instances can only be accessed from the VPC.

Instead of SSHing into the cluster and running a migration script, we can use the Kubernetes Jobworkload:

apiVersion: batch/v1
kind: Job
metadata:
  name: rds-migration-billing
  labels:
    type: rds-migration
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: migrate
        image: postgres:10.6
        command: [ "/bin/bash" ]
        args:
        - -c
        - |
          set -o xtrace
          set -o errexit
          set -o pipefail
          pg_dump --format=tar "$SOURCEURL" | pg_restore --no-owner --no-privileges --schema=public --role=billing --dbname "$DESTURL"
        env:
        - name: SOURCEURL
          value: postgres://billing@dedicated-billing-instance.XXX.rds.amazonaws.com/billing
        - name: DESTURL
          value: postgres://billing@shared-instance.XXX.rds.amazonaws.com/billing
        - name: PGPASSFILE
          value: /credentials/pgpass
        volumeMounts:
        - name: pgpass-secret-volume
          mountPath: /credentials
      volumes:
      - name: pgpass-secret-volume
        secret:
          secretName: pgpass
          defaultMode: 0600

ThebackoffLimit and restartPolicy are configured to prevent retries in the event of failure - if something goes wrong we want the operator to inspect the situation and decide what to do next.

Once the PR is merged, the job can be monitored for successful completion:

$ kubectl get jobs -l type=rds-migration
NAMESPACE      NAME                          COMPLETIONS   DURATION   AGE
billing        rds-migration-billing         1/1           7s         50s

PR #3: Switch Microservices to New Database

Having verified that the migration job completed successfully, we can now merge the final PR which is similar in form to PR #1 but instead updates the client services to use the new database connection parameters. Once merged, Kubernetes will gracefully roll the services over to the new database, restoring full read-write service.

Once the migration was done we followed up with a PR to remove the completed sync jobs. The current version of Weave Cloud Deploy doesn’t delete resources from the API server when they disappear from the config repository, so we did it manually, but that feature is already in progress!