A GitOps RDS Migration Story
For the past three years we have been building out Weave Cloud, our operations and management platform for developers and DevOps teams. As a consequence of the iterative nature of this development, we now have: Six databasesFive RDS...
Feedback and Control - an Essential GitOps Component
How to Create GitOps Pipelines with GitHub Actions and Weave Cloud
The Official GitOps FAQ
For the past three years we have been building out Weave Cloud, our operations and management platform for developers and DevOps teams. As a consequence of the iterative nature of this development, we now have:
- Six databases
- Five RDS instances
- A mixture of 9.5.x and 9.6.x PostgreSQL versions
- A mixture of legacy db.m3 and db.m4 instance types
In order to reduce costs and technical debt, we decided to consolidate and simplify the above to:
- Six uniformly named databases
- A single RDS instance
- PostgreSQL 10.x
- db.m5 instance type
This post explains how we used GitOps best practices and workflows to safely migrate our data with minimal downtime.
It was important for us to perform the migration with:
- Zero downtime for reads
- Write downtime comparable to an RDS multi-AZ failover e.g. a few minutes at most
- Ability to abort quickly and cleanly if necessary
In order to achieve this, we came up with the following plan:
- Pre-create the new RDS instance, populated with six empty databases and corresponding users
- Then for each existing database in turn:
- Update the client services to reconnect to the existing database with read-only permissions. Once in effect, no further writes are accepted; during this time the WC UI will decline write operations touching this database with a warning.
pg_dump | pg_restoreto clone the database to the new instance, taking the opportunity to rename them using a common scheme.
- Update the client services to connect to the new database with read-write permissions. Once in effect, full service is restored.
There is no explicit mechanism in PostgreSQL to make an entire database read-only - you have to avoid writes or revoke permissions. We decided to create a separate role for this purpose - this blog post was very helpful in understanding how.
Execution GitOps Styles
In accordance with GitOps principles, the Kubernetes manifests expressing the intended state for Weave Cloud are kept in a git repository with Weave Cloud Deploy ensuring the cluster is kept in sync and alerting on any variance. Because of this, we can execute the plan with a sequence of pull-requests matching the steps outlined above:
PR #1: Make Microservices Read-only
Most of our databases have multiple clients in the cluster - microservice replicas, sync processes, stats exporters etc. Our intent here is to make them all read-only as a single logical operation, so we collect all the manifest updates in a single commit which can be reviewed for correctness.
When it is merged, Weave Cloud Deploy automatically applies all the manifest updates to the cluster, and the services gracefully roll over to read-only as the Kubernetes deployment controller works its magic. We need to wait until all the rolling deployments are complete before it is safe to merge the sync job PR introduced in the next section. For this we used the Weave Cloud Deploy UI, which can display rollout progress of all workloads across the entire cluster.
At any point, reverting this PR will restore the cluster to a working state in seconds without needing to mess around with snapshot restoration; the existing database effectively functions as a hot-standby until we have verified the migration is complete.
PR #2: Create Sync Job
We want the PostgreSQL dump/restore operation to take place inside the cluster for two key reasons:
- Speed. This operation is the limiting factor in meeting our write-downtime objective.
- Security. We don’t want to export either the data or PostgreSQL credentials outside the cluster, and in any case the RDS instances can only be accessed from the VPC.
Instead of SSHing into the cluster and running a migration script, we can use the Kubernetes
apiVersion: batch/v1 kind: Job metadata: name: rds-migration-billing labels: type: rds-migration spec: backoffLimit: 0 template: spec: restartPolicy: Never containers: - name: migrate image: postgres:10.6 command: [ "/bin/bash" ] args: - -c - | set -o xtrace set -o errexit set -o pipefail pg_dump --format=tar "$SOURCEURL" | pg_restore --no-owner --no-privileges --schema=public --role=billing --dbname "$DESTURL" env: - name: SOURCEURL value: postgres://billing@dedicated-billing-instance.XXX.rds.amazonaws.com/billing - name: DESTURL value: postgres://billing@shared-instance.XXX.rds.amazonaws.com/billing - name: PGPASSFILE value: /credentials/pgpass volumeMounts: - name: pgpass-secret-volume mountPath: /credentials volumes: - name: pgpass-secret-volume secret: secretName: pgpass defaultMode: 0600
restartPolicy are configured to prevent retries in the event of failure - if something goes wrong we want the operator to inspect the situation and decide what to do next.
Once the PR is merged, the job can be monitored for successful completion:
$ kubectl get jobs -l type=rds-migration NAMESPACE NAME COMPLETIONS DURATION AGE billing rds-migration-billing 1/1 7s 50s
PR #3: Switch Microservices to New Database
Having verified that the migration job completed successfully, we can now merge the final PR which is similar in form to PR #1 but instead updates the client services to use the new database connection parameters. Once merged, Kubernetes will gracefully roll the services over to the new database, restoring full read-write service.
Once the migration was done we followed up with a PR to remove the completed sync jobs. The current version of Weave Cloud Deploy doesn’t delete resources from the API server when they disappear from the config repository, so we did it manually, but that feature is already in progress!