How to Correctly Handle DB Schemas During Kubernetes Rollouts
You decided to migrate to Kubernetes, but you are unsure how to safely roll out your microservice’s replicas, while also coordinating changes to the schema of the underlying database? This how to article will walk you through considerations and best practices.
So you have decided to migrate to Kubernetes, but you are unsure how to safely roll out your microservice’s replicas, while also coordinating changes to the schema of the underlying database?
No panic, we’ve got you covered!
In this article, we will show you how 1) Kubernetes features like rollout strategies, readiness probes and liveness probes, 2) your favourite database migrations library*, and 3) simple, good engineering practices, can enable you to embrace change while saving the day when something goes wrong and you need to roll things back.
*DB migrations libraries are available in most popular languages, e.g.: Flyway for Java and Scala,Flask-Migrate for Python, Migrate for Go. Many frameworks also provide one out-of-the-box, e.g.: Play Framework for Java and Scala, Django for Python, or Rails for Ruby.
Let’s consider the following “two tier” scenario:
- one “application tier” with multiple replicas of a stateless microservice,
- one “database tier" with one database (in production, this should have multiple replicas for redundancy too, but this is out-of-scope here),
- the “application tier” exposes an API to create and read users (CR in CRUD),
- the “database tier” is responsible for persisting users.
Implementation: high level
The two “tiers” are implemented using Services, which essentially provides:
- a DNS name in order to easily send requests to a “tier”,
- transparent routing of the requests to the underlying Pods.
Running processes for the microservice and the database are encapsulated inside containers, which provide isolation & resource limits. These containers are themselves encapsulated inside Pods, the smallest deployable unit in Kubernetes.
Configuring the number of Pod replicas and managing their lifecycle is done by Deployments. This is where the interesting and important bits to solve our problem are. Let’s get back to this in a minute.
As you change your microservice’s code, you will eventually need to also change your database’s schema, so that it matches at all times. Many resources cover this topic*, but one simple way to achieve this is via database migrations:
*Evolutionary Database Design, Refactoring Databases, Evolutionary Database Design, by Ambler & Sadalage, and many others.
- version your schema,
- write each change to the schema in a dedicated script (a.k.a. “migration”) which can be identified by a version number,
- package all these scripts with your code,
- on startup, check your schema version, and if it’s out of date, apply the necessary migrations so that the schema version matches the desired version.
The main benefits of this approach are:
- simplicity: no new moving pieces of infrastructure are introduced at runtime,
- ease of deployment: you will always have the right schema version, in development, during testing and in production.
So far, so good
Kubernetes let’s us implement this (rather classic) setup very easily, as it does most of the heavy lifting and wiring for us, which is pretty neat!
Indeed the entire “application tier” described above is essentially captured below in these two YAML manifests:
However, the above will not give you all the desirable properties you’d want for such a setup:
- What happens during the deployment of a new version? Do you get “zero downtime”? How does your capacity change?
- What happens if you made a mistake and the new version crashes as you deploy it?
- What happens if your microservice crashes after running for a while?
- What happens if you need to roll back?
Let’s answer these questions.
Implementation: the devil is in the details!
By default Kubernetes deploys pods using a “rolling update” strategy, removing 1 old pod at a time (
maxUnavailable: 1) and adding 1 new pod instead (
maxSurge: 1), which means that with 3 replicas, you would temporarily lose 33% of your ability to serve end-users’ requests as you roll a new version out.
Let’s fix this by changing maxUnavailable to be 0. This way, Kubernetes will first deploy one new pod, and will only remove an older one if the deployment was successful. Note that one downside is you need spare capacity in your cluster to temporarily run this extra replica, so if you are already close to capacity you may need to add an extra node.
The upside is that we theoretically now have zero downtime and zero impact on end-users.
Kubernetes adds a pod to its service’s load balancer when it thinks it is “ready”. By default, “ready” means only that all of the pod’s containers have started, and Kubernetes can “exec” into them. However, if we are establishing a connection to a database and running schema migrations on startup, this may take a while and we clearly need a better definition of “ready”.
From a business perspective, our microservice is ready when it can start answering end-users’ requests. Let’s therefore tell exactly that to Kubernetes by configuring a HTTP readinessProbe. Also, we obviously need to create our database connection & run migrations before we start our HTTP server.
Generally, waiting a bit after each pod's rollout also is a good idea.
Now, if we somehow crash upon startup, or fail to connect to our database, this newly deployed, failing pod will not be added to the “application tier”’s load balancer, and the rollout will stop there. Great! This means that if something goes wrong at this stage, it will not impact our end-users.
Kubernetes also periodically checks if pods are “alive”, and by default does so the same way it checks for readiness. In our case, if the database client somehow enters a corrupted state, we may also want Kubernetes to remove the affected pod from the load balancer, to kill it, and to start a new one. This can be done by adding a check (ideally, as representative as possible of your system’s health), exposing it to Kubernetes, and configuring a livenessProbe.
When things go pear-shaped, you may want to roll things back to the latest working version. Good engineering practices can help greatly in enabling this. The main one in our scenario is the backward compatibility of the database schema for our microservice.
For example, adding a column and selecting columns explicitly would allow us to run a prior version of the microservice against the latest schema, therefore allowing a smooth rollback from v1.1.0 to v1.0.0 without any schema change.
Renaming a column wouldn’t be backward compatible. In this case, you may want to use “down migrations” to revert to the previous schema version. Beware, though, as rolling forward or back will break “zero downtime”. Indeed, end-users may experience transient errors, depending on which replica they hit at which stage of the deployment. If this isn’t acceptable, you may need to first rollout a version of the microservice which can support both the old and new schema (by having two clients, and selecting the right one or trying both), and only then to roll out another version with the migrations for the desired schema change.
This can get pretty hairy, so you will want to test this carefully.
For larger systems, you may want to look into “blue-green deployments”. However, these are typically a lot more complex to implement, hence are out of scope for this article.
A little less conversation, a little more action, please!
To try all of this out, just follow these steps.
For example, you can visualize your setup with Weave Cloud:You can ensure new replicas are handling the traffic as you roll out new versions of your microservice:
You can ensure new replicas are handling the traffic as you roll out new versions of your microservice:
Or you can run arbitrary queries on gathered metrics and clearly see the impact of new versions (the blue vertical bars with a dot at the bottom), if any:
Thanks for making it this far! We hope you enjoyed it and learned a thing or two.
You should now be able to:
- embrace change for your “two tier” setup,
- take advantage of Kubernetes’ rollingUpdate, maxUnavailable, maxSurge, readinessProbe and livenessProbe features, and
- better refactor your databases and engineer your systems to cope with this.
If this helped, or if you have a different use-case which we haven’t covered here, we would love to hear about it! So feel free to tell us more!
After six years of service, we announced the end of service for Weave Cloud. If you have questions or need help please contact the support team who will be happy to help.Read more