Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that performs safe automatic node reboots. It gets triggered by the package management system of the underlying OS.
In essence Kured:
- Watches for the presence of a reboot sentinel e.g.
- Utilises a lock in the API server to ensure only one node reboots at a time
- Optionally defers reboots in the presence of active Prometheus alerts
- Cordons & drains worker nodes before reboot, uncordoning them after
The Reboot Problem
The EC2 instances run Ubuntu 16.04 with
unattended-upgradesenabled, so the machines need to be rebooted periodically (mainly in response to kernel upgrades). If they aren’t, the clusters are at risk from security vulnerabilities, and eventually run out of disk space as the OS is unable to remove older kernels and modules.
The first attempt
Our initial approach to this problem was to trigger a Prometheus alert whenever the
/var/run/reboot-required file appeared on any of the nodes. We tried coupling it with a manual process that entailed waiting for a safe moment - defined as no active alerts - before draining the application pods and then rebooting each node in turn.
Automation makes everything better
Whilst this worked in practice, the frequency of OS updates coupled with the quantity of nodes drove us eventually to an automated solution. And so for the past six months all reboots have been conducted safely and automatically by kured, our Kubernetes reboot daemon.
During this time kured has effected hundreds of node reboots in our dev and prod clusters without human intervention - in fact, until the relatively recent addition of Slack notifications, we were mostly unaware that it was happening at all.
Kured is here!
Now that we have gained confidence in the implementation through an extended period of operational use, we are pleased to share the 1.0.0 release of kured with the community under an Apache 2 license. Kured should work with most Kubernetes installations and distros - read more about how it works on Github:
Join our community slack if you have any questions or suggestions!