- 13. Sep
Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.
- Watches for the presence of a reboot sentinel e.g. /var/run/reboot-required
- Utilises a lock in the API server to ensure only one node reboots at a time
- Optionally defers reboots in the presence of active Prometheus alerts
- Cordons & drains worker nodes before a reboot, uncordoning them after
Kubernetes & OS Compatibility
The daemon image contains versions of k8s.io/client-go and the kubectl binary for the purposes of maintaining the lock and draining worker nodes. See the release notes for specific version compatibility information.
Additionally, the image contains a systemctl binary from Ubuntu 16.04 in order to command reboots. Again, although this has not been tested against other systems distributions there is a good chance that it will work.
The Reboot Problem
At Weaveworks the development and production clusters underpinning Weave Cloud are orchestrated with Kubernetes running on EC2, maintained with Terraform and Ansible.
The EC2 instances run Ubuntu 16.04 with unattended-upgrades enabled, so the machines need to be rebooted periodically (mainly in response to kernel upgrades). If they aren’t, the clusters are at risk from security vulnerabilities, and eventually, run out of disk space as the OS is unable to remove older kernels and modules.
The first attempt
Our initial approach to this problem was to trigger a Prometheus alert whenever the /var/run/reboot-required file appeared on any of the nodes. We tried coupling it with a manual process that entailed waiting for a safe moment - defined as no active alerts - before draining the application pods and then rebooting each node in turn.
Automation makes everything better
Whilst this worked in practice, the frequency of OS updates coupled with the number of nodes drove us eventually to an automated solution. And so for the past six months, all reboots have been conducted safely and automatically by kured, our Kubernetes reboot daemon.
During this time kured has affected hundreds of node reboots in our dev and prod clusters without human intervention - in fact, until the relatively recent addition of Slack notifications, we were mostly unaware that it was happening at all.