Container Linux Update Operator automates Kubernetes node OS upgrades

• By Josh Wood

CoreOS develops modern container cluster infrastructure guided by a philosophy of automation in pursuit of security. Beginning with the automatically-updating Container Linux operating system and extending through the Tectonic Kubernetes platform for the enterprise, CoreOS aims to deliver “continuous availability” – automated deployment, lifecycle management, and security updates at each layer of the infrastructure stack.

At a high level, there are three of these layers in the cloud and on-premises infrastructure being deployed today: the cluster orchestration system that harnesses many machines into a unified resource, the business applications provisioned and scaled on the cluster, and the operating system running each machine, or node, in the cluster. Kubernetes provides (and Tectonic extends) the ability to continuously deploy and automatically scale the central, and most important layer, the business applications. How are the outer layers managed, scaled, and automatically upgraded?

Kubernetes Operators

The Operators pattern developed by CoreOS encodes complex operational knowledge into software. The goal of Operators is constructing the management of complex applications, including the orchestration and infrastructure layers themselves, atop the existing Kubernetes API and its objects.

Container Linux has always delivered automatic updates, and any Container Linux machine regularly checks for updates and autonomously reboots to apply them. For a single machine, this would be unpredictable, and can be manually controlled in that case. But in a cluster of machines, simple coordination allows the management of automatic update node reboots like rolling updates to any application. This is where the Container Linux automatic update philosophy shines.

Today we’re formally introducing the Container Linux Operator, extending the CoreOS automatic update philosophy to the individual nodes that comprise Tectonic Kubernetes clusters.

The Container Linux Update Operator (CLUO) is responsible for the coordination of upgrade reboots in Tectonic Kubernetes clusters. Each Container Linux node runs an update_engine that checks for and schedules system upgrades. The Update Operator, in turn, runs in two parts on the cluster.

An update-agent runs as a Kubernetes DaemonSet. This means one update-agent is scheduled on every cluster node. The update-agent monitors the native Container Linux update_engine over a Dbus connection. When an update is available, update-agent marks its node with an annotation indicating the need for a reboot.

The coordinator component, update-operator , runs as a Kubernetes Deployment, easily scaled and reliably replicated on an appropriate number of machines, that monitors the update-agent ’s reboot annotations across the cluster. The update-operator knows when machines need to reboot to apply a system update, and gives permission to each node to reboot in a sequence assuring cluster and application services remain available. In the simplest case, one machine is allowed to reboot at a time.

The easiest way to get automatic operating system updates for Kubernetes cluster nodes is to install a CoreOS Tectonic cluster. Tectonic is available without charge for development and testing clusters, and Tectonic versions 1.6 and above include the Container Linux Update Operator to automatically provision and coordinate cluster node system upgrades. As open source software, the Container Linux Update Operator can also be installed manually on any Kubernetes v1.6+ cluster deployed on Container Linux nodes.

To deploy the Update Operator on a stock Kubernetes cluster, first prepare the Container Linux nodes for automatic updates. Container Linux ships with automatic updates enabled by default, but ambiguity can be avoided by unmasking and starting the update-engine.service systemd unit on each node. Connect to a node with ssh to issue systemctl unmask update-engine.service && systemctl start update-engine.service . Next, ensure the older reboot coordinator, locksmith , is stopped and disabled on each node with systemctl stop locksmith.service && systemctl mask locksmith.service . After repeating this on each cluster node, the Update Operator can be deployed.

Rolling out the Update Operator Kubernetes components is a matter of create ing the update-operator Deployment, which will in turn instantiate and manage the DaemonSet that runs update-agent on each node.

$ kubectl create -f https://github.com/coreos/container-linux-update-operator/blob/master/examples/update-operator.yaml

After the cluster fetches and runs the containers specified in the update-operator.yaml manifest, the Update Operators update-agent DaemonSet and update-operator Deployment are visible either with kubectl get commands or in the Tectonic Console browser GUI. Operating system updates will now be automatically coordinated by the Update Operator, and nodes will take turns rebooting to apply updates until all are running the lastest Container Linux release on the site’s selected release channel. Work on each rebooting node is automatically re-scaled or re-scheduled onto other nodes by the cluster orchestrator.

To demonstrate the Container Linux Update Operator, simulate a need to reboot on any cluster node. After connecting with ssh , issue a locksmithctl send-need-reboot command to place a synthetic reboot notice on the system’s Dbus. The Update Operator’s update-agent will annotate the node for a reboot, and the update-operator coordinator will take over from there to check the state of other cluster nodes and issue permission to reboot. It’s easy to observe the reboot process with kubectl get nodes --watch , or to observe the migration of the rebooting node’s work by watching pod status either with kubectl or in Tectonic Console.

Triggering a node reboot

Try it out and join the effort

The Update Operator is already part of the CoreOS Tectonic Kubernetes platform, and the project welcomes contributors who want to make infrastructure more secure through the automation of lifecycle chores. TryTectonic for free up to 10 nodes and start investigating the Update Operator as a user, or check out the CLUO source code on GitHub and help add features, documentation, and tests.