Kubernetes was designed in such a way as to be fault tolerant of worker node failures. If a node goes missing because of a hardware problem, a cloud infrastructure problem, or if Kubernetes simply ceases to receive heartbeat messages from a node for any reason, the Kubernetes control plane is clever enough to handle it. But that doesn’t mean it will be able to solve every conceivable problem.

A common misconception is as follows: “If there are enough free resources, Kubernetes will re-schedule all the pods from the lost node to another, so there’s absolutely no reason to worry about losing a node. Everything will be re-scheduled; the autoscaler will add a new node if necessary; life goes on.” To topple this misconception, let’s take a look at what disruptions really mean, and how the kubectl drain command works: what it does, and how it operates so gracefully. The cluster autoscaler uses similar logic to scale a cluster, and our Pipeline Platform also has a similar feature that automatically handles spot instance terminations gracefully via Hollowtrees.

Pod disruptions 🔗︎

Pods disappear from clusters for one of two reasons:

there was some kind of unavoidable hardware, software or user error

hardware, software or user error the pod was deleted voluntarily, because someone wanted to delete its deployment, or wanted to remove the VM that held the pod

The Kubernetes documentation calls these two things voluntary and involuntary disruptions. When a “node goes missing”, it’s considered an involuntary disruption. Involuntary disruptions are harder to deal with than voluntary disruptions (keep reading for an in-depth explanation as to why), but you can do a few things to mitigate their effects. The documentation lists a few preventative methods, from the trivial - like pod replication - to the complex - like multi-zone clusters. You should take a look at these and do your best to avoid the problem of involuntary disruptions, since these will surely occur in any cluster of sufficient size. But even if you’re doing your best, problems will arise eventually, especially in multi-tenant clusters, in which not everyone who’s using the cluster has the same information or is comparibly diligent.

So what can you do to guard yourself against involuntary disruptions, other than the preventative measures outlined in the official documentation? Well, there are some cases in which we can prevent involuntary disruptions and change these to voluntary disruptions, like AWS spot instance termination, or cases in which monitoring can predict failures in advance. Voluntary disruptions allow the cluster to gracefully accommodate its new situation, making the transition as seamless as possible. In the next section of this post, we’ll use the kubectl drain command as a means of exploring voluntary disruptions, and note the ways in which handling involuntary disruptions is less graceful.

The kubectl drain command 🔗︎

According to the Kubernetes documentation the drain command can be used to “safely evict all of your pods from a node before you perform maintenance on the node,” and “safe evictions allow the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified”. So if it’s not a problem that a node is being removed from the cluster, then why do we need this safe eviction and how does it work, exactly?

From a bird’s eye view drain does two things:

1. cordons the node

This part is quite simple, cordoning a node means that it will be marked unschedulable, so new pods can no longer be scheduled to the node. If we know in advance that a node will be taken from the cluster (because of maintenance, like a kernel update, or because we know that there will be scaling in the node), cordoning is a good first step. We don’t want new pods scheduled on this node and then taken away after a few seconds. For example, if we know two minutes in advance that a spot instance on AWS will be terminated, new pods shouldn’t be scheduled on that node, then we can work towards gracefully scheduling all the other pods, as well. On the API level, cordoning means patching the node with node.Spec.Unschedulable=true .

2. evicts or deletes the pods

After the node is made unschedulable, the drain command will try to evict the pods that are already running on that node. If eviction is supported on the cluster (from Kubernetes version 1.7) the drain command will use the Eviction API that takes disruption budgets into account, if it’s not supported it will simply delete the pods on the node. Let’s look into these options next.

Deleting pods on a node 🔗︎

Let’s start with something simple, like when the Eviction API cannot be used. This is how it looks in go code:

err := client.CoreV1().Pods(pod.Namespace).Delete(pod.Name, &metav1.DeleteOptions{ GracePeriodSeconds: &gracePeriodSeconds, })

Other than trivialities like calling the Delete method of the K8S client, the first thing you can catch is GracePeriodSeconds . As always, Kubernetes’ excellent documentation will help explain a few things:

“Because pods represent running processes on nodes in the cluster, it is important to allow those processes to gracefully terminate when they are no longer needed (vs being violently killed with a KILL signal and having no chance to clean up).”

Cleaning up can mean a lot of things, like completing any outstanding HTTP requests, making sure that data is flushed properly when writing a file, finishing a batch job, rolling back transactions, or saving state to external storage like S3. There is a timeout that facilitates clean up, called the grace period. Note that when you call delete on a pod it returns asynchronously, and you should always poll that pod and wait until the deletion finishes or the grace period ends. Check the Kubernetes documentation to learn more.

If the node is disrupted involuntarily, the processes in the pods will have no chance to exit gracefully. So let’s go back to our example of spot instance termination: if all we can do in the two minutes before the VM is terminated is cordon the node and call Delete on the pods with a grace period of about two minutes, we’re still better off than if we just let our instance die. But Kubernetes provides us with some better options.

Evicting pods from a node 🔗︎

From Kubernetes 1.7, onward, there’s been an option to use the Eviction API instead of directly deleting pods. First let’s see the go code again and note how it differs from the go code above. It’s easy to see that this is a different API call, but we still have to provide pod.Namespace , pod.Name and DeleteOptions along with the grace period. And though, elsewhere it looks very similar at a glance, we also have to add some meta info ( EvictionKind and APIVersion ).

eviction := &policyv1beta1.Eviction{ TypeMeta: metav1.TypeMeta{ APIVersion: policyGroupVersion, Kind: EvictionKind, }, ObjectMeta: metav1.ObjectMeta{ Name: pod.Name, Namespace: pod.Namespace, }, DeleteOptions: &metav1.DeleteOptions{ GracePeriodSeconds: &gracePeriodSeconds, }, } client.PolicyV1beta1().Evictions(eviction.Namespace).Evict(eviction)

So what does it add to the delete API?

Kubernetes has a resource type - poddisruptionbudget , or pdb - that can be attached to a deployment via labels. According to the documentation:

A PDB limits the number of pods of a replicated application that are down simultaneously from voluntary disruptions.

The following simplified example of a PDB specifies that the minimum available pods of the nginx app cannot be at less than 70% at any time (see more examples here):

kubectl create pdb my-pdb --selector=app=nginx --min-available=70%

When calling the Eviction API, it will only allow the eviction of a pod as long as it doesn’t collide with a PDB. If no PDBs are going to be broken by the eviction, the pod will be deleted gracefully, just as it would with a simple Delete . If the delete is not granted because a PDB will not allow it, then the API returns 429 Too Many Requests . See more details here.

If you call the drain command and it cannot evict a pod because of a PDB, it will sleep five seconds, and retry. You can try this by creating a basic nginx deployment with two replicas, adding the pdb above, and finding a node in which one of the pods is scheduled and by trying to drain it with this command ( --v=6 is all that’s necessary to see the Too Many Requests messages that are returned):

kubectl --v=6 drain <node-name> --force

This should work in most cases, because, if you’re setting values in a PDB that make sense (e.g.: min 2 available, replicas set to 3, and pod anti-affinity set for hostnames), then is should only be a temporary state for the cluster - the controller will try to restore the three replicas, and will succeed if there are free resources in the cluster. Once it restores the three replicas, drain will work effectively.

Also, note that eviction and drain can cause deadlocks , in which drain will wait forever. Usually these are misconfigurations like in my very simple example, when neither of the two nginx replicas could be evicted because of the 70% threshold, but deadlocks may occur in real-world situations as well. The Eviction API won’t start new replicas on other nodes or do any other magic, but return Too Many Requests . To handle these cases, you must intervene manually (e.g.: by temporarily adding a new replica), or write your code in a way that detects them.

Special pods to delete 🔗︎

Let’s complicate things even further. There are some pods that can’t be simply deleted or evicted. The drain command uses four different filters when checking for pods to delete, and these filters can temporarily reject the drain or the drain can move on without touching certain pods:

DaemonSet filter

The DaemonSet controller ignores unschedulable markings, so a pod that belongs to a DaemonSet will be immediately replaced. If there are pods belonging to a DaemonSet on the node, the drain command proceeds only if the --ignore-daemonsets flag is set to true, but even if that is the case, it won’t delete the pod because of the DaemonSet controller. Usually it doesn’t cause problems if a DaemonSet pod is deleted with a node (see node exporters, logs collection, storage daemons, etc.), so in most cases this flag can be set.

Mirror pods filter

drain uses the Kubernetes API server to manage pods and other resources, and mirror pods are merely the corresponding read-only API resources of static pods - pods that are managed by the Kubelet, directly, without the API server managing them. Mirror pods are visible from the API server but cannot be controlled, so drain won’t delete these either.

Unreplicated filter

If a pod has no controller it cannot be easily deleted, because it won’t be rescheduled to a new node. It’s usually advised that you not have pods without controllers (not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet), but if you still have pods like this, and want to write code that handles voluntary node disruptions, it’s up to the implementation as to whether it will delete these pods or fail. The drain command lets the user decide: when --force is set, unreplicated pods will be deleted (or evicted): if they’re not set, drain will fail.

When using go , the k8s apimachinery package has a util function that returns the controller for a pod, or nil, if there’s no controller for it: metav1.GetControllerOf(&pod)

LocalStorage filter

This filter checks if emptyDir exists for a pod or not. If the pod uses emptyDir to store local data, it may not be safe to delete because if a pod is removed from a node the data in the emptyDir is deleted with it. Just like with the unreplicated filter, it is up for the implementation to decide what to do with these pods. drain provides a switch for this as well; if --delete-local-data is set, drain will proceed even if there are pods using the emptyDir and will delete the pods and therefore delete the local data as well.

Spot instance termination 🔗︎

We use a drain-like logic to handle AWS spot instance terminations. We monitor AWS spot instance terminations with Prometheus, and have Hollowtrees configured to call our Kubernetes action plugin to drain the node. AWS gives the notice two minutes in advance, which is usually enough time to gracefully delete the pods, while also watching for PodDisruptionBudgets . Our action plugin uses a very similar logic to the drain command, but ignores DaemonSets and mirror pods, and force deletes unreplicated and emptyDir pods by default.

If you’d like to learn more about Banzai Cloud check out our other posts in the blog, the Pipeline and Hollowtrees projects.