In this post, we’ll help you understand the automatic pod eviction and rescheduling that occurs when a particular host resource is being depleted.

The “kubelet” agent daemon is installed on all Kubernetes hosts to manage container creation and termination. By default, this daemon has the following eviction rule: memory.available<100Mi. So, when a host is low on memory, kubelet will one of the running pods to free memory on the host machine. (This is not a random decision, and we’ll describe its logic a bit later in this post.)

You can set special flags for kubelet on start, based on memory and disk space usage, to specify how pods are evicted to free resources. Separate settings exist for disk space of type “container images” and for running containers themselves. The thresholds are provided through additional flags on start of the kubelet.

Here are the possible settings for these flags:

memory.available Free memory on host server

nodefs.available Containers filesystem free space (docker volumes, logs, etc’)

nodefs.inodesFree Containers filesystem available inodes

imagefs.available Images filesystem free space (docker images and container writable layers)

imagefs.inodesFree Images filesystem available inodes

And these are the flags to apply kubelet eviction policies:

1.The “ --eviction-soft ” and “ --eviction-soft-grace-period ” flags must be used together. If you don’t specify a grace period for soft limit, the kubelet will fail to start, displaying an error like:

error: failed to run Kubelet: failed to create kubelet: grace period must be specified for the soft eviction threshold nodefs.available

So, at “ --eviction-soft ” you specify thresholds per resource. For example:

--eviction-soft=”nodefs.available<1Gi,nodefs.inodesFree<500 ”

And set the grace period (in time units) to pass before eviction starts. For example:

--eviction-soft-grace-

period=”nodefs.available=1h,nodefs.inodesFree=1h”

In this example, if we have less than 1Gi of disk space for more than an hour, or have inodes less than 500, at nodefs filesystem (container volumes and logs), kubelet will select a pod to terminate.

To allow your pods time for a clean process shutdown, set the “ --eviction-max-pod-grace-period ” time. This allows kubelet to signal your pod containers for graceful shutdown.

2. “ --eviction-hard ” does not allow any time for graceful shutdown of a container. If the limit is exceeded, kubelet will take immediate action and terminate a chosen pod to free the resource. This flag is used exactly like “ --eviction-soft ”, and you can choose from the five values in the earlier table.

3. “ --eviction-minimum-reclaim ” helps to avoid flapping. Flapping is when hosts frequently reach eviction thresholds and, after a minimal cleanup by kubelet, quickly become full again. You can set the amount of “extra” space/memory/inodes that must be cleaned when the eviction signal occurs, in addition to the minimum required. Let’s look at the following scenario:

Your host has 8Gi memory, with a hard limit set at 200Mi. You run tens of small pods with an average consumption of 60–80Mi. After some time, 180Mi free memory is left and kubelet kills one pod, freeing up 60Mi. The host now has 240Mi free. The next pod is scheduled, because it requests only 20Mi. (Without a limit set, or with a 500Mi limit set, it can still be scheduled to this pod because Kubernetes scheduler makes decisions based on “request”, not limit, as described here.) This 20Mi pod quickly eats up 100Mi, triggering another eviction to occur. But if you’d set a minimum reclaim to 500Mi, upon reaching <200Mi, kubelet would clean up much more space, and the host would operate with no evictions for a longer time period.When an eviction signal is received by Kubernetes (indicating a hard or soft limit is exceeded), it will switch one of “MemoryPressure” or “DiskPressure” node conditions to true. No new “BestEffort” “quality of service” (QoS) pods will be sent to a node that has “MemoryPressure”. No new pods will be scheduled on a node with “DiskPressure”.

Want a stress-free K8S cluster management experience? Download our demo, Kublr-in-a-Box.

After kubelet does its cleanup, node state transition back to normal is delayed for the duration of the “eviction-pressure-transition-period” kubelet setting. You should use this option to avoid oscillation of node conditions, which may happen if pods exceed the soft limits frequently, without exceeding the grace period time of soft limits (which gets pods terminated).

The logic behind the decision to terminate a particular pod is based on the QoS class of the pod and its current resource usage relative to what was requested during pod start. There are 3 QoS classes for each resource:

● BestEffort

● Guaranteed

● Burstable

QoS classes are determined automatically, based on requests and limits specified in the pod manifest.

If a pod has “requests” set (either CPU or memory) but no “limits” (or its limits are higher than the request), it is assigned a “burstable” class, which means it will be scheduled based on its requests and can utilize more resources (provided other pods don’t need resources). On the other hand, “guaranteed” pods (those which specify the same numbers in both, requests and limits, or has only limits specified) are considered top-priority and are guaranteed not to be killed until they exceed their limits or until the system is under memory pressure and there are no lower priority containers that can be evicted.

Important: pods will not be killed if they exceed their CPU limit. Instead, their container processes will be throttled and will receive CPU derived from other pods’ consumption and limits. This also takes into account the running system processes on a host.

When a pod is terminated by kubelet, Kubernetes will reschedule the pod immediately on another host where enough resources are available. Kubernetes may reschedule the pod again to the same machine if you have an incorrect combination of settings on kubelet and pod requests/limits.

Fine-tuning cluster rebalancing mechanisms is not an easy task and requires a good understanding of all options. But a correct combination of settings and the needs of your workload is definitely worth the effort because you will be able to overcommit. You’ll also achieve a good utilization of cluster resources, leading to fewer servers running and a lower monthly bill (assuming you’re in the cloud) for computing and storage.

Click the link to learn more about how Kubernetes manages out-of-resource situations, including a description of Linux OOM killer behavior.

Share your thoughts and questions in the comments section below.

Need a user-friendly tool to set up and manage your K8S cluster? Check out Kublr-in-a-Box. To learn more, visit kublr.com.