Whatch out for those photoshop skills

Kubernetes Scheduler

As you probably know, a Kubernetes cluster is made of master and worker nodes. The scheduler is a component in a master node, which is responsible for deciding which worker node should run a given pod.

Scheduling is a complex task and like any optimisation problem you will always find a scenario in which the result may seem sub-optimal to the human eye.

Nonetheless, the default scheduler does a pretty decent job. It follows common sense strategies such as avoiding to schedule multiple replicas of a pod on the same node or avoiding overloading a node while others remain under allocated.

I am not familiar with the code behind the default scheduler but my observations have taught me that balancing node workload takes priority over preventing duplicates running on the same node.

This means that you may find yourself in a situation where all replicas of a pod are on the same node.

Therefore, if the node goes down — and you should assume that it will — availability will be impacted.

Interestingly, Kubernetes usually works by strictly enforcing some sort of state. For example, if you say you want three pods A , Kubernetes will imediately spawn new pod(s) should that number go below three. This is not the case for scheduling. Kubernetes will try to to spread your pods intelligently when it creates them but will not proactively enforce a good spread afterwards.

Scheduling happens at pod creation only.

High availability

Running multiple replicas of a pod will not guarantee high availability but it cannot be achieved without it. For obvious reasons, you should never have all the replicas of a pod on the same node.

High Availability is not an absolute concept; you can never be 100% highly available. How can your application be available if all the nodes go down simultaneously?

So you generally need to choose a level of availability you are comfortable with. For example, if you are running three nodes in three separate availability zones, you may choose to be resilient to a single node failure. Losing two nodes might bring your application down but the odds of loosing two data centres in separate availability zones are low.

The bottom line is that there is no universal approach; only you can know what works for your business and the level of risk you deem acceptable.

When balance becomes an issue

Let’s have a look at a scenario where scheduling can jeopardise availability.

Ideal scheduling

Let’s say we start in an ideal state of scheduling. We have four services A , B , C , and D , each scaled to run three replicas.

If Node 3 were to die, the master node(s) will notice that we only have two pods of each service. The master node(s) will immediately reschedule one of each on the remaining healthy nodes. This could lead to:

Node 3 is dead, the remaining nodes share the load

After some time (typically a few minutes), your cloud provider will notice that a node is down and will spin another one. When that node is ready we end up with a cluster like that:

An unbalanced cluster

Nothing to add here

The cluster is unbalanced but availability is not lost, yet… If we lose any of the nodes we still have at least one replica of each pod running.

However, let’s see what happens when we start soliciting the scheduler by releasing a new version of service A . This is typically done with a rolling deployment.

Kubernetes will start a new pod for service A , wait for the pod to be ready, and kill one of the old pods. It continues doing so until all the old pods have been replaced.

So, the scheduler is asked to place a new A v2 pod — the only place it can do so is Node 3. So far so good. However, where should it put the other two? As mentioned above the scheduler tries to avoid having duplicates on the same node but in that case Node 3 is massively under-utilised compared to the other nodes. So there is a high chance that all the new A v2 pods will be scheduled on Node 3, which would lead to this situation:

And now we have a problem; all the A pods are on a single node. If Node 3 were to fail again we loose availability for service A .

The whole process is described by this magnificent GIF:

De-scheduler

One way to bring back order is by doing a bit of chaos engineering. The scheduler can only fix a balance problem when pods are created. So let’s get killing to force creation of new pods!

Enters the descheduler. This project runs as a Kubernetes Job that aims at killing pods when it thinks the cluster is unbalanced. You can run it once or as a Cron Job to run it periodically.

The installation is pretty straightforward and well explained on the GitHub page.

You can use a variety of strategies to delete pods, which are defined in a Config Map. Let’s review two of them.

apiVersion: "descheduler/v1alpha1"

kind: "DeschedulerPolicy"

strategies:

"RemoveDuplicates":

enabled: false

Using the RemoveDuplicates strategy, the Descheduler will find identical pods on a node. It will then kill some of them, hoping the scheduler will schedule the pods on another node.

Note that it wouldn’t work with the scenario outlined above. This is because we still haven’t resolved the workload balance problem.

However, the LowNodeUtilization strategy can help with that:

apiVersion: "descheduler/v1alpha1"

kind: "DeschedulerPolicy"

strategies:

"LowNodeUtilization":

enabled: true

params:

nodeResourceUtilizationThresholds:

thresholds:

"cpu" : 20

"memory": 20

"pods": 20

targetThresholds:

"cpu" : 50

"memory": 50

"pods": 50

You must define what it means for a node to be under and over utilised. In this example, if a node is below 20% CPU utilisation and below 20% memory utilisation and has less than 20 pods it is considered under-utilised. If a node is above 50% CPU utilisation or above 50% memory utilisation or has more than 50 pods it will be considered over-utilised.

For this strategy to work, the descheduler will need to find at least one under-utilised node and one over-utilised node.

As usual, there are no universal utilisation parameters that will work well everywhere. You need to tweak them and find out what works best for you.

If you are really confident in your availability you could run this job every hour. Otherwise you could run it at night when impact on traffic is low. In the latter case, it might be useful to run it a few times in a row. For example, you could run it everyday at 3am, 3:15am and 3:30am.