I was in charge of managing multiple clusters each having between 4 and 50 nodes. There were up to 200 different microservices and applications running in a cluster. In order to utilize the available hardware resources more efficiently the majority of these deployments were configured with burstable RAM and CPU resources. This way pods may take available resources if they actually need them, but they don’t block other applications from being scheduled on that node. Sounds great doesn’t it?

Even though our overall cluster CPU (8%) and RAM usage (40%) was relatively low, we often faced problems with evicted pods. Those pods got evicted because they were trying to allocate more than the available RAM on a node. Back then we only had one dashboard to monitor the kubernetes resources and it looked like this:

Grafana dashboard which only uses cAdvisor metrics

Using such a dashboard one can easily find nodes with the high RAM and CPU usage so that I could quickly identify the overutilized nodes. Regardless, the real struggle begins when you try to solve the cause of that. One option to avoid evictions would be to set guaranteed resources on all pods (request and limit resources are equal). The downside is, that this leads to a much worse hardware utilization. Cluster wide we had hundreds of gigabytes RAM available, yet some nodes were apparently running out of RAM while others still had 4–10GB free RAM.

Apparently the kubernetes scheduler didn’t schedule our workloads so that they are equally distributed across our available resources. The kubernetes scheduler has to respect various configurations, e. g. affinity rules, taints & tolerations, node selectors which may restrict the set of available nodes. In my use case there were none of these configurations in place though. If that’s the case the pod scheduling is based on the requested resources on each node.

The node which has the most unreserved resources and can satisfy the requested resource conditions will be chosen to run the pod. For our use case this means the requested resources on our nodes do not align with the actual usages and this is where Kube Eagle comes to rescue as it provides a better resource monitoring for this purpose.