So you have your brand new Kubernetes (k8s) cluster up and running and you have deployed your first applications. You have enabled Horizontal Pod Autoscaling and you are ready for prime time for your application. You have configured a minimum and a maximum number of nodes for your cluster, so that the Cluster Autoscaler (CA) scales out the cluster itself in case the scheduled pods do not fit into the existing nodes. You are ready for prime time for your application! 🚀

You see that everything goes smoothly, but after some time of operation and a number of deployments, you notice the picture below:

Except for scaling out a k8s cluster, the Cluster Autoscaler is the component that will scale it in.

In order to do this though, you have to give it the required flexibility, which is not the default for many cases. But what do we mean by the term “flexibility”? Let’s see some of the basics of scaling in first.

In the simple case, if a node ends up without any running pods (DaemonSets excluded), it will be terminated.

Except for the simple case above, there are many cases where your cluster may end up with many nodes that are not fully utilised. In fact, there may be cases that you may observe your cluster being vastly under-utilised.

According to the documentation, the Cluster Autoscaler won’t evict pods in order to scale down nodes if:

Existing Pod Disruption Budgets: Those pods have a Pod Disruption Budget (PDB) that restricts CA from evicting them. For example, you may be running 2 pods and at the same time you may have specified that you want at least 2 pods available at any time. Given that CA respects the PDBs when evicting pods, this makes it incapable of moving them. kube-system PDBs: You have not specified Pod Disruption Budgets for the pods residing in the kube-system namespace (since CA 0.6 ). Those PDBs are not set in GKE for example, as the tolerance to kube-system pod disruptions is directly associated to the kind of workloads you are running in the cluster. Local storage: The pods have local storage attached. Warning: If you are using Istio, the istio-proxy sidecar mounts an emptyDir volume in the default setup. You need to add the "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" annotation for the Cluster Autoscaler to be able to evict them. Again, you have to make sure that you keep critical components highly available. Other nodes not matching pod constraints: There can be the case that even though that a node is under-utilised, the pod does not fit into one of the other nodes.

Pod resource requests are very important. You have to make sure that you don’t put CPU and memory requests very high compared to the actual ones that the pod needs. In that case, you will have an under-utilised node and the pod resource requests will be high enough to prevent the pod from fitting into another node, resulting in CA being unable to evict it.

Troubleshooting CA

One useful tip to see if the Cluster Autoscaler is considering any of the nodes as a candidate for termination (after evicting the pods out of it) is the following:

kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

This will show how many nodes there are in each availability zone (if you have gone with more than one AZs) and how many of them are considered candidates for termination.

Another useful tip is that given that CA is using metrics provided by the metrics-server to determine the node utilisation, the metrics-server has to be correctly gathering metrics from nodes ( kubectl top nodes should return metrics for all nodes).

Conclusion

Properly setting Pod Disruption Budgets and resource requests and allowing pods with local storage to be evicted, will give the flexibility to the Cluster Autoscaler to start moving your pods around. This will result in the termination of under-utilised nodes, which can have a very positive impact on your infrastructure costs.