One of the great promises of using Kubernetes is that it has the ability to scale your infrastructure dynamically based on user demand. But by default it won’t add or remove machines. To do that, you’ll have to enable the Kubernetes Cluster Autoscaler.

In this article, we’ll explain how the Kubernetes Cluster Autoscaler works to scale hardware based on the needs of your Kubernetes workloads. We’ll go over the autoscaling algorithm. We’re also sharing a tool we built to help teams understand why the autoscaler is or isn’t scaling as they expected, and to see how your current infrastructure would react to enabling the autoscaler.

The Basics

Clusters are how Kubernetes groups machines. They are comprised of Nodes (individual machines, oftentimes virtual) which run Pods. Pods have containers that request resources such as CPU, Memory, and GPU. The Cluster Autoscaler adds or removes Nodes in a Cluster based on resource requests from Pods.

A high-level overview of a Kubernetes cluster

When Does The Cluster Autoscaler Add Capacity?

The cluster autoscaler increases the size of the cluster when there are pods that are not able to be scheduled due to resource shortages. It can be configured to not scale up or down past a certain number of machines. Here is an overview of how it makes scaling decisions:

When enabled, the cluster autoscaler algorithm checks for pending pods. The cluster autoscaler requests a newly provisioned node if: 1) there are pending pods due to not having enough available cluster resources to meet their requests and 2) the cluster or node pool has not reached the user-defined maximum node count. Kubernetes detects the new node once it is provision by the underlying infrastructure, e.g. GCP The Kubernetes scheduler allocates the pending pods to the new node. Go back to Step 1 if there are still pods in a pending state.

When Do Nodes Get Turned Down?

Scaling down is a little bit more complex. The process to check whether a node is safe to delete starts when pod requests on that node are lower than a user-defined threshold (default of 50% request utilization). It’s worth noting that this node scaledown check does not consider actual CPU/memory usage, and instead only looks at resource requests. Once a node passes that check, the actual Kubernetes scheduling algorithm is called to determine whether the pods running on that node can be moved somewhere else. The cluster autoscaler also runs the following checks on each pod in the node to minimize the chance of service disruption:

If the pod is part of a daemonset, the pod is safe to turn down, since daemonsets are supposed to run statelessly on all nodes. Removing the node should not reschedule a pod in a daemonset.

If the pod is a mirror pod, (only relevant if you’ve created static pods), it is considered safe to turn down.

Removing the pod does not bring the number of replicas below the specified minimum replica count, unless you have specified a pod disruption budget and have remaining disruptions to “spend” on moving the pod.

The pod doesn’t use any local storage on the node; since the node is going away, that local storage will be lost.

Kube system pods won’t get moved unless they specify a pod disruption budget.

If any of these node or pod-level checks do not pass, then the node will not be turned down.

How to Enable Cluster Autoscaling

The procedure is slightly different for every cloud provider. Here’s a list of steps for common ones:

Limitations

The cluster autoscaler doesn’t take into account actual CPU/GPU/Memory usage, just resource requests and limits. Most teams overprovision at the pod level, so in practice we see aggressive upscaling and conservative downscaling.

Scaling up isn’t immediate, so we’ve seen services experience downtime or latency as they get stuck waiting to schedule more nodes to be added to the pool. While the cluster autoscaler should issue a scale-up request within 30–60 seconds to a cloud provider, the actual time the cloud provider takes to create a node can be on the order of minutes.

To avoid issues where you need to wait to scale up, most teams want to leave some resources idle. While you can overprovision by using “pause pods” with low priority to “reserve” space for pods of higher priority, this requires a fair amount of configuration. And again, the inputs to the scaling algorithm here are resource requests, not actual utilization.

Scaling down when requested utilization dips below your threshold isn’t guaranteed with the autoscaler, because as mentioned above nodes can have non removable pods. We’ve seen unreplicated workloads without disruption budgets get scheduled on big nodes and waste money.

Best Practices If You Are Using the Cluster Autoscaler

Keep your pod requests close to actual utilization. Consider using the vertical pod autoscaler (more on this in an upcoming blog post) to set your requests close to actual utilization automatically.

Set pod disruption budgets where applicable for kube-system pods, to avoid nodes that can’t be turned down

Avoid using local storage for pods

Monitor your pods to make sure they don’t make expensive nodes ineligible for scale-down

Our Tool Today

To help ensure that your nodes can scale down when needed, we built a tool that statically runs checks from the cluster autoscaler library and displays whether a node is safe to delete at various levels of utilization in your cluster. If the node isn’t safe, we show which pods are causing the node to stick, and which removal tests they are failing.

You don’t need to use the cluster autoscaler to use this tool — nodes that pass the checks can be safely removed manually as well.

We’ll be open-sourcing our tool in the coming weeks, but in the meantime, if you’re interested in giving it a trial run, reach out to team@kubecost.com to get access to our pilot program!

Update: this tool is now part of the core Kubecost product, available at http://kubecost.com. This product is based on our open source projects at Github.