Mistake that cost thousands (Kubernetes, GKE)

Lessons learned scaling Kubernetes cluster

No exaggeration, unfortunately. As a disclaimer, I will add that this is a really stupid mistake and shows my lack of experience managing auto-scaling deployments. However, it all started with a question with no answer and I feel obliged to share my learnings to help others avoid similar pitfalls.

What is the difference between a Kubernetes cluster using 100x n1-standard-1 (1 vCPU) VMs VS having 1x n1-standard-96 (vCPU 96), or 6x n1-standard-16 VMs (vCPU 16)?

I asked this question multiple times in Kubernetes community. No one suggested an answer. If you are unsure about the answer, then there is something for you to learn from my experience (or skip to Answer for the impatient). Here it goes:

Premise

I woke up middle of the night with a determination to reduce our infrastructure costs.

We are running a large Kubernetes cluster. “large” is relative of course. In our case that is 600 vCPUs during normal business hours. This number goes double during peak hours and goes to near 0 during some hours of the night.

Invoice for the last month was USD 3,500.

2019 August

This is already pretty darn good given the computing power that we get, and Google Kubernetes Engine (GKE) made cost management mostly easy:

We use the least expensive data center ( europe-west2 (London) is ≈15% more expensive than europe-west4 (Netherlands))

(London) is ≈15% more expensive than (Netherlands)) We use different machine types for different deployments (memory heavy vs CPU heavy)

We use Horizontal Pod Autoscaler (HPA) and Custom Metrics to scale deployments

We use cluster autoscaler (https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler) to scale node pools

We use preemptible VMs

Using exclusively preemptible VMs is what allows us to keeps the costs low. To illustrate the savings, in case of n1-standard-1 machine type hosted in europe-west4 , the difference between dedicated and preemptible VM is USD 26.73/ month VS USD 8.03/ month. That is 3.25x lower cost. Of course, preemptible VMs have their limitations that you need to familiarise with and counteract, but that is a whole different topic.

With all of the above in place, it felt like we are doing all the right things to keep the costs low. However, I always had a nagging feeling that something is off.

Major red flag 🚩

About that nagging feeling:

Average CPU usage per Node was low (10%-20%). This didn’t seem right.

My first thought was that I have misconfigured compute resources. What resource are required depends entirely on the program that you are running. Therefore, the best thing to do is to deploy your program without resource limits, observe how your program behaves during idle/ regular and peak loads, and set requested/ limit resources based on the observed values.

I will illustrate my mistake through an example of a single deployment “admdesl”.

In our case, resource requirements are sporadic:

NAME CPU(cores) MEMORY(bytes)

admdesl-5fcfbb5544-lq7wc 3m 112Mi

admdesl-5fcfbb5544-mfsvf 3m 118Mi

admdesl-5fcfbb5544-nj49v 4m 107Mi

admdesl-5fcfbb5544-nkvk9 3m 103Mi

admdesl-5fcfbb5544-nxbrd 3m 117Mi

admdesl-5fcfbb5544-pb726 3m 98Mi

admdesl-5fcfbb5544-rhhgn 83m 119Mi

admdesl-5fcfbb5544-rhp76 2m 105Mi

admdesl-5fcfbb5544-scqgq 4m 117Mi

admdesl-5fcfbb5544-tn556 49m 101Mi

admdesl-5fcfbb5544-tngv4 2m 135Mi

admdesl-5fcfbb5544-vcmjm 22m 106Mi

admdesl-5fcfbb5544-w9dsv 180m 100Mi

admdesl-5fcfbb5544-whwtk 3m 103Mi

admdesl-5fcfbb5544-wjnnk 132m 110Mi

admdesl-5fcfbb5544-xrrvt 4m 124Mi

admdesl-5fcfbb5544-zhbqw 4m 112Mi

admdesl-5fcfbb5544-zs75s 144m 103Mi

Pods that average 5m are “idle”: there is a task in the queue for them to process, but we are waiting for some (external) condition to clear before proceeding. In case of this particular deployment, these pods will change between idle/ active state multiple times every minute and spend 70%+ in idle state.

A minute later the same set of pods will look different:

NAME CPU(cores) MEMORY(bytes)

admdesl-5fcfbb5544-lq7wc 152m 107Mi

admdesl-5fcfbb5544-mfsvf 49m 102Mi

admdesl-5fcfbb5544-nj49v 151m 116Mi

admdesl-5fcfbb5544-nkvk9 105m 100Mi

admdesl-5fcfbb5544-nxbrd 160m 119Mi

admdesl-5fcfbb5544-pb726 6m 103Mi

admdesl-5fcfbb5544-rhhgn 20m 109Mi

admdesl-5fcfbb5544-rhp76 110m 103Mi

admdesl-5fcfbb5544-scqgq 13m 120Mi

admdesl-5fcfbb5544-tn556 131m 115Mi

admdesl-5fcfbb5544-tngv4 52m 113Mi

admdesl-5fcfbb5544-vcmjm 102m 104Mi

admdesl-5fcfbb5544-w9dsv 18m 125Mi

admdesl-5fcfbb5544-whwtk 173m 122Mi

admdesl-5fcfbb5544-wjnnk 31m 110Mi

admdesl-5fcfbb5544-xrrvt 91m 126Mi

admdesl-5fcfbb5544-zhbqw 49m 107Mi

admdesl-5fcfbb5544-zs75s 87m 148Mi

Looking at the above I thought that it makes sense to have a configuration such as:

resources:

requests:

memory: '150Mi'

cpu: '20m'

limits:

memory: '250Mi'

cpu: '200m'

This translates to:

idle pods don’t consume more than 20m

active (healthy) pods peak at 200m

However, when I used this configuration, it made deployments hectic.

admdesl-78fc6f5fc9-xftgr 0/1 Terminating 3 21m

admdesl-78fc6f5fc9-xgbcq 0/1 Init:CreateContainerError 0 10m

admdesl-78fc6f5fc9-xhfmh 0/1 Init:CreateContainerError 1 9m44s

admdesl-78fc6f5fc9-xjf4r 0/1 Init:CreateContainerError 0 10m

admdesl-78fc6f5fc9-xkcfw 0/1 Terminating 0 20m

admdesl-78fc6f5fc9-xksc9 0/1 Init:0/1 0 10m

admdesl-78fc6f5fc9-xktzq 1/1 Running 0 10m

admdesl-78fc6f5fc9-xkwmw 0/1 Init:CreateContainerError 0 9m43s

admdesl-78fc6f5fc9-xm8pt 0/1 Init:0/1 0 10m

admdesl-78fc6f5fc9-xmhpn 0/1 CreateContainerError 0 8m56s

admdesl-78fc6f5fc9-xn25n 0/1 Init:0/1 0 9m6s

admdesl-78fc6f5fc9-xnv4c 0/1 Terminating 0 20m

admdesl-78fc6f5fc9-xp8tf 0/1 Init:0/1 0 10m

admdesl-78fc6f5fc9-xpc2h 0/1 Init:0/1 0 10m

admdesl-78fc6f5fc9-xpdhr 0/1 Terminating 0 131m

admdesl-78fc6f5fc9-xqflf 0/1 CreateContainerError 0 10m

admdesl-78fc6f5fc9-xrqjv 1/1 Running 0 10m

admdesl-78fc6f5fc9-xrrwx 0/1 Terminating 0 21m

admdesl-78fc6f5fc9-xs79k 0/1 Terminating 0 21m

This would happen whenever a new Node is brought in/ out of the cluster (which happens often due to auto-scaling).

As such, I kept increasing requested pod resources until I have ended up with the following configuration for this deployment:

resources:

requests:

memory: '150Mi'

cpu: '100m'

limits:

memory: '250Mi'

cpu: '500m'

With this configuration the cluster was running smoothly, but it meant that even idle Pods were pre-allocated more CPU time than they need. This is the reason why the average CPU usage per Node was low. However, I didn’t know what is the solution (reducing requested resources resulted in hectic cluster state/ outages) and as such I rolled with a variation of generous resource allocation for all the deployments.

Answer

Back to my question:

What is the difference between a Kubernetes cluster using 100x n1-standard-1 (1 vCPU) VMs VS having 1x n1-standard-96 (vCPU 96), or 6x n1-standard-16 VMs (vCPU 16)?

For starters, there is no price-per-vCPU difference between n1-standard-1 and n1-standard-96 . Therefore, I reasoned that using a machine with fewer vCPUs is going to give me more granular control over the price.

The other consideration I had was how fast the cluster will auto-scale, i.e. if there is a sudden surge, how fast can the cluster auto scaler provision new nodes for the unscheduled pods. This was not a concern though — our resource requirements grow and shrink gradually.

And so I went with mostly 1 vCPU nodes, the consequence of which I have described in Premise.

Retrospectively, it was an obvious mistake: distributing pods across nodes with a single vCPU does not allow efficient resource utilisation as individual deployments change between idle and active states. Put it another way, the more vCPUs you have on the same machine, the more tightly you can pack many pods because as a portion of pods go over their required quota, there are readily available resources to take.

What worked: