What could go wrong?

Let’s look at some of the issues you could run into, when putting less importance on DEV, and the impact they might have.

I did not come up with these. We’ve witnessed these all happen before over the last 2+ years.

Scenario 1: K8s API of the DEV cluster is down

Your nicely built CI/CD pipeline is now spitting a mountain of errors. Almost all your developers are now blocked, as they can’t deploy and test anything they are building.

This is actually much more impactful in DEV than in production clusters as in PROD your most important assets are your workloads, and those should still be running when the Kubernetes API is down. That is, if you did not build any strong dependencies on the API. You might not be able to deploy a new version, but your workloads are fine.

Scenario 2: Cluster is full / Resource pressure

Some developers are now blocked from deploying their apps. And if they try (or the pipeline just pushes new versions), they might increase the resource pressure.

Pods start to get killed. Now your priority and QoS classes kick in — you did remember to set those, right? Or was that something that was not important in DEV ? Hopefully, you have at least protected your Kubernetes components and critical add-ons. If not, you’ll see nodes going down, which again increases resource pressure. Thought DEV clusters could do with less buffer? Think again.

This sadly happens much more in DEV because of two things:

Heavy CI running in DEV Less emphasis on clean definition of resources, priorities, and QoS classes.

Scenario 3: Critical add-ons failing

In most clusters, CNI and DNS are critical to your workloads. If you use an Ingress Controller to access them, then that counts also as critical. You’re really cutting edge and are already running a service mesh? Congratulations, you added another critical component (or rather a whole bunch of them — looking at you Istio).

Now if any of the above starts having issues (and they do partly depend on each other), you’ll start seeing workloads breaking left and right, or, in the case of the Ingress Controller, them not being reachable outside the cluster anymore. This might sound small on the impact scale, but just looking at our past postmortems, I must say that the Ingress Controller (we run the community NGINX variant) has the biggest share of them.

What happened?

A multitude of thinkable and unthinkable things can happen and lead to one of the scenarios above.

Most often we’ve seen issues arising because of misconfiguration of workloads. Maybe you’ve seen one of the below (the list is not conclusive).

CI is running wild and filling up your cluster with Pods without any limits set

CI “DoSing” your API

Faulty TLS certs messing up your Ingress Controller

Java containers taking over whole nodes and killing them

…

Sharing DEV with a lot of teams? Gave each team cluster-admin rights? You’re in for some fun. We’ve seen pretty much anything, from “small” edits to the Ingress Controller template file, to someone accidentally deleting the whole cluster.

Conclusion

If it wasn’t clear from the above: DEV clusters are important!

Just consider this: If you use a cluster to work productively then it should be considered similarly important in terms of reliability as PROD .

DEV clusters usually need to be reliable at all times. Having them reliable only at business hours is tricky. First, you might have distributed teams and externals working at odd hours. Second, an issue that happens at off-hours might just get bigger and then take longer to fix once business hours start. The latter is one of the reasons why we always do 24/7 support, even if we could offer only business hours support for a cheaper price.

Some things you should consider (not only for DEV ):

Be aware of issues with resource pressure when sizing your clusters. Include buffers.

Separate teams with namespaces (with access controls!) or even different clusters to decrease the blast radius of misuse.

Configure your workloads with the right requests and limits (especially for CI jobs!).

Harden your Kubernetes and Add-on components against resource pressure.

Restrict access to critical components and do not give out cluster-admin credentials.

credentials. Have your SREs on standby. That means people will get paged for DEV .

. If possible enable your developers to easily rebuild DEV or spin up clusters for development by themselves.

If you really need to save money, you can experiment with downscaling in off-hours. If you’re really good at spinning up or rebuilding DEV , i.e. have it all automated from cluster creation to app deployments, then you could experiment with “throw-away-clusters”, i.e. clusters that get thrown away at the end of the day and start a new shortly before business hours.

Whatever you decide, please, please, please, do not block your developers, they will be much happier, and you will get better software, believe me.

Written by Puja Abbassi — Developer Advocate @ Giant Swarm