I started to compile a list of public failure/horror stories related to Kubernetes. It should make it easier for people tasked with operations to find outage reports to learn from.

Since we started with Kubernetes at Zalando in 2016, we collected many internal postmortems. Docker bugs (daemon unresponsive, process stuck in pipe wait, ..) were a major pain point in the beginning, but Docker itself has become more mature and did not bite us recently. The biggest chunk of problems can be attributed to the nature of distributed systems and "cascading failures", e.g. a Kubernetes API server outage should not affect running workloads, but it did, or see our recent CoreDNS incident.

We shared some of our incidents and Kubernetes failures in talks:

My main motivation for giving such talks about failures is that I want to hear more of them myself! Nordstrom's talk "101 Ways to Crash Your Cluster" on KubeCon 2017 was my inspiration (as you can even see from the similarity in talk titles ;-)). I hope to see more people share their postmortems and give failure talks. Monzo's transparency and public postmortem is a great service to the community and should be something we all strive towards.