One of the chief benefits of the Kubernetes open source container orchestration engine is how it brings greater reliability and stability to distributed applications, through the use of dynamic scheduling of containers. But how do you make sure that Kubernetes itself stays up and running, when a component, or even entire data center goes down?

This is where planning for Kubernetes High-Availability comes into play. K8s HA is not just about the stability of Kubernetes itself. It is about setting up Kubernetes, along with supporting components such as etcd, in such a way that there is no single point of failure, explained Kubernetes expert Lucas Käldström. Käldström is a CNCF volunteer ambassador for the Cloud Native Computing Foundation, and organizes the CNCF and Kubernetes Finland meetup group in Helinski.

We were tipped off that Käldström was one of the world’s experts in Kubernetes high-availability by Alexis Richardson, who is the chair for the Cloud Native Computing Foundation‘s Technical Oversight Community, as well as the CEO for Weaveworks, where Käldström is doing some Kubernetes contracting work before he checks in later this year to carry put his mandatory military service for Finland. This interview took place at Kubecon + CloudNativeCon 2017 in December.

People throw around “multi-master” and “high-availability” a lot. What’s your take?

There’s a difference between high availability and multi-master. If you, for example, have three masters and only one Nginx instance in front load balancing to those masters, you have a multi-master cluster but not a highly available one because your Nginx can go down at any time and well, there you go.

High-availability is eliminating the single point of failure?

Yeah, basically, it means eliminating the single point of failure in the cluster. When I have a highly available cluster, I should be able to lose some amount of masters.

That’s the basic definition of highly-available Kubernetes. But Kubernetes is huge and there are, by definition, a lot of components that could fail. kube-dns is a good example. Let’s say we have a normal cluster, it has multiple masters, multiple etcd replicas but still running just one kube-dns. So then if the master running kube-dns fails, your cluster will experience some kind of outage because now suddenly all your service discovery queries may not resolve. So, we really have to go and take it to a deeper level and analyze where are the key components that we have in a cluster and then try to eliminate their single points of failure.

For example, for etcd, we need to have at least three to form a quorum and then we can lose one. If we lose one, we can still go on and a leader election will happen [to choose a new master] and one of them will take over. But then, it’s not resilient to failure anymore. If we had the cluster size of three, and one of them is gone and now we have 66 percent coverage of the cluster. But if we lose one more, it will have only 33 percent and it will go into read-only mode, when the availability is less than 50 percent and basically not tolerate any changes.

The quorum with three is one. We’re resilient to one failing. And if we’ll scale it up to five, we’re resilient to two failing nodes.

So using multiple active leader-elected peers is one way of making the component highly available. But then we also have components like the controller manager and the scheduler which operate in an active/passive mode. If we have five controller managers running for example, four of them will be passive and one of them will be active. So, if the active once goes down, some of the four will race to the API service, “Hey, I’m the one that should take over.” And when they’ve got the lock (using an Endpoint or ConfigMap Kubernetes object), they will operate.

Is handling etcd any different in terms of making sure that there’s at least one instance up and running?

Managing etcd in an HA environment can be really challenging. Just consider you have three nodes and you’re in a trusted environment. They all join and everything is like good, we’re in a happy world, then one of them goes down. And in cloud environment, for example, when a node comes up it might have a different IP, and gets a trigger like, “This is a new node”, instead of realizing it’s the same one that only was unavailable for a while. So that might be challenging — well, at least you have to take that into account when managing etcd and you should then add the new node and remove the old one because etcd sees them as different peers. Then upgrades, of course, are obviously challenging. You ideally should be able to serve a lot of requests while you’re upgrading.

Managing a HA etcd cluster isn’t easy, and requires a lot of conscious decisions and experience to get.

The API servers operate in an active-active pattern. If we had three API servers backed by three etcds, for example, all the API servers will be active in any given point of time.

When you say active-active, what do you mean exactly?

Let’s imagine we use DNS in front of the API servers or multiple resilient load balancers, not the single point of failure scenario. Active-active means that an API server doesn’t sleep as it can receive request at any given point in time.

It might always be serving workloads. Compared to the active-passive thing where — well we have five schedulers but only one of them is running.

This all sounds terribly complicated. Are there ways to how to ease the process of managing all of these?

Special Interest Group (SIG) Cluster Lifecycle discusses this on a weekly basis: how can we simplify the installation process? How can we make it easier to run Kubernetes in an HA mode? What are the key things to think about when doing this? So we’re going to provide HA documentation on the Kubernetes official documentation.

How does kubeadm help?

kubeadm is a tool for you to use to get from a place where you have a machine to a place where you have a Kubernetes cluster. The scope is small for kubeadm so that it can fit in in a lot of places. kubeadm is just the thing that the tool that takes you from no Kubernetes to Kubernetes and stitches the cluster together.

So yeah, with kubeadm, we have made this tool to ease the pain, so to say, and also to unify the difference of deployments that exist in this spectrum.

The CNCF has 23 approved distributions [at the time of the interview]. They’re all different in various ways. They’re all duplicating to some extent the same code, right? They’re all installing Kubernetes and there are many ways you can configure it, but at the end of the day, you’re probably running them pretty much in the same way. So kubeadm takes this task to unify this common piece of code to one place and it’s also extensible. If you don’t want everything kubeadm init does, you can just like use the small toolbox kubeadm phase toolbox that provides an interface for executing atomic phase [logical work items when bootstrapping a Kubernetes cluster] as we call them to —

Atomic?

Yeah, an atomic piece of work, like generating its certificates or whatever. A workpiece that is common to any Kubernetes cluster will be included in the kubeadm phases toolbox. So that’s the plan with kubeadm. If you’re about to set up an HA cluster on your environment, you have to think of a lot of things if you’re hand-rolling it. And kubeadm will help a great deal here because then you don’t have to think about, “How do I set up my cluster? How do I get from no Kubernetes to Kubernetes?” You can now rely on kubeadm for that part. For HA with kubeadm, right now you still have to do higher-level stuff like, “Well, I have to run kubeadm init on all three masters”, but that is better than having to do everything by hand.

You can create a HA Kubernetes cluster pretty smoothly from scratch using kubeadm, by bootstrapping a HA etcd cluster, executing kubeadm init out of band on all masters, and setting up some kind of highly available proxy in front, e.g. using DNS. As part of the v1.9 SIG Cluster Lifecycle work, we’re documenting how to do that.

For the internal DNS service (kube-dns or CoreDNS) you have to consider the outage scenario I talked about earlier. The internal DNS service IP, which is often something ending in -10, load balances DNS traffic to the endpoint Pods. You should set up autoscaling for the backing Pods, and require the replica amount to something like 3. And you could set something like anti-affinity on the Pods so you’re sure that you don’t have all say five instances running on the same master, as it can go down and there’s a (small) outage [Kubernetes will auto-heal the Pods later]. So the good thing is that we can pretty easily do all this with Kubernetes and we get lots of functionality for free.

Have you talked to people who have been doing Kubernetes high availability?

Yeah. I’ve definitely interacted with that are running HA Kubernetes — I don’t know if I can name anyone or even remember individual cases but yeah, people do it. However, be aware of that there are also HA-related bugs in Kubernetes yet to be fixed. And the problem is that our end-to-end tests don’t cover highly-available Kubernetes, which is a large hole in our testing infrastructure.

There are reportedly some problems with running Kubernetes in an HA mode.There was one gRPC bug in the API Server/etcd communication that made HA Kubernetes failing in some scenarios. There’s a Kubernetes issue tracking this, but as said we have to increase our coverage in this area because right now in some scenarios you might get broken if you run Kubernetes v1.9 in an HA mode. [Editor’s note: The issue in question has been fixed in v1.10]

Are there any other things you consider when thinking about Kubernetes high availability that we haven’t discussed?

If you’re running on a cloud, you might also want to think about the different availability zones that your cloud provider has. You probably don’t want to put everything in the same availability zone as if VMs in that zone goes down due to some cloud provider outage apps running on top of Kubernetes won’t get rescheduled to a healthy zone.

Another thing that would be good to talk about that is a very common misunderstanding about how Kubernetes works is that if your control plane goes down, nothing really happens. A lot of users think that, “I really need HA because if I don’t have it, if one process fails, all my apps will be broken.” That’s not the case. Kubernetes what’s cleverly designed in a manner that you could basically lose your master(s) but all your workloads still go on running as usual. The only thing is that your cluster gets into read-only state because you can’t write new state to the cluster when the API server is down. But if you just restart your master, everything will still work smoothly.

So, that’s a really clever design. And I think the inspiration was from airplanes, having the possibility to turn on and off their engines during a flight or something like that.

That also leaves us with you might need HA but you might also not.

If you have three nodes, for example, you probably don’t need three masters, because for a small CI/CD system, for example, you are probably able to handle an hour of API server outage perfectly fine.

That is fascinating. Yeah, and that raises the question of whether you really need HA?

Exactly! Because there so much complexity being added when going to a highly-available mode you should think twice if that complexity is worth the benefit, instead of blindly convincing yourself you need HA. Cost-efficiency is an other parameter to consider along with complexity.

For the upcoming KubeCon + CloudNativeCon Europe in Copenhagen next month, Käldström will be speaking about how to make a “production-ready” Kubernetes cluster.

Feature image via Pixabay.