In any significant deployment of computer hardware and software, there’s always going to be something wrong somewhere. Even though the components themselves may be very reliable, when the number of components rises, the chances of a failure at any point of time increases. So, to make the overall system reliable, you usually have multiples of everything, from load balancers to web servers.

Back in the 1990s and early 2000s, best practice for high availability typically involved using pairs, whether pairs of servers for hot/cold failover, or pairs of data centers with data replication between the two. Nowadays, a lot of the latest software uses the idea of a quorum to give availability and consistency, and that really favours odd numbers instead of pairs. Let’s see why that’s the case.

What’s a quorum and why does it matter?

A lot of modern cloud software is built upon the idea of a quorum. Anyone who’s tried to reason with a small child trying to get permission to do something would understand the value of a quorum. If the child asks one parent for permission and fails, they immediately ask the other in hope of a different outcome. But the key idea with a quorum is that every participant gives the same answer, no matter what happens. If the parents used the principle of a quorum, they would always give the same answer. The parents would have reached a consensus, much to the annoyance of the small person.

In software terms, the aim of a quorum is to create a distributed data store that gives consistent results even in the event of failures. Kubernetes keeps its data in a distributed, reliable store called etcd. Kafka keeps its configuration data in ZooKeeper. At a high level, these are similar in several ways. Each is composed of a cluster of servers, one of which is the elected leader while the others are followers. The cluster is resilient to failures of some number of the servers in the cluster. The leader appends operations to the cluster’s log and replicates it to the following members of the cluster. Once the leader knows that a majority of the followers have replicated an operation, it is considered committed to the cluster. This is the quorum. If the elected leader fails, the remaining servers elect a new leader. The class of algorithms used to implement this kind of system is known as consensus algorithms.

A key word in the previous paragraph is majority. Essentially, if you have a majority in agreement, consensus has been reached and the remaining servers can be out of contact, failed or slow in responding and it doesn’t matter because they cannot independently form their own majority. Quorum has been achieved. The following table illustrates the quorum size for different sizes of cluster.

Cluster size Quorum size Failures that can be tolerated

1 1 0

2 2 0

3 2 1

4 3 1

5 3 2

6 4 2

7 4 3

So you can see that a cluster of only 2 servers can only reach consensus when both are available. As clusters get bigger, the cost of communication increases and performance drops. For this reason, most people choose 3 or 5 or occasionally 7 servers in practice to give a good balance of performance and availability.

This is why software such as Kubernetes and Kafka is much better deployed across 3 datacenters than just 2. Internally, they use etcd and ZooKeeper, and they are based on the principle of quorum. If you spread servers evenly across 2 datacenters, you cannot tolerate loss of one of the datacenters. If you spread evenly across 3, you can.

Availability zones in Kubernetes

Kubernetes offers multizone support (https://kubernetes.io/docs/setup/best-practices/multiple-zones/). A single cluster runs in multiple zones but only a single geographic region. This means that the latency between the nodes is still low (under 20ms ideally), but the resources in the cluster can be spread across the zones to isolate failures. Each zone might correspond to a separate datacenter or a separate failure domain within a datacenter. Each Kubernetes node is assigned to a zone and labelled accordingly with failure-domain.beta.kubernetes.io/zone so you can easily tell which zone each node is in.

The usual practice is to design the system to tolerate the loss of an entire zone. You want your resources spread across the zones such that loss of a zone doesn’t cause an outage. In particular, the majority of any quorum-based resource needs to survive loss of a zone and this is achieved by spreading them across the zones. For a stateless resource that can just be restarted anywhere, it’s not all that important where they are running at a particular point in time. For a stateful resource, and in particular one that relies on a quorum, it’s important that they are spread across the zones and that they stay put. If this is not the case, you could think up a sequence of steps by which failures and restarts could end up with the resources bunched in one zone, and then loss of that zone would cause an outage.

Kafka and Kubernetes in a single zone

Here’s a very simple diagram showing 3 ZooKeeper servers and 3 Kafka brokers in a single zone.

Ideally, you want at least 3 nodes (worker machines) in your Kubernetes cluster so that the ZooKeeper and Kafka clusters can be spread across them to minimise the impact of a node failure. Kafka replication will spread the topic replicas and leadership across the brokers.

Kafka and Kubernetes across 3 zones

If you are running a multizone Kubernetes cluster, it automatically labels nodes with the zone information. Kubernetes schedules across the nodes in a single-zone cluster and this is extended in a multizone cluster across zones. It’s best-effort placement and works best if the zones are homogeneous to reduce the probability of uneven distribution.

When deploying Kafka across 3 zones, you need to consider both ZooKeeper and Kafka. The most straightforward way to deploy ZooKeeper is to use 3 servers and deploy one into each zone. If you prefer to use 5 servers in 3 zones, you’ll have 2 zones with 2 servers each and 1 zone with only 1 server. Having the number of servers out of balance like this isn’t ideal, but it’s not critical because a 5-node cluster can survive loss of 2 servers.

For Kafka, there’s no issue with quorum so a good rule of thumb is to have an equal number of Kafka brokers in each zone. You should also ensure that the replication factor is at least one larger than the minimum number of in-sync replicas, so that you can tolerate loss of one replica without impact. For example, in 3 zones, you could use replication-factor=3 and min-insync=2.

You can rely on Kubernetes to spread resources as it first creates them, but you can also take more explicit control using some of the more advanced features of storage classes, such as volumeBindingMode.

Kafka has a concept called rack-awareness that is very useful in a multizone cluster. If you configure the rack.id property of all of your brokers to match the zone label of the nodes they’re running on, Kafka will actually assign replicas to spread them across the zones too. So, not only are your Kafka brokers spread across the zones, the replicas of your topics are also spread across the zones. This trick of making the link between Kubernetes zone labels and Kafka broker rack ids is important. The open-source Strimzi project which offers a Kubernetes operator for managing Kafka does this automatically, and so does IBM Event Streams both on public cloud and on Red Hat OpenShift.

Here’s an illustration of the same Kafka cluster deployed across 3 zones.

Now, imagine that one of the zones becomes unavailable. The result looks like this.

There is still a quorum of ZooKeeper servers. The leadership of Kafka topics has adjusted slightly, but the configuration still works properly.

Kafka and Kubernetes across 2 zones

This is where it gets exciting. It’s clear that 3 is the right number of zones, but many organisations have pairs of datacenters. While it’s amusing telling someone in this situation to build a third datacenter, it’s quite a short conversation. In very specific circumstances, 2 datacenters could in theory be made to work adequately, but it’s not exactly ideal.

The difficulty with 2 datacenters is of course that if either is lost, you have lost quorum. The key to running in 2 datacenters is migrating failed servers to the surviving datacenter as quickly as possible. It relies on the ability to access their data in both datacenters with no compromises, so a shared filesystem or synchronous disk replication are required. This is only necessary for the quorum-based servers, meaning at least etcd and ZooKeeper when running Kafka on Kubernetes.

With Kafka itself, there’s no worry specifically about quorum and migrating brokers to keep running, but if you lose one zone, you’re going to lose half of the replicas. Rather than using replication-factor=3 and min.insync=2, I’d use replication-factor=4 and min.insync=2. That way, you get an even number of replicas spread across both zones and you can still tolerate loss of one of them.

So, here’s what a configuration across 2 zones looks like.

And here’s what happens if one zone is lost.

There’s only 1 ZooKeeper server in the surviving zone and quorum has been lost. Somehow, if you could migrate at least one of the failed ZooKeeper servers into the surviving zone with no data loss, you could recover the system.

Conceptually, you could work with just 2 zones, provided that you are able to migrate failed servers without any compromises, but I’ve never heard of a successful deployment in 2 zones. So, my advice would be to do the following instead.

Kafka and Kubernetes across 2.5 zones

This is a good compromise for organisations with 2 datacenters that want to do zones properly. You create a very small third zone, ideally separated from both of the main datacenters, and then you run just one server for each of the resources that requires a quorum. In Kubernetes terms, it’s a proper multizone cluster with 3 zones, but it just has sufficient capacity to run etcd and ZooKeeper.

The Kafka brokers will be in the two full-sized zones, and replication should be set up as if there were 2 zones not 3.

Now, if you lose one of the datacenters, you still have a working system.

Conclusion

Kafka and Kubernetes are very powerful technologies that can be used to give a very high level of resilience to failures. I hope I’ve given you a good understanding of the basic concepts enabling you to plan your own highly available, multizone clusters. While Kubernetes does a lot of the heavy lifting, it’s not magic and you still need a solid grasp of the topology you’re trying to build.