UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

Let's explore Apache ZooKeeper, a distributed coordination service for distributed systems. Needless to say, there are plenty of use cases! At Found, for example, we use ZooKeeper extensively for discovery, resource allocation, leader election and high priority notifications. In this article, we'll introduce you to this King of Coordination and look closely at how we use ZooKeeper at Found.

What Is ZooKeeper? ZooKeeper is a coordination service for distributed systems. By providing a robust implementation of a few basic operations, ZooKeeper simplifies the implementation of many advanced patterns in distributed systems. ZooKeeper as a Distributed File System One way of getting to know ZooKeeper is to think of it as a distributed file system. In fact, the way information in ZooKeeper is organized is quite similar to a file system. At the top there is a root simply referred to as /. Below the root there are nodes referred to as zNodes, short for ZooKeeper Node, but mostly a term used to avoid confusion with computer nodes. A zNode may act as both a file containing binary data and a directory with more zNodes as sub nodes. As most file systems, each zNode has some meta data. This meta data includes read and write permissions and version information. Unlike an ordinary distributed file system, ZooKeeper supports the concepts of ephemeral zNodes and sequential zNodes. An ephemeral zNode is a node that will disappear when the session of its owner ends. A typical use case for ephemeral nodes is when using ZooKeeper for discovery of hosts in your distributed system. Each server can then publish its IP address in an ephemeral node, and should a server loose connectivity with ZooKeeper and fail to reconnect within the session timeout, then its information is deleted. Sequential nodes are nodes whose names are automatically assigned a sequence number suffix. This suffix is strictly growing and assigned by ZooKeeper when the zNode is created. An easy way of doing leader election with ZooKeeper is to let every server publish its information in a zNode that is both sequential and ephemeral. Then, whichever server has the lowest sequential zNode is the leader. If the leader or any other server for that matter, goes offline, its session dies and its ephemeral node is removed, and all other servers can observe who is the new leader. The pattern with every node creating a sequential and ephemeral zNode is effectively organizing all the nodes in a queue that is observable to all. This is not only useful for leader election, it may just as well be generalized to distributed locks for any purpose with any number of nodes inside the lock. ZooKeeper as a Message Queue Speaking of observing changes, another key feature of ZooKeeper is the possibility of registering watchers on zNodes. This allows clients to be notified of the next update to that zNode. With the use of watchers one can implement a message queue by letting all clients interested in a certain topic register a watcher on a zNode for that topic, and messages regarding that topic can be broadcast to all the clients by writing to that zNode. An important thing to note about watchers though, is that they’re always one shot, so if you want further updates to that zNode you have to re-register them. This implies that you might loose an update in between receiving one and re-registering, but you can detect this by utilizing the version number of the zNode. If, however, every version is important, then sequential zNodes is the way to go. ZooKeeper gives guarantees about ordering. Every update is part of a total ordering. All clients might not be at the exact same point in time, but they will all see every update in the same order. It is also possible to do writes conditioned on a certain version of the zNode so that if two clients try to update the same zNode based on the same version, only one of the updates will be successful. This makes it easy to implement distributed counters and perform partial updates to node data. ZooKeeper even provides a mechanism for submitting multiple update operations in a batch so that they may be executed atomically, meaning that either all or none of the operations will be executed. If you store data structures in ZooKeeper that need to be consistent over multiple zNodes, then the multi-update API is useful; however, it is still not as powerful as ACID transactions in traditional SQL databases. You can’t say: “BEGIN TRANSACTION”, as you still have to specify the expected pre-state of each zNode you rely on. Don’t Replace Your Distributed File System and Message Queue Although it might be tempting to have one system for everything, you’re bound to run into some issues if you try to replace your file servers with ZooKeeper. The first issue is likely to be the zNode limit imposed by the jute.maxbuffer-setting. This is a limit on the size of each zNode, and the default value is one megabyte. In general, it is not recommended to change that setting, simply because ZooKeeper was not implemented to be a large datastore. The exception to the rule, which we’ve experienced at Found, is when a client with many watchers has lost connection to ZooKeeper, and the client library - in this case Curator - attempts to recreate all the watchers upon reconnection. Since the same setting also applies to all messages sent to and from ZooKeeper, we had to increase it to allow Curator to reconnect smoothly for these clients. Similarily, you are likely to end up with throughput issues if you use ZooKeeper when what you really need is a message queue, as ZooKeeper is all about correctness and consistency first and speed second. That said, it is still pretty fast when operating normally.

ZooKeeper at Found At Found we use ZooKeeper extensively for discovery, resource allocation, leader election and high priority notifications. Our entire service is built up of multiple systems reading and writing to ZooKeeper. One example of such a system is our customer console, the web application that our customers use to create and manage Elasticsearch clusters hosted by Found. One can also think of the customer console as the customers window into ZooKeeper. When a customer creates a new cluster or makes a change to an existing one, this is stored in ZooKeeper as a pending plan change. The next step is done by the Constructor, which has a watch in ZooKeeper for new plans. The Constructor implements the plan by deciding how many Elasticsearch instances are required and if any of the existing instances may be reused. The Constructor then updates the instance list for each Elasticsearch server accordingly and waits for the new instances to start. On each server running Elasticsearch instances, we have a small application that monitors the servers’ instance lists in ZooKeeper and start or stops LXC containers with Elasticsearch instances as needed. When an Elasticsearch instance starts, we use a plugin inside Elasticsearch to report the IP and port to ZooKeeper and discover other Elasticsearch instances to form a cluster with. The Constructor waits for the Elasticsearch instances to report back through ZooKeeper with their IP address and port and uses this information to connect with each instance and to ensure they have formed a cluster successfully. And of course, if this does not happen within a certain timeout, then the Constructor will begin rolling back the changes. A common issue that may lead to new nodes having trouble starting is a misconfigured Elasticsearch plugin or a plugin that requires more memory than anticipated. To provide our customers with high availability and easy failover we have a proxy in front of the Elasticsearch clusters. It is also crucial that this proxy forwards traffic to the correct server, whether changes are planned or not. By monitoring information reported to ZooKeeper by each Elasticsearch instance, our proxy is able to detect whether it should divert traffic to other instances or block traffic altogether to prevent detoration of information in an unhealthy cluster. We also use ZooKeeper for leader election among services where this is required. One such example is our backup service. The actual backups are made with the Snapshot and Restore API in Elasticsearch, while the scheduling of the backups is done externally. We decided to co-locate the scheduling of the backups with each Elasticsearch instance. Thus, for customers that pay for high availability, the backup service is also highly available. However, if there are no live nodes in a cluster there is no point in attempting a backup. Since we only want to trigger one backup per cluster and not one per instance, there is a need for coordinating the backup schedulers. This is done by letting them elect a leader for each of the clusters. With this many systems relying on ZooKeeper, we need a reliable low latency connection to it. Hence, we run one ZooKeeper cluster per region. While having both client and server in the same region goes a long way in terms of network reliability, you should still anticipate intermittent glitches, especially when doing maintenance to the ZooKeeper cluster itself. At Found, we’ve learned first hand that it’s very important to have a clear idea of what information a client should maintain a local cache of, and what actions a client system may perform while not having a live connection to ZooKeeper.

How Does It Work? Three or more independent servers form a ZooKeeper cluster and elect a master. The master receives all writes and publishes changes to the other servers in an ordered fashion. The other servers provide redundancy in case the master fails and offload the master of read requests and client notifications. The concept of ordering is important in order to understand the quality of service that ZooKeeper provides. All operations are ordered as they are received and this ordering is maintained as information flows through the ZooKeeper cluster to other clients, even in the event of a master node failure. Two clients might not have the exact same point in time view of the world at any given time, but they will observe all changes in the same order. The CAP Theorem Consistency, Availability and Partition tolerance are the the three properties considered in the CAP theorem. The theorem states that a distributed system can only provide two of these three properties. ZooKeeper is a CP system with regard to the CAP theorem. This implies that it sacrifices availabilty in order to achieve consistency and partition tolerance. In other words, if it cannot guarantee correct behaviour it will not respond to queries. Consistency Algorithm Although ZooKeeper provides similar functionality to the Paxos algorithm, the core consensus algorithm of ZooKeeper is not Paxos. The algorithm used in ZooKeeper is called ZAB, short for ZooKeeper Atomic Broadcast. Like Paxos, it relies on a quorum for durability. The differences can be summed up as: only one promoter at a time, whereas Paxos may have many promoters of issues concurrently; a much stronger focus on a total ordering of all changes; and every election of a new leader is followed by a synchronization phase before any new changes are accepted. If you want to read up on the specifics of the algorithm, I recommend the paper: “Zab: High-performance broadcast for primary-backup systems”. What everyone running ZooKeeper in production needs to know, is that having a quorum means that more than half of the number of nodes are up and running. If your client is connecting with a ZooKeeper server which does not participate in a quorum, then it will not be able to answer any queries. This is the only way ZooKeeper is capable of protecting itself against split brains in case of a network partition.

What We Don’t Use ZooKeeper For As much as we love ZooKeeper, we have become so dependent of it that we’re also taking care to avoid pushing its limits. Just because we need to send a piece of information from A to B and they both use ZooKeeper does not mean that ZooKeeper is the solution. In order to allow sending anything through ZooKeeper, the value and the urgency of the information has to be high enough in relation to the cost of sending it (size and update frequency). Application logs - As annoying as it is to be miss logs when you try to debug something, logs are usually the first thing you’re willing to sacrifice when your system is pushed to the limit. Hence, ZooKeeper is not a good fit, you actually want something with looser consistency requirements.

Binaries - These fellas are just too big and would require tweaking ZooKeeper settings to the point where a lot of corner cases nobody has ever tested are likely to happen. Instead we store binaries on S3 and keep the URL’s in ZooKeeper.

Metrics - It may work very well to start off with, but in the long run it would pose a scaling issue. If we had been sending metrics through ZooKeeper, it would simply be too expensive to have a comfortable buffer between required and available capacity. That goes for metrics in general, with the exemption being two critical metrics that are also used for application logic: the current disk and memory usage of each node. The latter is used by the proxies to stop indexing when the customer exceeds their disk quota and the first one will at some stage in the future be used to upgrade customers plans when needed.