Containers and persistent data

Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

Increasingly, users want to containerize their entire infrastructure, which means not just web servers, queues, and proxies, but database servers, file storage, and caches as well. In fact, one of the first questions from the audience at the recent CoreOS Fest during the application container (appc) specification panel was: "What about stateful containers?" These services require persistent data, or "state", that can't be casually discarded. As Tim Hockin of Google said: "Ideally everything is stateless. But there has to be a turtle at the bottom that holds the state."

Storing persistent data for containers has been a chronic issue since Docker was introduced, because the initial design for Docker simply didn't deal with the concept of data that needs to outlive the container runtime and that can't be easily moved from one machine to another. Docker's only concession to the need for state in version 1.0 was to allow "volumes": external filesystems that the container could access, but that were otherwise unmanaged by the Docker system. The conventional wisdom was not to put your data into containers.

According to the panel, the appc spec may take this a step further by specifying that the entire container image and all of its initial files be immutable, except for specific configured directories or mounts. The idea is that any data that administrators care about needs to be put in specific, network-mounted directories, anyway. Managing these directories will then be left to orchestration frameworks.

A reader would be forgiven for thinking that both Docker and CoreOS had decided to ignore the issue of persistent data for the time being. Certainly a lot of developers in the "world of containers" seem to think so and are working to fill that gap. They described some of their projects to deal with persistent data at ContainerCamp and CoreOS Fest. One solution is to make distributed data storage trustworthy, so that administrators don't have to worry about data being persistent on any specific node.

The Raft consensus algorithm

A reliable distributed data store needs to have some way of ensuring that the data in it will eventually, if not immediately, become consistent across all nodes. While this is easy on a single-node database, a multi-node database where any of the individual nodes is allowed to fail requires more complex logic to determine how to make writes consistent. This is called "consensus", which is an agreement on shared state across multiple nodes. Diego Ongaro, developer of Raft, LogCabin, and RAMCloud, described to the audience how it works.

The Paxos consensus algorithm was introduced by Leslie Lamport in 1989, and for over two decades was the first and last word in consensus. But it had some problems, the chief of which was extreme complexity. "Maybe five people really, truly understood every part of Paxos," Ongaro said. Students and programmers alike found it difficult or impossible to write tests that validated whether the protocols described by Paxos were correctly implemented. This meant that, despite being based on the same algorithm, different Paxos implementations were radically different from each other, and could not be proven to have correctly implemented the algorithm, he said.

To solve this, Ongaro and Professor John Ousterhout at Stanford created a new consensus algorithm that they designed to be simple, testable, and easy to explain to developers. Ongaro's PhD thesis described the algorithm [PDF], called Raft; the name refers to the desire to "escape the island of Paxos." Ongaro described their reasoning: "At every design choice, we asked ourselves: what's easier to explain?"

Judging by the number of projects that implement Raft, chief among them CoreOS's etcd, they have been successful in their goal of comprehensibility. Ongaro explained the core workings of Raft in a half-hour session, and demonstrated it using a Raft model written entirely in JavaScript.

Each node in the Raft cluster has a consensus module and a state machine. The state machine stores the data you care about, including a serialized log of state changes to that data. The consensus module makes sure that that log is consistent with the log of every other node in the cluster, which requires all state changes (writes) to be initiated by a "leader" node. With each write, the leader sends out a message to all nodes confirming the write, and only makes the write permanent if a majority of all nodes (called a "quorum") confirms. This is a form of "two-phase commit", which has long been a component of distributed databases.

Each node in the cluster also has a countdown clock that waits a random but significant time for a leader to send a message. If the current leader is unavailable, the node that counts down first sends out a message requesting a leader election. If a quorum confirms the leader election message, then the sender becomes the new leader. Log messages sent out by the new leader have a new value for the "term" field in the message that indicates which node was leader when the write happened, preventing log conflicts.

There is obviously more to Raft than that, such as how missing and extraneous entries are dealt with, but Ongaro was able to explain the core design in less than half an hour. This simplicity means that many developers have been able to produce software using Raft, including projects like etcd, CockroachDB, Consul, and libraries for Python, C, Go, Erlang, Java, and Scala.

[ Update: Josh Berkus updates some of the information about Raft in a comment below. ]

Etcd and Intel

Nic Weaver, who works on software-defined infrastructure (SDI) at Intel, spoke about what the company has been doing to improve etcd. Intel has a strong interest in both Docker and CoreOS because it is trying to help users scale to larger numbers of machines per administrator. Cloud hosting has allowed companies to scale to numbers of services where configuration management by itself isn't adequate, and Intel sees containers as a way to scale further.

As such, Intel has tasked his team with helping to improve container infrastructure software. In addition to releasing the Tectonic cluster server stack with Supermicro as mentioned in the first article of this series, it also put some work into the software. The component Intel decided to start with was etcd — looking at what it would take to build a really large etcd cluster. As container infrastructures managed with tools based on etcd grow to thousands of containers, the number of required etcd nodes grows and the number of writes to etcd grows even faster.

The problem that the team observed with etcd was that the more nodes you had in a cluster, the slower it would get. This is because, per the Raft algorithm, a write would require more than 50% of the etcd nodes to sync to disk and return success. So even a few nodes with chronic slow storage issues can hold up the cluster. If disk sync was not required, it would eliminate one source of cluster slowdown, but would put the entire cluster at risk of corruption in the event of a data center power loss.

Their solution to this was to make use of a facility added to Xeon processors called "asynchronous DRAM self-refresh" (ADR). This is a small designated area of RAM that is preserved when there is a crash and restored on system restart. It was created to support dedicated storage devices. There is a Linux API for ADR, though, so applications like etcd can use it.

Modifying etcd to use the ADR buffer as the write buffer to its logs was a success. Write time went from 25% to 2% of overall time in the cluster, and it was able to double throughput to 10,000 writes per second. This patch will soon be submitted to the etcd project.

CockroachDB

One of the natural steps to take with the Raft consensus algorithm is to go beyond the etcd key-value store and to build a full-service database around it. While etcd is adequate for configuration information, it lacks many of the features users want in application databases, such as transactions and support for complex requests. Spenser Kimball of Cockroach Labs explained how his team was doing so. The new database is called CockroachDB, because it is intended to be, in his words, "impossible to stamp out. You kill it, and it pops up again somewhere else."

CockroachDB is designed to be similar to Google's Megastore, a project Kimball was quite familiar with from his time at Google. The idea is to support consistency and availability across the whole cluster, including support for transactions. The project is planning to add a SQL-compatible layer on top of the distributed key-value store, as Google's Spanner project did. By having transactions as well as both SQL and key-value modes, it can enable most of the common uses of databases. "We want users to build apps, not workarounds," said Kimball.

The database is deployed as a set of containers, distributed across servers. The key-value address space for the database is partitioned into "ranges", and each range is copied to a subset of the available nodes, usually three or five. Kimball calls this cluster of individual Raft consensus groups "MultiRaft". This allows the entire cluster to contain more data than is present on any individual node, helping to scale the database.

Each node runs in "serializable" transaction mode by default, which means that all transactions must be replayable in log order. If a transaction has a serialization failure, it is rolled back on the originating node. This permits distributed transactions without unnecessary locking.

From this, it sounds like CockroachDB might be the answer to everyone's distributed infrastructure data persistence issues. It has one major fault though: the project isn't yet close to a stable release, and many of the planned features, such as SQL support, haven't been written yet. So while it may solve many persistence issues in the future, there are other solutions for right now.

High-availability PostgreSQL

Since highly distributed databases aren't yet ready for production use, developers are taking existing popular databases and making them fault-tolerant and container-friendly. Two such projects build up high-availability PostgreSQL: Flocker from ClusterHQ and Governor from Compose.io.

Luke Mardsen, CTO of ClusterHQ, presented Flocker at ContainerCamp. Flocker is a data volume management tool designed to help host databases in containers. It uses volume management that supports migrating database containers from one physical machine to another. This means that orchestration frameworks can redeploy database containers in almost the same way they would stateless services, which has been one of the challenges to containerizing databases.

Flocker is able to support migrating containers between physical machines by making use of ZFS on Linux from the project of the same name. Flocker creates Docker volumes on specially managed ZFS directories, allowing the user to move and copy those volumes by using exportable ZFS snapshots. Operations are performed via a simple declarative command-line interface.

Flocker is designed as a plugin for Docker. One challenge for the Flocker team is that Docker doesn't currently support plugins. The team created a plugin infrastructure called Powerstrip, but that tool has yet to be accepted into mainstream Docker. Until it is, the Flocker project can't provide a unified management interface.

If Flocker solves the container migration problem, then the Governor project from Compose, presented by Chris Winslett at CoreOS Fest, aims to solve the availability problem. Governor is an orchestration prototype for a self-managing replicated PostgreSQL cluster, and is a simplified version of the Compose infrastructure.

Compose is a Software-as-a-Service (SaaS) hosting company, which means that the services it offers need to be entirely automated. In order for Compose to deploy PostgreSQL, it needed to support automatic database replica deployment and failover. Since users have full database access, Compose also needed a solution that didn't require making any changes to the PostgreSQL code or to users' databases.

One of the things Winslett figured out quickly was that PostgreSQL could not be the canonical store of its own availability and replication state, because the master and all replicas would have identical information. This led to implementing a solution based on Consul, a distributed high-availability information service. However, Consul requires 40GB of virtual memory for each data node, which wasn't practical for tiny cloud server nodes. Winslett abandoned Consul for the much simpler etcd and, in the process, substantially simplified the failover logic.

Governor works by having the governor daemon control PostgreSQL in each container. Governor queries etcd to find out who the current master is and replicates from it on startup. If there is no master, it attempts to seize the leader key on etcd, and etcd ensures that only one requester can win that contest. Whichever node gets the leader key becomes the new master and the other nodes start replicating from it. Since the leader key has a time-to-live (TTL), if the master fails, a new leader election will follow shortly, ensuring that there will quickly be a new master.

This means that Compose can treat PostgreSQL almost the same as it treats the multi-master, non-relational databases it supports, like MongoDB and Elasticsearch. In the new system, PostgreSQL nodes can be configured in etcd, and then deployed using container orchestration systems, without hands-on administration or handling those containers differently.

Conclusion

Of course, there are many more projects and presentations than the ones mentioned above. For example, Sean McCord spoke at CoreOS Fest about using the distributed filesystem Ceph as a block device inside Docker containers, as well as running Ceph with each node in a container. While this approach is fairly rudimentary right now, it offers another option for containers that need to run services that depend on large file storage. Cloudconfig, CoreOS's new boot-from-a-container tool, was also introduced by Alex Crawford.

As the Linux container ecosystem moves from test instances and web servers into databases and file storage, we can expect to continue to see new approaches to solving these kinds of problems, increasing scale, and integrating with other tools. If anything is clear from CoreOS Fest and ContainerCamp, it's that Linux container technology is still young and we can expect many more new projects and dramatic changes in approaches over the next year.