Announcing etcd 3.3, with improvements to stability, performance, and more

• By Gyu-Ho Lee

We're proud to announce that etcd v3.3.0 is now public! This release includes backend database improvements, data corruption checking, a new client that's more tolerant of network partitions, v2 API emulation, and many more changes.

For the full list, please see the CHANGELOG and the v3.3 upgrade guide.

An improved backend bbolt: New Bolt DB

Previously, etcd used boltdb/bolt to store data locally on each node. Bolt is an embedded key-value store originally developed by Ben Johnson and now forked at coreos/bbolt.

Bolt maintains a separate freelist DB to record pages that are no longer needed and have been freed after transactions. Subsequent transactions can reuse those pages from the freelist tree. This minimizes garbage collection while keeping database size relatively static – or rather, it seemed that way until we found this "database space exceeded" issue.

When Bolt commits a transaction, the current freelist is flushed out to disk. The persisted freelist is then reloaded when Bolt come back from offline. This can speed up the recovery process, since it initializes freelist from previously synced pages. Otherwise, Bolt would need to rescan every page in the database file to find free pages and reconstruct the freelist from scratch.

However, the free list sync demands more disk space and incurs extra latencies. In one user's case, where there were lots of free pages due to frequent snapshots with read transactions, the database size quickly grew from 16 MiB to 4 GiB as a result of large freelist syncs.

The maximum database size limit for etcd is 10 GiB, establishing tens of seconds as an upper-bound for recovery. A bit slower restart operation seemed like an acceptable trade-off. Thus, the decision was made to get rid of freelist sync and rebuild it on recovery, and we were able to resolve the "database space exceeded" issue. The overhead of rebuilding freelist on a 10 GiB database file is about 2 seconds on a system with an Intel Core i7-7600U CPU, 16 GB RAM, and 250 GB SSD running Linux 4.13.0-32. Still, the benefits of static and predictable database size outweighed the cost of recovery time. With smaller data, this overhead would be so small as to be unnoticeable.

Bolt provides safe upgrades between database files of two options: freelist sync and no sync. We also fixed other critical bugs and improved garbage collection performance in the database.

Smarter Client Balancer

Previously, the etcd client blindly retried on network disconnects. The client was easily stuck with partitioned or blackholed nodes, and this was affecting production users. To achieve higher client availability under network partitioning scenarios, significant work was invested in improving the client balancer. Under the hood, the client balancer now maintains a list of unhealthy endpoints gathered using the gRPC Health Checking Protocol and it retries more efficiently in the face of transient disconnects and network partitions. The 3.3 client's greatly improved failover mechanism makes it better able to survive unreliable communication networks.

Experimental Data Corruption Check

Although v3 provides the Hash API to get key-value storage digests, earlier releases did nothing to prevent corrupted replicas from serving clients (e.g. corruption from etcd bugs, bad file systems, etc.). Version 3.3 adds the --experimental-corrupt-check-time flag, which causes etcd to monitor storage states and raise an alarm when corruption is detected.

A corruption alarm still does not preclude inconsistent data being served between boot time and monitor detection, however, and the situation gets worse if a corrupted node becomes the leader, replicating snapshots of corrupted data to its peers. To address this, we also added the --experimental-initial-corrupt-check flag for boot-time CRC32 verification. With this flag enabled, nodes first fetch hashes from peers at a known revision and perform integrity checks before serving any peer/client traffic; if the data is mismatched, the server terminates.

Causal consistency in serializable reads

Serializable read requests are served by local nodes with weak consistency. When requests are sent to a stale node, it is possible to receive revisions older than ones received from previous requests. Which node contains the latest revision should be transparent to the client, because we want etcd to be the data store that handles complex application logic. Thus, the clientv3/ordering package was added, to help etcd user ensure that the data is ordered regardless of which node it is connected to. It caches previous revisions to avoid violating ordering with automatic retries.

v2 API Emulation with v3 Storage, HTTP endpoints

gRPC and Protocol Buffer dependencies have been significant barriers to v3 API adoption in other languages. For instance, since Perl has no official gRPC support yet, community has to rely on old v2 APIs. 3.3 introduces v2 API emulation backed by the v3 storage layer, which is more efficient and scalable than v2. To enable this feature, use the --experimental-enable-v2v3 flag:

$ etcd --experimental-enable-v2v3=’/myprefix’ $ curl http://127.0.0.1:2379/v2/keys/foo -XPUT -d value="Hello world" {"action":"set","node":{"key":"/foo","value":"Hello world","modifiedIndex":1,"createdIndex":1}} $ curl http://127.0.0.1:2379/v2/keys/foo {"action":"get","node":{"key":"/foo","value":"Hello world","modifiedIndex":1,"createdIndex":1}}

Although v2 storage is being deprecated, HTTP endpoints will be maintained through v2 emulation and gRPC gateway for those who prefer HTTP/1.1 protocol to HTTP/2. Note that the support for gRPC gateway will be reevaluated after official grpc-web release.

What if one of the certificate private keys is lost or compromised? What if the certificate authority (CA) is compromised? What if the certificate's affiliation changed? In addition to the CRL in CA (supported by the Go standard library), version 3.3 now supports a separate X.509 Certificate Revocation List file to check whether certificates have been revoked by the issuing certificate authority (CA) before their expiration dates, indicating they should no longer be trusted. Configure the --client-crl-file and --peer-crl-file flags to validate certs against certificate revocation lists.

In addition, TLS authentication now supports wildcard DNS in the Subject Alternative Name field (SAN). For instance, if a peer certificate contains the wildcard domain name *.example.default.svc , etcd will look up example.default.svc to get resolved addresses and authenticate if there is a matching IP with the peer’s remote IP address. Please see security doc for more details.

Kubernetes TLS bootstrapping involves generating dynamic certificates for etcd members and other system components (e.g. API server, kubelet, and so on). Maintaining different CAs for each component provides tighter access control to the etcd cluster but is often tedious. Version 3.3 now supports Common Name (CN)-based authentication for inter-peer connections. When the --peer-cert-allowed-cn flag is specified, nodes can only join if they have matching common names (even with shared CAs).

Client Leasing, Disconnected Linearized Reads

etcd uses the leader-based Raft consensus protocol. To perform strongly consistent reads, follower nodes forward proposals to a leader replica, which must read from a quorum of nodes to guarantee the most recent key update. This consistency requirement incurs communication latencies and limits its throughput in read-intensive workloads. Things get even worse if a node is partitioned; the client cannot issue any linearizable reads when disconnected from cluster.

To tackle this problem, etcd 3.3 introduces an experimental leasing layer with a gRPC proxy. It coordinates client-side key writes by granting temporary lease access for a key. A client must acquire the lease ownership to update the key. If other clients respect the lease, the owner can safely claim the linearizable reads without invoking cluster consensus. If the owner crashes, the lease expires and ownership may be acquired by other clients.

Performance Comparison

As always, performance varies depending on workloads. Here are some of our benchmark results.

The etcd functional-tester can be used to inject failures under high pressure loads, but it is still self-testing with artificial workloads. Since Kubernetes depends on etcd, we want to ensure new versions of etcd do not introduce obscuring instability or performance regressions with real Kubernetes workloads.

We use Kubemark to simulate real cluster workloads on master components and etcd. Our setup was Kubemark 1.10 with a 3-node etcd cluster with the Kubernetes API server --etcd-quorum-read flag enabled. All tests run on Google Cloud Platform Compute Engine virtual machines running Ubuntu 16.04 and Linux Kernel 4.13.0-1002-gcp. Each etcd node has two vCPUs, 8 GB memory, and 40 GB SSD.

The graph below shows various etcd metrics observed as Kubemark injects 500-node workloads for three hours. Both 3.2 and 3.3 stably serve intensive Kubernetes workloads with predictable resource usage. Version 3.3 shows better disk sync latency, mainly due to reduced freelist sync. Version 3.2 latencies ranges between 6 ms and 9 ms, while 3.3 between 4 ms and 7 ms.

Given the best throughput for etcd, the latency should remain stable at a local minimum as total keys increase. Latency spikes can delay other operations and trigger leader elections with monitoring alerts. Ideally, etcd should serve thousands of concurrent clients with the lowest average latency and high throughput.

The next test suite creates 3 million unique 256-byte keys with 1024-byte values, for a total data size of 6.6 GiB. There are 1,000 concurrent clients with 100 TCP connections. We ran the test on Google Cloud Platform Compute Engine virtual machines running Ubuntu 17.10 and Linux Kernel 4.13.0-25-generic. It is a three-node etcd cluster with three virtual machines and one separate virtual machine for stressing clients. Each virtual machine has 16 vCPUs, 30 GB memory, and 300 GB SSD capable of 150 MB/s sustained writes.

The graph below shows the latency pattern as the number of total keys increased. etcd 3.2's average throughput was 32,976 req/sec, while 3.3 achieved 35,682 req/sec. Version 3.2 slowed down at the end, rejecting requests with etcdserver: too many requests errors, while 3.3 maintained low latency at scale.

Below is a graph of the latency distribution in log scale. Both 3.2 and 3.3 show stable bounds, but 3.3 has lower latency distribution with tighter bounds.

The graph below shows the latency distribution with linearizable reads. There is not much difference since we did not do much work on read optimization.

More detailed test results can be found at dbtester/test-results.

Again, our highest priorities for etcd are correctness and reliability. We extensively tested v3.3 with fault injection for several months and we are confident in its robustness under chaotic scenarios.

Future Work

The focus areas for the 3.4 milestone are stability of new features, downgrade supports, better large response handling, improvements on development process and testing infrastructures. We will also improve the client health balancer with new gRPC balancer interface.

Furthermore, to better support Kubernetes community, we now maintain previous two minor versions. Please join our bi-weekly etcd meeting for latest discussions on roadmap.

With CoreOS now part of the Red Hat family, etcd’s future in the open source community remains strong as a core component of Kubernetes. To read more about Red Hat’s acquisition of CoreOS, see the FAQ.