From the CoreOS labs: Disconnected consistent reads with etcd

• By Vishesh Pamadi and Anthony Romano

Here at CoreOS, we're always working along with the open source community to improve and enhance etcd, a distributed key-value store created by CoreOS. For the upcoming etcd 3.3 series, one of the challenges we're tackling is improving the performance and resiliency of linearizable reads using a distributed systems technique known as leasing.

In a consistent distributed key-value store like etcd, a linearizable read fetches the most recent key state. To ensure the read is up to date, an etcd server waits on its Raft backend for cluster-wide consensus to complete the request. This consensus stage requires that an etcd client sends each request to an etcd server and that the server contact a quorum of cluster members; the client cannot issue linearizable reads when disconnected from the cluster. This increases request pressure, which limits total throughput, and greater sensitivity to network disruption, leading to delays and loss of availability.

The linearizable read consensus penalty stems from unpredictable key writes. By carefully coordinating writes, etcd’s new, experimental leasing layer locally processes linearizable reads without invoking cluster consensus. Even if the etcd cluster is unavailable or disconnected from the client, the leasing layer can still serve linearizable reads. Microbenchmarks show leasing reduces network overhead and improves network partition tolerance. The leasing layer controls key writes using a client-side leasing protocol similar to the kind designed for networked file systems. Essentially, a client acquires ownership of a temporary lease granting exclusive write access for a key, ensuring the key is only updated with the owner’s consent. If all other clients respect the lease, the owner may locally assert consistent linearizable guarantees. If the owner crashes, the lease eventually expires and yields the key to other clients.

Leasing offloads linearized reads away from the core etcd cluster. Unmodified etcd clients transparently work with leasing by connecting to a proxy that follows the leasing protocol. This leasing protocol efficiently negotiates key ownership and protects leased keys from unapproved overwrites.

Start a leasing proxy

The easiest way to try leasing is with etcd’s gRPC proxy. The proxy optionally serves a leasing cache that gives its clients exclusive write access. All that’s needed is a Go installation.

First, install the latest etcd and etcdctl :

$ go get github.com/coreos/etcd/cmd/etcd $ go get github.com/coreos/etcd/cmd/etcdctl

Launch an etcd server on localhost:2379 and a leasing proxy on localhost:12379 . The --experimental-leasing-prefix flag specifies where to write leasing metadata in etcd:

$ etcd & $ ETCDPID=$! $ etcd grpc-proxy start \ --endpoints=http://localhost:2379 \ --listen-addr=localhost:12379 \ --experimental-leasing-prefix="_/leases/" 2>/dev/null &

Acquire a lease on the key abc by getting it through a proxied linearizable read:

$ ETCDCTL_API=3 etcdctl put abc 123 OK $ ETCDCTL_API=3 etcdctl --endpoints=http://localhost:12379 get abc abc 123

Finally, stop the etcd server and read abc :

$ kill -9 $ETCDPID $ ETCDCTL_API=3 etcdctl --endpoints=http://localhost:12379 get abc abc 123

The proxy safely serves the linearized read despite being disconnected from the cluster. By default the key remains available for a minute before the proxy abandons the lease.

Lease negotiation

Physically, the leasing layer represents a lease on a key by storing a corresponding leasing key into etcd. If the leasing key exists, only the lease owner may update the key. To maintain liveness, the owner attaches a session wide time-to-live to its leasing keys; if the owner fails to refresh its session, all its leasing keys will expire.

The leasing layer rewrites every linearizable single-key range request into a leasing transaction to lease a key. If there’s no leasing key, the client acquires the lease for itself, gaining exclusive write permission over the key, and caches the key-value data. This local cache may safely serve linearizable key reads so long as the leasing key is in etcd. To acquire the lease, the leasing transaction wraps the initial range request to additionally create a leasing key if one does not exist, similar to the etcd3 transaction below:

// rangeRequest := clientv3.OpGet(key, ...) leasingKey := “_/leases/” + key resp, err := clientv3.Txn(context.TODO()).If( clientv3.Compare(clientv3.Version(leasingKey, “=”, 0)) ).Then( rangeRequest, clientv3.OpPut(leasingKey, “”, clientv3.WithLease(sessionLeaseID)), ).Else( rangeRequest, ).Commit()

Suppose some client wishes to modify a key already owned by another client. To respect ownership, the writer first posts a revocation message to the leasing key and watches for a delete event. The owner watches its leasing keys for revocation messages; it deletes the leasing key and evicts the key from cache. To avoid contention, the owner sets a backoff time to wait before acquiring the lease again. The writer receives the delete event and transactionally retries the Put, similar to the etcd3 transaction below:

// putRequest := clientv3.OpPut(key, …) leasingKey := “_/leases/” + key resp, err := clientv3.Txn(context.TODO()).If( clientv3.Compare(clientv3.CreateRevision(leasingKey, “<”, cache.leasingKeyRev(key)+1)), ).Then( putRequest, ).Else( clientv3.OpPut(leasingKey, “REVOKE”, clientv3.WithIgnoreLease()), ).Commit()

Throughput and latency

The leasing cache should offload requests from the etcd cluster. This offloading can be observed by comparing the throughput and latency performance between the leasing proxy and etcd servers. Since the cache can serve keys from memory and should rarely access the etcd cluster, the leasing proxy should have higher throughput and lower latencies compared to etcd alone.

To measure throughput, etcd and proxy servers were saturated so requests per second peaked by increasing clients with etcd’s benchmark range tool. The etcd and leasing proxy servers were hosted on standalone n1-standard-1 VM instances to emphasize system resource limits. The benchmark command repeatedly issues linearized reads on one unique key per server; for tests with several servers, key access was sharded by proxy to reduce lease contention. The graph below illustrates the leasing throughput improvement. Baseline etcd has low throughput with few clients since it must wait for consensus; as concurrent operations increase, the requests batch more efficiently and average throughput rises. The leasing proxy has better throughput, both with low concurrency and in general, because requests are immediately served from cache.

Linearized read throughput as client concurrency increases

To measure read latency, an additional network delay is injected on the etcd server connection for the benchmark. The graph below shows average linearized read request latency for a single key as network latency increases. As expected, the etcd server’s response time scales with the network latency due to the injected delay. Since leasing proxy serves requests from its cache, it only takes the etcd delay to initially read the key; the proxy’s request latency amortizes to a constant time (0.1ms) regardless of the delay.

Linearized read latency with injected network delays

Next steps

The current leasing layer’s design goal was to confirm the feasibility of a leasing protocol under etcd. Since it’s a proof of concept, there’s plenty of room to experiment with leasing policies. In the future, leases could be shared among multiple leasing sessions, similar to the shared state in MSI-style cache coherence protocols; if a key is updated infrequently, multiple leasing clients could serve linearizable reads on that key. The policy for acquiring lease ownership could be updated to acquire leases under more circumstances. To limit leasing metadata overhead, the cache could periodically evict cold keys and revoke their leasing keys.

The new etcd leasing layer is part of the upcoming etcd 3.3 series. The latest etcd developments can be found in the etcd GitHub repository along with the official stable releases. As always, the etcd team is committed to building the best distributed consistent key-value store; feel free to report any bugs, ask questions, or make suggestions on the etcd issue tracker.