This diagram shows the replication steps for a SET operation. An application calls set() on the EVCache client library, and from there the replication path is transparent to the caller.

The EVCache client library sends the SET to the local region’s instance of the cache The client library also writes metadata (including the key, but not the data) to the replication message queue (Kafka) The “Replication Relay” service in the local region reads messages from this queue The Relay fetches the data for the key from the local cache The Relay sends a SET request to the remote region’s “Replication Proxy” service In the remote region, the Replication Proxy receives the request and performs a SET to its local cache, completing the replication Local applications in the receiving region will now see the updated value in the local cache when they do a GET

This is a simplified picture, of course. For one thing, it refers only to SET — not other operations like DELETE, TOUCH, or batch mutations. The flows for DELETE and TOUCH are very similar, with some modifications: they don’t have to read the existing value from the local cache, for example.

It’s important to note that the only part of the system that reaches across region boundaries is the message sent from the Replication Relay to the Replication Proxy (step 5). Clients of EVCache are not aware of other regions or of cross-region replication; reads and writes use only the local, in-region cache instances.

Component Responsibilities

Replication Message Queue

The message queue is the cornerstone of the replication system. We use Kafka for this. The Kafka stream for a fully-replicated cache has two consumers: one Replication Relay cluster for each destination region. By having separate clusters for each target region, we de-couple the two replication paths and isolate them from each other’s latency or other issues.

If a target region goes wildly latent or completely blows up for an extended period, the buffer for the Kafka queue will eventually fill up and Kafka will start dropping older messages. In a disaster scenario like this, the dropped messages are never sent to the target region. Netflix services which use replicated caches are designed to tolerate such occasional disruptions.

Replication Relay

The Replication Relay cluster consumes messages from the Kafka cluster. Using a secure connection to the Replication Proxy cluster in the destination region, it writes the replication request (complete with data fetched from the local cache, if needed) and awaits a success response. It retries requests which encounter timeouts or failures.

Temporary periods of high cross-region latency are handled gracefully: Kafka continues to accept replication messages and buffers the backlog when there are delays in the replication processing chain.

Replication Proxy

The Replication Proxy cluster for a cache runs in the target region for replication. It receives replication requests from the Replication Relay clusters in other regions and synchronously writes the data to the cache in its local region. It then returns a response to the Relay clusters, so they know the replication was successful.

When the Replication Proxy writes to its local region’s cache, it uses the same open-source EVCache client that any other application would use. The common client library handles all the complexities of sharding and instance selection, retries, and in-region replication to multiple cache servers.

As with many Netflix services, the Replication Relay and Replication Proxy clusters have multiple instances spread across Availability Zones (AZs) in each region to handle high traffic rates while being resilient against localized failures.

Design Rationale and Implications

The Replication Relay and Replication Proxy services, and the Kafka queue they use, all run separately from the applications that use caches and from the cache instances themselves. All the replication components can be scaled up or down as needed to handle the replication load, and they are largely decoupled from local cache read and write activity. Our traffic varies on a daily basis because of member watching patterns, so these clusters scale up and down all the time. If there is a surge of activity, or if some kind of network slowdown occurs in the replication path, the queue might develop a backlog until the scaling occurs, but latency of local cache GET/SET operations for applications won’t be affected.

As noted above, the replication messages on the queue contain just the key and some metadata, not the actual data being written. We get various efficiency wins this way. The major win is a smaller, faster Kafka deployment which doesn’t have to be scaled to hold all the data that exists in the caches. Storing large data payloads in Kafka would make it a costly bottleneck, due to storage and network requirements. Instead, the Replication Relay fetches the data from the local cache, with no need for another copy in Kafka.

Another win we get from writing just the metadata is that sometimes, we don’t need the data for replication at all. For some caches, a SET on a given key only needs to invalidate that key in the other regions — we don’t send the new data, we just send a DELETE for the key. In such cases, a subsequent GET in the other region results in a cache miss (rather than seeing the old data), and the application will handle it like any other miss. This is a win when the rate of cross-region traffic isn’t high — that is, when there are few GETs in region A for data that was written from region B. Handling these occasional misses is cheaper than constantly replicating the data.

Optimizations

We have to balance latency and throughput based on the requirements of each cache. The 99th percentile of end-to-end replication latency for most of our caches is under one second. Some of that time comes from a delay to allow for buffering: we try to batch up messages at various points in the replication flow to improve throughput at the cost of a bit of latency. The 99th percentile of latency for our highest-volume replicated cache is only about 400ms because the buffers fill and flush quickly.

Another significant optimization is the use of persistent connections. We found that the latency improved greatly and was more stable after we started using persistent connections between the Relay and Proxy clusters. It eliminates the need to wait for the 3-way handshake to establish a new TCP connection and also saves the extra network time needed to establish the TLS/SSL session before sending the actual replication request.

We improved throughput and lowered the overall communication latency between the Relay cluster and Proxy cluster by batching multiple messages in single request to fill a TCP window. Ideally the batch size would vary to match the TCP window size, which can change over the life of the connection. In practice we tune the batch size empirically for good throughput. While this batching can add latency, it allows us to get more out of each TCP packet and reduces the number of connections we need to set up on each instance, thus letting us use fewer instances for a given replication demand profile.

With these optimizations we have been able to scale EVCache’s cross-region replication system to routinely handle over a million RPS at peak daily.

Challenges and Learnings

The current version of our Kafka-based replication system has been in production for over a year and replicates more than 1.5 million messages per second at peak. We’ve had some growing pains during that time. We’ve seen periods of increased end-to-end latencies, sometimes with obvious causes like a problem with the Proxy application’s autoscaling rules, and sometimes without — due to congestion on the cross-region link on the public Internet, for example.

Before using VPC at Amazon, one of our biggest problems was the implicit packets-per-second limits on our AWS instances. Cross that limit, and the AWS instance experiences a high rate of TCP timeouts and dropped packets, resulting in high replication latencies, TCP retries, and failed replication requests which need to be retried later. The solution is simple: scale out. Using more instances means there is more total packets-per-second capacity. Sometimes two “large” instances are a better choice than a single “extra large,” even when the costs are the same. Moving into VPC significantly raised some limits, like packets per second, while also giving us access to other enhanced networking capabilities which allow the Relay and Proxy clusters to do more work per instance.

In order to be able to diagnose which link in the chain is causing latency, we introduced a number of metrics to track and monitor the latencies at different points in the system: from the client application to Kafka, in the Relay cluster’s reading from Kafka, from the Relay cluster to the remote Proxy cluster, and from Proxy cluster to its local cache servers. There are also end-to-end timing metrics to track how well the system is doing overall.

At this point, we have a few main issues that we are still working through:

Kafka does not scale up and down conveniently. When a cache needs more replication-queue capacity, we have to manually add partitions and configure the consumers with matching thread counts and scale the Relay cluster to match. This can lead to duplicate/re-sent messages, which is inefficient and may cause more than the usual level of eventual consistency skew.

If we lose an EVCache instance in the remote region, this results in an increase in latency as the Proxy cluster tries and fails to write to the missing instance. This latency leads back to the Relay side, which is awaiting confirmation for each (batched) replication request. We’ve worked to reduce the time spent in this state: we detect the lost instance earlier, and we are investigating reconciliation mechanisms to minimize the impact of these situations. We have made changes in the EVCache client that allow the Proxy instances to cope more easily with the possibility that cache instances can disappear.

Kafka monitoring, particularly for missing messages, is not an exact science. Software bugs can cause messages not to appear in the Kafka partition, or not to be received by our Relay cluster. We monitor by comparing the total number of messages received by our Kafka brokers (on a per topic basis) and the number of messages replicated by the Relay cluster. If there is more than a small acceptable threshold of difference for any significant time, we investigate. We also monitor maximum latencies (not the average), because the processing of one partition may be significantly slower for some reason. That situation requires investigation even if the average is acceptable. We are still improving these and other alerts to better detect real issues with fewer false-positives.

Future

We still have a lot of work to do on the replication system. Future improvements might involve pipelining replication messages on a single connection for better and more efficient connection use, optimizations to take better advantage of the network TCP window size, or transitioning to the new Kafka 0.9 API. We hope to make our Relay clusters (the Kafka consumers) autoscale cleanly without significantly increasing latencies or increasing the number of duplicate/re-sent messages.

EVCache is one of the critical components of Netflix’s distributed architecture, providing globally replicated data at RAM speed so any member can be served from anywhere. In this post we covered how we took on the challenge of providing reliable and fast replication for caching systems at a global scale. We look forward to improving more as our needs evolve and as our global member base expands. As a company, we strive to win more of our member’s moments of truth and our team helps in that mission by building highly-available distributed caching systems at scale. If this is something you’d enjoy too, reach out to us — we’re hiring!

— The EVCache Team (Shashi Madappa, Vu Nguyen, Scott Mansfield, Sridhar Enugula, Allan Pratt, Faisal Zakaria Siddiqi)

See Also: