Daniel Ellis (u/daniel)

Senior Software Engineer

Performance matters. One of the first tools we as developers reach for when looking to get more performance out of a system is caching. As Reddit has grown in users and response times have improved, the amount of caching has grown to be quite large as well.

In this post we’ll talk about some of the nuts-and-bolts numbers of Reddit’s caching infrastructure—the number of instances, size of instances, and overall throughput. We hope that sharing this information may help others gauge what type of performance and sizing they can expect when building similar clusters. At the very least, we hope you’ll find it interesting to see a bit more about how Reddit works under the hood.

We’ll also go over the Reddit-specific type of work our caches do, how we use mcrouter to manage our caches more effectively, and the custom monitoring (MemcachedSlabCollector and mcsauna) we’ve written to help us understand what’s going on behind the scenes. We’ll also talk about some of the more subtle issues that we’ve run into when deploying changes to our caches.

Memcached Basics

This post assumes a basic knowledge of caching. If you haven’t done much explicit work with caching, here’s a brief overview. Since we use memcached, I’ll be primarily talking within the context of it. (If you know all of this, feel free to skip this part.)

At its core, memcached is a key value store. You can issue a set command to set a key’s value and a get command to get the value back. You can also add a key, which will only create the key if it doesn’t already exist. This lets us use memcached for things like rate limits and locking.

Keys can also be set with a TTL, or expiration time. If no expiration time is set, the key will be kept until space is needed for new keys and something needs to be “evicted.” There are a lot of cache eviction algorithms out there, but the one memcached uses is Least Recently Used. Basically, memcached keeps track of which keys have had activity recently, and if it needs to make space, will prefer to delete those keys that haven’t been accessed lately. Of course, there is also a delete command to explicitly remove a key.

What It Looks Like

At Reddit, we cache a wide variety of things, including database objects, query results, and memoization of function calls. We also use it for non-cachey things, like rate limiting and locking. To accomplish this on a site of our size, we need hardware—a lot of it. At the time of this writing, our memcached infrastructure consists of 54 of AWS EC2’s r3.2xlarges, not including local caching on individual application servers. This comes out to nearly 3.3TB of dedicated caching.

As you may expect, managing a caching infrastructure of this size can get tricky, and a little optimization here and there can go a long way toward making everything run more efficiently. One of the optimizations we make is splitting up our caches by workload type, rather than running them as one big pool. This gives us a few distinct advantages.

Advantages to Workload-Based Pools

The first advantage is simple: separating pools based on workload allows us to scale independently based on utilization. If we find that our caches are starting to evict more and our database is slowing down as a result, we can scale the database-caching components independently of our other caches.

Scaling isn’t necessarily the only benefit, however. Having cache pools dedicated to one thing lets us make reasonable predictions about what will happen in case of a failure, test those predictions by forcing failover, then make decisions about what tradeoffs we might want to make based on our findings.

For instance, let’s say we were to test taking down a single cache responsible for database objects. We may find this failure results in a 300% increase in query rate to one of our postgres pools, resulting in an increased p99 response time of 300% and a backlog of requests that causes the site to be unavailable for 15 seconds while our failover cache warms up.

In general, this is something we’re trying to move away from, but for now, it’s technical debt that we have to think about carefully.

We might then look at our mean time between failure of these instances and decide that our pool size is reasonable given the risk and that the cost of increasing our pool size simply isn’t worth it. If we were to have our rate limits and locks in the same pool as our caching and database objects, we’d not only have a lot of unpredictable side effects by testing a failure like this, but we also probably wouldn’t be able to run this sort of test at all!

Independent scaling and testing are not the only benefits of workload based pools. For some of our cache pools, we may be doing more dangerous or critical operations where more thought needs to go into maintenance. For instance, in places where we are doing read-modify-write or handling locks, consistency needs to be perfect or near perfect.

Adding new instances or warming up new pools must be thought through carefully and may even require downtime. Having all caches contain this type of data would make this more of a headache than it already is, or even impossible without downtime. In general, this is something we’re trying to move away from, but for now, it’s technical debt that we have to think about carefully.

There’s one final important reason for workload-based pools. Memcached works on a memory allocation system known as slab allocation. This means that as objects of a certain size are added to memcached, certain sized “chunks” of memory are allocated for objects to be put into. So objects of size 1-96 bytes might fit into slab 1, objects of size 97-120 bytes might fit into slab 2, etc. This prevents memory fragmentation and greatly reduces the number of allocations that need to be done in general, since memory is never deallocated and reallocated.

This is fine, except that up until recent versions, these allocations could not be changed once they had been made. So if we had initially filled our memcached instance with mostly 1KB objects and our usage pattern suddenly changed to trying to store 500KB objects, the underlying slab allocation wouldn’t change. This would lead to a much higher eviction rate than we would otherwise expect, since there would be little to no room for the larger objects. Workload-based pools ensure that the memcached instances can settle into a slab allocation pattern that reflects the size of objects it typically deals with.

Newer versions of memcached have added the slab_automove ability to enable reallocation, but we haven’t had the opportunity to test it yet.

The Pools We Use at Reddit

Now that we’ve gone over why we split pools by workload, let’s look at the different pools we use at Reddit, their sizes, and the workloads they handle.

thing-cache

Instances 16 r3.2xlarge Memcached Version 1.4.30 Total RAM 976 GB Get Rate ~800k/s Set Rate ~13k/s Miss % 1.2-2% Typical Object Size 384-1184 bytes

All get and set rates are averages at peak for the entire cluster rather than individual instances, in thousands per second.

Our biggest pool is used for caching database objects, known as Things. Our “Thing” object is a schemaless abstraction allowing developers to easily add new attributes to objects without the need to manage database migrations. Each Thing automatically gets attributes like karma. Some examples of Things include comments, links, and accounts. This is by far our busiest and most useful cache, with a hit rate of nearly 99%. The loss of a Thing cache is felt very seriously by the databases.

cache-main

Instances 11 r3.2xlarge Memcached Version 1.4.30 Total RAM 671 GB Get Rate ~82k/s Set Rate ~10k/s Miss % ~75% Typical Object Size <96 bytes

Our second biggest cache is cache-main. This is our general dumping ground of a cache, where we store everything from the results used to display /r/all to GeoIP lookups. You can see the various things we use it for by looking in the open source Reddit repo.

When you vote, your vote isn’t instantly processed—instead, it’s placed into a queue.

One example of a random thing we use the cache for quite regularly is simply checking the things you’ve voted on recently when you load a page. When you vote, your vote isn’t instantly processed—instead, it’s placed into a queue. Depending on the backlog of the queue, this can mean if you were to vote and quickly refresh the page, your vote may not have been processed yet, and it would appear that your vote had been reverted. To get around this, we cache your recent votes for a short period of time to display them back to you until they’re processed. This sort of thing doesn’t really fit in with any kind of larger workload pattern, so we stuff it in with other things in cache-main.

cache-render

Instances 8 r3.2xlarge Memcached Version 1.4.30 Total RAM 488 GB Get Rate ~224k/s Set Rate ~103k/s Miss % ~45-55% Typical Object Size 240-2320 bytes

Our third biggest cache is cache-render. This is probably what you’d expect: it’s a bunch of rendered templates and sections of the page. This is one of our safer caches to fail, which is unsurprising for two reasons: 1) the miss rate is pretty high, since templates are constantly needing to be updated with new information, and 2) its failure doesn’t result in higher db load, since the context used to populate is being loaded independently of the caching done. The cache key is generated based on the contents of the context provided to the template—if that changes, a cache miss happens and the resulting template is re-cached. This ends up causing higher app CPU usage and slightly longer response time while the caches re-heat.

cache-perma

Instances 6 r3.2xlarge Memcached Version 1.4.17 Total RAM 366 GB Get Rate 24k/s Set Rate 4k/s Miss % <1% Typical Object Size 96-120 bytes

The last pool we’ll talk about in detail is cache-perma. This pool is used to cache database query results and the sorts for comments and links. This is really tricky stuff, especially for maintenance and migrations, mostly because we use the read-modify-write pattern.

For example, when new comments are added or votes are changed, we don’t simply invalidate the cache and move on—this happens too frequently and would make the caching near useless. Instead, we update the backend store (in Cassandra) as well as the cache. Fallback can always happen to the backend store if need be, but in practice this rarely happens. In fact, permacache is one of our best hit rates—over 99%.

Non-cache Pools

As mentioned earlier, we also have a couple of pools dedicated to non-cache functionality. These use functionality provided by memcached in novel ways for rate limiting and locking.

Our rate-limit caches are important for preventing people from doing things like spamming votes. The code behind this trickery can be seen here. Essentially, we bucket keys into timeslices with TTLs that are incremented every time an action occurs. The returned value is then checked to ensure it’s not over some limit. If it is over the limit, the user is prevented from performing the action. This setup also allows for some burstiness so long as the average rate is kept low.

We also use memcached for locking. This takes advantage of the “add” command, which only creates a new key if it does not already exist. If the lock is acquired, the functionality is then performed under the lock, and the key removed. The next call to the “add” command will then succeed and can continue along, knowing it is the only one working under the lock. This is one of our biggest pain points and is a single point of failure. Migration or maintenance of this box means shutting down the entire site. Our goal in the long term is to reduce or remove this locking from the application altogether.

Other Pools

The rest of our pools are small-ish, so we won’t go into the details for these. We have a “rel” cache, which holds lookups of relations of one Thing to another. Our memoize cache provides an easy way for developers to cache the results of function calls. Places that rely on these memoized results to be accurate can be problematic when doing maintenance, as we’ll see later.

Mcrouter

Now that we’ve gone over what our caching looks like, let’s dive into one of the tools we’ve started using lately: mcrouter.

One of our major goals over the last couple of years has been to move our caching infrastructure behind mcrouter. At its most basic, mcrouter is a memcached connection pooler and proxy developed at Facebook. The connection pooling functionality alone gives us a good amount of benefit. We have a number of workers on each of app servers, all of which are making outbound connections to our caches. Multiply that by the sheer number of servers it takes to run Reddit, and the number of connections adds up. Mcrouter allows us to take all of those connections by all of the workers on an application server and pool them together into one single outbound connection per cache.

Connection pooling isn’t the only benefit, however. Mcrouter can be configured to handle any number of complex scenarios. Want to route keys to specific pools based on their functionality? Add a prefix, and use PrefixSelectorRoute to choose what pool to route to. Want to add a new pool of servers but don’t want to take the database hit of swapping over to cold caches? Warm them up slowly with WarmUpRoute. Want to prevent a single cache failure from causing long-term problems? Use FailoverRoute to define a failover pool as a fallback for your regular pools.

This lets us avoid having empty results on /r/all in case of a cache failure, while also preventing hotspots in the cache pool.

Those aren’t the only ones (you can find a full list here). One we used heavily when migrating operating system and memcached versions was shadow writing.

Unlike warming up a new pool, which slowly changes the source of truth to your new pool and only mirrors missed gets to the old cluster, shadowing copies both read and write operations to your new cluster without changing the source of truth, allowing you to safely watch the new instances to make sure they behave as expected under both read and write load.

For instance, maybe you want to check what the real-world miss and eviction rate will be with a bigger pool of smaller instances. Or maybe you just want to make sure a new version of memcached won’t suddenly run at a much higher load. Shadowing works great for this.

Finally, though we use this less often, mcrouter also has the ability to replicate data. For the cached /r/all results we talked about earlier, for instance, sets are replicated to all available caches, and reads happen from random caches with failover going to a different random cache. This lets us avoid having empty results on /r/all in case of a cache failure, while also preventing hotspots in the cache pool. /r/all isn’t that popular, but it is hot enough that we’d rather not have the load unnecessarily go to one single cache.

Overall, mcrouter has made the management of our cache servers a lot easier and has given us a lot of flexibility. We’ve migrated most of our caches behind mcrouter, but for some of the more critical caches that require downtime, we still have work to do.

Subtle Problems

Deploy Race Conditions

Not all is rosy though. Using some of these mcrouter features can cause subtle race conditions and consistency issues, mostly because of our reliance on the cache as a source of truth.

Let’s consider the case of wanting to warm up a new set of caches and a real-world bug that was introduced as a result. To warm up the new caches, we needed to deploy mcrouter updates to 800 or so servers. The bug in question centers around trophies.

If you’re unfamiliar with trophies, they’re the things that appear on the sidebar of user pages. They look like this:

The problematic ones in this example are the “N year” trophies that are awarded every year on a redditor’s signup anniversary. One of the ways we check to see if an N year trophy needs to be added to an account is by seeing whether the trophy exists already. The result of this function call rarely changes, so we cache it in our memoize cache.

In an ideal world, updates to the list of caches when warming would happen instantaneously to all servers responsible for handling the updating of trophies. Cache requests for the trophy list for a user would first go to the new caches, a miss would occur, and the data would be retrieved from the old caches and added to the new ones.

Unfortunately, we don’t live in an ideal world. This check is handled in request, meaning it’s handled by one of those 800 app servers. This means as we’re deploying, not all 800 app servers have the same view of the caches.

Let’s consider the case where half of the app servers have the old config, pointing at the old cache pool, and half the servers have the new config, pointing at both pools. The servers that have the new warmup config would be reading from the old pool, updating the cache result, and storing it in the new pool. That’s fine and is what we want.

The problem is that after that cache update to the new pool is done, one of the old apps that hasn’t been deployed to yet would still see the old state and mistakenly think that a trophy was still needed, giving the user a duplicate. The good news is these servers would then update the old cache, in theory resulting in a maximum of a single duplicate trophy for that user.

Or would it?

After investigating the number of duplicate trophies by user and finding a couple hundred duplicate trophies, we noticed there was actually an account or two with 3 trophies! After thinking about it a while, we realized there is actually another race condition that can happen here.

Consider the case where a request comes in to an app where mcrouter hasn’t been updated yet. A “get” happens against the old cache, and it is determined based on the response that a trophy is needed. In the middle of this request, the mcrouter update is deployed to that server. The subsequent update of the new trophies goes to the new cache, but the old cache is left un-updated. A future request by an app that hadn’t been updated yet would then read the old data from the old cache.

In theory, this could actually happen any number of times before the old cache is updated. In practice, this is so unlikely that it managed to happen only once.

No TTLs on Warmup Routes

Another issue that bit us, which is written up here, centered around using warmup routes without explicitly specifying a TTL. Basically, when a miss occurs on the new caches and mcrouter retrieves the value from the old cache, it is unable to tell what TTL remains. With no expiration time explicitly indicated to mcrouter, this causes the item to be put into the new cache with no TTL at all! In this case, the application expected that if the item had not expired yet, it was still valid and would not bother recalculating, causing /r/all and /r/all/rising to get really stale.

Lesson learned: specify a TTL when warming up if your application relies on it!

Custom Monitoring

As alluded to earlier, memcached can sometimes be a bit of a black box. Though the memcached collector in diamond (the system stats collector we use) collects basic stats for things like evictions and miss rate, we realized from oddities in eviction behavior that we needed more insight into what was going on. After manually taking a look at slab allocation metrics, which are available via the “stats slabs” command, we realized it would be nice to have a visual representation of the data as well as see how it changed over time.

For this, we wrote a new diamond collector for keeping track of each slab’s metrics. Visualization is a bit wonky, but we created dashboards for point in time representation of the data so we can at least eyeball if some caches have major differences in slab allocation and in general see what size keys each pool tends to have.

The second lack of insight we had was related to why slab allocation would change in the first place. We reasoned that this probably was due to changing patterns in usage of memcached. Though Reddit still has a relatively small engineering team, we do manage to make a lot of changes, and cache patterns can certainly change over the course of months. What we needed was simple: a way to track which keys were most active.

For this, we wrote a tool called mcsauna. Mcsauna sits on each cache server watching network traffic, aggregating keys into buckets based on configured rules, and outputting the results to a file which can be picked up by diamond’s FilesCollector. Keys can then be graphed over time, and hotspots in the cluster where a particular type of key is being unevenly routed to one cache in the pool can be identified.

What’s Next

Though we’ve made a lot of progress, there are some key things we’d like to do moving forward. For one, we’d like to get the remainder of our cache pools behind mcrouter. We’d also like to get our mcrouter configs behind service discovery, so we can consider autoscaling based on health and avoiding unnecessary human intervention when a cache inevitably does go down. Finally, since we’ve managed to get on a much more modern version of memcached, we would like to begin testing the various automove settings for slabs to see how they perform and how quickly slabs are moved depending on the level of aggressiveness configured.

We’ve come a long way. With mcrouter, we can perform experiments and downtime-free maintenance with ease. With our new pieces of monitoring, we can have much better insight into what is going on with our caches for tuning and sizing. Overall, caching has been a great way to help improve the response time of the site and is a crucial part of letting us serve Reddit at scale.

Thanks to /u/goatfresh for his help with the design work and /u/spladug and /u/keysersosa for their editorial contributions. Want to discuss this blog post? Join /u/daniel on /r/programming and ask him anything! (And if you’re an engineer interested in joining Team Reddit, check out our Careers page for a list of open positions.)