One of the key strategies we use to keep the new Basecamp as fast as possible is extensive caching, primarily using the “Russian doll” approach that David wrote about last year. We stuffed a few servers full of RAM and started filling up memcached1 instances.

A few times in the last two years we’ve invalidated large swaths of cache or restarted memcached processes, and observed that our aggregate response time increases by 30-75%, depending on the amount of invalidation and the time of day. We then see caches refill and response times return to normal within a few hours. On a day-to-day basis, caching is incredibly important to page load times too. For example, look at the distribution of response time for the overview page of a project in the case of a cache hit (blue) or a miss (pink): Median request time on a cache hit is 31 milliseconds; on a miss it jumps to 287 milliseconds.

Until recently, we’ve never taken a really in-depth look at the performance impact of caching on a granular level. In particular, I’ve long had a hypothesis that there are parts of the application where we are overcaching; I believed that there are likely places where caching is doing more harm than good.

Hit rate: just the starting point

From the memcached instances themselves (using memcache-top), we know we achieve roughly a 67% hit rate: roughly two out of every three requests we make to the caching server has a valid fragment to return. By parsing2 Rails logs, we can break apart this overall hit rate into a hit rate for each piece of cached content.

Unsurprisingly, there’s a wide range of hit rates for different fragments. At the top of the heap, cache keys like views/projects/?/index_card/people/? 3 have a 98.5% hit rate. These fragments represent the portion of a project “card” that contains the faces of people on the project:

This fragment has a high hit rate in large part because it’s rarely updated—we only “bust” the cache when someone is added or removed from a project or some other permissions change is made, which are relatively infrequent events.

At the other end of cache performance with a 0.5% hit rate are views/projects/?/todolists/filter/? fragments, which represent the filters available on the side of a projects full listing of todos:

Because these filters are based on who is on a project and what todos are due when, the cache here is busted every time project access or any todo is updated. As a result, we rarely have a cached fragment available here, and 99 times out of 100 we end up rendering the fragment from scratch.

Hit rate is a great starting point for figuring out what caching is likely effective and what isn’t, but it doesn’t tell the whole story. Caching isn’t free – memcached is blazingly fast, but you still incur some overhead with every cache request whether you get a hit or a miss. That means that a cache fragment with a low hit rate that is also quick to render on a miss might be better off not being cached at all — the costs of all of the misses (the fruitless memcache request) outweigh the benefits of a hit. Conversely, a low hit rate isn’t always bad—a template that is extremely slow to render might still benefit on net even if only 10% of cache requests are successful.

Calculating net cache impact