



Caching beyond RAM: the case for NVMe - Dormando (June 12th, 2018)

Caching architectures at every layer of the stack embody an implicit tradeoff between performance and cost. These tradeoffs however are constantly shifting: new inflection points can emerge alongside advances in storage technology, changes in workload patterns, or fluctuations in hardware supply and demand.



In this post we explore the design ramifications of the increasing cost of RAM on caching systems. While RAM has always been expensive, DRAM prices have risen by over 50% in 2017, and high densities of RAM involve multi-socket NUMA machines, bloating power and overall costs. Concurrently, alternative storage technologies such as Flash and Optane continue to improve. They have specialized hardware interfaces, consistent performance, high density, and relatively low costs. While there is increasing economic incentive to explore offloading caching from RAM onto NVMe or NVM devices, the implications for performance are still not widely understood.



We will explore these design implications in the context of Memcached, a distributed, simple, cache-focused key/value store. For a quick overview, see the about page or the story tutorial.

Memcached is a RAM backed key/value cache. It acts as a large distributed hash table, with data lifetime governed by an LRU. The oldest untouched data is evicted to make room for fresh data.

Very low latency (sub-millisecond) and high throughput is important. Pages may request data from memcached several times, which can cause time to stack up quickly.

Memcached is a huge cost reduction by cutting queries to your backend system. Either a database with flash/disk drives, or CPU-bound code such as templating or rendering. Memcached has a storage system called extstore, which allows keeping a portion of data for “less recently used” keys on disk, freeing up RAM. See the link for a full breakdown of how it works, but in short: keys stay in RAM, while values can be split off to disk. Recent and frequently accessed keys will still have their values in RAM.



This test was done with the help of Accelerate with Optane, which provided the hardware and guidance. Also thanks to Netflix for their adoption of extstore, with all their great feedback.

Cache RAM Breakdown

For example, a 1TB database could have only 20% of its data “active” over a given time period (say 4 hours). If you want to cache all of the active data, you might need 200G of RAM. Out of that 200G of RAM, only 20% of it might be highly utilized.





Of cache memory used, only 10% of the RAM could be responsible for 90% of all hits to the cache. The rest of the 90% of RAM only amounts for 10%.



However, if you cut 90% of the RAM usage, your miss rate would at least double, doubling the load on your DB. Depending on how your backend system performs, losing some RAM would double backend costs.





Breaking down items in memory, we might find that a vast majority (97.2% in the above case) live within only 30% of RAM. A smaller number of large, but still important items take up the other 70%. Even larger items can eat RAM very quickly.



Keep in mind these larger items can be a significant percentage of network utilization. One 8k request takes the same bandwidth as 80+ small ones.





How does extstore help? By splitting the values from large less recently used items from their keys and pointing onto disk, we can save a majority of less-used RAM. Depending on use case, you can:



Reduce overall RAM: cut 100G to 30G, or less, if you have a lot of larger items.

Increase hit ratio with same RAM: move large items to disk, cache longer tail of small items, increase hit ratio and reduce backend costs.

Reduce server count: if you have network to spare (and can handle the loss of broken servers!) you can cut the size of the cache fleet.

Increase overall cache: easily add hundreds of gigabytes of cache capacity per server.

Cache objects that used to be too expensive: create new pools with larger objects, caching precomputed data from big data stores, machine learning training data, and so on.

Are other workload variations okay? In the above example, cache hits are evenly distributed. This theoretical system has an IO limit of 400,000, which should be similar to a high end SSD or Optane drive. In this case RAM cannot be relied on to saturate the network.



At 400,000 IOPS, just 3072 byte averages are necessary to saturate a 10G NIC. 8192 for 25G. In properly designed clusters, extra headroom is necessary for growth, usage spikes, or failures within the pool. This means item sizes down to 1024 byte averages might be possible, however at 1024b (assuming 100 bytes of overhead per key), extstore will only be able to store 10x to disk of what it could fit in RAM.



Careful capacity planning is required: How many lost machines can be tolerated? Each dead machine takes a percentage of the cache with it.

How much network bandwidth is necessary? Reducing the server count makes the network more dense.

How many IOPS are necessary? Most accesses are against recent data, reducing the reliance on disk.

What latency guarantees are necessary? If cache based disk lookups are still a lot faster than the backend

How long do items live for? SSD’s only tolerate a certain amount of writes before burning out. Not everyone’s workload will be compatible with external storage. Carefully evaluate how your RAM is used in your cache pools before using disk for cache. If you have exclusively small items, short TTL’s, or high write rates RAM will still be cheaper. Calculating this is done by monitoring an SSD’s tolerance of “Drive Writes Per Day”. If a 1TB device could survive 5 years with 2TB of writes per 24 hours it has a tolerance of 2 DWPD. Optane has a high tolerance at 30DWPD, while a high end flash drive is 3-6DWPD. , a 1TB database could have only 20% of its data “active” over a given time period (say 4 hours). If you want to cache all of the active data, you might need 200G of RAM. Out of that 200G of RAM, only 20% of it might be highly utilized., only 10% of the RAM could be responsible for 90% of all hits to the cache. The rest of the 90% of RAM only amounts for 10%.However, if you cut 90% of the RAM usage, your miss rate would at least double, doubling the load on your DB. Depending on how your backend system performs, losing some RAM wouldbackend costs., we might find that a vast majority (97.2% in the above case) live within only 30% of RAM. A smaller number of large, but stillitems take up the other 70%.Keep in mind these larger items can be a significant percentage of network utilization. One 8k request takes the same bandwidth as 80+ small ones.By splitting the values from large less recently used items from their keys and pointing onto disk, we can save a majority of less-used RAM. Depending on use case, you can:In the above example, cache hits are evenly distributed. This theoretical system has an IO limit of 400,000, which should be similar to a high end SSD or Optane drive. In this case RAM cannot be relied on to saturate the network.At 400,000 IOPS, just 3072 byte averages are necessary to saturate a 10G NIC. 8192 for 25G. In properly designed clusters, extra headroom is necessary for growth, usage spikes, or failures within the pool. This means item sizes down to 1024 byte averages might be possible, however at 1024b (assuming 100 bytes of overhead per key), extstore will only be able to store 10x to disk of what it could fit in RAM.Careful capacity planning is required:Not everyone’s workload will be compatible with external storage. Carefully evaluate how your RAM is used in your cache pools before using disk for cache. If you have exclusively small items, short TTL’s, or high write rates RAM will still be cheaper. Calculating this is done by monitoring an SSD’s tolerance of “Drive Writes Per Day”. If a 1TB device could survive 5 years with 2TB of writes per 24 hours it has a tolerance of 2 DWPD. Optane has a high tolerance at 30DWPD, while a high end flash drive is 3-6DWPD.

Test Setup The tests were done on an Intel Xeon machine, sporting 32 cores, 192G of RAM, a 4TB SSD, and 3x optane 750G drives. Only one optane drive was used during the test. As of this writing extstore only works with one drive, and this configuration reflects most of its users.

mc-crusher was used to run the tests. Specifically, the third tag, containing the test-optane script.

mc-crusher is designed primarily to run as fast as possible: it does not parse responses, stacks as many queries as it can per syscall, and makes no attempt to time anything. In this test it was run against localhost, though it never used more than a single core of CPU.

The test-optane script specifically describes the configurations used in the test. Memcached was configured to use 32 worker threads (the machine has 64 cores with hyperthreads).

The “balloon” program from mc-crusher was used to take up 125G of RAM, and 100 million keys loaded into memcached, to avoid extstore simply using the buffer pool. During each test run, the number of mc-crusher clients was varied, as well as the number of extstore IO threads in the server. Too few IO threads won’t saturate the device, and too many will overload it and can cause queues.



Each test runs for a minute after warm-up. Latency and Throughput measurements During each test run, the number of mc-crusher clients was varied, as well as the number of extstore IO threads in the server. Too few IO threads won’t saturate the device, and too many will overload it and can cause queues.Each test runs for a minute after warm-up. Since mc-crusher does not time results, two scripts were used to generate the result data: bench-sample: Periodically runs the “stats” command against memcached, using its counters to determine average throughput. The data is sampled every few seconds, and was inspected for significant standard deviation.

latency-sample: A script which pretends to be a blocking memcached client and “samples” requests over time, at the same time bench-sample is running. This is used to avoid traps like “95th percentile”, which removes outliers or grouping, causing misleading results. For every test, a full breakdown of the latency samples are provided. The sampling was done at a maximum rate of one per millisecond.

Note: an event loop is not used, to avoid having to determine time elapsed as time waiting to be processed if a stack of events happen at the same time. The Tests For every test, a full breakdown of the latency samples are provided. The sampling was done at a maximum rate of one per millisecond.Note: an event loop is not used, to avoid having to determine time elapsed as time waiting to be processed if a stack of events happen at the same time. Three general tests were done: ASCII multiget: this mode allows extstore to use the fewest packets to generate a response, as well as heavily pipeline requests internally. Lower latency devices can reach higher throughputs more easily with this test.

Pipelined gets: many get requests are stacked into the same packet, but extstore has to service each request independently. In these tests, extstore was easily able to saturate the OS’s ability to serve buffered IO (kswapd kernel threads were maxed out), but the latency graphs show optane able to keep latency down 1/10th that of the flash drive.

At higher client loads, pipelined gets may look odd: that will need further research, but is likely caused by queuing internally. Since mc-crusher is very aggressive, the optane drive is able to saturate the system with much fewer IO threads and crusher clients. In production workloads optane will provide much more consistent low latency service.

Multiget + pipelined sets: the previous two workloads were read-only. In this test, sets are also done against memcached at a rate of roughly 1/3rd to 1/5th. Extstore is flushing to the drive at the same time as reads are happening. Again, optane comes out strong. The Results Unfortunately there is some wobbling in the graphs; that is due to leaving too little RAM free during the tests. The optane’s performance was consistent, while the OS struggled to keep up.



For reference: A pure RAM multiget load test against this exact configuration of memcached (32 threads, etc) results in 18 million keys per second. More contrived benchmarks have gotten a server with many cores up past 50 million keys per second.



With few extstore IO threads the Optane drive is able to come much closer to saturating the IO limit: 4 threads, 4 clients: 230k Optane, 40k SSD. The latency breakdown shows the SSD typically being an order of magnitude higher in wait time, with the Optane staying in the 10us bucket, and SSD in 100us, slipping into 1ms.



With many extstore IO threads the underlying OS becomes saturated, causing wobbles and queueing in the Optane graph. Meanwhile, the SSD continues to benefit from extra thread resources, needed to overcome the extra latency from flash.



For many workloads, both SSD and Optane are completely viable. If a bulk of reads still come from RAM, with extstore used to service the long tail of only large objects, they both keep requests under 1ms in response time.



If you want to push the boundaries of extstore a drive like Optane goes a long way: High write tolerance goes well with cache workloads

Very low latency helps smooth over the tradeoff of requesting cache data from disk

Small size is currently adventageous: extstore requires RAM for every value on disk. 375G to 1TB of disk requires a lot less RAM in a particular machine, and 2TB+ is probably too dense to allow safe failover or avoid NIC saturation. Test type: batched multiget pipelined get batched multiget + pipelined sets SSD IO threads: 4 8 16 32 Optane IO threads: 4 8 16







Learnings Test type:SSD IO threads:Optane IO threads: The extstore flusher is a background thread combined with the code which manages the LRU’s. During a benchmark, sets are consistently sent to the server, which can cause starvation of extstore’s flushing. In production, insert rates to memcached tend to come in waves, even as small as a millisecond across, so it can keep up. This will have more consistent performance as its own thread.

Latency sampling is tough. The current script provides useful data, but a much better program would pace one request every millisecond onto a pool of blocking threads, allowing us to determine if the server is pausing or if some requests are simply slow. Full timing of every sample can also be saved and graphed. This would visualize clustering of responses which can come from the underlying OS or drive.

Buffered I/O has limitations. This was known ahead of time, most workloads are in the order of hundreds of thousands of operations per second or much less, and most of those will be against RAM and not extstore. We’ve focused on stability for the time being, but eventually direct IO and async IO will be able to better utilize the devices under high load.

Extstore’s bucketing could allow for a very interesting mix of Optane along with traditional flash. Internally, extstore organizes data into “pages” of disk space (typically 64M). New items are clustered into particular pages. Items with short TTL’s can be clustered together. Items which survive page compaction are clustered as well, which reduces the need for compaction over time. All new items and/or short TTL items could rest on a 375G optane drive, while compacted items could sit on a 1TB flash drive, providing even greater cost savings. Conclusions Workloads which aren’t currently possible due to cost are now possible. Most workloads containing mixed data sizes and large pools can have significant cost reduction.





Extstore requires RAM per item on disk. This chart, assuming 100 bytes of overhead per item (key + metadata), visualizes how the RAM overhead falls as item sizes get larger.



DRAM costs are 3-4x Optane, and 4-8x SSD, depending on the drive.



As cache shifts from RAM to Optane (or flash), money spent purely on RAM can drop to 1/3rd.



Reducing RAM reduces reliance on multi-socket servers to get very high RAM density, NUMA capable machines are often necessary. These have multiple sockets, multiple CPUs, with half of the RAM attached to each CPU. Since memcached is highly efficient, you can both cut RAM, as well as cut half of the motherboard/CPU and even power costs once RAM is reduced. Cost reductions up to 80% for specific workloads are reasonable.



High speed, low latency SSD opens a new era for database and cache design. We demonstrate high performance numbers for a wide variety of use cases, for both reduction of costs and expansion of cache usage. . This chart, assuming 100 bytes of overhead per item (key + metadata), visualizes how the RAM overhead falls as item sizes get larger., depending on the drive.As cache shifts from RAM to Optane (or flash), money spent purely on RAM can drop to 1/3rd.to get very high RAM density, NUMA capable machines are often necessary. These have multiple sockets, multiple CPUs, with half of the RAM attached to each CPU. Since memcached is highly efficient, you can both cut RAM, as well as cut half of the motherboard/CPU and even power costs once RAM is reduced. Cost reductions up to 80% for specific workloads are reasonable.High speed, low latency SSD opens a new era for database and cache design. We demonstrate high performance numbers for a wide variety of use cases, for both reduction of costs and expansion of cache usage.











