dormando dormando



Redis VS Memcached (slightly better bench)



I will now continue the back-and-forth obnoxiousness that benchmarking seems to be!



In my tests, I've taken the exact testing method antirez has used here, the same test software, the same versions of daemon software, and tweaked it a bit. Below are graphs of the results, and below that is the discussion of what I did.











Wow! That's pretty different from the first two benchmarks.



First,



What I did

The "toilet" bench and antirez's benches both share a common issue; they're busy-looping a single client process against a single daemon server. The antirez benchmark is written much better than the original one; it tries to be asyncronous and is much more efficient.



However, it's still one client. memcached is multi-threaded and has a very, very high performance ceiling. redis is single-threaded and is very performant on its own.



There is a trivial patch I did to the benchmarks to make them just run the GET|SET tests. It is included in the tarball.



What I did was take the same tests, but I ran several of them in parallel. This required a slight change in pulling the performance figures and running the test. The tests were changed to run indefinitely, either doing sets, or sets then indefinite gets (I wanted to run some sets before the get tests so they weren't just hammering air).



The benchmarks were then fired off in parallel via the bash script, with the daemon fresh started before each run. After a rampup time (to allow the sets to happen, as well as let the daemons settle a bit), a script was used to talk to the daemons and sample the request rate. Since the benchmark is running several times in parallel, it's now most accurate to directly ask the daemon for how many requests it's doing. I did some quick hand verification and the sampling code lines up with the output of a non-parallel benchmark. So far so good.



I checked in with antirez to ensure I was running the tests correctly, and re-ran them as close to the original as I could get. Same number of clients *per daemon*, but there were 4 daemons in this case, so the actual number of clients is actually 4x what's listed in the graphs.



The tests ran on localhost using a dual cpu quadcore xeon machine, clocked at 2.27ghz (with turbooost enabled, I'm pretty sure). The OS is centos5 but with a custom 2.6.27 kernel. I verified the single-process benchmark results on my ubuntu laptop runnin 2.6.35 and a 2.50ghz core2duo and got similarish-but-slightly-lower numbers. I also tried the tests on several slightly differing machines after getting some odd initial results. Memcached was using the default of 4 threads. Performance might suffer in this particular test with more threads, as you'd land with more lock contention.



So these numbers look correct, for what I was trying to do here.



Nothing else was changed. I used the same tools.



Why I did it

Both tests are busy loops. All three of these benchmarks are wrong, but this can be slightly closer to reality. In most setups, you have many independent processes contacting your cache servers. In some cases, tens of thousands of apache/perl/ruby/python processes across hundreds of machines, all poking and prodding your cluster in parallel.



I don't have the room here to explain the difference between running two processes and one process against the same daemon - So I'll hand waive with "context switches n' stuff". There're plenty of good unix textbooks on this topic :)



So in this case, four very high speed benchmark programs soaked up CPU and hammered a single instance of redis and a single instance of memcached, which displays the strong point for the scalability of a single instance in each case.



Why the bench is still wrong

These are contrived benchmarks. They don't test memcached incr/decr or append/prepend (mc /does/ have a few more features than pure get/set).



Real world benchmarks will require a mix of sets, gets, incrs, decrs. Also, it requires testing each in isolation; some users might use their key/value store as a counter and hammer incr/decr hard. Others might hammer set hard, others might be near-purely gets.



All of these need to be tested. All features should be benchmarked and load tested in isolation, and also when mixed. All features need to be tested under abuse as well.



The test also doesn't try very hard to ensure the 'get' requests actually match anything. A better benchmark would preload some data across 100,000 keys and then randomly fetch them. I might try this next, but for the sake of argument I'm matching the same testing situation as the original blog post.



The interpretation for memcached

Memcached sticks to a constrained featureset and multithreads itself for a highly consistent rate of scale and performance. When pushed to the extreme, it needs to keep up. We also need to stay highly memory efficient. For a bulk of our users, the more keys they can stuff in, the more bang for the buck. Scalable performance is almost secondary to this. This is why we have features like -C, which disables the 8-byte CAS per object.



In a single-threaded benchmark against a multi-threaded memcached instance, memcached will lose out a bit due to the extra accounting overhead it must perform. However, when used in a realistic scale, it really shines.



There are some trivial ways we are able to greatly increase this ceiling. It's not hard to get memcached to run above 500,000 gets per second via some tweaks on some of its central locks. Sets have a lot of room for improvement due to this as well. We plan to accomplish this. Our timing has been bad for quite a while though :)



In almost all cases, the network hardware for a memcached server will give out before the daemon itself starts to limit your performance. This is a lot of why we haven't rushed to improve the lock scale.



Computers are absolutely trending toward more cores and not toward higher clocks. Threading is how we will scale single instances.



I really hate drawing conclusions from these sort of things. The entire point of this post is more or less me posturing about how shitty benchmarks tend to be. They are created in myopia and usually lauded with fear or pride.



You can't benchmark the fact that Redis has atomic list operations against memcached. They do different things and exist in different spaces, and the real differences are philosophical and perhaps barely technical. I'm merely illustrating the proper scalable performance of issuing bland SETs and GETs against a single instance of both pieces of software.



Understand what your app needs feature-wise, scale-wise, and performance-wise, then use the right tool for the damn job. Don't just read benchmarks and flip around ignorantly, please :)



Finally, here's one more graph... I noticed that redis seemed to do slightly better in the non-parallel benchmark, so I ran the numbers again with a single parallel benchmark in case anyone wants to look into it. Yes, the memcached numbers were lower for the single benchmark test, but I don't really care since it's higher when you actually give it multiple clients :)



Hello! First read this if you haven't yet.I will now continue the back-and-forth obnoxiousness that benchmarking seems to be!In my tests, I've taken the exact testing method antirez has used here, the same test software, the same versions of daemon software, and tweaked it a bit. Below are graphs of the results, and below that is the discussion of what I did.Wow! That's pretty different from the first two benchmarks.First, here's a tarball of the work I did. A small bash script, a small perl script to interpret the results (takes some hand fiddling to get it into gnuplot format), and the raw logs from my runs pre-rollup.The "toilet" bench and antirez's benches both share a common issue; they're busy-looping a single client process against a single daemon server. The antirez benchmark is written much better than the original one; it tries to be asyncronous and is much more efficient.However, it's still one client.and has a very, very high performance ceiling.and is very performant on its own.There is a trivial patch I did to the benchmarks to make them just run the GET|SET tests. It is included in the tarball.What I did was take the, but I ran several of them in parallel. This required a slight change in pulling the performance figures and running the test. The tests were changed to run indefinitely, either doing sets, or sets then indefinite gets (I wanted to run some sets before the get tests so they weren't just hammering air).The benchmarks were then fired off in parallel via the bash script, with the daemon fresh started before each run. After a rampup time (to allow the sets to happen, as well as let the daemons settle a bit), a script was used to talk to the daemons and sample the request rate. Since the benchmark is running several times in parallel, it's now most accurate toask the daemon for how many requests it's doing. I did some quick hand verification and the sampling code lines up with the output of a non-parallel benchmark. So far so good.I checked in with antirez to ensure I was running the tests correctly, and re-ran them as close to the original as I could get. Same number of clients *per daemon*, but there were 4 daemons in this case, so the actual number of clients is actually 4x what's listed in the graphs.The tests ran on localhost using a dual cpu quadcore xeon machine, clocked at 2.27ghz (with turbooost enabled, I'm pretty sure). The OS is centos5 but with a custom 2.6.27 kernel. I verified the single-process benchmark results on my ubuntu laptop runnin 2.6.35 and a 2.50ghz core2duo and got similarish-but-slightly-lower numbers. I also tried the tests on several slightly differing machines after getting some odd initial results. Memcached was using the default of 4 threads. Performance might suffer in this particular test with more threads, as you'd land with more lock contention.So these numbers look correct, for what I was trying to do here.Nothing else was changed. I used the same tools.. All three of these benchmarks are wrong, but this can be slightly closer to reality. In most setups, you have many independent processes contacting your cache servers. In some cases, tens of thousands of apache/perl/ruby/python processes across hundreds of machines, all poking and prodding your cluster in parallel.I don't have the room here to explain the difference between running two processes and one process against the same daemon - So I'll hand waive with "context switches n' stuff". There're plenty of good unix textbooks on this topic :)So in this case, four very high speed benchmark programs soaked up CPU and hammered a single instance of redis and a single instance of memcached, which displays the strong point for theThese are contrived benchmarks. They don't test memcached incr/decr or append/prepend (mc /does/ have a few more features than pure get/set).Real world benchmarks will require a mix of sets, gets, incrs, decrs. Also, it requires testing each in isolation; some users might use their key/value store as a counter and hammer incr/decr hard. Others might hammer set hard, others might be near-purely gets.All of these need to be tested. All features should be benchmarked and load tested in isolation, and also when mixed. All features need to be tested under abuse as well.The test also doesn't try very hard to ensure the 'get' requests actually match anything. A better benchmark would preload some data across 100,000 keys and then randomly fetch them. I might try this next, but for the sake of argument I'm matching the same testing situation as the original blog post.Memcached sticks to a constrained featureset and multithreads itself for a. When pushed to the extreme, it needs to keep up. We also need to stay highly memory efficient. For a bulk of our users, the more keys they can stuff in, the more bang for the buck. Scalable performance is almost secondary to this. This is why we have features like -C, which disables the 8-byte CAS per object.In a single-threaded benchmark against a multi-threaded memcached instance, memcached will lose out a bit due to the extra accounting overhead it must perform. However, when used in a realistic scale, it really shines.There are some trivial ways we are able to greatly increase this ceiling. It's not hard to get memcached to run above 500,000 gets per second via some tweaks on some of its central locks. Sets have a lot of room for improvement due to this as well. We plan to accomplish this. Our timing has been bad for quite a while though :)In almost all cases, the network hardware for a memcached server will give out before the daemon itself starts to limit your performance. This is a lot of why we haven't rushed to improve the lock scale.Computers are absolutely trending toward more cores and not toward higher clocks. Threading is how we will scale single instances.I really hate drawing conclusions from these sort of things. The entire point of this post is more or less me posturing about how shitty benchmarks tend to be. They are created in myopia and usually lauded with fear or pride.You can't benchmark the fact that Redis has atomic list operations against memcached. They do different things and exist in different spaces, and the real differences are philosophical and perhaps barely technical. I'm merely illustrating the proper scalable performance of issuing bland SETs and GETs against a single instance of both pieces of software.Understand what your app needs feature-wise, scale-wise, and performance-wise, then use the right tool for the damn job. Don't just read benchmarks and flip around ignorantly, please :)Finally, here's one more graph... I noticed that redis seemed to do slightly better in the non-parallel benchmark, so I ran the numbers again with a single parallel benchmark in case anyone wants to look into it. Yes, the memcached numbers were lower for the single benchmark test, but I don't really care since it's higher when you actually give it multiple clients :) From: ext_263764 Date: September 22nd, 2010 07:27 am (UTC) (Link) Use four instances for Redis to maximize. Hello,

I'm not sure why memcached can't saturate all the threads with the async benchmark, but if you want to maximize everything in your test involving multiple processes you should also run four Redis nodes at the same time, and run every redis-benchmark against a different instance.

We tried, and this way you'll get very high numbers for Redis, but this is still wrong as it starts to be very dependent on what core is running the benchmark, and if it is the same as the server. So a better one is to have two boxes, linked with gigabit ethernet, and run the N clients in one box and the N threads of the server (being it a single memcache process and N threads or N Redis processes) on the other box. From: ext_263764 Date: September 22nd, 2010 09:11 am (UTC) (Link) Just verified Hello again,



ok just verified, using N servers and M instances of the benchmark, where you have M+N different cores (8 in your box, so you can use 4 servers and 4 benchmarks) you'll get 100k/s requests per instance, for a total of 400k/s.



This is for *SETs*, did not tried for GETs. So I think this shows how important it is to have a design where there is no contention. This is why Redis is single threaded. From: jayp39 Date: September 27th, 2010 02:02 am (UTC) (Link) Another thing I haven't seen taken into account or mentioned anywhere is that redis doesn't have an LRU style algorithm like memcached does. Redis will only discard data if it is expired. That means that if you want to use redis like memcached, you will have to give every key an expiration short enough to keep redis from exceeding memory, and things will be discarded as they expire without regard for popularity.



In practice redis would be a poor substitute for memcached for any site with a large amount of data that might be cached, because memcached will make more efficient use of available memory by allowing more SETs to happen (don't have to worry about doing too many and running out of memory) and by automatically keeping the hottest content in memory while discarding unpopular content.



As you said, ultimately we are comparing two products designed for deifferent purposes. From: DenisTRUFFAUT Date: January 3rd, 2012 09:10 pm (UTC) (Link) Denis TRUFFAUT The last graph has for legend : '4 parallel', instead of 'no parallelization'.



That's a bit confusing according to the text.



BTW, excellent benchmark :)