A coworker and I were designing a solution to a rather interesting problem yesterday and we realized we needed a more in-depth understanding of the memory usage of the various datatypes in Redis to squeeze as much storage out of Redis as possible. Among other things we also wanted to verify some of statements in the Redis documentation, e.g. are hashes really preferable to creating pseudo-hashes by namespacing keys?

Understanding the memory costs of Redis datastructures is really quite important as it allows application designers to choose from the various datastructures with equivalent big O costs for their required operations but which may have rather different storage costs in Redis. For us this additional information has had unexpected implications about the design, in addition to the obvious value for capacity planning. (The Redis documentation is already remarkable in explicitly noting the Big O notation costs for each available operation, and would be even more truly amazing if they similarly formalized the storage costs for each datastructure.)

Without further ado, here is a look (I am tempted to prepend "look" with "casual" to avoid being flamed for my inevitable failures in benchmarking, but in reality you really just can't avoid getting flamed) at memory growth for storing the Redis sets, sorted sets, hashes, arrays and strings.

Testing Methodology

For each datastructure's testcase the process for each combination of inputs was:

Remove all data from Redis using FLUSHALL.

Wait one second.

Load the data.

Determine final INFO for Redis.

Remove all data from Redis using FLUSHALL.

I didn't find that the initial memory was a useful statistic to discuss as Redis won't immediately return all of its unused memory, but it will consistently reuse it. Consider for example these runs where I did r.flushall() before and after each run:

(virt)will:redis_benchmark lethain$ python set_memory.py python set_memory.py Keys: 0 => 100 Memory: 1.03M => 14.09M (virt)will:redis_benchmark lethain$ python set_memory.py Keys: 0 => 100 Memory: 7.14M => 14.09M (virt)will:redis_benchmark lethain$ python set_memory.py Keys: 0 => 100 Memory: 7.14M => 14.09M (virt)will:redis_benchmark lethain$ python set_memory.py Keys: 0 => 100 Memory: 7.14M => 14.09M

Configuration Details

Here is an incomplete list of relevant information about these benchmarks:

These tests were run on an iMac with quad 2.66 GHz Intel Core i5, with 4 GB ram on OS X 10.6.4.

They used redis-2.0.0-rc4 and built redis-server using the default Makefile .

using the default . Where not explicitly mentioned these tests were done using the 64-bit executable which has a definite memory cost as opposed to the 32-bit version. I am rerunning the tests with 32 bit and updating the sections as the data is available (32-bit was built via make ARCH="-m32" but otherwise used the default makefile). As a rough rule of thumb the 64bit version uses 1.5 times the memory of the 32bit version.

but otherwise used the default makefile). As a rough rule of thumb the 64bit version uses 1.5 times the memory of the 32bit version. They used the Redis-Py client for all operations, specifically they used the code from commit 16c0eff87e7762d82fdd37d9949e33a7376d5357.

Now let's take a look at the numbers.

Set

Each value inserted into the sets was generated via

import uuid val = str ( uuid . uuid4 ())

which ensured a consistent size with inconsistent names and also happened to mimic the usecase I am personally most interested in. The key for storing each set was set.%s" % i where i was an autonomically increasing integer from 0 to # Sets .

Full code for generating these numbers available in set_memory.py.

Memory usage was:

# Sets # Values per set Memory Used 100 100 2.47M 100 1,000 14.09M 100 10,000 142.43M 1,000 100 70.23M 1,000 1,000 131.57M 1,000 10,000 1.37G 10,000 100 145.02M 10,000 1,000 1.27G

Values for the 32-bit executable:

# Sets # Values per set Memory Used 100 100 1.65M 100 1,000 10.21MM 100 10,000 101.70M 1,000 100 38.53M 1,000 1,000 96.67M

It is worth noting that sets are approximately 60% smaller than hashes or strings (those numbers below) but allow the same algorithmic complexity for lookup, so those who are only interested in membership or existence should strongly consider the set.

Sorted Set

Likewise to the set tests, each value inserted into the sorted sets was generated via str(uuid.uuid4()) , and each sorted set lived at a key in format "set.%s" % i . In addition each member of the sorted set had a score which corresponded to the output of time.time() at the time of its insertion.

Full code for generating these numbers available in sorted_set_memory.py.

# Sorted Sets # Members per sorted set Memory Used 100 100 3.36M 100 1,000 22.59M 100 10,000 226.37M 1,000 100 79.20M 1,000 1,000 216.34M 1,000 10,000 2.18G 10,000 100 234.18M 10,000 1,000 2.09G

Values with 32-bit executable:

# Lists # Members per list Memory Used 100 100 2.33M 100 1,000 16.79M 100 10,000 166.83M 1,000 100 45.39M 1,000 1,000 162.04M

With one caveat I'd like to mention that insertion into the sorted sets is algorithmically more complex than into sets or hashes, and that algorithmic complexity also translated into a very real slowness when running the memory consumption scripts where individual sorted sets had a large number of values.

Hashes

Keys for the hashes were of format "hash.%s" % i , keys within the hashes were of format str(uuid.uuid4()) , and values for each key were str(time.time()) .

Full code for generating these numbers available in hash_memory.py.

# Hashes # Values per hash Memory Used 100 100 3.23M 100 1,000 21.71M 100 10,000 218.66M 1,000 100 71.77M 1,000 1,000 207.87M 1,000 10,000 2.11G 10,000 100 221.31M 10,000 1,000 2.01G

Values for 32-bit binary:

# Hashes # Values per hash Memory Used 100 100 2.11M 100 1,000 14.78M 100 10,000 147.37M 1,000 100 40.06M 1,000 1,000 142.44M 1,000 10,000 1.43G

Lists

Keys for the lists were of format "list.%s" % i , values within the lists were of format str(uuid.uuid4()) .

Full code for generating these numbers available in list_memory.py.

# Lists # Members per list Memory Used 100 100 2.27M 100 1,000 13.25M 100 10,000 123.12M 1,000 100 68.25M 1,000 1,000 123.30M 1,000 1,0000 1.19G 10,000 100 125.21M 10,000 1,000 1.20G

Values for 32bit executable:

# Lists # Members per list Memory Used 100 100 1.52M 100 1,000 9.76M 100 10,000 92.16M 1,000 100 37.27M 1,000 1,000 92.26M

Strings

Keys for the strings were of format str(uuid.uuid4()) and values were of format str(time.time()) .

Full code for generating these numbers available in string_memory.py.

# Strings Memory Used 100 1.05M 1,000 1.23M 10,000 3.14M 100,000 21.87M 1,000,000 207.39M 10,000,000 2.29G

Values for 32-bit executable:

# Strings Memory Used 100 0.62M 1,000 0.75M 10,000 2.03M 100,000 14.83M 1,000,000 141.93M

Note that I got annoyed at waiting for the 10million keys so I extrapolated the memory from the amount of memory from 2150096816 bytes for 9363880 keys. This is the only score that I fudged.

Strings Versus Hashes

Let's start by comparing three ways of storing 1,000,000 key-value pairs:

1,000 hashes with 1,000 keys-value pairs took 207M.

1,00 hashes with 10,000 key-value pairs took 218M.

1,000,000 strings took 207M.

Well, that's a bit unexpected when viewed next to previous statements that hashes should outperform strings in terms of memory consumption, but I still say that hashes come out on top for a couple of reasons.

The first is that hashes are storing more data since they store the exact same key-value pairs as the strings and also store the hash keys themselves, so for some usecases they will be more memory efficient (e.g. with extremely small keys for the key-value pairs within a hash versus having to generate many longer namespaced keys). The second is that hashes expose a much richer interface for accomplishing certain tasks (retrieving all values in a given hash) with low algorithmic complexity.

If the hash can offer more functionality and equal-in-the-worst-case better-in-the-best-case memory consumption, then they do seem preferable to strings for many usecases.

Hashes Versus Strings Versus Sorted Sets

Taking our list from above, let's add the sorted set to our comparison of storing 1,000,000 key-value pairs (with the acknowledge caveat that the sorted set is storing a score-member pair, with the caveat squared that you can use this score-member pair as a key-value pair for some usecases):

1,000 hashes with 1,000 keys-value pairs took 207M.

1,00 hashes with 10,000 key-value pairs took 218M.

1,000,000 strings took 207M.

100 sorted sets with 10,000 score-member pairs took 226.36M.

1,000 sorted sets with 1,000 score-member pairs took 216.34M.

10,000 sorted sets with 100 score-member pairs took 234.18M

The memory-usage variation between string, hash and sorted-set ends up being quite a bit less than anticipated. Probably because under the hood they are implemented using the same data structures. Unfortunately there simply isn't much memory to gain by switching between these datastructures.

Sorted Set Versus Set

Another interesting comparison is the relative costs of sorted and unsorted sets:

Datastructure # Sets # Members Memory Set 100 10,000 142.43M Sorted Set 100 10,000 226.36M Set 1,000 1,000 131.57M Sorted Set 1,000 1,000 216.34M Set 10,000 100 143.02M Sorted Set 10,000 100 234.18M

We can see that sorted sets are about 60% larger than their unsorted counterparts which seems pretty reasonable considering the quantity of additional information being stored in the sorted set.

The End

Hopefully these numbers provide some insight for you as you are architecting Redis' role in your next application. Certainly they have been quite valuable for me. As always glad to hear feedback, especially if anyone has suggested optimizations to get more data in a given amount of memory.