Like many people already doing so, I’ve been digging into Redis and other NoSQL products.

Unlike all those anti-SQL fanboys, I have 15+ years of experience in RDBMS, and I know the relational algebra has a lot more mathematical implications in practice than the CAP theorem, not to mention the maturity of implementations, so I proceeded with caution.

I first looked at Cassandra and MongoDB, and got an impression that they were already over-featured enough to obfuscate the true focus on what kind of problems they were trying to solve. I knew their boilerplate said something about scaling horizontally, which sounded nonsensical to me, for two reasons:

First, scaling horizontally has little to do with the database engine itself - creating a transparent, consistent hash function is the easiest part. The hard part is choosing a good namespace for keys - you still need to organize keys in some ways, and from time to time, you need to “migrate” keys on refactoring. And when you are set with a solid naming structure, you are able to choose whatever database, including RDBMS, by redirecing (or proxying) requests based on the same partitioning logic. There’s nothing really inherent to the NoSQL technologies.

Second, by scaling horizontally, you only get performance gain in O(N), at the cost of decreased MTBF. If you want to double the memory, you need to double the machines in the cluster. In the computer science terminology, an O(N) algorithm is considered “naive”, and in the computer security terminology, it even has a name - “brute force”. If your web site were scaling linearly, that approach would be fine. But when we’re talking about Web Scale, it’s about exponential growth, right? I don’t think it’s reasonable to always expect exponential increase in the capital fund - I think adding 1000 servers every week should be considered as a necessary evil, not as a goal.

Simply put, in a system of significant scale, there is no silver bullet.

It’s not horizontal scalability at all that interested me in NoSQL. It was a basic problem that RDBMS couldn’t solve.

Why We Need Databases?

We use a database for various reasons - persistence, consistency, concurrency and query-ability.

But there is one important use case that is often overlooked - using database as global variables.

There are times that you want to use global variables that survive beyond HTTP requests, but you can’t, unless you run a single process on a single machine which is not scalable by any standard.

In that case, even if what you wanted was a simple access counter with a single integer value, you had to create a table in the database, create a class or a singleton then map the record to it.

Sure you can solve the problem that way, but it feels like hitting a nail with a sledgehammer. In addition, if you’ve hardcoded the variable name as a table name, you’ll be screwed when you want to change the name upon refactoring.

Some people use memcached for that. That’s great, but remember, memcached is an LRU cache and it is volatile. Global variables are volatile too, but their lifecycle is tied to the process that owns them, so you don’t worry about losing global variables in the middle of something. With memcached, most probably you end up with an orthodox approach: When you write, write to both memcached and database, and when you read, first read memcached, and if not found then read database and write it to memcached.

But wait, if you write to the database anyway, why don’t you directly use the database without memcached? Doesn’t it sound like, your application code is assuming a part of core functionality of the database system?

Adding cache servers for performance improvement is a real thing, but it increases cyclomatic complexity of your code, which is extrinsic to your application. Considering that the true bottleneck in software development is the productivity of programmers, as hardware performance continues to grow, at some point programmers will start to avoid application-level caching as far as possible.

Here’s a quote from Phil Karlton: “There are only two hard things in Computer Science: cache invalidation and naming things.” - Let your database do one of the hardest things, and let human beings focus on naming things.

So, is there a simple, nail hammer to hit the nail? Key-value stores (KVS) seem to fit the bill, since what you want is giving a name (key) to the variable and assign some data (value) to it.

Here comes Redis.

Redis as Global Variables on Steroids

Redis advertises itself as a memcached on steroids, and the statement is valid.

Redis runs really fast, in some benchmarks even faster than memcached, but that’s not even the best part to me.

When we create some kind of data structure using variables, we use arrays, hashes and sets. And Redis supports them natively (plus more). All operations are atomic on principle, which is a GREAT idea. Most KVS support atomic increment on integer values, but in Redis, push/pop operations to array or even moving a key-value pair to another database are atomic, which is near impossible without the server-side support. It’s kind of like thread-safe global variables - you don’t need to mess with mutex when you access to global variables. It’s pretty cool.

Redis is single-threaded with epoll/kqueue and scale indefinitely in terms of I/O concurrency. This is also a great decision for a disk-backed storage system like Redis, because CPU could hardly be saturated (typically less than 5%), and it helps Redis to not have serious bugs that otherwise could have bothered us in race conditions. In fact, there’s a benchmark that shows nearly 90,000 query per second with 26,000 concurrent requests.

For the use of global variables on steroids, Redis seems like a clear winner.

Okay, Redis is fantastic, so what’s the catch, then?

Persistence Is Serious Business

The only, but rather significant problem I can see in Redis is persistence.

There are two kinds of persistence supported: the first one is called snapshotting. In this mode, Redis periodically writes all of objects in memory to disk. “Periodically” means that, by default, if there are 10,000 changes in 60 seconds, or 10 changes in 5 minutes, or 1 change in 15 minutes. By saying “write to disk” it really means: fork a child process for background processing, serialize all data on memory, write it to a temporary file in the background and rename the file atomically to the real one upon finish. Even though the overhead of fork is zero in theory when the OS supports copy-on-write, you still need to turn on the overcommit_memory setting if you want to deal with a dataset more than ½ of RAM, which contradicts our habit to battle against OOM killer as a database administrator.

That said, it is very simple, safe, well-designed approach I think. Well, at least for small datasets. As you learn more about Redis, you might realize why this mode is chosen as the default behavior.

But it also means that you could lose up to 15 minutes of the most recent data. For a moderately busy server, you’re likely to get 10,000 changes in 60 seconds anyway, so let’s simply assume that the snapshotting runs every 60 seconds, but still it’s not really durable. Also you’ll be screwed when you have a large dataset, since Redis fully writes the entire data to disk every time. When you have 1GB of data, writing it every 60 seconds is a pretty bad idea.

To solve this problem, Redis 1.1 introduced append-only file (AOF), which is sort of write-ahead logging (WAL) in the conventional database terms. With AOF, every write request to Redis are written in this file and you can replay the log to construct the entire database on reboot.

But AOF has some obvious downsides.

First, with AOF, Redis has introduced the possible bottleneck that RDBMS have been suffering from. Writing to AOF is a disk bound task. Even though a sequential write to disk is considerably faster than random access, it’s still thousands of times slower than writing to memory. If you use “appendfsync always” without a RAID controller with battery-backed write-back cache, you’re totally fucked. For that matter, I agree with MongoDB’s statement - “Battery backed raid controllers will work well, but you have to really have one. With the move towards the cloud and outsourced hosting, custom hardware is not always an option.” Redis 2.0 has changed the default to “appendfsync everysec”, which is about the same effect as innodb_flush_log_at_trx_commit=2 in the MySQL world. I think the new default in Redis 2.0 is much more sensible as a modern database system.

But that’s not the only problem with AOF. If you have an update-intensive application like access counters, you end up with a ridiculously fast growing AOF and needlessly intensive disk writes, even though what you really get as a result is a few bytes of the access counter. Furthermore, when to run the BGREWRITEAOF command to cut down the AOF is up to you, which makes it pretty hard to plan ahead.

If you’re familiar with MySQL (InnoDB), you probably know that the size of WAL files can be fixed by configuration, using the combination of innodb_log_file_size and innodb_log_files_in_group . When the WAL files are fully used, InnoDB flushes dirty pages in its buffer pool and automatically rotate the log files. With MySQL, it’s all automatic. It’s not possible for Redis, because it doesn’t have a buffer pool that is mapped to the database file that enables incremental writes for persistence. I’ll go more into that topic later.

Probably what we really want is, something in between snapshotting and AOF. Snapshotting is very robust with update-intensive applications but is not good at handling large datasets. AOF is more robust with large datasets but is not good fit for update-intensive applications.

What I would dream, considering the nature of Redis, is something like this: “Write back only dirty pages, every one second.” That’d be perfect.

The 1:10 Problem - Holy Grail for NoSQL?

Finally, I’m going to formulate the 1:10 problem - that is, the ability or inability to handle 10GB of dataset with 1GB of memory.

In fact, most of NoSQL databases expect all data on memory to be fast. That of course is always preferable, even with RDBMS. The question is, how bad it could get when your dataset becomes larger than memory.

In the real world, applications have locality of reference more or less. And given that a RAM has much higher price per GB than a disk, there should be a sweet spot in the mix of memory and disk that maximizes the cost efficiency in the range of acceptable performance.

It is possible in the future that the faster and cheaper SSD will fundamentally solve the problem, but that would rather give extended life to RDBMS than NoSQL. So let’s not talk about SSD here.

What does Redis do to solve the 1:10 problem? The answer is Virtual Memory. It uses a swap file to store values that don’t fit in memory, while all keys are kept in memory so the key lookup and metadata retrieval are still fast. By the way, in every use case, keeping key space small (number of keys, length of keys) is a critical factor when you optimize Redis. Use a hash with N fields rather than N independent keys, whenever possible, and also consider zipmap when you do so.

Redis doesn’t use OS swap. According to Salvatore Sanfilippo, the creator of Redis, it was because the page size of 4KB was too big. I personally don’t think that helps but it’d be better if Redis preallocated specified amount of buffer pool and bring related objects to the same page to increase locality of reference, instead of letting the heap manager blindly fragment objects. In my opinion, the page size of 32 bytes is too small, considering that the hardware architectures and the compilers are optimized for the conventional page size. In that scale, even the latency of reading something from RAM could be dominant (RAM is too slow for CPU, therefore it’s got L1/L2 cache), and RAM has the pipelined burst mode to pre-fetche memory contents at a few clock cycles, before they are actually requested. It is a similar concept with speculative execution. Not to mention that a hard drive performs by far the worst when it has to random-access fragmented small chunks. But maybe I’m missing something.

Anyway, the page size is not the problem I want to discuss here - the real problem is that, now you get the third component that requires and contends for disk I/O, in addition to snapshotting and AOF. Above all, Virtual Memory is a swap and is not meant for durability. In other words, you can’t restore the data from it on restart, even though the data is in fact written on the file!

Personally I don’t like the current implementation of the Virtual Memory. Disk I/O is the most limited resource in the hardware, and we can’t afford wasting it for redundant use, for any purpose other than persistence.

Again, we could learn something from MySQL. On InnoDB, swap and durability are integrated into one single unified subsystem, which is called the buffer pool, as we know innodb_buffer_pool_size being the single most important parameter when we optimize MySQL. Unlike Redis, one disk write means both swap and durability at the same time. In fact, a setup with 1GB memory to handle 10GB dataset is very common with MySQL. The degradation in performance is much like O(log N), thanks to the locality of reference and the B-tree clustered index. MySQL 5.1 comes with the InnoDB Plugin, where the new Barracuda format is available and it has an option to compress buffer pool pages at the cost of small CPU overhead. It turned out to make a real difference. Clearly, MySQL is ahead in the game when it comes to the 1:10 problem.

Redis, on the other hand, has a compact serialized format for the dump file. Usually, how much you get on memory seems 10 times larger than the snapshot file. That’s a different concept, but it’d be interesting to see what happens if Redis used compressing for its data on memory, as there are a lot of repeated substrings that appear in a lot of places in the key space.

Usually it is suggested to avoid reinventing yet another virtual memory layer at the application level, but Redis has already crossed the line. Then why not go one step further to the buffer pool approach? That would enable dirty page tracking and partial sync on the database file without the need of AOF, and the forementioned ideal durability option, “write back only dirty pages, every one second.” would become reality. Even with AOF for maximum durability (which I don’t think is necessary for Redis), the log file could be automatically rotated at some fixed size by flushing dirty pages, exactly as InnoDB does. There are valid reasons that almost all seasoned RDBMS like Oracle, PostgreSQL and MySQL take the shadow paging approach for durability.

For that matter, I’m much interested in Kyoto Cabinet created by Mikio Hirabayashi. The durability features he implemented sound really good. I’ll definitely try Kyoto Tycoon when it comes out, the networking layer on top of Kyoto Cabinet.

Conclusion

However I pointed out some concerns in this post, I love Redis, and I would recommend it to anyone who has a casual interest in NoSQL databases. Remember that Redis is a young product that has a bright future, after all. I can dare to say that Redis is within the reach of becoming the new de facto standard after MySQL and memcached.

For people who don’t really grok what’s been said in this post (maybe because it was just too long to read), my recommended setup is: “Use Redis for small datasets that don’t grow fast (stay far less than 1GB). Have at least 2x memory than the dataset. Use default snapshotting and disable AOF.”

That way, you can enjoy all the delicious fruits from Redis and won’t see any problems.