[ 2019-December-08 16:34 ]

I think it is safe to state that hash tables (e.g. maps in Go, dicts in Python, HashMap in Java, etc.) are far more common than ordered data structures such as trees for in-memory data structures. One data point is that in a talk about C++ hash table optimization at Google (available in Abseil), the speaker noted that 1% of CPU and 4% of RAM globally at Google is used by hash tables. However, in databases, the default is almost always an ordered index, typically a B-Tree. Why is the "default" choice different between programs and databases, when at the end of the day they both do the same thing: accessing data for our code? More than a year ago I asked about this out loud on Twitter, and got many interesting answers. This is my very delayed attempt to distill what I've learned. [Note 2019-12-15: Added a paragraph about locality of reference.]

The traditional answer is that hash tables are designed to be efficient when storing data in memory, while B-Trees are designed for slower storage that is accessed in blocks. However, this is not a fundamental property of these data structures. There are hash tables designed to be used on disk (e.g. MySQL's hash index), many in-memory trees (e.g. Java's TreeMap, C++'s map), and even in-memory B-Trees.

I think the most important answer is that B-Trees are more "general purpose," which results in lower "total cost" for very large persistent data. In other words, even though they are slower for single value accesses that make up the majority of the workload, they are better when you consider rare operations and the cost of multiple indexes. In this article, I'll briefly explain the high level differences between hash tables and B-Trees, then discuss how persistent data has different needs than in-memory data. Finally, while I think these are probably the right defaults, I'll try to argue that we probably should use more ordered data structures in memory and more hash tables in databases.

Hash tables versus trees

First, let's review the fundamental difference between these data structures. Hash tables provide constant time O(1) access for single values, while trees provide logarithmic time O(log n) access. For single value lookups, this means hash tables are faster, no matter where your data is stored. However, in exchange for the extra overhead, trees keep the values in order. This allows us to efficiently access ranges of values, which means they efficiently support a wider variety of operations. For example, we can find all values that start with a given prefix, or the "top k" values, while with a hash table we need to scan the entire table. For an in-depth discussion at how databases use B-Trees, I highly recommend reading at least the first few sections Modern B-Tree Techniques, a wonderfully readable and comprehensive survey by Goetz Graefe.

Another difference is that hash tables only provide average constant time accesses. Bad behaviour can be caused by hash collisions, either unintentional or malicious, but a bigger issue is rehashing. Occasionally, when the table is growing, your normally fast O(1) insert involves moving to a bigger table. This typically involves a slow O(n) scan, or gradually copying items to a new table (e.g. like Go does). In either case, in causes variable performance. Trees, on the other hand, have worst case performance of O(log n).

Persistent versus in-memory data

Databases are typically used to store persistent data that needs to exist approximately forever. Programs typically only store data temporarily, until it is restarted. This ultimately means databases typically store much more data, both in terms of number of records and total bytes. This makes ordered data structures better for three reasons:

1. When n is small, data structures don't matter (as much)

I would guess that most of the hash tables in programs are small, with thousands of elements or less. At that size, the differences between O(1), O(log n), and O(n) don't matter as much. Therefore, in programs, hash tables are faster for the common single value lookups, and only slightly slower for the rare full table scans. However, as the data set gets larger, occasionally paying O(n) to find related values becomes unbearably slow.

To make this more concrete, let's model an application where 99% of the accesses are for single records, but 1% of the accesses are for 10 sequential items, starting at some value. If we use a hash table, the single record accesses costs 1, while the range accesses cost n (scan everything), for a total cost of 0.99×1 + 0.01×n. If we use a tree, the single record accesses cost log(n), and the ranges accesses cost log(n) + 9, for a total cost of 0.99×log(n) + 0.01×(log(n) + 9). Here is a graph of this model (or see the spreadsheet):

Obviously the "cost" is a fake number, but the trend is relevant: for small n, the hash table is better, but for large n, the cost of that rare scan dominates, and the tree is better. This is a classic O(n) versus O(log n) comparison, which means the ordered data structure's more "balanced" performance is a larger advantage for large data sets.

2. Storage is not free

When the number of elements is small, it is relatively cheap to store multiple copies of the data, indexed in different ways. Both programs and databases use this trick. However, adding another index costs O(n) storage and O(n) time to build. Adding a new secondary index on a very large database can take hours or days. As a result, the advantage in reusing an index for different kinds of queries grows with the size of the data. This is an advantage for ordered indexes, since we can use the ordered property in a few different ways. For example, one common database trick to create an index with more than one column, e.g. (location, store name). Then we can use this index to access one specific (location, store name), but also records for a single (location), or even a prefix of the location key. With a hash table, I need separate indexes.

3. Persistent data structures are much more complex

Storing data on disk so it is never corrupt and never lost, even when machines crash, is hard. You need to carefully order your writes, use write-ahead logging, and might need fine-grained concurrency control. This adds up to much more code than in-memory data structures. As a result, many databases only support a single type of persistent index. If you are only going to pick a single index, it is best to have one that works for a wide range of workloads, even if it is slightly less efficient.

This extra complexity means there is a lot more work than just data structure access, which changes the constant factors substantially. For example, you may need to send a request across the network, parse it, copy some data from disk, parse that, acquire locks, etc. This reduces the difference between the O(1) and O(log N) accesses.

4. Locality of reference matters more

Locality of reference always matters. In memory, an L1 cache hit is around 1 ns, while a main memory reference takes about 100 ns, so it is a factor of 100X slower. If you need to go to disk, an SSD random read is about 15 us, which is another 150X, while a magnetic disk random read is around 2ms, so about 2000X slower than memory. This means you want to avoid accesses to main memory, but you really want to avoid disk reads. In a hash table, each key is randomly assigned to a different location. This spreads load evenly, but also means that it doesn't matter if the application has some "natural" locality, such as accessing data for a single user. Every access to a new key is a random access that will only be cached by luck. With an ordered data structure, an application can choose its keys so that "related" data is close in the key space. This means that for some applications, the first access pulls a block of data into memory, and subsequent accesses will be cached. This is effectively performing more computation in exchange for fewer disk IOs. For some applications, this optimization won't help at all, but for some, it can be an enormous difference.

Conclusions and lessons

I think the correct default is to use hash tables in programs and ordered data structures in databases. However, there are probably plenty of cases where we are using the wrong choice. I suspect it is more often programs that fail to take advantage of ordered data structures, because using them is harder than brute-force iteration over everything. For example, if I have two collections sorted on time, and I want to output the merged collection in time order, I can iterate through the two collections simultaneously, advancing the appropriate iterator. However, I could also put all the events in a single list, sort it, then output it. I know which one I would write first. This sort of complexity trade-off makes me wonder if there is some way to embed database query optimization while working with collections ...

Acknowledgements

Thank you to all the people who replied to the Twitter thread. The points in this article are from all of you. I apologize for not having the time to go back and credit you directly.