May 1, 2017

Over the last several years, we’ve seen the rise of fast, ordered, transactional key-value stores such as leveldb , rocksdb , and lmdb . These are all very cool projects in their own rights, but perhaps even more notable are the projects built on top of them, such as MyRocks, SQLite4, CockroachDB, and Google’s F1.

To better understand how these work, I’ve been hacking together a basic relational database on a key-value store named voodoo. Interestingly, many of the concepts are applicable to everyday usage of a normal SQL database, especially in understanding query plans and indexes.

Starting from the Beginning #

A relational database is centered around the concept of a table , which is comprised of rows and columns . Notably, all tables have a column labeled as a primary key , which is used to uniquely identify each row. In the example users table below, id is the primary key .

id username email 1 ian root@example.com 2 alice alice@security.com 3 eve eve@pwned.com

If we had a key value store that implemented the interface:

type KV interface { Get(key []byte) []byte Set(key []byte, val []byte) }

we could map each row of the above table onto it by setting keys for each primary key, making sure to include the table name:

'users/1' => 'username: ian, email: root@example.com' 'users/2' => 'username: alice, email: alice@security.com' 'users/3' => 'username: eve, email: eve@pwned.com'

And in doing so, could set and retrieve rows of our users table by primary key.

What would we do if we wanted to look up users where email = alice@security.com ? We have no way to see what keys exist if we just have the above interface. Luckily, all of the aforementioned key-value stores implement an interface that looks more like:

type KV interface { Get(key []byte) []byte Set(key []byte, val []byte Walk(prefix []byte) Iterator } type Iterator interface { // get the value at the iterator location Val() []byte // get the current key Key() []byte // move to the next key Next() bool }

which allows us to traverse the keyspace by prefix. So with some code like:

iterator := kv.Walk([]byte("users/")) for iterator.Next() { if hasEmail(iterator.Val(), "alice@security.com") { return iterator.Val() } }

We could find the key associated with email: alice@security.com by iterating through all of the keys until we find a matching row (or all matching rows if the column isn’t marked unique). This is known in RDBMS terminology as sequential scan, something you may have seen in the output of a SQL EXPLAIN before (probably marked Seq Scan ). Unfortunately, while this strategy will work, it’s not particularly fast. To make this query fast we’ll need what’s known as a secondary index , a mapping of all emails to the keys in which they belong, akin to the SQL CREATE UNIQUE INDEX users_email_idx ON users(email); . This is fairly straightforward, changing our keyspace to look like:

'users/1' => 'username: ian, email: root@example.com' 'users/2' => 'username: alice, email: alice@security.com' 'users/3' => 'username: eve, email: eve@pwned.com' 'users_email_idx/root@example.com' => 'users/1' 'users_email_idx/alice@security.com' => 'users/2' 'users_email_idx/eve@pwned.com' => 'users/3'

With this change implemented, our pseudo-go-code to look up alice@security.com becomes:

primaryKey := kv.Get([]byte("users_email_idx/alice@security.com")) row := kv.Get(primaryKey)

As one might imagine, this indexing does not come for free, as it both takes up more space on disk, and means we have to insert two keys for every one row we want to write in a table.

The same concept we used to create a single-column secondary index applies to the creation of composite indexes - mapping the SQL CREATE UNIQUE INDEX users_username_email_idx ON users(username, email) to keys like 'users_username_email_idx/eve/eve@pwned.com' => 'users/3' .

The above techniques allow a relatively powerful mapping of a simple relational database onto a key-value store, while not holding us back as we pave the way for more complex features ( JOIN operations, “real” SQL parsing, executing, and planning, etc.).

60 Kudos