Large-scale in-memory key-value stores are universally useful (e.g. to load and serve tsv-data created by hadoop/mapreduce jobs), in-memory key-value stores have low latency, and modern boxes have lots of memory (e.g. EC2 intances with 70GB RAM). If you look closely many of the nosql-stores are heavily dependent on huge amounts of RAM to perform nicely so going to pure in-memory storage is only a natural evolution.

Scratching the itch

Python is currently undergoing a “new spring” with many startups using it as a key language (e.g. Dropbox, Instagram, Path, Quora to name a few prominent ones), but they have also probably discovered that loading a lot of data into python dictionaries is no fun, this is also the finding by this large-scale hashtable benchmark. The winner of that benchmark wrt memory efficiency was Google’s opensource sparsehash project, and atbr is basically a thin swig-wrapper around Google’s (memory efficient) opensource sparsehash (written in C++). Atbr also supports relatively efficient loading of tsv key value files (tab separated files) since loading mapreduce output data quickly is one of our main use cases.

prerequisites:



a) install google sparsehash (and densehash)

wget http://sparsehash.googlecode.com/files/sparsehash-2.0.2.tar.gz tar -zxvf sparsehash-2.0.2.tar.gz cd sparsehash-2.0.2 ./configure && make && make install

b) install swig

c) compile atbr

make # creates _atbr.so and atbr.py ready to be used from python

python-api example

import atbr # Create storage mystore = atbr.Atbr() # Load data mystore.load("keyvaluedata.tsv") # Number of key value pairs print mystore.size() # Get value corresponding to key print mystore.get("key1") # Return true if a key exists print mystore.exists("key1")

benchmark (loading)

Input for the bencmark was output from a small Hadoop (mapreduce) job that generated key, value pairs where both the key and value were json. The benchmark was done an Ubuntu-based Thinkpad x200 with SSD drive.

$ ls -al medium.tsv -rw-r--r-- 1 amund amund 117362571 2012-04-25 15:36 medium.tsv

$ wc medium.tsv 212969 5835001 117362571 medium.tsv

$ python >>> import atbr >>> a = atbr.Atbr() >>> a.load("medium.tsv") Inserting took - 1.178468 seconds Num new key-value pairs = 212969 Speed: 180716.807959 key-value pairs per second Throughput: 94.803214 MB per second

Possible road ahead?

1) integrate with tornado, to get websocket and http API

2) after 1) – add support for sharding, e.g. using Apache Zookeeper to control the shards.

Where can I find the code?



https://github.com/atbrox/atbr

Best regards,

Amund Tveit

Atbrox