What is atbr?

large-scale and low-latency in-memory key-value pair store for Python

Why atbr?

1) Modern boxes have 10-100s of Gigabytes of RAM 2) Gigabyte++-size Python dictionaries are slow to fill 3) Gigabyte++-size dictionaries are fun to use 4) atbr is fast (in particular to load from file)

What is atbr built with?

install

c++ (heavy lifting), python (apis/websocket), swig (glu). python libraries: tornado (http/websocket server), boto (Amaazon Web Services API), zc.zk (zookeeper), websocket-client. c++ libraries: Google's sparsehash.

Run the following to install atbr (including its dependencies)

$ cat INSTALL.sh # to see what it does $ chmod +x ./INSTALL.sh && sudo ./INSTALL.sh

(note: for mac, run python setup-mac.py install afterwards)

it basically does this:

$ sudo apt-get install libboost-dev python-setuptools swig* python-dev -y $ sudo pip install -r requirements.txt # or under virtualenv $ wget http://sparsehash.googlecode.com/files/sparsehash-2.0.2.tar.gz $ tar -zxvf sparsehash-2.0.2.tar.gz $ cd sparsehash-2.0.2 $ ./configure && make && sudo make install $ sudo python setup.py install # or under virtualenv

python-api example

import atbr.atbr # Create storage mystore = atbr.atbr.Atbr() # Load data mystore.load("keyvaluedata.tsv") # Number of key value pairs print mystore.size() # Get value corresponding to key print mystore.get("key1") # Return true if a key exists print mystore.exists("key1")

benchmark (loading)

Input for the benchmark was output from a small Hadoop (mapreduce) job that generated key, value pairs where both the key and value were json. The benchmark was done an Ubuntu-based Thinkpad x200 with SSD drive.

$ ls -al medium.tsv -rw-r--r-- 1 amund amund 117362571 2012-04-25 15:36 medium.tsv $ wc medium.tsv 212969 5835001 117362571 medium.tsv $ python >>> import atbr >>> a = atbr.Atbr() >>> a.load("medium.tsv") Inserting took - 1.178468 seconds Num new key-value pairs = 212969 Speed: 180716.807959 key-value pairs per second Throughput: 94.803214 MB per second

atbr http and websocket server

atbr can also run as a server (default port is 8888), supporting both http and websocket

Start server:

$ cd atbserver ; python atbr_server.py

HTTP API

Load tsv-file data with http

$ curl http://localhost:8888/load/keyvaluedata.tsv

Get value for key = 'key1'

$ curl http://localhost:8888/get/key/key1

Add key, value pair key='foo', value='bar'

$ curl http://localhost:8888/put/key/foo/value/bar

Websocket API

Example that loads keyvaluedata.tsv using websocket load api

python websocket_cmdline_client.py keyvaluedata.tsv

websocket client code

import sys from websocket import create_connection ws = create_connection("ws://localhost:8888/loadws/") # e.g. sys.argv[1] could 'keyvaluedata.tsv' ws.send(sys.argv[1]) result = ws.recv() ws.close() print result

Sharded Websocket Modus?

Example with 3 shards on localhost

$ python atbr_server 8585 shard_data_1.tsv $ python atbr_server 8686 shard_data_2.tsv $ python atbr_server 8787 shard_data_3.tsv $ python atbr_shard_server.py localhost:8585 localhost:8686 localhost:8787 $ python atbr_websocket_cmdline_client.py key1

Cost of running atbr on EC2

Start several atbr servers and finally one (or several) atbr shard servers to talk to them.

atbr runs in-memory, and costs for running e.g. an Amazon EC2 68.4GB ram instance is $1.80/hour. Assume the node has roughly 65GB available (after os components are loaded), this gives a Gigabyte-hour-cost of $0.027 and a Terabyte-hour-cost of 1000*0.027 = $27. Since atbr is designed to hold only json key and values with metadata, and metadata can have pointers to larger objects in disk-based storage (e.g. AWS S3) a Terabyte in-memory brings you very far. Monthly cost in this case would be $20412

What type of storage datastructure is used in atbr?

Currently:

default: Google's sparsehash library Google's densemap library C++/STL unordered_map

Roadmap

Increased concurrency and threadsafety support

Simplified sharded deployment with fabric

More benchmarks and comparison with other storage alternative (e.g. HBase, Redis, Cassandra)

More end-to-end examples (from mapreduce jobs to serving)

Objective C and Lua support

Lua-based map(reduce) on running atbr instances

Who develops and supports atbr?

The reason why Google's sparsehash is default is that is highly memory efficient, see this benchmark for more Will support other efficient C++-based datastructures in later versions

atbr is developed and supported by Amund Tveit (amund (at) atbrox (dot) com), Atbrox.

Forking project on Github?