Over the past few days I've been working on haskell bindings to Tokyo Cabinet. Tokyo Cabinet is a high performance (non-relational) database that is a simply composed of records; just keys and values. There are several ways of organizing the records in this file, including using a hash table, a fixed-length array and a B+ Tree. It's a pretty cool project, I think.

Part of the reason I'm writing this binding is to get more familiar with the Haskell FFI, the other reason is so I can play with tokyo cabinet while using Haskell. :)

As of current, I don't have a lot of the API covered; there is the tokyo cabinet utilities API, and each internal representation used for a database (B+ Tree/Hash/Array) have their own different APIs (although they are designed to be extremely similar in functionality, naming convention, etc..) As well there is the abstract API (an interface that allows you to interact with all the various types of databases using a uniform API) and two various extensions, tokyo tyrant and tokyo dystopia (more on that in a bit.)

Excluding Tyrant and Dystopia, that is - the utility API + the core database APIs, I would say I might have somewhere around 15% of it done. The utility API is by far the biggest and will take the longest, the hash API is the only thing that even has functions in it right now if you want to manipulate a database.

We can have super-fun contrived examples though!

Here's the C code:

#include <tcutil.h> #include <tchdb.h> #include <stdlib.h> #include <stdbool.h> #include <stdint.h> int main(int argc, char **argv){ TCHDB *hdb; int ecode; char *key, *value; /* create the object */ hdb = tchdbnew(); /* open the database */ if(!tchdbopen(hdb, "casket.hdb", HDBOWRITER | HDBOCREAT)){ ecode = tchdbecode(hdb); fprintf(stderr, "open error: %s

", tchdberrmsg(ecode)); } /* store records */ if(!tchdbput2(hdb, "foo", "hop") || !tchdbput2(hdb, "bar", "step") || !tchdbput2(hdb, "baz", "jump")){ ecode = tchdbecode(hdb); fprintf(stderr, "put error: %s

", tchdberrmsg(ecode)); } /* retrieve records */ value = tchdbget2(hdb, "foo"); if(value){ printf("%s

", value); free(value); } else { ecode = tchdbecode(hdb); fprintf(stderr, "get error: %s

", tchdberrmsg(ecode)); } /* traverse records */ tchdbiterinit(hdb); while((key = tchdbiternext2(hdb)) != NULL){ value = tchdbget2(hdb, key); if(value){ printf("%s:%s

", key, value); free(value); } free(key); } /* close the database */ if(!tchdbclose(hdb)){ ecode = tchdbecode(hdb); fprintf(stderr, "close error: %s

", tchdberrmsg(ecode)); } /* delete the object */ tchdbdel(hdb); return 0; }

And the Haskell code:

module Main where import Database.TokyoCabinet.Hash import Database.TokyoCabinet.Util import Control.Monad import System.Exit import Data.Bits import Foreign main = do -- open hdb <- tcHdbNew b <- tcHdbOpen hdb "casket.hdb" (tcHDBOWRITER .|. tcHDBOCREAT) when (not b) $ panic $ do e <- tcHdbEcode hdb putStrLn $ "open err: " ++ show e -- store records b1 <- tcHdbPut2 hdb "foo" "hop" b2 <- tcHdbPut2 hdb "bar" "step" b3 <- tcHdbPut2 hdb "baz" "jump" when (not b1 || not b2 || not b3) $ panic $ do e <- tcHdbEcode hdb putStrLn $ "put2 err: " ++ show e v <- tcHdbGet2 hdb "foo" b4 <- tcHdbEcode hdb if (b4 == 0) then print v else panic (putStrLn $ "get2 err: " ++ show b4) tcHdbClose hdb tcHdbDel hdb return () panic f = f >> (exitWith $ ExitFailure (-1))

Running it:

$ ghc --make -L$HOME/lib tchdb1.hs [1 of 1] Compiling Main ( tchdb1.hs, tchdb1.o ) Linking tchdb1 ... $ ./tchdb1 "hop" $

This is taken from the Tokyo Cabinet API specification as an example (fwiw we are using the Hash API here.)

A few things to note:

We don't traverse records (yet.) This is in part because I have not filled out the API; the other part has to due with the fact I am experimenting with perhaps using Oleg-style enumerators as a potential interface for those interested and am hacking some things up. Oleg's enumerator stuff is some of the more down-to-earth things he's come up with, and if nothing else I can rip it out later. I think it would be an interesting experiment though.

Currently we have to check err codes if we want to know whether or not operations such as tcHdbGet2 succeeded or failed. It is probably in the best interests of many haskellers to abstract this out likely with an Either or somesuch.

We have to delete the database manually after we close it; that is because right now I am using Ptr's to represent these databases. Very shortly I am probably going to move the hash API over to using ForeignPtr's so we can have finalizers for freeing the DB in memory (via tcHdbDel,) although you will always be responsible for closing it before that happens (the utility API deals with allocated memory all over the place with the various things it provides, such as extensible strings/array lists, so the util module already heavily uses ForeignPtr's.)

It is not done, not by a longshot.

Other things to take into consideration:

Performance. As of right now there is a lot of conversions between CStrings and regular String's which, everybody should know by now, are quite slow. This is one of the bigger things that have hit my mind, and perhaps going instead from CString's to ByteStrings would be a better choice? How could we accomplish this if so? Data.ByteString.Internal hackery? Benchmarks and stress testing will be necessary before any major conclusions; the FFI could end up being the major bottleneck for all I am aware.

The Abstract API; should it be included? Should it be the only API available at all, or would it be more appropriate to only fill out bindings to the various databases and abstract things with a type class? This is a design issue more than anything, and I'd like to hear the opinion of other Haskeller's out there.

Tokyo Tyrant; this will probably have to come at a later release, but essentially tokyo tyrant is an interface to tokyo cabinet databases over the network. Currently it only runs on Linux though, so if someone wanted this, they would have to add and test it themselves.

Tokyo Dystopia, again this might come at a later release, but Dystopia is a full-text search system built on top of Tokyo Cabinet.

utf8. I am very very very stupid when it comes to just how character encodings work and comments here would be greatly appreciated so I can learn more about character encodes and the whole 9 yards (the way I see it, most people are of the opinion that something either accepts unicode or it is bullshit); from the looks of the tokyo cabinet site it looks largely as if utf8 comes into play when dealing with Dystopia. utf8-string should hopefully make this easy.

If you would like to examine the code in its current, very primitive form you can do so by doing:

$ git clone git://github.com/thoughtpolice/haskell-tokyocabinet.git

Or you can check it out online:

http://github.com/thoughtpolice/haskell-tokyocabinet/tree/master

I'll post more on this in the next coming week or so if I can make headway on the oleg-ian stuff or something else significant. Comments on the spots I'm wary of are super duper ultra great and welcome.