Building a Redis clone in Haskell

In this post, we will attempt to make a simplified clone of Redis in Haskell. Here is a set of requirements that we will aim to fullfill:

get and set operations

Multi-threaded

Atomic

Redis compatible (implement the Redis protocol)

We should be able to use the redis-cli tool to connect to our server and issue commands to it.

We are going to omit many features that Redis has. For example, there will be no disk persistence. We will accomplish this with about 100 lines of Haskell.

Getting started: stack

We are going to use stack to build our project. Stack is a new build tool for Haskell projects. We can also use it to create all the necessary files that make up a Haskell project. You can find the installation instructions here.

Let’s create our project. We will call our server Redish.

$ stack new Redish simple

This will create a directory Redish/ with a few files in it.

Redish/ LICENSE README.md Redish.cabal Setup.hs src/ Main.hs stack.yaml

You can use stack to build this project and run it:

$ stack build Redish-0.1.0.0: configure Configuring Redish-0.1.0.0... Redish-0.1.0.0: build Preprocessing executable 'Redish' for Redish-0.1.0.0... [1 of 1] Compiling Main ( src/Main.hs, .stack-work/dist/x86_64-osx/Cabal-1.22.4.0/build/Redish/Redish-tmp/Main.o ) Linking .stack-work/dist/x86_64-osx/Cabal-1.22.4.0/build/Redish/Redish ... Reidhs-0.1.0.0: install Installing executable(s) in /Users/<user>/<dirs>/Redish/.stack-work/install/x86_64-osx/lts-3.2/7.10.2/bin $ stack exec Redish hello world

Types

Let’s start by defining our types. Redish is an in-memory database so we will need a representation of our database. For a simple key-value store, all that we need is a simple map. Let’s create a few aliases.

type Value = ByteString type Key = ByteString type DB = Map Key Value

Next, we will need to represent the commands that our server knows how to handle. The command data structure can be a get , a set or unknown.

data Command = Get Key | Set Key Value | Unknown deriving ( Eq , Show )

Software transactional memory

Now that we have our types in places, we need to write a few functions to operate on them. We need a way to insert data and to query our database.

First things first though. Since by default everything in Haskell is immutable, how can we change the value of our in-memory database? We can’t simply overwrite the old value with the new one. The compiler won’t let us. The answer is software transactional memory, or STM for short.

STM allows us to atomically change a value in our program. The atomic part is important. Many parts of the code can update this value and we have no way of knowing when and how often they might do so. STM allows us to perform atomic updates. This way any updates to our database will be run sequence even when coming from different threads. The only cost is that we have to perform any updates within the context of IO .

Our DB type will be become TVar DB . The TVar type represents the mutable reference. Next, let’s create the initial value in the main function:

main :: IO () main = do database <- atomically $ newTVar $ fromList [( "__version__" , "0.1.0" )]

This will create a Map with a key of __version__ which has the value 0.1.0 . Then, it wraps that Map in a TVar and atomically assigns it to the database variable. Each time we want to write or read the database value, we have to use IO . Let’s create a helper for atomically reading this value:

atomRead :: TVar a -> IO a atomRead = atomically . readTVar

And let’s make a function to update a value in the database. This takes a function that does the updating and runs it through the STM machinery.

updateValue :: ( DB -> DB ) -> TVar DB -> IO () updateValue fn x = atomically $ modifyTVar x fn

Reply parsing

Next, let’s talk about the Redis protocol. It’s a simple TCP scheme that looks like this:

*2\r

$3\r

get\r

$4\r

name

It’s a bunch of keywords and arguments separated by newlines. If we clean it up and break each thing to its own line, we get:

*2 $3 get $4 name

Let’s look at each line. *2 says to expect a command that has two arguments. $3 says that the first argument is three characters long. get is the three-character argument from above. $4 is the length of the second argument, and name is the value of the second argument. If you’re in the REPL provided by redis-cli , and you type get name , Redis will translate those two words into the above representation. A set command would look like this:

*3 $3 set $4 name $5 honza

This is what will be sent when you run set name honza .

This multi-argument message is called multibulk in the Redis documentation. There are two other data types that Redis uses that will interest us: the OK and the error.

When Redis needs to tell you that it accepted request and everything went smoothly, it simply responds with +OK . When Redis needs to indicate an error, it replies with -ERR something went wrong (where “something went wrong” is the message).

This format is very simple and actually very effective. When we listen on a socket for incoming messages, we have a look at the very first character. + tells us that it’s OK, - signals and error, and * tells us to keep reading for commands. We incrementally read from the socket only as much data as the protocol tells us.

In this section, we will write a parser for multibulk messages. We will use the amazing attoparsec library for this.

The following code is heavily influenced by the Hedis library. Credit goes to Falko Peters. Thanks!

A multibulk message is called a reply in Redis lingo. Let’s make a type for it.

data Reply = Bulk ( Maybe ByteString ) | MultiBulk ( Maybe [ Reply ]) deriving ( Eq , Show )

A Bulk reply is a simple string like get or name above. MultiBulk is the whole message. Let’s also write a function that attempts to convert a Reply to a Command .

parseReply :: Reply -> Maybe Command parseReply ( MultiBulk ( Just xs)) = case xs of [ Bulk ( Just "get" ), Bulk ( Just a)] -> Just $ Get a [ Bulk ( Just "set" ), Bulk ( Just a), Bulk ( Just b)] -> Just $ Set a b _ -> Just Unknown parseReply _ = Nothing

Next, let’s use attoparsec to write a parser for the Reply data type.

replyParser :: Parser Reply replyParser = choice [bulk, multiBulk]

Our replyParser will try to match either a bulk or a multiBulk . Let’s implement those:

bulk :: Parser Reply bulk = Bulk <$> do len <- char '$' *> signed decimal <* endOfLine if len < 0 then return Nothing else Just <$> take len <* endOfLine multiBulk :: Parser Reply multiBulk = MultiBulk <$> do len <- char '*' *> signed decimal <* endOfLine if len < 0 then return Nothing else Just <$> count len replyParser

First, the parsers look at the first character to see what kind of message it is. If it starts with a $ , it’s a bulk. If it starts with a * , it’s multibulk. Then, it reads as many characters from the input as the length indicator said. In the case of multibulk, it recurses because it can contain bulk messages.

You can now run:

> parse replyParser "*2 \r

$3 \r

get \r

$4 \r

name \r

" > ( MultiBulk ( Just [( Bulk ( Just "get" )), ( Bulk ( Just "name" ))]))

Networking

At this point, we have our data structures ready and we know how to parse incoming requests into them. Now we need to work on the networking part. Let’s teach our program how to listen on a socket and parse incoming text into something useful.

Let’s change our main function to this:

main :: IO () main = withSocketsDo $ do database <- atomically $ newTVar $ fromList [( "__version__" , version)] sock <- listenOn $ PortNumber 7777 putStrLn "Listening on localhost 7777" sockHandler sock database

This is pretty straight-forward. Ask for a socket and then listen on it. When something happens on the socket, run the function socketHandler . Let’s implement that next:

sockHandler :: Socket -> TVar DB -> IO () sockHandler sock db = do (handle, _ , _ ) <- accept sock hSetBuffering handle NoBuffering hSetBinaryMode handle True _ <- forkIO $ commandProcessor handle db sockHandler sock db

Given a socket and a reference to a mutable database, we can get a handle and start processing requests. For each new connection, run forkIO which will do all this work of parsing and responding on a new lightweight thread. At the end, we simply recurse to accept new work. The commandProcessor function does the heavy lifting here, so let’s write that next.

commandProcessor :: Handle -> TVar DB -> IO () commandProcessor handle db = do reply <- hGetReplies handle replyParser let command = parseReply reply runCommand handle command db commandProcessor handle db

This function runs the replyParser we wrote earlier. It uses a very clever function called hGetReplies which we will look at in a minute. It will read as much data as necessary from the handle to get an instance of Reply . We then convert that reply to a command and run it.

hGetReplies :: Handle -> Parser a -> IO a hGetReplies h parser = go S . empty where go rest = do parseResult <- parseWith readMore parser rest case parseResult of Fail _ _ s -> error s Partial {} -> error "error: partial" Done _ r -> return r readMore = S . hGetSome h ( 4 * 1024 )

The parseWith function does partial matching. When it can’t parse anything, it will use the readMore function to get more data and try again. The readMore function uses the handle to read more data. Once the parser can match something, we are done.

Running commands

Once we have a command, we can run it!

runCommand :: Handle -> Maybe Command -> TVar DB -> IO () runCommand handle ( Just ( Get key)) db = do m <- atomRead db let value = getValue m key S . hPutStr handle $ S . concat [ "$" , valLength value, crlf, value, crlf] where valLength :: Value -> ByteString valLength = pack . show . S . length runCommand handle ( Just ( Set key value)) db = do updateValue (insert key value) db S . hPutStr handle ok runCommand handle ( Just Unknown ) _ = S . hPutStr handle $ S . concat [ "-ERR " , "unknown command" , crlf] runCommand _ Nothing _ = return ()

When the command is a get , we read the DB atom. Then we construct a bulk reply and write it to the handle. The bulk reply is in the same format as our messages above: $5\r

honza\r

. The getValue function is a lookup function that returns “null” if a value can’t be found.

getValue :: DB -> Key -> Value getValue db k = findWithDefault "null" k db

When the command is a set , we use our updateValue function from above and write the ok to the handle. The ok variable is just +OK\r

.

When the command is unknown, we write an error to the handle.

Compiling and running

You can now build your program with

$ stack build

And run it with

$ stack exec Redish Listening on localhost 7777

To test it out, you can connect to it with the redis-cli tool:

$ redis-cli -p 7777 127.0.0.1:7777> set name honza OK 127.0.0.1:7777> get name "honza"

You can test the performance with something silly, like:

$ time redis-cli -r 10000 get name

Conclusion

You can see the finished product on GitHub. Feedback is welcome, so are questions.