HAMTs from Scratch

Posted on 29 July 2018

This blog post is also an IHaskell notebook and the source is available separately. I also did a talk at NYHUG based on this material.

I wanted an explanation for HAMTs (Hash Array Mapped Tries) that was more detailed than Marek Majkowski’s introduction and more approachable than Ideal Hash Trees by Phil Bagwell, the paper that introduced them. If you haven’t heard of them before, HAMTs are a way of efficiently representing a hashtable as a trie, and although they were first envisioned as a mutable data structure they are easily adapted to work as a persistent data structure. They form the backbone of the unordered-containers library but the implementation has been lovingly optimised to the point where I found it impenetrable. Edward Z. Yang’s implementation is much easier to follow and after adapting it I think I’m in a good place to provide my own take on them.

Let’s start with a few imports! I’ll be using these packages:

import Data.Bits ( Bits (bit, complement, popCount, shiftR, (.&.), (.|.)), (bit, complement, popCount, shiftR, (.&.), (.|.)), FiniteBits (finiteBitSize)) (finiteBitSize)) import Data.ByteArray.Hash ( FnvHash32 (..), fnv1Hash) (..), fnv1Hash) import Data.ByteString.Char8 (pack) (pack) import Data.Char (intToDigit) (intToDigit) import Data.Semigroup ((<>)) ((<>)) import Data.Vector ( Vector , drop, singleton, take, (!), (//)) , drop, singleton, take, (!), (//)) import Data.Word ( Word16 , Word32 ) import Numeric (showIntAtBase) (showIntAtBase) import Prelude hiding (drop, lookup, take) (drop, lookup, take) import System.TimeIt (timeIt) (timeIt) import Text.Show.Pretty (pPrint) (pPrint)

We’re going to be doing some bit twiddling. To make this easier to follow I’m going to define a newtype whose Show instance displays the binary representation.

{-# LANGUAGE GeneralizedNewtypeDeriving #-} newtype Binary a = Binary a deriving ( Enum , Ord , Real , Integral , Eq , Num , Bits , FiniteBits ) instance ( FiniteBits a, Show a, Integral a) => Show ( Binary a) where a,a,a)a) show ( Binary n) = let n) = showIntAtBase 2 intToDigit n "" strshowIntAtBase = finiteBitSize n sizefiniteBitSize n in replicate (size - length str) '0' <> str (sizestr)str

Using this newtype we can turn this:

24732 :: Word16

24732

into this:

24732 :: Binary Word16

0110000010011100

I’m going to use 32-bit hashes (because they’re more convenient to display than 64-bit ones) and 16-bit bitmaps.

type Hash = Binary Word32 type Bitmap = Binary Word16

The width of bitmaps is 2n where n is the number of bits of the hash that we use at each level of the tree (more on this below). I’m setting n = 4 which is what unordered-containers uses (as of this writing), but we could e.g. set n = 5 and use 32-bit bitmaps if we wanted.

bitsPerSubkey :: Int = 4 bitsPerSubkey

Shift is a multiple of n that we will use to focus on the correct part of the hash.

type Shift = Int

I’m also going to define a Hashable class to decouple the choice of a hash function from the implementation of HAMT .

class Hashable a where hash :: a -> Hash

For convenience, we’ll use the FNV-1 hash function with strings.

{-# LANGUAGE TypeSynonymInstances #-} {-# LANGUAGE FlexibleInstances #-} instance Hashable String where = let hash s FnvHash32 h = fnv1Hash ( pack s) fnv1Hash (s) in Binary h

Here’s what it looks like in practice.

"1" :: Binary Word32 hash

00000101000011000101110100101110

A HAMT can be

empty ( None )

) a leaf node with the hash, the key, and the value ( Leaf )

) a node with a bitmap and a (non-empty) vector of child HAMTs ( Many )

I’ve chosen to ignore the possibility of collisions, but we could handle them by adding an extra constructor, e.g. Colliding with a hash and a vector of key-value pairs.

data HAMT key value key value = None | Leaf Hash key value key value | Many Bitmap ( Vector ( HAMT key value)) key value)) deriving ( Show ) empty :: HAMT k v k v = None empty

We’ll need some helper functions for vectors:

insertAt inserts an element at a specified index, shifting elements to the right forwards

inserts an element at a specified index, shifting elements to the right forwards updateAt replaces an element at a specified index with a new element

replaces an element at a specified index with a new element deleteAt removes an element at an index, shifting elements to the right backwards

insertAt :: Vector a -> Int -> a -> Vector a index a = take index vector <> singleton a <> drop index vector insertAt vectorvectorsingleton avector updateAt :: Vector a -> Int -> a -> Vector a index a = vector // [( index , a)] updateAt vectorvector[(, a)] deleteAt :: Vector a -> Int -> Vector a index = take index vector <> drop ( index + 1 ) vector deleteAt vectorvector) vector

Insert

I think the bit manipulation functions are crucial to understanding what’s going on, so I’m going to motivate them by trying to define insert without them and coming up with them as they are needed. This initial definition won’t be quite right so I’ll call it insert_ to differentiate it from the correct insert' function I present later. The type signature for insert_ is

insert_ :: Hash -> key -> value -> HAMT key value -> HAMT key value keyvaluekey valuekey value

Inserting a key-value pair into an empty HAMT gives us a single leaf node:

None = Leaf hash key value insert_ hash key valuehash key value

Inserting a key-value pair into a single leaf node where the hashes match gives us an updated leaf node (because we’re pretending collisions don’t exist):

Leaf leafHash leafKey leafValue) insert_ hash key value (leafHash leafKey leafValue) | hash == leafHash = Leaf hash key value hashleafHashhash key value

Inserting into a HAMT consisting of a single leaf node where the hashes don’t match upgrades that leaf node to a Many node and inserts the key-value pair into that Many node:

@ ( Leaf leafHash leafKey leafValue) insert_ hash key value leafleafHash leafKey leafValue) | hash /= leafHash = insert_ key value ( Many someBitmap (singleton leaf)) hashleafHashinsert_ key value (someBitmap (singleton leaf)) where someBitmap = undefined someBitmap

Bit Masking

Where does someBitmap come from? Time for an example! Let’s start with a Leaf (hash "1") "1" 1 :

h = hash "1" hash = Leaf h "1" 1 leaf leaf

Leaf 00000101000011000101110100101110 "1" 1

someBitMap is a 16-bit bitmap where the number of bits set (the popCount ) is the length of the vector, which in this case is 1. We want to set one bit, but which bit? We carve off the last n bits using a mask:

subkeyMask :: Bitmap = (bit bitsPerSubkey) - 1 subkeyMask(bit bitsPerSubkey) subkeyMask

0000000000001111

-- 0101110100101110 -- .&. 0000000000001111 ----------------------- -- 0000000000001110 = fromIntegral h .&. subkeyMask fragmentsubkeyMask fragment

0000000000001110

Then we interpret that fragment as a number:

Binary position = fragment positionfragment position

14

Finally, we set that bit and we have our bitmap:

someBitmap :: Bitmap = Binary $ bit $ fromIntegral position someBitmapbitposition someBitmap

0100000000000000

We’re going to be doing this a lot, so I’ll define this as bitMask_ . The extra _ is because it isn’t quite right for the same reason as insert_ :

bitMask_ :: Hash -> Bitmap = let bitMask_ hash = fromIntegral hash .&. subkeyMask fragmenthashsubkeyMask Binary position = fragment positionfragment in Binary (bit ( fromIntegral position)) (bit (position))

Let’s look at the Many case. If we try inserting into a node where the bit in the bitmap corresponding to the mask is 0 , this means that there is an empty slot in the vector. We can insert a leaf node into this slot and set the corresponding bit in the bitmap to 1 :

Many bitmap vector) insert_ hash key value (bitmap vector) | bitmap .&. mask == 0 = let bitmapmask = Leaf (hash key) key value leaf(hash key) key value = insertAt vector index leaf vector'insertAt vectorleaf = bitmap .|. mask bitmap'bitmapmask in Many bitmap' vector' bitmap' vector' where = bitMask_ hash maskbitMask_ hash index = undefined

Mask Indexing

What index do we use? This is where popCount makes an appearance. Let’s demonstrate by inserting ("10", 2) into our example. First we get the mask corresponding to hash "10" :

= bitMask_ (hash "10" ) maskbitMask_ (hash mask

0000010000000000

Next we want to find the number of lower bits that have been set. We use mask - 1 as a mask:

- 1 mask

0000001111111111

-- 0100000000000000 -- .&. 0000001111111111 ----------------------- -- 0000000000000000 = someBitmap .&. (mask - 1 ) maskedsomeBitmap(mask masked

0000000000000000

Then we count the number of bits set with popCount :

index = popCount masked popCount masked index

0

And this is the index we need to insert at! We’ll call this maskIndex :

maskIndex :: Bitmap -> Bitmap -> Int = popCount (bitmap .&. (mask - 1 )) maskIndex bitmap maskpopCount (bitmap(mask))

The final case is where the bit in the bitmap is already set. We need to recursively update the HAMT at the corresponding index:

Many bitmap vector) insert_ hash key value (bitmap vector) | bitmap .&. mask == 1 = let bitmapmask = insert_ hash key value (vector ! index ) -- WRONG! subtree'insert_ hash key value (vector = updateAt vector index subtree' vector'updateAt vectorsubtree' in Many bitmap vector' bitmap vector' where = bitMask_ hash maskbitMask_ hash index = maskIndex bitmap mask maskIndex bitmap mask

But this definition is wrong, because instead of carving off the last n bits of hash , we want to recursively carve off the next n bits!

Shifting

This is what’s missing from our definition, a shift parameter corresponding to how far up the hash we’re looking. This is why we defined Shift above. Taking this extra parameter into account, our bit manipulation functions now become:

subkeyMask :: Bitmap = (bit bitsPerSubkey) - 1 subkeyMask(bit bitsPerSubkey) maskIndex :: Bitmap -> Bitmap -> Int = popCount (bitmap .&. (mask - 1 )) maskIndex bitmap maskpopCount (bitmap(mask)) subkey :: Hash -> Shift -> Int = fromIntegral $ ( fromIntegral $ shiftR hash shift) .&. subkeyMask subkey hash shiftshiftR hash shift)subkeyMask bitMask :: Hash -> Shift -> Bitmap = bit (subkey hash shift) bitMask hash shiftbit (subkey hash shift)

We plumb through this shift parameter, only modifying it in the final case, to give us the correct definitions of insert' and insert :

insert :: Hashable key => key -> value -> HAMT key value -> HAMT key value keykeyvaluekey valuekey value = insert' 0 (hash key) key value hamt insert key value hamtinsert'(hash key) key value hamt insert' :: Shift -> Hash -> key -> value -> HAMT key value -> HAMT key value keyvaluekey valuekey value None = Leaf hash key value insert' shift hash key valuehash key value @ ( Leaf leafHash leafKey leafValue) insert' shift hash key value leafleafHash leafKey leafValue) | hash == leafHash = Leaf hash key value hashleafHashhash key value | otherwise = insert' shift hash key value ( Many (bitMask leafHash shift) (singleton leaf)) insert' shift hash key value ((bitMask leafHash shift) (singleton leaf)) Many bitmap vector) insert' shift hash key value (bitmap vector) | bitmap .&. mask == 0 = let bitmapmask = Leaf hash key value leafhash key value = insertAt vector index leaf vector'insertAt vectorleaf = bitmap .|. mask bitmap'bitmapmask in Many bitmap' vector' bitmap' vector' | otherwise = let = vector ! index subtreevector = insert' (shift + bitsPerSubkey) hash key value subtree subtree'insert' (shiftbitsPerSubkey) hash key value subtree = updateAt vector index subtree' vector'updateAt vectorsubtree' in Many bitmap vector' bitmap vector' where = bitMask hash shift maskbitMask hash shift index = maskIndex bitmap mask maskIndex bitmap mask

Now we can construct HAMTs and inspect them! I’ll define a fromList function and use pPrint from pretty-show to highlight the tree structure:

fromList :: Hashable key => [(key, value)] -> HAMT key value key[(key, value)]key value = foldr ( uncurry insert) empty fromListinsert) empty = fromList [( "1" , 1 ), ( "10" , 2 ), ( "100" , 3 ), ( "1000" , 4 )] examplefromList [(), (), (), ()] pPrint example

Many 0100010000000000 [ Many 0000000100100000 [ Leaf 00100000011101101010111101011010 "10" 2 , Leaf 10001010111100101011011010001010 "1000" 4 ] , Many 0000001000000100 [ Leaf 00000101000011000101110100101110 "1" 1 , Leaf 01110100110101100000101010011110 "100" 3 ] ]

Lookup

Compared to insert , lookup is a walk in the park. It’s implemented along the same lines as insert :

on None nodes, it fails

nodes, it fails on Leaf nodes, it succeeds if the hashes match

nodes, it succeeds if the hashes match on Many nodes, it fails if the bit isn’t set, and recurses into the child node otherwise

lookup :: Hashable key => key -> HAMT key value -> Maybe value keykeykey valuevalue lookup key hamt = lookup' 0 (hash key) hamt key hamtlookup'(hash key) hamt lookup' :: Shift -> Hash -> HAMT key value -> Maybe value key valuevalue None = Nothing lookup' shift hash Leaf leafHash leafKey leafValue) lookup' shift hash (leafHash leafKey leafValue) | hash == leafHash = Just leafValue hashleafHashleafValue | otherwise = Nothing Many bitmap vector) lookup' shift hash (bitmap vector) | bitmap .&. mask == 0 = Nothing bitmapmask | otherwise = lookup' (shift + bitsPerSubkey) hash (vector ! index ) lookup' (shiftbitsPerSubkey) hash (vector where = bitMask hash shift maskbitMask hash shift index = maskIndex bitmap mask maskIndex bitmap mask

Let’s quickly confirm that it works.

lookup "100" example example

Just 3

Memoising Fibonacci

We now have enough of an API to use this as a hashtable! Let’s use it to memoise the calculation of the Fibonacci sequence. The naive implementation does a lot of unnecessary recomputation:

fib :: Int -> Int 0 = 1 fib 1 = 1 fib = fib (n - 1 ) + fib (n - 2 ) fib nfib (nfib (n $ print $ fib 30 timeItfib

1346269 CPU time: 1.31s

We can memoise it by storing previously calculated results and using them if they are available:

instance Hashable Int where = Binary ( fromIntegral int) hash intint) fib' :: HAMT Int Int -> Int -> ( Int , HAMT Int Int ) 0 = ( 1 , insert 0 1 table) fib' table, inserttable) 1 = ( 1 , insert 1 1 table) fib' table, inserttable) = case lookup n table of fib' table nn table Just i -> (i, table) (i, table) Nothing -> let = fib' table (n - 1 ) (i1, table')fib' table (n = fib' table' (n - 2 ) (i2, table'')fib' table' (n in (i1 + i2, insert n (i1 + i2) table'') (i1i2, insert n (i1i2) table'') fib :: Int -> Int = fst $ fib' empty n fib nfib' empty n $ print $ fib 30 timeItfib

1346269 CPU time: 0.00s

Delete

Finally we come to delete , which is only a little more complex than lookup . It needs to make sure that no Many node has a child None node, so if a None node:

is an only child, it will replace the parent node

has any sibling nodes, it will be removed from the parent node’s bitmap and vector

Leaf nodes similarly replace their parents if they are the only child.

delete :: Hashable key => key -> HAMT key value -> HAMT key value keykeykey valuekey value = delete' 0 (hash key) hamt delete key hamtdelete'(hash key) hamt delete' :: Shift -> Hash -> HAMT key value -> HAMT key value key valuekey value None = None delete' shift hash @ ( Leaf leafHash leafKey leafValue) delete' shift hash leafleafHash leafKey leafValue) | hash == leafHash = None hashleafHash | otherwise = leaf leaf @ ( Many bitmap vector) delete' shift hash manybitmap vector) | bitmap .&. mask == 0 = many bitmapmaskmany | otherwise = let = vector ! index subtreevector = delete' (shift + bitsPerSubkey) hash subtree subtree'delete' (shiftbitsPerSubkey) hash subtree in case subtree' of subtree' None -> if length vector == 1 vector then None else Many (bitmap .&. complement mask) (deleteAt vector index ) (bitmapcomplement mask) (deleteAt vector Leaf {} -> if length vector == 1 {}vector then subtree' subtree' else Many bitmap (updateAt vector index subtree') bitmap (updateAt vectorsubtree') Many {} -> Many bitmap (updateAt vector index subtree') {}bitmap (updateAt vectorsubtree') where = bitMask hash shift maskbitMask hash shift index = maskIndex bitmap mask maskIndex bitmap mask

Let’s see this in action.

$ delete "1000" example pPrintdeleteexample

Many 0100010000000000 [ Many 0000000000100000 [ Leaf 00100000011101101010111101011010 "10" 2 ] , Many 0000001000000100 [ Leaf 00000101000011000101110100101110 "1" 1 , Leaf 01110100110101100000101010011110 "100" 3 ] ]

It’s possible to have a situation where we have a Many node with only one child, because our replacement behaviour checks the length of the vector before any elements are removed from it. However, removing the last leaf will correctly delete the parent node.

$ delete "10" $ delete "1000" example pPrintdeletedeleteexample

Many 0100000000000000 [ Many 0000001000000100 [ Leaf 00000101000011000101110100101110 "1" 1 , Leaf 01110100110101100000101010011110 "100" 3 ] ]

And we’re done! I hope you understand HAMTs better than when you started reading this.

If you want to use this for something other than educational purposes, I would recommend adding logic to deal with hash collisions, which I intentionally omitted. There’s also some low-hanging fruit in terms of performance optimisations. The first thing that comes to mind is an additional Full constructor for the case where all bits in the bitmap are set, and the next thing is the use of unsafe vector functions that omit bounds checking.

Thanks to Evan Borden, Javier Candeira, Jean Niklas L’orange, Mark Hopkins, and Tim Humphries for comments and feedback.