Continuing on this series of beginner/intermediate projects for newer Haskell users, let’s look back at our Huffman encoding project.

In our last post we went over two types of binary trees implemented as algebraic data structures in Haskell, and also a scheme for assembling a Huffman encoding tree using the State monad.

Now let’s look at serializing and unserializing our prefix trees for easy storage, and then at actually using them to encode and decode!

Binary

There are a couple of serialization libraries in Haskell; the dominant one is binary, but cereal is also not uncommon. The two diverge on several design points, and you can read up on them in the documentation for cereal. We’ll be using binary for the this tutorial; among many reasons, for its easy integration with the pipes library we will be working with later.

The Easy Way

So let’s make PreTree serialize/unserializable.

The easy way is to enable the DeriveGeneric language extension on GHC, use deriving (Generic) when we define our PreTree , and then:

instance Binary a => Binary ( PreTree a) a)

And…that’s it! We just auto-generated functions to serialize and deserialize our PreTree s (if what they contain is itself serializable).

In real life, we would do this. However, for the sake of learning, let’s dig a bit more into the Binary typeclass.

The Other Easy Way

So the big crux of binary is the Binary typeclass:

class Binary t where put :: t -> Put get :: Get t

where Put and Get are sort of “instruction objects for putting/getting binary”. Get is a monad, and Put is a wrapped PutM , which is a writer monad. (To be more specific, Put is PutM () , because the final action has no result and only “writes”)

So Binary things are things that you can serialize (with the instructions in put ) and deserialize (with the instructions in get ).

Luckily, because of Haskell’s great composition tools, assembling these instructions by hand are easy peasy!

Put

Let’s define our own custom Put for our PreTree s:

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L69-L76 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding putPT :: Binary a => PreTree a -> Put PTLeaf x) = do putPT (x) True -- signify we have a leaf put put x PTNode pt1 pt2) = do putPT (pt1 pt2) False -- signify we have a node put put pt1 put pt2

This all should be fairly readable and self-explanatory.

“To put a PTLeaf x , first put a flag saying you have a leaf, then put the value of x .”

“To put a PTNode pt1 pt2 , first put a flag saying you have a node, then put both trees.”

Due to how monads and pattern matching work, the whole thing is pretty expressive, pleasant to read, and satisfying to write.

The only slightly annoying thing is that we subject ourselves to boolean blindness by using True or False ; we have to keep track of what means what. Alternatively, we can create our own binary data types, data PTType = IsNode | IsLeaf , and put that, instead…but in this case it might not be so bad to live with boolean blindness for now.

Get

Now let’s define our own custom Get :

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L79-L84 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding getPT :: Binary a => Get ( PreTree a) a) = do getPT <- get isLeafget if isLeaf isLeaf then PTLeaf <$> get get else PTNode <$> get <*> get getget

This also shouldn’t be too bad!

“Get” the boolean flag, to tell you if you have a leaf or a node.

If it’s a leaf, then get the data inside the leaf, and wrap it in a PTLeaf .

the data inside the leaf, and wrap it in a . If it’s not, get the next two PreTree a ’s, and put them both in a PTNode .

The neat thing here is that get is polymorphic in its return type. We know that the first get expects a Bool , so it knows to parse a Bool . We know that the second get expects an a , so it knows to parse an a . We know that the final two get s both expect PreTree a ’s, so it nows what to parse for that too.

Hooray for type inference!

If you’re not familiar with the f <$> x <*> y idiom, you can consider it to be the same thing as f x y , except that x and y are “inside” things:

> ( + ) 1 4 ghci 5 > ( + ) <$> Just 1 <*> Just 4 ghci Just 5

Where (<$>) and (<*>) come from Control.Applicative . We call this style “applicative style”, in the biz.

Wrapping it up

And finally, to tie it all together:

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L36-L38 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding instance Binary a => Binary ( PreTree a) where a) = putPT putputPT = getPT getgetPT

Testing it out

However way we decide to write our Binary instance, let’s test it all out.

> let ( Just pt) = runBuildTree "hello world" ghcipt)runBuildTree > let encoded = encode pt ghciencodedencode pt > : t encoded ghcit encoded encoded :: ByteString -- a string of bytes > let decoded = decode encoded :: PreTree Char ghcidecodeddecode > decoded ghcidecoded PTNode ( PTNode ( PTNode ( PTLeaf 'h' ) ( PTLeaf 'e' ) ) ( PTNode ( PTLeaf 'w' ) ( PTLeaf 'r' ) ) ) ( PTNode ( PTLeaf 'l' ) ( PTNode ( PTNode ( PTLeaf 'd' ) ( PTLeaf ' ' ) ) ( PTLeaf 'o' ) ) ) > decoded == t ghcidecoded True

Neat! We can also write it to a file and re-read:

> encodeFile "test.dat" t ghciencodeFile > t' <- decodeFile "test.dat" :: IO ( PreTree Char ) ghcit'decodeFile > t' ghcit' PTNode ( PTNode ( PTNode ( PTLeaf 'h' ) ( PTLeaf 'e' ) ) ( PTNode ( PTLeaf 'w' ) ( PTLeaf 'r' ) ) ) ( PTNode ( PTLeaf 'l' ) ( PTNode ( PTNode ( PTLeaf 'd' ) ( PTLeaf ' ' ) ) ( PTLeaf 'o' ) ) ) > t' == t ghcit' True

And this looks like it works pretty well!

Encoding

Now that we’ve got that out of the way, let’s work on actually encoding and decoding.

So, basically, we encode a character in a huffman tree by path you take to reach the character.

Let’s represent this path as a list of Direction s:

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L25-L30 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding data Direction = DLeft | DRight deriving ( Show , Eq , Generic ) type Encoding = [ Direction ]

Eventually, an Encoding will be turned into a ByteString , with DLeft representing the 0 bit and DRight representing the 1 bit. But we keep them as their own data types now because everyone hates boolean blindness. Instead of keeping a True or False , we keep data types that actually carry semantic meaning :) And we can’t do silly things like use a boolean as a direction…what the heck? Why would you even want to do that? How is “true” a direction?

Direct search

Here’s a naive recursive direct (depth-first) search.

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L90-L96 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding findPT :: Eq a => PreTree a -> a -> Maybe Encoding = go pt0 [] findPT pt0 xgo pt0 [] where PTLeaf y ) enc | x == y = Just ( reverse enc) go (y ) encenc) | otherwise = Nothing PTNode pt1 pt2) enc = go pt1 ( DLeft : enc) <|> go (pt1 pt2) encgo pt1 (enc) DRight : enc) go pt2 (enc)

The algorithm goes:

If you find a PTLeaf , if the data matches what you are looking for, return the current path in a Just . If not, this is a dead-end; return Nothing . If you find a PTNode , search the left branch adding a DLeft to the current path, and the right branch adding a DRight to the current path. Use (<|>) to perform the search lazily (ie, stop after the first success).

> let pt = runBuildTree "hello world" ghciptrunBuildTree > findPT pt 'e' ghcifindPT pt Just [ DLeft , DLeft , DRight ] > findPT pt 'q' ghcifindPT pt Nothing

While it is clearly horribly inefficient, it does serve as a nice clean example of a depth-first search (which exits as soon as it finds the goal), and probably a nice reference implementation for us to reference later.

Its inefficiency lies in many things — chiefly of those being the fact that Huffman trees don’t give you any real help as a search tree, and nothing short of a full depth-first traversal would work. Also, you probably don’t want to do this every time you want to encode something; you’d want to have some sort of memoizing and caching, ideally.

Pre-searching

We can sort of “solve” both of these problems this by traversing through our PreTree and adding an entry to a Map at every leaf. This fixes our repetition problem by memoizing all of our results into a map…and it fixes our search problem because Map s are an ordered binary search tree with efficient O(log n) lookups.

There are many ways to do this; my favorite right now is to do it by doing collapsing our tree into one giant map, using the Monoid instance of Map .

Basically, we turn each of our leaves into little Map s, and then “combine” them all, using (<>) , which “combines” or merges two Map k v ’s, using their Monoid instance:

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L101-L106 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding ptTable :: Ord a => PreTree a -> Map a Encoding = go pt [] ptTable ptgo pt [] where PTLeaf x) enc = x `M.singleton` reverse enc go (x) encenc PTNode pt1 pt2) enc = go pt1 ( DLeft : enc) <> go (pt1 pt2) encgo pt1 (enc) DRight : enc) go pt2 (enc)

We do some sort of fancy depth-first “map” over all of the leaves, keeping track of how deep we are. Then we combine it all as we go along with <> .

Note how it is almost identical in structure to findPT :

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L90-L96 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding findPT :: Eq a => PreTree a -> a -> Maybe Encoding = go pt0 [] findPT pt0 xgo pt0 [] where PTLeaf y ) enc | x == y = Just ( reverse enc) go (y ) encenc) | otherwise = Nothing PTNode pt1 pt2) enc = go pt1 ( DLeft : enc) <|> go (pt1 pt2) encgo pt1 (enc) DRight : enc) go pt2 (enc)

Except instead of doing a “short-circuit combination” with (<|>) , we do a “full combination” with (<>) .

Lookup, Act 2

So now that we have our lookup table, our new lookup/find function is both simple and performant:

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L110-L111 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding lookupPTTable :: Ord a => Map a Encoding -> a -> Maybe Encoding = flip M.lookup lookupPTTableM.lookup

given, of course, that we generate our table first.

> let pt = runBuildTree "hello world" ghciptrunBuildTree > let tb = fmap ptTable pt ghcitbptTable pt > tb >>= \tb' -> lookupPTTable tb' 'e' ghcitb\tb'lookupPTTable tb' Just [ DLeft , DLeft , DRight ] > tb >>= \tb' -> lookupPTTable tb' 'q' ghcitb\tb'lookupPTTable tb' Nothing

(Here we use the Monad instance for Maybe, to extract the tb' out of the Just tb . We “sequence” two Maybe’s together. For more information, check out my blog post on this exact topic)

Encoding many

Now, we’d like to be able to decode an entire stream of a ’s, returning a list of the encodings.

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L114-L117 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding encodeAll :: Ord a => PreTree a -> [a] -> Maybe Encoding [a] = concat <$> sequence ( map (lookupPTTable tb) xs) encodeAll pt xs(lookupPTTable tb) xs) where = ptTable pt tbptTable pt

This is a bit dense! But I’m sure that you are up for it.

First, we build the lookup table and call it tb . Then, we map lookupPTTable tb over our list xs , to get a list of type [Maybe Encoding] . Then, we use sequence , which in our case is [Maybe a] -> Maybe [a] . It turns a list of Maybe’s into a list inside a Maybe. Recall the semantics of the Maybe monad: If you ever encounter a Nothing , the whole thing is a Nothing . So in this case, if any of the inputs are not decodable, the entire thing is Nothing. > sequence [ Just 5 , Just 4 ] ghci Just [ 5 , 4 ] > sequence [ Just 6 , Nothing ] ghci Nothing Note that the standard libraries provide a synonym for sequence . map — mapM . So we could have written it as mapM (lookupPTTable t) xs …but that is significantly less clear/immediately understandable. Recall that our sequence left us with a Maybe [Encoding] …but we only want Maybe Encoding . So we can use (<$>) to concat all of the Encoding s inside the Maybe.

> let pt = runBuildTree "hello world" -- :: Maybe (PreTree Char) ghciptrunBuildTree > pt >>= \pt' -> encodeAll pt' "hello world" ghcipt\pt'encodeAll pt' Just [ DLeft , DLeft , DLeft , DLeft , DLeft , DRight , DRight , DLeft , DRight , DLeft , DRight , DRight , DRight , DRight , DRight , DLeft , DRight , DLeft , DRight , DLeft , DRight , DRight , DRight , DLeft , DRight , DRight , DRight , DLeft , DRight , DRight , DLeft , DLeft ] > pt >>= \pt' -> encodeAll pt' "hello worldq" ghcipt\pt'encodeAll pt' Nothing

Welp, that’s half the battle!

Decoding

For huffman trees, decoding is the much simpler process. Simply traverse down the tree using the given encoding and return a value whenever you reach a leaf.

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L123-L128 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding decodePT :: PreTree a -> Encoding -> Maybe (a, Encoding ) (a, PTLeaf x) ds = Just (x, ds) decodePT (x) ds(x, ds) PTNode pt1 pt2) (d : ds) = case d of decodePT (pt1 pt2) (dds) DLeft -> decodePT pt1 ds decodePT pt1 ds DRight -> decodePT pt2 ds decodePT pt2 ds PTNode _ _) [] = Nothing decodePT (_ _) []

The logic should seem pretty familiar. The main algorithm involves going down the tree, “following” the direction list. If you reach a leaf, then you have found something (and return the directions you haven’t followed yet). If you run out of directions while on a node…something has gone wrong.

> do pt <- runBuildTree "hello world" ghciptrunBuildTree | enc <- encodeAll pt "hello world" encencodeAll pt | decodePT pt enc decodePT pt enc Just ( 'h' , [ DLeft , DLeft ... ]) , [])

(Here we are using the Maybe monad, in order to “stitch together” three possibly-failing operations in a row. We call pt and enc the values “inside” the Just pt and Just enc returned by runBuildTree and encodeAll ; the whole thing fails if any of the steps fail at any time. If you are not familiar with this, I sort of literally wrote an entire blog post on this subject :) )

Decoding many

We’d like to repeatedly iterate this until we have consumed our entire encoding.

Basically, starting with a list of encodings, we want to continually chop it up and build a list from it.

This sounds a lot like the Data.List function unfoldr :

unfoldr :: (b -> Maybe (a, b)) -> b -> [a] (b(a, b))[a]

unfoldr makes a list by applying your function repeatedly to a “de-cumulator”, carrying the state of the decumulator, and stopping when your function returns Nothing . You can think of it as the “opposite” of foldr .

Using unfoldr , we can write a decodeAll :

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L132-L133 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding decodeAll :: PreTree a -> Encoding -> [a] [a] = unfoldr (decodePT pt) decodeAll ptunfoldr (decodePT pt)

> do pt <- runBuildTree "hello world" ghciptrunBuildTree | enc <- encodeAll pt "hello world" encencodeAll pt | return (decodeAll pt enc) (decodeAll pt enc)

Which works exactly as we’d like!

Testing

We can write a utility to test our encoding/decoding functions:

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/Huffman.hs#L106-L113 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding testTree :: Ord a => [a] -> [a] [a][a] = [] -- handle the empty list testTree [][] = decoded testTree xsdecoded where Just decoded = do decoded <- runBuildTree xs ptrunBuildTree xs <- encodeAll pt xs encencodeAll pt xs return (decodeAll pt enc) (decodeAll pt enc)

(Again, refer to my MonadPlus article from earlier, if you are unfamiliar with working with the Maybe monad)

testTree should be an identity; that is, testTree xs === xs .

> testTree "hello world" ghcitestTree "hello world" > testTree "the quick brown fox jumps over the lazy dog" ghcitestTree "the quick brown fox jumps over the lazy dog"

Note the very unsafe irrefutable pattern match on Just decoded . We’ll fix this later!

QuickCheck

Now that we have a neat proposition, we can use quickcheck on it, from the great QuickCheck library. quickcheck will basically test our proposition testTree xs == xs by generating several random xs ’s.

> import Test.QuickCheck ghci > : set - XScopedTypeVariables ghciset > quickCheck (\( xs :: String ) -> testTree xs == xs) ghciquickCheck (\(testTree xsxs) *** Failed ! Falsifiable (after 3 tests and 2 shrinks) : (aftertestsshrinks) "a"

Failure!

Oh! We failed? And on such a simple case? What happened?

If we look at how "a" is encoded, it’ll become apparent:

> let ( Just pt) = runBuildTree "aaa" ghcipt)runBuildTree > pt ghcipt PTLeaf 'a' > findPT pt 'a' ghcifindPT pt Just [] [] > encodeAll pt "aaaaaaaaaaa" ghciencodeAll pt Just [] []

Ah. Well, that’s a problem. Basically, our input string has “zero” entropy, according to typical measurements. So we cannot naively huffman encode it.

Success!

There are a few ways to deal with this. The most “immediate” way would be to realize that decodeAll is partial (that is, it does not terminate/is undefined on some of its inputs), and will actually never terminate if the given tree is a singleton tree. We can write a “safe” decodeAll :

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/PreTree.hs#L137-L139 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding decodeAll' :: PreTree a -> Encoding -> Maybe [a] [a] PTLeaf _) _ = Nothing decodeAll' (_) _ = Just $ unfoldr (decodePT pt) enc decodeAll' pt encunfoldr (decodePT pt) enc

In doing this, we don’t exactly “fix” the problem…we only defer responsibility. Now, whoever uses decodeAll' (like our eventual encoding interface) is forced to handle the error (by handing the Nothing case). In this way, the type system enforces safety. Had we always used the unsafe decodeAll , then whoever uses it eventually has to “manually remember” to handle the unterminating case, by carefully reading documentation or something. In this case, the type system is a big, explicit reminder saying “hey, deal with the unterminating case.”

We’ll also a “safe” testTree :

-- source: https://github.com/mstksg/inCode/tree/master/code-samples/huffman/Huffman.hs#L117-L121 -- interactive: https://www.fpcomplete.com/user/jle/huffman-encoding testTree' :: Ord a => [a] -> Maybe [a] [a][a] = do testTree' xs <- runBuildTree xs ptrunBuildTree xs <- encodeAll pt xs encencodeAll pt xs decodeAll' pt enc

So we can now quickcheck:

> quickCheck (\( xs :: String ) -> testTree' xs `elem` [ Nothing , Just xs]) ghciquickCheck (\(testTree' xsxs]) +++ OK , passed 100 tests . , passedtests

Hooray!

Re: Testing

All I’ll admit that I didn’t even anticipate the degenerate singleton tree case until I decided to add a quickcheck section to this post. It just goes to show that you should always test! And it also shows how easy it is to write tests in quickcheck. One line could mean five unit tests, and you might even test edge/corner cases that you might have never even thought about!

For example, we probably should have tested lookupPTTable against findPT , our reference implementation :) We should have also tested our binary encode/decode!

Next Time

We’re almost there!

For our last section, we are going to be focusing on pulling it all together to make a streaming compression/decompression interface that will be able to read a file and encode/decode into a new file as it goes, in constant memory, using pipes. After that, we will also be looking at how to profile code, applying some optimization tricks we can do to get things just right, and other things to wrap up.