Greetings and welcome to the first entry of Functional Kansans, a new blog brought to you by a few of the people interested in functional programming at the University of Kansas.

We open the blog with a series of posts by myself and Nick Frisby detailing our recent exploration into the world of Haskell program optimization. Before we begin, I should state that this work arose from what was supposed to be a class-wide project for Andy Gill’s advanced functional programming class that Nick and I co-opted as our own, perhaps greedily so. The following content could not have been possible without Andy’s excellent instruction and the support of our other classmates (who hopefully will have posts of their own on this blog at some point).

The Problem

The target of our optimization efforts was the Jaro-Winkler algorithm, a metric for judging the similarity of two strings. Specifically, we looked at the Jaro distance metric given that its time-complexity dominates that of the Winkler variant.

PSUEDO-CODE

This section provides pseudo-code snippets in order to explain the algorithm.

The Jaro measure is computed in terms of the string lengths, number of matches, m , and the number of transpositions, t , and yields a similarity metric between 0 and 1.

jaro s1 s2 = (m / n1 + m / n2 + (m - t) / m) / 3 where n1 = length s1; n2 = length s2 m = matches s1 s2 t = transpositions s1 s2

A pair of characters from each string match if they are equal and not too far apart. The maximal match distance is defined as a sliding window, framed by the list of indices [lo .. hi], whose size is determined relative to the length of the two input strings. Additionally and importantly, each character can only match once.

matches s1 s2 = snd (foldl snoc nil (zip s1 [0..])) where nil = ([], 0) snoc st@(used, m) (c1, idx1) = case find p [lo .. hi] of Nothing -> st Just idx2 -> (idx2 : used, m + 1) where p idx2 = idx2 `notElem` used && c1 == s2 !! idx2

To calculate the number of transpositions, count the number of differences in the in-order matched characters.

transpositions s1 s2 = length $ filter (\c1 c2 -> c1 /= c2) $ zip (filter wasMatched s1) (filter wasMatched s2)

Note that since each character can only match once and only with a character from the other string, there’s the same number of matched characters in each string; the zip doesn’t truncate either argument. Also note that we avoid specifying a full definition for wasMatched, but suffice it to say that an implementation that matches its current type of Char -> Bool is not possible.

The next sections will present the starting implementation of this algorithm and a discussion of our profiling techniques used for our optimization efforts.

The Starting Point

As was just mentioned, wasMatched cannot have the type of Char -> Bool , at least as the transpositions function is currently specified, because it has a missing dependency on the indices calculated in the matches function; the character value on its own is insufficient information for determining if a position in the string was matched. Presented below is a naive implementation of the algorithm that corrects that deficiency by having matches return the string consisting of all matched characters, in order, from the left string (s1) and all matched indices from the right string (s2). Please also note that the transpositions function has been “folded into” the definition of jaro in its where clause, mainly for the purpose of brevity.

jaro :: String -> String -> Double jaro s1 s2 | 0 == m = 0 | otherwise = (m / n1 + m / n2 + (m - t) / m) / 3 where n1 = fromIntegral $ length s1 n2 = fromIntegral $ length s2 (used, m1) = matches s1 s2 m2 = map (s2 !!) $ sort used m = fromIntegral $ length m1 t = (/ 2) $ fromIntegral $ length $ filter id $ zipWith (/=) m1 m2 matches :: String -> String -> ([Int], String) matches s1 s2 = foldl snoc initst (zip s1 [0..]) where window = max 0 (max (length s1) (length s2) `div` 2 - 1) initst = ([], "") snoc st@(used, m1) (c1, idx1) = case find p [lo .. hi] of Nothing -> st Just idx2 -> (idx2 : used, m1 ++ [c1]) where p idx2 = idx2 `notElem` used && c1 == s2 !! idx2 lo = max 0 (idx1 - window) hi = min (length s2 - 1) (idx1 + window)

From this point the path to an optimized implementation is simple: find what is slow and make it faster. Flippancy aside, we made the decision to identify time-cost hotspots using GHC’s profiling functionality. Many notable bloggers and Haskell hackers have also advocated for the direct inspection of the Core code that GHC produces to augment profiling information; however, we chose not to.

The honest truth of the matter is that Core is not a fun language to read, except perhaps to gluttons for syntactic punishment. Furthermore, the practice quickly becomes infeasible for any significantly large code base.

Profiling

So we made up our mind. Profiling and profiling only.

Unfortunately, running a single iteration of jaro takes such a trivially small amount of time that no worthwhile profiling data would be collected. To sidestep this issue we devised the following test harness that calls jaro a large number of times.

words_file :: FilePath words_file = "/usr/share/dict/propernames" main :: IO () main = do ns <- lines `fmap` readFile words_file now <- length ns `seq` getCurrentTime let xs :: [Double] xs = sortBy (flip compare) [ jaro n m | n <- ns, m <- ns ] now2 <- rnf xs `seq` getCurrentTime print (diffUTCTime now2 now)

In short, the harness inputs a line delimited file, splits it into words, and then calculates the jaro similarity factor for the cross product of the word list. Also note that there is some additional plumbing to measure execution time with the Data.Time library and to force evaluation of the list to head normal form using the Control.DeepSeq library. This combination provides us with informal benchmarking capabilities that should confirm any results we find with profiling.

For all profiling experiments in this post we will be using the following compilation flags: -rtsopts -prof -auto-all -O2 . The first three flags represent the standard flags required for profiling all top-level definitions with the final flag representing the simplest way to activate most of the compiler optimizations that would make a difference for us. There may be a future post discussing other optimization flags that would be of benefit, however for now we accept the O2 flag as “good enough.”

Likewise we will be using the following runtime system options: -p -i0.001 . There are no surprises here, we simply activate profiling and change the time scale for profiling ticks to 1 ms. Note that with GHC 7 most runtime system options were disabled by default, hence the need to use the -rtsopts flag during compilation.

Profiling our naive implementation with the above settings gives us the following information:

total time = 0.88 secs total alloc = 3,255,736,936 bytes wall clock = 14.746186 secs COST CENTRE MODULE %time %alloc main Main 61.4 36.8 matches Main 29.8 47.7 jaro Main 8.8 15.5

Fantastic. So now we know that matches is the dominant function, accounting for roughly 30% of our total run time and almost 48% of our total allocation. Looking back at the matches function, though, the code is fairly dense and, at least to less experienced eyes, does not appear to reflect any obvious optimization opportunities. To gain more information we will introduce our own cost centers to measure any internal expressions that we think may account for a large portion of our run time.

matches :: String -> String -> ([Int], String) matches s1 s2 = {-# SCC "matchesFold" #-} foldl snoc initst (zip s1 [0..]) where window = {-# SCC "matchesWindow" #-} max 0 (max (length s1) (length s2) `div` 2 - 1) initst = ([], "") snoc st@(used, m1) (c1, idx1) = case find p [lo .. hi] of Nothing -> st Just idx2 -> {-# SCC "matchesSnoc" #-} (idx2 : used, m1 ++ [c1]) where p idx2 = {-# SCC "matchesPred" #-} idx2 `notElem` used && c1 == s2 !! idx2 lo = max 0 (idx1 - window) hi = {-# SCC "matchesHi" #-} min (length s2 - 1) (idx1 + window)

Running the profiling with the new cost centers does in fact give us quite a bit more information to work with:

total time = 1.06 secs total alloc = 2,883,573,832 bytes wall clock = 15.011385 secs COST CENTRE MODULE %time %alloc main Main 30.5 41.5 matchesPred Main 25.0 0.0 matches Main 13.5 18.3 jaro Main 12.6 17.5 matchesFold Main 9.6 14.6 matchesWindow Main 4.2 1.5 matchesHi Main 3.3 2.5 matchesSnoc Main 1.3 4.1

Hold on, though. The introduction of our own cost centers affected our total time and allocation by fairly significant numbers. That can’t be right, can it? We’ll touch on this more in a later section, but for now, thankfully, we can rely on the independent benchmarking that we built into our test harness (recorded above in the wall clock field) to confirm any speed ups that we find.

Down the Rabbit Hole

Looking at the profiling information, we see that the dominant cost center we introduced is matchesPred which measures our find predicate:

p idx2 = idx2 `notElem` used && c1 == s2 !! idx2

There are a few things going on in this function, so lets work from left to right. Recall that used is simply a list of integers, making notElem a linear traversal.

It should be possible to find a better container type for used that increases the speed of this check without penalizing other operations over used .

We decided to change used to be modeled as an IntSet (note: we imported qualified as IS ) rather than a list of integers. In the worst case, the complexity of the operations should be the same as a list; however, in practice we know that at the very least the allocation behavior should improve, likely leading to an improvement in time-cost as well. We also gain the benefit of using a container that better matches the semantics we’re looking for; after all, a set of matched indices needs no room for duplicates.

matches :: String -> String -> (IntSet, String) matches s1 s2 = {-# SCC "matchesFold" #-} foldl snoc initst (zip s1 [0..]) where window = {-# SCC "matchesWindow" #-} max 0 (max (length s1) (length s2) `div` 2 - 1) initst = (IS.empty, "") snoc st@(used, m1) (c1, idx1) = case find p [lo .. hi] of Nothing -> st Just idx2 -> {-# SCC "matchesSnoc" #-} (idx2 `IS.insert` used, m1 ++ [c1]) where p idx2 = {-# SCC "matchesPred" #-} idx2 `IS.notMember` used && c1 == s2 !! idx2 lo = max 0 (idx1 - window) hi = {-# SCC "matchesHi" #-} min (length s2 - 1) (idx1 + window)

And its profiling information:

total time = 0.72 secs total alloc = 2,490,396,576 bytes wall clock = 14.79711 sec COST CENTRE MODULE %time %alloc main Main 41.1 48.1 matchesPred Main 20.7 0.0 matches Main 11.7 11.6 matchesFold Main 9.4 17.0 jaro Main 9.4 12.7 matchesWindow Main 4.0 1.7 matchesSnoc Main 1.8 6.2 matchesHi Main 1.8 2.8

As expected, this lead to a significant decrease in total allocation during the execution of the program. Unfortunately, while the profiling time did decrease by roughly 30%, the reduction in wall clock time was not as significant. That leaves matchesPred as still the dominant cost center, so we revisit it looking for another optimization opportunity. The only other function in this expression that is not a primitive comparison operator is the string indexing. Obviously strings are not designed for efficient random access, so again we go on the quest for an alternative container type.

ByteStrings

The obvious and immediate choice here would be to use a ByteString as our new container type. ByteStrings are so attractive because they provide constant time indexing and length checks in addition to extremely efficient stream-fused traversals, all of which should provide some big wins across the entire algorithm.

When considering which types to change, it is important to note that string s2 is the only one ever indexed into. However, because s1 and s2 are built from the same word list it would seem to make sense to have them be of the same type to avoid unnecessary calls to pack or unpack which can sometimes be a major source of inefficiency when using ByteStrings . This is something that we will discuss in more detail in the conclusion, mainly to lead in to the next post in our series, but for now we operate under the assumption that the preceding statement is correct.

The resultant types of our functions are shown below:

jaro :: ByteString -> ByteString -> Double matches :: ByteString -> ByteString -> (IntSet, String)

For the most part the available ByteString operations mirror the String operations provided by the prelude, so code changes are predominantly just qualifications to tell GHC which version to use (note: we imported qualified as BS ). Given this, we will avoid showing the updated code for brevity’s sake. The profiling results, though, are shown below in their entirety.

total time = 0.51 secs total alloc = 2,739,268,472 bytes wall clock = 13.066119 secs COST CENTRE MODULE %time %alloc main Main 64.4 43.7 matchesPred Main 10.2 0.0 matches Main 9.8 10.5 jaro Main 5.9 13.6 matchesFold Main 4.7 22.5 matchesSnoc Main 2.9 5.6 matchesHi Main 1.4 2.6 matchesWindow Main 0.6 1.5

The first thing that we notice is that the switch to ByteStrings has pushed our allocation back up to a level near where it was at before we made the switch to an IntSet to carry our match indices. The tradeoff, though, is significant performance increases across the board resulting in shaving an almost two full seconds off of the wall clock time.

Looking at our code again, we recognize that the expressions within the matchesWindow, matchesPred, and matchesHi cost centers have now all been reduced to either constant time or primitive operations. To clear up our profiling information from this point on we will be removing those cost centers since there is not much else we can do to optimize them directly. That does not mean that we are done, though. Even though our test harness is now the dominant cost center by a very large margin, we still have a few tricks up our sleeves that we have yet to exhaust to see if we can make the Jaro algorithm even faster.

Down the Rabbit Hole in Space Jam

All of the changes we have made up to this point have been, for the most part, algorithm impartial. We have identified a few hot spots and attempted to mitigate their cost by changing the data representations of the objects we are working with to make those operations more efficient; however, the structure of the code itself was left largely unchanged. What follows in this section is an exploration of more “invasive” optimization attempts. If the previous section was an allusion to simply following the white rabbit (profiling output) to guide our path, this section would best be described as again following the rabbit, but this time asking more in-depth questions like, “Why is Bill Murray here?”

The Bill Murray in this case is the large amount of allocation still happening in the matches function, specifically in our matchesFold cost center. Whenever you see allocation issues with a fold you should think to yourself, maybe now is the time to try a strict version of that fold. In fact, the ByteString library provides a very nice and efficient version of foldl' to use. The issue, though, is that we are not currently folding over the string directly, but rather over the zip of the string and an incrementing counter. To use the ByteString traversals we must fuse that zip in by tracking the counter at each iteration of the fold.

While we are fusing traversals we should also pay attention to the jaro function. Currently we traverse string s2 indexing into it, only to again traverse the result with a zip, and then a filter. In parallel we also calculate the value of m by taking the length of our matches string. All of these passes can, and should, be fused into a single loop. The new code for all of these changes is shown below:

jaro :: ByteString -> ByteString -> Double jaro s1 s2 | 0 == m = 0 | otherwise = (m / n1 + m / n2 + (m - t) / m) / 3 where n1 = fromIntegral $ BS.length s1 n2 = fromIntegral $ BS.length s2 (used, m1) = matches s1 s2 m = fromIntegral $ length m1 t = loop (IS.toAscList used) m1 0 where loop (i : is) (c : m1') = loop is m1' . f where f = if c /= s2 `BS.index` i then (+ 1) else id loop _ _ = (/ 2) matches :: ByteString -> ByteString -> (IntSet, String) matches s1 s2 = {-# SCC "matchesFold" #-} snd $ BS.foldl' snoc initst s1 where window = max 0 (max (BS.length s1) (BS.length s2) `div` 2 - 1) initst = (0, (IS.empty, "")) snoc (idx1, st@(used, m1)) c1 = case find p [lo .. hi] of Nothing -> (idx1 + 1, st) Just idx2 -> {-# SCC "matchesSnoc" #-} (idx1 + 1, (idx2 `IS.insert` used , m1 ++ [c1])) where p idx2 = idx2 `IS.notMember` used && c1 == s2 `BS.index` idx2 lo = max 0 (idx1 - window) hi = min (BS.length s2 - 1) (idx1 + window)

And its profiling information:

total time = 0.77 secs total alloc = 2,698,599,372 bytes wall clock = 11.650707 secs COST CENTRE MODULE %time %alloc main Main 62.3 44.3 matches Main 21.9 32.6 jaro Main 9.6 13.5 matchesFold Main 3.5 1.3 matchesSnoc Main 2.6 8.3

Looking at this profiling information it’s hard to instantly declare whether this was a success or not. Recall that we removed some custom cost centers and, as discussed previously, that probably distorted total time and allocation results. However, if we operate under the assumption that the relational cost percentage numbers are accurate then we can claim that we shrunk the % allocation for the matchesFold cost center significantly. If that wasn’t indicative enough of a clear win, we also shaved almost another two full seconds off of the wall clock time. The trade off, though, was that % allocation rose in a few places, notably in the matches function itself.

Let’s see if we can fix this by pushing strictness a bit further. Rather than have each iteration of our fold return nested tuples, lets instead have it work over a record type with strict fields. Furthermore, let’s hit it with the sledgehammer that is the -funbox-strict-fields compilation flag to see if we can squeeze any more performance out of it. Note that the code below requires the RecordWildCards and NamedFieldPuns language extensions: these let us write the more concise patterns St{used, m1} and St{..} .

matches :: ByteString -> ByteString -> (IntSet, String) matches s1 s2 = (used, m1) where window = max 0 (max (BS.length s1) (BS.length s2) `div` 2 - 1) St{used, m1} = {-# SCC "matchesFold" #-} BS.foldl' snoc initst s1 where initst = St { idx1 = 0, used = IS.empty, m1 = "" } snoc st@St{..} c1 = (case find p [lo .. hi] of Nothing -> st Just idx2 -> {-# SCC "matchesSnoc" #-} st { used = idx2 `IS.insert` used , m1 = m1 ++ [c1] } ) { idx1 = idx1 + 1 } where p idx2 = idx2 `IS.notMember` used && c1 == s2 `BS.index` idx2 lo = max 0 (idx1 - window) hi = min (BS.length s2 - 1) (idx1 + window)

And its profiling information:

total time = 0.26 secs total alloc = 2,525,880,780 bytes wall clock = 11.152646 secs COST CENTRE MODULE %time %alloc main Main 68.5 47.4 matches Main 16.3 31.8 jaro Main 8.9 14.5 matchesFold Main 3.5 1.7 matchesSnoc Main 2.7 4.7

While not as impressive of a change as some of our previous tweaks, moving to the strict record type still increased our performance by about half a second in wall clock time. The last thing we can attempt to decrease our allocation is moving away from using append in the matchesSnoc cost center and instead try to build our matches string using cons.

data St = St { idx1 :: !Int, used :: !IntSet , mk :: !(String -> String) } matches :: ByteString -> ByteString -> (IntSet, String) matches s1 s2 = (used, mk []) where window = max 0 (max (BS.length s1) (BS.length s2) `div` 2 - 1) St{used, mk} = {-# SCC "matchesFold" #-} BS.foldl' snoc initst s1 where initst = St { idx1 = 0, used = IS.empty, mk = id } snoc st@St{..} c1 = (case find p [lo .. hi] of Nothing -> st Just idx2 -> {-# SCC "matchesSnoc" #-} st { used = idx2 `IS.insert` used , mk = mk . (c1 : ) } ) { idx1 = idx1 + 1 } where p idx2 = idx2 `IS.notMember` used && c1 == s2 `BS.index` idx2 lo = max 0 (idx1 - window) hi = min (BS.length s2 - 1) (idx1 + window)

And its profiling information:

total time = 0.67 secs total alloc = 2,523,483,604 bytes wall clock = 11.279014 secs COST CENTRE MODULE %time %alloc main Main 82.2 47.4 matches Main 10.9 31.9 jaro Main 4.6 14.5 matchesFold Main 1.2 1.7 matchesSnoc Main 1.0 4.6

Looking at the total allocation and wall clock time one would be tempted to say that this change did not really do much. Interestingly enough, though, if you look at the %time numbers, almost all cost centers pushed a significant amount of time to the main function. Theoretically a more efficient test harness could leverage this change for some bigger gains down the road.

Results

We finish by stripping out our SCC annotations, adding an inline pragma for matches , manually inlining our find predicate, and doing common subexpression eliminations where the ByteString lengths were used. This exhausts our optimization efforts, at least for this post. The resultant profiling information is shown below.

total time = 0.87 secs total alloc = 2,593,431,612 bytes wall clock = 10.096731 secs COST CENTRE MODULE %time %alloc main Main 63.8 46.1 jaro Main 36.1 53.9

Compared to the profiling information from our starting point we have sped up our implementation by almost a full five seconds of wall clock time; a whopping one third of our total execution time. Furthermore, our total allocation was shrunk by almost one billion bytes. Not too shabby for a few hours worth of work.

The Pool of Tears or: How I Learned to Stop Crying and Love Benchmarking Libraries

This blog post is not a complete summary of our optimization efforts; to be honest, we got quite “stupid” with it. In fact, Nick will be following this post with a few more detailing our efforts to see if we could find a data type that could out perform ByteStrings and his own wild journeys off into the world of minute algorithmic variations. Regardless of whether you have enjoyed this post or not (Space Jam jokes are not for everyone) I encourage you to check the future ones out. The reason I mention this, in addition to providing a moderately decent introduction to the next posts in this series, is to set up the following anecdote.

Towards the end of our optimization fun-time, perhaps a week in at that point, we were playing with a variation of the Jaro algorithm that was barely recognizable. The code was rife with language extensions, unboxed types and primitives, and more calls to unsafePerformIO than I honestly feel comfortable admitting. Given how far we had travelled away from the original implementation, we built a few sanity checks into our test harness to confirm that each of our manual optimizations weren’t changing the input/output behavior of jaro . Another thing we got into the habit of doing was clocking the time with both a profiled and non-profiled version, if only to have a small, meaningless victory with every test we ran. We had just made another arbitrary change and ran the tests when something terrible happened: the sanity check had passed for the non-profiled version of the code and failed for the profiled version.

Profiling had just affected the evaluation of our code. That’s not good.

Those of us fortunate enough to have the newest generation of unibody Macs (read: not me) already knew that there were some issues with profiling for that architecture, given that most attempts to profile came back with zeros across the board. Looking for answers, Nick turned to the Haskell Cafe for help where he was pointed to this GHC bug report. In short, profiling does not play nicely with the optimization flags, including the default -O0 flag (no optimizations). Basically, profiling is currently broken and not to be trusted.

Obviously this was a problem given how I had just written an entire blog post that detailed an optimization methodology based almost entirely on profiling. Oh that reminds me, all of the profiling numbers in this post are probably wrong so take them with a grain of salt. This brings up a very good point that I neglected to mention earlier: There is no magic bullet of optimization. Profiling, while powerful when it works, will never be entirely sufficient on its own. Thankfully we included benchmarking code in our test harness, so we were never proceeding truly blind.

That statement leads to my next point: Do whatever you can to avoid writing your own benchmarking code. For any well established programming language, invariably someone else will have already developed a robust benchmarking library. If you can, use it. For example, Bryan O’Sullivan has developed a fantastic benchmarking library for Haskell called criterion that I wished we would have used from the beginning. Instead, we spent almost as much time changing and improving our test harness as we did optimizing the Jaro algorithm. Even with all of that effort, when it came time to to execute a large batch of tests, roughly 100 tests with an estimated run time of 10+ hours, we rewrote everything to use criterion. The reason was simple; it was the better testing framework.

I think both of these points are probably things that I’ve heard before. Undoubtably some professor tried to sagely pass on this knowledge during my undergraduate days, probably during the phase where I thought I would never use any of that material ever again (I’m still crossing my fingers that this holds true for C++). The fact of the matter, though, is that it didn’t really ring true until I ran into these issues myself.

So if you get only one thing out of reading this post it should be the following: Participate in your own optimization experiment and feel free to get “stupid” with it. Hopefully you will learn an enormous amount about the language you are working in. Even more importantly, hopefully you will have a lot of fun.

Thank you for reading and see you next time. Same functional time. Same functional place. Same functional channel.

-Evan