Update 2 June 2011: we’ve since discovered an error in our benchmarking suite. Data.Text was at a disadvantage and actually performs on par with String . So, the Data.Text results below are inaccurate — everything else should be fine AFAWK.

I don’t really know if this is technically the same “functional time”. But maybe by “functional” Evan meant whenever it’s ready — so here goes. I’m Nick. Thanks for reading. I have to apologize ahead of time: I cannot even feign the capability to compete with Evan’s cultural allusions.

A quick recap of the last post: when optimizing a straight-forward implementation of the Jaro string distance, we found some success with datatype refinements to IntSet and ByteString (surprise surprise). But then we found out an inconvenient truth about the current GHC profiling, namely that it isn’t to be trusted.

This post includes some more of our experiments and design considerations. In particular, we focus on trying out four alternative types for representing the strings.

tl;w(on’t)r

The following chart summarizes the results; the scalar is an average time for an iteration of the outermost loop of our testing harness and the label is the type used to represent words.

Disclaimer: These are all the same algorithm, with different data structures just plugged it. With algorithms tailored to their interface, these datatypes might perform better!

ByteString wins again. In this case, we think the chart indicates that much of that performance benefit is due to the support for O(1) random-access. This claim is supported by the comparable success of UArray Int Char . Read on for a more thorough discussion of these the highlights:

we couldn’t squeeze any performance gains out of mutability and unsafePerformIO ,

, don’t underestimate the benefits of a data declaration with strict fields for your inner loop’s state!, and

declaration with strict fields for your inner loop’s state!, and Data.Sequence severely underperforms, but we’re not sure why.

Prompted by a comment to the last post by Erik Hesselink (Thanks, Erik!), we included a variant for Bryan O’Sullivan’s Data.Text . We have not investigated why it’s outperformed by it only performs on par with String (which shares the same time-complexities for the interface we use); the only caveat we’ll mention is the minimality of this algorithm. It’s basically a strict left fold, length, and random access otherwise. There’s really no opportunity for fusion to kick in that we’ve noticed.

We’d be happy to send our code your way if you’d like — just let us know in the comments. You can dig deeper if you like, but we’re just investigating a “plug-and-play” like use here.

Experimental Setup

As a first step, we identify the essential interface to the String -like datatype for calculating the Jaro distance. This type class is used in the code we actually benchmarked; each instance of Jaro is saturated with INLINE pragmas and the jaro and matches definitions were cut-and-pasted for each datatype in order to avoid dictionary passing at run-time. We couldn’t get all the overhead removed via SPECIALISE — we have yet to determine why not.

class Jaro c where fromBS :: ByteString -> c iLen :: c -> Int index :: c -> Int -> Char foldl' :: (b -> Char -> b) -> b -> c -> b {-# INLINE gLen #-} gLen :: Enum n => c -> n gLen = toEnum . iLen

The test harness uses a strict ByteString to read in the test data. The fromBS method is used to convert the input word to the appropriate type for the jaro variant under test. We use Control.DeepSeq.rnf in order to exclude the conversion from our measurement of execution time. Otherwise, the methods simply provide access to the length of the word, random-access indexing, and a strict left fold.

The Jaro type class provides all the necessary methods for implementing the Jaro algorithm.

jaro :: (Jaro c, Fractional n, Enum n) => c -> c -> n jaro s1 s2 | 0 == m = 0 | otherwise = (m / gLen s1 + m / gLen s2 + (m - t) / m) / 3 where m = toEnum mInt (x1, x2) = if iLen s2 < iLen s1 then (s2, s1) else (s1, s2) (mInt, x1', hits2) = matches x1 x2 t = (/ 2) $ toEnum $ loop hits2 x1' (0 :: Int) where loop (i : is) (c1 : cs1) = loop is cs1 . f where f = if c1 /= x2 `index` i then succ else id loop _ _ = id

The definition of jaro re-orders the input words so that the outer-loop of matches traverses the shorter word. That traversal yields the number of matches, mInt ; the string of matched characters from the short string, x1' ; and the ascending indices of matched characters in the long string, hits2 . The latter two are simultaneously traversed in order to compute the transpositions.

The matches consists of a left fold over the short input word, which carries the following state type, St . The data declaration uses “bang patterns” in order to specify that the fields should never contain thunks. Along with the -funbox-strict-fields compilation flag, this technique pushes these values nearer to the CPU at run-time — they are the data manipulated by our program’s hot-spot and any access-time overhead is costly.

data St = St { idx1, m :: !Int, mk :: !(String -> String), used :: !IntSet } {-# INLINE matches #-} matches :: Jaro c => c -> c -> (Int, String, [Int]) matches s1 s2 = (m, mk [], IntSet.toAscList used) where len2 = iLen s2 window = max 0 (max (iLen s1) len2 `div` 2 - 1) St {m, mk, used} = foldl' snoc nil s1 where nil = St { idx1 = 0, m = 0, mk = id, used = IntSet.empty } snoc st@St{..} c1 = (case inline List.find (\idx -> IntSet.notMember idx used && c1 == s2 `index` idx) [lo .. hi] of Nothing -> st Just idx2 -> st { m = m + 1, mk = mk . (c1 : ), used = IntSet.insert idx2 used } ) { idx1 = idx1 + 1 } where lo = max 0 (idx1 - window) hi = min (len2 - 1) (idx1 + window)

The INLINE pragma merely mitigates the function call overhead; we’re not aware of any optimizations it enables within jaro .

This particular definition of jaro and matches was the result of iterating on the various data and slight algorithmic choices when refining the original specification from the previous post. The primary differences are some manually eliminated common subexpressions, some explicit inlining via GHC.Exts.inline , and some trivial changes to the types.

As a reminder, profiling was unfortunately of little use because of its current detrimental interactions with optimization, as mention in this GHC ticket. For our available machines, this bug either prevented any timing information or caused invalid compilation! Thus, we only had our own coarse-grain wall-clock time measurements to work with, computed carefully with Data.Time.getCurrentTime and Control.DeepSeq.rnf .

Jaro instances

The instances for Jaro are crucial to the performance of the jaro variant that uses the datatype. Here, we have omitted the INLINE pragmas that saturate these instance declarations for the sake of presentation, but they are crucial to avoiding dictionary passing at run time. Also, recall that the execution of fromBS is not included in the measurements.

For ByteString , String , and Text , the instances are immediate.

instance Jaro ByteString where fromBS = id iLen = BS.length index = BS.index foldl' = BS.foldl' instance Jaro [Char] where fromBS = BS.unpack iLen = length index = (!!) foldl' = List.foldl' instance Jaro T.Text where -- NB get special input treatment instead fromBS = error "Text from BS" iLen = T.length index = T.index foldl' = T.foldl'

The functions used in these instances are defined in heavily optimized libraries. This is, surprisingly, apparently not the case for Seq Char .

instance Jaro (Seq Char) where fromBS = Seq.fromList . BS.unpack iLen = Seq.length index = Seq.index foldl' = \snoc nil -> List.foldl' snoc nil . Fold.toList

The foldl' definition via the conversion to String consistently outperforms those via Data.Sequence.foldlWithIndex and Data.Foldable.foldl' by about 0.1 seconds, which is degradation of about 15% to 50% compared to the execution times of all of our tests in this experiment. That would seem to merit some investigation; we haven’t any theories yet.

instance Jaro (UArray Int Char) where fromBS = \bs -> IA.listArray (0, BS.length bs - 1) (BS.unpack bs) iLen = IA.numElements index = (IA.!) foldl' = \snoc nil arr -> let (lo, hi) = IA.bounds arr w !idx !acc | hi < idx = acc | otherwise = w (succ idx) (snoc acc (arr IA.! idx)) in w lo nil

criterion Harness

Our main function reads in the input file as a ByteString and then immediately prepares some thunks for converting that data into the various pairs for cross product of datatypes (one for short word, one for long). We also use Data.Text ‘s lazy IO in order to thunk its input until it’s needed; we don’t convert it from a ByteString since it supports richer encodings.

type C0 = ByteString type C1 = UArray Int Char … main = do ns <- BS.lines `fmap` BS.readFile wordsFile ns_text <- (map (id &&& id) . T.lines . LT.toStrict) `fmap` LT.readFile wordsFile let ns00 = map (fromBS &&& fromBS) ns :: [(C0, C0)] ns01 = map (fromBS &&& fromBS) ns ns02 = map (fromBS &&& fromBS) ns … ns33 = map (fromBS &&& fromBS) ns :: [(C3, C3)]

We basically hand everything off to criterion from here. We use rnf “in the right places” in order to force the conversion to take place outside of the execution window that criterion measures. Each bgroup corresponds to one of the sixteen pairs of representation types (e.g. ByteString and UArray Int Char ) and contains all of the bench marks that use that input data. Combined with the cfgPerformGC option to criterion , this should ensure that only the ByteString for the entire file and one conversion from that data are resident in memory at any given time.

let cfg = defaultConfig { cfgPerformGC = ljust True, cfgSummaryFile = ljust "crit.csv" } defaultMainWith cfg (return ()) [ bgroup "00" [ bench "jaro0" $ rnf ns00 `seq` whnfIO (critrun jaro0 ns00) , bench "jaro0'" $ whnfIO (critrun jaro0' ns00) , … {- a bench for each variant that uses ns00 -}], bgroup "01" [ bench "jaro01" $ rnf ns01 `seq` whnfIO (critrun jaro01 ns01) ], bgroup "02" [ bench "jaro02" $ rnf ns02 `seq` whnfIO (critrun jaro02 ns02) ], … {- a bgroup for each input type except Text -}, bgroup "text" [ bench "jaro_text" $ rnf ns_text `seq` whnfIO (critrun jaro_text ns_text) ] ]

We compile with

ghc -O2 -funbox-strict-fields -fforce-recomp

-fspec-constr-count=6 --make Main

We just bumped the SpecConstr count until the warnings went away; 6 didn’t seem offensively large, and there were two call patterns at most (recall that GHC reduces the count as part of its heuristic).

Results and Discussion

Measurements

We used the 8K /usr/share/dict/propernames file on our Mac OS X distributions to run the tests, since Jaro-Winkler is designed to primarily work on names. Our test harness traverses the diagonalization of the names in this file, and sums the similarity metrics between all of the words (this diverges from the first post’s test harness, but it’s much faster since we’re running so many variants and iterations). The result is a nonsensical number, but it forces the necessary evaluation, has a small memory footprint, and represents a hash of sorts of the files contents. All of the jaro variants computed the same sum (without the profiling, that is), so we’re confident they are equivalent. As an exercise in autodidacticism, this work did not inspire us to compare for correctness against the golden C implementation of Jaro-Winkler (linked to by the Wikipedia page referenced above).

It is well understood that standard Haskell String s limit performance; this is why ByteString was written in the first place. For this experiment, we primarily focused on various data structures for representing the words during the two loops in matches . Since each loop accesses the respective argument differently (short word – outer loop, long word – inner loop), we tried all sixteen combinations of our 4 datatypes of interest. Direct benchmarking via Data.Time.getCurrentTime and Control.DeepSeq.rnf yielded the following measurements.

Outer loop Inner loop Measured time (s) ByteString ByteString 0.22892195665577 ByteString UArray Int Char 0.3117424055312 ByteString String 0.4186133953307 ByteString Seq Char 0.62530112468937 UArray Int Char ByteString 0.32210345232228 UArray Int Char UArray Int Char 0.33752040827015 UArray Int Char String 0.45714450323322 UArray Int Char Seq Char 0.68065535986164 String ByteString 0.37693604194859 String UArray Int Char 0.40947800600269 String String 0.49661373102406 String Seq Char 0.74052640163639 Seq Char ByteString 0.53333888256291 Seq Char UArray Int Char 0.58034939014652 Seq Char String 0.68305074178913 Seq Char Seq Char 0.8523849221442

The Datatypes

The graph from the introduction tells the big story: ByteString is best. We chose the other types in order to explore our straight-forward and simple hypotheses. We included String since it’s the most familiar and default type in Haskell for words. We were surprised it fared so well. UArray Int Char was intended as a less optimized version of ByteString . We presumed that its implementation is basically the same with fewer possible rewrites and slightly worse access-times, and so we expected a constant factor of overhead compared to ByteString . Our use of Data.Sequence is slightly vestigial: an earlier algorithmic variant used its queue-like interface (O(1) tail and snoc) to slide the matching window across the second string. Without profiling, we haven’t the means to determine what’s “wrong” with Data.Sequence ; the time complexities listed in its documentation suggest it should do better than String .

In order to better compare the types, we chart the times with a fixed type for the short word and average across the measured times for each of the possible types for the longer word. In other words (har), we fix the type of the outer loop and vary the type of the inner loop.

Next, we do the opposite: fix the inner loop type and vary the outer loop type, again averaging the results to a single scalar.

Without going into statistics, the best we can do is eyeball it. It does seem that the “slope” of the second graph is steeper, which indicates that the datatype used for the inner loop has a greater effect on performance. That sounds reasonable.

Algorithmic Variations

Strict Tuple Patterns

Unaccustomed to low-level optimizing, we didn’t immediately remember to employ a data declaration with strict, unboxed fields. Its performance benefit at the time was startling; a factor of 2 or 3. So we’ve retained variants of jaro that use simple tuples saturated with strictness annotations just to see the difference.

Datatype Measured time (s) Degradation ByteString 0.35290416443089 54.16% UArray Int Char 0.45579727375248 35.04% String 0.61194113933781 23.22% Seq Char 0.99888140403965 17.19%

The data declaration with strict fields is clearly crucial.

Mutable bit array

All of the measured variants discussed above use a Data.IntSet to track which of the characters in the second string have already been matched. Since this is a persistent data structure — even though it’s so tightly packed and efficient — we though that changing it to be a mutable bit array would be a big win. However, all of our attempts at doing so, including ST , unsafePerformIO , re-using a singly allocated array to allow Data.ByteString.Internal.inlinePerformIO within the loop, cloning Data.ByteString.foldl' to work with a monadic snoc , etc. failed to perform as well all as ByteString + IntSet . We peeked at the Core enough to suppose that there were bounds checks interfering with some optimizations; the IntSet had 10 instances of the loops whereas the array code had only 4. After eliminating the bounds checks (with some really low-level coding to circumvent the Data.Array API), there were a matching 10 instances of the loops in the array based code, but it remained less performant. Without profiling and better understanding/navigation of Core, we’re unable to determine why.

The variant with the singly allocated array and using only Data.ByteString.Internal.inlinePerformIO within the inner loop had a measured time of 0.3209107s, which is a 40.18% performance degradation.

Inlining List.find

In the definition of matches , we use GHC.Exts.inline to force the inlining of Data.List.find . Since it’s a library function, this is the only way we are aware of to force it to be inlined. Without inlining it, the ByteString jaro averages 0.29225487196186, which is a 27.67% performance degradation. We surmise that its unfolding is enabling a fusion rule for filter and enumFromTo .

Having never worked this hard at optimizing a function, we were surprised that a use of a standard function like List.find wasn’t optimally handled by GHC automatically. In particular, we thought it was very likely to unfold functions with such short definitions; perhaps library must mark it as INLINABLE with a pragma? Odd that such a standard library wouldn’t. Looking now at the definition of Data.List.find , it seems it’d be more efficient to case directly on the result of the filter instead of converting to a Maybe . Is listToMaybe INLINE d even though it has no such pragma? (At the time of writing, those links hit base-4.3.1.0 .)

Eliminating the double traversal

The counting of transpositions in jaro traverses the strings of matched characters. With respect to the short string, the characters occur in the exact same order within the outer loop of matches . These characters are essentially traversed twice, and — presumably worse still — a list is constructed in between, bound to x1' in jaro .

We wrote a version using circular programming to avoid the construction of the string of matched characters from the short word. The resulting fold counts the transpositions “as” it determines the matches.

data St n = St { idx1, m :: !Int, t2 :: n, used :: !IntSet, hits2 :: String } matches_t :: (Jaro c, Fractional n, Enum n) => c -> c -> (Int, n) matches_t s1 s2 = (m, t2 / 2) where window = max 0 (max (iLen s1) len2 `div` 2 - 1) len2 = iLen s2 St {m, used, t2} = foldl' snoc nil s1 where nil = St { idx1 = 0, m = 0, t2 = 0, used = IntSet.empty, hits2 = map (index s2) (IntSet.toAscList used) } snoc st@St{..} c1 = (case inline List.find (\idx -> IntSet.notMember idx used && c1 == s2 `index` idx) [lo .. hi] of Nothing -> st Just idx2 -> st { m = m + 1, hits2 = tail hits2, used = IntSet.insert idx2 used, t2 = (if c1 /= head hits2 then succ else id) t2 } ) { idx1 = idx1 + 1 } where lo = max 0 (idx1 - window) hi = min (len2 - 1) (idx1 + window)

The circularity is between the outermost binding of used and the initial value of the hits2 field in the nil state. Divergence is avoided because, during the traversal, none of the other fields depend on hits2 or t2 and their fields in St are not strict. As a result, those fields accumulate as thunks, which only collapse once t2 is forced. This ultimately causes the rest of the traversal to take place first.

With ByteString s, this algorithm and the primary algorithm perform the same at optimization level -O1 . We surmise that the thunks that build up are operationally comparable to the construction of the list of matched characters, x1' . That would still eliminate the redundant traversal, but for the fact that the thunks merely accumulate during the fold; their collapse happens only once t2 is forced, and is effectively a second traversal. Furthermore, with -O2 the circular program is significantly less performant than the primary algorithm; we surmise that the circularity prevents some of the more aggressive optimizations.

Though the circular programming removes the redundant traversal from the code, it fails to eliminate it from the execution. We surmise the existence of a much more invasive alteration of the primary algorithm where the outer loop counts the transpositions for the prefix of the longer string that is beyond the lower limit of the matching window for the shorter string’s last character. This ought to reduce the redundancy of the second traversal. We have not yet implemented it.

“Surmise, surmise, surmise”, you say. We’re trying to not look at the generated Core, especially since -O2 does a lot of unfolding. Perhaps heap profiling would be a good indicator? … You know, we do have research to do.

criterion

We eventually jacked our code into Bryan O’Sullivan’s criterion package late into our process. In particular, during development we did simple timing with Data.Time.getCurrentTime with a handful of iterations for quick turnaround.

For the numbers and graphs in this post, though, we ran everything in a single criterion harness. Fortunately, comparing the two data sets at the shared data points validates our simple method.

Phew! Everything just about lines up. The criterion benchmarking was ran over night, with no other processes — we assume that sort of OS-level inequity in the execution environment is where the seemingly constant offset comes from.

Conclusions & Questions

We kind of already concluded above. ByteString was best. We’re surprised we couldn’t beat IntSet by using a mutable array — that’s probably our most interesting result. But all of the IO-escaping (even via inlinePerformIO ) seemed to inhibit some key optimizations. Maybe with the right refactoring and compilation flags, the mutable bitarray would out-perform IntSet . Suffice it to say it apparently wouldn’t be a small, obvious change.

We also didn’t try the other unboxed array packages. We just hadn’t heard of them yet. So maybe there’s something else out there for mutable bitarrays that would outperform IntSet and not block those aggressive optimizations as we’re supposing was happening with our attempts. Please let us know!

If you’re interested in getting access to the code we used:

Leave a comment saying so! We’ll need some motivation to package it up since the semester is over… Brace yourself! There’s a lot of “duplicate” code. We manually circumvented SPECIALIZE pragmas, since our initial tests indicated some overhead. Also, our naming conventions got a little frantic at the end there. Maybe we’ll clean it up before posting it if you ask nice.

If you’re just after the fastest jaro definition (in order to try to beat it, perhaps? Hmm?), just wait for the next post. We’d love it if someone could show us some further tricks! We’d also really like to know why we’ve failed to speed it up with the mutable bit array.

Also, we are concerned about the apparent SPECIALIZE overhead we noticed. Is this a known issue? Are we missing something?

This point about missing flags deserves repetition. In fact…

Disclaimer: we could have overlooked crucial INLINE s or compilation flags.

Those sorts of things could really affect the results we’re getting and so possibly invalidate some of our conclusions (e.g. those regarding inlining List.find or Seq Char ’s slow folds). Unfortunately, we can’t even find Acovea on the Internets and aren’t otherwise familiar with the multitude of GHC flags. … Help?

Thanks again for reading! Our next post will discuss some more invasive algorithmic changes; there’s more we can do to speed up jaro , albeit not by much. Fuse, fuse, fuse.