This is the last episode of a series on optimization of Haskell programs:

More careful measurements

The golden standard for benchmarking in Haskell is the criterion library. For this series I did not bother with writing smaller test cases that would play well with it, and my measurement have not be rigorous. For this post, I have run each sample ten times and computed an average. This is still not ideal, but should suffice to give a feeling of the performance.

However, I don’t have the measurements for the last iteration of my program. I waited till I had time to collect these, but this never happened for various reasons. Gauging the performance gains of the last version is left as an exercise to the reader!

It turned out that the state of the program at the end of the previous episode was much better than I expected! It performed its task in 2.712 seconds, which is just as fast as the discarding C program. Now let’s beat it!

Playing with compiler options

Finding the precise set of compiler options that will produce a fast program is more art than science. People are using genetic algorithms to find the best set of compiler flags, but I am only going to focus on a single flag here : -fllvm . It can do wonder for tight loops and numerical code, but it also can slow down the compilation process and the resulting program.

I have found it best to just test it on performance critical code, and keep it if there was a significant gain.

In this case, the gain is about 3%, which would not be sufficient for me to keep in usual circumstances. But this series is about performance, so I will keep this nice “free” speedup.

Alternative encodings of the monad

An optimization technique that is fun and sometimes very efficient is to move the critical parts of the computation to a continuation passing like implementation. In our case, that would be the Parser monad.

I am used to call this process Church encoding, but it seems from the responses from Reddit that it might not be correct. I have been pointed to a paper describing how to systematically encode all the stuff in what is called Boehm-Berarducci encoding. Despite my claims I did not read it (yet), so there might be inaccuracies in my terminology.

As an example, here is how Maybe and Either are usually encoded:

> newtype MaybeChurch a = MaybeChurch > { runMaybe :: forall x . x -> ( a -> x ) -> x } > newtype EitherChurch l r = EitherChurch > { runEither :: forall x . ( l -> x ) -> ( r -> x ) -> x }

Those types look a lot like the following functions:

> maybe :: b -> ( a -> b ) -> Maybe a -> b > either :: ( a -> c ) -> ( b -> c ) -> Either a b -> c

And it is not surprising, for simple ADT like those you “just” have to write their fold to get their Church encodings (by the way, it is a fun exercise to newtype them and try to write the usual instances).

But our Parser type is a bit more complicated, as it really is a function:

> newtype Parser a = Parser > { runParser :: ByteString -> Maybe ( a , ByteString ) }

I don’t know how to mechanically advance here, so I am going to try to explain my intuition. The original parser takes a ByteString and stops the computation with an empty result ( Nothing ), or a result along with the remaining input.

Once transformed, it will not directly return anything, but will pass the result of the current computation to a pair of continuation functions, one for each case:

> newtype Parser a = Parser > { runParser :: forall r . ByteString -> nothing -> just -> r }

Now, both the nothing and just part should be functions that must return r . The nothing part handles errors. The just part keeps on parsing, based on the result of the current function. It must be fed with the remaining input and said result:

> newtype Parser a = Parser > { runParser :: forall r . ByteString -> ( error -> r ) -> ( ByteString -> a -> r ) -> r }

But as a final optimisation, we might see that we don’t care about fancy error messages for the “failure” case, so the final form will be:

> newtype Parser a = Parser > { runParser :: forall r . ByteString -> r -> ( ByteString -> a -> r ) -> r } > deriving Functor

For a parser with verbose error messages, a suitable type could be:

> newtype Parser a = Parser > { runParser :: forall r . ByteString > -> ( String -> r ) -- ^ Failing case with a String error message > -> ( ByteString -> a -> r ) > -> r } > deriving Functor

Now, most of the combinators are trivial to write:

> takeWhile1 :: ( Char -> Bool ) -> Parser ByteString > takeWhile1 prd = Parser $ \ s failure success -> case BS8 . span prd s of > ( "" , _ ) -> failure > ( a , b ) -> success b a > > char :: Char -> Parser () > char c = Parser $ \ input failure success -> > if BS . null input > then failure > else if BS8 . head input == c > then success ( BS . tail input ) () > else failure

And running the parser is easy too:

> parseOnly :: Parser a -> ByteString -> Maybe a > parseOnly ( Parser p ) s = p s Nothing $ \ b a -> if BS . null b > then Just a > else Nothing

Now, one of the fun properties of Haskell is that deriving the Applicative and Monad instances can actually be done as an exercise in type tetris, where you have no idea of what the code you write actually means, but where it works fine once it typechecks:

> instance Applicative Parser where > pure a = Parser $ \ b _ s -> s b a > Parser pf <*> Parser px = Parser $ \ input failure success -> > let succ' input' f = px input' failure ( \ i a -> success i ( f a ) ) > in pf input failure succ'

While this gives no speed gain with the usual compiler options, the performance under llvm is slightly better (2.56s vs 2.63s). This technique can do wonders in many use cases, but is clearly not that great here.

Oh well, this was fun anyway 🙂

Parallel execution

In the previous episode I moved from a parsing function that would handle the whole input to a parsing function that would only handle a single line. This function being pure, it should be very easy to parallelize its execution.

In practice, I had to change this part of the program:

> mapM findline

into:

> traverse id . withStrategy ( parListChunk 2000 rseq ) . map findline

Note that traverse id is sequence , but for some reason I didn’t go with it … This strategy stuff is very well covered in the Marlow book which I recommend even if you don’t believe you really care about these topics. It is a really good book!

Basically this will cut the list in chunks of 2000 items and spark a thread for each of them. Each of these threads can be processed on an individual processor.

In practice, the gain is highly dependent on the sequential parts of a program and the complexity of the parallel chunks of work. The latter part means that sometimes it is more work to handle the complexities of the parallel runtime than to do the work in the first place. That means that algorithms that look like they should parallelize well really don’t, and that you should benchmark your programs with many values of -N (which is the runtime option that governs how many cores are going to be used).

For this program, I found out that a value of -N6 gave the best results. The program runs in 1.95s with this value.

There isn’t much speedup, and it is probably related to the fact that the program does the following sequential operations:

File loading

Parsing

Map creation

Printing the map size

I tried going with lazy IO to increase the productivity of the program, but this worsened the performance. I did not try a streaming library because I suspect the increased complexity will not be worth it.

Performance summary

Here is a breakdown of all that has been discussed in this series. It should be quite obvious that the alternative encoding of the parser could have been left out. It would also be interesting to get a metric of how fast would the final version be while keeping attoparsec . Introducing a custom parser did wonders for the performance of the program, but it might be technical debt if more complex features were to be required.

In the end, the Haskell program could be made faster than a truncated C program that wouldn’t do as much work. You will have to take my word for it, as I will not release the C program, but it wasn’t a bad program I selected for an easy win.

Program version Time spent (seconds) Memory usage (MB) Gain over first version (CPU / RAM) parsec 95.741 14542 n/a attoparsec 19.609 3154 4.88x / 4.61x thyme 18.302 1975 5.23x / 7.36x bytestring 13.890 1959 6.89x / 7.42x BangPatterns 8.705 406 11x / 35.82x parseDTime 7.7 406 12.43x / 35.82x Custom parser 5.628 382 17.01x / 38.07x Break 4.754 396 20.14x / 36.72x Inlining 2.712 396 35.3x / 36.72x With LLVM 2.63 36.4x Church 2.71 35.3x Church LLVM 2.561 37.4x Parallel -N2 2.51 38.1x Parallel -N4 2.13 44.9x Parallel -N6 1.95 49.1x original C (mem) 12 original C (discard) 2.7

Conclusion

This series was meant as a way to showcase some simple and effective optimization techniques, such as

Using efficient libraries. For parsing I suggested using attoparsec , but I would suggest using something like the parsers package, as it has a great interface and is really a front-end to several other parsers ( attoparsec , parsec and trifecta ). That way you can have a “slow” version with good error messages and a “fast” one powered by attoparsec (or your own parser!) using the same code base.

Using BangPatterns , even if a bit randomly, while roughly benchmarking.

Profiling to find the hot spots. This one is obvious, but it is not the first thing I usually do when optimizing. Also be warned that the cost centers might accounted as you might expect when using aggressive optimizations or continuation passing style.

Replacing some well used library with custom, more specialized, code. This is something that should be done only as last resort in production code, as this will reduce its maintainability (compared to a well known library with many useful features).

Experiment with the compiler flags. This can give some “free” speedups, but can also negatively impact compilation times.

Embarrassingly parallel tasks are embarrassingly easy to parallelize in Haskell.

One technique that didn’t work was the alternative encoding of the custom parser. This is a technique I use a lot, but I suspect mainly because I like the soothing feeling of successful type-Tetris.

Happy optimizing!