While I was in New York for QCon last week, the temperatures started out quite mild, but soared back to their usual sweltering summertime levels by midweek. I thus found myself confined to my hotel room for a few hours one afternoon, feeling grateful for the air conditioning.

As I waited for the sun to go down so that I might venture out in the heat without being cooked by both air and glare, I decided to return my attention to the Haskell text library for the first time in a while.

This library is in a happy state of feature and bug stability, and it has grown in popularity to the point where it is now one of the ten most used libraries on Hackage (counting the number of other packages depending on it).

library # deps base 4757 containers 1490 bytestring 1368 mtl 1243 directory 742 filepath 669 array 552 transformers 502 parsec 491 text 454

Very pleasing, right? But there are always improvements to be made, and I came across a couple of nice candidates after a little inspection.

I’m going to talk about one of these in some detail, because it’s worth digging into a worked example of how to further improve the performance of code that is already fast, and this is nice and brief.

The easiest of these candidates to explain is the function for decoding a bytestring containing UTF-8 text. Generally when we’re decoding UTF-8, the strings we use are relatively large, and so the small memory savings below will not usually be significant.

The optimisation I discuss below appears in other performance sensitive contexts within the library in which small strings are common, so this is definitely a meaningful optimisation. For instance, the code for converting between the internal Stream type and a Text admits the same optimisation.

Somewhat simplified, the beginning of our UTF-8 decoding function reads as follows:

decodeUtf8With :: OnDecodeError -> ByteString -> Text



decodeUtf8With onErr bs = textP ary 0 alen

where

(ary,alen) = A.run2 (A.new ( length bs) >>= go)

go dest = {- the actual decoding loop -}

The A.run2 function referred to above is important to understand. It runs an action that returns a mutable array, and it “freezes” that array into an immutable array.

run2 :: (forall s . ST s ( MArray s, a)) -> ( Array , a)



run2 k = runST ( do

(marr,b) <- k

arr <- unsafeFreeze marr

return (arr,b))

Our go function is given an initial mutable array allocated by A.new , fills it with text, and returns the final mutable array (which may have been reallocated) and its size.

I was curious about the performance of this code for very short strings, based on visual inspection alone.

The run2 function forces a tuple to be allocated by its callee, and both of the values in that tuple must be boxed.

It then allocates another tuple, containing one new boxed value.

That tuple is immediately deconstructed, and its components are passed to textP . This is a simple function that ensures that we always use the same value for zero-length strings, to save on allocation.

Because the caller of decodeUtf8With may not demand a result from textP , it is possible that the result of this function could be an unevaluated thunk.

For very short inputs, the overhead of allocating these tuples and boxing their parameters (required because tuples are polymorphic, and polymorphic values must be boxed) worried me as potentially a significant fraction of the "real work". (For longer strings, the actual decoding will dominate, so this becomes less of an issue. But short strings are common, and should be fast.)

But first, can we be sure that those tuples really exist once the compiler has worked its magic? GHC has for a long time performed a clever optimisation called constructed product result analysis, or CPR. This tries to ensure that if a function returns multiple results (e.g. in a tuple or some other product type), it can (under some circumstances) use machine registers, avoiding boxing and memory allocation. Unfortunately, I checked the generated Core intermediate code, and CPR does not kick in for us here. It’s often fragile, and a case like this, where we carry a tuple from the ST monad into pure code, can easily defeat it. (It’s far from obvious when it will or will not work, so always best to check if it matters.)

Remaining optimistic, we can manually get a little bit of a CPR-like effect by judicious choice of product types. Recall the type of run2 :

run2 :: (forall s . ST s ( MArray s, a)) -> ( Array , a)

There are two profitable observations we can make here.

To begin with, we know that the a parameter of the first tuple will be an Int in some particularly important cases. We use a specialised tuple to represent this idea.

-- Instead of (MArray s, a):

data Run s = Run ( MArray s) {-# UNPACK #-} ! Int

We have explicitly instructed GHC to not box the Int ; instead it will be stored unboxed, directly in the Run structure. This eliminates one memory allocation and the performance-sapping pointer indirection it would induce. (You might wonder why we don’t direct GHC to unbox the MArray s parameter. This value is just a pointer to an array, and we certainly don’t want to copy arrays around.)

Our next observation is that any time we want to return a Run , we will always use the result to construct a Text . So why not construct the Text directly, and save our callers from doing it in many places?

runText :: (forall s . ST s ( Run s)) -> Text

Here’s what the body of runText looks like. We are no longer allocating a second tuple, and the len variable will never be boxed and then immediately unboxed. This brings us from allocating two tuples and an Int down to allocating just one Run .

runText act = runST $ do

Run marr len <- act

arr <- A.unsafeFreeze marr

return (textP arr 0 len)

Experienced eyes will spot one last catch: because we are not forcing the result of the textP expression to be evaluated before we return , we are unnecessarily allocating a thunk here.

Nevertheless, even with that small inefficiency, both time and memory usage did improve: we allocated a little less, so we did a little less work, and thus a microbenchmark that decoded millions of short words became a few percent faster.

As gratifying as this is, can we improve on it further? Apart from the silly oversight of not forcing textP , there’s still that bothersome allocation of a Run . Reading the Core code generated by GHC indicates that CPR is still not happening, and so a Run really is being allocated and immediately thrown away. (I think that the polymorphic first parameter to Run might be defeating CPR, but I am not sure of this.)

All along, we’ve been either allocating tuples or Run s and then immediately throwing them away. Allocation might be cheap, but ceteris paribus, not allocating anything will always be cheaper.

Instead of returning a Run value, what if we were to have our action call a continuation to construct the final Text value?

runText ::

(forall s .

( MArray s -> Int -> ST s Text )

-> ST s Text )

-> Text

The function that we call as our continuation captures no variables from its environment, so it has no hidden state for which we might need to allocate memory to save.

runText act = runST (act $

\ ! marr ! len -> do

arr <- A.unsafeFreeze marr

return $! textP arr 0 len)

We’ve now avoided the allocation of a Run , and just as importantly, we remembered to have the new runText force the result of the textP expression, so it will not allocate a thunk.

The only downside to this approach is that GHC does very little inlining of continuation-based code, so our use of a continuation leaves a jump in the code path that we’d have preferred to see the inliner eliminate.

This change in tack causes a significant reduction in memory allocation: on my small-decode microbenchmark, we allocate 17% less memory, and run 10% faster. I see the same improvement with another microbenchmark that exercises the Stream to Text conversion code, where I made the same optimisation. Given that I changed just a few lines of code in each case, this result makes me happy. If you’re interested, it might help to take a look at the final continuation-using version of decodeUtf8With in context.

Although simple, I hope that working through this in some detail has been interesting. Please let me know if you’d like to see more hands-on posts like this.