A few days ago, my Facebook colleague Andrei Alexandrescu posted a note entitled Three Optimization Tips for C++, which reminded me that I had unfinished business with Haskell’s text package.

I took his code, applied it to the text package, and this is the story of what happened.

The text package provides a type named Builder for efficiently constructing Unicode strings from smaller fragments. A couple of years after Johan contributed the initial Builder implementation, I wrote some helper functions to perform recurring tasks, such as rendering numbers.

In my initial number renderer, I put no effort into making my code fast—as this snippet demonstrates. Simplicity first.

positive :: ( Integral a) => a -> Builder positive = go where go n | n < 10 = digit n | otherwise = go (n `quot` 10 ) <> digit (n `rem` 10 ) digit n = singleton . intTodigit . fromIntegral

Having gotten the code working, I promptly forgot about it for a while. When I saw Andrei’s note nine months later, it tickled my memory: didn’t I have code that I could possibly improve? Indeed, when I took a look at my number rendering code, I found that I hadn’t even bothered to write benchmarks to measure its performance.

Before I discuss how I improved it, a brief description of how a Builder works is in order. A Builder provides a safe way to destructively write data into a list of fixed-size buffers, and to convert the result into an immutable Text value. External users of Builder never see the mutable buffers; they can only use safe access methods.

The code above uses the safe API. Each use of singleton is bounds-checked internally: if a write will not fit into the current buffer, Builder “finalizes” that buffer (putting it at the tail of the list), allocates a new one, and starts writing there. The <> operator sequences the writes.

While this implementation is correct, it is far from fast, partly due to the overhead of performing a bounds check for every character to be written out. In fact, this approach is almost always slower than simply using the show function, then converting the resulting [Char] value to a Builder !

My first observation was that if I knew the number of digits I needed up front, I could perform just one buffer-size check, instead of a separate check prior to rendering each digit.

countDigits :: ( Integral a) => a -> Int countDigits v0 = go 1 ( fromIntegral v0 :: Word64 ) where go ! k v | v < 10 = k | v < 100 = k + 1 | v < 1000 = k + 2 | v < 10000 = k + 3 | otherwise = go (k +4 ) (v `quot` 10000 )

This is almost identical to Andrei’s second digits10 function.

Once I was able to count digits, I could use the internal writeN function to destructively update the buffer. writeN ensures that a buffer is available with at least the requested amount of space; it then calls a user-supplied function, giving it the buffer to write to ( marr ) and the position to start writing at ( off ).

positive :: ( Integral a) => a -> Builder positive i -- we win when we special-case single-digit numbers | i < 10 = writeN 1 $ \marr off -> unsafeWrite marr off (i2w i) | otherwise = let ! n = countDigits i in writeN n $ \marr off -> posDecimal marr off n i posDecimal :: ( Integral a) => forall s . MArray s -> Int -> Int -> a -> ST s () posDecimal marr off0 ds v0 = go (off0 + ds - 1 ) v0 where go off v | v < 10 = unsafeWrite marr off (i2w v) | otherwise = do unsafeWrite marr off (i2w (v `rem` 10 )) go (off -1 ) (v `div` 10 ) i2w :: ( Integral a) => a -> Word16 i2w v = 48 + fromIntegral v

I then took Andrei’s very clever third digits10 function and translated that to Haskell. Syntax apart, there is a small difference between his function and mine: his is recursive, while mine is tail recursive (i.e. a loop).

countDigits :: ( Integral a) => a -> Int countDigits v0 = go 1 ( fromIntegral v0 :: Word64 ) where go ! k v | v < 10 = k | v < 100 = k + 1 | v < 1000 = k + 2 | v < 1000000000000 = k + if v < 100000000 then if v < 1000000 then if v < 10000 then 3 else 4 + fin v 100000 else 6 + fin v 10000000 else if v < 10000000000 then 8 + fin v 1000000000 else 10 + fin v 100000000000 | otherwise = go (k + 12 ) (v `quot` 1000000000000 ) fin v n = if v >= n then 1 else 0

(To be sure my intuition was correct, I did indeed measure recursive against tail recursive versions of my Haskell translation, and tail recursion wins by a few percent here.)

While this countDigits function helped performance by quite a bit, there was another step remaining in following Andrei’s example: converting two digits at a time.

posDecimal :: ( Integral a) => forall s . MArray s -> Int -> Int -> a -> ST s () posDecimal marr off0 ds v0 = go (off0 + ds - 1 ) v0 where go off v | v >= 100 = do let (q, r) = v `quotRem` 100 write2 off r go (off - 2 ) q | v < 10 = unsafeWrite marr off (i2w v) | otherwise = write2 off v write2 off i0 = do let i = fromIntegral i0; j = i + i unsafeWrite marr off $ get (j + 1 ) unsafeWrite marr (off - 1 ) $ get j get = fromIntegral . B.unsafeIndex digits

A final surprise came when I decided to try an experiment: what if I replaced the separate uses of quot and rem with a single use of quotRem ? This improved performance by a further 30% on large 64-bit numbers! Why such a big difference? Because quotRem can often be emitted as a single machine instruction instead of two, and division is expensive enough that in a hot loop, this helps a lot.

Many modern optimizing compilers can spot this kind of opportunity automatically. Although GHC’s optimizer performs many complex high-level transformations, its machinery for handling low-level optimizations is currently weak. (This is why you’ll see a few cases of v+v instead of v*2 above, where I’m strength-reducing operations by hand instead of trusting the compiler.)

I was not at all surprised that Andrei’s optimization tips should translate perfectly to Haskell, as most of what he says has nothing to do with C++ itself. His advice should apply well to any language that provides low-level access to the machine and ends up running as native code. Since Haskell is often not understood to be a perfectly good imperative language, it’s easy for less experienced programmers to overlook the relevance of low-level concerns to Haskell performance.