The other night, I had a random whim to spend a couple of minutes looking at the performance of UTF-8 decoding in the Haskell Unicode text package. Actually, rather than look at the actual performance, what I did was use Don Stewart's excellent ghc-core tool to inspect the high-level "Core" code generated by the compiler. Core is the last layer at which Haskell code is still somewhat intelligible, and although it takes quite a bit of practice to interpret, the effort is often worth it.

For instance, in this case, I could immediately tell by inspection that something bad was afoot in the inner loop of the UTF-8 decoder. A decoder more or less has to read a byte of input at a time, as this heavily edited bit of Core illustrates:

let x2 :: Word8

x2 = case readWord8OffAddr # {- ... -} of

( # s, x # ) -> W8 # x

in {- ... -}

What's important in the snippet above is that the value x2 is boxed, i.e. packaged up with a W8# constructor so that it must be accessed via a pointer indirection. Since a decoder must read up to 4 bytes to emit a single Unicode code point, the loop was potentially boxing up 4 bytes, then immediately unboxing them:

case x2 of

W8 # x2 # ->

case x3 of

W8 # x3 # ->

case x4 of

W8 # x4 # -> {- ... -}

While both boxing and unboxing are cheap in Haskell, they're certainly not free, and we surely don't want to be doing either in the inner loop of an oft-used function.

We can see why this was happening at line 96 of the decodeUtf8With function. I'd hoped that the compiler would be smart enough to unbox the values x1 through x4 , but it turned out not to be.

Fixing this excessive boxing and unboxing wasn't hard at all, but it made the code uglier. The rewritten code had identical performance on pure ASCII data, but was about 1.7 times faster on data that was partly or entirely non-ASCII. Nice! Right?

Not quite content with this improvement, I tried writing a decoder based on BjÃ¶rn HÃ¶hrmann's work. My initial attempt looked promising; it was up to 2.5 times faster than my first improved Haskell decoder, but it fell behind on decoding ASCII, due to the extra overhead of maintaining the DFA state.

In English-speaking countries, ASCII is still the king of encodings. Even in non-English-speaking countries that use UTF-8, a whole lot of text is at least partly ASCII in nature. For instance, other European languages contain frequent extents of 7-bit-clean text. Even in languages where almost all code points need two or more bytes to be represented in UTF-8, data such as XML and HTML still contains numerous extents of ASCII text.

What would happen if we were to special-case ASCII? If we read a 32-bit chunk of data, mask it against 0x80808080, and get zero, we know that all four bytes must be ASCII, so we can just write them straight out without going through the DFA (see lines 82 through 110).

As the numbers suggest, this makes a big difference to performance! Decoding pure ASCII becomes much faster, while both HTML and XML see respectable improvements. Of course, even this approach comes with a tradeoff: we lose a little performance when decoding entirely non-ASCII text.

Even in the slowest case, we can now decode upwards of 250MB of UTF-8 text per second, while for ASCII, we exceed 1.7GB per second!

These changes have made a big difference to decoding performance across the board: it is now always between 2 and 4 times faster than before.

As a final note, I haven't released the new code quite yet - so keep an eye out!