A ByteString performance mystery

This post will recount an interesting performance mystery from work a long way back. I wasn't involved in this murder mystery at the time, but that's all the more reason to get involved writing it up!

The problem is pretty simple: we have a buffer that is assumed to be ASCII alphanumerics, and we want to convert it to lowercase for comparison. It's in an inner loop, so it should be as fast as possible, though we are not quite willing to implement the entire program in C. We still want a typed interface and Hedgehog coverage.

If we want this to be a fast inner loop, we probably want to avoid unnecessary branching. Luckily, our input is in the right shape. There's a fairly well-known trick for upper- and lower-casing ASCII alphanumeric strings. Upper and lower case are exactly 0x20 apart, so it is common to check the case, then add or subtract 0x20 . However, checking the case first would introduce a branch, so we can instead perform bitwise AND 0x20 (upper) or OR 0x20 (lower). This will produce some unusual output for characters that aren't alphanumeric, but we don't care in this particular case.

We can implement this function in fairly idiomatic Haskell using ByteString primitives:

import Data.Bits ((.|.)) import Data.ByteString ( ByteString ) import qualified Data.ByteString as B toLowerNative :: ByteString -> ByteString toLowerNative = B.map ( .|. 0x20 )

... and we can see that it works given ASCII:

λ > toLowerNative "ABCDE" "abcde" λ > toLowerNative "ABCDE123123123" "abcde123123123" λ > toLowerNative "ABCDE123123123abcde" "abcde123123123abcde"

... and we can write a benchmark to demonstrate that it's linear in the size of the input:

Benchmark bench : RUNNING... benchmarking toLowerNative / 100 time 699.9 ns ( 689.4 ns .. 712.1 ns) benchmarking toLowerNative / 500 time 3.133 μs ( 3.092 μs .. 3.193 μs) benchmarking toLowerNative / 1000 time 6.245 μs ( 6.128 μs .. 6.384 μs) benchmarking toLowerNative / 2000 time 12.66 μs ( 12.39 μs .. 12.99 μs)

... but we don't have anything to benchmark against! We shouldn't be especially surprised that a map across ByteStrings runs in linear time.

The interesting bit came when it was benchmarked against the exact same function written in C and invoked via the FFI. So, let's do that:

void to_lower ( char * str, int len) { for ( int i = 0 ; i < len; i++) { str[i] |= 0x20 ; }; };

We can bind it in Haskell and marshal our ByteStrings in the usual fashion. This function has only benign side effects. It idempotently modifies memory provided by ByteString's useAsCStringLen function, but touches nothing else. So, it is safe to use unsafePerformIO , convincing GHC to treat it as an ordinary pure function.

{-# LANGUAGE ForeignFunctionInterface #-} import Data.ByteString ( ByteString ) import qualified Data.ByteString as B import Foreign.C.Types ( CChar ) import Foreign.Ptr ( Ptr ) import qualified System.IO.Unsafe as Unsafe toLowerFFI :: ByteString -> ByteString toLowerFFI bs = Unsafe.unsafePerformIO . B.useAsCStringLen bs $ \(ptr, len) -> do to_lower ptr len B.packCStringLen (ptr, len) foreign import ccall to_lower :: Ptr CChar -> Int -> IO ()

... and when we add it to the benchmark, it's an order of magnitude faster!

benchmarking toLowerFFI/100 time 236.0 ns (233.6 ns .. 239.1 ns) benchmarking toLowerFFI/500 time 349.4 ns (334.2 ns .. 366.1 ns) benchmarking toLowerFFI/1000 time 477.6 ns (432.3 ns .. 528.6 ns) benchmarking toLowerFFI/2000 time 637.8 ns (586.2 ns .. 690.2 ns)

Assembly dump

We know from various wonderful Haskell performance guides that there's a bunch of stuff we can try next to improve this native version. I'm not going to go there. Let's just peek at the assembly for the speedy version.

$ objdump -S lower.o lower.o: file format Mach-O 64 -bit x86- 64 Disassembly of section __TEXT,__text: _to_lower: ; void to_lower (char* str, int len) { 0: 55 pushq % rbp 1: 48 89 e5 movq % rsp , % rbp ; for (int i = 0; i < len; i++) { 4: 85 f6 testl % esi , % esi 6: 7e 63 jle 99 <_to_lower+ 0x6B > 8: 89 f0 movl % esi , % eax ; str[i] |= 0x20; a: 83 fe 1f cmpl $31, % esi d: 76 45 jbe 69 <_to_lower+ 0x54 > f: 83 e6 1f andl $31, % esi 12: 49 89 c0 movq % rax , % r8 15: 49 29 f0 subq % rsi , % r8 18: 74 3a je 58 <_to_lower+ 0x54 > 1a: 48 8d 57 10 leaq 16 (% rdi ), % rdx 1e: 0f 28 05 4b 00 00 00 movaps 75 (%rip), % xmm0 ; for (int i = 0; i < len; i++) { 25: 4c 89 c1 movq % r8 , % rcx 28: 0f 1f 84 00 00 00 00 00 nopl (% rax ,% rax ) ; str[i] |= 0x20; 30: 0f 10 4a f0 movups - 16 (% rdx ), % xmm1 34: 0f 10 12 movups (% rdx ), % xmm2 37: 0f 56 c8 orps % xmm0 , % xmm1 3a: 0f 56 d0 orps % xmm0 , % xmm2 3d: 0f 11 4a f0 movups % xmm1 , - 16 (% rdx ) 41: 0f 11 12 movups % xmm2 , (% rdx ) 44: 48 83 c2 20 addq $32, % rdx 48: 48 83 c1 e0 addq $ - 32 , % rcx 4c: 75 e2 jne - 30 <_to_lower+ 0x30 > 4e: 85 f6 testl % esi , % esi 50: 75 05 jne 5 <_to_lower+ 0x57 > 52: eb 17 jmp 23 <_to_lower+ 0x6B > 54: 45 31 c0 xorl %r8d, %r8d 57: 4c 01 c7 addq % r8 , % rdi 5a: 4c 29 c0 subq % r8 , % rax 5d: 0f 1f 00 nopl (% rax ) 60: 80 0f 20 orb $32, (% rdi ) ; for (int i = 0; i < len; i++) { 63: 48 ff c7 incq % rdi 66: 48 ff c8 decq % rax 69: 75 f5 jne - 11 <_to_lower+ 0x60 > ; }; 6b: 5d popq % rbp 6c: c3 retq

Making sense of it

Not sure about you, but I'm not very good at reading assembly. I know what lots of the instructions do, and can follow simple control flow, but keeping track of complex changes in state and carry flags takes a lot of energy. I don't take great insight from this assembly dump; I just try to zoom in on the interesting bits.

Edit 24/08: Turns out I'm really bad at reading x86! I had originally confused a call for uninlined ByteString.Map for the inner loop. I've removed a little of the Twitch Reads ASM-style content for now.

Native

The native version is going one byte at a time in a loop. That's about it. I had some incorrect ASM tea-leaf reading here before, but no longer.

C

The C version has compiled to a much larger instruction pile. We've got nice sourcemaps, because I compiled with debugging symbols enabled ( -g ), allowing us to relate each line of C to the generated code. This helps somewhat if you're a rube like myself.

The second occurrence of |= 0x20 is the meat of the loop, and it goes a long way to explaining the dramatic performance gap. It's using orps on the xmm registers to do the heavy lifting. A quick Google will reveal these to be parts of Intel's SIMD Streaming Extensions, or SSE.

In short, the compiled C is making use of CPU vector operations. It's packing the 128-bit registers with characters and OR'ing up to 16 characters at once. When it doesn't have enough characters to pack those registers, it seems to just proceed a byte at a time until the input is exhausted.

Glancing back at our benchmark, each C result is about 16 times faster, give or take. SIMD explains most of the gap away.

GHC and SIMD

So, why doesn't GHC generate vectorised code in this case?

You probably shouldn't ask me, I learned everything I know about SSE from Wikipedia just now.

I'd like to hear about the backstory here, and why it's difficult. Get in touch if you can be bothered! I will go read up a little.

Edit 24/08: Austin Seipp has provided a great wealth of information on automatic vectorisation on Reddit , read it!

Edit 24/08: Thanks to Ilya Yanok for pointing out some of my incorrect comments on the native ASM.

The code in this post is truncated. Get the full source here.