I was profiling a simple loop (years ago). It was similar to:

countForever ref = forever $ modifyIORef' ref (+1)

I counted four assembly instructions.

Addition takes ~300 picoseconds, but there is memory access so let’s say two nanoseconds

To profile, I wrote:

iorefLoop = do

ref <- newIORef (0 :: Int)

replicateM_ 100000000 $ modifyIORef' ref (+1)

I expected this to take ~ 0.2 seconds, but …

$ bench ./bin/ioref-loop benchmarking ./bin/ioref-loop

time 332.3 ms (330.4 ms .. 333.8 ms)

1.000 R² (1.000 R² .. 1.000 R²)

mean 334.4 ms (333.2 ms .. 335.9 ms)

std dev 2.822 ms (2.182 ms .. 3.942 ms)

Was this a good time? Was I expecting something unrealistic from Haskell? As a sanity check, I wrote an analogous program in C:

int main (int argc, const char** argv)

{

int ref = 0;

while (ref < 100000000)

{

ref++;

}

}

I compiled and ran it like:

$ gcc -O2 ./src/CLoop.c -o ./bin/CLoop && bench ./bin/CLoop benchmarking ./bin/CLoop

time 3.999 ms (3.794 ms .. 4.235 ms)

0.980 R² (0.972 R² .. 0.992 R²)

mean 3.940 ms (3.838 ms .. 4.093 ms)

std dev 593.3 μs (422.8 μs .. 816.0 μs)

variance introduced by outliers: 90% (severely inflated)

This was much faster than I had expected.

Maybe bench isn’t accurate when programs are fast?

I went on IRC (different times because GHC makes faster code now):

jfischoff ~ Is there anyway to make modifyIORef' faster?

jfischoff ~ I am surprised that in a second I was only able to

↪ update this ref 100 million times:

↪ timeout 1000000 $ forever $ modifyIORef' x (1+)

jfischoff ~ where as c++ was able to do the same in 4 milliseconds

glguy ~ c++ was able to do 1 update every 0.04 nanoseconds?

glguy ~ an update rate of 25 gigahertz?

dv- ~ gcc probably just replaced it with a constant

jfischoff ~ dv-: perhaps

glguy ~ That or C++ unlocks the fast mode of an Intel processor

Burn.

$ gcc -O2 CLoop.c -S

Here’s the assembly:



; Notice, there are no jumps.

; There is no looping.

_main: ##

.cfi_startproc

## BB#0:

pushq %rbp

Ltmp0:

.cfi_def_cfa_offset 16

Ltmp1:

.cfi_offset %rbp, -16

movq %rsp, %rbp

Ltmp2:

.cfi_def_cfa_register %rbp

xorl %eax, %eax

popq %rbp

retq

.cfi_endproc ; CLoop.s; Notice, there are no jumps.; There is no looping._main: ## @main .cfi_startproc## BB#0:pushq %rbpLtmp0:.cfi_def_cfa_offset 16Ltmp1:.cfi_offset %rbp, -16movq %rsp, %rbpLtmp2:.cfi_def_cfa_register %rbpxorl %eax, %eaxpopq %rbpretq.cfi_endproc

dv- was right. The compiler got rid of the loop, I’m assuming, because I wasn’t using the result. I added a volatile keyword to prevent optimizations:

//CountForver.c

int main (int argc, const char** argv)

{

volatile int ref = 0;

while (ref < 100000000)

{

ref++;

}

}

Compiling:

$ gcc -O2 ./src/CLoopVolatile.c -o ./bin/CLoopVolatile

$ bench ./bin/CLoopVolatile benchmarking ./bin/CLoopVolatile time 158.2 ms (156.7 ms .. 159.5 ms)

1.000 R² (0.999 R² .. 1.000 R²)

mean 159.1 ms (158.4 ms .. 160.2 ms)

std dev 2.423 ms (1.755 ms .. 3.772 ms)

The assembly of the loop is:

loop:

incl -4(%rbp) # Increment the value at the address stored

# in rbp displaced by 4.

movl -4(%rbp), %eax # move the value at the address stored in rbp

# displaced by 4 to eax.

cmpl $100000000, %eax # Compare to 100000000 to eax.

jl loop # Jump to the start of the loop if the eax

# was less than when compared above.

So, incrementing C’s mutable refs is over twice as fast as IORef ? Maybe, though, handwritten assembly could be faster? Maybe the volatile keyword threw off optimizations too much?

; AsmLoop.asm

EXTERN _exit

GLOBAL start



SECTION .data

align 8

iterationCount DD 100000000

result DD 0



SECTION .text

start:

; Copy the iteration count to the ecx register

; which is used by the loopnz instruction

mov ecx, [iterationCount]



loopBlock:

inc dword [result] ; Increment the value at the address of result

loopnz loopBlock ; Decrement the ecx counter and loop until ecx

; is zero



exitBlock:

push dword 0 ; Set the exit code to zero

mov eax, 1 ; Place the system call number (exit) in the eax reg

sub esp, 4 ; I have to add some dummy space to the stack for

; some reason

int 0x80 ; Exit

Compiling:

$ nasm -fmacho src/AsmLoop.asm

$ ld -o ./bin/AsmLoop -macosx_version_min 10.7.0 ./src/AsmLoop.o

$ bench ./bin/AsmLoop



benchmarking ./bin/AsmLoop

time 185.2 ms (183.7 ms .. 186.9 ms)

1.000 R² (0.999 R² .. 1.000 R²)

mean 187.0 ms (185.9 ms .. 188.5 ms)

std dev 3.132 ms (2.229 ms .. 4.269 ms)

Wow, my l33t skillz increased the time by around 15% (three years ago this version was faster … just saying).

So IORef is about two times slower than naive approaches in C and assembly. What gives? To the core!

$ stack exec ghc-core -- --no-syntax -- -O2 \

-dsuppress-all src/IORefLoop.hs

Looking through the core I saw:

...

case readMutVar# ipv1_aVW w2_a35m

of _ { (# ipv2_aWt, ipv3_aWu #) ->

case ipv3_aWu of _ { I# x_aUW ->

case writeMutVar# ipv1_aVW (I# (+# x_aUW 1#)) ipv2_aWt

...

I find GHC core almost unreadable.

One trick is to try to ignore most of the case statements. The first and third case statements are not for scrutinizing alternatives, but are to ensure proper sequencing of IO actions. Also, renaming the generated variables helps.

Here is a cleaned up version of what was above:

...

case readMutVar# ioRef token of

{ (# token', x #) ->

case x of

{ I# unbox -> case writeMutVar# ioRef (I# (+# unbox 1#)) token'

...

The second case statement is unboxing an Int to a primitive unboxed Int# ,

case x of

{ I# unbox

and then boxing when setting.

case writeMutVar# ioRef (I# (+# unbox 1#)) token'

I# is the Int constructor (the # means it is a compiler primitive). It wraps or boxes an unpacked, unboxed, "machine" int. Most of the time, GHC can unbox primitives automatically, but it can't in this case. Boxing/unboxing could be the root of the slowdown.

We need an unboxed mutable reference. I can think of two options: Ptr Int and MutableByteArray.

First the Ptr version:

main = alloca $ \ptr -> do

poke ptr (0 :: Int)

replicateM_ 100000000 $ do

i <- peek ptr

let !i' = i + 1

poke ptr i'

Running, I get:

bench ./bin/ptr-loop benchmarking ./bin/ptr-loop

time 175.2 ms (174.1 ms .. 176.2 ms)

1.000 R² (0.999 R² .. 1.000 R²)

mean 174.7 ms (174.1 ms .. 175.2 ms)

std dev 1.626 ms (1.372 ms .. 2.089 ms)

This is good! Almost twice as fast IORef .

Now the MutableByteBuffer version:

main = do

mba <- newAlignedPinnedByteArray 4 8

replicateM_ 100000000 $ do

i <- readByteArray mba 0 :: IO Int

let !i' = i + 1

writeByteArray mba 0 i'

Running:

bench ./bin/mutable-byte-buffer-loop benchmarking ./bin/mutable-byte-buffer-loop

time 175.2 ms (173.5 ms .. 177.3 ms)

0.999 R² (0.999 R² .. 1.000 R²)

mean 177.4 ms (176.5 ms .. 178.4 ms)

std dev 2.423 ms (1.987 ms .. 2.919 ms)

Basically identical.

I’ve made progress, but I don’t want to use either Ptr or MutableByteBuffer . They are not very safe ( Ptr is arguably safe enough, but I prefer not to manually allocate memory directly if I can help it).

If only there were a safe wrapper around one of the types? There is! Vector is a safe wrapper around MutableByteBuffer . Here is the Vector version:

main = do

mba <- M.new 1

replicateM_ 100000000 $ do

i <- M.read mba 0 :: IO Int

let !i' = i + 1

M.write mba 0 i'

Running:

bench ./bin/vector-loop benchmarking ./bin/vector-loop

time 178.6 ms (177.3 ms .. 179.9 ms)

1.000 R² (0.999 R² .. 1.000 R²)

mean 178.2 ms (177.1 ms .. 179.3 ms)

std dev 3.021 ms (2.082 ms .. 4.288 ms)

Nice! With only a small increase in time, we get a safe interface for fast, unboxed mutable references. It is a little odd to use a collection type for a single element, but it’s not terrible, either.

What’s on the Table

The CLoopVolatile and the AsmLoop.asm are not the fastest way one could write a loop that increments a number. That would involve putting the reference in a register for the duration of the loop, and other tricks like loop unrolling. Just to give a sense of how much faster things could be here is a some simple assembly example that does that.

; FastAsmLoop.asm

EXTERN _exit

GLOBAL start SECTION .text

start:

mov rcx, 0

mov rax, 100000000 loopBlock:

inc qword rcx ;

cmp rcx, rax ;

jl loopBlock ; exitBlock:

mov rax, 0x2000001 ; exit

mov rdi, 0

syscall

Compiling

$ nasm -fmacho64 src/FastAsmLoop.asm

$ ld -o ./bin/FastAsmLoop -macosx_version_min 10.7.0 \

./src/FastAsmLoop.o

$ bench ./bin/FastAsmLoop; benchmarking ./bin/FastAsmLoop

time 38.38 ms (37.96 ms .. 39.15 ms)

0.999 R² (0.995 R² .. 1.000 R²)

mean 38.22 ms (37.98 ms .. 38.87 ms)

std dev 715.8 μs (282.6 μs .. 1.261 ms)

I barely believe these numbers, but I think it shows that GHC still is leaving some performance on the table.

Something else worth noting, this isn’t how a tight loop would be written in Haskell anyway. One would not need a mutable reference. In a future post I would like to do more direct comparison between Haskell and assembly similar to above. The C code was really just a baseline that lead me to look into what GHC was doing.

Update: Michael Snoyman alerted me to a package of his that provides an unboxed mutable reference, among other nice things: https://www.stackage.org/haddock/lts-8.16/mutable-containers-0.3.3/Data-Mutable.html#t:URef

The a project with this code that can also produce these timings can be found here: https://github.com/jfischoff/are-mutable-references-in-haskell-fast