In this post you’ll get a bit of an idea how to:

make a Haskell program much faster by parallelising it

see how to analyse and use the SMP runtime flags GHC provides

mess with the parallel garbage collector

Ultimately we’ll make a program 4x faster on 4 cores by changing one line of code, using parallelism, and tuning the garbage collector.

Update: and since I began writing this GHC HQ (aka Simon, Simon and Satnam) have released “Runtime Support for Multicore Haskell” which finally puts on paper a lot of information that was previously just rumour. As a result, I’ve rewritten this article from scratch to use GHC 6.11 (today’s snapshot) since it is just so much faster and easier to use than 6.10.x.

Ready?

The new GHC garbage collector

The GHC 6.10 release notes contain the following text on runtime system changes:

The garbage collector can now use multiple threads in parallel. The new -g n RTS flag controls it, e.g. run your program with +RTS -g2 -RTS to use 2 threads. The -g option is implied by the usual -N option, so normally there will be no need to specify it separately, although occasionally it is useful to turn it off with -g1 . Do let us know if you experience strange effects, especially an increase in GC time when using the parallel GC (use +RTS -s -RTS to measure GC time). See Section 5.14.3, “RTS options to control the garbage collector” for more details.

Interesting. Maybe this will have some impact on the shootout benchmarks.

Binary trees: single threaded

There’s one program that’s been bugging me for a while, where the garbage collector is a bottleneck: parallel binary-trees on the quad core Computer Language Benchmarks Game. This is a pretty straight forward program for testing out memory management of non-flat data types in a language runtime – and FP languages should do very well with their bump-and-allocate heaps. All you have to do is allocate and traverse a bunch of binary trees really. This kind of data:

data Tree = Nil | Node ! Int ! Tree ! Tree

Note that the rules state we can’t use laziness to avoid making O(n) allocations at a time, so the benchmark will use a strict tree type – that’s fine – it only helps with a single core anyway. GHC will unbox those Int fields into the constructor too, with -funbox-strict-fields (should be implied by -O in my opinion). The benchmark itself is really quite easy to implement. Pattern matching makes allocating and wandering them trivial:

-- traverse the tree, counting up the nodes check :: Tree -> Int check Nil = 0 check ( Node i l r ) = i + check l - check r -- build a tree make :: Int -> Int -> Tree make i 0 = Node i Nil Nil make i d = Node i ( make ( i2 - 1 ) d2 ) ( make i2 d2 ) where i2 = 2 * i d2 = d - 1

The full code is here. So quite naive code, and fast… if we just look at this code running on the single core benchmark machine:

Functional language implementations taking up 4 of the top 6 slots, and edging out C (it’s even faster with lazy trees). You can try this for yourself:

whirlpool$ ghc -O2 --make A.hs [1 of 1] Compiling Main ( A.hs, A.o ) Linking A ... whirlpool$ time ./A 16 stretch tree of depth 17 check: -1 131072 trees of depth 4 check: -131072 32768 trees of depth 6 check: -32768 8192 trees of depth 8 check: -8192 2048 trees of depth 10 check: -2048 512 trees of depth 12 check: -512 128 trees of depth 14 check: -128 32 trees of depth 16 check: -32 long lived tree of depth 16 check: -1 ./A 16 1.26s user 0.03s system 100% cpu 1.291 total

I’m on a quad core Linux 2.6.26-1-amd64 x86_64 box, with:

whirlpool$ ghc --version The Glorious Glasgow Haskell Compilation System, version 6.11.20090302

If we take the value of N up to the N=20, it takes a while longer to run:

whirlpool$ time ./A 20 stretch tree of depth 21 check: -1 2097152 trees of depth 4 check: -2097152 524288 trees of depth 6 check: -524288 131072 trees of depth 8 check: -131072 32768 trees of depth 10 check: -32768 8192 trees of depth 12 check: -8192 2048 trees of depth 14 check: -2048 512 trees of depth 16 check: -512 128 trees of depth 18 check: -128 32 trees of depth 20 check: -32 long lived tree of depth 20 check: -1 ./A 20 40.21s user 0.16s system 99% cpu 40.382 total

And of course we get no speed from the extra cores on the system yet. We’re only using 1/4 of the machine’s processing resources. The implementation contains no parallelisation strategy for GHC to use.

Binary trees in parallel

Since Haskell (especially pure Haskell like this) is easy to parallelise, and in general GHC Haskell is pretty zippy on multicore :-) let’s see what we can do to make this faster by parallelisation. It turns out, teaching this program to use multicore is ridiculously easy. All we have to change is one line! Where previously we computed the depth of all the trees between minN and maxN sequentially,

let vs = depth minN maxN ... depth :: Int -> Int -> [( Int , Int , Int )] depth d m | d <= m = ( 2 * n , d , sumT d n 0 ) : depth ( d + 2 ) m | otherwise = [] where n = 1 `shiftL` ( m - d + minN)

Which yields a list of tree results sequentially, we instead step back, and compute the separate trees in parallel using parMap:

let vs = parMap rnf id $ depth minN maxN

From Control.Parallel.Strategies, parMap forks sparks for each (expensive) computation in the list, evaluating them in parallel to normal form. This technique uses sparks – lazy futures – to hint to the runtime that it might be a good idea to evaluate each subcomputation in parallel. When the runtime spots that there are spare threads, it’ll pick up the sparks, and run them. With +RTS -N4, those sparks (in this case, 9 of them) will get scheduled over 4 cores. You can find out more about this style of parallel programming in ch24 of Real World Haskell, in Algorithm + Strategy = Parallelism and now in the new GHC HQ runtime paper.

Running parallel binary trees

Now that we’ve modified the implementation to contain a parallel evaluation strategy,all we have to do is compile it against the threaded GHC runtime, and those sparks will be picked up by the scheduler, and dropped into real threads distributed across the cores. We can try it using 2/4 cores:

whirlpool$ ghc -O2 -threaded A.hs --make -fforce-recomp whirlpool$ time ./A 16 +RTS -N2 stretch tree of depth 17 check: -1 131072 trees of depth 4 check: -131072 32768 trees of depth 6 check: -32768 8192 trees of depth 8 check: -8192 2048 trees of depth 10 check: -2048 512 trees of depth 12 check: -512 128 trees of depth 14 check: -128 32 trees of depth 16 check: -32 long lived tree of depth 16 check: -1 ./A 16 +RTS -N2 1.34s user 0.02s system 124% cpu 1.094 total

Hmm, a little faster at N=16, and > 100% cpu. Trying again with 4 cores:

whirlpool$ time ./A 16 +RTS -N4 stretch tree of depth 17 check: -1 131072 trees of depth 4 check: -131072 32768 trees of depth 6 check: -32768 8192 trees of depth 8 check: -8192 2048 trees of depth 10 check: -2048 512 trees of depth 12 check: -512 128 trees of depth 14 check: -128 32 trees of depth 16 check: -32 long lived tree of depth 16 check: -1 ./A 16 +RTS -N4 2.89s user 0.06s system 239% cpu 1.229 total

Hmm… so it got only a little faster with 2 cores at N=16, but about the same with 4 cores. At N=20 we see similar results:

whirlpool$ time ./A 20 +RTS -N4 stretch tree of depth 21 check: -1 2097152 trees of depth 4 check: -2097152 524288 trees of depth 6 check: -524288 131072 trees of depth 8 check: -131072 32768 trees of depth 10 check: -32768 8192 trees of depth 12 check: -8192 2048 trees of depth 14 check: -2048 512 trees of depth 16 check: -512 128 trees of depth 18 check: -128 32 trees of depth 20 check: -32 long lived tree of depth 20 check: -1 ./A 20 +RTS -N4 96.61s user 0.93s system 239% cpu 40.778 total

So still 40s, at 239% cpu. So we made something hot. And you can see a similar result at N=20 on the current quad core shootout binary-trees entry. Jobs distributed across the cores, but not much better runtime. A little better than the single core entry, but only a little. And in the middle of the pack, and 2x slower than C!

Meanwhile, on the single core, it’s in 3rd place, ahead of C and C++. So what’s going on?

Listening to the garbage collector

We’ve parallelised this logically well, so I’m not prepared to abandon the top-level parMap strategy. Instead, let’s look deeper. One clue about what is going on is the cpu utilisation in the shootout program:

Those aren’t very good numbers – we’re using all the cores, but not very well. So the program’s doing something other than just number crunching. A good suspect is that there’s lots of GC traffic happening (after all, a lot of trees are being allocated!). We can confirm this hunch with +RTS -sstderr which prints lots of interesting statistics about what the program did:

whirlpool$ time ./A 16 +RTS -N4 -sstderr ./A 16 +RTS -N4 -sstderr 946,644,112 bytes allocated in the heap 484,565,352 bytes copied during GC 8,767,512 bytes maximum residency (23 sample(s)) 95,720 bytes maximum slop 27 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 674 collections, 0 parallel, 0.54s, 0.55s elapsed Generation 1: 23 collections, 22 parallel, 0.57s, 0.16s elapsed Parallel GC work balance: 1.56 (17151829 / 10999322, ideal 4) Task 0 (worker) : MUT time: 0.36s ( 0.39s elapsed) GC time: 0.28s ( 0.13s elapsed) Task 1 (worker) : MUT time: 0.67s ( 0.43s elapsed) GC time: 0.14s ( 0.14s elapsed) Task 2 (worker) : MUT time: 0.01s ( 0.43s elapsed) GC time: 0.09s ( 0.08s elapsed) Task 3 (worker) : MUT time: 0.00s ( 0.43s elapsed) GC time: 0.00s ( 0.00s elapsed) Task 4 (worker) : MUT time: 0.31s ( 0.43s elapsed) GC time: 0.00s ( 0.00s elapsed) Task 5 (worker) : MUT time: 0.22s ( 0.43s elapsed) GC time: 0.60s ( 0.37s elapsed) SPARKS: 7 (7 converted, 0 pruned) INIT time 0.00s ( 0.00s elapsed) MUT time 1.02s ( 0.43s elapsed) GC time 1.12s ( 0.71s elapsed) EXIT time 0.32s ( 0.03s elapsed) Total time 2.45s ( 1.17s elapsed) %GC time 45.5% (60.7% elapsed) Alloc rate 708,520,343 bytes per MUT second Productivity 54.5% of total user, 114.1% of total elapsed gc_alloc_block_sync: 35082 whitehole_spin: 0 gen[0].steps[0].sync_todo: 0 gen[0].steps[0].sync_large_objects: 0 gen[0].steps[1].sync_todo: 1123 gen[0].steps[1].sync_large_objects: 0 gen[1].steps[0].sync_todo: 6318 gen[1].steps[0].sync_large_objects: 0 ./A 16 +RTS -N4 -sstderr 2.76s user 0.08s system 241% cpu 1.176 total

At N=16, the program is spending 45% of its time doing garbage collection. That’s a problem. We can also see some other things:

7 sparks are being created by our parMap, all of which are turned into real threads

The parallel GC does get a chance to run in parallel 22 times.

And at N=20, the benchmark number, things aren’t any better:

19,439,350,240 bytes allocated in the heap 21,891,579,896 bytes copied during GC 134,688,800 bytes maximum residency (89 sample(s)) 940,344 bytes maximum slop 376 MB total memory in use (6 MB lost due to fragmentation) Generation 0: 14576 collections, 0 parallel, 20.67s, 20.62s elapsed Generation 1: 89 collections, 88 parallel, 36.33s, 9.20s elapsed SPARKS: 9 (9 converted, 0 pruned) %GC time 64.0% (74.8% elapsed)

So yikes, we’re wasting a lot of time cleaning up after ourselves (though happily our par strategy isn’t wasting any fizzled sparks). Diving into the GC docs, we see:

Bigger heaps work better with parallel GC, so set your -H value high (3 or more times the maximum residency)

Ok. Let’s try to get that number down.

Helping out the GC

We can see how much to make a guess at by looking at the maximum residency stats. A good start might be 400M:

whirlpool$ time ./A 20 +RTS -N4 -H400M stretch tree of depth 21 check: -1 2097152 trees of depth 4 check: -2097152 524288 trees of depth 6 check: -524288 131072 trees of depth 8 check: -131072 32768 trees of depth 10 check: -32768 8192 trees of depth 12 check: -8192 2048 trees of depth 14 check: -2048 512 trees of depth 16 check: -512 128 trees of depth 18 check: -128 32 trees of depth 20 check: -32 long lived tree of depth 20 check: -1 ./A 20 +RTS -N4 -H400M 35.25s user 0.42s system 281% cpu 12.652 total

Ok, so that was pretty easy. Runtime has gone from 40s to 12s, and why? Looking at +RTS -sstderr:

%GC time 6.8% (18.6% elapsed) Generation 0: 86 collections, 0 parallel, 2.07s, 2.23s elapsed Generation 1: 3 collections, 2 parallel, 0.34s, 0.10s elapsed

GC time is down under 10% too, which is a good rule. For the original N=16, with its smaller number of trees, which was taking 1.29s, is now down to:

whirlpool$ time ./A 16 +RTS -N4 -H400M stretch tree of depth 17 check: -1 131072 trees of depth 4 check: -131072 32768 trees of depth 6 check: -32768 8192 trees of depth 8 check: -8192 2048 trees of depth 10 check: -2048 512 trees of depth 12 check: -512 128 trees of depth 14 check: -128 32 trees of depth 16 check: -32 long lived tree of depth 16 check: -1 ./A 16 +RTS -N4 -H400M 1.26s user 0.38s system 285% cpu 0.575 total

So this is a reasonable stopping point.

The lessons

parMap can be quite effective and easy as a parallelisation strategy

if you’ve a reasonable parallelisation strategy, but not getting the performance, check what the GC is doing.

And as a final remark, we can look forward to what’s around the corner for GHC:

12.1 Independent GC … We fully intend to pursue CPU-independent GC in the future … moving to more explicitly-separate heap regions is a more honest reflection of the underlying memory architecture …

So hopefully soon each core will be collecting its own binary trees.

References

Complete details of the new GC are in the paper, and particularly the new RTS paper:

Parallel generational-copying garbage collection with a block-structured heap,Simon Marlow, Tim Harris, Roshan P. James, Simon Peyton Jones, International Symposium on Memory Management 2008.

Runtime Support for Multicore Haskell, Simon Marlow, Simon Peyton Jones and Satnam Singh. Submitted to ICFP 09.

And as a final teaser, more on the multicore Haskell story this week: