scaling ghc --make

Tl;DR: to speedup build of a haskell package use:

$ ghc --make -j +RTS -A256M -qb0 -RTS < your-args > --make -j +RTS -A256M -qb0 -RTSyour-args

Story mode:

In the previous post I’ve shown a tiny hack on how to make sense out of ghc using perf. My actual goal was to get basic understanding on why ghc --make does not scale perfectly.

The scalability problem is known as GHC trac #9221.

Why ghc --make -j scalability matters

I care about ghc --make speed because I build a lot of haskell packages as part of gentoo-haskell overlay maintenance.

My workflow to add or update a new package is simple:

quickly build new version of package (or packages) against my existing haskell world (~1500 globally installed haskell packages) and run it’s tests

install this new version in system and rebuild rest of haskell packages against it

Installation and rebuild usually happens overnight.

Quick build is a relatively interactive phase. It fails frequently as packages require some tweaks along the way, like upper bound fixes or minor API changes.

I use various hacks to speed up quick builds:

build with optimisations disabled: -O0

build in parallel ghc --make -j mode

The sympthoms

My machine is an x86_64 desktop (4x2 HT cores on a single die). Nothing fancy. Should be an easy target to get perfect parallelism. Right?

At first I chose highlightling-kate package as it has at least 100 independent modules for various languages.

The ghc –make -j<N> histogram looks like that:

The problem is that even on my machine ghc --make -j3 works better than ghc --make -j8 and I could not get more than 1.5x speedup. I’d like to see 8x speedup. Or at least 4x :)

Another problem is quick performance degradation once you exceed number of cpus. Running ghc --make -j20 on my machine makes things seriously worse: compilation time is longer than in non-parallel mode! That did not look good.

The clues

So why are things so bad? And more importantly how would you diagnose similar issue in another haskell program? The problem looked like a nice thing to get basic experience in this area.

In comment #9 Gintas found one of the problems with large <N> but I was not able to understand his reasoning at that time as I knew too little about GHC runtime.

Simon suggested larger heap sizes and allocation areas. I tried it and got better performance numbers! Best numbers were at -A 256M: 2.5x speedup! But what is the real bottleneck here? How to estimate the number in general case?

The bug contained a lot of speculation about lock contentions of sorts but no simple cause of slowness.

Down the rabbit hole

I’ve decided to try to find out what happens there. Maybe someone else will have an AHA! moment and will fix things for everyone if I’ll manage to find out that useless sleep(1) somewhere :)

perf, take 1

I tried to run perf record on my benchmark to figure out what happens with the process. perf suggested GHC did mostly sched_yield() which does not sound like a useful call.

threadscope, take 1

Simon also suggested to try ThreadScope on GHC itself. In theory it’s quite simple. You need to relink ghc-stage2 with -eventlog as:

# mk/build.mk GhcStage2HcOpts += -rtsopts -eventlog

and run GHC with event logging enabled as ghc --make ... +RTS -l.

Then running threadscope ghc-stage2.eventlog should show us nice graph of all sorts of interesting things. Or not.

Alas threadscope crashed on generated eventlog file. I first thought it’s a threadscope bug and filed one.

After threadscope got a fix it still could not read my eventlog profiles. I’ve misattributed a bug to another ghc bug #9003 where GHC accidentally broke eventlog ABI by changing enum values for existing events in one of GHC releases.

After bug #9003 got fixed my dumps still did not render! I’ve filed third ghc bug #9384 which basically says that calling setNumCapabilities in your haskell program usually breaks your eventlog.

I didn’t think of trying to fix it but in the process of finding the trigger I’ve found out the workaround: if you specify rts option -RTS -N<N> at program startup to the same value you try to setNumCapabilities it will work just fine: running ghc –make -j<N> +RTS -N<N> -l will not change GHC behaviour except it will produce readable eventlogs!

I looked at the eventlog output and did not know what I should look at.

Later Alexander Vershilov fixed that bug with 2 lines of C code. Tl;DR: GHC did not handle dynamic resize of number of capabilities. Thank you!

perf, take 2

Half a year later many more things happened:

Peter’s DWARF support for GHC landed and I’ve found a few bugs in it: ghc bug #10655 where -g breaks rewrite rules ghc bug #10667 where -g breaks GNU as and produces broken assembly file

support for landed and I’ve found a few bugs in it: I’ve mastered black magic of porting GHC to weird architectures: basically built unregisterised GHC on amd64 and made it pass most tests. Some details are in this blog post. I’ve started Seeing Things!

I’ve attempted to use perf and found out the reason why perf can’t disassemble haskell function and wrote another blog post on how to make it see the code (still needs more things to be upstreamed).

perf also suggested that one of most popular userspace functions being called was the stg_BLACKHOLE_info().

Ad-hoc rts patching

Blackholes are special kind of thunks that usually denote a thunk being evaluated by another haskell thread. When a normal (closure) thunk is being evaluated by a thread thunk could be updated at two points in time: either before (eager) or after (lazy) actual thunk evaluation.

By default GHC uses lazy strategy (unless -feager-blackholing is used). That means the same thunk can be evaluated by multiple execution threads. But it is fine for a typical pure program.

It’s not fine for sensitive things like unsafePerformIO thunks which are guaranteed to be evaluated once.

For those sensitive thunks GHC uses blackholes: at enter closure type is atomically rewritten to BLACKHOLE thunk which points to a thread (TSO) that locks this thunk. When any other thread tries to enter blackhole it blocks on that thunk. When first thread finishes computation it rewrites a TSO pointed by blackhole to an actual result of computation and resumes second thread. Second thread picks result of evaluation and continues execution.

Thus my assumption was that the main scalability problem was large amount of unsafePerformIO calls that causes high haskell thread rescheduling churn.

I tried to modify cmm code to print an object pointed by blackhole but most of the time those are not TSO objects but I# (integers), labels and other things. GHC indeed uses unsafePerformIO for global unique string pool. Those I# are likely the sign of it (see mkSplitUniqSupply).

I tried to recompile GHC with -feager-blackholing option enabled. GHC became slightly slower in single-threaded case (expected), slighty slower in multi-threaded case (unexpected) and did not degrade as fast as base case on large -j<N> (totally unexpected!).

It means that GHC evaluates some expensive things more than once. How would we find which ones?

GHC needs to have something in rts or in eventlog to expose stats for:

blackhole synchronization (who stalls other threads?)

things needlessly evaluated more than once (what wastes cpu/heap?)

speeding up parallel GC: -qb0

The more I tried to play with various GHC changes the more I realised my machine is not good enough to expose bottlenecks. I decided to pick a 24-core virtual machine and gave threadscope another chance.

This time haskell threads were frequently stopped by garbage collector threads. And garbage collector CPU load was unevenly balanced across cores!

I started reading a parallel GC paper by Simon which describes in detail how things work.

Basically there are two phases of parallel GC:

non-work-stealing phase when each GC thread scans it’s own Capability’s TSO stacks and allocation area (aka generation-0, g0, eden, nursery)

work-stealing phase when each GC thread either processes it’s queue of from-space pointers and copies things to to-space or steals work from other threads’ queues

GHC as a compiler does not take too much heap to compile a file. Having huge allocation area (-A256M in my case) is enough to compile an average file. Older generations (g1+) don’t have much to scan besause there is nothing to put there!

It’s fun that +RTS -sstderr always showed me that GC parallelism number and I did not even notice it:

Parallel GC work balance: 22.34% (serial 0%, perfect 100%) # -qb1, default ... Parallel GC work balance: 74.48% (serial 0%, perfect 100%) # -qb0 ... Parallel GC work balance: 80.03% (serial 0%, perfect 100%) # -qb0

The bottleneck is allocation area scan. GHC’s RTS happens to have a knob to enable work-stealing on g0. The option is -qb0.

It was a magic knob that increased GHC parallelism from 2.5x to 7x speedup on 24-core VM! I’ve got similar effect on my 8-core desktop where speedup was from 2.5x to 3.5x.

Two pictures building highlighting-kate on 24-core VM w/o and with -qb0:

20 seconds to 9 seconds shrink. First picture shows that GHC struggles to load half the cpus at peak. While second picture eats all 24 at peak and then tails on slow modules. GREEN - mutation cpu usage, ORANGE - GC cpu usage.

It also became obvious that having more GC threads than available cpus always hurts performance.

That’s what Gintas meant by setNumCapabilities discrepancy. The first fix was obvious.

It does not fix the normal case but at least does not make things worse when ghc --make -j<N> is higher than the number of cpus available.

To improve normal case GHC now autotunes parallel scan based on allocation area size.

multithreadng vs. multiprocessing

Threadscope also showed that load imbalance on highlighting-kate is high due to uneven workload on haskell threads. Basically, not all highlighting modules are equal in size. Which is obvious in the hindsight. So much for a perfect benchmark :)

I’ve written synthetic benchmark synth.bash which generates a lot of independent haskell sources of even size and measures it’s build time:

#!/bin/bash MODULES= 128 128 FIELDS= 100 100 GHC= ghc ghc GHC= ~/dev/git/ghc-perf/inplace/bin/ghc-stage2 ~/dev/git/ghc-perf/inplace/bin/ghc-stage2 rm -rf src/ -rf src/ mkdir -p src -p src for m in ` seq 1 ${MODULES} `; do in { echo "module M ${m} where" echo "data D = D0" for f in ` seq 1 ${FIELDS} `; do in echo " | D ${f} { f ${f} :: Int}" done echo " deriving (Read, Show, Eq, Ord)" } > src/M ${m} .hs done #perf record -- \ time \ $GHC \ -hide-all-packages -package=base \ -package=base \ --make src/*.hs -j +RTS -A256M -RTS " $@ " src/*.hs -j +RTS -A256M -RTS

Nice to tweak numbers to adapt to small and large machines. It’s also interesting to compare with multiprocess GHC using the following Makefile:

OBJECTS := $( patsubst %.hs , %.o , $( wildcard src/*.hs )) all: $( OBJECTS ) src/%.o: src/%.hs $< -o $@ ~/dev/git/ghc-perf/inplace/bin/ghc-stage2 -c +RTS -A256M -RTS-o clean: $( RM ) $( OBJECTS ) .PHONY: clean

Note the difference: here we run separate ghc -c processes on each .hs file and let GNU make do the scheduling.

This benchmark raises even more questions! It shows (see comment #65) that multiprocess compilation scales perfectly and is fastest at make -j<N> where N is the number of cpus. It’s even faster that ghc --make -j (both for 8-core and 24-core machines).

It means there is still a few issues to understand here :)

This benchmark also shows what best performance I should expect from GHC on these machines. Multiprocess (Makefile-based) benchmark is expected to have ideal scalability. If not it means bottleneck is somewhere outside GHC (kernel/RAM bandwidth/etc.).

4x2 HT desktop:

$ make clean; time make -j1 real 1m2.561s user 0m56.523s sys 0m5.560s $ make clean; time make -j4 real 0m18.936s user 1m7.549s sys 0m6.857s $ make clean; time make -j8 real 0m15.964s user 1m52.058s sys 0m9.929s

It’s 3.93x (almost exactly 4x). Almost ideal saturation of 4 physical cores.

12x2 HT VM:

$ make clean; time make -j1 real 1m33.147s user 1m20.836s sys 0m11.556s $ make clean; time make -j12 real 0m10.537s user 1m36.276s sys 0m16.948s $ make clean; time make -j24 real 0m7.336s user 2m15.936s sys 0m19.004s

This is 12.7x speedup. Better-than-ideal saturation of 12 physical cores.

Note how confusing user time in -j1 / -j12 / -j24 cases is. It claims CPU does twice as much work. But it shoud not as I throw the same amount of work at CPUs in all runs.

It’s one of confusing parts about hyperthreading: operating system lies to you how much actual work was actually done.

perf, take 3

Trying to slice and dice metrics reported by perf I tried to get an idea where the overhead appears in multithreaded case.

I’ve noticed GHC works way faster when compiled with -fno-worker-wrapper -fno-spec-constr flags.

It instantly sped GHC up by 10% on synth benchmark and made GHC bootstrap 5% faster.

GHC managed to inflate tiny snippet:

cmmExprNative :: ReferenceKind -> CmmExpr -> CmmOptM CmmExpr = do cmmExprNative referenceKind expr <- getDynFlags dflagsgetDynFlags let platform = targetPlatform dflags platformtargetPlatform dflags = platformArch platform archplatformArch platform case expr of expr CmmLit ( CmmLabel lbl) lbl) -> do cmmMakeDynamicReference dflags referenceKind lbl

into the followng code:

_______ │ cmmExprNative :: ReferenceKind -> CmmExpr -> CmmOptM CmmExpr │ cmmExprNative referenceKind expr = do 0,11 │ cmp $0x3,%rax │ ↑ jb 3ceb930 <cFO7_info+0x8b0> │ -- we must convert block Ids to CLabels here, because we │ -- might have to do the PIC transformation. Hence we must │ -- not modify BlockIds beyond this point. │ │ CmmLit (CmmLabel lbl) │ -> do 2,02 │ add $0x890,%r12 │ cmp 0x358(%r13),%r12 │ ↑ ja 3cf456f <cFIc_info+0x7df> 0,16 │ mov 0x7(%rbx),%rax 0,59 │ lea ghc_DynFlags_DynFlags_con_info,%rbx 0,05 │ mov %rbx,-0x888(%r12) 3,41 │18e9: mov 0x50(%rsp),%rbx 0,05 │ mov %rbx,-0x880(%r12) 0,32 │ mov 0x58(%rsp),%r14 │ mov %r14,-0x878(%r12) │ mov 0x60(%rsp),%rbx │ mov %rbx,-0x870(%r12) 0,05 │ mov 0x68(%rsp),%r14 │ mov %r14,-0x868(%r12) │ mov 0x70(%rsp),%rbx │ mov %rbx,-0x860(%r12) │ mov 0x78(%rsp),%r14 0,11 │ mov %r14,-0x858(%r12) 0,05 │ mov 0x80(%rsp),%rbx │ mov %rbx,-0x850(%r12) 0,05 │ mov 0x88(%rsp),%r14 │ mov %r14,-0x848(%r12) │ mov 0x90(%rsp),%rbx │ mov %rbx,-0x840(%r12) 0,05 │ mov 0x98(%rsp),%r14 0,05 │ mov %r14,-0x838(%r12) 0,11 │ mov 0xa0(%rsp),%rbx │ mov %rbx,-0x830(%r12) │ mov 0xa8(%rsp),%r14 │ mov %r14,-0x828(%r12) 0,05 │ mov 0xb0(%rsp),%rbx │ mov %rbx,-0x820(%r12) │ mov 0xb8(%rsp),%r14 ... <a few more pages of it>

That appeared to be a bug of specialiser being too eager to move out as many strict fields from structs to separate function arguments (aka worker arguments) even if it means pulling out 180 arguments.

Optimisation should be controlled by -fmax-worker-args=<N> flag but it got lost when demand analyzer was rewritten a few years ago.

I’ve noticed it by accident when passing through DynFlags structure in GHC. Many of flags were unused (including -fmax-worker-args=<N>).

That gave rise to ghc bug #11565.

Final result

On my 8-core desktop final improvement is 3.72x (22 seconds vs. 82 seonds):

$ ./synth.bash -j1 +RTS -sstderr 100,430,153,096 bytes allocated in the heap 3,970,134,600 bytes copied during GC 145,432,032 bytes maximum residency (16 sample(s)) 1,639,792 bytes maximum slop 643 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 363 colls, 0 par 7.065s 7.060s 0.0194s 0.0496s Gen 1 16 colls, 0 par 1.485s 1.484s 0.0928s 0.1705s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.002s ( 0.002s elapsed) MUT time 64.242s ( 73.611s elapsed) GC time 8.550s ( 8.544s elapsed) EXIT time 0.012s ( 0.015s elapsed) Total time 72.821s ( 82.172s elapsed) Alloc rate 1,563,303,909 bytes per MUT second Productivity 88.3% of total user, 89.6% of total elapsed gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0 real 1m22.213s user 1m21.061s sys 0m1.139s

$ ./synth.bash +RTS -sstderr ... 100,608,799,232 bytes allocated in the heap 3,959,638,248 bytes copied during GC 173,982,704 bytes maximum residency (8 sample(s)) 2,947,048 bytes maximum slop 2556 MB total memory in use (1 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 53 colls, 53 par 15.480s 2.055s 0.0388s 0.0767s Gen 1 8 colls, 7 par 3.943s 0.558s 0.0697s 0.0897s Parallel GC work balance: 75.64% (serial 0%, perfect 100%) TASKS: 19 (1 bound, 18 peak workers (18 total), using -N8) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.002s ( 0.002s elapsed) MUT time 131.375s ( 19.329s elapsed) GC time 19.423s ( 2.613s elapsed) EXIT time 0.016s ( 0.018s elapsed) Total time 150.830s ( 21.962s elapsed) Alloc rate 765,815,750 bytes per MUT second Productivity 87.1% of total user, 88.1% of total elapsed gc_alloc_block_sync: 370037 whitehole_spin: 0 gen[0].sync: 564078 gen[1].sync: 520393 real 0m22.068s user 2m45.042s sys 0m4.318s

On 24-core VM speedup is around 8x (9 seconds versus 75 seconds)

This suggests GC takes only 12% of elapsed time in both cases. I’m not sure if 200M allocation difference could be attributed to throw-away work done by haskell threads.

CPU MUT and GC time got 2x increase. It’s not the throw-away work but hyperthreading artifact.

Conclusions

I have not solved the problem yet. And don’t fully understand it either! But I feel we are on the right track :)

Some achievements:

ghc --make is slightly faster now!

is slightly faster now! Now I (and you!) understand haskell runtime performance a bit better

we’ve fixed a few seemingly unrelated bugs :)

eventlog works in programs calling setNumCapabilities

works in programs calling threadscope can read GHC traces

can read traces large -A size autotunes parallel GC parameters

size autotunes parallel parameters ghc --make does not create more Capabilities than cpus available

does not create more Capabilities than cpus available we’ve learned hyperthreading is tricky :)

How you can improve GHC:

GHC is full of low-hanging performance improvements and they are easy to fix :)

is full of low-hanging performance improvements and they are easy to fix :) haskell profiling tools need an improvement to see where BLACKHOLES come from. (I suspect most of them come from mkSplitUniqSupply function)

come from. (I suspect most of them come from mkSplitUniqSupply function) perf is still too hard to use on GHC programs

Have fun!

Posted on October 1, 2016

Please enable JavaScript to view the comments powered by Disqus.

Disqus