Using GHC 7.0.3, I can reproduce your heavy GC behavior:

$ time ./A +RTS -s %GC time 92.9% (92.9% elapsed) ./A +RTS -s 7.24s user 0.04s system 99% cpu 7.301 total

I spent 10 minutes going through the program. Here's what I did, in order:

Set GHC's -H flag, increasing limits in the GC

Check unpacking

Improve inlining

Adjust the first generation allocation area

Resulting in a 10 fold speedup, and GC around 45% of time.

In order, using GHC's magic -H flag, we can reduce that runtime quite a bit:

$ time ./A +RTS -s -H %GC time 74.3% (75.3% elapsed) ./A +RTS -s -H 2.34s user 0.04s system 99% cpu 2.392 total

Not bad!

The UNPACK pragmas on the Tree nodes won't do anything, so remove those.

Inlining update shaves off more runtime:

./A +RTS -s -H 1.84s user 0.04s system 99% cpu 1.883 total

as does inlining height

./A +RTS -s -H 1.74s user 0.03s system 99% cpu 1.777 total

So while it is fast, GC is still dominating -- since we're testing allocation, after all. One thing we can do is increase the first gen size:

$ time ./A +RTS -s -A200M %GC time 45.1% (40.5% elapsed) ./A +RTS -s -A200M 0.71s user 0.16s system 99% cpu 0.872 total

And increasing the unfolding threshold, as JohnL suggested, helps a little,

./A +RTS -s -A100M 0.74s user 0.09s system 99% cpu 0.826 total

which is what, 10x faster than we started? Not bad.

Using ghc-gc-tune, you can see runtime as a function of -A and -H ,

Interestingly, the best running times use very large -A values, e.g.