I'm pleased to announce the availability of criterion, a new library for measuring the performance of Haskell code.

Compared to most other benchmarking frameworks (for any programming language, not just Haskell), criterion focuses on being easy to use, informative, and robust.

Here's a canonical benchmark-worthy function, which has the desirable properties of being both small and slow:





import Criterion.Main (defaultMain)



fib :: Int -> Int

fib 0 = 0

fib 1 = 1

fib n = fib (n -1 ) + fib (n -2 )



To compile a program that will benchmark our fib function, we write a trivial main entry point:





main = defaultMain [

bench "fib 10" $

-> fib ( 10 +n-n)

, bench "fib 30" $

-> fib ( 30 +n-n)

, bench "fib 35" $

-> fib ( 35 +n-n)

]



Criterion itself is trivially easy to install.

$ cabal install criterion

Once it's installed, we build the application we intend to benchmark.

$ ghc -O --make Fibber

If we run the Fibber program with the --help option, it prints a large amount of help output. The most interesting options early on are -t , which displays raw timing information, and -k , which turns the timing information into a chart of the probability distributions of timings. So for immediate gratification, we can run Fibber like this:

$ ./Fibber -t win -k win

The win argument to each option means "display these charts in a window". For this article, I used the png argument instead, to get nicely sized images that I could simply drop into WordPress.

$ ./Fibber -t png:450x175 -k png:450x175

Here's an example of raw timing information:

Notice that the times on the Y axis are labelled with their units, in this case microseconds. On the X axis is the run count.

Timing information is somewhat nice to have, but much more useful is an indication of which times occurred most frequently. The traditional way to display such information is with a histogram. Histograms are pretty awful, since a histogram with an incorrect bin size is either misleading or useless, and bin sizes have to be chosen by hand. Much more useful is a kernel density estimate:

This is a chart of the estimated probability densities of timings. We observe a big hump at about 5.23µs, with a few small outliers further out. This suggests that it usually takes about 5.23µs to evaluate fib 10 .

Meanwhile, in our console window, our Fibber program has printed a lot more information. Here's how it starts:

estimating clock resolution... mean is 9.956511 us (80001 iterations) found 3711 outliers among 79999 samples (4.6%) 3505 (4.4%) high severe

The first thing that Fibber does is measure how long it takes for the system clock to tick. It will then run our benchmarked code enough times that the resolution of the clock will not introduce a significant error.

Indeed, our first benchmark shows why this is important: if fib 10 takes 5.23µs to evaluate, but the clock ticks once every 9.96µs, we clearly can't just evaluate fib 10 once, and subtract the start from the end time. Sometimes it will appear to evaluate instantaneously, since the clock won't have ticked, and sometimes it will appear to have taken about 10µs, since the clock will have ticked exactly once. Both answers would be worse than useless.

Instead, we evaluate fib 10 many thousands of times in order to get the measurement error introduced by the resolution of the clock down to a few parts in ten thousand.

We also use something called the boxplot technique to develop a quick sense of the quality of our data. In this case, a few percent of our numbers are significantly off from the mean of the sample.

Once we've characterised the system clock's period, we figure out how expensive it is to actually use the clock.

estimating cost of a clock call... mean is 923.3224 ns (58 iterations) found 4 outliers among 58 samples (6.9%) 1 (1.7%) high mild 3 (5.2%) high severe

We adjust our timing measurements to take the cost of clock calls into account, even though in practice this makes almost no difference to the numbers we report. (If you're measuring an event that occurs on a millisecond-to-second time scale, you wouldn't expect a one-microsecond difference to perturb things much!)

The last part is the most interesting. We automatically figure out how many times we need to evaluate fib 10 :

benchmarking fib 10 collecting 100 samples, 1911 iterations each, in estimated 995.8188 ms

In this case, we evaluate fib 10 1911 times and measure the cost of this, then repeat this measurement 100 times. We print an estimate in advance for how long the measurements will take, so that if you're benchmarking something expensive, you can go for coffee or whatever. (If you can fetch coffee in 995.8 milliseconds, more power to you!)

Why do we measure so many times? Because we're going to do some statistical analysis of our measurements to see whether they are trustworthy.

bootstrapping with 100000 resamples

The bootstrap step takes a second or two, and this is where the interesting stuff takes place. The performance of real-world applications doesn't follow some kind of tidy statistical pattern such as the normal distribution, especially when people are measuring in realistic, noisy environments.

If you run a time-consuming benchmark on your laptop, you're likely to want to switch to your web browser, watch a video on youtube, check your mail, and so on. Your CPU might slow down because it's overheating, or your laptop's ACPI subsystem might change the CPU frequency to conserve power.

In other words, there are lots of things that might perturb our benchmarking results, some of which we can't see or control, but then again they might not occur or might not be important, and our numbers could be fine. How can we tell?

We use a statistical technique known as the bootstrap, with which we take some sample data and perform some number crunching to tell us interesting things. Specifically, we can use the results of the bootstrap to tell us whether the outliers in our measurements (timings that are far from the mean) are perturbing our numbers in a significant way.

The first thing that the bootstrap tells us is the mean and standard deviation of our measurements, along with the 95% confidence intervals for those values.

mean: 5.232800 us, lb 5.222862 us, ub 5.262482 us, ci 0.950 std dev: 79.86726 ns, lb 31.06332 ns, ub 173.0716 ns, ci 0.950

It then reports on the outliers in the measurements, and most importantly, tells us whether they are important.

found 7 outliers among 100 samples (7.0%) 3 (3.0%) high mild 4 (4.0%) high severe variance introduced by outliers: 0.993% variance is unaffected by outliers

I ran the above example on my laptop when it was idle, so the timing measurements are fine. Boring!

What about a more interesting example? I measured the performance of fib 35 :

collecting 100 samples, 1 iterations each, in estimated 88.14368 s

While this executed, I ran a stupid but computationally expensive task in another window:

while true; do for ((i=0;i<20;i++)); do md5sum cufp.odp >/dev/null & done for ((i=0;i<20;i++)); do wait done done

This shell script grinds over a largish file repeatedly, but does so using 20 processes at a time. It reliably drives the load average on my laptop up to 8. What did it do to my measurements?

As the timing numbers suggest, I ran my CPU-chomping script twice while taking measurements. This had a noticeable effect not just on the timings, but on their probability distributions too:

Those are two fairly large humps, which certainly seem like they should give cause for concern. What does Fibber report on the console?

mean: 1.566470 s, lb 1.333717 s, ub 1.855111 s, ci 0.950 std dev: 1.324608 s, lb 1.114684 s, ub 1.510580 s, ci 0.950 found 23 outliers among 100 samples (23.0%) 22 (22.0%) high severe variance introduced by outliers: 52.994% variance is severely inflated by outliers

Notice that the standard deviation of the sample is almost the same as the mean, a classic sign of sketchy performance measurements. However, we don't merely have to spot this suspicious pattern ourselves: by analysing the results of the bootstrap, our benchmark automatically tells us that our numbers are, in effect, junk.

So! Thanks to criterion, you can now easily measure the performance of Haskell code, and even better, you'll (usually) be told when you're lying to yourself. Oh, and you can produce those charts in PDF, SVG, and CSV formats, too, in addition to PNG and windowed output, so you can simply drop camera-ready copy straight into your next Haskell Implementor's Workshop submission!

There's plenty of work yet to be done. In order to improve the rigour of the bootstrap analysis, I want to perform an autocorrelation test on the timing measurements, since an autocorrelated sample can distort the result of a bootstrap analysis. I'd also like to be able to produce charts with many estimates overlaid, to allow things like scaling and comparative analyses. But I think that criterion is off to a solid start. If you'd like to work on any of these, let me know: I think that the code is quite clean and easy to work on, and of course the darcs repository for the source is easy to get:

darcs get http://darcs.serpentine.com/criterion

It should even be easy to use criterion to benchmark C code and command line programs.

Much of my work on criterion was inspired by Brent Boyer's excellent articles on Java performance analysis (part one and part two). Amusingly, Java and Haskell each pose substantial benchmarking challenges: in the case of Java, dynamic recompilation makes getting good numbers very challenging, while for Haskell, lazy evaluation brings its own difficulties.