I’ve been doing some benchmarking with Criterion recently, and wanted to write up the results. However, I didn’t really understand how it was arriving at some of the statistics that it spits out, particularly the confidence intervals. One book on the subject later, and I think I’ve got it. I thought I’d write it down in case anyone else needed to know. (I’m not a statistics expert though, so please do send corrections if necessary.) The mean result from Criterion is straightforward; the benchmark is run (typically 100 times) and the mean is calculated in the standard way (total time divided by count). But Criterion also gives 95% confidence intervals for the mean, and these require some more detailed explanation.

Confidence Intervals

Confidence intervals are a notoriously tricky part of statistics. A confidence interval of 95% means that if you run many separate experiments, and calculate a confidence interval for each of them, 95% of those intervals will contain the true mean. The kicker being that you have no way of knowing whether the interval you have for your one experiment is one of the 95% that contain the mean, or one of the 5% that don’t.

One way to calculate a confidence interval is to make an assumption about the distribution of the data. It is important to realise that this is an assumption. For example, you might be tempted to assume the times were normally distributed — but do you really believe that your times come from a perfectly symmetric bell-curve? Run times are often skew, distributed more exponentially, or simply too noisy to pick a distribution that could have generated them. (You can perform a statistical test for the validity of the normality assumption in your sample, if you’re interested.) If you did pick a distribution and calculate the parameters — e.g. the variance of a normal distribution — you could say where 95% of the values are likely to fall because you know what shape the distribution is. But if you don’t know the shape of the real distribution (from which your noisy sample was drawn), you need to use a different technique; Criterion uses the bootstrap.

The Bootstrap

Imagine for a moment that rather than just one mean, you instead had 1000 means. A straightforward way to form a 95% confidence interval would be to arrange your sample times in order, and pick the 2.5% mark (roughly the 25th highest mean) and the 97.5% mark (roughly the 975th highest mean). This method makes no new assumptions, and would derive the interval just from the data. Of course, you don’t have the time to wait around and record enough samples to form 1000 means. (Some of my benchmarks took in the order of 20 minutes per benchmark to collect the 100 values for a single mean; for 1000 means I would have to wait a long time!)

The bootstrap technique generates a lot of means, by “faking” a lot more samples than you actually recorded and taking the mean of each of them. It takes your single sample of 100, and generates many more samples (in Criterion, 100000 by default) by randomly picking 100 values out. Obviously, if you picked 100 values from a set of 100 without replacement you’d always end up with the starting set; the bootstrap picks with replacement, so it is very likely to feature repeated original values in the new fake sample, and also to omit some of the original values in each new fake. The confidence interval can be formed from these fake samples as described in the previous paragraph.

That is the intuition of the bootstrap method of calculating confidence intervals. In fact, Criterion uses bootstrap with bias correction and acceleration. These two extra measures affect how the confidence intervals are calculated by attempting to adjust for skew and sensitivity in the data. Unfortunately the reference books I have available are not very illuminating on the intuition behind bias and acceleration — so you and I will have to remain in the dark, pending a better reference.

The Code

Often, programmers learn a system best by tinkering and seeing what happens. In that spirit, here is a little code that takes a sample and prints out the bootstrapped mean confidence intervals for it (as Criterion would derive them):

import Control.Monad.ST (runST) import Data.Array.Vector (toU) import Statistics.Sample (mean) import Statistics.Resampling (resample) import Statistics.Resampling.Bootstrap (bootstrapBCA) import System.Random.MWC (create) sample :: [Double] sample = [1..10] ++ [20] main :: IO () main = print . runST $ do g <- create resamples <- resample g [mean] 100000 sampleU return $ bootstrapBCA 0.95 sampleU [mean] resamples where sampleU = toU sample

You’ll need the Hackage packages statistics, uvector and mwc-random. For the above, the output (to 2 d.p.) is:

[Estimate {estPoint = 6.82, estLowerBound = 4.64, estUpperBound = 11.00, ..}]

You can see that the outlier 20 pulls the upper bound of the confidence interval beyond all the other values. Fiddle with sample yourself and see what happens.