Ben Goldacre, The Guardian, Saturday 20 August 2011

What do all these numbers mean? “‘Worrying’ jobless rise needs urgent action – Labour” was the BBC headline. They explained the problem in their own words: “The number of people out of work rose by 38,000 to 2.49 million in the three months to June, official figures show.”

Now, there are dozens of different ways to quantify the jobs market, and I’m not going to summarise them all here. The claimant count and the labour force survey are commonly used, and number of hours worked is informative too: you can fight among yourselves for which is best, and get distracted by party politics to your heart’s content. But in claiming that this figure for the number of people out of work has risen, the BBC is simply wrong.

Here’s why. The “Labour Market” figures come through the Office for National Statistics, and they’ve published the latest numbers in a PDF document. On page 13, top table, 4th row, you will find these figures the BBC are citing. Unemployment aged 16 and above is at 2,494,000, and has risen by 38,000 in a quarter (32,000 in a year). But you will also see some other figures, after the symbol “±”, in a column marked “sampling variability of change”.

Those figures are called 95% confidence intervals, and these are one of the most useful inventions of modern life.

We can’t do a full census of everyone in the population, every time we want some data, because they’re too expensive and time-consuming for monthly data collection. Instead, we take what we hope is a representative sample.

This can fail in two interesting ways. Firstly, you’ll be familiar with the idea that a sample can be systematically unrepresentative: if you want to know about the health of the population as a whole, but you survey people in a GP waiting room, then you’re an idiot.

But a sample can also be unrepresentative simply by chance, through something called sampling error. This is not caused by idiocy. Imagine a large bubblegum vending machine, containing thousands of blue and yellow bubblegum balls. You know that exactly 40% of those balls are yellow. When you take a sample of 100 balls, you might get 40 yellow ones, but in fact, as you intuitively know already, sometimes you get 32, sometimes 48, or 37, or 43, or whatever. This is sampling error.

Now, normally, you’re at the other end of the telescope. You take your sample of 100 balls, but you don’t know the true proportion of yellow balls in the jar – you’re trying to estimate that – so you calculate a 95% confidence interval around whatever proportion of yellow you get in your sample of 100 balls, using a formula (in this case, 1.96 times the square root of ((0.6×0.4) ÷ 100)).

What does this mean? Strictly (it still makes my head hurt) this means that if you repeatedly took samples of 100, then on 95% of those attempts, the true proportion in the bubblegum jar would lie somewhere between the upper and lower limits of the 95% confidence intervals of your samples. That’s all we can say.

So, if we look at these employment figures, you can see that the changes reported are clearly not statistically significant: the estimated change over the past quarter is 38,000 but the 95% confidence interval is ±87,000, running from -49,000 to 125,000. That wide range clearly includes zero, no change at all. The annual change is 32,000, but again, that’s ±111,000.

I don’t know what’s happening to the economy: it’s probably not great. But these specific numbers tell us nothing, and there is an equally important problem arising from that, which is frankly more enduring for meaningful political engagement.

We are barraged, every day, with a vast quantity of numerical data, presented with absolute certainty, and fetishistic precision. In reality, many of these numbers amount to nothing more than statistical noise, the gentle static fuzz of random variation and sampling error, making figures drift up and down, following no pattern at all, like the changing roll of a dice. This, I confidently predict, will never change.