Ben Goldacre, 18 July 2009, The Guardian.

We’d all like to help the police to do their job well. They, in turn, would like to have a massive database with DNA profiles from everyone who has been arrested, but not convicted of a crime.

We worry that this is intrusive, but some of us are willing to make concessions, on our principles, and the invasion into our privacy, in the name of preventing crimes. To do this, we’d like to know the evidence on whether this database is helpful, to help us make an informed decision.

Luckily the Home Office have now published a consultation paper on the subject. They defend their database by arguing that innocent people who have been arrested are as likely to commit crimes in the future as guilty people. “This”, they say, “is obviously a controversial assertion”. That’s not true: it’s a simple matter of fact, and you could easily assemble some good quality evidence to see if it’s true or not.

The Home Office have assembled some evidence. It is not good quality. In fact, this study from the Jill Dando Institute, attached to their consultation paper as an appendix, is possibly the most unclear and badly presented piece of research I have ever seen in a professional environment. Or I am having a bad day. Join me in my struggle to understand their work.

They want to show that the level of criminal activity in a group of people who have been arrested, but on whom no further action has been taken, is the same as the level of criminal activity in people who have been arrested and convicted of a crime, or who accept a caution.

On page 30 they explain their methods, haphazardly, scattered about in the text. They describe some people “sampled on 1st June 2004, 1st June 2005 and 1st June 2006”. These dates are never mentioned again. I have no idea what their plan was there. They then leap to talking about Table 2. This contains data on people each from a “sample” in 1996, 1995, and 1994, followed up for 30 months, 42 months, and 54 months respectively. Are these anything to do with the people from 2004, 2005, and 2006? I have no idea.

In fact I have no idea what “sample” means, perhaps that was the date they were first arrested. I don’t know why they were only followed up for 30, 42, and 54 months, instead of all the way to 2009. Crucially I also don’t know what the numbers in the table mean, because they don’t explain this properly. I think it is the number of people, from the original group, who have subsequently been arrested again.

Anyway. Then they start to discuss the results from this table. They say that these figures show that arrested non-convicted people are the same as convicted people. There are no statistics conducted on these figures, so there is absolutely no indication of how wide the error margins are, and whether these are chance findings. To give you a hint about the impact of noise on their data, more people are subsequently re-arrested over the 42 month period than over the 54 month period, which seems surprising, given that the people in the 54 month group had a much longer period of time over which to get arrested.

This is before we even get on to the other problems. At a few hundred people, this study seems pretty small for one that is supposed to give compelling evidence that there is no difference between two groups, because to prove a negative like this, you’d generally want a large sample, to minimise the chance of missing a true difference in the noise.

There is no evidence that they have done a “power calculation” to determine the sample size they’d need, and in any case, their comparison group feels a bit rigged to me. In their “convicted” sample they only count people who had a non-custodial sentence, and exclude people who got a custodial sentence, on the grounds that those people would be incapable of committing a crime during their incarceration. This also has the effect, however, of making the “criminal” group not quite so criminal, and so a bit more likely to be similar to innocent people.

I could go on. Table 1 is so thoroughly “not as described” as to be uninterpretible. In the text they talk about different cells on the table which are “solid red”, “stippled yellow”, and “blank”. In fact the whole thing is just blue.

This research was incomprehensible and unreadable. Anybody who claims to have been persuaded by the data quoted here is telling you, loudly and clearly in the subtitles, that they don’t need to understand a piece of research in order to find it compelling. Such people are not to be trusted, and if research of this callibre is what guides our policy on huge intrusions into the personal privacy of millions of innocent people, then they might as well be channeling spirits.