If you believe The Economist, science is in the midst of a crisis, with most of its conclusions failing to stand the test of time. Research fraud is rising, but even studies that were performed properly sometimes either can't be reproduced or appear to suffer from bias.

A new analysis suggests a very simple explanation for some of the problems: our statistics are weak. A statistician has figured out how to compare Bayesian statistics to those normally used in scientific tests of significance. By comparing the two, he finds that researchers are often accepting numbers that any good Bayesian would consider to be weak evidence.

What's in a p?

To understand the problem, we have to go into how scientists assess significance. Typically, a given experiment has an experimental condition that produces a number and a control condition that produces a second. The two numbers will typically be different, but we need to know if those differences are significant. That's where statistics comes in. The typical test used in science involves determining whether you'd produce the two numbers by random chance. In most fields, if there's less than a five percent chance that you'd get the two numbers by random chance, then you can reject chance—the results are considered significant. In statistical terms, this is called having a p value of less than 0.05.

Is there something special about a 95 percent probability? Absolutely not; a recent paper referred to it as "seemingly arbitrary." It's simply been arrived at through the consensus of people working in the field. It seems in most fields, people have been willing to accept a situation where, out of every 20 positive results, chances are that one of them is a fluke and will not be reproducible.

But the 95 percent rule doesn't apply to every field. In particle physics, hints of particles with greater than 95 percent certainty come and go all the time—you can get a different answer depending on how much data you have at the time of analysis. So that field has settled on a much higher standard: greater than 99.9999 percent confidence.

Even biology has made exceptions when needed. In genetics experiments, a 95 percent confidence was considered perfectly acceptable evidence. Until the 90s, that is, when the development of gene chips meant that you could do a single experiment that looked at every single gene in the human genome at once—over 20,000 of them. Suddenly, a five percent error rate meant that every experiment produced over 1,000 false positives. The problem should have been obvious, but, amazingly, it wasn't. It took a number of papers and ensuing discussions to swing the consensus of the field around to demanding more statistical rigor.

It's worth noting that this value is no protection against research fraud. You can fake whatever statistical significance you like. It also doesn't protect against more subtle and potentially unconscious biases, like which experiments to include in a paper and when to stop collecting data for them. As noted at the Retraction Watch blog, results just at or below a p of 0.05 are over-represented compared to other values, and their frequency is increasing. This suggests the pressure for positive results is affecting what people publish.

Bigger problems for p

The new paper, however, argues that there are much bigger problems than biases or fields where 95 percent confidence doesn't work. Instead, it contests that the measure itself is fundamentally misguided.

The author, Valen Johnson, is a statistician at Texas A&M. In his introduction, he notes that the standard statistics used in science involve comparing the experimental results to a null hypothesis, namely random chance. Bayesian statistics isn't used as often, in part because it compares a given hypothesis, random chance, and an alternative hypothesis. Since it's usually hard to come up with alternate hypotheses, it's impractical to use Bayesian statistics.

Johnson's big contribution, published previously, was to develop a way to mathematically link Bayesian statistics to the standard probabilities used by scientists. The math then allows a direct comparison between the probability values. In his comparison, scientific standards seem pretty weak. The 95 percent certainty corresponds to a Bayesian evidence threshold of between three and five, which Johnson notes is typically considered "positive evidence"—but it falls well below the values considered to be "strong evidence." It takes 99 percent certainty to get there.

(Just as with the standard practice, the values that Bayesian fans have set for what constitutes positive and strong evidence are suggested by individual researchers and agreed to by consensus. Nobody's ever found a stone tablet etched with a value for scientific certainty.)

Johnson concludes that if we assume that only one-half of the hypotheses should give us a positive result, then "these results suggest that between 17 percent and 25 percent of marginally significant scientific findings are false." If we assume the proportion of correct hypotheses is larger—which we might, given that scientists are usually pretty clever about the hypotheses they choose to test—then the problem gets even more pronounced. Overall, Johnson's suggestion is simple: raise the statistical rigor all around. Demand that experiments produce a p value of 0.005 or smaller. And be even pickier about results that we consider highly significant. There is a cost to this, in that you need bigger samples to achieve the higher statistical rigor. In his example, you'd have to double the sample size. That's no problem if you're breeding bacteria and fruit flies, but it will add a lot of time and expense if your project involves mice. Science as a whole would move a lot more slowly.

Will this ensure reproducibility?

It will probably make things better, but it's not going to solve the problem. That's because there are really three classes of reproducibility issues. The first one is simply a matter of numbers. The more experiments you do, the more likely you'll fall afoul of the five percent error that we tolerate. And scientists are doing a lot more experiments. The number of journals and papers are proliferating, and a presentation I attended recently indicated that, at least in biology, the number of individual experiments per paper has gone up dramatically. (The figures within papers used to have about eight individual images; now, they often have more than 20.)

Given Johnson's standard, a lot of these results will no longer be significant. So, we'll either have to get comfortable with publishing a lot more suggestive, but negative, results or comfortable with publishing a lot less. It's hard to see anyone—researchers, publishers, funding agencies—who will actually be enthused about this, so it will be a difficult sell.

Perhaps more significantly, there are those other types of reproducibility issues. One of them is what I'd consider the "big picture" issue. Individual experiments may be wrong five percent of the time, but the conclusions of most papers are built from a number of individual experiments that all point roughly in the same direction. A higher statistical rigor would probably help by eliminating some of the spurious little-picture information that leads us astray when we consider the big picture. But we get led astray for all sorts of additional reasons, including our biases, faulty reasoning, and simply not having all the information we need to reach the right conclusion.

The other reproducibility problem is really a simple yes/no issue that has nothing to do with statistics. If you knock out a specific gene in mice, does it have the phenotype that people have reported? If you use a specific antibody in a procedure, do you see the same signal that another lab did? These are the sorts of nuts-and-bolts reproducibility issues that drive researchers crazy, because they can be affected by things like the specific strain of mice you use, where you buy your chemicals, and even the pH of your lab's water supply. No amount of statistical thinking is going to change any of that.

Overall, Johnson has made what could be an important contribution at a time when a lot of people are worried about reproducibility. Unfortunately, it also comes with a message—do more science before you publish—that's going to be a tough sell in the publish-or-perish research culture that currently exists. And even if Johnson were to succeed, the problem of reproducibility won't go away entirely.

PNAS, 2013. DOI: 10.1073/pnas.1313476110 (About DOIs).