I was searching the early edition of PNAS for the abstract of yet another sloppy “science by press release” that didn’t bother to give the the title of the paper or the DOI, and came across this paper, so it wasn’t a wasted effort.

Steve McIntyre recently mentioned:

Mann rose to prominence by supposedly being able to detect “faint” signals using “advanced” statistical methods. Lewandowsky has taken this to a new level: using lew-statistics, lew-scientists can deduce properties of population with no members.

Josh (N=0) humor aside, this new paper makes me wonder how many climate science findings would fail evidence thresholds under this new proposed standard?

Revised standards for statistical evidence

Valen E. Johnson

The lack of reproducibility of scientific research undermines public confidence in science and leads to the misuse of resources when researchers attempt to replicate and extend fallacious research findings. Using recent developments in Bayesian hypothesis testing, a root cause of nonreproducibility is traced to the conduct of significance tests at inappropriately high levels of significance. Modifications of common standards of evidence are proposed to reduce the rate of nonreproducibility of scientific research by a factor of 5 or greater.

Abstract

Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25–50:1, and to 100–200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.

From the discussion:

The correspondence between P values and Bayes factors based on UMPBTs suggest that commonly used thresholds for statistical significance represent only moderate evidence against null hypotheses. Although it is difficult to assess the proportion of all tested null hypotheses that are actually true, if one assumes that this proportion is approximately one-half, then these results suggest that between 17% and 25% of marginally significant scientific findings are false. This range of false positives is consistent with nonreproducibility rates reported by others (e.g., ref.5). If the proportion of true null hypotheses is greater than one-half, then the proportion of false positives reported in the scientific literature, and thus the proportion of scientific studies that would fail to replicate, is even higher.

In addition, this estimate of the nonreproducibility rate of scientific findings is based on the use of UMPBTs to establish the rejection regions of Bayesian tests. In general, the use of other default Bayesian methods to model effect sizes results in even higher assignments of posterior probability to rejected null hypotheses, and thus to even higher estimates of false-positive rates.

This phenomenon is discussed further in SI Text, where Bayes factors obtained using several other default Bayesian procedures are compared with UMPBTs (seeFig. S1). These analyses suggest that the range 17–25% underestimates the actual proportion of marginally significant scientific findings that are false.

Finally, it is important to note that this high rate of nonreproducibility is not the result of scientific misconduct, publication bias, file drawer biases, or flawed statistical designs; it is simply the consequence of using evidence thresholds that do not represent sufficiently strong evidence in favor of hypothesized effects.

=================================================================

The full paper is here: http://www.pnas.org/content/early/2013/10/28/1313476110.full.pdf

The SI is here: Download Supporting Information (PDF)

For our layman readers who might be a bit behind on statistics, here is a primer on statistical significance and P-values as it relates to weight loss/nutrition, which is something that you can easily get your mind around.

Gross failure of scientifical nutritional studies is another topic McIntyre recently discussed: A Scathing Indictment of Federally-Funded Nutrition Research

So, while some dicey science findings might simply be low threshold problems, there are real human conduct problems in science too.