Several people pointed me to this awesome story by John Bohannon:

“Slim by Chocolate!” the headlines blared. A team of German researchers had found that people on a low-carb diet lost weight 10 percent faster if they ate a chocolate bar every day. It made the front page of Bild, Europe’s largest daily newspaper, just beneath their update about the Germanwings crash. From there, it ricocheted around the internet and beyond, making news in more than 20 countries and half a dozen languages. . . . My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.

How did the study go?

5 men and 11 women showed up, aged 19 to 67. . . . After a round of questionnaires and blood tests to ensure that no one had eating disorders, diabetes, or other illnesses that might endanger them, Frank randomly assigned the subjects to one of three diet groups. One group followed a low-carbohydrate diet. Another followed the same low-carb diet plus a daily 1.5 oz. bar of dark chocolate. And the rest, a control group, were instructed to make no changes to their current diet. They weighed themselves each morning for 21 days, and the study finished with a final round of questionnaires and blood tests.

A sample size of 16 might seem pretty low to you, but remember this, from a couple of years ago in Psychological Science:

So, yeah, these small-N studies are a thing. Bohannon writes, “And almost no one takes studies with fewer than 30 subjects seriously anymore. Editors of reputable journals reject them out of hand before sending them to peer reviewers.” Tell that to Psychological Science!

Bohannon continues:

Onneken then turned to his friend Alex Droste-Haars, a financial analyst, to crunch the numbers. One beer-fueled weekend later and… jackpot! Both of the treatment groups lost about 5 pounds over the course of the study, while the control group’s average body weight fluctuated up and down around zero. But the people on the low-carb diet plus chocolate? They lost weight 10 percent faster. Not only was that difference statistically significant, but the chocolate group had better cholesterol readings and higher scores on the well-being survey.

To me, the conclusion is obvious: Beer has a positive effect on scientific progress! They just need to run an experiment with a no-beer control group, and . . .

Ok, you get the point. But a crappy study is not enough. All sorts of crappy work is done all the time but doesn’t make it into the news. So Bohannon did more:

I called a friend of a friend who works in scientific PR. She walked me through some of the dirty tricks for grabbing headlines. . . . The key is to exploit journalists’ incredible laziness. If you lay out the information just right, you can shape the story that emerges in the media almost like you were writing those stories yourself. In fact, that’s literally what you’re doing, since many reporters just copied and pasted our text. Take a look at the press release I cooked up. It has everything. In reporter lingo: a sexy lede, a clear nut graf, some punchy quotes, and a kicker. And there’s no need to even read the scientific paper because the key details are already boiled down. I took special care to keep it accurate. Rather than tricking journalists, the goal was to lure them with a completely typical press release about a research paper.

It’s even worse than Bohannon says!

I think Bohannon’s stunt is just great and is a wonderful jab at the Ted-talkin, tabloid-runnin statistical significance culture that is associated so much with science today.

My only statistical comment is that Bohannan actually understates the way in which statistical significance can be found via the garden of forking paths.

Bohannan’s understatement comes in a few ways:

1. He writes:

If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. . . . P(winning) = 1 – (1-p)^n [or, as Ed Wegman would say, 1 – (1-p)*n — ed.] With our 18 measurements, we had a 60% chance of getting some “significant” result with p < 0.05.

That’s all fine, but actually it’s much worse than that, because researchers can, and do, also look at subgroups and interactions. 18 measurements corresponds to a lot more than 18 possible tests! I say this because I can already see a researcher saying, “No, we only looked at one outcome variable so this couldn’t happen to us.” But that would be mistaken. As Daryl Bem demonstrated oh-so-eloquently, there many many possible comparisons can come from a single outcome.

2. Bohannon then writes:

It’s called p-hacking—fiddling with your experimental design and data to push p under 0.05—and it’s a big problem. Most scientists are honest and do it unconsciously. They get negative results, convince themselves they goofed, and repeat the experiment until it “works”.

Sure, but it’s not just that. As Eric Loken and I discussed in our recent article, multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Even if a researcher only performs a single comparison on his or her data and thus did not do any “fishing” or “fiddling” at all, the garden of forking paths is still a problem, because the particular data analysis that was chosen, is typically informed by the data. That is, a researcher will, after looking at the data, choose data-exclusion rules and a data analysis. A unique analysis is done for these data, but the analysis depends on those data. Mathematically this of course is very similar to performing a lot of tests and selecting the ones with good p-values, but it can feel very different.

I always worry when people write about p-hacking, that they mislead by giving the wrong impression that, if a researcher performs only one analysis on his her data, that all is ok.

3. Bohannon notes in passing that he excluded one person from his study, and elsewhere he notes that researchers “drop ‘outlier’ data points” in their quest for scientific discovery. But I think he could’ve emphasized this a bit more, that researcher-degrees-of-freedom is not just about running lots of tests on your data, it’s also about the flexibility in rules for what data to exclude and how to code your responses. (Mark Hauser is an extreme case here but even with simple survey responses there are coding issues in the very very common setting that a numerical outcome is dichotomized.)

4. Finally, Bohannon is, I think, a bit too optimistic when he writes:

Luckily, scientists are getting wise to these problems. Some journals are trying to phase out p value significance testing altogether to nudge scientists into better habits.

I agree that p-values are generally a bad idea. But I think the real problem is with null hypothesis significance testing more generally, the idea that the goal of science is to find “true positives.”

In the real world, effects of interest are generally not true or false, it’s not so simple. Chocolate does have effects, and of course chocolate in our diet is paired with sugar and can also be a substitute for other desserts, etc etc etc. So, yes, I do think chocolate will have effects on weight. The effects will be positive for some people and negative for others, they’ll vary in their magnitude and they’ll vary situationally. If you try to nail this down as a “true” or “false” claim, you’re already going down the wrong road, and I don’t see it as a solution to replace p-values by confidence intervals or Bayes factors or whatever. I think we just have to get off this particular bus entirely. We need to embrace variation and accept uncertainty.

Again, just to be clear, I think Bohannon’s story is great, and I’m not trying to be picky here. Rather, I want to support what he did by putting it in a larger statistical perspective.