Several of you asked me to write about that chocolate article that went viral recently. From I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here’s How:

“Slim by Chocolate!” the headlines blared. A team of German researchers had found that people on a low-carb diet lost weight 10 percent faster if they ate a chocolate bar every day. It made the front page of Bild, Europe’s largest daily newspaper, just beneath their update about the Germanwings crash. From there, it ricocheted around the internet and beyond, making news in more than 20 countries and half a dozen languages. It was discussed on television news shows. It appeared in glossy print, most recently in the June issue of Shape magazine (“Why You Must Eat Chocolate Daily,” page 128). Not only does chocolate accelerate weight loss, the study found, but it leads to healthier cholesterol levels and overall increased well-being. The Bild story quotes the study’s lead author, Johannes Bohannon, Ph.D., research director of the Institute of Diet and Health: “The best part is you can buy chocolate everywhere.” I am Johannes Bohannon, Ph.D. Well, actually my name is John, and I’m a journalist. I do have a Ph.D., but it’s in the molecular biology of bacteria, not humans. The Institute of Diet and Health? That’s nothing more than a website. Other than those fibs, the study was 100 percent authentic. My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.

Bohannon goes on to explain that as part of a documentary about “the junk-science diet industry”, he and some collaborators designed a fake study to see if they could convince journalists. They chose to make it about chocolate:

Gunter Frank, a general practitioner in on the prank, ran the clinical trial. Onneken had pulled him in after reading a popular book Frank wrote railing against dietary pseudoscience. Testing bitter chocolate as a dietary supplement was his idea. When I asked him why, Frank said it was a favorite of the “whole food” fanatics. “Bitter chocolate tastes bad, therefore it must be good for you,” he said. “It’s like a religion.”

They recruited 16 (!) participants and divided them into three groups. One group ate their normal diet. Another ate a low-carb diet. And a third ate a low-carb diet plus some chocolate. Both the low-carb group and the low-carb + chocolate group lost weight compared to the control group, but the low-carb + chocolate group lost weight “ten percent faster”, and the difference was “statistically significant”. They also had “better cholesterol readings” and “higher scores on the well-being survey”.

Bohannon admits exactly how he managed this seemingly impressive result – he measured eighteen different parameters (weight, cholesterol, sodium, protein, etc) which virtually guarantees that one will be statistically significant. That one turned out to be weight loss. If it had been sodium, he would have published the study as “Chocolate Lowers Sodium Levels”.

Then he pitched it to various fake for-profit journals until one of them bit. Then he put out a PR release to various media outlets, and they ate it up. They ended up in a bunch of English and German language media including Bild, the Daily Star, Times of India, Cosmopolitan, Irish Examiner, and the Huffington Post.

The people I’ve seen discussing this seem to have drawn five conclusions, four of which are wrong:

Conclusion 1: Haha, I can’t believe people were so gullible that they actually thought chocolate caused weight loss!

Bohannon himself endorses this one, saying bitter chocolate was a favorite of “whole food fanatics” because “Bitter chocolate tastes bad, therefore it must be good for you” and “it’s like a religion.

But actually, there’s lots of previous research supporting health benefits from bitter chocolate, none of which Bohannon seems to be aware of.

A meta-analysis of 42 randomized controlled trials totaling 1297 participants in the American Journal of Clinical Nutrition found that chocolate improved blood pressure, flow-mediated dilatation (a measure of vascular health), and insulin resistance (related to weight gain).

A different meta-analysis of 24 randomized controlled trials totalling 1106 people in the Journal of Nutrition also found that chocolate improved blood pressure, flow-mediated dilatation, and insulin resistance.

A Cochrane Review of 20 randomized controlled trials of 856 people found that chocolate improved blood pressure (it didn’t test for flow-mediated dilatation or insulin resistance)

A study on mice found that mice fed more chocolate flavanols were less likely to gain weight.

An epidemiological study of 1018 people in the United States found an association between frequent chocolate consumption and lower BMI, p < 0.01. A second epidemiological study of 1458 people in Europe found the same thing, again p < 0.01. A cohort study of 470 elderly men found chocolate intake was inversely associated with blood pressure and cardiovascular mortality, p less than 0.001, not confounded by the usual suspects.

I wouldn’t find any of these studies alone very convincing. But together, they compensate for each other’s flaws and build a pretty robust structure. So the next flawed conclusion is:

Conclusion 2: This proves that nutrition isn’t a real science and we should all just be in a state of radical skepticism about these things

What we would like to do is a perfect study where we get thousands of people, randomize them to eat-lots-of-chocolate or eat-little-chocolate at birth, then follow their weights over their entire lives. That way we could have a large sample size, perfect randomization, life-long followup, and clear applicability to other people. But for practical and ethical reasons, we can’t do that. So we do a bunch of smaller studies that each capture a few of the features of the perfect study.

First we do animal studies, which can have large sample sizes, perfect randomization, and life-long followup, but it’s not clear whether it applies to humans.

Then we do short randomized controlled trials, which can have large sample sizes, perfect randomization, and human applicability, but which only last a couple of months.

Then we do epidemiological studies, which can have large sample sizes, human applicability, and last for many decades, but which aren’t randomized very well and might be subject to confounders.

This is what happened in the chocolate studies above. Mice fed a strict diet plus chocolate for a long time gain less weight than mice fed the strict diet alone. This is suggestive, but we don’t know if it applies to humans. So we find that in randomized controlled trials, chocolate helps with some proxies for weight gain like insulin resistance. This is even more suggestive, but we don’t know if it lasts. So we find that in epidemiological studies, lifetime chocolate consumption is associated with lifetime good health outcomes. This on its own is suggestive but potentially confounded, but when we combine them with all of the others, they become more convincing.

(am I cheating by combining blood pressure and BMI data? Sort of, but the two measures are correlated)

When all of these paint the same picture, then we start thinking that maybe it’s because our hypothesis is true. Yes, maybe the mouse studies could be related to a feature of mice that doesn’t generalize to humans, and the randomized controlled trial results wouldn’t hold up after a couple of years, and the epidemiological studies are confounded. But that would be extraordinarily bad luck. More likely they’re all getting the same result because they’re all tapping into the same underlying reality.

This is the way science usually works, it’s the way nutrition science usually works, and it’s the way the science of whether chocolate causes weight gain usually works. These are not horrible corrupt disciplines made up entirely of shrieking weight-loss-pill peddlers trying to hawk their wares. They only turn into that when the media takes a single terrible study totally out of context and misrepresents the field.

Conclusion 3: Studies Always Need To Have High Sample Sizes

Here’s another good chocolate-related study: Short-term administration of dark chocolate is followed by a significant increase in insulin sensitivity and a decrease in blood pressure in healthy persons.

Bohannon says:

Our study was doomed by the tiny number of subjects, which amplifies the effects of uncontrolled factors…Which is why you need to use a large number of people, and balance age and gender across treatment group

But I say “Short-term administration…” is a good study despite having an n = 15, one less than the Bohannon study. Why? Well, their procedure was pretty involved, and you wouldn’t be able to get a thousand people to go through the whole rigamarole. On the other hand, their insulin resistance measure thing was nearly twice as high in the dark chocolate group as the white chocolate group, and p < 0.001. (Another low sample size study that was nevertheless very good: psychiatrists knew that consuming dietary tyramine when taking a MAOI antidepressant can cause a life-threatening hypertensive crisis, but they didn't know how much tyramine it took. In order to find out, they took a dozen people, put them on MAOIs, and then gradually fed them more and more tyramine with doctors standing by to treat the crisis as soon as it started. They found about how much tyramine it took and declared the experiment a success. If the tyramine levels were about the same in all twelve patients, then adding a thousand more patients wouldn’t help much, and it would definitely increase the risk.)

Sample size is important when you’re trying to detect a small effect in the middle of a large amount of natural variation. When you’re looking for a large effect in the middle of no natural variation, sample size doesn’t matter as much. For example, if there was a medicine that would help amputees grow their hands back, I would accept success with a single patient (if it worked) as proof of effectiveness (I suppose I couldn’t be sure it would always work until more patients had been tried, but a single patient would certainly pique my interest). You’re not going after sample size so much as after p-value.

Conclusion 4: P-Values Are Stupid And We Need To Get Rid Of Them

Bohannon says that:

If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result…the letter p seems to have totemic power, but it’s just a way to gauge the signal-to-noise ratio in the data…scientists are getting wise to these problems. Some journals are trying to phase out p value significance testing altogether to nudge scientists into better habits.

Okay, take the “Short-term administration” study above. I would like to be able to say that since it has p < 0.001, we know it's significant. But suppose we're not allowed to do p-values. All I do is tell you "Yeah, there was a study with fifteen people that found chocolate helped with insulin resistance" and you laugh in my face. Effect size is supposed to help with that. But suppose I tell you "There was a study with fifteen people that found chocolate helped with insulin resistance. The effect size was 0.6." I don't have any intuition at all for whether or not that's consistent with random noise. Do you? Okay, then they say we’re supposed to report confidence intervals. The effect size was 0.6, with 95% confidence interval of [0.2, 1.0]. Okay. So I check the lower bound of the confidence interval, I see it’s different from zero. But now I’m not transcending the p-value. I’m just using the p-value by doing a sort of kludgy calculation of it myself – “95% confidence interval does not include zero” is the same as “p value is less than 0.05”.

(Imagine that, although I know the 95% confidence interval doesn’t include zero, I start wondering if the 99% confidence interval does. If only there were some statistic that would give me this information!)

But wouldn’t getting rid of p-values prevent “p-hacking”? Maybe, but it would just give way to “d-hacking”. You don’t think you could test for twenty different metabolic parameters and only report the one with the highest effect size? The only difference would be that p-hacking is completely transparent – if you do twenty tests and report a p of 0.05, I know you’re an idiot – but d-hacking would be inscrutable. If you do twenty tests and report that one of them got a d = 0.6, is that impressive? No better than chance? I have no idea. I bet there’s some calculation I could do to find out, but I also bet that it would be a lot harder than just multiplying the value by the number of tests and seeing what happens. [EDIT: On reflection not sure this is true; the possibility of p-hacking is inherent to p-values, but the possibility of d-hacking isn’t inherent to effect size. I don’t actually know how much this would matter in the real world.]

But wouldn’t switching from p-values to effect sizes prevent people from making a big deal about tiny effects that are nevertheless statistically significant? Yes, but sometimes we want to make a big deal about tiny effects that are nevertheless statistically significant! Suppose that Coca-Cola is testing a new product additive, and finds in large epidemiological studies that it causes one extra death per hundred thousand people per year. That’s an effect size of approximately zero, but it might still be statistically significant. And since about a billion people worldwide drink Coke each year, that’s a ten thousand deaths. If Coke said “Nope, effect size too small, not worth thinking about”, they would kill almost two milli-Hitlers worth of people.

Yeah, sure, you can never use p-values again, and run into all of these other problems. Or you can do a Bonferroni correction, which is a very simple adjustment to p-values which corrects for p-hacking. Or instead of taking one study at face value LIKE AN IDIOT you can wait to see if other studies replicate the findings. Remember, the whole point of p-hacking is choosing at random form a bunch of different outcomes, so if two trials both try to p-hack, they’ll end up with different outcomes and the game will be up. Seriously, STOP TRYING TO BASE CONCLUSIONS ON ONE STUDY.

Conclusion 5: Trust Science Journalism Less

This is the one that’s correct.

But it’s not totally correct. Bohannon boasts of getting his findings in a couple of daily newspapers and the Huffington Post. That’s not exactly the cream of the crop. The Economist usually has excellent science journalism. Magazines like Scientific American and Discover can be okay, although even they get hyped. Reddit’s r/science is good, assuming you make sure to always check the comments. And there are individual blogs like Mind the Brain run by researchers in the field that can usually be trusted near-absolutely. Cochrane Collaboration will always have among the best analyses on everything.

If you really want to know what’s going on and can’t be bothered to ferret out all of the brilliant specialists, my highest recommendation goes to Wikipedia. It isn’t perfect, but compared to anything you’d find on a major news site, it’s like night and day. Wikipedia’s Health Effects Of Chocolate page is pretty impressive and backs everything it says up with good meta-analyses and studies in the best journals. Its sentence on the cardiovasuclar effects links to this letter, which is very good.

Do you know why you can trust Wikipedia better than news sites? Because Wikipedia doesn’t obsess over the single most recent study. Are you starting to notice a theme?

For me, the takeaway from this affair is that there is no one-size-fits-all solution to make statistics impossible to hack. Getting rid of p-values is appropriate sometimes, but not other times. Demanding large sample sizes is appropriate sometimes, but not other times. Not trusting silly conclusions like “chocolate causes weight loss” works sometimes but not other times. At the end of the day, you have to actually know what you’re doing. Also, try to read more than one study.