Brian Wansink didn’t mean to spark an investigative fury that revisited his entire life’s work. He meant to write a well-intentioned blog post encouraging PhD students to jump at research opportunities. But his blog post accidentally highlighted some questionable research practices that caused a group of data detectives to jump on the case.

Wansink attracted the attention because he’s a rockstar researcher—when someone’s work has had such astronomical impact, problems in their research are a big deal. His post also came at a time when his field, social sciences, is under increased scrutiny due to problems reproducing some of its key findings.

Wansink is probably regretting he ever started typing. Tim van der Zee, one of the scientists participating in the ongoing examination into Wansink’s past, keeps a running account of what’s turned up so far. “To the best of my knowledge,” van der Zee writes in a blog post most recently updated on April 6, “there are currently 42 publications from Wansink which are alleged to contain minor to very serious issues, which have been cited over 3,700 times, are published in over 25 different journals, and in eight books, spanning over 20 years of research.”

That’s enough to cause an entire field to rethink what it thought it knew.

If you hide your junk food, you’ve probably heard of Wansink

You’ve probably come across Wansink’s ideas at some point. He researches how subtle changes in the environment can affect people’s eating behavior, and his findings have made a mark on popular diet wisdom. Perhaps you’ve adopted the tip to use smaller plates to trick yourself into eating less, moved your unhealthy snacks into a hard-to-reach place, or placed your fruit bowl prominently on your kitchen counter. Maybe you’ve scoffed at the “health halo” marketing of a decidedly unhealthy food, or chosen 100-calorie snack packs to control your intake.

And Wansink has influenced more than just popular culture. “All of us who are in this sort of field use and refer to Brian’s work all the time,” researcher Yoni Freedhoff told Ars. Freedhoff specializes in obesity and writes about evidence-based nutrition and weight management. Wansink blurbed Freedhoff’s book, while Dr. Freedhoff has interviewed Wansink and talked about his research on television.

The errors are “concerning,” Freedhoff says: “Were it to be found that these studies aren’t relevant and aren’t correct, that would be a huge blow not just to our understanding of what’s going on with consumer psych and marketing, but many public policy efforts.” And even if the most important pieces of Wansink’s work, such as his research on “health halos,” are found to be solid, Freedhoff worries that the whole body of Wansink’s work now runs the risk of losing its ability to influence decisions.

That could be a problem, because we should keep in mind that all of Wansink’s ideas might be correct, despite the scrutiny. Doing the things Wansink recommends could, perhaps, make you healthier and help you lose weight. There’s currently no evidence that they won’t, and much of the advice is in the neighborhood of common sense. But the evidence that his recommendations will help is under fire. And if it turns out that no strong evidence suggests that these ideas work, then any public efforts based on them may have been wasted. And the money and resources being funneled into Wansink’s research (rather than other, more robust work) become a cause for concern.

According to its website, Wansink’s Food and Brand Lab at Cornell receives funding from a variety of sources, including the US Department of Agriculture, the National Institutes of Health, non-profits, and private industry. The lab’s work has massive impact: “We aggressively disseminate our findings through outreach partners and the media,” Wansink says on his website, which notes that his “Smarter Lunchroom Movement” is now found in more than 30,000 schools across the US.

Accidentally raising the alarm

Things began to go bad late last year when Wansink posted some advice for grad students on his blog. The post, which has subsequently been removed (although a cached copy is available), described a grad student who, on Wansink’s instruction, had delved into a data set to look for interesting results. The data came from a study that had sold people coupons for an all-you-can-eat buffet. One group had paid $4 for the coupon, and the other group had paid $8.

The hypothesis had been that people would eat more if they had paid more, but the study had not found that result. That’s not necessarily a bad thing. In fact, publishing null results like these is important—failure to do so leads to publication bias, which can lead to a skewed public record that shows (for example) three successful tests of a hypothesis but not the 18 failed ones. But instead of publishing the null result, Wansink wanted to get something more out of the data.

“When [the grad student] arrived,” Wansink wrote, “I gave her a data set of a self-funded, failed study which had null results... I said, ‘This cost us a lot of time and our own money to collect. There’s got to be something here we can salvage because it’s a cool (rich & unique) data set.’ I had three ideas for potential Plan B, C, & D directions (since Plan A had failed).”

The responses to Wansink’s blog post from other researchers were incredulous, because this kind of data analysis is considered an incredibly bad idea. As this very famous xkcd strip explains, trawling through data, running lots of statistical tests, and looking only for significant results is bound to turn up some false positives. This practice of “p-hacking”—hunting for significant p-values in statistical analyses—is one of the many questionable research practices responsible for the replication crisis in the social sciences.

A long and heated comment thread opened with disbelief from Paul Kirschner, a professor of educational psychology at the Open University of the Netherlands: “Is this a tongue-in-cheek satire of the academic process or are you serious? I hope it’s the former.” Kirschner contacted the journals that had published the papers Wansink mentioned, but told Ars that only one replied to him. That journal told him that there was nothing they could, or would, do. As a journal editor himself, Kirschner was shocked by the lack of action.



The problems with p-hacking

Wansink engaged in the comment thread on his post, thanking the commenters who pointed out the statistical problems with the approach. Many of the commenters suggested that it was possible that Wansink just didn’t know why p-hacking was a problem, so they linked him to resources and papers to try explaining why p-hacking is a bad idea.

Wansink later added a note onto the post, arguing that “P-hacking shouldn’t be confused with deep data dives—with figuring out why our results don’t look as perfect as we want... Cool data contains cool discoveries.”

Using existing data to answer new questions is fine—plenty of excellent research does this with census data, for example. But if you don’t define your question before you go leaping in, you could come out latching on any significant p-value and calling it real. “It doesn’t matter how ‘cool’ your data are,” writes statistician Andrew Gelman on his blog. “If the noise is much higher than the signal, forget about it.”

It is entirely possible that the problem stems from a lack of understanding of statistics. “My impression remains that Wansink is a naïf when it comes to research methods,” noted Gelman on his blog. Naivety does seem to explain how Wansink could have posted his story without knowing that it would cause such outcry.

Problems with p-hacking are by no means exclusive to Wansink. Many scientists receive only cursory training in statistics, and even that training is sometimes dubious. This is disconcerting, because statistics provide the backbone of pretty much any research looking at humans, as well as a lot of research that doesn’t. If a researcher is trying to tell whether changing something (like the story someone reads in a psychology experiment, or the drug someone takes in a pharmaceutical trial) causes different outcomes, they need statistics. If they want to detect a difference between groups, they need statistics. And if they want to tease out whether one thing could cause another, they need statistics.

The replication crisis in psychology has been drawing attention to this and other problems in the field. But problems with statistics extends far beyond just psychology, and the conversation about open science hasn’t reached everyone yet. Nicholas Brown, one of the researchers scrutinizing Wansink’s research output, told Ars that “people who work in fields that are kind of on the periphery of social psychology, like sports psychology, business studies, consumer psychology... have told me that most of their colleagues aren’t even aware there’s a problem yet.”

The pizza papers

Enter the “data detectives,” a tongue-in-cheek name for researchers who, frustrated with poor standards in science, scrutinize the results in published papers. Brown and van der Zee, along with colleague Jordan Anaya, decided to take a look at the four “pizza papers” that came out of this data set. They contacted Wansink and asked him to share his data, but he responded that it wouldn’t be possible because the data contained identifying characteristics of the research participants.

But did a poor grasp of statistics lead to actual problems? van der Zee and his colleagues figured that the only way to find out was to scour the papers for inconsistencies themselves. To look for inconsistencies, errors, or weak data, they used statistical tools that check whether the numbers reported in a paper match up with one another.

To understand how this works, imagine you wanted to find out the average number of pieces of candy eaten by three people. You’d add up all the pieces eaten and divide by three. Your average should be a round number, or else end in .33, or .67, because those are the only options when you divide by three—the number you divide by affects the kind of decimals that show up in the answer. If you reported a number ending in .25, and a sample size of three, you are either doing something odd, or you’re really bad at math.

In a similar way, van der Zee and his colleagues looked at reported statistics, like the mean and standard deviation, to see if they made sense together. They also looked at inconsistencies between the papers—which, remember, had all come from the same data. Altogether, they found more than 150 errors across the four papers.

Currently, their analysis, “Statistical heartburn: an attempt to digest four pizza publications from the Cornell Food and Brand Lab,” has not been peer-reviewed. But it has been submitted to a scientific journal. The researchers posted it on peerj.com, a site that allows researchers to get feedback from a wide range of people and disseminate their results quickly, before the lengthy peer review process begins. “Statistical heartburn” has been downloaded more than 4,000 times.

Wansink added another note to his blog post, acknowledging the problems and announcing that a statistician would be redoing the analyses. “Upon learning of these inconsistencies, we contacted the editors of all four journals to swiftly and squarely deal with these inconsistencies,” he writes, adding that his lab would be undergoing an overhaul of its data practices.