Something is rotten in the state of biomedical research. Everyone who works in the field knows this on some level. We applaud presentations by colleagues at conferences, hoping that they will extend the same courtesy to us, but we know in our hearts that the majority or even the vast majority of our research claims are false.

When it came to light that the biotechnology firm Amgen tried to reproduce 53 “landmark” cancer studies and managed to confirm only six, scientists were “shocked.” It was terrible news, but if we’re honest with ourselves, not entirely unexpected. The pernicious problem of irreproducible data has been discussed among scientists for decades. Bad science wastes a colossal amount of money, not only on the irreproducible studies themselves, but on misguided drug development and follow-up trials based on false information. And while unsound preclinical studies may not directly harm patients, there is an enormous opportunity cost when drug makers spend their time on wild goose chases. Discussions about irreproducibility usually ends with shrugs, however—what can we do to combat such a deep-seated, systemic problem?

Lack of reproducibility of biomedical research is not the result of an unusual level of mendacity among scientists. There are a few bad apples, but for the most part, scientists are idealistic and fervent about the pursuit of truth. The fault lies mainly with perverse incentives and lack of good management. Statisticians Stanley Young and Alan Karr aptly compare biomedical research to manufacturing before the advent of process control. Academic medical research functions as a gargantuan cottage industry, where the government gives money to individual investigators and programs—$30 billion annually in the US alone—and then nobody checks in on the manufacturing process until the final product is delivered. The final product isn’t a widget that can be inspected, but rather a claim by investigators that they ran experiments or combed through data and made whatever observations are described in their paper. The quality inspectors, whose job it is to decide whether the claims are interesting and believable, are peers of the investigators, which means that they can be friends, strangers, competitors, or enemies.

Lack of process control leads to shoddy science in a number of ways. Many new investigators receive no standardized training. People who work in life sciences are generally not crackerjack mathematicians, and there’s no requirement to involve someone with a deep understanding of statistics. Principal investigators rarely supervise the experiments that their students and post-docs conduct alone in the lab in the dead of night, and so they have to rely on the integrity of people who are paid slave wages and whose only hope of future success is to produce the answers the boss hopes are true. The peer review process is corrupted by cronyism and petty squabbles. These are some of the challenges inherent in a loosely organized and largely unregulated industry, but these are not the biggest reasons why so much science is unreproducible. That has more to do with dumb luck.

Randall Munroe has a wonderful cartoon at xkcd that neatly summarizes the reason why most published research findings are false. In the cartoon, scientists ask whether jelly beans cause acne and determine that they don’t. They then proceed to do subgroup analyses on 20 different colors of jelly beans, and excitedly announce that green jelly beans are associated with acne “with 95% confidence!” This is a reference to the traditional gold standard for whether or not a research finding is considered to be statistically significant. Over the last century, scientists have somewhat arbitrarily agreed that if something has only a 1-in-20 chance of happening purely by chance, then when that thing happens, we will consider it to be meaningful. For instance, if the first time you asked someone out on a date that person declined in favor of attending a nephew’s birthday party, you might think of it as a coincidence. If the same excuse came up a second time, you might find it strange that the birthday parties always fell on Friday nights. By the third time, you would have to sadly conclude that there was a less than 1-in-20 chance that yet another nephew had a Friday night birthday party, and that the pattern of rejection was statistically significant.

One could quibble about whether or not 95% confidence is high enough to be truly confident. We wouldn’t fly on planes that had a 5% chance of crashing, but we would probably go on a picnic if there were a 5% chance of rain. Whether it’s the right number for scientific studies isn’t clear, but it is clear that this cutoff for statistical significance should not apply to multiple testing or multiple modeling. The jelly bean cartoon illustrates this point nicely. If the scientists had found an association between jelly beans and acne on the first try, they might reasonably think that it wasn’t just chance—maybe jellybeans cause acne, or maybe acne causes jelly bean cravings. After testing 20 colors of jelly beans, though, the 1-in-20 chance of finding an association by pure chance becomes meaningless. If you test enough jelly beans, you are bound to find an association by pure chance, and that association will be spurious and irreproducible, just like many scientific studies.

When scientists run experiments in labs or model large datasets in multiple different ways, they generate heaps and heaps of negative data, but these don’t get reported. All that gets published is the 100th experiment or analysis that “worked.” Furthermore, scientists are rarely required to state upfront how they will measure primary outcomes. To understand why this is a problem, imagine that I claim to have a magic coin. I tell you that I’m going to flip it 10 times, and if it is magic, it will it come up heads every single time. That’s a pretty good study. But what if instead I flip my coin a 1,000 times and comb through the data for patterns. When I find any pattern in a series of 10 flips, and I tell you that the probability of that sequence occurring by luck alone is less than one in 1,000. That’s correct, but are you impressed by the magic of my coin?

There are some potential solutions to the irreproducibility of medical science, but they would require an extensive overhaul of the system. For observational studies, Young and Karr have proposed sensible measures, like making data publicly available, recording data analysis plans upfront, and splitting the data to be analyzed into test and validation sets. For basic science, public money could be used to set up large testing facilities where experiments can be run by impartial technicians and all results, positive or negative, can be made available to the scientific community. If such changes were implemented, however, the number of published studies would plummet precipitously. Journals would go out of business and so would most scientists, unless new criteria were devised for doling out grant money and handing out promotions. Some areas of research would be invalidated if everyone had access to negative studies, and researchers would be discredited. The biomedical research community isn’t ready for these kinds of painful changes. One piece of evidence for this is that nobody knows which 47 studies Amgen was unable to reproduce. To gain the cooperation of the principal investigators of those studies, Amgen was forced to sign non-disclosure agreements about the results of their inquiries. It seems that the authors of the “landmark” cancer studies knew that they would be found out, and unsurprisingly, setting the record straight wasn’t high on their list of priorities.