Ironically enough, it seems that one of the most reliable findings in psychology is that only half of psychological studies can be successfully repeated.

That failure rate is especially galling, says Simine Vazire from the University of California at Davis, because the Many Labs 2 teams tried to replicate studies that had made a big splash and been highly cited. Psychologists “should admit we haven’t been producing results that are as robust as we’d hoped, or as we’d been advertising them to be in the media or to policy makers,” she says. “That might risk undermining our credibility in the short run, but denying this problem in the face of such strong evidence will do more damage in the long run.”

Many psychologists have blamed these replication failures on sloppy practices. Their peers, they say, are too willing to run small and statistically weak studies that throw up misleading fluke results, to futz around with the data until they get something interesting, or to only publish positive results while hiding negative ones in their file drawers.

But skeptics have argued that the misleadingly named “crisis” has more mundane explanations. First, the replication attempts themselves might be too small. Second, the researchers involved might be incompetent, or lack the know-how to properly pull off the original experiments. Third, people vary, and two groups of scientists might end up with very different results if they do the same experiment on two different groups of volunteers.

The Many Labs 2 project was specifically designed to address these criticisms. With 15,305 participants in total, the new experiments had, on average, 60 times as many volunteers as the studies they were attempting to replicate. The researchers involved worked with the scientists behind the original studies to vet and check every detail of the experiments beforehand. And they repeated those experiments many times over, with volunteers from 36 different countries, to see if the studies would replicate in some cultures and contexts but not others. “It’s been the biggest bear of a project,” says Brian Nosek from the Center for Open Science, who helped to coordinate it. “It’s 28 papers’ worth of stuff in one.”

Despite the large sample sizes and the blessings of the original teams, the team failed to replicate half of the studies it focused on. It couldn’t, for example, show that people subconsciously exposed to the concept of heat were more likely to believe in global warming, or that moral transgressions create a need for physical cleanliness in the style of Lady Macbeth, or that people who grow up with more siblings are more altruistic. And as in previous big projects, online bettors were surprisingly good at predicting beforehand which studies would ultimately replicate. Somehow, they could intuit which studies were reliable.