Parts of science are in crisis. Big, important findings of previous studies – things we thought we knew – have failed to stand up to scrutiny: When other scientists re-created the experiments, the same results didn't appear.

The problem is widespread. One major study, led by the psychologist Brian Nosek of the University of Virginia, tried to replicate the findings of 100 psychological experiments; less than half of them stood up to scrutiny. No field has been completely unaffected, from physics to economics, although the crisis is most keenly felt in medical science and psychology.

Some of the most well-known psychological studies ever have failed to replicate. For instance, a famous 1988 study found that facial expressions affect our mood – people who held a pencil between their teeth, forcing them to smile, apparently found cartoons funnier. It had a huge impact and was cited more than 1,500 times by other research – but last year a major replication attempt failed to find any effect. In another instance, researchers found in 2010 that "power poses" – standing in ways associated with dominance – made people more assertive. Their study had a huge impact and a TED talk by one of its authors has been viewed more than 42 million times – but again, it's failed to replicate, and another of the authors has said she doesn't believe the "power pose" effect is real. These are two of the most high-profile examples, but there are hundreds more.

In an upcoming paper in the journal Nature Human Behaviour, available in preprint form on the site PsyArXiv, 72 scientists – including Nosek – have suggested a partial solution. Marcus Munafo, a professor of experimental psychology at the University of Bristol and one of the authors of the paper, told BuzzFeed News: "We're past the point where we can just highlight the problem and say how terrible it is. We need to think about ways in which we can improve the quality of what we do." Their suggestion is to make it much harder to declare that you've found a "statistically significant" result in the first place.

The proposal has sparked significant controversy, but on balance, the scientists BuzzFeed News spoke to were in favour. "I'm still absorbing it," Sanjay Srivastava, a professor of psychology at the University of Oregon, who was not involved in the study, told BuzzFeed News. "But I think it has a lot going for it."

"It raises a very important issue," David Spiegelhalter, Winton professor of the public understanding of risk at the University of Cambridge, told BuzzFeed News. "It's crude, but I've got some sympathy for it."

To explain what it means, we're going to have to discuss how science works. At its most basic level, science is simple: A scientist puts forward a hypothesis, then tests it by collecting data. So they could put forward the hypothesis "this die is loaded so it always rolls a six" and test it by rolling the die.

If she rolls a six, that doesn't prove the die is loaded. It could have been a fluke. If she rolls it again and it's a six again, that doesn't prove it either. In fact, she could roll it forever, and she'd never actually prove the die is loaded.

What you can say is how likely it is that you'd see the results you're getting if the die isn't loaded. On a fair die, there's a 1 in 6 chance you'd roll a six. So if your hypothesis is wrong – in technical language, if the "null hypothesis" is true – there's a 1/6 probability you'd see a 6 anyway. There's a 1/36 chance that you'd see two sixes in a row on a fair die; a 1/216 chance that you'd see three in a row.

This probability is known as the "p-value", and it's usually written as a decimal between 0 and 1: For instance, something that you'd see 1 time in 10 if the null hypothesis is true is written as p=0.1, i.e. 1 divided by 10.

By convention, in science a finding is considered "statistically significant" if you'd see that result less than 1 time in every 20 if the null hypothesis were true: if p=0.05 or less. In our die experiment, that would mean that a single roll of the die wouldn't be enough. Rolling one six on a fair die has a p-value of 0.1667 (1 divided by 6), much higher than 0.05. But if you rolled two sixes in a row, you would comfortably be able to declare it as statistically significant, at p=0.0278 (1 divided by 36).