In 1998, psychologists found evidence of a tantalizing theory: We all have a finite mental store of energy for self-control and decision-making. Resisting temptations, or making tough decisions, saps this energy over time.

Willpower is like a muscle, the argument goes. When it's tired, we're less focused; we give in to temptation and make shoddy decisions that hurt us later. The original 1998 experiment used chocolate chip cookies, radishes, and an impossible quiz to elegantly illustrate this. Participants who were told to eat radishes and resist cookies gave up on the quiz faster that the people who were allowed to eat the cookies.

Over the years, the theory has been tested in hundreds of peer-reviewed studies, with countless stand-ins for the chocolate, radishes, and the quiz. Scientists have shown how diminished willpower can affect our ability to hold on to a handgrip, sap our motivation to help another in need, and even negatively impact athletic performance.

This huge body of research has helped ego depletion, as psychologists call it, and its offshoot decision fatigue, become the basis for best-selling books, TED talks, and countless life hacks. In an age where temptations and decisions pummel us at warp speed, it's become an empowering concept. If we know how the system works, we can game it: President Obama famously doesn't pick out his suits, for fear that it might deplete some of his decision-making capabilities.

But the whole theory of ego depletion may be on the brink of collapse.

Slate's Daniel Engber reports on an upcoming study in the journal Perspectives on Psychological Science that found in a test with more than 2,000 participants across more than 20 labs, "a zero-effect for ego depletion: No sign that the human will works as it’s been described, or that these hundreds of studies amount to very much at all."

How could hundreds of peer-reviewed studies possibly be so wrong? There may be a way to explain it, and it's shaking researchers to their cores.

Every time scientists conduct an experiment, there's a chance they'll find a false positive. But here's the scary thing: Psychologists are now realizing their institutions are structured so it's more likely that false positives will make it through to publication than inconclusive results.

"We’re now learning that there’s so much bias in the published literature that the meta-analyses can’t be trusted," Simine Vazire, a professor of psychology and the editor in chief of the journal Social Psychological and Personality Science, tells me.

This had led to a painful period of introspection for psychology, leaving researchers bewildered, even scared. What if more fundamental research findings — findings that have spurred books, self-help guides, and countless articles — don't hold up to scrutiny? Does psychology lose its validity as a science?

"Any good science should always be looking at its methods, its statistics, but in a bigger sense, its institutions"

Michael Inzlicht, a psychology professor at the University of Toronto, is a co-author on the forthcoming ego depletion paper. While he's not ready to discuss it in depth ("I do not think it's wise to talk about this until people can actually read the paper for themselves," he tells me in an email), he did clarify that the result won't spell the absolute death of ego depletion theory. "There would need to be a few more of these massive replication failures to support a claim like that," he says.

But beyond the demise of the theory, for Inzlicht the results represent something greater, and sadder. He's worked on ego depletion for most of a decade. His studies have been published in top journals. "I’m in a dark place," he writes in a recent blog post. "Have I been chasing puffs of smoke for all these years?"

Depending on whom you ask, this moment is either a crisis for the science or a revolution to hold researchers and journals more accountable for flimsy conclusions.

For psychologists, the problem is not going to go away anytime soon. Nor are the solutions easy. But there's a chance that this fire will be cleansing — and that the science of psychology will emerge from this period stronger, more effective, and more trustworthy.

Psychology's crisis goes far beyond this one theory

It's not just ego depletion.

In recent years, psychologists have been forced to reexamine many of the discipline's most famous and influential findings.

Over the past decade, studies started to trickle into the literature suggesting that major discoveries in psychology may be the result of experimenter bias. In one paper, psychologists showed that standard statistical practices could be used to make just about any effect appear significant. Look no further for proof than when The Journal of Personality and Social Psychology published a result finding people were capable of precognition, which most scientists would say is impossible.

Then it became apparent these problems weren't just on the fringes of the science but had infected some of the field's most celebrated findings.

In 2012, social priming — an influential theory that explains how subliminal cues influence our behavior — failed a replication test. The theory gained popularity after a 1996 experiment showed a surprising effect: Participants who completed a word puzzle filled with phrases related to the elderly actually started to behave differently. Researchers recorded them walking more slowly to an exit after the quiz.

Like ego depletion, the priming experiment inspired many offshoots. One popular test showed that when a person holds a cold drink during a conversation, he or she can perceive the other person as having a chillier personality. Another test found that if interviewers carry a heavy clipboard while talking to a job candidate, they think the candidate is more serious.

These conclusions are the type that make one marvel at the mystery and complexity of the human brain. They make us wonder about our free will. (It was theories like these that led me to major in psychology in college.)

Social priming theory isn't necessarily wrong. But when researchers failed to replicate the slow-walking result with more than double the number of participants, it cast doubt on both the conclusions and psychology's ability to reliably test for them. Especially concerning was that in the replication test, experimenters only found the result — participants walking more slowly — when they were told this was the probable outcome.

The crisis intensified this past August when a group of psychologists called the Open Science Collaboration published a report in the Science with evidence of an overarching problem: When 270 psychologists tried to replicate 100 experiments published in top journals, only around 40 percent of the studies held up. The remainder either failed or yielded inconclusive data. What's more, the replications that did work showed weaker effects than the original papers.

How experimenters skew their own experiments

A combination of factors pressure scientific journals to publish studies that may overstate their conclusions.

"Ultimately, after the ugliness is over —and I don't expect that this is the end of it — the science will end up being better"

First there's publication bias. The basic idea here is that journals tend to accept papers that find a positive conclusion. A scientist can run two experiments: One works, one doesn't. The one that works is submitted to the journal; the one that doesn't stays in a drawer. (A study in the Journal of Experimental Political Science finds evidence to suggest that non-published studies replicate more reliably than published ones.)

Then there's p-hacking, an array of statistical techniques scientists can use to make their results appear more significant than they actually are. (A p-value is a test of statistical significance.) One example: Researchers can stop collecting data when their results reach statistical significance. That would be like flipping a coin, getting three heads in a row, and then concluding that coin flips always end on heads.

A 2012 survey of 2,000 psychologists found these tactics are commonplace. Fifty percent admitted to only reporting studies that panned out (ignoring data that was inconclusive). Around 20 percent admitted to stopping data collection after they got the result they were hoping for. Most of the respondents thought their actions were defensible.

These researchers are succumbing to what's known as confirmation bias: our human tendency to want to see the world as we predict it. They're not necessarily trying to deceive. They're just being human, and aren't immune to the theories they lecture on.

A lot of these "p-hacks" then lead to the problem of underpowered studies — studies with samples sizes too small to really be reliable. And simply put, the less powered a study, the more prone it is to find a result that isn't real.

A bit ironically, one victim of underpowered studies might be the mega-popular theory of power poses. In 2010 an experiment with 42 participants found evidence that people could be made to feel more powerful if they posed with an open, expansive posture. The theory inspired a TED talk that has been viewed more than 32 million times. It's an appealing, digestible idea: one weird trick to feel more powerful!

But like ego depletion and social priming, power posing effects have failed to replicate with larger subject pools.(Clarification: A power pose replication test did find participants felt a subjective sense of power. But the test failed to find the hormonal and behavioral changes that made the original paper a blockbuster.)

The point of replication isn't to shame researchers — it's to build better science

With the Open Science Collaboration project and other large-scale replication projects like it, psychologists aren't setting off to prove or disprove individual conclusions. Rather, they're asking the question: What is the difference between the experiments that can be replicated and the ones that cannot?

The answer to that question is the key to solving the discipline's core problem. If psychology finds it has to start from scratch evaluating its hypotheses, at least it will be able to do so in a manner that's more methodologically sound.

"We never want to be at a point where every single study, every one, has to have five direct replications run," says Sanjay Srivastava, a psychologist at the University of Oregon who blogs about issues in the field on his website the Hardest Science. "We want to know, ultimately, what are the signs of a healthy science?"

The August replication study already started to point toward the answers.

"I think psychology has a lot of potential ... but I’m not sure we have a lot of answers yet"

One thing scientists are learning is that studies with higher statistical power (a function of sample size) are more likely to reproduce. "If I go out and I see another study, and nobody’s run a replication, I can have more trust if the study has higher power," Srivastava says.

Studies that yield highly significant results are more likely to reproduce than those that are just barely significant. And direct one-to-one effects are more likely to reproduce than complicated interactions.

The story of the replication crisis got a bit more complicated last week when a group of psychologists out of Harvard published a critique of the replication project, also in Science.

The paper was led by Harvard psychologist Daniel Gilbert (you'll recognize him from Prudential commercials), who has generally challenged the replication movement. His issue, in short, was that in the process of trying to replicate experiments, the Open Science Collaboration introduced enough error to render the replications meaningless. And on top of that, Gilbert and his co-authors asserted the "reproducibility of psychological science is quite high."

A handful of researchers I spoke with found this argument unconvincing: If the Harvard authors wanted to prove that more replications and stricter adherence to protocols increased the replication rate, they should have conducted the experiments.

But it goes to show: Even the science of assessing science is a hotly debated work in progress.

And to be clear, a failed replication doesn't mean the original study is wrong. It could be that the new experiment didn't precisely recreate the conditions of the first. But then that raises the question: Should experiments be so sensitive that they fail with small adjustments?

"Replication is often more complicated in psychology [than other sciences] because we tend to study things that are not always directly observable," Ingrid Haas, a professor of psychology and political science, writes me in an email. Behaviors like love, friendship, bravery, and trust often have to be coaxed out of psychology experiment participants through role playing. Those scenarios are extremely sensitive to changes in culture and context, and are difficult to recreate.

Psychology is becoming more humble

Behind the replication movement is an earnest desire to be more transparent and humble about scientific conclusions.

"I think psychology has a lot of potential," Vazire says, "and I think we’re improving it as a tool to answer really important questions, but I’m not sure we have a lot of answers yet."

Remember, she's a psychologist and a journal editor.

Vazire isn't the only one who thinks this way. I heard the same line from a few psychologists: The scientists don't want the public's trust just because they wear lab coats. They want to earn it. (An exception to this was Gilbert, who thinks the reproducibility crisis is overblown. "The average person cannot evaluate a scientific finding for themselves any more easily than they can represent themselves in court or perform surgery on their own appendix," he wrote me.)

Brian Nosek, who co-founded the Center for Open Science at the University of Virginia and coordinated the replication paper in Science, believes the answer to the science's problems is transparency.

"Transparency is making available the methodology, data, and process that one used to arrive at a scientific claim," Nosek, who was the lead author on the August replication report, writes me in an email. "It means that anyone can evaluate and critique the research. Researchers willing to be transparent signal that they are open to scrutiny, and research that survives scrutiny may be more robust."

"Humans desire certainty, and science infrequently provides it"

The biggest change Nosek and his like-minded colleagues are calling for is the preregistration of study designs. Usually researchers don't have to tell anyone about their experimental designs until they publish results. This opens the door to all the p-hacks and biases mentioned above.

"Preregistration reduces my flexibility as a researcher to conduct many analyses and studies and report only a subset that happens to fit my preconceived views of what should occur," Nosek explains. Registration will make it harder for scientist to cherry-pick data that makes them look good. It also will make it easier for other labs to replicate the tests.

Already changes are being made. More and more journals are requiring the preregistration of experiments, and are reviewing study designs more intensely.

"Generally I'm a pessimist," Barbara Spellman, a former editor of Perspectives on Psychological Science, tells me. "But I do think ultimately, after the ugliness is over —and I don't expect that this is the end of it — the science will end up being better."

Vazire's journal Social Psychology and Personality Science is paying more attention to statistical power, asking researchers to publish more of the data that was excluded in the final paper. "As we learn more and more about what kinds of studies and results are more likely to replicate, we know better how to evaluate submissions," she says.

So how should we evaluate psychological claims?

"Any good science should always be looking at its methods, its statistics, but in a bigger sense, its institutions, the way it thinks about evidence," Srivastava says.

It's also important to remember: Replication issues aren't limited to psychology. Biology and medicine has gone through similar trials. Psychology just has the privilege of being a very popular science for a general audience: Its conclusions are easier to fit into our boring, everyday lives.

It's also a relatively young science. Freud's spearheading work is barely 100 years old, and even now that's considered to be more like literature than science. What will we know in a century that will make our current knowledge look quaint?

The truest thing written about psychology's crisis was in the conclusion of the August report in Science. "Humans desire certainty, and science infrequently provides it," it stated. "As much as we might wish it to be otherwise, a single study almost never provides definitive resolution for or against an effect and its explanation."

This period of introspection isn't just for psychologists. It's also for the writers who report on its conclusions and the public who consumes psychology news. We need to become more skeptical, and we need to put individual pieces of research within a larger context. We should think before jumping to turn psychological conclusions into "news you can use."

"You want to look for converging lines of evidence," Srivastava says of evaluating psychological conclusions. A survey that finds people are happiest when they're around friends is interesting. That finding is more convincing when an experiment also finds the same effect. When those experiments are replicated — both exactly and in new contexts — that's even better.

"The thing you want to avoid is that you just found a bunch of patterns you can heap together into a story," he says. (That's true for all genres of journalism.) All the lines of evidence have to stand the test of replication.

These are troubling times for psychology, but there's also reason for optimism. During this time of reappraisal, some textbooks may need to be rewritten, and some egos will be badly bruised. But psychology will have a more solid foundation.

"To be clear: I am in love with social psychology," Inzlicht writes. And you have to be honest with those you love.