The Ongoing Controversy Over the Marshmallow Test

The legendary psychological experiment measured self-discipline, but its findings are squishy

It has achieved legendary status as a psychology experiment, but for those who need a refresher: Around 50 years ago, the psychologist Walter Mischel and his colleagues at Stanford University began testing the patience of preschool children. The researchers asked the kids — who ranged in age from around three to six— to choose between two tempting snacks: a pretzel or a marshmallow. After each child chose their favorite (for example, the marshmallow), the researcher would make the child an offer: The researcher was leaving the room. If the child could wait until they returned, they could eat the marshmallow. If the child could not wait, they could ring a bell and the researcher would return early. But if that happened, the child would have to settle for a pretzel, rather than the preferable marshmallow.

The test is meant to measure self-discipline, and there have been several variations of the study performed over the years, including the most famous example of waiting for two marshmallows or settling for one. All of the variations assume the same basic premise: The longer children resist before ringing the bell, the stronger their ability to delay gratification in order to gain a larger reward in the future — arguably whether it’s minutes or years later.

Life is full of these kinds of choices in adulthood: decisions between saving or spending, exercising or relaxing, studying or partying. But could childhood performance in the marshmallow test predict success in later life? To answer this, Mischel united with his colleagues Yuichi Shoda and Philip Peake in 1988 to survey the parents of children who had taken part in the test 10 years earlier. The questions asked about the children’s academic performance, social skills, coping ability, and more. According to their results, the children who had waited the longest for their marshmallow a decade earlier had grown into adolescents who did better at school and in their social lives. This was evidence of childhood self-discipline predicting a healthy future, and it became a seminal study suggesting that a person’s willpower influenced their likelihood to succeed later in life.

But fast forward 30 years to the present day, and the implications of the study are up for debate. This comes in the wake of a widely publicized replication crisis, in which large numbers of psychology studies are being repeated and failing to reach the original conclusions. Influential theories cited thousands of times, including the idea that powerful poses boost confidence or that self-control is limited and can be exhausted, have been called into question. To some, the landscape of psychology research as a whole is looking shaky.

Squishy results

In 2018, news began to circulate suggesting that the marshmallow test may be the next domino to fall. The stories were based on a study by Tyler Watts, Assistant Professor of Developmental Psychology at Columbia University (he was at NYU at the time), and his colleagues at the University of California. Their study examined the link between childhood marshmallow test performance and adolescent academic performance for over 900 people, with the data coming from a previous child care study by the National Institute of Child Health and Human Development.

This study was a big step up from the original Mischel studies, most of which tested fewer than 100 children per experiment. Similar to those original studies by Mischel, Watts and his team found that a child’s ability to wait for their marshmallow predicted their future academic achievement. But they decided to control for other factors that could explain the connection. “Correlations are valuable,” says Watts, lead researcher on the project, “but they leave unanswered vital questions regarding whether both gratification delay and outcomes could be the product of other factors.” For example, are children who wait longer for a marshmallow also from wealthier families, and could that be the reason they perform better in school?

After the team accounted for the effects of variables such as socioeconomic status, home environment, and cognitive ability, they found that a child’s ability to delay gratification no longer predicted academic outcomes. They argued that life outcomes were not driven by self-control per se, but rather by other factors. For example, a wealthier background or simply being smarter might boost both self-control and academic outcomes, meaning that self-control predicts better academic performance but does not directly cause it. It’s the old “correlation not causation” problem: Ice cream sales may rise with the percentage of people wearing sunglasses on any given day, but that doesn’t mean sunglasses give people an urge for ice cream — instead, they are both caused by the separate variable of pleasant weather.

The results of Watts’ study, published in July 2018, were heralded by the media as a failed replication and a debunking of the marshmallow test altogether. But the truth is more nuanced. Even according to Watts’ own analysis, the results were at least a partial replication: When the new dataset was initially analyzed, it still showed that waiting longer for a marshmallow predicted better academic performance in later life. On the other hand, the fact that the effect disappeared after accounting for home environment and cognitive ability suggested that the way the original marshmallow test results were embraced was overzealous. Watts didn’t think his study should be lumped into the larger replication crisis, but said his work “does raise questions about the robustness of some conclusions that have been embraced by many psychologists.”

Yet just when it seemed, at least among media, that the marshmallow study was bunk, another team of researchers from the University of Colorado and George Mason University published a couple of “not-so-fast” papers. They re-analyzed the data used by Watts and concluded that delaying gratification likely does matter for future achievement, and warned against “throwing the baby out with the bathwater”.

For the average person, this back and forth is highly confusing. Does self-discipline factor into future success, or is it all just explained by variables such as affluence? The answer, much like the entire field of social science, is complicated and requires some digging.

Even squishier

In the decades that followed Mischel’s original marshmallow tests, several prominent research groups ran experiments that produced similar findings. For example, in a study from 2005, Angela Duckworth, author of Grit and a pioneering researcher in the area of self-discipline, recruited eighth-grade students to complete questionnaires assessing their self-control (essentially, an analog of marshmallow test restraint). She then linked each student’s profile to their academic grades approximately seven months later. Students with greater levels of self-discipline had better school attendance and showed more diligence with homework. Ultimately, they also achieved better grades, and their level of self-discipline appeared to be more important than their IQ in predicting how well they performed overall in school.

This is one reason why the widespread news in 2018 about a failure to replicate the marshmallow test came as such a surprise. Sabine Doebel, an assistant professor at George Mason University, decided to ask some questions about Watts’ 2018 study. Together with her colleagues at the University of Colorado, she published a counterargument to Watts in April 2019, arguing that his analysis may have canceled out crucial factors that contribute to a person’s ability to resist immediate gratification. For example, in his study, Watts minimized the effect of a psychological trait called executive function — a mental process that’s involved in suppressing impulses. His reasoning was that this trait reflects a child’s general cognitive ability, and could be the real reason why patient kids do better socially and academically. But Doebel argues that children may need this skill in order to resist a tempting marshmallow, so ruling it out may also rule out any relevance of the marshmallow test altogether.

Laura Michaelson and Yuko Munakata, two of the authors who contributed to the paper with Doebel, went a step further and reanalyzed the data used by Watts, publishing their results in a second paper. They found that whether the marshmallow test predicts outcomes in life depends on exactly how you measure those outcomes. For example, by using a more comprehensive measure of unhealthy behavior than Watts used, they argued that children’s ability to wait for marshmallows did predict their life difficulties as adolescents, even after accounting for some other important childhood influences. Based on the results in their paper, the marshmallow test may predict outcomes primarily because of social support rather than self-control: Children growing up in more supportive and trusting environments may be more willing to wait for rewards. This converges with some of Watts’ arguments about the importance of a child’s environment in explaining their outcomes, but Michaelson has a more optimistic perspective on what it all means for the marshmallow test. “The interesting question is whether this simple test in preschool can help us make better predictions about which children will go on to develop problems later in life,” Michaelson says, “and based on our results, the answer to that question is yes.”

Watts is currently working on a formal response to the Michaelson study. He remains confident that the marshmallow test is unreliable after controlling for other crucial childhood factors beyond the ability to delay gratification. Michaelson and Doebel argue that the marshmallow test helpfully captures many important features of a child’s life, all of which together explain how people develop from children into adults.

What the interpretations from both teams of researchers highlight is an important point about Mischel’s original work: The story that people took from it was a dramatically oversimplified one. This is not unusual in science. As more data points come in, understanding and interpretation become more complete — and more nuanced. Mischel’s original work on the marshmallow test wasn’t wrong. It was simply incomplete.

Plenty of books are written, and money is made, through premature suggestions that people only need to build up willpower for success.

The new studies raise interesting questions: What childhood characteristics is the marshmallow test measuring, and what is it not measuring? How does performance in the test interact with socioeconomic background and other childhood variables? And ultimately, what is the marshmallow test actually saying about human behavior?

But back to the test’s central question: whether life outcomes can be improved by targeting childhood self-control. Watts explains that he came to the marshmallow test after studying how other early life skills, such as math and reading ability, predicted success in life: “In that other research, we were finding that, although math and reading achievement were strong correlates of later school success, interventions that explicitly taught early math and reading skills by and large failed to produce lasting impacts.”

When researchers discover an interesting connection between a fun measure such as the marshmallow test and future outcomes for children, Watts says it becomes “very tempting to suggest that if we can just teach kids to do the relatively simple thing captured by the measure, we should produce long-lasting effects.” Plenty of books are written, and money is made, through premature suggestions that people only need to build up willpower for success. As Doebel puts it, “People are looking for quick fixes.”

But life is more complicated than that, and misjudgments in designing interventions can be costly. “We need to take very seriously where we place our bets,” Watts says, “because if we spend a lot of resources focusing on one type of intervention, that likely comes at the cost of not focusing on other interventions that could also be promising.”

It’s rarely black and white

Science remains the best way to find answers about how the world works, but it frequently involves a conflict between perspectives until there is so much evidence in favor of one explanation that it becomes orthodox. Darwin talked about evolution by natural selection in 1859, but it took almost 100 years for it to be widely accepted as the dominant scientific theory explaining the origin of species, and the details continue to be refined to this day.

Theories that crash and burn get more attention than theories that gradually fall apart, or evolve, over time, and people leap to the most sensational conclusions every step of the way. While it’s an understandable impulse, it is also an impulse that can throw useful theories into the fire, and even contribute to the smearing of academic reputations with personal attacks. Ironically, the same sensationalist impulse also contributes to the publication of hasty and provocative studies that exacerbate non-replicable science in the first place.

High-quality science is time consuming, and progress is always incremental: Studies like the marshmallow test are just one step forward. But the pace of modern information-sharing, and the urge to grasp at narratives that fit personal ideologies, is making the process appear fickle and unreliable. Oversimplified conclusions by the media tend to emphasize the most extreme interpretations of a debate. Right now the connection between how kids perform in the marshmallow test and how they grow into functioning adults remains obscure. But whatever the level of skepticism about the link, it seems clear that the marshmallow test shouldn’t be lumped into the general replication crisis.

When explaining her motivation in writing her initial response to the Watts research, Sabine Doebel said “it wasn’t just his paper, it was more the spin that the media put on it as a failed replication.” Watts also describes “being rather surprised by the media coverage.” He says that “despite media characterizations, we never viewed our paper as constituting a failed replication of Shoda and Mischel.”

The value in Watts’ work came from determining whether the marshmallow test would stand up against a more stringent analysis with more people, not from faithfully repeating the original study or analysis. As he points out, “Our study is best viewed as adding nuance and complexity to the original findings.”

So... what does this mean for the marshmallow test?

At the moment, some scientists have a favorable view of the marshmallow test, and others are searching for its successor. The evidence is fairly consistent in showing that kids who wait longer for their marshmallow do end up with at least some better outcomes, for example in their performance at school. But the evidence isn’t so clear on exactly why they end up with those better outcomes.

Is there more value in dumping the test and devising an alternate assessment for childhood dispositions (maybe one supported by experiments that use more stringent controls and more children)? Or is there more value in revising traditional ideas around how the simple and practical marshmallow test is working? This is where an old science cliche comes in particularly handy: more research is needed.

The modern world, with its 24-hour news cycles and instant communication, is not particularly patient. It would most certainly fail the marshmallow test if it could take it. People seeking the truth, rather than the clickbait, will have to live with ambiguity for a while longer as the mystery continues to unfold.