A comment at Thomas Lumley’s blog pointed me to this discussion by Terry Burnham with an interesting story of some flashy psychology research that failed to replicate.

Here’s Burnham:

[In his popular book, psychologist Daniel] Kahneman discussed an intriguing finding that people score higher on a test if the questions are hard to read. The particular test used in the study is the CRT or cognitive reflection task invented by Shane Frederick of Yale. The CRT itself is interesting, but what Professor Kahneman wrote was amazing to me [Burnham], 90% of the students who saw the CRT in normal font made at least one mistake in the test, but the proportion dropped to 35% when the font was barely legible. You read this correctly: performance was better with the bad font. I [Burnham] thought this was so cool. The idea is simple, powerful, and easy to grasp. An oyster makes a pearl by reacting to the irritation of a grain of sand. Body builders become huge by lifting more weight. Can we kick our brains into a higher gear, by making the problem harder?

This is a great start (except for the odd bit about referring to Kahneman as “Professor”).

As in many of these psychology studies, the direct subject of the research is somewhat important, the implications are huge, and the general idea is at first counterintuitive but then completely plausible.

In retrospect, the claimed effect size is ridiculously large, but (a) we don’t usually focus on effect size, and (b) a huge effect size often seems to be taken as a sort of indirect evidence: Sure, the true effect can’t be that large, but how could there be so much smoke if there were no fire at all?

Burnham continues with a quote from notorious social science hype machine Malcolm Gladwell, but I’ll skip that in order to spare the delicate sensibilities of our readership here.

Let’s now rejoin Burnham. Again, it’s a wonderful story:

As I [Burhnam] read Professor Kahneman’s description, I looked at the clock and realized I was teaching a class in about an hour, and the class topic for the day was related to this study. I immediately created two versions of the CRT and had my students take the test – half with an easy to read presentation and half with a hard to read version. Within 3 hours of reading about the idea in Professor Kahneman’s book, I had my own data in the form of the scores from 20 students. Unlike the study described by Professor Kahneman, however, my students did not perform any better statistically with the hard-to-read version. I emailed Shane Frederick at Yale with my story and data, and he responded that he was doing further research on the topic.

This is pretty clean, and it’s a story we’ve heard before (hence the image at the top of this post). Non-preregistered study #1 reports a statistically significant difference in a statistically uncontrolled setting; attempted replication #2 finds no effect. In this case the replication was only N = 20, so even with the time-reversal heuristic, we still might tend to trust the original published claim.

The story continues:

Roughly 3 years later, Andrew Meyer, Shane Frederick, and 8 other authors (including me [Burnham]) have published a paper that argues the hard-to-read presentation does not lead to higher performance. The original paper reached its conclusions based on the test scores of 40 people. In our paper, we analyze a total of over 7,000 people by looking at the original study and 16 additional studies. Our summary: Easy-to-read average score: 1.43/3 (17 studies, 3,657 people)

Hard-to-read average score: 1.42/3 (17 studies, 3,710 people) Malcolm Gladwell wrote, “Do you know the easiest way to raise people’s scores on the test? Make it just a little bit harder.” The data suggest that Malcolm Gladwell’s statement is false. Here is the key figure from our paper with my [Burnham’s] annotations in red:

What happened?

After the plane crashes, we go to the black box to see the decision errors that led to the catastrophe.

So what happened with that original study? Here’s the description:

Main Study Forty-one Princeton University undergraduates at the student campus center volunteered to complete a questionnaire that contained six syllogistic reasoning problems. The experimenter approached participants individually or in small groups but ensured that they completed the questionnaire without the help of other participants. The syllogisms were selected on the basis of accuracy base rates established in prior research (Johnson-Laird & Bara, 1984; Zielinski, Goodwin, & Halford, 2006). Two were easy (answered correctly by 85% of respondents), two were moderately difficult (50% correct response rate), and two were very difficult (20% correct response rate). The easy and very difficult items were omitted from further analyses because the ceiling and floor effects obscured the effects of fluency on processing depth. Shallow heuristic processing enabled participants to answer the easy items correctly, whereas systematic reasoning was insufficient to guarantee accuracy on the difficult questions. Participants were randomly assigned to read the questionnaire printed in either an easy-to-read (fluent) or a difficult-to-read (disfluent) font, the same fonts that were used in Experiment 1. Finally, participants indicated how happy or sad they felt on a 7-point scale (1 = very sad; 4 = neither happy nor sad; 7 = very happy). This is a standard method for measuring transient mood states (e.g., Forgas, 1995). Results and Discussion As expected, participants in the disfluent condition answered a greater proportion of the questions correctly (M = 64%) than did participants in the fluent condition (M = 43%), t(39) = 2.01, p < .05, η2 = .09. This fluency manipulation had no impact on participants' reported mood state (Mfluent = 4.50 vs. Mdisfluent = 4.29), t < 1, η2 < .01; mood was not correlated with performance, r(39) = .18, p = .25; and including participants' mood as a covariate did not diminish the impact of fluency on performance, t(39) = 2.15, p < .05, η2 = .11. The performance boost associated with disfluent processing is therefore unlikely to be explained by differences in incidental mood states.

Let’s tick off the boxes:

– Small sample size and variable measurements ensure that any statistically significant difference will be huge, thus providing Kahneman- and Gladwell-bait.

– Data processing choices were made after the data were seen. In this case, two-thirds of the data were discarded because they did not fit the story. Sure, they have an explanation based on ceiling and floor effects—but what if they had found something? They would easily have been able to explain it in the context of their theory.

– Another variable (mood scale) was available. If the difference had shown up as statistically significant only after controlling for mood scale, or if there had been a statistically significant difference on mood scale alone, any of these things could’ve been reported as successful demonstrations of the theory.

What is my message here? Is it that researchers should be required to preregister their hypotheses? No. I can’t in good conscience make that recommendation given that I almost never preregister my own analyses.

Rather, my message is that this noisy, N = 41, between-person study never had a chance. The researchers presumably thought they were doing solid science, but actually they’re trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.

To put it another way, those researchers might well have thought that at best they were doing solid science and at worst they were buying a lottery ticket, in that, even if their study was speculative and noisy, it was still giving them a shot at a discovery.

But, no, they weren’t even buying a lottery ticket. When you do this sort of noisy uncontrolled study and you “win” (that is, find a statistically significant comparison), you actually are very likely to be losing (high type M error, high type S error rate).

That’s what’s so sad about all this. Not that the original researchers failed—all of us fail all the time—but that they never really had a chance.

On the plus side, our understanding of statistics has increased so much in the past several years—no joke—that now we realize this problem, while in the past even a leading psychologist such as Kahneman and a leading journalist such as Gladwell were unaware of the problem.

P.S. It seems that I got some of the details wrong here.

Andrew Meyer supplies the correction: