In 2009 NPR’s Radiolab did a program on stereotype threat. This fall they revisited the topic in the context of the replication issue in social psychology. That episode was just released. No complaints, but I wanted to address an impression I might have left from something I said on the program — and to raise a point or two about the replication issue. I was asked if the replication

issue changed my impression of the stereotype threat literature. I said “no, I don’t think anything could make me go… this whole thing is not true.” The context of this quote was a discussion about whether a single, or a few failures to replicate some particular stereotype threat (ST) effect would change my view of this literature — a literature of hundreds and hundreds of studies. In

that context, I will own this quote — and explain myself shortly.

What I don’t like about this quote

First, I want to acknowledge the bad impression this quote might give: that I have some kind of non-empirically based faith in the phenomenon that makes me impervious to evidence against the effect. I don’t. I wrote a book conveying how much the evolution of this idea and research was against a robust skepticism, inside and outside of our lab. Also learning where the effect happens and where it doesn’t happen, and where it is strong and where it is weak has helped us develop a strong understanding or theory of the phenomenon, a theory that has made it easier to develop interventions to counter its ill effects in, for example, real-life schooling. And, of course, if a smoking gun or guns was identified that explained these hundreds of studies in terms of some concert of artifacts, bias or poor research practices, I would be sad, but all ears. My faith is in science, evidence and reason. So, for the record, I want to disavow any meaning of the above quote as suggesting I wouldn’t be open to evidence against ST. I would be, and always have been. In the end, it’s about evidence. And that goes all around. For example, evidence, and reason built on evidence, should be weighted more than generic, untested suspicions and doubts. To be taken seriously, suspicions and doubts should be made precise and tested.

How would I assess the replicability/reliability of the ST phenomenon?

Through operations that have fidelity not necessarily to the exact operations of some earlier experiment (the meaning and impact of operationalizations can change from setting to setting and from time to time in ways that are hard to know about — making “exact” replications a little like trying to step in the same river twice) but to the theoretically prescribed requirements for producing the psychology in question. This is where theory comes in. It spells out those requirements. If those requirements are demonstrably met in an experiment, and the experiment fails to produce the predicted effect, then we have an interpretable challenge to the theory (allowing for the fact that some small proportion of such experiments will fail strictly due to chance — reflecting the probabilistic nature of our research.) And this is how theory can help with the “replication crisis.” Our ideas can be better specified.

This is a strength of the stereotype threat literature. It has a specified theory: in performance areas from academics to sports, negative stereotypes are hypothesized to affect groups whose relevant abilities are negatively stereotyped, but to do so primarily for those in the group who identify enough with the relevant domain to be threatened by being negatively stereotyped in it, and when the performance involved is difficult and frustrating enough to make the threatening stereotype applicable as an account of their experience. And this pressure should vary with the strength of cues in the setting suggesting a likelihood of being judged or treated stereotypically. When these conditions are met, one’s conviction about the ST phenomenon has to be held accountable to the experiment’s results, regardless of whether it is an exact or a conceptual replication. That is, if a fair number of ST experiments that met these conditions did not find ST effects, I would surmise that ST wasn’t a significant pressure in the domains these experiments generalize to. It is the now hundreds of conceptual replications of ST effects when these conditions are met, across such varied conditions and domains of behavior, on which I rest my confidence in its replicability.

What I was trying to say in Radiolab

First, remember its context: would one or several replication failures change my impression of the ST literature. It might change my impression of the experimental protocol being replicated — that its social meaning has changed with time, that ST is no longer a pressure felt by the group in question in the situation in question, or maybe that the finding in the first experiment using that protocol was due to chance, bias or bad research practices. This I could buy. But in the face of hundreds of conceptual replications of the effect across different groups, different stereotypes, behavioral and physiological measures, by hundreds of investigators some of whom are skeptics of the effect, in different places all over the world, I don’t think it’s rational to doubt the existence of the effect on the basis of one or even several replication failures of a particular experiment[Office1] .

Said another way, based on the evidence from mine and others’ labs and my experience doing this research, I have the prior belief that ST effects are robust and reliable. I have this belief even though I know there are undoubtedly unpublished studies missing from the literature. Given this prior belief, and my read of the large body of evidence on which it’s based, I don’t feel it’s quite right that one, or even several replication failures should overturn that belief. I am in no way saying that my prior belief would not be updated by replication failures. It would be — to be sure. I am just saying that in light of the large body of ST evidence it would take more than a few such studies to do that.

That’s, in part, what I meant. If the literature was small, or perhaps dominated by a narrow set of methodologies, then it would seem more plausible that replication failures of central studies would bring the effect into question, that is, raise the possibility that the original effect was due to chance or artifact. But when the conceptual replications number in the hundreds and include dozens of real-world interventions, the idea that a failed replication makes the “whole thing” artefactual isn’t highly plausible to me.

Second, what I also meant is that failures to replicate particular experiments wouldn’t, by themselves, tell me the phenomenon is not true in real life — at least not in this case. Again, I could readily accept such failures as showing that ST isn’t important for the group and situation in question, or that the earlier effects found with that procedure were due to chance or artifact. But as to the truth of the phenomenon…such evidence wouldn’t push me that far. I’ve experienced ST in so many situations, so frequently and throughout so much of my life that doubting the existence of the phenomenon in the face of several failed replications of a given experiment wouldn’t likely dissuade me of the existence of the phenomenon. And while I firmly share the methodological concerns of this replication-focused era — about small n’s, p-hacking, etc. — these concerns don’t rise to the level of negating my personal experience of this phenomenon. We don’t all walk in the same shoes. That’s why a diversity of scientists is important. In my shoes, I’ve just experienced too much of this phenomenon to doubt its existence — which is how far some of the people who doubt this effect seem to take these concerns. I had this in mind too when I answered the question above. How important is the effect? Where and when does it affect a group’s outcomes? etc. These are important research questions — as we have long-described. But the existence of stereotype threat…I’m 99.9% confident of that and somebody failing to get a replication effect wouldn’t likely change that.