Stereotype threat can have deteriorating effects on math performance of school girls and female college students (e.g., Ambady, Shih, Kim & Pittinsky, 2001; Spencer, Steele & Quinn, 1999). Even though the body of stereotype threat literature, theoretical depth and the number of experiments is impressive, the robustness of the stereotype threat effect and used methodology have been called into question (e.g., Flore & Wicherts, 2015; Ganley, Mingle, Ryan, Ryan, Vasilyeva & Perry, 2013; Stoet & Geary, 2012; Zigerell, 2017). In three studies, we address these methodological issues by studying the following three questions. Does stereotype threat influence math performance of Dutch high school girls? How can psychometric analyses benefit stereotype threat research? And does stereotype threat research replicate over cultures and sites?

In a large scale experiment we studied the influence of stereotype threat on math performance of 13 and 14-year-old Dutch high school students from the two highest education levels. In total, we visited 86 classrooms (N = 2,064). By means of multilevel modeling we study stereotype threat performance decrements and supposed moderators (domain identification, test anxiety, and gender identification), using both a frequentist and a Bayesian framework. We failed to find a significant stereotype threat effect (girls in the stereotype threat negligibly underperformed compared to girls in the control condition, Cohen’s d = -.05). Moreover, we found strong evidence for the “no stereotype threat hypothesis” compared to the “stereotype threat hypothesis” (Bayes Factor = 28.18). As this study was set up in a registered report publishing format, the study was fully pre-registered and published regardless of the outcomes.

In a study on the psychometric properties of stereotype threat math tests we re-analyzed several existing datasets. We inspected several measures from Classical Test Theory (e.g., item difficulties, item-rest correlations, reliability estimates) and the amount of missing responses. Reliability estimates of math tests varied from extremely low to good. Some tests were extremely difficult or easy, reflecting unrealistic testing situations. Tests were often speeded (i.e., there was a large number of missing responses), and patterns in the data suggest differential speededness over the experimental conditions. Additionally, we studied the effects of stereotype threat in our Dutch dataset using several IRT models.

Finally, our team is working on a registered replication report. In this project we attempt to replicate findings from a well-known study on stereotype threat and math performance in college students (Johns, Schmader & Martens, 2005). In this large scale replication study we carry out a replication study amongst students from Tilburg University. Methods, materials and instructions are shared online, and labs abroad are welcome to join our replication effort. Goals and progress of this replication effort will be discussed.