Literature search and study selection

We performed a systematic literature search following guidelines from PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses46). The PRISMA flow diagram depicting our search and screening process is shown in Fig. 4. We broadly followed the search protocol used by Voyer and Voyer6. We searched three databases for articles published between August 2011 and May 2015: ERIC, SCOPUS and ISI Web of Science. We did not use the PsychINFO or PsycARTICLES databases used by Voyer and Voyer6, as they were malfunctioning at the time of our search. We searched for articles containing the term ‘school grade/s’, ‘school achievement/s’, ‘school mark/s’ or ‘grade point average/s’. The exact search strings used for each database and additional details of the literature search are provided in Supplementary Methods. While there was no clear signal of publication bias in the school subset (Supplementary Tables 12, 25), a limitation of our literature search is that we did not actively search for unpublished studies or theses.

Fig. 4 PRISMA diagram showing the process of locating studies included in this meta-analysis. The full list of included studies, and the list of studies excluded at the stage of full-text screening, are available in Supplementary Data 1 and 2 Full size image

Eligibility criteria

To be included for data extraction at the full-text screening phase, studies needed to present teacher-assigned grades or global GPA (grade point average, i.e. grades averaged across many subjects) for a cohort containing both male and female students. The students could be from grade one and above. These criteria excluded kindergarten and single-sex studies, and self-reported grades or test data. Because of socio-cultural effects on gender differences, we required samples of students that took classes together; we therefore excluded online courses. We also excluded retrospective studies comparing adults that were not in the same study cohort. Where longitudinal data was reported, we included only the first year of data that met the inclusion criteria. In the case of studies that reported high school GPA for an undergraduate sample, we only included the university grades, if reported, and we deemed the high school grades ineligible. This is because the high school grades of groups of undergraduates do not come from the same cohort—they represent a subsample of students from disparate high schools, and only those students who performed well enough to attend university. When we identified studies that reported data from the same large database, we only included the study with the largest sample size, and excluded the rest to avoid pseudo-replication. The list of excluded studies, with reasons for exclusion, is presented in Supplementary Data 2.

Data extraction and coding

From the original papers, we extracted the sample sizes, means, and standard deviations for male and female academic grades. For the studies used by Voyer and Voyer6, we attempted to contact authors if any of these data were missing. All contacted authors were also asked to provide any additional data (published or unpublished) they might have available. If we received no response after 1 month, we sent a follow-up email. Only unstandardised grade data was collected. When presented data was standardised, we contacted authors to request the corresponding unstandardised values. For the studies published after August 2011, we only contacted authors if variance data was missing. In total, data from authors was acquired for 15 studies, including two unpublished studies.

Moderator variables

In addition to the descriptive statistics for grades of males and females, we extracted a number of moderator variables, all of which are presented in Supplementary Table 1. We generally followed the variables used by Voyer and Voyer6 (e.g. racial composition), as well as recording additional information (e.g. age of students). An analysis of the moderating effect of racial composition on the gender gap in school grades is presented in the Supplementary Note 1 and Supplementary Tables 1, 3. Continuous moderators were scaled and centred (resulting in mean of 0, and standard deviation of 1) prior to the analyses. We used multiple imputations to fill in missing values of study year and students’ mean age (details in Supplementary Methods).

Effect sizes

Using standardised effect sizes allowed us to combine original data collected on different scales (grades were recorded on different scales among included studies). To test for differences in mean grades between genders, we used the natural logarithm response ratio (hereafter referred to as lnRR), and its corresponding sampling error variance \({{s}}_{{\mathrm{lnRR}}}^2\)47.

$${\mathrm{lnRR}} = {\mathrm{ln}}\left( {\frac{{\bar x_{\mathrm{f}}}}{{\bar x_{\mathrm{m}}}}} \right),$$ (1)

$${{s}}_{{\mathrm{lnRR}}}^2 = \frac{{s_{\mathrm{m}}^2}}{{n_{\mathrm{m}}\bar x_{\mathrm{m}}^2}} + \frac{{s_{\mathrm{f}}^2}}{{n_{\mathrm{f}}\bar x_{\mathrm{f}}^2}},$$ (2)

where:

\(\bar x_{\mathrm{f}}\) and \(\bar x_{\mathrm{m}}\) = the mean grade of female and male students, respectively,

\(s_{\mathrm{m}}^2\) and \(s_{\mathrm{f}}^2\) = the variance in grades of female and male students, respectively,

n m and n f = the number of male and female students in each sample, respectively.

Positive values of lnRR imply greater mean grades for girls.

We extended the literature search in Voyer and Voyer6 by 5 years, and our analysis of mean grades differed from theirs in two ways: (1) we included only studies where we could compare variances, and; (2) we used lnRR instead of the standardised mean difference in performance (SMD or Hedges g24; see Supplementary Equations 1–4). We chose to use lnRR because, unlike SMD, it is unaffected by differences in variance (standard deviation) between groups. However, for comparison with Voyer’s6 results, we have repeated the lnRR analyses using SMD as the effect size. The results for both lnRR and SMD analyses—which are very similar to each other—are presented in the Supplementary Figure 4, and Supplementary Tables 2–4, 6, 8, 12, 13, 16, 19, 22, 25.

To assess differences in variance of grades of boys and girls, we used the natural logarithm coefficient of variation ratio (lnCVR) and its associated sampling error variance \({{s}}_{{{\mathrm{lnCVR}}}}^2\)35.

$${{\mathrm{lnCVR}}} = {\mathrm{ln}}\left( {\frac{{{\mathrm{CV}}_{\mathrm{f}}}}{{{\mathrm{CV}}_{\mathrm{m}}}}} \right) + \frac{1}{{2\left( {n_{\mathrm{f}} - 1} \right)}} + \frac{1}{{2\left( {n_{\mathrm{m}} - 1} \right)}},$$ (3)

$$\begin{array}{l}{{s}}_{{{{\mathrm{ln}}{\mathrm{CVR}}}}}^2 = \frac{{s_{\mathrm{m}}^2}}{{n_{\mathrm{m}}\bar x_{\mathrm{m}}^2}} + \frac{1}{{2\left( {n_{\mathrm{m}} - 1} \right)}} - 2\rho _{\ln \bar x_{\mathrm{m}},\ln s_{\mathrm{m}}}\\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\times\sqrt {\frac{{s_{\mathrm{m}}^2}}{{n_{\mathrm{m}}\bar x_{\mathrm{m}}^2}}\frac{1}{{2\left( {n_{\mathrm{m}} - 1} \right)}}} + \frac{{s_{\mathrm{f}}^2}}{{n_{\mathrm{f}}\bar x_{\mathrm{f}}^2}} + \frac{1}{{2\left( {n_{\mathrm{f}} - 1} \right)}}\\ \;\;\;\;\;\;\;\;\;\;- 2\rho _{\ln \bar x_{\mathrm{f}},\ln s_{\mathrm{f}}}\sqrt {\frac{{s_{\mathrm{f}}^2}}{{n_{\mathrm{f}}\bar x_{\mathrm{f}}^2}}\frac{1}{{2\left( {n_{\mathrm{f}} - 1} \right)}}} \end{array},$$ (4)

where:

CV f and CV m = the coefficient of variation for males and females \(\left( {\frac{s}{{\bar x}}} \right).\)

\(\rho _{\ln \bar x_C,\ln s_C}\) and \(\rho _{\ln \bar x_E,\ln s_E}\) = the correlations between the logged means and standard deviations of the male and female students, respectively.

All other notation is described above. Positive values of lnCVR imply greater variance in girls’ grades relative to boys’ grades. By dividing the female and male standard deviations by their respective means, we controlled for the effect of a proportional relationship (the mean–variance relationship) between the standard deviation and the mean. To test how the variance in grades has changed over time, we also computed the natural logarithm of the coefficient of variation (lnCV) for boys and girls separately, and its associated sampling error variance35:

$${{{\mathrm{lnCV}}}} = {\mathrm{ln}}\left( {\frac{s}{{\bar x}}} \right) + \frac{1}{{2\left( {n - 1} \right)}},$$ (5)

$${{s}}_{{{{\mathrm{lnCV}}}}}^2 = \frac{{s^2}}{{n\bar x^2}} + \frac{1}{{2\left( {n - 1} \right)}} - 2\rho _{\ln \bar x,\ln s}\sqrt {\frac{{s^2}}{{n\bar x^2}}\frac{1}{{2\left( {n - 1} \right)}}}.$$ (6)

All notation as described above. For the same mean, a more negative value of lnCV implies a smaller variance.

Statistical analyses

We performed our main analyses on lnCVR and lnRR, and their associated error terms, using the rma.mv function in the R (v.3.4.2) package metafor v.2.0-048. One-third of effect sizes were not independent, because they came from the same study and/or the same cohort of students. We therefore included cohort ID and comparison ID as random effects in each model (the levels of study ID overlapped too much with cohort ID to model both levels simultaneously; e.g. in the school data, 120 studies and 141 cohorts, respectively). We also modelled covariance between effect sizes, assuming that effect sizes from the same cohort had 0.5 correlations between grades in different subjects (recommended in ref. 49) because sampling error variances among these effect sizes based on the same cohort are likely to be correlated. We added this covariance matrix as our sampling error variance matrix (V argument in the rma.mv function). In addition, to account for the two main types of non-independence in our data (hierarchical/nested and correlation/covariance structures), we used the robust function within the metafor package to generate fixed effects estimates and confidence intervals, based on robust variance estimation, from each rma.mv model. To test for the overall effect of gender on mean and variance in school grades, we constructed meta-analytical models with no fixed effects (i.e. meta-analytic model or intercept-only model). We tested whether the results were significantly different between school and university by including the ‘school or university’ categorical moderator in a meta-regression model on the whole dataset. We then ran separate meta-analytical models on the school and university data subsets to quantify respective heterogeneities (Supplementary Methods). To test whether the gender gap in school grades varied between subjects, we included subject type (STEM, non-STEM, Global, Other/NR) as a fixed effect in meta-regression analyses. To test whether the gender difference in school grades has changed over historical time, or with student age, we included either study year or average student age as a fixed effect. To test whether the variance of either males or females has changed over historical time, or with student age, we used lnCV as the response variable, and the fixed effects of sex and study year, or sex and age, and their interactions. Point estimates from all statistical models were considered statistically significant when their CI did not span zero.

Robustness of results

There is a possibility of a bias in our results due to over-reporting of positive findings in published studies, so we tested our data for publication bias using multilevel-model versions of funnel plots and Egger’s regression50,51. We also performed alternative analyses of key components of our study to test whether our conclusions are robust. Overlaps of grade distributions were inferred using simulation methods. Details and results of these analyses are presented in Supplementary Methods and Supplementary Tables 15–18, 20–23.