Studies using data from the Early Childhood Longitudinal Study–Kindergarten Class of 1998–1999 (ECLS-K:1999) revealed gender gaps in mathematics achievement and teacher perceptions. However, recent evidence suggests that gender gaps have closed on state tests, raising the question of whether such gaps are absent in the ECLS-K:2011 cohort. Extending earlier analyses, this study compares the two ECLS-K cohorts, exploring gaps throughout the achievement distribution and examining whether learning behaviors might differentially explain gaps more at the bottom than the top of the distribution. Overall, this study reveals remarkable consistency across both ECLS-K cohorts, with the gender gap developing early among high achievers and spreading quickly throughout the distribution. Teachers consistently rate girls’ mathematical proficiency lower than that of boys with similar achievement and learning behaviors. Gender differences in learning approaches appear to be fairly consistent across the achievement distribution, but girls’ more studious approaches appear to have more payoff at the bottom of the distribution than at the top. Questions remain regarding why boys outperform girls at the top of the distribution, and several hypotheses are discussed. Overall, the persistent ECLS-K patterns make clear that girls’ early mathematics learning experiences merit further attention.

Despite advances in gender equity in past decades, troubling patterns specific to math have persisted. Evidence from the nationally representative Early Childhood Longitudinal Study–Kindergarten Class of 1998-1999 (hereafter, ECLS-K:1999) indicated that U.S. boys and girls began kindergarten with similar math proficiency, but disparities in achievement and confidence developed by Grade 3 (Fryer & Levitt, 2010; Ganley & Lubienski, 2016; Husain & Millimet, 2009; Penner & Paret, 2008; Robinson & Lubienski, 2011). In contrast, the gender gap in reading was present in the fall of kindergarten (favoring girls) but narrowed somewhat during elementary school.

Unlike gaps based on race and socioeconomic status (SES), which stem, in part, from differences in schools attended (Fryer & Levitt, 2004), it is unlikely that gender gaps in elementary school are due to boys and girls attending different schools or to demographic differences between boys and girls. Hence, it is surprising that math gender gaps, as measured on ECLS-K:1999, grew at least as much as race- and SES-related gaps did in elementary grades (Fryer & Levitt, 2010; Reardon & Robinson, 2008).1 These findings suggest there are patterns unique to gender and mathematics that warrant our attention.

Interestingly, though, research suggests that the gender gap is not constant throughout the achievement distribution. For example, National Assessment of Educational Progress (NAEP) data suggest that gender gaps among students in Grades 4 and 8 favor males at the top of the distribution but are virtually nonexistent below the median (Lubienski, McGraw, & Strutchens, 2004). State tests suggest that males display greater achievement variability in general, outscoring girls at the top of the distribution but also underperforming at the bottom (Hyde, Lindberg, Linn, Ellis, & Williams, 2008). The ECLS-K:1999 provided a unique opportunity to examine how the gaps develop longitudinally and suggested that the math achievement gap developed first at the top of the distribution (in kindergarten) and then progressed further down the distribution through Grade 3 (Husain & Millimet, 2009; Robinson & Lubienski, 2011). Gender gaps at the top of the distribution were substantial; for example, Robinson and Lubienski (2011) found that, in the fall of kindergarten, girls made up only 20% of students above the 99th percentile in math. Together, the research on gender gaps highlights the importance of looking beyond simple mean differences to understand patterns related to achievement differences across the distribution.

Data This study uses data from the ECLS-K:1999 (N = 21,399) and ECLS-K:2011 (N = 18,170). The ECLS-K:1999 has completed all waves of data collection, including kindergarten and first, third, fifth, and eighth grades. The ECLS-K:2011 has completed data collection for kindergarten and first and second grades, with third, fourth, and fifth grades forthcoming. Relevant to this study, the data sets include information on student achievement, teacher ratings of academic proficiency and learning behaviors, and student demographic information. Direct Cognitive Assessment Scores Children completed mathematics and reading direct cognitive assessments at each wave of data collection, included in the data set as theta scores. Assessments were developed by the Educational Testing Service and were based on input from early education and curriculum expert as well as widely accepted standards and frameworks for assessment. Assessments were adaptive, with each child receiving questions best suited to their ability based on their answers to previous items (Najarian, Pollack, Sorongon, & Hausken, 2009; National Center for Educational Statistics [NCES], n.d.). Teacher Ratings Academic Rating Scale Teachers used subject-specific Academic Rating Scales (ARS) to rate their students’ proficiency (on a 5-point scale from not yet = 1 to proficient = 5) in a variety of constructs, including specific mathematical topics and problem-solving skills (Najarian et al., 2009; Tourangeau et al., 2015). For example, some items on the kindergarten ARS asked teachers to evaluate how well the child “orders a group of objects,” “solves problem involving numbers using concrete objects,” “shows an understanding of the relationship between quantities,” and “models, reads, writes, and compares fractions.”4 In the first-grade survey, some items rotate out, replaced by items regarding more difficult skills, such as “surveys, collects, and organizes data into simple graphs” and “makes reasonable estimates of quantities.”5 ARS scale scores were calculated using a one-parameter IRT (Rasch) model and included in the ECLS-K:1999 data set (Pollack et al., 2005). Only item-level data were included for the ARS in ECLS-K:2011; therefore, we calculated the scale scores using a generalized partial-credit IRT model, and it is these scale scores on which our analyses are based. Learning behaviors: Externalizing Problem Behaviors and Approaches to Learning The ECLS-K Externalizing Problem Behaviors scale is a combined score based on teacher responses to items about a student’s tendencies to have difficulty getting along with others, paying attention, or avoiding distractions. The ECLS-K Approaches to Learning scale score is based on a teacher’s ratings of student behaviors related to self-direction, organization, persistence, and eagerness to learn (see teacher questionnaires for both data sets; NCES, n.d.). NCES provides the composite scores for both of these scales in both data sets. We refer to externalizing problem behaviors and approaches to learning collectively as learning behaviors. Student Demographics Student gender, race, and age at assessment were collected from parent interviews and school documentation. Parents also provided their education levels, occupations, and incomes, which were used to create a composite SES variable (Najarian et al., 2009; Tourangeau et al., 2015). Analytic Data Sets To ensure that we compare the same students across the various tests (including direct cognitive assessments and ARS scores) as they progressed through school, we retained only students with nonzero longitudinal sampling weights, valid test scores, and academic ratings scores at each wave of analysis; this reduced the 1999 sample to 5,615 observations and the 2011 sample to 8,522 observations. These samples were further restricted to first-time kindergarteners at the beginning of the studies with complete demographic information (i.e., age, race, SES, gender) and valid teacher ratings on the Approaches to Learning and Externalizing Problem Behaviors scales. The final analytic samples for the 1999 and 2011 cohorts are 5,056 and 7,507, respectively. The final sample includes students in kindergarten and Grades 1 and 3 for ECLS-K:1999 and kindergarten and Grades 1 and 2 for ECLS-K:2011. Descriptive statistics for both samples are provided in Table 1. Table 1 Means and Standard Deviations, by Cohort, Wave, and Gender View larger version

Method Distributional Gender Gaps Because prior research suggests that the size of math gender gaps differ for low- and high-performing boys and girls, we estimate gaps throughout the achievement spectrum.6 Here, rather than assuming the ECLS-K assessments are interval scaled, we use a metric-free distributional measure, λ θ , developed by Robinson and Lubienski (2011). The method estimates the proportion of females scoring above/below a given percentile. In addition to replicating this work with the 2011 cohort, we extend it to look at adjusted gaps throughout the achievement distribution. As explained in the online appendix of Robinson and Lubienski (2011), one can use a series of logistic regressions to estimate the conditional proportion of males and females and, thus, estimate a conditional version of their measure. The cumulative density (Φ) of females (or males) observed by a given percentile of achievement (θ) conditional on a vector of characteristics (X; e.g., age, race, SES, prior achievement, learning behaviors) can be expressed as a logistic regression predicting the likelihood a student scored at or below the θth percentile of achievement, as a function of an indicator for male (its coefficient being β θ1 ) and X. To ensure that differences in X across males and females are conditioned out of the final estimates of Φ m ( θ ) | X ¯ θ and Φ f ( θ ) | X ¯ θ , X is held constant at the mean values for the given θth percentile of achievement (represented by X ¯ θ ): For males : Φ m ( θ ) | X ¯ θ = { 1 + exp [ − ( β θ 0 + β θ 1 + X ¯ θ Β θ ) ] } − 1 For females : Φ f ( θ ) | X ¯ θ = { 1 + exp [ − ( β θ 0 + X ¯ θ Β θ ) ] } − 1 Thus, using logistic regression as the basis for λ θ , we can estimate the proportion of females (or males) at or below (or above) each percentile: λ θ = { [ 1 + exp [ − ( β θ 0 + β θ 1 + X ¯ θ Β θ ) ] ] − 1 [ [ 1 + exp [ − ( β θ 0 + β θ 1 + X ¯ θ Β θ ) ] ] − 1 + [ 1 + exp [ − ( β θ 0 + X ¯ θ Β θ ) ] ] − 1 ] , | θ < 50 1 − [ 1 + exp [ − ( β θ 0 + X ¯ θ Β θ ) ] ] − 1 2 − [ [ 1 + exp [ − ( β θ 0 + β θ 1 + X ¯ θ Β θ ) ] ] − 1 + [ 1 + exp [ − ( β θ 0 + X ¯ θ Β θ ) ] ] − 1 ] , | θ ≥ 50 Here, we interpret the value of λ 50 to be the proportion of students at or above the median value of achievement (or in some instances, teacher ratings) who are female, after conditioning on demographic, behavioral, and prior achievement differences between males and females in some model specifications. A value of λ 50 = .5 indicates that half of the students above the median are female and half are male. A value of λ 50 = 1 indicates that only females score above the median, and a value of λ 50 = 0 indicates only males score above the median; hence, the metric is bounded by [0,1], facilitating easy interpretation. For values of θ below the median, the value of λ θ represents the proportion of students who are male; as Robinson and Lubienski (2011) explained, this is necessary so that, throughout the distribution, values of λ θ below .5 consistently indicate an advantage for males and values above .5 consistently indicate an advantage for females. For example, a value of λ 10 = .3 indicates that only 30% of students below the 10th percentile are males, whereas a value of λ 90 = .3 indicates that only 30% of students above the 90th percentile are females. We estimate three models for the metric-free distributional gaps; the first two models are similar regardless of outcome. Model 1 contains no covariates other than gender and thus is identical to the models estimated by Robinson and Lubienski (2011). Model 2 extends the base model by adding covariates for age, race, SES, and all prior and current ratings of learning behaviors. When the direct cognitive assessment is the outcome, Model 3 adds covariates for all prior direct cognitive assessment scores in the content area. When ARS scores (i.e., teacher rating of student proficiency) are the outcome, Model 3 adds covariates for all prior and current direct cognitive assessment scores as well as all prior ARS scores in the content area. Hence, when the direct cognitive assessment is the outcome, Model 1 presents raw gaps, Model 2 presents conditional gaps, and Model 3 presents conditional gaps that can loosely be interpreted as conditional gaps in growth.7 For instance, if λ 90 = .4 in Model 3, we would conclude that among students at or above the 90th percentile who have similar demographics, learning behaviors, and prior achievement, females represent only 40%. Model 3 helps us identify where in the distribution we see growth in the gaps between the waves of data collection. That is, although we can visually compare, say, Model 2 from the fall of kindergarten to the spring for intuitions on growth, Model 3 provides a more formal test of growth. When the teacher rating is the outcome, Models 1 and 2 present raw and conditional gaps, respectively, just as with the direct cognitive assessment outcomes; Model 3, however, represents how a teacher would rank a boy and girl with the same demographics, learning behaviors, past academic trajectory, and current achievement score. To better understand the magnitude of the λ θ estimates, we can translate them into an effect size metric. Estimates of λ θ = .44 (or .56, if above .5) approximately correspond to a standardized effect size of d = 0.2, thus the range of λ θ = (.44,.56) could be considered “small.” Differences considered “moderate” (d = [0.2,0.5]) correspond to λ θ = (.30,.44) and λ θ = (.56,.70). Differences considered “large” (d = [0.5,0.8]) correspond to λ θ = (.21,.30) and λ θ = (.70,.79). Differences considered “very large” (d = [0.8,1.0]) correspond to λ θ = (.15,.21) and λ θ = (.79,.85).8

Conclusion The persistence of the gender gap across two ECLS-K cohorts over a decade apart and the mounting evidence from many other types of math assessments demonstrating its early emergence make clear that this gap deserves more attention than it receives in our public awareness and education accountability policies. In both data sets, the gap emerges early, starting first at the top of the achievement distribution and working its way completely down the distribution in the first 3 to 4 years of school. Girls’ stronger approaches to learning may help narrow the gender gap in math at lower ranges of the achievement distribution but may do less to help the persistent gap at the top of the distribution. We also found consistent evidence across both cohorts that teachers give lower ratings to girls when boys and girls perform and behave similarly; this underrating of girls relative to observationally similar boys was found throughout the achievement distribution and suggests that teachers must perceive girls as working harder than similarly achieving boys in order to rate them as similarly proficient in math. This work points to the importance of examining gaps throughout the achievement distribution as well as further examining the causes of early gender gaps in math, including the role that teacher expectations and students’ learning behaviors and problem-solving approaches may play in their development.

Acknowledgements The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305B100017 to the University of Illinois. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education. A portion of Joseph Robinson Cimpian’s time was supported by a National Academy of Education/Spencer Foundation Postdoctoral Fellowship. We thank Andrei Cimpian for helpful comments on an earlier draft.

Notes 1.

Race- and socioeconomic status (SES)–based gaps tend to be much larger in magnitude than gender-based gaps. However, when it comes to growth in the math gap, data from the Early Childhood Longitudinal Study–Kindergarten Class of 1998–1999 (ECLS-K:1999) suggest that the Black–White gap grows during the first 4 years of formal schooling by about 0.2 standard deviations, about the same amount the gender gap grows by over the same period. Other race- and SES-based gaps decreased over this period (Reardon & Robinson, 2008). 2.

In contrast, more parents displayed concern about the appearance of their daughters than of their sons, with more queries about whether their daughters are “beautiful” or “ugly.” 3.

In the ECLS-K data sets, males also display greater variance in math test achievement. For example, in the fall of kindergarten, the male:female test-variance ratio is about 1.2 in both the 1999 and 2011 cohorts (calculated from Table 1). 4.

For more information, see https://nces.ed.gov/ecls/pdf/kindergarten2011/Fall_K_Classroom_Teacher_Child_Level.pdf. 5.

For more information, see https://nces.ed.gov/ecls/pdf/firstgrade/Spring_2012_Teacher_Ques_Child_Level_First.pdf. 6.

Although this paper focuses on mathematics, we suspect that some readers might be interested in similar analyses for reading. We ran parallel analyses focused on reading, but a discussion of the results is beyond the scope of this paper. Interested readers may find the reading results in the supplemental materials. 7.

Because the outcome is an indicator of whether a student is above/below a given percentile (rather than a continuous score as the outcome), conditioned on prior achievement (and other covariates, including demographics and learning behaviors), the interpretation is more accurately current percentile-range standing given prior achievement and other covariates. 8.

These values are approximate standardized differences, derived from first taking the log odds ratio on the male coefficient (i.e., β θ1 , the defining difference between males and females in Equation 1) and dividing it by 1.81 (Chinn, 2000). Then, these log-odds ratio standardized differences were matched with their corresponding values of λ θ to arrive at the guides presented here. These guides are intended to help readers translate the magnitude of the gender differences above (below) specific percentiles of the achievement distribution into standardized units and commonly used effect size terms; however, terms such as small and large differences should not be perceived of as rigidly fixed to the specific range of values presented here (e.g., see Valentine & Cooper, 2003, for a discussion on how effect sizes should be interpreted in context). 9.

There were two other small differences (one at the bottom of the distribution and one closer to the top), but these differences reflect aberrant percentiles and not a general pattern of a cluster of percentiles. 10.

We included race, age, and SES in these analyses. We ran supplemental analyses that included only demographics and found that the models with demographics were very similar to the base models, suggesting that demographics are not driving the changes between Models 1 and 2. Instead, the learning behaviors are driving these differences. 11.

Like our paper, Fahle’s (2016) work examines gaps across the achievement distribution. The other papers referenced either examine differences in a small portion of the distribution or focus on average differences. All of these papers use data sets different from ours, with different sampling procedures, methods, and test foci. 12.

We replicated the instrumental variable analyses in Study 2A of Robinson-Cimpian, Lubienski, Ganley, and Copur-Gencturk (2014b) and found that teachers’ underrating of girls is likely contributing to the development of the gender gap between kindergarten and Grade 1 for the 2011 cohort, just as was found for the 1999 cohort. Results are available upon request.