For our first study, we investigated predictors of course performance in nine large introductory courses taken in fall 2016 by non-biology majors ( Table 1 ). Courses ranged in size from 90 to 239 students, and varied in the proportion of the total course grade that was due to exams (midterms and finals). In one course, for example, 41% of the course grade was calculated from exam scores, but in another, 52% of the grade was due to exams. Specifically, we categorized grades as exam grade (combined performance on midterms and finals), non-exam grade (performance on any non-exam assessments), and final course grade (a combination of exam grade and non-exam grade). A third-party individual, not involved in the course or this research, matched student grades to student gender, age, and incoming academic preparation (American College Test, hereafter ACT).

Statistical analysis

We performed all statistical analyses using SPSS software version 24 (SPSS Inc., Chicago, IL, USA). We used multilevel modeling with hierarchically nested data (students in different classes) to account for the non-independence of data in nested-data structures [22,23]For analyses we used the Akaike’s information criterion (AIC) to assess model significance [24]. AIC allows us to estimate the best model for our data, based on an estimation using AIC differences (Δi = AIC model i–minAIC, where minAIC is the model with the smallest AIC value). We performed four separate sets of analyses. We were interested in the interaction between the percentage that exams contribute to the final course grade (PercExam, a continuous fixed effect) and student gender (SGender, a fixed effect with two levels). Therefore, our model initially included those three effects (SGender, PercExam, and SGender*PercExam) and ACT score. We included ACT score to account for variation in students’ incoming preparation for the courses [25]. In addition, we tested whether the following variables improved the fit of the model for the given set of data using AIC differences: (1) student underrepresented minority status (whether they are African American, Hispanic, Native American, or Pacific Islander; hereafter URM, a factor with two levels); (2) student age (Age); (3) class size. Only students with a complete set of these variables were included in these analyses. We ultimately chose the most parsimonious model that best fit the data. The final model for exam performance, non-exam performance, and total course performance included the following predictor variables:

Class section (ClassSec) was included as a random effect, and was tested for significance by removing it and taking the difference between the -2 log likelihoods. This was tested against a chi-square distribution with one degree of freedom.

Next, we conducted three ‘case study’ analyses of courses that shifted grading schemes to either emphasize or deemphasize the influence of exam performance on final course grades (Table 2). For the first two cases, we used univariate general linear models to compare metrics of student achievement across two semesters of BIOL 100 and 300. With average exam grade and total course score as the dependent variables, we included ACT score, SGender, semester, and the interaction between SGender and semester for each analysis. An ANOVA showed that incoming ACT scores did not differ significantly between semesters (for women in BIOL 100 F 1,17 = 0.932 P = 0.539, for men in BIOL 100 F 1,14 = 0.481 P = 0.935; for women in 300 F 1,15 = 0.532 P = 0.913, for men in 300 F 1,14 = 0.802 P = 0.663), indicating that incoming student populations were comparable in their preparation. We did not have ACT scores for seven students in BIOL 100 and nine students in 300, and so we assigned average ACT scores for their classes to those students in order to include them in the analyses. Further sensitivity analyses, in which we tested one standard deviation increase and decrease (±SD) of the ACT input for those students, did not significantly change our results.

For the third case study, we focused on a two-semester sequence of courses restricted to lower-division majors in biology. BIOL 202 and BIOL 203 are two courses taken consecutively by students, and so a high proportion of students who took BIOL 202 in the spring of 2016 also took BIOL 203 the following fall (97% of students in BIOL 203 took BIOL 202 the previous semester). In these courses, we were interested in individual students’ performance in the two classes, which are similar in nature (‘Part 1’ and ‘Part 2’ of a Foundations of Biology for Biological Science Majors sequence) but differ in the extent to which exams make up the final grade. To analyze these courses, we used a mixed model, wherein we included student ID as a repeated measure across semesters, and used a first-order autoregressive (AR1) covariance matrix. With this covariance matrix, we assume that residual errors within each subject are correlated, but are independent across subjects. With average exam grade and total course score as the dependent variables, we included ACT score, SGender, semester, and the interaction between SGender and semester for each analysis. We used Pearson correlations to examine whether baseline estimates (data collected prior to the course) were correlated with each other and with student outcomes. We deleted one outlier found in the residuals in our analysis of students’ exams in order to meet the assumptions of a mixed model. This individual had an average exam score of 20% across the semester (whereas the next lowest cumulative score for students was >60%). For all ‘case study’ analyses, we report post-hoc Bonferroni pairwise comparisons to clarify performance outcomes of students based on gender.