Abstract Gendered and racial inequalities persist in even the most progressive of workplaces. There is increasing evidence to suggest that all aspects of employment, from hiring to performance evaluation to promotion, are affected by gender and cultural background. In higher education, bias in performance evaluation has been posited as one of the reasons why few women make it to the upper echelons of the academic hierarchy. With unprecedented access to institution-wide student survey data from a large public university in Australia, we investigated the role of conscious or unconscious bias in terms of gender and cultural background. We found potential bias against women and teachers with non-English speaking backgrounds. Our findings suggest that bias may decrease with better representation of minority groups in the university workforce. Our findings have implications for society beyond the academy, as over 40% of the Australian population now go to university, and graduates may carry these biases with them into the workforce.

Citation: Fan Y, Shepherd LJ, Slavich E, Waters D, Stone M, Abel R, et al. (2019) Gender and cultural bias in student evaluations: Why representation matters. PLoS ONE 14(2): e0209749. https://doi.org/10.1371/journal.pone.0209749 Editor: Heidi H. EWEN, University of Indianapolis, UNITED STATES Received: May 17, 2018; Accepted: December 11, 2018; Published: February 13, 2019 Copyright: © 2019 Fan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All data underlying the study are within the paper and its Supporting Information files. Funding: This study was funded by Division of the Deputy Vice-Chancellor Academic, UNSW. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction Using student evaluations of teaching (SET) as a tool to assess teaching quality has become a contentious issue. Some scholars ([24]) argue that these surveys do not measure teaching effectiveness and should only be used to monitor student experience. Yet many academic institutions require the reporting of SET results as a routine component of performance enhancement and promotion. A number of recent influential studies ([11], [6], [19], [7]) have found evidence of gender bias in university teaching evaluations. Indeed, several studies have found that gender, ethnicity and the instructor’s age matter ([2], [9], [4] [26], [27]). While the literature on teaching evaluations is rich, most studies either rely on case studies, or small sample sizes. Recently for example, a study of around 20,000 student evaluations over the period 2009-2013 from the school of Business and Economics, [11] at the University of Maastricht in the Netherlands, found that, on average, female instructors received a score 37 percentage points lower than male instructors. The bias is driven by male students, and is worst for junior female instructors. They also found the bias to be more obvious in courses which contain more mathematics. Another study from a French university analysed over 22,000 online evaluations over a 5 year period for students in social sciences ([6]). They found that male students express a bias in favour of male professors, and that men are perceived to be more knowledgeable and to have stronger leadership skills. Finally, a US study conducted an experiment whereby the instructors of an online course operated under two differently gendered avatars ([19]). This research found that students rated the male avatar significantly higher than the female avatar, regardless of the instructor’s actual gender, but the study was based on a sample size of 43 students assigned to 4 different instructors. There is very little research on the effect of culture or race on SET scores. Some authors ([10], [13]) have studied course evaluation scores between Hispanic and Asian-American faculty compared to White faulty. However, the sample size used for the analyses was too small to draw any conclusions. Other studies have also been carried out, using surveys or interviews ([23], [14]). In the Australian context, public conversations have been focussed primarily on gender equality. One recent report found that Asian Australian academics perceive their heritage as a disadvantage in the workplace ([22]), whilst others have argued there is resistance in opening such debates ([15]). This study is based on SET and course satisfaction data collected at a leading Australian university, which consistently collected student evaluations of courses and teaching over a long period of time. We refer to these data throughout as “SET data”. The dataset is comprised of 523,703 individual student surveys, across 5 different faculties and over a seven year period 2010-2016. There were 2,392 unique courses and 3,123 individual teachers in the dataset. The university has a high international student population, (comprising 34% of the surveys), primarily from the Asia-Pacific region, and a diverse international cultural background in the teaching staff (38% of the classification). See Table 1 for a break down of the demographics. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. Breakdown of demographics from the SET dataset by faculty. Across the rows are: total number of individual student surveys; total number of unique courses; number of female teachers with non-English and English speaking background; number of male teachers with non-English (NE) and English (E) speaking background; and the number of female and male international (I) and local (L) students. https://doi.org/10.1371/journal.pone.0209749.t001 This study differs from all previous studies in several ways. First, our study is by far the largest data study and the only institution-wide study of SET; second, we look at evidence for potential cultural bias and the interplay between gender and cultural bias in a way that has never been considered (We are, of course, mindful that ‘culture’ is a complex and contested concept. We use the term ‘cultural bias’ to capture the combination of biases related to language background, embodiment or presentation of (presumed) racial/ethnic identity, and beliefs or conventions particular to a given cultural context. In our dataset, ‘language spoken at home’ is the relevant variable.); finally we use a random effects model to appropriately account for “course” and “teacher” effects in a statistically rigorous analysis.

Methods This research was approved by the UNSW Human Research Ethics Advisory Panel (HREAP), HC17088. Data collection The university has a mature data warehouse that has been developed using the Kimball method of data warehousing ([18]). The method models individual business processes subject by subject to form an enterprise warehouse. Integration between subjects is achieved by adhering to a data warehouse bus matrix which captures the relationships between the business processes and the core descriptive dimensions. This enables subject-oriented data marts to be built over time and be assembled to produce an Enterprise Data Warehouse ([25]). The resulting integrated data warehouse is optimized for reporting and analytics ([18]). This data is used for many of the university’s decision support processes and has been cleansed, tested and utilized for decision making for seven years. Seven of the business processes that have been modelled have been used to prepare the data for the analysis work. These processes are program creation, course creation, enrolment in programs, enrolment in courses, grades in courses, accumulative weighted average mark (WAM) in a semester and course survey. As part of the ethics approval on this project we separated the data preparation and engineering from the data analysis. The data was prepared and anonymized to protect the identity of the students doing the survey and the teachers who are the subject of the survey. The anonymized data was then handed over for analysis. The data set is itemised by students enrolled in courses. Each semester students enrol in courses and at the end of each semester students are asked to participate in a survey about their experience in the course. The survey is voluntary, and anonymous, students are reassured that they cannot be identified and penalised for their comments. In the data set, attributes of the courses and programs were retained such as faculty (The term ‘faculty’ is used here to refer to the administrative unit of the university (there are eight faculties at the university in question, including Arts and Social Sciences, Business, Science and so on), not to be confused with the teacher/professor.), school, re-identified unique code, re-identified unique name and the field of education. The Field of Education is the Australian Bureau of Statistic’s Australian Standard Classification of Education (ASCED) ([3]). Teacher demographics were included to aid analysis. This included re-identified teacher identifier, gender, age at survey time, Australian residency information, citizenship information, language spoken at home, indigenous status and salary grade (Casual Tutorial, Casual Lecturer, Associate Lecturer, Lecturer, Senior Lecturer, Associate Professor or Professor). In the Australian system, a casual tutor and casual lecturer may or may not hold PhDs. associate lecturers are often temporary lecturers with a PhD, lecturer is equivalent to tenure track assistant professors in the North American system, and senior lecturer/associate professor is equivalent to the associate professor then professor to professor in the North American equivalent. Student demographics to aid analysis including re-identified respondent identifier, WAM at survey time, gender of the student, age at survey time, Australian residency information, citizenship information, language spoken at home, indigenous status, grade for specific course being surveyed, student load for semester of the survey (is the student part time or full time). The student is asked for demographic information on the survey and this is also included. This data includes, gender as stated in survey response, mode of study as stated in survey response, residency as stated in survey response. The university has been performing Course and Teaching Evaluation and Improvement (CATEI) surveys in one form or another since the late 1990s and moved online in the late 2000s. The survey data used for this analysis are from 2010-2016, and included four questionnaire forms: Form A (Course Evaluation) which was used to evaluate a course;

Form B (Large Group Teaching Evaluation) which was used to evaluate course lecturers;

Form C (Small Group Teaching Evaluation) which was used to evaluate tutors or lab demonstrators; and

Form D (Studio/Design Based Teaching) which was used to evaluate tutors or studios with smaller number of students. The Likert questions (from a scale of 1 to 6, “strongly disagree, disagree, moderately disagree, moderately agree, agree, strongly agree”) on each form consisted of up to ten questions, eight standard questions in the case of Form A and two text questions. Forms B, C and D comprised of seven standard questions and up to two text questions. This analysis focus on the last question: Form A (Course Evaluation) Overall, I was satisfied with the quality of this course.

Form B (Large Group Teaching Evaluation) Overall, I was satisfied with the quality of this lecturer’s teaching.

Form C (Small Group Teaching Evaluation) Overall, I was satisfied with the quality of this facilitator’s / tutor’s teaching.

Form D (Studio teaching Evaluation) Overall, I was satisfied with the quality of this facilitator’s / tutor’s teaching. Classes at this university were predominantly conducted in the traditional way during the survey period, i.e., face to face lectures and tutorials or labs where students are expected to attend. Large groups of lectures can have up to two to three hundred students, while typical tutorials and lab groups are under 30 students. Our focus on the final survey question is based on the fact that this is the question used by management as performance indicators for promotion and other purposes. Statistical analysis Individual student evaluations scores (for a particular teacher from a particular course) are measured on a Likert scale (1, …, 6), indicating “strongly disagree, disagree, moderately disagree, moderately agree, agree, strongly agree”. Together with the score, we also have information on a variety of student, teacher and course specific variables. An ordinal regression model is appropriate for this type of response, since scores are ordered categorical data ([1]). Since the data we analyse here is observational, the unequal number of times that a course or a teacher is surveyed can lead to biased results. To account for this we use a mixed model with two random effects terms to account for individual course effects, and individual teacher effects, these two terms will also pick up individual specific effects not otherwise accounted for in the model. The number of students providing multiple surveys to the same teacher is few, therefore we treat the responses as conditionally independent. A large number of studies have produced mixed conclusions about which student or teacher characteristics influence SET results, but most of these are based on small samples or case studies ([5]). We include in our fixed effects most of the frequently studied variables, including student semester average mark (WAM); student cultural background: as indicated by residency status of student; gender of student; total number of students in the course; course type (postgraduate or undergraduate); gender of teacher; and cultural background of teacher (English or non-English background). Around one third of teachers had missing information in the database that contained language/cultural background of the teacher information- in these cases we flagged them to be English speaking if they were born in a predominantly English speaking country (Australia, New Zealand, United Kingdom, United States, South Africa) and non English speaking if they were born elsewhere. Where country of birth and language spoken at home were both missing we flagged the cultural background as missing, unless the citizenship status was a non Australian class- in which case we flagged the cultural background as non English speaking. Overall, 24% of teachers were flagged with missing cultural background. Since the interplay between student attributes and teacher attributes are complicated, we include four further interaction terms between: teacher gender and student cultural background; teacher gender and student gender; teacher and student cultural backgrounds; and teacher cultural background and student gender. All terms are treated as linear here, based on findings from the relevant literature ([5]). We fit a cumulative logit link model of the form (1) where j = 1, …, 6 refer to the response levels, P(y ict ≤ j) is the probability of student i from course c taught by teacher t giving a score less than or equal to level j, given x i = (x 1i , …, x pi ) the vector of fixed effect measurements and α c and α t are the vector of random effects coefficients. The vector β = (β 1 , …, β p ) is the vector of fixed effects coefficients. The model was fitted separately to each faculty using the ordinal package which uses a maximum likelihood approach in R [21]. Interpretation of fixed effects parameters The effect of gender or culture can be studied through the fixed effect coefficient for the particular effect. For instance, if we are interested in the gender effect, the covariate for gender x k takes values 0 or 1 indicating female and male. Then Eq 1 for women becomes , i.e., the β k term disappears from the equation for women. Then because x k takes the value 1 for men, β k stays in the equation for male teachers. Taking the difference between the equation for female and male teachers, we get (2) where odds females is defined as p j /(1 − p j ), p j = P(y ict > j) for women, and the odds males is defined as q j /(1 − q j ), q j = P(y ict > j) for men. As the model included interaction terms with student gender (and cultural background), we calculated the odds ratios separately for each strata of students (male and female students, and local and international students). The 95% confidence intervals were calculated for the odds ratio. The standard error of the log-odds ratio followed naturally from the inverse of the Hessian, a by product of the model fit. Then OR ± 1.96ese(log(OR)). Subset analysis In order to gain a sense of relative contribution of gender and culture to factors that actually measure improvements in teaching effectiveness, we created a new variable that indicates if the course is being taught at least once before by the instructor. Typically, instructors’ scores improve by a large amount once they have taught the course once, and have had feedback on the course. To do this, we use data only from 2012 onwards, and only data on teachers appointed at the lecturer or senior lecturer level. These staff conduct the bulk of academic teaching, and there is less variability amongst this cohort than amongst the casual teaching staff. We created a flag to indicate whether the instructor has not taught the course in the last 3 years. We assume if the instructor has not taught the course in the last 3 years, they can be considered as teaching the course for the first time. We fit a model as above with random effects to account for SET scores clustered on teacher and course, and fixed effects terms student WAM, student cultural background, gender of student, total number of students in course, course type, gender of teacher, cultural background of teacher, and whether the teacher has experience teaching the course (we did not fit interactions here as the dataset was reduced in size). Model assessment To assess the ability of the ordinal regression at classifying scores, for each j = 1, …, 5 we took the estimated probability that the SET score is less than or equal to j (i.e. ) and compared that to a binary indicator for whether the observed SET score was less than or equal to j (i.e. ). We calculated the Area under the Receiver Operating Curve (AUC), which assesses how well is able to discriminate . Generally AUC’s between 0.7-0.8 are considered fair, 0.8-0.9, good and 0.9-1 excellent ([17]). To assess uncertainty in the AUC from a mixed model, accounting for the design clustered on teachers and courses, we conduct a clustered bootstrap ([20]). That is, we sample course-teacher units in each of Nboot = 500 resamples. Letting (c*, t*) be the resampled indices, the standard error of AUC was estimated from sd(AUC( ), and 95% confidence interval limits for the AUC were then estimated as AUC ±z se(AUC), where z = 1.96, a common large sample approximation for AUC (e.g. [12]).

Conclusion This study analysed a large observational dataset of student evaluations of teaching, to detect potential bias, both in terms of gender and culture in student evaluation of teachers. Since surveys are voluntary, typical response rate is around 30% across the University, care should be taken when generalising these results to the more general student population. These results reflect the scoring patterns of those who responded. Note that when these surveys are used by the University administration, the effects of low response rates are not considered or accounted for. In the future, it would be interesting to study the effects of increasing survey response rate. In discarding teachers with missing culture information for the analyses involving culture, we have assumed that these information are missing at random. Although controlled experiments ([19]) are more ideal for studying a specific effect, they tend to suffer from small sample size, and can rarely address the complexity in the interplay between various factors that influence SET scores. When the sample sizes are large, such as the case with our study, then the findings of the observational study become more representative of a bigger population. With over 3,000 teachers in the sample, and over 44% of them female, and 38% with non-English background, the findings are less sensitive to individual specific traits. Our findings suggest that SET scores are subject to different types of personal biases. To the best of our knowledge, this is the first study that has revealed statistically significant bias effects attributable to both gender and culture, and their interactions. We detected statistically significant bias against women and staff with non-English language backgrounds, although these effects do not appear in every faculty. Our findings on the effect of cultural background is novel and significant because in Australia, where the population is culturally diverse, current policy and administrative actions have focussed on addressing gender bias, but less on cultural or racial bias. We found some evidence that the proportion of women or staff with non-English language backgrounds in a faculty may be negatively correlated with bias, i.e., having a diverse teaching staff population may reduce bias. We also found that due to the magnitude of these potential biases, the SET scores are likely to be flawed as a measure of teaching performance. Finally, we found no evidence that student’s unconscious bias changes with the level of their degree program. Throughout this paper, and in the title, we have used the term “bias” when describing the statistically significant effect females and non-English speaking teachers. It should be pointed out that one of the limitations of this study is that it is only able to show association, e.g., being female is associated with a lower SET score, we cannot say what really was the cause for a lower score. However, if SET is really measuring teaching quality, then the only plausible causes are either that females are generally bad teachers across a large population, or there’s bias, the same argument can be made for teachers who have non-English speaking background. Since we find no credible support that females, or someone with an accent, should generally be bad teachers, we have chosen to use the term “bias”. Comparing SET results from course evaluations where gender, or cultural background no longer shows up strong patterns, suggest that teaching evaluations may be evaluating the person, not the teaching effectiveness. Hence the effect we observe may be related to the student’s impression of the teacher in the context of the Australian university setting. Some evidence for this can be seen in the accompanying text responses where students comment on different aspects of the teacher, sometimes with a clearly gendered perspective, though this is beyond the scope of the present study. Universities may be able to reduce bias in several ways, either by making sure they have staff diversity, by employing more under-represented staff in specific faculties, or through bias training for students. Making university students less biased may have enormous flow-on benefits for society, as university students represent a large proportion of future leaders in industry and government (for example all fortune 500 CEOs have at least a bachelor’s degree). The administration of the university on which our study is based, is proactively seeking change to minimise the effects of conscious and unconscious bias. Development of measures of teaching effectiveness which considers findings of this and other similar studies, would lead to enhanced teaching quality. A first step in this direction may be to consider bias correction to recalibrate the scores.

Acknowledgments The authors would like to acknowledge financial support from the Division of the Deputy Vice-Chancellor Academic at UNSW Sydney, and the rest of the research team for their contribution to the project: Sophie Adams, Sheree Bekker and Tess Gordon. YF is grateful for useful discussions with Professor Ray Chambers on statistical modelling.