Bob Uttl’s first academic job seemed to be going well. As an assistant professor in psychology at Oregon State University, hired in 1999, he was publishing regularly and developing a good rapport with students, even those taking his challenging courses in research methods and psychometrics. “Some students were doing great and some of them not so great,” he recalls.

Then it all fell apart. Dr. Uttl was denied tenure. A student who had been removed from his course for academic dishonesty had written a letter saying she could not understand his accent – Dr. Uttl comes from the former Czechoslovakia. A colleague’s peer review claimed the font size used in a handout showed a “disdain” for students, and he found out later that this same colleague didn’t want to hire him in the first place. The tenure and promotion committee cited his student evaluations of teaching (SETs) scores as grounds for dismissal.

“Your career goes up in smoke,” says Dr. Uttl. He fought the dismissal and a student petitioned to have him reinstated. A federal court ruled in his favour in 2005 for several reasons, including the fact that the dean and others could not explain why his SETs scores, which hovered around the department average, were deemed not good enough.

Dr. Uttl was retroactively granted tenure and promotion, but the damage was done. No one would hire him and his U.S. work visa got cancelled. He landed an academic job in Japan and then sessional work in Red Deer, Alberta, a long commute from his new home in Calgary, where his Canadian wife had a job. But things worked out eventually: he is now a professor of psychology at Mount Royal University and, as an ongoing research interest, studies SETs.

He and others have been saying for some years that Canadian universities put too much weight on SETs (which go by a range of other names, including student surveys, student ratings of instruction or, simply, course evaluations). They’re being used to size up departments, rehire or fire part-time instructors, and inform tenure and promotion decisions.

“The literature has always said student evaluations should be used in the context of peer evaluation and self-evaluation, they should not have too much weight put on them,” says Brad Wuetherick, executive director of the Centre for Learning and Teaching at Dalhousie University. “But they’re being used on their own, unfortunately, because it’s a number. It’s an easy thing to understand.”

What’s more, research over the past decade or so has shown serious flaws with SETs: “They don’t measure teaching effectiveness and they are subject to all sorts of biases,” says Dr. Uttl. However, this narrow approach of equating questionnaire scores with teaching skill may be coming to an end at Canadian universities. An arbitration decision handed down last June in a dispute between Ryerson University and the Ryerson Faculty Association backs up Dr. Uttl’s assertions.

The arbitrator, William Kaplan, concluded that while SETs may be good at “capturing the student experience” of a course and its instructor, they are “imperfect” and “unreliable” as a tool for assessing teaching effectiveness. He declared that the university’s faculty course surveys could no longer be used to measure teaching effectiveness for promotion or tenure, the crux of the dispute.

As a result of the ruling, universities across the country are now looking at whether they need to revamp their student survey practices and connected human resources policies. It’s hard to fail a survey, but Canadian universities pretty much just did.

In the ivory tower past, university professors taught and students learned – or did not learn, as the case may be. Bad teachers sometimes got flagged and were nudged out, but many got to keep droning away at the lectern until retirement. As the idea of student-centred learning emerged, students increasingly saw themselves as active participants in their education. They wanted an official say and SETs offered a sanctioned means to share insights on the classroom experience.

In the 1980s, professors and university departments started experimenting with student surveys. By the early 1990s, university administrators began taking them over, standardizing questions and running them for all faculty. Those that resisted adopting this approach were often met with a student response. As an undergraduate, Mr. Wuetherick was part of the push for SETs at the University of Alberta in 1994. “We were advocating for the student voice to be taken seriously,” he recalls.

“The original intent was admirable,” says Gavan Watson, director of the Centre for Innovation in Teaching and Learning and the associate vice-president, academic, teaching and learning at Memorial University. “There was a sense that it was important to understand what the student experience of being in the course was like.”

As governments pushed universities to be more transparent as a condition for funding, student evaluations “satisfied the growing administrative emphasis on accountability,” says Jeff Tennant, associate professor in the department of French studies at Western University. Dr. Tennant is also the chair of the Ontario Confederation of University Faculty Associations’ collective bargaining committee and a member of its working group on student questionnaires.

The widespread integration of SETs was bolstered by a 1981 meta-analysis by Peter Cohen of Dartmouth College that found “strong support for the validly of student ratings [of instruction] as a measure of teaching effectiveness.” Subsequent research in the following years backed up Dr. Cohen’s findings, says Dr. Uttl. Then, as universities increased their computing power and interest in data, they began producing spreadsheets and charts with easy-to-digest numbers – and raised some early warning flags.

Ryerson computer science professor Sophie Quigley, who was the grievance officer for the faculty association when it launched its complaint against the university in 2009, says concerns about the university’s faculty course surveys were long-standing but worsened in 2007. “It got moved to an online format, and this is when it started getting used in a very different way,” she says. “The university introduced a bunch of averages that were not in place before.”

Ryerson uses a Likert scale, with questions such as, “The instructor is knowledgeable about the course material,” allowing a response of 1 for “agree” to 5 for “disagree.” The university would then use those numeric responses to create average scores. “The numbers from one to five, they’re just labels. They should not be made into averages,” says Dr. Quigley.

Ryerson also produced more complex datasets from their student surveys, but they were harder to interpret, says Dr. Quigley. “Averages are a single number so people naturally liked that better.” She and her colleagues started seeing these bad-math averages used in reports to rank university departments and to decide if people got tenure. “These averages got a life of their own.”

At many universities, a teacher’s skill set was being boiled down to a few numbers and compared to a standard that was often arbitrary and seldom standardized. Instructors falling below a certain average were declared bad teachers. “Your effectiveness as a teacher, your teacher score, became your SET score,” says Dr. Uttl. “People have been fired over this, their entire careers demolished.”

Beyond simplistic SETs scores, there’s more: newer research shows that these questionnaires don’t measure teaching effectiveness all that effectively. Many earlier studies were conducted by researchers affiliated with companies selling surveys to universities, says Dr. Uttl. As well, the sample sizes were often small for studies showing statistically significant correlations between SETs scores and student success. In a 2017 report, Dr. Uttl does the math again on a range of old studies, taking into account study size, plus student’s prior knowledge and ability, and finds “no significant correlations between the SET[s] ratings and learning.”

Also, in terms of SETs scores, large classes at inconvenient times rank poorly, as do lower-year undergraduate courses compared to upper-level and graduate ones. Dr. Uttl conducted a 2017 study that linked poor scores with quantitative courses. Other studies have shown that compulsory courses fare worse than optional ones. “Often, with difficult courses, it’s only with time that you get to reflect and understand what the value of a different learning experience was,” says Dr. Watson at Memorial.

An instructor’s gender, age (either too close to students in age, or much older), attractiveness, ethnicity and accent also influence scores. However, Dalhousie’s Mr. Wuetherick warns that the effect is not overly strong for these factors. “Class size and class level explain more of the variance than a whole bunch of the others combined,” he says.

Nevertheless, those who dig deep into SETs suspect there are even more factors influencing which boxes students tick. For instance, Dr. Uttl says how well a student learned the material in a prerequisite course will inevitably impact their perceptions of the follow-up. “If you nearly fail the first statistics course, you will do poorly in the next course,” he says.

A recent study out of Germany concluded that the availability of chocolate cookies during an academic course session affects the evaluation of teaching. The authors write, seemingly without irony, that the findings “question the validity of SETs and their use in making widespread decisions within a faculty.”

Meanwhile, many of these student evaluations don’t use best practices in survey design and include ill-phrased or improper questions. For instance, instructors at Ryerson requested that the university omit the question which asked students if the teacher was “effective,” a nebulous concept. Many surveys ask if the instructor is knowledgeable about the subject matter, a patently inappropriate question, particularly for undergraduates says Dr. Uttl. “Students cannot possibly tell you whether you are knowledgeable. They don’t know the field,” he says.

Too few students are filling out the surveys as well. Research suggests that, while as many as 80 percent of students will fill out paper surveys, this drops to 60 percent or less online. University of Toronto reports that the response rate for its online surveys is about 40 percent. Dr. Quigley says the response rates on Ryerson’s faculty course surveys used to be around 60 percent but gradually declined to 20 percent after going online. “They become more and more meaningless,” she says.

The Ryerson arbitration decision has already caused ripples. Late last fall, the Ontario Undergraduate Student Alliance issued a report that, among other things, calls for the Higher Education Quality Council of Ontario to set standards for SETs, and for universities to balance scores with peer review for hiring and promotion. Student ideas shared at the organization’s general assembly helped provide the content for the report, which was written by Kathryn Kettle, vice-president of policy and advocacy with the Students’ General Association at Laurentian University. “Students really care about this issue, and the fact that biases are inherent in surveys,” she says.

More recently, Dr. Tennant and OCUFA’s working group released a much-anticipated report in February, just prior to University Affairs going to press. It concludes that student questionnaires on courses and teaching – its preferred term – “fail to accurately reflect teaching quality” and should be used for formative purposes only, not for summative uses or for stand-alone performance evaluation.

Even predating the Ryerson decision, many universities began changing their student evaluation systems to better address their inherent flaws. “If you place value on the student voice about their experience, it behooves you to collect that voice in a way that’s as effective as possible,” says Susan McCahan, vice-provost, academic programs, and vice-provost, innovations in undergraduate education, at U of T. The university has been rolling out newly designed course evaluations to include six standard questions – all written using survey design best practices – plus two open-ended qualitative comment questions. Faculties and departments can add in other questions, too.

Similarly, Western University’s new system lets individual instructors add two supplementary questions for its student questionnaires of courses and teaching. They can choose from 45 questions in nine categories, allowing for specialized feedback on technology use, online classes, tutorials and labs. “Then they’re getting the information that, in theory, they want,” says Dr. Watson, who helped develop the new system in his previous role at Western before moving to Memorial in 2018.

Western also created a web portal called Your Feedback to help students, instructors and university staff to better understand the questionnaires. “We took it as an opportunity to make explicit some of the implicit assumptions around this data,” says Dr. Watson.

Many universities are also getting smarter about how they use their SETs data. U of T has been turning its survey results into a huge database since 2012 and now has hundreds of thousands of data points. It used some of these data to run a validation study, which it published in 2018, to track how things like class size, gender and year of study impacted results. The study found that class size had the most effect on scores. U of T expects to use this information as an educational tool for various groups, including tenure and promotion committees. “We want to help people interpret the [student survey] data in a more nuanced way,” says Dr. McCahan.

Mr. Wuetherick, as well, is analyzing SETs data at Dalhousie to under-stand what factors impact scores. He says hidden factors inevitably influence results and it may be impossible to entirely tease them all out. His team at Dalhousie is working to develop a framework by this summer to guide tenure and promotion committees around the use of SETs data, and how to integrate it with peer evaluation and other information in a teaching dossier. “We’ve gotten away with being collectively lazy as a system in how we evaluate teaching,” he says.

But, guiding instructors on how to put together an effective and comprehensive dossier and then, in turn, supporting administrators in how to interpret this information, is a big ask. The University of Calgary’s Taylor Institute for Teaching and Learning has created a Teaching Philosophies and Teaching Dossiers Guide that can help. It’s 50 pages long.

“It’s going to take more time to collect the evidence and it’s going to take more time to understand what effectiveness looks like,” says Dr. Watson. Meanwhile, as universities dig into more sophisticated means of measuring and showing teaching effectiveness, this raises additional questions. “The definition of effective teaching varies widely,” notes Dr. Uttl.

Setting a standard around good teaching, accepting what can and cannot be measured, and understanding the biases of students and faculties all give universities much to examine. And they need to do it now, before additional legal or even human rights challenges come down – Dr. Uttl says he would not be surprised to see a class-action lawsuit sometime in the near future. Adds Dr. Watson, “Given the complexity of what teaching is like and what learning is like, there ought to be a variety of data collected, a variety of evidence that’s presented that helps describe what the experience of being in a class is like.”