In:  Research  

New ScienceOpen study on the effectiveness of student evaluations of teaching highlights gender bias against female instructors

Student evaluations in teaching form a core part of our education system. However, there is little evidence to demonstrate that they are effective, or even work as they’re supposed to. This is despite such rating systems being used, studied and debated for almost a century.

A new analysis published in ScienceOpen Research offers evidence against the reliability of student evaluations in teaching, particularly as a measure of teaching effectiveness and for tenure or promotion decisions. In addition, the new study identified a bias against female instructors.

The new study by Anne Boring, Kellie Ottoboni, and Philip Stark (ScienceOpen Board Member) has already been picked up by several major news outlets including Inside Higher Education and Pacific Standard. This gives it an altmetric score of 54 (at the time of writing), which is the highest for any ScienceOpen Research paper to date!

The research paper highlights some key points regarding the student teaching evaluation (SET) process, including:

  • The existence of a statistically significant bias against female instructors, mostly driven by scores from male students
  • Such bias affects how students rate even objective aspects of teaching, and varies according to discipline and student gender
  • The bias is complex, and unable to be adjusted for
  • Student evaluations of teaching are more sensitive to this gender bias and their grade expectations than to any aspect of teaching effectiveness
  • These biases can be prevalent enough to create lower evaluation scores for more effective instructors, and vice versa.

The identification of students generally giving their female instructors lower scores than their male counterparts is unfortunate and more than likely quite a damaging discovery. However, it appears that there is no simple fix for this or way to compensate such a bias.

According to the Pacific Standard report, Stark mentioned that “you can’t just simply add half a point” to compensate for these gender biases. What we’re probably looking at here are much more deeply ingrained cultural differences between how male students perceive men and women in positions of educational authority.

The main damage caused by this is that such evaluations are used in hiring decisions, and therefore inflict a bias against female candidates for positions. Stark said “I really think colleges and universities need to move away from using student evaluations for hiring, firing, and promotion decisions.”

The study concludes with some strong recommendations: “The onus should be on universities that rely on SET for employment decisions to provide convincing affirmative evidence that such reliance does not have disparate impact on women, under-represented minorities, or other protected groups. Because the bias varies by course and institution, affirmative evidence needs to be specific to a given course in a given department in a given university. Absent such specific evidence, SET should not be used for personnel decisions.”

The article is open to be peer reviewed and commented upon, so please do join in the conversation. Articles of this magnitude deserve to be openly evaluated by the community, and at ScienceOpen we provide that functionality through a dual commenting-review system.

In addition, this article highlights something that we at ScienceOpen strongly believe in, which is the appropriate use of metrics. This is why we provide altmetric statistics for all our archived articles, as part of an enriched understanding of the social digital use of research articles. We have also signed DORA, the Declaration on Research Assessment, which states that metrics should be used appropriately when it comes to assessment and evaluation within academia.

This case study is based on a natural experiment conducted over five years at a French University, based on 23,001 evaluations of 379 instructors by 4,423 surveyed students, representing a pretty data intensive study. Jupyter notebooks containing the analyses from the study can be found here, andthey rely on the permute Python library. The US data are available here, but French privacy law prohibits publishing the French data. Nonetheless, this is a great example of making the data and methods behind research article