You are on page 1of 15

Assessment & Evaluation in Higher Education

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/caeh20

Quantifying halo effects in students’ evaluation of


teaching

Edmund Cannon & Giam Pietro Cipriani

To cite this article: Edmund Cannon & Giam Pietro Cipriani (2022) Quantifying halo effects in
students’ evaluation of teaching, Assessment & Evaluation in Higher Education, 47:1, 1-14, DOI:
10.1080/02602938.2021.1888868

To link to this article: https://doi.org/10.1080/02602938.2021.1888868

© 2021 The Author(s). Published by Informa


UK Limited, trading as Taylor & Francis
Group

Published online: 23 Mar 2021.

Submit your article to this journal

Article views: 561

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=caeh20
Assessment & Evaluation in Higher Education
2022, VOL. 47, NO. 1, 1–14
https://doi.org/10.1080/02602938.2021.1888868

Quantifying halo effects in students’ evaluation of


teaching
Edmund Cannona and Giam Pietro Ciprianib
School of Economics, University of Bristol, Bristol, UK; bDepartment of Economics, University of Verona, Verona, Italy
a

ABSTRACT KEYWORDS
Student evaluations of teaching may be subject to halo effects, where students’ evaluation of
answers to one question are contaminated by answers to the other teaching; validity; halo
questions. Quantifying halo effects is difficult since correlation between effects; block rating;
lecture-room capacity
answers may be due to underlying correlation of the items being tested.
We use a novel identification procedure to test for a halo effect by
combining a question on lecture-room capacity with objective informa-
tion on the size of the lecture room. We confirm the presence of halo
effects but show that the responses to the contaminated question remain
informative. This suggests that the distortion in the evaluation question-
naires caused by halo effects need not be a concern for higher education
institutions.

Introduction
There is a vast literature on Student evaluations of teaching (SETs); Al-Issa and Sulieman (2007)
report that there are thousands of papers on this subject. On the one hand, some papers claim
that SETs are a problematic measure of teaching quality. Among these, Spooren, Brockx, and
Mortelmans (2013) discuss a variety of issues with the validity of what is being measured and
Uttl, White, and Wong-Gonzalez (2017)’s meta-analysis of multi-sectional studies shows that there
is no relationship between SET score and student learning. On the other hand, Wright and Jenkins-
Guarnieri (2012) in a review of meta-analyses conclude that SETs appear to be valid measures of
instructor skill and teaching effectiveness and are related to student achievement. In any case,
SETs remain popular with students, university administrators and national governments: for exam-
ple, in the United Kingdom the National Student Survey is a nationally-run SET and the results
are publicly available to enable the comparison of teaching in university departments. In the USA,
Becker, Bosshardt, and Watts (2012) confirmed the result of an earlier survey and concluded that
evaluation of teaching across Departments of Economics relies heavily and almost exclusively on
SETs and these evaluations are given to promotion committees in 75 percent of cases.
In this paper we re-address a particular issue of SETs, namely the existence or importance of
halo effects, the phenomenon of questions attempting to elicit information on different aspects
of teaching (items) resulting in responses that are very highly correlated and potentially reveal
little more than an overall evaluation. Halo effects arise when students form global impressions
of a teacher and in so doing fail to distinguish among the different dimensions considered rele-
vant to teaching effectiveness. For example, there would be halo effects if valuations on a

CONTACT Edmund Cannon edmund.cannon@bristol.ac.uk


© 2021 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License ((http://
creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided
the original work is properly cited, and is not altered, transformed, or built upon in any way.
2 E. CANNON AND G. P. CIPRIANI

questionnaire concerning different elements of a course, such as the quality of hand-out material
and the comfort of the lecture room, were to correlate positively. If halo effects are perceived to
be problematic, then possible solutions are either to have much shorter evaluations (since asking
more questions results in little additional information) or to design evaluations very differently
(which may be costly). Shorter evaluations could also help reduce survey non-response by reducing
survey fatigue; Adams and Umbach (2012) found that the number of SETs administered to students
is a statistically significant predictor of participation.
The problem with identifying halo effects is that we expect responses to different questions to
be correlated even without a halo effect. For example, a tutor who is well organised might be more
likely to produce good teaching material. Assuming that SETs reflect something about what the
tutor is actually doing, then this correlation would be reflected in responses to the items both on
organisation and teaching material. One way to identify a halo effect is to include an item which
should not be correlated with the other items. A long literature has used a question on good looks
as such a question and generally found that better looking tutors get better SET scores overall (e.g.
Hamermesh and Parker 2005). However, this is not an ideal method to identify a halo effect because
aspects of physical appearance may well be correlated with ability if there are no controls for other
tutor characteristics. For example, in a slightly different context Case and Paxson (2008) find that
height is a good predictor of wages when no other controls are used, but that this is because
height is correlated with IQ (after controlling for the effect of IQ, height has no predictive power).
Similarly, Cipriani and Zago (2011) find that good looks are associated with higher examination
marks. That paper also presents a brief discussion of the theories advanced by the psychological
and economic literature to explain why attractiveness may influence productivity.
In this paper we use a completely different item (question) to identify a halo effect, namely
a question about whether the lecture room was large enough to accommodate all students. Our
data are from a university where tutors have minimal ability to re-negotiate larger lecture spaces
if rooms are too full, so whether or not the room is large enough is effectively randomly deter-
mined. Furthermore, it is likely that better lecturers will have better student attendance and so
there might even be a reason to expect a negative correlation between responses to this question
and other questions. Importantly our identifying strategy is very different to research using
beauty and hence it provides external validation of existing studies on halo effects.
Our study is based on data from an Italian university: while there are many papers on SETs
administered in US universities this paper adds to a very small body of published evidence on
Italian universities.

Literature review
In what follows we concentrate on the sole issue of the halo effect. The phenomenon was
first noted a century ago by Thorndike (1920), who commented that “even a very capable
[evaluator] is unable to treat an individual as a compound of separate qualities and to assign
a magnitude to each of these in independence of the others” (pp. 28-29). The result was that
true underlying correlations between different attributes were biased by the halo effect. In his
review of the literature, Cooper (1981) suggested that halo effects were “ubiquitous” and sug-
gested the terms “illusory halo” and “true halo” to distinguish problematic correlations from
correlation in item responses which is justified by correlation in the underlying characteristic
being measured. Our reading of the literature is that use of the illusory/true halo distinction
is not widespread and in this article we shall continue to refer to the halo effect as any cor-
relation which is unjustified.
It was soon realised that there was some additional structure to the correlations between
item responses. Isaacson et al. (1964) found that 95 per cent of the variation in responses to
46 questions evaluating teachers could be explained by just six factors: the interesting point
here is that it is rare for respondents to give the same answer to all of the questions and so
Assessment & Evaluation in Higher Education 3

there need not be correlation across all items. The correlation of responses varies between
studies. An extreme example of high correlation is found in SETs by Anderson (2013) where,
on a six-point Likert scale, 58 per cent of individual SETs had exactly the same answer to all
fourteen questions about individual tutors’ instruction, a phenomenon that Anderson refers to
as “block rating”. These results are interesting because, while they suggest problems with the
validity of SETs, they do suggest that long (and potentially burdensome or costly) questionnaires
may be of little value and it may be better to ask a much smaller number of questions.
According to an important part of the literature, students base their evaluations essentially
on a single characteristic of the teacher and then generalize their opinion about this characteristic
to all other, possibly unrelated, characteristics. Shevlin et al. (2000) mention charisma as such a
characteristic, which explains more than two thirds of the variation in the “lecturer ability” ques-
tions. Others have tested for the effect of personality or physical attractiveness. The effect of
personality has been tested by Patrick (2011) using the Big Five personal traits. She found that
personality of the instructor, in all five traits assessed by the students, seemed to have an effect
on student ratings. One of the big five, agreeableness, was also found to be positively correlated
with SETs when students rated their own personalities. The effect of physical attractiveness has
been measured by Hamermesh and Parker (2005). They found that moving from one standard
deviation below the mean to one standard deviation above in their beauty measure leads to an
increase in the average class rating of about one standard deviation. Mendez and Mendez (2016)
also considered the effects of physical attractiveness but through its impact on students’ per-
ception of knowledge, approachability and faculty selection in a hypothetical course. They found
that attractiveness may skew student perception of other attributes attached to teaching.
In general, various other measures of likeability have been investigated in relation to SETs
and many papers have cast some doubt on whether SETs are, therefore, a valid indicator of
teaching quality. A recent example is the study by Feistauer and Richter (2018) that confirms
a halo effect of teacher’s likeability on SETs.
Halo effects have also been confirmed in experiments. Keeley et al. (2013) have manipulated
the quality of teaching so that some aspects of an otherwise identical short lecture were dif-
ferent, like good eye contact versus poor eye contact, and found that this had also an effect
on the evaluation of nonmanipulated items.
Other than arguing for the ubiquity of halo effects, Cooper’s (1981) review largely concentrated
on how one could reduce them: since evaluators appear to be unaware of their tendency to
manifest halo effects (Nisbett and Wilson 1977), it appeared that the best way to do this was by
careful evaluation design. In this vein, some authors addressed the issue of the number of teaching
dimensions to be evaluated by students, i.e. whether to use a single score or multiple scores. Hativa
and Raviv (1993), using a factor analysis of the separate items, concluded that a single score
faithfully represents all dimensions of teacher ratings. If this was the case, as mentioned earlier,
questionnaires could be much shorter, thus making students more willing to rate carefully and
reduce block rating, as recommended by Feeley (2002). It has also been noted by Darby (2007)
that whilst Likert-type scales are not regarded independently by students, responses to open-ended
evaluation forms do not display a halo effect. In any case, efforts to improve the psychometric
quality of SETs to reduce the halo effect have been the concern of many researchers. For instance,
Huybers (2014) proposed a best–worst scaling approach, which by construction forces respondents
to consider the trade-offs between the items, thus enhancing discriminatory capability.
However, not all authors accept that the halo effect is ubiquitous or even that it is necessarily
problematic (Murphy, Jako, and Anhalt 1993). The latter point arises from halo effects being a
problem of the rater, whereas evaluations, or performance appraisals in general, potentially
remain informative, so long as they contain useable information on the average rating. Although
the halo effect reduces the reliability of within-teacher distinctions by flattening the overall
profile of ratings, on the other hand it can magnify differences in the mean ratings received
by different teachers. It follows that the bias from a halo effect is not problematic if the purpose
4 E. CANNON AND G. P. CIPRIANI

of SET is to distinguish a good teacher from a bad one, whilst it would be problematic if its
purpose was to distinguish between strengths and weakness within a single teacher. In general,
despite dissatisfaction with SETs, they remain widely used and are often the only instrument
available to evaluate teaching quality.

Context
We analyse data for student evaluations for an Italian university. While there is a substantial liter-
ature on SETs in universities in the USA, there are relatively few published articles for Italy. Braga,
Paccagnella, and Pellizzari (2014) report that there is a negative relationship between teacher
effectiveness and students’ evaluations using a multi-section analysis of SETs from Bocconi University;
Vanacore and Pellegrino (2019) discuss reliability of student evaluations by analysing repeated
evaluations of the same tutor; and Guerra, Bassi, and Dias (2020) discusses a method to identify
teaching activities with consistently poor SETs. None of these three papers discusses halo effects.
In our data set we have SETs from the three-year undergraduate degree/programme (laurea
triennale), the two-year masters degree/programme (laurea magistrale) and some combined
five- or six-year degrees (mainly in law and medicine). In a given year students are typically
expected to study six units/modules/courses, three units in each of the Autumn and Spring
semesters, each of which will be assessed by a separate SET. In the rest of the paper, we shall
refer to “degrees” and “units”.
The SETs were administered online in the students’ own time and the response rate was
high because students were not allowed to attempt the final (summative) assessment for a
unit until they had completed the evaluation. The SETs were administered anonymously, and
only summary statistics were made available to the lecturers. Administrative staff at the uni-
versity enabled matching of the anonymised SET data to other information. All of the evalu-
ations analysed here are of the unit leaders and lecturers alone, since many units did not have
accompanying small-group seminar teaching and there were too few data on such teaching
for a meaningful analysis.
An important feature of the Italian system is that about one-third of students never attempt
the examination and never graduate. A consequence of this is that the number of SETs that
we have is only about a half of what we should expect if every enrolled student filled out
every SET, with the exception of the Faculty of Medicine. While it may be surprising that stu-
dents would enrol and never attempt the examination, it should be remembered that Italian
university fees are very low and many students study part-time alongside working: they may
eventually decide that the value of graduating is low if they already have a job. The decision
to opt out means that the examinations are attempted by a positively selected group of stu-
dents (Cipriani and Zago 2011) and further selection may arise where some units are optional.
In principle this might create systematic biases (He and Freeman 2020) but, in a context where
a significant minority of students has minimal engagement with the education provided by
the university, our SETs correspond to the students in whom we are most interested. Nor is
it clear that this selection is really relevant for the analysis of halo effects.
The second interesting feature is that students chose to complete one of two different SETs;
one was for students who had attended the lectures and one for those who had not. Students
might be an attender for one unit and a non-attender for another unit (we have more than
one SET for most students).
The main contribution of this paper is to identify halo effects through the question on lec-
ture-room size, but the non-attenders were not asked about this (as it was irrelevant), so we
exclude the non-attenders from our analysis. It remains the case that the issue of attendance
versus non-attendance potentially raises a further issue of selection into the two types (Lalla
and Ferrari 2011), although, as with the issue of examination taking, it is not clear that this
should have any consequences for a study of halo effects. As a robustness check, we compared
Assessment & Evaluation in Higher Education 5

Table 1.  Questions on the SET.


No. Text of question
1 Is the overall study load of the courses officially scheduled acceptable?
2 Is the overall organisation (timetable, midterm and end-of-term exam) acceptable?
3 Was your preliminary knowledge adequate to understand the topics covered for the exam?
4 Is there an adequate balance between the study load and the number of credits assigned to this course?
5 Are the teaching and learning materials provided adequate to study the subject?
6 Has information on the exam structure been clearly provided?
7 Is the teacher/lecturer available to provide explanations/clarifications?
8 Are you interested in the topic/s dealt with during the course?
9 Are class timetables (for lectures, tutorials, practice sessions and other teaching activities) adhered to?
10 Does the teacher/lecturer motivate students towards the subject?
11 Does the teacher/lecturer explain the topics clearly?
12 Was the teaching consistent with what stated on the website about the course?
13 Are the lecture theatres where this course is held adequate? Namely, can students see, hear, find a seat?
14 On the whole, are you satisfied with the organisation and teaching of this course?

Table 2. Descriptive statistics.


SET Student Unit
Total 50806 12534 856
Female student 68% 66%
Undergraduate degree / unit 74% 71% 65%
Masters degree / unit 13% 14% 24%
Combined degree / shared unit 14% 14% 11%
Different campus 5% 7% 6%
Female tutor(s) 35% 33%
Male tutor(s) 48% 47%
Mix of tutors 16% 19%
Business 7784 2328 129
Law 3400 991 61
Languages 9379 2305 120
Medicine 10666 2115 168
Humanities 5544 1456 135
Sciences 3128 931 94
Education 7105 1701 97
Sport science 3800 707 52
Summary statistics for SETs used in the analysis. All of these SETs
are for students classifying themselves as attenders and SETs
are only considered if there are fourteen or more SETs for the
unit (to remove units with very small numbers of SETs).

the attenders and non-attenders where possible and found that they are similar, suggesting
that there is no selection effect.
We ignore questions which required students to choose between qualitative comments on the
unit, questions which it was optional for a student to answer (because they concerned aspects
of the unit which students had not utilised) and an open-response question. The remaining
questions which we analysed are all based on the Likert scale of 1 (bad) to 4 (good) which is
commonly used in Italy (Lalla, Facchinetti, and Mastroleo 2005) and are reported in Table 1.
We have data for three sorts of degree: three-year undergraduate degrees (Laurea Triennale),
two-year Masters degrees (Laurea Magistrale) and five- or six-year combined degrees (only in
Law or Medicine). Note that some units are shared across different degrees. Our data set includes
54,049 SETs from 12,787 students taking 1,374 units. However, the number of SETs is very low
for some units and we exclude the SETs from the smallest units. Using a rule of thumb that
the number of respondents should be greater than the number of items (i.e. fourteen or more
SETs for attenders) reduces the number of units to 856 and we confine our analysis to the SETs
for these units. Table 2 summarises the data which we use for our analysis. Female students
represent 66 per cent of the population and undergraduates 71 per cent. The largest faculty is
Medicine, which represents 21 per cent of the SETs, followed by Languages (18 per cent),
Business (15 per cent) and Education (14 per cent).
6 E. CANNON AND G. P. CIPRIANI

Figure 1. Responses to individual questions by attenders.


The histograms show the percentages of responses to all fourteen questions for 50,806 SETs completed
by 12,534 attending students. Question 13 concerns the size of the lecture theatre; question 14 is overall
satisfaction with the unit.

The responses for the SETs that we are analysing are illustrated in Figure 1. Question 14 is
distinguished in black as this refers to overall satisfaction with the teaching: we do not analyse
this question to search for a halo effect since it should be correlated with all of the other items
by definition. Questions 1-12 refer to various aspects of the teaching which, to varying extents,
can be influenced by the lecturer(s): questions 5-12 concern the actual course delivery (e.g.
adequacy of materials, ability to motivate and explain, availability of tutor).
Questions 1 to 4 are probably the questions which are least under the lecturers’ control and
it is notable that the distributions of the answers to questions 1–4 are different to those of
questions 5–12: the earlier four questions have many more students responding with a “3” than
a “4” and there also tend to be more “1”s and “2”s. These differences in the marginal distributions
place an upper bound of less than one on the correlation between responses to different
questions, which would attenuate the halo effect. However, lecturers do have some control over
the issues raised by these questions. For example, question three concerns the level of study
required compared to pre-existing knowledge, but lecturers have flexibility to adapt the teaching
material to the students’ prior ability, especially as there is only limited moderation of the
content of individual units.
Question 1 is perhaps the most interesting question as it asks about the overall study load
from all units studied within a semester. The strictest interpretation of this question suggests
that students should give the same answer to question 1 on all SETs answered within a given
semester. However, fewer than half of students do so, and a small minority gives widely different
answers to contemporaneous units (i.e. both a “1” and a “4”). This is potentially further evidence
for a halo effect since, as we shall see; responses to question one are highly correlated with
responses to the other questions. However, we are wary of drawing too strong a conclusion
Assessment & Evaluation in Higher Education 7

from the responses to question 1, because another interpretation of the question is possible:
students might feel that the overall study load is too high because the study load is appropriate
for some units and too high for others, in which case their response to question 1 might plau-
sibly be answering a slightly different question (in other words, confusing question 1 with
question 4, which is designed to elicit whether the study load is appropriate for the given unit).
Without any means of trying to disentangle these alternative possibilities we cannot pursue
this issue further, but we do perform robustness checks to see if the responses to questions 1
to 4 affect our analysis.
We highlight question 13 in blue to draw attention to this being the item we use to identify
halo effects. The question is about the quality of the lecture room. The way that lecture rooms
are timetabled means that members of staff have effectively no control over which room they
get. This suggests that there should be no correlation between the answer to question 13 and
the first twelve questions. In other words, the adequacy of the lecture theatre should be com-
pletely independent from all aspects of teaching influenced by the lecturer, whose differences
are often quite subtle. Moreover, popular and better members of staff are likely to have more
students attending lectures, suggesting that there might be less space for students or that they
had to sit further away and be less able to see or hear the lecturer, which might even suggest
a negative correlation.
Although not the focus of this paper, we conduct a preliminary check for intra-unit consis-
tency in how students respond to different questions. Clayson (2018) suggests that there is no
widespread agreement on the appropriate statistic to use for this; nor is there complete agree-
ment on the various statistics’ interpretation (Taber 2018). Recognising these caveats, we cal-
culated Cronbach’s alpha for each of the units we are analysing and found that the vast majority
of the units have relatively high levels of Cronbach’s alpha, with just a few of the smallest units
having an alpha less than 0.75. This means that different students tend to evaluate different
items on the same unit in a similar fashion. On the basis of these statistics there is no reason
to believe that the SETs contain fundamental problems that would invalidate our data set.

Halo effects: block rating and correlations


In this section we use conventional measures of correlation to confirm that halo effects do
appear to be a problem in our data set.
The extreme example of highly correlated responses is when students give the same response
to every question, which Anderson (2013) refers to as “block rating”. In the SETs which we
analyse only about 12 per cent of SETs are block graded, considerably less than the proportion
of 58 per cent found by Anderson (2013). A similar proportion of SETs utilise both extremes of
the Likert scale. Block graders’ average score is 3.31 (s.e. 0.008), compared to the average of
3.04 (s.e. 0.003) for the remaining respondents (the difference is significant: t = 34.88, p = 0.00).
Since we observe more than one SET for many students, we investigate the extent to which
students consistently block rate. The final column of Table 3 shows that the proportion of SETs
which are block graded only rises slightly with the number of SETs completed by a student: stu-
dents who complete more SETs have a slightly higher tendency to block grade but, except for
students who complete a very large number of SETs, they do not do so consistently (since many
block grade sometimes but not always). There is some evidence here that the burden of completing
large number of SETs results in block grading, but the effect is weak, as can be seen in Table 3.
We conclude that block rating is much less common in our data set than in some other
studies. However, this does not mean that correlations to different questions are low. Table 4
provides evidence on correlations between responses to the different questions: correlations
for all question pairs are typically in the range 0.25 to 0.50 and are all statistically significant.
Question 13 on the lecture room capacity usually has the lowest correlation: since the responses
to question 13 should potentially have no correlation with the responses to the other questions,
8 E. CANNON AND G. P. CIPRIANI

Table 3.  Frequency of block grading by students.


Number of SETs Block-grading: number of students Proportion of students
completed by student Always Sometimes Never who ever block grade
1 182 1664
2 71 219 1366 18%
3 63 356 1385 23%
4 38 467 1440 26%
5 34 498 1321 29%
6 18 509 1143 32%
7 20 416 709 38%
8 6 170 256 41%
9 3 68 70 50%
10 21 12 64%
11 3 4 43%
12 1 1 50%

Table 4. Correlations between responses to different questions.


qn 1 qn 2 qn 3 qn 4 qn 5 qn 6 qn 7 qn 8 qn 9 qn 10 qn 11 qn 12
qn 1 1
qn 2 0.55 1
qn 3 0.37 0.33 1
qn 4 0.53 0.44 0.38 1
qn 5 0.41 0.43 0.36 0.47 1
qn 6 0.33 0.4 0.27 0.37 0.48 1
qn 7 0.33 0.39 0.29 0.37 0.45 0.48 1
qn 8 0.32 0.32 0.35 0.33 0.38 0.32 0.37 1
qn 9 0.29 0.4 0.26 0.34 0.42 0.46 0.49 0.31 1
qn 10 0.36 0.41 0.32 0.39 0.48 0.44 0.49 0.48 0.42 1
qn 11 0.37 0.41 0.34 0.39 0.51 0.47 0.5 0.45 0.43 0.68 1
qn 12 0.37 0.43 0.32 0.42 0.51 0.51 0.56 0.39 0.52 0.5 0.54 1
qn 13 0.26 0.29 0.24 0.28 0.29 0.3 0.33 0.25 0.33 0.26 0.27 0.34
The table reports the correlations (Kendall’s tau) between responses to different questions, for all 50,806 SETs. The asymp-
totic standard errors are not reported but are typically a little less than 0.005. Question 13 is the question on the lecture
room.

that it ranges from 0.24 to 0.34 is prima facie evidence for a halo effect. Note that several other
correlations are within this range or just slightly above.
To measure the multivariate correlation between different responses we conduct a factor
analysis of questions 1–13: Table 5 reports our results for the complete data set. Analyses of
sub-samples of the data set and analyses of individual units yielded quantitatively similar results;
as a further robustness check we also conducted the analogous principal components analysis,
which again provided similar results. The first (unrotated) factor has an eigenvalue of 5.57,
explaining 97 per cent of the variation, showing that the high bivariate correlations of Table 4
are driven by one underlying factor; the next four factors are relatively unimportant.
Table 5 also reports the loadings for the first five factors for the whole sample and a measure
of the uniqueness of each variable. We have already discussed the possibility that questions 1
to four might be different from the other questions, and there is a little evidence that question
three might be different as it has a relatively high uniqueness measure. Question 13 has a lower
loading on the first factor than the other twelve questions (a loading of 0.449 for question 13
compared to loadings between 0.524 and 0.764 for the other questions) and a much higher
uniqueness measure.
What do we conclude at this point? For the university as a whole there is little evidence of
extreme halo effects since the prevalence of block grading is low. However, responses to the
different questions are highly correlated. Responses to question 13, which is about room size
and quality, are correlated with responses to the other questions, even if the correlation is
Assessment & Evaluation in Higher Education 9

Table 5.  Factor analysis – estimated eigenvalues (components).


First factor Second factor Third factor Fourth factor Fifth factor
Proportion explained 0.966 0.089 0.053 0.015 0.007
Loadings First factor Second factor Third factor Fourth factor Fifth factor Uniqueness
Q.1 Overall study load 0.630 0.414 −0.020 −0.075 0.022 0.425
Q.2 Organisation 0.664 0.243 0.074 −0.134 0.070 0.472
Q.3 Adequacy of prior 0.524 0.188 −0.118 0.153 −0.001 0.653
knowledge
Q.4 Study load and 0.644 0.298 −0.008 0.029 −0.063 0.492
credits
Q.5 Teaching materials 0.725 0.019 −0.006 0.016 −0.123 0.458
Q.6 Examination 0.678 −0.122 0.148 −0.034 −0.068 0.497
information
Q.7 Teacher availability 0.710 −0.175 0.142 0.033 0.030 0.444
Q.8 Topic interesting 0.591 −0.032 −0.207 0.128 0.057 0.587
Q.9 Adherence to 0.652 −0.142 0.226 −0.010 0.045 0.502
timetable
Q.10 Teacher motivates 0.753 −0.200 −0.255 −0.076 0.030 0.322
students
Q.11 Teacher’s 0.764 −0.209 −0.229 −0.070 −0.010 0.315
explanation
Q.12 Consistency with 0.759 −0.149 0.142 0.022 −0.014 0.381
website
Q.13 Lecture room 0.449 0.034 0.151 0.114 0.058 0.758
The table presents results from a factor analysis on all 50,806 students (analogous analyses for sub-samples provide similar
results and suggest that results are uniform across the whole data set). The factors are unrotated and are estimated
using the principal factor method. The first row reports the proportion of variation explained by each of the first five
factors: since we have calculated the proportion explained using all thirteen eigenvalues and the sixth and subsequent
eigenvalues are negative, the total proportion explained appears to be greater than one (the factors with negative
eigenvalues – which we have not reported – appear to explain a negative proportion). The subsequent rows report the
loadings of the first five factors on to the individual questions and the uniqueness, which is the fraction of the variation
in each question not explained by the common factors.

lower than the correlation between the other questions. This is all corroborative evidence for
a halo effect, but we have not yet considered any independent information about the rooms.

Identifying halo effects using room size


For one faculty we have been able to get the exact information on the room capacity for all
of the teaching and we use this as an independent measure to analyse the responses to ques-
tion 13. Since the question is about availability of a seat, we need to compare room capacity
with number of students. Some units were lectured in more than one room and so we calculate
the average room size and compare this with the number of SETs completed by attending
students, so for unit u
the number of seats per student is:

Average room sizeu


Ru ≡
Attending SETsu

Robustness checks using the room size of the smallest room used for lectures (instead of
the average of the room sizes) or the total number of SETs (instead of just attenders) gave
similar results. In addition to this, some units were lectured at a different campus and we control
for this with an indicator dummy variable Cu .
To explore fully the effects of room size and halo effects on the response to question 13,
we estimate three different regression specifications so that we can compare their results.
Specification (i) uses only the variables R and C u to model the explanatory power of these
u
variables ignoring the halo effect; specification (ii) uses only the responses to questions 1 to
10 E. CANNON AND G. P. CIPRIANI

12 to quantify the apparent strength of halo effects when other information is unavailable. Our
final specification (iii) includes both sets of variables to obtain the partial explanatory power
of both the information on rooms and the halo effect.
We can estimate this relationship at either unit level or SET level, although the room and
campus variables only vary at unit level. The most general version of the unit-level model is of
the form: 12



y u   0  1Ru   2 C u   q x q . u   u
q 1

− −
where the y u is the mean response to question 13 for unit µ and x q . u is the corresponding
mean unit response to question q  1, ,12 . Since these variables are unit means they can

be treated as being approximately continuous; furthermore, the variable y u lies in the range
1.9 and 3.6, so there is no obvious issue of censoring. Therefore, we estimate this model using
ordinary least squares. Note that the corresponding restricted regressions are:


y u   0  1Ru   2 C u   u

 12 
y u   0   q x q .u   u
q 1

As noted earlier, lecturers may have less influence over the issues asked about in questions
1 to 4. As a robustness check, we re-estimate regression (iii) with only the responses to ques-
tions 5 to 12.
To estimate the model at SET level we take explicit account of both y i and Ru being ordinal
variables for individual i . For each question q , we create three indicator dummy variables:

D = 1 if the response to question q is 2 ( 0 otherwise )



q .2. i

D = 1 if the response to question q is 3 ( 0 otherwise )



q .3. i

D = 1 if the response to question q is 4 ( 0 otherwise )



q .4. i

and then estimate the relationship by ordered probit:

 12 4

Pr  y i  1, 2, 3, 4   f   0  1Ru   2 C u 


q 1 x  2
q . x Dq . x . i  i 


The parameter estimates for β1 and β 2 are reported in Table 6, alongside tests for: (i) the joint
significance of the room variables Ru and C u ; and (ii) the joint significance of the other
question variables. As with the unit-level regression, we re-estimate this specification omitting
questions 1 to 4.
Table 6. Regression analysis for responses to question 13.
(1) OLS (2) OLS (3) OLS (4) OLS (5) OProbit (6) OProbit (7) OProbit (8) OProbit
Room capacity 0.125** 0.198*** 0.212*** 0.245*** 0.285*** 0.284***
(0.041) (0.041) (0.040) (0.050) (0.051) (0.050)
Second campus 0.641*** 0.581*** 0.583*** 0.705*** 0.730*** 0.701***
(0.096) (0.093) (0.091) (0.121) (0.123) (0.124)
Questions 1-4 (four variables)  
Questions 5-12 (eight variables)   
Dummies for questions 1-4  
Dummies for questions 5-12   
Rooms joint test F(2,58) = 30.8 F(2,46) = 37.3 F(2,50) = 49.3 χ²(2) = 57 χ²(2) = 62 χ²(2) = 60
[p = 0.000] [p = 0.000] [p = 0.000] [p = 0.000] [p = 0.000] [p = 0.000]
Questions joint test F(12,48) = 2.0 F(12,46) = 4.7 F(8,50) = 5.2 χ²(36) = 284 χ²(36) = 659 χ²(24) = 238
[p = 0.046] [p = 0.000] [p = 0.000] [p = 0.000] [p = 0.000] [p = 0.000]
N 61 61 61 61 4318 4318 4318 4318
R² 0.470 0.200 0.676 0.649
Pseudo R² 0.058 0.043 0.109 0.098
The explained variable is the response to questions 13: in specifications (1) to (4) this is the mean unit response to this question; in specifications (5) to (8) it is the
SET-level response taking the discrete values {1,2,3,4}. Some estimated parameters are not reported to save space: all specifications also include a constant; speci-
fications (2) and (3) include the mean unit responses to questions 1-12 as explanatory variables; specifications (6) and (7) include a complete set of indicator dummies
for the responses to questions 1-12 as explanatory variables. Robust standard errors in parentheses; in specifications (5) to (8) these are clustered at unit level.
* 0.05 ** 0.01 *** 0.001.
Assessment & Evaluation in Higher Education
11
12 E. CANNON AND G. P. CIPRIANI

Regression results
At unit level, specification (1), the room and campus variables individually and jointly have explan-
atory power for the average response to question 13: the room size variable has the correct sign
and the variables are statistically significant at the 0.1% level. This suggests that there is some
external validity in the response to question 13. A regression of the average response to question
13 reported in specification (2) confirms that there is correlation between the response to question
13 and the other questions (although this is not a conventional halo effect as the relationship is
at unit rather than at student level). Comparing regression (4) with regression (3) in Table 6, we
see that it makes very little difference whether we include questions 1 to four or not.
Specification (3) provides information on the partial correlations. The parameter on the room
variable increases a little when we control for halo effects, suggesting that if we had omitted
the halo effects we should have slightly under-estimated the true magnitude of the effect of
room size on students’ evaluation of this particular issue. The coefficient of multiple correlation
(R-squared) is much larger for the general model than the two individual models, demonstrating
that they each have significant predictive power for the answer to question 13. This is consistent
with the suggestion by Murphy, Jako, and Anhalt (1993) that the presence of halo effects need
not invalidate SETs as a means to identify an item since, crudely speaking, the variation in
responses is the sum of both the underlying effect and the halo effect. The magnitudes of the
R-squareds in specification (1) and specification (2) show that the room information variables
have much more explanatory power than the halo effects.
We obtain even stronger results when we estimate the relationship at SET level. The parameter
estimates on the room size variable are larger than in the unit-level models since variation between
students has not been averaged away: these estimates suggest that an increase of seats per stu-
dent by one raises satisfaction with the room by an expected value of just under one half. Since
all rooms were in principle large enough for all students to have a seat, this effect is quite large.
At SET level it also appears that the information about rooms has more explanatory power
than the halo effect: there is no straightforward analogue to the goodness-of-fit measure for
ordered probit models but the pseudo R-squared of 0.109 from specification (7) is larger than
the corresponding value of 0.043 from specification (6).

Conclusion
Student evaluations of teaching (SET) may be affected by a halo effect, that is by the students’
failure to discriminate among the different and potentially independent items being tested.
Halo effects arise when students form global impressions of a teacher and subsequently fail to
distinguish among the different dimensions considered relevant to teaching effectiveness. Thus,
significant correlations between items of the questionnaire which should be unrelated indicate
the presence of a halo effect. If SETs have a diagnostic goal and not only a summative goal,
that is if they are meant to provide specific feedback in order to improve teaching performance,
halo effects represent a problem. Therefore, a considerable amount of research has tested for
this effect and suggested ways to reduce it. However, testing for halo effects is usually very
difficult, since the various items in a SET questionnaire may actually be related and therefore
any correlation could be the result of real similarities rather than a halo-effect error.
In this paper, we use an original dataset and a novel identification procedure to identify
halo effects. This procedure relies on a question on the characteristics of the lecture room which
could be considered really independent from the other question variables. Our empirical results
confirm that halo effects are present in SET, but also show that evaluations remain informative
of the various aspects being judged, despite the distortion. In other words, we find no evidence
to corroborate the need for higher education institutions to design evaluations differently
because of halo effects.
Assessment & Evaluation in Higher Education 13

Acknowledgements
We should like to thank the university administrators who provided data to us in an anonymised form. Any
remaining errors are the authors’ own.

ORCID
Edmund Cannon http://orcid.org/0000-0002-1947-8499
Giam Pietro Cipriani http://orcid.org/000-0001-5436-0835

References
Adams, M. J. D., and P. D. Umbach. 2012. “Nonresponse and Online Student Evaluations of Teaching:
Understanding the Influence of Salience, Fatigue, and Academic Environments.” Research in Higher
Education 53 (5): 576–591. doi:10.1007/s11162-011-9240-5.
Al-Issa, A., and H. Sulieman. 2007. “Student Evaluations of Teaching: Perceptions and Biasing Factors.”
Quality Assurance in Education 15 (3): 302–317. doi:10.1108/09684880710773183.
Anderson, J. A. 2013. Student Feedback Measures: Meta-Analysis. A Report to the Academic Senate.
University of Utah. https://collections.lib.utah.edu/details?id=214165&facet_school_or_college_
t=%22College+of+Humanities%22&facet_setname_s=ir_eua
Becker, W. E., W. Bosshardt, and M. Watts. 2012. “How Departments of Economics Evaluate Teaching.”
The Journal of Economic Education 43 (3): 325–333. doi:10.1080/00220485.2012.686826.
Braga, M., M. Paccagnella, and M. Pellizzari. 2014. “Evaluating Students’ Evaluations of Professors.”
Economics of Education Review 41: 71–88. doi:10.1016/j.econedurev.2014.04.002.
Case, A., and C. Paxson. 2008. “Stature and Status: Height, Ability, and Labor Market Outcomes.”
The Journal of Political Economy 116 (3): 499–532. doi:10.1086/589524.
Cipriani, G.-P., and A. Zago. 2011. “Productivity or Discrimination? Beauty and the Exams.” Oxford
Bulletin of Economics and Statistics 73 (3): 428–447. doi:10.1111/j.1468-0084.2010.00619.x.
Clayson, D. E. 2018. “Student Evaluation of Teaching and Matters of Reliability.” Assessment & Evaluation
in Higher Education 43 (4): 666–681. doi:10.1080/02602938.2017.1393495.
Cooper, W. H. 1981. “Ubiquitous Halo.” Psychological Bulletin 90 (2): 218–244. doi:10.1037/0033-
2909.90.2.218.
Darby, J. A. 2007. “Are Course Evaluations Subject to a Halo Effect?” Research in Education 77 (1):
46–55. doi:10.7227/RIE.77.4.
Feeley, T. H. 2002. “Evidence of Halo Effects in Student Evaluations of Communication Instruction.”
Communication Education 51 (3): 225–236.
Feistauer, D., and T. Richter. 2018. “Validity of Students’ Evaluations of Teaching: Biasing Effects of
Likability and Prior Subject Interest.” Studies in Educational Evaluation 59: 168–178. doi:10.1016/j.
stueduc.2018.07.009.
Guerra, M., F. Bassi, and J. G. Dias. 2020. “A Multiple‑Indicator Latent Growth Mixture Model to Track
Courses with Low‑Quality Teaching.” Social Indicators Research 147 (2): 361–381. doi:10.1007/s11205-
019-02169-x.
Hamermesh, D. S., and A. Parker. 2005. “Beauty in the Classroom: Instructors’ Pulchritude and Putative
Pedagogical Productivity.” Economics of Education Review 24 (4): 369–376. doi:10.1016/j.econ-
edurev.2004.07.013.
Hativa, N., and A. Raviv. 1993. “Using a Single Score for Summative Teacher Evaluation by Students.”
Research in Higher Education 34 (5): 625–646. doi:10.1007/BF00991923.
He, J., and L. A. Freeman. 2020. “Can we Trust Teaching Evaluations When Response Rates Are Not
High? Implications from a Monte Carlo Simulation.” Studies in Higher Education. doi:
10.1080/03075079.2019.1711046.
Huybers, T. 2014. “Student Evaluation of Teaching: The Use of Best-Worst Scaling.” Assessment &
Evaluation in Higher Education 39 (4): 496–513. doi:10.1080/02602938.2013.851782.
Isaacson, R. L., W. J. McKeachie, J. E. Milholland, Y. G. Lin, M. Hofeller,J. W. Baerwaldt, and K. L. Zinn. 1964.
“Dimensions of Student Evaluations of Teaching.” Journal of Educational Psychology 55 (6): 344–351.
doi:10.1037/h0042551.
14 E. CANNON AND G. P. CIPRIANI

Keeley, J. W., T. English, J. Irons, and A. M. Henslee. 2013. “Investigating Halo and Ceiling Effects in
Student Evaluations of Instruction.” Educational and Psychological Measurement 73 (3): 440–457.
doi:10.1177/0013164412475300.
Lalla, M., and D. Ferrari. 2011. “Web‐Based versus Paper‐Based Data Collection for the Evaluation of
Teaching Activity: Empirical Evidence from a Case Study.” Assessment & Evaluation in Higher Education
36 (3): 347–365. doi:10.1080/02602930903428692.
Lalla, M., G. Facchinetti, and G. Mastroleo. 2005. “Ordinal Scales and Fuzzy Set Systems to Measure
Agreement: An Application to the Evaluation of Teaching Activity.” Quality & Quantity 38 (5): 577–
601. doi:10.1007/s11135-005-8103-6.
Mendez, J. M., and J. P. Mendez. 2016. “Student Inferences Based on Facial Appearance.” Higher
Education 71 (1): 1–19. doi:10.1007/s10734-015-9885-7.
Murphy, K. R., R. A. Jako, and R. L. Anhalt. 1993. “Nature and Consequences of Halo Error: A Critical
Analysis.” Journal of Applied Psychology 78 (2): 218–225. doi:10.1037/0021-9010.78.2.218.
Nisbett, R. E., and T. D. Wilson. 1977. “The Halo Effect: Evidence for Unconscious Alteration of
Judgements.” Journal of Personality and Social Psychology 35 (4): 250–256. doi:10.1037/0022-
3514.35.4.250.
Patrick, C. L. 2011. “Student Evaluations of Teaching: Effects of the Big Five Personality Traits, Grades
and the Validity Hypothesis.” Assessment & Evaluation in Higher Education 36 (2): 239–249. [Database]
doi:10.1080/02602930903308258.
Shevlin, M., P. Banyard, M. Davies, and M. Griffiths. 2000. “The Validity of Student Evaluation of
Teaching in Higher Education: Love Me, Love my Lectures?” Assessment & Evaluation in Higher
Education 25 (4): 397–405. doi:10.1080/713611436.
Spooren, P., B. Brockx, and D. Mortelmans. 2013. “On the Validity of Student Evaluation of Teaching:
The State of the Art.” Review of Educational Research 83 (4): 598–642. doi:10.3102/0034654313496870.
Taber, K. S. 2018. “The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments
in Science Education.” Research in Science Education 48 (6): 1273–1296. doi:10.1007/s11165-016-
9602-2.
Thorndike, E. L. 1920. “A Constant Error in Psychological Ratings.” Journal of Applied Psychology 4 (1):
25–29. doi:10.1037/h0071663.
Uttl, B., C. A. White, and D. Wong-Gonzalez. 2017. “Meta-Analysis of Faculty’s Teaching Effectiveness:
Student Evaluation of Teaching Ratings and Student Learning Are Not Related.” Studies in Educational
Evaluation 54: 22–42. doi:10.1016/j.stueduc.2016.08.007.
Vanacore, A., and M. S. Pellegrino. 2019. “How Reliable Are Students’ Evaluations of Teaching (SETs)?
A Study to Test Student’s Reproducibility and Repeatability.” Social Indicators Research 146 (1–2):
77–89. doi:10.1007/s11205-018-02055-y.
Wright, S. L., and M. A. Jenkins-Guarnieri. 2012. “Student Evaluations of Teaching: Combining the
Meta-Analyses and Demonstrating Further Evidence for Effective Use.” Assessment & Evaluation in
Higher Education 37 (6): 683–699. doi:10.1080/02602938.2011.563279.

You might also like