You are on page 1of 9

Assessment & Evaluation in Higher Education, Vol. 25, No.

4, 2000

The Validity of Student Evaluation of


Teaching in Higher Education: love me,
love my lectures?

MARK SHEVLIN, School of Behavioural and Communication Sciences,


University of Ulster at Magee, Northern Ireland
PHILIP BANYARD, MARK DAVIES & MARK GRIFFITHS, Division of
Psychology, The Nottingham Trent University, Nottingham, UK

ABSTRACT This paper examines the validity of student evaluation of teaching (SET)
in universities. Recent research demonstrates that evaluations can be inuenced by
factors other than teaching ability such as student characteristics and the physical
environment. In this study, it was predicted that students’ perception of the lecturer
would signiŽcantly predict teaching effectiveness ratings. Using an 11-item student
rating scale (N 5 199 ), a two-factor conŽrmatory factor model of teaching effectiveness
was speciŽed and estimated using LISREL8; the factors were ‘lecturer ability’ and
‘module attributes’. This initial model was extended to include a factor relating to the
students’ ratings of the lecturer’s charisma. The model was an acceptable description of
the data. The charisma factor explained 69% and 37% of the variation in the ‘lecturer
ability and ‘module attributes’ factors respectively. These Žndings suggest that student
ratings do not wholly reect actual teaching effectiveness. It is argued that a central trait
exists which inuences a student’s evaluation of the lecturer.

Introduction
What makes a good teacher and how can we recognise him or her? First we might value
a teacher by their ability to effect personal change and development in their students.
This is a long-term outcome and problematic if we attempt to quantify it. Second we
might value a teacher by their effectiveness in facilitating good academic work in their
students. This is more measurable and it is currently the subject of debate in the UK as
the British Government considers incentives for teachers based on examination results.
A third way of evaluating teachers is to ask their students to rate them. This is the most
ISSN 0260-2938 print; ISSN 1469-297 X online/00/040397-0 9 Ó 2000 Taylor & Francis Ltd
398 M. Shevlin et al.

immediate and the most widely used of the three strategies and is commonly measured
by questionnaire at the end of courses. One of the issues to consider is whether we are
measuring the most important variables of teaching effectiveness or whether some
variables are becoming more important just because they are measurable. A further issue
to consider, and the one that is addressed in this paper, is the validity of measures of
teaching effectiveness gathered from student evaluations.
The practice of student evaluation of teaching (SET) in universities is ubiquitous in
the UK and the US. In the UK, information from SET is considered as important
evaluative information, but also as a guide for potential changes in course material
and method of delivery. The signiŽcance of SET is noted by the Quality Assurance
Agency for Higher Education (QAA) in the documentation regarding subject review
practices, in particular quality assessment and management (QAA, 1997). In the US,
information from SET can be used for faculty decisions about conditions of employment
such as salary and promotion. In short, SET is an integral part of higher education
practices.
Despite the perceived importance of SET there are theoretical and psychometric issues
related to the assessment of teaching effectiveness that are yet unresolved. First, there
appears to be little agreement on the nature and number of dimensions that represent
teaching effectiveness (Patrick & Smart, 1998). Studies predominantly use question-
naires and factor analysis to derive the dimensions of effective teaching. For example,
Swartz et al. (1990) identify the two factors of effective teaching as (1) clear instruc-
tional presentation, and (2) management of student behaviour, whereas Lowman and
Mathie (1993) identify them as (1) intellectual excitement, and (2) interpersonal rapport.
There is no obvious mapping between these two pairs of dimensions. Further studies
identify more and different factors of teaching effectiveness. For example Brown and
Atkins (1993) identify the three factors of effective teachers as (1) caring, (2) systematic,
and (3) stimulating, whereas Patrick and Smart (1998) identify the three factors of
teaching effectiveness as (1) respect for students, (2) organisation and presentation
skills, and (3) ability to challenge students. Other researchers have suggested as many
as seven factors (Ramsden, 1991) or nine factors of effective teaching (Marsh & Dunkin,
1992).
In terms of the psychometric properties of evaluation instruments the primary issue
of concern is validity. A number of extraneous variables have been examined that
may confound the measurement of teaching effectiveness. The relationships between
ratings of teaching effectiveness and variables related to student characteristics, lecturer
behaviour, and the course administration have been examined (d’Apollonia & Abrami,
1997). For example, in relation to student characteristics, Marsh (1987) and Feldman
(1976) reported a positive association between expected grades and ratings of teaching
effectiveness. Further to this, Marsh & Roche (1997) reported similar relationships
between ratings and the prior subject interest of the student and the reason for taking the
course. The variable related to the lecturer behaviour that has received the greatest
research interest is that of grading leniency. Using a large sample of American students
Greenwald and Gillmore (1997) demonstrated that grading leniency had a strong positive
relationship with ratings of teaching effectiveness. With regard to the effect of course
administration, there is, for example, a weak relationship between class size and student
ratings with the largest and the smallest classes giving the most positive ratings
(Fernàndez et al., 1998). A further problem concerns the validity of the conclusions that
are drawn from SET data due to the lack of statistical sophistication in the personnel
committees that may use the information (McKeachie, 1997). Overall, research on the
Validity of Student Evaluation of Teaching 399

effects of extraneous variables on the validity of SET suggests the need for caution in
the interpretation of this data.
It would appear, then, that consensus on the characteristic of effective teaching is
low, and there are a number of factors that challenge the validity of the data. There
is also disagreement on whether the different dimensions are discrete or are repre-
sentative of a single higher-order teaching effectiveness dimension (Abrami et al.,
1997; Marsh & Roche, 1997). It is argued here that if students have a positive personal
and/or social view of the lecturer this may lead to more positive ratings irrespective
of the actual level of teaching effectiveness. Support for this idea comes from the
classic work of Asch on implicit personality theories (Asch, 1946; Bruner & Tagiuri,
1954). Studies found that manipulation of bi-polar attributes such as warm-cold
(e.g. Kelly, 1950) produced a large effect in student judgements of lecturers. So-called
halo and horns effects (Vernon, 1964) can also be argued to have an impact. These
studies illustrate how single attributes are generalised to other judgements of the
individual.
Students may respond to a central quality of leadership that then inuences their
evaluations of teachers. One approach to leadership that offers parallels to teaching is
charismatic leadership. For example, House’s (1977) theory of charismatic leadership
emphasises the relationship between the leader and the follower. According to this
approach the principle behavioural features of a charismatic leader are: (1) impression
management, by which the leader creates the impression of competence; (2) setting an
example, by which followers are encouraged to identify with the leader’s beliefs and
values; (3) setting high expectations about the followers’ performance; (4) providing an
attractive vision for the future; and (5) arousing motivation in the followers to be
productive. A development of this approach can be seen in Bass’s model of transforma-
tional leadership (Bass, 1990). This model has four components (the four I’s), (1)
individual consideration, or leadership by developing people; (2) intellectual stimulation;
(3) inspirational motivation; and (4) idealised inuence. This last point is often seen as
the charismatic component of transformational leadership. The distinction between
transformational leadership and charismatic leadership are not clear (Shackleton, 1995),
and even if we make the distinction then the feature of Bass’s model that has been found
to have the greatest effect on satisfaction ratings is idealised inuence (or charisma)
(Bryman, 1992). The features of charismatic leadership and transformational leadership
resemble the features of teaching effectiveness identiŽed above (Patrick & Smart, 1998).
It is argued here that the quality of charisma affects judgements including that of
teaching effectiveness. Charisma has been shown to affect voter judgements of politi-
cians (Pillai et al., 1997), as well as leadership at work (Fuller et al., 1996). Distinctions
are drawn between expert power, referent power and charisma, though it has been shown
in a study of public sector workers, that the only characteristic which inuenced
workers’ ratings of satisfaction with their supervision was charisma (Kudisch et al.,
1995).
The impact of charisma in student evaluations of teachers is further enhanced because
of the special features of the teacher’s role as a ‘critical other’, in which they challenge
students, assess students and attempt to motivate students (Woods, 1993). It is argued
that charisma is such a salient trait in students’ perceptions of teachers that it affects
assessment of teacher effectiveness. From these various literatures, a study was devised
to examine the relationship between charisma and teaching effectiveness. It was
predicted that the student’s perception of the lecturer would signiŽcantly predict teaching
effectiveness ratings.
400 M. Shevlin et al.

Method
Sample
The sample consisted of 213 undergraduate students at a UK University in the Midlands.
They were all enrolled full-time on courses within a department of social sciences. Due
to the anonymous nature of the evaluation no details of demographic variables are
available, although there is no apparent reason why the proŽle of the students at the
university would signiŽcantly differ from other institutions in the area. The sample size
after listwise deletion of missing data was 199. The participants were required to rate
their lecturer. In total eight lecturers (four males and four females) were rated during this
study.

Measurements
An 11-item teaching effectiveness self-report scale was administered (Appendix 1) to
students by a member of lecturing staff. The scale was designed to measure two
dimensions of teaching effectiveness. Six items related to lecturer attributes (items 1, 2,
3, 4, 5, 11) measure the ‘lecturer ability’ factor, and Žve items related to aspects of the
particular module (items 6, 7, 8, 9, 10) measure the ‘module attributes’ factor. Responses
to the items were made on a 5-point Likert scale anchored with ‘strongly agree’ and
‘strongly disagree’. An addition item was included, ‘The lecturer has charisma’, which
used the same response format as the other items.

Analysis
The model presented in Figure 1 was speciŽed and estimated using LISREL8 (Jöreskog
& Sörbom, 1993).
Figure 1 speciŽes a two-factor measurement model for the 11 items (y1– y11)
measuring student evaluations. The two factors, lecturer ability (h1) and module at-
tributes (h2), are measured by their respective items in the self-report teaching evaluation
scale. The factor loadings are given the symbol l, and the error variances for each item
the symbol «. The lecturer ability (h1) and module attributes (h2) factors are regressed
on the charisma factor (h3). The regression coefŽcients are symbolised as b. As the
charisma factor is measured by a single item (y12) the reliability was speciŽed at 0.478,
which was the average reliability of the other 11 items in the scale. The model estimates
can be used to determine the percentage of variation in the lecturer ability and module
attributes factor that is attributable to the charisma factor.
From the sample data a covariance matrix was computed using PRELIS2 (Jöreskog &
Sörbom, 1993) and the model was estimated using maximum likelihood.

Results
The Žt indices show that the model is a reasonable description of the data (c2 5 114,
df 5 52, p , 0.05; RMSEA 5 0.075; SRMR 5 0.049; GFI 5 0.92; CFI 5 0.94;
IFI 5 0.94). The standardised parameter estimates are reported in Table 1.
The factor loadings indicate that the items used in the teaching effectiveness self-re-
port scale are good indicators of the lecturer ability and module attributes factors. All the
factor loadings are positive, high and statistically signiŽcant. The standardised regression
coefŽcients from the charisma factor to the lecturer ability (b1 3) and module attributes
Validity of Student Evaluation of Teaching 401

h3 l 123
y12 e12
Charisma

b13 b23

h1 h2
Lecturer Module
ability attributes
l 111

l 11 l 21 l 31 l 51 l 62 l 72 l102
l92 l 82

y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11

e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11

FIG. 1. Model of teaching effectivenes s and charisma factors.

TABLE 1. Standardise d parameter


estimates for teaching effectivenes s
ratings model

Parameter Estimate

l1 1 0.60*
l2 1 0.76*
l3 1 0.82*
l4 1 0.77*
l5 1 0.77*
l6 2 0.53*
l7 2 0.54*
l8 2 0.56*
l9 2 0.74*
l10 2 0.85*
l11 1 0.67*
b1 3 0.83*
b2 3 0.61*

Note: *p , 0.05.
402 M. Shevlin et al.

factors (b2 3) are 0.83 and 0.61 respectively. These effects are statistically signiŽcant
(p , 0.05). Therefore the charisma factor accounts for 69% of the variation of the lecturer
ability factor and 37% of the module attributes factor.

Discussion
The results of this study raise issues regarding the interpretation and utility of SET ratings.
The SET ratings were demonstrated to be signiŽcantly affected by the students’ perception
of the lecturer thereby questioning the validity of this particular scale. Further, they raise
questions about how the effect of confounding variables can be minimised thereby
increasing the validity of SET ratings.
The main aim of this study was to determine whether a halo effect occurs in the
completion of SET ratings and to estimate the magnitude of this effect. The results
indicate that a halo effect does indeed operate during the measurement of teaching
effectiveness as the relationships between the charisma factor and the lecturer ability and
module attributes were statistically signiŽcant. Indeed, the effect is large with the
charisma factor accounting for 69% and 37% of the variation in the lecturer ability and
module attributes factors respectively. This means that a signiŽcant proportion of the
scale’s variation is reecting a personal view of the lecturer in terms of their charisma
rather than lecturing ability and module attributes. The authors acknowledge that
alternative speciŽcations of the model are possible, but on the basis of substantive
psychological theory and previous research we have speciŽed this particular model. For
example, an alternative model could be speciŽed where the direction of inuence is from
the SET factors to the charisma factor. This speciŽcation would suggest that lecturers are
attributed a level of charisma based on their level of ‘lecturer ability’ and ‘model
attributes’, that is, the better the lecturer the more charismatic they are rated. In addition,
cross-lagged designs or models with reciprocal effects between the charisma and teaching
effectiveness factors are interesting alternatives that may be examined in future research.
These results of this study raise issues regarding the interpretation and utility of SET
ratings. The SET ratings were demonstrated to be signiŽcantly effected by the students’
perception of the lecturer on a variable that should be unrelated to assessments of teaching
ability, thereby questioning the validity of this particular scale. However, the Žndings
could be argued to be likely to generalise to most teaching assessment instruments on
the basis of the prevalence of the halo effect. In addition, the results raise questions about
how the effect of confounding variables can be minimised thereby increasing the validity
of SET ratings.
The two-factor structure of the scale, with high factor loadings, would appear to suggest
that meaningful and useful variables related to teaching quality were being measured.
However, this is not the case. The two factors are reecting a positive halo effect as well
as variance attributable to teaching quality. This raises questions regarding the utility of
using information from such scales since the attribute of charisma is having a central trait
effect on student evaluations.
The wide discrepancy in the factors of effective teaching identiŽed above (Brown &
Atkins, 1993; Marsh & Roche, 1997; Patrick & Smart, 1998; Ramsden, 1991) can be
partly attributed to the existence of an underlying variable. It is argued that this underlying
attribute is the personal quality of leadership commonly described as charisma. An
alternative explanation is that the effectiveness of a teacher effects the ratings of charisma,
though this explanations still implies a single underlying trait that accounts for SET
scores.
Validity of Student Evaluation of Teaching 403

It is not argued that good and effective teaching is a one-dimensional skill. Teaching
is shown to be multi-dimensional, as are the well-designed SET forms. The issue is about
how students approach the evaluation of teaching and how they use the SET forms. They
are not trained in rating or psychometrics, and it is argued here and elsewhere (for
example, d’Apollonia & Abrami, 1997), that they rate speciŽc features of teaching on the
basis of a global evaluation. That global factor is lecturer charisma.
This study presents a challenge to the use of SET in higher education and, in
particular, raises questions of fairness if such ratings are to be used in decisions relating
to employment issues.

Notes on Contributors
DR MARK SHEVLIN currently works for the University of Ulster at Magee College.
His research interests are in the areas of statistics and research methodology, in
particular the use of structural equation modelling in assessing reliability and validity.
He has published research papers in the areas of psychometric evaluation and
multivariate analysis. Correspondence: Tel: 01504 375619. Fax: 01504 375402.
E-mail: m.shevlin@ulst.ac.uk
PHIL BANYARD has been teaching psychology for 20 years and he is currently
associate Senior Lecturer at Nottingham Trent University. He is the author of four
student textbooks and has written several articles on teaching psychology. He is chief
examiner for an ‘A’ Level in Psychology and contributes to training events for
psychology teachers.
DR MARK DAVIES is a Senior Lecturer at Nottingham Trent University. He has
previously held posts at Nottingham University and University College London. He
has authored many refereed journal papers and has edited a text on animal behaviour.
His research interests include perception, comparative psychology, lateralisation of
function, evolutionary psychology, and cognition and emotion.
DR MARK GRIFFITHS is a Reader in Psychology at the Nottingham Trent University.
He is internationally known for his work into gambling and gaming addictions and
was the Žrst recipient of the John Rosecrance Research Prize for ‘Outstanding
scholarly contributions to the Želd of gambling research’ in 1994 and the winner of
the 1998 CELEJ Prize. He has published over 65 refereed research papers, numerous
book chapters and over 125 other articles. His current interests are non-drug addictions
(e.g. gambling, computer games, Internet, excercise, sex, etc.), interactive technology
computer games, internet, virtual reality, virtual pets), the psychology of fame and
student learning in higher education.

REFERENCES
ABRAMI, P. C., D’APOLLONIA, S. & ROSENFIELD, S. (1997) The dimensionalit y of student ratings of
instruction : what we know and what we do not, in: R. P. PERRY & J. C. SMART (Eds) Effective
Teaching in Higher Education: research and practice, pp. 321–367 (New York, Agathon Press).
ASCH, S. E. (1946) Forming impressions of personality , Journal of Abnormal and Social Psychology,
41(2), pp. 258–290.
BASS, B. M. (1990) Bass and Stodgdill ’s Handbook of Leadership, 3rd edn. (New York, Free Press).
BROWN, G. & ATKINS , M. (1993) Effective Teaching in Higher Education (London, Routledge) .
BRUNER, J. S. & TAGIURI, R. (1954) The perception of people, in: G. LINDZEY (Ed.) Handbook of Social
Psychology, Vol. 2 (London, Addison Wesley).
404 M. Shevlin et al.

BRYMAN, A. (1992) Charisma and Leadership in Organization s (London, Sage).


D ’APOLLONIA , S. & ABRAMI, P. C. (1997) Navigating student ratings of instruction , American
Psychologist, 52(11), pp. 1198–1208.
FELDMAN, K. A. (1976) Grades and college students’ evaluations of their courses and teachers , Research
in Higher Education, 18(1), pp. 3–124.
FERNÀNDEZ , J., MATEO , M. A. & MU-IZ, J. (1998) Is there a relationshi p between class size and student
ratings of teaching quality? Educationa l and Psychologica l Measurement, 58(4), pp. 596–604.
FULLER, J. B., PATTERSON, C. E. P., HESTER, K. & STRINGER , D. Y. (1996) A quantitativ e review of
research on charismatic leadership , Psychologica l Reports, 78(1), pp. 271–287.
GREENWALD , A. G. & GILLMORE, G. M. (1997) Grading leniency is a removable contaminant of student
ratings, American Psychologis t, 52(11), pp. 1209–1217.
HOUSE, R. J. (1977) A 1976 theory of charismatic leasdership , in: J. G. HUNT & L. L. LARSON (Eds)
Leadership: the cutting edge, pp. 189–207 (Carbondale, IL, Southern Illinois University Press).
JÖRESKOG, K. G. & SÖRBOM, D. (1993) LISREL 8:Structura l equation modeling with the SIMPLIS
Command Language (Chicago, ScientiŽc Software International) .
KELLEY, H. H. (1950) The warm-cold variable in Žrst impression s of persons, Journal of Personality and
Social Psychology, 18(3), pp. 431–439.
KUDISCH, J. D., POTEET, M. L., DOBBINS, G. H., RUSH, M. C. & et al. (1995) Expert power, referent
power, and charisma: toward the resolution of a theoretical debate, Journal of Business and
Psychology, 10(2), pp. 177–195.
LOWMAN, J. & MATHIE, V. A. (1993) What should graduate teaching assistant s know about teaching ?
Teaching of Psychology, 20(2), pp. 84–88.
MARSH, H. W. (1987) Students’ evaluations of universit y teaching : research Žndings, methodologica l
issues, and direction s for future research , Internationa l Journal of Educationa l Research, 11(3), pp.
253–388.
MARSH, H. W. & DUNKIN , M. (1992) Students’ evaluation s of universit y teaching : a multi-dimensiona l
perspective , in: J. C. SMART (Ed.) Higher Education: handbook on theory and research, Vol. 8,
pp. 143–234 (New York, Agathon Press).
MARSH, H. W. & ROCHE, L. A. (1997) Making students’ evaluation s of teaching effectivenes s effective ,
American Psychologist, 52(11), pp. 1187–1197.
MCKEACHIE, W. J. (1997) Student ratings: the validity of use, American Psychologis t, 52(11), pp. 1218–
1225.
PATRICK, J. & SMART, R. M. (1998) An empirical evaluatio n of teacher effectiveness : the emergence of
three critical factors, Assessment and Evaluation in Higher Education, 23(2), pp. 165–178.
PILLAI , R., STITES DOE, S., GREWAL , D. & MEINDL, J. R. (1997) Winning charisma and losing the
presidentia l election, Journal of Applied Social Psychology, 27(19), pp. 1716–1726.
QUALITY ASSURANCE AGENCY FOR HIGHER EDUCATION (1997) Subject Review Handbook: October 1998 to
September 2000 (QAA 1/97 ) (London, Quality Assurance Agency for Higher Education) .
RAMSDEN, P. (1991) A performance indicator of teaching quality in higher education : the course
experienc e questionnaire , Studies in Higher Education, 16(2), pp. 129–150.
SHACKLETON, V. (1995) Business Leadership (London, Routledge) .
SWARTZ, C. W., WHITE, K. P. & STUCK, G. B. (1990) The factorial structure of the North Carolina
Teacher Performance Appraisal Instrument , Educational and Psychologica l Measurement, 50(1),
pp. 175–185.
VERNON, P. E. (1964) Personality Assessment: a critical survey (London, Methuen).
WOODS , P. (1993) The charisma of the critical other: enhancin g the role of the teacher, Teaching and
Teacher Education, 9(8), pp. 545–557.

Appendix 1. Student evaluation questionnaire


1. The lecturer speaks clearly.
2. The lecturer presents material in a well organise d and coherent way.
3. The lecturer is able to explain difŽcult concepts in a clear and straightforwar d way.
4. The lecturer makes effective use of examples and illustration s in his or her explanations .
5. The lecturer is successfu l in presenting the subject matter in an interestin g way.
6. The lecturer is successful in encouragin g students to think independentl y and do supplementar y
reading on the subject matter of the module.
7. The module was what I expected .
8. The reference s were very useful.
Validity of Student Evaluation of Teaching 405

9. In this module I learned a lot.


10. In my opinion this module was enjoyabl e and worthwhile.
11. The lecturer was very approachable .
12. The lecturer has charisma.

You might also like