Studies in Higher Education: To Cite This Article: Douglas Magin & Phil Helmore (2001) Peer and Teacher Assessments

This article was downloaded by: [Northeastern University]
On: 31 December 2014, At: 21:26

Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954
Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,
UK
Studies in Higher Education

Publication details, including instructions for authors
and subscription information:
http://www.tandfonline.com/loi/cshe20
Peer and Teacher Assessments

of Oral Presentation Skills: How
reliable are they?
Douglas Magin & Phil Helmore
Published online: 25 Aug 2010.
To cite this article: Douglas Magin & Phil Helmore (2001) Peer and Teacher Assessments
of Oral Presentation Skills: How reliable are they?, Studies in Higher Education, 26:3,
287-298, DOI: 10.1080/03075070120076264
To link to this article: http://dx.doi.org/10.1080/03075070120076264
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the
information (the “Content”) contained in the publications on our platform.
However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or
suitability for any purpose of the Content. Any opinions and views expressed
in this publication are the opinions and views of the authors, and are not the
views of or endorsed by Taylor & Francis. The accuracy of the Content should
not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions,
claims, proceedings, demands, costs, expenses, damages, and other liabilities
whatsoever or howsoever caused arising directly or indirectly in connection
with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan, sub-
licensing, systematic supply, or distribution in any form to anyone is expressly
forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Downloaded by [Northeastern University] at 21:26 31 December 2014
Studies in Higher Education Volume 26, No. 3, 2001
Peer and Teacher Assessments of

Oral Presentation Skills: how
reliable are they?
DOUGLAS MAGIN & PHIL HELMORE
The University of New South Wales, Sydney, Australia
ABSTRACT This article reports ndings on the reliabilities of peer and teacher summative assess-
ments of engineering students’ oral presentation skills in a fourth year communications subject. The
context of the study is unusual, in that each oral presentation was subject to multiple ratings by teams
of students and teams of academic staff. Analysis of variance procedures were used to provide separate
estimates of inter-rater reliability of assessments by peers and teachers for classes in four succeeding
years. Teacher ratings were found to have substantially higher levels of inter-rater agreement than
peer ratings. Generalising over the four years, it would require the averaging of between two and four
peer ratings to match the reliability of single teacher assessments. However, the estimates of individual
rater reliability for teachers, based on the intra-class correlation coef cient, were moderately low (0.40
to 0.53). It is concluded that the reliability of summative assessments of oral presentations can be
improved by combining teacher marks with the averaged marks obtained from multiple peer ratings.
Introduction
Student engagement in assessment has been advocated on the grounds of the learning
bene ts which are likely to ensue from being involved in giving and receiving feedback. Many
of those who have argued for greater involvement by students in the assessment process have
stressed the crucial role that this can play in developing skills required for professional
responsibility, judgement and autonomy (e.g. Boud & Prosser, 1980; Heron, 1981;
Falchikov, 1988). This rationale continues to underpin the introduction of peer assessment
procedures in universities. Old eld & MacAlpine (1995) initiated peer assessment on the
grounds that their students ‘throughout their working lives … will need to assess the quality
of the work of their subordinates, their peers, their superiors and realistically, themselves’ (p.
129). Similarly, Kwan & Leung (1996) have advanced the claim that ‘The ability to judge the
performance of peers critically and objectively is a skill that students should possess when
they enter employment in any profession’ (p. 211), and regard peer assessment as a necessary
ingredient in the undergraduate experience. Successful learning outcomes following the
introduction of peer assessment procedures have been documented, either on the basis of
students’ perceptions of learning bene ts (e.g. Falchikov, 1986, 1995; Magin & Churches,
1989; Mockford, 1994), or on measures of improved student performance (e.g. Hendrickson
et al., 1987; Stefani, 1994; Hughes & Large, 1993).
Impediments
Whilst there is general agreement on the value of feedback from peer assessment in
promoting learning, the issue of whether peer assessments should form a signi cant part of
ISSN 0307-5079 print; ISSN 1470-174X online/01/030287-12 Ó 2001 Society for Research into Higher Education
DOI: 10.1080/0307507012007626 4
288 D. Magin & P. Helmore
a student’s nal grade remains vexed. Boud (1989), an early advocate of self and peer
assessment feedback methods, has nonetheless cautioned against extension of their use to
formal grading procedures, on the grounds that this could compromise its main pedagogical
intent, or that peer assessments are often ‘not accurate enough’ for this purpose. Concerns
about the validity and reliability of peer assessments have ranged over such issues as: students
being ‘poor judges’ of effective communication skills (e.g. Swanson et al., 1991; Van der
Vleuten et al., 1991); the potential for biases of different kinds to in uence assessments
(Pond et al., 1995; Brindley & Scof eld, 1998); and the variability of marking standards
employed by peer assessors (Houston et al., 1991; Swanson et al., 1991). A further concern
raised in two studies (Beard & Hartley, 1984; Freeman, 1995) was that of the need for
universities and the communities they serve to have con dence in those assessment practices
which provide the basis for the certi cation of graduates for professional careers.
In Support of Summative Peer Assessment

In contradistinction to the issues raised against the use of peer assessments in nal grading,
arguments can be adduced supporting their use in summative assessment. First, the knowl-
edge that peer assessments will ‘count’ towards nal grades is likely to promote a greater
seriousness and commitment on the part of students (Lejk et al., 1999). In an earlier study,
Swanson et al. (1991) found that ‘When peer ratings were used formatively … students either
did not take them seriously, or refused to complete them’ (p. 262). Second is the principle
of developing student autonomy and empowering students to make judgements that count
(Fineman, 1981; Stefani, 1998; Brindley & Scof eld, 1998). If students are made aware that
their assessments cannot be counted towards nal grading, either because they are considered
unable to make valid or reliable assessments, or because they cannot be trusted to do so, then
we should not be surprised if it is dif cult to convince them of the learning value of engaging
in peer assessment. There are many contexts where the impediments are such that fair and
valid peer assessments are unlikely to be obtained. However, there are others in which these
impediments have minimal impact, or can be overcome, or indeed where assessments based
on peer ratings could be superior to teacher assessments.
One such context is that of the assessment of students’ oral presentation skills. The most
usual setting is one in which each student in a tutorial or small class delivers an oral
presentation to the class, with the teacher in attendance. In this situation the audience is,
essentially, one’s peers, and valid assessment of the presentation should take account of the
extent to which the presenter has succeeded in communicating to the target audience. From
this perspective, peer assessments have a special claim to face validity. A further consideration
is that including peer assessments of oral presentation skills with those of teacher assessments
should improve substantially the overall reliability of the marks. Given that several studies
have found students to be poor at assessing communication skills, such a claim might seem
curious at rst. What needs to be taken into account is that, even if the ratings provided by
individual students have quite low reliabilities, the reliability of the scores obtained from the
averaging of ratings by a number of peer assessors can be quite high. Provided students
attempt to apply a common set of criteria, no matter how poor their ability to discriminate,
the reliability of the averaged scores will increase as the number of raters increases (Houston
et al., 1991).
Past Studies of ‘Reliability’

In almost all instances reported in the literature, indications of the reliability of peer
assessments have been attempted through measures of agreement between peer and teacher
Oral Presentation Skills 289
assessments (Topping, 1998). The majority of studies have approached this through correlat-
ing the averaged mark awarded by peers with that awarded by a teacher. Other studies have
measured agreement in terms of mark differences between peer and teacher assessments (e.g.
Falchikov 1994, 1995; Searby & Ewers, 1997; Heylings & Stefani, 1997). Topping’s
literature review identi ed 109 peer assessment studies, of which 25 provided ‘reliability’ data
in terms of peer–teacher agreement measures. The ubiquity of this practice led Topping to
comment:
many purported studies of ‘reliability’ appear actually to be studies of validity. That
is, they compare peer assessments with assessments made by professionals rather
than with those of other peers or the same peers over time (Topping, 1998, p. 257).
A similar conclusion about the interpretation of peer–tutor agreement indices has been put
forward by Melvin & Lord (1995) in the United States: ‘If the professor is considered to be
an observer, the correlations provide evidence for interobserver reliability. However, if one
considers the professor’s evaluation as the primary criterion (e.g. analogous to a supervisory
rating), these correlations strongly support the concurrent validity of the peer ratings’ (p.
260).
With respect to oral presentations, the subset of studies comparing peer and tutor
agreement is small. Of the six studies we have located, ve used correlation analysis and one
study used mark differences as the primary criterion of agreement. High correlation
coef cients were obtained by Kaimann (1974) and by Hughes & Large (1993). Freeman
(1995) reported a moderate correlation (r 5 0.60). Old eld & MacAlpine (1995), in their
analysis of three case studies they conducted, reported coef cients varying from 0.16 to 0.72.
MacAlpine (1999) found that changes introduced to peer assessment protocols and marking
scales resulted in an improvement in the correlation coef cient—from 0.58 before the
changes to 0.80 in the subsequent year. Falchikov (1995) used mark differences as the
criterion for agreement. She found that (on a 20 point scale) 98% of the marks derived by
peers were within one-half of a mark of that awarded by the tutor. In all six studies cited the
marks derived from peers were averaged from multiple peer ratings.
Limitations
In the studies cited above, any inferences drawn concerning the accuracy of peer ratings rest
on an assumption that it is the teacher’s assessment that provides the standard or reference
point. Freeman (1995), for example, makes the claim that ‘Unless student markers can
reliably reproduce academic marks, then peer assessment, if used, should have a very low
weight, if any, in a student’s nal grade’ (p. 291). Yet, as Falchikov & Boud (1989) have
observed, the practice of using teacher marks as the standard against which the ‘reliability’ or
accuracy of peer marks is measured is ‘clearly problematic’, since it cannot be assumed that
teachers’ ratings themselves have satisfactory reliabilities. The assumption is particularly open
to challenge with respect to a teacher observing and assessing oral presentation skills. In a
review of different assessment procedures used in medicine, Van der Vleuten et al. (1991)
concluded that assessments using observational methods were less reliable than other forms
of assessment.
The studies mentioned in the preceding section are, therefore, limited in that they
provide no information at all on the comparative reliabilities of the marks awarded by
teachers as against those awarded by peers. In the absence of such information, it is not
possible to determine the extent to which low levels of agreement could be attributable to
unreliability in peer ratings, to lack of reliability in teacher ratings, or to both.
In almost all settings where oral presentations are made, only one teacher is present
in the class, and hence it is not possible to determine independently the reliability of teacher
ratings. However, as was the case in all six studies cited above, peer marks have been most
commonly determined through the averaging of multiple peer ratings. In these circum-
stances, it is possible to determine the reliability of peer scores on the basis of inter-rater
agreement between peer assessors. One case study has been reported (Reizes & Magin,
1995) in which this method was used to estimate the reliability of multiple peer assessments
of oral presentation skills. This described a situation in which over 100 second year
engineering students were required to carry out communication skills exercises and peer
assessments in small groups without the presence of a tutor. One of these exercises required
each student in a group of about seven students to make an oral presentation. The oral
presentation skills were rated by the peers and feedback given. The ratings sheets from each
of the nineteen groups were subsequently collected and analysed using one-way analysis of
variance. Analysis of the peer assessment scores (based on averaging approximately six peer
ratings per student) yielded a mean inter-rater reliability coef cient of 0.69 over the nineteen
groups.
The Study
This paper reports ndings on the comparative reliabilitie s of peer and teacher summative
assessments of students’ oral presentation skills. The context of the study is unusual in that
each oral presentation was subject to multiple ratings by teams of students and by teams of
academic staff. Analysis of variance procedures were used to provide separate inter-rater
reliability estimates for peer assessments and teacher assessments. An additional feature of
the study is that it provides analysis of these assessments for the years 1995–1998, enabling
us to examine the consistency of assessment outcomes for four succeeding years.
Final year students in the School of Mechanical and Manufacturing Engineering at The
University of New South Wales study subjects mainly in their specialisation. In addition to
specialisation subjects in their fourth year, all students are required to undertake two
subjects—a thesis project (usually based on the student’s specialisation), and Communica-
tions for Professional Engineers. The two major objectives of the letter are to develop
con dence and skills in oral communications using audio-visual aids, and to improve skills in
design and presentation of complex professional documents (papers, contracts, speci cations
and, in particular, theses).
Oral presentations by individual students are each assessed by approximately ve peers,
and by a similar number of teaching staff (usually three to seven). The presentations, based
on work from the thesis project they are undertaking in the same semester, consist of a fteen
minute exposition on their own project followed by a ve minute discussion period. This
constitutes the major assessment task in the subject (40% of nal assessment) and is the last
assessment task undertaken. Prior to the oral presentations students are assessed on four
preliminary tasks, namely: (i) submission by e-mail of a (maximum) 50 word abstract of their
thesis, (ii) revision of abstract after consultation with other students, (iii) a brief (four
minutes) thesis progress talk using overhead transparencies, (iv) a longer seven minute talk
on a topic of their choice using overhead transparencies and/or other aids. These preliminary
tasks are assessed by the lecturer, and feedback is given to students in tutorials to assist them
in preparing for their nal oral presentation in a formal setting. As part of this preparation,
discussions are held on what constitutes an effective technical presentation, on marking
standards, and on the use of the rating scales in assessing presentations. The skill of the
students in technical writing is judged by teaching staff elsewhere, i.e. in the presentation of
their thesis.
The study analyses data for four successive years, 1995–1998. The enrolments in this
period ranged from approximately 100 to 120 students. There are about forty academic staff
in the School, and marking is undertaken by virtually all of the staff, and by all of the students
enrolled in the subject. This requires staff to attend and mark at least fteen talks during the
two-day conference presentation sessions. In addition, students are required to attend a
number of talks by their peers, and assess about ve of these. During their presentations,
students use a range of audio-visual equipment including overhead transparencies, 35 mm
slides, computer screen projectors and video-cassette recorders (VCRs).
Formative Assessment
Assessment of oral presentations is used both for formative and for summative purposes. Staff
and peer assessments are made on a standard A4 mark sheet on which both formative and
summative assessments are entered. Feedback is provided in relation to eight aspects of their
presentation:
Did the speaker:
· speak loudly enough?

· have clear diction?
· use the English language properly?
· use visual aids effectively?
· have adequate eye contact?
· inform you adequately on the thesis topic?
· present the information logically?
· handle the questions well?
Performance is rated against these criteria using simple check marks on a 10-point scale
calibrated from 0 to 100, and a space is provided for general comments. The A4 mark sheets
are forwarded to the student presenters at the end of each conference presentation session.
This results in every presenter receiving an average of about ten feedback sheets from peer
and teacher raters. General comments made on each rating sheet tend to be two or three two
lines in length, with many of the assessors employing brevity to comment on several different
aspects of a presentation. An example is:
You have a naturally soft voice, so you may need to project it more. Also, I suggest
more eye contact with audience. Too much text on same slide—use point form, not
paragraphs (or you lose audience reading it). Printing on several slides too small.
Summative Assessment
In addition to providing formative assessments, each rater records a global summative
assessment, expressed as a percentage. The global mark is speci ed as an overall assessment
of the ‘level of con dence and skill attained by the student in making an oral presentation
using audio-visual aids’, and it is this mark alone which is used for summative assessment.
The use of a single global mark is supported in the literature on the grounds of reliability and
practicality. In assessing medical students, communication skills, Hodges et al. (1996) found
that the inter-rater reliability of global ratings was superior to that achieved with a detailed
TABLE I. Averaged teacher and peer marks: means, standard deviations and correlations
Teacher assessments Peer assessments

Correlation Number of
Year Mean s.d. Mean s.d. Pearson r students assessed
1995 76.9 8.0 84.1 6.3 0.59 114

1996 76.7 7.9 85.6 5.5 0.69 119
1997 76.5 7.2 82.3 5.0 0.48 96
1998 76.8 7.1 83.8 4.9 0.54 136
checklist and concluded that, for summative purposes, it was ‘probably unnecessary to
employ time-consuming detailed versions of a checklist’ (p. 40). Similar ndings have been
reported by Van der Vleuten et al. (1991).

The global assessment made by each rater is entered on a summary sheet for return to
the subject coordinator, and is also included on the individual feedback sheet given to the
student being assessed. For each student, the staff assessment mark is taken as the simple
average of the global marks awarded by staff, scaled from a percentage to a mark out of 20.
Similarly, the student assessment mark is taken as the average of the marks awarded by peers.
Results
The means and standard deviations of the global teacher and peer marks awarded for oral
presentations were calculated for each of the four years, and are displayed in Table I. The
table also includes product-moment correlations (‘Pearson r’) determined from the averaged
scores received by each presenter from teacher and peer assessments.
Substantial ‘overmarking’ in peer assessments is evident in all four years. Whilst the class
mean for teacher assessments had stayed stable at around 77%, the mean for peer assess-
ments over the four years ranged between 82% and 86%. The standard deviations for teacher
assessments fell from 8.0 in 1995 to 7.1 in 1998. The standard deviations for peer assess-
ments similarly declined over the four years, dropping from 6.3 to 4.9. They were also more
‘bunched’ than teacher assessments, the standard deviations being substantially lower in each
year. The correlation coef cients varied between (approximately) 0.5 and 0.7. These indicate
low to moderate levels of agreement between teacher and peer marks. In order to explore the
extent to which the lack of high agreement could be attributable to unreliable marking on
the part of peers and/or teachers, analysis of variance procedures were employed to determine
the level of inter-rater agreement for the pattern of peer ratings, and similarly for the pattern
of teacher ratings.
Reliability Estimates Based on One-Way Analysis of Variance

Although analysis of variance techniques for measuring reliabilities (in terms of agreement
between raters or observers) have been employed in the social sciences for over three decades,
they have not found general use in the area of peer assessment studies. Simple one-way
analysis is the appropriate model in circumstances where ‘different judges are used for
different sessions’ (Hartmann, 1977, p. 106), or in which ‘decisions are made by …
comparing averages which come from different groups of raters’ (Ebel, 1951, p. 412). Since
each set of presentations were assessed by different peer raters, and different permutations of
TABLE II. Teacher assessments: reliability coef cients 1995–1998
Reliability of Mean number Single rater Number of

averaged mark teacher raters reliability students assessed
Year rnn k r11 N
1995 0.83 4.6 0.50 114

19961 0.79 3.9 0.51 119
1997 0.81 6.2 0.40 96
1998 0.84 4.6 0.53 136
Note: Reliability data for 1996 have been previously reported in conference proceedings (Magin
& Helmore, 1998)
teachers, the data were analysed using one-way analysis methods. The technique enables
determination of two reliability indices: am inter-rater reliability index (designated rnn) for
the marks based on the average of the scores awarded by N raters; and a second reliability
index, the ‘intra-class correlation coef cient’ (designated r11). This latter index provides a
means of estimating how reliable peer assessments might be if each presentation were to be
assessed by just one rater instead of N raters.
The matrix of all peer and teacher ratings was entered into spreadsheets each year. We
conducted one-way analyses of these data using the ‘AOVONEWAY’ command from the
MINITAB statistical package. This command provides analysis of variance output sum-
marises, including the F-ratio. The reliability of the averaged marks (rnn), and the estimate of
the intra-class correlation coef cient for individual raters (r11) can be calculated using this
F-ratio, viz:
F2 1
rnn 5
F
And
F2 1
r11 5 (Ebel, 1951)
F1 N2 1
Table II displays reliability coef cients for the teacher assessments. The reliability of the
marks awarded through averaging the scores from multiple teacher assessments per presen-
tation is shown in column 1. These were consistently high for each year, ranging from
rnn 5 0.79 in 1996 to 0.84 in 1998. The mean number of teacher assessors per presentation
varied between four and six. The estimated single rater reliability of teachers (r11), as shown
in column 3, ranged between 0.40 and 0.53 over these four years. This latter statistic can be
interpreted as an estimate of the overall reliability of marks when only one teacher acts as an
assessor for each presentation. In such circumstances, however, the existence of stringent and
lenient markers could give rise to considerable discrepancies in the marks awarded to
students, if these were based on only one set of teacher ratings per presentation.
The analysis was repeated for the peer rating data, and summary results are presented
in Table III. For all four years the reliabilitie s of the averaged peer marks were considerably
lower than those obtained from teaching staff. Based on scores generated by an average of 4.6
peer assessors per presentation, the mark reliability ranged from 0.53 to 0.70. However,
estimates for the averaged overall reliability of individual peer raters were very low, ranging
between from 0.20 to 0.34.
TABLE III. Peer assessments: reliability coef cients 1995–1998
Reliability of Mean number of Single rater Number of

averaged mark peer raters reliability students assessed
Year rnn k r11 N
1995 0.70 4.6 0.34 114

19961 0.58 4.6 0.23 119
1997 0.53 4.5 0.20 96
1998 0.59 4.6 0.24 136
Note: Reliability data for 1996 have been previously reported in conference proceedings (Magin
& Helmore, 1998).
Comparing Reliabilities: ‘Student Equivalents’

Table IV provides direct comparisons of the single rater reliability for teachers and students.
Also included in the table is a ‘student equivalent index’. This is an estimate of the average
number of peer ratings (for each presentation) which would be required to reach the same
reliability as that estimated for single teacher ratings. Determination of the index is based on
the Spearman-Brown prophecy formula (Ferguson, 1971, p. 369).
In 1995 the reliability estimate for single teacher ratings was 0.50, and for single peer
ratings the corresponding estimate was 0.34. By applying the Spearman-Brown formula it is
estimated that an average of 2.0 peer ratings per presentation would be required to achieve
the same reliability as that resulting from single teacher ratings. As shown in Table IV, the
‘student equivalent index’ ranged from 2.0 to 3.6. Generalising over the four years, single
teacher assessments are the equivalent of between two and four peer assessments per
presentation.
Discussion
In their review of the literature on self assessment, Boud & Falchikov reported that ‘There
appears to be lack of replication with different groups taking the same course in following
years or with the same group over time…. If replications are being undertaken, they are not
being reported’ (Boud & Falchikov, 1989, p. 535). A similar lack still remains in the literature
on peer assessment of communication skills. The investigation reported in this paper utilised
a data set consisting of multiple peer and teacher ratings of around 100 student presentations
in each year, and is substantially larger than found in many other studies. This has enabled
us to examine the consistency of outcomes for the four classes in the years from 1995 to
1998.
TABLE IV. Comparative reliabilities: teacher compared with peer 1995–1998
Teacher single Peer single ‘Student equivalent’

rater reliability rater reliability index
Year rl rp NSE
1995 0.50 0.34 2.0

1996 0.51 0.23 3.4
1997 0.40 0.20 2.7
1998 0.53 0.24 3.6
Consistencies
One remarkable feature is the uniformity of mark averages determined by teacher assessments
over the four year period. The averages for peer marks were less uniform, but nonetheless
were in a narrow band (see Table I). However, peer marks were substantially higher than
those given by teachers in each year. This nding is at variance with many other studies from
varying disciplines, which have found peer and teacher mark averages to be quite similar (e.g.
Falchikov, 1994, 1995; Stefani, 1994; Freeman, 1995; Kwan & Leung, 1996). Peer ‘over-
marking’ is correctable, where warranted, by simple adjustment, and since 1996 we have
scaled peer marks to conform with the overall teacher assessment mean and standard
deviation.
What did emerge as a concern was the consistent nding that peer marks were
considerably more bunched than were teacher marks. Numerous other studies have found
this same tendency (e.g. Hughes & Large 1993; Freeman, 1995; Kwan & Leung, 1996). It
is not known whether the bunching phenomenon arises from students’ reluctance to utilise
a mark range similar to that employed by their teachers, or whether it is due to students
having less ability than staff in being able to discriminate between differing standards of
presentation. Whatever the reason, the use of a more restricted marking range by student
assessors will tend to lower the reliability of the scores.
A further consistency was detected. There was a trend over the four years towards
greater homogeneity of marks. As mentioned in the preceding section, the distribution of
teacher marks showed a consistent drop in the standard deviation from 8.0 in 1995 to 7.1 in
1998. The same trend can be seen in the peer assessment distributions over the same years.
We have formed the opinion that this trend is real rather than an artefact of assessor
behaviour. There has been virtually no change in the teaching personnel over the four years,
and we can nd no reason why there would be change in their marking behaviour disposition
with respect to range, but not in relation to the overall mean. One possible explanation is
that, over the last four years, more students appear to have become familiar with the use of
audio-visual aids, and this may have caused greater uniformity in the standard of presenta-
tions with respect to audio-visual skills.
Low Individual Rater Reliabilities

The ndings as shown in Table III lend support to the commonly-held belief that students
are quite poor at judging oral presentation skills. The estimate of (average) single peer rater
reliability for each of the four years was very low, ranging from 0.20 to 0.34. However, this
needs to be considered in the light of the low estimates for the reliability of single teacher
assessments (ranging between 0.40 and 0.53). Although these reliabilitie s were considerably
higher than for student raters in each of the four years, it would be unsafe to rely on the marks
produced by a single teacher assessor alone as the basis for determining part of a nal
grading.
Superiority of Teacher Assessments

Over the four years single teacher ratings were found to be the equivalent (in terms of
reliability) of between two and four per ratings. The most obvious line of explanation for the
superiority of teacher assessments is that teachers are more experienced, more expert, and are
less likely to be biased in their judgements. The results from the study suggest that the
marking behaviour of students is also a factor. The tendency for student assessors to
overmark their peers, and the consequent bunching of scores towards the top of the range act
to lower reliability. If students could be persuaded to exercise greater discrimination,
improvement in the reliability of peer rating scores should ensue.
These ndings can be considered from another perspective. Whilst ratings by a typical
teacher are indeed superior to those of a typical student assessor, scores produced by single
teacher assessments are unlikely to be superior to averaged peer scores, where these are based
on more than four ratings for each presentation. From this perspective, there is potential to
improve the reliability of summative assessments by combining teacher marks with the
averaged scores obtained from multiple peer ratings.
Routine Use of Analysis of Variance Methods

It has become almost universal practice for grades in a subject to be determined by

summation of marks from several distinct assessable tasks, rather than from a single unseen
examination. Many teachers now employ spreadsheets for entering assessment results and
use accompanying statistical packages. Teachers who have taken the initiative to include
multiple peer assessments, together with teacher assessments, in determining nal marks
often monitor this practice by carrying out peer–teacher correlation analysis. It is recom-
mended in these situations that subject coordinators and researchers take the simple extra
step of using standard analysis of variance commands to obtain reliability (inter-observer
agreement) estimates for the averaged scores from peer assessments. This will overcome
several of the previously canvassed problems of interpretation based on correlation or mark
difference analyses. Analysis of variance methods have the additional advantage of providing
reliability estimates for peer assessment outcomes in the absence of teaching assessments.
Concluding Remarks
On visiting the literature relating to peer assessment of communication skills, and on reading
the review by Topping (1998), what struck us was the absence of any studies which had
compared the reliabilitie s of peer and teacher assessments. In the course of monitoring
assessment outcomes from the nal year communication subject in the period from 1995 to
1998, we had available four years of assessment data. These have been analysed to provide
for the rst time information on the comparative reliabilities of peer and teacher assessments
of oral presentation skills.
Whilst teacher assessments were found to be more reliable than peer assessments, the
use of a single teacher ratings of oral presentation skills is clearly inadequate as a reliable
assessment measure. We should not be discouraged by this aspect of the study. Almost all
situations where students make oral presentations involve other students as the audience.
Combining teacher assessment scores with those produced by averaged scores from multiple
peer assessments can produce highly satisfactory mark reliabilities. By involving the audience
of students in the task of assessment we can achieve two goals: that of fostering skills of
professional judgement; and that of investing the assessment of oral presentations with
greater reliability than can be attained otherwise.
Correspondence: Douglas Magin, Faculty of Engineering Of ce, The University of New South
Wales, Sydney 2052, Australia; e-mail: d.magin@UASW.edu.au
REFERENCES
BEARD, R. & HARTLEY, J. (1984) Teaching and learning in higher education (4th edition) (London, Paul
Chapman).
BOUD , D. (1989) The role of self-assessment in student grading, Assessment and Evaluation in Higher Education,
14, pp. 20–30.
BOUD , D. & FALCHIKOV, N. (1989) Quantitative studies in self-assessment in higher education, Higher
Education, 18, pp. 529–549.
BOUD , D. & PROSSER, M. (1980) Sharing responsibility: staff-student cooperation in learning, British Journal
of Educationa l Technology, 11, pp. 24–35.
BRINDLEY, C. & SCOFFIELD , S. (1998) Peer assessment in undergraduate programmes, Teaching in Higher
EBEL, R. (1951) Estimation of the reliability of ratings, Psychometrika, 16, pp. 407–424.
FALCHIKOV, N. (1986) Product comparisons and process bene ts of collaborative peer group and self
assessments, Assessment and Evaluation in Higher Education, 11, pp. 146–166.
FALCHIKOV, N. (1988) Self and peer assessment of a group project designed to promote the skills of capability,
Programmed Learning and Educational Technology, 25, pp. 327–339.
FALCHIKOV, N. (1994) Learning from peer feedback marking: student and teacher perspectives, in: H. FOOT ,
C. HOWE , A. ANDERSON, A. TOLMIE & D. WARDEN (Eds) Group and Interactive Learning, pp. 411–416
(Southampton, Computational Mechanics Publications).
FALCHIKOV, N. (1995) Peer feedback marking: developing peer assessment, Innovations in Education and
Training International, 32, pp. 175–187.
FALCHIKOV, N. & BOUD , D. (1989) Student self-assessment in higher education: a meta-analysis, Review of
Educational Research, 59, pp. 395–430.
FERGUSON , G. (1971) Statistical Analysis in Psychology and Education (3rd edition) (Tokyo, McGraw-Hill
Kogakusha).
FINEMAN, S. (1981) Re ections on peer teaching and peer assessment—an undergraduate experience,
Assessment and Evaluation in Higher Education, 6, pp. 82–93.
FREEMAN, M. (1995) Peer assessment by groups of group work, Assessment and Evaluation in Higher Education,
20, pp. 289–299.
HARTMANN, D. (1977) Considerations in the choice of inter-observer reliability estimates, Journal of Applied
Behavior Analysis, 10, pp. 103–116.
HENDRICKSON, J., BRADY, M. & ALGOZINNE, B. (1987) Peer-mediated testing: the effects of an alternative
testing procedure in higher education, Educationa l and Psychological Research, 7, pp. 91–102.
HERON, J. (1981) Assessment revisited, in: D. BOUD (Ed.) Developing Student Autonomy in Learning (London,
Kogan Page).
HEYLINGS, D. & STEFANI, L. (1997) Peer assessment feedback marking in a large medical anatomy class,
Medical Education, 31, pp. 281–286.
HODGES, B., TURNBULL, J., COHEN, R., BIENENSTOCK, A. & NORMAN, G. (1996) Evaluating communication
skills in the objective structured clinical examination format: reliability and generalizability, Medical
HOUSTON , W., RAYMOND, M. & SVEC, J. (1991) Adjustment for rater effects in performance assessment,
Applied Psychological Measurement, 15, pp. 409–421.
HUGHES, I. & LARGE, B. (1993) Staff and peer-group assessment of oral communication skills, Studies in
Higher Education, 18, pp. 379–385.
KAIMANN, R. (1974) The coincidence of student evaluation by professor and peer group using rank
correlation, The Journal of Educational Research, 68, pp. 152–153.
KWAN, K. & LEUNG, R. (1996) Tutor versus peer group assessment of student performance in a simulation
training exercise, Assessment and Evaluation in Higher Education, 21, 3, pp. 205–214.
LEJK, M., WYVILL, M. & FARROW, S. (1999) Group assessment in systems analysis and design: a comparison
of the performance of streamed and mixed-ability groups, Assessment and Evaluation in Higher Education,
24, pp. 5–14.
MACALPINE, J. (1999) Improving and encouraging peer assessment of student presentations, Assessment and
Evaluation in Higher Education, 24, pp. 15–26.
MAGIN, D. & CHURCHES, A. (1989) Using self and peer assessment in teaching design, Proceedings, World
Conference on Engineering Education for Advancing Technology, Institution of Engineers, Australia, 89/1, pp.
640–644.
MAGIN, D. & HELMORE, P. (1998) Peer judgement of oral presentation skills versus the ‘gold standard’ of
teacher assessments, in: M. BARROW & M. MELROSE (Eds) Research and Development in Higher Education,
21, pp. 193–201, Proceedings of the HERDSA Conference held in Auckland, New Zealand, July 1–4.
MELVIN, K. & LORD, A. (1995) The prof/peer method of evaluating class participation: interdisciplinary
generality, College Student Journal, 29, pp. 258–263.
MOCKFORD , C. (1994) The use of peer group review in the assessment of project work in higher education,
Mentoring and Tutoring, 2, pp. 45–52.
OLDFIELD, K. & MACALPINE, M. (1995) Peer and self-assessment at tertiary level—an experiential report,
Assessment and Evaluation in Higher Education, 20, pp. 125–132.
POND, K., UL-HAQ, R. & WADE, W. (1995) Peer review: a precursor to peer assessment, Innovations in
Education and Training International, 32, pp. 314–323.
REIZES, J. & MAGIN, D. (1995) Peer assessment of engineering students’ oral communication skills, in: M.
PETTIGROVE & M. PEARSON (Eds) Research and Development in Higher Education, 17, pp. 275–279
(Campbelltown, HERDSA).
SEARBY, M. & EWERS, T. (1997) An evaluation of the use of peer assessment in higher education: a case study
in the School of Music, Kingston University, Assessment and Evaluation in Higher Education, 22, pp.
371–384.
STEFANI, L. (1994) Peer, self and tutor assessment: relative reliabilities, Studies in Higher Education, 19, pp.
69–75.
STEFANI, L. (1998) Assessment in partnership with learners, Assessment and Evaluation in Higher Education, 23,
pp. 339–350.
SWANSON, D., CASE, S. & VAN DER VLEUTEN, C. (1991) Strategies for student assessment, in: D. BOUD & G.
FELETTI (Eds) The Challenge of Problem Based Learning, pp. 260–273 (London, Kogan Page).
TOPPING, K. (1998) Peer assessment between students in colleges and universities, Review of Educationa l
Research, 68, pp. 249–276.
VAN DER VLEUTEN, C., NORMAN, G. & DE GRAAFF, E. (1991) Pitfalls in the pursuit of objectivity: issues of
reliability, Medical Education, 25, pp. 110–118.

Studies in Higher Education: To Cite This Article: Douglas Magin & Phil Helmore (2001) Peer and Teacher Assessments

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Studies in Higher Education: To Cite This Article: Douglas Magin & Phil Helmore (2001) Peer and Teacher Assessments

Uploaded by

Copyright:

Available Formats

This article was downloaded by: [Northeastern University]

On: 31 December 2014, At: 21:26

Studies in Higher Education

Peer and Teacher Assessments

To link to this article: http://dx.doi.org/10.1080/03075070120076264

PLEASE SCROLL DOWN FOR ARTICLE

Peer and Teacher Assessments of

In Support of Summative Peer Assessment

Past Studies of ‘Reliability’

· speak loudly enough?

Teacher assessments Peer assessments

1995 76.9 8.0 84.1 6.3 0.59 114

reported by Van der Vleuten et al. (1991).

Reliability Estimates Based on One-Way Analysis of Variance

TABLE II. Teacher assessments: reliability coef cients 1995–1998

Reliability of Mean number Single rater Number of

1995 0.83 4.6 0.50 114

TABLE III. Peer assessments: reliability coef cients 1995–1998

Reliability of Mean number of Single rater Number of

1995 0.70 4.6 0.34 114

Comparing Reliabilities: ‘Student Equivalents’

TABLE IV. Comparative reliabilities: teacher compared with peer 1995–1998

Teacher single Peer single ‘Student equivalent’

1995 0.50 0.34 2.0

Low Individual Rater Reliabilities

Superiority of Teacher Assessments

Routine Use of Analysis of Variance Methods

It has become almost universal practice for grades in a subject to be determined by

You might also like

TABLE II. Teacher assessments: reliability coef cients 1995–1998

TABLE III. Peer assessments: reliability coef cients 1995–1998