Professional Documents
Culture Documents
Self, Peer, and Teacher Assessments
Self, Peer, and Teacher Assessments
Multifaceted Rasch measurement was used in the present study with 91 stu-
dent and 4 teacher raters to investigate how self- and peer-assessments work
in comparison with teacher assessments in actual university writing classes.
The results indicated that many self-raters assessed their own writing
lower than predicted. This was particularly true for high-achieving students.
Peer-raters were the most lenient raters; however, they rated high-achieving
writers lower and low-achieving writers higher. This tendency was independ-
ent of their own writing abilities and therefore offered no support for the
hypothesis that high-achieving writers rated severely and low-achieving
writers rated leniently. On the other hand, most peer-raters were internally
consistent and produced fewer bias interactions than self- and teacher-raters.
Each of the four teachers was internally consistent; however, each displayed
a unique bias pattern. Self-, peer-, and teacher-raters assessed Grammar
severely and Spelling leniently. The analysis also revealed that teacher-raters
assessed Spelling, Format, and Punctuation differently from the other criteria.
It was concluded that self-assessment was somewhat idiosyncratic and
therefore of limited utility as a part of formal assessment. Peer-assessors on
the other hand were shown to be internally consistent and their rating
patterns were not dependent on their own writing performance. They also
produced relatively few bias interactions. These results suggest that in at
least some contexts, peer-assessments can play a useful role in writing
classes. By using multifaceted Rasch measurement, teachers can inform
peer-raters of their bias patterns and help them develop better quality assess-
ment criteria, two steps that might lead to better quality peer-assessment.
Address for correspondence: Sumie Matsuno, Part-time lecturer, The Graduate School of Languages
and Cultures, Nagoya University, Nagoya, 464-860 Japan; email: msk77@sage.ocn.ne.jp
© 2009 SAGE Publications (Los Angeles, London, New Delhi and Singapore) DOI:10.1177/0265532208097337
76 Self-, peer-, and teacher-assessments in Japanese university
I Literature review
Self-assessment can be defined as ‘procedures by which the learners
themselves evaluate their language skills and knowledge’ (Bailey,
1998, p. 227). In first language assessment, self-assessment has often
been reported as an effective tool because self-assessment helps stu-
dents to develop a better understanding of the purpose of the assign-
ment and the assessment criteria (Orsmond & Merry, 1997),
improves learning (Sullivan & Hall, 1997), and softens the blow of
a bad grade by helping students understand the reasons for their
grade (Taras, 2001). Moreover, most participants in previous studies
were satisfied with the perceived benefits of self-assessment (Taras,
2001). On the other hand, whether students assess themselves cor-
rectly is controversial; Sullivan and Hall (1997) found that 39% of
the students overestimated their performance and Oldfield and
Macalpine (1995) found a low correlation between self- and teacher
assessments (r .30).
Peer-assessment can be defined as ‘an arrangement for peers to
consider the level, value, worth, quality or successfulness of the
products or outcomes of learning of others of similar status’
(Topping, Smith, Swanson & Elliot, 2000, p. 150). In the field of
first language pedagogy, peer-assessment has also been considered
as an effective tool in both group and individual projects. Peer-
assessment has been found to help teachers assess each person’s
effort in group projects (Conway & Kember, 1993; Goldfinch, 1994;
Goldfinch & Raeside, 1990) and to help students learn more and
work cooperatively in a group (Kwan & Leung, 1996). In individual
tasks, students can be more involved in assessment and instruction,
Sumie Matsuno 77
where peers respond to and edit each other’s written work qualita-
tively, have been conducted, empirical studies of self- and peer-
assessments are extremely limited.
Second, in both first and second language pedagogy, the question of
whether self- and peer-assessments can be used as part of formal class-
room assessment has been a point of contention. For example, a num-
ber of researchers have reported high correlations between student- and
teacher-assessments (e.g., Oldfield & Macalpine, 1995; Rudy, Fejfar,
Griffith & Wilson, 2001), while other studies have shown low correla-
tions between them (Freeman, 1995; Swanson et al., 1997).
Third, most previous researchers have been concerned with inter-
rater reliability and have measured this using simple correlations; the
degree to which raters are internally consistent (intrarater consis-
tency) has rarely been discussed. Only researchers using MFRM
(e.g., Brown, Hudson, Norris & Bonk, 2002; Kondo-Brown, 2002;
Kozaki, 2004; Lumley & McNamara, 1995; McQueen & Congdon,
1997; Wigglesworth, 1993) have investigated intrarater consistency.
In fact, whether raters were internally consistent is one of the most
important aspects of rater performance.
Fourth, most researchers have used the traditional true-score
approach. Because each rater interprets rating scale categories in an
idiosyncratic way, traditional approaches to measurement are prob-
lematic because they do not adequately address issues such as rater
severity/leniency, and assessment criterion difficulty/easiness. For
example, even if raters correlate perfectly using the traditional true-
score approach, there is still some doubt about whether they are rat-
ing examinees in the same way (Lunz, 1992).
In the present study, Multifaceted Rasch Measurement (MRFM)
is used to investigate how self- and peer-assessments work in com-
parison with teacher assessments in actual university writing classes.
MRFM measurement can show the degree to which raters are inter-
nally and externally consistent. Moreover, MFRM can measure rater
severity/leniency, and through bias analyses, to whom and to which
assessment criteria each rater displays bias. Overall, MFRM can illu-
minate the inside of the workings of the assessment process. Even
though MFRM has been utilized in various rating studies, only one
study of self- and peer-assessment (Nakamura, 2002) was conducted
using this approach. Hence in the present study, MFRM was utilized
to investigate self-, peer-, and teacher-assessments, and more specif-
ically whether student raters are externally and internally consistent.
This was accomplished through an inspection of writers’ abilities,
raters’ severities, and assessment criteria difficulties, and the ways in
Sumie Matsuno 79
II Method
1 Participants
Ninety-seven second-year Japanese students participated in the
study as essay writers; however, because six students did not rate
their own and peers’ essays, they were eliminated as raters from the
study, leaving 97 student writers and 91 student raters. Eighty-one of
the students were enrolled in a prefectural university and sixteen
were enrolled in a national university in central Japan. Both univer-
sities are considered prestigious and highly competitive. The
participants’ ages ranged from 19 to 21 years old and none of the par-
ticipants had received extensive composition instruction before this
course.
In addition to the student participants, four native Japanese teachers
(T1–T4) were selected for this study. They were chosen because they
had more than five years of experience teaching English as a foreign
language at the university level. In addition, they had experienced
teaching or researching EFL writing for at least three years.
2 Procedures
The study was conducted in two university writing classes, and
between the first and seventh weeks, the participants received instruc-
tion concerning essay writing such as essay format, mechanics,
80 Self-, peer-, and teacher-assessments in Japanese university
organization, and content. Then, during the eighth week, they wrote
an essay on the following topic as a homework assignment: Please
discuss the advantages and disadvantages of college students having
a cellular phone. They were asked to write about 300 words (1 page)
and to bring the assignment to the next class. Four copies of each
essay were made. In the ninth week, based on the essay evaluation
sheet (Appendix A), the participants practiced evaluating three
essays together in class. The students were then instructed to evalu-
ate their own essay and the essays written by five peers at home, an
assignment that was worth 10% of their final course grade. They did
not know whose essays that they were evaluating because the names
were deleted from the essays.
T1 evaluated all 97 essays (W1 (Writer 1) to W97), T2 evaluated
30 essays from W1 to W30, T3 evaluated 30 essays from W31 to
W60, and T4 evaluated 37 essays from W61 to W97. Those essays
were randomly assigned to each teacher. Because the multifaceted
Rasch model does not require that all raters evaluate all essays, a suf-
ficient degree of connectedness was achieved with the above rating
plan. The teacher-raters rated the essays at home and returned them
to the researcher within one month.
3 Instruments
Because analytical scaling grids have been found to be generally
more reliable and informative than holistic ones (Hamp-Lyons,
1991; Jacobs et al., 1981; Perkins, 1983), an analytic scoring grid was
tailored for this study based on the ESL composition profile devel-
oped by Jacobs et al. (1981). The ESL composition profile was not
used its original form; that instrument was too difficult for the par-
ticipants in this study to use because of their limited experience pro-
ducing or assessing English compositions. Therefore, a simplified
analytic scoring grid was used in this study (Appendix A). The
Jacobs et al.’s ESL composition profile is a weighted scale with
mechanics being the least weighted (5%) and content the most
(30%). However, Kondo-Brown (2002) mentioned, ‘it is not clear
how the weightings were determined in the original version, and
some researchers have questioned the assignment of different
weights to evaluation criteria’ (p. 9). Hence, in this study, each cate-
gory was divided into a couple of subcategories (assessment crite-
ria), which were weighted equally with a 6-point rating scale.
This instrument had been pre-tested with 26 participants and no
assessment criteria misfit the multifaceted Rasch model. In the pre-test,
Sumie Matsuno 81
4 Analyses
Multifaceted Rasch measurement was conducted using the FACETS
computer program, version 3.22 (Linacre, 1999). In the analysis,
writers, raters, and assessment criteria were specified as facets. The
output of the FACETS analysis reported: (a) a FACETS map, (b)
ability measures and fit statistics for each writer, (c) a severity esti-
mate and fit statistics for each rater, (d) difficulty estimates and fit
statistics for each assessment criterion, and (e) a bias analysis for
rater writer interactions. The FACETS map provides visual infor-
mation about differences that might exist among different elements
of a facet, such as differences in severity among raters and ability
among writers. Writer ability logit measures are estimated concur-
rently with the rater severity logit estimates and assessment criterion
difficulty logit estimates. By placing them on the same linear meas-
urement scale, the results are easily compared. The Rasch model also
provides fit statistics, which provide estimates of the consistency of
the rater rating patterns (Lunz & Linacre, 1998). Bias analyses were
also carried out in order to detect raters who were rating particular
persons too severely or too leniently as well as raters who were using
particular assessment criteria in an overly severe or lenient way.
Table 1 Rating scale statistics after combining categories 2 and 3, and 4 and 5
the other hand, when it has been determined that some student raters
do not assess the essays seriously or do not meet the expectations of
the Rasch model, their ratings can be justifiably eliminated in order
to improve the precision of the ability estimates. Therefore, all of the
student essays were included and ten student raters were eliminated
in the analysis. Overall, the final analysis was made up of 97 student
writers, 81 student raters, and four teacher raters.
Finally, the unidimensionality of the assessment criteria was
checked. Misfitting assessment criteria were first identified using
infit mean squares from .50 to 1.50 as the acceptable range. On the
first FACETS run, two assessment criteria were found to misfit the
model (Spelling 1.60 and Format 1.60). These two criteria were
apparently viewed differently by different raters. One way to check
the impact of misfit suggested by Linacre and Williams (1998) is to
remove bad data in layers and compare the resulting measures with
scatterplots. When the plots show no meaningful change, the remain-
ing data are sufficiently good. Using this approach, the scatterplots
show a straight line in this study, which indicates that no meaningful
change occurs as a result of the deletions; hence all of the original
criteria were included in the analyses.
Table 2 shows the descriptive statistics for the self-, peer-, and
teacher-assessments. The peer assessments include 40 missing data
points but this is not a problem because the FACETS analysis can
tolerate a relatively large number of missing observations (Linacre,
1989/1994, p. 5). The following analyses are based on data from 68
self-assessors and 81 peer-assessors. The 81 peer-assessors assessed
16 criterion assessments for five writers; hence, the total should have
been 6480; however, because of the 40 missing data points, the total
was 6440.
As shown in Table 2, the mean of the peer assessments is the high-
est, and the means of self- and teacher-assessments are the same.
Among the teachers, the mean of T3 is the lowest. The SD of peer
assessments is the smallest, which indicates that peers did not award
extreme scores frequently; on the other hand, the SD of teacher-
assessment is .81. Among the teachers, T1 had the largest SD, indi-
cating that she awarded a relatively wide variety of scores.
Regarding skewness, all of the raters except T4 were skewed; self-
and T3’s assessment were positively skewed and the others
were negatively skewed. In addition, teacher-assessment and T4
demonstrated negative kurutosis. However, skewness and kurtosis
are not issues in this study because MFRM does not require normal
distributions.
84 Self-, peer-, and teacher-assessments in Japanese university
Figure 1 FACETS map of writer ability, rater severity, assessment criterion diffi-
culty, and Likert scale functioning.
Note: Each asterisk (*) indicates two writers or one rater. M indicates one misfitting
writer.
1.5
0.5
Logit of difficulty
0
Peer
–0.5 Teacher
Self
–1
–1.5
–2
–2.5
–3
g
it y
or m
ce
e
io n
ar
t
io n
y
el o p
m at
n
ou n
nc e
ce
lli n
a ng
I nt r
Bo d
a t io
mm
h oi
ic a l
ress
ten
clus
WF
eva
For
Sp e
Am
D ev
WR
WC
ct u
Gra
L og
Se n
I mp
C on
Rel
Pun
Assessment criteria
Rater High ability with High ability with Low ability with Low ability with
negative z-score positive z-score negative z-score positive z-score
Note: Bracket is the percentage of the number of biased ratings out of the total num-
ber of ratings of each assessment.
were analyzed together in order to investigate the bias for the peer-
and teacher-assessments. This analysis was conducted without the
self-assessments because the self-assessments were somewhat idio-
syncratic.
The results are presented in Table 3. In each column, the numbers
of biased writers are presented. High-ability writers obtained ability
estimates 1.00 logit, while low-ability writers obtained ability
estimates 1.00 logit. The 1.00 logit cutoff was selected because
this allowed for the creation of two groups of approximately equal
size. Negative z-score were those that were 2.00 logits, while
the positive z-scores were 2.00 logits. The percentage of the
number of biased writers in the total number of writers in each
assessment is presented in parentheses.
Self-raters tended to assess their own writing more strictly than
expected; 15 self-raters were overly severe and seven were overly
lenient. More able writers (11.76%) assessed themselves more
severely than predicted. Among the participants with ability logits
greater than 1.00, only one rater (1.47%) rated him/herself too
leniently. On the other hand, 8.82% of writers with abilities less than
1.00 had negative z-scores, indicating that they gave themselves
higher scores than predicted by the Rasch model, but this finding
was balanced by the fact that some lower level writers (10.3%)
assessed themselves more harshly than expected.
The z-scores of writers with estimated abilities above 1.00, as
determined by the peer-raters, were divided fairly equally; 28 exam-
inees (6.91%) had positive z-scores and 22 examinees (5.43%) had
negative z-scores, indicating a slight tendency on the part of peer-raters
to be severe. Moreover, four writers with ability logits above 2.00
received lower scores than expected and one writer with an ability logit
92 Self-, peer-, and teacher-assessments in Japanese university
above 2.00 obtained a higher score than expected from the peer-
raters. Six writers (1.48%) with an estimated ability under 1.00 had
a positive z-score and twenty writers (4.94%) had negative z-scores,
an indication that they obtained higher than expected scores from
their peers; thus, peer raters were sometimes lenient only toward
writers with relatively low writing abilities.
The results of the bias analysis are consistent with the SD of the
peer assessments, which was smaller than that of the other assess-
ments; student raters generally awarded middle scores to their peers.
Furthermore, the peer-assessors tended to award more lenient scores
to lower level writers and harsher scores to more proficient writers.
These results were consistent with previous studies conducted by
Hughes and Large (1993) and Kwan and Leung (1996).
Next, following Hughes and Large (1993), the peer-assessments
were analyzed in order to determine whether students who per-
formed well on the writing assignment marked their peers’ essays
more severely than students whose essays received low ratings. The
possible presence of this tendency was examined using bias analysis
by analyzing both writing ability and biased z-scores. The results
indicate that almost an equal number of severe and lenient biases
were detected for participants with writing abilities between .00 and
.99 (14 above 2.00 vs. 16 below 2.00), writers with abilities from
1.00 to 1.99 (11 above 2.00 vs. 14 below 2.00), and writing abil-
ity above 2.00 (5 above 2.00 vs. 4 below 2.00). Overall, biased
ratings were not dependent on the students’ writing abilities. High-
achieving writers did not often rate their peers severely and low-
achieving writers did not often rate their peers leniently. Hence, the
participants in this study were able to make reasoned assessments
independent of their own performances.
T1 assessed 14 low ability writers with abilities less than 1.00
strictly (14.4%) as shown by the positive z-scores while no writers
with ability estimates less than 1.00 obtained negative z-scores
(0.00%), indicating they were not assessed leniently. On the other
hand, 14 high ability writers (14.4%), whose estimated ability was
above 1.00 logits, obtained negative z-scores, indicating that they
obtained higher scores than expected, while only four writers
(4.12%) whose estimated ability was above 1.00 logits obtained a
positive z-score, indicating that they obtained lower scores than
expected. These results are also consistent with the SD of T1’s rat-
ings, which were the largest compared with the other teacher raters.
Three writers (10.00%) with ability logits less than 1.00 and one
writer (3.33%) with an ability of logit above 1.00 were given lenient
Sumie Matsuno 93
scores from T2, while five writers (16.65%) with ability estimates
greater than 1.00 obtained overly severe scores. No writer with an
ability estimate less than 1.00 obtained a severe score from this rater.
T3 displayed few bias patterns. Two higher ability writers (6.67%)
obtained negative z-scores and three higher ability writers (10.00%)
obtained positive z-scores. In addition, only one lower ability writer
(3.33%) obtained a negative z-score.
T4 awarded five higher ability writers (13.51%) overly lenient
scores as shown by the negative z-scores; however, three high ability
writers (8.11%) also received harsh scores as indicated by the positive
z-scores (8.11%). Two lower ability writers (5.40%) obtained a harsh
score while no lower ability writer obtained a lenient score from T4.
T1 and T4’s bias patterns were similar; however, T4’s tendency toward
awarding lenient scores to able writers and harsh scores to less able
writers is less pronounced than that of T1.
Overall, the teacher-raters displayed relatively individual bias pat-
terns. As Wigglesworth (1993) mentioned, the FACETS bias analy-
sis can be used in rater training because showing raters their bias
patterns can lead to improved rating performance. This study also
confirmed the necessity of rater training using the bias analysis
results produced by FACETS. In addition, it is noteworthy that peer-
raters produced fewer bias interactions than the self- and teacher-raters.
This suggests that in at least some contexts, peer-assessments can
play a useful role in writing classes.
IV Limitations
There were four major limitations in this study. The first limitation
concerns the small amount of overlap in the ratings. MFRM does not
require that every examinee be rated by every judge on every assess-
ment criteria; that is, Rasch estimates can be obtained so long as a
certain degree of overlap is maintained; therefore, each essay was
rated by three peers, two teachers, and the writer of the essay.
However, as Linacre (2002c) noted, the measures that are arrived at
with this technique are less precise than with complete data. About
84% fewer observations were made in this study than was possible.
Even though asking each student to assess all of his/her peers is
unrealistic, more overlap of ratings is preferable as this would result
in more precise measurement.
The second limitation concerns rater leniency. The individual rater
report showed that the raters were separated into different levels of
94 Self-, peer-, and teacher-assessments in Japanese university
severity ranging from 2.17 to 2.28; however, the Facets map (see
Figure 1) showed that some able writers could not be assessed pre-
cisely because no raters had severities equivalent to them. One
implication is that the raters, including the teachers, were too lenient.
Though it is often the case for peer-raters, it is preferable to find
more severe raters or ask peer-raters to be more severe in order to
assess high proficiency writers more precisely.
Third, the addition of qualitative research methods would have
made the results more understandable. For example, in this study,
three underfitting and seven overfitting raters were detected.
Conducting interviews or think-aloud protocols could have shed
light on the reasons for the misfit.
Fourth, how general proficiency differences affect self- and peer-
assessments is an important issue. In the present study, no general
proficiency scores, such as TOEIC or TOEFL scores, were available.
The participants were in similar prestigious universities in Japan;
however, this did not mean that they were all at the same level of
English proficiency. The results reported in this study may not apply
to foreign language learners who are at lower or higher English pro-
ficiency levels.
V Conclusion
This paper has reported on a quantitative investigation of self-, peer-,
and teacher-assessments of English essays written by Japanese uni-
versity students using Multifaceted Rasch Measurement. There were
several important findings. First, self-raters, and especially high
achieving writers, were overly critical toward themselves. This result
was probably caused by the tendency of many Japanese to display a
degree of modesty. Second, peer raters, who did not show much vari-
ance, were lenient, but were at least internally consistent and their
rating patterns were not dependent on their own writing perform-
ance; higher achieving writers were not more severe raters, and
lower achieving writers were not more lenient raters. However, peer
rating patterns were dependent on the writers’ writing ability; peer
raters rated low-achieving writers leniently and high-achieving writ-
ers severely, meaning that they did not give high scores randomly.
Moreover, peer-raters produced fewer bias interactions than the self-
and teacher-raters. Third, teacher raters also varied and each had
his/her own unique bias pattern, a finding that showed that one
teacher rater was not sufficient in assessing students’ essays. Fourth,
Sumie Matsuno 95
Spelling and Format showed poor fit to the Rasch model; however,
a close inspection revealed that the misfit was caused by the teacher-
raters, not the peer-raters. Although these mechanical criteria are
often evaluated in writing studies, they may differ from other assess-
ment criteria such as organization and content, and may not discrim-
inate well between different levels of writing ability and/or indeed
may not be valid measures of writing ability.
Taken as a whole, at the present time, it is difficult to recommend
using self-assessment for formal grading; however, because even
internally consistent teacher raters demonstrated unique biases toward
specific writers, peer-assessment can possibly supplement teacher-
assessment and compensate for shortcomings in teacher-assessment.
The present study showed that most of the peer-assessors were inter-
nally consistent, their rating patterns were not dependent on their own
writing performance, and fewer biases were produced by peer-raters
than self- and teacher-raters. Those findings suggest that peer-raters
have the potential to make important contributions to the overall
assessment process. By using MFRM, teachers can inform peer-raters
of their bias patterns and help them develop better quality assessment
criteria, two steps that would lead to better quality peer-assessment.
Even though few researchers have utilized MFRM to investigate
self- and peer-assessments, this study shows that MFRM can be suc-
cessfully utilized to investigate self-, peer-, and teacher-assessments by
specifying writer ability, rater severity, interrater and intrarater consis-
tency, and the difficulty of assessment criteria. Moreover, the bias pat-
terns of self-raters, peer-raters, and teacher-raters were also effectively
illuminated by MFRM. Many of the results in this study cannot be
detected by traditional statistical approaches, such as correlation and
ANOVA. I believe that as more researchers use this research technique,
we can illuminate a multitude of facets of self- and peer-assessments.
Acknowledgements
This article is a part of my dissertation at Temple University Japan.
I would like to express my deep appreciation to Dr David Beglar who
acted as my advisor and provided me with valuable comments
throughout this study. This research also owes much to the helpful
advice of Dr James Dean Brown, Dr Kim Kondo-Brown, Dr Kenneth
Schaefer, Dr Marshall Childs, and my cohorts at Temple University
Japan. I would also like to express my gratitude to Dr Glenn Fulcher,
co-editor of Language, and the anonymous Language Testing review-
ers for their invaluable comments and suggestions.
96 Self-, peer-, and teacher-assessments in Japanese university
VI References
Bachman, L. F., & Palmer, A. S. (1989). The construct validation of self rat-
ings of communicative language ability. Language Testing, 6(1), 14–29.
Bailey, K. M. (1998). Learning about language assessment. Cambridge, MA:
Heinle & Heinle.
Ballantyne, R., Hughes, K., & Mylonas, A. (2002). Developing procedures
for implementing peer assessment in large classes using an action research
process. Assessment & Evaluation in Higher Education, 27(5), 427–441.
Brown, J. D., Hudson, T., Norris, J., & Bonk, W. (2002). An investigation of
second language task-based performance assessments. Honolulu:
University of Hawai’i.
Bonk, W., & Ockey, G.. J. (2003). A many-facet Rasch analysis of the second
language group oral discussion. Language Testing, 20(1), 89–110.
Caulk, N. (1994). Comparing teacher and student responses to written work.
TESOL Quarterly, 28(1), 181–188.
Cheng, W., & Warren, M. (1997). Having second thoughts: Student percep-
tions before and after a peer assessment exercise. Studies in Higher
Education, 22(2), 233–240.
Cheng, W., & Warren, M. (2005). Peer assessment of language proficiency.
Language Testing, 22(1), 93–121.
Conway, R., & Kember, D. (1993). Peer assessment of an individual’s contri-
bution to a group project. Assessment & Evaluation in Higher Education,
18(1), 45–54.
Evans, A. W., Aghabeigi, B., Leeson, R., O’Sullivan, C., & Eliahoo, J.
(2002). Are we really as good as we think we are? Annals of the Royal
College of Surgeons of England, 84(1), 54–56.
Evans, A. W., McKenna, C., & Oliver, M. (2002). Self-assessment in medical
practice. Journal of the Royal Society of Medicine, 95(10), 511–513.
Freeman, M. (1995). Peer assessment by groups of group work. Assessment &
Evaluation in Higher Education, 20(3), 289–301.
Goldfinch, J. M. (1994). Further developments in peer assessment of group
projects. Assessment and Evaluation in Higher Education, 19(1), 29–35.
Goldfinch, J. M., & Raeside, R. (1990). Development of a peer assessment
technique for obtaining individual marks on a group project. Assessment
& Evaluation in Higher Education, 15(3), 210–225.
Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-
Lyons (Ed.), Assessing second language writing in academic contexts
(pp. 241–276). Norwood, NJ: Ablex.
Hanrahan, S., & Isaacs, G. (2001). Assessing self- and peer assessment: The
students’ views. Higher Education Research and Development, 20(1),
53–70.
Hargreaves, A., Earl, L., & Schimidt, M. (2001). Perspectives on alternative
assessment reform. American Educational Research Journal, 39(1), 69–95.
Hughes, I. E., & Large, B. J. (1993). Staff and peer-group assessment of oral
communication skills. Studies in Higher Education, 18(3), 379–385.
Ikeno, O. (2002). The Japanese mind: Understanding contemporary culture.
North Clarendon: Tuttle.
Sumie Matsuno 97
Jacobs, H. L., Zingraf, S. A., Wormuth, D. R., Hartfiel, V. F., & Hughey, J. B.
(1981). Testing ESL composition: A practical approach. Rowley, MA:
Newbury House.
Kondo-Brown, K. (2002). A FACETS analysis of rater bias in measuring
Japanese second language writing performance. Language Testing,
19(1), 3–31.
Kozaki, Y. (2004). Using GENOVA and FACETS to set multiple standards on
performance assessment for certification in medical translation from
Japanese into English. Language Testing, 21(1), 1–27.
Kwan, K., & Leung, R. (1996). Tutor versus peer group assessment of student
performance in a simulation training exercise. Assessment & Evaluation
in Higher Education, 21(3), 205–215.
Linacre, J. M. (1989/1994). Many-facet Rasch measurement. Chicago, IL:
Institute for Objective Measurement.
Linacare, J.M. (1999). FACETS: computer program for many faceted Rasch
measurement (version 3.22). Chicago, IL: Mesa Press.
Linacre, J. M. (2002a). What do infit and outfit, mean-square and standard-
ized mean? Rasch Measurement Transactions, 16(2), 878.
Linacre, J. M. (2002b). Optimizing rating scale category effectiveness.
Journal of Applied Measurement, 3(1), 85–106.
Linacre, J. M. (2002c). Construction of measures from many-facet data.
Journal of Applied Measurement, 3(4), 486–512.
Linacre, J. M., & Williams, J. (1998). How much is enough? Rasch
Measurement Transactions, 12(3), 653.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias:
Implications for training. Language Testing, 12(1), 54–71.
Lunz, M. E. (1992). New ways of thinking about reliability. Professions
Education Researcher Quarterly, 13(4), 16–18.
Lunz, M. J., & Linacre, J. M. (1998). Measurement designs using multifacet
Rasch modeling. In G. A. Marcoulides (Ed.), Modern methods for busi-
ness research (pp. 44–47). Mahwah, NJ: Lawrence Erlbaum.
Lunz, M. E., Stahl, J. A., & Wright, B. D. (1994). Interjudge reliability and
decision reproducibility. Educational and Psychological Measurement,
54(4), 914–925.
Mangelsdorf, K. (1992). Peer reviews in the ESL composition classroom:
What do the students think? ELT Journal, 46(3), 274–283.
McNamara, T. F. (1996). Measuring second language performance. Harlow:
Addison Wesley Longman.
McQueen, J., & Congdon, P. (1997). Rater severity in large-scale assessment:
Is it invariant? (ERIC Document Reproduction Service No. ED411303)
Mendonça, C. O., & Johnson, K. E. (1994). Peer review negotiations: Revision
activities in ESL writing instruction. TESOL Quarterly, 28(4), 745–769.
Nakamura, Y. (2002). Teacher assessment and peer assessment. (ERIC
Document Reproduction Service No. ED464483)
Oldfield, K. A., & Macalpine, J. M. K. (1995). Peer and self-assessment at
tertiary level: An experiential report. Assessment & Evaluation in Higher
Education, 20(1), 125–132.
98 Self-, peer-, and teacher-assessments in Japanese university
Orsmond, P., & Merry, S. (1997). A study in self-assessment: Tutor and stu-
dents’ perceptions of performance criteria. Assessment & Evaluation in
Higher Education, 22(4), 357–370.
Orsmond, P., Merry, S., & Reiling, K. (2000). The use of student derived
marking criteria in peer and self-assessment. Assessment and Evaluation
in Higher Education, 25(1), 23–38.
Patri, M. (2002). The influence of peer feedback on self- and peer-assessment
of oral skills. Language Testing, 19(2), 109–131.
Perkins, K. (1983). On the use of composition scoring techniques, objective
measures, and objective tests to evaluate ESL writing ability. TESOL
Quarterly, 17(4), 651–671.
Pope, N. K. L. (2005). The impact of stress in self- and peer assessment.
Studies in Higher Education, 30(1), 51–63.
Ross, S. (1998). Self-assessment in second language testing: A meta-analysis
and analysis of experiential factors. Language Testing, 15(1), 1–20.
Rudy, D. W., Fejfar, M. C., Griffith, C. H. III., & Wilson, J. F. (2001). Self-
and peer-assessment in a first-year communication and interviewing
course. Evaluation and the Health Professions, 24(4), 436–445.
Saito, H., & Fujita, T. (2004). Characteristics and user acceptance of peer rating
in EFL writing classrooms. Language Teaching Research, 8(1), 31–54.
Sluijsmans, D. M. A., Brand-Gruwel, S., & Marriënboer, G. V. (2002). Peer
assessment training in teacher education: Effects on performance and per-
ceptions. Assessment & Evaluation in Higher Education, 27(5), 443–454.
Sullivan, K., & Hall, C. (1997). Introducing students to self-assessment.
Assessment & Evaluation in Higher Education, 22(3), 289–306.
Swanson, D., Case, S., & Van der Vleuten, C. (1991). Strategies for student
assessment. In D. Boud & G. Feletti (Eds.), The challenge of problem
based learning (pp. 269–282). London: Kogan Page.
Swanson, D., Case, S., & Van der Vleuten, C. (1997). Strategies for student
assessment. In D. Boud & G. Feletti (Eds.), The challenge of problem
based learning (pp. 269–282). London: Kogan Page.
Takada, N., & Lampkin, R. (1996). The Japanese way: Aspects of behavior,
attitudes and customs of the Japanese. New York: McGraw-Hill.
Taras, M. (2001). The use of tutor feedback and student self-assessment in
summative assessment tasks: Towards transparency for students and for
tutors. Assessment & Evaluation in Higher Education, 26(6), 289–306.
Topping, K. J., Smith, E. F., Swanson., I., & Elliot, A. (2000). Formative
peer assessment of academic writing between postgraduate students.
Assessment & Evaluation in Higher Education, 25(2), 149–169.
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater
consistency in assessing oral interaction. Language Testing, 10(3),
305–335.
Williams, E. (1992). Student attitudes towards approaches to learning and
assessment. Assessment & Evaluation in Higher Education, 17(1), 45–59.
Yang, Wen-Ling. (2000). Analysis of item ratings for ensuring the procedural
validity of the 1998 NAEP achievement-levels setting. (ERIC Document
Reproduction Service No. ED440136)
Appendix A
Essay evaluation sheet
Essay Number
Evaluator’s Name
average
1. Overall Impression 1 2 3 4 5 6
Content
2. Amount 1 2 3 4 5 6
3. Thorough development of thesis 1 2 3 4 5 6
4. Relevance to an assigned topic 1 2 3 4 5 6
Organization
5. Introduction & Thesis statement 1 2 3 4 5 6
6. Body & Topic sentence 1 2 3 4 5 6
7. Conclusion 1 2 3 4 5 6
8. Logical Sequencing 1 2 3 4 5 6
Vocabulary
9. Range 1 2 3 4 5 6
10. Word/ idiom Choice 1 2 3 4 5 6
11. Word Form 1 2 3 4 5 6
Sumie Matsuno
(continued)
100
Appendix A (continued)
Essay evaluation sheet
Essay Number
Evaluator’s Name
average