Self, Peer, and Teacher Assessments

Language Testing 2009 26 (1) 075–100
Self-, peer-, and teacher-assessments

in Japanese university EFL writing
classrooms
Sumie Matsuno Nagoya University, Japan
Multifaceted Rasch measurement was used in the present study with 91 stu-
dent and 4 teacher raters to investigate how self- and peer-assessments work
in comparison with teacher assessments in actual university writing classes.
The results indicated that many self-raters assessed their own writing
lower than predicted. This was particularly true for high-achieving students.
Peer-raters were the most lenient raters; however, they rated high-achieving
writers lower and low-achieving writers higher. This tendency was independ-
ent of their own writing abilities and therefore offered no support for the
hypothesis that high-achieving writers rated severely and low-achieving
writers rated leniently. On the other hand, most peer-raters were internally
consistent and produced fewer bias interactions than self- and teacher-raters.
Each of the four teachers was internally consistent; however, each displayed
a unique bias pattern. Self-, peer-, and teacher-raters assessed Grammar
severely and Spelling leniently. The analysis also revealed that teacher-raters
assessed Spelling, Format, and Punctuation differently from the other criteria.
It was concluded that self-assessment was somewhat idiosyncratic and
therefore of limited utility as a part of formal assessment. Peer-assessors on
the other hand were shown to be internally consistent and their rating
patterns were not dependent on their own writing performance. They also
produced relatively few bias interactions. These results suggest that in at
least some contexts, peer-assessments can play a useful role in writing
classes. By using multifaceted Rasch measurement, teachers can inform
peer-raters of their bias patterns and help them develop better quality assess-
ment criteria, two steps that might lead to better quality peer-assessment.
Keywords: bias analysis, essay writing, FACETS, Japanese university, lan-

guage testing, self-assessment, peer-assessment, teacher-assessment
It is generally acknowledged that in order to assess students’ learn-

ing, proficiency, and knowledge, educators need to use a variety of
Address for correspondence: Sumie Matsuno, Part-time lecturer, The Graduate School of Languages
and Cultures, Nagoya University, Nagoya, 464-860 Japan; email: msk77@sage.ocn.ne.jp
© 2009 SAGE Publications (Los Angeles, London, New Delhi and Singapore) DOI:10.1177/0265532208097337
76 Self-, peer-, and teacher-assessments in Japanese university
assessment methods (Orsmond, Merry & Reiling, 2000; Pope,

2005); however, in traditional classroom settings, which are often
found in present-day Japan, the teacher acts as the sole evaluator.
When students take a test made up of items that have only one cor-
rect answer, the traditional approach is generally appropriate; how-
ever, in performance tests, such as oral presentations, written
compositions, and role-plays, the use of a single assessor can result
in potentially biased evaluations. As a result of attempts to overcome
the limitations of teacher assessments, alternative assessments, such
as self-assessments and peer-assessments, have been the focus of
increasing interest in the field of education (Hargreaves, Earl &
Schmidt, 2001).
I Literature review
Self-assessment can be defined as ‘procedures by which the learners
themselves evaluate their language skills and knowledge’ (Bailey,
1998, p. 227). In first language assessment, self-assessment has often
been reported as an effective tool because self-assessment helps stu-
dents to develop a better understanding of the purpose of the assign-
ment and the assessment criteria (Orsmond & Merry, 1997),
improves learning (Sullivan & Hall, 1997), and softens the blow of
a bad grade by helping students understand the reasons for their
grade (Taras, 2001). Moreover, most participants in previous studies
were satisfied with the perceived benefits of self-assessment (Taras,
2001). On the other hand, whether students assess themselves cor-
rectly is controversial; Sullivan and Hall (1997) found that 39% of
the students overestimated their performance and Oldfield and
Macalpine (1995) found a low correlation between self- and teacher
assessments (r .30).
Peer-assessment can be defined as ‘an arrangement for peers to
consider the level, value, worth, quality or successfulness of the
products or outcomes of learning of others of similar status’
(Topping, Smith, Swanson & Elliot, 2000, p. 150). In the field of
first language pedagogy, peer-assessment has also been considered
as an effective tool in both group and individual projects. Peer-
assessment has been found to help teachers assess each person’s
effort in group projects (Conway & Kember, 1993; Goldfinch, 1994;
Goldfinch & Raeside, 1990) and to help students learn more and
work cooperatively in a group (Kwan & Leung, 1996). In individual
tasks, students can be more involved in assessment and instruction,
Sumie Matsuno 77
which leads to greater satisfaction with the class (Sluijsmans, Brand-

Gruwel & Marriënboer, 2002). Many students have reported that
peer-assessment facilitates their learning (Ballantyne, Hughes &
Mylonas, 2002). Moreover, some research has found reasonably
high correlations between teacher- and peer-assessments (Freeman,
1995; Pope, 2005; Sullivan & Hall, 1997). On the other hand, some
researchers still doubt that peer-assessment can be used as formal
assessment. Goldfinch and Raeside (1990) found a correlation of .38,
and Kwan and Leung (1996) found low agreement between peer-
and teacher assessments (46.9%).
In the literature on second language assessment, the reliability and
validity of self-assessments were measured by comparing self-rating
questionnaire results with estimates of the participants’ language
proficiency (e.g., TOEFL scores), but not comparing students’ per-
formance (e.g., presentation and essay writing) with their self-
assessments. For example, using this former method, Bachman and
Palmer (1989) found that the ESL learners in their study were gen-
erally able to perceive areas in which they had difficulty, and Ross
(1998) found that ‘can-do’ questions derived from instructional
materials resulted in more accurate self-assessments than the more
abstract questions. Even though the latter method was utilized, using
self-assessment as a formal assessment was controversial, because
researchers often found low correlations between self- and teacher-
assessments (Patri, 2002; Saito & Fujita, 2004).
Peer-assessment in ESL/EFL contexts has often been conducted
qualitatively under such names as peer-response and peer-review
(Caulk, 1994; Mangelsdorf, 1992; Mendonça & Johnson, 1994); on
the other hand, few researchers conducting studies of self- and peer-
assessment have used quantitative methods. When quantitative
methods were conducted, they were mostly done by calculating sim-
ple correlations (e.g., Patri, 2002) and paired t-tests (e.g., Cheng &
Warren, 2005) using the true-score approach.
Overall, four limitations can be identified in the above literature.
First, although extensive research into self- and peer-assessment in
first language pedagogy suggests that self- and peer-assessments are
pedagogically beneficial, few ESL/EFL researchers (Cheng &
Warren, 1997; Nakamura, 2002; Saito & Fujita, 2004; Patri, 2002)
have investigated how effectively students can function as raters.
Most research into rater performance in ESL/EFL writing contexts
concerns professional raters and is focused on the differences between
native and non-native raters, novices and experience raters, and trained
and untrained raters. Although peer-responses and peer-reviews,
where peers respond to and edit each other’s written work qualita-
tively, have been conducted, empirical studies of self- and peer-
assessments are extremely limited.
Second, in both first and second language pedagogy, the question of
whether self- and peer-assessments can be used as part of formal class-
room assessment has been a point of contention. For example, a num-
ber of researchers have reported high correlations between student- and
teacher-assessments (e.g., Oldfield & Macalpine, 1995; Rudy, Fejfar,
Griffith & Wilson, 2001), while other studies have shown low correla-
tions between them (Freeman, 1995; Swanson et al., 1997).
Third, most previous researchers have been concerned with inter-
rater reliability and have measured this using simple correlations; the
degree to which raters are internally consistent (intrarater consis-
tency) has rarely been discussed. Only researchers using MFRM
(e.g., Brown, Hudson, Norris & Bonk, 2002; Kondo-Brown, 2002;
Kozaki, 2004; Lumley & McNamara, 1995; McQueen & Congdon,
1997; Wigglesworth, 1993) have investigated intrarater consistency.
In fact, whether raters were internally consistent is one of the most
important aspects of rater performance.
Fourth, most researchers have used the traditional true-score
approach. Because each rater interprets rating scale categories in an
idiosyncratic way, traditional approaches to measurement are prob-
lematic because they do not adequately address issues such as rater
severity/leniency, and assessment criterion difficulty/easiness. For
example, even if raters correlate perfectly using the traditional true-
score approach, there is still some doubt about whether they are rat-
ing examinees in the same way (Lunz, 1992).
In the present study, Multifaceted Rasch Measurement (MRFM)
is used to investigate how self- and peer-assessments work in com-
parison with teacher assessments in actual university writing classes.
MRFM measurement can show the degree to which raters are inter-
nally and externally consistent. Moreover, MFRM can measure rater
severity/leniency, and through bias analyses, to whom and to which
assessment criteria each rater displays bias. Overall, MFRM can illu-
minate the inside of the workings of the assessment process. Even
though MFRM has been utilized in various rating studies, only one
study of self- and peer-assessment (Nakamura, 2002) was conducted
using this approach. Hence in the present study, MFRM was utilized
to investigate self-, peer-, and teacher-assessments, and more specif-
ically whether student raters are externally and internally consistent.
This was accomplished through an inspection of writers’ abilities,
raters’ severities, and assessment criteria difficulties, and the ways in
Sumie Matsuno 79
which writers’ abilities and assessment criterion difficulties differed

among self-, peer-, and teacher-assessments. Bias analyses were also
conducted in order to investigate how much and what type of bias was
evident in the students’ and teachers’ assessments of writer abilities.
In the present study, four research questions were investigated.
1. To what degree do writers’ abilities, raters’ severities, and
assessment criteria’s difficulties vary and fit the model?
2. How do self-assessors, peer-assessors, and teacher-assessors dif-
fer when assessing writers’ abilities?
3. How do self-, peer-, and teacher-assessments compare in terms
of assessment criterion difficulty and its infit mean square
value?
4. To what degree do self-assessors, peer-assessors, and teacher-
assessors exhibit bias towards writers’ abilities, and what types
of bias are they?
II Method
1 Participants
Ninety-seven second-year Japanese students participated in the
study as essay writers; however, because six students did not rate
their own and peers’ essays, they were eliminated as raters from the
study, leaving 97 student writers and 91 student raters. Eighty-one of
the students were enrolled in a prefectural university and sixteen
were enrolled in a national university in central Japan. Both univer-
sities are considered prestigious and highly competitive. The
participants’ ages ranged from 19 to 21 years old and none of the par-
ticipants had received extensive composition instruction before this
course.
In addition to the student participants, four native Japanese teachers
(T1–T4) were selected for this study. They were chosen because they
had more than five years of experience teaching English as a foreign
language at the university level. In addition, they had experienced
teaching or researching EFL writing for at least three years.
2 Procedures
The study was conducted in two university writing classes, and
between the first and seventh weeks, the participants received instruc-
tion concerning essay writing such as essay format, mechanics,
organization, and content. Then, during the eighth week, they wrote
an essay on the following topic as a homework assignment: Please
discuss the advantages and disadvantages of college students having
a cellular phone. They were asked to write about 300 words (1 page)
and to bring the assignment to the next class. Four copies of each
essay were made. In the ninth week, based on the essay evaluation
sheet (Appendix A), the participants practiced evaluating three
essays together in class. The students were then instructed to evalu-
ate their own essay and the essays written by five peers at home, an
assignment that was worth 10% of their final course grade. They did
not know whose essays that they were evaluating because the names
were deleted from the essays.
T1 evaluated all 97 essays (W1 (Writer 1) to W97), T2 evaluated
30 essays from W1 to W30, T3 evaluated 30 essays from W31 to
W60, and T4 evaluated 37 essays from W61 to W97. Those essays
were randomly assigned to each teacher. Because the multifaceted
Rasch model does not require that all raters evaluate all essays, a suf-
ficient degree of connectedness was achieved with the above rating
plan. The teacher-raters rated the essays at home and returned them
to the researcher within one month.
3 Instruments
Because analytical scaling grids have been found to be generally
more reliable and informative than holistic ones (Hamp-Lyons,
1991; Jacobs et al., 1981; Perkins, 1983), an analytic scoring grid was
tailored for this study based on the ESL composition profile devel-
oped by Jacobs et al. (1981). The ESL composition profile was not
used its original form; that instrument was too difficult for the par-
ticipants in this study to use because of their limited experience pro-
ducing or assessing English compositions. Therefore, a simplified
analytic scoring grid was used in this study (Appendix A). The
Jacobs et al.’s ESL composition profile is a weighted scale with
mechanics being the least weighted (5%) and content the most
(30%). However, Kondo-Brown (2002) mentioned, ‘it is not clear
how the weightings were determined in the original version, and
some researchers have questioned the assignment of different
weights to evaluation criteria’ (p. 9). Hence, in this study, each cate-
gory was divided into a couple of subcategories (assessment crite-
ria), which were weighted equally with a 6-point rating scale.
This instrument had been pre-tested with 26 participants and no
assessment criteria misfit the multifaceted Rasch model. In the pre-test,
Sumie Matsuno 81
though a 4-point rating scale worked more effectively than the 6-

point scale, as Lunz and Linacre (1998) stated, as more categories
are included on a rating scale, the possibility of obtaining more spe-
cific judgment increases, a 6-point Likert Scale was again used in the
study.
4 Analyses
Multifaceted Rasch measurement was conducted using the FACETS
computer program, version 3.22 (Linacre, 1999). In the analysis,
writers, raters, and assessment criteria were specified as facets. The
output of the FACETS analysis reported: (a) a FACETS map, (b)
ability measures and fit statistics for each writer, (c) a severity esti-
mate and fit statistics for each rater, (d) difficulty estimates and fit
statistics for each assessment criterion, and (e) a bias analysis for
rater writer interactions. The FACETS map provides visual infor-
mation about differences that might exist among different elements
of a facet, such as differences in severity among raters and ability
among writers. Writer ability logit measures are estimated concur-
rently with the rater severity logit estimates and assessment criterion
difficulty logit estimates. By placing them on the same linear meas-
urement scale, the results are easily compared. The Rasch model also
provides fit statistics, which provide estimates of the consistency of
the rater rating patterns (Lunz & Linacre, 1998). Bias analyses were
also carried out in order to detect raters who were rating particular
persons too severely or too leniently as well as raters who were using
particular assessment criteria in an overly severe or lenient way.
III Results and discussions

1 Initial analyses
Linacre (2002b) has proposed the following guidelines for evaluat-
ing rating scales using the Rasch model: (a) each point on the scale
should have at least 10 observations, (b) outfit mean-squares should
be less than 2.0, (c) the step difficulty of each category should
advance by at least 1.4 logits, and (d) the step difficulty of each cat-
egory should advance by less than 5.0 logits. In the initial analyses,
since the 6-point scale used in this study did not adequately meet
Linacre’s criteria, they were combined to create a 4-point scale.
Table 1 shows the result.
Table 1 Rating scale statistics after combining categories 2 and 3, and 4 and 5
Original Scale Count % Logit Outfit Step difficulty
1 1 224 2 –1.09 1.1 Low

23 2 2386 23 –.05 1.0 –3.00
45 3 6110 59 1.33 1.0 –.28
6 4 1627 16 2.59 1.0 3.28
Next, as an initial analysis, misfitting writers and raters were

examined by infit mean square values. Infit mean square values indi-
cate the degree to which each writer/rater fit the Rasch model. They
provide summaries of the size and direction of the residuals (dis-
crepancies between the predicted and observed data) for each
writer/rater. Infit mean square values have an expected value of 1.
Individual values will be above or below 1, indicating that the
observed values show greater variation (more than 1) or less varia-
tion (less than 1) than predicated by the Rasch model.
The acceptable range of the infit mean square statistic varies
depending on the assessment context. McNamara (1996) suggested .70
to 1.20 as acceptable ranges; however, his primary concern was pro-
fessional raters’ ratings such as ESL teachers. Bonk and Ockey
(2003) stated that ‘a common rule of thumb for acceptable values is
the range from .70 or .80 to 1.20 or 1.30, but this is based on well-
behaved data from multiple-choice items, and these values may not
represent realistic fit goals for rating data’ (p. 96). On the other hand,
Linacre (2002a), the developer of the FACETS software, has stated
that infit mean square values of .50 to 1.50 are fruitful for measure-
ment. Lunz, Stahl and Wright (1994) also suggested .50 to 1.50 as
acceptable ranges. Therefore, in this study, values between .50 and
1.50 are viewed as acceptable ranges.
Using this range, 10 raters and eight writers misfit the Rasch
Model. The infit mean square value of the raters are 1.80 (Rater 42),
1.70 (R58, and R62), .40 (R17, R24, R53, and R86), .30 (R81 and
R88), and .20 (R96). Values more than 1.50 indicate that these raters
rate idiosyncratically compared with the other raters. Values less
than .50 indicate that these raters simply had too little variation in
their ratings. The infit mean square value of the writers are 1.60
(Writer37, W43, and W85), 1.70 (W30, W44, and W49), 2.00
(W95), and 2.40 (W32); however, from a pedagogical point of view,
all student essays must be evaluated because they are a graded class
assignment regardless of their degree of fit to the Rasch model; on
Sumie Matsuno 83
the other hand, when it has been determined that some student raters
do not assess the essays seriously or do not meet the expectations of
the Rasch model, their ratings can be justifiably eliminated in order
to improve the precision of the ability estimates. Therefore, all of the
student essays were included and ten student raters were eliminated
in the analysis. Overall, the final analysis was made up of 97 student
writers, 81 student raters, and four teacher raters.
Finally, the unidimensionality of the assessment criteria was
checked. Misfitting assessment criteria were first identified using
infit mean squares from .50 to 1.50 as the acceptable range. On the
first FACETS run, two assessment criteria were found to misfit the
model (Spelling 1.60 and Format 1.60). These two criteria were
apparently viewed differently by different raters. One way to check
the impact of misfit suggested by Linacre and Williams (1998) is to
remove bad data in layers and compare the resulting measures with
scatterplots. When the plots show no meaningful change, the remain-
ing data are sufficiently good. Using this approach, the scatterplots
show a straight line in this study, which indicates that no meaningful
change occurs as a result of the deletions; hence all of the original
criteria were included in the analyses.
Table 2 shows the descriptive statistics for the self-, peer-, and
teacher-assessments. The peer assessments include 40 missing data
points but this is not a problem because the FACETS analysis can
tolerate a relatively large number of missing observations (Linacre,
1989/1994, p. 5). The following analyses are based on data from 68
self-assessors and 81 peer-assessors. The 81 peer-assessors assessed
16 criterion assessments for five writers; hence, the total should have
been 6480; however, because of the 40 missing data points, the total
was 6440.
As shown in Table 2, the mean of the peer assessments is the high-
est, and the means of self- and teacher-assessments are the same.
Among the teachers, the mean of T3 is the lowest. The SD of peer
assessments is the smallest, which indicates that peers did not award
extreme scores frequently; on the other hand, the SD of teacher-
assessment is .81. Among the teachers, T1 had the largest SD, indi-
cating that she awarded a relatively wide variety of scores.
Regarding skewness, all of the raters except T4 were skewed; self-
and T3’s assessment were positively skewed and the others
were negatively skewed. In addition, teacher-assessment and T4
demonstrated negative kurutosis. However, skewness and kurtosis
are not issues in this study because MFRM does not require normal
distributions.
Table 2 Descriptive statistics for self-, peer-, and teacher assessments
Rater N M SEM SD Skewness SES Kurtosis SEK
Self 1088 2.74 .02 .61 .64 .07 –.24 .14

Peer 6440 2.94 .01 .58 –.07 .03 .05 .06
Teacher 3104 2.74 .02 .81 –.22 .04 –.43 .09
T1 1552 2.86 .02 .84 –.54 .06 –.14 .12
T2 480 2.70 .04 .78 –.25 .11 –.28 .22
T3 480 2.51 .03 .69 .52 .11 –.24 .23
T4 592 2.62 .03 .78 .05 .10 –.48 .20
Note: N The number of assessments.
2 The FACETS map and the rater, writer, and assessment

criterion reports
a Research question 1: Research question 1 asked to what degree
do writers’ abilities, raters’ severities, and assessment criteria’s difficul-
ties vary and fit the model? This question is answered by the output
in the FACETS map, the Separation Index, the reliability estimate, the
chi-square values, and infit mean square values of raters’ abilities,
writers’ severities, and assessment criteria’s difficulties. In Figure 1,
the first column shows the Rasch logit scale. Unlike raw test scores
in which the distances between points may not be equal, the logit
scale is a true interval scale. The last column shows each point on the
4-point rating scale used in the analysis.
The second column shows the raters’ severities. The most severe
raters appear at the upper part of the figure and the least severe raters
are toward the lower part. Each asterisk (*) indicates one student
rater, and T1, T2, T3, and T4 indicate the four teacher raters. The
Separation Index of 4.80 indicates that the variance among raters is
4.80 times the error of estimates. The reliability estimate of .96 indi-
cates that the raters are separated into different levels of severity.
Moreover, the chi-square of 2169.80 (df 84) is statistically signif-
icant (p .01), indicating that all raters were not equally severe. The
student raters ranged in severity from 2.17 to 2.28, and that of the
teachers ranged from .90 to 1.30. Compared with the teachers’ sever-
ity, the peer raters’ severity is quite diverse.
The third column displays the writers’ ability estimates. The most
able writers are at the upper part of the figure and least able at the
lower part; each asterisk (*) represents two writers. M indicates one
misfitting writer. Misfitting writers were often lower ability writers.
There is considerable variation in writer ability (ability estimates
Sumie Matsuno 85
range from 1.70 to 2.95). The Separation Index is 4.83, indicating

that the variance among writers is about 4.80 times the error of esti-
mates and that there are approximately five statistically distinct lev-
els of writers. Reliability is .96, indicating that the analysis is reliably
separating writers into different levels of ability. This is yet another
indication that the writers have widely differing levels of ability. In
addition, the chi-square of 2917.80 with 96 df is significant at
p .01, a result that also indicates that writers’ abilities differ sig-
nificantly from one another.
The fourth column shows the difficulty of the assessment criteria.
The most leniently scored assessment criterion was Spelling (diffi-
culty measure 1.72) at the top and the most harshly scored
assessment criterion was Grammar (.78) at the bottom. The difficulty
span between these two assessment criteria was 2.50 logits. M indi-
cates misfitting. The Separation Index was 8.30, indicating that the
variance among assessment criteria was about 8.30 times the error of
the estimates. The reliability was very high (.99) indicating that the
assessment criteria were statistically different from one another in
term of their difficulty. This was confirmed by the chi-square of
1054.00 with 15 df, which was statistically significant at p .01.
In Figure 1, writers and assessment criteria were not spread as
widely as the raters. Moreover, some extremely good writers were
not measured precisely because the raters were relatively lenient and
the assessment criteria were too easy in comparison with their abil-
ity. Some lower ability writers were also not well measured by the
assessment criteria; for these students, most of assessment criteria
were too difficult. A writer whose ability estimate is .00 logits is
likely to get about 2.57 points on an average difficulty category
when assessed by an average-severity rater.
The writers’ ability estimates were high in relation to the raters.
This can be explained in two ways. One is that the writers actually
exhibited high-level writing ability. The other is that the raters were
too lenient and awarded high scores to students who had not pro-
duced good essays. The second interpretation is more probable
because, as some researchers have indicated (e.g., Hanrahan &
Isaacs, 2001; Kwan & Leung, 1996; Williams, 1992), peers gener-
ally do not give severe scores to their classmates even when they do
not know whose essays they are rating.
Regarding the question ‘to what degree do writers’ abilities, raters’
severities, and assessment criteria’s difficulties fit the model?’ the fit
statistics shows that eight writers underfit the model, which means
| Logit | Raters | Writers | Items | Scale|

+4 + + + + (4) +
+3 + + + + +
| | |* | | |
| | | | | |
| | | | | |
| | |* | | |
| |* |* | |3 |
| | | | | |
+2 + +** + + +
| | |**** | | |
| |**** |****** |Grammar | |
| |** |*** |Conclusion | |
| | |**M |Intro / logical Sequencing / Impression / Development | |
| |*** |*** |Body / W Form / W Choice | |
| |** |* |Sentence | |
+1 +****T3 +* +Range + +
| |***** |** | | |
| |***T4 | | | |
| |******T2 |** |Amount | |
| |** |* |Punctuation | |
| |** |*M |Format(M) / Relevance | |
| |* |*M | | |
+0 +***T1 +* + +2 +
| |** | | | |
| |***** |M | | |
| |******** |M | | |
| |***** | | | |
| |**** | | | |
| |*** |*M | | |
+ -1 +**** +* + + +
| |** | | | |
| |** | | | |
| |*** |M | | |
| | |M | | |
| | | |Spelling(M) | |
| | | | | |
+ -2 +** + + +(1) +
| |* | | | |
| | | | | |
| | | | | |
| | | | | |
| |* | | | |
| | | | | |
+ -3 + + + + +
Figure 1 FACETS map of writer ability, rater severity, assessment criterion diffi-
culty, and Likert scale functioning.
Note: Each asterisk (*) indicates two writers or one rater. M indicates one misfitting
writer.
that raters disagree on the quality of these writers’ performances.

A comparison of the peer- and teacher-assessments of these misfit-
ting writers revealed that the ability estimates produced by two types
of raters were quite diverse. For example, the ability logit of W30
was 1.69 when assessed by peer-raters, but was .74 when assessed
by the teacher-raters. With only one exception, all of the misfitting
Sumie Matsuno 87
writers were assessed leniently by the peer-raters and severely by the

teacher-raters. These differences accounted for the misfit shown by
these writers.
The majority of student raters fit the Rasch model. After eliminat-
ing the three underfitting and seven overfitting raters, none of the
remaining raters misfit the model. Intrarater consistency is often dis-
cussed as an important requirement for the establishment of an effec-
tive rating process. Low intrarater consistency often indicates poor
quality of ratings (Yang, 2000). McNamara (1996) also noted that a
lack of consistency as indicated by high rater fit values suggests that
those raters need to be retrained or possibly excluded from the rating
process. In this study, even though the assessors’ severities, includ-
ing both peers and teachers, varied, they fell within an acceptable
range of infit mean squares; thus, they were at least internally con-
sistent. This is one of the important requirements for high-quality
rating, which was successfully fulfilled by both the student and
teacher raters in this study.
Spelling (infit mean square values 1.60) and Format (infit mean
square values 1.69) misfit the Rasch model. These results suggested
that Spelling and Format probably elicited a pattern of responses that
did not fit the general pattern of responses. The raw scores show that
even some able writers did not obtain high scores for these criteria
while some less able writers obtained the highest score possible.
Moreover, these criteria may not reflect general writing ability. In
addition, while low ability writers can get mechanical aspects of an
essay correct, only good writers can produce high-quality essays in
terms of features such as lexical choice, grammatical accuracy and
complexity, and logical flow. On the whole, high quality writing may
be distinct from good control over mechanics. This may be one of
the reasons why mechanics received little weighting in Jacobs et al.’s
composition profile (1981). Another point is that some of the assess-
ment criteria such as Introduction, Logicality, Grammar, and
Development have similar difficulty values. However, because the
infit mean squares of these criteria were more than .5, they were not
viewed as redundant criteria and each criterion used in this study
provided reasonably unique information.
3. Comparing self-assessment, peer-assessment, and teacher-

assessment
a. Writer ability (research question 2): Research question 2 asked,
‘How do self-assessors, peer-assessors, and teacher-assessors differ
when assessing writers’ abilities?’ Figure 2 shows the three groups

of raters that participated in this study. Self-assessors are shown in
the left column, peer-assessors in the middle column, and teacher-
assessors in the right column. Even though writer ability should be
the same regardless of who assesses the writer, it differed noticeably.
With self-assessment, many of the writers’ ability estimates were
below .00 logits, with peer-assessment, most writer ability estimates
were above 1.00 logits, and with teacher-assessment, more writers
were assessed as below 1.00 logits. Moreover, teacher-assessment
included a wider range of writers’ abilities than self- and peer-assess-
ments. Students generally evaluated their peers leniently and evalu-
ated themselves severely. The teachers were neither lenient nor
severe compared with the self- and peer-assessors.
This finding contradicts many previous researchers’ findings (e.g.,
Evans, Aghabeigi, Leeson, O’Sullivan & Eliahoo, 2002; Evans,
McKenna, & Oliver, 2002), in which students tended to use higher
rating scale categories and overrated rather than underrated their per-
formances. In the present study, some students also did not assess
their own writing objectively; few students awarded themselves a
high grade even though they may have thought that their essays were
good. This may be a cultural matter; some Japanese believe that they
should not assess themselves higher than others as modesty is tradi-
tionally considered a virtue for Japanese (e.g., Ikeno, 2002; Takada &
Lampkin, 1996). In this study, the teacher observed how the students
self-assessed their own writing. This might have encouraged many
Self Peer Teacher

| Logit | Writers | | Logit | Writers | | Logit | Writers |
+4 + + +4 +* + +4 + +
| |. | | |*** | | |. |
+3 + + +3 +***** + +3 +* +
| |. | | |******* | | |*. |
+2 +***. + +2 +********* + +2 +*****. +
| |****. | | |********* | | |*******. |
+1 +*** + +1 +***** + +1 +********* +
| |***** | | |***** | | |******. |
+0 +*****. + +0 +**. + +0 +***. +
| |****. | | |**. | | |***** |
+ -1 +. + + -1 + + + -1 +**. +
| |*****. | | | | | |** |
+ -2 +** + + -2 + + + -2 +. +
| |**. | | | | | |** |
+ -3 +* + + -3 + + +3 +* +
| |* | | | | | |. |
Figure 2 Writers’ abilities as estimated by self-, peer-, and teacher-assessors.

Note: Each asterisk (*) indicates two writers
Sumie Matsuno 89
participants to underestimate, rather than overestimate, the quality of

their writing.
b. Assessment criterion difficulty (research question 3): Research

question 3 asked, ‘How do self-, peer-, and teacher-assessments
compare in terms of assessment criterion difficulty and its infit mean
square value?’ Figure 3 showed that the most harshly scored crite-
rion for self-assessors was Sentence; for peer-assessors, it was
Grammar, and for teacher-assessors, it was Conclusion. In self-
assessment, Grammar, Logicality, and W Choice were the next most
harshly scored criteria. In the teacher-assessment, Grammar was the
second most harshly scored criterion, while Relevance was assessed
leniently by the teachers. Spelling, Format, and Punctuation were
leniently scored criteria by all raters. One similarity is that Grammar
was severely scored by all three groups. This result may have been
partly caused by the participants’ experiences in the Japanese educa-
tion system, in which English grammar is generally emphasized.
Through their educational experience, not only the teachers but also
the students appeared to be strongly concerned with grammatical
accuracy.
Most of the assessment criterion difficulties were around .00 log-
its for self- and peer-assessments except Spelling, Format, and
Punctuation for self-assessment and Spelling and Punctuation for
peer-assessment. On the other hand, for teacher-assessment, they
were often diverse. Peer-raters displayed the narrowest assessment
criterion difficulties as they ranged from .50 to .50. The largest dif-
ferences were found on Relevance, Conclusion, Spelling, and
Format, Relevance. Spelling and Format were more leniently
assessed and Conclusion was more strictly assessed by the teacher-
raters than the self- and peer-raters.
Next, the infit mean squares of the assessment criteria difficulty
were compared among the self-, peer-, and teacher-assessors. For
peer-assessors, no criteria misfit the Rasch model. On the other hand,
for the teacher-assessors, Spelling (2.10), Format (1.70), and
Punctuation (2.00) seriously underfit the model, a finding that indi-
cates that mechanics may differ significantly from the other writing
ability criteria; for the teachers, these assessment criteria might not
be related to writing ability. For self-assessors, Format (1.70), which
was the most leniently scored criterion next to Spelling, underfits the
model. This should have been the easiest criterion to check because
explicit criteria concerning areas such as margins, title, and indenta-
tion were provided. Although self-raters often gave severe scores,
1.5
0.5
Logit of difficulty
0
Peer
–0.5 Teacher
Self
–1
–1.5
–2
–2.5
–3
g
it y
or m
ce
e
io n
ar
t
io n
y
el o p
m at
n
ou n
nc e
ce
lli n
a ng
I nt r
Bo d
a t io
mm
h oi
ic a l
ress
ten
clus
WF
eva
For
Sp e
Am
D ev
WR
WC
ct u
Gra
L og
Se n
I mp
C on
Rel
Pun
Assessment criteria
Figure 3 Assessment criteria difficulties of the self-, peer-, and teacher-raters
they were somewhat lenient on Format, a factor that may have

caused the infit mean square of 1.70.
4 Bias analysis: Rater writer (research question 4)

Research question 4 asked, ‘To what degree do self-assessors, peer-
assessors, and teacher-assessors exhibit bias towards writers’ abili-
ties, and what types of bias are they?’ This question was answered by
conducting a bias analysis of rater-writer interactions. Self-, peer-
and teacher-assessments were first analyzed together in order to con-
duct a bias analysis for the self-assessments. Because each self-
assessor assessed his/her own essay just once, the self-assessment
results were compared with the peer-assessment results. Thus, raters
who displayed considerable harshness or leniency towards only their
own essay would be flagged as displaying bias while raters who
rated their own essay in a way similar to how they rated their peers’
essays would not be identified as biased. Teacher-assessments were
also included in this analysis in order to obtain more precise ability
estimates. On the second run, only peer- and teacher-assessments
Sumie Matsuno 91
Table 3 The number of biased ratings towards writers’ abilities.
Rater High ability with High ability with Low ability with Low ability with
negative z-score positive z-score negative z-score positive z-score
Self 1 (1.47%) 8 (11.76%) 6 (8.82%) 7 (10.29%)

Peer 22 (5.43%) 28 (6.91%) 20 (4.94%) 6 (1.48%)
T1 14 (14.43%) 4 (4.12%) 0 (0.00%) 14 (14.43%)
T2 1 (3.33%) 5 (16.67%) 3 (10.00%) 0 (0.00%)
T3 2 (6.67%) 3 (10.00%) 1 (3.33%) 0 (0.00%)
T4 5 (13.51%) 3 (8.11%) 0 (0.00%) 2 (5.40%)
Note: Bracket is the percentage of the number of biased ratings out of the total num-
ber of ratings of each assessment.
were analyzed together in order to investigate the bias for the peer-
and teacher-assessments. This analysis was conducted without the
self-assessments because the self-assessments were somewhat idio-
syncratic.
The results are presented in Table 3. In each column, the numbers
of biased writers are presented. High-ability writers obtained ability
estimates 1.00 logit, while low-ability writers obtained ability
estimates 1.00 logit. The 1.00 logit cutoff was selected because
this allowed for the creation of two groups of approximately equal
size. Negative z-score were those that were 2.00 logits, while
the positive z-scores were 2.00 logits. The percentage of the
number of biased writers in the total number of writers in each
assessment is presented in parentheses.
Self-raters tended to assess their own writing more strictly than
expected; 15 self-raters were overly severe and seven were overly
lenient. More able writers (11.76%) assessed themselves more
severely than predicted. Among the participants with ability logits
greater than 1.00, only one rater (1.47%) rated him/herself too
leniently. On the other hand, 8.82% of writers with abilities less than
1.00 had negative z-scores, indicating that they gave themselves
higher scores than predicted by the Rasch model, but this finding
was balanced by the fact that some lower level writers (10.3%)
assessed themselves more harshly than expected.
The z-scores of writers with estimated abilities above 1.00, as
determined by the peer-raters, were divided fairly equally; 28 exam-
inees (6.91%) had positive z-scores and 22 examinees (5.43%) had
negative z-scores, indicating a slight tendency on the part of peer-raters
to be severe. Moreover, four writers with ability logits above 2.00
received lower scores than expected and one writer with an ability logit
above 2.00 obtained a higher score than expected from the peer-
raters. Six writers (1.48%) with an estimated ability under 1.00 had
a positive z-score and twenty writers (4.94%) had negative z-scores,
an indication that they obtained higher than expected scores from
their peers; thus, peer raters were sometimes lenient only toward
writers with relatively low writing abilities.
The results of the bias analysis are consistent with the SD of the
peer assessments, which was smaller than that of the other assess-
ments; student raters generally awarded middle scores to their peers.
Furthermore, the peer-assessors tended to award more lenient scores
to lower level writers and harsher scores to more proficient writers.
These results were consistent with previous studies conducted by
Hughes and Large (1993) and Kwan and Leung (1996).
Next, following Hughes and Large (1993), the peer-assessments
were analyzed in order to determine whether students who per-
formed well on the writing assignment marked their peers’ essays
more severely than students whose essays received low ratings. The
possible presence of this tendency was examined using bias analysis
by analyzing both writing ability and biased z-scores. The results
indicate that almost an equal number of severe and lenient biases
were detected for participants with writing abilities between .00 and
.99 (14 above 2.00 vs. 16 below 2.00), writers with abilities from
1.00 to 1.99 (11 above 2.00 vs. 14 below 2.00), and writing abil-
ity above 2.00 (5 above 2.00 vs. 4 below 2.00). Overall, biased
ratings were not dependent on the students’ writing abilities. High-
achieving writers did not often rate their peers severely and low-
achieving writers did not often rate their peers leniently. Hence, the
participants in this study were able to make reasoned assessments
independent of their own performances.
T1 assessed 14 low ability writers with abilities less than 1.00
strictly (14.4%) as shown by the positive z-scores while no writers
with ability estimates less than 1.00 obtained negative z-scores
(0.00%), indicating they were not assessed leniently. On the other
hand, 14 high ability writers (14.4%), whose estimated ability was
above 1.00 logits, obtained negative z-scores, indicating that they
obtained higher scores than expected, while only four writers
(4.12%) whose estimated ability was above 1.00 logits obtained a
positive z-score, indicating that they obtained lower scores than
expected. These results are also consistent with the SD of T1’s rat-
ings, which were the largest compared with the other teacher raters.
Three writers (10.00%) with ability logits less than 1.00 and one
writer (3.33%) with an ability of logit above 1.00 were given lenient
Sumie Matsuno 93
scores from T2, while five writers (16.65%) with ability estimates
greater than 1.00 obtained overly severe scores. No writer with an
ability estimate less than 1.00 obtained a severe score from this rater.
T3 displayed few bias patterns. Two higher ability writers (6.67%)
obtained negative z-scores and three higher ability writers (10.00%)
obtained positive z-scores. In addition, only one lower ability writer
(3.33%) obtained a negative z-score.
T4 awarded five higher ability writers (13.51%) overly lenient
scores as shown by the negative z-scores; however, three high ability
writers (8.11%) also received harsh scores as indicated by the positive
z-scores (8.11%). Two lower ability writers (5.40%) obtained a harsh
score while no lower ability writer obtained a lenient score from T4.
T1 and T4’s bias patterns were similar; however, T4’s tendency toward
awarding lenient scores to able writers and harsh scores to less able
writers is less pronounced than that of T1.
Overall, the teacher-raters displayed relatively individual bias pat-
terns. As Wigglesworth (1993) mentioned, the FACETS bias analy-
sis can be used in rater training because showing raters their bias
patterns can lead to improved rating performance. This study also
confirmed the necessity of rater training using the bias analysis
results produced by FACETS. In addition, it is noteworthy that peer-
raters produced fewer bias interactions than the self- and teacher-raters.
This suggests that in at least some contexts, peer-assessments can
play a useful role in writing classes.
IV Limitations
There were four major limitations in this study. The first limitation
concerns the small amount of overlap in the ratings. MFRM does not
require that every examinee be rated by every judge on every assess-
ment criteria; that is, Rasch estimates can be obtained so long as a
certain degree of overlap is maintained; therefore, each essay was
rated by three peers, two teachers, and the writer of the essay.
However, as Linacre (2002c) noted, the measures that are arrived at
with this technique are less precise than with complete data. About
84% fewer observations were made in this study than was possible.
Even though asking each student to assess all of his/her peers is
unrealistic, more overlap of ratings is preferable as this would result
in more precise measurement.
The second limitation concerns rater leniency. The individual rater
report showed that the raters were separated into different levels of
severity ranging from 2.17 to 2.28; however, the Facets map (see
Figure 1) showed that some able writers could not be assessed pre-
cisely because no raters had severities equivalent to them. One
implication is that the raters, including the teachers, were too lenient.
Though it is often the case for peer-raters, it is preferable to find
more severe raters or ask peer-raters to be more severe in order to
assess high proficiency writers more precisely.
Third, the addition of qualitative research methods would have
made the results more understandable. For example, in this study,
three underfitting and seven overfitting raters were detected.
Conducting interviews or think-aloud protocols could have shed
light on the reasons for the misfit.
Fourth, how general proficiency differences affect self- and peer-
assessments is an important issue. In the present study, no general
proficiency scores, such as TOEIC or TOEFL scores, were available.
The participants were in similar prestigious universities in Japan;
however, this did not mean that they were all at the same level of
English proficiency. The results reported in this study may not apply
to foreign language learners who are at lower or higher English pro-
ficiency levels.
V Conclusion
This paper has reported on a quantitative investigation of self-, peer-,
and teacher-assessments of English essays written by Japanese uni-
versity students using Multifaceted Rasch Measurement. There were
several important findings. First, self-raters, and especially high
achieving writers, were overly critical toward themselves. This result
was probably caused by the tendency of many Japanese to display a
degree of modesty. Second, peer raters, who did not show much vari-
ance, were lenient, but were at least internally consistent and their
rating patterns were not dependent on their own writing perform-
ance; higher achieving writers were not more severe raters, and
lower achieving writers were not more lenient raters. However, peer
rating patterns were dependent on the writers’ writing ability; peer
raters rated low-achieving writers leniently and high-achieving writ-
ers severely, meaning that they did not give high scores randomly.
Moreover, peer-raters produced fewer bias interactions than the self-
and teacher-raters. Third, teacher raters also varied and each had
his/her own unique bias pattern, a finding that showed that one
teacher rater was not sufficient in assessing students’ essays. Fourth,
Sumie Matsuno 95
Spelling and Format showed poor fit to the Rasch model; however,
a close inspection revealed that the misfit was caused by the teacher-
raters, not the peer-raters. Although these mechanical criteria are
often evaluated in writing studies, they may differ from other assess-
ment criteria such as organization and content, and may not discrim-
inate well between different levels of writing ability and/or indeed
may not be valid measures of writing ability.
Taken as a whole, at the present time, it is difficult to recommend
using self-assessment for formal grading; however, because even
internally consistent teacher raters demonstrated unique biases toward
specific writers, peer-assessment can possibly supplement teacher-
assessment and compensate for shortcomings in teacher-assessment.
The present study showed that most of the peer-assessors were inter-
nally consistent, their rating patterns were not dependent on their own
writing performance, and fewer biases were produced by peer-raters
than self- and teacher-raters. Those findings suggest that peer-raters
have the potential to make important contributions to the overall
assessment process. By using MFRM, teachers can inform peer-raters
of their bias patterns and help them develop better quality assessment
criteria, two steps that would lead to better quality peer-assessment.
Even though few researchers have utilized MFRM to investigate
self- and peer-assessments, this study shows that MFRM can be suc-
cessfully utilized to investigate self-, peer-, and teacher-assessments by
specifying writer ability, rater severity, interrater and intrarater consis-
tency, and the difficulty of assessment criteria. Moreover, the bias pat-
terns of self-raters, peer-raters, and teacher-raters were also effectively
illuminated by MFRM. Many of the results in this study cannot be
detected by traditional statistical approaches, such as correlation and
ANOVA. I believe that as more researchers use this research technique,
we can illuminate a multitude of facets of self- and peer-assessments.
Acknowledgements
This article is a part of my dissertation at Temple University Japan.
I would like to express my deep appreciation to Dr David Beglar who
acted as my advisor and provided me with valuable comments
throughout this study. This research also owes much to the helpful
advice of Dr James Dean Brown, Dr Kim Kondo-Brown, Dr Kenneth
Schaefer, Dr Marshall Childs, and my cohorts at Temple University
Japan. I would also like to express my gratitude to Dr Glenn Fulcher,
co-editor of Language, and the anonymous Language Testing review-
ers for their invaluable comments and suggestions.
VI References
Bachman, L. F., & Palmer, A. S. (1989). The construct validation of self rat-
ings of communicative language ability. Language Testing, 6(1), 14–29.
Bailey, K. M. (1998). Learning about language assessment. Cambridge, MA:
Heinle & Heinle.
Ballantyne, R., Hughes, K., & Mylonas, A. (2002). Developing procedures
for implementing peer assessment in large classes using an action research
process. Assessment & Evaluation in Higher Education, 27(5), 427–441.
Brown, J. D., Hudson, T., Norris, J., & Bonk, W. (2002). An investigation of
second language task-based performance assessments. Honolulu:
University of Hawai’i.
Bonk, W., & Ockey, G.. J. (2003). A many-facet Rasch analysis of the second
language group oral discussion. Language Testing, 20(1), 89–110.
Caulk, N. (1994). Comparing teacher and student responses to written work.
TESOL Quarterly, 28(1), 181–188.
Cheng, W., & Warren, M. (1997). Having second thoughts: Student percep-
tions before and after a peer assessment exercise. Studies in Higher
Education, 22(2), 233–240.
Cheng, W., & Warren, M. (2005). Peer assessment of language proficiency.
Language Testing, 22(1), 93–121.
Conway, R., & Kember, D. (1993). Peer assessment of an individual’s contri-
bution to a group project. Assessment & Evaluation in Higher Education,
18(1), 45–54.
Evans, A. W., Aghabeigi, B., Leeson, R., O’Sullivan, C., & Eliahoo, J.
(2002). Are we really as good as we think we are? Annals of the Royal
College of Surgeons of England, 84(1), 54–56.
Evans, A. W., McKenna, C., & Oliver, M. (2002). Self-assessment in medical
practice. Journal of the Royal Society of Medicine, 95(10), 511–513.
Freeman, M. (1995). Peer assessment by groups of group work. Assessment &
Evaluation in Higher Education, 20(3), 289–301.
Goldfinch, J. M. (1994). Further developments in peer assessment of group
projects. Assessment and Evaluation in Higher Education, 19(1), 29–35.
Goldfinch, J. M., & Raeside, R. (1990). Development of a peer assessment
technique for obtaining individual marks on a group project. Assessment
& Evaluation in Higher Education, 15(3), 210–225.
Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-
Lyons (Ed.), Assessing second language writing in academic contexts
(pp. 241–276). Norwood, NJ: Ablex.
Hanrahan, S., & Isaacs, G. (2001). Assessing self- and peer assessment: The
students’ views. Higher Education Research and Development, 20(1),
53–70.
Hargreaves, A., Earl, L., & Schimidt, M. (2001). Perspectives on alternative
assessment reform. American Educational Research Journal, 39(1), 69–95.
Hughes, I. E., & Large, B. J. (1993). Staff and peer-group assessment of oral
communication skills. Studies in Higher Education, 18(3), 379–385.
Ikeno, O. (2002). The Japanese mind: Understanding contemporary culture.
North Clarendon: Tuttle.
Sumie Matsuno 97
Jacobs, H. L., Zingraf, S. A., Wormuth, D. R., Hartfiel, V. F., & Hughey, J. B.
(1981). Testing ESL composition: A practical approach. Rowley, MA:
Newbury House.
Kondo-Brown, K. (2002). A FACETS analysis of rater bias in measuring
Japanese second language writing performance. Language Testing,
19(1), 3–31.
Kozaki, Y. (2004). Using GENOVA and FACETS to set multiple standards on
performance assessment for certification in medical translation from
Japanese into English. Language Testing, 21(1), 1–27.
Kwan, K., & Leung, R. (1996). Tutor versus peer group assessment of student
performance in a simulation training exercise. Assessment & Evaluation
in Higher Education, 21(3), 205–215.
Linacre, J. M. (1989/1994). Many-facet Rasch measurement. Chicago, IL:
Institute for Objective Measurement.
Linacare, J.M. (1999). FACETS: computer program for many faceted Rasch
measurement (version 3.22). Chicago, IL: Mesa Press.
Linacre, J. M. (2002a). What do infit and outfit, mean-square and standard-
ized mean? Rasch Measurement Transactions, 16(2), 878.
Linacre, J. M. (2002b). Optimizing rating scale category effectiveness.
Journal of Applied Measurement, 3(1), 85–106.
Linacre, J. M. (2002c). Construction of measures from many-facet data.
Journal of Applied Measurement, 3(4), 486–512.
Linacre, J. M., & Williams, J. (1998). How much is enough? Rasch
Measurement Transactions, 12(3), 653.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias:
Implications for training. Language Testing, 12(1), 54–71.
Lunz, M. E. (1992). New ways of thinking about reliability. Professions
Education Researcher Quarterly, 13(4), 16–18.
Lunz, M. J., & Linacre, J. M. (1998). Measurement designs using multifacet
Rasch modeling. In G. A. Marcoulides (Ed.), Modern methods for busi-
ness research (pp. 44–47). Mahwah, NJ: Lawrence Erlbaum.
Lunz, M. E., Stahl, J. A., & Wright, B. D. (1994). Interjudge reliability and
decision reproducibility. Educational and Psychological Measurement,
54(4), 914–925.
Mangelsdorf, K. (1992). Peer reviews in the ESL composition classroom:
What do the students think? ELT Journal, 46(3), 274–283.
McNamara, T. F. (1996). Measuring second language performance. Harlow:
Addison Wesley Longman.
McQueen, J., & Congdon, P. (1997). Rater severity in large-scale assessment:
Is it invariant? (ERIC Document Reproduction Service No. ED411303)
Mendonça, C. O., & Johnson, K. E. (1994). Peer review negotiations: Revision
activities in ESL writing instruction. TESOL Quarterly, 28(4), 745–769.
Nakamura, Y. (2002). Teacher assessment and peer assessment. (ERIC
Document Reproduction Service No. ED464483)
Oldfield, K. A., & Macalpine, J. M. K. (1995). Peer and self-assessment at
tertiary level: An experiential report. Assessment & Evaluation in Higher
Education, 20(1), 125–132.
Orsmond, P., & Merry, S. (1997). A study in self-assessment: Tutor and stu-
dents’ perceptions of performance criteria. Assessment & Evaluation in
Higher Education, 22(4), 357–370.
Orsmond, P., Merry, S., & Reiling, K. (2000). The use of student derived
marking criteria in peer and self-assessment. Assessment and Evaluation
in Higher Education, 25(1), 23–38.
Patri, M. (2002). The influence of peer feedback on self- and peer-assessment
of oral skills. Language Testing, 19(2), 109–131.
Perkins, K. (1983). On the use of composition scoring techniques, objective
measures, and objective tests to evaluate ESL writing ability. TESOL
Quarterly, 17(4), 651–671.
Pope, N. K. L. (2005). The impact of stress in self- and peer assessment.
Studies in Higher Education, 30(1), 51–63.
Ross, S. (1998). Self-assessment in second language testing: A meta-analysis
and analysis of experiential factors. Language Testing, 15(1), 1–20.
Rudy, D. W., Fejfar, M. C., Griffith, C. H. III., & Wilson, J. F. (2001). Self-
and peer-assessment in a first-year communication and interviewing
course. Evaluation and the Health Professions, 24(4), 436–445.
Saito, H., & Fujita, T. (2004). Characteristics and user acceptance of peer rating
in EFL writing classrooms. Language Teaching Research, 8(1), 31–54.
Sluijsmans, D. M. A., Brand-Gruwel, S., & Marriënboer, G. V. (2002). Peer
assessment training in teacher education: Effects on performance and per-
ceptions. Assessment & Evaluation in Higher Education, 27(5), 443–454.
Sullivan, K., & Hall, C. (1997). Introducing students to self-assessment.
Assessment & Evaluation in Higher Education, 22(3), 289–306.
Swanson, D., Case, S., & Van der Vleuten, C. (1991). Strategies for student
assessment. In D. Boud & G. Feletti (Eds.), The challenge of problem
based learning (pp. 269–282). London: Kogan Page.
Swanson, D., Case, S., & Van der Vleuten, C. (1997). Strategies for student
assessment. In D. Boud & G. Feletti (Eds.), The challenge of problem
based learning (pp. 269–282). London: Kogan Page.
Takada, N., & Lampkin, R. (1996). The Japanese way: Aspects of behavior,
attitudes and customs of the Japanese. New York: McGraw-Hill.
Taras, M. (2001). The use of tutor feedback and student self-assessment in
summative assessment tasks: Towards transparency for students and for
tutors. Assessment & Evaluation in Higher Education, 26(6), 289–306.
Topping, K. J., Smith, E. F., Swanson., I., & Elliot, A. (2000). Formative
peer assessment of academic writing between postgraduate students.
Assessment & Evaluation in Higher Education, 25(2), 149–169.
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater
consistency in assessing oral interaction. Language Testing, 10(3),
305–335.
Williams, E. (1992). Student attitudes towards approaches to learning and
assessment. Assessment & Evaluation in Higher Education, 17(1), 45–59.
Yang, Wen-Ling. (2000). Analysis of item ratings for ensuring the procedural
validity of the 1998 NAEP achievement-levels setting. (ERIC Document
Reproduction Service No. ED440136)
Appendix A
Essay evaluation sheet
Essay Number
Evaluator’s Name
average
Too Many Mistakes (Q10–16) Very Few Mistakes

Ineffective Effective
Very Poor Very Good
1. Overall Impression 1 2 3 4 5 6
Content
2. Amount 1 2 3 4 5 6
3. Thorough development of thesis 1 2 3 4 5 6
4. Relevance to an assigned topic 1 2 3 4 5 6
Organization
5. Introduction & Thesis statement 1 2 3 4 5 6
6. Body & Topic sentence 1 2 3 4 5 6
7. Conclusion 1 2 3 4 5 6
8. Logical Sequencing 1 2 3 4 5 6
Vocabulary
9. Range 1 2 3 4 5 6
10. Word/ idiom Choice 1 2 3 4 5 6
11. Word Form 1 2 3 4 5 6
Sumie Matsuno
Sentence Structure / Grammar

12. Use of Variety of Sentence Structures 1 2 3 4 5 6
99
(continued)
100
Appendix A (continued)
Essay evaluation sheet
Essay Number
Evaluator’s Name
average
Too Many Mistakes (Q10–16) Very Few Mistakes

Ineffective Effective
Very Poor Very Good
13. Overall Grammar 1 2 3 4 5 6

Mechanics
14. Spelling 1 2 3 4 5 6
15. Essay Format 1 2 3 4 5 6
16. Punctuation/ Capitalizaion 1 2 3 4 5 6
Comments
Self-, peer-, and teacher-assessments in Japanese university

Self, Peer, and Teacher Assessments

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Self, Peer, and Teacher Assessments

Uploaded by

Copyright:

Available Formats

Language Testing 2009 26 (1) 075–100

Self-, peer-, and teacher-assessments

Keywords: bias analysis, essay writing, FACETS, Japanese university, lan-

It is generally acknowledged that in order to assess students’ learn-

assessment methods (Orsmond, Merry & Reiling, 2000; Pope,

which leads to greater satisfaction with the class (Sluijsmans, Brand-

which writers’ abilities and assessment criterion difficulties differed

though a 4-point rating scale worked more effectively than the 6-

III Results and discussions

Original Scale Count % Logit Outfit Step difficulty

1 1 224 2 –1.09 1.1 Low

Next, as an initial analysis, misfitting writers and raters were

Table 2 Descriptive statistics for self-, peer-, and teacher assessments

Rater N M SEM SD Skewness SES Kurtosis SEK

Self 1088 2.74 .02 .61 .64 .07 –.24 .14

Note: N The number of assessments.

2 The FACETS map and the rater, writer, and assessment

range from 1.70 to 2.95). The Separation Index is 4.83, indicating

| Logit | Raters | Writers | Items | Scale|

that raters disagree on the quality of these writers’ performances.

writers were assessed leniently by the peer-raters and severely by the

3. Comparing self-assessment, peer-assessment, and teacher-

when assessing writers’ abilities?’ Figure 2 shows the three groups

Self Peer Teacher

Figure 2 Writers’ abilities as estimated by self-, peer-, and teacher-assessors.

participants to underestimate, rather than overestimate, the quality of

b. Assessment criterion difficulty (research question 3): Research

Figure 3 Assessment criteria difficulties of the self-, peer-, and teacher-raters

they were somewhat lenient on Format, a factor that may have

4 Bias analysis: Rater writer (research question 4)

Table 3 The number of biased ratings towards writers’ abilities.

Self 1 (1.47%) 8 (11.76%) 6 (8.82%) 7 (10.29%)

Too Many Mistakes (Q10–16) Very Few Mistakes

Sentence Structure / Grammar

Too Many Mistakes (Q10–16) Very Few Mistakes

13. Overall Grammar 1 2 3 4 5 6

You might also like

range from 1.70 to 2.95). The Separation Index is 4.83, indicating