You are on page 1of 11

Assessing Writing 30 (2016) 21–31

Contents lists available at ScienceDirect

Assessing Writing

A Many-Facet Rasch analysis comparing essay rater behavior


on an academic English reading/writing test used for two
purposes
Sarah Goodwin
Georgia State University, United States

a r t i c l e i n f o a b s t r a c t

Article history: Second language (L2) writing researchers have noted that various rater and scoring vari-
Received 23 January 2016 ables may affect ratings assigned by human raters (Cumming, 1990; Vaughan, 1991; Weigle,
Received in revised form 15 July 2016 1994, 1998, 2002; Cumming, Kantor, & Powers, 2001; Lumley, 2002; Barkaoui, 2010). Con-
Accepted 20 July 2016
trast effects (Daly & Dickson-Markman, 1982; Hales & Tokar, 1975; Hughes, Keeling, &
Available online 4 August 2016
Tuck, 1983), or how previous scores impact later ratings, may also color raters’ judgments
of writing quality. However, little is known about how raters use the same rubric for dif-
Keywords:
ferent examinee groups. The present paper concerns an integrated reading and writing
Second language writing assessment
test of academic English used at a U.S. university for both admissions and placement pur-
Many-Facet Rasch measurement
L2 writing raters poses. Raters are trained to interpret the analytic scoring rubric similarly no matter which
Factors affecting writing scores test type is scored. Using Many-Facet Rasch measurement (Linacre, 1989/1994), I analyzed
Rater variability scores over seven semesters, examining rater behavior on two test types (admissions or
placement). Results indicated that, of 25 raters, five raters showed six instances of statis-
tically significant bias on admissions or placement tests. The findings suggest that raters
may be attributing scores to a wider range of writing ability levels on admissions than on
placement tests. Implications for assessment, rater perceptions, and small-scale academic
testing programs are discussed.
© 2016 Elsevier Inc. All rights reserved.

1. Introduction

This paper presents an investigation into rater behavior on an integrated reading and writing task used for both university
admissions and placement purposes. Using Many-Facet Rasch measurement (Linacre, 1989/1994), I describe how raters
interpreted scores for two different academic English testing populations. This analysis is intended to contribute to the
body of research on second language writing in assessment contexts and also to the ongoing research and validation for the
in-house English as a second language (ESL) testing program from which the data came. The testing program coordinator
must train human raters of our reading and writing test to ensure that the scores they assign are satisfactory measures of
writing performance. It is important to remember that examinee scores on performance assessments are mediated through
human raters. In other words, ratings themselves are not a direct representation of the quality of examinee writing, because
the rater’s experiences and judgments play a role (Wind & Engelhard, 2013). Because of the considerations mentioned here,
an investigation is necessary that examines raters’ scores of writing quality and what effects there may be on the assignment
of scores.

E-mail address: saregoodwin@gmail.com

http://dx.doi.org/10.1016/j.asw.2016.07.004
1075-2935/© 2016 Elsevier Inc. All rights reserved.
22 S. Goodwin / Assessing Writing 30 (2016) 21–31

2. Background to the study

Scores assigned by writing test raters may be impacted by various characteristics of raters’ background, such as their
experiences with grading writing exams, teaching writing, prior language learning, and so forth (Barkaoui, 2010; Cumming,
1990; Cumming, Kantor, & Powers, 2001; Lumley, 2002; Vaughan, 1991; Weigle, 1994, 1998, 2002). Weigle (1998) and
Lumley (2002), in investigations of raters of second language writing, both note the complex nature of the rating process.
The textual features of an essay, the wording of the rating scale, and all of the impressions readers bring with them – as well
as the potential interaction of these elements – can have an effect on raters’ perceptions of writing, and thus on the scores
they assign.
Raters’ judgments of what they have read may be impacted by the quality of the samples they have previously rated,
reflecting possible contrast effects on scores. On average, raters scored an average-quality composition preceded by high-
quality samples lower than when it followed lower-quality exemplars (Daly & Dickson-Markman, 1982; Hales & Tokar, 1975;
Hughes, Keeling, & Tuck, 1983). Additionally, Spear (1997) found that two preceding samples of contrasting quality created
stronger biasing effects than just one sample on scores assigned to later pieces of writing. Although raters may be instructed
in their rater training to, for example, not compare samples to one another during the scoring process, prior investigations
illustrate that quality distinctions among essays may have an impact on later scores assigned.
Murphy and Yancey (2008) even go so far as to say that “[rater] variability represents an underlying disagreement about
the nature of the construct underlying the assessment” (p. 369), although they are referring to writing assessment in general
rather than specifically to second language writing tests. McNamara (1996), however, states that it may be suitable “to
accept variability in stable rater characteristics as a fact of life, which must be compensated for in some way, either through
multiple marking and averaging of scores, or using the more sophisticated techniques of multi-faceted analysis” (p. 127).
Hence, variability does not necessarily indicate that raters are arriving at vastly different judgments of writing quality. Even
if raters have a keen understanding of the writing assessment construct, it may not be essential to have raters be trained to
be completely rigid in their interpretations of texts, as some variability is bound to occur.
Rasch (1960/1980) measurement theory has been employed as a method for examining rating quality in writing assess-
ments (e.g., Engelhard, 2002, 2013; Linacre, 1989/1994; Wind & Engelhard, 2013; Wolfe, 2004). The Many-Facet Rasch
measurement (MFRM) approach to monitoring rater performance considers facets that may have an impact on scores.
Facets can be components such as raters, rating scales, or examinees, and these are plotted onto a common interval scale
along with scores that raters assign (Bachman, 2004; Eckes, 2009; Lim, 2011). More detail about the interval scale is pre-
sented in the methodology section of this paper. In order for meaningful information to be drawn from MFRM measures, two
assumptions need to be made: the data should fit the model, and the test should measure a single, unidimensional construct
(Bond & Fox, 2007; Eckes, 2008; McNamara, 1996).
Other investigations into second language writing assessment have employed MFRM methods to investigate rater traits.
Weigle (1998) used MFRM to examine differences in rater severity and consistency before and after training, finding that
rater training helped boost scorers’ intra-rater reliability (internal consistency). Also examining rater harshness/leniency,
as well as accuracy/inaccuracy and centrality/extremism, Wolfe (2004) employed MFRM to “control for error contributed
by systematic variability between both items and raters” (p. 42). While Weigle (1998) used a pre- and post-test design and
Wolfe’s (2004) study was a one-time snapshot of raters, Lim (2011) examined rater consistency and severity longitudinally,
over 12–21 months, for novice and experienced raters. Investigating various writing rater effects may contribute to both
improved scoring and rater training. Moreover, it can provide information for the validation of writing assessment rating
scales (Harsch & Martin, 2012; Knoch, 2011; Shaw & Weir, 2007).
The body of existing research underscores that raters must contend with a number of variables while rating, participating
in tasks that require judgments drawing on various sources of information. The monitoring of rater quality with regard to how
raters use rating scales is thus of importance to contribute to the validity and reliability of tests. The current study concerns
raters, examinees, and a scale for the essay component of an academic English reading and writing test administered to two
test-taker populations.

3. Context for the study

The academic English proficiency test from which the data are drawn is an examination given to ESL examinees for college
admissions decisions and ESL course placement recommendations. Designed by a team of ESL instructors, content-area
instructors, and language assessment professionals, it is intended to reflect the types of tasks students will need to perform
in university contexts. It consists of an integrated reading and writing task, a multiple-choice listening comprehension
section, and a multiple-choice reading comprehension section. New graduate students also sit a face-to-face oral interview.
The integrated reading and writing section consists of a short-answer and an essay component, requiring examinees to
read two source texts and synthesize them in their writing responses (the short-answer component) as well as reading and
responding to an argumentative writing prompt (the essay component). The short-answer component is scored in content
and language, and the essay component is rated for content, organization, accuracy of language (grammar and vocabulary),
and range and complexity of language. The essay portion, which is a timed argumentative writing sample, will be the focus
of the present investigation (its rubric can be found in Appendix A). Each analytic rubric category (content, organization,
accuracy of grammar/vocabulary, and range/complexity of grammar/vocabulary) is scored along a 10-point scale using
S. Goodwin / Assessing Writing 30 (2016) 21–31 23

whole numbers, and each sample is seen by two raters. Content and organization are summed for a rating of 20 possible
points, called the Rhetoric score, and accuracy and range/complexity of language are also summed for 20 maximum, called
the Language score. Raters’ two Rhetoric scores are then compared, and the two Language scores are also compared. If either
or both Rhetoric or Language ratings differ by 2 points or greater, the sample is seen by an experienced third reader.
The test is given to two main populations: newly-enrolled graduate students at the university at which the test is admin-
istered, and test takers who are generally applying to undergraduate degree programs in the local area. Hereafter, these
two groups will be called Placement and Admissions examinees. Placement test takers take the assessment to determine
whether they may benefit from graduate-level ESL academic writing support coursework. Admissions examinees test for
undergraduate admissions purposes. Our testing program sends results to the international student admissions offices at
different institutions. The Placement test takers, because they have already been accepted to a graduate-level academic
program of study, generally tend to be at a higher proficiency level than the Admissions test takers. Additionally, the two
populations are rarely scored during the same rating session. Placement tests are only given at the start of new academic
semesters, while Admissions tests take place throughout the year, three to five times per semester. This particular integrated
reading/writing task has only been administered to Admissions examinees since Fall 2011. The previous Admissions test
format was an independent writing test in response to a 50-word-maximum prompt with no extensive source text provided
to writers.
A special note needs to be made here about scores assigned to essays. In some cases, samples could only receive a
maximum score of 4 (out of 10 possible). This is due to a modification that was made to the analytic rubric beginning with
the Spring 2012 administrations of the test. A decision was made that essay samples containing 100 or fewer words could
only be assigned a maximum score of 4 in each rubric category, due to candidates possibly running out of time or not planning
their writing more effectively. This was done in order to ensure that incomplete essays were not receiving the same scores
as more complete essays. Hence, some of the samples scored by raters in this study had a maximum score of 4 in each
category. This occurred with 392 of the 11,392 scores assigned to essays. These were all Admissions samples; they often
consisted of a great deal of textual borrowing, likely scoring at the low end of the rubric, even if writers had produced more
than 100 words. (I note this here to clarify that the analysis may be impacted by raters’ scores of 4 or lower, although they
comprise less than 4% of the data.) Because raters were required to use a certain part of the rubric for shorter compositions,
this means that some ratings of 1–4 were not due simply to natural rater variability but to a required set of categories for
some Admissions essays.
Raters are faculty and graduate students in the university department that offers the exam. All have experience teaching
or scoring writing in ESL contexts. They take part in a rater training session and score a take-home pack of tests before
they are approved to score writing samples. These training sessions are carried out one to three times per calendar year,
depending on the program’s need for raters. Raters review a pack of benchmark samples, which they read on their own for
approximately two hours. Later, they join together for an in-person training session for about two hours, and we discuss
the samples, the analytic rubric, and scoring procedures. After the session, they complete a scoring set on their own that
takes them anywhere from 45 to 75 min to complete. If people have passed the training but have not rated test samples for
more than three months, they are asked to review the rater norming and calibration materials. Because most of the raters
are graduate students, there is turnover from semester to semester.
At their training, raters are instructed to use the rubric scale descriptors consistently no matter whether they are rating
for Placement or Admissions tests, since the essays are scored using the same rubric. Raters are also trained not to associate
certain score levels directly with ESL course recommendation or placement conclusions, although the test results are used
for different purposes. The rater training packs include both Placement and Admissions test samples, but, depending on the
time point of the academic semester in which raters are trained, they may score Placement tests first, or they may score
Admissions tests first. This may potentially lead to a contrast effect on the other test type scored, due to the differences in
proficiency levels generally represented. Thus, I hypothesized that there may be differences in the way raters score the two
different testing populations. My two research questions were:

1) Do raters interpret an essay scale similarly whether they are rating for one examinee population or the other?
2) Which test type had each rater first rated, and do they show a contrast effect for, or bias toward, the test type they first
rated?

4. Methods

4.1. Procedure

From Fall 2011 to Fall 2013, 43 unique raters scored tests. Eleven raters who had rated ten or fewer essays were removed
from the data set because they did not rate enough samples across more than one test administration. After running a
preliminary analysis with 32 raters, I removed from the set one severely misfitting rater (infit mean square value = 3.25;
infit standardized z-score = 9.0; outfit mean square value = 3.18; outfit standardized z-score = 9.0); infit mean square values
of higher than 1.30 are of concern (Bond & Fox, 2007; McNamara, 1996; Wright & Linacre, 1994). This rater had scored only
two test administrations, and the samples she had rated had needed to be seen by a third rater each time. Next, to ensure
connectivity in the data, I checked whether remaining raters had scored enough of both test types, Placement and Admissions,
24 S. Goodwin / Assessing Writing 30 (2016) 21–31

Table 1
Scorers’ Ratings by First Test Rated in This Data Set.

Rater First Test Date Rated First Test Type Rated Samples Rated

5 16 Aug 2011 Placementa 64


8 16 Aug 2011 Placementa 102
11 16 Aug 2011 Placementa 101
13 16 Aug 2011 Placementa 66
17 16 Aug 2011 Placement 85
23 16 Aug 2011 Placement 359
27 16 Aug 2011 Placementa 133
29 16 Aug 2011 Placementa 122
30 16 Aug 2011 Placementa 40
32 16 Aug 2011 Placement 107
39 16 Aug 2011 Placementa 364
20 17 Feb 2012 Admissions 278
26 17 Feb 2012 Admissions 47
9 2 Mar 2012 Admissions 30
28 16 Mar 2012 Admissions 118
19 22 Jun 2012 Admissions 133
4 14 Aug 2012 Placement 22
36 14 Sep 2012 Placementa 264
16 2 Nov 2012 Admissions 102
12 9 Jan 2013 Placement 15
10 14 Jun 2013 Admissions 168
3 20 Aug 2013 Placement 46
35 20 Aug 2013 Placement 52
38 20 Aug 2013 Placement 44
41 20 Aug 2013 Placement 82
Total ratings: 2944
a
Rater scored essays for the program before date listed in table, though the analysis begins with data from August 2011.

so that comparisons could be made between the two scales. Therefore, six raters who had only scored Placement (N = 2) or
Admissions (N = 4) tests were removed from the set because they had not scored both test types. With the completion of
these steps, 25 raters remained. Connectivity was ensured for the examinee facet because more than one rater marks each
writing sample, but no one rater scored all of the samples. Each of the 25 raters scored at least 15 essays.
Additionally, there were 93 individuals who retested once or more during this time frame, but their examinee identifica-
tion numbers were kept as different IDs because they produced 200 different essays. Lim (2011) kept this facet unchanged in
his MFRM analysis of Michigan English Language Assessment Battery composition ratings, treating retesters who produced
more than one essay in the data set as unique individuals. Although this does introduce some additional dependence in the
data because 107 of the 1525 essays (approximately 7% of the set) were written by some of the same people, the observations
are treated as independent because each score observation represents one rater’s judgment about writing ability at one time
point (Chien, 2008; Wright, 2003).
Table 1 presents raters’ identification number, the first test date they rated within this data set, whether the first opera-
tional test they scored was Placement or Admissions, and how many samples they rated. (Recall that not all rater numbers
appear in the table because 18 of the raters from 1 to 43 were removed.)
As can be seen in Table 1, of the 25 raters remaining in the data set, 18 had rated Placement as their first test. Nine of
them were trained before Fall 2011, so they began rating Placement tests before these data were collected, and seven had
rated Admissions as their first test. Raters’ experience ranged from one semester (those who began rating in August 2013)
to over six calendar years (Rater 36) rating the test. Not all raters rated in consecutive test administrations. For example,
those who rated beginning in August 2013 did not rate all five test administrations that occurred in the Fall 2013 semester.

4.2. Data analysis

The data were set up as a specification file to be run in FACETS Many-Facet Rasch (MFR) modeling software (Linacre,
2014). In this study, the MFR model, an extension of Rasch (1960/1980) Measurement Theory, reflects the relationship
between assessment facets and the probability of observing certain outcomes within situations involving more than one
facet (Engelhard & Wind, 2013). Raw score observations are converted to scores along a logit, or log-odds, interval scale. This
interval scale means that an equal distance between any two data points represents an equal difference in person ability or
item difficulty (Bond & Fox, 2007). The FACETS software outputs the interval scale in the form of a variable map, or Wright
map, allowing for the direct comparison of test-taker writing proficiency, rater severity, scale difficulty, or other facets of
interest (Eckes, 2009). In the present study, there were four facets: examinees, raters, test type (Placement or Admissions),
and scales (the components of the analytic scoring: Content, Organization, Accuracy, and Range). The test type facet was set
to be anchored to zero logits so that neither the Placement nor Admissions element figured into the measure estimation,
as Placement and Admissions examinees are different people. (Although examinees actually wrote on three different essay
topics, topic was not included as a facet in this investigation.)
S. Goodwin / Assessing Writing 30 (2016) 21–31 25

An analysis was then run with the remaining 25 raters, who assigned 11,392 scores to 1525 essays. The mean score
assigned was 5.87 (SD = 1.72) where, as previously stated, the maximum possible rating for a category is 10. However, for
this analysis, ratings of 1 and 2 were collapsed into their own category, as were 9 and 10, since ratings of 1 or 10 occurred
the most infrequently (164 and 128 times respectively of the 11,392 ratings), and in actual rating, raters do not generally
have to distinguish between a rating of 9 versus 10, or 1 versus 2. A partial-credit version of the MFR model was applied.
This variation on the MFR model permits the structure of a rating scale to vary across items, or test types in this case. The
partial-credit model can illustrate raters’ discrepancies in the use of scoring categories (as in Engelhard & Wind, 2013). The
partial-credit model used in this analysis can be expressed as:

ln [pnijk /pnijk − 1 ] =  n − ˇi − ˛j −  g −  ik

where
pnijk = probability of examinee n receiving a rating of k on criterion i from rater j,
pnijk − 1 = probability of examinee n receiving a rating of k − 1 on criterion i from rater j,
␪n = proficiency of examinee n,
␤i = difficulty of criterion i,
␣j = severity of rater j,
␥g = test type g (Placement or Admissions), and
␶gk = difficulty of receiving a rating of k relative to a rating of k − 1 for test type g.
To review, the research questions concerned whether raters interpret an essay scale similarly whether they are rating
for one examinee group or the other, and whether they displayed contrast effects for or against the test type they first rated.

5. Results and discussion

5.1. Many-Facet Rasch measurement results

The variable map is presented in Fig. 1.


In the second and third columns of the figure, examinee ability and rater severity estimates are plotted along the logit
scale. In the second column, the logit span of examinees can be seen. Each asterisk represents 10 examinees. The average
logit value is 1.16 for examinees. Test takers span more than 16 logits, from +8.23 to −8.52 logits, indicating a wide spread
of ability levels. Raters’ measures are presented in the third column. Every one asterisk represents two raters. They cluster
together in a range of approximately 1 logit, from +0.59 to −0.56. Because the raters span about 1 logit in the MFRM analysis
and examinees about 16 logits, raters show about 1/16 of the logit spread observed for examinees.
The logit values of all raters, sorted highest to lowest, can be seen in more detail in Appendix B. Rater 5 is the harshest of
the 25 raters, at +0.59 logits, while Rater 11 is the most lenient at −0.56. Four raters are misfitting, with infit mean square
values above 1.30 (infit mean square values are provided for each rater in parentheses): Rater 8 (1.41), Rater 13 (1.37), Rater
29 (1.41), and Rater 38 (1.32). One rater, Rater 3 (0.63), is overfitting, with an infit mean square value below 0.75.
The rater separation index reported in the FACETS output was 2.80, with a reliability value of 0.89. The separation index
indicates the number of statistically different levels of performance. In an ideal model where raters are behaving consistently,
this figure should be close to 1, but in this model there appear to be nearly three distinct groups of raters. The reliability
figure is closer to 1 than to 0, suggesting that raters rated with highly different degrees of severity (Eckes, 2009). The fixed
chi-square value was 313.8 (df = 24, p < 0.01), indicating that raters are behaving significantly differently, echoing the finding
based on the separation index. Although a Many-Facet Rasch (MFR) model, if it fits the data, does adjust for differences in
rater leniency or severity, this particular MFR model is not used to calculate or adjust current examinees’ writing scores. For
a standardized ESL academic writing test program, raters should ideally be “interchangeable” and behaving similarly to one
another, but here not all raters are equally lenient or severe. A separation index of 1 might be more ideal, but because any
context effects do not seem to be systematic among all raters, there may be variability that is natural in second language
writing rating situations.
At the far right of the variable map, the second-to-last column, S.1, corresponds to the probabilistic model estimates of
Placement test type scores, and the rightmost column, S.2, shows probable Admissions scores. Rasch-Andrich thresholds are
presented visually as the horizontal marks between each adjacent rating of 2 through 9. Each mark indicates “the transition
point at which the probability is 50% of an examinee being rated in one of two adjacent categories, given that the examinee is
in one of those two categories” (Eckes, 2009; p. 13). Ideally, if raters are interpreting the scoring rubric in the same way at each
examinee ability level no matter the test type, these thresholds should be the same estimated logit value for both columns,
i.e., the horizontal marks should align, but there are some discrepancies. As previously stated, there are four analytic scores
assigned by each rater, so this variable map is not differentiating the specific ratings of Content, Organization, Accuracy, or
Range from one another. Also, the variable map displays probable scores ranging from 2 to 9, since scores of 1 and 2 were
collapsed into one category, as were 9 and 10. Table 2 presents the logit values of the Rasch-Andrich thresholds, according
to the FACETS output.
Consulting the two rightmost vertical rulers and comparing them to the leftmost column, an examinee at the +3 logit
ability level would generally receive ratings of 8 s if she took the Placement test but 7 s if she took Admissions. A test taker
26 S. Goodwin / Assessing Writing 30 (2016) 21–31

Fig. 1. All facet vertical ruler.

Table 2
Rasch-Andrich Thresholds.

Score Placement Admissions

9 4.44 4.93
8 2.65 3.41
7 1.07 1.68
6 −0.38 0.25
5 −1.62 −1.65
4 −3.08 −3.58
3 −5.04
2

The logit value of the threshold between ratings of 4 and ratings of 5 is bolded, illustrating the probable score around the −1.6 logit value.

at the 0 logit mark would likely score 6 s on Placement but 5 s on Admissions, while an examinee at just below the −3 logit
mark would score approximately 3 s on Placement but 4 s on Admissions. This is not ideal because the scale should be applied
identically, no matter whether the test is Placement or Admissions, for examinees at the same writing ability level.
Another observation about the variable map is that the logit spans for scores are wider for Admissions than for Placement.
In other words, in the rightmost column (S.2, Admissions), the total distance from ratings of 3–8 spans more logits than the
distance from 3 to 8 does in the second-to-last column (S.1, Placement). For Placement the scores range from around the
−4 logit value mark to 4.44 logits, and for Admissions the scores range from −5.04 to 4.93 logits. This suggests that raters
S. Goodwin / Assessing Writing 30 (2016) 21–31 27

Table 3
Statistically Significant Bias Shown by Raters.

Rater Bias Size Probability Logit Value Difference Test Type

32 0.52 0.0179 −0.12 Admissions


39 0.30 0.0032 0.01 Placement
8 0.18 0.0392 −0.15 Placement
8 −0.28 0.0096 −0.15 Admissions
26 −0.37 0.0292 0.44 Placement
19 −0.67 0.0003 −0.40 Placement

may be attributing scores to a wider range of ability levels on Admissions than on Placement. Based on overall proficiency
results, test takers generally represent a wider variety of proficiency levels for Admissions than for Placement, so it seems
that the writing performance assessment rating scale has been somewhat expanded for Admissions to reflect this. Scores of
4 in particular are assigned to writers at more varied ability levels on Admissions tests than on Placement exams.
One consistency between the Placement and Admissions score ranges on the variable map has to do with the Rasch-
Andrich threshold around the −1.6 logit value, the boundary between scores of 4 and 5. At the lower end of the 5 score range
and the upper end of the 4 score range, examinees at the −1.62 to −1.65 logit mark scored approximately the same whether
they took a Placement or Admissions test administration. This may indicate that raters interpret the bottom of the 5 score
band as some type of minimum acceptability judgment of academic writing for both testing populations, since this score
cutoff is similar for both Placement and Admissions.

5.2. Contrast effect bias analysis results

In order to examine raters’ score assignments more specifically depending on the test type, a bias and interaction analysis
was run with the rater and test type facets. Because the initial partial-credit model did not allow investigation of the
interactions between facets, the model statement was modified to include the phi (␸) variable, the second-to-last element
in the equation. The model statement for the exploratory interaction analysis was:

ln [pnijk /pnijk − 1 ] = ␪n − ␤i − ␣j − ␥g − ␸jg − ␶gk

where
pnijk = probability of examinee n receiving a rating of k on criterion i from rater j,
pnijk − 1 = probability of examinee n receiving a rating of k − 1 on criterion i from rater j,
␪n = proficiency of examinee n,
␤i = difficulty of criterion i,
␣j = severity of rater j,
␥g = test type g (Placement or Admissions),
␸jg = rater-and-test type interaction parameter (the bias term), and
␶gk = difficulty of receiving a rating of k relative to a rating of k − 1 for test type g.
This was done to determine rater leniency or severity on Placement, Admissions, or both test types. I considered which
test type each rater had first operationally rated, hypothesizing that the first test administration they rated after they had
been approved as trained raters might have an impact on their interpretation of the scoring rubric. In other words, if they
had rated Placement tests first, which are typically samples of ESL graduate student writing, they may have shown a contrast
effect, or bias, if later they scored Admissions prospective undergraduate ESL writing, or vice-versa. According to this analysis,
five unique raters showed six instances of statistically significant bias, as illustrated in Table 3.
Rater 32 showed bias when rating the Admissions test type, Raters 19, 26, and 39 were biased on the Placement test type,
and Rater 8 was biased on more than one scale. Raters 19 and 26 were unexpectedly strict on Placement, while Rater 32
was unexpectedly lenient for Admissions and Rater 39 lenient on Placement. Rater 8 seems to be more strict when scoring
Admissions tests but more lenient than most other raters when rating Placement samples. However, this rater is one of the
more misfitting graders in the data set, with an infit mean square value of 1.41, so this finding from the bias analysis should
be interpreted cautiously. The bias sizes range from 0.18 to 0.67 (absolute values) logits difference between observed and
expected scores.
For the five raters who showed bias, I checked back to determine which test type they first rated after their training
(which was presented in Table 1). Raters showing biases are presented in Table 4.
Raters 19 and 26 first scored Admissions tests, while Raters 8, 32, and 39 first graded Placement exams after their rater
training. For Rater 32, the finding that she showed bias when rating Admissions samples may be due to the fact that she
first scored Placement essays. On the other hand, Raters 19 and 26, who first rated Admissions writing, are unexpectedly
strict when scoring Placement samples, so they may be showing an opposite trend from Rater 32: that they show bias when
rating Placement may be due to their prior experience rating Admissions writing. Raters 8 and 39, who also first rated on
Placement essays, show different biases, as Rater 8 was more strict for Admissions but more lenient for Placement, and Rater
39 was also more lenient for Placement; there may still be some contrast effects present for these raters, as well as possibly
scoring Placement examinees’ writing unnecessarily higher than writing by Admissions examinees.
28 S. Goodwin / Assessing Writing 30 (2016) 21–31

Table 4
Biased Raters’ Test Type First Rated.

Rater First Test Date Rated First Test Type Rated Bias Shown

8 16 Aug 2011 Placement lenient toward Placement; harsh toward Admissions


19 22 Jun 2012 Admissions harsh toward Placement
26 17 Feb 2012 Admissions harsh toward Placement
32 16 Aug 2011 Placement lenient toward Admissions
39 16 Aug 2011 Placement lenient toward Placement

5.3. Summary of results and discussion

Overall, the results suggest that raters often do not interpret the existing essay scale similarly whether they are rating
Placement or Admissions test takers’ writing. If they are assigning high 4 scores or low 5 scores, their uses of the scale tend
to be more similar for Placement or Admissions, but at other score levels there are differences in logit values and ranges.
This is slightly problematic if raters are expected to apply the scale descriptors equally no matter whether they are scoring
Placement or Admissions ESL writing samples. However, the bias shown by five of the 25 raters is rather small and may be
a sign of usual variability.
The two test types can be divided into two typical profiles of examinees: enrolled graduate students and prospective
undergraduate program applicants. Differences in rater expectations of undergraduate versus graduate writing, although
not investigated in detail in this paper, surely warrant more attention. These Placement examinees have often already
completed English-medium coursework in which they have read information and had to summarize it in writing, or produced
argumentative essays based on course materials. Admissions examinees, on the other hand, have not necessarily been
enrolled in formal courses of ESL study or even produced argumentative essays before sitting this test. Since some writers
may have less control over the U.S. academic English writing genre conventions for the purposes of argumentation, this may
have an impact on rater perceptions of writing task fulfillment. The role of scorer expectations and what effect that has on
academic writing scores deserves further investigation.
What implications, then, do these findings have for writing test rating and rater training? For this group of raters, it is
relieving to see that they do not span more than about one logit on the variable map. It would be concerning if examinees’
writing samples were being seen by raters who, based on an MFR model, had the probability of assigning drastically higher
or lower ratings than other judges (such as plus or minus one logit, with regard to the average). However, because their
separation index is 2.8, they are still rating somewhat differently from one another. They may benefit from additional rater
recalibration, particularly those raters who are misfitting the model or who are relatively more harsh or lenient than other
scorers. If raters are misfitting the model, these may be indicators of rater error or inaccuracy (Wind & Engelhard, 2012).
Because of rater turnover, many judges included in this analysis no longer rate for the exam because they left, graduated, or
became too busy with other tasks to be able to rate. Current raters may benefit from a smaller-scale analysis in order to see
how they compare to other raters actively scoring the writing test at the same time as they are.
Regarding the scale interpretation for the two groups of examinees, Placement and Admissions, scores are being assigned
differently to test takers who may be at a similar ability level. As has been indicated by previous work into contrast effects
(Daly & Dickson-Markman, 1982; Hales & Tokar, 1975; Hughes et al., 1983; Spear, 1997), raters’ judgments of writing quality
may be impacted by samples they have previously scored. Especially at important cut points, the score descriptors in the
rubric may need to be reexamined, and raters may find it helpful to have more detailed information about benchmark
training samples at those score bands. Not all “8”s are created equally, so giving raters sufficient examples that illustrate
what both Placement and Admissions test takers produce is critical. Also, for the tests that may only be able to be rated a
maximum of “4” due to a low overall word count, it is also necessary to provide raters with various examples of “4” and
lower ratings so that they may familiarize themselves with a range of examples characterized by the rubric descriptors.
It is important to note that the writing section is just one portion of the exam, and all sections must be considered
together for admissions or ESL course placement decisions. Also, any borderline decisions for admissions or placement
undergo additional reviews by ESL instructors and the testing director. Further analyses need to be conducted in order to
boost the claims made in this section.

6. Conclusions and future directions

This study sought to uncover whether Placements and Admissions ESL writing test scores, despite a shared performance
assessment task, are being used in the same way by raters who rate both populations of test takers’ writing samples. However,
several limitations must be acknowledged.
Because there were three different topics included in the data set, this may have had an effect on the difficulty of the
task for examinees and also on raters’ interpretations of the writers’ fulfillment of the task. If raters are less familiar with a
certain topic, it is possible that their scoring of it may show some sort of bias. This could be investigated in the future as a
facet of an MFRM analysis.
The analysis in this paper examined ratings assigned over seven semesters, from Fall 2011 to Fall 2013. In order to inspect
differential rater functioning over time more closely (“DRIFT”, Myford & Wolfe, 2009), raters who score more regularly across
S. Goodwin / Assessing Writing 30 (2016) 21–31 29

consecutive test dates could be examined from one test administration to another or from one semester to the next. This
information, which could be a focus in a future measurement model, would serve to benefit both raters and the testing
program with which they are involved.
To check more closely whether raters exhibit trends toward grading either Placement or Admissions tests more leniently
or severely, some sort of pseudo-scoring (which Wolfe, Jurgens, Sanders, Vickers, & Yue, 2014, advocate to improve rater
training) could be implemented. While raters are scoring an Admissions test administration, some Placement samples that
have been previously rated could also be included in the rating sessions, and vice-versa. This would allow the rater trainer
to be able to check scorers’ intra-rater reliability and explore whether raters show bias for or against writing samples from
a particular test type.
Lastly, inclusion of qualitative rater data, in tandem with MFRM analyses, would also be beneficial for future studies.
This analysis only examined scores that raters assigned. I did not ask raters whether they personally interpret ratings
for Placement or Admissions tests differently. To do so would no doubt uncover information to supplement the findings
presented here.
This analysis may benefit other standardized ESL testing programs with regard to writing rubric interpretations, rater
training, and the use of scores. If, for example, a group of administrators is considering using an existing performance
assessment for a new purpose, practical considerations regarding the scoring process must be kept in mind. The rater
training materials ought to reflect the array of samples that examinees are likely to produce on the test, and a rater trainer
should give advice that assists scorers in interpreting L2 writing assessment rubrics in the desired manner.
Raters are an invaluable component of the performance assessment process and thus merit attention from language
testing researchers and assessment program coordinators. The findings presented here raise future questions about rater
expectations and other factors that impact writing ratings, many of which will surely be investigated in further studies.

Appendix A. Integrated Reading/Writing Essay Rubric.

Rhetoric Language

Content Organization Language: Accuracy Language: Range and Complexity

9–10 9–10 9–10 9–10


• The treatment of the assignment • Clear and appropriate • The essay is clearly written with • Uses a variety of sentence types
completely fulfills the task organizational plan few errors; errors do not accurately
expectations and the topic is • Effective introduction and interfere with comprehension • Uses a wide range of academic
addressed thoroughly conclusion • Includes consistently accurate vocabulary
• Fully developed evidence for • Connections between and word forms and verb tenses • Source text language is used
generalizations and supporting within paragraphs are made • Word choices are accurate and sparingly and accurately
ideas/arguments is provided in a through effective and varied use appropriate incorporated into writer’s own
relevant and credible way of transitions and other cohesive words
• Uses ideas from source text well devices
to support thesis

7–8 7–8 7–8 7–8


• The treatment of the assignment • Clear organizational plan • The essay is clearly written but • The essay uses a variety of
fulfills the task expectations • Satisfactory introduction and contains some errors that do not sentence types
competently and the topic is conclusion interfere with comprehension • Good range of academic
addressed clearly • Satisfactory connections • The essay may contain some vocabulary used with at most a
• Evidence for generalizations and between and within paragraphs errors in word choice, word few lapses in register
supporting ideas/arguments is using transitions and other form, verb tenses, and • Some language from the source
provided in a relevant and cohesive devices complementation text may be present but is
credible way generally well incorporated into
• Ideas from source text used to writer’s own words
support thesis

5–6 5–6 5–6 5–6


• The treatment of the assignment • Adequate but simplistic • Is generally comprehensible but • Somewhat limited range of
minimally fulfills the task organizational plan contains some errors that sentence types; may avoid
expectations; some aspects of • Introduction and conclusion distract the reader or slow down complex structures
the task may be slighted present but may be brief comprehension • Somewhat limited range of
• Some relevant and credible • Connections between and within • The essay may contain several academic vocabulary
evidence for generalizations and paragraphs occasionally missing errors in word choice, word • May include extensive language
supporting ideas/arguments is form, verb tenses, and from source text(s) with an
provided complementation attempt to incorporate text
• Ideas from source texts are language with own language
included but may not be
explicitly acknowledged as such
30 S. Goodwin / Assessing Writing 30 (2016) 21–31

Maximum of 4 Points if Response is ≤100 Words, Even if Few or No Errors


3–4 3–4 3–4 3–4
• The response demonstrates an • There is evidence of an • Contains many errors; some • Uses a limited number of
understanding of the task and an organizational plan that is errors may interfere with sentence types
attempt to complete it incomplete OR organizational comprehension • Vocabulary limited
• Evidence for generalizations and plan hard to follow • Includes many errors in word • Extensive use of source text
supporting ideas/arguments is • Introduction and conclusion choices, forms, word forms, verb language with little integration
insufficient and/or irrelevant may be missing or inadequate tenses and complementation with writer’s words
• May not include ideas from • Connections between and within
source text, or may consist paragraphs frequently missing
primarily of ideas from source
text without integration with
writer’s ideas

1–2 1–2 1–2 1–2


• The treatment of the assignment • No apparent organizational plan • Contains numerous errors that • Uses simple and repetitive
fails to fulfill the task • Introduction and conclusion interfere with comprehension vocabulary that may not be
expectations and the paper lacks missing or clearly inappropriate • Includes many errors in word appropriate for academic writing
focus and development • Few connections made between choices, forms, word forms, verb • Does not vary sentence types
• Evidence for generalizations and and within paragraphs tenses and complementation sufficiently
supporting ideas/arguments is • May rely almost exclusively on
insufficient and/or irrelevant source text language

Appendix B. Logit Values of All Raters.

References

Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge, UK: Cambridge University Press.
Barkaoui, K. (2010). Do ESL essay raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44(1),
31–57.
Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.
Chien, T.-W. (2008). Repeated measure designs (time series) and Rasch. Rasch Measurement Transactions, 22(3), 1171. Retrieved from
http://www.rasch.org/rmt/rmt223b.htm
Cumming, A. H., Kantor, R., & Powers, D. E. (2001). Scoring TOEFL essays and TOEFL 2000 prototype writing tasks: an investigation into raters’ decision making
and development of a preliminary analytic framework. (TOEFL Monograph No. MS-22). Princeton NJ: Educational Testing Service.
Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31–51.
Daly, J. A., & Dickson-Markman, F. (1982). Contrast effects in evaluating essays. Journal of Educational Measurement, 19(4), 309–316.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155–185.
Eckes T. (2009). Many-facet Rasch measurement. Reference supplement to the manual for relating language examinations to the Common European
Framework of Reference for Languages: Learning, teaching, assessment.
Engelhard, G. (2002). Monitoring raters in performance assessments. In G. Tindal, & T. Haladyna (Eds.), Large-Scale Assessment Programs for All Students:
Development, Implementation, and Analysis (pp. 261–287). Mahwah, NJ: Lawrence Erlbaum Associates.
Engelhard, G., Jr. (2013). Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences. New York, NY: Routledge.
Engelhard, G., & Wind, S. A. (2013). Rating quality studies using Rasch measurement theory. Research report 2013-3. New York: The College Board. Retrieved from
https://research.collegeboard.org/sites/default/files/publications/2013/8/researchreport-2013-3-rating-quality-studies-using-raschmeasurement-theory.pdf
S. Goodwin / Assessing Writing 30 (2016) 21–31 31

Hales, L. W., & Tokar, E. (1975). The effect of the quality of preceding responses on the grades assigned to subsequent responses to an essay question.
Journal of Educational Measurement, 12(2), 115–117.
Harsch, C., & Martin, G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach.
Assessing Writing, 17, 228–250.
Hughes, D. C., Keeling, B., & Tuck, B. F. (1983). Effects of achievement expectations and handwriting quality on scoring essays. Journal of Educational
Measurement, 20(1), 65–70.
Knoch, U. (2011). Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from? Assessing
Writing, 16(2), 81–96.
Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced
raters. Language Testing, 28(4), 543–560.
Linacre, J. M. (1989/1994). Many-facet Rasch measurement. MESA Press, Chicago.
Linacre, J. M. (2014). Facets computer program for many-facet Rasch measurement, version 3.71.4. Beaverton, Oregon, Winsteps.com.
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246–276.
McNamara, T. F. (1996). Measuring second language performance. New York: Longman.
Murphy, S., & Yancey, K. B. (2008). Construct and consequence: Validity in writing assessment. In C. Bazerman (Ed.), Handbook of research on writing:
history, society, school, individual, text (pp. 365–385). New York, NY: Routledge.
Myford, C. M., & Wolfe, E. W. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale
category use. Journal of Educational Measurement, 46(4), 371–389.
Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research. (Expanded
edition, Chicago, University of Chicago Press, 1980).
Shaw, S., & Weir, C. (2007). Examining writing: Research and practice in assessing second language writing. Cambridge, UK: Cambridge University Press.
Spear, M. (1997). The influence of contrast effects upon teachers’ marks. Educational Research, 39(2), 229–233.
Vaughan, C. (1991). Holistic assessment: What goes on in the rater’s mind? In L. Hamp-Lyons (Ed.), Assessing second language writing in academic contexts
(pp. 111–125). Norwood, NJ: Ablex.
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11, 197–223.
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15, 263–287.
Weigle, S. C. (2002). Assessing writing. Cambridge, UK: Cambridge University Press.
Wind, S. A., & Engelhard, G. (2012). Examining rating quality in writing assessment: Rater agreement, error, and accuracy. Journal of Applied Measurement,
13(4), 321–325.
Wind, S. A., & Engelhard, G. (2013). How invariant and accurate are domain ratings in writing assessment? Assessing Writing, 18(4), 278–299.
Wolfe, E. W., Jurgens, M., Sanders, B., Vickers, D., & Yue, J. (2014). Evaluation of pseudo-scoring as an extension of rater training. Research Report,.
Retrieved from http://researchnetwork.pearson.com/wp-content/uploads/Wolfe PseudoScoring 04142014.pdf
Wolfe, E. W. (2004). Identifying rater effects using latent trait models. Psychology Science, 46(1), 35–51.
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370.
Wright, B. D. (2003). Rack and stack: Time 1 vs. time 2 or pre-test vs. post-test. Rasch Measurement Transactions, 17(1), 905–906. Retrieved from
http://www.rasch.org/rmt/rmt171a.htm

Sarah Goodwin is a Ph.D. candidate in applied linguistics at Georgia State University, where she coordinates the ESL Testing Program and teaches ESL
and applied linguistics courses. Her research interests include language assessment and second language writing, as well as second language listening.
She has over twelve years’ experience working as a composition rater, item developer, and assessment specialist.

You might also like