Eelt0337 Writing Assessment by Dudley Reynolds

Writing Assessment
DUDLEY W. REYNOLDS
Framing the Issue
Interest in the assessment of second language learners’ writing initially emerged

in the early 1960s from the practical need for large-scale tests that could determine
students’ readiness for academic study in English at university level. With the
development of programs preparing students for such study, an interest also
arose in using measures of writing ability to place students in multilevel programs.
Judgments were often based, however, on indirect measures of writing, which tar-
geted grammatical knowledge and sometimes reading comprehension, since these
could be objectively scored. This was especially true in the United States, where
greater emphasis was placed on psychometric concerns for reliable scoring
procedures than on the authenticity of the task.
Beginning in the 1980s, large-scale assessments such as the British Council’s
English Language Testing Service (ELTS) exams and the ETS’s TOEFL® began
moving to direct measures: students would produce an impromptu writing
sample in response to a prompt, and the sample would be evaluated. In the
1990s and early twenty-first century, as the fields of language assessment
and composition studies more generally began to take an interest in formative
types of assessment—what Black and Wiliam (1998) refer to as “assessment for
learning”—researchers in the field of second language writing assessment also
began to move beyond a primary focus on scoring reliability and task validity
and to examine alternatives to the impromptu writing sample, the efficacy of
different types of instructor feedback, and ways to promote self-assessment and
peer assessment. This period also saw a broadening of the contexts for which
assessments were being developed. Such contexts now included K-12 learners,
citizenship requirements, and business communication. These developments
coincided with the emergence of the new field of second language writing stud-
ies, as marked by the first issue of the Journal of Second Language Writing in 1992
and the creation of the Second Language Writing Interest Section within the
TESOL International Association in 2005.
The TESOL Encyclopedia of English Language Teaching, First Edition.

Edited by John I. Liontas (Project Editor: Margo DelliCarpini).
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
DOI: 10.1002/9781118784235.eelt0337
eelt0337.indd 1 11/13/2017 12:00:03 PM

2 Writing Assessment
Today three trends signal future directions in the development of new forms of
second language writing assessment. First, influenced by dynamic theories of lan-
guage development, researchers have begun to investigate techniques for assessing
what students can produce on their own versus with the help of scaffolding. Second,
drawing on recent advances in computerized natural language processing (NLP),
researchers are examining how language patterns drawn from large-scale analyses
of corpus data can be used for automated writing assessment. Finally, also taking
advantage of the “big data” analysis capabilities offered by cloud computing,
researchers are developing assessment systems that will facilitate and track class-
room-based formative assessments and combine them with information from more
standardized, benchmark assessments in order to provide policy and decision
makers at all levels with real-time, comprehensive data.
Making the Case
At the most general level, writing ability can be understood as the ability to
generate a written product or products with specified characteristics in a specified
manner. Assessment instruments tend to focus therefore on the product(s) or on
the process of writing. When it comes to second language writers, instruments
may or may not emphasize the challenges faced by individuals when they write in
a code and cultural context with which they have limited familiarity.
An example of a product-focused assessment with less emphasis on the fact that
the writers are using a second language would be one that directs raters to make
judgments about the language, rhetoric, and content of a text. Language use may
be codified in terms of sophistication of vocabulary, syntactic development, and
conformity to grammatical conventions. Rhetoric may invoke judgments related
to audience appeal, organization, and clarity of purpose, while content judgments
might focus on the accuracy, development, or novelty of the propositions presented.
To the extent that these judgments represent characteristics of effective texts,
regardless of the linguistic repertoire of the writer, they prioritize writing ability
over second language writing ability. A product-focused assessment with
greater emphasis on the ability of second language users may forego judgments
about content and possibly even about rhetoric but will include measures of
language complexity, accuracy, and fluency, in an attempt to operationalize the
psycholinguistic dimensions of text production. Familiarity with the cultural
context, understood as adherence to genre features, might be added as well.
Assessment schemes that incorporate attention to the production process
often measure process in terms of the stages of production of a single text
(invention, drafting, revision, and editing). Research into the processes of skilled
and unskilled writers suggests that these “stages” are in fact behaviors that
writers engage in repeatedly during production. In order to make judgments
about these behaviors, the assessor will usually examine artifacts such as notes,
drafts, and marked-up copies of the text as well as written reflections made by
the writer about his or her process. Production ability may also be defined as
eelt0337.indd 2 11/13/2017 12:00:03 PM

Writing Assessment 3
the ability to generate a variety of texts within an academic or specialized

domain (e.g., summary, analysis, narrative; business email, report, marketing
pitch). In this case measurement will often focus on command of the distinctive
features of each variety, and also on the ability to adjust language to the rhetori-
cal situation.
The selection of an approach to characterizing second language writing ability
also involves consideration of the purpose of the assessment. The information
derived from writing assessments may be used to help determine a student’s
readiness to enter or exit an academic program, place students within an academic
program, or more directly support the learning process through diagnostic feed-
back. For each of these uses, the information generated by the assessment should
align with the curriculum of the program or, in the case of readiness decisions,
with a research-based understanding of the kinds of writing the individual being
assessed will likely be expected to produce. Thus, while the decision that a student
no longer needs to study writing, for example in a pre-university English program,
is a categorical “yes” or “no,” such a decision should be based on reliable indica-
tors that the student has adequately mastered what was taught in the program.
As with other forms of assessment, three options exist for coming up with reli-
able indicators about a writer’s ability: criterion referencing, norm referencing,
and individual benchmarking. An example of criterion-referenced writing assess-
ment would be the use of a list of ability statements such as “the writer uses topic
sentences to establish the purpose of paragraphs.” The assessor may then agree
(or not) that the criterion has been met, or may be asked to make a more precise
judgment about the degree to which the criterion has been met using level
descriptors. With norm-referenced writing assessments, raters are asked to use
benchmark responses produced by previous test takers; these responses are
meant to guide the raters’ interpretations of whether a criterion statement
has been met and, if so, the degree to which it has. The distinction between
criterion- and norm-referenced writing assessment is somewhat blurry, because it
is highly likely that raters who use criteria as a reference for measurement
judgments will, at least subconsciously, invoke previous texts they have read;
similarly, raters who use benchmark samples to norm their judgments will likely
formulate an independent interpretation of the criteria supposedly exemplified
in the samples.
With individual benchmarking, the emphasis is less on precise measurement
and more on rich measurement. The most common example of individual bench-
marking is classroom teachers who assess several texts produced by students
during a single course. While teachers may use criteria in the form of rubrics,
and possibly even essays from students in previous classes, all of which will
inform their judgment, they are likely to prioritize individualized judgments
that contribute to their students’ future development as writers over the need to
locate these students’ performances in relation to one another. As they judge
successive essays produced by a student, they are likely to make comparisons
with that student’s earlier writing and to identify next steps in his or her
development.
eelt0337.indd 3 11/13/2017 12:00:03 PM

Pedagogical Implications
The design and appropriate use of writing assessments in TESOL contexts requires
an understanding of the options of assessment tasks and scoring procedures, as
well as attention to potential issues related to fairness and equity.
Assessment tasks should be shaped by decisions about whether the goal is to
assess writing products or processes, the relative importance of task authenticity
and reliable scoring, and the need for summative judgments versus formative
feedback. The most common assessment task in second language writing con-
texts is probably a prompt-based essay. The prompt is a simulation of a “real-
world” context and purpose that might elicit the creation of a written text. It
serves both to inspire content and to prescribe boundaries for the form of the
response. It may consist of little more than a question or a provocative statement,
with instructions to produce a text of a certain length. Alternatively it may pro-
vide specifications regarding a context where the text would be read or used,
characteristics of the potential audience, suggestions for the writing process,
characteristics of a good or bad response, and information about how the
response will be evaluated. Finally, respondents may be expected to produce the
text as soon as they have read the prompt, in which case the result would be
considered an impromptu writing s ample; or they may be given an extensive
period of time, in which case there may be greater expectations of content devel-
opment and revision.
One recent innovation in more formal writing assessments such as the one
included in ETS’s TOEFL® exam has been the use of integrated assessment of writ-
ing. Inspired by the desire to make such formal assessments mirror the way in
which writing is frequently assessed in classroom settings, integrated assessments
first provide respondents with one or more reading or listening passages on which
they are tested; then the respondents are asked to write an essay that responds to
or draws on the content of the passage(s). Integrated tasks are frequently used in
classroom settings because they mirror authentic, non-classroom writing tasks
such as summarizing a business meeting or writing an academic paper. But, if the
goal is to measure writing ability independently of other language abilities, it
should be kept in mind that it is not clear to what degree the quality of a text has
been influenced by the respondent’s ability to read or understand the source
passage(s).
As an alternative to tasks that base their judgments on a single product, many
classroom teachers and instructional programs have moved to requiring students
to assemble writing portfolios. Typical portfolio tasks require students to select
samples of their work that exemplify specified criteria, to evaluate the work in
relation to those criteria, and to reflect upon their overall learning. Often the works
students choose from will have been previously assessed as stand-alone products,
and so the portfolio task shifts the emphasis to the student’s general writing ability
and meta-awareness. Occasionally portfolios may still focus on a single essay that
has been previously graded. In this case, students must revise the essay and write
a commentary that explains why they made certain revisions, so that the emphasis
eelt0337.indd 4 11/13/2017 12:00:03 PM

is again on broad learning. The biggest challenge with portfolio tasks is the
quantity of information provided through different formats. Judgments about
portfolios can be informed by which texts are selected, the characteristics of the
texts themselves, and the numerous reflective statements included.
While most writing assessment tasks today are based on having students actu-
ally write, it is also possible to consider more indirect measures of writing ability.
Historically, grammar and reading tests employing selected response questions
that could be reliably scored were used to make inferences about one’s writing
ability. Concurrent validity studies frequently showed high correlations between
such measures and teachers’ judgments. It is also possible to construct discrete
test items that query one’s knowledge of writing conventions, rhetorical termi-
nology, and publication styles, or even items where respondents have to choose
between revision options, organizational sequences, or effective introductions.
The writing task provides an opportunity for the students to demonstrate their
ability, but there must also be a system for representing judgments about that abil-
ity—for coming up with scores, grades, or ratings. Scores may be assigned directly
by human raters or by computerized algorithms guided by data from raters who
have made judgments about a set of reference texts. Computerized writing assess-
ments are becoming much more common because of the speed with which they
can be completed; human raters, however, are much better at handling outliers
within a set of responses and also do not require previously scored essays for nor-
ming. Whether human or computerized, scoring systems present two challenges:
What do the scores represent about one’s writing ability? And how do the scores
differentiate individual writers?
Ways of conceptualizing the writing ability in terms of process and product
have been discussed above. When it comes to scoring, however, a decision must be
made about whether the different dimensions of the writing ability should be rep-
resented individually (trait ratings) or as a single ability (holistic rating). The
choice between the two systems is often presented as being about whether the
ability to write is more than the sum of its parts. In practice, however, the decision
is often made on much more practical grounds. If there is a need to provide diag-
nostic feedback to teachers or students (or both), then scores that provide judg-
ments about components, as opposed to judgments about the general ability, are
more desirable. If the purpose of the assessment is to make decisions about level
placement or program entrance/exit, then a single holistic rating will suffice. One
other option is to devise a hybrid system where raters assign trait ratings that are
then combined according to some algorithm to produce a single overall score. By
weighting a trait such as “clarity of purpose” more than “accurate use of punctua-
tion,” for example, the algorithm can also reflect curricular priorities.
Rater judgments are inherently variable. One advantage of holistic systems in
terms of reliability is that they allow for multiple paths to the same end, which
means that there is likely to be greater agreement between scores assigned by dif-
ferent raters. Too much variation in how raters assign scores, however, means that
it is not clear what the score represents. It is important therefore to provide raters
with clear descriptions of what different scores represent. Often such descriptions
eelt0337.indd 5 11/13/2017 12:00:03 PM

will be accompanied by example responses. For high-stakes assessments, proce-

dures are usually in place to have scores represent a consensus among raters. For
example, responses may be scored by two different human raters or by one human
and one computer; if there is a defined level of disagreement between the two,
then a third rater will be used. Raters may also calibrate on a select set of responses
previously scored by “expert” raters. In a calibration session, they will assign
ratings, discuss the basis for their ratings with each other, and then agree on how
the response should be scored. One criticism of such norming procedures, how-
ever, is that they suppress the known variability in readers’ interpretations of texts
in authentic situations.
The problem of what the scores represent becomes even more important when
the fact that most writing assessments are designed to differentiate between indi-
viduals is taken into account. They may differentiate between individuals who
took a particular exam; they may differentiate between individuals who took dif-
ferent versions of an exam at different points in time. From an assessment perspec-
tive, differences in the abilities of individuals are of interest insofar as they help
place or exit those individuals or offer them appropriate instruction; but measure-
ment theory points out that scores may be influenced by differences between raters
and tasks in addition to differences in the actual ability of the respondents. For
high-stakes contexts, mathematical approaches based on generalizability theory
and Rasch modeling are therefore frequently used to detect and possibly correct
influences caused by differences between raters and tasks.
Perhaps the most important consideration when designing or selecting a writing
assessment is whether it will be a fair assessment of the respondents’ ability—an
assessment that will contribute positively to their further learning. For English
language learners, this means ensuring that tasks will not presume cultural knowl-
edge that learners do not have. Also, if an assessment is being used for both native
and non-native speakers, the scoring system should at least distinguish linguistic
competency from other, more rhetorically or content-based competencies.
Attention also needs to be paid to whether a test is used for the purpose it was
developed for. Tests designed to measure readiness for university-level writing
classes should not be used by an employer to gauge business communication
skills. Similarly, if students are given several weeks to complete all the writing
assignments in a university writing class, it is probably not appropriate to deter-
mine their readiness for the class on the basis of what they can write in a 20-minute
impromptu essay. Finally, serious attention must be paid to the effects that an
assessment is likely to have on the curriculum of students who prepare for the
assessment. Exams that privilege a form such as the “five-paragraph essay” or a
scoring scheme based more on grammatical correctness than on clarity of expres-
sion will have effects on what students learn.
SEE ALSO: Analytic, Holistic, and Primary Trait Marking Scales; Automated
Writing Assessment; Ethics in Testing and Assessment; Integrated-Skills
Assessment; Large-Scale Writing Assessment; Norm-Referenced Testing and
Criterion-Referenced Testing; Placement Testing; Portfolios; Scoring Writing
eelt0337.indd 6 11/13/2017 12:00:03 PM

Reference
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education:
Principles, Policy & Practice, 5(1), 7–75.
Suggested Readings
Behizadeh, N., & Engelhard, G., Jr. (2011). Historical view of the influences of measurement
and writing theories on the practice of writing assessment in the United States. Assessing
Writing, 16(3), 189–211. doi:10.1016/j.asw.2011.03.001
Crusan, D. (2010). Assessment in the second language writing classroom. Ann Arbor: University
of Michigan Press.
Cumming, A. (2013). Assessing integrated writing tasks for academic purposes: Promises
and perils. Language Assessment Quarterly, 10(1), 1–8. doi:10.1080/15434303.2011.622016
Hamp-Lyons, L. (Ed.). (1991). Assessing second language writing in academic contexts. Norwood,
NJ: Ablex.
Reynolds, D. W. (2010). Assessing writing, assessing learning. Ann Arbor, MI: University of
Michigan Press.
Weigle, S. C. (2002). Assessing writing. Cambridge, England: Cambridge University Press.
eelt0337.indd 7 11/13/2017 12:00:03 PM

Eelt0337 Writing Assessment by Dudley Reynolds

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Eelt0337 Writing Assessment by Dudley Reynolds

Uploaded by

Copyright:

Available Formats

Writing Assessment

Interest in the assessment of second language learners’ writing initially emerged

The TESOL Encyclopedia of English Language Teaching, First Edition.

eelt0337.indd 1 11/13/2017 12:00:03 PM

eelt0337.indd 2 11/13/2017 12:00:03 PM

the ability to generate a variety of texts within an academic or specialized

eelt0337.indd 3 11/13/2017 12:00:03 PM

eelt0337.indd 4 11/13/2017 12:00:03 PM

eelt0337.indd 5 11/13/2017 12:00:03 PM

will be accompanied by example responses. For high-stakes assessments, proce-

eelt0337.indd 6 11/13/2017 12:00:03 PM

eelt0337.indd 7 11/13/2017 12:00:03 PM

You might also like