Professional Documents
Culture Documents
10 Assessing writing
1 Introduction
Writing is a multifaceted and complex skill, both in first language (L1) and in sec-
ond or foreign language (L2) contexts. The ability to write is commonly regarded
as essential to personal and social advancement, as well as to a nation’s economic
success. Against this background, it is hardly surprising that research into, and
professional concern about, writing instruction and assessment have increased tre-
mendously in recent years (e.g., Hinkel 2011; MacArthur, Graham and Fitzgerald
2008; Weigle 2013a).
This chapter first provides a brief historical review of the field of writing assess-
ment. It then introduces the notion of language frameworks and discusses the con-
struction and use of writing tasks and rating scales. Subsequently, the focus shifts
to evaluating the quality of writing assessments, highlighting the critical role of
raters. After a brief glance at current controversies, the chapter concludes with
possible future directions.
of writing ability on the one hand and advances in measurement methodology and
statistical analysis on the other. No wonder, therefore, that writing assessment
looks back on a varied and sometimes controversial history (Elliot 2005; Spolsky
1995).
and at the college-entry level (Hamp-Lyons 2001). Third, within more recent assess-
ment validation frameworks (Kane 1992, 2011; Messick 1989), the empirical evi-
dence needed to support the interpretation and use of writing assessment out-
comes has been extended beyond simply computing correlations with some exter-
nal criterion, including the study of washback effects and raters’ judgmental and
decision processes (Bejar 2012; Lumley 2005; Wall 2012).
In view of the essay-based tests administered in ancient China more than 2,000
years ago (Hamp-Lyons 2001), it is tempting to conclude that writing assessment
has a long past but a short history, paraphrasing a famous quote by Ebbinghaus
(1908), who thus characterized the discipline of psychology at the time. Indeed, as
a formally defined field writing assessment has existed for a relatively short period
of time, approximately 100 years. Over the years, the field has made significant
progress in theoretical analysis, empirical research, and methodology, finally real-
izing the “psychometric-communicative trend” that Bachman (1990: 299) once con-
sidered necessary for advancing language testing in general.
proficiency scale has been criticized as being confounded with intellectual func-
tioning such that higher CEFR levels can only be achieved by learners with higher
intellectual or educational level (Hulstijn 2011).
In addition, the CEFR gives only little attention to factors that are relevant for
the development of practical writing tasks, such as linguistic demands or cognitive
processing. For instance, Harsch and Rupp (2011) examined how, and to what ex-
tent, the theoretical model of writing discussed in the CEFR can be operationalized
in terms of test specifications and rating criteria to produce level-specific writing
assessments for use in a large-scale standards-based assessment project. Their
findings suggested that the CEFR should solely be regarded as a basis for task
design and assessment since its writing model does not sufficiently take into ac-
count facets like task environment and the social context of the writing activity.
Therefore, they additionally drew on key results from research on writing, as well
as on a systematic classification of writing tasks according to relevant criteria from
the CEFR Grid for Writing Tasks, including task demands, time allotted to solve a
task, or linguistic features of the expected response.
Frameworks can function as an instrument for validation, allowing writing test
developers to justify the use of a particular assessment through a set of scientifical-
ly sound procedures. Validation frameworks like Bachman and Palmer’s (2010) as-
sessment use argument (AUA) or Weir’s (2005) socio-cognitive approach included
an analysis of assessment tasks focusing on target language use (Xi and Davis this
volume). Both Bachman and Palmer and Weir stressed the importance of cognitive
and contextual parameters of assessment tasks to make inferences from test perfor-
mance to language ability as manifested in contexts different from the test itself.
Shaw and Weir (2007) provided a comprehensive analysis of writing tasks based
on Weir’s socio-cognitive approach that is also relevant for the issue of task diffi-
culty across proficiency levels.
The well-known weaknesses of timed writing tests have also fostered the devel-
opment of portfolio-based writing assessment (Hamp-Lyons and Condon 2000;
Yancey and Weiser 1997). In fact, portfolio assessment reflects a shift from a prod-
uct-oriented approach to a more process-oriented writing pedagogy, focusing on
the steps involved in creating a written product (Romova and Andrew 2011). Use
of portfolios has been suggested for reasons of high authenticity and validity, as
well as for its relevance for writing instruction, curriculum design and low-stakes
assessment. However, problems may come about when portfolios are to be used in
large-scale contexts, mainly because of practical restrictions and lack of standard-
ized scoring procedures (Weigle 2002; Williams 2000).
reliability is usually computed. But there are many different indices (Stemler and
Tsai 2008), some of which can lead to incongruent, sometimes even contradictory
results and conclusions. For example, one rater may award scores to examinees
that are consistently one or two scale points lower than the scores that another
rater awards to the same examinees. The relative ordering of the examinees will
be much the same for both raters, yielding high correlations between ratings; that
is, interrater consistency will be high. Yet, the raters have not reached exact agree-
ment in any one case; that is, interrater consensus will be low. Even if raters exhibit
both high consistency and high consensus, the evidence for high accuracy can be
on shaky ground. Thus, two raters may be in fine agreement with each other simply
because they are subject to the same systematic errors or biases to much the same
degree (Eckes 2015; Wind and Engelhard 2013).
Therefore, any study that reports on interrater reliability should meet the fol-
lowing minimum requirements: a) specifying the design used for collecting the
rating data (e.g., number of raters, examinees, and criteria, plan for assigning rat-
ers to examinees, kind of rating scale or scales etc.); b) defining the index or indi-
ces that were computed; c) computing at least two different indices, one consensus
index (e.g., percentage agreement) and one consistency index (e.g., Pearson corre-
lation). More sophisticated quality control procedures rest on the use of measure-
ment models.
7 Future directions
Looking ahead, there are a number of recent developments that will further ad-
vance the discipline of writing assessment. Foremost among these are inquiries
into the suitability of integrated writing tasks and the extension of models and
procedures for automated essay scoring. Human raters will, in our view, not be
replaced completely by AES systems in most instruments for direct writing assess-
ments. Therefore, a key issue that needs to be addressed refers to ubiquitous rater
effects as a source of CIV. Today, measurement models are available to closely
study or even control for many of these effects and thus meet the challenge of the
communicative movement that led to the re-emergence of essay tests. McNamara
(2011: 436) euphorically welcomed one particular model, the many-facet Rasch
model, as a “quantum leap in our capacity to investigate and to a considerable
extent to deal with this psychometric challenge.” However, this and related models
will bear fruit only if they become standard part of the training of those involved
in overseeing the assessment of writing; that is, there is a clear need to develop
writing assessment literacy (Taylor 2009; Weigle 2007).
A related issue concerns the study of rater cognition, that is, the study of the
attentional, perceptual, and judgmental processes involved when raters award
scores to examinees. Research into these cognitive processes is directly relevant
for validating the use and interpretation of essay scores. Thus, as Bejar (2012: 2)
pointed out, “it is important to document that the readers’ cognitive processes are
consistent with the construct being measured.” Though not labeled as such, early
research on this issue was conducted by Diederich, French and Carlton (1961), who
showed that readers (raters) could be classified into five schools of thought. These
schools of thought were reliably distinguished by the weight they attached to dif-
ferent aspects of the written performances. More recently, the term rater types has
been used to describe groups of raters manifesting systematic patterns of similar-
ities and dissimilarities in perceived criterion importance that also relate to criteri-
on-specific biases in operational scoring contexts (Eckes 2008, 2012; He et al. 2013).
Taken together, much progress has been made since Edgeworth’s (1890) cri-
tique of the “element of chance” in essay examinations, echoing the emergence of
an integrative science of writing assessment. Yet, much more needs to be done to
establish assessment procedures that come close to the goal of providing valid and
fair measures of writing ability in highly varied writing contexts.
8 Acknowledgements
We would like to thank our colleagues at the TestDaF Institute for many stimulat-
ing discussions on issues of foreign-language writing assessment. Special thanks
go to Claudia Harsch (University of Warwick, UK) for thoughtful comments on an
earlier version of this chapter.
9 References
Asención Delaney, Yuly. 2008. Investigating the reading-to-write construct. Journal of English for
Academic Purposes 7(3): 140–150.
Attali, Yigal and Jill Burstein. 2006. Automated essay scoring with e-rater® V.2. Journal of
Technology, Learning, and Assessment 4(3): 1–30.
Bachman, Lyle F. 1990. Fundamental considerations in language testing. Oxford, UK: Oxford
University Press.
Bachman, Lyle F. 2000. Modern language testing at the turn of the century: Assuring that what we
count counts. Language Testing 17(1): 1–42.
Bachman, Lyle F. and Adrian S. Palmer. 1996. Language testing in practice: Designing and
developing useful language tests. Oxford, UK: Oxford University Press.
Bachman, Lyle F. and Adrian S. Palmer. 2010. Language assessment in practice: Developing
language assessments and justifying their use in the real world. Oxford, UK: Oxford
University Press.
Baldwin, Doug, Mary Fowles and Skip Livingston. 2005. Guidelines for constructed-response and
other performance assessments. Princeton, NJ: Educational Testing Service.
Behizadeh, Nadia and George Engelhard Jr. 2011. Historical view of the influences of
measurement and writing theories on the practice of writing assessment in the United
States. Assessing Writing 16(3): 189–211.
Bejar, Issac I. 2012. Rater cognition: Implications for validity. Educational Measurement: Issues
and Practice 31(3): 2–9.
Bouwer, Renske, Anton Béguin, Ted Sanders and Huub van den Bergh. 2015. Effect of genre on
the generalizability of writing scores. Language Testing 32(1): 83–100.
Brennan, Robert L. 2001. Generalizability theory. New York: Springer.
Brown, Rexford. 1978. What we know now and how we could know more about writing ability in
America. Journal of Basic Writing 4: 1–6.
Carr, Nathan T. 2011. Designing and analyzing language tests. Oxford, UK: Oxford University
Press.
Carr, Nathan T. 2014. Computer-automated scoring of written responses. In: Anthony J. Kunnan
(ed.). The companion to language assessment. Vol. 2, 1063–1078. Chichester, UK: Wiley.
Chalhoub-Deville, Micheline. 2009. Content validity considerations in language testing contexts.
In: Robert W. Lissitz (ed.). The concept of validity: Revisions, new directions, and
applications, 241–263. Charlotte, NC: Information Age.
Council of Europe. 2001. Common European framework of reference for languages: Learning,
teaching, assessment. Cambridge, UK: Cambridge University Press.
Cumming, Alister. 1997. The testing of writing in a second language. In: Caroline Clapham and
David Corson (eds.). Encyclopedia of language and education: Language testing and
assessment, 51–63. Dordrecht, The Netherlands: Kluwer.
Cumming, Alister. 2013. Assessing integrated writing tasks for academic purposes: Promises and
perils. Language Assessment Quarterly 10(1): 1–8.
Cumming, Alister. 2014. Assessing integrated skills. In: Anthony J. Kunnan (ed.). The companion
to language assessment. Vol. 1, 216–229. Chichester, UK: Wiley.
Deane, Paul. 2013a. Covering the construct: An approach to automated essay scoring motivated
by a socio-cognitive framework for defining literacy skills. In: Mark D. Shermis and Jill
Burstein (eds.). Handbook on automated essay evaluation: Current application and new
directions, 298–312. New York: Routledge.
Deane, Paul. 2013b. On the relation between automated essay scoring and modern views of the
writing construct. Assessing Writing 18(1): 7–24.
Diederich, Paul B. 1974. Measuring growth in English. Urbana, IL: National Council of Teachers of
English.
Diederich, Paul B., John W. French and Sydell T. Carlton. 1961. Factors in judgments of writing
ability. RB-61-15. Princeton, NJ: Educational Testing Service.
Ebbinghaus, Hermann. 1908. Psychology: An elementary text-book. Boston, MA: Heath.
Eckes, Thomas. 2005. Examining rater effects in TestDaF writing and speaking performance
assessments: A many facet Rasch analysis. Language Assessment Quarterly 2(3): 197–221.
Eckes, Thomas. 2008. Rater types in writing performance assessments: A classification approach
to rater variability. Language Testing 25(2): 155–185.
Eckes, Thomas. 2012. Operational rater types in writing assessment: Linking rater cognition to
rater behavior. Language Assessment Quarterly 9(3): 270–292.
Eckes, Thomas. 2015. Introduction to many-facet Rasch measurement: Analyzing and evaluating
rater-mediated assessments, 2nd ed. Frankfurt am Main: Lang.
Edgeworth, Francis Y. 1890. The element of chance in competitive examinations. Journal of the
Royal Statistical Society 53(4): 644–663.
Elliot, Norbert. 2005. On a scale: A social history of writing assessment in America. New York:
Lang.
Elliot, Norbert and David M. Williamson. 2013. Assessing Writing special issue: Assessing writing
with automated scoring systems. Assessing Writing 18(1): 1–6.
Engelhard, George Jr. 2002. Monitoring raters in performance assessments. In: Gerald Tindal and
Thomas M. Haladyna (eds.). Large-scale assessment programs for all students: Validity,
technical adequacy, and implementation, 261–287. Mahwah, NJ: Erlbaum.
Engelhard, George Jr. 2013. Invariant measurement: Using Rasch models in the social, behavioral,
and health sciences. New York: Routledge.
Fulcher, Glenn. This volume. Standards and frameworks. In: Dina Tsagari and Jayanti Banerjee
(eds.). Handbook of second language assessment. Berlin: DeGruyter Mouton.
Gebril, Atta. 2009. Score generalizability of academic writing tasks: Does one test method fit it
all? Language Testing 26(4): 507–531.
Gebril, Atta and Lia Plakans. 2014. Assembling validity evidence for assessing academic writing:
Rater reactions to integrated tasks. Assessing Writing 21(1): 56–73.
Hamp-Lyons, Liz. 1991a. Basic concepts. In: Liz Hamp-Lyons (ed.). Assessing second language
writing in academic contexts, 5–15. Westport, CT: Ablex.
Hamp-Lyons, Liz. 1991b. Scoring procedures for ESL contexts. In: Liz Hamp-Lyons (ed.). Assessing
second language writing in academic contexts, 241–276. Westport, CT: Ablex.
Hamp-Lyons, Liz. 2001. Fourth generation writing assessment. In: Tony Silva and Paul K. Matsuda
(eds.). On second language writing, 117–127. Mahwah, NJ: Erlbaum.
Hamp-Lyons, Liz. 2002. The scope of writing assessment. Assessing Writing 8(1): 5–16.
Hamp-Lyons, Liz. 2003. Writing teachers as assessors of writing. In: Barbara Kroll (ed.). Exploring
the dynamics of second language writing, 162–189. Cambridge, UK: Cambridge University
Press.
Hamp-Lyons, Liz. 2013. What is the role of an international journal of writing assessment?
Assessing Writing 18(4): A1–A4.
Hamp-Lyons, Liz and William Condon. 2000. Assessing the portfolio: Principles for practice,
theory, and research. Cresskill: Hampton Press.
Harsch, Claudia and Guido Martin. 2012. Adapting CEF-descriptors for rating purposes: Validation
by a combined rater training and scale revision approach. Assessing Writing 17(4): 228–250.
Harsch, Claudia and André A. Rupp. 2011. Designing and scaling level-specific writing tasks in
alignment with the CEFR: A test-centered approach. Language Assessment Quarterly 8(1):
1–33.
He, Tung-Hsien, Wen J. Gou, Ya-Chen Chien, I-Shan J. Chen and Shan-Mao Chang. 2013. Multi-
faceted Rasch measurement and bias patterns in EFL writing performance assessment.
Psychological Reports: Measures and Statistics 112(2): 469–485.
Hinkel, Eli. 2011. What research on second language writing tells us and what it doesn’t. In: Eli
Hinkel (ed.). Handbook of research in second language teaching and learning. Vol. 2, 523–
538. New York: Routledge.
Hirvela, Alan. 2004. Connecting reading and writing in second language writing instruction. Ann
Arbor, MI: University of Michigan Press.
Huang, Jinyan. 2012. Using generalizability theory to examine the accuracy and validity of large-
scale ESL writing assessment. Assessing Writing 17(3): 123–139.
Huddleston, Edith M. 1954. Measurement of writing ability at the college-entrance level: Objective
vs. subjective testing techniques. Journal of Experimental Education 22(3): 165–213.
Hulstijn, Jan H. 2011. Language proficiency in native and nonnative speakers: An agenda for
research and suggestions for second-language assessment. Language Assessment Quarterly
8(3): 229–249.
Huot, Brian A. 1990. The literature of direct writing assessment: Major concerns and prevailing
trends. Review of Educational Research 60(2): 237–263.
Huot, Brian A. 1996. Toward a new theory of writing assessment. College Composition and
Communication 47(4): 549–566.
Huot, Brian A. 2002. (Re)articulating writing assessment for teaching and learning. Logan, UT:
Utah State University Press.
Johnson, Robert L., James A. Penny and Belita Gordon. 2009. Assessing performance: Designing,
scoring, and validating performance tasks. New York: Guilford Press.
Kane, Michael T. 1992. An argument-based approach to validation. Psychological Bulletin 112(3):
527–535.
Kane, Michael. 2011. Validating score interpretations and uses: Messick Lecture, Language
Testing Research Colloquium, Cambridge, April 2010. Language Testing 29(1): 3–17.
Knoch, Ute. 2011. Investigating the effectiveness of individualized feedback to rating behavior:
A longitudinal study. Language Testing 28(2): 179–200.
Knoch, Ute and Woranon Sitajalabhorn. 2013. A closer look at integrated writing tasks: Towards
a more focussed definition for assessment purposes. Assessing Writing 18(4): 300–308.
Kroll, Barbara and Joy Reid. 1994. Guidelines for designing writing prompts: Clarifications,
caveats, and cautions. Journal of Second Language Writing 3(3): 231–255.
Lado, Robert. 1961. Language testing: The construction and use of foreign language tests. New
York: McGraw-Hill.
Lim, Gad S. 2010. Investigating prompt effects in writing performance assessment. Spaan Fellow
Working Papers in Second or Foreign Language Assessment 8: 95–116. Ann Arbor, MI:
University of Michigan.
Linacre, John M. 1989. Many-facet Rasch measurement. Chicago, IL: MESA Press.
Lumley, Tom. 2005. Assessing second language writing: The rater’s perspective. Frankfurt am
Main: Lang.
Lumley, Tom and Tim F. McNamara. 1995. Rater characteristics and rater bias: Implications for
training. Language Testing 12(1): 54–71.
MacArthur, Charles A., Steve Graham and Jill Fitzgerald (eds.). 2008. Handbook of writing
research. New York: Guilford Press.
McNamara, Tim F. 1996. Measuring second language performance. London: Longman.
McNamara, Tim F. 2000. Language testing. Oxford, UK: Oxford University Press.
McNamara, Tim F. 2011. Applied linguistics and measurement: A dialogue. Language Testing
28(4): 435–440.
McNamara, Tim and Ute Knoch. 2012. The Rasch wars: The emergence of Rasch measurement in
language testing. Language Testing 29(4): 555–576.
Messick, Samuel. 1989. Validity. In: Robert L. Linn (ed.). Educational measurement, 3 rd ed.
13–103. New York: Macmillan.
Messick, Samuel. 1994. The interplay of evidence and consequences in the validation of
performance assessments. Educational Researcher 23(2): 13–23.
Montee, Megan and Margaret Malone. 2014. Writing scoring criteria and score reports.
In: Anthony J. Kunnan (ed.). The companion to language assessment. Vol. 2, 847–859.
Chichester, UK: Wiley.
Morrow, Keith. 1979. Communicative language testing: Revolution or evolution? In: Christopher J.
Brumfit and K. Johnson (eds.). The communicative approach to language teaching, 143–157.
Oxford, UK: Oxford University Press.
Moss, Pamela A. 1994. Can there be validity without reliability? Educational Researcher 23(2):
5–12.
Parkes, Jay. 2007. Reliability as an argument. Educational Measurement: Issues and Practice
26(4): 2–10.
Plakans, Lia. 2008. Comparing composing processes in writing-only and reading-to-write test
tasks. Assessing Writing 13(2): 111–129.
Plakans, Lia. 2012. Writing integrated items. In: Glenn Fulcher and Fred Davidson (eds.), The
Routledge handbook of language testing, 249–261. New York: Routledge.
Romova, Zina and Martin Andrew. 2011. Teaching and assessing academic writing via the
portfolio: Benefits for learners of English as an additional language. Assessing Writing 16(2):
111–122.
Saville, Nick. 2014. Using standards and guidelines. In: Anthony J. Kunnan (ed.). The companion
to language assessment. Vol. 2, 906–924. Chichester, UK: Wiley.
Shaw, Stuart D. and Cyril J. Weir. 2007. Examining writing: Research and practice in assessing
second language writing. Cambridge, UK: Cambridge University Press.
Shermis, Mark D. 2014. State-of-the-art automated essay scoring: Competition, results, and future
directions from a United States demonstration. Assessing Writing 20: 53–76.
Spolsky, Bernard. 1995. Measured words: The development of objective language testing. Oxford,
UK: Oxford University Press.
Stemler, Steven E. and Jessica Tsai. 2008. Best practices in interrater reliability: Three common
approaches. In: Jason W. Osborne (ed.). Best practices in quantitative methods, 29–49. Los
Angeles, CA: Sage.
Taylor, Lynda. 2009. Developing assessment literacy. Annual Review of Applied Linguistics 29:
21–36.
Valette, Rebecca M. 1967. Modern language testing. New York: Harcourt, Brace and World.
Van Moere, Alistair. 2014. Raters and ratings. In: Anthony J. Kunnan (ed.). The companion to
language assessment. Vol. 3, 1358–1374. Chichester, UK: Wiley.
Van Moere, Alistair and Ryan Downey. This volume. Technology and artificial intelligence in
language assessment. In: Dina Tsagari and Jayanti Banerjee (eds.). Handbook of second
language assessment. Berlin: DeGruyter Mouton.
Wall, Dianne. 2012. Washback. In: Glenn Fulcher and Fred Davidson (eds.). The Routledge
handbook of language testing, 79–92. New York: Routledge.
Weigle, Sara C. 1998. Using FACETS to model rater training effects. Language Testing 15(2): 263–
287.
Weigle, Sara C. 1999. Investigating rater/prompt interactions in writing assessment: Quantitative
and qualitative approaches. Assessing Writing 6(2): 145–178.
Weigle, Sara C. 2002. Assessing writing. Cambridge, UK: Cambridge University Press.
Weigle, Sara C. 2004. Integrating reading and writing in a competency test for non-native
speakers of English. Assessing Writing 9(1): 27–55.
Weigle, Sara C. 2007. Teaching writing teachers about assessment. Journal of Second Language
Writing 16(3): 194–209.
Weigle, Sara C. 2013a. Assessment of writing. In: Carol A. Chapelle (ed.). The encyclopedia of
applied linguistics, 257–264. New York: Blackwell.
Weigle, Sara C. 2013b. English language learners and automated scoring of essays: Critical
considerations. Assessing Writing 18(1): 85–99.
Weir, Cyril J. 1990. Communicative language testing, 2 nd ed. New York: Prentice Hall.
Weir, Cyril J. 2005. Language testing and validation: An evidence-based approach. Basingstoke,
UK: Palgrave Macmillan.
Williams, James D. 2000. Identity and reliability in portfolio assessment. In: Bonnie S. Sunstein
and Jonathan H. Lovell (eds.). The portfolio standard: How students can show us what they
know and are able to do, 135–148. Portsmouth: Heinemann.
Wind, Stefanie A. and George Engelhard. 2013. How invariant and accurate are domain ratings in
writing assessment? Assessing Writing 18(4): 278–299.
Xi, Xiaoming. 2010. Automated scoring and feedback systems: Where are we and where are we
heading? Language Testing 27(3): 291–300.
Xi, Xiaoming and Larry Davis. This volume. Quality factors in language assessment. In: Dina
Tsagari and Jayanti Banerjee (eds.). Handbook of second language assessment. Berlin:
DeGruyter Mouton.
Yancey, Kathleen B. 1999. Looking back as we look forward: Historicizing writing assessment.
College Composition and Communication 50(3): 483–503.
Yancey, Kathleen B. and Irwin Weiser (eds.). 1997. Situating portfolios: Four perspectives. Logan,
UT: Utah State University Press.