Liz Hamp Lyons 2016 1

Assessing Writing 27 (2016) A1–A2
Contents lists available at ScienceDirect
Assessing Writing
Editorial
Farewell to Holistic Scoring?
Our recent Special Issue of Assessing Writing on Rubrics, guest edited by Deborah Crusan, has led me back to issues that
have troubled me since I first began researching into writing assessment more than 30 years ago. What makes writing good?
What makes good writing? What makes good writers?
Writing is a complex and multifaceted activity. When we assess writing, we engage in another complex and multifaceted
activity: judging another person’s text. Into that text has gone not only that person’s grammatical ability, their reach of word
knowledge and control, their sense of what a unified subject is, their factual knowledge about the subject, but also their
understanding of the world and their place in it, their exploration of ideas, and their feelings. How shall we judge all this?
For many years I have been writing about the importance of what I call multiple-trait scoring (Hamp-Lyons, 1986, 1987,
1991), or what in composition usually gets called ‘analytic’ scoring in many contexts. I don’t like the term ‘analytic’ because
it takes us back to the time when, in the US particularly, the direct assessment of writing had fallen into disfavour and
educational measurement gurus argued that indirect measures of writing were as good as the direct scoring methods of the
time, and more reliable. The earliest scales were heavily influenced by the ideologies of the Industrial Revolution and the
drive towards mathematical exactness and the assumption that differences of views indicate ‘errors of judgment’. The first
attempts were made by Milo Hillegas at Harvard (Hillegas, 1872, 1912), a student of Thorndike, who nevertheless took a
rather different approach although with the same aim of identifying exactly the quality of every piece of writing. Hillegas
created a method that we might nowadays think of as similar to the much more recent Bookmark method (e.g. Lewis, Mitzel,
Green, & Patz, 1999; Hambleton, 2001).1 The Hillegas scale had 1000 points and “young people” were placed along the scale.
Hillegas stressed that: “No attempt has been made in this study to define merit. The term as used here means just that quality
of writing which competent persons commonly consider as merit, and the scale measures just this quality.” (p.13) He asked
more than five hundred ‘experts’ in teaching composition (in various groups) to judge sub-sets of sample essays (some of
them ‘artificial samples, including the zero point, but the highest-scored sample, at 937 being written by a first year college
student). The total number of samples was eighty three, but the most complete data that formed the core of the scale were
derived from the judgments of twenty eight judges on twenty seven samples that have been through previous iterations
of judgments. Looking into his data, we can see substantial variations between judges: his scale derives from averages of
the twenty eight judgments on each sample (when samples have already been filtered by other judges for some degree of
separation). Interestingly, Hillegas tried to address this issue by awarding each judge a ‘penalty’ based on their variation
from the whole group, with the penalty being greater for the judges who were “guilty” of the greatest “error” (p.57).
From the early twentieth century through to the early 1970s, in the US, development in education and especially in
educational assessment/writing assessment, was largely controlled by pressures of the rapid and huge increase in the college
entering population, especially less well-prepared entrants. As is eloquently detailed in Elliot’s (2005) book, the struggle to
convince the College Board and other major players of the absolute necessity of including the direct assessment of writing
was carried out by a few ‘lone wolves’. White (1986) comments that “[in} the early 1970s the only systematic work in
large-scale essay scoring was being done in just two locations: the National Assessment of Education Progress (NAEP), and
ETS” (p. 19). White, of course, makes no mention of work that had been and was being done in the scoring of writing in the
UK (as I have pointed out elsewhere). But from the late 1970s in more and more (US) colleges and schools, teachers’ views
were being listened to, and students were being asked to really write. Thus began the heyday of holistic scoring, motivated,
1
For an up-to-date and very detailed description and explanation of methods of judgement processes, see Papageorgiou, S. (2009). Setting performance
standards in Europe: The judges’ contribution to relating language examinations to the Common European Framework of Reference. Frankfurt: Peter Lang.
http://dx.doi.org/10.1016/j.asw.2015.12.002
1075-2935/© 2015 Published by Elsevier Inc.
A2 Editorial / Assessing Writing 27 (2016) A1–A2
as White has often said, by the fact that “The evolution of holistic scoring . . .made direct measurement of writing ability an
economically feasible alternative to multiple-choice testing even for the accounting office” (p.21).
So—why do I say ‘Farewell to holistic scoring’? Because I want to draw readers’ attention to the continuing debate within
writing assessment around holistic scoring and the nature and necessity of rubrics. Although in the US there was impressive
and influential work on holistic scoring during the 1980s, and holistic scoring continues to be used and used well in many
places around the world, its strength is often seen by false comparison with what gets called ‘analytic’ scoring. Analytic
scoring is often seen as being a sterile measure of the skills inherent in writing: as Ed White said in the first edition of
Teaching and Assessing Writing. “[Analytic theory] assumes that writing can be seen and evaluated as a sum of its parts and
so stands in opposition to the assumptions behind holistic scoring” (p. 30). In his very valuable work in the 1970s and 80s,
White focused his efforts on the development of reliable ways of scoring large numbers of short essays quickly and reliably,
and these efforts were successful, as far as they went. But in describing analytic scoring as ‘seeing the sum of the parts’
of a text he misses the mark. The term analytic is best reserved for the attempts to capture (usually merely hypothesized)
characteristics and skills of writing through the use of multiple-choice and other indirect or semi-direct test item types. The
surge of research into more nuanced approaches to judging the quality and qualities of writing has been led by Huot (2002)
who proposes re-articulating writing assessment as a form and embodiment of social action; and by Broad (2003) who does
not dismiss rubrics but questions, indeed challenges, the value of almost all of them because of their lack of contextual
relevance and failure to grow organically from contexts and purposes. Broad rightly comments in his Epilogue: “[society is]
undergoing a paradigm shift way from the technical requirements and justifications of positivist psychometrics and toward
considerations such as how well assessments support best practices in teaching and learning” (p.137). This and work by other
(primarily US) researchers and scholars is admirable and takes us in positive directions: but in my view it makes the same
mistake that the large majority in the US composition community continue to make: it ignores the reality—ironically, much
more prevalent and dominant in the US than anywhere else in the world I know, with perhaps the exception of China—of
the existence of ‘big tests’ and powerful test agencies. Those agencies were the creators of and prime movers for analytic
testing in the 1950s and the main opponents of the early attempts to create change. But we cannot make them our enemies
because as Foucault has shown us, money is power, size is power, power is power. If change has been fast and observable in
US composition classrooms (and I believe it has), that is not true worldwide, not nearly.
In the continuation of this Editorial, and in anticipation of a further volume focusing on rubrics later in 2016, I will return
to the concept I introduced in my opening: multiple trait scoring. I will argue that this is a tool that can make sense to
teachers, and can make sense to test agencies. I will argue that it is a tool that can make sense to proponents of automated
evaluation of writing as well as to teachers in middle schools. Like all assessment, it is possible to do it badly; but it can be
done very well.
In the meantime I encourage readers to dive deeper into this area in their reading, and to do with open minds (see
Nakamura, 2004; Zhang, Xiao, & Luo, 2015).
References
Broad, B. (2003). What we really value: Beyond rubrics in teaching and assessing writing. Logan UT: Utah State Press.
Elliot, N. (2005). On a Scale: A social history of writing assessment in America. New York: Peter Lang.
Hambleton, R. K. (2001). Setting performance standards on educational assessments and criteria for evaluating the process. In G. J. Cizek (Ed.), Setting
performance standards: Concepts, methods, and perspectives (pp. 89–116). Mahwah, NJ: Erlbaum.
Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons (Ed.), Assessing second language writing in academic contexts (pp. 241–246).
Norwood, NJ: Hampton Press.
Hamp-Lyons, L. (1987). Performance profiles for academic writing. In K. Bailey, R. Clifford, & T. Dale (Eds.), Language testing research: Papers from the ninth
Annual Language Testing Research Colloquium. Monterey, CA: Defense Language Institute (Available for download on ResearchGate).
Hamp-Lyons, L. (1986). Testing second language writing in academic settings. Unpub. PhD, University of Edinburgh.
Hillegas, M. (1872). . pp. 1–56. A scale for the measurement of quality in composition by young people (vol. XIII, No. 4) PhD Teachers College, Columbia
University (published in Teachers College Record (1912)).
Huot, B. (2002). Re-articulating writing assessment for teaching and learning. Logan, UT: Utah State Press.
Lewis, D. M., Mitzel, H. C., Green, D. R., & Patz, R. J. (1999). The Bookmark standard setting procedure. Monterey, CA: McGraw-Hill.
Nakamura, Y. (2004). A comparison of holistic and analytic scoring in the assessment of writing. https://jalt.org/pansig/2004/HTML/Nakamura.htm
White, E. (1986). Teaching and assessing writing. San Francisco, CA: Jossey Bass.
Zhang, B., Xiao, Y., & Luo, J. (2015). Rater reliability and score discrepancy under holistic and analytic scoring of second language writing. Language Testing
in Asia, 5, 5.
Liz Hamp-Lyons

Liz Hamp Lyons 2016 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Liz Hamp Lyons 2016 1

Uploaded by

Copyright:

Available Formats

Assessing Writing 27 (2016) A1–A2

Contents lists available at ScienceDirect

Farewell to Holistic Scoring?

You might also like