You are on page 1of 4

Assessing Writing 22 (2014) A1–A4

Contents lists available at ScienceDirect

Assessing Writing

Editorial

Three current, interconnected concerns for


writing assessment

Recently I’ve spent quite a lot of my time listening to presentations around writing assessment-
related topics in various places, which have led me to think about how some of our shared issues play
out in different ways in different contexts. Three aspects have drawn my attention particularly.
1. Automated essay scoring (AES) has been ‘emerging’ for some 40 years, and we can see that it
has been of ‘concern’ to the kinds of people who read this journal for much of that time; but the
concern has grown lately as AES has become more credible as a potential tool for replacing some
of the work of human raters in ‘scoring’ essays (see Condon, 2013; Deane, 2013; Williamson, Xi,
& Breyer, 2012), but with the development of powerful corpus analytic tools and the pressure in
states and national governments to find economic ways of measuring the writing (proficiency or
development) of large numbers of school pupils, the number of systems available has exploded. In the
US in particular, the provision of AES products has become very big business (Shermis, 2014; Perelman,
2014).
In a recent project, colleagues and I reviewed 6 AES systems and interviewed technical and sales
staff from the vendors. Starting from very little knowledge, this was a crash course with a clear purpose.
We learned that three system ‘modules’ are needed to run the technical side of AES: the capture system,
an AES scoring engine that can work with the locally-developed prompts/tasks, and – in our case – a
diagnostic report generating programme. Of the six vendors, only two could deliver the first two; one
vendor could only deliver the second, but was the best on our other criteria. No vendor could deliver the
third except by imprinting their existing commercial ‘feedback’ technology (e.g., MyAccess) onto our
students’ texts, and none of the available commercial feedback technologies met our requirements.
The one vendor who satisfied us regarding delivery of the second was LightSIDE, a very small US
start-up by a doctoral student at CMU, now apparently growing rapidly: LightSIDE was the only one
able to score multiple domains on the same essay. However, the trials run with LightSIDE helped us
understand that of the six domains that the teachers and EAP programme developers in our context
wanted, only three could be adequately scored and interpreted. Studying the resulting data patterns
closely has helped us better understand the constructs underpinning our assessment and clarify them
more fully.
When I superimpose onto this direct experience with AES my detailed reading and critique of
the literature arguing for and against AES/AWE, I can see that although the technologies nearly all
originated in the US (starting with Page in the 1960s and Writer’s Workbench in the 1970s), and
perhaps with motives as much pedagogical as commercial, in the last 15 years it has been driven

http://dx.doi.org/10.1016/j.asw.2014.09.003
1075-2935/© 2014 Published by Elsevier Ltd.
A2 Editorial / Assessing Writing 22 (2014) A1–A4

hard by profit motives, especially since the introduction in the US of No Child Left Behind and other
pressures towards mass market assessment of learner writing, and culminating with the Smarter
Balanced/Automated Student Assessment Prize (ASAP) which offers the potential for US Govern-
ment contracts in the millions of dollars yearly (Shermis, 2014). The industry-sponsored literature
is plentiful but focuses on showing that AES systems can approximate human readers’ scores, and
has had little to say about the constructs of writing. However, it is encouraging to see signs of seri-
ous engagement from research leaders in the field (see Deane, 2013; Williamson, Williams, Weng,
& Trapani, 2014). The work of Williamson, Xi and Beyer (2012), with its inclusion of consideration
of consequences in its framework for evaluating AES systems is notable, as is the recent work by
other researchers at ETS, some of it reported at the recent International Association of Applied Lin-
guistics convention in Australia (Deane, 2014), looking at CBAL (Cognitively-Based Assessment of, for,
and as Learning: see www.ets.org/research/topics/cbal/initiative/). It seems to me that it would be
beneficial for writing assessment researchers in the academic community to learn more about such
initiatives.
2. Dynamic criteria mapping (DCM) is a much newer initiative, also begun in the US (Broad, 2003).
And yet in a sense a process which might well have deserved this name has been going on for gen-
erations. Kenneth Eble wrote, in his Foreword to the First Edition of White’s Teaching and Assessing
Writing, that “there is something new under the sun even in books about writing”. However, White
had mentioned in that book the work of the San Francisco Bay Area writing teachers that grew into
the National Writing Project; and also mentioned – what especially interested me – that this initia-
tive grew out of the scoring workshops held to train readers for the (US) Advanced Placement test.
White wasn’t creating something ‘new’ but doing what all scholars do – following the footsteps of
and being inspired by those innovators who have gone before. And long before that, as Elliot (2005)
points out: “In 1874 teachers came together in oak-lined rooms to read essays resting on mahogany
tables” (p. viii). In fact teachers have been coming together to read students’ writing and give it grades
for hundreds of years, and when they did, and do, come together like that it often occurs that they
arrive ‘dynamically’ at salient criteria to share for the purpose. Concepts of DCM are similar to what
second language assessment researchers and applied linguists would call ‘indigenous criteria’ (Jacoby
& McNamara, 1999), although the indigenous criteria movement relates particularly to assessing writ-
ing in professional and very specialised disciplinary contexts such as veterinary science (e.g., Douglas
& Myers, 2000). DCM and the move towards indigenous criteria both argue that all assessment needs
to be local if it is to be relevant. Historically DCM is a logical development from primary trait scoring
and analytic/multiple trait scoring; however, it presupposes a group of invested and capable people to
carry out the processes that would lead to a valid instrument. It also presupposes a context in which
these people have the freedom to develop and implement their DCM instrument(s). While the collec-
tion by Broad et al. (2009) shows that this approach has taken root and is proving to be generative,
these two key presuppositions are by no means universally true even in the US. In less economically
developed countries and in countries where education is more closely constrained by authorities it
is less likely that the conditions that foster DCM or indigenous approaches will exist. Readers of this
journal will be more sensitive than many to the great variety of contexts in which writing assessments
are used, developed and researched.
3. The Common European Framework of Reference (CEFR). The third set of concerns surround the
spread of the Common European Framework of Reference beyond Europe and beyond its original
purposes. The stated aim of the creation of the CEFR was to provide “a common basis for the elab-
oration of language syllabuses, curriculum guidelines, examinations, textbooks, etc. across Europe”
(Council of Europe, 2001, p. 1), in order to facilitate international educational exchange of people
and ideas. But that little word “examinations” embedded in the text has proved to be of giant import
as all Europe’s exam bodies have found many advantages in aligning their tests to the CEFR. I have
recently been involved in a project which attempts to use the CEFR as the benchmark for a writing
assessment in Hong Kong universities, where it has proven problematic. Similar attempts are being
undertaken in, for example, Thailand and Taiwan. I have attended presentations about the ‘Common
Japanese Framework’ and there are plans for a Common Chinese Framework, both tied back to the
CEFR. In my view these developments take the well-intentioned and very carefully carried out work to
Editorial / Assessing Writing 22 (2014) A1–A4 A3

create a Common European Framework and turn it too far in other directions. As Fulcher (2004: 255)
says:

The question remains why it is necessary to harmonize-which in the case of higher education
means introducing a common structure, credit rating, and content comparability for degree
programs across Europe-to have a system of qualification recognition. It could equally be argued
that harmonization means less diversity, and less choice, with one degree program looking very
much like another.

Indeed, if that was to happen across the world (as seems to be increasingly the case) the resulting
increasingly homogeneous shape of education would be damaging at many levels. But that is only
one, macro-level, problem. There are many smaller problems when a ‘tool’ intended to provide a
framework is increasingly claimed to be useable as a metric.

1. Interconnected concerns

There is a sociopolitical interplay between these three developments of the 21st century that
has potentially serious consequences in the context of language education and assessment. In the
context of assessment of writing specifically, we can look at the potential growth of ‘cookie-cutter’
education systems and curricula, driven perhaps by the CEFR, or, as in the US (which in many
ways stands outside education movements worldwide) the rapid adoption of Common Core (visit
www.corestandards.org/about-the-standards/frequently-asked-questions for a short explanation) as
a result of the Race to the Top initiative (US Department of Education, 2009), now adopted by 49 States
of the USA.
These movements at once make the development and use of dynamic, indigenously-derived criteria
for assessing writing more difficult to gain approval for; and at the same time they make it very
tempting to apply large-scale solutions to assessing the writing of individuals through automated
systems. Much has been said in this journal both for and against automated scoring. Many of us
believe that its potential for use in providing feedback to learners is worth exploring; but as teachers
ourselves we distrust the potential it has to create greater distance between individual learners and
their teachers. Both AES and DCM, then, connect in technical, political and educational ways with this
growing movement towards common standards nationally and internationally. Many in the wider
‘community’ of writing assessment remain quite unaware and/or unappreciative of these concerns. I
hope we can all give more thought to how our world may be changing around us.

References

Broad, B. (2003). What we really value: Beyond rubrics in teaching and assessing writing. Logan, Utah: Utah State University Press.
Broad, Adler-Kassner, B. L., Alford, B., Detweiler, J., Estrem, H., Harrington, S., et al. (2009). Organic writing assessment: Dynamic
criteria mapping in action. Logan, Utah: Utah State University Press.
Condon, W. (2013). Large-scale assessment, locally-developed measures, and automated scoring of essays: Fishing for red
herrings? Assessing Writing, 18(1), 100–108.
Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, Teaching and Assessment. Stras-
bourg: Language Policy Unit.
Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing,
18(1), 7–24.
Deane, P. (2014). Construct representation and the structure of an automated scoring engine: Issues and uses. In Paper presented
at the Symposium Technology and language testing: Automated scoring and beyond AILA, Brisbane, August 10–15,
Douglas, D., & Myers, R. R. (2000). Assessing the communication skills of veterinary students: Whose criteria. In A. Kun-
nan (Ed.), Fairness and validation in language assessment. Studies in language testing (Vol. 9) (pp. 60–81). Cambridge Local
Examinations/Cambridge University Press.
Elliot, N. (2005). On a scale: A social history of writing assessment in America. New York: Peter Lang Publishing.
Fulcher, G. (2004). Deluded by artifices? The common european framework and harmonization. Language Assessment Quarterly,
1(4), 253–266.
Jacoby, S., & McNamara, T. (1999). Locating competence. English for Specific Purposes, 18(3), 213–241.
Perelman, L. (2014). When the “state of the art” is counting words. Assessing Writing, 21, 104–111.
Shermis, M. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States
demonstration. Assessing Writing, 20, 53–76.
US Department of Education. (2009). Race to the top executive summary. http://www2.ed.gov/programs/racetothetop/executive-
summary.pdf (downloaded 01.09.14)
A4 Editorial / Assessing Writing 22 (2014) A1–A4

Williamson, D., Williams, F., Weng, V., & Trapani, C. (2014). Automated essay scoring in innovative assessments of writing from
sources. Journal of Writing Assessment.
Williamson, D., Xi, X., & Breyer, F. J. (2012). A Framework for evaluation and use of automated scoring. Educational Measurement:
Issues and Practice, 31(1), 2–13.

Liz Hamp-Lyons
Centre for Research in English Language Learning and Assessment (CRELLA), University of
Bedfordshire, Putteridge Bury, United Kingdom

E-mail address: liz.hamp-lyons@beds.ac.uk

Available online 26 September 2014

You might also like