You are on page 1of 6

What We Don't Know about the Evaluation of Writing

Author(s): James C. Raymond


Source: College Composition and Communication, Vol. 33, No. 4 (Dec., 1982), pp. 399-403
Published by: National Council of Teachers of English
Stable URL: http://www.jstor.org/stable/357952
Accessed: 27-06-2016 02:29 UTC

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
http://about.jstor.org/terms

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted
digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about
JSTOR, please contact support@jstor.org.

National Council of Teachers of English is collaborating with JSTOR to digitize, preserve and extend
access to College Composition and Communication

This content downloaded from 128.197.26.12 on Mon, 27 Jun 2016 02:29:25 UTC
All use subject to http://about.jstor.org/terms
What We Don't Know about the
Evaluation of Writing

James C. Raymond

Nineteen years ago, Braddock, Lloyd-Jones, and Schoer compared research


in written composition to "chemical research as it emerged from the period
of alchemy,"1 an image that continues to haunt us, leading us to expect re-
search in composition to evolve as a discipline, like each of the sciences, with
universally accepted methods and neat boundaries around its subject. Since
that time a great deal of important research has occurred, much of it sup-
ported by methods and insights imported from the social and behavioral sci-
ences. But the evolution suggested by the image has not occurred.
In particular, we certainly know more about evaluation than we did twenty
years ago; but what we know is not definitive, nor is it an orderly and sys-
tematic corpus. It may be described as a growing list of terms and techniques,
such as "the general impression scales" a system used by ETS;2 and "analytic
scales," the guided scoring procedure developed by Paul Diederich;3 and
"Primary Trait Scoring," the system developed by Richard Lloyd-Jones for
the National Assessment of Educational Progress (NAEP);4 and "T-unit anal-
ysis," the measure of syntactic fluency invented by Kellogg Hunt;5 and
"holistic scoring," a generic term that, as Charles Cooper defines it, includes
a variety of guided scoring methods;6 and "relative readability," the focus of
measurement proposed by E. D. Hirsch in The Philosophy of Composition.7
What is remarkable about this list is that it would make as much sense to
study it in alphabetical order as chronological. Each of the items is so
thoroughly independent of the others that not even the order of their inven-
tion is logical or necessary. To the items I have mentioned might be added
others so disparate in what they purport to measure as to suggest that we
have not even agreed on what it is we are trying to evaluate--whether it is
the mastery of editorial skills, or indices of cognitive development, or success
in communicating a semantic intention. In the evaluation of writing, old sys-
tems survive the invention of new ones; nothing supersedes or replaces any-
thing else. There are a few gains in precision, but always at the expense of

James C. Raymond is Associate Professor of English and Assistant Dean of the Graduate
School at the University of Alabama. He is the author of Writing (Is an Unnatural Act), editor of
Literacy as a Human Problem, and co-author of a forthcoming book on legal writing.

College Composition and Communication, Vol. 33, No. 4, December 1982 399

This content downloaded from 128.197.26.12 on Mon, 27 Jun 2016 02:29:25 UTC
All use subject to http://about.jstor.org/terms
400 College Composition and Communication

questionable assumptions about the nature of writing or about the relative


importance of various factors associated with quality.
Perhaps the status of our research can more accurately be compared to the
state of linguistics before it emerged from philology, when Ferdinand de
Saussure discovered that scholars had assembled a vast body of knowledge
about languages without attempting to determine "the nature of the object
they were studying."8 Because language has turned out to be not an object,
but an unstable set of fluid relationships among symbols and subjects, the
method suitable for studying it and the kind of knowledge we might expect
these methods to yield are necessarily different from the methods suitable
for studying objects and the kind of knowledge we expect those methods to
yield. This is why the "science" of linguistics is radically different from the
physical sciences. In linguistics, to cite Jonathan Culler's paraphrase of Saus-
sure, "you cannot hope to attain an absolute or Godlike view of things";
instead you must simply "choose a perspective."9
Saussure's discovery and acknowledgment of the essential subjectivity of
language has had, of course, profound effects, not only on the development of
modern linguistics but on the development of critical theory as well. There are
now hopeful signs of similar developments in the evaluation of student writ-
ing. In an article entitled "Written Composition: Toward a Theory of Evalua-
tion," Anne Ruggles Gere asks similar questions about the nature of writing,
and like Saussure, she sees the object of our study as not an object in itself, but
as insubstantial relationships among symbols and human consciousness. She
rejects the traditional notion that "the text itself is autonomous" in favor of a
theory that "meaning exists in the reader as well as in the text."10 If this notion
were to gain widespread acceptance among us-if we were to agree that the
quality of writing resides not exclusively or even primarily in the text itself, but
in the several marriages each text makes with the minds of its individual
readers-then we could expect, oddly enough, not a radical break from the
way we have been evaluating composition in the past, but simply a better
understanding of the uses and limitations of those methods we already have.
The procedures we have been borrowing from social and behavioral sciences
will continue to be useful, just as they have been in the history of linguistics
since Saussure; but we will become more keenly aware of the limitations of
these procedures, less naive in our expectations.
As guidelines for exploiting what we know about evaluating writing with-
out stumbling into the abyss of what we do not know, I would offer eight
caveats and suggestions.
1. An evaluation method is good if it does what it sets out to do. If scores
on the CEEB exam do, as the test makers claim, have a high correlation with
grades in freshman composition, then the test is useful for the purpose of
predicting success in freshman composition, even though it may not even
indirectly measure a student's ability to write. Even a test of conventional
usage may serve a legitimate purpose, if it is intelligently constructed, and if

This content downloaded from 128.197.26.12 on Mon, 27 Jun 2016 02:29:25 UTC
All use subject to http://about.jstor.org/terms
What We Don't Know about the Evaluation of Writing 401

we want to assess editorial skills as distinct from other aspects of writing that
are generally acknowledged to be more important. Usage, incidentally, is one
aspect of writing that ought to be examined with empirical methods, though
as Joseph Williams has told us recently" and Thomas Creswell six years
ago,12 we seem content to recycle old shibboleths instead of using methods
currently available to discover the facts of usage in edited English.
2. There is safety in numbers. Inferences made on the basis of large sam-
ples may be useful, as long as they are not applied injudiciously to the evalua-
tion of single papers. It is true that T-unit length is longer, on the average, in
professionally written prose than it is, on the average, in prose composed by
twelfth-graders. It is not true that any given sample of prose with long
T-units is necessarily better than another sample with short T-units, not even
if they both contain precisely the same information.
3. There is safety in numbers. Because the performance of skilled writers
varies considerably from one day to the next and from one writing task to
another, it makes sense to construct assessment tests that require more than
one kind of writing on more than one day. If external constraints preclude
multiple testing, it makes sense to allow students who are dissatisfied with
the results of a single essay exam to take another without prejudice.
4. There is safety in numbers. Because variability in reader response is
both inevitable and desirable, it makes sense to have more than one reader
evaluate any exam that will have serious consequences for individual stu-
dents. Even in daily classroom practice, it makes sense for teachers to give
students the option of having a second reader for any paper on which the
instructor's grade is disputed.
5. Although training sessions for raters are normally motivated by the de-
sire to achieve inter-rater reliability, their chief value is that they require
evaluators to examine their assumptions critically and to arrive at an institu-
tional policy about what is important and unimportant in writing. The inter-
rater reliability achieved this way ought not to be confused with objectivity
or validity; the consensus reached at one institution will and ought to vary
from the consensus reached at other insitutions, just as judgments about
what constitutes publishable prose vary among editors and publishers.
6. The degree to which inter-rater reliability is a desirable characteristic in
evaluation varies with the kind of assessment the procedure is intended to
yield. It would be possible to achieve near perfect inter-rater reliability by
simply counting the number of words produced during the test period; but
no one would seriously accept this as a measure of quality. Because the qual-
ity of writing resides not entirely in the text, but in the interactions among
the text, its author, and its individual readers, we should not only expect but
actually demand a reasonable amount of variation among raters when the goal
is to evaluate a piece of writing as a whole. Instead of apologizing for reliabil-
ity rates in the neighborhood of .80, we might well become suspicious of
rates that are much higher than that.

This content downloaded from 128.197.26.12 on Mon, 27 Jun 2016 02:29:25 UTC
All use subject to http://about.jstor.org/terms
402 College Composition and Communication

7. A sound rhetorical framework is as essential to a holistic evaluation


procedure as a sound statistical design, because rhetoric, at least in its more
comprehensive manifestations, is the discipline that treats discourse as an
intersubjective transaction rather than as an objective phenomenon. Primary
trait scoring is receiving much attention today not only because its research
design is sound but because its rhetorical assumptions are well articulated:
instead of evaluating student writing in a vacuum, it evaluates that writing as
a response to a specific task with a specific purpose communicated to the
writer through carefully written instructions.
8. The more complex an evaluation procedure is, the less likely it will be
used by anyone but its inventors. Primary trait scoring is widely discussed
because it is a relatively simple system, focusing on some features of writing
to the exclusion of others that would not be ignored in a genuinely holistic
procedure. A much more comprehensive procedure-like the one developed
by Aviva Freedman and Ian Pringlea3 to measure growth in syntactic devel-
opment, rhetorical skills, and cognitive complexity among students at Carle-
ton University, Ottawa, over a period of years-is appropriate for the kind of
longitudinal research Freedman and Pringle are doing; but because it is more
comprehensive, it is even less likely than primary trait scoring to be trans-
planted to other campuses to solve the daily and seasonal problems of evalua-
tion that need to be solved.
The essence of these observations has been expressed more succinctly by Lee
Odell and Charles Cooper, whose names are generally associated with what we
know about evaluating writing. Cooper and Odell insist "that an advocate of any
one procedure identify: 1) the assumptions underlying a given procedure, 2) the
extent to which those assumptions are consistent with our current understand-
ing about the nature and purposes of written discourse, and 3) the limitations as
well as the uses of any given procedure."14
The only quibble I might have with Cooper and Odell is that they offer
these observations as interim advice, to be followed while we still lack "de-
finitive studies" and "a theoretically sound, empirically tested basis for choos-
ing a procedure for evaluating student writing." If they mean to suggest that
composition is about to emerge from alchemy, I suspect they are wrong.
Imagine the consequences if they were right: there would be, as there is in
freshman chemistry, a received body of knowledge and a standard set of
algorithms. There would be, as there is in chemistry, only a dozen or so
textbooks, each barely distinguishable from the others. Anyone remotely
familiar with the publishing industry realizes how far we are from this state of
affairs. There are, if anything, more rather than fewer textbooks to choose
from than there were twenty years ago, indicating that, despite undeniable
advances in composition, there is still no consensus about what good writing
is, much less about how it might be evoked or measured.
This disarray does not reveal us as too dull or too lazy to make a science of
our learning. Rather it suggests the essential difference between research into

This content downloaded from 128.197.26.12 on Mon, 27 Jun 2016 02:29:25 UTC
All use subject to http://about.jstor.org/terms
What We Don't Know about the Evaluation of Writing 403

purely physical or statistical phenomena, on the one hand, and composition


research, on the other hand-the difference between working with objects
and numbers on the one hand and the symbols of intersubjective transactions
on the other. Pure objectivity in the evaluation of writing is, by the nature of
writing, impossible. Even if a computer were to attack the task with programs
to measure every nuance of style and meaning known to researchers, the
results would still be subject to review by anyone willing to review them.
Good writing, ultimately, is writing that is perceived to be good. In evalua-
tion, perception may not be all, but it is a sine qua non. Perception may be
codified, guided, corroborated, or refined in any number of ways; but it can
never be entirely quantified or eliminated. This means that our research will
forever be inconclusive, especially when compared with the more precise
results of science. We should, of course, avail ourselves of every research
procedure that can make our work less indefinite, but we should recognize
that, because the object of our investigation is subjective, objective proce-
dures will always miss the mark. Perhaps the time has come, not to decon-
struct the old alchemy-to-chemistry metaphor, but to acknowledge its lim-
itations, lest it make us feel guilty for failing to achieve a degree of precision
that cannot and should not be achieved.

Notes

1. Richard Braddock, Richard Lloyd-Jones, and Lowell Schoer, Research in Written Composi-
tion (Urbana, IL: National Council of Teachers of English, 1963), p. 5.
2. See Lee Odell and Charles Cooper, "Procedures for Evaluating Writing: Assumptions and
Needed Research," College English, 42 (September, 1980), 36.
3. Paul B. Diederich, Measuring Growth in English (Urbana, IL: National Council of Teach-
ers of English, 1974).
4. Richard Lloyd-Jones, "Primary Trait Scoring," in Evaluating Writing: Describing, Measur-
ing, Judging, ed. Charles R. Cooper and Lee Odell (Urbana, IL: National Council of Teachers of
English, 1977), pp. 33-66.
5. Kellogg W. Hunt, Grammatical Structures Written at Three Grade Levels (Urbana, IL: Na-
tional Council of Teachers of English, 1965).
6. Charles R. Cooper, "Holistic Evaluation of Writing," in Evaluating Writing: Describing,
Measuring, Judging, p. 3.
7. E. D. Hirsch, Jr., The Philosophy of Composition (Chicago: University of Chicago Press,
1977), p. 189.
8. Quoted by Jonathan Culler, Ferdinand de Saussure (New York: Penguin, 1976), p. 8.
9. Culler, p. xv.
10. Anne Ruggles Gere, "Written Composition: Toward a Theory of Evaluation," College Eng-
lish, 42 (September, 1980), 58.
11. Joseph M. Williams, "The Phenomenology of Error," College Composition and Communica-
tion, 32 (May, 1981), 152-168.
13. Aviva Freedman and Ian Pringle, "Writing in the College Years," College Composition and
Communication, 31 (October, 1980), 314.
14. "Procedures for Evaluating Writing," p. 43.

This content downloaded from 128.197.26.12 on Mon, 27 Jun 2016 02:29:25 UTC
All use subject to http://about.jstor.org/terms

You might also like