The rating process
Making judgements abour people is a common featuce of every
day life. We are continually evaluating what others say and do, in
comments called for or not offering eritiism and feedback infor-
‘mally co friends and colleagues about their behaviour. Formal,
institucional judgements. gure prominent in our lives too,
People pass driving tests, survive the probationary period in a
new jab, get promotions at work, succeed ar interviews, win
(sears for performances ina film, win medals in diving competi-
tions, and are released from prison for good behaviour. The
juigement wil in most eases have direct consequences forthe per.
son judged, and so issues of fairness arse, which mos public pro:
‘cedures try ro rake account of in some way. Regrertably, itis easy
to became aware of the way in which the idiosyncrasies of the
rarer or che rating process can determine the outcome unfairly. In
international sporting contests such as che Olympic Games and
World Cup soccer, the nationality af judges eferes or umpires,
and their presumed and sometimes ceal biases become an issue,
and artempts are made to mitigate dei effets. Al of us can prob
ably recoune instances ofthe benign or damaging role of particu
lar raters in examination processes in which we have been
involved. Many people have aneedotes of bizarre procedures for
reaching rating decisions in various contests, for example in job
selection
This chapter will discuss rating procedures used in language
assessment, (The teem ratings and raters will be used to reler to
the judgements and those who make then } We wll discuss the
necessity for, and pitfalls of, a rate-mediated approach to the
assessment af language Fits, we will look ae the procedures used36
in judging, then at how judgements may be reported, and finally
atthreats to the fairness of the procedures and how these may be
avoided or at least mitigated. We will consider in some detail
three aspects ofthe validation of rating procedures: the esablish
sent of rating prosocols; exploring differences between individ
ual raters, and mitigating their effects; and understanding
ineractions berween raters and other features of the rating
process (for example the reaetions of individual raters particu
lar topies or to speakers froma particular language background)
Establishing a rating procedi
Ratee-mediared assessment is becoming mare and more cer
language teaching and learning, As communicative language
teaching has increasingly focused on communicative perfor
mance in contest, So eating the impact ofthat communication has
hrecome the focus of language assessment. Rarer-mediated lan
guage assessment is also in line with instcutional demands for
accountability in educarion, as outcomes of educational processes
are often described in cerms of demonstrable practical compe
tence in the learner. This competence is then verified thrasgh
Where assessments meet institutional requirements, for exam=
ple for certification, as with any bureaucratic procedure there are
set methods for yielding he judgement in question. These meth-
cds typically have three main aspects,
First, there is agreement about che conditions (including the
length of time) under which the person's performance or behav
ior is elicited, and/or is attended to by dhe rater. This may take
the form of a formal examination, with set tasks and fixed
amounts of time for the performances. Altematively, it may
involve a period of observation ducing insertion, or while ean
ida catty ot relevant asks and oes inthe acta argc pt
Second, certain features of the performance ate agreed to he
critical; the criteria for judging these will be determined and
agreed. Usually this will involve considering vatious components
of competence—fluency, accuracy, organtcation, socioenlral
appropriateness and so an. The weighting of esch of the compo
rents of assessment becomes an issue. So does their relevance: an
increasingly important question inthe validation of performance
jrsessments is how the relevant criteria for assessing the perfor
mance are co be decided. The hent of the test construct lies here
Finally, raters who have been trained roan agreed understand-
ing of the erietia characcerize a peeformance by allocating a
trade or rating. This assumes the prior development of descrip
five rating categories of some kind: "competent nor competent’,
‘oady a cope with a university course, and so on.
The problem with raters
Inteoducing the rater ina the assessment process is borh neces
sary and problematic Iris problematic because ravings ave neces
Sauily subjective Another way of saying ths i shat the rating
fiver toa candidat is reflection, nor only of che quality ofthe
performance, bar ofthe qualities ava rater of the person who has
Judged i- The assumption in mos raring schemes is dar ifthe rat
ing category labels ate clear and explicit, and the rater is erined
carefully ro interpret them in accordance with che intentions of,
the test designers, and concentrates while doing the rating, then
the rating process can be made objective, nother words, ratings
tssentilly reduced to 8 pracess of the recognition of objective
Sans, with elssieation following automaticaly. In this view
rating would cesemble the process of chicken sexing, in which
young chicks are inspected for te external visible sigs of cher
{ex apparenconly o the trained eye when chicks are very young),
snd allocated ro male and female cateyories accordingly
‘Bat che reality i that eating remains intraecably subjective. The
allocation of individvals to categories is not a. deterministie
process, driven by the objective, ecognizable characteristics oF
performances, external t0 the rater. Rather, rating always eon:
fins a significant degree of chance, ssociared with the rater and
fther factors. The influence ofthese factors can be explored by
thinking of rating as a probabilistic phenomenon, thats, explor
the probabilities of certain raring outcomes with particular
» particular tasks, and s0 on. We can easily show this by
Tooking at the way in which even trained raters differ in thei han-
dling ofthe allocation of individual peeformances in bordetine
teases. Close comparison of che ratings giver by different raters in
Such eases wil typically show that one eater wll be consistently3
inclined to assign a lower category eo candidates whom anather
rater puts into-a higher one, The obvious resule of this is thae
whether a candidate is judged as meeting a particular standard or
noc depends fortuitously on which rater assesses their work
Worse (hecause ths is less predictable), eaters may nor even be
self-consistent from one assessed performance 0 the nest, or
from one rating occasion ro another. Researchers have semetines
been dismayed co learn that chore is as much variation among,
raters as there is variation between candidates,
the 19508 and 1960s, when concerns for reliability d
raed language assessment, eater-mediated assessment was dis:
couraged hecause of the problem of subjectivity. This led 10.4
tendeney toavoid direct testing. Thus, writing skills were assessed
indirectly through examination of contzol over the grammatical
system and knowledge of vocabulary But inereasingly ie was ele
that so much was lost by this restriction on the scope of astess-
‘ment thar the problem of subjectivity was something that had ro
be faced and managed. Particularly with the advent of comma
nicarive language teaching, with its emphasis on how linguistic
knowledge is actually pot to use, understanding and managing
the rating process becamean urgent necessity.
Establishing a framework for making judgements
In establishinga rating procedure, we need ro consider the riteria
by which performances ata given fevel will he recognized, and
then to decide how many diferent levels. of performance we wish
to distinguish. The answers co these questions will determine the
basic framework or orientation for the rating provess. Deciding
which of chese orientations hes isa particular assessment sct-
«ing will depend onthe contest and parpose of the asessment,
Irisusefuleo view achievement asa continawin, Theassessment
system may recognize a number of different levels of achieve
iment, in which ease we then chink af ica representing lalder or
seule: In other conexts, only ane point on che continu is of
relevance, and a simple enough/not enough distinerion al chat
needs to be made. In this case the testing system can best be
though ofin terms ofa burdleor eutpolat These two porsibives
are not of course contradictory, but area ltl ike different set-
tings on a camera or microscope. We ean stand back and look at
sss
|
|
|
|
|
the whole conringum, or we ean 200m in on one part oft Each
level ofthe ladder may’be thoughe af as requiring 2 "yevno" dec
sion [enouigh/not enough’ for thae level
‘We a illustrate the distinction between the hurdle and ladder
perspectives by reference ro two very different kinds of perfor:
mance. Consider the driving tese. Most people, given adequate
preparation, would assume ehey could pass it Although not
‘verybody who passes the test has equal competence as a driver,
fhe Funerion of the test co make a simple distinction beeween
those wha ate safe on che soads and those who are not, rather
than ro distinguish degrees of competence in driving skill Often,
jnhurdle assessment, as in the driving test, the assesment system
is not intended 0 permanently exclude. In other words, every
competent person should pass, and itis assumed that mose people
‘with adequate preparation willbe capable ofa competent perfor-
smance, and derive the benefis af certification accordingly. The
aim of the cersifcation is wo protect other people from incompe-
tence, The assessment is essentially na competitive
‘Many systems of assessment ry co combine the characteristics
‘of aezess and vompeiton. For example inthe system of certifies
tion for comperence in piano playing, a numberof grades of per-
formance are established, with relevant criteria defining cach, and
‘vera numberof yearsa learner ofthe piano may proceed through
the examinations for the grades AS the levels become more
demanding, Fewer people have the necessary mouvation oF oppor
tunity we for performance ar sucha level, or indeed even
the necessary skill The inal stages of certification involve Feely
contested piano competitions where only the most brilliant will
succeed, 0 resembling che Olympic context. But at levels below
this, the “grade” system of certification involves a principle of
access: at each step of comperenes, judged in a'yes'7no" manner
(Ceompezent ais level vs. “nar corapeten’), chose with adequate
preparation are likely ro pass. The funesion of che assessment ata
tiven level is nor to make distinctions between cxndidates, other
than a binary distinction between those who meet the require
ments of the level and chase who do not
Language resting has examples ofeach ofthese kinds of frame
work for making judgements about individuals. In judgements of
‘competence, o perform particular kinds of oecuparional roles,
3940
for example co work as. medical practitioner through the
medium of a second language, where the communicative
demands of dhe work or seudy setting to which access is sought
are high, then the form of the judgement will be ready" ar “not
‘ready’, ab in the driving test Even though the amount of preparae
tion is much greater, and what is demanded is much higher, we
nevertheless expect each ofthe medical professionals who present
for such atest to succeed in the end, Is Farction is nar usually 0
exclude permanenty those who need to demonstrate competence
in che language in order co practise their profession, although
rests may of course he used as instruments of such exclusion, as
sve shall see later, in Chapter 7. {n contrast, in contents where
only a small percentage of candidates can be selected, for example
in the awarding of competitive prizes or scholarships, then the
higher levels of achievement will become important as they
are used to distinguish the most able of candidaes from the rest
This is the case in contexts of achievement, for example, in
school-based language learning, or in vocational and workplace
taining
Rating scales
Most fen, frameworks for rating are designed as scales, this
los the greatest Nesbily eo the uses, who may wane se
the multiple dstineons avalable from a scale, or who «ay
choose to foes on only ane cut-pint or region of te
preparation of such a scale involves developing evel
thats, dscibingin word performances ha illustrate each level
‘of competence defined on the sale For example, in the diving,
test performance ata passing level might be deseribed as "Can
drive in normal rate conditions for 20 minutes makings ange
‘of normal movements and dealing with range of typical events
alte: and ean cope witha limited aumber of feguenily encoun:
‘ered suddenly emerging situations on the road * This descrpcion
‘will necessary be abstracted from the experince of those fi
ine wi the setang ands demands, inthis ease experienced di
‘ing nservrors and wl have tobe vetted by relevant authority
entrusted with inthis ease issuing ince ro drive based onthe
‘est performance
‘An oxdered series of such desrprions is known a 3 ating
scala. A number of distinctions are usually made—rating sales
|
|
|
|
|
|
‘ppically have berween 3 and levels Figure 4.1 givesan example
‘fa summary rating scale developed by theauthorcodescrbe lev-
tds of performance on an advanced level west af English asa sec>
fond language for speaking skills in clinical sctngs.
{opect of performance considered: overall communicative
effectiveness
1 elementary level of communicative effectiveness
2 eleatly could nor cope in a brid
seting involving interactions
ging programme in clinical
ith patients and colleagues
just beloyessinisnum competence needed co cope ina bridging
programme in a clinial setting involving inceraetions wih
patients and colleagues
has minimum competence needed to cope in a bridging
programe in clinical sorting involving ineraetions with
patients and colleagues
5 could easily cope ina bridging programme ina clinical setting
‘volving interactions with parents and colleagues
6 near native communicative effectiveness
FiGURE 4.1 Rating seale, Occupational English Test for beat
professionals
This rating sale i wsed as pare ofa screening procedure (used
to determine if an overseas trained health professional has the
necessary minimum language skills to be admitted under supervi=
sion to the clinical sting). fn this particular ease, asthe Focus of
the discriminations made inthe scale is around a single point of
‘minimum competence, the other levels tend co be defined in terms
‘of their distance from this point, Most raving scales do not have
such a single point of reference, and ideally the definition of each
level should be independent of she ones above and below ion che
seale.In fact, however, given the continuous nature ofthe seale,
‘wordings frequentiy involve comparative statements, with one
level described relative 10 one ar mote others—for example, in
eeterms of greater of less control oF features of the grammatical sy
tem, or pronunciation, and so on
‘An important aspect ofa sale isthe way in which performance
at the top end of the seale is defined, There is frequently an
‘unacknowledged problem here. Rating scales often make refer
ence ro what are assumed to be the typical performances of native
speakers oF expert users of the language at che top end of the
scale. Thats, tis assumed that the performance of native speak
rs wll he fundamentally unlike the performances of non-native
speakers, who will tend gradually to approximate native speaker
performance as their own profiieney inereases However, claims
bout the uniformly superior performance of these idealized
native speakers have rarely heen supported empirically: In faet,
the studies thar have been carried aut eypically show the perfor”
‘mance of native speakers as highly variable, related to edvea-
‘onal evel and covering a range of postions on te sale. In spite
‘ofthis, che idealized view of native speaker performance sil ov
ers inappropriately atthe top of many rating sales
The numberof levels on a rasng sealeisalso an important mi
ter toconsder, although the questions raised hereare mores mat
ter of practical wlty chan of theoretical validity. Theve is no
point in proliferating descriprioas ourside the range of ability of
incerest. Having too few distinctions within the range of such
bility i also frustrating, and the revision of rating seales often
involves the creation af more distinctions
The failure of racing scales vo make distinctions soiienly Fine
‘ocaprute progress being made by studentsisa frequent problem
Ie arises because the purposes of users of a single assessment
instrament may be st odds, Teachers have continuous expostre
tw their students’ achievements in the normal couse of learn.
In the process, they receive ongoing informal confirmation of
learner progress which may nor be adequately flected ina cate
sory difference as described by a sale. Imagine handing parents
sho areseckingevidence af their child's growth a measuring stick
with marks on i only a foot (50 centimetres} apart, the measure
nor allowing any other distinction ro be made, The parents cst
‘observe she growth of the childs they have independent evidence
inthe comments of relatives, othe face that the child has grow
‘out of a set of clothes. Yer in terms of the measuring stick no
growth can be recorded because the child has nor passed the
fac cut point into the next adjacen category of measurement.
Teachers restricted to reporting achievement only in terms of
broad rating scalecateyories are ina similar position Most rating
sales used in public educational stings are imposed by govern=
‘nent authorities for purposes of administative efficiency and
financial accountability, for which fine-grained distinetions are
innecessary The scales ae wsed co repore the achievements ofthe
‘locational system interme of changes in dhe proficiency of large
fnumbers of learners over eelatively extended periods of time. The
government needs dhe “big pictase™ of leamer (and teacher)
Schievement in order to satis’ iself thar its educational budgets
yielding results Teachers working with these government
imposed, seale-based reposting mechanisms experience fruste-
tions with the Taek of fine distinctions on che scale The
coarse-grained character of the categories may hardly do justice
to the reachers”senie ofthe grovth and learning thar hos been
achieved in a course. The purposes ofthe ewo groups—admiis
trators, who are incerested in financial accountability, and reac
‘rs, ho are interested inthe learning. prowess may be at odds in
suchacase
The wording of rating scales may vary according to the pur
poses for which they ate to be used. On the one hand, scales are
tse to guide and constrain the behaviour of raters, and on the
‘other, chey ae used 10 report the outcome of a rating process:
Score users-teachers, employers, admission authorities, parent,
find s0 on. Asa result dfferon versions of a rating scale are often
‘created for diferent usess
Holistic and analytic ratings
Pesformances are comples, Judgement of performances involves
balancing perceptions of a numberof diffeseneFearures of te per
formance. In speaking, a person may be fluent, but hard ro under
stands another may be correct, bur stilted’ Thus rather than
setting rates ca record a single imprestion ofthe impact of the
pesfocmance as a whole (holt rating) a alrernative approach
Involves gering eaters to provide separate assessments foreach of
4 umber of aspects of performance For example, in speaking,
raters may be asked to provide separate assessments of: fluency,
6“
appropriateness pronunciation, control of formal resources of
grammar, and vocabulary and the like, This latter approach i
[nova a8 analyte ating, and requires the development of anu
ber of separate rating scales foreach aspect assessed Even where
analyte rating is earied out is usual to combine the scores For
the separate aspects into a single overall score for reporting pur
poses. This single reporting scale may maintain is analy oren-
tation in thatthe overall characterization of a level description
may consis of « weaving together of strands relating to separate
aspects of performance.
Rater training
An important way 10 improve che quality of ratersmediated
assessment schemes is to provide inital and ongoing taining to
racers. This usually cakes the form of a moderation meeting. At
such a meeting, individual raters are each initially asked to pro=
Vide independenc ravings fora series of pecformances at different
levels. They are then canfroated with the differences between the
ratings they have given and chose given by the other raters in the
gvoup. Discrepancies are noted and are discussed in detail, with
Particular ateention being paid eo the way in which the level
descriptors are being interpreted by individual raters,
Moderation meetings have te function of bringing about broad
agreement on the relevant interpretation of level deseriptors ond
rating categories
Even where agecement is reached on the meaning of terms,
there remain differences bergen raters. This may be in terms of
relative severity, ora consistent rendency co see a particular per
formance as aarrowly demonstrating or narrowly falling. 10
demonstrate achievement ata particular performance level. The
mote extreme eases of racer harshness or leniency will emerge in
rater taining, Usually, the psychological pressure of embarrass
ment over having given ratings aut of line with those of others i
suisient coger raters ro reduce their differences. Afr an intial
moderation meeting raters are eypicaly given a further set of
training performances to rate and are aceredited as ater iFthese
ratings show adequate conformity with agreed ratings for she per
formancesin question Ongoing monitoring of rater performance
isclearly necessary o ensure fatness inthe resting proces
a
Conclusion
Inchischapter we have accepted the desirability of rater-mediate
assessment, and looked at issues in the desiga of rating proce
dures We have looked atthe construction and use of rating sales
{o guide rater hehoviour, and nored che enormous potential for
vatiabiliey and hence unfairness inthe rating proces, associated
for example with task and rater factors. Rater-mediaced assess
rent is complex and in a way ambitious io is goals, and requives
4 sophisticated understanding of the ways in which ic ean be
‘fair ro candidaces and ways unfairness can be avoided