Chapter 4 From Language Testing, 2000

The rating process Making judgements abour people is a common featuce of every day life. We are continually evaluating what others say and do, in comments called for or not offering eritiism and feedback infor- ‘mally co friends and colleagues about their behaviour. Formal, institucional judgements. gure prominent in our lives too, People pass driving tests, survive the probationary period in a new jab, get promotions at work, succeed ar interviews, win (sears for performances ina film, win medals in diving competitions, and are released from prison for good behaviour. The juigement wil in most eases have direct consequences forthe per. son judged, and so issues of fairness arse, which mos public pro: ‘cedures try ro rake account of in some way. Regrertably, itis easy to became aware of the way in which the idiosyncrasies of the rarer or che rating process can determine the outcome unfairly. In international sporting contests such as che Olympic Games and World Cup soccer, the nationality af judges eferes or umpires, and their presumed and sometimes ceal biases become an issue, and artempts are made to mitigate dei effets. Al of us can prob ably recoune instances ofthe benign or damaging role of particu lar raters in examination processes in which we have been involved. Many people have aneedotes of bizarre procedures for reaching rating decisions in various contests, for example in job selection This chapter will discuss rating procedures used in language assessment, (The teem ratings and raters will be used to reler to the judgements and those who make then } We wll discuss the necessity for, and pitfalls of, a rate-mediated approach to the assessment af language Fits, we will look ae the procedures used36 in judging, then at how judgements may be reported, and finally atthreats to the fairness of the procedures and how these may be avoided or at least mitigated. We will consider in some detail three aspects ofthe validation of rating procedures: the esablish sent of rating prosocols; exploring differences between individ ual raters, and mitigating their effects; and understanding ineractions berween raters and other features of the rating process (for example the reaetions of individual raters particu lar topies or to speakers froma particular language background) Establishing a rating procedi Ratee-mediared assessment is becoming mare and more cer language teaching and learning, As communicative language teaching has increasingly focused on communicative perfor mance in contest, So eating the impact ofthat communication has hrecome the focus of language assessment. Rarer-mediated lan guage assessment is also in line with instcutional demands for accountability in educarion, as outcomes of educational processes are often described in cerms of demonstrable practical compe tence in the learner. This competence is then verified thrasgh Where assessments meet institutional requirements, for exam= ple for certification, as with any bureaucratic procedure there are set methods for yielding he judgement in question. These meth- cds typically have three main aspects, First, there is agreement about che conditions (including the length of time) under which the person's performance or behav ior is elicited, and/or is attended to by dhe rater. This may take the form of a formal examination, with set tasks and fixed amounts of time for the performances. Altematively, it may involve a period of observation ducing insertion, or while ean ida catty ot relevant asks and oes inthe acta argc pt Second, certain features of the performance ate agreed to he critical; the criteria for judging these will be determined and agreed. Usually this will involve considering vatious components of competence—fluency, accuracy, organtcation, socioenlral appropriateness and so an. The weighting of esch of the compo rents of assessment becomes an issue. So does their relevance: an increasingly important question inthe validation of performance jrsessments is how the relevant criteria for assessing the perfor mance are co be decided. The hent of the test construct lies here Finally, raters who have been trained roan agreed understanding of the erietia characcerize a peeformance by allocating a trade or rating. This assumes the prior development of descrip five rating categories of some kind: "competent nor competent’, ‘oady a cope with a university course, and so on. The problem with raters Inteoducing the rater ina the assessment process is borh neces sary and problematic Iris problematic because ravings ave neces Sauily subjective Another way of saying ths i shat the rating fiver toa candidat is reflection, nor only of che quality ofthe performance, bar ofthe qualities ava rater of the person who has Judged i- The assumption in mos raring schemes is dar ifthe rat ing category labels ate clear and explicit, and the rater is erined carefully ro interpret them in accordance with che intentions of, the test designers, and concentrates while doing the rating, then the rating process can be made objective, nother words, ratings tssentilly reduced to 8 pracess of the recognition of objective Sans, with elssieation following automaticaly. In this view rating would cesemble the process of chicken sexing, in which young chicks are inspected for te external visible sigs of cher {ex apparenconly o the trained eye when chicks are very young), snd allocated ro male and female cateyories accordingly ‘Bat che reality i that eating remains intraecably subjective. The allocation of individvals to categories is not a. deterministie process, driven by the objective, ecognizable characteristics oF performances, external t0 the rater. Rather, rating always eon: fins a significant degree of chance, ssociared with the rater and fther factors. The influence ofthese factors can be explored by thinking of rating as a probabilistic phenomenon, thats, explor the probabilities of certain raring outcomes with particular » particular tasks, and s0 on. We can easily show this by Tooking at the way in which even trained raters differ in thei han- dling ofthe allocation of individual peeformances in bordetine teases. Close comparison of che ratings giver by different raters in Such eases wil typically show that one eater wll be consistently3 inclined to assign a lower category eo candidates whom anather rater puts into-a higher one, The obvious resule of this is thae whether a candidate is judged as meeting a particular standard or noc depends fortuitously on which rater assesses their work Worse (hecause ths is less predictable), eaters may nor even be self-consistent from one assessed performance 0 the nest, or from one rating occasion ro another. Researchers have semetines been dismayed co learn that chore is as much variation among, raters as there is variation between candidates, the 19508 and 1960s, when concerns for reliability d raed language assessment, eater-mediated assessment was dis: couraged hecause of the problem of subjectivity. This led 10.4 tendeney toavoid direct testing. Thus, writing skills were assessed indirectly through examination of contzol over the grammatical system and knowledge of vocabulary But inereasingly ie was ele that so much was lost by this restriction on the scope of astess- ‘ment thar the problem of subjectivity was something that had ro be faced and managed. Particularly with the advent of comma nicarive language teaching, with its emphasis on how linguistic knowledge is actually pot to use, understanding and managing the rating process becamean urgent necessity. Establishing a framework for making judgements In establishinga rating procedure, we need ro consider the riteria by which performances ata given fevel will he recognized, and then to decide how many diferent levels. of performance we wish to distinguish. The answers co these questions will determine the basic framework or orientation for the rating provess. Deciding which of chese orientations hes isa particular assessment sct- «ing will depend onthe contest and parpose of the asessment, Irisusefuleo view achievement asa continawin, Theassessment system may recognize a number of different levels of achieve iment, in which ease we then chink af ica representing lalder or seule: In other conexts, only ane point on che continu is of relevance, and a simple enough/not enough distinerion al chat needs to be made. In this case the testing system can best be though ofin terms ofa burdleor eutpolat These two porsibives are not of course contradictory, but area ltl ike different set- tings on a camera or microscope. We ean stand back and look at sss | | | | | the whole conringum, or we ean 200m in on one part oft Each level ofthe ladder may’be thoughe af as requiring 2 "yevno" dec sion [enouigh/not enough’ for thae level ‘We a illustrate the distinction between the hurdle and ladder perspectives by reference ro two very different kinds of perfor: mance. Consider the driving tese. Most people, given adequate preparation, would assume ehey could pass it Although not ‘verybody who passes the test has equal competence as a driver, fhe Funerion of the test co make a simple distinction beeween those wha ate safe on che soads and those who are not, rather than ro distinguish degrees of competence in driving skill Often, jnhurdle assessment, as in the driving test, the assesment system is not intended 0 permanently exclude. In other words, every competent person should pass, and itis assumed that mose people ‘with adequate preparation willbe capable ofa competent perfor- smance, and derive the benefis af certification accordingly. The aim of the cersifcation is wo protect other people from incompe- tence, The assessment is essentially na competitive ‘Many systems of assessment ry co combine the characteristics ‘of aezess and vompeiton. For example inthe system of certifies tion for comperence in piano playing, a numberof grades of performance are established, with relevant criteria defining cach, and ‘vera numberof yearsa learner ofthe piano may proceed through the examinations for the grades AS the levels become more demanding, Fewer people have the necessary mouvation oF oppor tunity we for performance ar sucha level, or indeed even the necessary skill The inal stages of certification involve Feely contested piano competitions where only the most brilliant will succeed, 0 resembling che Olympic context. But at levels below this, the “grade” system of certification involves a principle of access: at each step of comperenes, judged in a'yes'7no" manner (Ceompezent ais level vs. “nar corapeten’), chose with adequate preparation are likely ro pass. The funesion of che assessment ata tiven level is nor to make distinctions between cxndidates, other than a binary distinction between those who meet the require ments of the level and chase who do not Language resting has examples ofeach ofthese kinds of frame work for making judgements about individuals. In judgements of ‘competence, o perform particular kinds of oecuparional roles, 3940 for example co work as. medical practitioner through the medium of a second language, where the communicative demands of dhe work or seudy setting to which access is sought are high, then the form of the judgement will be ready" ar “not ‘ready’, ab in the driving test Even though the amount of preparae tion is much greater, and what is demanded is much higher, we nevertheless expect each ofthe medical professionals who present for such atest to succeed in the end, Is Farction is nar usually 0 exclude permanenty those who need to demonstrate competence in che language in order co practise their profession, although rests may of course he used as instruments of such exclusion, as sve shall see later, in Chapter 7. {n contrast, in contents where only a small percentage of candidates can be selected, for example in the awarding of competitive prizes or scholarships, then the higher levels of achievement will become important as they are used to distinguish the most able of candidaes from the rest This is the case in contexts of achievement, for example, in school-based language learning, or in vocational and workplace taining Rating scales Most fen, frameworks for rating are designed as scales, this los the greatest Nesbily eo the uses, who may wane se the multiple dstineons avalable from a scale, or who «ay choose to foes on only ane cut-pint or region of te preparation of such a scale involves developing evel thats, dscibingin word performances ha illustrate each level ‘of competence defined on the sale For example, in the diving, test performance ata passing level might be deseribed as "Can drive in normal rate conditions for 20 minutes makings ange ‘of normal movements and dealing with range of typical events alte: and ean cope witha limited aumber of feguenily encoun: ‘ered suddenly emerging situations on the road * This descrpcion ‘will necessary be abstracted from the experince of those fi ine wi the setang ands demands, inthis ease experienced di ‘ing nservrors and wl have tobe vetted by relevant authority entrusted with inthis ease issuing ince ro drive based onthe ‘est performance ‘An oxdered series of such desrprions is known a 3 ating scala. A number of distinctions are usually made—rating sales | | | | | | ‘ppically have berween 3 and levels Figure 4.1 givesan example ‘fa summary rating scale developed by theauthorcodescrbe lev- tds of performance on an advanced level west af English asa sec> fond language for speaking skills in clinical sctngs. {opect of performance considered: overall communicative effectiveness 1 elementary level of communicative effectiveness 2 eleatly could nor cope in a brid seting involving interactions ging programme in clinical ith patients and colleagues just beloyessinisnum competence needed co cope ina bridging programme in a clinial setting involving inceraetions wih patients and colleagues has minimum competence needed to cope in a bridging programe in clinical sorting involving ineraetions with patients and colleagues 5 could easily cope ina bridging programme ina clinical setting ‘volving interactions with parents and colleagues 6 near native communicative effectiveness FiGURE 4.1 Rating seale, Occupational English Test for beat professionals This rating sale i wsed as pare ofa screening procedure (used to determine if an overseas trained health professional has the necessary minimum language skills to be admitted under supervi= sion to the clinical sting). fn this particular ease, asthe Focus of the discriminations made inthe scale is around a single point of ‘minimum competence, the other levels tend co be defined in terms ‘of their distance from this point, Most raving scales do not have such a single point of reference, and ideally the definition of each level should be independent of she ones above and below ion che seale.In fact, however, given the continuous nature ofthe seale, ‘wordings frequentiy involve comparative statements, with one level described relative 10 one ar mote others—for example, in eeterms of greater of less control oF features of the grammatical sy tem, or pronunciation, and so on ‘An important aspect ofa sale isthe way in which performance at the top end of the seale is defined, There is frequently an ‘unacknowledged problem here. Rating scales often make refer ence ro what are assumed to be the typical performances of native speakers oF expert users of the language at che top end of the scale. Thats, tis assumed that the performance of native speak rs wll he fundamentally unlike the performances of non-native speakers, who will tend gradually to approximate native speaker performance as their own profiieney inereases However, claims bout the uniformly superior performance of these idealized native speakers have rarely heen supported empirically: In faet, the studies thar have been carried aut eypically show the perfor” ‘mance of native speakers as highly variable, related to edvea- ‘onal evel and covering a range of postions on te sale. In spite ‘ofthis, che idealized view of native speaker performance sil ov ers inappropriately atthe top of many rating sales The numberof levels on a rasng sealeisalso an important mi ter toconsder, although the questions raised hereare mores mat ter of practical wlty chan of theoretical validity. Theve is no point in proliferating descriprioas ourside the range of ability of incerest. Having too few distinctions within the range of such bility i also frustrating, and the revision of rating seales often involves the creation af more distinctions The failure of racing scales vo make distinctions soiienly Fine ‘ocaprute progress being made by studentsisa frequent problem Ie arises because the purposes of users of a single assessment instrament may be st odds, Teachers have continuous expostre tw their students’ achievements in the normal couse of learn. In the process, they receive ongoing informal confirmation of learner progress which may nor be adequately flected ina cate sory difference as described by a sale. Imagine handing parents sho areseckingevidence af their child's growth a measuring stick with marks on i only a foot (50 centimetres} apart, the measure nor allowing any other distinction ro be made, The parents cst ‘observe she growth of the childs they have independent evidence inthe comments of relatives, othe face that the child has grow ‘out of a set of clothes. Yer in terms of the measuring stick no growth can be recorded because the child has nor passed the fac cut point into the next adjacen category of measurement. Teachers restricted to reporting achievement only in terms of broad rating scalecateyories are ina similar position Most rating sales used in public educational stings are imposed by govern= ‘nent authorities for purposes of administative efficiency and financial accountability, for which fine-grained distinetions are innecessary The scales ae wsed co repore the achievements ofthe ‘locational system interme of changes in dhe proficiency of large fnumbers of learners over eelatively extended periods of time. The government needs dhe “big pictase™ of leamer (and teacher) Schievement in order to satis’ iself thar its educational budgets yielding results Teachers working with these government imposed, seale-based reposting mechanisms experience fruste- tions with the Taek of fine distinctions on che scale The coarse-grained character of the categories may hardly do justice to the reachers”senie ofthe grovth and learning thar hos been achieved in a course. The purposes ofthe ewo groups—admiis trators, who are incerested in financial accountability, and reac ‘rs, ho are interested inthe learning. prowess may be at odds in suchacase The wording of rating scales may vary according to the pur poses for which they ate to be used. On the one hand, scales are tse to guide and constrain the behaviour of raters, and on the ‘other, chey ae used 10 report the outcome of a rating process: Score users-teachers, employers, admission authorities, parent, find s0 on. Asa result dfferon versions of a rating scale are often ‘created for diferent usess Holistic and analytic ratings Pesformances are comples, Judgement of performances involves balancing perceptions of a numberof diffeseneFearures of te per formance. In speaking, a person may be fluent, but hard ro under stands another may be correct, bur stilted’ Thus rather than setting rates ca record a single imprestion ofthe impact of the pesfocmance as a whole (holt rating) a alrernative approach Involves gering eaters to provide separate assessments foreach of 4 umber of aspects of performance For example, in speaking, raters may be asked to provide separate assessments of: fluency, 6“ appropriateness pronunciation, control of formal resources of grammar, and vocabulary and the like, This latter approach i [nova a8 analyte ating, and requires the development of anu ber of separate rating scales foreach aspect assessed Even where analyte rating is earied out is usual to combine the scores For the separate aspects into a single overall score for reporting pur poses. This single reporting scale may maintain is analy oren- tation in thatthe overall characterization of a level description may consis of « weaving together of strands relating to separate aspects of performance. Rater training An important way 10 improve che quality of ratersmediated assessment schemes is to provide inital and ongoing taining to racers. This usually cakes the form of a moderation meeting. At such a meeting, individual raters are each initially asked to pro= Vide independenc ravings fora series of pecformances at different levels. They are then canfroated with the differences between the ratings they have given and chose given by the other raters in the gvoup. Discrepancies are noted and are discussed in detail, with Particular ateention being paid eo the way in which the level descriptors are being interpreted by individual raters, Moderation meetings have te function of bringing about broad agreement on the relevant interpretation of level deseriptors ond rating categories Even where agecement is reached on the meaning of terms, there remain differences bergen raters. This may be in terms of relative severity, ora consistent rendency co see a particular per formance as aarrowly demonstrating or narrowly falling. 10 demonstrate achievement ata particular performance level. The mote extreme eases of racer harshness or leniency will emerge in rater taining, Usually, the psychological pressure of embarrass ment over having given ratings aut of line with those of others i suisient coger raters ro reduce their differences. Afr an intial moderation meeting raters are eypicaly given a further set of training performances to rate and are aceredited as ater iFthese ratings show adequate conformity with agreed ratings for she per formancesin question Ongoing monitoring of rater performance isclearly necessary o ensure fatness inthe resting proces a Conclusion Inchischapter we have accepted the desirability of rater-mediate assessment, and looked at issues in the desiga of rating proce dures We have looked atthe construction and use of rating sales {o guide rater hehoviour, and nored che enormous potential for vatiabiliey and hence unfairness inthe rating proces, associated for example with task and rater factors. Rater-mediaced assess rent is complex and in a way ambitious io is goals, and requives 4 sophisticated understanding of the ways in which ic ean be ‘fair ro candidaces and ways unfairness can be avoided

Chapter 4 From Language Testing, 2000

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4 From Language Testing, 2000

Uploaded by

Copyright:

Available Formats

You might also like