You are on page 1of 182
Testing for Language Teachers CAMBRIDGE HANDBOOKS POR LANGUAGE TEACHERS General Editor: Michacl Swaa This is a sevies of practical guides for teachers of English and other Janguages. [ustrative examples are usaally drawn from the field of English as a farign or second language, bu: the ideas and techniques described can equally well be used in the teaching of any language. In this series: Drama Techniques in Language Learning ~ A resource book of communication activities for langage teachers by Alan Maley anid Alan Duff Games for Language Laseniog by Andrew Wright, David Berreridge and Michael Bucky Discussioas that Work — Task centred fluency practice by Penny Ur Once Upon a Time Using stores mm the language classroom, dy Jobe Morgan and Marla Rinwolucri Teaching Liscening Comprehension by Penzry Ur Keep Talking ~ Communicative fluency activities for language teaching by Friederike Klippel Working with Words ~ A guide co reaching and learning vocabulary hy Rath Gains and Stuart Redman Learner Eagtish ~ A seacher’s guide to ianerlerence and other problems edited by Michael Svan and Bernard Stith ‘Testing Spoken Language ~ A handbook of oral resting techniques iy Nic Underhill Literature in the Language Classroom ~ A resource book of ideas and activities by joanne Callie and Stephen Slater Bicuuion ~ New techods, new pussibilicies by Paul Davis and Mavi Rinwohucri Granvmar Practice Activities ~ A practica by Penney Ur uide for reachers Vesting foe Language Teachers by Aithar Hughes The Inward Bar ~ Poetry in rhe kinguage classroom ty Alan Maley and Alan Duff Potures for Language Learning by Andrew Wright Fly Minute Activities ~ A resaurce bonk of shart activities ene Ur and Andrew Weight Testing for Language Teachers Arthur Hughes CAMBRIDGE BF UNIVERSLEY PRESS Publahed by the Press Syndivate of the Univenicy of Canvbridge The Pitt Boilding, Trumpington Stroct, Cambridge CBZ IRF West Mth Seeeet, New York, N¥ 1001 (4211, USA ik) Stamford Road, Oakleigh, Melbourne 3466, Australia © Cambridge University Preys 1989 Furst Published 1989 Eighth printing 1996 Peted in Great Bean by Bell & Bain Led, Glasgow Library oj Congress catalaguntg » publication data Hughes, Arthur, 194]- Testing tor language teachers / Arthur Hughes BP. em, = (Camberdge handbooks for language rachers} bhogeaphy: p. tndudes index ISBN G 621 25264 6. 1SBN 0 521 27260 2 tphk.j 1. Language and languages ~ Abily testmg. 1. Tithe, UL Serves. S34 Had 1989 S07 oad19 — BR-38IO CP Brutish Labrary cataloging in publication data Haghes, Arthur, 1944- Testing for language ceachers. ~ Cambridge handbooks tar language reachers} E Great Britain, Educational stitutions Students, Foreign language skis. Assesariesit. Tesss ~ For teaching. L Tile ang! 0078 ISBN 0 $21 25260 4 hard covers ISBN 0 S21 27260.) pagerback Capyright The law allows reader t9 make a single copy of part of a book for putposes of private study. does not allow the copying of entire books or the making of muluple copies of extracts. Written permission for any such copying must always be obtamed from the publisher ia advance ce For Vicky, Meg, and Jake Contents oa Acknowledgements viii Preface ix + Teaching and testing 2 Testing as problem solving: an overview ai the hook 3. Kinds of test and testing 9 4 Validiy 22 S Reliability 29 6 Achieving beneficial backwash 44 7 Stages of test construction 4g 8 Test techniques and testing ovecall abiliy 59 & Testing weiting 75 10 Testing oral ability 102 LE Testing reading 116 12 Testing listening 134 i3 Testing grammar and vocabulary 144 14 Test administration Appendix 1 Statistical analysis of est Appendix2 163 Bibliography 166 Index 170 °C ee Acknowledgements ‘The author and publishers would like to thank the following for permission to reproduce copyright material: Americar: Council on the Teaching of Foreign Languages Inc. for extracts from ACTFL Provisienal Proficieucy Guidelines and Generic Guidelines (1986); AR Examination Trust for extracts from examinations; A. Hughes for extracts from the New Bofazici University Language Proficiency Test (1984); The British Council for the draft of the scale for ELTS test; Cambridge University Press for M. Swan and C. Walte: Cambridge English Course 3, p. 16 (1988); Educational Testing Servi for the graph on p. 84; Filmscan Lingual House for M. Garman and : English Cloze Pxercises Chapeer 4 (1983); The Foreign stimie for che Testing Kit pp. 33-8 (1979); Harper & Row, Publishers, luc, fox the graph on p. 164, from Basic Statistical Methods by N. M. Downie and Robert W. Heath, copyright © 1985 in N. M. Downie and Robert W. Heath, reprinted with permission of Harper & Row, Publishers, [nc., Te iadependent for N. Timmins: ‘Passive smoking comes under fire’, 14 March 1987; Joint Matriculation Board for extracts from March £980 and June 1980 Tests in English (Overseas); Language Learning, andj. W. Oller Jr. and C, A, Conrad for the extract from ‘The cloze technique and ESL proficiency’, Language Learning 2183-94: Longman UK Led for D, Byrne: Progressive Picture 30 (1967); Macmillan London Ltd for Colin Dexter: f Annexc 3 (1986), The Observer for S. Limb: *One-sids *, 9 Cewber 1993; The Royal Society of Arts Examinations Board/University of Cambridge Local Examinations Syndicate for exttacts from the specifications for the examination in Tae Com stunrication Use of English as a Foreign Language, and from the 1984 and 1985 summer exammations, The University of Cambridge Local Syndicate for the extract from Testpack | paper 3; The University of Oxford Delegacy of Local Examinations for extracts ‘trom the Oxford Examination in English as a Foreign Languge. Preface The simple objective of this book is ro heip language teachers write berter resis. It takes the view chat test construction is essentially a matter of problem solving, with every teaching situation setting a different testing problem. In order to artive at the best solution lor any perticular situation -- the most appropriate test or resting system = it is nor enough to have at one's disposal a collection af rest techniques from which to choose. It is also necessary to undersrand the principles of testing and how they can be applied in practice. It is eelacively straightforward to introduce and explain the desirable qualities of tests: validity, reliabilicy, practicality, and beneficial backwash this last, which refers to the favourable effects tests can bave on teaching and learning, here receiving more attention than is usual in books of this kind). fs is much less easy to give realistic advice on how to achieve them in teacher-made tests, One is tempted either co ignore the problem or to present as a model the mot always appropriank methods of large-scale testing institutions. tn resisting these tempzations { have been compelled to make explicit in my own mind much thar had previously been vagne and intuitive. | have. certainly henefited from doing this: £ hope that readers will too. Exemplification thequghout the book is from the testing of English as a foreign language. This reflects bath my own experience in language testing and the fact that English will be the one language known by all readers. { crust that ic will not prove too difficulr tor teachers of other languages to find of construct pacalicl examples of thelr own. T must acknowledge the contributions of others: MA students at Reading University, too numerous to mention by name, who have caught me much, usuially hy asking questions that J could not answer; my friends and colleagues, Paul Fletcher, Michael Garman, Don Porter, and Tony Woods, who all read parts of the manuscript and mace many helpful suggestions; Barbara Barnes, who typed a ficst version of the carly chapters; Michael Swan, who gave good advice and much encoursge- ment, and who cemained remarkably patient as each deadline for completion passed; and finally my family, who accepted the writing of the book ax an excuse more often than chey should. To all of them Tam very grateful. ix 4 Teaching and tasting crs harbour a deep mistrust of & any language cea! The etarting point for this book frequently well-founded. lt canmar be de resting ig of very poor quality. Too often language tests have a harmful effect on teaching and learning: and wo olten chey fail co messure accurately whatever it is they are intended to measure. » the admission chi “d that aarest deal of Langu Backwash The effect of testing on teaching and lesening is known as backwash Backwash can be harmful or beneficial. if test is regarded as importans, then preparation for it can come to dominate all waching and leaening acrivirics, And if the test content and resting rechwig™ues are at variance with the objectives of che course, then there is likely to be h: backwash. An instance of this would be where students are following an English course which is meant to train chem in the language skills {including writing) necessary for university siusly tm an English-speaking country, but where the language test which chey have to tale in order £0 beadmitted t 2 university does not tesr those slails direcrly. /f the skill af ple choice iteras, then there is writing, for example, is tested only by mul great pressure 10 practise such items cather than practise the skill of writing itself, This is clearly undesirable. We have just looted a a case of harmful backwash, However, backw: need nor always be harmful, indeed iz can be positively beneficial. 1 was once involved int the developmen of an English language efor aa English mediam university in a noa-Hnglish-speuking eountry, 2 test was 0 be administered arthe end of an intensive yeas of English study there and would be used to determine which students would be allowed to go on to their endergraduate courses (caught in English) and which would have to leave che university. A rest was devised which was based dinevtly on an analysis of the English language needs of firse year undergraduate students, and which includedt tasks as similar as possible to those which they would save to perform as badergrad putes {reading textbook materials, taking notes during tes introduction of this test, in place of one which ha: gana fe » had an immed fate effect on t ecching che syllabus was redesigned, 3 were chosen, classes were conducted differently. The result of these changes was that by the end of their year’s training, in circtyn- stances made particularly dilfcult by greatly increased numbers and Himited resources, the stadents reached a much higher standard in English than hadever been achieved in the universiry’s history. This was a case of beneficial backwash Davies (196825) has said that ‘the good test is an obedient seevant since it follows and apes the reaching, 1 find it difficult to agree. The proper relationship between tesching andl testing is surely that otparmership. ft is erue thar there may be occasions when the teaching is good and appropriate and the testing is not; we are then fikely ro suéfer from harmful backwash. This would seer to be the situation that feads Davies ro confine testing to the sole of servant of ceaching, But equally there may be occasions when teaching is poor oF inappropriate and when testing is able to exere a beneficial influence. We cannot expect testing only to follow teaching, What we should demand of it, however, is that it shold he suppartive of good teaching and, where necessary, exert a corrective influcace on bad teaching. If resting always had a beneficial backwash on teaching, it would bave a much berter reputation amongst teachers, Chapter € of this book is devoted to a discussion of how beneficial backwash can be achieved inaccurate test The second reason for mistrasring rests is that very often they tail co measure accurately whatever it is that they ars intended to measure, Teachers know this. Students’ true abifiries are not always reflected in the test scotes that they ebrain. ‘fo a certain extent this is inevitable, Language abilities ar: nor easy Co measure; we cannot expect a level of accuracy comparable w those of measurements in the physical scicoces, But we can expect greater accuracy than is frequently achieved. Why are test: inaccurate? The causes of inaccuracy (and ways of minimising their eff are identified and discussed in subsequent chapters, bura short ar possible here. There are nyo main sources of inaccuracy. The first of these conceras test content and techniques. To return t© an earlier example, #f we wane to know how well someone can write, there is absolutely no way we can get a really accurate measure of their ability by means of a snultiple choice test, Professional testers have expended great effort, and not a line oioney, in accemprs to de its bur shey have always tailed. We may be able ro get an approximate measure, tet thar ig all, Whos testing is carried out on a very la large scale, when the scoring of tens of thousands of compositions might seem not to be a ve 3 Teaching and testing practic racy is sacrificed for reasons of economy and convenience, But it does not give testing a good name! And st does set a bad example. While few teachers would wish to follow that particular exemple ia order t vest writing ability, the overwhelming practice in large-scale testing of using multiple choice items does iead to imitation in dream | proposition, it is understandable chae purentially greater accu- stances where such items are not at al! appropriate. What is more, the imitation tends ta be of a very poor standard. Good muitiple choice items are notorivesly difficult to write. A grear deal of ime and effort has to go into their construction. Too many multiple choice tests are written wheee sach care and attention is not given (and indeed may not be possible}, The result is a set of poor items that cannot possibly provide accurate measurements. One of the principal aims of this book is to discourage the use of inappropriate techniques and to show that seacher-made rests can be superior in certain respects to their professional counterparts. The second source of inaccuracy is lack of reliability. Reliability is a technical term which is explained in Chapter $. For the moment it is enough to say that a test ix reliable if jt measures consistently. On a reliable test you can be confident that someone will get more or less the same score, wherher they happen w take it on one particular day of on the next; whereas on an unrefiable test che seore is quite likely to be considerably difference, depending on the day on which itis raken, Unreliability has two origins: features of rhe test itself, and the way itis scored. in the Brot case, something about the test creates a tendency for individuals to perform signifieandy differently on different occasions when they might take the test. Their performance might be quite different if they took the test on, say. Wednesday rather than on the following day. Asa result, even ifthe scaring of their performance on the testis pertecily accurate (chat is, the scorers do net make any mistakes), they will nevertheless obtain a markedly different seore, depending an when they actually sae the text, even theugh there has been no change tr the abiticy which the test is meant co meastire, This is nat the place to list all possible features of a test which might make jr unreliable, but examples are: unclear instructions, ambiguous questions, items thar result in guessing on the part of the rest takers. Whie itis aor possible entizely to eliminate such differences in behaviour from one test adminiscration vo another thuman beings are not machines) , there are principles of test construction which can reduce thers, In the second case, equivalent test performances are accorded sig- nificantly different scores. For example, the same composition may be given very different scores by ditlecenr markers (or even by the same marker on different occasions), Fortunately, there are well-understood ways of minimising such differences in scoring. Mose {but noc all; large testing organisacions, to thetr credit, take ery Teaching and testing to make their wses, and che scoring of chemi, as ccliable as sful in rbis recpect. Small-scale precauitio possible, and are generally highly sue testing, on the other hand, cends co be less reliable than it should be Another sim of this book, ther is to show how cr achieve greater rehab on this is to be loud in Chapeer § resting, Ady The need for tests has been concerned to understand why teats are so by many language texchers. We have seen thar this miserust is often justified. Qne conclusion drawn from this nughr be thar we would be better off withour language tests. Teaching is, after all, the primary activity; if testing comes in conflict with f, then itis cessing whieh should ge, especially when it has been admitted that se much vesting provides inaccurate information, This is a plausible argument —bur there are other considerations, which might Jead co a different coactusion, Information about people's language ability is often very useful and sometimes necessary. te is difficul: to imaging, for example, British and American universities accepting stadenes fram overseas without some knowledge of their proficiency in English, The ame is true for organt sations hiring interpreters ot iranslators, They cervainly need dependable measures of language abilizy Within teaching systems, too, as long as ic ig chought apprapriate for individuals to be given a statement of what they have achieved in a second oc foreign language, then tests of some kind or other will be eeeded,! They will also he needed in order to provide information about ¢ achievement of groups of learners, without which itis difienit ro see how rational educational decisions can be made. While for same purpases teachers’ assessmenes of their own students are both apprapriace and sufficient, this is not true for the cases just mentioned, Even without considering the possibility of hias, we have to eecogaise the need for a common yardstick, which tests provide, in ovder 13 make meaningful comparisons. fivis accepted that tests are necessary, and if we care abour testing and its effect on teaching and ‘easning, the other conclusion (in my view, the correct one] to be drawn from a recognition of che poor quafiry of so rauch testing is that we should do everything that we can to improve the practice of testing, 1. (wall become clear chatia this book che word “testis mtumpreted widely, iris used 10 relve to any staictured exempt to measure ianguage abdity, Mo dietinction is made between ‘examination’ and “Tex. What is to be done? hing profession can male ore of testing: they can write better rests thes an others, tach professional testers: improve their tesc. This book represeat For the reader who doubes the reeachers « can influence the Taree | resting institutions, ler chis chapter end with a f 7 writing through multiple choi those re: onsible for TOEFL (1 mast nor Engli eto oN American universicies, Over a period of many years they maintamed it was simply nor possible e9 rest the writing abiliry of hundreds of chousaads of candidates by means of a corpasitinns: it was iF and the results, anyhow, would be unreliable. Yer in 1986 a wrising tose (Test of Weittets English), in which candidates actually have to verite for thirty minutes, was introduced as a supplement co TOEFL, and already many colleges in che Uniced Sraes ace requiring appheants te cake eh test in addition to TOEFL. The principal reason given for this change was pressuze from English language teachers who had finally convinced those responsible for the TCEFL of the overrickng need for a writing task which would provide beneficial backwash. ‘The tesa practicable pracduabh READER ACTIVITIES . Think of tests with which you are familiar (che tests may be inter national or local, written by professioagls or by teachers). What do you think the backwash effect of each of them is? Harmful oc beneficial? What are your ceasens for coming ro these conclusions? Consider these rests again, Da you think that they give aca inaccurate information? What are your reasons for coming to these conclusion: Or Ns Further reading For an account of how the introduction of a mew test ca have a striking beneficial effect on waching and (ea Ping, 828 Hughes (19883). Por # review of the new TOEFL writing reac which acknowledges is potential iat backwash effect but which also points out that the narrow ange of writing tasks set (they are oF only two types) may result werow waining in wriding, see Greenberg { 14986). Bor a discussion of ce ethics of language cestiag, see Spolsky (193 1). RE 2 Testing as problem solving: an overview af the book ‘The purpose of this chapter is to intsoduce seaders to the idea of testing as problem solving and to show how the content and structure of the book ase designed to help them 1 become suecessfil solvers of testing problems, Language Lesters ase somethings asked ro say what is ‘the best 1 ‘the best resting technique’. Such questions reveal a misunderstanding of what is involved in ehe practice of language testing, tn fact there is no best test or best technique, A test which proves ideal for one purpose may be quite useless tor another; a rechnique which may work very well in one sitvation can be entirely inappropriate in another. As we saw in the previous chapter, what suits large testing corporations may be quite out. of place in the tests of teaching instirutions. In the same way. two. reaching institutions may require very different tests, depending amongst oiher things on dhe objectives of theic courses, the purpose and impor astce of the testa, and the resources that are available. The assumption that has to be made therefore is that each testing situation is unique and ccs sets a patticular testing problem. iris the rester’s job xp provide the best solution to that problem. The aims of this book are to equip readers with the basic knowledge and rechaiques first co solve such problems, second. ty evaluate the sokitions proposed oe aleeady implemented by others, and thirdly iw argue persuasively for improvements in cesting practice where Lhsse serm necessary, in covery situation the fast sep must be to stare the testing problem as clearly as possible, Without a clear statentenr of the prolslem itis hard to arrive at the right sclurion. Every testing problem can be expressed in the same az a test ox testing systern which will: ar oneral terms: We FURL LO © consistently provide accurate measures of precisely the abilicies! in which we are interest have 3 bereticial etfzcr on real likely to influence teaching); be economical in terms of time and money. hing fin those eases where the tests are 1, *Abilines’ js not boing used lees in any technical sense, It eefets sieuply co what people san de anor yieh, a language, i could, far exampte, include the ability to converse Pusnely in a tang cite grammatical cules ¢f that is someting which we are lovergsted in migasuring!). ir does not, however, refer eo ae wall as the abiliey snr Testing as problem solving Latus describe the general esting problem in a lire more detail. The fin thing chat testers have to he close about is the purpose of testing ia any particular situation, Disferent purposes will usually require diflerent kinds of tests. This may seem obvious but it issomething which seems nor always to be recognised. The purposes of testing discussed in this book are: — to measure language proficiency regardless of aay language courses that candidates may have followed ~ to discover how far students have achieved the objectives of a course of study = to diagnose seudents’ strengths and weaknesses, so identify whac they know and whae they do not know — to agaist placement of students by identifying the seage or part of 2 eaching programme most appropriate to their abiliry All of these purposes are discussed in the nexr chapter. That chapter also introduces different kinds of testing and test techniques: direct as opposed {0 indirect testing: discrete-point versus integrative testing; criterion-veferenced resting as against norm-referenced testing; objective and subjective testing. In stating che testing problem im general terms above, we spoke of providing consistent measures of precisely che abilities we are intexested in, A test which does this is said to be ‘vafia’. Chapter 4 addresses itself to vatious kinds of validity, ke provides advice on the achievement of validity in test construction and shows how validity is measured. The word ‘consistently’ was used in che statement of the testing problem. The consistency with which accurate measurements are made is in fact an essential ingredient of validity. Hf a test measures consistensly (if, for example a person's score on the test is likely to be very similar regardless of whether zhey happen to rake it on, say, Monday morning rather than on Taesday afternoon, assuming that there has been no. sigaificant change in their ability) it ts said to be reliable, Reliability, already referred wo in the previous chapter, i an absolutely essential quality of tests ~ what use is a test if it will give widely differing estimates af an individual's {unchanged} abiliey? — yer it is something which is. distinctly lacking in very many teacher-made tests. Chapter 5 gives advice on how to achieve reliability and explains the wavs in which iz is measured. The concept of backwash effeer was inteaduced in the previous chapter. Chapter 6 identifies a number of conditions for rests to meet in order to achiewe beneficial backousk language aptitude, the wilent which people have, in differing degrees, for learning Languages. The measurement of rus talent in order to predict hew well or haw guickly individuals will leaen a foreign language, is heyond the scope of this book. The interested reader 1g refered 1 Pimsleur {1988}, Caroll (1981), and Skehan (1986). 7 ing as problem soluing All teats cost time and money ~ to prepare, administer, score and imterpret, Tune and money are in limiced supply, and so there is often likely to be a conflict herween what appears ro be a perfect resting solution in a particular situation sad considerations of practicaliry. This syne is also discussed in Chapter 6. To rephrase she general testieg peoblem identified above: the basic problem is to develop tests which are valid ond reliable, which have a beneficial backwash sffect on teaching {where this is relevant}, and which are practical. The next four chapters af the book are intended to look eagce Closely at the relevaat conceprs andl so help the reader to formulate such problems clearly in partivalar (astances, and wo provide advice on bow to approae’ their solucion he second half of the book is devoted to more detailed advice in the construction and use of rests, the parting in ro practice of the principles outlined in earlier chanters. Chapter 7 oudines and exemplifies the vations stages of test consteactinn, Chaprer 8 discusses a number of testing techniques. Chapters 914 show how a vanery of language abilities can best be tested, particularly within teaching institutions Chapter 14 gives swaightfooward advice on the admniniscration of tests, We have to say something about statistics. Some understanding of statistics is useful, indeed necessary, for a proper epprectation of testiag matters and for successful problem solving. Ar the same time. we have to. recognise chat there is @ limit te what many ceaders will be prepared to Go, expecially if they are at all afraid of mathematics. Por this reason, statistical marters are kept ro a minimum and are presented in terms that everyone should be able to grasp. The emphasis will he on interpretation tather chan on calculation. For the more advenrurovs reader, however, Appendix t explains how to carry outa nuniber of statistical aperations Further reading The collection of critical reviews of ceasty 50 English language vests (mostly British end American), edited by Alderson, Keahake and Stons- field (1987), reveals how well professional test writers are thought 10 have solved their problems, A. fill underscanding of the reviews will depend to some degree on an assimilation of the content of Chapters 3, 4, and § of this book. 3 Hinds of test and testing chapter begins by ons poses Kor testing 16 carned ont. fr g between direct aad indirect testing rive coating, beeween norm-referenced and berween objective and subjecti communicative language testing. We use tests t obrain information. The fnfermacon thac w. obtain will of course vary from situ nevertheless, to carggorise tests accurding ty a small aumiser of nds oF information being sought. This categorisanon will peave asehil hark in deciding whether an existing test is surtable tox a particalar purpose and in weiting appropriate new testy whute chose ate necessary. The four types of test which we will discuss in che following sections are. proficiency tests, achievement tests, diaguest Y of diitinction: oreee poine and integer ote od gesting on on o sstution. it is possible, Proficiency tests Proficiency tests ace designed to measure people's ability in w laa: regatdless of any training they may have hac in char Language. content of a proficiency test, therefore, is not based on the content or objectives of language courses which peorie taking the test may have followed. Rather, its hased on a speerfiearicn of what candidates nave ra be able to de in the language in order vonsidered proficient, raises the question of what we mean by ¢ ‘oul proficient’, In the case of some proficiency tevts, “proficienr’ means having sufficient command of the fsnguage fory particular purpose. nexample af this would be a test designed to discover wh ny sari function successfully as a United Nations tesaslazor. Another ex: would be a test used ro determine whether a student's English is gacd anough to follow a course of srudy at a Britiuh university, Such a test a even attempt to take into account the level follow courses in particular subject areas, ke sight, for example, have ane form of the tese for arts subjects, another for sciences, and so 00. Wharever the particular purpose to which chs language is to he pas, rhis ~~ will be retl mat at ac early stage of a rests developin hy There ere ciency rests which, by conteast, do not have any occupation or course of study in mind, For them the concept of proficiency is more genetal. British examples of these would be the Cambridge examinations (First Certifteare Examination and Proficiene Examination) and the Oixéord EFL examinations (Preliminacy and Higher). The function of these teste is to show whether candidites have reached siandand with cespect to certaée specified abilities, Such examining bodies deperdensy of the reaching inctirutions and so can be relied on by potential employers etc, ve make fair comparisons between cardidaces from different instisstions and diiferent countries. ‘Though there ig no particular purpose in mind jor the language, these gerteral proficiency rests should have detailed specifications saying just whatitis char successful candidates wil have demonstrated that they can dy. Each teat should be seen te be based directly on chese specificarions. All users of a rest (teachers, students, employers, enc.) can then judge whether the test is suinible for them, and can interpret test resales, Hr is aot enough ts have some vague novion of proficiency, however prestigious the testing body concerned. Despite differences between them of concent aad level of difficuley, all proficiency seats have in common the fact thar they are not based on arses thar candidares may have previously taken. On the other hand, as we saw in Chapter 1, such seats may themselves cxcreise considerable influence over thy method and coment of language courses. Their backwash effect for th ciency tests is more harmful exs of students whe take such ests, y be able to than benslicial, However, ihe t and whose werk suffers from a harmful backwash effect, ma exercise mere influctice o they realise. The recens addition te TORPL, ceterred ro in Chaprer 4, isa case in point, Achievement tests Mase teachers ace unlikely wo be respoasrble for proficiency tescs. It is much more probable that they will be involved in the preparation and use of achievement tests, In ciency rests, achievement tests are direerly hunguage courses, their purpose being to establish How suecessinl individual seucenss, groups of seadenrs, or the courses been in achieving objgetives. They are of tw and progress achievernent tests, pment teats are those administers themselves hav achieverner Kinds of test and testing study. They may be written aad administered by miniseries of education, official examining boards, or by members of teaching msritutions. Clearly the content of these tests must be related to. the courses with which they are concerned, but che nature of this relationship isa matter of disagreement amongse language testers in the view of some testers, the content of a final achievernent test should be based directly on a detailed ooturse ayHabus or an the books and other materials used s the “ayllabus-content 4. This has been referred to a approach’, Ir has an obvious appeal, since die cest only « vontains whas iris thought that the students have actually encountered, and thus can be considered, in this respect at least, a fair test, The disadvantage is that if the syllabus is badly designed, or she books and other materials are badly chosen, then the resufes of a test can de very misleading. Successful performance on the test may wot truly indicate successful achieventent of course objectives. For exampie, a course may have as an objective the development of conversational ability, but the course itself and the test may require students only to ater carefully prepared statements about their home town, che weather, or whatever. Another course may aii to develep 2 reading ability in German, but the test may Jimie itsell to the vocabulary the srudents are known to have met. Yet anothee course is intended to prepare students for university study in English, bur the syliabus (and so the course and che test) may not inchide listening (with note taking) to English delivered in lecture style on topics of the kind that the students will have to deak with at university. In each of these exansples ~ all of them based on actital cases — test regalrs will fail ro show what students have achieved in terms of course objectives. The alternative approach is co base the tear concent directly on the objectives of the course. This has a aumber of advantages. First, it compels course designers t be explicit abour adjectives. Secondly, it takes it possible for performance on the test to show just how fac students have achieved those objectives, This in tern pure pressure on those responsible for the syllabus and for the selection of books and matertals to ensure that these are consistent with the course objectives. Tests based 1 obiectives work against the perpetiation of poor teaching practice, something which course-conient-based tests, almost as ifpart of 4 conspiracy, fail to do, [tis my belief that te base test conzent an course objectives ig much to be preferred: it will provide more accumare information about individual and group achievement, and it is likely to Promote 2 more beneficial backwash effect on teaching.! 1. OF course, if ubjectives are intealistic, then rests will also ceveal a failure to achneve them. Thistoc can ony he segnided as salurary. There may he disagcesment 3 : there has een a Fail the sbjectives, bur at Jease Us peovides a suartinng Point for oecessary discussicn which atherwise mughs never howe saken piace. ree a wif Cost cond Lasting Naw i might be argued chat re base west content on objectives rather chan ov course conreatis untaic to students. Lf the course contear dogs net fe well with objectives, chey will be expected ro do things for which they bave nor been prepared. bp a sense chis is re, But in anoter sense its not. if a resr is based on the conten of a poor or mappropriate course, the students taking it wilf be misled as co the extent of their achieve- ment and the quality of the course. Whereas 1 the zevt is based on objectives, not oaly will che information ic grees be mote uselul, but there is less chance of the course surviving in its present unsatistactory form. Initially some studenes may suffer, bur fucure srudents will benefit frou: the pressure for change. The long-term interests of students are best seeved by fina] achievement cests whose content is based on course objectives. The reader may wonder at this stage whecher there ig any real difference beewren Hnal achievernent tests and praticiency cests. Ha test is bosed on the objectives of a contse, and these are equivalent 2 the langage aveds on which a protcleney test is based, chen there is an teason to expect a differenice between che form and conrent of the two tests, Two things bave ro be remembered, however. First, objecsives and nacds will not typically coincide in this way. Secondiy, many achicvernent tests are not in fact based on course objectives. These facts have implicazions both for the users of test restilts and for test writers. Test users have to know on whar basis an achievement rest has been con- steacted, and be aware of the possibly limited validity and applicability of test scores. Test writers, on the other hand, tutst crear’ achievernent tests which reflect che objectives af a parsicutar contre, and aot expect a general proficiency test (ar some imitation of iz) to provide a satisiactury allernative. Progress achievement tests, as their name suggests, are intended to measure the progress chat students are making. Since ‘progress’ is towards the achievement of course objectives, these tests too should telace ro objectives, But how? One way of measuring progress would be repeatedly to administer final achievement tests, the ¢hopefitily) increas- ing scotes indicating the progress anade, This is not really feasible, particutiely i the early stages af a course. The low scores obtained would be discouraging To studerts and quite possibly to their teachers. The alernative is to establish a series of welldefined sheet-rerm obje: tives. These should make a clear progression towards the fina} achieve- ment test based On course oiyjectives, Then if the syllabus and teaching are appropriate to these objectives, progress tests based on short-term objectives will fit well with what has been taughe. ff nox, there will de pressure to create a better ft. [fit is the syHabus that is at faule, it is the tester’s responsibility & make clear that if is there that change is needed, fot it de tes: 12 Kinels of iest ansd tes Je addition to more formal achiew stg which repre cazetat preparation, teachers shoud ieel free © thelr own “pep quizzes’ These serve both to make a rough chenk on seodenss’ propressand co kee studencs om their toes, Since such tests will mor form pare of foe assessment procedures, their construct rigorous. Nevertheless, they should be ss towards the intermediate objectives on whic! achievement tests are based, They can, however, reflect ‘route’ char swards the objectiv thas been argued in chis section tha achievement teste on course objectives eather than on the detailed content ofa course. However, it muy ner be ar all this, especially if the latter approach ws already being tollowed, Nar only is there likely co be natural resistance to cha uc such a change may represent a threarro maay people. A great deal of skill, eacc and, passibly, political manoeuveing may be called for ~ npies oa which this book cannot pretend to give advice. ing need naz be ee measuring progres the more forrnal progress 2 individual teacher isu > base che contens of it is bette Diagnostic tests ic tests are used to identity sradencs® screngehs and we Diagno! in what tucther teaching is They ace intended primarily to ascer necessary. Ar the level of broad fonguage skills th straightiorward. We can be tainly confident of our abiliry that wilt cell us that a stadenr is paricnlarly weak in, say, speaking as opposed to reading tia language. ladeed existing proficiency tests may often prove adequate for this purpow. We may be able to go further, analysing samples of a student's performance in writing or speaking in order to create profiles of the sudeat’s ability with respect to such catexories as ‘grammatical accu tacy’ of Tinguistic appropriacy’. (See Chapter 9 for a scoring syscem chive may provide such an analysis.) Bat it is not so ea analysis of a staient’s command of grame which weuld tell as, for example, whether she or he had manstred the prosent peefecripast tense distinction in English. in order to be sure of this, we would needa nu ofexamples of the choice the student made between the two structares in every different contest which we though was significandly different and important eneugh «w warrant obisining infermaiion on, A single example of each would not be enough, since a student might give the correct response by chaace, As a result, s comprehensive diagnostic test glish grammar wauid be vast ichink of whar would be urvolved ie e feet aber Kinds of test and testing testing the moslal verbs, for inscance}. The size of such a test would make itimpractical 20 administer in a routine fashion. For this reason, very few vests are constructed for purely diagnostic purposes, and those thar there are do not provide very derailed information, The lack of good diagnose tests is unlormnate, They could be extremely weeful for individuslised instruction or sclf-instruction, Learn- ers would be shown where gaps exist in their command of the language, é could be directed to sources of information, exemplification and practice. Happily, the ceady availability of relatively inexpensive com- puters with very large memories may change the siguation. Well-written computer programmes would ensure that the learner spent ac pore time than wag absolutely necesgary to obtain the desired imformanon, and without the aved for a test administrator, Tests of this kind will still need a tremendous amount ef work to produce. Whether or not they become generally available will depend on the willingness of individuals to write them aad of publishers to disteibure them. Placement tests Placement tests, ac their name suggests, are intended to provide infor mation which will help to place students at the stage (or in the past] of the teaching programmes most approptiare to theiz abilities. Typically chey are used to assign students to classes at difference levels, Placement rests ca be bought, bar this is noe wo be cecommended unless the institution concemed is quite sure thar the test being. con- sidezed suizs its particular teaching programme. Ne one placement cost wall work for every institution, and the inital assumption abour any test that is commercially available must be that it will not work well, ‘The placement tests which are most successful are those constructed for particular situations. They depend on the identification of the key feavures at different levely of teaching inthe institution, They are tailor-made eather than bought off rhe peg, This usually means that they have been produced ‘jn house’. The work chat goes into their construs- Hon is tewarded hy the saving in time and effort thrangh accurate placement. An exa: of bow a placement rest might be designed within an instizution is given in Chapter 7, the validation of plasement tests is referred to in Chapter 4. Direct veraus indirect testing in thi results art construct $e fa chapter we have considered a number of uges co which test put. We ow distinguish hetween two approaches to te: 14 Kinds of test and testing Testing is said ro he direce when ic requires the candidare so perform precisely the skill which we wish to measure, If we want to know how well candidates can write compositions, we get them to write com positions, If we want to know how well they pronounce a language, we get them to speak, The tasks, and the texts which are used, should be as authentic as possible, The fact that candidates are aware that they are ina tesr situation means that the tasks cannot be really authentic. Neverthe- less the effort is made to make them as realistic as possible. Direct testing is easier to carry out when it is intended ie measure the ptodactive skills of speaking and writing, The vvey acts of speaking and writing provide us wich information about the candidate's ability, With listening and reading, however, it is necessary to get candidates not only to listen or read but also to demonstrate that they have done this successfully, The tescer has to devise methods of cliciting such evidence accurately and without the method interfering with the performance of the skills in which he or she is interested. Appropriate methods for achieving this are discussed in Chapters 11 and 12, Interestingly enough, in many texts on language testing itis the testing of productive skills that is preseated as being most problematic, for reasons usually connected with reliabilizy. In fact the problems are by no means insurmountable, as we shall see in Chapters 9 and 10, Direct vesting has a number of attractions. Fust, provided that we are clear about just what abilities we want to assess, it is relatively straight forward to create the conditions which will elicit the behaviour en which to base our judgements. Secondly, at least in the case of the productive skills, the assessment and interpretation of students’ performance is also quite straightforward. Thirdly, since practice for the test involves prac- tice of the skills thar we wish to foscer, there is likely to be a helpful backwash effect. Indirect vesting atrempts to measure the abilities which underife the skills in which weare interested, One section of the TOEPL, for example, was developed as an indirect measure of writing ability. fe contains items of the folowing kind: ALfitsr the old woenan seemed unveilling (0. accent anything that was offered her by any friend and L where the candidate has to identity which of the underlined elements is erroneous or inappropriate in formal standard English, While the ability to respond ro such teats has been shown to be related statistically to the ability ro write compositions (chough the strength of the relationship was Rot particularly great), it is clearly not the same thing. Another example of indirect testing is Lado’s (L961) proposed method of testing pronunci- ation ability by a paper and pencil tese in which the candidate has to identify pairs of words which chyme with each other. a5 Kinds of text and testing Perhags che main appeal of indirect cesting is chat it seers 10 offer th possibiliry of resting a representative sample of a Bnire number of abilities whied underlie a potenially indefinitely large munber of manifestations af tiem. If for example, we take a representauive sample of grammatical structores, then, it may be argued, we have taken a sample which is relevance for af the situations in which control of grammar is necessary. By contrast, direct testing is inevivably fimired to a rather small sample of tusks, which may call ona restciceed and possibly unrepresentative range oi grammatical sre es, On this argument, indirect testing ig superta! to dicect testing in that its resules are more generalisable, ‘The maia problem with indirect tests is thar the relationship between performance on them and performance of rhe skills in which we are usually more interested tends to be tather weak in srrengeh and uncertain in nature, We do aot yet know enough about the component parts of, s2y, composition writing to predic: accurately composition weiting ability from scores on tests which measure the abilities which we believe underlie it, We may construct tests of grammar, vocabulary, discourse markers, handwriting, panctuation, and what we will, Bur we suill will not be able ro predict accurately scores on compositions (even if we make sute of the representativeness of the composition scores by taking many sanrples). if seems ro me thar in our presertt stare of knowledge, ar least 2g far as proficiency aid final achievement tests are concerned, it is preferable to concentrate on direct esting, Provided char we sample reasonably widely (for example requice at least two compositions, each calling for a different kind of weiting and on 2 different topic), we can expect more accurate estimates of rhe abilities thar really concern us than would be obtained through indirect testing. The fact thar direct tests are generally easior to construct simply reinforces this view with respect to institutional tests, as does their greater potential for beneficial hackwash. It is only fair to say, however, that many wsters are reluctant to commit themselves eurirely co direct testing and will always include an indirect clement in their tests. Of course, to obtain diagnostic information on underlying abilities, such as control of particular geammatical structuces, indirect testing is called for. Discrete point versus integrative testing Disceete point testing refers to the testing of one element at a time, item by item, This might involve, for example, a series of items cach testing a particular grammatical structure. Integrative testing, by contrast, requires the candidate s combine many language elements in the completion of a task. This might involve writing a composition, making notes while listeniny to a lecture, taking a diciacion, or completing a cloze 16 Kinds of + » thas bezween indires passage. Clearly this d and dlirece testing, D while integrative tests will westing methods, such as ch ve be inslice Novwn-referenced versus eriterion-refarenced testing jougine that a reading test is adminiscensd sa an irdiidual suudenz When we ask how the student performed ori the tsz, we may }. rwo kinds of answer. Au angewer of rhe & Ki grudest obtained a score that placed hee o canuidates who have taken that test, vt in that she or he did berter than sixty per cent « kind would be thar the ant in che cop tea per cent che bette hese whe took iw. A test we per gene, or which i designed ep give this kind of infatmaarion 1s said to he norm referenced. tt relates one candidare’s perfarmance so thar of other candidates, We are not cold dicectly what che student ix capable of doing in the language. The omer kind of answer we mighr be gives i exemplified by the following, taken from the Ineragency Language Roundrable (LR language skill level descriptions for reading: Sufficient comprehension to read simple, avdhentic written materials in a form equivalent so natal printing or tspescrigst on sabjoces within a fantiiar caneexe. Able to read with some misanderstandmgs straightforward, familiar, factual material. bus in general insuificies experienced with the language to deaw inferences direetly Fema the linguistic aspects of the text. Can tocate and understand the main idess sad devails in maretiads written for the general reader... The individual can read uncomplicated, bur authentic prove cin fumiliar subjects thar are socmally presented ina puedicrable sequenwe whieh aids the render in understanding. Texts may include descriptions aud nacrarons a contexts such ay news fern eqaenel simple biographical informa tty Generally the peose that can be cend by the individual is predominantly in straightforwarcihigh-frequancy se The iadividaat does not have a broad active vovabulary .., bur ts able to use concewrval and realworld clues to understand the thers using simple = to gresi, interact with and take feave of others; ~ to exchange information on personal background, home, schoo! life and interests; Kiads of tesi ana testing ~ to diseuss end make choices, decisions and plans, ~ fo express opinions, make requesis and suggestions; ~ ip asic for information and understand instructions. eases we leara nothing about how the individual's perform- ze5 with that of other candidates, Rather we learn someching abour what he of she can actually do in the language. Tests which are designed to provide this kind of information directly axe said to be ovitesinn-referenced.” The purpose of crizerion-referenced tests ig ro classify people according re whether oc not they are able to perform seme task or set of tasks satistactorily. The risks are set, and the performances are evaluated, It does not matter in principte whether all the candidates are successful, of none of the candidates is successful. The casks are set, and those who: perform thers satisfactorily ‘pass’; those who don’t, ‘fail’. This means that students are encotsraged to measure their progress in relation to meaningful criteria, without feeling that, because they are less able than most of their fellows, they are destined co fail. {n the case of the Berkshire German Certificate, for example, i¢ hoped chac all scadenss who are entered for it will be successful, Ceiterion-referenced tests therefore have we positive virtues: they set standards meaningful in teems of whar people can do, which do not change with different graups of candidates; anid they motivate students to aitain those standards. The need for dircee interpretation of performance means thar the construction of a criterion-referenced test may be quire different from that of a norm-referenced rest designed te-serve che same parpose. Let us imagine that the parpose is ro assiss the Engtish language ability of stadents in velation to the demands iaade by English medium univecsities. The eriterion-referenced test would almost certainly have to be based on an analysis of what students had to he able to do with or through English at mmiversity. Tasks would then be set similar to those to be met ar university, ff this were aoe done, direer interpretation of performance would be impossible, The norm-referenced rest, on the ocher hand, while is content might be based on a similar analysis, is not so restricted. The Michigan Test of English Language Proficiency, for instance, has avul- uple choice grammar, vocabulary, and tecding comprehension com- ponents. A candidate's score ov the test does not tell us direcily what his or har English ability is in celacion co the demands that would be made on itat an English-medinum university. To know this, we must consult a table which makes recommendations as to the academic load that a student in these tw ople siifer somewkar a cher use of the keen ‘cricerion-referenced”, Thes ig w the sense urrended ts mate clear, The sense in which itis 09 ch | feel will be most useful to the reader ir analysing testing impuetant proved s here is the one w peoblens 18 Kinds of test and te: ting with chat score should be allowed te carry, this being based on experience over the years of students with similar scores, not on any meaning in the score itself. In the same way, university adnunistrators have leaned from experience how to interpret TOEFL scores and to set minimum scores for their own institutions. Books on language testing have tended to give advice which is more appropriate to norm-referenced testing than to ctiterton-referenced testing. One reason for this may he thar procederes for use with norm-referenced tests {particularly with respect to such matters as the analysis of icerns and the estimation of reliability) are well established, while these for criterion-referenced tests are not. The view taken in this book, and argued for in Chapier 6, is that criterion-referenced tests are often to be preferred, not least for the heneficial backwash effect they are likely to. have, The lack of agreed procedures for such tests is nor sufficient reason for them to be excluded from consideration. Objective testing versus subjective testing The distinction hers is between methods of scoring, and nothing else Hino judgement is cequired on the part of the scorer, then the scoring is objec- tive. A multiple choice test, with che correct responses uaambignously identiGed, would be a cast in point Ef judgement is called for, the scoring is said to be subjective, There are different degrees of subjectivity in testing. The impressionistic scoring of a compasition may be con- Sidered more subjective than the scoring of short answers in response to. questions on a reading passage, Objectivity in scoring is sought after by many escers, nor for itscl!, but for the greater reliability it beings, In general, the less subjective the scoring, the greater agreement there will be between ewo different scorers {and between the scores of one person scoring the same test paper on different aceasions!. However, there are ways of obtaining reliabie subjective scoring, even of compasitions, These are discussed fest in Chapter 5, Communicative language testing Much has been written in recene years abour ‘communicative language testing’, Discussions have centred on the desirability of measuring che ability co take par in acts of communicadon Gactuding reading and listening) and on the best way to do this. Ie is assumed in chis book that it is usually communicative abiliry which we wantto test. Asa result, what t believe to be the most significant points made in discussions of communt- is xe Rinds of ¢ ative testing are to be found shrougheot. A recapiuladon under a separate heading would therefore be eedundant. PEACE ACT AE TIVITIES Consider a wumbee of Janguage tests with which you are familiar, For sach of chem, answer the following quesuons 1, Whar is the purpose of ths rest? Dons it represent direct or inctirscr renting for a mixture of bowl! Are the items discrete point of integ! itive jora mixture ot both)? Which irems ate objective, and which are subjective? Can you ocder she subjective irems according to degree of subjecti ©. ds the test borm-referenced or criterion-referenced? 4 Does the test measure communicative abilivies? Would you describe ir ag a cominunicative cost? fusciy your answers, What relationship is there berween the answers ty question 6 and the answers to the other qu eM “4 sons? Further reading Fora discussion of the two approaches towards achievement test content specification, see Pliner 1968}, Alderson (1987) repacts on research ine the possible comeributions of the Gormpurer to language resting, Direct testing calls for texts and tagks to be as authentic as poisilile: Vol. 2, No. | (1985) of the jourtial Language Testing is devoted ro articles on authenticity in language te : cunt Gf the development of an indirect test of writing is givea in Godshalk ef al, (1966). Classic short papers on criterion referencing and normereferencing (aor restricted to language testing) are by Popham (1978), favouring criterion-refereticed sesting, and Bbel (1978), arguing for the superiority of norm-referenced sesting, The deseription of reading ability given in this chapter comes from the Ineeragency Language Roundrable Langrage Skill Level Descriptions. Comparahte deseriptions at a number of levels for ihe four Kills, intended for assessing students in academic contexts, have been devised by the Amecican Council for the reaching of Foreign Languages {ACTFL), These ACTFL Guidelines are available from ACTEL at $79 Broadway, Hastings-on-Hudson, NY 16706, USA. it should be said, however, that the form thar these rake and che way in which they Were constructed have been the subject of sore controversy, Doubts abour the applicability of crizerion~ referencing to language iesting are expressed by Skehan (1984); for a different view, see Hughes (1986). Carrolf (1 361) 20 and ete goiny ineegeasive tes 1979) Js u seminal paper om commnmcet ‘yecussion of the topic is w be found tn Aiderson and Hughes (L982, Pare t th Hughes 3 Davies (L988). Weir's 1988) book has as its language testing. 24 : S 4 4 Validity We already know from Chapter 2 that a test is said to be valid if it measures acuraiely what it is inteaded to measure, This seems simple enough. When closely examined, however, the concept of validity reveals a number of aspects, each of which deserves our attention, This chapter will prevent cach aspect ia turn, and attempt to show its relevance for the solution of language testing problems. Content validity A test is said to have contest validity if its contenr constitutes a representarive sample of the language skills, structures, ete, with which ic ig meant ta be converned, itis obvious that a grammar test, for instance, aust be made up of items testing knowledge or control of grammar. But thisin itself does nor ensure conrent validity. The test would have content validity only if it included a proper sample of the relevant structures. Just whatare the relevant structures will depend, of course, apou the purpose of the fest. We would not expect an achievement test for intermediate learners to contain just the same ser of structures as one for advanced learners. in order ve judge whether or nat a test has content validity, we need a specification of the skills or structures ete. that itis meant to cover. Such « specification should be made ar a very early stage in test constfiction. 1f isr’f to be expected that everything in the specification willalways appear ie the rest; there may simply be coo many things for all of them to appear fc test, But it will provide the test constructor with the basis for making a principled selection of elements for inclusion in the test. A computison of rest specification and test content is the basis for judgements as co content valivicy. [deally these judgements should be wade by people who are familiae with language teaching and testing but wha ate not directly concerned with the production of the test in tion, What is ibe itopatcance of content validity? First, the greater a test’s content yalidiry, the siore likely it is to be an accurate measure of what it mpposed to measure. A test in which major areas identified in the specification are under-represented — of not represented at all — is unlikely to be accurate, Secondly, such a test is likely to have a harmful Validity buckwash effect, Arcas which are not tested are likely to become areas ignored in teaching and learning. Too often the content of tests is determined by what is easy to test rather than what is important ta test. The best safeguard against this is to write full test spectications and t ensure that the test content is a fair ceflection of thesc. Advice on the weiting of specifications and on the judgement of concent validity is to be found in Chapter 7. Criterion-related validity Another approach to test validity is to see how far results on the test agree with those provided by some independent and highly dependable assess- ment of the candidate’s ability, This independent assessment is thus the criterion measure against which che cest is validated. ‘There are essentially wo kinds of criterion-related validity: concutcent validity and predictive validity, Concurrent validizy is established when the test and the criterion ace administered at about che same time. To exemplify this kind of validation in achievement testing, let us consider a situation where course objectives call for an oral component as part of the final achievement test. The objectives may list a large number of ‘functions’ which srudenis are expected to peclorm orally, to test all of which might rake 45 minaves for each student. This could well be impractical, Perhaps itis felt that only ten minutes can be devoted to each student for the orat component. The question then arises: can such a ten-minute session give a sufficiently accurate estimate of the student's ability with respect to the functions specified in the course objectives? is it, in other words, a valid measure? From the point of view of content validity, this will depend on how many of the functions are tested in the component, and how representa- tive they are of the complete set of functions tacluded in the objectives. Every effort should be nade when designing the oral component to give it content validity. Gnce this has been done, however, we can go further. We can attempt to establish rhe concurrent validity of the component To do this, we should choose at random a sampie of all the students iaking the test. These students would then be subjected to the full 45 Minute oral component necessary for coverage of ail the functions, using Perhaps four scorers to ensure reliable scormg (see next chapter), This would be the criterion test against which the shorter rest would be wudged. The students’ scores on the full east would be compared with the ones they abtaincd on che ten-minute session, which would have been conducted and scored in the usual way, withour knowledge of sheir Performance on the longer version. If the comparison between the two sets of scores reveals a high level of agreement, chen the shorter version of 23 Validicy the oral component may be considered valid, inasmuch as it gives restalts sirtifar co those obtained with the longer version, 48, ou the ‘ther hand, the two sets of scores show dine agreewrens, the shorter version cannot be considered valid; it canner be used as a dependable measare of achieve. ment with respect to rhe Fanerions specified in the objectives, Of course, if tert minutes really is aff that can be spa d for each staden:, then the oral component may be included for the contributien that i: makes to the assessment of students’ overall achievement and for its backwash eff Bit it canner be cegarded as an accurate measure in tselt, Refercnwes ro ‘a high level of agreemear and ‘little agreement’ raise the question of how the level of agreement is measured, There are in faer standard procedures for comparing sets of yeores in shis way, which weterate what is calleda ‘validity coefficient’, 2 mathematical measure of sinilarity. Perfect agreement between Own sets of scores will resale ip a validity coefficient of 1. Total lack of agreernenr will give a coefiicient of zero. To gee a keel for the meating of a coefiuiea: between these two extremes, read the conrents of Box 1. Box 7 To get a feel for what a coetivient means in terns of the level of agreemenc beaween wo sees of scores, it is best to squaare che sovféicient, Let us imagiow thac a coudfviente of 0.7 ig calcutat between che two oval tests celerred 16 in the main text. Squared, this becomes 0.49, Hf this is regarded as a pooportion of ane, and converted ro a percentage, we get 49 per cert. On the basis of this, we vant say that the scores on the short test predict 49 per cent of che variation in scores on the fonger test, In broad terms, there is almost 50 per cont agreement between one set af scores and the other. A. { coviticient of 0.5 would signify 25 per cenz agreement; a coelficient of | 0.8 would indicate 64 per ceat agreerent. J is importaae to tote that | a ‘level of ageeerment’ of, say, 50 per vens does not mean that JG per ceut of the students would exch have equivalent scores on the to versions. We are dealing with an overall measure of ageeement that does bor reter to che individual scores of students, This explanation of how to interpret velidity coefficienss tocher crude, Fo Appendix 1. very bric£and acceasocily 8 better andersranding, the reader is referred to Whether or not 2 particular level of agreement ig regarded as satisfactory will depead upon the purpose of the test and the importance of the decisions that are made on the basis of is. if, for example, a test of oral ability was to be used as part of the selection procedure for a high level diplomatic post, then a coefficient of 0.7 might well be regarded as too low for a shorter test to be substituted fer a hill and thorough test of oral a4 ability. The saving in dime would nor be Someone with inefficient ability ier de ve orber hand. a coefficient of th a brief interview forming, parr of a plac it should be said that the criterion for neces ¥ a proven, longer rest. example, teachers meaty assessments themselves can be relied on, where a test was developed different trom all existing te developed “communicative! eat edict candidates’ future concerns the ‘ley eto which a weer ¢ performance, An example would be bu predict a student's ability to cope with a graduare course at 2 Bris niversicy. The criterion measure here mighy be an a ment scudent’s English as perceived by his or her supervisor at the anivecsir ir could be the outcome of the course (sass‘faid etc. The cheice of criterion meagute raises interesting issues. Should we rely ont the sub) rive and untrained judgements of supervise How helpful is ie to us: final ourcome as the criterion measure when so many facrors orher chan ability in English {euch as subject knowledge, inzelligence, modvazion, health and happiness} will have conte! uted to every cuicame? Where outcome is ased as the criterion measure, a validiry coelticient of aroned 0.4 (only 20 per cent agreement) is about as brigh as one can expect. This is pavely because of the other factors, and partly because those stadents whose English the test predicted would be madequate are not normally permitted to take the course, and so the test's (possible) accuracy ip predicting problems for chose studenrs goes unrecognised, As a cesul validity coefGcient of this order is generally regarded ay satisfactory further reading section at the end of the chapter gives references to che recent reports on the validacion of the British Connvil’s ELTS test, in which these issues are discussed ar length. Another example of predictive validity would be where an accent was made to validate a placement test, Placement tests arteinpt 10 predict the most appropriate class for any pamncular shwdent, Validation wore involve an enquiry, once courses were under Way, into the proportion of students who were thought to be misplaced. le would then be a matter of comparing the number of musplacements (and their effect on teaching and learning) wich the cost of developing and administering a test which would place stadents more accurarely, well a proficiency rest coutd oe Validiey Sonstuet validity A test, part of a west, of a testing technique is said to have construct validity if it can be demonstrated that it measures just the abuity which it supposed ro measure. The word ‘construct’ refers to any underlying ability (or mrait) which is bypothesised in a theory of language ability. One might hypathesise, for example, that che ability to cead involves a number of sub-abilites, such as the abiliry to guess the meaning of unknown words from the coatext in which they are met. It would be a maiter of empirical research co establish whether or not such a distiner ability existed and could be measured. Hf we attenipted to measure that ability in a particular test, then that part of the test would have consrruct iidity only if we were able to demonstrate thar we were indeed measuring just that abiliry Gross, commonsetse constructs like ‘reading abiligy’ and ‘writing ability’ are, in my view, enproblemarical, Similarly, the direct measure~ ment of writing ability, for instance, should not cause us coo much concern: even without research we can be fairly confident that we are measuring 4 distinct and meaningful ability. Once we try t0 measure such an ability indixecily, however, we can uo longer take for granted what we are doing, We need to look to a theory of waiting ability for guidance as to the inem an indicect rest should rake, its content and techniques Let us imagine thar we are indeed planaing to construct an indirect test of writing ability which must for reasons of practicality be mulciple choice, Our theory of weiting tells us that underlying writing ability area number of sub-abilitics, such as cencrol of punctuation, sensieivigy to demands on style, and so on. We constrict items that are meant ta tneasure these sub-abilities and adrnisister chem: as a pilot test. How do we know that this test eeally is sieasuring writing ability? One secp we would almose certaialy take is to obtain extensive samples of the writing ability of the group to whom the test is first administered, and have these reliably scored. We would then compare scores ca the pilot test with the scores given for the samples of writing. If there is a high level of agecement {and a cocficient of che kind described in the previous section ca be calculated), chen we have evidence that we are measuring writing absliry with the rest. So far, however, chough we may have developed a satisfactory indicect © of writing, we have nor demonstrated the reality of the underlying constructs (control of punctuation ete,). To do this we might administer a sevies of specially constructed tests, measuring each of the construcis by a number of different methods. in addition, compositions written by the people whe took tie rests could he scored separately [or performance in relation co the hypothesised constructs (coutrol of punctuation, for BS Vatidity example). In this way, for each person, we would chain 2 set of scores foreach of the constructs. Coefficients could then be calculated berween the various measures. If the coefficients between scores on the same construct are consistently higher than those berween scores on different constructs, then we have evidence that we are indeed measuring separate and identihable constructs. Canstract validation is a research activity, the means by which theories are put to the test and are confirmed, modified, or abandoned, Ie is through construct validation chat language testing can be put on & sounder, more scientific footing. But ir will aot afl happen overnight; there is a long way to go. In the meantime, the practical language tester should try to keep abreast of what is known, When in doubt, where it is possible, direct testing of abilities is recommended, Face valldity A test is said to have face validity if it fooks as if it measuces what it is supposed to measure. For example, a rest which pretended to measure pronunciation ability but which did not require the candidate to speale {and there have been some) might be thought to lack face validity. This would be true even if the test’s construct and criterion-related validiey could be demonstrated. Face validity is hardly a scientific concept, yet it is very important. A rest which does not have face validity may not he accepted by candidates, teachers, education authorities or employers. It may simply not be used; and if tt ts used, the candidates’ reaction to may mean that they do not perform on it ina way that truly reflects their ability. Novel techniques, particularly those which provide indirect measutes, have to be introduced slowly, with care, and with convincing explanations, The use of validity What ase is the reader to make of the notion of validity? Firse, every effort should be made in constructing tests to ensure content validity. Where Possible, the tests should be validated empirically against some criterion, Particularly where it is intended to use indirect resting, reference should be made to the rescatch literature to confirm that measurement of the relevance underlying constrects has been demonstrated using the testing techniques thar are to be used (this may often result in disappomtment — another reason for favouring dicece eesting!). Aay published tes: should supply details of its validation, without 27 i iddiry which its validiey (and suitability) can hardly be pudged by 2 potendal trchaser. Tests for which validity nfoemanen is not available shovle be iroated with camtion, READER ACTIVITIES Cuasider any tests with which you are familiar, Assess each of thee is reems of the various kinds of ealidity that have been presented in this chapter, What empirical evidence is there that the testis valid? [f evidence is Jacking, haw would you set abour gathering ir? Further reading For general discussion of test validiry and ways of measuring it, see Anastasi (1976). For an interesting cecent example of test vali the British Council £4.7S test) in which a number of important issues are raised, see Criper and Davies (1988) and Hughes, Porter and Weir (1988), Por the argument (with which | do not agree) that there is no ctitetion agains: which ‘communicative’ language tests can be validated (in the sease of criterion -related validity], see Morrow (1986). Bachman and Palmer (1984) ts a good example of construct validation. For a collection of papers on language testing rescarch, see Oller (1983). & Relabilicy av test ac Three o'clock o It or ridiculously Imagine thaca hundred studears cake a 1! Thursday afternoon, The cest i$ nor impossibly diff adents, so they do new al! get sere oF 9 pe 100, Now what if in fact they had net taken the res had taken it at three o'clock the prev each 8 actually did on the Thursday? The answer 0 this question oust be ae Even if we assume thar che test is exvellest, that the conditions of administration are almost identical, thar che sceving calls for no judge ment on the part of the scorers and is cacried out with perbeer care, and that no learning ur forgetting has taken place during the one-day omerval = nevertheless we would not expect every individual to get precisel! satne score on the Wednesday as they gor on the Thursday, Camas beings are not ike that, they simply do nos bshave in exacdy che sa way on every occasion, even when the circumstances seem id Bar if this is the case, it would seem to iniply thai we cen compere trust in any sev of resi scores. We know thar the scores have been differear if the test had been administered on the previous or the following day, This is inevitable, and we must acceptir. Whar we have todo is constrect, administer and score tests in such a way that che scores, actually obtained on a wst on a particular occasion are likely to be very similar to those which would have been obtained if it had been adrnin~ istered to the same students with the same abiliry, but ata different cime. The mare similar the scares would have been, the more refiaide the cast i said to be, Look ar the hypothetical daca in Table 1a). They represent the score obtained by ten students who took 1OG-irem test (A) on a particule deeasion, and those that they would have Gbtaimed if they had taken ir day Jacer. Compare the twe sets of scores. (Bo aor worry for the moment abour the face that we would sever be able co obtain this information. Ways of estiniating what scores people would have gor on ancther occasion are discussed later. The most obvious ot these is simply to have people take the same test twice.) Note the size of che difference benwveen the swe scores for each student. Reliabitity TABLE 1a) 4 (INVENTED DATA) Student e obtained Score which would bave been obtained on the following day Bill 68 a2 Mary a6 28 Aan 1% 34 Harry a9 67 Cyril 43 Pauline 36 Bon 43 Colin ar Irene vas Sue 82. Now look at Table 1b), which displays the same kind of information fora second 100-item rest (8), Again note the difference in scores for cach student. TABLE tb) SCORES ON TEST B (INVENTED BATA) Score obtained Score which would have been obtained on the Seudent 59 Pauline sé , 5 Boa AR pas Colin | 16 irene $62 Sue {57 Which cest seems the more reliable? The clifferences berween the two sets of scores are much smaller for Test B than for Test A. On the evidence thar ave here (and (2 practice we would ner wish te make claims ahout reliability on the basis of such a small number of individuals), Test B appears tw he more tefiable than Test A. 36 Reliability Look now at Table ic), which represents scores of the same students on an iverview using a five-point scale. TABLE 1c) SCORES ON INTERVIEW (INVENTED DATA) Student Score which would have been obtained av the following day Score obtained Cyril Pauline Don Colin lrene Sue me in one sense the two sets of interview scores are very similar. The largest difference between a student’s actual score and the one which would have been obtained on the following day is 3, Bur the largest possible difference is only 4! Really the two sets of scores are very different. This becomes apparent once we compate the size of the differences between students with the size of differences berween scores for individual students, They are of about the same order of magnitude, The resule of this can be seen if we place the students iar order according to their interview score, the highest rst. The order based on their actual seoves is markedly different from the one based on the scores they would have obtained if they had had the interview on the following day. This interview turns out iat fact nov to be very reliable ar all. The reliability eoetficient ‘eis possible co quantify the celiabilicy of a tese in the form of a reliabilicy Coefficient, Reliability coefficients are like validity coefficients (Chapter 4). They allow us te compare the reliability of different tests. The ideal reliability coefficient is 1-2 test with a reliability coefficient of | is one which would give precisely the same tesults for a particulac set of Candidates regardless of when it happened to be administered. A test which had a reliabilicy coefficient of zero [and let us hope that no sack 3h teat existe!) sould give sets of results quite unconnected with each other, in the sense chat che score rhav someone actually gor on a Wednesday bene heip avail in attempting to predict the score he or she would tif they wok the rest the day after, itis between the neo exteemes of | and zero that genaine text celrabtiity coefficients are ro he found. Certain authors have suggested how high a celisbility coetficien we should expeot for different typss of language tests. Lado (1961), for exomple, says that good vocabulary, urracture and reading tes ate usually in the 26 to 99 range, while anditory comprehension tests are more often in the 80 to..89 range, Oral production tests may be in che 70 to .79 range. He adds that a icliability coefficient of 85 might be considered high for an oral peudvetion zest but low for a reading test, Tivese suggestions ceflect whac Lado sees as the difficulty in achieving reliability in the testing of the different abilities, In fact the reliabiliey coefficene that is to he sought will depead 3 most porti on other considerasions, larly the importance of the decisions that are ro be taken on the busig of the test. The more importont the decisions, the greacer rehabilary we must demandh if we are to refuse someone the opportunity | to study overseas because of cheir score an a language tess, then we have to be preety sure that their score would a: they had takea the test a day or ewe eather or loter. The next section will explain how the rehability coefficient can be ysed te acnve at another figure (the standard error of measurement) 1 estimate likely differen of this kind. Before this is done, however, somerhing has tw be said about che way in which reliability coefficients are artived at. The firsr requirement is to have rwo sers Gf scares for comparison. The most obvious way of obtaining these is to get a group of subjects to take the same test twice. This is known as che test-retest method. The drawbacks are not difficult to see. Uf the second administration of the test is tow soon after the Bret, then subjects ace likely to recall items and their responses to them, making the same responses more likely and the teliabitity spuriously high, If there is too long a gap between adiministta- Hons, then learning (ov forgetting!) will have taken place, and the coefficient wall be lower than it should be. However long the gap, the subjects are walikely to be very motivated to take the same test swice, and this toa is likely to have a depressing effect on the coefficient, These effects are reduced somewhat by the use of owe dilfeccat forms of the same test (the altersate formes method), However, alternate forms are often simply aot available it turns out, surprisingly, that the most common methods of obtaining the necessary two vets of scores involye only one administration of one test. Such methods provide us with a coefficient of internal consistency’, The most basic of these is the split half method. In this the subjects take the rest in the usual way, bur each subject is given two scores. One score is 32 rhave been mach different if | @ ocher halt, The ww sew: tootivient asf rhe whole furs seliabifiey econd seo jor one half of the test, are then usedd to obtain the of scores ces had been taken twice. in order for this met for te rest to be split ince owe babes which are the earelul matching of (eens {in faer where irene in the re ardered in terms of difficulty, a split ito odd-numbered stems pea numbered items may be adequate), le can be seen thay thas method | Recher like the alternate forms method, excupe thar the eo Yormy ase © only half the length,? : fi has been demonsceated enspirivally economical method will indeed give geod | eetficients, provided thar the alternate forms are closely equivalen: | fach other, Deeaily of other methods of estimating evfiability. and i al calculations are ro be found ye statis carrying out the neces: Appendix 7 The standard error of measurement and the true score While the rellabiliry coefficient allows us to compare the celiability of rests, it does noc rel! us directly how clase an individual’s actual scare is bo what he or she might have scored on anocher occasion, With a Lede fasther calculation, however, i is possible ro estate haw < person’s actual score is to what is called their “true score’ Imagine that it were possible for somieone ro take the same langeage rest over and over again, an indefinitely large number of dimes, without their performance being affected by having alseady taken the test, and without their ahiliry in the language changmg. Unless the test is perfectly rellable, and provided that i is not so easy or difficult that che student always gets full marks of zero, we would expect their scores on the various admunistrations to vary. fi we had all ofthese scorss we would be able to calculate their average score, anid it wuld seem nov usreasonable to think ofthis average as che one thar best represenes the student's asiliry with respect 0 this particular cess. Ie is this, whith far obvious reasons we can never know for certain, which is cefered to as the candidate's trses seare. We are able to make statements about the probability that a car date’s true score (the one which besr represents their ability a1 the rest) is within a certain number of peints of the score the ally obrained ‘on the vest, In order to do this, we first have to culeulate the stanclare error of measurement of the particular test. The favion (described in ae fouls cawiliciour te be jess than it woukd be 1. Because o! the reduced length, which will cause for the whole rest, 9 statistical adjusmment hes to be miade (see Appendix | for detuits} Reliability Appendix {) ie very straightforward, and is based on the eel: coefficient and a measure of the spread of all the scores on the test given spread of scoves the greater the reliability coefficient, the smaller will be the standard error of measurement), How such statements can be made using the standard error of measurement of the test is best iflustrated by an example, Suppose that 4 test has a standard error of measurement of 3. An individual ‘ores 36 on that test. We are then in a position to make the folowing stacemenrs: We can be about 68 per cent ge tain chat che person's tine score lies in the range of $3 0 G1 tie. within one standard error of measurement ok the score actually obtained an this occasion) We can be about 95 pur cent certain that their true score isin the range 46 to 66 (ie. within rwo standard errors of meagwiement of the score actually obtained} We can be 99.7 per cene certain that their true score is int the cange 41 to 7E (ie. within three standard errors of measurement of the score actaally obtained) ‘These statements are based on what i; known about the patiern of scores that would aceur if it were in fact possible for someone to take the zest repeatedly in the way described above, About 68 per cent of their scores would be within one standard error of neasurement, and so on, If in fact they only take the test once, we cannot he sure how their score on chat occasion relates to their tue score, but we are still able to make probabilistic starements as above? in the end, the statistical cationale is not important, Whar és important fs to recognise how we can use the standard error of measurement co inform decisions that we take on the basis of test scores, We should, for example, he very wary of taking important negative decisions about people's future if the standard error of measurement indicates that their hove statistical statements are haved on whatis knowe abou the way a person's seares world tend to be distrtbutred if they took the serve test an indefinitely large number of Fines (oi po experience nf iny eesi-taking occasion affecting performance an any other accisionl. The scares would follow what is called a normal discribucies (see Woods. Eiughes, 88s, tor daseussioa beyond the scope of the present bauk}, fess che bnewn snuctare atthe naneal distribution which stlows us co say what percesrage of sceres wilt fil within a cemain range for example about 68 per cent of soores will fall within one standard error of measurement of dhe true scose!. Since about 88 per cent of actual scores will he within dime standard ernest af measurement of the ine score, we ean be about Sf per cent cectain shat any particular acraat score will be within one standard estor ol measorenvene Of the ue score, te should be cess thar there is no such thing as.a “goed? or a ‘bad! ssandaed error of measiiremient, it's the particular use made of pacticuturscores in relatign to a particular standard ercor of measurement which may be considered acceptable or unacceptable. 34 Reliability taae score is quite fikely co be cqual to or above the score that would lead to 2 positive decision, even though their actual score is below it, For th reason, all prblished tests should provide users with not only th reliability coefficient burt also the standard error of measurement. ‘We have seen the importance of relisbilicy, If a cest is not reliable chen we kiow that the acwal scores of many individuals are likely to be quite different from their true scores. This means that we can place licle reliance on those scores. Byen where reliability is quite high, the srandaed errot of measurement serves to remind us that in the case of some individuals chere is quite possibly a lave discrepancy berween actual score and true score, This should make us very cautious about making important decisions on the basis of the test scores of candidates whose actnal scores place them close to the cut-off paint [the point that divides ‘passes’ from ‘faily’]. We should at east consider the possibility of gathering further relevant information on the language ability of such candidates. Having scen the importance of reliability, we shall consider, taterin the chapcer, how co make our tests more reliable, Before thar, however, we shall took at another aypect of refiability. Scorer reliability tn the Erst example given in this chapter we spoke about scores on a multiple choice test. Ir was most unlikely, we thought, thar every candidate woule get precisely the same score an bath of two possible administrations of rhe cest. We assumed, however, thar scoring of che test would be ‘pestecr’, That a, if a particular candidate did perform in exactly the same way on the two occasions, they would be given the same Score on both occasions. That is, any ane scorer would give the same score on the two occasions, and this would be the same score ax would be given by any other scorer on either occasion. iis possible to quantify the level of agreement given by differen: scorers on ditferent occasions by Means of a scorer reliability coefficient which can be interpreted in a similar way as the test reliability coefficient. in the case of the multiple choice test jus¢ described the scorer reliability coefficient would be L. As Wenoted in Chapter 3, when scaring requires ao judgement, and could itr Principle or in practice be carcied out by a computer, the testis said to be y carelessness shoutd cause the reliability cuefficients of sts to fall below 1. However, we did not assume perfectly consistent scoring inthe cave of the interview scores discussed earlier in che chapter, [t would probably ave seemed to che reader an unreasonable assumption. We cas accepe that scorers shoud be able to he consistent when here 1s only one easily 38 Reliabaliey ndgemenc is called for ance in an interview, recog ised correct re: esponse. Bac when a degree « onthe part'al che scorer, as in : pecfect consistency ie nar ro be expected, Suck subjective tests will aot reliability coeticients of 1! Indeed here was a time when many people thought thar scorer celiabiliry coefficients (and also the reliability of the test) would always be roo low re justify the use of subjective measures of language ability in rious language testing, This view is less widely held today. Whie the pectect reliabiliry of obyex Sis net obtainable js subjective tests, there are ways of making it sufficiency high for teit results to be valuable, {tis possible, for rstance, ro obtain scorer reliability cootReience of over G.9 tor the scaring of compositions, te is perhaps worth making explicic semehing about the relationship berween scorer reliability and test reliability. [ uke oGting ata testis not refiuble, chen the tese reales camnor be reliable vither. Indeed the test reliability coctticient will almost cerrainly be fower ean scores reliability, since other sources of unceliability will be additional to what enters through imperfect scuring. In a case T kaow of, the scorer reltability coefficient on a composition writing text was 92, while the reliabilicy coefficient for the testi was .84. Variability in the performance of individual candidaces accounted for the diffevence between the two covttivient have How to makes tests more reliable As we have seen, there arc two components of test reliability: the peclormance of candidates from occasion co occasion, and the rsliabilicy of the scoring, We will begin by suggesting ways of achieving consistent performances from candidates and chen tum our attention to scorer reliability Take enough samples of behaviour Ouher things being equal, the more items that row have on a zest, the more reliable that test will be. This seems intuatively tight. If we wanted to know how good au aches someone was, we wouldn't rely ou the evidence of a single shor at the targe:, That one shot could be quite ucwepresentative of theie abiliy, To be satisfied thar we had a really reliable measure of the abitiey we would want co see a large number of shots ac che target. The same is trae for language testing. ly has been demonstrated empirically thar the addition of further items will make a test more reliable. There is even a fasmuts (the Spearman-Browa formula, see the Appendix] that allows one ro estimate how many extra items similar ro. 36 che ones already in the cost yell he needed te: increase voetfcient £6 a required teve., Gne thigy to be che additional ines shout d be independent of exc! rents. Lragine a ceading test that asks the