You are on page 1of 50

i

r--E-r .' i
t esttng tor
Langu age Teachers

Arthur H*ghes

I
I

,F
The righr of :the ..
$ Unilcrsity oI Cbnbridgc
to print ond sell
I
oll mnnr of books
6 Ntat tronlad:by
Hcnrs,VIII in,l5J1,
'i
F The Univcnily hat printcd
and publithcd continuow!y
since I5El,
h
n
I

Cambri dge UniversitY Pres s


Cambridge
New York Port Chester
Melbourne SydneY
@ Cambridge University Press 1989

First pubiished 1989

Prinred in Grear Brirain


by Bell &.Bain Ltd, Glasgow

Library of Congress cataloguing in publication clala


Hughes, Arrhur, 1941-
Testing for language reachers / Arrhur Hughes.
p. cm. - (Cambridge h-andbooks for language reachers)
Bibliography'p.
Includes index.
ISBN 0 521 25264 4. - ISBN 0 521 27260 2 (pbk,)
l. Language and languages - Ability tesring. ' l. Title.
IL Series.
P53.4.H84 1989
407 .5-dc19 88-3 850 CIP

British Library cataloguittg in publication data


Hughes, Arrhur, l94L-
Tesring for language reachers. (Cambridge
-
handbooks for language teachers).
1. Great Brirain. Educarional insrirutions.
Students. Foreign language skills.
Assessment. Tests - For teaching.
I. Title.
478',.0076

4 hard covers
ISBN 0 521,25264
ISBN0 52727260 2 paperback

Copyright
The law allows a reader ro make a single copy of part of a book
for purposes of privare study. It does nor allow the copvine of
entire books or rhe making of mulriple copies of extracrs. iv.irr-n
permission for any such copying must alwa.vs be obrained from rhe
publisher in advance.

CE
v i,

F I estrng for Language Teachers


B

h
v

2
I

t'
I
t

CAMBRIDGE HANDBOOKS FOR LANGUAGE TEACHERS I

C.".itf Editors: Michael Swan and Roger Bowers


eachers of English and other
sually drawn from the field of Engiish
\
he ideas and techniques described can
f arry language. F
l
fi
ln this series: fi
book of
I

Drama Techniques in Language Learning - A resource


.o*.nr.tnication activities fo r I4ngua ge teachers i
bliAlon MaleY and ,tian D"ff
I

t
Games for Language Learning
iy'iiarr* WrTghi Dauid Beiteridge and lvlichael Buckby
practice by Pewty Ur
Discussions thatWork-Task-centred fluency
once upon aTime - using s_tories in the language.classroorn
Uy'jit torgan and NIario Rinualucri
"'t
Teaching LiEtening Comprehension by Penny Ur
Keep Talking - Communicative fluency activities
for Ianguage teaching
bt, Frederihe KliPPel
vocabulary
working with words - A guide to teaching and learning
b,y RutiGairns and Stuart Redman
problems
Learner English - A teacher's guide to interference and other
-riiltd liirhael Swan and Bernard Smith
by
Testing Spoken fanguage - A handbook of oral resting techniques
bv I'tric LJnderhill
book of ideas and
Literature in the Language classroom - A resource
by Joanne eollie and Stephen Slater
".tiuiri.t
Dictatior, - New methods, new possibilities
Ly Paut Dauis and Mario Rinuolucri
teachers by Pennv Ur
Gramnrar Practice Activities - A pracrical guide for
Testing for Language Teachers by Arthur lTughes

3
r
h For Vicky, lr4eg and Jake
I

r
F

'I

.-f
ffimntents

r Acknou'ledgements viii
I

Preface
I
I
t1 Teaching and testing 1"

b
I
!'t Testing as problem solving: an overview of th
;
l" Kinds of test and testing
B l.r
L

/"-'--
9
;
,,
I 5
Validitl'
Reliability
22
29

6 Achieving ben efi cial bae-kr


S ta geibf tA$i-cdiis tFu ction

Test techniques and testing overall ability 59


, 9 Testingwriting. 75J
lO Testing oral ability. 70I v
11 Testing reading. 1,16
L2 Testing listening 1.34

'1.13 Testing gramrqar and vocabulary 141


14 Test administration 152
/
,,1 Appendix 1 Statistical anall'sis of test results 155
3 \\. "YY
Annendix 2 1,55
Ilibliography 1,66

) f lde x 170

I'

5
l{e t<newIedgeirlents

The author and publishers would like to thank the followine f-or {
permission to reProduce copyright material:
fl
American Council on the Teaching of Foreign Languag_es inc.
[or extracts
Generic Guidelines 6
i;;* ACTFL provisional Proficiency Guidelines andfrom
for extracts examinationsl
ew Bofiaziqi UniversitY Language
ouncil for the draft of the scale ior
ress for lvI. Swan and C. Walter:
98 8 ) ; Educational Testing Services
ingual House for M. Garman and
s Chapter 4 (1983); The Foreign
pp. 35-8 (1979); HarPer & Row',
publishers, lnc. for the graph on p. 1Z1,.from Basic Statistical Nlethods
t, X.M. Downie and Rot.tt W. Heath, copyright O 1985-in N' ivl'
ilo*ni. and Robert \W. Heath, reprinted wit-h permission of Harper 6C

n"*, p"blishers, Inc.; The Independent_f.o_r N. Tirnmins: 'Passive oard


smoking comes under fr,.', 14 March 1987; J
io, .,t,rl.ts from March 1980 and June 1980 T
eas);

Language Learning, and J' W' Cller Jr' and C' r r


tract
r
a;; 'Th. .lore-t"echnique and ESL proficiency', Language Learning
it,$ sq, I-ongman uK Ltd u?,1"fi.u.5'ff'rrl'?fJ?'iir', S':::'::,
e Cbseruer for S. Limb: 'One-sided
oyal Society of Arts Examinations
ocal Examinations SYndicate for
or the examination in The Com'
municatiue IJse of English as a Fore
1985 summer exarninations; Th
Examinati for the ex'
Univlrsity egacY of L
rhe Oxfor in English as a Foreign Languge'

viii
Fnef ece

The simple objective of this book is to help language teachers write better
I

rests. It takes the view that test construction is essentially a matter of


L
d
7
problem solving, u'ith every teaching situation setting a different tesring
I problem. In order to arrive at the best solution for any particular
,
situation - the most appropriate test or testing systen'l - it is not enough
ro have at one's disposal a collection of test techniques from which to
I

L
I
J choose. lt is also necessary to understand the principles of testing and
hou'they can be applied in practice.
Ir is relatively straightforu'ard to introduce and explain the desirable
qualiries of tests: validiq', reliability, practicality, and beneficial backu'ash
(tl'lis last, r','hich refers to the favourable effects tests can have on teaching
and learning, here receiving more attention than is usual in books of this
kind). lt is much less ea5y to give realistic advice on how to achieve them
in reacher-made teStS; One is tempted either to ignore the problem or to
Fresent as a model the not always appropriate methods of large-scale
tesring instirutions. In resisting these temptations I have been compelled
ro make explicit in my' own rnind much that had previously been vague
and intuitive. I have certainly benefited from doing this; I hope that
readers u'ill too.
Exemplification throughout the book is from the testing of English as a
foreigrr language. This reflects both my own experience in language
te siing and the fact that English will be the one Ianguage knou,n by all
reirde rs. I trust that it rvill not prove too difficult for teachers of other
I:r:iguaqes to find or construct parallel examples of their o\\rn.
I rrrust acknoivledge the contributions of others: -lr4A students at
Reading University, too numerous to mention by name) who have taught
t me much, usually by asking questions that I could not answer; my friends
irnC crrlleagues, Paul Fletcher, Michael Garman, Don Porter, and Toni'
\1r.r,,.,,;c., -,,.,1;-o alJ read parts of the ma.nuscript and mad.e ITl3n)'helpfu!
:lggesriorrs; Barbara Barnes, rvho typed a first version of the early
D chaprers; Ir{ichael Swan, rn'ho gave good advice and much encourage-
l'nenr. and u'ho remained remarkably patient as each deadline for
conrpletion passed; and finally *y farnily, u'ho. accepted the r,,'riting of
LIrc book as an excuse more often than they shouid. To ail of them I arn
ve rt' gra'tefuI.

f ix
F
I
N4any language teachers harbour a deep mistrust of tests andof tesrers.
The starting point for this book is the admission that this mistrust is
;F
frequently u,ell-founded. It cannot be denied that a great deal of language
resting is of very poor quality. Too often language tests have a harmful
B
I effect on teaching and learning; and too often they fail io measure
I
accurately whatever it is are intended to measure*
I
B
', '^'\i\\
othey
\**J'*-"' h* n"i\ \'r**'t- \"s\q
I

-f_a--
The effect of testin on tea lea rn ip_g;Ls known asib g c!1..u ry!2,
Backwash can . If a test is regarded as importanr,
then preparation for iftan cori? to dominate all teaching and learning
activities. And if the test content and testing techniques are at variance
ri'ith the objectives of the course, then there is likell' to be harmful
backu'ash. An instance of this wouid be rvhere students are follow'ing an
English course w'hich is meant to train them in the language skills
(including rvriting) necessary for university study in an Englisii-speaking
collntr)', but rvhere the language test which they have to take in order to
b,e admitted to a university does not test those skills directly, If the skill of
t'riting, for example, is tested only by multiple choice items, then there is
great pressure to practise such items rather than practise the skill'of
ri'riting itself. This is clearly undesirable.
\l,re have just looked a case of harmful backye:ll. Horvever,
^t
backu.aslrneednotalwaysbeharmf"ffiepositively
beneficial. I was once involved in the development of an English^languag!
rest for an English medium university in a non-English-speaking country,
T]ie test \,\'as to br- administered at the end of an intensive year of Enelish
:.rud,. ilicic an,i w-ouid be irse.l tu clererminc which srucients wouii be
allou'ed to go on to their undergraduate courses (taught in English) and
! u'hich u'ould have to leave the universitr'. A test was devised which was
b;ased directly' on an analysis of the English language needs of first year
undersraduate students. and which included tasks as similar. as possibie
to those u,hich they u'ould have to perforn'l as undergraduates (reading
rer:tb'ook materials, taking notes during nectures, and so on). The
introduction of this test, in place of one rvhich had been enrirely multiple
'
I

Teaching and testittg


I

choice, had ap immediate effect on teaching: the syllabus'was redesiqne'C.


new books were chosen, classes were conducted differently. The result oi
I th.se changes was that by the end of their year's training, in circum-
stancer r^d. particularly difficult by greatly increased.numbers anrj
lirnited ,.s"ur..r, the students reached a much higher standard in English
than had ever been achieved in the university's history. This was a case oI
beneficial backwash. {
*' Da,ries (1958:5) has said that 'the good test is an obedient servantproper
sincr:
it follows and apes rhe teaching'. I find it difficult to agree. The {
relationship between teaching and testing is surely that of partnersilip. It
is true that there may be occasions when the teaching is g_ood _ and {
appropriate and the testing is not; we are then likely to suifer irom
nrirnflf backwash. This rvould seem to be the situation that leads Dat'ies
to confine testing to the role of servant of teaching. But equally there mav
be occasions whin teaching is poor or inappropriate and rvhen tesring is
able to exerr a beneficial influence. We cannot expect testing ortiv ro
follow teaching. What we should demand of ir, holvever' is that it should
be supportive of good teaching and, where-necessary, exert a correctl'''c
influence on bad ftaching. lf resting ahvays had a beneficial backrvash on
teachilg, it would have a much better reputation amongst teachers.
Cn"pr.i,5 of this book is devoted to a discussion of horv benefici;il
backwash can be achieved.

tests

The second reason for mistrusting tests is that very oiten thel' ,larl-ro -
measure accuratg\r--wtrateyer"-ifjs-thrauhey-ar.e-in!9lld-ed to *eas.ulr€.
fl;.G;["",ffiG. Students'true abilities are not al,.vays retlected in t]re
tesr scores that they obtain. To a certain extent this is inevitable.
Lrngurg. abilities are not easy to measure; we cannot expect a level of
accuracy comparable to those of measurements in thE phy'sical sciences.
But we can exPect greater accuracy than is fr.equentlv achieved'
Why t.it, inaccurate? The causes of inaccuracy (and rva,vs of
"r. their effects) are identified and discussed in subsequent
rninimisinE
is possible h
ese concerrxs
1"-,
: ifrwe'Warl'i-
(
re of
have
; br-rt
surer
n the
of tens of thousands of be a
scoring
Teaching and testing

practical proposition, it is understandable that potentially greater accu-


racy is sacrificed for reasons of economy and convenience. But it does not
give testing a good name! And it does set a bad example "

While few teachers would wish to follow that particular exampie in


order to test writing abilityn the ovenvhelrning practice in large-scale
testing of using multiple choice items does lead to irnitation in circum-
starices where such items are not at all appropriate. What is more, the
t imitation tencls to be of a very poor standard. Good multiple choice items
are notoriously difficult to write. A great deal of time and effort has ro go
F into their construction. Too many multiple choice tests are written where
such care and attention is not given (and indeed may not be possible). The
result is a set of poor items that cannot possibly provide accurare
measurements. One of the princlpal aims of this book is to discourage the
use of inappropriate techniques and to show that teacher-made tests can
B
be superior in certain respects to their professional counterparts.
'-l
I he.Secon u -l iahtliry Reliability is a
t..hffi-I.1Eifr which is .ipl"i*d=-in Ctilpter"Sl-FbTThe moment it is
enough ro say that a test is reliable if it measures consistently. On a
reliable test you can be confident that someone rvill get more or less the
same score, whether they happen to take it on one particular day or on
the nexrl u'hereas on an unreliable test the score is quite likely to be
considerably different, dependLpg on the day on which it is ta-ken.
-+> -U,nr.J d i ry has ;Fiqs!'t@
t!\,o- e
"d
atffi*a+a+is:*
about the test creates a tendency for
se, something
fiFfoim significantly differently on different occasions
u'hen they might take the test. Their performance might be quite different
if they took the test on, say,lWednesday rather than on the follou'ing day.
As a iesult, even if the scoring of their performance on the test is perfectiy
accurate (that is, the sborers do not make any mistakes), they will
nevertheless obtain a markedly different score) depending on u'hen they
actuallt. sat rhe test, even though there has been no change in the ability
/ u'hich the test is meant to measure. This is not the place to list all possible
fearures of a test rvhich might make it unreliable, but examples are:
F.
unclear instrucrions, ambiguous questions,_items that resulr in guessing
on rhe part of the test takers. While it is not possible entirely to eiiminate
Uh
such differences in behaviour from one test administration to another
F
(hunran beings are not machines), there are prlncipies of test construction
u'hich can reduce them.
F In the erformances are accorded _- sis-
.nifica t. F"t example, the sarnJlompoTiTron"miyEe
g1i'en ver)' different scores by different nrarkers (or even by' the same
ma.rker on different occasions). Fortunately, there are well-understood
ii'a)/s of minimising such differences in scoring.
N,{ost (but not all) large testing organisations, to their credit, take every

Io :1
Teacbing and testing

tests, and the scoring of them, as reliabie as


y highly successful in this respect. Small-scaie
d, tends to be less reliable than it should be'
k, then, is to show how to achierie grcllter
reliability intesting. Advice on this is to be found in chapter 5.
{
The need for tests
rvhy tests ere sc)
So far this chapter has been concerned to understand
this mistrust is
l-ir,rrri.d by many language teachers,'We have seen that that rve *'ould
;^fr*l;r;in.h. on. .oniluJion drawn from this might beall, the.primar' I
i.tU.it., off rvithout language tests.,Teaching is, after
ii .ot fli.t',virh it, then it is testing '',vhich shor-rpl
as been ad.mitted that so much testing provides
I his is a plausible argument - but there are orher
conclusion.
considerarions, which might lead to a different
useiul and
, Information abou, p.oil.'s language ability is often very British and
sometime, n...rrrry. i, ii aifficulito i*tgine,
for example,
A-.ri."n accePting students from overseas rvithout some
'd;;i;;. u.riu.rsities
of ,h.i, proficien.yltt English. The same is true for organi-
,rrion, hlring interpreters or translators. They certainly need dependable
measures of language abilitY'
for
\ilithin teaching systems, too, as long as.it is thought appropriate
in second
indi;idr"ls to be g"iuin a statement of what they have achieved needed'1
a

or foreign language, then-tests of some kindor other


will be
it; *Ifi rftJ b." n..d.C in order to provide information about the
achievement of SrouPs of learners, without which
it is difficult to see horv
for some PurPoses
,urion"t .d,r.rtlonaf decisions can be made. While appropriate
of their o*n students are both and
teachers' ,rr.rr*.nts
sufficient, this is not true for the iust mentioned' Even without
cases
need for a
.""ria.ting ,hr porribility of bias, we have to recognise themeaningful
;;;;"; yitattilf., whicir tests provide, in order to rnake
compariso*t'
.o, that
lf it is accepted rhrr resrc arenecessary, and iiIwe care about testing and
tests are
(in my view' the
its effect on ,.".lrlng tttd learning, the other conclusion
the poor quality of so
correct orre) to-b. ir^*n frorn ^-',ttogt'ition of
;;;;-;.;ii.i ir ,hur ru. should do eve.ything that we can to irnprove the
practice of testlng.

interpreted widelY. [t is use'] to


abiliry. No distincrion is macle
Teaching and testing

What is to be done?
The reaching profession can make two contributions to the improvement
clf tesring: they can write better tests themselves, and they can put pressure
on others, including professional testers and exarnining boards, to
improve theirtests. This book r€presents an attempt to help them do both.
Fot the reader who doubts that teachers can influence th'! large testing
institutions, let this chapter end with a further reference to the testing of
writing through n'rultiple choice items. This was the practice followed by
p those responsible for TCEFL (Test of English as a Foreign Language), the
test raken by most non-native speakers of English applying to North
l American universities. Over a period of rnany years they maintained that
ir was simply not posiible to test the writing ability of hundreds of
thousands of candidates by means of a composition: it was impracticable
F and the results, anyhow, would be unreliable. Yet in 1985 a writing test
andidates actually have to write for
upplement to TOEFL, and already
re requiring applicants to take this
pal reason given for this change was
pressure from English language-teachers who had finally convinced those
,esponsible for the TOEFL of the overriding need for a writing task
ro'I''ich would provide beneficial backwash.

READER ACTIVITIES

1 Think of tests with which you are farniliar (the tests may be inter-
national or local, written by professionals or by teachers). What do
you think the backwash effect of each of them is? Harrnful or
beneficial? \Xlhat are your reasons for coming to these conclusions?
2 Consider these tests again. Do you think that they give accurate or
'What are your reasons for coming to these
inaccurate information?
conclusions ?

F,
9

F
Further reading
:
Forlan accounr of how the introduction of a nerv test can have a striking
B
beneficial effect on teaching and leagning, see F{ughes (tr988a). For a
reyie\\, of the nerv TCEFL w'riting test which acknorvledges its potentia[
'beneficial backrvash effect but rn'hich also points out that the narrorv
range of writing tasks. set (they are. of only^two types) rnay 6.t*1, Un
narrow rr",ntngln writing, see Gree-4-herg (1986). For a discussion of the
cthics of language testing, see Spo_lsky (n98x.).

t2
I'f'J TestEffiE as ProfoEerm sotving:

idea of testins as
The purpose of this chaprer is to introduce readers to the
solving and to show how the content and structure of the L:itok
-a.iign.a
oioUf.n.,
Ir. ; help them to become successful solvers of testing {

fi?ffr:ge tesrers are sometimes asked ro say rvhat is'the best test' or
.the be"st tJsti.,g technique'. Such questions reveal a misunderstanding of
;h;;i, inuotu.? in the iractice of [anguage re.sting. ln fact there is no besr
i.r, technique. A test rvhich proves ideal for one purpose. ma' be
"iU.st fo,
;;ir;;;.1.r, a technique which may work verv r'vell in o^e
il;;ilr. .t. entirely inappropriate in another. As we saw in the
u.^norher;
or.uiour chapter, what suits l"ig. iesting corporations may be quite c'ut
'oi pi^.. in the i.ttt of teaching institutions. In the same s7211r, t\\'o
,."itring institutions may require ve y diffe rent tests, depending.amongst
and import-
orher ,liing, on the objectivis of th:ir courses, the Pytpe:.
of th". t.sts, and ihe resources that are available. The assumption
""1.
,nr, n", to be *^d. therefore is that each testing-situation is
unique.and
problem. It is the tester's f ob to provide the best
so sets a partlcular testing
qTiP readers rvith
,otu,ion ,o that problemlThe aims of thi's book are to e
problems, secondlv'
,rr. urri. kno*ledge and techniques first to solve such
and
io.urittte the solitions ptopot.d or already implemented by others,
practice lvhere
thirdly to argue persuasiv.ly fot improvements in testing
these seem necessary.
,trl! sLeJ rnust
.]n every situatiggjbg
rEarwl"*l::jT:li:11:
ffi'fr;r'n*solution.Er,erytestingprobIemcanbeexpressedinthe
which rvill:
;;;g;..r"1tfr*s: we vant to create a test or testing system
onsistently provide accurate measures of precisely the abilitiesl in
which we are interested;
ave a beneficial effect on teaching
(in those cases where the tests are
'v likelv to influence teaching);
money'
OU...onomical in terms of time and
nical sense. lr refers sinrply to r'r'hat people
example, include the ability to collverse
y to recite grammatical rules (if chat is
iuringl). [t does not, however, refer to
Testing as problem soluing

I_,et us describe the general.testing problem in a little more detail. The first
thing that testers have to be clear about is the purPose of testing in any
orrti.ular situation. Different purposes will usually require different kinds
f f ,.ri.t. This rnay seem obvious hut it is something which seems not always
.t
ro be recognrseo. Ih-.- p_g.lp.g,::Loll.jl$ discussed in this bo.ok are:
\-.-to []easure language proficiency regardless of any language courses
ed
F achieved the objectives of a course of

F nd rn'eaknesses, to identify what they


w
B by identifying the stage or part of a
teaching programme mosr appropriare ro their ability
b
All of these purposes are discussed in the next chapter. That chapter also
introduces different,"-kinds o.f-.t.g!-tjgg*-elq JgqlJshniqrres: direct as
opposed to indirect resting; discrete-point versus integrative testing;
.rii..iun-referenced testing as against norm-referenced testing; objective
and subjective testing.
ln stating the testing problem in general terms above, u'e spoke of
proyiding consistenr measures of precisely.the abilities we are interested
in. A ,.ti*hi.h does this is said to be 'ualid'. Chapter 4 addresses itself to
r,irrious kinds of validity. [t provides advice on the achievement of
yalidity in test construction and shows how validity is measured.
The u,ord 'consistently' was used in the statement of the testing
problem. The consj:Jengy [Lth which accu!?te m%.si-rspJnqSre-trS" -I

in iact an esien-tt-alntrr-g4le vatlol lsten


*. t * p t . t p:llo-ft*t-lg9ll-aU rh e -tesljt- I * ely-tq*.b.*t-:sr'y" -si'nr+i'l-a r
-'
t ;f;-f
"
;ee;rdless of ivhelhet
rathgr than o4
te-uegg,1::,1'l*Hl
'@Jbe'#brr
;l?Aatt ;.i.it.d*to inlfiE*pt*ious cha an absolutely essential
qualir, of tesrs - u,hat use is a test if it rvill give widely differing estirnates
F Ji .,n'individual's (unchanged) ability? - yet it is something which is
disrinctly lacking in very -^rry teacher-rnade tests. Chapter 5 gives advice
on ho'*, to achl*ue reliability and explains the \\rays in u'hich it is
mea sured.
The concepr of backr,vash effect was introduced in the previous
F
chapter. Chaiter 6 identifies a number of conditions for tests to rneet in
order to achiette beneficial bacl<was.h"
ple have, in differing degrees,ifor learning
ent in order to predict horv u'ell or how
e, is beyond the scope of this book. The
, Carroll (1981), and Skehan (1986).
{F,

tv
Testing as Problem soluing

Ail tests cost tirne and money - to prepare, adtniuister' scol'e atid
inrerpret. Time and money are in limited suppll', and so there is often
likely to be a conflict between what appears ro be. a perfect testirrg
solution in a particular situation and considerations of practiccli 11'. This
issue is also discussed in Chapter 6.
To rephrase rhe general testing problem identifred above: the basic
problem is to develop'tests rvhich are valid and reliable, r.vhich hrrrt '.'
tcncficial bacltwasll effect cn teaching ("','here this is rele','lnl), al'.'C ','''hi'-:h
are pracical. The next four chapters of the. book are intended to Iook
more closely at the releVant-cgncepts and so hqlp the reader to formrrlrte
such problems clearly in pariicular instances, and to provide advice on {
how to approach their solution-
The seiond half of the book is devoted to more detailed advice on the
t
consrrucrion and use of tests, the putting into practice of the principlt:s '
outlined in earlier chapters. Chapter 7 outlines and exemplifies the
various srages of test construction. Chapter 8 discusses a nurnber oI
testing tecltniques" Chapters 9-13 shorv how' a variety of langu;rge
abiliti-es can best be tested, particularly within teaching institutions.
Chapter 14 gives straightforward advice on the administration oI tests.
We have ,o rry something about statistics. Some understanding of
statistics is-useful, indeed necessary, for a proper appreciation oi testing
matters and for successful problem solving. At the same time, rve have to
recognise that there is a limit to rvhat *1ty readers will be prepared to
do, Jrp..ially if they are ar all afraid of ma.then:Iatics. For this reason.
statistical *,rtt.r, are kept to a minimum and are pre.sented in terms that
everyone should be able to grasp.-The emphasis will be on interpret;ttion
r"rhe, than on calculation. For the rnore adventurous reader, how'evet.
Appendix 1 explai.ns how to carry out a number bf statisrical operations..' ;li

Further reading
The collection of critiC-al reviews of nearly 50 Engtish language tests
(mostly Brifish and American), e Krahnke and Stans-
field (1.987), reveals how well p iters are thought to
have solved their problerirs. A full of the reviervs lvill
depend to some degree on an assimilation of the content of Chapters 3, 4, {
and 5 of this book.

t5
'I'his chapter begins by considering the purposes for which language
F resring is carried out. It goes on to make a number of distinctions:
' berween direct and indirect testing, between discrete point and integra-
b tive testing, between norm-referenced and criterion-referenced testing,
and between objective and subjective testing. Finalll' there is a note on
p communicative language testing. ')

We use rests to o"brrin inforriati n. The informatibit thtt we hope to


obtain will of course'vary from situation to situation. It is possible,
neverrheless) ro categorjse tests according to a pmall nurnber of kinds of
information being sought. This categorisation lt'ill prove useful both in
deciding u,hether-an existing test is suitable for a particular purpose and
in writing appropriate new tests where these are necessary. The four
tvpes of tesr which we will discuss in the following sections are:
proficiency tests, achievement tests, diagnostic tests, and placement tests.

Proficiency tests
*lt yro6ciency resrs are designed to measure people's ability' in a language
regardless of any training they may have had in that language. The
.ont.nr of a proficiency test, therefore, is not based on the content or
ob jectives of language courses rn'hich people taking the test rnay have
follou'ed. Rather, iqiqlased on a spe cification of u'hat candidates have to
be able to do in-tfFfrnguage in order to be considered prof,cient. This
raises the question of what we mean by the u'ord'proficient'.
B In the case of some proficiency tests, 'Pro[ciqgfl]mea+l-ber,1ag
suffi 9 fgl ! p ary S"tg_p44ypo s 9.. An exa m ple
) oi ltrir u,ould be a test designed to discover rn'hether someone can
functron successiully as a United Nations translator. Another example
u'ould be a test used to determine n'hether a srudent's E,nglish is good
)
enough to follou' a course of studl' at a British university' Such a test may
eyen arrempr ro take into account the level and kind of English needed to
'follou, ."uis.t in particular subiect areas" It might, for example, have one
form of the test for arts subjects, another for sciences, and so on.
\Thatever the particular purpose ro which the language is to.be put, this
Kinds of test and testing

wilt be reflected in the specification of test content at an early stage of a


test's develoPment.
There are other proficiency tests which, by contr"ast, do not have any
occupation or course of srudy in mind. For them the concepl of
proficiency
^Cambridge
is more general. British -examples of these rvould be the
examinations (First Certificate Exarnination and Profi ciencv
E;ramination) and the Oxford EFL examinations (Preliminari' ani
Higher). Tirc functitrn ui tiresc tcsts is to sltow lvhethcr carrriir-iaius iia', c
rea;hed a certain standard rvith respect to certain specified abilities, Sr:ch
examining bodies-are independent of the teaching institutions and so can
be relied on by potential employers etc. to make fair comparisons
between candidates from different institutions and different courrtries.
Though there is no particular purpose in mind for the language, these
g.n.r^l proficiency tests should have detailed specifications. saying just
iuh", ir il that successful candidates will have demonstrated that they can
do. Each test should be seen to be based directly on these specifications.
All users of a test (teachers, students, employers, etc.) can then judge
whether the test is suitable for them, and can interpret te st results. It is nor
enough to have some vague notion of proficiency, however prestigious
the testing body concerned.
Despite differences between them of content and level of difficultv, all
proficiency rests have in common the fact that thev are not based on
.ourr.r that candidates may have previously taken. On the other hand, as
we saw in Chapter 1-, such tests may themselves exercise considerable
influence over the method and content of language courses. Their
backrvash effect - for this is what it is - may be beneficial or harrnful. In
my view, the effect of sorne widely used proficiency tests is more harmfut
than beneficial. However, the teachers of students who take such tests,
and whose work suffers from a harmful backwash effect, may be abie to
exercise more influence over the tesring organisations concerned than
they realise. The recent addition to TOEFL, referred to in Chapter L, is a
case in Point.

\
t1
Most reachers are urnlikeiy to he responsibne fon proficiency tests. lt is
much more probable that they will be involved in the preparation and use
of abhievement tests" X.n contrast to proficiency tests,'achievement tests
are directly relate<l to language courses, their purpose being to esrablish
how st-tccessful individual studerets' groups
thernselves have been in achieving objectives'
d(f6ffii"t
a ch i eve m e n t te s ts n ch i e v e m e nt te ts'
s

A plnnl achievement ^ red at the end of a course oi-


l''
Kinds of test and testing

" ^st!dy, They may be written and administered by ministries of education.


official examining boards, or by rnembers of teacLring institutions.
Clearly the content of these tests must be related to the courses with
u,hich they are concerned, but the nat
nt amongst language tes
ew of some testers, the
ased directly on a detaile
9 other materials used. This has been referred to as the 'sylXabus-content
approach'. trt has an obvious q ince the t r-'
ry contalns what lt ts
rhough t
F tar:LllWlh a v e a c t u4lly_e n. o u n t Jr-e d._arylr h u sT a n-EE

c r:-n s dEie d,:glh it_lgjpgc ! 4tr arl


r

s ano oI
sprr; ls tnat I
I :y"ely--!0idsadt*n g. Su c-e5 sTu I
f I i .---*
perf ormancegllbg1gqt_may"n-o-r-r:-ul.,r.r
*---.7-- sstul achreveme*nt of
_----- :--'---;-
ts c.-o*q!-lS-gbig*qgu:q, For example, a course may have as an
i.

develooment of conversational ability,


development abilitv- but
hrrt the course
corrrse itself
irse and the test
may require students only to utter carefully prepared sratements about
their home town, the w'eather) or whatever. Another course rnay aim to
develop a reading ability in German, bur the test may limit itself to the
vocabuiary the students are known to have met. Yet another course is
intended ro prepare studenrs for university study in English, but the
svllabus (and so the course and the test) may not include listening (',vith
note taking) to English delivered in lecture style on topics of the kind that
tlre students rvillhave to deal with at university. In each of these examples
- all of them based on actual cases - test results will fail to show rrhrt
have achieved in terms of course objectives. €xr-s.y5e- - *V\"*il\\r&
^6 studentsalternative appro"t.i: ,.o base the test coglllrec.-rlyp\rll. r-
# @The
c'bjectives of the course. lbir has. a quglb
f,',tr I \r#__.
.-lorLp- compels course oeslgners* t-o r_-c^rgg_D_o_il"r*.A-ql9_Q-qlv"e_s.Q,e_c914'i j!
ruiakeslT_poidble for performance on the tesd to show iuiFT6n'-far
Ii s.-Ti,ii in:ilil-p;ii-flitsiuft o;i-
and for the selecrion of books and
nraterials to ensure that these are consistent with the course obiectives.
D,
, .+ -'tc5 sed on obiectives work asainst t .POOr"teaehl.ng
€IalIlEe,-something u'\ich c_o.Ut_Q_e-:-€9__R^ten-t_:ba,F.qd.t.qqlg, almost as if part of
.._Y-... -.

iracy. fail to _4g,


1 _consp!_r-1-c_yr{n{,t_q do. I,
It is nry
nrv belief that to base test content on course
i ,,,. :;:
,-.I:i.,,..;:,'.',
;::I.
r-)l:;r'i-LJ ;. iiiilCn rn
i Uo iJ trO l-... --rnS----.J, iti+,=,;.!!
DC pfCICffCG: Vviii PfOViLie
-.,-.-..=,;.-!= iiiUts aCCuIaLe
-,,.L
;-^1..-.-.^*: -L^,"*:-.1
infon:rarion about individual and
^-. :,,:J,.^l --^..^ achievernent, and lt is l:l-^l-.
^*,-: group ^^L:^-,^*^.-* ^-l ;*:^ likely -^
to
T promote a more beneficial backwash effect on teaching"l

1
.L Of course, if objectives are unrealisric, then tests u'ill also reveal a failure to achieve
rlrern. This too can only be regarded as salutary. There may be disagreement as to why
tlrerE has been a failure to achieve the objecrives, bur at least this provides a srarring
poini for ne cessary discussion rvhich otlrerwise might never have taken place.

1a
I-L
\g
Kinds of test and testing

Now it might be argued that to base test c her


than on.ourr. content is unfair to students' If
not
h-i *.ff with objectives, they will be expected
hty
hru. nor been prepared. nn a sense this is true. But in anothe r sense it is
nor. If a resr is-baied on the content of a poor or inqPPropriate course,
thb stud.nts taking ir will be misled as to the extent of their achieve-
*.n, and the qut-iitv'of the course. \Whereas if the test is based on {
trsefrrl. lrtrr
Obieatirr"Si not onlV will the information it gives be mote
*rire is less chance of the course surviving in its present unsatisfactorv
i;;;. lnitially some students may suffer, but future srudents rvill benefit
,{
iro* the'pressure for change. The long-term interests of students are
C*, ,.ro.d by final achievement te *hot. content is based on course
'ts
ob jectives.
I
- The reader ma nder at this stage *hgbe--_--LSqL._!S _i_n1ttl
t
t ditierence betwe ment tests an retsts. [ar-esrrs
I 'rthe rne
_F^-
I
I
t r - -.- -^^;J^ ^- 'nO
I r'fl
I
I
e torm and conlenr of the tu'o

I
I
I an achievement test has been con-
iI
srructed, and be aware of the possibly limited validity and
applicability
I
! Li,.r, ,.or.r. Test writers' on the rther hand, must create achievement
I ;;r;; t;hi.h ,*fl..t the objectives of a particular course, andanot
?
l
exPect a
satisfactory
\ ;.;;r;if roficiency test (or some imitition of it) to provide
I
t

-#\
u---/

made. This is not reallY feasible,


a course. The low scores obtained
; and quite possibly to their teachers"
objec-
The alternarive is to establish a series of-well-defined short-term ffi
achie'e-
;;;r. ih;r. rtr"uta n'rake a ciear progression towards the fi-nal
ment test based on course oblectiv.t. Th.tt if the syllabus
and teachirrg
based on short-term
if not, there will be
at is at fault, it is the
that ii t's there that change is n'eeded,
not in the tests.
t?
1,2
Kinds of test and testing

emgnt test$ WhiSh*f_eq.uir-e. c-areful

e such resrs will not form part of formal


assessment Proceclures, fnelr construction and scoring need not be too
rigorous. Nevertheless, they should be seen as measuring progress
tourards the intermediate objectives on which the more formal progress
F achievement tests are based. They can, however, reflect the particular
'route' that an individual teacher is taking towards the achievement of
oLrjectives.
ts
It has been argued in this section that it is better to base the content of
achievement tests on course objectives rather than on the detaile d content
! of a course. However, it may not be at all easy to convince colleagues of
this, especially if the latter approach is already being follorved. Not only
F is tlrere likely to be natural resistance to change, but such a change may
represent a threat to many people. A great deal of skill, tact and, possibly,
political manoeuvring may be called for - topics on rvhich this book
cannot pretend to give advice.

.'"-Fiagn0stie tests

=+ n g$-rs a nd u'eaknesses.
rvhat further teatTin-f is
necessar),. At the level of broad language skills this is reasonably
srraightfonn'ard. We can be fairly confident of our ability to create tests
that rn'ill tell us that a student is particularly rn,eak in,_sa,v, speaking as
proficienc| tests malr
ndeed existing proficiency ma1,
se.
, analysing samples of a studerrr's
g in order to create profiles of the
ch categories as 'grammatical accu-
e Chapter 9 for a scoring system that
UR

D a detailed analysis of a student's


s, something which would tell us, for
r stereC the prcscnt pcrfcct/past tcnsc
sure of dris, we would need a number
F t made between the t\ ro structures in
ought was significantll' diffe rent and
btaining information on. singleA
ugh, since a str.rdent rrright give the
sult, a comprehensive diagnostic test
(think of what would be irrvolved in
. -1 1
.I.J
Kinds of test and te'sting

). The size of such a test would rrtakc


ine fashion. For this reason, very ferv
nostic purposes, and those that there
ormation.
sts is unfortunate. TheY could be
nsrruction or self-instruction. Leirrn-
st in their command of the lrrnrrtrrtg*
t
cf infcrrnrticn, exempliFclti'rn :ln'J'
lity of relatively inexpensi',re (-orFl-
change the situation. Well-u'ritten
e that the learner spent no more time
obtain the desired information, irtrcl
rvithout the need for a test administrator. Tests of
this kind rvill still nre ,l t
u ,r.In.*dour-r*ount of work to produc.e.
Whether or not thel' become
of individuals to write
eenerally available will depend on the willingness
in.nl and of publishers tQ distribute them'

*;;"'il)
e-'44e6'*F

sidered suits its particular teachi


ivill work for institution, a
everY
not work rvell'
that is .o*..r.irily uutilable must be that it will
most successful are those constructed

have been Produced 'in house.'' I


through
rion is ,.-Juta.a by the savlng ln ttme and effort
a:cur.ate

olacement. An .*r*pl. a-p


of how mightbe.designed rvithin
;;i;".tii"" it given in Ch-aPter ion of placement tests is
E* o.^**s\**+:.:h
i,- \\l \*-\ )

O Direct versus indirect testing


a number of uses to
So far in this chaPter we have considered
resulrs are Pur. We now distinguish betweert
two approa
construction.
LI
'14
Kinds of test and testirtg

Testing is said to be direct when it requires the candidate to perform


we want ro know horv
nrecisely-the skill which we wish to measure. [f
l,.ii .riaidates can write compositions, we get thern atolanguage,write com-
oositions. If we want to know how well they pronounce we
'n.t rl1.r to speak. The tasks, and the texts w be as
Iuthentic as possible. The fact that candidates re in a
t.rt ui,u"tion means that the tasks cannot be rthe-
,less the effort is made to make them as realistic as possible-
Direct resting is easier to carry out when it is intended to measure the
p iting. The very acts of speaking and
about the candidate's ability. With

::::::il' ;,?':n'.T t l;:' : : : :',1'l


successfully. The tester has devis to uch evidence
F
,..ur"t.ly and without themethod rformance of
the skills in which he or she is i methods for
achieving this are discussed in Chapters 1l- and l-2. lnterestingll' enough,
in many rexrs on language testing it is the te sting of productive skills that
i, ft.r.nted as bein[ *-ost problematic, for reasons usually connected
-r,.iif., reliability. ln fact the problems are bir no means insurmountable, as
rve shall see in ChaPters 9 and 10'
' Direct resring hai a number of attractions. First, provided that we are
clear
.for,.,^rd
.foru,ard about jusi u,hat abilities we want to assess) it is relatively straight-'
ro create the conditions which u'ill elicit the behaviour on u'hi
u'hi
io t our judgements. Secondly, at least in the case of t ie producti
skills, ^r.the assesr*.nt and interpreration of students' performance is also
quit.'straightforu,ard. Thirdly, since practice for the test involves Prac-
J.. of the"skills that we wish to foster, there is likely to be a helpful
backr"'ash effect.
Irrclirect testing atrempts ro measure the abilities rn'hich underlie the
if.ittt in u'hich *J ^r. interested. One section of the TOEFL, for example,
,uri a.u.toped as an indirect measure of n'riting ability. It contains iterns
of the follorving kind:
h
ft At first the o\d \^,oman seemed unwilling to acceDt anything that u'as
;ff.t.d her b| mY friend and I'
.,.,.hc;c rhc candidate has to iCentif,r' .*hich cf the underlin:d elernents is

.,, o n. o u s o r i n a p p r o p r i a t e in f o r m a
L*f ,Hi i; :l:,'l ; ) *h:hl'; it:
h the strength of the relationship n'as
ot the same thing. Another exan-rple
proposed rnethod of testing pronunci-
etion ability by a paper and p."oii test in which the candidate has to
identif;, pai.s of words rn'hich rhyme with each other.
I)
2-2
Kinds at''test and testing

'perhaps.the main appeal of indirect testing is that it see ms to offcr thc


possibility of testing a representative sarnple of a finite number of ab,ilities
which underlie a potentially indefinitely large number of manifestatioris
of them. [f, for example, we take a representative sample of grarnmatical
Structuresl then, it may be argued, we have taken a sample rvhich is
relevant i.or all the situations in which control of grammar is uecessary.
By contgast, direct testing is inevitably limited to a rather smali sample of {
tasks.which mav call on a restricted and nossiblv rrnrep'resentntit''e rnnc'e
of grammatical structures. On this argument, indirect, testing is superior sql
to dir..t'testing in that its results are more generalisable.
The main problem with indirect tests is that the relationship betr,veen €
performance on them and_performance ot the skills in rvhich we ate
lsually more interested tends to be rather weak in strength and uncertain 6
in nature. We do not yet knorv enough about the component parts of, sav,
composition writing to predict accurately composition writing abilit.v-
from scores on tests which measure the abilities Which we beliette
underlie it. \fle may construct tests of grammar, vocabularv, discourse
markers, handwriting, punctuation, and what lve will, But rve still ,,viil
not be able to predict accurately scores on compositions (even if rve rnake
sure of the representativeness of the composition scores by taking man-y
samples).
Itlseems ro me that in our present state of knowledge, at least as far as
proficiency and final achievement tests are concerned, it is preferabie to
concenrrare on direct testing. Provid d that we sample reasonably w'idely
(for example require at least two compositions, each calling for a

different kind of wridng and on a c iffer.ent topic), we can expect more


accurate estimates of the abilities that really concern us than lvould be
obtained through indirect testing. The fact that direct tests are generally
easier to .orrrtruct simply reinforces this view with resPect to inslitutional
tesrs, as does their gt.it.r potential for beneficial backwash. it is only {air
to say, however, th"t *rny testers are reluctant to commit themselves
entirely ro direct resring and will always include an indirect element in
rheir ,.rrr. Of course, to obtain diagnostic information on underlying
abilities, such as conrrol of particulii grammatical structures, indir:ect
testing is cailed for.

@ plscrete point versus integratlve testing


Discrete point testing refers to the testing of one element irt a tirne, item

2?
lo
Kinds af test and testing

Dassage. Clearly this distinction is not unrelated to that between indirect


ind ,fir..r resting. Discrete poini tests rvill alrnost always be indirect,
while inregrative tests will tend to be direct. However, some integrative
tesring methods, such as the cloze procedure, are indirect.

D Norm-referenced versus criterion-referenced testing


Imagine that a reading test is administered to an individual student.
When we ask how the student performed on the test, we may be given
tno kinds of answer. An answer of the first kind would be that the
student obtained a score that placed her or him in the top ten'per cent of
candidates who have taken that test, or in the bottom five per cent; or
! that she or he did better than sixty per cent of those who took it. A test
w,hich is designed to give this kind of information is said to be norm-
F referenced. It relates one candidate's performance to that of other
canCidates. We are not told directly what the student is capable of doing
in the language
The otlrer kind of answer we might be given is exemplified by the
follou'ing,'taken from the lnteragencv Language Roundtable (lLR)
language skiltievel descriptions for reading:
Sufficient comprehension to read iimple, authentic written materials in a
form equivalent to usual printing or tvpescript on subiects u'ithin a
familiar context. Able to read with some misunderstandings
straightforward, familiar, factual material, but in general insuffic-iently
.*p.ii.n..d ivith rhe language to dran' inferences directly from the
lin.euistic aspects of the text. Can locate and understand the main ideas
apJ details in materials u'ritten for the general reader . . . The individual
can r,:ad uncomplicated, but authentic prose on familiar subiects that are
normally presenred in a predictable sequence which aids the reader in
understanding, Texts may include descriptions and narrations in
conrexrs such as news items describing frequently-occurring events,
simple biographical information, social notices, fonnulaic business
l.tte 1s, anJ simple technical information written for the general reader.
Generally rhe piose that can be r:ad bv the individual is predominantly
in straightforward/high-frequency sentence patterns. The individual does
F
not hal'e a broad active vocabulary .. . but is able to use conrextual and
F real-rvorid clues to understand the text"
,
G

Simiiarlv, a candidate u,ho is awarded the Berksirire Certificate of


t proficiency i., G.r-an Level 1 can 'speak and react to others using sirnple
language in the following contexts':

to greet, interact with and take leave of others;


to exchange information on personal backgroulnd, home,
school life and interests;

t7
Lrl
Kinds oi test and testing

- to discuss and rnake choices, decisions and plans;


- to express opinions, make requests and suggestions;
- to ask for inforrnation and understand instructions.

In these two cases we learn nothing about how the individual's perform- ;

ance compares with that of other candidates. Rather we learn something


abourt rvhat he or she can actually do in the language. Tests vihich a.r+ {
..-lee;r'ned
evrrb..vs to
." fsrc',.,ide
-* this kind of infornnation directly r.re sa.id tr 1,.:
,€
cr it ir ion - r ef er e n c e d.2 €
The purpose of qriterion-referenced tests is to classify people according
to wheth.i ot not rhey are able to perform some task or set of t:rsks q

satisfactorily. The tasks are set, and the performances are evairiar,ed. it
. does not marrer in principle whether all the candidates are successful, or B
none of the candidates.is successful. The tasks are set, and thcse lvho
perform them satisf-actorily 'pass'; those who don't, 'fail'. This means
ih^t rtudents are encouraged to measure their progress in relation to
meaningful criteria, rvithout feeling that, pegause they are-less able thln
*o51 of=their fellows, they are desrined to fail. ln the case of rhe Berkshire
German Certificate, for example, it is hoped that all students r*,'ho are
entered for it ivill be successful. Criterion-referenced tests therefore hir,;'e
two positive virtues: they terms of -uvh.rt
p.opi. can do,which do no s of candidirics I

and they motivate students


The need for direct interpretati means that the
consrrucrion of a criterion-referenced test may be quite different from
rhat of a norm-referenced test designed to serve the same PurPose. Let us
imagine that the purpose is to assess the English language abilit'v of
stud"enrs in relarion to ih. demands n ade by English medium
universiries.
The criterion-referenced test would almost certainly have to bebased on
,^ r^rfyris of whar students had to be able to do with or through English
at univlrsity. Tasks would rhen be set similar to those to be met at
urriu.rtity. if ,hlr were not done, direct interpretation of performance
would be impossible. The i'orn't-referenced test, on the other hand, while
itsconrent might be based on a si .l1,:?:tlili]i;ffil:'.IT:i
Jl;
, and reading comprehension com-
test does not teil us directly ivhat his
the demands that would be made on
consult a tatlle
ir at an English-medium *niversity.To kqow thi.s, we must
which *ni.., recommendations as to the academic load that a student
This is r:n-
2. People differ somewhat in their use of the terrn 'criterion-referenced''
rvhich it is used
clear. The sense in
important Provided thar the sense intended is made
one which I feel will be mosr useful to the reader
in analvsing testing
heie is rhe
problems.
. Kinds af test and testing

ra'irh that score should be allowed to carry, this heing based on experience
over rhe years of students with sirnilar scores, not on any meaning in the
score itself. In the same wdy, university administrators have learned from
crperience horv to interpret TOEFL scores and to set minimum scores for
theit own institutions.
Books on Ianguage testing have tended to give advice which is more
appropriate to norm-referenced testing than to criterion-referenced
F resting. One reason for this rnay be that procedures for use with
norm-referenced tests (particularly with respect to such matters as the
r analysis of items and the estimation of reliability) are well established,
rvhile those for criterion-referenced tests are not. The view taken in this
B book, and argued for in Chapter 6, is that criterion-referenced tests are
often to be preferred, not least for the beneficial backwash effect they are
likely to have. The lack of agreed procedures for such tests is not
sufficient reason for them to be excluded from consideration.

@ O1iective testing versus subiective testing

The distinction here is between methods of scoring,, and nothing else. lf no


judgement is required on the part of the scorer, then the scoring is objec-
tiye, A multiple choice test, with the correct responses unarnbiguously
identified, rvould be a case in point. lf judgement is called for, the
scoring is said to be subjective. There are different degrees of subjectivity'
in testing. The impressionistic scoring of a composition malr $. .on'
sidered more subjective than the scoring of short answers in response to
questions: on a reading Passage.
Objectivity in scoring'is sought after by many testers) not for itself, but
for the grearer reliability it brings. ln general,,the less subjective the
scoring, ih. gr.rt.r agreement there will be between two different scorers
(and betu'een the scores of one person scoring the same test paPer on.
different occasions). Hora'ever, there are ways of obtaining reliable
subjective scoring, even of compositions. These are discussed first in
Chapter 5.

@ *o!'fi r-rt unr icative ian gu a g e testing


h
J jrtuch has been written in recent years about 'cotrnmunicative tranguage
testing'. Discussions have centred on the desirability of rneasuring th9
.filitl' to take part in acts of comrnunication (including reading and
listening) and on the best way to do this. [t is assumed in this book that it
" is usuall-), .o**unicative ability which we want to test. As a result, rn'hat tr
beliei,e ro be rhe mosr significant points made in discussiores of communi-

2,€ L9
Kinds of test and testing

cative testing are to be found throughout. A


recapitulatiou under a

separate heading would therefore be redundant-

ffiDFnAcrlYlrtFS
'',vith vrhich yoll are famili:.r:, F,:,:
consider a number of language tesrs
each of thcir,, ansivcr thc follo';;ing qucstions: d

(or a mixture of both)? {


(or a mixture of both)?
subjective? Can You order
thesubiectiveitemsaccordingtodegrteofsubiectivity?
5. Is rhe test norm-referenced or criterion-refe renced? you describe it
6. Does the test measure communicative abilities? would
as a communicative test? Justify your answers'
question 5 and the
7. What ,.trtionihip is there b.t*l.in the answers to
answers to the other questions?

Further reading
towards achievement test content
erson \1987) rePorts on researcfi
e computer to language testing'
o be as authentic as Possible: Vol'
Testing is devoted to articles on
ount of the develoPrnent of an
dshalk et al. (1965). Classic short
orm-referencing (not restricted to

ns at a nurnber of level's for the four


s in acadernic contexts' have been
the teaching of Foreign Languages
are available frorn ACTFI- at 579
Y 1A706, USA. It shouid be said,
ke and the waY in u'hich the;' were
sorne controversy. Doubts about the
to language testing are expressed by
se. F{"ughes (1986)' Carroll (195i)
Kinds of test. and testing

made the distinction between discrete point and integrative language


resting. Oller (1979) discusses integrative testing techniques. Morrou,
(1979) is a seminal paper on communicative language testing. Further
cliscussion of the topic is to be found in'Canale and Swain (1980),
Alder+on and Hughes (1981, Part L), Hughes and Porter (1983), and
i Davies (1988). Weir's (1988a) book has as its title Communicatiue
I
I Ianguage testing.
4
I

Bi
I
I

!
F

r
D

21
?8
It

& VattciitY

f,

2 that a test is said to be "'alid rf it


ended to measure. This seerns simple
wever, the concept of validiry vcals
re
deserves our attention.'[his chaprer
d
shorv its relevance for the
will presenr each aspect in turn, and attempt to
soluiion of language testing problems'

/.--\
L) Content validitY

this in itselfdoes' not q


validitY onlY"if it Inilu
what are the'We relevant structures
of the test. would not exPe
set of strucrures as one for arlvan,:ed
{ gI l-o,! a !9{ \"9-cgn!gnt validltv;11'e
structure s etc. that tt ls meant to cover'
made- at a very earlv stage in test
Validity

backwas-h effect. Areas which are not tested are likely to become areas
isnored in teachin
d-cr-ermiqsd -byrry-he

(g-..
J'fu*._d=:=_#

tTt
ntent is a fair reflect Ee on the
rvriting-f specifrcatiiins E"n-d onT-ne )udgementTf iontent validity is to be
found in Chapter l.
6'tL-./
I
YCrlte ri o n -related va lid ity

"': - "''" "" *'-


Eeiil-*-*--
There are essentially two
validitl, and predictive vali
thg_test and the..crimriorr are -ad-m
e.xemplify' this kind of validation in ach.icvement.tEsfing, let us consider a
situation lvhere course objectives call for an oral component as part of
ihe final achievement tes r of
' 'functions'which student Il of
-u.hich might take 45 m I be
iryppc.gigal. Perhaps it is felt that only ten minutes can be devoted to each
srudent folthe oral component. The question then arises: can such a
. ten-minute session give a sufficientll' accurate estimate of the student's
at'iliti'rvith respect to the functions specified in the course objectives? Is
it, in other words, a valid measure?.
From the point of view of content validity, thii-:urlklspgqd -gn how
t,d tto_urre.preserlta -
an
d_ed in thq pbjec_tiues.
Even, e ffort should be made when designing the oral component to give it
conrent r,alidity. Once this has been done, however, we can go further.
n \\ie can attempt to establish the concurrent validity of the component.
To do this, we should choose at random a sample of all the students
i;rhing. the test. These students would then be subjected to the full 45
r irriiiuti oi:al coinponent necessary for cc-,verage of ail the functions, qsi-qg.
l pc ri-r ap s .f o ur-scorers -lo^ensur.ag3;lieflil gggin g ( see n ext chap te r ) . Th[
u'ould be the criterion test against which the shorter test would be
judged. The students' scores on the full test would be cornpared with the
ones they obtained on the ten-minute session,. rvhich would have been
conducted and scored in the usual way, wirhout knowtredge of their
perfornlance on thre longer version. [f the con:lparison betrveen the two
sers of scores reveals a high level of agreernent, then the shorter version of

''! ')
LJ
Jo
'r'
t

Validity

the oral component may be considered valid, inasmuch as it gives results


similar ro rhose obtained with the longer version. [f, on the other I'rand,
the two sets of scores show little agreement, the shorter version cannot be
considered valid; it cannot be used as a dependable measure of achieve-
menr rvirh respect to the functions specified in the objectives. Of course, if
ten minutes really is all that can be spared for each student, then the oral
fl
ccfnponent rnay be included for the contribution that it makes to thc
of st'.:.jcn;s' ovcrall achici;ciirciit anC for its back';,'r;.'-rh ,:l:fl.:i,
"ss.it-.nt
B.ut it cannot be regarded as an accutate measure in itself.
f;
References to 'a high level of agreement' and 'little agreement' raise the
question of how the level of agreement is measured. There are in fact {
siandard procedures for comparing sets of scores in this wsY, lvhich
I

generate what is called a 'validity coefficient', a mathematical measure oi I

iimilaritv. Perfect agreement betrveen tw'o sets of scores rvill result irr a
validity coeficient of 1. Total lack- of agreemen-t rvill give a coefficient of
-io g.t a feel for the meaning of a coefficient between these trvo
zero.
extremes, read the contents of Box 1.

Box 1

To get a feel for what a coefficient means in terms of the level oi


,grJ.*.n, betw'een two sets of scores, it is best co square that
cJefficient. Let us imagine that a coefficient .of.0.7 is.calcul;rted
berrveen the two oral tests referred to in the main text. Squared' this
becomes O.49.lf this is regarded as a proportion of one' and
converted to a percentage, \ 'e get 49 per cenr. On the basis of this' r,r'e
can say that thi scores on rhe short test predict 49 per cent oI the
variation in scores on the longer test.'[n broad terms' there is almost
50 per cent agreement between one Set of scores and rhe other. A
.o.ifi.i.n, ofb.S would signify 25 per cent agreement; a coefficient of
0.8 would indicate 54 per cent agreement. [t is important to note that
{
a 'level of agreement' of, say', 50 per cent does not mean thar 50 per
cent of the studenfs would each have equivalent Scores on rhe n','o
versions. We are dealing with an overall measure of agreement that
does nor refer ro the individual scores of srudents. This explanation of
how to interprer validity coefficients is very brief and necessarily
rarher crude. For a berrer under5tanding, the reader is referred to
Appendix i.. d

Vhether or not a particular level of agreemeirt is regarded as satisfactorY


*itt a.p.nd upon the purpose of the test and the irnportance of the
decisions that are made on rhe basis of it. [f, for example, a test of oraL
;;ii*t was ro be used as parr of the selection.procedure for a high lei'el
diplomatic post, then a cbefficient of 0.7 rnight well be
regarded as too
io* fo, a shorter test ro be substituted for a full and thorough te st of oratr
i=i
,j 1€

V alidity

abrliry. The saving in time would not be worth the risk of appointing
someone with insufficient ability in the relevant foreign language. On the
otl-rer hand, a coefficient of the same size might be perfectly acceptable for
a brief interview forming part of a placemcnt test.
I. It should be said that the criterion for concurreRt va[idation is not
necessarily a proven, longer test. A test may be validated againsr, for
example, teachers' assessments of their students; provided that the
assessrnents themselves can be relied on. This would be appropriate
tvhere a test was develOped which claimed to be measuring something
different from all existing tests, as was said of at least one quiite recently
devel,rped'communicative' test.
The second kind of criterion-related validity is predictiue uafif,isy. This
c^ncerns the d.g... ,o *n1g5. t-; ;-;;. pi.€i.f .9-gndidat,JJ'p-ui*
.nYexampiJ *oiiid u. rto* *.il r pioficienCy t.t,Tb"ta
n
p ent's abiliry to cope with a graduate course at a British
unirrersity. The criterion measure here might be an assessment of the
student's English as perceived by his or her supervisoi at the university, or
it cguld be the outcome of the course (pass/fail etc.). The choice of
criterion measure raises interesting issues. Should we rely on the subjec-
rive and untrained judgements of supervisors? How helpful is it to use
final outcome as the criterion measure when so many factors other than
abiliry in English (such as subject knowledge, intelligence, motivation,
health and happiness) will have contributed to every outcome? Where
outcome is used as the criterion measure, a validity coefficient of around
0.4 (only 20 per cent agreement) is about as high as one can expect. This
is partly because of the other factors, and partly because those students
,.,,hor. ErLglish the test predicted would be inadequate are not normally
permitted to take the course, and so the test's (possible) accuracy in
predicting problems for those students goes unr'ecognised. As a result, a
i'alidity coefficient of this order is generally regarded as satisfactory. The
further reading section at the end of the chapter gives references to the
, recenr renorrs on the validation of the tsritish Council's ELTS test, in
I\!!!r!

u'hich these issues are discussed at length.


$

students rvho u,ere thought to be rnisplaced. trt would then be a rnatter of


comparing the number of rnisplacements (and their effect on teaching
and i."rnLng) u'ith the cost of deveioping arad adrnini,stering a test which
u'ould place students more accurately'

92 25
\talidity

if ir can he t-tl,m ea s=.[-iei j ]r s I th e- ab i i ty


1 rv h ic fi i r
=e-d-tha
*-,^lilit-, [e
Le r-s+.+-any- :Jn
r.s+.+-any- d* rl:ii
:,Jnd,e ng
-e.r \:{"1

ry- of lansuase abiliti'.


of- lanBuage abilitt'.
^J-:-::-:----.-"-----Y--
,..',,
ab
9.*;;.,;i-t'KJnothesise.
;
{
foi.example, that.r'.the abrlity"tbTded-iE';i6l'''ci
._,-..-.:'-..',.4
"
-L^ ii,irili::; r-:i
ilitg-to....gucss ihc
'oe a
ch theY are met. It r'"ould 6
whether or not such a distinct

tike 'reading ability' and 'r'"riting


tieah Similarly; the direct measure-

that underlying writing abilitv are a


ntrol of punctuation, sensitivit,t* to
construct items that are meant to
iriister them as a pilot test' Horv do
ring writing ability? 9lt steP we
n extensive samples of the rvriting
t is first administered, and have these
e scores qn the pilot test r"ith the t
it g. If there is a high level of
discribed in the Previous section
idence that we are measuring writing

have developed a sadsfactory indirect


nstrated tho reality of the underlying

33
Validity

n, we would obtain a set of scores


ts could then be calculated between
ients between Scores on the same
n those between-scO-resu.rr dif ferent
t we are indeed measuring separate
and,identifiable ccinstructs.
Construct validation is a research activity, the means by which theories
9 are put to the test and. are confirmed, modified, or abandoned. lt is
tfri"lgh construct validation that language testing.can be put on a
,E so,,ndir, more scientific footing. But it will not all happen overnight;
there is a long way to go. In the meantime, the practical language tester
should try to keep abreast of what is known. When in doubt, u'here it is
F
possible, direct testing of abilities is recommended.

* l4)
Face vaiidi

A test is said't6.fiiueJace validity if itlo-oA#s"d*!l $q9.9


ot -F p retenddd to mea sure
,uppor.d to measut..*Fot
-*T-.-ii+ r
exa m ple,-TTEsT-whii
r: r r:l --^--:--- -L^ ^^-l:l- ^---t-
to speak
ofiffiil-tion;Hifr but which did not require the candidate
r i r r. !.
irnd there have been some) might be thought.to lack face validity. This
il,ould be true even if the test'i construct and criterion-rel'ated vaiidity
jrlari Lyas"ejenri ficco-n ce pJ, yet. it. is .

;"t h;t face v alidirf'uay- lg1


-hF-
uqation authorities or employers. lt
usbd, the candidates' reaction to it
on it in a way that truly reflects their
arly those which providg indirect
u'ly, rvith care, and with conl'incing
explanations.

Ttre use of \ralidity

\Vhat use is the reader to make of the


should be made in constructing tes
#E ^. Gilffi aiCJ
$ "
f*iltl '-iAE?
v eli
parricularly where it is intended to use indirect testirrg, reference should
b be made ro rhe research literature to confirm that measurement of the
-using the testing
relei,ant underlying constructs has been dernonstrated
.rechniques rhar are to be used (this rnay,often result in disappointment-
another reason for favouring direct testing!)'
W
An), published test should supply details of its validation, without

27
?Lt
*
u
Validity I
B
i
which its ^validity (and suitabiliry) can hardly be judged by a potentiai
purchaserh.ttt ior which validity information is not available should be
treated with caurion.

READER ACTIVITIES

Consicier any tests with which you ar'e iamiiiar. Assess'eacii oi iiiciii iii
rerms of the various kinds of validity that have been presented in this
chapter. What empiricalevidence is there that the test is valid? If evidence
is lacking, how would you set about gathering it?

Further reading
For general discussion of test validity and ways of measuring it, see
Anastasi (1,975). For an interesting recent example of test validation (of
the British Council ELTS test) in which a number of importanr issues are
raised, see Criper and Davies (1988) and Hughes, Porter and Veir
(1988). For the argument (with which I do not agree) that there is no
criterion against which 'communicative' language tests can be validated
(in the sense of criterion-related validity), see Morrow (1985). Bachman
and Palmer (1981) is a good example of construct validation. For a
coliection cf papers on language testing research, see OIIer (1983).

t5
.rqfl$i
" -i--fr
a.
E
A

ffieEiabiEEtV
1-$,.*- t:o vr \.i , \ * tr,,1 .,fi **-n o"* * "t\.
"\tS\
'ttr..r*. \,,-l*-.,{r*U-,*,{. 6*\\ rl. .

/t
nl
I

I
I
u"l
F;
I lqnagine that a hundred students take a 100-item test at three o'clock one
It
!l Thursday afternoon. The test is not impossibly difficult or ridiculously
I
I easy for these students, so they do not all get zero or a perfbct score of
100. Now what if in fact they had not taken the test on the Thursday but
!l had taken it at three o'clock the previous afternoon? Would r,,rre expect
I
il each student to have got exactly the same score on the
\,Wednesday as they
actually did on the Thursday? The answer to this question must be no. .

Even if we assume that the test is excellent, that the conditions of


administration are almost identical, that the scoring calls for no judge-
ment on the part of the scorers and is carried out with perfect care, and
that no learning or forgetting has taken place during the one-day interval
- nevertheless we would not expect every individual to get precisely the
same score on the Wednesday as they got on the Thursday. Human
beings are not like that; they simpl)' do not behave in exactly the same
\\ra), on e\.'ery occasion, even when the circumstances seem identical.
But if this is the case, it would seem to imply that we can never have
'We
complete trust in any set of test scores. know that the bcores would
have been different if the test had been administered on the previous or
rhe follou'ing day. This is inevitable, and we must accept.it.'!7hat we have
ro do is construct, administer and score tests in such a way that the scores
actually obtained on a test on a particular occasion a.re likely to be uery
sinilar to those which would have been obtained if it had been admin-
isrered ro the same students with the same ability, but at a different time.
D
The more similar the scores rvould have been, the moie reliable the test is
said to be.
Look at the hypothetical data in Table 1a). They represent the scores
obtained by ten students who took a 1O0-item test (A) on a particular
occasion, and those that they u'ould have obtairied if they had taken it a
Jal' lltcl. Ccnpare the nvo sets cf scores. (Dc not \I":orii'foi thc nomcni
about tire fact that we would never be able to obtain this information.
x Ways of estimating what scores people would have got on another
occasion are discussed later. The most obvious of these is simply to have
people take the same test trn'ice.) Note the size of the difference between
the tivo scores for each student"

3/
29
Reliabiliry

rnet.r 1a) scoRES oN TEsr n (tnvENTED onre)


Student Score obtained Score uhicb wotild hu,'e
been obtained on the
follotuing day

Bill 58 82 {
Mr rrt 46 28
Ann L9 34
Harry 89 67
Cyril 43 63
Pauline 55 59
Don 43 35
Colin 27 23
Irene 76 62
Sue 62 49

Now look Table 1b), rvhich displays the same kind of information [or i
ar
second 1gO-item resr (B). Again note the difference in scores for e,rch
student.

reel-E, Lb) scoRES oN TEST e (lruvexrED Dnre)

Stndent Score obtained Score tuhich tt,ould haue


been obtained on ihe
follotuing day

BiII 65 59
Mary 48 52
Ann 23 2t
Harry 85 90
Cyril 44 39
Pauline 56 59
Don 38 35
Colin 1.9 rc
Irene 67 52
d
Sue 5Z 57

\flhich r€st seerns rhe rnore reliable? The differences betrveen the two sets
of scores are much smaller for Test B than for Test A. On the evicience
that we have here (and in practice we w'ould not wish to make claims
about reLiabitiry on the basii of such a srnall number of individuals), Test
B appears to be rnore reliable Test A'
Reliability

of the same students


Look now ar Table 1c), which represents scores
oin interview using a fiue-point scale'
ranrl 1c) scoREs oN INTERvIEw (tNvENrEn o,+.ra)

Score obtained Score which would haue


Student
been obtained on tbe
F following day

5 3
B Bill
.tA 5
h4arY
2 4
Ann
5 z
HarrY
2 4
Cvril
n Pauiine 3 5
3 1
Don
Colin 1 z
4 5
Irene
(tr c 3 I

The.largest
In one sense the nvo sets of interview scores are very similar.
diff.r.n.. betrveen a student's actual score and the one which would have
;;;; obt.in.d on the following d"y is 3. Butarethe largest possible
very different..This
;iff.r;. is only 4! Really the two sets of scores differences between
beconres apparent once we compare the.size of,
the
,tud.n,o *iif, the size of differences betrveen scores for individual
,ird.nto. They are of about the same order of ma-gnitude' The result of
il.,i, .r., b. slen if we place the students in order according to their
le order based on their actual scores is
sed on the scores theY would have'
rview on the following daY. This
very reliable at all.

F
retlabi[ltY coefficie
B
in the form a relia?I::l
It is possible to quanrify the reliability of a test
of-
D ':.::tf;;;;, are s (chaPter
:,Y'+;^','^i'
I t' ineY ali
.'+). ]:i
reli
:-t- .
.T--q-t&el
'tr i"
,"^'^+ nF ! 4 recr is ong
reliabilitv rnefficrent ls t - ---aEgilyllj}-a-..r-.gll*v-atl"k-y--.v.v*p^L{4-YlY.r.+:
-gg,
^F.n
^f
T4 ^,.,"
-::-:l
,^ffic-h fj9.iy s"*._-ie_rs"ilr
,for 1
p"Ti.-:11'.t^.,_:l
iegardless of wh." lt rrupp;"td to "be adrninistered.-A
rest
;i;*a**
r.t r t zero (nnri
^^^{E-;ant nd
-r:-L:l:--, coefficient
;-,h,;h had a ,".tiuUitity of ,.*r', (and let us hone that rtJiillH-
tls hope no sr'rc

?g 3i
Xr
g' ''**> $
sl
E

ReliabilitY I
I

test exists!) would give sets of results quite unconnected lvith each otl,.ei',
t'
t

in the sense that the score that someone actually got on a Wednesdav
1,

would be no help at all in attempting to predict the score he or she would


_ej5tiggr_gg gf 1

roqr-$".
coelnctent \\ e
should expect for different types of language tests. Lado (1961)' f1r
exa-nnple, -..)'s th.a.t good voca.b'-rla-ry, structure and. reading tcst: :r':'
usually in the.90 to .99 range, while auditory comprehension tests;:lre
more often in the .80 to .89 range. Oral production tests may be in the .70
ability coefficient of .85 rniehr be
tion test but lorv for a reading rest.
o sees as the difficulty in achievinq
rent abilities. In fact rhe reliebilirv
epend also on other consideratiotrs,
mosr parricularly the importance of the decisions rhat are to be take n on
the basis of rhe test. The -ngg'ilnportant-!bg- d+uj he*gr-e.+te r

il i ty Fre
reljabi,lity.
relja H: !--*
SLI:!-qg;1ra;AG
Ftg n ty,t-1Lt someone the opportunitr'
-b-,

loituav ou.ir."J"6.iouse of tneii sEore on a language test) then u'e ha.''e


to be prerty sure that their score would not have been much difierent iI
i[* f.t"art taken tfte test a d,ay or two earlier or l"t.r]The next section rvill
explain how rhe retiability coefficient can be used to arrive at-another
figut. (the standard error of measurement) to estimate Iikely differences
oithis kind. Before this is done, however, somethsometh t=q;bqLrt

the way in which reliability coefficients are arriv


requ F
L

=;,9T&!rst
most obviotis w e

i6e ianre te_s.1"-tw-iee. lhts ts Known as ne3!9Y-r:t1:: ygrJ:r+; rne


Jrr*brlkr not difficult ro see. tf the second affiififfition of the test
i. roo soon "i.
after the first, then subjects are likely to recall items and their
e responses more likely and the
too long a gap between administra-
l) will have taken place, and the
uld be. F{owever long the gap, the
-::'--- are
subjects unlikely to be very motiv.ated to take the same test tw'icq,and
' I't I t - - r^--essing effect on the coefficient' These
rhis too is ltkelY to have a oepr
-t-r*pdtf{erer+F-f@rrnrofthe c
"if".,, are_reduced_spmewhalbr-rh-e-uqe"SL
forms met[odil F{owever, alg-1g4-te iorms are
1e***-
It rurns out, rutf iiiin$lnlhat the most common methods of obtaining
4, of scores involve only o.zle administration of qns
the necessary !1-y-o--.s"g!s
provide u,Lw ien t o f
-in
t€r[ gl-q ogs-slps-qv"
theseie-C,ru Iit hal ln this the subjects take"
ih. t.tt ln the usual wAY, but ea su ject is given two scores. Cne score is
] 7
|,I

R.eliability
I

f.or-thetest"+o-be-spli which are really equiyalesrt, through


es
tl,, ."t.f,tl matchin fact.w-here.items in the iest fg.yg b$n
&
t ord.r.d in terms of difficulty, a split into odd-numbered iterni and
even-numbered items may be adequate). It can be seen that this method is
B
rarher like the alternate forms rnethod,, except that the two 'forms' are
only half the length.l
I Ir has been demonstrated empirically that this altogether more
economical method will indeed give good estimates of alternate forms
coefficients, provided that the alternate forms are closely equivalent ro
Bl each other. Details of other methods of estimating reliability and of
-l carrying out the necessary statisti
Appendix 1.
_[
The stqndard error of rneasurement and the tru e score

While the reliability coefficient allows us to compare the reliability of


rests, $ a-e"tu,a.Ls-c*QJe is-tq
,,r,har ion. IVirh-a-lltrle-
- f.urrher--. c alculati-o"n-n*h o w e ve r, j1-i clo
e score
-sen-aure.tuelsssg"st"p--wh?i-iseall-edihSLt
lmagine that it were possible for someone to take the same language
t.st or'.t and over again, an indefinitely large number of times, without
their performance being affected by having already taken the test, and
rvithout their ability' in the language changing. Unless the test is perfectly
reliable, and provided that it is not so easy or difficult that the student
alu'a)'s gets full marks or zerot
various administrations to vary. [.
3
.-{-@qs

wrth resPect
$ we
treasons ertain, which iE referred to as the
carrciidate's@74, 'r /g"ruv- f,c are-
D $j'e ar-r atti?*t,oTn-ike statements ahout the probab,iliry thar a candi-
dare 's rrue score (the one u'hich best represents their abiiitl'on the test) is
B rvithin a certain number of points of the score they actually obtained
qn the test. In order to do thit, *yg-Elqlk.to cSlcgla!-e the standard.
error of measutrentent of the P-artic*f::t-:, T"he calculation (described in

1. Because of the reduced lengrh, rn'hich will cause the coefficient to be less than it would be
f,rr rhe u'hole tesr, a statisrical adjusrment has to h'e rnade (see Appendix'1 for details).

lo 33
illustrated by an examPle.
Suppose that a test has a standard error of measurement Ot J. At^r
indi'.'rJua! scores 56 on that test- Ve are then in e nocirir'n tn rnrlrc the i1
2
following statements: I

We can be abou.t 58 per cent c'ertain that the person's true score lies in the q
J

range of 51 to 61 (i.e. r,vithin one standard error of tneasureirlcni of tlic


r.oi. actually obtained on this occasion)-
We can be about 95 per cent certain that their true score is in the r:rnge -15
to G6 (i.e. rvithin tr,r'o standard errors of measurement of the score
actually obtained).
.We can be 99.7 per cenr certain that their true score is in the range 41 to
7l (i:e. u,ithin rhree standard errors of measurement of the score
actually obtained).
These sraremenrs are based on rvhat is known.about the pattern ot
scores.that rvould occur if it lvere in fact possible for sorneone to take tiie
rest repearedly in the way described above. About 68 per cent of their
scores would be rvithin one standard error of measurement, and so on. li
in fact they only take the test once' we cannot be sure hor,l' their score on
core, but rve are still able to make

Tffirrn dE-lislons that wg.!1ke-en.,qhs_F"e.:.*_9,! !g:"1"_:cg'e1."#e -slsls"Jor

-ffii.;' ffi-li. 'f][g st"nd=iiile'ro-i of measui:*'."qil9i:ff::-:b.*.L!b'--rr


the way a person's scotes
2.
- These statisrical sratemenrs are based on what is known about
would tend ro be distribured if rhev took the same test an indefinitelv large nuntber of
times (wirhout the experience of any test-raking occasio.n affecting performance on anv
orher occasion). The scores would follor.r-rvhar is called a normal distributi,rn
(see
discussion beyond the scope t-,f the Present
mai distribution which ailou"s us to sav what
ain range (for exarnple abour 68 per cerrt of
f measurement of the frue score)' Since about
dg pei cent of actual scores will be within one standard error of measurenierLt
of the
actual sccrre +'ill he
rrue scone, \\,e can be abour d8 per cenr certain that any parricular
rn ithin one standard error of measuremenr
of the true score"
error oi
3. Ir should be clear rhar rhere is no such rhing as a'good'or a'bad'standard
in relation to particuhr
measurernent. tt is the particular use made of particular
scores a
nay be considered acceptable or unacceptable.
Reliability

y
a

sv.lde-us.er*s-r#iih ..nolo,;}l th e '\ I


'ard
ard error of measurement,
Hiu Lteslis-not relia
ow tnat t 5g:*eJm-ary^ jndUtd_ualsarelik,ely- fp--bE-Suire
'(\
nt-{rom their tru ttle t
d f"affi-e"e*strrss. ird
{1

sl
F
.the case..of .same 'i
i-n"

Fi a.ngy bqgweqn aq$al I


utious about making n\
ll important decisions on the basis of of candidates whose \\
close to the cut-off point (the point that divides
,.iurl scores place themtWe
,Dasses' from 'fails'). should at least consider the possibility of
F1 jathering further relevant information on the language ability of such
candidates.
Haying seen the importance of reliability,'*'e shall consider' later in the
chapter, to.,,n, to make our tesrs more reliable' Before that, however' u'e
shail look at another asPect of reliability.

Scorer reliabilitY
]n rhe first example given in this chapter we spoke about scores on a
niqlriple choice test. lt was rnost unlikely, we thought, that everY
candidate rvould get precisely the same score on both of two possible
adminisrrarions of the test. We assu red, horvever, that scoring of the test
1'ould be 'perfect'. That is, if a particular candidate did perform in
exactly the iame way on the two occasions, they would be given the same
score on borh occasions. Tbat is, any one scorer u'ould give the same'
score on the t\ /o occasions, and this u'ould be the same score as would be
giyeu b), an), other scorer on either occasion. lt is poss_ible to quantify the
i.t.i uf agreement given by different scorers on different occasions by
n,
means of a scorer reliability coefficient which can be interpreted in a
similar \4,ay as the test reliability coefficient.,trn the cage-of:h.emulriple
ch oi qq q.est j.u.sr .d.e,sc:ib-e.d the
D
il'e noteciin Chapter 3, when

; gq_ql.g lhe reliability coefficients of


ectly consistent scoring in the case of
the intervierv scores discussed earlier in the chapter. ni would probably
haveTJeG-e?*tdtE-e reader an unreasonaL,le assumption. We can accept
thar score rs should be able to be consistent *'hen there is on15'one easily

tl2 35
T

Reliability lI
I
n a degree of judgement is calied for
ring of performance in an ihtervie'',v.
. Such subjective t"=sqg1"-j,ll" 5ot
thEEGiiTimlw'hen manv
coefficients (and also rhe reliabilit.,'
low to justify the use of subjectir,e
measures o,t language ability in serious language testing. This vier', is lu:;s
wicieiy* heici rociay. v{Ihiie rhe perfecr reiiabiiity oi objectivc tcsis is i-rrii
{
obtainable in subjective tests, there are ways of making it sufficientiy' 1r i5h
.,,1

;eliable*lherlhg ther' Indeed the tesi


ieliabiliry coefficient will almost cert han scorer reliah'ilin',
since other sources of unreliability will be additional to u'hat enters
through imperfect scoring. In a case I knorv of, the scorer reliabiliir
coef6cient on r composition rvriting test was .92, while the reliabiliti'
coefficient for the resr was .84. Variability in the performance of
individual candidates accounred for the difference betlveen the tyu'o
coef ficients.

As we have seen, there are two components of test reiiabilin': the


sion to occasion, and the reliabilitv
esting ways of achieving consistent
then turn our attention to scorer
reliability.

U 'ake enough samPles of behaviour


Fth., things being equal, the more items that you have on a test' the more
reliable rhlt test *iti U.. This seerns intuitively right. lf we wanted to
know how good an archer sorneone was, we wouldn't rely on the
.uid.n.. of i single shot at the rarger. That_one shot could be quite
u*.pr.r.nrarive 6f ,h.ir ability. To be satisfied that we had a really
reliaLle measure of rhe abiliry rve would want to see a large number
of
shots at the target.
The same is true for language tesring. trt has been demoirstrateci
empirically that e

reliabie. There is
e

Ap-pe*ndix) tbafel1ry:-911-!"o- o
&
R eliability ;
.t
glven
gr lecti
a selectroj_-gt
iimeV to-hr'.1
l-^ ^-^ C-^^S^- -L^-:-
depresstng ettect on tne rellaDtlltl ,t-., -

test U.* irk.n, fay, a day


* the t.li In general,
iho"ld
la1q5:
over rvhich
theief_ore,._candiglsi.Es
I
}hd
l,{ ha raa+rtataA
miqht Vafj' SnCUiU UC -i_qljii*i!.L,
' I\-\JiiiPciiL
nqnnr^
lvrru vvrrlg r! r rLrrrS L::ri'.-r
Lrlu f^lr^rt"

a) Write a comPosition on tourism.


b) Write a composition on tourism in this country.
.j \y/1i1s r.o*porition on how we might develop the tourist industrv in
this countrY. {
d) Discuss the following measures intended to increase the nurnl-le r .-,i
foreign tourists coming to this country:
i) More/better advertising and/or information (rvhere? u'h.rt [orrr,
should it take?).
ii) Improve facilities (hotels, transportation, communication etc.)'
iii) Training of personnel (guides, hotel managers etc.).
The successive tasks impose more and more control over $'hlr is
written. The fourth task is likely to be a much more reliable indicetttr rrf
writing ability than the first-
The"general principle of restricting the freedom of candidates rvill be
taken r[ agrin'in chipters relating to particular skills. It should perhaps
t. rria ^tt.ri, ho*'eu.r, rhat in restricting the students we must be caret'ul
not ro distort too much the task that we really want to see them perform.
ih. pon.ntial tension between reliabilitv and validity is taken up at the
end of the chaPter.
dd

Write unambiguous items


It is essential that candidates should not be presented with items 'uvhose
*.^rring is not clear or to which there is.an acceptable ansrver which the
,.r, *ri-*r has not anticipated. In a reading test I once set the follo',"'ing
open-ended question, baied on a length' reading Passage abo.ut English
'Where does tl e author.direct the reader w'ho is
di"l.cts:
"i..nrr "rrd
interested in non-standard dialects of English? T'he expected answer
\\'as
, the Further reading secrion of the book, which is where the reader
\,vas

dirocted to. A nu.ib.r of candidates answered 'page 3', rvhich rvas the
reader
olace in rhe texr where the author actualll' said that the interested
;h;;lt loqk in rhe Further reading section' Qgly*-t-trglS-1g1glrsfuac's1
ins tne
scortng the tesE re \\'as a com
revealed th.a-q-."-thsr
test [evcalcLt i

.nrrert answer to the quest


;;-rre;?fiil,t0_lHel'y_gsti9n. lf that hjd nol"hagnened, then ^,:o.f..:
.ii.#ec*Cn rrm"#;;;*ed as incorrect. The fact that an individual
,/4
R.eliability

.
ones alreadv
.the ene_Lalreadv
_LLL_U 3',.r-"
tn test will be needed
in the test -go 1"ncq-e_gsg the reliability
A.-.,----

.epgffigt--l 9ne thing,to bear in.mind,


r and of exisring
I ile_ms. Lnagine a reading test that asks the question: ''Where did the thief
l hide the jewels?'lf an additional item following that took the forrn'What
I
I
I
r"6 "nu.uual about the hiding place?', it vrould not rnake a ful! contri-
I bution to an increase in the reliability of the test. Why not? Because it is
Fl hardly possible for someone who got the original question-wrong to get
the supplementary question right. Such candidates are effectively pre-
I
I

I! vented from answering the additional question; for them, in reality, there
is no additional question. We do not get an additional sample of their
behaviour, so the reliability o{;heir apility isis not
of our estima;e of..their
increased
increased. afl (-
lu- 1t+1{} s.rff^*
'f.f lT.* (,- .t,# .u6,[,{,'6yuo*r(
*u,[,fl,'f ,'noJ l"te-yrnr
lnt*+n,
r as possiBle
rS Each additional item should as tar.as Possl represent a fresh start fqr
Bi ina" d d-irc n a J inlqunErr-on-o.n
ihe ca n d i d ate. Bv-dP t-qgJbi
_tt_ich__fyill nakq--t-qs-t r-esu-lts- 1n_o 1e
should not be taken to mean only
est of writing, for example, where
candi.jates have to produce a number of passages, each of those passages
is to be regarded as an item. The more independent Passages there are, the
more reliable will be the test. In the same wdy, in an interview used to test
oral ability, the candidate should be given as many 'fresh starts' as
oossible. More detailed implications of the need to obtain sufficiently
irrg. samples of behaviour will be outlined later in the book, in chapters
deircted to the testing of particular abilities.
_Whileit!qrmpo_r1qt-lo-make-arEsrl.rong€n"ought.o-achiq-v--e9-1t5!-?Slg:)'
b_e m.ade so l.ong-rha.r.-rhe candidates.b-e,.EomE io
_reiiaL.,ilitv, iiihCrl_doot .l
tt_bS -
e _ogr e,S. U-n t"epJ e-s-e-n t.a.

tuc_cl_rLe_pability. At the same time, it may often be necessary to resist


pr;;r. to make a test shorter than is appropt,?t..The usual argument
io,. short.ning a rest is that it is not practical. The ansrver to this is that
accurare info-rrnation does not co ne cheaply: if such information is
. ln general, :be-mo-l-ej-mporta+r"e+he-
rc.q! Sh-o"pld -be. Jephthah used the
' as a test to distinguish his own
ot protlounce sh. Those r,'ho faiied
a}i's oivL'r l'r-rcn killeu.l in crrur uriglrt
have t'isired for a longer, more reliable test'
$
{a.)
k-9f-
6-not allow candidates t-oo rnuch f reedc-
e is a tendency to offer candidates a
thern a great deal of free dom in the
hey have chosen. An example would

e5 37
--
R.eliability

candidate might interpret the question in different ways on different


occasions means that the item is not contributing fully to the reliabilitv of
the test.
heilwey-t-o-{Ldvq-alu-q4JnUsrg}"Iitggs-r,"hsvrng-drdt_ed"--!h-e-m,
I I tr
to subject thern to $.e ,9IfU utln g3glig-s-LWAgi !.9"i1.[d- t rv' a s-
ard?s tTiey can to find alternative r-!t--e-rpr:9-tatia-OqlqlLq-gp.-esmt=erd-e-d, If
;ilS:asEffiiherightspirit,oneofgood.naturedperversity,
mosi of the problems can be identified before the test is administered.
Pretesting of the items on a group of people comparable to those for
whom the test is intended (see Chapter 7) should reveal the remainder.
Where pretesting is not practicable, scorers must be on the lookout for
patterns of response that rndicate that there ate problem items.

This applies both to written and oral instructions. If it is possible for


candidares to misinterpret what they are asked to do, then on some
occasions some of them certainly r,vill. trt is by no means alrn,ays the
u,eakest candidates who are misled by ambiguous instructions; indeed it
is often the better candidate rn'ho is able to provide the alternative
tests written for the str-rdents of a
e-s ri pF o sitioil-ili a--i;tfu d en ii all
"

ction s.. Th e f r e q u e n q'


-w-o-r-dc d-i nsff u

of rhe complaint that students are unintelligent, have been stupid, have
i','ilfLrllv misunderstood what they were asked to do, reveals that the
supposition is often unwarranted. Test u'riters should not rely on the
students' po\4rers of telepath)' to elicit the desired behaviour. Again, JbS
use of colleagues to criticise dr4fts of instructioqs (i
ems. Spoken instruc-
tions should alu'ays be reao lrom a prePareo 'text In order to avoid

h
ri Too often, institutional tests tt. be$gPgd (of hgnd"ftglLhave too
I
much text in too small3j!3ge, and are ngglf *p*-{ggd. As a result,
hi
p, ;r.de"ts areEce?-wrTffiAaiiional taski u,E-i?E?e-not ones meant to
measure therr language ability. Their variable performance on the
unu'anted tasks q'iil lower the reliability of a test.
F'
I

-i
fAndidates should be familiar with fEtnat and fesfr4
onh
9WlitttYvve
ninllec \
.-..*_._.*_,..-.1..---.-.---- -//

tf any aspect of a test is unfarniliar to candidates, they are likely to


perform less weil than they u'ould do othenvise (on subsequent);'taking a

16 39
.it
'T"
Y
F..

Reliability

parallel version, for example)..Lgl th3g J-ea1on?-,e-verv effort must be rnade


il
i; ;r".J h ^ I -*lt-tuo+, ui* l,l, a,y+lt'*-ffi t f ffi ;ii'i" .ti ii "1, s t w H a t rv i I I
b"_res_qir of them. f[l'--*"y'lsestr-qbFbI!:n gl:e-ple-,sut tpilrl
*1:t5lg s-q"f-p-tad{-.-e-*-a .t
e-r
!ds-i n t h e cas e
lols-
fl
rovr de u,r,ro riA; d ;;n-ciistracting cond itio ns 9!)
---71----;-*

fh. gru^ter the differences between one administration of a test irnd


anotf,er, the greater the differences one can expect between a candidate's
Great care should be taken to ensure
strictly adhcred
;1---J4
Inlstrailons ot tl
intain a quiet

'We rurn now to ways of obtaini"{WlSJlgfufy-,yfhich' as \\'e si1\\'


above, is essential to iest reliabilit)'.
=- -*l-
:{-----"21

Grlt"*s
'-
that permit scoring which ,: -rt:!!9{''3-?f-)
^^^:|^t^

fhii .ty to be a recommendation P. gsi-11glttnlg-g!:.tg items,


aPpear -i
e T hE-i s n o t i n t e n d e d .lVFi-i e t
ilitl+suu!-q-saplg "ly_*i:g f y s
3tt-g.
;."fu-b;il;k*-t"-safiTiat inultiple choice items are never appro-
priate,.it is certainly true th_at there a n r'vhich they
are qulte lnapproPriate. What is m ice items are
notorioutly difficult to write and al pretesting. A
substantiai part of Chapter 8 is ssion of the
construction and use of multiple choice items'
_4s a
ates
f act
spelling which makes candidate's
^
ing test) often make demands on the
scorer's judgement. The longer the required response, the greater the
difficulries of this kind. Cne way of dealing with this is to structure the d
candidate's response by providing part of it. For example, the open-
endbd quesrion What was different abou,t the results? may be designed to
elicit tlre response Swccess Laas closely associate.d tc,ith high ttotiuatiott.
This is iikellro cause problems f rr scoring. Greater scorer reiiab'iliri'will
probably be achieved if the question is followed by:
w.as ftrore closely associated with
q2
bility
1

trater
Irems of this kind are discussed 'in
t

i-rtfei;i ;;! ; ; ;;:' )

-
e-;tY-e a;Ir i-

dy made that candidates should not


I hey should be limited in the way that
I
t
I
,g-the.-,co.mp"q"sit.Lans-a-ll--en.qt-e-.t-qpie

i c a n di. d. i t e s". t ao m.i. C lt "frbedo tn -


-, "

t.

L a detailed scoring k-,--'"-*-\


J 'Answersl and assi

"I-*;;;*---' oI. pc
tints. It should be the outcome of efforts
aipo.ssrble
"" r"-".' .--in tts_asslgnment tt | _ ._L:^^-^r -^
to antlclpate aIL responses and have been subiected to group
" poirible tsrP\rrr
PosslL)lc
criticism. (This ,dui..applies only rn'here resPonses can be classed as
pnrti"lly o, totally 'correct', not in the case of compositions, for
instance-)
(*l
*'fflg,n
---"--.-....".."-."-----'-,
t:-gI9rs-"''
'T1
lve. I ne sconng
L-!!te:Y"-*r-- '
' ' sflr
of composrtions, tor eXdlrlple' ' ould not be assigned to anyone rvho has
r
from
_ r__:_:^-_-^: ^__
past- administratrons.
;;rl.ffia. ,.or. accuratery compositions
ifr., each administrarion, p"tt.int of scoring should be. analysed.
inji'iau"rs whose scoring i.ui"t.t markedrl' and inconsistently from the
norm sliould not be used again'

, \*L-
ffig re e a ccePta b I e resPonses and
;\
t\

$
of scoring
*,--;^
1'
)--
A sampte or scrl[ "t ^,,lrl hctalren
^r^^-rn+c imrnediatelv afterthe ad.nr-i-nisu.ation
-a.ch,,
*' -_i_.__^n--,.i-i:r4--.--.-.- : | _. tati v e^^s o^tf
ety p i ca l re p re s en ---:_.
:wl, aia 69 ie a;e?oin p o s hi o ni,
r "itili:ie"
,iiiiir.n, ieveis of ability sliouid be seiected. Oniy when aii scorers are
begin" Much
$
;;;;;J;" the scores to be giien to these should realof scoring compositions.
,fr. ,,r,ill be said in Chapter 9 about the scoring note any difficulties
For short answer questions, the scorers should
is unlikely to have anticipated every
se to the attention of whoever is
Once a decision has been taken as to
I

+g
I

i
47
i
Reliability i

the points to be assigned, the supervisor should convey it to all the scorcrs
concerned.

Scorers inevitably have expectations of candidates that they knolr .


fr
Except in purely objective testing, thiswill aflecl the rvay th.at rhey scili'i.
Stuciies irave sho,n,n rnat even where the canciidates are unknowii ci--t tiie :6
scorers, the name on a script (or a photograph) lvill make a_sigrrificant
difference to rhe scores given. For example, a scorer may be influenceJ br-
the gender or narionality of a name into making predictio.ns.which can €
aff.eZt the score given. The identification of candidates onlv bv
number ,r@

ivill reduce such effects.


:

As a g.n.il-ilL, certainly where testing.is subjective,.all scripis


"ndat least nvo independent scorers. Neither scorer
,houlj be scored by
,fr-""fa knorv hotv ihc other has scored a test paper. Scores shouid be
recorded on separate score sheets and passed to a third, senior, colle
rgr-re.

*ho .o.npnr.t rhe t'r,vo sets of scores and investigates discrepancies.

ReliabilitY and validitY


V consistently accurate measurements. lr
must therefore ieliable. A at

_4!1. For example, as


a writing test we might require candidates to rvrite
ld^o*r rhe translation equivalents of 500 words in their orvn language.
it i, .outd rvell be a reliable test; but it is unlikely to be a valid test of
writing.
In oi,, efforrs to make rests reliable, we must be rvary of reducing their q
validity.'Earlier in this chapter it was admimed
of whar candidates are permitted to write
diminish the validiry of the task. This depends i I
are trying to tneasure by setting the !1tk' I

ir
.r"al,i^r.lr' ability omposition, then it would be hard ro f,
,*nf, providing ih rure in order to incre ase reliabiliry. At
to restrict candidates in wa,vs t'vhich
,h. ,r.. time we
rvOuld not render their performance orx the task irrvalid'
There will always be ro*. rension berw.een reliability and
valiCiri'" The
,.rr., has to bala.nce gains in one against losses in the other'
a?

You might also like