Professional Documents
Culture Documents
r--E-r .' i
t esttng tor
Langu age Teachers
Arthur H*ghes
I
I
,F
The righr of :the ..
$ Unilcrsity oI Cbnbridgc
to print ond sell
I
oll mnnr of books
6 Ntat tronlad:by
Hcnrs,VIII in,l5J1,
'i
F The Univcnily hat printcd
and publithcd continuow!y
since I5El,
h
n
I
4 hard covers
ISBN 0 521,25264
ISBN0 52727260 2 paperback
Copyright
The law allows a reader ro make a single copy of part of a book
for purposes of privare study. It does nor allow the copvine of
entire books or rhe making of mulriple copies of extracrs. iv.irr-n
permission for any such copying must alwa.vs be obrained from rhe
publisher in advance.
CE
v i,
h
v
2
I
t'
I
t
t
Games for Language Learning
iy'iiarr* WrTghi Dauid Beiteridge and lvlichael Buckby
practice by Pewty Ur
Discussions thatWork-Task-centred fluency
once upon aTime - using s_tories in the language.classroorn
Uy'jit torgan and NIario Rinualucri
"'t
Teaching LiEtening Comprehension by Penny Ur
Keep Talking - Communicative fluency activities
for Ianguage teaching
bt, Frederihe KliPPel
vocabulary
working with words - A guide to teaching and learning
b,y RutiGairns and Stuart Redman
problems
Learner English - A teacher's guide to interference and other
-riiltd liirhael Swan and Bernard Smith
by
Testing Spoken fanguage - A handbook of oral resting techniques
bv I'tric LJnderhill
book of ideas and
Literature in the Language classroom - A resource
by Joanne eollie and Stephen Slater
".tiuiri.t
Dictatior, - New methods, new possibilities
Ly Paut Dauis and Mario Rinuolucri
teachers by Pennv Ur
Gramnrar Practice Activities - A pracrical guide for
Testing for Language Teachers by Arthur lTughes
3
r
h For Vicky, lr4eg and Jake
I
r
F
'I
.-f
ffimntents
r Acknou'ledgements viii
I
Preface
I
I
t1 Teaching and testing 1"
b
I
!'t Testing as problem solving: an overview of th
;
l" Kinds of test and testing
B l.r
L
/"-'--
9
;
,,
I 5
Validitl'
Reliability
22
29
) f lde x 170
I'
5
l{e t<newIedgeirlents
The author and publishers would like to thank the followine f-or {
permission to reProduce copyright material:
fl
American Council on the Teaching of Foreign Languag_es inc.
[or extracts
Generic Guidelines 6
i;;* ACTFL provisional Proficiency Guidelines andfrom
for extracts examinationsl
ew Bofiaziqi UniversitY Language
ouncil for the draft of the scale ior
ress for lvI. Swan and C. Walter:
98 8 ) ; Educational Testing Services
ingual House for M. Garman and
s Chapter 4 (1983); The Foreign
pp. 35-8 (1979); HarPer & Row',
publishers, lnc. for the graph on p. 1Z1,.from Basic Statistical Nlethods
t, X.M. Downie and Rot.tt W. Heath, copyright O 1985-in N' ivl'
ilo*ni. and Robert \W. Heath, reprinted wit-h permission of Harper 6C
viii
Fnef ece
The simple objective of this book is to help language teachers write better
I
L
I
J choose. lt is also necessary to understand the principles of testing and
hou'they can be applied in practice.
Ir is relatively straightforu'ard to introduce and explain the desirable
qualiries of tests: validiq', reliability, practicality, and beneficial backu'ash
(tl'lis last, r','hich refers to the favourable effects tests can have on teaching
and learning, here receiving more attention than is usual in books of this
kind). lt is much less ea5y to give realistic advice on how to achieve them
in reacher-made teStS; One is tempted either to ignore the problem or to
Fresent as a model the not always appropriate methods of large-scale
tesring instirutions. In resisting these temptations I have been compelled
ro make explicit in my' own rnind much that had previously been vague
and intuitive. I have certainly benefited from doing this; I hope that
readers u'ill too.
Exemplification throughout the book is from the testing of English as a
foreigrr language. This reflects both my own experience in language
te siing and the fact that English will be the one Ianguage knou,n by all
reirde rs. I trust that it rvill not prove too difficult for teachers of other
I:r:iguaqes to find or construct parallel examples of their o\\rn.
I rrrust acknoivledge the contributions of others: -lr4A students at
Reading University, too numerous to mention by name) who have taught
t me much, usually by asking questions that I could not answer; my friends
irnC crrlleagues, Paul Fletcher, Michael Garman, Don Porter, and Toni'
\1r.r,,.,,;c., -,,.,1;-o alJ read parts of the ma.nuscript and mad.e ITl3n)'helpfu!
:lggesriorrs; Barbara Barnes, rvho typed a first version of the early
D chaprers; Ir{ichael Swan, rn'ho gave good advice and much encourage-
l'nenr. and u'ho remained remarkably patient as each deadline for
conrpletion passed; and finally *y farnily, u'ho. accepted the r,,'riting of
LIrc book as an excuse more often than they shouid. To ail of them I arn
ve rt' gra'tefuI.
f ix
F
I
N4any language teachers harbour a deep mistrust of tests andof tesrers.
The starting point for this book is the admission that this mistrust is
;F
frequently u,ell-founded. It cannot be denied that a great deal of language
resting is of very poor quality. Too often language tests have a harmful
B
I effect on teaching and learning; and too often they fail io measure
I
accurately whatever it is are intended to measure*
I
B
', '^'\i\\
othey
\**J'*-"' h* n"i\ \'r**'t- \"s\q
I
-f_a--
The effect of testin on tea lea rn ip_g;Ls known asib g c!1..u ry!2,
Backwash can . If a test is regarded as importanr,
then preparation for iftan cori? to dominate all teaching and learning
activities. And if the test content and testing techniques are at variance
ri'ith the objectives of the course, then there is likell' to be harmful
backu'ash. An instance of this wouid be rvhere students are follow'ing an
English course w'hich is meant to train them in the language skills
(including rvriting) necessary for university study in an Englisii-speaking
collntr)', but rvhere the language test which they have to take in order to
b,e admitted to a university does not test those skills directly, If the skill of
t'riting, for example, is tested only by multiple choice items, then there is
great pressure to practise such items rather than practise the skill'of
ri'riting itself. This is clearly undesirable.
\l,re have just looked a case of harmful backye:ll. Horvever,
^t
backu.aslrneednotalwaysbeharmf"ffiepositively
beneficial. I was once involved in the development of an English^languag!
rest for an English medium university in a non-English-speaking country,
T]ie test \,\'as to br- administered at the end of an intensive year of Enelish
:.rud,. ilicic an,i w-ouid be irse.l tu clererminc which srucients wouii be
allou'ed to go on to their undergraduate courses (taught in English) and
! u'hich u'ould have to leave the universitr'. A test was devised which was
b;ased directly' on an analysis of the English language needs of first year
undersraduate students. and which included tasks as similar. as possibie
to those u,hich they u'ould have to perforn'l as undergraduates (reading
rer:tb'ook materials, taking notes during nectures, and so on). The
introduction of this test, in place of one rvhich had been enrirely multiple
'
I
tests
The second reason for mistrusting tests is that very oiten thel' ,larl-ro -
measure accuratg\r--wtrateyer"-ifjs-thrauhey-ar.e-in!9lld-ed to *eas.ulr€.
fl;.G;["",ffiG. Students'true abilities are not al,.vays retlected in t]re
tesr scores that they obtain. To a certain extent this is inevitable.
Lrngurg. abilities are not easy to measure; we cannot expect a level of
accuracy comparable to those of measurements in thE phy'sical sciences.
But we can exPect greater accuracy than is fr.equentlv achieved'
Why t.it, inaccurate? The causes of inaccuracy (and rva,vs of
"r. their effects) are identified and discussed in subsequent
rninimisinE
is possible h
ese concerrxs
1"-,
: ifrwe'Warl'i-
(
re of
have
; br-rt
surer
n the
of tens of thousands of be a
scoring
Teaching and testing
Io :1
Teacbing and testing
What is to be done?
The reaching profession can make two contributions to the improvement
clf tesring: they can write better tests themselves, and they can put pressure
on others, including professional testers and exarnining boards, to
improve theirtests. This book r€presents an attempt to help them do both.
Fot the reader who doubts that teachers can influence th'! large testing
institutions, let this chapter end with a further reference to the testing of
writing through n'rultiple choice items. This was the practice followed by
p those responsible for TCEFL (Test of English as a Foreign Language), the
test raken by most non-native speakers of English applying to North
l American universities. Over a period of rnany years they maintained that
ir was simply not posiible to test the writing ability of hundreds of
thousands of candidates by means of a composition: it was impracticable
F and the results, anyhow, would be unreliable. Yet in 1985 a writing test
andidates actually have to write for
upplement to TOEFL, and already
re requiring applicants to take this
pal reason given for this change was
pressure from English language-teachers who had finally convinced those
,esponsible for the TOEFL of the overriding need for a writing task
ro'I''ich would provide beneficial backwash.
READER ACTIVITIES
1 Think of tests with which you are farniliar (the tests may be inter-
national or local, written by professionals or by teachers). What do
you think the backwash effect of each of them is? Harrnful or
beneficial? \Xlhat are your reasons for coming to these conclusions?
2 Consider these tests again. Do you think that they give accurate or
'What are your reasons for coming to these
inaccurate information?
conclusions ?
F,
9
F
Further reading
:
Forlan accounr of how the introduction of a nerv test can have a striking
B
beneficial effect on teaching and leagning, see F{ughes (tr988a). For a
reyie\\, of the nerv TCEFL w'riting test which acknorvledges its potentia[
'beneficial backrvash effect but rn'hich also points out that the narrorv
range of writing tasks. set (they are. of only^two types) rnay 6.t*1, Un
narrow rr",ntngln writing, see Gree-4-herg (1986). For a discussion of the
cthics of language testing, see Spo_lsky (n98x.).
t2
I'f'J TestEffiE as ProfoEerm sotving:
idea of testins as
The purpose of this chaprer is to introduce readers to the
solving and to show how the content and structure of the L:itok
-a.iign.a
oioUf.n.,
Ir. ; help them to become successful solvers of testing {
fi?ffr:ge tesrers are sometimes asked ro say rvhat is'the best test' or
.the be"st tJsti.,g technique'. Such questions reveal a misunderstanding of
;h;;i, inuotu.? in the iractice of [anguage re.sting. ln fact there is no besr
i.r, technique. A test rvhich proves ideal for one purpose. ma' be
"iU.st fo,
;;ir;;;.1.r, a technique which may work verv r'vell in o^e
il;;ilr. .t. entirely inappropriate in another. As we saw in the
u.^norher;
or.uiour chapter, what suits l"ig. iesting corporations may be quite c'ut
'oi pi^.. in the i.ttt of teaching institutions. In the same s7211r, t\\'o
,."itring institutions may require ve y diffe rent tests, depending.amongst
and import-
orher ,liing, on the objectivis of th:ir courses, the Pytpe:.
of th". t.sts, and ihe resources that are available. The assumption
""1.
,nr, n", to be *^d. therefore is that each testing-situation is
unique.and
problem. It is the tester's f ob to provide the best
so sets a partlcular testing
qTiP readers rvith
,otu,ion ,o that problemlThe aims of thi's book are to e
problems, secondlv'
,rr. urri. kno*ledge and techniques first to solve such
and
io.urittte the solitions ptopot.d or already implemented by others,
practice lvhere
thirdly to argue persuasiv.ly fot improvements in testing
these seem necessary.
,trl! sLeJ rnust
.]n every situatiggjbg
rEarwl"*l::jT:li:11:
ffi'fr;r'n*solution.Er,erytestingprobIemcanbeexpressedinthe
which rvill:
;;;g;..r"1tfr*s: we vant to create a test or testing system
onsistently provide accurate measures of precisely the abilitiesl in
which we are interested;
ave a beneficial effect on teaching
(in those cases where the tests are
'v likelv to influence teaching);
money'
OU...onomical in terms of time and
nical sense. lr refers sinrply to r'r'hat people
example, include the ability to collverse
y to recite grammatical rules (if chat is
iuringl). [t does not, however, refer to
Testing as problem soluing
I_,et us describe the general.testing problem in a little more detail. The first
thing that testers have to be clear about is the purPose of testing in any
orrti.ular situation. Different purposes will usually require different kinds
f f ,.ri.t. This rnay seem obvious hut it is something which seems not always
.t
ro be recognrseo. Ih-.- p_g.lp.g,::Loll.jl$ discussed in this bo.ok are:
\-.-to []easure language proficiency regardless of any language courses
ed
F achieved the objectives of a course of
tv
Testing as Problem soluing
Ail tests cost tirne and money - to prepare, adtniuister' scol'e atid
inrerpret. Time and money are in limited suppll', and so there is often
likely to be a conflict between what appears ro be. a perfect testirrg
solution in a particular situation and considerations of practiccli 11'. This
issue is also discussed in Chapter 6.
To rephrase rhe general testing problem identifred above: the basic
problem is to develop'tests rvhich are valid and reliable, r.vhich hrrrt '.'
tcncficial bacltwasll effect cn teaching ("','here this is rele','lnl), al'.'C ','''hi'-:h
are pracical. The next four chapters of the. book are intended to Iook
more closely at the releVant-cgncepts and so hqlp the reader to formrrlrte
such problems clearly in pariicular instances, and to provide advice on {
how to approach their solution-
The seiond half of the book is devoted to more detailed advice on the
t
consrrucrion and use of tests, the putting into practice of the principlt:s '
outlined in earlier chapters. Chapter 7 outlines and exemplifies the
various srages of test construction. Chapter 8 discusses a nurnber oI
testing tecltniques" Chapters 9-13 shorv how' a variety of langu;rge
abiliti-es can best be tested, particularly within teaching institutions.
Chapter 14 gives straightforward advice on the administration oI tests.
We have ,o rry something about statistics. Some understanding of
statistics is-useful, indeed necessary, for a proper appreciation oi testing
matters and for successful problem solving. At the same time, rve have to
recognise that there is a limit to rvhat *1ty readers will be prepared to
do, Jrp..ially if they are ar all afraid of ma.then:Iatics. For this reason.
statistical *,rtt.r, are kept to a minimum and are pre.sented in terms that
everyone should be able to grasp.-The emphasis will be on interpret;ttion
r"rhe, than on calculation. For the rnore adventurous reader, how'evet.
Appendix 1 explai.ns how to carry out a number bf statisrical operations..' ;li
Further reading
The collection of critiC-al reviews of nearly 50 Engtish language tests
(mostly Brifish and American), e Krahnke and Stans-
field (1.987), reveals how well p iters are thought to
have solved their problerirs. A full of the reviervs lvill
depend to some degree on an assimilation of the content of Chapters 3, 4, {
and 5 of this book.
t5
'I'his chapter begins by considering the purposes for which language
F resring is carried out. It goes on to make a number of distinctions:
' berween direct and indirect testing, between discrete point and integra-
b tive testing, between norm-referenced and criterion-referenced testing,
and between objective and subjective testing. Finalll' there is a note on
p communicative language testing. ')
Proficiency tests
*lt yro6ciency resrs are designed to measure people's ability' in a language
regardless of any training they may have had in that language. The
.ont.nr of a proficiency test, therefore, is not based on the content or
ob jectives of language courses rn'hich people taking the test rnay have
follou'ed. Rather, iqiqlased on a spe cification of u'hat candidates have to
be able to do in-tfFfrnguage in order to be considered prof,cient. This
raises the question of what we mean by the u'ord'proficient'.
B In the case of some proficiency tests, 'Pro[ciqgfl]mea+l-ber,1ag
suffi 9 fgl ! p ary S"tg_p44ypo s 9.. An exa m ple
) oi ltrir u,ould be a test designed to discover rn'hether someone can
functron successiully as a United Nations translator. Another example
u'ould be a test used to determine n'hether a srudent's E,nglish is good
)
enough to follou' a course of studl' at a British university' Such a test may
eyen arrempr ro take into account the level and kind of English needed to
'follou, ."uis.t in particular subiect areas" It might, for example, have one
form of the test for arts subjects, another for sciences, and so on.
\Thatever the particular purpose ro which the language is to.be put, this
Kinds of test and testing
\
t1
Most reachers are urnlikeiy to he responsibne fon proficiency tests. lt is
much more probable that they will be involved in the preparation and use
of abhievement tests" X.n contrast to proficiency tests,'achievement tests
are directly relate<l to language courses, their purpose being to esrablish
how st-tccessful individual studerets' groups
thernselves have been in achieving objectives'
d(f6ffii"t
a ch i eve m e n t te s ts n ch i e v e m e nt te ts'
s
s ano oI
sprr; ls tnat I
I :y"ely--!0idsadt*n g. Su c-e5 sTu I
f I i .---*
perf ormancegllbg1gqt_may"n-o-r-r:-ul.,r.r
*---.7-- sstul achreveme*nt of
_----- :--'---;-
ts c.-o*q!-lS-gbig*qgu:q, For example, a course may have as an
i.
1
.L Of course, if objectives are unrealisric, then tests u'ill also reveal a failure to achieve
rlrern. This too can only be regarded as salutary. There may be disagreement as to why
tlrerE has been a failure to achieve the objecrives, bur at least this provides a srarring
poini for ne cessary discussion rvhich otlrerwise might never have taken place.
1a
I-L
\g
Kinds of test and testing
I
I
I an achievement test has been con-
iI
srructed, and be aware of the possibly limited validity and
applicability
I
! Li,.r, ,.or.r. Test writers' on the rther hand, must create achievement
I ;;r;; t;hi.h ,*fl..t the objectives of a particular course, andanot
?
l
exPect a
satisfactory
\ ;.;;r;if roficiency test (or some imitition of it) to provide
I
t
-#\
u---/
.'"-Fiagn0stie tests
=+ n g$-rs a nd u'eaknesses.
rvhat further teatTin-f is
necessar),. At the level of broad language skills this is reasonably
srraightfonn'ard. We can be fairly confident of our ability to create tests
that rn'ill tell us that a student is particularly rn,eak in,_sa,v, speaking as
proficienc| tests malr
ndeed existing proficiency ma1,
se.
, analysing samples of a studerrr's
g in order to create profiles of the
ch categories as 'grammatical accu-
e Chapter 9 for a scoring system that
UR
*;;"'il)
e-'44e6'*F
.,, o n. o u s o r i n a p p r o p r i a t e in f o r m a
L*f ,Hi i; :l:,'l ; ) *h:hl'; it:
h the strength of the relationship n'as
ot the same thing. Another exan-rple
proposed rnethod of testing pronunci-
etion ability by a paper and p."oii test in which the candidate has to
identif;, pai.s of words rn'hich rhyme with each other.
I)
2-2
Kinds at''test and testing
2?
lo
Kinds af test and testing
t7
Lrl
Kinds oi test and testing
In these two cases we learn nothing about how the individual's perform- ;
satisfactorily. The tasks are set, and the performances are evairiar,ed. it
. does not marrer in principle whether all the candidates are successful, or B
none of the candidates.is successful. The tasks are set, and thcse lvho
perform them satisf-actorily 'pass'; those who don't, 'fail'. This means
ih^t rtudents are encouraged to measure their progress in relation to
meaningful criteria, rvithout feeling that, pegause they are-less able thln
*o51 of=their fellows, they are desrined to fail. ln the case of rhe Berkshire
German Certificate, for example, it is hoped that all students r*,'ho are
entered for it ivill be successful. Criterion-referenced tests therefore hir,;'e
two positive virtues: they terms of -uvh.rt
p.opi. can do,which do no s of candidirics I
ra'irh that score should be allowed to carry, this heing based on experience
over rhe years of students with sirnilar scores, not on any meaning in the
score itself. In the same wdy, university administrators have learned from
crperience horv to interpret TOEFL scores and to set minimum scores for
theit own institutions.
Books on Ianguage testing have tended to give advice which is more
appropriate to norm-referenced testing than to criterion-referenced
F resting. One reason for this rnay be that procedures for use with
norm-referenced tests (particularly with respect to such matters as the
r analysis of items and the estimation of reliability) are well established,
rvhile those for criterion-referenced tests are not. The view taken in this
B book, and argued for in Chapter 6, is that criterion-referenced tests are
often to be preferred, not least for the beneficial backwash effect they are
likely to have. The lack of agreed procedures for such tests is not
sufficient reason for them to be excluded from consideration.
2,€ L9
Kinds of test and testing
ffiDFnAcrlYlrtFS
'',vith vrhich yoll are famili:.r:, F,:,:
consider a number of language tesrs
each of thcir,, ansivcr thc follo';;ing qucstions: d
Further reading
towards achievement test content
erson \1987) rePorts on researcfi
e computer to language testing'
o be as authentic as Possible: Vol'
Testing is devoted to articles on
ount of the develoPrnent of an
dshalk et al. (1965). Classic short
orm-referencing (not restricted to
Bi
I
I
!
F
r
D
21
?8
It
& VattciitY
f,
/.--\
L) Content validitY
backwas-h effect. Areas which are not tested are likely to become areas
isnored in teachin
d-cr-ermiqsd -byrry-he
(g-..
J'fu*._d=:=_#
tTt
ntent is a fair reflect Ee on the
rvriting-f specifrcatiiins E"n-d onT-ne )udgementTf iontent validity is to be
found in Chapter l.
6'tL-./
I
YCrlte ri o n -related va lid ity
''! ')
LJ
Jo
'r'
t
Validity
iimilaritv. Perfect agreement betrveen tw'o sets of scores rvill result irr a
validity coeficient of 1. Total lack- of agreemen-t rvill give a coefficient of
-io g.t a feel for the meaning of a coefficient between these trvo
zero.
extremes, read the contents of Box 1.
Box 1
V alidity
abrliry. The saving in time would not be worth the risk of appointing
someone with insufficient ability in the relevant foreign language. On the
otl-rer hand, a coefficient of the same size might be perfectly acceptable for
a brief interview forming part of a placemcnt test.
I. It should be said that the criterion for concurreRt va[idation is not
necessarily a proven, longer test. A test may be validated againsr, for
example, teachers' assessments of their students; provided that the
assessrnents themselves can be relied on. This would be appropriate
tvhere a test was develOped which claimed to be measuring something
different from all existing tests, as was said of at least one quiite recently
devel,rped'communicative' test.
The second kind of criterion-related validity is predictiue uafif,isy. This
c^ncerns the d.g... ,o *n1g5. t-; ;-;;. pi.€i.f .9-gndidat,JJ'p-ui*
.nYexampiJ *oiiid u. rto* *.il r pioficienCy t.t,Tb"ta
n
p ent's abiliry to cope with a graduate course at a British
unirrersity. The criterion measure here might be an assessment of the
student's English as perceived by his or her supervisoi at the university, or
it cguld be the outcome of the course (pass/fail etc.). The choice of
criterion measure raises interesting issues. Should we rely on the subjec-
rive and untrained judgements of supervisors? How helpful is it to use
final outcome as the criterion measure when so many factors other than
abiliry in English (such as subject knowledge, intelligence, motivation,
health and happiness) will have contributed to every outcome? Where
outcome is used as the criterion measure, a validity coefficient of around
0.4 (only 20 per cent agreement) is about as high as one can expect. This
is partly because of the other factors, and partly because those students
,.,,hor. ErLglish the test predicted would be inadequate are not normally
permitted to take the course, and so the test's (possible) accuracy in
predicting problems for those students goes unr'ecognised. As a result, a
i'alidity coefficient of this order is generally regarded as satisfactory. The
further reading section at the end of the chapter gives references to the
, recenr renorrs on the validation of the tsritish Council's ELTS test, in
I\!!!r!
92 25
\talidity
33
Validity
* l4)
Face vaiidi
27
?Lt
*
u
Validity I
B
i
which its ^validity (and suitabiliry) can hardly be judged by a potentiai
purchaserh.ttt ior which validity information is not available should be
treated with caurion.
READER ACTIVITIES
Consicier any tests with which you ar'e iamiiiar. Assess'eacii oi iiiciii iii
rerms of the various kinds of validity that have been presented in this
chapter. What empiricalevidence is there that the test is valid? If evidence
is lacking, how would you set about gathering it?
Further reading
For general discussion of test validity and ways of measuring it, see
Anastasi (1,975). For an interesting recent example of test validation (of
the British Council ELTS test) in which a number of importanr issues are
raised, see Criper and Davies (1988) and Hughes, Porter and Veir
(1988). For the argument (with which I do not agree) that there is no
criterion against which 'communicative' language tests can be validated
(in the sense of criterion-related validity), see Morrow (1985). Bachman
and Palmer (1981) is a good example of construct validation. For a
coliection cf papers on language testing research, see OIIer (1983).
t5
.rqfl$i
" -i--fr
a.
E
A
ffieEiabiEEtV
1-$,.*- t:o vr \.i , \ * tr,,1 .,fi **-n o"* * "t\.
"\tS\
'ttr..r*. \,,-l*-.,{r*U-,*,{. 6*\\ rl. .
/t
nl
I
I
I
u"l
F;
I lqnagine that a hundred students take a 100-item test at three o'clock one
It
!l Thursday afternoon. The test is not impossibly difficult or ridiculously
I
I easy for these students, so they do not all get zero or a perfbct score of
100. Now what if in fact they had not taken the test on the Thursday but
!l had taken it at three o'clock the previous afternoon? Would r,,rre expect
I
il each student to have got exactly the same score on the
\,Wednesday as they
actually did on the Thursday? The answer to this question must be no. .
3/
29
Reliabiliry
Bill 58 82 {
Mr rrt 46 28
Ann L9 34
Harry 89 67
Cyril 43 63
Pauline 55 59
Don 43 35
Colin 27 23
Irene 76 62
Sue 62 49
Now look Table 1b), rvhich displays the same kind of information [or i
ar
second 1gO-item resr (B). Again note the difference in scores for e,rch
student.
BiII 65 59
Mary 48 52
Ann 23 2t
Harry 85 90
Cyril 44 39
Pauline 56 59
Don 38 35
Colin 1.9 rc
Irene 67 52
d
Sue 5Z 57
\flhich r€st seerns rhe rnore reliable? The differences betrveen the two sets
of scores are much smaller for Test B than for Test A. On the evicience
that we have here (and in practice we w'ould not wish to make claims
about reLiabitiry on the basii of such a srnall number of individuals), Test
B appears to be rnore reliable Test A'
Reliability
5 3
B Bill
.tA 5
h4arY
2 4
Ann
5 z
HarrY
2 4
Cvril
n Pauiine 3 5
3 1
Don
Colin 1 z
4 5
Irene
(tr c 3 I
The.largest
In one sense the nvo sets of interview scores are very similar.
diff.r.n.. betrveen a student's actual score and the one which would have
;;;; obt.in.d on the following d"y is 3. Butarethe largest possible
very different..This
;iff.r;. is only 4! Really the two sets of scores differences between
beconres apparent once we compare the.size of,
the
,tud.n,o *iif, the size of differences betrveen scores for individual
,ird.nto. They are of about the same order of ma-gnitude' The result of
il.,i, .r., b. slen if we place the students in order according to their
le order based on their actual scores is
sed on the scores theY would have'
rview on the following daY. This
very reliable at all.
F
retlabi[ltY coefficie
B
in the form a relia?I::l
It is possible to quanrify the reliability of a test
of-
D ':.::tf;;;;, are s (chaPter
:,Y'+;^','^i'
I t' ineY ali
.'+). ]:i
reli
:-t- .
.T--q-t&el
'tr i"
,"^'^+ nF ! 4 recr is ong
reliabilitv rnefficrent ls t - ---aEgilyllj}-a-..r-.gll*v-atl"k-y--.v.v*p^L{4-YlY.r.+:
-gg,
^F.n
^f
T4 ^,.,"
-::-:l
,^ffic-h fj9.iy s"*._-ie_rs"ilr
,for 1
p"Ti.-:11'.t^.,_:l
iegardless of wh." lt rrupp;"td to "be adrninistered.-A
rest
;i;*a**
r.t r t zero (nnri
^^^{E-;ant nd
-r:-L:l:--, coefficient
;-,h,;h had a ,".tiuUitity of ,.*r', (and let us hone that rtJiillH-
tls hope no sr'rc
?g 3i
Xr
g' ''**> $
sl
E
ReliabilitY I
I
test exists!) would give sets of results quite unconnected lvith each otl,.ei',
t'
t
in the sense that the score that someone actually got on a Wednesdav
1,
roqr-$".
coelnctent \\ e
should expect for different types of language tests. Lado (1961)' f1r
exa-nnple, -..)'s th.a.t good voca.b'-rla-ry, structure and. reading tcst: :r':'
usually in the.90 to .99 range, while auditory comprehension tests;:lre
more often in the .80 to .89 range. Oral production tests may be in the .70
ability coefficient of .85 rniehr be
tion test but lorv for a reading rest.
o sees as the difficulty in achievinq
rent abilities. In fact rhe reliebilirv
epend also on other consideratiotrs,
mosr parricularly the importance of the decisions rhat are to be take n on
the basis of rhe test. The -ngg'ilnportant-!bg- d+uj he*gr-e.+te r
il i ty Fre
reljabi,lity.
relja H: !--*
SLI:!-qg;1ra;AG
Ftg n ty,t-1Lt someone the opportunitr'
-b-,
=;,9T&!rst
most obviotis w e
R.eliability
I
wrth resPect
$ we
treasons ertain, which iE referred to as the
carrciidate's@74, 'r /g"ruv- f,c are-
D $j'e ar-r atti?*t,oTn-ike statements ahout the probab,iliry thar a candi-
dare 's rrue score (the one u'hich best represents their abiiitl'on the test) is
B rvithin a certain number of points of the score they actually obtained
qn the test. In order to do thit, *yg-Elqlk.to cSlcgla!-e the standard.
error of measutrentent of the P-artic*f::t-:, T"he calculation (described in
1. Because of the reduced lengrh, rn'hich will cause the coefficient to be less than it would be
f,rr rhe u'hole tesr, a statisrical adjusrment has to h'e rnade (see Appendix'1 for details).
lo 33
illustrated by an examPle.
Suppose that a test has a standard error of measurement Ot J. At^r
indi'.'rJua! scores 56 on that test- Ve are then in e nocirir'n tn rnrlrc the i1
2
following statements: I
We can be abou.t 58 per cent c'ertain that the person's true score lies in the q
J
y
a
sl
F
.the case..of .same 'i
i-n"
Scorer reliabilitY
]n rhe first example given in this chapter we spoke about scores on a
niqlriple choice test. lt was rnost unlikely, we thought, that everY
candidate rvould get precisely the same score on both of two possible
adminisrrarions of the test. We assu red, horvever, that scoring of the test
1'ould be 'perfect'. That is, if a particular candidate did perform in
exactly the iame way on the two occasions, they would be given the same
score on borh occasions. Tbat is, any one scorer u'ould give the same'
score on the t\ /o occasions, and this u'ould be the same score as would be
giyeu b), an), other scorer on either occasion. lt is poss_ible to quantify the
i.t.i uf agreement given by different scorers on different occasions by
n,
means of a scorer reliability coefficient which can be interpreted in a
similar \4,ay as the test reliability coefficient.,trn the cage-of:h.emulriple
ch oi qq q.est j.u.sr .d.e,sc:ib-e.d the
D
il'e noteciin Chapter 3, when
tl2 35
T
Reliability lI
I
n a degree of judgement is calied for
ring of performance in an ihtervie'',v.
. Such subjective t"=sqg1"-j,ll" 5ot
thEEGiiTimlw'hen manv
coefficients (and also rhe reliabilit.,'
low to justify the use of subjectir,e
measures o,t language ability in serious language testing. This vier', is lu:;s
wicieiy* heici rociay. v{Ihiie rhe perfecr reiiabiiity oi objectivc tcsis is i-rrii
{
obtainable in subjective tests, there are ways of making it sufficientiy' 1r i5h
.,,1
reliabie. There is
e
Ap-pe*ndix) tbafel1ry:-911-!"o- o
&
R eliability ;
.t
glven
gr lecti
a selectroj_-gt
iimeV to-hr'.1
l-^ ^-^ C-^^S^- -L^-:-
depresstng ettect on tne rellaDtlltl ,t-., -
dirocted to. A nu.ib.r of candidates answered 'page 3', rvhich rvas the
reader
olace in rhe texr where the author actualll' said that the interested
;h;;lt loqk in rhe Further reading section' Qgly*-t-trglS-1g1glrsfuac's1
ins tne
scortng the tesE re \\'as a com
revealed th.a-q-."-thsr
test [evcalcLt i
.
ones alreadv
.the ene_Lalreadv
_LLL_U 3',.r-"
tn test will be needed
in the test -go 1"ncq-e_gsg the reliability
A.-.,----
I! vented from answering the additional question; for them, in reality, there
is no additional question. We do not get an additional sample of their
behaviour, so the reliability o{;heir apility isis not
of our estima;e of..their
increased
increased. afl (-
lu- 1t+1{} s.rff^*
'f.f lT.* (,- .t,# .u6,[,{,'6yuo*r(
*u,[,fl,'f ,'noJ l"te-yrnr
lnt*+n,
r as possiBle
rS Each additional item should as tar.as Possl represent a fresh start fqr
Bi ina" d d-irc n a J inlqunErr-on-o.n
ihe ca n d i d ate. Bv-dP t-qgJbi
_tt_ich__fyill nakq--t-qs-t r-esu-lts- 1n_o 1e
should not be taken to mean only
est of writing, for example, where
candi.jates have to produce a number of passages, each of those passages
is to be regarded as an item. The more independent Passages there are, the
more reliable will be the test. In the same wdy, in an interview used to test
oral ability, the candidate should be given as many 'fresh starts' as
oossible. More detailed implications of the need to obtain sufficiently
irrg. samples of behaviour will be outlined later in the book, in chapters
deircted to the testing of particular abilities.
_Whileit!qrmpo_r1qt-lo-make-arEsrl.rong€n"ought.o-achiq-v--e9-1t5!-?Slg:)'
b_e m.ade so l.ong-rha.r.-rhe candidates.b-e,.EomE io
_reiiaL.,ilitv, iiihCrl_doot .l
tt_bS -
e _ogr e,S. U-n t"epJ e-s-e-n t.a.
e5 37
--
R.eliability
of rhe complaint that students are unintelligent, have been stupid, have
i','ilfLrllv misunderstood what they were asked to do, reveals that the
supposition is often unwarranted. Test u'riters should not rely on the
students' po\4rers of telepath)' to elicit the desired behaviour. Again, JbS
use of colleagues to criticise dr4fts of instructioqs (i
ems. Spoken instruc-
tions should alu'ays be reao lrom a prePareo 'text In order to avoid
h
ri Too often, institutional tests tt. be$gPgd (of hgnd"ftglLhave too
I
much text in too small3j!3ge, and are ngglf *p*-{ggd. As a result,
hi
p, ;r.de"ts areEce?-wrTffiAaiiional taski u,E-i?E?e-not ones meant to
measure therr language ability. Their variable performance on the
unu'anted tasks q'iil lower the reliability of a test.
F'
I
-i
fAndidates should be familiar with fEtnat and fesfr4
onh
9WlitttYvve
ninllec \
.-..*_._.*_,..-.1..---.-.---- -//
16 39
.it
'T"
Y
F..
Reliability
Grlt"*s
'-
that permit scoring which ,: -rt:!!9{''3-?f-)
^^^:|^t^
trater
Irems of this kind are discussed 'in
t
-
e-;tY-e a;Ir i-
t.
"I-*;;;*---' oI. pc
tints. It should be the outcome of efforts
aipo.ssrble
"" r"-".' .--in tts_asslgnment tt | _ ._L:^^-^r -^
to antlclpate aIL responses and have been subiected to group
" poirible tsrP\rrr
PosslL)lc
criticism. (This ,dui..applies only rn'here resPonses can be classed as
pnrti"lly o, totally 'correct', not in the case of compositions, for
instance-)
(*l
*'fflg,n
---"--.-....".."-."-----'-,
t:-gI9rs-"''
'T1
lve. I ne sconng
L-!!te:Y"-*r-- '
' ' sflr
of composrtions, tor eXdlrlple' ' ould not be assigned to anyone rvho has
r
from
_ r__:_:^-_-^: ^__
past- administratrons.
;;rl.ffia. ,.or. accuratery compositions
ifr., each administrarion, p"tt.int of scoring should be. analysed.
inji'iau"rs whose scoring i.ui"t.t markedrl' and inconsistently from the
norm sliould not be used again'
, \*L-
ffig re e a ccePta b I e resPonses and
;\
t\
$
of scoring
*,--;^
1'
)--
A sampte or scrl[ "t ^,,lrl hctalren
^r^^-rn+c imrnediatelv afterthe ad.nr-i-nisu.ation
-a.ch,,
*' -_i_.__^n--,.i-i:r4--.--.-.- : | _. tati v e^^s o^tf
ety p i ca l re p re s en ---:_.
:wl, aia 69 ie a;e?oin p o s hi o ni,
r "itili:ie"
,iiiiir.n, ieveis of ability sliouid be seiected. Oniy when aii scorers are
begin" Much
$
;;;;;J;" the scores to be giien to these should realof scoring compositions.
,fr. ,,r,ill be said in Chapter 9 about the scoring note any difficulties
For short answer questions, the scorers should
is unlikely to have anticipated every
se to the attention of whoever is
Once a decision has been taken as to
I
+g
I
i
47
i
Reliability i
the points to be assigned, the supervisor should convey it to all the scorcrs
concerned.
ir
.r"al,i^r.lr' ability omposition, then it would be hard ro f,
,*nf, providing ih rure in order to incre ase reliabiliry. At
to restrict candidates in wa,vs t'vhich
,h. ,r.. time we
rvOuld not render their performance orx the task irrvalid'
There will always be ro*. rension berw.een reliability and
valiCiri'" The
,.rr., has to bala.nce gains in one against losses in the other'
a?