Professional Documents
Culture Documents
(Arthur Hughes) Testing For Language Teachers
(Arthur Hughes) Testing For Language Teachers
If 'I i-
I ./
It!
(i) ,,}v
" ,
CA;"IBRIDGE HANDBOOKS FOR LANGUAGE TEACHERS
Gencral Editor: Michael Swan Testing for
This is a series of practical guides for teachers of English ane! other
languages. Illustrative examples are usually drawn from the field of English Language Teachers
as a foreign or second language, but the ideas and techniques described can
equally well be used in the teaching of any language.
Copyright
The law allows a reader to make a Single cop" of part 01 a book
for purposes of private study. It does not allow the copving of
entire books or the making of multiple copies of extracts. Written
permission for any such copying must alwavs be obtained from the
publisher in advance.
CE
J!
,I
I Contents
J
I
,I
;
f
I
II fI
t
Acknowledgements
Preface IX
ViII
] I0
11
Testing oral ability
Testing reading 116
101
]
]
]
Acknowledgements Preface
The author and publishers would like to thank the following for The simple objective of this book is to help language teachers write better
permission to reproduce copyright material: tests. It takes the view that test construction IS essentially a matter ot
problem solving, with e'.'ery teaching situation setting a different testing
American Council on the Teaching of Foreign Languages Inc. for extracts problem. In order to arrive at the best solution for anv particular
from ACTFL Provisional Proficiency Guidelines and Generic Guidelines situation - the most appropriate test or testing system - it IS not enough
(1986); ARELS Examination Trust for extracts from examinations; to have at one's disposal a collection of test techniques from which to
A. Hughes for extracts from the New Bogazi<;i University Language choose. It is also necessary to understand the principles of testing and
Proficiency Test (1984); The British Council for the draft of the scale for how they C111 be applied in practice.
ELTS test; Cambridge University Pres~ for M. Swan and C. Walter: It is relatively straightforward to introduce and explain the desirable
Cambridge English Course 3, p. 16 (1988); Educational Testing Services qualities of tests: validirv, reliability, practicality, ~111d beneficial backwash
for the graph on p. 84; Filrnscan Lingual House for M. Garman and (this last, which refers to the favourable effects tests can have on teaching
A. Hughes: English Cloze Exercises Chapter 4 (1983); The Foreign and learning, here receiving more attention than is usual III books of this
Service Institute for the Testing Kit pp. 35-8 (1979); Harper & Row, kind). It is much less easy to give realistic advice on how to achieve them
Publishers, Inc. for the graph on p. ] 61, from Basic Statistical Methods in teacher-made tests. One is tempted either to ignore the problem or to
by N. M. Downie and Robert W. Heath, copyright © 1985 in N. M. present as a model the not always appropriate methods of large-scale
Downie and Robert W. Heath, reprinted with permission of Harper & testing institutions. In resisting these temptations I have been compelled
Row, Publishers, Inc.; The ln dcp en dent for N. Timmins: 'Passive to make explicit In my own mind much that had previously been vague
smoking comes under fire', 14 March 1987; Joint Matriculation Board and intuitive. I have certainly benefited from doing this: I hope that
for extracts from March] 980 and June 1980 Tests in English (Overseas); readers will too.
Language Learning, and ]. W. Oller Jr. and C. A. Conrad for the extract Exernpiificarion throughout the book is from the testing of English as a
from 'The doze technique and ESL proficiency', Language Learning foreign language. This reflects both my own experience in language
21: 183-94; I.ongrnan UK Ltd for D. Byrne: Progressive Picture testing and the fact that English will be the one language known by all
Compositions p. 30 (1967); Macmillan London Ltd for Colin Dexter: readers. I trust that it will not prove too difficult for teachers of other
The Secret of Annexe 3 (1986); The Cbserver for S. Limb: 'One-sided languages to find or construct parallel examples of their own.
headache', 9 October 1983; The Royal Society of Arts Examinations I must acknowledge the contributions of others: MA students at
Board/University of Cambridge Local Examinations Syndicate for Reading University, too numerous to mention by name, who have taught
extracts from the specifications for the examination in The Com- me much, usually by asking questions that I could not answer; my friends
municatiue Use of English as a Foreign Language, and from the 1984 and and colleagues, Paul Fletcher, Michael Carman, Don Porter, and Tony
1985 summer examinations; The University of Cambridge Local Woods, who all read parts of the manuscript and made many helpful
Examinations Syndicate for the extract from Testp ac]: 1 paper 3; The suggestions; Barbara Barnes, who typed a first version of the early
University of Oxford Delegacy of Local Examinations for extracts from chapters; Michael Swan, who gave good advice and much encourage-
the Oxford Examination in English as a Foreign Languge. ment, and who remained remarkably patient as each deadline for
completion passed; and finally my family, who accepted the writing of
the book as an excuse more often than they should. To all of them I am
very grateful.
Vlll IX
1
J 1 Teaching and testing
]
] Many language teachers harbour a deep mistrust of tests and of testers.
The starting point for this book is the admission that this mistrust IS
1
Backwash
] Th~iJfe~t of te~il.lLQIl t~~l\=l:llDli._and learning is known as lraclncusl),
BacJ<_wzsh::cuweJuwl £11\ or beneficla-Cf1::ije~Lis.Legard~tia s important,
then preparation for it can come to dominate all teaching and lca-inirig
1 activities. And if the test content and testing techniques are at variance
with the objectives of the course, then there is likely to be harmful
ba ck was h. An i!1g<l!!l:~~_5Ll!bis_\y() u 1_l,tbJ:..~'_h.~sJ\LdeJltsardo llo wing an
E~sh_~2.urse y/bich is .. meant. to. train .thern in the language -skills
] (including writing) necessary for university study in an English-speaking
country, but where the language test which they ha ve to take in order to
be-a"dn1itted to auniversitydoes not test those skills.directly. If the skill of
1
~ Teaching and testtng T",/chillg and testing
I
choice, had an immediate effect on teaching: the syllabus was redesigned, practical proposition, it IS understandable that potentially greater accu-
new books were chosen, classes were conducted differently. The result of racy IS sacrificed for reasons at economy and convenience. Hut It does not
these changes was that bv the end of their vears training, in circum- giv~ testing a good name! And it does set a bad example, ,
stances made particularly difficult by greativ increased numbers and WhIle few teachers would wish to follow that particular example 111
limited resources, the students reached a much higher standard in EnglIsh order to test writing ability, the overwhelming practice in large-scale
than had ever been achieved III the university's history. This WJS a case of tesnnz of using multiple choice items does lead to imitation in circum-
he /I efictal back \V:1 S h. stances where such items are not at all appropriate. What is more, the
Davies (1968:5) Ius said that 'tlw good test is an obedient servant since imitation tends to be of a very poor standard. Cood multiple choice items
it follows and apes the teaching'. I find it difficult to agree. The proper are nororiouslv difficult to write. A great deal of time and effort has to go
relationship between teaching and testing is surely that of partnership. It into their construction. Too many multiple choice tests are written where
is true that there may be occasions when the teaching is good and such care and attention is not given (and indeed may not be possible). The
appropriate and the testing is not; we are then likely to suffer from result is a set of poor items that cannot possibly provide accurate
harmful backwash. This would seem to be the situation that leads Davies measurements. One of the principal aims of this book is to discourage the
to confine testing to the role of servant of teaching. But equally there m:1Y use of inappropriate techniques ami to show that reacher-made tests can
be occasions when reaching is poor or inappropriare arid when testing is be superior in certain respects to their professionalcounterparts.
able to exert a beneficial influence. \(!e cannot expect testing only to The second source of maccuracv islack ot..rdigiJjJil'{ Reliability IS a
follow teaching. \X1hat we should demand of it, however, is that it should tedln;catren~hlch is explaine(r~;-Chapter 5. for the moment It IS
be supportive of good reaching and, where nccessarv, exert a corrective enough to say that J test is reliable if it measures consistently, On a
influence on bad teaching. If testing always had a beneficial backwash on reliable test vou can be confi,d,enuh:Hsomeonewill get moreor less the
teaching, it would have a much better reputation amongst teachers. same score:~hethe;the;happen to rake it on one particular day or on
Chapter 6 of this book is devoted to a discussion of how beneficial t11e next;'\\'here~ls(;~- an unreliable test the score is quite likely to be
backwash can be achieved. collsider;lblv different, depending on the day on which it is taken.
Unrell:I.bJm'll.;1j..l.\,,{llmlgins.;j~:grlrc~s,of
the resritselt, and the\~ay'lr_!2
scor~J. In the first case, something ahOllt-rT)c-Test 'creates a tendency tor
Inaccurate tests In~llvlJuals to perform Significantly differently on diirerent occasions
when they might take the test. Illeirpcrformance might be quite different
The second reason for mistrusting tests IS that very often t~t:.L.ilil tg if thev took the test on, say, Wednesday rather than on the following day.
measure accurarelv whatever it is that they are intended to measure. As'a- ~esult, even if the sL'ofi/lg()ftheirperformance on the test is perfectly
TC:a~h~r~-kr;()w~thi~.~StuJeii'ts't~;'uea bilities a~e not al\~ays reflecte(fln~the
accurate (t11Jt is, the scorers do not make any mist.rkes), they will
test scores that thev obtain. To a certain extent this is inevitable. nevertheless obtain a rnarkedlv different score, depending on when they
Language abilities are nor easy to measure; we cannot expect a level of acruallv sat the test, even thOl;gh there has been no change III the ability
accuracy comparable to those of measurements in the physical sciences. which ~he test is meant to measure. This is not the place to list all possible
But we can expect greater accuracy than is frequently achieved. features of ,I test which might make it unreliable, but examples are:
Why~a.ruc=sts inaccura!.-eL_The.causes of inaccuracy (and ways.of
unclear instructions, .nnbiguous questions, items th.lt result In guessmg
minimising their e~ffectsY are identified and discussed !!!._2ubsequent on the part of the test takers. WhIle it IS not pOSSIble entirely to eliminate
chapters, but a short answer is possible here. Ih~r~ areJ\';~)JI!.ain'sources_ such differences in behaviour from one test administr.ition to another
of inaccuracy. The first of these concerns test content and techniques. To (human beings are not machines}, there are principles of test construction
return to an earlier-;;-;-~~py;;-,-;T~e want to know how well someone can
which can reduce them.
write, there is absolutely no way we can get a really accurate measure of In the second case, equivalent test performances are accorded sig-
theirability by means ofa~ f!1J1I!ipJe choic.ei0_L..Professional testers have nific.mtly different scores. f()LeX~mpl~, Jl1csJme.(lmpos,tiQI1Jnay.be
expended great effort, and not a lIttle money, in attempts to do it; but given very different scores bydifferent markers (or even by the same
they have always failed. \Ve may be able to get an approximate measure, m'Jrkef ondifrererit occasions). fortunately, there are well-understood
but that is all. When testing is carried out on a very large scale, when the ways of minimising such differences 111 scoring.
scoring of tens of thousands of compositions might seem not to be a ~lost (but not all) large testing organisations, to their credit, take every
2
3
Teaching and testing Teaching and testing
precaution to make their tests, and the scoring of them, as reliable as What is to be done 7
possible, and are generally highly successful in this respect. Small-scale
testing, on the other hand, tends to be less reliable than it should be. The teaching profession can make two contributions to the improvement
Another aim of this book, then, is to show how to achieve greater of testing: they CJn write better tests themselves, and they CJn put pressure
reliability in testing. Advice on this is to be found in Chapter 5. on others, including professional testers and examining boards, to
improve their tests. This book represents In attempt to help them do both.
----...., For the reader who doubts that teachers can influence the large testing
The need for tests
~-------_._----.~,_._-- ~
) institutions, let this chapter end with a further reference to the testing of
writing through multiple choice items. This was the practice followed hv
So far this chapter has been concerned to understand whv tests are so those responsible for TOEFL (Test of English as a Foreign Language), the
mistrusted by many language teachers. We have seen that this mistrust is test taken by most non-native speakers of English applying to North
often justified. One conclusion drawn from this might be that we would American universities. Over a period of many years they maintained that
be better off without language tests. Teaching is, after all, the primary it was simply not possible to test the writing ability of hundreds of
i'f - thousands of candidates by means of a composition: it was impracticable
,- activity; if testing comes in conflict with it, then it is testing which should
go, especially when it has been admitted that so much testing provides
inaccurate information. This is a plausible argument - but there arc other
and the results, anyhow, would be unreliable. Yet in 1986 a writmg test
(Test of Written English), in which candidates acrual]v have to write for
j -- thirty minutes, was introduced as a supplement to TOEFL, and already
il .' considerations, which might lead to a different conclusion.
Information about people's language ability is often very useful and many colleges in the United States are requiring applicants to take this
test in addition to TOEFL. The principal reason given for this change was
sometimes necessary. It is difficult to imagine, for example, British and
American universities ac_ce.p.tin&.~dents,from.m:erseas._withouL-some pressure from English language teachers who had finally convinced those
knJ)\v[ed~QLtllt:;.LLp(Qficj~llCy.JD_EI1g1l~b" The same is true for organi- responsible for the TOEFL of the overriding need tor a writing task
• sations hiring interpreters or translators. They certainly need dependable which would provide beneficial backwash .
measuresoflanguage ability.
Within teaching systems, too, as long as it is thought appropriate for
individuals to be given a statement of what they have achieved in a second READER ACTIVITIES
or foreign language, then tests of some kind or other will be needed. I
Thcv will also be needed in order to provide information. abou.Lt!le- 1. Think of tests with which vou arc tarniliar (the tests mav be inter-
achievement~groups of learners, without which it is difficult to see how national or local, written by professionals or by teachers)'. What do
rational educational decisions can be made. While for some purposes you think the backwash effect of each of them is? Harmful or
teachers' assessments of their own students are both appropriate and beneficial? \Vhat are your reasons for coming to these conclusions?
sufficient, this is not true for the cases just mentioned. Even without 2. Consider these tests again. Do you think that they give accurate or
considering the possibility of bias, we have to recognise the need for a inaccurate information? \Vhat are your reasons for coming to these
common yardstick, which tests provide, in order to make meaningful concl usions?
comparisons,
If It IS acc(ptcd that tests are necessary, and if we care about testing and
its effect on teaching and learning, the other conclusion (in my view, the Further reading
correct one) to be drawn from a recognition of the poor quality of so
much testing is that we should do everything that we can to improve the For an account of how the introduction of a new test can have a striking
practice of testing. bencticial effect on teachIng and learning, see Hughes (1988a). For a
review of the new TOEFL writing test which acknowledges its potential
beneficial backwash effect but which also points out that the narrow
range of writing tasks set (they are of only two types) may result III
L It will become dear that in this book the word 'test' is Interpreted widely, It is used to narrow training in writing, see Greenberg (19:::S6). For a discussion of the
refer to any structured attempt to measure language ability, No distin'dion is made
I 4
between 'examination' and 'test',
ethics of language testing, see Spolsky (1981).
5
L
Testing ,IS problem solving
2 Testing as problem solving: an overview Let us describe the general testing problem In a little more detail. The first
of the book thing that testers have to be clear about is the purpose of testing III any
particular situation. Different purposes will usually require different kinds
of tests. This may seem obvious but it is something which seems not always
to be recognised. The purposes of testing discussed in this hook are:
II,
Iang,,:!ges. The measurement of this [alent in order [() predict how well or how
fluently in a language, as well as the ability to recite gramrnancal rules (if that is
qurcklv individuals will le.irn a foreign language, IS bevond [he scope of this book. The
something which we are interested 111 measuring'). It does not, however, refer to
Interested reader is referred [0 I'irnsleur i I %8), Carroll 1198l), Jnd Skehan 11986i.
j' 6
7
J I
ItL.- _
Testing as problem solving
All tests cost time and money - to prepare, administer, score and
interpret. Time and money are in limited supply, and so there is often
3 Kinds of test and testing
likely to be a conflict between what appears to be a perfect testing
solution in a particular situation and considerations of practicality. This
issue is also discussed in Chapter 6.
To rephrase the general testing problem identified above: the basic
problem is to develop tests which are valid and reliable, which have a
beneficial backwash effect on teaching (where this is relevant), and which
arc practical. The next four chapters of the book are intended to look This chapter begins by considering the purposes for which language
more closely at the relevant concepts and so help the reader to formulate testing is carried out. It goes on to make a number of distinctions:
such problems clearly in particular instances, and to provide advice on between direct and indirect testins., between discrete point and Il1tegra-
how to approach their solution. .-tiY.Lt~, betwee'LnoLt11.:Lclcrm~nLE!2.d criterion-referenced test~
The second half of the book is devoted to more detailed advice on the and between objective and subjective testing. Ei1<i.lly-thereis"a"nCJte (in
construction and use of tests, the putting into practice of the principles comrn unicatIve-language testing. •
outlined in earlier chapters. Chapter 7 outlines and exemplifies the We use tests to ~in information. The information that we hope to
various stages of test construction. Chapter 8 discusses a number of obtain will of course vary from situation to situation. It is possible,
testing techniques. Chapters 9-13 show how a variety of language nevertheless, to categorise tests according to a small number of kinds of
abilities can best be tested, particularly within teaching institutions. information being sought. This categorisation will prove useful both in
Chapter 14 gives straightforward advice on the administration of tests.
_.
deciding whether an existing test is suitable for a particular purpose and
We have to say something about statistics. Some understanding of in writing appropriate new tests where these are necessary. The four
statistics is useful, indeed necessary, for a proper appreciation of testing types of test which we will discuss in the following sections are:
matters and for successful problem solving. At the same time, we have to ~ t s , achievement tests, diagnostic tests, and placement tests.
recognise that there is a limit to what many readers will be prepared to ----------------
do, especially if they are at all afraid of mathematics. For this reason,
statistical matters are kept to a minimum and are presented in terms that Proficiency tests
everyone should be able to grasp. The emphasis will be on interpretation
rather than on calculation. For the more adventurous reader, however, Proficiency tests are designed to measure peo2!e"s al',IIJ5Y in a la~g~l~g_e_
Appendix 1 explains how to carry out a number of statistical operations. regardless of any training they may have had in that language. The
content of a proficiency test, therefore, is not based on the content or
objectives of language courses which people taking the test may have
Further reading followed. Rather, itisJ?ased on a specification of what candi~ates h<1ye~o_
~e al2ki~_<.l.9_i!!..lheJ~lgl!~~g~in orl~~r.~o ~~.:?,~s.id~re? proficient. This
The collection of critical reviews of nearly 50 English language tests raises the question of what we mean bv the word 'proficient'.
(mostly British and American), edited by Alderson, Krahnke and Stans- In the case of some proficiency tests, 'proficient' means having
field (1'187), reveals how well professional test writers are thought to sufficient command of the language (or ,1 tiarticular pur/JOse. An example
have solved their problems. A full understanding of the reviews will of this would be a test designed to discover whether someone can
depend to some degree on an assimilation of the content of Chapters 3, 4, function successfully as a United Nations translator. Another example
and 5 of this book. would be a test used to determine whether a student's English is good
enough to follow a course of study at a British universitv. Such a test may
even attempt to take into account the level and kind of English needed to
follow courses in particular subject areas. It might, for example, have one
form of the test for arts subjects, another for sciences, and so on.
I' Whatever the particular purpose to which the language is to be put, this
Ii'
II 9
~
will be reflected in the specification of test content at an early stage of a study. They rnav be written and administered by ministries of education,
test's development. official examining boards, or by members of teaching institutions.
There are other proficiency tests which, bv contrast, do not have anv Clearly the content of these tests must be related to the courses with
occupation or course of study in mind. For them the concept o'f which'they are concerned, hut the nature of this relationship is a matter of
proficiency is more general. British examples of these would be the disagreement amongst language testers.
Cambridge examinations (First Certificate Examination and Proficiency In the view of some testers, the content of a final achievement test
Examination) and the Oxford EFL examinations (Preliminarv ancr should be based directly on a detailed course syllabus or on the books and
Higher). The function of these tests is to show whether carldidJ'ie~ have other materials used. This has been referred to as the 'syllabus-content
reached a certain standard with respect to certain specified abilities, Such approach'. It has an obvious appeal, since the test only contains what it IS
examining bodies a-re independent of the teaching institutions and so can thought that the students have actually encountered, and thus on be
be relied on by potential employers etc. to make fair comparisons considered, in this respect at least, a fair test. The disadvantage is that if
between candidates from different institutions and different countries. the syllabus is badly designed, or the books and other materials are badlv
Though there is no particular purpose in mind for the language, these chosen, then the results of a test can be very misleading. Successtul
general proficiency tests should have detailed specifications saying just performance on the test may not truly indicate successful achievement of
what it is that successful candidates will have demonstrated that they can course objectives. For example, a course may have as an objectu-e the
do. Each test should be seen to be based directly on these specifications. development of conversational ability, but the course itself and the test
All users of a test (teachers, students, employers, etc.) can then judge may require students only to utter carefully prepared statements about
whether the test is suitable for them, and can Interpret test results. It is not their home town, the weather, or whatever. Another course Illay aim to
enough to have some vague notion of proficiency, however prestigious develop a reading ability III German, but the test Illay limit itself to the
the testing body concerned. vocabulary the students are known to have met. Yet another course IS
Despite differences between them of content and level of difficultv all intended to prepare students for university study in English, but the
proficiency tests have in common the fact that they are not based' on syllabus (and so the course and the test) may not include listening (with
courses that candidates may have previously taken. o"n the other hand, as note taking) tu i.nglish delivered in lecture style on topics of the kind that
we saw in Chapter 1, such tests may themselves exercise considerable the students will have to deal with at university. In each of these examples
influence over the method and content of language courses. Their - all of them based on actual cases - test results will fail to show what
backwash effect - for this is what it is - rnav be beneficial or harmful In students have achieved III terms of course objectives.
my view, the effect of some widely used proficiency tests is more harmful The alternative approach is to base the test content directly on the
than beneficial. However, the teachers of students who take such tests objectives of the course. This has a number of advantages. first, It
and whose work suffers from a harmful backwash effect, mav be able to compels course designers to be explicit about objectives. Secondly, it
exercise more influence over the testing organisations c()n~erned than makes it possible for performance on the test to show just how far
they realise. The recent addition to TOEFL, referred to in Chapter I, is a students have achieved those objectives. This in turn puts pressure on
case III pornt. those responsible for the syllabus and for the selection of books and
materials to ensure that these are consistent with the course objectives.
Tests based on objectives work against the perpetuation of poor teaching
Achievement tests practice, something which course-content-based tests, almost as if pan of
a conspiracy, fail to do. It is my belief that to base test content on course
Most teachers are unlikelv to be responsible for proficiency tests. It IS objectives is much to be preferred: it will provide more accurate
much more probable that they will be involved III the preparation and use information about individual and group achievement, and it is likely to
of achievement tests. In contrast to proficiency tests, achievement tests promote a more beneficial backwash effect on reaching."
are ~~~Iy:~elated to langu;1ge.i:.Q!:!rses,.tIltir purpose being to establish
how successful i 'vidual stLIden!},-:-grqups.of students, or the courses
1. Of course, it objectives are unrealistic, then tests will also reveal a failure
to achieve
themselves hav~~jnf2.~~jt:.ctJy.es-,They are of two kinds)i!:w.L. them. This too can only be regarded as salutary, There may be disagreement as to why
achievement tests and progress achievement tests. ',:;;> ~,i ,I" ;: .. , .: there has been a [ailur« to achieve the objectives, but at least this provides a starting
Final achievement tests are those administered at the end of a course ~( POInt for necessary discuvsion which otherwise might never have taken place,
10 ; t' ;. ~,~ c, { 11
]
Kinds of test and testing Kinds of test and testtng
1 Now it might be argued that to base test content on objectives rather In addition to more formal achievement tests which require careful
than on course content is unfair to students. If the course content does not preparation, teachers should feel free to set their own 'pop quizzes'.
fit well with objectives, they will be expected to do things for which thev These serve both to make a rough check on students' progress and to keep
1 have not been prepared. In a sense this is true. But in another sense it IS
not. If a test is based on the content of a poor or inappropriate course,
students on their toes. Since such tests will not form part of formal
assessment procedures, their construction and scoring need not be too
the students taking it will be misled as to the extent of their achieve- rigorous. Nevertheless, they should be seen as measuring progress
1 ment and the quality of the course. Whereas if the test is based on
objectives, not only will the information it gives be more useful, but
there is less chance of the course surviving in its present unsatisfactory
towards the intermediate objectives on which the more formal progress
achievement tests are based. They can, however, reflect the particular
'route' that an individual teacher is taking towards the achievement of
form. Initially some students may suffer, but future students will benefit objectives.
from the pressure for change. The long-term interests of students are It has been argued in this section that it is better to base the content of
best served by final achievement tests whose content is based on course achievement tests on course objectives rather than on the detailed content
objectives. of a course. However, it may not be at all easy to convince colleagues of
The reader may wonder at this stage whether there is any real this, especially if the latter approach is alreadv being followed. Not only
J difference between final achievement tests and proficiency tests. If a test is
based on the objectives of a course, and these are equivalent to the
is there likely to be natural resistance to change, but such a change may
represent a threat to many people. A great deal of skill, tact and, possibly,
language needs on which a proficiency test is based, then there is no ,, political manoeuvring may be called for - topics on which this book
J reason to expect a difference between the form and content of the two
tests. Two things have to be remembered, however. First, objectives and
needs will not typically coincide in this way. Secondly, many achievement I
cannot pretend to give advice.
t
Diagnostic tests
J
tests are not in fact based on course objectives. These facts have
implications both for the users of test results and for test writers. Test
users have to know on what basis an achievement test has been con-
!
I Diagnostic tests are used to identify students' strengths and weaknesses:.
structed, and be aware of the possibly limited validity and applicability I They are intended primarily ijp ascertain what turtherte:i-Z:hmg~-i's
iI
J of test scores. Test writers, on the other lund, must create achievement necessary. At the level of broad language skills this is reasonably
J
tests which reflect the objectives of a particular course, and not expect a
general proficiency test (or some imitation of it) to provide a satisfactory
alternative.
Progress achievement tests, as their name suggests, are intended to
measure t&-rJ!JgrcSs::JE~t st~.geIltLJ.rc.r.n~~illg; Since 'progress' is
I! "straightforward. We can be fairly confident of our ability to create tests
that will tell us that a student is particularly weak in, say, speaking as
opposed to reading in a language. Indeed existing proficiency tests may
often prove adequate fodl1iqiurpose.
\X1e may be able to go further, analysing samples of a student's
towards the achievement of course objectives, these tests too should perforn~ance in writing W·~ 7I2e;~~~~Jg ip.order to ~~ profiles~t the
] relate to objectives. But how? One way of measuring progress would be
repeatedly to administer final achievement tests, the (hopefully) increas-
student s ability With respect to such categories as grammatical accu-
,~act' or,,'Iinguistic appropriacy'. (See CKlpter 9 for a scoring system that
ing scores indicating the progress made. This is not really feasible, may provide such an analysis.)
particularly in the early stages of a course. The low scores obtained But it is not so easy to obtain a detailed analysis of a student's
] would be discouraging to students and quite possibly to their teachers.
The alternative is to £stablish a series of well-defined short-term objec-
command of grammatical structures, something which would tell us, Lor
example, whether she or he had mastered the present perfect/past tense
.)iv_t?~ These should [l1ake.~clear progression towards the final achieve- ~ Jistinc~tion in English. In order to be sure of this, we would need a number
12 13
1
Kinds of test and testing
Kinds of test and testing
testing the modal verbs, for instance). The size of such a test would make
Testing is said to be direct when it req III res. the c~iti:1t£...I9_perf()r~
it impractical to administer in a routine fashion. For this reason, very few
preclsel.yme skill whicll we WIsh to measure.. If we want to know how
"tests are constructed for purely diagnostic purposes, and those that there
~weIl candidates can wrTt"ecomposmon"s,-"\ve get them to write com-
are do not provide very detailed information.
positions. If we want to know how well they pronounce a language, we
The lack of good diagnostic tests is unfortunate. They could be
get them to speak. The tasks, and the texts which are used, should be as
extremely useful for individualised instruction or self-instruction. Learn-
authentic as possible, The fact that candidates are aware that they are in a
ers would be shown where gaps exist in their command of the language,
test situation means that the tasks cannot be really authentic. Neverthe-
and could be directed to sources of information, exemplification and
less the effort is made to make them as realistic as possible.
practice. Happily, the ready availability of relatively inexpensive com-
Direct testing is easier to carry out when it is intended to measure the
puters with very large memories may change the situation. \X1c1I-written
productive skills of speaking and writing. The very acts of spea king and
computer programmes would ensure that the learner spent no more time
writing provide us with information about the candidate's ability. With
than was absolutely necessary to obtain the desired information, and
listening and reading, however, it is necessary to get candidates not only
without the need for a test administrator. Tests of this kind will still need
to listen or read but also to demonstrate that they have done this
a tremendous amount of work to produce. \X1hether or not they bec;)me
successfullv. The tester has to devise methods of eliciting such evidence
generally available will depend on the willingness of individuals to write
them and of publishers to distribute them. accuratelv'and without the method interfering with the performance of
the skills' in which he or she is interested. Appropr i.ire methods for
achieving this are discussed in Chapters 11 and 12. Interestingly enough,
Placement tests in many texts on language testing it is the testing of productive skills that
is presented as heing most problematic, for reasons usually connected
Placement tests, as their name suggests, are intended to provide infor- with reliability. In fact the problems are by no means insurmountable, as
mati~n which will help topla<:~~~lden~s~ttbfs_t~gejorin the part)~ we shall see in Chapters 9 and 10.
teaching pr2gra~~e most appropriate to their abilities. Typically they Direct testing has a number ofattractions. first, provided that we are
are used _~() assignstudents J_0.s:I';}55e.~_"ltdifferentlevels. clear about just what abilities we want to assess,.it is relatively straight-
Placement tests can be bought, but this is not to be recommended forward to create the conditions which will elicit the behaviour on which
unless the institution concerned is quite sure that the test being con- . to base our judgements. Secondly, at least ill the case of the productive
sidered suits its particular teaching programme. No one placement test skills, the assessment and interpretation of studen~~'.performanceIS also
will work for every institution, and the initial assumption about any test quite straightforward. "Thirdly, since practice for the test involves prac-
that is commercially available must be that it will not work well. .rice of the skills that we wish -to foster, there is likely to be a helpful
The placement tests which are most successful are those constructed backwash effecr.
Jor particular situations. They depend on the identific.irion of th~- k'ev , Indirect testing attempts to measure the abilities which underlie the,
features at different levels of teaching in the institution. They are ski-lis in which we are interested. One section of the TOEfL, for example,
tailor-made rather than bought off the peg. This usually means that they \':'~1S de\;el~)pe(rasan-i'ridi;-e-ctmeasure of writing ability. It contains items
have been produced 'in house'. The work that goes into their construe- of the following kind:
tion is rewarded by the saving in time and effort through accurate
placement. An example of how a placement test might be designed within At rirvr the old woman seemed lInwillmg to aCITD.! anything lb..:J.l was
~ttered her by my friend and L
an institution is given in Chapter 7; the validation of placement tests is
referred to in Chapter 4. where the candidate has to identifv which of the underlined elements is
erroneous or inappropriate in forn~al standard English. While the ahilirv
Direct versus indirect testing to respond to such items has been shown to be related statistically to the
ability to write compositions (though the strength of the relationship was
So far in this chapter we have considered a number of uses to which test not particularly great), it is clearly not the same thing. Another example
results are put. \X1e now distinguish between two approaches to test of indirect testing is LIdo's (1961) proposed method of testing pronunci-
construction. ation ability by a paper and pencil test in which the candidate has to
identify pairs of words which rhyme with each other.
14
15
J
Kinds of test and testing
Kinds of test and testing
1 Perhaps the main appeal of indirect testing is that it seems to offer the
possibility of testing a representative sample of a finite number of abilities
passage. Clearly this distinction is not unrelated to that between indirect
and direct testing. Discrete point tests will almost always be indirect,
which underlie a potentially indefinitely large number of manifestations while intearative tests will tend to be direct. However, some mregr anve
b
,
Generally the prose that can be read bv the individual IS predominanrlv
abilities, such 3S control of particular grammatical structures, indirect
testing is called for. in straightforward/high-frequency sentence patterns. The individual does
not have a broad active vocabulary ... but is able to use contextual and
J Discrete point versus integrative testing
real-world clues to understand the text.
Similarly, a candidate who is awarded the Berkshire Certificate of
Discrete poi~ testing refers to the testing of one element at a time item Proficiencv in German Level 1 can 'speak and react to others using simple
] 12Y."Ite..m.- TIiis might involve, for ex7mPIe,aseries'cirIreins each tes~l't-;gj . language in the following contexts':
particular grammatical structure. !f!.!~g~ative .te.sJing,_ by contrast,
...!!9uires the S!.!l..~~e to combine manv-~uilge elements in the to greet, interact with and take leave of others;
] _.C?~~pJ~ti.?~:,L:.!~.s_~. This might involve writing a composltioii,--maldng to exchange information on personal background, home.
notes wnlle llstenll1g to a lecture, taking a dictation, or completing a cloze school life and interests;
16
] 17
Kinds of test and testing
Kinds of test and testing
to discuss and make choices, decisions and plans;
to express opinions, make requests and suggestions; with that score should be allowed to carry, this being based on experience
to ask for information and understand instructions. over the vears of students with similar scores, not on any meaning 111 the
score itse'lf. In the same way, university administrators have learned from
In these two cases we learn nothing 3bollt how the individual's perform- exp erience how to interpret TOEFL scores and to set muurnum scores for
ance compares with that of other candidan-. R3ther we learn something their own institutions. . "
about what he or she C3n actually do in the bngu3ge. Tests which are Books on language testing have tended to give advice which _IS more
designed to provide this kind of information directly are said to be appropriate to norm-referenced testing than to cnrenon-reterenced
. cr!t~t)01!~!ef!!!!.!.lf§'.d.. 2 testing. One reason for this may be that procedures tor use with
The purpose of criterion-referenced tests is ro cbssifypeople 3ccording norm-referenced tests (particularly with respect to such matters as the
to,~h~th.e_~.9.r_.n(.l.~_!he.L~!.~2~£~~.::!;~9_':l2..<;j~s.k·-0i:"s~~'C;(tJskS analysis of items and the estimation of reliability), are well established,
sarisfacrorily. The tasks are set, and the performances are evaluated. It while those for criterion-referenced tests are not. The VIew taken 111 this
does not matter in principle whether all the candidates are successful, or book, and argued for in Chapter 6, is that .criterion-referenced tests are
none of the candidates is successful. The tasks are set, and those who often to be preferred, not least for the beneficial backwash cftect they are
perform them satisfactorilY~~thosewho don't,_'fail'. This means likely to have. The lack of agreed procedures lor such tests IS nor
that students are encouraged to measure their progr~ rel ation to sufficient reason for them to be excluded from consider.1tlllll.
meaningful criteria, without feeling that, because they are less able than .,
most of their fellows, they are destined to fail. In the case of the Berkshire
German Certificate, for example, it is hoped that all students who are Objective testing versus subjective testing
entered for it will be successful. Criterion-referenced tests therefore have
two positive virtues: they set standards meaningful in terms of what The distinction here is between methods of scori!!g;jlnd norhing else. IfJill-
people can do, which do not change with different groups of candidates, j u dge!!l.~.n_~ ~~9u ired__o.t.!_r~:y~ ~t~jLili,~~c,~!:~:: ..~!l.e.~l ..the__sco ri ng I s (~h Icc-
i2.
and they rnotivare students to attain those standards. tive. A multiple choice test, with the correct responses unambiguously
The need for direct interpretation of performance means that the identified, w(;'lI1~rhe'~1"~~;'se in point. If iudge~ll.e~t. is called ~()r,. the
construerion of a criterion-referenced test may be quite different from . s·cpring is said to be;'i\J!~l££~::S' There are different degrees of subjecrivirv
that of a norm-referenced test designed to serve the same purpose. Let us 111 - t~"Stmg. The impressionistic scoring or. a cornposmon rnav he con-
im3gine thar the purpose is to assess the English language ability of sidered more subjective than the scoring or short answers In response to
students in relation to the demands made by English medium universities. questions on :1 reading passage. . _ . .
The criterion-referenced test would almost certainly have to be based on .Objectivlty in scoring is sought after by m:lny testers, not tor itself, but
an analysis of what students had to be able to do with or through English for-the-~~<:_J.!!r.....!_el!;lJ2.!~2it brings. In general, the less. subjective the
at university. Tasks would then be set similar to those to be met at scoring, the gre:1ter agreement there will be between two ditfercntscorers
university. If this were not done, direct interpretation of performance (and between the scores of one person SCOrIng the same rest paper on
would be impossible. The norm-referenced test, on the other hand, while different occasions). However, there are ways ot obtaining reliable
its COntent might he based on a similar analysis, is not so restricted. The subjective SCOrIng, even of compositions. These are discussed first 111
Michigan Test of English Language Proficiency, for instance, has mul- Chapter 5.
tiple choice grammar, vocabulary, and reading comprehension com-
ponents. A candid3te's score on the test does not tell us directly what his
or her English ability is in relation to the demands that would be made on
it at an English-medium university. To know this, we must consult a table
Communicative language testing .
which makes recommend3tions as to the academic load that a student Much has been written in recent years about ·communi.cative language
testing'. Discussions have centred on the desirability ot meaSUrIng the
o People differ somewhat in their use of the term ·criterion-refaenced'. This is un-
unport.mr provided that the sense intended IS made clear. The sense in which it is used
abilitv to take part in acts or cornrnuntcanon (including reading and
here is the one which I feel will be most useful to the reader in analysing testing listening) and on the best way to do this, It is assumed in this book thar It
problems.
is usuallv cornmnnicanve ability which we want to test. As a result, what I
18 believe to be t;le most significant points made in discussions of communi-
19
..
1 Kinds of test and testing Kinds of test and testing
cative testing are to be found throughout. A recapitulation under a made the distinction between discrete point and integrative language
separate heading would therefore be redundant. testing. Oller (1979) discusses integrative testing techniques. Morrow
(1979) is a seminal paper on communicative language testing. Further
discussion of the topic is to be found in Canale and Swain (1980),
READER ACTIVITIES Alderson and Hughes (1981, Part 1), Hughes and Porter (1983), and
Davies (1988). \Veir's (1988a) book has as its title Connnunicatiue
Consider a number of language tests with which you are familiar. For language testing.
each of them, answer the following questions:
1. What is the purpose of the test?
( .~
. Does it represent direct or indirect testing (or a mixture of both)?
, 3. Are the items discrete point or integrative (or a mixture of both)?
4. Which items are objective, and which are subjective? Can you order
the subjective items according to degree of subjectivity?
5. i Is the test norm-referenced or criterion-referenced?
6J'DOCS the test measure communicative abilities? Would you describe it
as a communicative test? Justify your answers.
7 What relationship is there between the answers to question 6 and the
answers to the other questions?
) " C'
/ 4'
Further reading
t~ ,,.
For a discussion of the two approaches towards achievement test content
specification, see Pilliner (1968). Alderson (1987) reports on research
into the possible contributions of the computer to language testing.
Direct testing calls for texts and tasks to be as authentic as possible: Vol.
','
2, No.1 (1985) of the journal Language Testing is devoted to articles on -
;. i r
authenticity in language testing. An account of the development of an
indirect test of writing is given in Godshalk et al. (1966). Classic short
papers on criterion-referencing and norm-referencing (not restricted to
language testing) are by Popham (1978), favouring criterion-referenced
testing, and EbcI (1978), arguing for the superiority of norm-referenced
testing. The description of reading ability given in this chapter comes
from the Interagency Language Roundtable Language Skill Level
Descriptions. Comparable descriptions at a number of levels for the four
skills, intended for assessing students in academic contexts, have been
devised by the American Council for the teaching of Foreign Languages
(ACTFL). These ACTFL Guidelines are available from ACTFL at 579
Broadway, Hastings-on-Hudson, NY 10706, USA. It should be said,
however, that the form that these take and the way in which they were
constructed have been the subject of some controversy. Doubts about the
applicability of criterion-referencing to language testing are expressed by
Skehan (1984); for a different view, see Hughes (1986). Carroll (1961)
20 21
I
Validity
I
4 Validity backwash effect. Areas which are not tested are likely to become areas
ignored in teaching and learning. Too often the content of tests is
I
determined by what is easy to test rather than what IS Important to test.
The best safeguard against this is to write full test specifications and to
ensure that the test content is a fair reflection of these. Advice on the
writing of specifications and on the judgement of content validity is to be
I
-
found in Chapter 7.
Content validity
"~-;ite;i~n measure agai-nsrwhlch the testis validated.
There are essentially two kin'b of crirerion-relared validity: C;Qn_CI!~!e_n! __
I
validity and predictive valid!.£!:. Concurrent ualiditv is established when
the ora! component may be considered valid, inasmuch as it gives results ability. The saving in time would not be worth the risk of appointing
similar to those obtained with the longer version. If, on the other hand, someone with insufficient abilitv in the relevant foreign language. On the
the two sets of scores show little agreement, the shorter version cannot be other hand, a coefficient of the same size might be perfectly acceptable for
considered valid; it cannot be used as a dependable measure of achieve- a brief interview forming part of a placement test.
ment with respect to the functions specified in the objectives. Of course, if It should be said that the criterion for concurrent validation is not
ten minutes really is all that can be spared for each student, then the oral necessarily a proven, longer test. A test may be validated against, for
component may be included for the contribution that it makes to the example, teachers' assessments of their students, provided that the
assessment of students' overall achievement and for its backwash effect. assessments themselves can be relied on. This would be appropriate
But it cannot be regarded as an accurate measure in itself. where a test was developed which claimed to be measuring something
References to 'a high level of agreement' and 'little agreement' raise the different from all existing tests, as was said of at least one quite recently
question of how the level of agreement is measured. There are in fact developed 'communicative' test.
standard procedures for comparing sets of scores in this way, which The second kind of criterion-related validity is Pledictive validit;'.. This
generate what is called a 'validity coefficient', a mathematical measure of concerns the degree to which a test can p-redicteJrlJld~.future
similarity. Perfect agreement between two sets of scores will result in a pcrfOl:p1ance. An example would be how we~licler]citest could
validity coefficient of 1. Total lack of agreement will give a coefficient of -predict -:!ili;dent's ability to cope with a graduate course at a British
zero. To get a feel for the meaning of a coefficient between these two university. The criterion measure here might be an assessment of the
extremes, read the contents of Box 1. student's English as perceived by his or her supervisor at the university, or
it could be the outcome of the course (pass/fail etc.). The choice of
Box 1 criterion measure raises interesting issues. Should we rely on the subjec-
tive and untrained judgements of supervisors? How helpful is it to use
To get a feci for what a coefficient means in terms of the level of
final outcome as the criterion measure when so many factors other than
agreement between two sets of scores, it is best to square that
ability in English (such as subject knowledge, intelligence, motivation,
coefficient. Let us imagine that a coefficient of 0.7 is calculated
between the two oral tests referred to in the main text. Squared, this
health and happiness) will have contributed to every outcome? Where
becomes 0.4'). If this is regarded as a proportion of one, and outcome is used as the criterion measure, a validity coefficient of around
converted to a percentage, we get 49 per cent. On the basis of this, we 0.4 (only 20 per cent agreement) is about as high as one can expect. This
can sa v that the scores on the shurt test predict 4') per cent of the is partly because of the other factors, and partly because those students
variation in scores on the longer test. In broad terms, there is almost whose English the test predicted would be inadequate are not normally
50 per cent agreement between one set of scores and the other. A permitted to take the course, and so the test's (possible) accuracy in
coefficient of 0.5 would signify 25 per cent agreement; a coefficient of predicting problems for those students goes unrecognised. As a result, a
O.X would indicat« 64 per cent agreement. It is important to note that validity coefficient of this order is generally regarded as satisfactory. The
a 'level of agreement' of, say, 50 per cent does not mean that 50 per further reading section at the end of the chapter gives references to the
cent of the students would each have equivalent scores un the two recent reports on the validation of the British Council's ELTS test, III
verSIOIlS. We are dealing with an overall measure uf agreement that
which these issues are discussed at length.
does 1I0t rder to the individual scores uf students. This explanation of
Another example of predictive validity would be where an attempt was
huw tu Interpret validity coefficients is very brief and necessarily
rather crude. Fur a better uuder sranding, the reader is referred tu maae to validate a placement test. Placement tests attempt to predict the
Appendix I. 'most appropriate class for any particular student. Validation would
involve an enquiry, once courses were under way, into the proportion of
students who were thought to be misplaced. It would then be a matter of
Whether or not a particular level of agreement is regarded as satisfactory comparing the number of misplacements (and their effect on teaching
will depend upon the purpose of the test and the importance of the . and learning) with the cost of developing and administering a test which
decisions that arc made on the basis of it. If, for example, a test of oral would place students more accurately.
ability was to be used as part of the selection procedure for a high level
diplomatic post, then a coefficient of 0.7 might well be regarded as too
low for a shorter test to be substituted for a full and thorough test of oral
Validity Validity I
Construct validity
I
number of sub-abilities. such as control of punctuation, sensitivity to may simply not he used; and if it is used, the candidates' reaction to it
demands on style, and so on. \'V'e construct items that are meant to may mean that they do not perform on it in a way that truly reflects their
measure these sub-abilities and administer them as a pilot test. How do ability. Novel techniques, particularly those which provide indirect
we know that this test really is measuring writing ability? One step we measures, have to be introduced slowly, with care, and with convincing
would almost certainly take is to obtain extensive samples of the writing
ability of the group to whom the test is first administered, and have these
reliably scored. We would then compare scores on the pilot test With the
explanations.
I
scores given for the samples of writing. If there is a high level of
agreement (and a coefficient of the kind described in the previous section
can be calculated), then we have evidence that we are measuring writing
The use of validity
What use is the reader to make of the notion of validiry? First, every effort
I
ability with the test.
should be made in construcnng tests to ensure content validity. \'V'here
So far, however, though we may have developed a satisfactory indirecr
test of writing, we have not demonstrated the reality of the underlying
constructs (control of punctuation erc.), To do this we might administer a
possible, the tests should he validated empirically against some criterion.
Parricul.itlv where it is intended to use indirect testing, reference should
be made to the research literature to confirm that measurement of the
I
series of specially constructed tests, measuring each of the constructs bv a
number of different methods. In addition, compositions written by the
people who took the tests could be scored separately for performance in
relation to the hypothesised constructs (control of punctuation, for
relevant underlying constructs has been demonstrated using the testing
techniques that are to be used (this may often result in disappointrnenr
another reason for favouring direct testing!).
>-
I
Any published test should supply details of its validation, without
26
27
I
Validity
which its validity (and suitability) can hardly be judged by a potential 5 Reliability
purchaser. Tests for which validity information is not available should be
treated with caution.
READER ACTIVITIES
Consider any tests with which you are familiar. Assess each of them in
terms of the various kinds of validity that have been presented in this Imagine that a hundred students take a 100-item test at three o'clock one
chapter. What empirical evidence is there that the test is valid? If evidence Thursdav afternoon. The test is not impossibly difficult or ridiculouslv
is lacking, how would you set about gathering it? easy for these students, so they do not all get zero or a perfect score of
100. Now what if in fact they had not taken the test on the Thursday hut
had taken it at three o'clock the previous afternoon? Would we expect
Further reading each student to h.ive got exactly the same score on the Wednesday as thcv
actuallv did on the Thursday) The answer to this question must be 110.
For general discussion of test validity and ways of measuring it, see Even if we assume that the test is excellent, that the conditions of
Anastasi (1976). For an interesting recent example of test validation (of administration arc almost identical, that the scoring calls for no judge-
the British Council ELTS test) in which a number of important issues are ment on the part of the scorers and is carried out with perfect care, and
raised, see Criper and Davies (1988) and Hughes, Porter and Weir that no learning or forgetting has taken place during the one-day interval
(1988). For the argument (with which 1 do not agree) that there is no - nevertheless we would not expect every individual to get precisely the
criterion against which 'communicative' language tests can be validated same score on the Wednesday as they got on the Thursday. Human
(in the sense of criterion-related validity), see Morrow (1986). Bachman beings are not like that; they simply do not behave in exactly the same
and Palmer (1981) is a good example of construct validation. For a way on every occasion, even when the circumstances seem identical.
collection of papers on language testing research, see Oller (1983). Hut if this is the case, it would seem to imply that we can never have
complete trust in any set of test scores. \X!e know that the scores would
have been different if the test had been administered on the previous or
the following day. This is inevitable, and we must accept it. What we have
to do is construct, administer and score tests in such a way that the scores
actually obtained on a test on a particular occasion arc likelv to be L'cry
similar to those which would have been obtained if it had been admin-
istered to the same students with the same ability, but at a different time.
The more similar the scores would have been, the more reliable the test is
said to be.
Look at the hypothetical data in Table 1a). They represent the scores
obtained by ten students who took a 100-item test (A) on a particular
occasion and those that thev would have obtained if they had taken it a
day later: Compare the two sets of scores. (Do not worry for the moment
about the fact that we would never be able to obtain this information.
Ways of estimating what scores people would have got on another
occasion are discussed later. The most obvious of these is simply to have
people take the same test twice.) Note the size of the difference between
the two scores for each student.
'"In
Reliability Reliabilitv I
Look now at Table lc), which represents scores of the same students
TABLE
Student
l a) SCORES 0;\1 TEST A (INVENTED DATA)
Rill 68
following day
82
Student Score obtained Score which would have
been obtained on the
I
Mary 46 28 following day
Ann
Harry
Cyril
19
89
43
34
67 Bill
Mary
5
4 5
I
63
Pauline
Don
Colin
56
43
27
59
35
23
Ann
Harry
Cyril
2
5
2
4
2
4
I
Irene 76 62 Pauline 3 5
Sue 62 49 Don
Colin
Irene
3
I
4
I
2
:)
I
Now look at Table 1b), which displays the same kind of information for a Sue 3 I
second lOO-item test (B). Again note the difference in scores for each
student.
In one sense the two sets of interview scores are very similar. The largest
I
TABI.E
Student
l b) SCORES ON TEST B (INVENTED DATA)
Bill 65
following day
- - - - - - - - - 1 - - - - - - - - - + - - - - - - - - - - . - -__
69
students with the size of differences between scores for individual
students. They are of about the same order of magnitude. The result of
this can be seen if we place the students in order according to their
I
Mary 48 52 interview score, the highcst first. The order based on their actual scores is
Ann
Harry
23
85
21
90
markedly different from the one based on the scores they would have
obtained if they had had the interview on the following day. This
I
Cyril 44 39 interview turns out in fact not to he verv reliable at all.
Pauline
Don
Colin
56
38
19
59
35
16
I
Irene 67 62 The reliability coefficient
Sue 52 57
It is possible to quantify the reliability of a test in the form of a reliability
coefficicnt. Reliability coefficients are like validity coefficients (Chapter
I
Which test seems the more reliable? The differences between the two sets 4). Thev allow us to compare the reliability of differentjests. The ideal
of scores are much smaller for Test B than for Test A. On the evidence
that we have here (and in practice we would not wish to make claims
about reliability on the basis of such a small number of individuals), Test
reliabilTtycoefl1clent is J~st wirFlarefiabllrtycodficiem-ofTiSone
which would",-give pr~sely the sam!-!"~~!lhLf<?!__~J~rrLtiqILJLset.-or.
I
candidates regirdlcssoT-wh-enlth-appened to he .admtnisrered. A test
B appears to be more reliable than Test A.
30
wh,ch-l1:ida-reliiliilifycodficlilltof zeroIandlet us hope that no such
31
I
Reliability Reliability
test exists!) would give sets of results quite unconnected with each other, for one half of the test the second score IS for the other half. Ihe two sets
in the sense thar the score that someone actually got on 3 Wednesdav J;E~~resare then used t()()5tiln~the-I"eTiabilTtycoe[ficie-t1tJ S if the \~j;ole
would be no help at 311 in attempting to predict the score he or she would test had been taken twice. In-oi-arrEar this method to work, It is necessary
get if they took the test the day after. It is between the two extremes of 1 for the test to be split into two halves which ate r~.!lY equiyalen], through
and zero that genuine test reliability coefficients are to be found. the careful matching of items (in fact where items in the test have been
Certain authors have suggested how high 3 reliabilitv coefficient we ordered in terms of difficulty, a spilt into odd-numbered items and
should expect far different types of language tests. Lado (1961), for even-numbered Items may be adequarer.It can be seen that this method is
example, says that good vocabulary, structure and reading tests are rather like the alternate forms method, except that the two 'forms' are
usually in the .90 to .99 range, while auditory comprehension tests are only half the length. I
more often in the .80 to .8lJ range. Oral production tests may be in the .70 It has been demonstrated empirically that this altogether more
to .79 range. He adds that a reliability coefficient of .85 might be economical method will indeed give good estimates of alternate forms
considered high for an oral production test but low for a reading test. coefficients, provided that the alternate form arc closely equivalent to
These suggestions reflect what Lado sees as the difficulty in achieving each other. Details of other methods of estimating reliability and of
reliability in the testing of the different abilities. In fact the reliability carrying out the necessary statistical calculations arc to be found in
coefficient that is to be sought will depend also on other considerations, Appendix I.
most particularly the importance of the decisions that are to be taken on
the basis of the test. The mare important the decisions, the greater
reliability we must demand: if we are to refuse someone the opportunity The standard error of measurement and the true score
to study overseas because of their score on a language test, then we have
to be pretty sure that their score would not have been much different if While the reliability coefficient 3110ws us to compare the reliability of
they had taken the test a day or two earlier or later. The next section will tests, it docs not tell us directly how close an individual's actual score IS to
explain how the reliability coefficient can be used to arrive at another what he or she might have scored on another occasion. With a little
figure (the standard error of measurement) to estimate likely differences further calcul.uiou, however, it is possible to estimate how close J
of this kind. Before this is done, however, something has to be said about person's actual score IS to what is called their 'true score'.
the way in which reliability coefficients are arrived at. Imagine that it were possible for someone to take the same language
The first requirement is to have two sets of scores for comparison. The test over and over again, an indefinitely large number of times, without
most obvious way of obtaining these is to get a group of subjects to take their performance being affected by having already taken the test, and
the same test twice. This is known as the test-retest method. The without their ability in the language changing. Unless the test is perfectly
drawbacks are not difficult to see. If the second administration of the test reliable, and provided that it IS not so easy or difficult that the student
is too soon after the first, then subjects are likely to recall items and their always gets full marks or zero, we would expect their scores on the
responses to them, making the same responses mare likely and the various administrations to vary. If we had all of these scores we would be
reliability spuriously high. If there is too long a gap between administra- able to calculate their average score, and it would seem not unreasonable
tions, then learning (or forgetting!) will have taken place, and the to think of this average as the one that best represents the student's ability
coefficient will he lower than it should be. However long the gap, the with respect to this particular test. It is this score, which for obvious
subjects are unlikely to be very motivated to take the same test twice, and reasons we can never know for certain, which is referred to as the
this too is likely to have 3 depressing effect on the coefficient. These candidate's true score.
effects are reduced somewhat by the use of two different forms of the We are able to make statements about the probability that a candi-
same test (the alternate forms method). However, alternate forms are date's true score (the one which best represents their ability on the test) is
often simply not available, within J certain number of points of the score they actually obtained
It turns out, surprisingly, that the most common methods of obtaining on the test. In order to do this, we first have to calculate the standard
the necessary two sets of scores involve only one administration of one error of measurement of the particular test. The calculation (described in
test. Such methods provide us with a coefficient of 'internal consistency'.
The most basic of these is theszlit half method. In this the subjects take
I. Because of the reduced length, which will cause the coefficient to be less than it would be
the test in the usual W3Y, but each subject is given two scores. ?_n~~5~re is for the whole test, a statistical adjustment has to be made (see Appendix 1 for details).
32
Reliabilitv
Reliability
•
Appendix 1) IS very straightforward, and is based on the reliability
coefficient and a measure of the spread of all the scores on the test (for a
true score is quite likely to be equal to or above the score that would lead
to a positive decision, even though their actual score IS below It. For this
I
given spread of scores the greater the reliability coefficient, the smaller reason, all published tests should provide users WIth not only the
will be the standard error of measurement). How such statements can be
made using the standard error of measurement of the test is best
illustrated hv an example.
reliabilirv coefficient but also the standard error or measurement.
We have seen the importance of reliability. If a test is not reliable then
we know that the actual scores of many individuals are likely to be quite
I
Suppose that a test has a standard error of measurement of 5. An
I
different from their true scores. This means that we can place little
individual scores 56 on that test. We are then In a position to make the
reliance on those scores. Even where reliability is quite high, the standard
following statements: 2
error of measurement serves to remind us rh.it in the case of some
We CJn he about 68 per cent certain that the person's true score lies III the individuals there is quite possibly a large discrepancy herween actual
rJnge of 51 to 61 (i.e, within one stJndard error of measurement of the
score actually obtained on this occasion).
\Ve can be about 95 per cent certain that their true score is in the r3nge 46
score and true score. This should make us very cautious about making
important decisions on the baSIS of the test scores of candidates whose
actual scores place them close to the cur-ott pOint (the point that divides
I
'passes' from 'tnils"). We should at least consider the possihiliry or
I
to 66 (i.e. within two srnnd.ird errors of measurement of the score
actually obtained). gJthering further relevant information on the language .rhilirv ot such
We can he 99.7 per cent certain that their true score is in the range 41 to candida res.
71 (I.e. within three standard errors of measurement of the score Having seen the Importance of rcli.rhilit v, we shall consider, l.ucr 111 the
actually obtained).
1 These suri,ticli st.uement s are [used on what is k nown .ihour the wa;, a person's scores
would lend to he distributed if they took the same test an indctinuelv large number of
times (without the experience of ,my test-taking occasion "ffectlllg per torm.ince on any
gt'vellby any other scoreron either.{KS'.:lSI()lL It is pOSSIble to quantify the
level of agreement given by dlftcrent scorers on dIttercnt OCC1SIons by
means or a scorer reiiabilirv coefficient whicl: c.m be interpreted 111 a
I
SImilar way as the test reliability coefficient. In the case ot the multiple
I
other occasion), The scores would follow what IS called a normal distrrhution (see
Woods, Fletcher, and Hughes, 1'iH6, for discussion beyond the scope of the present choice restjust described the scorer reliabilir y coefticient would be I. As
book). It IS the known structure of the normal disrribuuon which allows us to say what we noted in Chapter 3, when scoring requires no Judgement, .md could 111
percentage of scores will fall within a cerr.un r"nge \for cx.nnple about 6H per cent of principle or in practice be carried out by a computer, the test is ~ald to be
scores will 1311 wirhm one standard error of measurement of the true score). Since about
68 per cent of actual scores will be withm one standard error of measurement of the
true score, we can be about 68 per cent cerr.iin that anv particular actual score will be
within one st a ndard error o] tneJsurClllent ot the true ~c()re.
objective. Only carelessness should cause the rcli.ibiliry coefficients ot
objective tests to I.il! below 1.
However, we did not assume perfectly consistent SCOrIng in the case of
I
3. It should be clear that there is no such thing as a 'good' or a 'bad' standard error of the interview scores discussed earlier in the chapter. It would probably
measurement. It IS the parricuia- use made of particular scores In relation to a particular
standard error of measurement which m,IY be considered acceptable or unacceptable,
have seemed to the reader an unreasonable assumption. We can accept
that scorers should be able to be consistent when there is only one easily I
34
35
Reliability Reliability
recognised correct response. But when a degree of judgement is called for the ones already in the test will be needed to increase the reliability
on the part of the scorer, as in the scoring of performance in an interview, coefficient to a required level. One thing to bear 111 mind, however, is that
perfect consistency is not to be expected. Such subjective tests will not the additional items should be independent of each other and of existing
have reliability coefficients of I! Indeed there was a time when manv items. Imagine a reading test that asks the question: 'Where did the thief
people thought that scorer reliability coefficients (and also the reliability hide the jewels?' If an additional item following that took the form 'What
of the test) would always be too low to justify the use of subjective was unusual about the hiding place?', it would not make a full contri-
measures of language ability in serious language testing. This view is less bution to an increase in the reliability of the test. Why not? Because it IS
widely held today. While the perfect reliability of objective tests is not hardly possible for someone who got the original question wrong to get
obtainable in subjective tests, there are ways of making it sufficiently high the supplementary question right. Such candidates are effectively pre-
for test results to be valuable. It is possible, for instance, to obtain scorer vented from answering the additional question; for them, in reality, there
reliability coefficients of over 0.9 for the scoring of compositions. is no additional question. We do not get an additional sample of their
It is perhaps worth making explicit something about the relationship behaviour, so the reliability of our estimate of their abilitv IS not
between scorer reliability and test reliability. If the scoring of a test is not increased.
reliable, then the test results cannot be reliable either. Indeed the test Each additional item should as far as possible represent a fresh start for
reliability coefficient will almost certainly be lower than scorer reliability, the candidate. By doing this we are able to gain additional inforrnarion on
since other sources of unreliability will be additional to what enters all of the candidates, information which will make test results more
through imperfect scoring. In a case I know of, the scorer reliability reliable. The usc of the word 'item' should not be taken to mean on lv
coefficient on a composition writing test was .92, while the rcliabilitv brief questions and answers. In a test of writing, for example, where
coefficient for the test was .84. Variability 111 the performance (;f candidates have to produce a number of passages, each of those passage<;
individual candidates accounted for the difference between the two is to be regarded as an item. The more independent passages there are, the
coefficients. more reliable will be the test. In the same way, in an interview used to test
oral ability, the candidate should be given as many 'fresh starts' as
possible. More detailed implications of the need to obtain sufticicntlv
How to make tests more reliable large samples of behaviour will be outlined later In the book, In chapters
devoted to the testing of particular abilities.
As we have seen, there are two componeIlts of test reliability: the While it is important to make ;l.test long enough to achieve satisfacrorv
performance of candidates from occasion to occasion, and the reliability reliabrlity, it should not be made so long that the candidates become so
of the scoring. \,\I e will begin by suggesting ways of achieving consistent bored or tired that the behaviour that they exhibit becomes unrcprcscnta-
performances from candidates and then turn our attention to scorer tive of their ab~li.tv. 1\ll,he same time, it m.iv often be. necessary to resist
~~, . ' . .
36 37
Reliability Reliability •
I
be a test of writing where the candidates are simply given a selection of candidate might interpret the question in different ways on different
titles from which to choose. Such a procedure is likely to have a occasions means that the item IS not contributing fully to the reliability of
depressing effect on the reliability of the test. The more fr;edom that is the test.
given, the greater is likely to be the difference between the performance The best way to arrive at unambiguous items is, having drafted them,
actually elicited and the performance that would have been elicited had
the test been taken, say, a day later. In general, therefore, candidates
should not be given a choice, and the range over which possible answers
to subject them to the critical scrutiny of colleagues, who should try as
hard as they can to find alternative interpretations to the ones intended. If
this task is entered into in the right spirit, one of good-natured perversity,
I
might vary should be restricted. Compare the following writing tasks:
I
most of the problems can be identified before the test IS administered.
a) Write a composition on tourism. Pretesting of the items on a group of people comparable to those flJr
b) Write a composition on tourism in this counrrv. whom the test is intended (see Chapter 7) should reveal the remainder.
c) Write a composition on how we might developthe tourist inclustrv in \X!here pretesting is nor practicable, scorers must he on the lookout for
this country.
d) Discuss the following measures intended to increase the number of
foreign tourists coming to this country:
. patterns of response that indicate that there are problem items.
38
perform less well than they would do otherwise (on subsequently taking a
39
I
I
Reliability Reliability
parallel version, for example). for this reason, every effort must be made Items of this kind are discussed in later chapters.
to ensure that all candidates have the opportunity to learn just what will
be required of them. This may mean the distribution of sample tests (or of Make comparisons between candidates as direct as possible
past test papers), or at least the provision of practice materials in the case
of tests set within teaching institutions. This rcmtorces the suggestion already made that candidates should not
be gIven a choice of items and that they should be limited in the wav that
Provide uniform and non-distracting conditions of they arc allowed to respond. Scoring the compositions all on onetopic
will be more reliable than if the candidates are allowed to choose from six
administration
topics, as has been the case in some well-known tests. The scoring should
The greater the differences between one administration of a test and be all the more reliable if the compositions are guided as in the example
another, the greater the differences one can expect between a candidate's above, 111 the sectH)l1~o..!!.S!.~ allow (andidLltes too //lu!:.!!J!~edo//l.
performance on the two occasions. Great care should be taken to ensure
uniformity. Fur example, timing should be specified and strictly adhered
to; the acoustic conditions should be similar for all administrations of a
Provide a detailed scoring key
listening test. Every precaution should be taken to maintain a quiet This should specit~ acceptable answers and assign points for partially
setting with no distracting sounds or movements. correct responses. For high scorer reliability the kev should be as detaile~l
We turn now to ways of obtaining scorer reliability, which, as we saw as possible In its assignment of points. It should be the outcome of efforts
above, IS essential to test reliability. to anticipate all possible responses and have been subjected to group
cnncisrn. (I his advice applies only where responses can be classed as
Use items that permit scoring which is as objective as partiallv or totally 'correct', not in the case of compositions, for
possible inst.uicc.)
40 41
Reliability Reliability
•
the points to be assigned, the supervisor should convey it to all the scorers
concerned. . READER ACTIVITIES
I
1. \'(;'hat published tests are you familiar with; Try to find out their
Identify candidates by number, not name
Scorers inevitably have expectations of candidates that thev know.
Except in purely objective testing, this will affect the way that they score.
reliabilitv coefficients. (Check the manual; check to see if there is a
review in Alderson et al. 1987.) What method was used to arrive at
these? WhJt are the standard errors of measurement?
I
Studies have shown that even where the candidates are unknown to the 2. The TOEFL test has a standard error of measurement of 15. A.
scorers, the name on a script (or a photograph) will make a significant
di fference to the scores given. For example, a scorer may be influenced by
particular American college states that it requires a score of 6UO on the
test for entry. What would you think of students applying to that
I
the gender or nationality of J name into making predictions which can college and making scores of 605, 600, S 95, 590, 575)
affect the score given. The idenriticarion of cnndidates only by number
will reduce such effects. ' ,
3. Look at your own institutional tests. USing the list ot points in the
chapter, say in what ways you could improve their rcliubilitv.
4. What examples can you think of where there would be a tension
I
between reliability and validity? In cases rh.ir yOll k now, do you think
Employ multiple, independent scoring
As a general rule, and certainly where testing is subjective, all scripts
the right balance has been struck?
I
should be scored by at least two independent scorers. Neither scorer
should know how the other has scored a test paper. Scores should be
recorded on separate score sheets and passed to a third, senior, colleague,
who compares the two sets of scores and investig.rres discrepancies.
Further reading
For more on reliability in general, see Anastasi (1976). For the more
I
mathematically minded, Krzanowski and Woods' (I Y~4) article on
6 Achieving beneficial backwash gfeet will tend to be fclt only in that.area.The new TOEFL writing test
(referred to in Chapter 1) will set only two kinds of task: compare!
contrast; describe/interpret chart or graph. The likely outcome is that
much preparation for the test will be limited to those two types of task.
The backwash effect may not be as beneficial as it might have been had a
wider range of tasks been used.
Whenever the content of a test becomes highly predictable, teaching
and learning are likely to concentrate on what can be predicted. An effort
In the first chapter of this book it was claimed that testing couLdL<Jnd should therefore be made to test across the tull range of the specifications
should, ha ve a beneficial backwash effect on teachini~lnd icarning. TIllS (in the case of achievement tests, this should be equivalent to a tullv
chapter attempts to say how such an effect can be achieved. elaborated set of objectives), even where this involves elements that lend
themselves less readily to testing.
1
great majority of the test tasks successfully. Students take only the test (or
far as possiblethe.fullscope of what is specifie.d.,.Jf n~ot, if t~s_'!!]lp1e-is--_ tests) on which they are expected to be successful. As a result, they are
taKcil" from a restricted area of the specifications, then tne backwash spared the dispiriting, dernotivating experience of taking a test on which
~-------
44 45
I
Achieuing beneficia! backwash
AclJlel'lIlg beneiicia! backwash
•
they can, for example, respond correctly to fewer than half of the items
(and yet be given a pass). This type of testing, I believe, should encourage
Counting the cost I
positive attitudes to language learning. It is the basis of some of the new
GCSE (General Certificate of Secondary Education) examinations in
Britain.
One of the desirable qualities of rests which trtps quite readily off the
tongue of many tesi:ers,-itfrcrv-alidlti Jnd reliability, IS that otpr'lc;lc'll- __
it". Other rbings heing equal. It IS good thJt J test_should~~_~;15Land
I
~nstnlJ:t ... administer, scorejmJjnrcrpret. We should not for~et
Base achievement tests on objectives that testing costs time and money that could be put to alternative uses ".
It is unlikelv to have escaped the reader's notice that ;~tleJst some or
the recomme;1dJtions listed above for creating benehcul. backwash
I
If achievement tests arc based on objectives, rather than on deraiied
I
involve more than minimal expense. The individual direct testing or some
teaching and textbook content, they will provide a truer picture of what
abilities will rake a great deal of time, as will the reliable scoring ot
has actually been achieved. Teaching and learrung will tend to be
evaluated against those objectives. As a result, there will be constant performance on JnI' subjective test. The producrion,a~d dlstnbutl,on or
sample tests and the traming of teachers will also be costly. It might be
pressure to achieve them. This was argued more fully in Chapter 3.
argued, therefore, that such procedures are impr actical. .
In my opinion, this would reveal an incomplete understanding of what
....III volved . Before we decide that we cannot afford to trst In a way that
I
Ensure test is known and understood by students and IS .
will promote beneficial backwash, we have t(: ask ourselves ~l question:
teachers
However good the potentia! hackwash effect of a test may be, the effect
what will be the cost of not achieving beneficial backwash? When we
compare the cost of the test with the waste of effort and time on the parr
I
of teachers and students in activities quite inappropriate to their true
will not be fully realised if students and those responsible for teaching do
not know and understand what the test demands of them. The rationale
for the test, its specifications, and sample items should be made available
learning goals (and in some circumstances, with the potential loss to the
national economy of not having more people COlllpetellt III foreign
languages), we arc likely to decide that we cannot a itord not to Introduce
I
to everyone concerned with preparation for the test. This is particularly
a test with J powerful beneficial backwash effect.
Important when a new test is being introduced, especially if it incorpor-
ates novel testing methods. Another, equallv important, reason for
supplying information of this kind is to increase test relIability, as was
READER ACTIVITIES
I
noted in the previous chapter.
I
encourage communicative language teaching, but if the teachers need
guidance and possibly training, and these are not given, the test will not cost, syllabuses and specimen papers for the new GCSE exarmnanons In
achieve its intended effect. It may simply cause chaos and disaffection. foreign languages. These illustrate some of the principles enumerated In
Where new tests are meant to help change teaching, support has to be the chapter.
given to Iielp effect the change.
I
46
47
Stages of test construction
•
7 Stages of test construction Writing specifications for the test I
The first form that the solution takes is a set of specifications for the test.
This will include information on: c.:QIltenr,- forni~itandwning, criteria!
levels of performance, and scoring proced-ures:--~-----
I
CONTENT
TI1IS chapter begins by bricllv laying down a set of general procedures for
This refers not to the content of a single, particular version of a test, but
to the entire potential content of any number of versions. Samples of this
I
test construction. These are then illustrated by two examples: an achieve- content will appear in Individual versions of the test.
ment test and a placement test. finally there is a short section on
validation.
The fuller the information on content, the less arbitrary should be the
subsequent decisions as to what to include in the writing of anv version of
the test. There is a danger, however, that in the desire to be highlv
I
specific, we may go beyond our current understanding of what the
Statement of the problem
It cannot be said too many times that the essential first step in testing is to
components of language ability are and what their relationship is to each
other. For instance, while we may believe that many subskills contribute
I
to the ability to read lengthy prose p:1ssages with full understanding, it
make oneself perfectly clear about what iti_~!l~_\Yil(l~J;_Q k.U_Q\\,.:lnd for
what purpose. The following questl~;l~<tl~~ significance of which should
-be clear from previous chapters, haveto be answered:
seems hardly possible in our present state of knowledge to name them all
or to assess their individual contributions to the more general ability. We
cannot be sure that the sum of the parts that we test will amount to the
I
What kind of test is it to be? Achievement (final or progress), 'whole in which we are generally most directly interested. At the same'
proficiency, diagnostic, or placement?
\Vhat is its precise purpose?
What abilities are to be tested?
'- tirnc, ho\vever, teaching' practice often assumes some such knowledge,
with one subskill being taught at a time. It seems to me that the safest
procedure is to include 1Il the content specifications only those clements
I
Ilow detailed must the results be?
How accurate must the results be?
How important is backwash?
whose contribution is fairly well established.
The way in which content is described will vary with its nature. The
content of a grammar test, for example, may simply list all the relevant
I
What constraints arc set hy un.rvailabiliry of expertise, facilities, time structures. The content of a test of a language skill, on the other hand,
(for construction, administration and scoring)? may he specified along a number of dimensions. The following provides a
possible framework for doing this. It is not meant to be prescriptive;
readers may wish to describe test content differently. The important thing
I
Providing a solution to the problem is that ~()ntent should be as fully specified as possible.
"'Dnce the problem is clear, then steps can he taken to solve it. It is to be
hoped that a handbook of the present kind will take readers a long way
Operations (the tasks that candidates have to be able to carry out). For
I
a reading test these might include, for example: scan text to locate specific
towards appropriate solutions. In addition, however, efforts should be
made to gather information on tests that have been designed for similar
situations. If possible, ~!lmples of such tests should be obtained. There is
information; guess meaning of unknown words from context.
Types of text For a writing test these might include: letters, forms,
I
nothing shameful in doing this; it is what professional testing bodies do academic essays up to three pages in length.
when they arc planning a test of a kind for which they do not already have
first-hand experience. Nor does it contradict the claim made earlier that
each testing situation is unique. It is not intended that other tests should
Addressees This refers to the kinds of people that the candidate is
expected to be able to write or speak to (for example native speakers of
I
the same age and status); or the people for whom reading and listening
simply be copied; rather that their development can serve to suggest
possibilities and to help avoid the need to 'reinvent the wheel'. materials are primarily intended (for example native-speaker university
students).
I
48
49 I
Stages of test construction St'1gCS or test construction
Topics Topics are selected according to suitability for the candidate
and the type of test.
Flexibility Need not usually take the intiative in
I FORMAT AND TIMING conversation. May take time to respond to a
change of topic. Interlocutor may have to
This should specify test structure (including time allocated to com- make considerable allowances and often
ponents) and item types/elicitation procedures, with examples. It should adopt a supportive role.
state what weighting is to b, assigned to each component. It should also
say how many passages will normally be presented (in the case of reading
or listening) or required (in the case of writing), how many items there
will be in each component. Size Contributions generally limited to one or two
simple utterances are acceptable.
CRITERIAL LEVELS OF PERFORMANCE
The required level(s) of performance for (different levels of) success
should be specified. This may involve a simple statement to the effect SCORINC PROCEDURES
that, to demonstrate 'mastery', 80 per cent of the items must be These are most relevant where scoring will he subjective. The test
responded to correctly. It may be more complex: the Basic level oral constructors should be clear as to how they will achieve high scorer
interaction specifications of the Royal Society of Arts (RSA) Test of the reliability. .
Communicative Use of English as a Foreign Language will serve as an
example. These refer to accuracy, appropriacy, range, flexibility, and
size. Thus:
Writing the test
/
SAt\\l'l.lNC
It is most uuiik cly that everything found under the heading of 'Content'
BASIC
in the specifications can be included in anyone version of the test.
Choices have to he made. For content validity and for beneficial
backwash, the important thing is to choose widely from the whole area of
Accuracy Pronunciation may be heavily influenced by
content. One should not concentrate on those elements known to he easy
L1 and accented though generally intelligible.
Any confusion caused by grammatical/lexical to test. Succeedmg versions of the test should also sample Widely and
errors can be clarified by the candidate. unpredictably.
The writing of successful items (in the hro.idest sense, including, for
Appropriacy Use of language broadly appropriate to example, the setting of writing tasks) is extremely difficult. No one can
function, though no subtlety should be expect to be able consistently to produce perfect Items. Some Items will
expected. The intention of the speaker can be have to be rejected, others reworked. The best way to identify items that
perceived without excessive effort. have to be improved or abandoned is through teamwork. Colleagues
must really try to find fault; and despite the secrnirtgly inevitable
emotional attachment that item writers develop to items that they have
created, they must be open to, and ready to accept, the criticisms that are
Range Severely limited range of expression is offered to them. Good personal relations are a desirable quality In any
acceptable. May often have to search for a way test writing ream.
to convey the desired meaning.
Critical questions that Illay he asked Include:
Is the task perfectly clear?
Is there more than one possible correct response?
50 51
Stages of test construction Stages of test construction
Can candidates show the desired behaviour (or arrive at the correct Example 1 An achievement test
response) without having the skill supposedly being tested?
Statement of the problem
Do candidates have enough time to perform the task(s)?
In addition to moderation, as this process is called, an attempt should be There is a need for an achievement test to be administered at the end of a
made to administer the test to native speakers of a similar educational prcsessional course of training in the reading of academic texts in the
background to future test candidates. These native speakers should score social sciences and business studies (the students are graduates who are
about to follow postgraduate courses in English-medium universities).
100 per cent, or dose to it. Items that prove difficult for comparable
The teaching institution concerned (as well as the sponsors of the
native speakers almost certainly need revision or replacement.
students) wants to know lust what progress is being made during the
three-month course. The test must therefore obviously be sufficiently
WRITING AND MODERATION OF SCORING KEY
sensitive to measure gain over that relatively short period. While there IS
Once the items have been agreed, the next task is to write thescoring key no call for diagnostic information on individuals, it would be useful to
where this is appropriate. Where there is intended to be only one correct know, for groups, where the greatest difficulties remain at the end of the
response, this is a perfectly straightforward matter. Where there are course, so that future courses may give more attention in these areas.
alternative acceptable responses, which may be awarded different scores, Backwash is considered important; the test should encourage the
or where partial credit may be given for Incomplete responses, greater practice of the reading skills that the students will need in their
care is necessary. Once again, the Criticism of colleagues should be sought university studies. This is in fact intended to be only one of a battery of
as a matter of course. tests, and a maximum of two hours can he allowed for it. It will not be
possible at the outset to write separate tests for different subject areas.
Pretesting Specifications
Even after careful moderation, there arc likely to be some problems with
CONTENT
every test. It is obviously better if these problems can be identified before
the test is administered to the group for which it is intended. The aim Types of text The texts should be academic (t.iken from text hooks
and journal papers).
should be to administer it first to a group as similar as possible to the one
for which it is really intended. Problems in administration and scoring are
Addressees Academics at postgraduate level and beyond.
noted. The reliability coefficients of the whole test and of its components
are calculated, and individual items are analysed (see the Appendix for Topics The subject areas will have to be as 'neutral' as possible, smcc
procedures). the students arc from a v.mcty of social science and business disciplines
It has to be accepted that, for a number of reasons, pretesting is often (econom ics, sociology, management erc.).
not feasible. In some situations aj;rouE for pretesting "may simply not be
available. InO'riiCr-sltLlatloil-s;tflOugh a suitable group exists, it may be Operations These are based on the stated objectives of the course, and
thought that ~he $curLty l?Lthe te.g jnight be put at risk. It is often the include broad and underlying skills:
case, therefore, that faults in a test arc discovered only after it has been
Broad skills:
administered to the target group. Unless it IS intended that IlO part of the 1. Scan extensive written texts for pieces of intormation.
test should be used again, it is worthwhile noting problems that become 2. Construe the meaning of complex. closely argued passages.
apparent during administration and scoring, and afterwards carrying out
statistical analysis of the kind described in Appendix I. Ullderlyillg slcills:
(Those which arc regarded as of particular importance tor the
development of the broad skills, and which are given particular attention
in the course.)
3. Guessing the meaning of unfamiliar words from context.
4. Identifymg referents of pronouns erc., often some distance removed in
the text.
Stages of test construction
Stages of test construction
•
FORMAT AND TIMING
I SA .\11' LI N <;
Texts will be chosen from as wide a range of tOpiCS and types of writing
I
Scanning 2 passages each c. -',000 words in length. as is compatible with the specifications. Draft items will only be written
15 short-answer items on each, the items in the order in which relevant
information appears in the texts. 'Controlled' responses where
appropriate.
after the suitability of the texts has heen agreed.
l'rctesnng of texts and items sufficient for at least two versions will be
I
c.irried 0\1[ with students currenr!v taking the course.
certain intelligence test tasks require , something
01 which we are less capable as we grow oldcr.
5 meaning-lrom-context items from one of the dcr.rilcd reading passages
I
Statement of the problem
I
e.g. For each of the following, find a single word in the text with an
equivalent meaning. Note: the word In the text may have an ending
such as -ins, -s, etc. A cornmerci.il Enghsh bnguage te;lching insritunon nceds .1 pL1CL'I11CIIt
highest point (lines 20-35) test. Its purpose will be to assign new students to classes .H live levels:
false beginners; lower intermediate; middle inrcruiccli.rrc ; Ilpper
5 referent-identification items from one of the der~lIled reading p;lss3ges
e.g. What does each of the following retcr to in the rexr? Be very
inu-rrncdiar«: advanced. Course objectives ;It c111 levels arc cxprcssi«! In
r.uher gencLlI 'cornmunic.mve' terms, with no one skill being given
grc,lter .irtenno n rh.m any other. As well c1S inform.u rou on overall
I
precise.
ability in the Llnguage, some indication of 0[;11 .ihilirv woul.] be IISdll!.
the lOr/11fT (line 43)
SCORING PROCEDURES
There will be a detailed key, m.ik im; scoring almost entirely objective.
Specifications
(ON1T:-n
I
There will nevertheless be independent double scoring. Scorers will be Stl'ks .1nd tOpiCS of texts will rctlecr the kinds ot wr itrcn .m.] spoken
trained to ignore irrelevant (for example gr,1I11matICo11) ui.i.cur.rcv In
responses.
texts t h.rr .irc to 11<:' found In tec1e1l1ng m.ucri.ils .u the v.irious levels.
As testing will he indirect, operations are de;1lt With un.icr 'format'. I
FO 1(.\1 A T A:\ II rr ,\11 N (;
grammar items and vocabulary items, as well as to some extent testing though unanticipated errors will probably occur during pretesting. The
overall comprehension. final key will have to take account of all of these, 'Ine! he highlv explicit
Time: 30 minutes (Note: this seems verv little time but the more about their treatment.
advanced students will find the early pa'ssages extr~mely easy, and will
take very little time. It does not matter whether lower level students Pretesting
reach the later pnss.iges.)
(For examples of doze-type passages, see next chaprer.) l\ lore passages and dictation items will be constructed than will finallv
he used. All of them will be pretested on current students at the various
Partial dictation A 20-item partial dictation test (part of what is heard levels in the institution. Problems in administration and scoring will be
by the candidates already being primed on the answer sheet). Stretches noted.
to be transcribed being ordered according to difficulty. Spelling to be Students' scores are noted on each of the doze-type passages and the
ignored in scoring. To be administered in the hnguage l.iborntorv. Since partial dictation. Total scores for the doze-type passdges and for the test
no oral test is possible, this is intended to give some indication ot oral as a whole are calculated. Each of these scores for each student IS then
ability. comp.ircd with his or her level in the institution. The next step is to
Time: 10 minutes ideutify the 'cutting' scores which will most accurarelv divide the
students up into their actual levels in the school. 'Mls'fits', those students
e.g. The candidate hears:
who are assignee! to the 'wrong' level by the test using these cutting
'Particul.u consideration was given to the timing of the report'
scores, can he investigated; it may be that they would be as well off In
and has to complete:
the level in which the test has placed them.
Particular consideration timing .
Improvements in items are likely to suggest themselves during
In addition. information should be collected on previous training in pretesting. There are also statistical procedures for combining scores
English, as well as comments of the person who carries out the on the various parts of the test to maximise its predictive power.
registration using English as far as possible (the registration procedure Unfortun.irclv. these must fall outside the scope of a book of this
Gin be used as an informal, and not very reliable, oral test). size.
SA:>l!' 1.I NC
As noted above, the texts arc meant to provide a representative sample Validation of the test
of the kinds of spoken and written English found at the various levels.
Choice of texts will be agreed before deletions are made. \Vhen we speak of validating a particular test, we are usually referring to
Scoring will be objective and may be carried out by non-teachers. criterion-related validity. We are looking for empirical evidence that the
C1UTElUAL LEVELS OF I'ERFO/(i\lANCE
test will perform well against some criterion. Chapter 4 mentioned
typical criteria against which tests are judged. In the case of the placement
These will be determined during pretesting (see below).
test above, the proportion of students that it assigned to inappropriate
ITEl\I AND KEY WIUTlNC (AND MODERATION) classes would be the basis for assessing its validity. The achievement test
Cloze-type tests Once texts have been chosen, deletions are made and might be validated against the ratings of students by their current
keys constructed. The mutilated passages are then presented to the language teachers and by their future subject teachers soon after the
moderating group. It is particularly important that there· is only one beginning of their academic courses.
acceptable response for most items (preferably no more than two or
three on any item). Revised versions arc presented to native speakers,
and revised again if necessary. READER ACTIVITIES
56 57
Stages of test construction •
English, you might look at ELTS (British Council), TEEP (Test of English
for Educational Purposes), and TOEFL. If specifications are not avail-
8 Test techniques and testing overall
ability
I
able, you will have to infer them from sample tests or past papers.
Further reading
I
It is useful to study existing specifications. Specifications for the UCLES/
RSA Test in the Communicative Use of English as a Foreign Language What are test techniques? Quite simply they arc means of eliciting I
can be obtained from UCLES, 1-3 Hills Road, Cambridge. (NB This behaviour from candidates which will tell us about their language
examination is currently under review. The revised version is expected to
be ready from September, 1990. The first administration will not be
before November, 1990.) Those for TEEP are to be found in Weir
"abilities. What we need are techniques which:
I. will elicit behaviour which is a reliub!e and valid indicator of the I
(1988a, 1988b). ability in which we arc interested;
2. will elicit behaviour which can be reliably scored;
3. are as economical of time and effort as possible;
4. will have a beneficial backwash effect.
I
From Chapter 9 onwards, techniques arc discussed in rcl.irion to pur-
ticular abilities. This chapter discusses one technique, multiple choice,
which has been recommended .ind used for the testing of m.inv language
I
abilities. It then goes on to examine the usc of techniques which may be
used to test 'overall ahilitv'.
I
Multiple choice
;\Iultiplc choice items take many forms, but their b.1SIC structure IS as
I
follows.
There is .1 stein:
Enid has been here half ~1Il hour.
I
and a number of options, one of which correct, the others being
I
IS
distractors:
A. during
B. for
C. while
D. since
I
It is the candidate's task to identify the correct or most appropriate
option (in this case B).
Perhaps the most obvious advantage of multiple choice, referred to I
earlier in the book, is that scoring can he perfectlyreliqb]« Scoring
should also be rapid and economical. A further considerable-advantage is
I
58 59
Test techniques and testing overall ability Test techniques and testing oiera]! ability
that, since in order to respond the candidate has onlv to make a mark on candidates will usually not have a restricted number of responses
the paper, it is possible to include more items than'would otherwise be presented to them (with the information that one of them is correct).
possible in a given period of time. As we know from Chapter 5, this is
likely to make for greater test reliability.
The technique severely restricts what can be tested
The advaIHages'-lJF the nluTtlple choice technique were so highly
regarded at one trrne that It almost seemed that it was the onlv wav to test. The basic problem here is that multiple choice items require distractors,
While many laymen have always been sceptical of wh'at c~uld be and disrractors arc not always available. In a grammar test, it m3Y not be
achieved through multiple choice testing, it is only fairly recentlv that the Eossible to find three or four plausible alternatives t()--the correct -
technique's limitations have been more generally recognised' by pro- ~tr~_cture. Th-e result is that command of what may be an important
Iessional testers. The difficulties with multiple choice are as follows. structure is simply not tested. An example would be the distinction in
English between the past tense and the present perfect. Certainly for
The technique tests only recognition knowledge learners at a certain level of ability, in a given linguistic context, there are
no other alternatives that are likely to distract. The argument that this
If there is a lack of fit between at [cast some candidates' productive and must be a difficulty for any item that attempts to test for this distinction is
receptive skills, then performance on a multiple choice test may give a difficult to sust.un, since other items that do not overrly present a choice
quite maccurate picture of those candidates' ability. A multiple choice may elicit the candidate's usual behaviour, without the candidate resort-
grammar test score, for example, may be a poor indicator of somcone's ing to guessing.
ability to use grammatical structures. The person who can identify the
correct response in the item above may not be able to produce the correct
It is very diffi~ult to write successful items
form when speaking or writing. Tilis is in part a question of construct
validity; whether or not grammatical knowledge of the kind that can be A further problem with multiple choice is that, even where items arc
- demonstrated in a multiple choice test underlies the productive use of possible, good ones arc extremely difficult to write. Professional test
grammar. Even if it does, there is still a gap to be bridged between writers reckon to have to write many more items than they actually need
knowledge and usc; if usc is what we are interested in, that gap will mean for a rest, and it is only after pretesting and statistical aualvsis of
that test scores are at best giving incomplete information. performance on the items that they can recognise the ones that arc ~lsablc.
It is my experience that multiple choice tests that arc produced for use
within institutions are often shot through withJaults. Common amongst
Guessing may have a considerable but unknowable effect on
test scores these arc: more than one correct answer; no correct answer; there are
clues in the options as to which is correct (for example the correct option
The chance of guessing the correct answer in a three-option multiple may be different in length to the others); ineffective distrnctors. The
choice item is one in three, or roughly thirty-three per cent. 011 (wcrage amount of work and expertise needed to prepare good multiple choice
we would expect someone to score 33 on a 100-item test purely by tests is so great that, even if one ignored other problems associated with
guesswork. We would expect some people to score fewer than that bv the technique, one would not wish to recommend it for regular achieve-
guessing, others to score more. The trouble is that we can never knO\~ ment testing (where the same test is not used repeatedly) within institu-
what part of any particular individual's score has come about through tions. Savings in time for administration and scoring will be outweighed
guessing. Attempts are sometimes made to estimate the contribution of by the time spent on successful test preparation. It is true that the
guessing by assuming that all incorrect responses arc the result of . development and use of item banks, from which a selection can be made
guessing, and by further assuming that the individual has had average for particular versions of a test, makes the effort more worthwhile, but
luck in guessing. Scores are then reduced by the number of points the great demands are still made on time and expertise.
individual is estimated to have obtained by guessing. However, neither
assumption is necessarily correct, and we cannot know that the revised
Backwash may be harmful
score is the same as (or very close to) the one an individual would have
obtained without guessing. While other testing methods may also involve It should hardly be necessary to point out that where a test which is
guessing, we would normally expect the effect to be much less, since important to students is multiple choice in nature, there is a danger that
60 61
Test techniques and testing oucrall ability Test techniques and testing ouerall ability
•
practice for the test will have a harmful effect on learning and teaching.
Practice at multiple choice items (especially when, as happens, as mucll
highly proficient speaker but a poor writer, or Vice versa. It follows from
this that if we want to know, for example, how well an individual speaks,
I
attention is paid to improving one's educated guessing as to the content then we will not always get accurate information if we base our
of the items) will not usually be the best way for students to improve their
command of a language.
assessment on his or her performance in writing.
One reason why various studies provided statistical support for the
Unitary Competence Hypothesis was that they were concerned with the
I
performance of groups rather than of individuals. Differences within
Cheating may be facilitated
The fact that the responses on a multiple choice test (a, h, c, d) are so
individuals (say, between writing and speaking) tended to be hidden by
the differences bcticecn individuals, What is more, for many individuals
I
simple makes them easy to communicate to other candidates nonver- the differences between abilities were small. In general, where people
ball y. Some defence against this is to have at least two versions of the test,
the only difference between them being rhcor der in which the options are
presented.
have had training in both speaking and writing, the better someone
writes, the better they are likely to speak. This may reflect at least three
facts. First, people differ in their aptitude for language learning; someone
I
All in all, the multiple choice technique is best suited to relarivclv with high aptitude will tend to perform above aver.ige in all aspects.
infrequent testing of large numbers of candidates. 1'1115 is not to say tha't
there should be no multiple choice items in tests produced n.''';I1arly
Secondly, training in the language is often (hut not always) such as to
encourage comparable development m the different skills. Thirdly, there
I
within institutions. In seiting a reading comprehension test, for ex~mple, is likely to be ill least some transfer of learning from one skill to another;
there may be certain tasks that lend themselves very readily to the
multiple choice lor mar, with obvious distractors presenting then~selves m
the text. .There are real-life tasks (say, a shop assistant identifying which
words learned through listening would seem unlikely to he unavnilable to
the learner when reading.
It is because of the tendencies described in the previous pnr agrnph that
I
one of tour dresses a customer is descrihing) which are essenri.illv 'it makes sense to speak of someone being 'good (or quite good, or had) at
multiple choice. The simulation in a test of such i1 siruntion would seem to
be perfectly appropriate. \X'hat the reader is being urged to avoid is the
a language'. \X!hile the Unitary Competence Hypothesi', ILlS hecn ab.m-
doncd, the notion of overall ability can, in some circumstances, be a
useful concept. There are times when all we need IS an approximate idea
I
excessive, indiscriminate, and potentially harmful use of the technique.
'lead-in', it is usuallv about cvcrv seventh word wluch is deleted. The Some of the blanks you will have completed with confidence andease.
following example, \~hich the reader might wish to attempt, was used in Others, evenif you are a native speaker of English, you will have found
research into doze in the United States (put only one word in each space). difficult, perhaps impossible. In some cases youmay h-avcsupplic~ra
The answers arc at the end of this chapter. word which, though different from the original, you may think just as
gond or even better. All of these possible outcomes are discussed III the
What is a college? following pages.
There was a time when the doze procedure seemed to be presented
Confusion exists concerning the real purposes, aims, and goals of a college.
What are these'? What should a college be? almost as a language testing panacea. An integrative method, it was
thought by many to draw on the candidate's ability to process length v
Some believe that the chief function 1. even a liberal arts passages of language: in order to replace the missing word in a blank, it
college is 2. vocational one. I feel that the 3. function was necessary to go beyond the immediate context. In predicting the
ofa college, while important, 4. nonetheless secondary. Others missing word, candidates made use of the abilities that underlay all their
profess that the 5. purpose of a college is to 6. _
paragons of moral, mental, and spiritual 7. Bernard McFaddens language performance. The doze procedure therefore provided a
with halos. If they 8. that the college should include students measure of those underlying abilities, its content validity deriving from
9. the highest moral, ethical, and religious 10. by the fact that the deletion of every nth word meant that 3 representative
precept and example, 111. willing to accept the thesis. sample of the linguistic features of the text was obtained. (It would not be
useful to present the full derails of the argument in a book of this kind.
I 12. in attention to both social amenities 13. _
regli~ ~"~"S.but I prefer to see 14. colleges get down to more basic
The interested reader is referred to the Further reading section at the end
15. '. ~ ethical considerations instead of standing in loco parentis of the chaprer.) Support for this view came in the form of relatively high
16. four years when 17. student is attempting in his correlations between scores on doze passages and total scores on much
youthful 18. awkward ways, to grow up. It 19. been said longer more complex tests, such as the University of California at Los
that it was not 20. duty to prolong adolescences, We are Angeles (UCLA) English as a Second Language Placement Test (ESLPE),
21. adept alit. 3S well 3S with the individual components of such tests (such as reading
There are those 22. maintain that the chief purpose of and listening).
23. college is to develop "responsible citizens." 24. is The clozc procedure seemed very attr3.l:tiv~. Clozc tests were easy to
good if responsible citizenship is 25. by-product of all the factors construct, administer and score. Reports of early research seemed to
which 26. to make up a college education 27. life itself. suggest that it mattered little which passage was chosen or which words
The difficulty arises from 28. confusion about the meaning of were deleted; the result would be a reliable and valid test of candidates'
responsible 29. . I know of one college which 30. mainly
to produce, in a kind 31. academic assembly line, outstanding underlying language abilities (the doze procedure was also advocated as
exponents of32. system of free enterprise. 3 measure of reading ability; see Chapter 10).
Unfortunately, doze could not deliver all that was promised on its
Likewise, 133. to praise the kind of education 34. behalf. For one thing, as we saw above, even if some underlying ability is
extols one kind of economic system 35. the exclusion of the good being measured through the procedure, it is. not possible to predict
portions 36. other kinds of economic systems. It 37. to
me therefore, that a college 38. represent a combination of all accurately from this what is people's ability with rcspect rotlic variety of
39. above aims, and should be something 40. besides- _separate skills (speaking, writing etc.) in which we arc usually interested.
first and foremost an educational 41. , the center of which is the Further, it turned out that different passages gave different results, 3S did
42. exchange between teachers and students. the deletion of different s~ts of "words in the same pass3ge. Another
. matter for concern was the fact that intelligent and educated native
143. read entirely too many statements such 44. this
one on admissions application papers: "45. want a college education speakers varied quite considerably in their ability to predict the missing
because I 46. that this will help to support 47. and my , words. What is more, some of them did less well than many non-native
family." I suspect that 48. job as a bricklayer would help this speakers. The validity of the procedure, even 3S 3 very general measure of
49. to support himself and his family 50. better than a overall ability, W3S thus brought into question.
college education. There seems to be fairly general agreement now that the doze
(Oller and Conrad 1971)
procedure cannot be depended upon automatically to produce reliable
64 65
•
Test techniques and testing ouerall abilit » Test techniques and testing overall ability
and useful tests. There is need for careful selection of texts and some
'p~ete.~tIng. The fact that deletion of every nth word almost alwavs
The deletions in the above passage were chosen to provide 'interesting'
items. Most of them we might be inclined to regard as testing 'grammar',
I
bur to respond to them successfully more than grammatical ability is
produces problematical items (for example impossible to predict the
rmssmg word), points to the advisability of a careful selection of words.to
delete, from the outset._The following is an in-house ci~i~ passage, for
needed; processing of various features of context is usually necessary.
Another feature is that native speakers of the same general academic
I
students at universiry entrance level, in which this has been done. I Again ability as the students for whom the test was intended could be expected
the reader IS invited to try to complete the gaps.
Ecology
I would recommend for measuring overall ability. General advice on the
construction of such tests is given below.
I
It may reasonably be thought that doze procedures, since they produce
Water, soil and the earth's green mantle of
plants make up the world that supports the
animal life of the earth. Although modern
purely pencil and paper tests, cannot tell us anything about the oral
component of overall proficiency."
However, some research has explored the possibility of using doze
I
man seldom remembers the fact, he could not
passages bused on tape-recordings of oral interaction to predict oral
I
exist without the plants that harness the
sun's energy and manufacture the basic food- ability. The passage used in some of that research is presented here. Tlus
stuffs he depends (1) for life. Our 1 _ type of material would not normally he used in a test for non-native
attitude (2) plants is a singularly 2 _
narrow (3) • If we see any immediate 3 _ speakers as it is very culturally hound, and probably only sorncbodv from
I
utility in (4) plant we foster it. 4 _ a British background could understand it fully. It is a good example of
(5) for any reason we find its presence 5 _ informal family conversation, where sentences are lett unfinished and
undesirable, (6) merely a matter uf 6 _
indifference, we may condemn (7) to 7 _ topics run into each other. (Again the reader is invited to attempt to
destruction. Besides the various plants predict the missing words. Note that things like John's, I'm etc. count as
(B)
(9)
are poisonous to man or to
livestock, or crowd out food plants,
many are marked (10) destruction merely
because, according to our narrow view, they
9
8
1"'0-----
_
one word. Only one word per spacc.)
Famil» reunion
I
happen to (11) in the wrong place at the 11 _ Mother: I love that dress, Mum.
(12)
merely
time. Many others are destroyed
(13) they happen to be associates
of the unwanted plants.
12
13----- Cr.mdmothcr:
:\!other:
Oh, it's M and S.
Is it?
I
CrJndnlOther:Yes, five pounds.
The earth's vegetation is (14) of a web 14 _
of life in which there are intimate and
essential relations between plants and the
earth, between plants and (15) plants, 15 _
Mother:
Crandmorhcr:
My goodness, It'S not, i\!UI1l.
But it's made of that Tvshirt sruff, so ! don't think It'll
wash very. (1),1'011 know, thrv go all ...
I
between plants and animals. Sometimes we
have no (16) but to disturb Mother: sort (2) I know the kind, yes .
I
16 _
(17) relationships, but we should 17 _ Grandmother: Yes.
(18) so thoughtfully, with full 18 _ Mother: I've got some Tvshirrs of that, and ... (3)
awareness that (19) we do may' 19 _
(20) consequences remote in time and 20 _ shrink upwards and go wide ...
I
place. Cr.mdmorhcr: I know, so ...
2. 111 t.ict , doze tests h.ive been admimsrered or.llly on .111 mdividu.il baSIS, p.i rncularlv to
1. Because they see the selection of deletions as tund.imeutally different from the original
doze procedure, some testers feel the need to gIve tim procedure a new name. Weir
(1 '-)8Sa), for instance, refers to this testing method as 'selective deletion PI' filling'.
younger children - but thIS IS 3 tune-consunnng operation an.! It IS 110t clear that the
information it provides is as useful as C.1n be obr.nncd through more usual testing
procedures.
I
66 67
Test techniques and testing overall ability Test techniques and testing overal! alJility
Mother: It's a super colour. It . .... (4) a terribly cxpen- Grandmother: Did I make this beautiful thing' I don't remember
sive une, doesn't it? . ...... (5) you think so when making that ... Gosh, I wouldn't make one of those now .
you saw (6)? Mother: I know. You must have ...
Grandmother: \Vell, I always know in Marks. . (7) just go Grandmother: It's, (26) know, tvpicul no l meanr a
in there and ... and ... ...... (8) it's not there I typical Mrs Haversham thing.
don't buy it. I know I won't like anything else. I got about Girl: Can 1 have a sweetie?
three from there '" four from there. Onlv I wait Grandmother: Harry would Carol would be disgusted
about... ' .. (27) this.
Girl: Mummy, can I have a sweetie? Mother: Would she?
Mother: What, love? Grandmother: Yes, she'd want navy blue or black.
Gr:mdmother: Do you know what those arc called? .. Oh, I used to Mother: Oh.
love them (9) I was a little girl. Liquorice Grandmother: Hello, darling... . (28) can't I canr I
comfits. Do you like liquorice? Does she? must look (29) him a lot, Gemma. Do you
Mother: .......... (10) think she quite likes it. Do (30)? I mean Megan. Do you mind?
................. (11) ? We've got some liquorice allsorts Cirl: Why?
actually (12) the journey. Grandmother: Because 1 have to see who (31) like.
Grundmothcr: Oh yes. . Mother: Mum thinks he looks like Dad I mean, Granny
Mother: And I said she could have one after. thinks ...
Grandmother: Oh, I'm going to have one. No, I'm (13). Grandmother: Except of .. (ll) he hasn't got his nose, has
No, it'd make me fat, dear. .............. (.33)?
Mother: Listen. Do you want some stew? It's hot now. Mother: No, but it doesn't look quite. ...... (34) hig now
Grandmother: No, no, darling. I don't want anything. . .. but, ooh, it really (.J5) look enormous
Mother: Don't you want any? Because (14) just put when ...
it on the table. Grandmother: Oh you darling . (36) thing. He's got wide
Grandmother: I've got my Limmits. eyes, hasn't he? Has he got wide apart eyes?
Mother: Arc you going ... .... . .... (15) cat them now with us? Mother: .... (37), I suppose he has, hasn't he?
Grandmother: Yes. .. (16) you going to have yours . Gr.mdmothcr: I think he's going to he a little like Megan when he gets
yours now? older Yes, I think he IS Onlv he's
Mother: Well, I've just put mine on the plate, but Arth savs he .................. (18) dark like she is, is he?
doesn't (17) anv now. ' Mother: Fx ccpt that his eyes are quite d.irk, (1'1)
Grandmother: Oh yes, go on. . they, now ...
Mother: So ... so he's going to come down later ... Grandmother: Oh, yes, I think ... (4U) bound to go dark,
Grandmother: What arc (18) going to eat? .. Oh, I like .ucn't they?
.......... (19). Is that a thing that ... Mother: Probably, yes.
Mother: ... you gave me, but I altered it.
(Cclrlll;lll alld Hughes 1%.1)
Grandmother: Did (20) shorten it?
Mother: I took the frill (21).
Grandmother: I thought it looked . The thinking behind this was that familiarity with the various features
Mother: 1 altered (22) straps and I had to . of spoken English represented in the text would make a significant
Girl: That's (23) you gave me, Granny . contribution to candidates' performance. As it turned out, this 'conver-
Granny, I'm (24) big for that . sational doze', administered to overseas students who had already been
Mother: And so is Jake. It's for a doll ... Do you remember that? in Britain for some time, was a better predictor of their oral ability (as
Grandmuther: No. rated by their teachers) than was the first cloze passage presented in this
Mother: Oh, Mum, you're awful. . .. (25) made it. chapter. This suggests the possibility of using such a passage as part of a
68 69
Test techniques and testing overall ability Test techniques and testing ouerall ability •
placement test battery of doze tests, where a direct measure of oral abilitv
is not feasible. More appropriate recordings than the one used in the
research could obviously be found.
The C-Test
The C-Test is really a variety of doze, which its originators claim is
I
superiorto the kind' of doze (Iescribed above. Instead of whole words, it
7
scormg, Scorers are given a card with the acceprahle resp(;nses written
II1 such a way as to lie opposite the candidates' responses.
Anyone who IS to take a doze test should have had several opportuni-
puzzle-solving strategy may be at an advantage over a candidate of
similar foreign language ability. However, research would seem to
indicate that the C-Test functions well as a rongh measure of overall
I
ties to become tamiliar With the techniqne. The more practice they ability in a foreign language. The advice given above about the develop-
have h.id, the more likely it is that tlmr scores will represent their true
abIlity III the language.
~. Cloze test scores are not directly interpretable. In order to be able to
ment of doze tests applies equally to the C-Test (except of course in
details of deletion). I
mrerprer them we need to have some other measure of ability. /f a
series ot cloze passages is to he used as a placement test (sec previous
chapter), then the obvious thing would be to have all students
Dictation
In the 1960s it was usual, at least in some parts of the world, to decry
I
currently in the institution complete the passages. Their doze scores dictation testing JS hopelessly misguided. After all, since the order of
could rhen be compared with the level at which they are studying III
thell1stIt~tIon. lnforrnation from teJchers as to which students could
be J,n a higher (or lower) class would also be useful. Once a pattern
words was given, it did not test word order; since the words themselves
were given, it did not test vocabulary; since it was possible to identify
words from the context, it did not test aUfJI perception. While it might
I
WJS established between doze test scores and class level, the doze
I
test punctuation and spelling, there were clearly more economicnl ways
passages
. co uio
Id lne use d as at Ieast part at. the placement procedure.
>
.of doing this.
70
71
I
Test techniques and testing overall ability Test techniques and testing ouerall ability
73
72
Test techniques and testing overall ability
•
Unitary Competence Hypothesis (including Oller's renouncement of it]
see Hughes and Porter (1983). The research In which the first cloze
9 Testing writing I
passage in the chapter was used is described in Oller and Conrad (1971).
Examples of the kind of doze recommended here are to be found in
Cambridge Proficiency Examination past papers. Hughes (1981) is an
account of the research into conversational cloze. Streiff (1978) describes
I
the use of oral cloze with Eskimo children. Klein-Braley and Raatz ( 1984)
and Klein-Braley (1985) outline the development of the C-Test. I.ado
(1961) provides a critique of dictation as a testing technique, while Lado
\'V'e will make the assumption in this chapter that the best way to rest
people's writing ability is to get them to write. I
I
(198~) carried our further research using the passage employed by Oller
This is not an unreasonable assumption. Even professional testing
and Conrad, to cast doubt on their claims. C~arman and Hughes (198.1)
provide cloze passages for teaching, bur they could form the baSIS for
tests (native speaker responses given). .
institutions arc unable to construct indirect tests which measure writing
ability accurately (sec Chapter 3, Further reading: Godshalk et ,d.). And
I
if in fact satisfactory accuracy were a real possibility, considerations of
I
19. that; 20. you; 21. off; 22. the', 23. what " onc- ?_'-to outset just what these tasks are that they should be able to perform. These
")18._ you;
y 1 till)',
should be identified in the test specifications. The framework for the
-~. ou; 26. you; 27. with, at, by, about; 28. I; 2'J. at; 30. mind:
31. he's; 32. course; 33. he; 34. as, so, that: 35. did; 311. little, sweet' specification of content presented in Chapter 7 is relevant here:
wee; 37. Yes; 38. not; 39. aren't; 40. they're. ' operations, types of text, addressees, topics.
I. \'(Ie will 31so 3SSll!lle thar the writing of elcrneru.iry students IS nor [() he tested.
I
\Vh.Hever writing skills are required of them c.in be 3ssessed intor m.rllv. There seems
little pomt 10 constructing, for example. 3 formal test of the .lbilltl' to form ch3Llcters
or tr anscribe simple sentences. I
75
74
Testing ioriting Testing writing
•
Let us look at the specifications for the writing test at the Basic level of
the RSA Examination in the Communicative Use of English as a Foreign
be some points where perhaps more detail is called for; others where
additional elements arc needed. There is certainly no reason to feel
I
Language. It should be said immediately that I hold no particular brief for limited to this particular framework or its content, but all in all these
the examining board in question. Indeed I think that there is some
conceptual confusion in the content of its specifications for this test.
specifications should provide a good starting point for many testing
purposes.
Once the dimensions of the test have been specified, it becomes
I
Nevertheless, it represents an unusually thorough attempt to specifv test
I
content. relatively straightforward to make a selection in the way outlined in the
previous chapter. It is interesting to see how the RSA examiners have
Operations Expression of thanks
requirements done this. One of their examinations at the Basic level is reproduced on
opinions pages 78-80. 2
comment
attitude
confirmation
From this it is clear that the examiners have made a serIOUS attempt ro
create a representative sample of tasks (the reader might wish to check off
the clements in the specification which arc represented in the test). What
I
apology
also becomes clear is that with so many potential tasks and with so few
want/need
information items, the test's content validity is brought into question. Really good
coverage of the range of potential tasks is not possible in a single version
I
Eliciting information of the test. This is a problem to which there is no easy answer. Only
I
directions
research will tell us whether candidates' performance on one small set of
service
(and all areas above) selected tasks will result in scores very similar to those that their
performance on another small non-overlapping set would have been
Text types Form Type awarded.
Letter
Postcard
Note
Announcement
Description
A second example, this time much more restricted, concerns the
writing component of a test of English for academic purposes with which
I
Narration I was associated. The purpose of the test was to discover whether a
Forms
Addressees
Comment
2, Note that the test is intended primarily for c.mdidatcs living in the United Kingdom.
I
76 77
Testing writing T1!5ti/lg uriting
2. An English friend of yours living outside the U.K. has given you £10.00 to buy a couple of
1. You are in the United Kingdom. You know you will need a cheap air-ticket to visit your
books for her. She definitely wants "The Guinness Book of Records" and she'd also Iikea new
home country/take a trip overseas later in the year. You decide to join a travel club so that
pocket "Oxford English Dictionary" if the money she has given you is enough.
your journey will be as cheap as possible. Complete the application form below.
Write to the bookshop advertised below ordering "The Guinness Book of Records". Find out
from them whether £10.00 will be enough also to pay for the dictionary your friend wants.
Make sure they send you full details of prices including p & p. Check on what methods of pay-
ment they will accept and get them to confirm the latest delivery date (your friend wants the
'BETATRAVEL TRAVEL CLUB' books by the beginning of August).
The cheapest nighls from major UK airports 10 deslioatioos worldwide.
Write the letter on the next page. Write addressies) on the letter as appropriate.
I-
To join our club you just complete the form. Please complete all sections of the form
clearly in ink. Answer all the questions. Do not leave a blank. Put n/a if not applicable.
For Section A use BLOCK CAPITALS.
l
Sec/ion A I
I. STATE MR., MRS., MtSS. MS., or OTHER TITLE
2. FAMILY NAME .. I
3. OTHER NAMES
5. NATIONALITY
7. OCCUPATION
Section B
8. Which U.K. airport would be most convenient for you?
B.E.B.C.
9. Which country or countries are you intending (0 travel to?
Britain's leading distributor of
English Language Teaching books and
10. Proposed date of departure
aids to students, teachers and
booksellers throughout the world.
Section C Please supply the following further details.
II. Proposed duration of Slay outside U.K.
L- _
78 79
Testing writing Testing uriting
3. You are leaving the U.K. for a few weeks. When you arrive at the airport you remember
something you forgot to do before leaving home. You can't contact anybody on the phone.
Text type examination answers up to two Paragraph~in length
Write a brief note to a friend. Apologize for any trouble you are causing. Explain to 111m what
you want him to do. Thank him for helping you. Addressees native speaker and non-rianve speaker uruvcrsirv
Write the message and your friend's address on the card below.
lecturers
Topics any capable of academic treatment
It can be seen that in this case it is not at all difficult to select
representative writing tasks. Content validity is less of a problem than
with the much wider-ranging RSA examination. Since it is only under the
heading of 'operations' that there is any significant variability, a test that
required the student to write four answers could cover the whole range of
tasks, assuming that differences of topic really did not matter. In fact, the
writing component of each version of the test contained two writing
tasks, and so fifty per cent of all tasks were to be found in each version of
the test. The assumption about topics may seem hard to accept.
However, topics were chosen with which it was expected all students
would be familiar, and information or arguments were provided (see
example, page 85).
4. You've been away for nearly 2 weeks now and you have a lot of news. Write to some English
\Ve should begin by recognising that from the point of view of vali.lirv,
friends who have offered to meet vou at the airport on your return. Thank them for the offer. the ideal test would be one which required candidates to perform al! the
Tell them when and where they can meet you. Tell them the different things you have seen and relevant potential writing tasks. The total score ohra iucd on that test
done since you left. Say how you feel about your trip so far.
Your letter should not be more than 120 words.
would be our best estimate of a candidate's ability. That total score
Write the letter below.
would be the sum of the scores on each of the different tasks. We would
not expect all of those scores to be equal, even if they were perfectly
scored on the same scale. First, as we saw in Chapter 5, people's
performance on the sallie task is not consistent. For this reason we have
to offer candidates as many 'fresh starts' as possible, and each task can
represent a fresh start. Secondly, people will simply be better at some tasks
than others. If we happen to choose just the task or tasks that a candidate
is particularly good at, then the outcome is likely to be very different from
that which would result from the ideal test that included all the tasks.
Again, this is 3 reason for including as many different tasks as possible.
Of course, this has to be balanced against practicality; otherwise we
would always include all of the potential tasks. It must be remembered,
however, that if we need to know something accurate and meaningful
abour a person's writing ability, then we have to be prepared to pay for
that information. What we decide to do will depend in large part on how
accurate the information has to be. If it is a matter of placing students in
classes from which they em easily be moved to another more appropriate
RO Q 1
Testing writing Testing writing
Either (a)
In about 250 words'
I
Bntain. The aim of the scheme ISto promote international understanding, and to
This assumes that we do not want to test other things. In language testing foster an awareness of different working methods.
we are not normally interested in knowing whether students are creative, Candidates for the scheme are asked to write an initial leller of application,
briefly outlining their general background and, more Importantly, giving the
imaginative, or even intelligent, have wide general knowledge, or have reasons why they feel they would benefit from the scheme. In addition, they
good reasons for the opinions they happen to hold. For that reason we
should not set tasks which measure these abilities. Look at the following
tasks which, though invented, are based on others taken from well-
should Indicate in which country they would like to work. On the baSIS of this
letter they may be invited lor interview and ottereo a post.
Write the letter of application.
I
known tests.
I. \'Vrite the conversation you have with a friend about the holiday vou
plan to have together.
One w.iy of reducing dependence on the candid.ucs' .rhiliry to read is to
make use of illustrations. The following is from the Joint Matriculation
Board's Test in English (Overseas).
I
1 You spend a year abroad. While you are there, you arc asked to talk to
a group of young people about life in your country. Write down wh.u
you would say to them.
3. 'Envy is the sin which most harms the sinner.' Discuss.
The diagram below shows three types of bee.
I
Compare and contrast the three bees.
4. The adv3ntages and disadvantages of being horn into a wealthy
family.
I
82 83
I
Testing writing Testing ioriting
A series of pictures can be used to elicit a narrative. Suppose you are writing a report in which you must
intprpret the three graphs shown above. Write the section
A 8 of that report in which you discuss how the graphs are
related to each other and explain the conclusions you
have reached from the information in the graphs. Be sure
the graphs support your conclusions.
c D
This echoes the general point made in Chapter 5. The question above
about envy for example, could result in very different answers from the
same person on different occasions. There are so many significantly
different ways of developing a response to the stimulus. \Vriting tasks
(Isvrue 1";) should be well defined: candidates should know just what is required of
them, and they should not be allowed to go too far astray. ,\ useful device
is to provide information in the form of notes (or pictures, as above).
The following example, slightlv modified, was used in the test I was
This llIay t.ikc the form of quite realistic transfer of information.
concerned with, mentioned earlier in the chapter.
TIME - 30 MINUTES
Compare the benefits of a university education in English with its
CHANGES IN F.\RMING IN THE U. S.: 1940 - 1980
drawbacks. Use all of the points given below and come to a
conclusion. You should write about one page.
a) Arabic
1. Easier for students
2. Easier for most teachers
1l
3. Saves a year in most cases
10
b) English
1. Books and scientific sources mostly in English
2. English international language - more and better job
opportunities
3. Learning second language part of education/culture
lO
Care has to be taken when notes are given not to provide students with
too much of what they need in order to curl' out the task. Full sentences
arc generally to be avoided, particularly where they can be incorporated
1940 '50 '60 '70 '80 1940 • ~o '&0 • 70 '80 '50 '60 '70 'so
into the composition with little or no change.
85
Testing writing Testing writing •
Setting tasks which can be reliably scored
In fact a number of the suggestions made to obtain performance that is
scoring in which each student's work is scored by four different trained
scorers can result in high scorer reliability. There is nothing magical
about the number 'four'; it is simply that research has quite consistently
I
representative will also facilitate reliable scoring. shown acceptably high scorer reliability when writing is scored four
SET AS ~L\NY TASKS AS POSSIBLE
The more scores that scorers provide for each candidate, the more
tunes,
A reservation was expressed above about the need for such scoring to
be well conceived. Not every scoring system will give equally reliable
I
reliable should be the total score.
RESTRICT CANDIDATES
results. The system has to be appropriate to the level of the candidates
and the purpose of the test. Look at the following scoring system used 1Il
the English-medium university already referred to in this chapter.
I
The greater the restrictions imposed on the candidates, the more directly
comparable will be the performances of different candidates.
A
Possibly more than adequate
The samples of writing which are elicited have to be long enough for
judgements to be made reliably. This is pa rticul.ulv important where
diagnostic information is sought. For example, in order to obtain reliable
D
NA
FI3A
Doubtful
Clearly not adequate
Far below adequacy
I
information on students' organisational ability in writing, the pieces have
to be long enough for organisation to reveal itself. Civen a fixed period of
time for the test, there is an almost inevitable tension between the need
This worked perfectly well in the situation for which it \\':15 designed.
The purpose of the writing component of the test was to determine
I
for length and the need to have as many samples as possible. whether a student's writing ability was adequate for study in English in
I
I!O!fSTIC SCORING specific purpose and obviously it would be of little usc in most other
--- -- ----------_ ..~~-
IIQ1i.'itu.':_:;Ql[ing,_UJft~!LJ.derxcd t~l.-j1s'-:impr,,·s:.i()llistiL·_sconug)Jl1'l~S Circumstances. Testers have to be prepared to modify existing scales to
t.h01SS!~~~U)L'L'i!!!g!L'if(2Ir...t(ULPicLU2b'Jitill~u-the-h.. ,I;~ suit their own purposes. Look now at the following banding system. This
~~lpressi~~-52.U!:.This kind of scoring has the advantage of being too concerns entry to university level education, this time in Britain. It is
v~xpenencedscorers can judge a one-page piece of writing in
Just a couple of minutes or even less (scorns of the new TOEFL Test of
in fact a dLlft of ,1 revised scale for the British Council's ELTS test.
I
Written English will apparently have just one and a half minutes for each 9 The writing displays an ability to communicate in a way which gives
scoring of a composition). This means that it is possible for each piece of
work to be scored more than once, which is fortunate, since It is also
ncccsS,lry! Harris (19ilR) refers to research in which, when each student
the reader full satisfaction It displays a completely logical
organizational structure which enables the message to be followed
effortlessly, Relevant arguments are presented in an interesting
way, With main ideas prominently and clearly stated, with
I
wrote one 20-minute composition, scored only once, the rcliabilirv completely effective supportrnq material, arqumen ts are effectively
coefficient was only 0.25. If well conceived and well organised, holisti~
86
related to the writer's experience or views, There are no errors of
87
I
n
Testing uniting Testing writing
6 The writing displays an ability to communicate although there IS As can be seen, in this case there ;J..rC-Uiuc.Juuds,.each one having a
occasional strain for the reader. It IS organized well enough for the
message to be followed throughout. Arguments are presented but description of the writer at each level. Band 7 is a not untypical standard
it may be difficult for the reader to distinquish main Ideas from demanded for entry into British universities. These descriptions have two
supporting material; main ideas may not be supported; their advantages. First, thev_mal 11<'11' tilt' scorer assign candidates to bands
relevance may be dubious. arguments may not be related to the
writer's experience or views. The reader is aware. of errors of ~Y. Secondly, tl~~JJ.!e ban9.-~~Le.Vl1Y..!~meaf1iL1gflLLt~uh.e
vocabulary, spelling, punctuation or grammar, and/or limited ability ~including the students themselves) to whol!1_tb.c.Y....JlL1\:_be.-.
to manipulate the linquistic systems appropriately, but these intrude rcnoned,
only occasiorially.
-If the descriptions become too derailed, however, a problem arises.
5 The writing displays an ability to communicate although ther~ IS Look at the following, which is part of the ACTFL (American Council for
often strain for the reader. It is organized well enough for the the Teaching of Foreign Languages) provisional descriptors for writing,
message to be followed most of the time. Arguments are presented
but may lack relevance, c1arlty,consistency or support; they may not
and represents an attempt to provide external criteria against which
be related to the writer's experience or views. The reader IS aware of foreign language learning in schools and colleges call be assessed.
errors of vocabulary, spelling, punctuation or grammar which
intrude frequently, and of limited ability to manipulate the linguistic
systems appropriately. Novice-Low No functional ability in writing the foreign language.
4 The writinq displays a limited ability to communicate which. puts Novice-Mid No practical communicative writing skills. Able to
strain on the reader throughout. It lacks a clear organizational
structure and the message is difficult to follow. Arguments are copy isolated words or short phrases. Able to tran-
inadequately presented and supported; they may be Irrelevant; If scribe previously studied words or phrases.
the writer's experience or views are presented the.lr relevance may
be difficult to see. The control of vocabulary, spelling, punctuation Novice-High Able to write simple fixed expressions and limited
and grammar is inadequate. and the writer displays inability to
memorized material. Can supply information when
manipulate the linguistic systems appropriately, causing severe
strain for the reader. requested on forms such as hotel registrations and
travel documents. Can write names, numbers.
3 The writing does not display an ability to communicate although dates, one's own nationality, addresses, and other
meaning comes through spasmodically. The reader cannot find any simple biographic information, as well as learned
organizational structure and cannot follow a message. Some
vocabulary, short phrases, and simple lists. Can
,
elements of information are present but the reader IS not provided
88 89
Testing icriting Testing writing •
write all the symbols in an alphabetic or syllabic
system or 50 of the most common characters. Can
write simple memorized material with frequent mis-
to personal questions using elementary vocabulary
and common structures. Can write simple letters,
brief synopses and paraphrases, summaries of bio-
I
spellings and inaccuracies. graphical data and work experience, and short com-
Intermediate-
Low
Has sufficient control of the writing system to meet
limited practical needs. Can write short messages,
positions on familiar topics. Can create sentences
and short paragraphs relating to most survival needs
(food, lodging, transportation, immediate surround-
I
such as simple questions or notes, postcards, phone ings and situations) and limited social demands. Can
messages, and the like within the scope of limited
language experience. Can take simple notes on
material dealing with very familiar topics although
relate personal history, discuss topics such as daily
life, preferences, and other familiar material. Can
express fairly accurately present and future time.
I
memory span is extremely limited. Can create state- Can produce some past verb forms, but not always
ments or questions within the scope of limited
language experience. Material produced consists of
recombinations of learned vocabulary and structures
accurately or with correct usage. Shows good con-
trol of elementary vocabulary and some control of
basic syntactic patterns but major errors still occur
I
into simple sentences. Vocabulary is inadequate to when expressing more complex thoughts. Diction-
express anything but elementary needs. Writing
tends to be a loosely organized collection of sen-
tence fragments on a very familiar topic. Makes
ary usage may still yield incorrect vocabulary of
forms, although can use a dictionary to advantage to
express simple ideas. Gcncrallv cannot use basic
I
continual errors in spelling. grammar, and punctua- cohesive clements of discourse to advantage such as
tion, but writing can be read and understood by a
native speaker used to dealing with foreigners. Able
to produce appropriately some fundamental socio-
relative constructions, subject pronouns, connec-
tors, etc. Writing, though faulty, is comprehensible
to native speakers used to dealing with foreigners.
I
linguistic distinctions in formal and familiar style,
such as appropriate subject pronouns, titles of
address and basic social formulae. I
Intermediate-Mid Sufficient control of writing system to meet some
survival needs and some limited social demands.
Able to compose short paragraphs or take simple
notes on very familiar topics grounded in personal
The descriptions imply a pattern of development common to all
language learners. They assume that a particular level ot grammatical
I
ability will always be nssoc.ated with a particular level of lexical ability.
experience. Can discuss likes and dislikes, daily
routine, everyday events, and the like. Can express
past time, using content words and lime expressions.
or with sporadically accurate verbs. Evidence of
This is, to say the least, highly questionable. The potential lack of tit in
individuals between performance in the various subskills leads naturallv
ro a consideration ot analytic methods of scoring.
I
good control of basic constructions and inflections
such as subject-verb agreement, noun-adjective
agreement, and straightforward syntactic construc-
AN.·\I.YTIC METHODS OF SCORINC
Grammar
I
often is unable to identify appropriate vocabulary,
Intermediate-
or uses dictionary entry in uninflected form.
90
notes in some detail on familiar topics, and respond
however, interfere with comprehension.
I
91
Testing writing
Testing writing
4. Errors of grammar or word order fairly frequent; Fluency (style and ease of communication)
occasional re-reading necessary for full comprehension.
3. Errors of grammar or word order frequent; efforts of 6. Choice of structures and vocabulary consistently
interpretation sometimes required on reader's part. appropriate; like that of educated native writer.
2. Errors of grammar or word order very frequent; reader 5. Occasional lack of consistency in choice of structures
often has to rely on own interpretation. and vocabulary which does not, however, impair overall
1. Errors of grammar or word order so severe as to make ease of communication.
comprehension virtually impossible. 4. 'Patchy', with some structures or vocabulary items
noticeably inappropriate to general style.
Vocabulary 3. Structures or vocabulary items sometimes not only
inappropriate but also misused; little sense of ease of
6. Use of vocabulary and idiom rarely (if at all) communication.
distinguishable [rom that of educated native writer. 2. Communication often impaired by completely inappropriate
5. Occasionally uses inappropriate terms or relies on or misused structures or vocabulary items.
circumlocutions; expression of ideas hardly impaired. 1. A 'hotch-potch' of half-learned misused structures and
4. Uses wrong or inappropriate words fairly frequently; vocabulary items rendering communication almost
expression of ideas may be limited because of impossible.
inadequate vocabulary.
3. Limited vocabulary and frequent errors clearly hinder
Form (organisation)
expression of ideas.
2. Vocabulary so limited and so frequently misused that
6. Highly organised; clear progression of ideas well
reader must often rely on own interpretation.
linked; like educated native writer.
1. Vocabulary limitations so extreme as to make
5. Material well organised; links could occasionally be
comprehension virtually impossible.
clearer but communicaticn not impaired.
4. Some lack of organisation; re-reading required for
Mechanics
clarification of ideas.
3. Little or no attempt at connectivity, though reader
6. Few (if any) noticeable lapses in punctuation or can deduce some organisation.
spelling. 2. Individual ideas may be clear, but very difficult to
5. Occasional lapses in punctuation or spelling which do deduce connection between them.
not, however, interfere with comprehension. 1. Lack of organisation so severe that communication is
4. Errors in punctuation or spelling fairly frequent; seriously impaired.
occasional re-reading necessary for full comprehension.
3. Frequent errors in spelling or punctuation; lead SCORE: Gramm: + Voc: + Mech: + Form:
sometimes to obscurity.
2. Errors in spelling or punctuation so frequent that
reader must often rely on own interpretation.
1. Errors in spelling 0- punctuation so severe as to make
comprehension virtually impossible.
93
92
Testing writing Testing writing •
There are a number of advantage?}o analytic scoring. First, it disposes
of the problem of une~~-devclopment of subskills in individuals.
Secondlv,_scorers are compelled roj.onsider aspects of performance
component. Of CG"fSe such diagnosis will be effective only if the samples
of writing are sufficiently long for each aspect to be Judged reliably. I
which they might otherwise ignore. And thirdly, the very fact that the Appendix 0
scorer has to give a number of scores will tend to make rhescoring more
reliable. While it is doubtful that scorers can judge each of the aspens
New Profile Scale and Profile Method 1 ("PM2")
I
independently of the others (there is what is call~d a 'halo effect'), the Communicative!
1----
..
Organization Argumentation lin guistlc
Ae curacy
linguistic
I
AppropriaeV
mere fact of having (in this case) five 'shots' at assessing the student's Quality ---- --~-----I
I
I
~i"es the reader structure which clearly stated, wuh gram mar complete
perceived by the tester (with or without statistical support), is reflected in ull sans taction I
enables the completely effective appropnacv
message to be supporting material;
weighrings attached to the various components. Grammatical accuracy, . followed ar~uments are
effOrll~'isly ef eetlVely related to
'for example, might be given greater weight than accuracy of spelling. A the wnter's
candidate's total score is the sum of the weighted scores.
The main disadvantage of the analytic method is the time that it LIkes.
Even with practice, scoring will take longer than with the holistic
8 The writing
dISplays an
ability to
The writing
experience or views
1----'--------
Relevant arguments
d"plays a logICal are presented In an
orqaniz anonal Interesting way. With
The re oldersees
no srq rutrc ant
errors of
There IS an
ability to
manipulate the
I
cornmuruc ate structure which main Ideas vocao ulary, linquisuc
method. Particular circumstances will determine whether the analytic WIthout cdulilng enables the hl~hl,ghted, spell, nq. systems
method or the holistic method will be the more economical way of
obtaining the required level of scorer reliability.
A second disadvantage is that concentration 011 the di ffercnt aspects
the reader any
dlfflcultleli
messaqe to be
I
followed eaSily
et ectrve supportinq
matenal and they are
well related to the
wnter's own
expenence or vlew~~
punct uanon or
gram mar
c pproprrat elv
I
- _ - - --- --_.-
The r ;~e~~1 There dre-~
--~--~ ..
I
may divert attention from the overall effect of the piece of writing. 7 The wnunq The writIng Argumena are well
dl~piaYli an dISplays good presented With awar e of but not hrmtauons to
Inasmuch as the whole is often greater than the sum of its pans, a ability to organizational relevant supporting traub led bV the abd'ty to
commun.cate structure which matenal and an ocr as ronal manipulate to
composite score may be very reliable but not valid. Indeed the aspects wu h few enable') the attempt to relate rturic r errors of lingUistic
dlfflcultH~,) tor message to be them to the .....trlter'') voca bui dry, systems
which are scored separately (the 'pans'), presumably based on the theory the reader followed expene oce or Views spelll ng. appropriately
of linguistic performance that most appeals to the author of any
particular analytic framework, may not in fact represent the complete,
'correct' set of such aspects. To guard against this, all additinnat,
-
6
t--
The Writing
Without such
effort.
The r eader IS
which do not
Intrude on the
reader
There IS limited
I
displayli an organ! zed well biesented but It may awar e of errors ability to
impressionistic score 011 each composition is sometimes required of
I
ability to enough for the e difficult for the ot vo cabularv. manipulate the
communicate message to be reader to distingUish spell, ng. tinquunc
scorers, with significant discrepancies between this and the analytic total Jlthough there followed main IdeodS from punc tuanon or s,;,stems
being investigated. ISor casronal throughout supporting material, gram mar, but appropnately,
str a.n for the main Ideas may not the, e but thIS Intrudes
It is worth noting a potential problem in Anderson's scale. This arises reader be supported: the" occa sionally only
I
relevance may be occasionally
~~__J__.
from the conjunction of frequency of error and the effect of errors on dubious: arguments
communic.inon. It is not necessarily tilt: use that the iWO arc highly
I the wnter's
mav not be rel ated to
e xperrerice or ,"ew_s~l....-..-.-
correlated. A small number of grammatical errors of one kind could have l-_
94
I
95
I
Testing writing Testing writing
HOLISTIC OR ANALYTIC?
Communicative Organization Argumentation linguistic linguistic
Quality Accuracy Appropriacy The choice between holistic and analvtic scoring 9~pends in part yn the-
5 Th~writlng Th~ writi!'lO is I Argum~nts ar. .l'u.1Rose of the testing. If diagnostic information is required, then analytic
di,plaY'an
ability to
organiz~d well
enough for the
I~r.sented but may
ack relevance,
Th@ reader is
awart!'of errors
Th@f@ is limited
abliltyto
S~O~~l1g is essential. Th~cnolc:e also depends on the circumstances of
communicate
although there
message <: b.
foltowec most
I
clarity, consistency
orsupport; they
of vocabulary.
spelling.
punctuation or
manipu.late th@
linquistrc
systems
scoring. If it is being carried out by a small, well-knit grou p at a single
IS often strain of the time. may not be related grammar which appropriately. site, then holistic scoring, which is likely to be more economical of time,
for the reader. to the writer's intrude which Intrudes
I experience or frequently. frequently may be the most appropriate. But if scoring is being conducted by a
Views
heterogeneous, possibly less well trained group, or in a number of
4 The writing The writing Arguments art! The reader finds There i~ inability
displays a lacks a clear inadequately ! the control of to manIpulate different places (rh. British COuncil EL TS test, for instance. is scored at a
limIted ability to organizational presented and vocabulary, the lingUIstic
communicate structure and supported; they spelling. systems large number of test centres), analytic scoring is probably called for.
which puts the mes'i~qe is may be irrelevant; punctuation and appropriately, Whichever is used, jf high accuracy is sought, multiple scoring is
strain on the difficult tc if the writer's grammar which causes
reader follow expenence or views inadequate. severe strain for desirable.
throughout. are presented their the reader
relelJance may be It should go without saying that the rating systems presented in this
difficult to see
chapter are meant to serve only as examples. Testers will almost ccrtuinlv
3 The writing does The writing has Some etements of The reader is There IS little Or
not display an no discernible Information are primarily aware no sense of need to adapt them for use in their own situation.
ability to orqaruz atronal present but the of gross linguistiC
communicate structure and a reader is not inadequacies of approp"acy.
although message cannot provided with an vocabulary, although the", THE CONDUCT OF SCORING
meaning comes be followed. argument, or the spelling. is evidence of
through argument is mainly punctuation and sentence It is assumed that scorers have already been trained with previously
spasmodically irrelevant. grammar. structure
written scripts. Once the test is completed, a search should be made to
2 The writing No A meaning comes The reader sees There is no sense
dISplaY' no organizational through no eVidence of of lingUistIc identify 'benchmark' scripts which typify key levels of ability on each
ability to structure or occasionally but it is control of appropnacy
communicate menage IS not relevant. vocabularv writing task (in the case of the English medium university referred to
recognizable. spelling.
punctuation or
above, these were 'adequate' and 'not adequate'). Copies of these should
I------
grammar then be presented to the scorers for an initial scoring. Only when there is
1 A true non- agreement on these benchmark scripts should scoring begin.
wnter who has
not produced Each task of each student should be scored independently by two or
anyassessabl"
strings of more scorers (as many scorers as possible should be involved in the
English writing.
An answer assessment of each student's work), the scores being recorded on separate
whICh is wholly
or almost wholly
sheets. A third, senior member of the team should collate scores and
copied from the identify discrepancies in scores awarded to the same piece of writing.
mput text or
task is in this Where these are small, the two scores can be averaged; where they are
category.
-- larger, senior members of the team will decide the score. It is also worth
looking for large discrepancies between an individual's performance on
different tasks. These may accurately reflect his or her performance, but
they may also be the result of inaccurate scoring.
It is important that scoring should take place in a quiet, well-lit
environment. Scorers should not be allowed to become too tired. While
holistic scoring can be very rapid, it is nevertheless extremely demanding
if concentration is maintained.
Multiple scoring should ensure scorer reliability, even if not all scorers
are using quite the same standard. Nevertheless, once scoring is com-
pleted, it is useful to carry out simple statistical analyses to discover if
anyone's scoring is unacceptably aberrant.
97
96
Testing writing
Testing writing -
READER ACTIVITIES
2. A nat10n can't ~ake improvements, 1f 1t coesn't let the minds of their
peop l.o breathe and expand to understand ITOre about life than what is at the
I
end of the s t ree t , this i.mprovement can oe made by means of tourism.
1. Following the advice given in this chapter, construct two writing tasks
appropriate to a group of students with whom you are familiar. Carry
There arc several ways to attract rTD1"e pecpl e to our country. First of I
all, advertisements and informut10n t~e an impcrtant place. These
Out the tasks yourself. If possible, get the students to do them as well.
Do any of the students produce writing different in any significant
way from what you hoped to elicit? If so, can vou see whv? Would vou
wish to change the tasks in any way? . . .
advert1s,~nts and information should be ~,sed on the qualities of that
place w i thou t. exaq.jerut ron . Tne rrore time passes and the ITOre information
t our i st s qa ther about one country, the ITOre assured they can be' that it
I
I
This activity is best carried out with colleagues. Score the following wi Ll be a qood expcr i encc , people travel one place to another in order
three short compositions on how to increase tourism, using each of to sF£'nd thei r holiday, to see different cui tures or to attend conferences.
the scales presented in the chapter. Which do you find easiest to use, All of these necessitate filcilitlCS. I t is impcrtant to rmke srrre pcints
and why? How closely do you and your colleagues agree on the scores c l oa r , Hote l , tra.nstDrt3tion ami ccmmmicat_ion facilities arc a case in
you assign? Can you explain any large differences? Do the different
scales place the compositions in the same order? If not, can you see
why not? Which of the scales would you recommend in what
po r n t . T(-) sorre ox t erit , we can mintmize the diffeculties by rmans of rroney .
Furthernorc, thi s situation does not only de[C€nd on the financial sltuatlOn,
I
LuL a l so behaviors t owa rds the tourists. Esp:?cially J a dcve l op ino cour.t rv
circum st all ces? should kept In mind the chalLencJe of the fut.ur e rather than tile mistakes
of the past, In order to dehi'/e this, the wa'l's of traininCJ cf pe r sonnc l
m."1'/ lx' L:u!xL 111l.' m.is t i rnpor t anr problem facL"(j by nkln'./ of countries is
I
cc'ci~;ilx'l5 t ha t. must b{~ made a re within the C~lrar)ilities of
I
v.AH_,thcr tht'
I. N(~adays a lot of countries tend to develop U1eir tourism's incomes, and their (,ducal ion system. EchlC.:ltinq ou i do s and hotel man,:lCll'rs a re bec·"·cming
therefore t.rou r i sm called the factory without chemny. Turkey , which
rron::- Clnd rrr-rc impor t ant .
undoubtedly needs f or i qn rroney, trys to increase the number of foreign
tourists caning to Tu rkey . What are 1 ikely to do in order to increase
th i s number .
As d [cSIlI t, it shc;u lei ('nee ITorp 1£, sa i d that, \o.e 11'..11' increase the: nurnce r
of fon:'lqn t our i st.s ccminq to Turkv",: by t ak i nq serre !TIC,]S"llrcs. Advertisernent,
I
i nf orrna tron , improv i nq Lj(,llitios and t r.i i rn no pcrscmnel mLly be f-'ffectivc,
At first, much more and better advertising should do in foreign countries
and the information offices should open to infonn the people to decide to
COfT>.> Tu rkey , Secondly, improve facilities, which are hotels, t ransportat.i on
hut also all fx;()ple should be oncou r.rqcd to l.T:ntrlbutc t.h i s event.
I
and ccrrrmneca t i on, Increase the number of hotels, similarly U1e number of
public t ranspor t.at Ion which, improve the lines of camunication. '!hirdly
which 1S impcrtant as two others is training of personnel. This is al~
I
a bas i c need of tourism, because the tourist will want to see in f rr-nt. c;:
him a skilled guides or a skilled hotel managers. The new school. will open
in order to train skilled personnel and as well as theoric knowledges,
pract ice mrst. be given them.
3. Tourism is nOW f.>CCcrning a major i ndus t ry trouqhout the wor l d , FOl- DDny
countries their tourist trade is an essential source of tll(~ir revenue.
I
The countries which are made available these three basic need for tourists
have already improved their tourism's incares. Spain is a case in r-oint;
All ooun t r i e s have their aim particular atractlOns for tOUrists and t.h i s
must be kept in mind when advert i s i nq 'l\lrkey a[,road. For example Turke,', I
which wants to increase the nurnbe r of foreiqn r our i sr.s ccm~ng must
or Greec. Although Turkey needs this Inccrre : it didn't do any real attffi~ts
to achive it. In fact all of them should have already been done, till today.
However it is late, it can be begin without loosing any time.
advertise its culture and sunshine.
I
99
98
I
Testing writing
Improving facilities I ike hotels. transportation and corrmun i ca t i on play 10 Testing oral ability
important role on this matter more Hotels can be built and avalluble ones
can be kept clean and tidy. New and modern cransportation systems must
be given to foreign tourists and one rIDre, the corrmunication system must
work regularly to please these people.
Tourists don't want to be led around I i ke sheep. They want; to explurc for
themselves and avoid tile places which are pact out wi th many other t.ouri st .
The assumption is made in this chapter that the objective of teaching
Because of that there must be their trained guides on their t.owns through
spoken language is the development of the ability to interact success full}'
anywhere and on the other hand hotel rnanaqers must be ~'11 t r a i ned , 111e,'
in that language, and that this involves comprehension as well as
must keep being kind to foreign t our i st. and must know English as wo l l .
production. It is also assumed that at the earliest stages of learning formal
testing of this ability will not be called for, informal observation
If we make tourists feel cornf or t.ab l c im these f.rct s , tourism wi 11 increase providing any diagnostic information that is needed.
and we will bcnef i t frum it. The basic problem in testing oral ability is essentially the same as for
testing writing. \X!e want to set tasks that form a representative sample of
(Hughes et <II. 1':I~7: 145-7)
the population of oral tasks that we expect candidates to be able to
perform. The tasks should elicit behaviour which truly represents the
candidates' ability and which can be scored validly and reliably.
CONTENT
Further reading \X!e can again make use of the framework introduced in Chapter 7:
Operations, Types of text, Addressees, and Topics. RSA test specifi-
A very good book on the testing of writing is Jacobs et al. (1 1). The n cations, this time for 'Oral interaction' at the intermediate level, serve as
background to the new British Council scales is reported in Hump-Lyons an illustration.
( 1987). Operations The operation is to take part in oral inter-
action which may involve the following language func-
tions:
Expressing: thanks, requirements, opinions, comment,
attitude, confirmation, apology, want/need, informa-
tion, complaints, reasons/justifications
_Narrating: sequence of events
Eliciting. information, directions, service, clarification,
help, permission, (and all areas above)
Directing: ordering, instructing (how to), persuading,
advising, warning
100 101
Testing oral ability
Testing oral ability •
~, Reporting: description, comment, decisions. The reader m3Y well feel that these descriptions of levels of ability are
not sufficiently precise to use with any consistency. In fact, even where I
Text types dialogue and multi-participant interactions specification is more precise than this, the opefJtionalisation of criteria]
normally of a face-to-face nature but telephone conversa-
tions not excluded.
levels has to be carried out in conjunction with samples of candidates'
perform3nces.
Another approach to the specification of skills is to combine the
I
Addressees and topics Not specified except as under elements of content with 3n indication of criteria! levels of ability. The
'Topics for writing'. rating scales against which candidates' performances arc judged provide
the specifications. Examples of these are the ACTFL Guidelines (see
pages 89~91) and the Interagency Language Roundtable (ILR) proficiency
I
Criteriallevels of performance
Intermediate-~id
I
function. The overall intention of the speaker is always Able to handle successfully a variety of uncomplicated, basic
clear.
~
though neither of these may be consistently manifested.
I
I
102 103
Testing oral ability Testing oral abilit»
The ILR description of 'limited working proficiency' (the ILR scales y Interaction with peers Two or more candidates m3Y be asked to
relate to the usc of language for a specific purpose, being used for discuss 3 topic, make plans, and so on. The-.2.!"9blem_with this is that the
assessing the foreign language ability of US Government 3gency staff) is '-performance Q£gn~ _c~nQid;!te islikelyto be affectedbythat cl.tbe others.
as follows: For example, an assertive and insensitive c3ndid3te" may dominareand
not 3110w another candidate to show what he or she C3n do. If this format
Able to satisfy routine social demands and is used, candidates should be carefully matched whenever possible.
limited work requirements. Can handle with
confidence but not with facility most social , Response to tape-recordings Uniformity of elicitation procedures
situations including introductions and casual can be achieved through presenting all candidates only with the S3l11e
conversations about current events, as well audio- (or video-) tape-recorded stimuli. This ought to promote reliabi-
as work, family, and autobiographical infor- lity. There can also be economy where a language laboratory is available,
mation; can handle limited work require-
smce large numbers of candidates can be tested at the same time. The
ments, needing help in handling any
obvious .disadvantuge of this format is its inflexibility: there is no wav of
complications or difficulties; can get the gist
following up candidates' responses. - "
of most conversations on non-technical sub-
iects (i.e. topics which require no specialised
knowledge) and has a speaking vocabulary Obtaining a sample that properly represents each candidate's
sufficient to respond simply with some cir- ability and which can be reliably judged
cumlocutions; accent, though often quite
faulty, is intelligible; can usually handle ele- This section begins with general advice and then goes on to suggest
mentary constructions quite accurately but techniques for different formats.
does not have thorough or confident control
of the grammar. ADVICE ON PLANNING AND CONDUCTING ORAL TFST'i
be spent on one particular function or topic. At the same time, candidate is at, in an interview the tester has to begin by guessing
I
candidates should not be discouraged from making a second attempt what this level is on the basis of early responses. The interview is then
5.
to express what they want to say, possibly in different words.
Select interviewers carefullv and train them. Successful interview-
ing is by no means easy a~J -not-e-~eryone has great aptitude for
conducted at that level, either providing confirmatory evidence or
revealing that the initial guess is inaccurate. In the latter case the level
is shifted up or down until it becomes clear what the candidate's level
I
it. Interviewers need to be sympathetic and flexible characters, is. A second tester, whose main role is to assess the candidate's
with a good command of the language themselves. But cveu .iie
most apt need training. They can be given advice, shown video
performance, can elicit responses at a different level if it is suspected
that the principal interviewer m~lY be mistaken.
1 I. Do not talk too much. There is an unfortunate tendency for
I
recordings of successful interviews, and can carry out interviews
6.
themselves which can be recorded to form the basis for critical
discussion.
Use a s.f.cQn~Ltester .for interviews. Because of the difficultv of
'interviewers to talk roo much, not giving enough talking time to
candidates. Avoid the temptation to make lengthy or repeated
explanations of something that the candidate has misunderstood.
I
conducting an interview and of keeping track of the candid:1te's
performance, it is very helpful to have a second tester present. This
person can not only give more attention to how the candidate is
performing but also elicit performance which they think is necessary
ELICITATION TECHNIQUES
I
then an interview should be brought gently to a close, since nothing
will be learned from subjecting her or him to a longer ordeal. Where, Role play can be carried out by two candidates with the tester as 30
on the other hand, the purpose of the test is to see iohat level the observer. For some roles this may be more natural but, as pointed out
l()A
107 I
Testing oral alJility
Testing oral ability
Tape-recorded stimuli As was noted earlier in the chapter. circum-
above, the performance of one candidate may be affected by that of the
stances may dictate that oral ability be tested in the language laborarorv.
other. Similarly, the exchange may not follow the pattern predicted by
A good source of techniques is the ARELS (Association of Recognised
the tester (and so the intended functions may not be elicited).
English Language Schools) examination in Spuken English and Cum-
prehension. These include:
< Interpreting It is not intended that candidates should be able to aet as
interpreters (unless that is specified). However, simple interpreting tasks
Described situations. for example:
can test both production and comprehension in a controlled way. One of
You are walking through town one day and you meet two friends
the testers acts as a monolingual speaker of the candidate's native who you were sure had gone to live in the USA. What do you say')
language, the other as a monolingual speaker of the language being Remarks in isolation to respond to, for example:
tested. Situations of the following kind can be set up: The candidate hears. 'I'm afraid I haven't managed to fix that
cassette player of yours yet. Sorry.' or 'There's a good film on TV
The native language speaker wants to invite a foreign visitor to his or her tonight.'
home for a meal. The candidate has to convey the invitation and act as Simulated conversation. for example:
an interpreter for the subsequent exchange. The candidate is given information about a play which he or she is
supposed to want to see, but not alone. The candidate is told to talk
Comprehension can be assessed when the candidate attempts to convey to a friend, Ann, on the telephone. and ask her to go to thc theatre
what the visitor is saying, and indeed unless some such device is used, it is and answer her questions. The candidate hears:
difficult to obtain sufficient information on candidates' powers of Ann: Hello. What can I do for you')
comprehension. Production is tested when the candidate tries to convey Ann: Hold on a moment. What's the name of the play. and whos it In')
Ann: Never heard of it. When's it on exactlv?
the meaning of what the native speaker says. Ann: Sorry to mention it. but I hope it isn't'too expensive.
Ann: Well which night do you want to go, and how much would you like
Discussion Discussions between candidates can be a valuable source to pay')
of information. These may be discussions of a topic or in order to come to Ann: OK. That's all right. It'll make a nice evcning out. 'Bye.
a decision. Thus an example from the RSA (19154):
Note that although what Ann says is scripted, the style of speech is
appropriately informal
There is too much sport on television. For all of the above, an indication is given to candid.ires o l the time
or available (for example ten seconds) in which to respond.
Your school has a substantial budget to spend on improving
facilities. The following have been suggested as possible pur-
chases for the school. Imitation Candidates hear a series of sentences, each of which thev
1. Video equipment have to repeat in turn. This obviously does not in itself represent a cross
2. A swimming pool (indoor or outdoor'? But how much would section of the oral skills that we arc usually looking for, but there is
this cost'?) research evidence that, provided the sentences arc not too shurt (sec what
3. A mini-bus
4. Computer equipment was said about dictation in Chapter 8), learners will make the same kind
5. A sauna of errors in performing this task as they will when speaking freely. The
6. Any other suggestion . advantages of this format are the control which can be exercised in choice
Discuss the advantages and disadvantages of each suggestion of structures etc., and its economy. While it might provide sufficient
with your partner and try to reach agreement on the most
information as part of a placement test, problems in inrcrprct.uion. as
suitable. Make other suggestions if you wish.
well as a potentially unfortunate backwash effect, militate agalllst giving
it a prominent role in achievement or proficiency tests.
This technique suffers from the disadvantages attached to any method
that depends on interaction between candidates. While any reasonable TECHNIQUES NOT RECOMMENDED
number of candidates could be involved in such discussions, restricting
Prepared monologue Some examinations require candidates to
the number to two allows the tester the best opportunity of assessing the
present a monologue on a topic after being given a few minutes to
performance of more diffident candidates. However, at least one national
prepare. If this task were to be carried out in the native language, there
matriculation examination in English uses larger groups.
109
108
•
Testing oral ability
Testing oral ability
5.
Marked I'foreign accent'· and occasional mispronunciations
which do not interfere with understanding.
No conspicuous mispronunciations. but would not be
I
Clearly recognisable and appropriate descriptions of criterial levels are taken for a native speaker.
written and scorers are trained to use them. 1
Irrelevant features of performance are ignored.
There is more than one scorer for each performance.
6. Native pronunciation, wit~ no trace of "foreign accent.
11
I
DESCRIBING THE CRITERIAL LEVELS
Descriptions may be holistic (as the ACTrL and ILR scales presented
above) or analytic (as for the RSA test). The advantages and dis-
1.
2.
Gra~nar
phrases.
alnlost entirely InaCCllrate
4.
Freqllent errors showing some rna j o r r.a t.t o r n s uncontrolled
and causing occasional irritation arld misunders~an(!in\~.
1. In some cases It may be c:.:f~icient for scorers to he provided with recorded cxempl.irs of
the various levels.
that of an educated native speaker.
I
110 111
I
Testing oral ability Testing oral ubi/it}'
Fluency
WEIGHTING TABLE
1. Speech is so halting and fragmentary that conversation
1 2 6 (Al
is virtually impossible.
2. Speech is very slow and uneven except for short or
Accent 0 1
routine sentences.
Grammar 6 12 18 24 30 36
3. Speech is frequently hesitant and jerky; sentences may Vo c a bu La r y 4 8 12 16 20 24
be left uncompleted. Fluency 2 4 6 10 12
4. Speech is occasionally hesitant, with some unevenness Comprehension 8 12 15 19 23
caused by rephrasing and groping for words.
5. Sreech is effortless and smooth, but perceptibly
non-native in speech and evenness.
Total
6. Speech on all professional and general topics as
effortless and smooth as a native speaker's.
Note the celative weightings fur the various cornponent s. The t ot a] of the weighted scores
IS then looked lip In .1 table (not reproduced here) which converts It into a r anrn; o n .1 scale
Comprehension
0-5,
1. Understands too little for the simplest type of
con v e rsa t ion. (,\dal11s and Fr rth 1'17'1: 35-H)
112 113
Testing oral ability
Testing oral abilitv •
pleasantness, prettiness, or the cut of their dress. The truth IS that these
are hard to exclude from one's judgement - but an effort h3S to be made!
Further reading
The scoring of oral ability is generally highly subjective. Even with beneficial backwash, A recent book devoted to oral testing is Underhill
careful training, a single scorer is unlikely to be as reliable as one would (1987). The American FSI publishes a 'Tesring Kit' which provides
wish. If two testers are involved in 3 (loosely defined) interview then rhev information on scales and techniques, and includes sample interviews at
C3n independently assess each candidate. If they agree, tI~ere is no
problem. If they disagree, even after discussion, then a third assessor mav
be referred to. In such cases it is necessary to have a recording of the
the various levels (but not in English). The kit is available from: The
Superintendent of Documents, US Government Printing Office, Wash-
ington DC 20402. Bachman and Savignon (19Hh) is a critique of the
I
ACTFL oral interview. Information on the ARELS examinations (and
session, if the candidate is to be spared the trauma of being called back
tor a further session. Even where elicitation is highly controlled ami there
is a detailed scoring key (as in the ARELS examination), independent
past papers with recordings) CJn be obtained from ARELS Examinations
Trust, 113 Banbury Road, Oxford, OX2 sjx.
I
double scoring is advisable.
Conclusion
I
The accurate measurement of oral abilitv is not easy. It rakes consider-
able time and effort to obtain valid ami reliable results, Nevertheless,
I
where backwash is an important consideration, the investment of such
time and effort may be considered ncccssarv.
Readers arc remi;lded that the appropriarcness of content, descriptions
of criteria! levels, and elicitation techniques used in oral testing will
I
depend upon the needs of individual institutions or organisations.
READER ACTIVITIES
I
These activities arc best carried out with colleagues. I
I. For a group o! students that you are familiar with, prepare a holistic
1
rating scale (five bands) appropriate to their range of ability. From
your knowledge of the students, place each of them on this scale.
Choose three me.thods of elicitation (for example role play, group
I
discussion, mterview). Design a test in which each of these methods is
used for five to ten minutes.
J. Administer the test to a sample of the students you first had in mind.
4. Note problems in administration and scoring. How would you avoid
I
them?
S. For each stll~ent who takes the test, compare scores on the different
tasks. Do different scores represent real differences of ability between
I
tasks? How do the scores compare with vour original ratinus of the
students? , b
I
114 115
I
Testing reading
that it was more difficult to construct reliable and valid tests of writing
and speaking than of reading and listening. In large part this was because Recognising the significance of the usc of the present continuous wi th
future time advcrbials
the former seemed to depend on notoriously unreliable subjective
scoring, while in the case of the latter, objective scoring was both possible Knowing that the word 'brother' refers to J. male sibling
and appropriate. This view of the relative difficulty could be quite And finally, low-level, operations, like distinguishing between the
mistaken. for one thing, we know from the previous chapter that the letters t11 and n, or band d.
subjective scoring of writing can be very reliable (and the same is true for A book of the present kind is obviously not the place to argue the
the scoflng of speaking). For another, the exercising of receptive skills merits of various models of the foreign-language reading process.
does not neccssarilv, or usually, manifest itself directly in overt Readers will have their own views as to what goes into successful reading,
behaviour. When people write and speak, we see and hear; when they and it is not for me to try to change them. I think it is appropriare,
read and listen, there will often be nothing to observe. The task of the however, to say something about the broad feuds of reading processes at
language tester is to set reading and listening tasks which will result in which testing is most usefully carried out.
behaviour that will demonstrate their successful completion. This is by It seems to me that information about students' abilities at the lowest
no means simple, as we shall see below. level (for example distinguishing between letters) could only be useful for
diagnostic purposes, and can be carried out through info~mJ.l observa-
tion. There is no call for formal testing.
Specifying what the candidate should be able to do Even at a higher level, information on grammatical and lexical abilitv
would only be diagnostic, and in any case could be acquired through test's
In order that we should be clear about the population of abilities that we of grammar and vocabulary, not necessarily J.S an integral part of a
want to test, it is worthwhile specifying these as accurately and com- reading test.
pletely as possible. Once again the general framework introduced in Only at the level of what are referred to above as 'micro-skills' do we
Chapter 7 can be used. In particular Content (operations, types of text, reach the point where we find serious candidates for inclusion in a
addressees, and topics) and Critcrial levcls of performance. reading test. These we can recognise as skills which we might well wish to
teach as part of J. reading course in the belief that they will promote the
development of the particular macro-skills towards which the course is
Content
primarily addressed. It follows from this that items testing such micro-
OPERATIONS skills are appropriate in achievement tests. If there are skills whose
development we wish to encourage, then it is right that they should be
These mav be at different levels of analysis. Thus, the following may be
represented in tests. At the same time, however, an excess of micro-skill
thought of as ~-sh~% directly related either to, needs or to course
test items should not be allowed to obscure the fact that the micro-skills
objectiv~s:
are beingtaught not as an end in themselves but as a means of improving
-. 5cJ.nning_text to locate specific information malin-skills. At least in final (as opposed to progress) achievement tests,
- Skimming text to obtain the gist the balance of items should reflect the relationship between the two levels
- 'Identifying'stages of an argument ' of skill. There is in fact a case for having only items which test
- ,iden.t1fY,ing-ex.lmpks yresented in support of an argument macro-skills, since the successful completion of these would imply
r,:_,-7 ,/' I f , '
116 117
Testing reading Testing reading •
command of the relevant micro-skills. The same would also be true for
proficiency tests. This issue is raised again in the section on the writing of
items, below.
they are administered. This book, however, encourages a broadly
criterion-referenced approach to language testing. In the case of the
testing of writing, as we saw in the previous chapter, it is possible to
I
describe levels of writing ability that candidates have to attain. While this
TYPES OF TEXT
... - ---- --_ ..--/
----~-_ ..
These will obviously be related to text rvpcs, and it may not be necessary
that ~1 handbook GIn provide; practice is necessary. It is nevertheless
possible to offer useful advice. While the points nL1Y seem rather obvious,
they are often overlooked.
I
to specify further: for example we know the inten~led audience fo"r
quality newspapers. But textbooks, for instance, could be for school-
children or university students.
I. Keep specifications constantly in mind and try to select ;IS representa-
tive a sample ,IS possible. Do not repeatedly select texts 01,1 particular
kind simply because they arc readily available.
I
TOI'[CS
2. Choose texts of appropriate length, Scanning rna v (;111 for paSSJges of
It may often be appropriate to indicate the range of topics in only very
general terms. . .
up to 2,000 words or more. Detailed reading GIn he tested using
passages of just a few sentences.
I
3. In order to obtain acceptable reliahility, include as many pnssuges as
Setting criterial levels of performance
In norm-reference~1 testing our interest is in seeing how candi(L!tes
possible in a test, thereby giving candidate, a good number.otfresh
starts. Considerations of practicality will inevitably impose con-
straints on this, especially where scanning or skimming is to be tested.
I
perfo:m by comparison with each other. There is no need to specify -I. In order to test scanning, look for passages which contain plenryof
crirerial levels of performance before tests are constructed, or even befor~
118
discrete pieces of information.
I
119
Testing reading Testing reading
5. Choose texts which will interest candidates but which will not
overexcite or disturb them.-~~'c_,-,~-_.- 'C-
Writing items
The aim must be to write items which will measure the ability in which
we are interested, which will elicit reliable behaviour from candidates,
and which will permit highly reliable scoring. We have to recognise that
the act of reading docs not demonstrate its successful performance. We
need to set tasks which will involve candidates in providing evidence of .
It should hardly be necessary to say that True/false questions, which
successful reading.
are to be found in many tests, are simply a variety of multiple choice, with
POSSIBLE TECHNIQUES only one distracror and a 50 per cent probability of choosing the correct
response by chance! Having a 'not applicable' or 'we don't know'
It is important that the techniques used should iuterfci c as little as
category adds a distractor and reduces the likelihood of guessing correct-
possible with the reading itself, and that they _should not add a sig-
ly to .33 :\ per cent. C
nificantly difficult. task on top of reading. This is one reason for being
wary of requiring candidates to write answers, particularly in the
language of the text. They may read perfectly well but difficulties ih
\ !!nique answer. Here there is only one possible correct response. This
writing may prevent them demonstrating this. Possible solutions to this might ~single ~~-,-"-~!:....!.l!:!mber, or something sltgEtrv-!onger (for
example China; China and Japan; The item on the table; 157). The item
problem include:
may take the form of a ~l~~:S!ion, for example:
Multiple choice The candidate provides evidence of successful reading In which city do the people described in the 'Urban Villagers' live?
by making a inark against one out of a number of alternatives, The
Or it may require completion of a sentence, for example:
superficial attraction of this technique is outweighed in institutional
testing by the various problems enumerated in Chapter 8. This is true ....... was the man responsible for the first steam railway.
whether the alternative responses are written or take the form of
Though this technique meets the criteria laid down above, and is to be
ill us tra tions. recommcndcli,.its use is necessarily limited.
Thus, the following is multiple choice:
Choose the picture (A, B, C, or D) which the following sentence '(§hort answer When unique answer items are not possible, short
describes: The nun with a dog was attacked in the street by a woman. answer items may be used. Thus:
According to the author, what does the increase in divorce rates show
about people's expectations of marriage and marriage partners?
would expect an answer like:
They (expectations) are greater (than in the past).
121
120
Testing reading •
Guided short answers The danger with short-answer questions is of
course the one referred to above: 3 student who has the answer in his or
her head after reading the relevant part of the passage may not be able to
Original passage THE GOVERNMENT'S first formal ack-
nowledgement that inhaling other people's
tobacco smoke can cause lung cancer led
I
to calls yesterday for legislation to make
express it well (equally, the scorer m3Y not be able to tell from the non-smoking the norm 1Il offices, factories
I
5
response that the student has arrived at the correct answer). Consider the and public places. .
question: The Government's Independent SCIen-
tific Committee on Smoking and Health
What changes have taken place in the attitudes of many European concluded that passive smoking was con-
universities to different varieties of English?
Even without knowing the intended answer one suspects that it m3Y
sistent with an increase in lung cancer of
between 10 and 30 per cent in non-
smokers. While the home was probably an
important source of tobacco smoke expo-
10
I
create problems of production unrelated to reading ability. The relevant sure particularly for the young, "work and
part of the text is as follows (note that the abbreviations 'EngEng' and
'NAmEng' are explained earlier in the text):
Until recently, many European universities and colleges not only taught
indoor leisure environments WIth their
greater time occupancy may. be more
important for adults", the comrm ttee said.
The Department of Health, said ,the
15
I
findings were consistent WIth 200 to 300
I
EngEng hut actually required it from their students; i.e. other varieties of
lung cancer deaths a year 1Il non-smokers 20
standard English were not allowed. This was the result of a conscious and some deaths from other smoking-
decision, often, that some norm needed to be established and that related diseases such as bronchitis. The
confusion would arise if teachers offered conflicting models. Lateiv. risk had been estimated at about 100 times
greater than the risk of lung cancer from
however, many universities have come to relax this requirement,
recognismg that their students are as likely (if not more likely) to
encounter ;\lAmEng as EngEng, especially since some European students
inhaling asbestos over 20 years III the
amounts in which it is usually found in
buildings.
25
I
study for a time in North America. Many universities therefore now The findings are to be used by the Health
permit students to speak and write either EngEng or NAmEng, SI) IO/lg
as they lire consistent.
(Trudgill ;1I1d Hannah 19X2;2)
Education Council to drive home the mes-
sage that passive smoking is accepted as
causing lung cancer. The Government
should encourage proprietors of all public
30
I
places to provide for clean air III all
In such cases the desired response can be obtained by framing the item
so that the candidates have only to complete sentences presented to them:
enclosed spaces, the council said, and
legislation should not be ruled out.
Action on Smoking and Health went
further, saying legislation was essential to
35 I
Complete the following, which is based on the fourth paragraph of the make non-smoking the norm III public
passage.
'Many universities in Furope used to insist that their students speak and
write only... .. Now Illany of them accept. ...... as an
places. "In no other area of similar risk to
the public does the Government rely on
voluntary measures," David SImpson, the
40 I
director of ASH said.
alternative, but not a... . of the two.' The Department of Health said concern
122
(Timmins 19X7)
123
I
Testing reading Testing reading
The Independent Scientific Committee on Smoking and Health has rave: a horizontal beam forming part of the side wall of the cart.
just issued an interim report. It says that smoking shaft: one of a pair of wooden boards between which the horse is
(that is, breathing in other people's smoke) is harnessed to pull the cart.
consistent with an increase in of between 10 and 30 shutter: a piece of stout wood stretching across the bottom of a cart:
per cent amongst people who do not . The risk of joins the sides and supports the floorboards.
getting the disease in this way is reckoned to be very much greater sole: a short extension of the lower timbers of the cart frame which
than that of getting it through breathing in typical amounts of projects behind the cart on each side.
over long periods of time. Children might be standard: a vertical bar of wood which is part of the frame for the side of
subjected to significant amounts of tobacco smoke at the cart.
_ _ _ _ _ _ _, but for places of work and indoor stock: a hub made of elm wood into which wooden bars or spokes
leisure might be more important. are fixed.
In response to the report, the Health Education Council (which is strake: an iron tyre made in sections and nailed to the rim of wheel to
soon to be by the Health Education Authority) said protect it.
that the Government should encourage owners of _
strouter: a curved wooden support to strengthen the cart sides; often
places to ensure that the in all enclosed spaces is elegantly carved.
clean. Action on Smoking and Health said that was
topboard: a board with a curved top nailed to the top rave of the cart.
necessary. However, it is known that government ministers would
prefer to use _
A CART
A cart is the simplest type of wheeled vehicle. The following terms arc T
used in describing the parts of one type of cart.
Read the definitions and choose 11 suitable words from the list of 16 to
label the parts numbered on the figure below. The first one is done for
you. Write the labels on the table below the figure.
~
which the axle is fixed; protects the space between the end of 1 axle 7
the axle and the hub of the wheel.
2 8
exbcd: a beam running the width of the cart to which the axle is fixed.
3 9
felloe: a section of the wooden rim on the wheel of the cart.
4 10
forehead: a plank forming the upper portion of the front end of a cart;
usually with a curved top. 5 11
longboard: a plank of wood running parallel to the cart sides to form the 6
floor of the cart.
124 125
Testing reading T esttng reading -
TECHNIQUES FOR rARJ_I_~lJ~~~P~~~~~ES
Identifying order of events, topics, or ~-"guments The candidate
and becoming self-employed-
before the heldaches would
retreat, NowadaY', I C2J1
feeling of • dagger ttrikina hY:JIUl~' Some lU~t that
through the eyeball and twist- nngrame sufferera are ~te1lec.
ing into the brain can make the tual types, or particularly
I
can he required to number the evenrs etc. as follows in an example taken IOmetimelao for 3 or 4 montht tufterer long for death. When conaci~nt~oul personalitiee.
from the RSA:
without an .ttack, u long u I
keep my immediate environ-
mentu cool, dark and peaceful
u possible. Sometimes I think
.t Jut (sometimes three days 'l"her. II little buit for any d-
later) the pain begins to ebb, thiJ. Migraine alfectI 7 to 18
~d ~ne ~ a.lowly.creep back per cent of the population,
mto life, It s like being reborn. unputially; the egghea.da and
I
In what order does the writer do the following in her article? To answer I thould live in • cave, or lurk Migraine is the focusof many the emptyheaded alike.
this, put the number 1 in the answer column next to the one that
appears first, and so on. If an idea does not appear in the article write
N/A (not applicable) in the answer column. '
under • atone like • tosd,
Migraine is rather' like a
poeaesaive parent or lover who
myths. It is emphatically not I .
recent ailment, or a responseto ~ety, of course, can caUM
the atresses of modern life. It migrame. And fear of an attack
I
cannot bear to tee itl victim hal been with us always. III can itself be • cause of massive
0 very name derives from the anxiety. Caught in thisCatch22
I
a) She gives some of the history of migraine. enjoying ordinary life. Indeed,
my loved ones have IOmetimes ancient Greek for half th« sltull situation, tome sufferers no
b) She recommends specific drugs. 0 in their turn felt jesloUi at the
way in which migraine sweeps
- migraine is always • ODe- longer dare to make any plans,
tided headache. The Egyptians 10 reluctantare they to letdown
c) She recommends a herbal cure. 0 me off my feet and .way from had • god for it: no doubt he their family or friends yet
d) She describes migraine attacks. 0
0
all company, keeping me in •
darkened room where it fasts
off me foe dayson end.
wu more often cursed than again. This incapacitating fear
I
e) She gives general advice to migraine sufferers.
a Tyrant,
~ blackmailer,
I
SUNDAYPusrrr: kidnapper, ,
bore
Migraine sufferersoften feel. I
ONE-SIDED deep senseof guilt, for migraine
is • bore u well u • tyrant and
~~p~r. It destroY' ~W
HEADACHE
plans and devastates work-
schedules. Despite its destruc-
tive power, however, the
I
ignorant ltill ditmiII it II the
SUE LIMB begins an occasional series by sufferers
from particular ailments
product of • fevered (and
probably kmaIe) Imqination:
• bit lib the vapoun. But if
you've ever felt it, or teeD
I
IOmeoM live through it, you
MIGRAINEfintvioited
when I was 20, and for years
afterwardsit hung about my
.. ~ know: migraine is the hardest,
blackest and DlOIt terrifyiq of
everyday paiua.
I
oo1~ MIGRAINE Eyes thrink to the tize of
keptlike
life at • bayblackmailer,
by constant
ucrifices on my part. Ita
tyranny. was considerable. meals, loud noises, smoke- but anger and tension were
currants, the face turns deathly
pale,the tonguefeelsIiU an old
prdening aJove, the entire
body teeD1ll to age about 70
I
Many innocent everyday filled rooms, the sun, and dangerous luxuries to • woman years, 10 only • paIaied thuffle
experiences would triggeran
attack: atuffy rooms,
fluorescent light, minute
watching TV for more than with my volatile chemistry.
two hours. 'At Itl wont, migraine .,.
Incapacitating me three times.
to the bathroom is possible.
Daylight is agonisin~ • thirst
rages, and the vomitiDl comes
I
amounts of alcohol staying Wor~aocIal1lfe ~ holidaY' week, for houn on end. I wu almost u • relief, tince in the
ually dihi~bP~' Ioeing more than half my life. I peroxyIJD of nausea the pain
up late, lying in' at the
'-__ ...I h .
weeN:DU, avmg to Walt for
.
were
N _11•• alleqth
,'
made me """'
• -,
ese pro ItIODI hid to change my life-atyle
tense and angry, lAWI,;AUy
_.':_11
-
. .
alVlng up my J'00
recedes for • few bliatful
teconda. Aboveall, the constant au. Limb: • I wa. losing mar. than ha" my 1If.: I
12!i 127
I
Testing reading Testing reading
(Mellontop'hobia) shows the clinical tria1J. Dr Johnson's water bottle or acupuncturist's Identifying referents One of the 'micro-skills' listed above was the
far-reaching damage migraine work is still progressing.. but moxa stick-a bit like s cigar-e) ability to identify referents. An example of an item to test this (based on
l! doing to the lives of six early results hint at spectacular is very soothing.
million adulta in Great Britain IUccesa. 70 per cent of IUbjectJ But above all the best th.Inc the original passage about smoking, above) is:
alone. claim their attaeb are lesa I've done about my migraine is
What docs the word 'it' (line 25) refer to? ....
The best thing these lUfferera mquent and DOt 10 lllVerll: 33 Ieam to relax all the muscles
can do is to join the British per cent aeem completely IUttOWldini the -reo ". Care has to be taken that the precise referent is to be found in the text. It
Migraine Aaaociation without mjgr~ine-free. A ~ew natural response to .-vere JMi8
delay. This excellent, lively and experience unpleuant I1de- is to tense up the mUlCies, may be necessary on occasion to change the text slightly for this
Informal organisation produces dfecta (IJlOItly mouth-uleen: making the pain wone. Deli- condition to be met.
leaflets and a newsletter, and fev~~ is a very bitter herb), berately relaxing theM mUJdtl
organises fund-raising activities and It 18 not recommended for instead is a demanding d.isd-
to sponsor research. It keeps its prqnant women. But for the pline and ~uires undistur~ Guessing the meaning of unfamiliar words from context This IS
members informed about the rest of 1lI, three vile-tasting concentration, but the effect _ another of the micro-skills mentioned above. Items may take the form:
latest sophisticated drugs avail- feverfew lesVe5 a day have dramatic. Immediately the paia
able, and Ilso (most become indispensable. becomes lesa acute. Find a single word in the passage (between lines I and 25) which has the
importantly) lWaps members' Ten yean ago I was talting Migraine is a ~bIe same meaning as 'making of laws'. (The word in the passage may have
hints about herbal treatment Ubrium to reduce stn:ss and adversary: tyrant, blackmailer. an ending like ·s, -ing, -ed ctc.)
and self-help techniques. Ergotamine to treat the actual kidnapper, bore j but after
There are several drugs aV1lU- migraine pain. They were many yean' struale I really This item also relates to the smoking passage.
able on prescription for the powerful drugs, which left me feel I've Ft it on the~. ADd An examination of the usc of techniques for testing what we have
control of migraine, but per- feeling doped and poisoned, and though I m a great admiru cl
haps the most exciting recent they didn't always cure the the best of Western orthodox called micro-skills reveals that candidates always need more than the
development in research headache either. Nowadays I medicine, it's pleasing that my particular micro-skill in order to find the correct response. In order to
involves a modest hedgerow est my three leaves feel godd migraines have finally stsrted to 'guess the meaning of a word from its context', the context itself has to be
plant, native to the British Isles and probably nev~ get th; alink a~r when faced not ~th
.::nil used for centuries by wise headache in the first place. a futuristic superpill, but WIth understood. The same will normally be true when a referent has to be
women and herbalists for a Acupuncture has also helped, the gentle bealing practics ~ identified. It is for this reason that it was suggested above that items of
variety of ailments. It is pertly by improving my general the But BDd the Put. this kind have a place in final achievement and proficiency tests.
feverfe~ (Chrysanthemum eenae of well-being, but during 1M BririIJI Mi«r.... ~
Relatively few techniques have been presented in this section. This is
Psrthemum). I migraine the pain can be 1784 Higl! Road, ByfllIl, IV~
In 1979, Dr B. Stewart immediatelydulledandeventu- bridr" Sun'':' KT14 iED. Td 1 because, in my view, few basic techniques are needed, and non-profes-
Jo~. Resem:h ~or. of ally dispersed by needles placed Byfl'" 51468. sional testers will benefit from concentrating on developing their skills
th~ .City of London. Migrame 011 lpecial points in the feet or 1M City t1/ lArtdtMMifr.... CHIIM,
Clinic, saw three patients who temples. Finger pressure on 11 Clumulloau SqIur" I..DndDI4.
within a limited range, always allowing for the possibility of modifying
had been uaing feverfew u • these points can help too, in the HCIM 6DX aU rrut n4fmf'f these techniques for particular purposes and in particular circumstances.
preventative, and lOOn after- absence of an acupuncturist. """lit ~ fNIWI Jt-u fllir114 . - . Many professional testers appear to have got by with just one - multiple
wardshe became involved in ia Locally applied heat (a hot GlttKi.
choice! The more usual varieties of cloze and the C-Test technique have
been omitted because, while they obviously involve reading to quite a
(limb 19~n)
high degree, it IS not clear that reading ability is all that they measure.
Guessing is possible here, but the probabilities arc lower than with This makes it all the harder to interpret scores on such tests in terms of
straightforward multiple choice. criterial levcls of performance.
The wording of reading test items is not meant to cause candidates any
difficulties of comprehension. It should always be well within their
capabilities, less demanding than the text itself. In the same way,
responses should make minimal demands on writing ability. Whcre
candidates share a single native language, this can be used both for items
and for responses. There is a danger, however, that items may provide
some candidates with more information about the content of the text
than they would have obtained from items in the foreign language.
Testing reading
•
Testing reading
The starting point for writing items is a careful reading of the text, having
A note on scoring
General advice on obtaining reliable scoring has already been given in
I
the specified operations in mind. \X!here relevant, a note should be taken Chapter 5. It is worth adding here, however, that in a reading test (or a
of main points, interesting pieces of information, stages of argument,
examples, and so on. The next step is to determine what tasks it is
listening test), errors of grammar, spelling or punctuation should not be
penalised, provided that it is clear that the candidate has successfully
I
reasonable to expect candidates to be able to perform in relation to these. performed the reading task which the item set. The funerion of a reading
It is then that draft items can be written. Paragraph numbers and line
numbers should be added to the text if items need to make reference to
these. The text and items should be presented to colleagues for moder-
test is to test reading ability. To test productive skills at the same time
(which is what happens when grammar etc. are taken into account)
simply makes the measurement of reading ability less accurate.
I
ation. Items and even the text may need modification. Only when the
test-writing team is satisfied with all the items (from the point of view of
reliable candidate performance and reliable scoring) should they be
included as part of the test (for pretesting where this is possible).
READER ACTIVITIES I
PRACTICAl. ADVICE ON ITEM WRITING
1. In a scanning test, present items in theorder in which the answers can
I. Following the procedures and advice given in the chapter, construct a
12-item reading subtest based on the passage about New Zealand
Youth Hostels on page 132. (The passage was used in the Oxford
I
be found in the text. Not to do this Introduces too much random Examination in English as a Foreign Language, Preliminary Level, in
variation and so lowers the test's reliability.
') Do not write items for which the correct response can be found
without understanding the text (unless that is an ability that you are
1987.) For each item, make a note of the skillts) (including micro-
skills) you believe it is testing. If possible, have colleagues rake the test
and provide critical comment. Try to improve the test. Again if
I
testing!). Such items usually involve simply matching a string of words
I
possible, administer the test to an appropriate group of students.
in the question with the same string in the text. Thus (around line 45 Score rhe tests. Interview a few students as to how they arrived at
in the smoking passage, above): correct responses. Did they usc the particular micro-skills that you
What body said rhar concern over passive smoking had arisen in parr predicted they would?
through better insulation and dr.iughr proofing?
Better might be:
')
Compare your questions with the ones to be found in Appendix 2.
Can you explain the differences in content and technique? Arc there
any items which you might want to change? \'(Thy? How?
I
\V'hat body has claimed that worrres about passive smokinj; are panly ,). The following is part of an exercise designed to help students learn to
due to improvements in buildings?
Items that demand simple arithmetic can be useful here. We may
cope with 'complicated sentences'. Flow successful would this form of
exercise be as part of a reading test? What precisely would it test?
I
\X!ould you want to change the exercise in any way? If so, why and
learn in one sentence that before 1985 there had only been three
hospital operations of a particular kind; in another sentence, that
there have been 85 since. An item can ask how many such operations
how? Could you make it non-multiple choice? If so, how?
The intention of other people concerned, such as the Minister of
I
there have been to date, according to the article. Defence, to influence the government leaders to <Hbpt their policy to fir
3. Do notjncludc items that some candidates are likely to he able to
answer without reading the text, For example:
in with the demands of the right wrng, cannot he igl10rClL
\Vh;lt IS the suhJect of 'cannot be ignored'?
a. the intention
I
Inhaling smoke from other people's cigarettes can C;luse , """"" ""","
h. other people concerned
It is not necessary, however, to choose such esoteric topics as has
characterised the Joint Matriculation Board's Test in English (Over-
seas). These have included coracles, the RuCl1, and the people of
L. the Minister of Defence
d. the demands of the tight wll1g. I
Willington. (SW3Il 1975)
I
130 131
I
A .3horf car 'ride from th l~ F~n7. J~)$ef f~b;;:eJ ISone of
New Zealand's nlcest hostels. "l"hcz rnan ~~r ~$
\Vhi?re tn ~rcew Zee la nd could you nnd a nigh f ~ , €n owne d Ior cookm g Il l' the heo! ever wa(l\~s llnd
eccommcdetion lor mdy$10 NZ. nfld$hare dinner ice cream as an .::lifer -dtnna r treat. Tl)lj~t~!H
~.v\th ,:l f!i en(H~; cn~·..ud ft"t.)(.n tU'!')th1 d xl\eWQr~d? ;jLl~" ,,"t" l'.d to put bock th-a calcri~s !}Oil to.:>\; dJ
'h\ .;;Iny one of Nc w .2~~l an d)s &0 plus :r(~uth host€'lsi willl-;ing up to the g~aciHt lookout, Ofon a forest Dr
!i.lhe$ide ram ble.
M"enng p'lople i$" hlghilght of ""yDIp,end the 11m South Island mOllntainsend lakesct-e: d favourt~~
(~ llrnm\jn~1 ~!o..'tOll\hl1OSplwl'e is i~I"1 tho place to
cn B~~t cvith fellow travellers. On a lypical nlght you'll
of nlln.." bu t wh<lt"'-'?, '/OIlT holiduy pla ns yo u' l) find
!.~1d iI.Ut,tH~ian." Canadians, ariii,h. /\>l\elican>. ~ujb hostels there, Prefer 'J Ji~w·b'lach stay wlih
L)(~nni.\n.'3 , Dnn tl!i, ~p.d Jap enesc , and, 01 co urse, f:: Llll ~bc!hjng, :;wiJmniJ1g. $(') jllng.. or scuba diving ?
some I'll ew i~ ~ Iand ell at ~ hostel. Man1/t,re 1T~'j,kjng With 3 coastline or 1 a.ooo >:''1'1, ih~ choice is yo urs!
th4! Lrip Of il Hf-e time 1,l.tte r study. whjIe (;tn~TS nre on~ -nH~ Bay of bl f)l1ds {n{)rEh of /~.lJddal1 d) has.long h~en
!hetr thir<1l.1T"lourth h()lid~y in New Zea land , tJw $~ a Sp mt5. T wo h("te ls serv e
~ fT",!, ,,~ tor all of
Still otlJerll, likiO Do n and,i~~n C" m<?rO!1 trorn South ~"" l &l'i .. one M thll historic iownsh ip of Xerilrol; , the
Austraua. are em (l "rerir~J11€nr' huJida!,.', thJ'(~t? months oi her on th e Whm,g.1ro.1 HQli)OUl. A boot charter
molor«:ycUng th-tr SOLlJh is~and. J m~t JJP Mth Don co mp any of!er5 Kerilwri ·!lo. telle rs a spec;" l disc()U,,1
end Jean last SLunmer at thf!! QUG"ens to~.vn hostel, -- a whole duyDuton a ~amn ~ boat for aroun d
"Our mends 1h ~)ught W~ were mod"Jaughed Jean. $:30 NZ pe person.
"bu t we find ih" h ostels lie",) comfvrtable, a nd w,,'ve Wha ngama's sheilered harbour sub-tropical bush,
met so milO\! interesting people that we'll have io and u comfortable host..,l with superb harbour view
ma ke a t.Mrk! bip nf/xt ye~r just 10 see them alW' mnl't! for a perlec! break. Mter a lazy dal{or. ~ cruise ,
dnd hying our han d a tfish ing, a gro up ()( us headed
SPECTACULAR down to tile lo cal ho le/ lor ~ freshly CllUgltl scaloed
dinnar scm e of the ones that didll'(\lel
c ~way.
Th e QUOoilb10wn hoste l is tuci<&.l onto !he ~h o,ellt le
"r Lake W"lcatipu, ",jtb spectacular vie.1IS ecross to Ifthe world 0 rbubbling hoI pools ~ nd soaring
the tu!me d Rernarll<lble Mountains. n,is
busy tourist gey~ rs is mot~ your scene. then a stop In Rotorua 15
town ~,the cent,,, for :mth leisurelyano "dVenlUW I1U a must, 'The R 010 l1.1a hostel i~ right in the mldc!Je o f
activWes - from ~ (l'Ul se on tlw hist oric steame r h>Wll, just ~n c"",,, wall; to \,V!1&I<zll'<lWaOlwaThclm~1
Earnslawand v"i t (0 II worKing sh'Jep-stalio n, to IaGt Res"",e and tn>! hoi pools. Forlol)gi!f !rip ' hire a
Imd hlYiomwhil e waltI',311\119 on the K11W:!tyau and hle'l':I" /rom th~ hO<t'!i, Of 'kiP. 11n~ of the wtde rnnw:
50<,101ll)r Rivero,. 0 1 bus t(}urs 10 <We the ilrea.
Flll".her nOlt);, at 1I1t C<l n~, tt.e Youth 1100tei 11,e hDSiel ne;work hl\Sl1't forgot1\!l1 ll, e major clli"~.
'~5(>c;alion's biggest ev er proJ",ri il wIder way . Th e Auc~J1ind b<15 an ;nn~l'-d"J and a $uh urb,u , " mwl, a5
presenl hOf.tel, ,,-,suiiny bulging al the SBillTL~ i" b<linu weI! ~5 two Isl,~nd hkleawa),'l;, W.;\lIlngloo. lhe capital
l'lp lac1ldwith i11;;rge spe cia!!,!designetl host el Ilnd lwnsport anae of lh<1 cl?untw. h"l' om11llw r city
wnid, w!ll ;00 " b<> opening. ' , hoot"l ; while Christchurch boa sts ~ stalely home
host'!l as weUa' a d(>wnlo wn l>a:>E<. Dunedin, Ihe
A~ N<?w Z~~limd's l!lll" st mOllnlcln, MtCook i, a Scottish cil'f of :lt~ south, hIlS one of ~le Sl'lll'ld e'1 of
m~!ol't()U'~ il ttr" ction , but you don't fw,ea to bq 11 ...t! - ',,\~h small be'.iro<)ms a~ Ln.re" livi"g room• .
cUmb er to "'lioy the fl"Ik. Ranger. can suggest a
Wld~ ron[j<' (>f wnIlc1 tha I open 1.1» lh~ 1>lpin e world -- W h«th e r you 'rs nl I'kw Zeal~nd for a SIIOr! slop ·
'j~a5~ij lie glac;er mOTcl~ne.'5. icy sl.ro.ams, and hny 51pine over, or herE for se ve",] mooths, host2I~ ,.,iIl S:re!c.~
plant!:. YOlJritave! funds much furth'll'. Ho51~lo al'lim' t lw.my
holel", but (hey do pJ'Ovide simple, co mforta ble
aCCI>mmodatioJl wlth good hlte..lJen. iau nd r,J and
WAFFLES b<lth roo m f;>cilill<!' . KItchEn, i1113 well equippEd, aU
Actoss tile Southi?1TI Alps, WilstlandNillic'\!lal P"ri< 1JOl! ne" d 8lPPly iS lhG 100<.1, Many hosiets na"'l a
o ffers a qUUQ d;f(~ nmt expeJi,gnte. Toe FmnzJosgf sma ll , hap !0o, $e!Jillg meal -$l2Ed POltiOJlS 01fc,>d_
<ln~ F-OJ'i g~<Icie!4. cei1t:-e ~pi~c{.rs of this spectacuJar Fm Ruther iilulOnation about Ne w z"aland youth
?W." pbnue 00 ,,",71 from Ito" maln d.l'Jide thr.(>Ugh hogte!s, Wltle 1.0 Youlh Hostul k50Cj~l!OI\ Natiol1al
Westland's luxUlilml f",,,.t to only a !ilwhuoor.,d OIfic.., PO BOJl 436, Chr.istch<Uch, NewZealand.
metr~s "b clle S€a lewl.
/12.
·R~ :tdho.g ,:,}rt.p:r ~ ~...>tr l..S:0 ~\: ~J<:::: t.'abll: a.y:
gm:n.ztl,'U" {CC.r: j.i ; ~Q ~.!~ ::; .
...
U''''._ ~ ... >7l< ~ ':''O'':'_~''''"-_·~ . 1-<-
I. Ar:,;hi~: ~\:Lt dJ rL1 IH: 'S rJ r~tJ par<lLhuri~l[j when .n \Vncn r\r ..:lH'l.~ :VLid~lr 1:l m· first k ;!~'lH ll;.:~~~'.:hi:[(·:·
hI,.' W.'i:S?5 ~ ,wJ ~I{' h~J d ruu: 18 p~lr.:1..::.·h~I~'-:' jW(i'P'~ jumping, tit P,\'i""~I~!rJl;.'d rll ~;t hl~ ',~,/.:;;:a or;fr ilL ~i!
'01'':( the la st ~[X ~·~ :·H,~. R~':;:;lltl)' ht' ~~~.~:; ~iY~n f,~l,."t. h~ i,~ ruuch l)lJ ~ !' ~h~ ~~ ch.it. and he is (':"\\11;'
ehc pl;,.li.,~ o r h~Jn1JL1f :n :1 pn 1·;J,hLITi~r.o-:' m('L'ci~lt.~>.
"ii/hen he ":i L"~ T Hl d P 'l~, ,~tl l! lr ing. , hL' mid :1 Ill:'
about his ag e. His w lft!' arul c.:! ;l L1~h{t'L' ure
I:'
~7;l~~:::~:~~~~ i~'i~;,~~~~~l~~'i f;' ~:l~~~~~t~~:'hl~\;~~. i~ i h
,h::H he- uruuld :=.: rL'tr mO L'OJ'r:r~~Ji ng,
W
W{Il'fI'Ad il b()~;( hun. utou n r:l in (~ t,l rji1 g ; ~ I l d Jj_111g ·g!j :.li lii~ .
~t nl';,~ y ~,t'~ 1.n r .:H b i f ~·; ·d d ~ r~ '\'(·: sr 11 :_~ t en ing~ :-:"~" p .:~ r,:u-e i y ~~rqn\ ~: p<:': ~"t king ) ~ in <e
rhe t·vvo ~ k l! L~j ~.1 n·~ ty p!~~i H y {:x':;r'(,~ sed rL)~.;t'~h('r in nerd interactio n.
i··~n'8( t!er., thcrt ~1n~ -: -,~·~;,: ::~,-:,t< ~ n ;,.:; _ !_~ :-_ l c. ~ h -,1 ~, H ~-/::~ ~ ~ ~ ~~ g ~. .(;. d":{: ~·: ~ (h~_~~ h::"tt ning tr~
t~~ri.: ~!;~:'~, ~~:~}t~:;:!,' ;l;~ ~~~'~ : : ',; ;," '!~~\):, ~;:;:ti[~:~~i:;~]':~'\~ :~~ ~~1,:: _~~::~\~\':~:;:~t:\r:r::~;Jl~:;
~ i t u ;_1t j t ) n :; where til t' ti~~ ri n ,~ o f <~ (\": ! a.biiir)' !;>;: c nn~;itle r(;d , for one rcnson or
a n r.. . r. h \.~ r .! i l"n p r ~l(' t ; C;,l~,:, by!t \" / h t~r e ~J t~:·s l c.. t" li~~r;~ n i n6 is in -;; h.H.it~d for its
ba c,; [(\.v;15h t f{e-t..,t" on th':,;,':" d f\'{' ~ OlJ'rnt"'JH' of u ra~ ~ k! Hs .
Be~~JH :)~ it is a 'j'e~~pl~1Vt: :~kdi. the tc~>t\ ng of hst~n i ng ~hH':\ lh~l~ in rn~)~lt
'N~l F; rhe testi ng c, [' J'~~,l,-ltng , Thi s cha pte r w ill therefore spend link time
'on 1S~1\\\~~~ (tYln~.11:1:.)n to tht> t~~~~t~ ~"' g o f th~ t'-A·'O sk~.H:=:, and \o~·iH C~..) f\Ce\ltr.~' t~~
f110 n': on rnJntr~ \..vhich tHC p :.H·d(u J.;.H~ to h ~a.c Hi n}{_ The reader who pla ns
ro cousrrucr ;\ hSL::nin:; t t~S\ ~ ::~ a...hq$(..:d to r't.s~~ d LH~.;·th t h~~; ,1ntl the- p re'~l iou~;
;i;; \~,:'; il i' ~:~; ~~)t;:x{is;~:;::,;::~; 'j ~1,:;~::~;';: \~~~~~~'~;~:: (;,1,::~~:_~,;~'!i~:~j' ;;' n
~]:;'T~\~lJ !:I::~'!,~ ;~:t;~~~ '(;;~~:;~
i:~: I;, \l~:'\~I:,:;1(~:;:!;,:c, \'~',F$ " ~I <:! c;ding wit h rht":\' l'l"Ohi ..mslH: disc't!$sed hue:
.:'\ :-; V;: lf h r~'. hJ ; : j t:~ ,: rh;::.:.; :;· ~-:~ ..J:l " h, ~l t mo re ! h~~ n p!;r it.' Y C~ ni· ~H} ~'d ::s i~i :
~\. l ~h.: r (; ·· ~ k i n~. -"'\: ;' .) i.d L~ b ~, ~'~ l rt' ~:rh: rt:L.Hed rn (·~ l n d i d:.nc s. · i'1\:'t'th ()r ro co urse
o b j ~ crives, :t nd n~ ~ g.h ~· ~ ~ ~ ;..' ~ :L ~ de:
1,1--1
inte rI>r~'{a.ti'On :,.~f in H!n~\ti ~)n p~tt n:sn~~ {1~~.cogn ~ t}on of sarcas m, erc.)
recogn itio n (}f function of: structu res (such as inrcrrogat ive as requ est,
for t::;< ,-H np le ~ Co uld you PD S~j the salt?;
A( the lo west !(;?vc! ar e a biiiriesI ike being a ble to distinguish between
phol1-tmes ( f~w ex;\m pk be rwe en [wi and r;;/) .
It seems to me that: abilirie, nr the IO Wf;: 5t ievd are only of interest lo r
di agn ost ic purposes, and The releva nt informacion GI n be readilv
ob r;lincd [h;'o\1gh informal mean s. <
'r YP E5 O F TEXT
·f )lf.; 5C might be Iirs t s pt:d iied :.If; mo nol ogue, dia logue, o r rnu rn -
participant; und turther specified: annou ncement, talk or lectu re, instr uc-
rioos, directions, e re,
i, DD RESSE ES
Tex ts may be intended for the genera l public, s tud ents (either specia lists
o r generalists), young children, and 50 on,
TOl'i CS
These ""ill often be indicated in qui te genera ] term s.
speakers, then ide·:.:dl y we sho uld use sa m ples of authenti c speech. In fact
these C:1rJ usually be n;adil y fo und, with a lirrle effort. Possible so urces
.rre : rad io bro.id casrs, r<;: ~h:h ! n g m.rrerinis (see rhe Furth er rC~ l{!ing section
for exa mples), ant.! our own record ings of narive speakers. If, ou the mile:'
hand, we want to know whether candidates com understand Lwg,uagr.
which ma y be a Jdr l'~sed to th em as no n-na tive s pea kers, the se too C'1!l be
obrained tr orn reaching materials (~ Jl cl recordings of native speakers
which W~? can
make ourselves , in so me C1 S, :S tbt: indifferent quali ty o f the
recording ma y n eccssirare r,?..re ce.rd ing. It see ms to me, tlmugh n ot
eve ryon e would agree, that ;1 poor record ing introd uces d ifficultie s
135
Testing listening Testing listening
additional to the ones that we want to set, and so reduces the validitv of ~his on ~ubsequent items can be disastrous (candidates listening for
the test. It may also introduce unreliability, since the performance of answers that have already passed). Since a single Iaulrv item can have
individuals may be affected by the ~cording faults in different degrees such an effect, it is particularly important to trial extende'd listening tests,
from occasion to occasion. In some cases (see below), a recording may be even If only on colleagues aware of the potential problems,
used simply as the basis for a 'live' presentation. . Candidates should be warned by key words that appear both in the
If the quality of a recording is unsatisfactory, it is always possible to Item and Il1 the passage that the information called for is about to be
make a transcription and then re-record it. Simi/at 1;., if details of what is heard. For example, an item may ask about 'the second point that the
said on the recording interfere with the writing of good items, testers speaker makes' and candidates will hear 'My second {Joint is .. , '. The
should feel able to edit the recording, or to make a fresh recording from word~ng docs not have to be identical, but c~ll1c!idates should be given fair
the amended transcript. warmng m the passage. It would be wrong, for instance, to ask about
Jf recordings are made especially for the test, then care must be taken to 'what the spe~ker regards as.her most important point' when the speaker
,make them as natural as possible. There is typical] y a fair amount of makes the pomt and only afterwards refers to it as the most important.
redundancy in spoken language: people are likely to paraphrase what TIllS IS an extreme example; less obvious cases should be revealed
they have already said ('What I mean to say is ., . '), and to remove this through trial ling.
redundancy is to make the listening task unnatural. In particular, we Other than in exceptional circumstances (such as when the candida res
should <avoid passages origin~llyjf1t,endedfor reading, like the following, are required to take notes on a lecture without knowing what the items
which appeared as all example of a listening comprehension passage for a WIll be, see below), candidates should be given sufficient time at the
well-known test: outset to familiarise themselves with the i t e m s . "
As was suggested for reading in the previous chapter, there seems no
She found herself in a corridor which was unfamiliar, but after sound reason not to write items and accept responses in the native
trying one or two doors discovered her way back to the language of the candidates. This will in fact often be what would happen
stone-flagged hall which opened onto the balcony. She listened m the real world, when a fellow native speaker asks for information that
for sounds of pursuit but heard none, The hall was spacious. we have to listen for in the foreign language.
devoid of decoration: no flowers, no pictures.
POSSIBLE TECHNIQUES
This is an extreme example, but test writers should be wary of trying to
create spoken English out of their imagination: it is better to base the !l1u/tip(e.choice.. The advantages and disadvantages of using multiple
passage on a genuine recording, or a transcript of one, - choice 111 extended listening tests are similar to those identified for
Suitable passagcs may be of various lengths, depending on what is reading tests in the previous chapter. In addition, however, there is the
being tested. I\. passage lasting ten minutes or more might be needed to problem of the ,candidates havjng to hold in their heads four or more
test the ability to follow an academic lecture, while twenty seconds cmld .:11t.crnatives while listening to the passage and, after responding to onc
be sufficicnt to give a set of directions, Item, of taking ll1 and retaining the alternatives for the next item, If
-multiplc choice is to be used, then the 5l1t~JJ1.?tives must be kept short and
supple. The alternatives in the following, which-ap'peared in a sample
Writing items listening test of a well-known examination, are probably too complex.
For extended listening, such as a lecture, a useful first step is to listen to When stopped by the police, how is the motorist advised to behave?
the passage and take notes, putting down what it is that candidates A He should say nothing until he has seen his lawyer.
should be able to get fro Ill-the passage. We can then attempt to write B He should give only what additional information the law requires.
items that check whether or not they have got what they should be able to C He should say only what the law requires.
get. This note-making procedure will not normally be necessary for D He should in no circumstances say anything.
shorter passages, which will have been chosen (or constructed) to test
particular abilities. Short answer Provided that the items themselves are brief, and only
In testing extended listening, it is essential to keep items sufficiently far really short responses are called for, short-answer items can work well in
_a1'art in the passage. If two items arc close to-each other.Yandidatcsmay listening tests. The completion variety, requiring minimal writing from
miss'iIleseconoofthem through no fault of their own, and the effect of the candidate, is particularly useful.
1.36 137
Testing listening Testing listening •
-~ Information transfer This technique is as useful in testing listening as
it is in testing reading, since it makes minirnaLd~n1ands on productive
. skU~_; It can involve such activities as the Iabell1I1g-of di~grJms'or
o Note taking Where the ability to take notes while listening to, say, a
lecture is in question, this activity can be quite realistically replicated in
the testing situation. Candidates take notes during the talk, and only after
I
pictures, ~ompleting forms, or showing routeson amap. The following the talk is finished do they see the items to which they have to respond.
example, which is taken from the ARELS examination: is one of a series
of related tasks in which the candidate 'visits' a friend who has been
When constructing such a test, it is essential to use a passage from which
notes can be taken successfully. This will only become clear when the task I
involved in a motor accident. The friend has hurt his hand, and the is first attempted by test writers.
candidate (listening to a rape-recording) has to help Tom write his report
of the accident. Time allowed for each piece of writing is indicated.
It should go without saying that, since this is a testing task which might
otherwise be unfamiliar, potentia! candidates should be made aware of
its existence and, if possible, be provided with practice materials. It tim is
I
In this question you must write your answers. Tom also has to draw a not done, then the performance of many candidates will lead us to
sketch map of the accident. He has drawn the streets, but he can't write in
the names. He asks you to fill in the details. Look at the sketch map in
your book. Listen to Tom and write on the map what he tells you.
underestimate their ability.
Partial dictation While partial dictation (see Chapter iI) may not be a
I
I
particularly authentic listening activity (though in lectures at university,
for instance, there is a certain amount of dictation), it can be useful. It
may be possible to administer a partial dictation when no other rest of
listening is practical. It can also be used diagnostically to test students'
ability to cope with particular difficulties (such as weak forms III English).
140 141
Testing gr.uninur .tn d uocabulary Testing grallllllar and vocabulary
•
other. \'Vherever the teaching of grammar is thought necessary, then
consideration should be given to the advisability of including a grammar
WRITING ITEMS
Many, if not most, language testing handbooks encourage the testing of
I
component in achievement tests. If this is done, however, it would seem grammar by means of multiple choice items, often to the exclusion of just
prudent, from the point of view of backwash, not to give such com-
ponents too much prominence in relation to tests of skills, the develop-
ment of which will normally constitute the primary objectives of lan-
about any other method." Ch-apter--S of this book gave a number of
reasons why testing within institutions should avoid excessive use of
multiple choice. In this section, there will be no examples of multiple
I
guage courses. choice items. Though all the techniques can be given a multiple choice
Whether or not grammar has an important place in an institution's
teaching, it has to be accepted that grammatical ability, or rather the lack
forin-at;there-ader who attempts to write such items em often expect to
have problems in finding suitable distracrors.
I
of it, sets limits to what can be achieved in the way of skills performance. The techniquesro be presented here are just three in number: j)ara-
The successful writing of academic assignments, for example, must
depend to some extent on command of more than the most elementarv
grammatical structures. It would seem to follow from this that in order to
rhrase, c0E1r1etjoll' and 111()~0_e_d_cloze.: Used with imagination, these
should prove sufficient for most grammar testing purposes. They have in
common what should be appreciated as a highly desirable qualiry:_they
I
place students in the most appropriate class for the development of such require the student to supply grammatical structures appropriately and
skills, knowledge of a student's grammatical ability would be very useful
information. There appears to be room for a grammar component in at
not simply to recognise their correct use. W'ays of scoring are suggested in
the next section.
I
least some placement tests.
I
It would be particularly useful to know for individual students (and for PARAPHRASE
groups of students) what gaps exist in their grammatical repertoire. These require the student to Write a sentence equivalent in meaning to
Unfortunately, as explained in Chapter 3, for the moment at least, to one that is given. It is helpful to give part of the par.iphrnse in order to
obtain reliable and more than rudimentary diagnostic information of this restrict the students to the grammatical structure being tested.
kind would involve the development and administration of long and
unwieldy tests. Computer-based testing may bring progress in this field in
the near future.
Thus:
1. Testing passive, past continuous form. I
\Xlhen we arrived, a policeman was questioning the b.ink clerk.
WRITING SPECIFICATIONS
For achicvetrienr tests where teaching objectives or the syllabus list the "")
When we arrived, the hank clerk
COMPLETION
SIX years.
I
of the structures identified in this way, as well as, perhaps, those
structures the COIl1Ill~1l1d of which is taken for granted in even the lowest
cl asst's.
This technique can be
context in a passage
Certificate in English
used to test a variety of structures. Note how the
like the following, from the Cambridge First
Testpacl: 1, allows the tester to elicit specific
I
structures, in this case interrogative forms.
I
SAM I'LlNe;
This will reflect an attempt to give the test content validity by selecting
widely from the structures specified. It should also take account of what In the following- conversation. the sentences numbered (I) to (fi) have
eire regarded for one reason or another as the most important structures. been left incomplete. Complete them suitably. Read the whole
It should not deiIberately concentrate on the structures which happen to
be easiest to test.
conversation before you begin to answer the question.
(Mr Cole wants a job in Mr Gilbert's export business. He has come for an
intervicw.)
I
MrGilbert: Good morning, Mr Cole. Please come in and sit down. Now let
me see. (J) Which school ? I
142 143
Testing grammar and uocabularv
Testing grammar and vocabulary
Testing articles
I\1r Cole: Whitestone College. (Candidates are required to write the, a or Nil - No Articlc.)
I\1rGilbert: (2) And when ?
In England children go to ... . ... school from Mondav to Fridav.
I\1r Cole: In HJ72, at the end of the summer term. ..... school that Marv goes to is very small. She wa'lks there ~ach
I\1rGilbert: (3) And since then what ? morning with. .. friend. One morning they saw.
man throwing. ... stones and ...... .. pieces of wood at
I\1r Cole: I worked in a bank for a year. Then I took my present job,
..... dog. . dog was afraid of .. man.
selling cars. BuL I would like a change now.
And so on.
I\1rGilbert: (4) Well. whaL sort of a job ?
I\1r Cole: I'd really like to work in your Export Department. Testing a varicty of grammatical structures
(The text is taken from Colin Dexter The Secret of AIl11eXC 3.)
I\1r Gilbert: That might be ;1 little difficult. What are your qualifications?
(5) I mean what languages . When the old man died, ...... was probably no gre.ir JOy
besides English? ..... heaven; and quite certainly little if any real grief in
I\1r Cole: Well, only a liLLie French. Charlburv Drive, the pleasantly unpretentious cul-de-sac.
semi-detached houses to which he .......... retired.
MrGilbert: That would be a big disadvantage, Mr Cole. (6) Could you tell
me why ?
Testing sentence linking
l\ir Coie: Because I'd like to travel and to meet people from other (A one-word answer is required.)
eouutrics.
The council must do something to improve transport ill the citv.
Mr Gilbert: I don't think I can help you at present. Mr Cole. Perhaps you
..... , they will lose the next election. .
oughL to try a travel agency.
SCORINl; PRODUCTION GRAMMAR TESTS
MODIFIFlJ CLOZE
The important thing is to be clear about what each item is testing, and to
Testing prepositions of place award Eoints for that _only. There may be just one element; such as the
definite article, and all ,ivailable points should be awarded for that;
nothing should be deducted for non-grammatical errors, or for errors in
grammar which is not being tested by the item. For instance, a candidate
should not be penalised for a missing third person -S when the item is
testing relative pronouns; opend should be accepted for opened, without
penalty.
If two clements are being tested in an item, then points may be assigned
to each of them (for example present perfect form and since with past
time reference point). Alternatively, it can be stipulated that both
elements have to be correct for any points to be awarded, which makes
sense in those cases where getting one element wrong means that the
student docs not have full control of the structure.
For valid and reliable scoring of grammar items of the kind advocated
here, careful preparation of the scoring keyi~nece~sary. A key for items
in the previous section might be along the following lines.
John looked round the room. The book was still the table.
The cat was the chair. He wondered what was .
the box ....... ...... the telephone.
And so on.
145
H4
1 eSlIl1g grammar ana uocatiutary Testing grammar L1Ild vocabulary
Paraphrase
1. Must have Was being
2. I haven't [or have m;t] -1 point
ed - 1 point
Similar reasons may be advanced for testing vocabulary in proficiency
tests to those used to support the inclusion of a grammar section (though
vocabulary has its special sampling problems). However, the arguments
I
seen - 1 point for a separate component in other kinds of test may not have the same
for six years - 1 point
Alternatively, all three elements could be required for a point to be given,
strength. One suspects that much less time is devoted to the regular,
conscious teaching of vocabulary than to the similar teaching of I
depending on the information that is being looked for. grammar. If there is little teaching of vocabulary, it may be argued that
ITEM \'I;'RlTING
I
the development and demonstration of linguistic skills. But that does ~ot Recognition This is one testing problem for which multiple choice G10
necessarily mean that it should be tested separ ately.
146
be recommended without too many reservations. For one thing, distract-
147
I
I
Testing grammar and vocabulary
Testing grammar and vocabulary •
ors arc usually readily available. For another, there seems unlikely to
any serious harmful backwash effect, since guessing the meaning
be
of
'J
3, G,IP tilling itnultiple cbotce]
I
vocabulary items is something which we would probably WIsh to Context, rather than a definition or a synonym, can be used to test
encourage. However, the writing of successful items is not without
difficulties.
Items may involve :I number of different operations:
its knowledge of a lexical item.
The 'i!rong wind, ,...... the m'1I1's efforts to put lip the tent.
I
A. disabled C. deranged
'j 1. Synonyms
Choose the alr ernativc (A,B,C,D) which is closest in meaning to the word
on the left of the page.
B. h.unpcr cd D. regaled
Note that the context should not itself contain words which the candi-
I
dates are unlikelv to know.
gleam A. gather B. shine C. welcome D. clean
The writer of this item has probably chosen the first alternative because
PROl)UCTIO!'i
The testing of vocabulary productively is so difficult that it is practically
I
of the word glean, The fourth may have been chosen because of the
never attempted in proficiency tests. Information on receptive ability IS
similarity of its sound to that of gleam. Whether these distractors would
work :IS 'intended would only be discovered through pretesting.
Note that all of the options are words which the candidates are
regarded as sufficient, The suggestions presented below are intended onlv
fo~ possible use in achievement tests.
I
expected to know. If, for example, iuelcome were replaced by groyne,
most candidates, recognising that it is the meaning of the stem (gleam) on
which they are being tested, would dismiss groyne immediately.
J 1, Pictures
The main difficulty in testing productive lexical ability is the need to limit
the candidate to tl;e (usually one) lexical item that we have in mind, while
I
On the other hand, the item could have a common word as the stem using only simple vocabulary ourselves. One \vay round this is to use
with four less frequent words as options:
slunc A. malm B. gleam C. loam D. snarl
pictures.
Each of the objects drawn below has a letter against it. \Vrite down the
I
Note that in both items it is the word gleam which is being tested. n.nucs of the objects:
.2. Definitions
loathe means 1\. dislike intensely
:\
g
(
I
D
B. become seriously iII
C. search carefully
D. look very angry
E
F I
Note that all of the options arc of about the same length. It is said that
rcst-ta kcrs who are uncertain of which option is correct will tend to
choose the one which is noticeably different from the others. If dislike
I
intensely is to he used as the definition, then the distractors should be
made to resemble it. In this case the item writer has included some notion
of intensity in all of the options.
Again the difficult word could be one of the options.
I
One word that means to dislike intensely is A. growl
B. screech
C. sneer
I
D. loathe
I
This method of testing vocabulary is obviously restricted to concrete
nouns which can be unambiguously drawn.
148 149
Testing grammar and vocabulary Testing gr.l/IlIIz.zr .in d uocabulary
2. Definitions 2. Produce two vocabulary tests by writing two items for each of the
This may work for a range of lexical items: following words, including, if possible, a number of production items.
Give each test to a different (but comparable) group of students (you
A is a person who looks after our teeth.
could divide one class into two). Compare performance on the pairs of
................. is frozen water.
....... is the second month of the year. items. Can differences of performance be attributed to a difference in
technique?
But not all items can be identified uniquely from a definition: any
beard sigh bench deaf genial
definition of say feeble would be unlikely to exclude all of its synonyms.
greedy mellow callow tickle weep
Nor can all words be defined entirely in words more common or simpler
than themselves.
Postscript
This chapter should end with a reminder that while grammar and
vocabulary contribute to communicative skills, they are rarely to be
regarded as ends in themselves. It is essential that tests should not accord
them too. much importance, and so create a backwash effect" tha t
undermines the achievement of the objectives of teaching and learning
where these are communicative in nature.
READER ACTIVITIES
150 151
Test administration -
14 Test administration 7. Examiners should practise directions which they will have to read
out to candidates.
I
8. Examiners who will have to use equipment (for example tape-
recorders) should familiarise themselves with its operation.
9. Examiners who have to read aloud for a listening test should
practise, preferably with a model tape-recording (see Chapter 12).
I
10. Oral examiners must be thoroughly familiar with the test procedures
The best test may give unreliable and invalid results if it is not well
administered. This chapter is intended simply to provide readers with an
and rating system to be used (only properly trained oral examiners
should be involved). I
ordered set of points to bear in mind when administering a test. While
most of these points will be very obvious, it is surprising how often some
of them can be forgotten without a list of this kind to refer to. Tedious as
Invigilators (or proctors)
11. Detailed instructions should also be prepared for invigilators, and
I
many of the suggested procedures are, they arc important for successful should be the subject of a meeting with them. See Administration,
testing. Once established, they become part of a routine which all
concerned take for granted.
below, for possible content.
I
Candidates
Preparation
I
should certainly not be allowed to disturb the concentration of those
the Administration section, below.
already taking the test.
152
153
I
Test administration
20. The identity of candidates should be checked. Appendix 1 Statistical analysis of test
21. If possible, candidates should be seated in such a way as to prevent
results
friends being in a position to pass information to each other.
22. The examiner should give clear instructions to candidates about
what they are required to do. These should include information on
how they should attract the attention of an invigilator if this proves
necessary, and what candidates who finish before time are to do.
They should also warn students of the consequences of any irregular
behaviour, including cheating, and emphasise the necessity of main- As was indicated in Chapter 7, the point of carrYll1g out pretesting is to be
taining silence throughout the duration of the test. able to analyse performance on the test and thereby identify any weaknesses or
23. Test materials should be distributed to candidates individually by problems there may he before the test is administered to the t.rrger c.iudid.ucs.
Even when pretesting is not possible, it is nevertheless worthwhile to .in.rlvsc
the invigilators in such a way that the position of each test paper and
performance on the test after the main administration. Part of the analvsi. 'will
answer sheet is known by its number. A record should be nude of
be simply to provide a summary of how the c.indid.tres h.i ve done, anJ to
these. Candidates should not be allowed to distribute test materials. check on the test's reliability and so have some idea of how dependable arc the
24. The examiner should instruct candidates to provide the required test scores. This chapter will provide the reader with an outline ot how such
details (such as examination number, date) on the answer sheet or analysis can be conducted. It will make use only of the most elementary
test booklet. arithmetic concepts, which should be well witlun the abilities of .mv reader. It
25. If spoken test instructions are to be given in addition to those written will also restrict itself to operations which c.m be carried our on al;
on the test pdper, the examiner should read these, including what- inexpensive calculator. I
ever examples have been agreed upon. While these deliberate restrictions may make post-test analvsls accessible to
26. It is essential that the examiner time the test precisely, making sure more readers than would otherwise be the case, it also means th,1t some of the
that everyone starts on time and does not continue after time. procedures described are rather less sophisticated than one would hope to see
applied in ideal circumstances. References to procedures conunonlv used bv
27. Once the test is in progress, invigilarors should unobtrusively
professional testers, which will give more dcpcnd.rbl« results, c.in he found In
monitor the behaviour of candidates. They will deal with any
the Further re.uiins; section.
irregularities in the way bid down in their instructions.
28. During the test, candidates should be allowed to leave the room only
one at a time, ideally accompanied by an invigilaror.
Summarising the scores
29. Invigilators should ensure that candidates stop work immediately
they are told to do so. Candidates should remain in their places until The mean
all the materials have been collected and their numbers checked.
The first step is to calculate the lIIe,1I/ score (or average score) tor e.ich
component of the test, as well as for the whole test. The mean IS the sum of ,III
the scores divided by the number of scores. Thus the mean of 14,34, Sil, 6S IS:
(14 + .14 + SA + AS) -7- 4 = 43.
If the components of the test have equal weight, then the me.in for the
whole test will be the mean of the means for each section. Thus if the me.in for
reading is 40 and the mean for writing is 4S, then the mean for the complete
test will be 44.
If the components are weighted differently (say reading 45 per cent, [isteninj;
1. All rh.ir is needed is a calcul.uor that gives sqZldre mills (look for the svmbol v'l, me.ins
(look for the syrnhol 5<), st.uulurd deui.itrons (look tor SD, s, or oj .md correl.ittuns
(look for r). The one apparent exception, the use of a uucr o-computer to carrv out
Rasch analysis, is mentioned only as a m arrer of mrerest to those readers who might be
in <1 position to benefit from this.
154 155
Appendix 1 Appendix 1
55 per cent), then the easiest thing is to enter the test total scores (which take 25
account of the weighting) into the calculator. Whether or not there is differen- .....-
tial weighting, these scores will have to be entered to calculate the standard
deviation for the whole test (see below). 20
.....-
r-r-r-
The standard deviation 15 f--
Very different sets of scores may have the same mean. For example, a mean of
.....-
50 could be the result of either of the following two sets of scores: .....-
10
I. 4S,49,50,51,52
2. 10, 20, 40, SO, I UU
f---
If we wanted to compare two such sets (one could represent scores on reading,
while the other might be scores on listening), stating the mean alone would be
misleading. \X!e need an indication of the way that the scores are distributed
around the mean. This is what the standard deviation gives us. The standard
rr 10 11 12 13 14
h
15
deviation for the two sets of the above sets of scores are 1) 1.5S and 2) 38.73. Score
The two surnmarv measures, the mean and the standard deviation, thus Figure l. Histogram of Scores of 108 students on a 15 item test.
allow us to cornpare the performance of different groups of students on a test,
or the same group of students on different tests or different parts uf one test.
They will also prove useful in estimating the reliability of the test, below. The diagram should be self-explanatory: the vertical dimension iudic.ues the
number of students scoring within a particular range of scores; the horizontal
dimension shows what these ranges are.
..... ..., Because of the ways a set of scores may be distributed (perhaps quite
....... tL.
unevenly - for example a small number of very low scores, but no verv high
Calculating means and standard deviations on a calculator ones), such a pictorial representation can correct any misleading impression
I. Press the kcyts) necessary to engage the calculator's statistical that the mean and standard deviation may give. It will also make clear what
functions. will be the outcome (how many will pass, fail, or be on the borderline; of
2. Tap in the first score. setting particular pass marks. Fur these reasons, it is aiw.ivs advisable to make
J. Press the Enter key (this is quite likely to have a plus (+) sign on such a diagram, at least for test totals.
It, though this varies from calculator to calculator).
4. Repeat 2 and J for all of the scores.
5. Press the AII'll" key, which probably has a x sign on it. Estimating reliability
The figure which is then displayed is the mean of that set of scores.
6. Press the St andard Deviation key, which may have the Greek letter We already know from Chapter 5 that there are a number of w.ivs of
sigma (0) on it. (If your calculator offers a choice of standard estimating a test's reliability. We shall present here only the split lial] method.
deviations - more than one key - use the one that the calculator It may be remembered that this involves dividing the test (or a component of a
manual says is based on a formula that contains the expression test) very carefully into two equivalent halves. Then for each student, two
n-1, rather than just n.) scores are calculated: the score on one part, and the score on the other part.
7. Record the mean and standard deviation, press the 'Clear' key, The more similar are the two sets of scores, the more reliable is the test said to
and then enter scores for the next component. be (the rationale for this was presented in Chapter 5). Let us take the
following sets of scores (for illustration onlv; estimates of rcliabilitv are onlv
meaningful when the scores of much larger 'numbers of students are invoh'ecl):
Histograms
The mean and standard deviation are numerical values. It is sometimes easier
to 'take in' a set of scores pictorially as in the histogram, figure 1.
Jiifr-+
156 157
Appendix 1 Appendix 1
-
TABLE I In the present case:
2 x 0.95
1 + 0.95
I
Student Score on one part Score on other part
A
B
55
48
69
52
1.90
1.95 I
C 23 21 0.97
D
E
F
55
44
56
60
39
59
I
G 38 35 Using the formula, we obtain a figure of 0.97, which indicates very high
H
I
J
19
67
52
16
62
57
reliability.
There is now just one thing to check. The possibility exists that the
correlation coefficient has overestimated the level of agreement between the
I
two sets of scores. For example, the following pairs of scores would give a
Box 3
perfect correlation coefficient of 1: 33,66; 24,48; 18,36; 45, 90. Yet the two
sets of scores are really very different. For this reason we should look at the
means and standard deviations for the two halves (which can be obtained at
I
the same time as the correlation coefficient, without putting in the scores all
Calculating the correlation coefficient on a calculator for the
above split half scores
1. Press the key to eng3ge the calculator's statistical functions.
over again). These are: 45.7 and 15.2 for one part, 47.0 and 18.2 for the
other.
The results for the two parts are sufficiently similar for us to believe that the
I
1 Tapin55. correlation coefficient has nor seriously overestimated the test's reliability."
3. Press one of the entry keys (the calcul.iror manual will tell vou
which).
4. T3p in 69.
We arc now in a position to calculate the stancl.ird error o( measurement,
(See Chapter 5 for 'I reminder of what this is and how It can he used.) Once
more there is a SImple formula, and its use is derailed in Box 5 below.
I
5. Press the other entry key.
6. Repeat 2 to 5 for the other pairs of scores.
7. Press the key for Correlation coefficient (probably marked 'r ').
Box 5
I
Calculating the standard error of measurement
The strength of the relationship between the two sets of scores is given by the
correlation coefficient. \X!hen calculated as directed in Box 3 above, this turns
out to be 0.95. This coefficient relates to the two hal] tests. But the full test is
The formula is:
Box 4
So the standard error of measurement is 32.99 x vT=-O.97
32.99 x VO.O) I
32.99 x 0.173
Estimating reliability of full test from coefficient based on
split halves
The formula to use is:
5.7
I
The more correct methods for estimating reliability are based on what is called analysis
I
1
2 X coefficient fot split halves of variance, which takes means and standard deviations into account from the outset.
Reliability of whole test =
1 + coefficient for split halves Analysis of variance is beyond the scope of this book, but'see the Further reading section.
158 159
I
Appelldix 1 Appelldix 1
correlations between scores on each item and scores on the whole test will be
The standard error of measurement turns out to be 5.7. \'l/e know that we
extremely tedious. Happily there are short cuts, one of which is represented in
CIlI be 95 per cent certain that an individual's true score will be within two
Figure 2 (an abac).
st;1l1dard errors of measurement from the score she or he actuallv obtains on
the test. In this case, two standard errors of measurement amount to 11.4 (2 x 100
5.7). This means that for a candidate who scores 64 on the test we can be 95
per cent certain that the true score lies somewhere between 52.6 and 75.4. 0.90
This is obviously valuable information if significant decisions are to be taken
on the basis of such scores.
080
So far this Appendix has largely been concerned with obtaining useful
information about the test as a whole. But a test is (normally) made up of
parts, and it is useful to know how the various parts are contributing to the g. 0.70
0
020
Even individunl items make their own contribution to the total test. Some
contrlhute more th.ui others, and it is the purpose of item analysis to identify 0.10 0.20 030 0.40 0.50 060 0.70 080 090 100
those rh.u need to be changed or replaced. Figure 2. I'{ -Propor non ottne Lower Subgroup
161
160
Appendix 1
Appendix 1
that item's Iacilirv value is 0.75. There can be no strict rule about what range
of facility values are to be regarded as satisfactory. It depends on what the
Item response theory
Everything that has been said so far has related to classical item analysis. In
I
purpose of the test is, and is related in a fairly complex fashion with item-test recent years new methods of analysis have been developed which have many
total correlations. The best advice that can be offered to the reader of this
book is to consider the level of difficulty of the complete test. Should it be
made more difficult or more easy? If so, knowledge of how individual items
attractions for the test writer. These all corne under the general heading of
item response theory, and the form of it so far most used in language testing is
called 'Rasch analysis'. Unfortunately, for efficient use of this analysis quite
I
contribute to overall test difficulty can be used when making changes. powerful microcomputers are necessary. Nevertheless, the growing availability
Analysis of distractors
of these at reasonable cost means that the advantages of item response rheorv
analysis will he possible for even small institutions. It will only he possible
here to give the brictesr outline of Rasch analysis.
I
Where multiple choice items arc used, in addition to calculating item-test
correlations and facility values, it is necessary to analyse the performance of
disrracror s. Disrractors which do not work, i.e. arc chosen by very few
Rasch analysis hegins with the assumption that items Oil a test helve a
particular difficulty attached to them, that they can he placed in order of
ditficulrv, and that the test taker has a fixed level of abilirv. Under these
I
candidates, make no contribution to test reliability. Such distracrors have to he conditions, the idealised result of a number of candidates raking a test will be
replaced by hetter ones, or the item has ro he otherwise modified or dropped. as in Table 2.
I
Item analysis record cards
I
Table 2 He sponse s of imaginary subjects to imaginary items
Item analysis information is usefully kept on GIrds (either physically or in the
Items
computer). Here is an example for a multiple choice item. Subjects 2 3 4 5 6 7
1 0
I
0 0 0 0 0
2
3
1
1 ,
1 0
0
0
0
0
0
0
0
Item No ... 4
5
1
1 ,
1
,
0 0
0
0
0
0
0
,
They'd said it would be a nice day, ....
a) wouldn't they) h) didn't thev?
8
6
7
1
1
,
1
1
1
,
1
1
,
1
0
0
,
0
0
1 I
~) hadn't they?' d) wouldn't it) e) shouldn't it) Total Incorrect a 4 5 7 7
Option
-~~.
Top 27{~'~, !\ottom 2 7 % Remainder
(\X'oods .ind B'lker I '.I~))
Table 2. represents a model of what happens 111 test taking, but we know
I
J If> 22
h
c
d
1 J
29
<)
2.h
12
14
that, even if the model is correct, people's performance will not be a perfect
reflection of their ability. In the real world we would expect an individual",
performance more like:
I
e 0 2. () () o
Nil () ()
Rasch analysis in fact accepts variability of this kind as normal. But it does
draw attention to test performance which is signitic.intlv different from what
I
Facility ualue 0.2 I Item-test correlation 0.2S the model would predict. It will identify test takers whose behaviour does not
From this it can he seen that though the item seems ro be working well, the
disrracror e) IS biling to distract candidates.
fit the model; and it will identify items that do not fit the model. We know
then that the ror.il test scores of people so identified do not mean the same
thing as (and so cannot be used in the same way as) the same score obtained
I
by the other people taking the test. Similarly, the items which are identified as
aberrant do not belong in the test; they should be modified (if it can be seen
why they do not fit) or dropped. I
163
162
I
Appendix 1
For analysis of variance and statistical matters in general, see Woods, Fletcher, 1. New Zealand has a) more than 60 hostels.
and Hughes (1986). For the calculation of reliability coefficients, see b) less than 60 hostels.
Krzanowski and \Voods (1984). For item response theory, see Woods and c) exactly 60 hostels.
Baker (1985) and other articles in Language Testing Vol. 2, No.2. For 2. You are unlikely to meet New Zealanders in the hostels.
statistical techniques for the analysis of criterion-referenced tests, see Hudson True or false?
J Lynch (1984). 3. Which hostel is said to be nearly always very full?
4. \Vhere can you visit a working sheep-station' ..
5. Give one reason why Mount Cook is so popular with tourists.
6. What is the speciality of the hostel near the Franz Josef glacier)
7. Does the author recommend one particular hostel above .mv other which
is particularly good for a lazv beach stay with sunbathing and scuba
diving?
165
164
Bibliography
•
Bibliography Downie, N. M. and Robert W. Heath. 1985. Basic St.itistical Methods. Harper
and Row.
I
Ebel. R. L. 1978. The case for norm-referenced measurements. Educational
Researcher 7 (11) :3-5.
Garman, M. and A. Hughes 1983. English cloze exercises. Oxford: Blackwell.
Godshalk, F. I., f. Swineford, and W. E. Coffman 19(,6. The measurement of
I
u.riting ability. New York: College Entrance Examination Board.
Adams, 1\1. 1.. and J. R. Frith (Eds.) 1979. Testing Kit. Washington D.C.:
Greenberg, K. 198(,. The developmenr and validation of the TOEFL writing
test: a discussion of TOEFL Research Reports 15 and 19. TESO L Quarterly
20:531-44.
I
Foreign Service Institute. Hump-Lyons, E. 1987. Performance profiles for academic writing. In K. Bailev
Alderson, J. C. 1987. Innovation in language testing: Can the micro-computer
help? Language Testing Update Special Report No.1. Institute for English
Language Education, University of Lancaster.
et al. (Eds.) 1987. Language Testing Research: selected papers from the 1986
colloquium. Monterey, California: Defense Language Institute.
Harris, D. P. 19(,8. Testing English as a second language. New York:
I
Alderson, J. C and A. Hughes (Eds.) 1981. Issues in language testing. EL 't McGraw Hill.
Documents 111. London: The British Council.
Alderson, J. C, K. J. Krahnke, and C. W. Stansfield. 1987. Rcuieios of English
language proficiency tests. Washington D.C.: TESOL.
Hudson, T. and B. Lynch. 1984. A criterion-referenced approach to ESL
achievement testing. Language Testing 1: 171-20 I.
Hughes, A. 1981. Conversational doze as a measure of oral ability. English
I
Anastasi A. 197(,. Psychological testing (4th edition). New York: Macmillan. Lmguage Teaching Journal 35: 161-8.
Bachman, L. F. and A. S. Palmer. 1981. The construct validation of the FSI
oral interview. Language Learning 31 :67-8(,.
Bachman, L. F. and A. S. Palmer. 1982. The construct validation of some
Hughes, A. 1986. A pragmatic approach to criterion-referenced foreign
language testing. In Portal.
Hughes, A. 1988:1. Introducing :1 needs-based test of English for study in an
I
components of communicative proficiency. TESOL Quarterly 16:449-65. English medium university in Turkey. In Hughes 1988b.
Bachman, L. F. and S. J. Savignon. 1986. The evaluation of communicative
language proficiency: a critique of the ACTFL oral interview. Modern
Language journal 70:380-90.
Hughes, A. 1988b (Ed.) Testing English for university studv. ELT Documents
127. Oxford: Modern English Press.
Hughes, A., 1.. Giil<;ur, P. Gurel, and T. IvlcCombie. 1987. The new Bogazic;i
I
University English Language Proficiency Test. In Bozok and Hughes.
I
Bozok, S. and A. Hughes. 1987. Proceedings of the seminar, Testing
English beyond the high school. Istanbul: Bogazi<;i University Hughes, A. and D. Porter. (Eds.) 1983. Current deuelopnicnts III ldnguage
Publications. testing. London: Academic Press.
Byrne, D. 1967 I'rogressiue picture compositions. London: Longman. Hughes, A., D. Porter, and C. Weir. (Eds.) 1988. Validating the ELTS test: a
critical reuieur. Cambridge: The British Council and University of
I
Canale, M. and M. Swain, 1980. Theoretical bases of communicative
approaches to second language teaching and testing. Applied Linguistics Camhridge Loc:11 Examinations Syndicate.
I: 1-47. Hughes, A. and P. Trudgill. 1987. English accents and dialects: an
Carroll, J. B. 1961. Fundamental considerations in testing for English language introduction to social and regional uarieties of British English (2nd edition).
proficiency of foreign students. In H. B. Allen and R. N. Campbell (Eds.)
1972. Teaching English as a second language: a hook of readings. New
York: McGraw Hill.
London: Edward Arnold.
Jacobs, H. L., S. A. Zingraf, D. R. Wormllth, V. F. Hartfield, and J. B.
Hughey. 1981. Testing ESL composition: a practical approach. Rowley,
I
Carroll, J. B. 1981. Twenty-rive years of research on foreign language Mass: Newbury House.
aptitude. In K. C. Diller (Ed.) lndiuidual differences and universals in
language learning aptitude. Rowley, Mass: Newbury House.
Criper, C. and A. Davies. 1988. EL TS ualidation project report. Cambridge:
Klein-Braley, C. 1985. A doze-up on the C- Test: :1 study in the construct
validation of authentic tests. Language Testing 2: 76-104.
Klein-Br.rley, C and U. Raatz. 1984. A survey of research on the C-Test.
I
The British Council and Cambridge Local Examinations Syndicate. Language Testing 1: 134-46.
Crystal, D. and D. Davy. 1975. Advanced conversational English. London:
Longman.
Davies, A. (Ed.) 1968. Language testing symposium: a psycholinguistic
Krzanowski, W. J. and A. J. Woods. 1984. Statistical aspects of reliability in
language testing. Language Testing I: 1-20.
Lado, R. 1961. Language testing. London: Longman.
I
perspective. Oxford: Oxford University Press. Lado, R. 1986. Analysis of native speaker performance on a cloze test.
Davies, A. (1988). Communicative language testing. In Hughes 1988b.
Dexter, Colin. 1986. The Secret of Annexe 3. London. Macmillan.
Language Testing 3: 130-46.
Limb, S. 1983 Living with illness. Migraine. The Gbseruer, 9 October. I
166 167
I
Bibliography
Bibliography
Morrow, K. 1979. Communicative language testing: revolution or evolution? \'{'oods, A., and R. Baker. 1985. Item response theorv. Language Testing
In C. J. Brurnfir and K. Johnson. The communicative approach to language 2:119-40. . ,~t>
teaching. Oxford: Oxford University Press. Reprinted in Alderson and Woods, A., P, Fletcher, and A. Hughes. 1986. Statistics in language studies.
Hughes. Cambndge: Cambndge University Press.
Morrow, K. 1986. The evaluation of tests of communicative performance. In
Portal.
Oller, J. W. 1979. Language tests at school: a pragmatic approach. London:
Longman.
Oller, J. W. (Ed.) 1983. Issues in language testing research. Rowley, Mass:
Newbury House.
Oller, J. W. and C. A. Conrad. 1971. The doze technique and ESL proficiency.
Language Learning 21: 183-94.
Pilliner, A. 1968. Subjective and objective testing. In Davies 1968.
Pimsleur P. 1968. Language aptitude testing. In Davies 1968.
Popham, W. J. 1978. The case for criterion-referenced measurements.
Educational Researcher 7 (11):6-10.
Portal, M. (Ed.). 1986. Innovations in language testing. Windsor: Nl'Elc-Nelson.
Shohamy, E. 1984. Does the testing method make a difference? The case of
reading comprehension. Language Testing 1: 147-76.
Shoharny, E., T. Reves, and Y. Bejarano. 1986. Introducing a new
comprehensive test of oral proficiency. English Language Teaching [ourn al
40: 212-20.
Skehan, 1'. 1984. Issues in the testing of English for specific purposes.
Language Testing I: 202-20.
Skehan, P. 1986. The role of foreign language aptitude in a model of school
learning. Language Testing 3: 188-221.
Spolsky, B. 1981. Some ethical questions about language testing. In C.
Klein-Braley and D. K. Stevenson (Eds.) Practice and problems in language
testing 1. Frankfurt: Verlag Peter D. Lang.
Streiff, V. 1978. Relations among oral and written doze scores and achievement
test scores in a bilingual setting. In J. W. Oller and K. Perkins (Eds.)
Language in education: testing the tests. Rowley, Mass: Newbury House.
Swan, M. 1975. Inside meaning. Cambridge: Cambridge University Press.
Swan, M. and C. Walter. 1988. The Cambridge English Course. Student's
Book 3. Cambridge: Cambridge University Press.
Timmins, N. 1987. Passive smoking comes under official fire. The Independent,
14 March.
Trudgill, P. and J. Hannah. 1982. International English: .: ~·/;.le to the
varieties of standard English. London: Edward Arnold
Underhill, N. 1987. Testing spoken language: a bandboo ! testi: ,;
techniques. Cambridge: Cambridge University Press.
Weir, C. J. 1988a. Communicative language testing. EXt .ui:': -, "['Jdil',
Vol 11. Exeter: University of Exeter.
Weir, C. J. 1988b. The specification, realization and validation of an English
language proficiency test. In Hughes, 1988b.
West, M. (Ed.) 1953. A general service list of English words: with semantic
frequencies and a supplementary word-list for the writing of popular science
and technology. London: Longman.
168
169
Index •
Index gap filling items, 149, 150
Garman, M., 69, 74
GCSE (General Certificate of
Limb, S., 128
listening, testing of, 134-40
Lynch, S., 164
I
Secondary Education), 46, 47
Godshalk, F. I., 20, 75
grammar, testing of, 141-6
Greenberg, K., 5
mean, 155-6
Michigan Test, 18-19
moderation, 51-2, 55, 56
I
guessing, 60-1 monologue, 109--10
170 171
I
Index
referents, identifying, 129 text types, 49, 53, 75, 81, 101, 118,
reliability, 3--4, 7, 29-43, 60, 157--60 135
coefficient, 31--3, 157--9 timing, 50, 54, 55--6
scorer, 3--4,35--6,51,59,86,97 Timmins, N., 123
role play, 107-8 TOEFL (Test of English as a Foreign
RSA (Royal Society of Arts), 50-1, Language), 5, 10, 15, 19,45, 58,
58, 76-80, 81, 103-3, 107, 108, 86
126-8 topics, 50, 53, 75, 81, II8, 135
Trudgill, P., 122, 140
Savignon, S. J., 115 true score, 33--5, 160
scoring, 51, 54, 59--60, 86-100, TWE (Test of Written English), 5,
110-14, 131, 139, 145--6 45,86
Shoharny, E. 115, 133
short answer items, 121-2, 137 UCLA ESLP (Universitv of
Skehan, P., 7, 20 California at Los Angeles
skills, 53 English as a Second Language
specifications, 22, 49-51, 53--5, 55--8, Placement) Test, 65, 72
75--81, 101-5, 116-19, 134-5, Underhill, N., 115
142, 147 unfamiliar words, guessing
split-half estimate of reliability, meaning of, 129
32-3, 158-9 unique answer items, 121
Spolsky, 5., 5 unitary competence hypothesis,
standard 62-3,74
deviation, 156
error of measurement, 33--5, validity, 22-28
159-60 coefficient, 24-5
Stansfield, C. W., 8 concurrent, 23--5
stem, 59 construct, 26-7
Streiff, V., 74 content, 22-3, 27, 51, 77, 81,
subjective testing, 19 141
Swain, I\l., 21 criterion-related, 23--5, 45--6, 57
Swan, M., 131, 133 predictive, 25
syllabus-content approach, 11-12 and reliability, 42
synonym items, 148 vocabulary, testing of, 146-50
"".
172