Professional Documents
Culture Documents
Assessment and Testing: Annual Review of Applied Linguistics (2000) 20, 147-161. Printed in The USA
Assessment and Testing: Annual Review of Applied Linguistics (2000) 20, 147-161. Printed in The USA
Caroline Clapham
INTRODUCTION
In this brief article, I discuss the relationship between language testing and
the other sub-disciplines of applied linguistics and also the relationship, as I see it,
between testing and assessment. The article starts with a brief exploration of the
term ‘applied linguistics’ and then goes on to discuss the role of language testing
within this discipline, the relationship between testing and teaching, and the
relationship between testing and assessment. The second part of the article
mentions some areas of current concern to testers and discusses in more detail
recent advances in the areas of performance testing, alternative assessment, and
computer assessment. One of my aims in this article is to argue that the skills
involved in language testing are necessary not only for those constructing all kinds
of language proficiency assessments, but also for those other applied linguists who
use tests or other elicitation techniques to help them gather language data for
research.
It is usually the case with new disciplines that they go through periods of
adjustment as the limits of the discipline are realigned. Applied linguistics is going
through such a stage at present as its scope widens and its subfields start to impinge
on those of other disciplines.
The term ‘applied linguistics’ appears first to have been used in the late
1940s when the discipline embraced the teaching and learning of second and
foreign languages (Johnson and Johnson 1998), but since then the discipline has
expanded to cover a wider range of sub-disciplines or ‘subfields’ as they are called
by Bachman and Cohen (1998). In 1980, Henry Widdowson said, “…applied
linguistics yields descriptions which are projections of actual language which
explore linguistic theory as illumination…” (p. 169), and in 1997, Chris Brumfit
147
148 CAROLINE CLAPHAM
A key point, which perhaps not all applied linguists appreciate, is that
language testing is by no means limited to assessing the linguistic proficiency of L2
students. Many areas of linguistic research use elicitation instruments to gather
data, and these instruments often take the form of tests or tasks (see, for example,
Robinson 1997, Skehan and Foster 1999). If the results of such data elicitation
techniques are to be credible, they need to be prepared with as much rigor as
proficiency tests, and they therefore have to be valid and reliable (Alderson and
Banerjee in press). Crudely, a valid elicitation technique is one that accurately
elicits what it is intended to elicit, and a reliable technique is one that produces
consistent results. (For discussions of validity, see Chapelle 1999, Messick 1989;
1996, Shepard 1993; and for comments on the relationship between validity and
reliability, see Moss 1994.) Although Messick (1989) subsumes reliability under
validity, since any valid measure must, by definition, be reliable, it is useful here to
distinguish between validity and reliability since the assessment of an instru-ment’s
reliability is often neglected in elicitation procedures. If research is to have
credibility, data gathering instruments must not only be carefully designed to ensure
that they will elicit the type of language required, they must also be pre-tested to
check that the measures do indeed elicit such language practice and that any rating
ASSESSMENT AND TESTING 149
or coding system is workable and capable of producing consistent results (see North
and Schneider 1998). Since it is generally expected that subjects in an investigation
produce similar kinds of language regardless of when the task is done, and that this
language can be analyzed in a similar manner regardless of when and by whom it is
assessed or coded, the reliability of the elicitation techniques must be given careful
consideration.
There has been much discussion about how language testing fits into
applied linguistics and how it relates to language teaching (see, for example,
Bachman and Palmer 1996). In general, it seems clear that “…language testing
benefits from insights from applied linguistics as a discipline…” (Alderson and
Clapham 1992:164) but that it is sometimes necessary for testing to lead the way:
It seems, indeed, that each affects the other: Methods of assessment may affect
teaching in the classroom (Cheng 1997, Wall 1996; 1997), while new theories of
language learning and teaching lead to changes in testing practices (Spolsky 1995).
With the advent of communicative teaching in the late 1970s, there was a
need for testers to devise new theories of language testing. Canale and Swain
(1980), whose model applied to both teaching and testing second and foreign
languages, included grammatical, sociolinguistic, and strategic competence in their
description of the domains of language use. In 1990, Bachman added psychophy-
siological mechanisms and proposed four components in his model: grammatical,
textual, illocutionary, and sociolinguistic competence. Bachman and Palmer (1996)
elaborated on this model further to include both affective and metacognitive factors.
Bachman and Palmer’s model of communicative language ability is used as the
theoretical basis for tests such as the International English Language Testing
System (IELTS) test, and it provides the basis for many current research projects
(e.g., Hasselgren 1998). (See McNamara 1996 for a discussion of language testing
models.)
150 CAROLINE CLAPHAM
The term ‘assessment’ is used both as a general umbrella term to cover all
methods of testing and assessment, and as a term to distinguish ‘alternative
assessment’ from ‘testing.’ Some applied linguists use the term ‘testing’ to apply to
the construction and administration of formal or standardized tests such as the Test
of English as a Foreign Language (TOEFL) and ‘assessment’ to refer to more
informal methods such as those listed below under the heading ‘alternative
assessment.’ For example, Valette (1994) says that ‘tests’ are large-scale
proficiency tests and that ‘assessments’ are school-based tests. Intriguingly, some
testers are now using the term ‘assessment’ where they might in the past have used
the term ‘test’ (see, for example, Kunnan 1998). There seems, indeed, to have
been a shift in many language testers’ perceptions so that they, perhaps subcon-
sciously, may be starting to think of testing solely in relation to standardized, large-
scale tests. They therefore use the term ‘assessment’ as the wider, more acceptable
term.
Unfortunately, although ‘assessors’ and ‘testers’ have the same aims, there
is less dialogue than there should be between them, possibly because many of them
tend to think of ‘testing’ and ‘assessment’ as being categorically different (see Hill
and Parry 1994) instead of being on a continuum with at one end those ‘testers’
who deliver carefully validated multiple choice tests, and at the other end,
‘assessors’ who prepare real-life tasks for their students and candidates but who do
not concern themselves with how well these tasks actually work. ‘Assessors’
appear to distrust extreme ‘testers’ because they feel that these ‘testers’ are so
wedded to the numerical analysis of data that they are not sufficiently concerned
with the content and administration of their tests. Such ‘assessors’ tend to be
concerned that these ‘tests’ are not ‘communicative,’ and that they may lead to
negative washback (Brown and Hudson 1998). In contrast, many ‘testers’ are
concerned with the fact that, although the ‘assessors’’ methods of assessment may
be novel and interesting, the tasks are not pre-tested to see whether they work as
intended or whether the assessments can be delivered and marked in a consistent
manner. In short, they distrust ‘assessors’ because ‘assessors’ do not appreciate the
importance of investigating the validity and reliability of their instruments. Brown
and Hudson (1998) quote Huerta-Macias (1995) who says that it is unnecessary to
evaluate the validity and reliability of methods of alternative assessment because
they are already built into the assessment process. Brown and Hudson make the
point that it is not enough to build validity and reliability into the measures; the
ASSESSMENT AND TESTING 151
measures must also be trialed to see whether or not they are valid and reliable in
practice. (See also Johnstone [in press] and Rea-Dickens [in press].)
Areas that are attracting attention in the testing literature at present have, in
a number of cases, been the subject of recent ARAL reviews. Performance testing
was covered by Shohamy (1995), alternative assessment by Hamayan (1995),
advances in the use of the computer for testing by Chalhoub-Deville and Deville
(1999) and the interface between tasks and assessment by Skehan (1998) and
McNamara (1998). Other areas of current interest include test washback (Alderson
and Wall 1993; 1996, Wall 1997) and the ethics of language testing (Davies 1997,
152 CAROLINE CLAPHAM
Hamp-Lyons 1997, Kunnan in press, Norton 1997). (For more about these and
other areas of current concern, see Bachman 2000, Brindley in press; see also
Clapham and Corson 1997.) In the remainder of this article, I will update the
ARAL articles on performance testing, alternative assessment, and the use of
computers for testing. I will, at the same time, relate these three areas to real or
imaginary differences between ‘testers’ and ‘assessors.’
1. Performance testing
2. Alternative assessment
across institutions. Governments (sometimes in a hurry) wish to use the results for
their own aims (Brindley 1998). They do not appreciate the need for careful
trialing, and therefore they do not always allow researchers the time to try out the
assessment instruments. Some types of assessment used in these circumstances
may have high face validity—they may look excellent to the uninformed—but they
may be marred by inappropriate marking schemes and rating inconsistencies.
Trialing such measures is essential if the final tools are to be valid and reliable, but
unfortunately such pre-testing and subsequent editing of materials takes time, and
time is often in short supply. Inevitably, the ensuing tasks and marking schemes
are not valid and reliable (Brindley 1998), and such schemes, therefore, launched
with much fanfare, may produce invalid results which are unfair to students and
teachers alike. If all assessors appreciated the importance of trialing in the
construction of all assessment procedures, they might be able to work together to
tell governments and other funding bodies that test development requires more time
if assessors are to devise satisfactory assessment instruments.
computers are all widening the scope of computer administered tests (see Burstein,
et al. 1996, Drasgow and Olson-Buchanan 1998, Ordinate Corporation 1998).
These advances will not only increase the efficiency of standardized tests, but will
increase the scope of other elicitation techniques.
One project that has the potential to produce interesting and yet easy-to-
deliver-and-mark tests is DIALANG, which aims to produce diagnostic tests in 14
different European foreign languages (DIALANG 1997). The tests will be
delivered on the world wide web; they will be computer adaptive (see Chalhoub-
Deville and Deville 1999); and students, after taking their chosen test, will receive
instant diagnostic information about the strengths and weaknesses of their
performance. At present, compositions written for DIALANG will be marked by
hand, but this may not always be the case as there are now research projects
looking into the computer marking of tests of language production (see Burstein and
Chodorow 1999). Although it is hard to imagine computers ever replacing human
markers, and we might not wish them to, it is possible that they will take some of
the drudgery away from subjective marking and will thus make the marking task
easier and more interesting for the raters.
There will thus, in the near future, be great advances in computer testing.
However, we cannot yet know whether or not there is still a danger that the use of
computers will limit rather than expand the kinds of tests that will be used.
CONCLUSION
One of my aims in this article has been to discuss the need to bring
‘testers’ and ‘assessors’ closer together, and I expect that over the next decade any
differences between them will become less and less marked. ‘Testers,’ I hope, will
become more open to ideas of different kinds of assessment, and ‘assessors’ will be
more willing to accept that even the most carefully designed task or set of marking
criteria needs to be trialed. If ‘testers’ and ‘assessors’ can each see that they are
aiming for the same goal, they will perhaps start a dialogue which might transform
both tests and assessments. Similarly, on a larger scale, the basic issues of testing
and assessment are important in all areas of applied linguistics that call on the use
of elicitation techniques to collect data. All such elicitation techniques should be
reliable and valid. Understanding this connection, however, will only be possible if
all putative applied linguists (and all potential language teachers) are introduced
during their training to the vital tenets of testing so that they can be more critical of
the elicitation techniques they use.
ASSESSMENT AND TESTING 155
ANNOTATED BIBLIOGRAPHY
This article opened language testers’ eyes to the fact that, although the
expression ‘washback’ was freely used in language teaching and testing
communities, and although there was plenty of anecdotal evidence as to its
existence, there had been few serious attempts to define it or to investigate
its existence and possible effects. The authors reported on the few existing
studies of language testing washback and listed fifteen washback hypo-
theses which, they said, should be investigated. This article led to a flurry
of investigations into test washback, and the results of the first of these
studies are now being published.
Bachman, L. F. 2000. Modern language testing at the turn of the century: Assuring
that what we count counts. Language Testing. 17.1.
Clapham, C. and D. Corson (eds.) 1997. Language testing and assessment, Vol. 7.
The encyclopedia of language and education. Dordrecht, Holland: Kluwer
Academic.
Wall, D. 1996. Introducing new tests into traditional systems: Insights from general
education and from innovation theory. Language Testing. 13.334–354.
UNANNOTATED BIBLIOGRAPHY