You are on page 1of 45

Language Testing

Liu Jianda

It is expected that, by the end of this module, participants should be
able to do the following :

Understand the general considerations that must be addressed in the

development of new tests or the selection of existing language tests;
Make their own judgements and decisions about either selecting an
existing language test or developing a new language test;
Familiarise themselves with the fundamental issues, approaches, and
methods used in measurement and evaluation;
Design, develop, evaluate and use language tests in ways that are
appropriate for a given purpose, context, and group of test takers;
Understand the future development of language testing and the
application of IT to computerized language testing.

In order to achieve these objectives, the module gives participants the opportunity
to develop the following skills:

writing test items

collecting test data and conducting item analysis

evaluating language tests with regard to validity and reliability

This is done by considering a wide range of issues and topics related to language
testing. These include the following :

General concepts in language testing and evaluation

Evaluation of a language test: reliability and validity

Communicative approach to language testing

Design of a language test

Item writing and item analysis

Interpreting test results

Item response theory and its applications

Computerized language testing and its future development

Class Schedule

Basic concepts in language testing

Test validation: reliability and validity (1)
Test validation: reliability and validity (2)
Test construction (1)
Test construction (2)
Test construction (3)
Test Construction (4)
Test Construction (5)
Test Construction (6)
Rasch analysis (1)
Rasch analysis (2)
Language testing and modern technology

One 5000 6000 word paper on language
Collaborative work:
Youll be divided into group of four to complete

the development of a test paper. Each of you

will be responsible for one part of the test
paper. But each part should contribute
equally to the whole test paper. Therefore,
besides developing your part, you need to
come together to discuss the whole test
paper in terms of reliability and validity.

Course books

Bachman, L. F. & Palmer, A. (1996). Language Testing in

Practice. Oxford: Oxford University Press.
Brown, J. D. (1996). Testing in Language Programs. Upper
Saddle River, NJ: Prentice Hall Regents.
Li, X. (1997). The Science and Art of Language Testing.
Changsha: Hunan Educational Press.
McNamara, T. (1996). Measuring second language
performance. London ; New York: Longman.


Session 1
Basic concepts in language testing

A short history of language testing
Spolsky (1978) classified the development
of language testing into three periods, or
the prescientific period
the psychometric/structuralist period
the integrative/sociolinguistic period.

The prescientific period

grammar-translation approaches to
language teaching
translation and free composition tests
difficult to score objectively
no statistical techniques applied to
validate the tests
simple, but unfair to students

The psychometric-structuralist period

audio-lingual and related teaching methods

objectivity, reliability, and validity of tests
measure discrete structure points
multiple-choice format (standardized tests)
follow scientific principles, have trained
linguists and language testers

The integrative-sociolinguistic period

communicative competence

Chomskys (1965) distinction of competence and performance

Competence: an ideal speaker-listeners knowledge of the rules of the

performance: the actual use of language in concrete situations

Hymess (1972) proposal of communicative competence

the ability of native speakers to use their language in ways that are not only
linguistically accurate but also socially appropriate.

Canale & Swains (1980) framework of communicative competence:

Grammatical competence, mastery of the language code such as morphology,

lexis, syntax, semantics, phonology;
Sociolinguistic competence, mastery of appropriate language use in different
sociolinguistic contexts;
Discourse competence, mastery of how to achieve coherence and cohesion in
spoken and written communication
Strategic competence, mastery of communication strategies used to
compensate for breakdowns in communication and to enhance the
effectiveness of communication.

The integrative-sociolinguistic period

Bachmans (1990)s framework of communicative
language ability:

Language competence: grammatical, sociolinguistic, and

discourse competence (Canale & Swain):

organizational competence

grammatical competence

textual competence
pragmatic competence

illocutionary competence

sociolinguistic competence

Strategic competence: performs assessment, planning, and

execution functions in determining the most effective means
of achieving a communicative goal
Psychophysiological mechanisms: characterize the channel
(auditory, visual) and mode (receptive, productive)

The integrative-sociolinguistic period

Ollers (1979) pragmatic proficiency test:

Temporally and sequentially consistent with the

real world occurrences of language forms
Linking to a meaningful extralinguistic context
familiar to the testees

Clarks (1978) direct assessment:

approximating to the greatest extent the
testing context to the real world
Cloze test and dictation (Yang, 2002b)
Communicative testing or to test

The integrative-sociolinguistic period

Performance tests (Brown, Hudson, Norris, &

Bonk, 2002; Norris, 1998)

Not discrete-point in nature

Integrating two or more of the language skills of
listening, speaking, reading, writing, and other
aspects like cohesion and coherence,
suprasegmentals, paralinguistics, kinesics,
pragmatics, and culture
Task-based: essays, interviews, extensive reading

Performance Tests
Three characteristics:

The task should:

be based on needs analysis (What criteria should be

used? What content and context? How should experts
be used?)
be as authentic as possible with the goal of measuring
real-world activities
sometimes have collaborative elements that stimulate
communicative interactions
be contextualized and complex
integrate skills with content
be appropriate in terms of number, timing, and
frequency of assessment
be generally non-intrusive, that is, be aligned with the
daily actions in the language classroom

Performance Tests

Raters should be appropriate in terms of:

number of raters
overall expertise
familiarity and training in use of the scale

The rating scale should be based on appropriate:

categories of language learning and development

appropriate breadth of information regarding learner
performance abilities
standards that are both authentic and clear to students

To enhance the reliability and validity of decisions as well as

accountability, performance assessments should be combined with
other methods for gathering information (e.g. self-assessments,
portfolios, conferences, classroom behaviors, and so forth)

Development graph (Li, 1997: 5)

2. Theoretical issues

Language testing is concerned with

both content and methodology.

Development since 1990

Communicative language testing

(Weir, 1990)
Reliability and validity
Social functions of language testing

Ethical language testing

Washback (impact) (Qi, 2002; Wall, 1997)

impact: effects of tests on individuals, policies or practices within the classroom,

the school, the educational system or society as a whole
washback: effects of tests on language teaching and learning
Ways of investigating washback:

analyses of test results

teachers and students accounts of what takes place in the classroom (questionnaires and
classroom observation

Ethics of test use

use with care (Spolsky, 1981: 20)

codes of practice

Professionalization of the field

training of professionals
development of standards of practice and mechanism for their implementation and

Critical language testing

put language testing in the society

Factors affecting performance of

language ability



Test method

Development since 1990

Testing interlanguage pragmatic knowledge

currently on research level

focus on method validation
web-based test by Roever

Computerized language testing

Item banking
Computer-assisted language testing
Computerized adaptive language testing

Test items adapted for individuals

Test ends when examinees ability is determined
Test time very shorter

Web-based testing
Phonepass testing

Development since 1990

Language testing and second
language acquisition (Bachman &
Cohen, 1998)

Help to define construct of language

Use findings of language testing to prove
hypotheses in SLA
Provide SLA researchers with testing and
standards of testing

Development of research methodology

Factor analysis
The main applications of factor analytic
techniques are:

(1) to reduce the number of variables and

(2) to detect structure in the relationships
between variables, that is to classify variables.

Therefore, factor analysis is applied as a

data reduction or structure detection

Generalizability theory (Bachman, 1997;

Bachman, Lynch, & Mason, 1995)

Estimating the relative effects of different

factors on test scores (facets)
The most generalizable indicator of an
individuals language ability is the universe
score, however, in real world, we can only
obtain scores from a limited sample of
measures, so we need to estimate the
dependability of a given observed score as an
estimate of the universe score.

Two stages are involved in applying G-theory to

test development
The purpose is to estimate the effects of the various
facets in the measurement procedure (usually
conducted in pretesting).

e.g. persons (differences in individuals speaking ability),

raters (differences in severity among raters), tasks
(differences in difficulty of tasks);

two-way interactions:

task x rater different raters are rating the different tasks

person x task some tasks are differentially diffucult for
different groups of test takers (source of bias)
person x rater some raters score the performance of
different groups of test takers differently (indication of rater

Two stages are involved in applying Gtheory to test development

The purpose is to design an optimal measure for
the interpretations or decisions that are to be made
on the basis of the test scores (estimation of
Generalizability coefficient (G coefficient) provides
an estimate of the proportion of an individuals
observed score that can be attributed to his or her
universe score, taking into consideration the effects
of the different conditions of measurement
specified in the universe of generalization. But it is
appropriate for norm-referenced tests.
For criterion-referenced tests, use phi coefficient.

Item response theory (Rasch model)

It enables us to estimate the statistical
properties of items and the abilities of
test takers so that these are not
dependent upon a particular group of test
takers or a particular form of a test. It is
widely used in large-scale standardized

Structural equation model (Antony

John Kunnan, 1998)

A combination of multiple regression,

path analysis and factor analysis
Attempts to explain a correlation or a
covariance data matrix derived from a
set of observed variables; latent
variables are responsible for the
covariance among the measured

Basic procedures in SEM (Example from

Purpura, 1998)

Examine the relationships between strategy use and

second language test performance.
Design two questionnaires for cognitive strategies
and metacognitive strategies (40 items)
Ask respondents to answer the questionnaires
Respondents take a foreign language test
Cluster the 40 items to measure several variables
Compute the reliability of the variables
Conduct factor analysis to identify factors
Conduct SEM analysis (AMOS, EQS, LISREL)

Qualitative method

Verbal report (think-aloud,

Questionnaires and interviews
Discourse analysis

3. Classification of language tests

According to families

Norm-referenced tests
Criterion-referenced tests

Norm-referenced tests

Measure global language abilities (e.g.

listening, reading speaking, writing)
Score on a test is interpreted relative to
the scores of all other students who
took the test
Normal distribution

Normal Distribution

Norm-referenced tests

Students know the format of the test

but do not know what specific content
or skill will be tested
A few relatively long subtests with a
variety of question contents

Criterion-referenced tests

Measure well-defined and fairly specific objectives

Interpretation of scores is considered absolute
without referring to other students scores
Distribution of scores need not to be normal
Students know in advance what types of questions,
tasks, and content to expect for the test
A series of short, well-defined subtests with similar
question contents

According to decision purposes

Proficiency tests
Placement tests
Achievement tests
Diagnostic tests

Proficiency tests

Test students general levels of language

The test must provide scores that form a
wide distribution so that interpretations of the
differences among students will be as fair as
Can dramatically affect students lives, so
slipshod decision making in this area would
be particularly unprofessional

Placement tests

Group students of similar ability levels

(homogeneous ability levels)
Help decide what each students
appropriate level will be within a
specific program
Right tests for right purposes

Achievement tests

About the amount of learning that students have

The decision may involve who will a advanced to the
next level of study or which students should graduate
Must be designed with a specific reference to a
particular course
Criterion-referenced, conducted at the end of the
Used to make decisions about students levels of
learning, meanwhile can be used to affect curriculum
changes and to test those changes continually
against the program realities

Diagnostic tests

Aimed at fostering achievement by promoting strengths and

eliminating the weaknesses of individual students
Require more detailed information about the very specific areas
in which students have strengths and weaknesses
Criterion-referenced, conducted at the beginning or in the
middle of a language course
Can be diagnostic at the beginning or in the middle but
achievement test at the end
Perhaps the most effective use of a diagnostic test is to report
the performance level on each objective (in a percentage) to
each student so that he or she can decide how and where to
invest time and energy most profitably

Formative assessment vs. summative


Formative: a judgment of an ongoing program used

to provide information for program review,
identification of the effectiveness of the instructional
process, and the assessment of the teaching process
Summative: a terminal evaluation employed in the
general assessment of the degree to which the larger
outcomes have been obtained over a substantial part
of or all of a course. It is used in determining
whether or not the learner has achieved the ultimate
objectives for instruction which were set up in
advance of the instruction.

Public examinations vs.

classroom tests

Purpose: proficiency vs. achievement

(placement, diagnostic)
Format: standardized vs. open
(objective vs. subjective)
Scale: large-scale vs. small-scale (selfassessment)
Scores: normality, backwash