Language Testing

Language Testing
http://ltj.sagepub.com
Establishing test form and individual task comparability: a

case study of a semi-direct speaking test
Cyril J. Weir and Jessica R.W. Wu
Language Testing 2006; 23; 167
DOI: 10.1191/0265532206lt326oa
The online version of this article can be found at:
http://ltj.sagepub.com/cgi/content/abstract/23/2/167
Published by:
http://www.sagepublications.com
Additional services and information for Language Testing can be found at:
Email Alerts: http://ltj.sagepub.com/cgi/alerts
Subscriptions: http://ltj.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations (this article cites 13 articles hosted on the
SAGE Journals Online and HighWire Press platforms):
http://ltj.sagepub.com/cgi/content/refs/23/2/167
Downloaded from http://ltj.sagepub.com at SWETS WISE ONLINE CONTENT on

October 6, 2007
2006 SAGE Publications. All rights reserved. Not for commercial use or
unauthorized distribution.
Establishing test form and individual

task comparability: a case study of a
semi-direct speaking test
Cyril J. Weir University of Luton and
Jessica R.W. Wu Language Training and
Testing Center, Taiwan
Examination boards are often criticized for their failure to provide evidence
of comparability across forms, and few such studies are publicly available.
This study aims to investigate the extent to which three forms of the General
English Proficiency Test Intermediate Speaking Test (GEPTS-I) are parallel
in terms of two types of validity evidence: parallel-forms reliability and content validity. The three trial test forms, each containing three different task
types (read-aloud, answering questions and picture description), were administered to 120 intermediate-level EFL learners in Taiwan. The performance
data from the different test forms were analysed using classical procedures
and Multi-Faceted Rasch Measurement (MFRM). Various checklists were
also employed to compare the tasks in different forms qualitatively in terms
of content. The results showed that all three test forms were statistically parallel overall and Forms 2 and 3 could also be considered parallel at the individual task level. Moreover, sources of variation to account for the variable
difficulty of tasks in Form 1 were identified by the checklists. Results of the
study provide insights for further improvement in parallel-form reliability of
the GEPTS-I at the task level and offer a set of methodological procedures
for other exam boards to consider.
I Introduction
In 1999, to promote the concept of life-long learning and to further
encourage the study of English, the Ministry of Education in Taiwan
commissioned the Language Training and Testing Center (LTTC) to
develop the General English Proficiency Test (GEPT), with the aim
of offering Taiwanese learners of English a fair and reliable English
test at all levels of proficiency. The test is administered at five
levels Elementary, Intermediate, High-Intermediate, Advanced, and
Address for correspondence: Cyril Weir, Powdrill Chair in English Language Acquisition,
Luton Business School, Putteridge Bury, Hitchin Road, Luton, LU2 8LE, UK;
email: cyril.weir@luton.ac.uk
from http://ltj.sagepub.com at SWETS WISE ONLINE CONTENT on
Language Testing 2006Downloaded
23 (2) 167197 10.1191/0265532206lt326oa
2006 Edward Arnold (Publishers) Ltd
October 6, 2007
168
Establishing test form and individual task comparability
Superior each level including listening, reading, writing, and

speaking components.
A major consideration in developing a speaking proficiency
component for use within the GEPT program was that it be amenable
to large-scale standardized administration at GEPT test centers
island-wide. For the GEPT Intermediate Speaking Test (GEPTS-I),
with normally over 20, 000 candidates in each administration, it was
considered too costly and impractical to use face-to-face interviews,
involving direct interaction between the candidate and an interlocutor who would have had to be a trained native or near-native speaker
of English. A semi-direct tape-mediated test conducted in a language
laboratory environment was more feasible.
In tests such as the GEPTS-I, limited availability of language
laboratory facilities necessitates the use of different test forms in
multiple administrations to enhance test security. As a consequence,
demonstrating the comparability of these forms is essential to avoid
criticisms of potential test unfairness. The administration of multiple
forms of a test in independent sessions provides alternate-form coefficients, which can be seen as an estimate of the degree of overlap
between the multiple forms. Thus, in the quantitative aspect of this
study, candidates scores achieved on one form were compared
statistically with scores achieved by them on an alternate form.
Parallel-form reliability may be influenced by errors of measurement that reside in testing conditions and other contextual factors.
The quantitative analysis of test score difficulty was thus complemented in this study by collection of data on rater perceptions of a
number of contextual parameters with regard to each individual task
type in the GEPTS-I tests.
Skehan (1996) attempted to identify factors that can affect the
difficulty of a given task and which can be manipulated so as to
change (increase or decrease) task difficulty. Skehan proposes that
difficulty is a function of code complexity (lexical and syntactic
difficulty), cognitive complexity (information processing and
familiarity), and communicative demand (time pressure).
A number of empirical findings have revealed that altering task
difficulty along a number of these dimensions can have an effect on
performance, as measured in the three areas of accuracy, fluency, and
complexity (Robinson, 1995; Foster and Skehan, 1996; 1999;
Skehan, 1996; 1998; Skehan & Foster, 1997; 1999; Mehnert, 1998;
Norris et al., 1998; Ortega, 1999; OSullivan et al., 2001;
Wigglesworth, 1997). Recent research such as that by Iwashita et al.
(2001) has raised some doubts over the findings on some of the
October 6, 2007
169
effects of these variables on performance and generated interest in

possible reasons for such differences in findings.
However, the focus for our particular study is not the actual effects
on performance of intra-task variation in terms of these difficulty
parameters but whether these variables are in fact equivalent in the
three comparable tasks under review. Therefore, in evaluating
whether the three trial test forms of the GEPTS-I and the three tasks
within them are equally difficult, it was thought useful to determine
equivalence according to the parameters established earlier in this
intra-task variability research: code complexity, cognitive complexity,
and communicative demand.
II Comparability of forms and tasks
1 Establishing evidence of statistical equivalence
Exam boards have been the subject of criticism for not demonstrating the parallelness of forms (or tasks within these) used in and
across administrations. Spolsky (1995) makes this criticism of examinations, echoing Bachman et al. (1995), and it was repeated in
Chalhoub-Deville and Turner (2000). Failure to address this issue
must cast serious doubt on the comparability of tests/examinations
across forms and raise concern over their value for end users.
In addition to establishing parallelness of test forms we are also
concerned in this article with establishing parallelness at the task
level across test forms. The type of intra-task variation research
referred to above must be contingent on having two parallel tasks
and then manipulating a variable of interest in one of them; otherwise, all subsequent comparisons are flawed. In much of this
research (one of the few exceptions being Iwashita et al., 2001),
there appears to be little evidence that the parallelness of the tasks
employed had been established prior to any manipulation in respect
of a single variable, which must cast some doubt on findings. In
research by exam boards to explore the effects of any potential
changes to a test task e.g. increasing planning time, providing
structured prompts, or providing an addressee it is obviously a sine
qua non that two parallel tasks first need to be established before the
effect of the change can be investigated.
The administration of parallel (alternate) forms of a test in
independent sessions provides us with alternate-form coefficients.
The two tests are normally given to an appropriate single group
of learners with a short period between the two administrations.
October 6, 2007
170
A short period would be not so short that the learners would be

exhausted or bored by the process, and not so long that significant
learning might take place between the two administrations. This will
measure both the temporal stability of the test and the consistency of
response to the two samples of test tasks. Anastasi (1988: 119;
emphasis in original) notes:
If the two forms are administered in immediate succession, the resulting
correlation shows reliability across forms only, not across occasions. The
error variance in this case represents fluctuations in performance from one
set of items to another, but not fluctuations over time.
The results achieved by the learners on the first parallel form are
compared statistically with the results they achieve on the second
parallel form. The resulting correlation constitutes the parallel-form
reliability of the scores on each of the two forms, i.e. it is an estimate
of the extent to which each of the two forms was awarding the same
marks as the other. It indicates how much error variance had resulted
due to the content sampling of the two forms. By squaring the correlation, it is possible to provide an estimate of the degree of overlap
between the two.
2 Catering for rater variability
For tests of speaking and writing the AERA/APA/NCME Standards
(1985/1999) make it clear that when the scoring of a test involves
judgements by examiners or raters, it is important to consider
reliability in terms of the accuracy and consistency of the ratings that
are made.
Traditional estimates of accuracy and consistency are calculated
using both inter- and intra-rater correlations. These may be obtained
operationally where double ratings are used or by experimental
methods. However, the interpretation of such correlations as estimates of reliability can be as problematic as with other reliability
coefficients. It is known, for example, that correlations of this kind
can be affected by the nature of the rating scale used and by the range
of ability of the candidates who are being assessed. When the ability
range is narrow, small differences between examiners can affect the
correlations that are obtained; such an occurrence may on first
inspection make the test seem unreliable, whereas in practice it may
not truly reflect the accuracy of the classification that is made (an
important consideration for criterion-referenced tests). In other
words, the accuracy and consistency of the classification may be
acceptable even though the inter-rater correlation is not high.
October 6, 2007
171
In the case of a taped speaking test, the examination conditions

themselves provide no threat to the reliability of the assessment
made. However, interactions between other facets within the assessment procedure may impinge on the outcomes. In order to account
for these features in estimating parallel-form reliability of speaking
tests, it is sensible to use statistical models such as Multi-faceted
Rasch Measurement (MFRM) as a quality check, as well as more
traditional analyses.
MFRM takes into account all of the factors that might affect a
students final score, for example, the ability of the student, the
severity of the rater and the difficulty of the task. As long as the raters
are consistent in themselves, it becomes possible to adjust scores for
the differences occasioned by the harshness/leniency of a large
cohort of raters (McNamara, 1996: 28387). There has been a move
in recent years by examination boards to use MFRM as part of their
examination monitoring system (Myford and Wolfe, 2000a: 3; Weir
and Milanovic, 2003). It also enables a more accurate comparison to
be made at the form and the task level, where there are potential
effects arising from different markers rating different performances.
3 Establishing evidence of comparable content
The tests under review must be as similar as possible in terms of the
operations tested and the performance conditions of code complexity, cognitive complexity and communicative demand, i.e. they
should meet the same test specifications in every respect. Thus, the
same language skills/subskills would be tested on the same breadth
of items and under the same performance conditions, such as length,
degree of topic familiarity and linguistic demands (Bachman,1990;
Weir, 2005).
To establish equivalence between the same level examinations
across forms from year to year, or between examinations offered at a
particular level by different providers, comprehensive specification
of the content to be measured is as essential as demonstrating
statistical parallelness. Any serious test must be developed according
to a rigorous specification. Content validation is concerned with
establishing that what was intended by the test-developers in acting
on this blueprint is realized in practice.
In addition to investigating statistical equivalence (traditional
parallel-form reliability), we therefore examined content validity to
determine parallelness of task. Generating evidence in these two
areas is feasible for operational tests that are administered twice or
October 6, 2007
172
more a year in multiple forms. However, no claims can be made

for the complete parallelness of the test tasks in this study because
no investigation was made of their cognitive validity or consequential validity (Weir, 2005).
To generate a more comprehensive body of validity evidence,
additional studies will need to be carried out. For cognitive
validity, the mental processing involved in the GEPT tasks can be
investigated through verbal protocol studies. As evidence of consequential validity, self-report methods might be used to study washback, or differential item functioning (DIF) procedures to detect bias
(for discussion of such studies, see Weir, 2005: Part 3). However,
by their very nature such investigations may not be feasible every
time a test is administered. The dearth of comprehensive validation
studies in the literature in relation to major EFL tests must raise
questions about their viability in an operational context or at least
attest to the amount of work involved (for one of the few rigorous
studies in the area that illustrates the complexity of such endeavors,
see OLoughlin, 2001).
Caution is necessary, however, with regard to the dangers of an
overdependence on content validity alone (Messick, 1989; Bachman,
1990). Bachman rightly notes in relation to the various components
of validity that it is important to recognize that none of these by
itself is sufficient to demonstrate the validity of a particular interpretation or use of test scores (1990: 237). Thus, no claims are made for
the validity of GEPT tests in this study outside of the area of content
validity and parallel-forms reliability.
In speaking tests, content validation has largely depended on
transcriptions of the task performance. In a number of studies on
speaking tests (Ross and Berwick, 1992; Young and Milanovic,
1992; Lazaraton, 2000) transcribed performance has demonstrated
its usefulness in providing qualitative data for analysis. However,
despite its usefulness, such analysis of transcribed performances has its shortcomings, the chief of which is the complexity
involved in the work of transcription. As OSullivan et al. (2002: 39)
warned:
In practice, this means that a great deal of time and expertise is required in
order to gain the kind of data that will answer the basic question concerning
validity. Even where that is done, it is impractical to attempt to deal with
more than a small number of test events; therefore, the generalizability of the
results may be questioned.
Therefore, in an attempt to overcome these practical problems, in a

project commissioned by Cambridge ESOL Examinations,
October 6, 2007
Cyril J. Weir and Jessica R.W. Wu 173
OSullivan et al. (2002: 38) proposed the use of language function

checklists as an effective and efficient procedure for establishing
the content validity of speaking tests. Their aim in using the
checklists
was to create an instrument, built on a framework that describes the language
of performance in a way that can be readily accessed by evaluators who are
familiar with the tests being observed. This work is designed to be complementary to the use of transcriptions and to provide an additional source of
validation evidence (2002: 39).
Based on their findings, they concluded positively that, although

still under development for use with the UCLES Main Suite examinations, an operational version of these checklists is certainly feasible, and has potentially wider application, mutatis mutandis, to the
content validation of other spoken language tests (2002: 45).
Further, they added that with the checklists it would be not only
possible to compare predicted and actual test task performance, but
also provide a useful guide for item writers in taking a priori
decisions about content coverage.
By taking up this qualitative approach to comparing the predicted
and actual performance on language tasks in the three GEPTS-I
forms, this study will address the a priori test-development aspect of
content validation, as well as the issue of a posteriori parallel-form
reliability.
III Methods and procedures

1 Participants
Participants in this study were 120 intermediate-level EFL learners
in Taiwan, who had all passed Stage 1 of the GEPT-Intermediate by
scoring 80% or more in the Listening and Reading subtests and were
thereby eligible to proceed to take the Speaking test (GEPTS-I). The
candidates were randomly divided into two groups, 60 to each group.
The two groups were equivalent in terms of general language proficiency based on their performance on the first stage of the GEPT-I.
Before taking the test proper, each group was invited to take two
additional GEPTS-I forms in succession. To help ensure the candidates motivation to perform well in the tests, it was explained to
them that they were being given practice tests employing authentic
test items for the GEPTS-I which would help them prepare for the
forthcoming live test.
October 6, 2007
174
2 Instruments
The GEPTS-I comprises three separate task types, which are
requisites in the syllabus laid down by the Ministry of Education in
Taiwan.
Task A: Read-aloud: In the first task, the candidate sees two to
three printed passages of about 150 words and is given two minutes
to look over the texts and read them silently. The candidate then
reads the passages aloud with attention to pronunciation, intonation,
and flow of delivery.
Task B: Answering questions: The second task requires the candidate
to situate him/herself in the position of being in an imaginary interview
with the interlocutor who is heard on the test tape. The candidate is
required to respond to 10 questions and, as in Cambridge ESOL Main
Suite examinations, each question is heard twice (for a justification, see
Weir and Milanovic, 2003: 336).
Task C: Picture description: In the third task, the candidate studies
a colored picture accompanied by a series of guided questions which
are written in Chinese. The candidate is given 30 seconds to look over
the picture and questions, and then has 1.5 minutes to complete a
description of the picture, i.e. to talk about the relationship between the
persons: their behavior, thoughts, and attitudes.
The tests were conducted in a language laboratory setting, in
compliance with procedures followed in standard GEPTS-I
administrations. For logistical reasons each group of 60 had to sit
the test together. One group took Form 1 and the other group took
Form 3 first; then both groups had Form 2 as their second test, so
that all the participants would take Form 2 under the same
conditions.
3 Quantitative procedures
The first step in the analysis was to check whether an order effect
might influence the results, given the order of administration of the
three forms, as just described. For this purpose, an additional study
involving 38 learners was conducted, and no significant effect was
found. This confirmed us in our view that in the main study any order
effect was likely to be minimal, especially since all the participants
October 6, 2007
175
were high school students who had been given substantial practice
on the test during the preceding school year in preparation for the
real GEPT test they were to sit the following week.
Sixty candidates (Group 1) took Forms 1 and 2 (a total of 120
audio tapes) and the other 60 (Group 2) took Forms 2 and 3 (a total
of 120 audio tapes). Thus, a total of 240 tapes were collected, comprising 60 of Form 1, 120 of Form 2, and 60 of Form 3. A core batch
of tapes was randomly formed as Batch 1, containing 15 of Form 1,
30 of Form 2, and 15 of Form 3. Three additional batches, Batches
2, 3, and 4, were formed from the remaining tapes in the same way
as for Batch 1. Three experienced accredited raters, who had been
trained and recently standardized to the criteria, were invited to mark
the tapes. Each rater marked a total of 120 tapes, containing the core
batch of tapes (Batch 1) plus an additional batch.
The design above was chosen as it offered a practical solution to
one of the requirements of MFRM, namely that there should be a
degree of overlap for each of the facets. In this case it meant that
candidates had to be connected through an overlap in the tasks taken
and in the raters marking the tapes. While it is not difficult to ensure
that the candidates all perform an overlapping set of tasks (or indeed
the same tasks), it is less easy to ensure adequate overlap in the
rating of these performances. One solution is to ask all raters to rate
all performances, though clearly this is not practical in anything but
a very small-scale empirical study. Another solution, suggested by
Myford and Wolfe (2000b), is to create artificial links, using as few
as three well-chosen performances (at the middle to higher performance level) to establish the required connectivity. However, as this
solution carries with it the problem of high error scores, it was
decided to adopt the more conservative common-batch approach
outlined above.
The raters were requested to mark each of the three tasks separately, with a score for each task on a scale of 1 to 5. The GEPTS-I
operational rating scale was modified slightly to fit the read-aloud
task, focusing on the pronunciation, intonation and fluency elements
of the scale. The operational scale was used for the short answer and
picture description tasks (for the scale descriptors, see Appendix 1).
Inter-rater reliability was monitored during the study. Each tape in
the core batch (Batch 1) was marked by the three raters separately.
Raters were not aware of the marks awarded by the other raters.
Candidate score data was analysed by SPSS for correlation, factor
analysis and ANOVA, whereas MFRM was run on the computer
program FACETS (Linacre, 1999). By requiring candidates to take
October 6, 2007
176
one version of a task in common, it was possible to make comparisons between performances on three versions. Three different
versions of each of the tasks could then be plotted on the same scale
of difficulty. Further, by treating rater and task characteristics as
facets, MFRM can provide information on the effect of these facets
on candidate ability estimated in a performance assessment setting
(McNamara, 1996). In the data output, each candidate receives a
logit score, which represents his/her ability in terms of the probability of obtaining a particular score on any task. Taking account of
the differences in item difficulty and rater severity, the program
is able to compensate for differences across facets. In this study,
candidate ability, rater severity, and test form difficulty were
included in the facets to be measured.
4 Qualitative procedures
For each task type, a checklist of task difficulty was specially created
to elicit raters judgements on the degree of parallelness of the tests
in terms of code complexity (lexical and syntactical features), cognitive complexity (content familiarity and information processing),
and communicative demand (time pressure) (for the checklists, see
Appendix 2). For example, for code complexity in the read-aloud
tasks, raters were asked to show whether they agreed or disagreed
with the statement: The lexical items of the texts are equally familiar
to candidates. Raters responses to the questions were given on a
five-point Likert response scale (1 strongly agree to 5 strongly
disagree). Given that there were very few responses indicating either
strongly agree or strongly disagree, raters responses of 1 and 2
were treated as agree, 3 as agree but with reservation, and 4 and
5 as disagree. Those raters who indicated disagree on certain
areas were requested to add written explanations to the checklist.
In addition to these qualitative judgements, the DaleChall readability formula was employed to measure the linguistic difficulty of
the texts used in Task A. The DaleChall formula makes use of a basic
US fourth grade list of 3000 English words as well as the average sentence length within each passage of 100 words to rate passages from
a reading level of 1 to a college graduate level of 16. It is reported that
word familiarity and sentence lengths together correlated highly with
reading comprehension at .92 (Chall and Dale, 1995).
For the task of answering questions (Task B), both predicted and
actual performances were under investigation. As described above,
OSullivan et al. (2002) had developed checklists for the functional
October 6, 2007
177
content of direct speaking tests, such as those in the Cambridge

ESOL Main Suite examinations. However, GEPTS-I is a semi-direct
test delivered by tape, with an obvious reduction in reciprocity and
potential for interactivity, and so the checklists were modified by
taking out functions not applicable to a semi-direct test such as
negotiating meaning in the interactional functions and also all functions relating to managing interaction. For predicted performance,
the modified checklists were given to a group of 12 raters familiar
with the GEPTS-I format and tasks. They were requested to map
the functions listed in the checklists on to the items in Task B in the
three trial GEPTS-I test forms (1, 2, and 3). The test papers were
shown to the raters and they were asked to identify which functions
they expected each of the items in the tests to elicit. The raters were
asked to record each instance of a particular function independently
without consultation with each other. Raters reports on the likely
presence of a particular function were tallied, and frequency counts
generated. The frequency counts provided details of the relative
frequency of occurrence of the language functions and also of the
degree of consensus among raters.
Raters were further asked to consider data from actual test performances. In cases where a function is not observed by the raters, it
might be due to the proficiency of the candidate rather than to any
variation in the coverage of language functions between the task
versions. To guard against this only the best performances on each
test form, which had been awarded 5 by all three raters, were considered. Due to practical constraints only two of these best performances on Task B from each of the three test forms were randomly
selected and transcribed. The 12 raters, the same individuals who
identified the intended functions, were asked to map the functions
they observed in the transcripts.
In addition to their observation of language functions in the task,
the same group of raters was invited to familiarize themselves with
the tasks and then to provide their views on whether the three
GEPTS-I forms were equally difficult, within each task type, in
terms of factors that may affect task difficulty as described above
(see Appendix 2).
As in the previous two tasks, a checklist was also designed to elicit
raters judgements on Task C (see Appendix 2). This time the focus
was on whether the pictures presenting the non-verbal propositional
input used in the different versions of Task C were equally difficult.
Following an information-processing approach, it is considered that
candidates familiarity with input information has an impact on their
October 6, 2007
178
performance. Therefore, the criteria for the checklist focus mainly

on candidates familiarity with the content in the pictures, such as
roles of people, locations, events, objects, visibility, and amount of
details. Also included in the checklist were criteria for the range of
lexical items, grammatical structures, and language functions that
candidates are expected to use in describing the pictures.
IV Results of statistical analyses

1 Descriptive statistics
Table 1 reports results for each test task, its mean and standard
deviation. It can be seen that in each group the mean scores for all
six tasks are very close, ranging from 3.80 to 4.02 (Group 1) and
3.82 to 3.95 (Group 2). Further, all the differences in mean scores
between each pair of test tasks (i.e. Task A in Test Forms 1 and 2,
coded as F1TA and F2TA; Task C in Test Forms 2 and 3, coded as
F2TC and F3TC) are also small, being no greater than .2.
2 Correlation
The correlation coefficients between the scores on the two GEPTS-I
forms (Form 1 vs. Form 2; Form 2 vs. Form 3) at task level provide
Table 1 Means and standard deviations on
the different test tasks
Variable
Mean
s.d.
Group 1 (n 60) :
F1TA
F2TA
F1TB
F2TB
F1TC
F2TC
3.90
4.00
3.80
3.98
4.02
3.83
.63
.58
.66
.62
.57
.62
Group 2 (n 60) :
F2TA
F3TA
F2TB
F3TB
F2TC
F3TC
3.92
3.95
3.90
3.87
3.87
3.82
.59
.59
.63
.62
.60
.60

October 6, 2007
179
a preliminary estimate of the parallel-form reliability of each form.

Since the task scores are at an ordinal level of measurement, the
Spearman correlation (rho) was used. As seen in Table 2a, for
equivalent tasks in Forms 1 and 2, the correlation coefficients fall in
a range of .550 to .596, which are all significant at the 0.01 level. The
same applies to the equivalent tasks in Forms 2 and 3, with rho
values ranging from .377 to .839 (see Table 2b). Given the fact that
some of the correlations are quite low, there is a possibility that
notwithstanding any differences attributable to content variation
these traditional estimates of parallel-form reliability might have
been influenced by errors of measurement resulting from variation in
rater harshness, as well as being affected by the restricted range of
the candidature. Therefore, it was necessary to use MFRM to take
these other variables into account. At the form level (i.e. adding
Table 2a Spearman correlation coefficients: Form 1 vs. Form 2

Form 1
Form 1:
Task A
Task B
Task C
Form 2
Task B
Task C
Task A
Task B
Task C
.528**
.468**
.477**
.596**
.514**
.408**
.599**
.550**
.623**
.483**
.646**
.582**
.560**
.418**
.414**
Form 2:
Task A
Task B
Note : **all correlations significant at 0.01 level (2-tailed)
Table 2b Spearman correlation coefficients: Form 2 vs. Form 3

Form 2
Form 2:
Task A
Task B
Task C
Form 3:
Task A
Task B
Form 3
Task B
Task C
Task A
Task B
Task C
.439**
.460**
.587**
.377**
.531**
.357**
.433**
.839**
.589**
.410**
.593**
.497**
.489**
.501**
.630**
Note : **all correlations significant at 0.01 level (2-tailed)

October 6, 2007
180
together scores on each task within forms), the correlation between

Forms 1 and 2 was .84, while that between Forms 2 and 3 was .69.
3 Factor analysis
Factor analyses enable us to investigate statistically whether there
are components that are shared in common by the tests. Since the test
tasks were created to measure speaking, one would expect ideally
one factor to appear in a factor analysis of each pair of test forms.
For a factor analysis of all the six tasks together, there should be one
major factor with an eigenvalue greater than 1, representing speaking ability. This is borne out by the results of the present study.
Tables 3 and 4 show that about 62% of the variance is explained by
this factor in the case of Forms 1 and 2, and about 60% for Forms 2
and 3. Further, we can look at the factor loadings for individual tasks
within the test forms, reflecting the portion of the total variance that
each task contributed to the factor (see Table 5). The results show
that each pair of test tasks generally has equivalent loadings on the
factor, which indicates that they measure the same trait.
Table 3 Principal components analysis of Forms 1 and 2

Factor
Eigenvalue
Percentage of variance
Cumulative percentage
1
2
3
4
5
6
3.700
.705
.588
.394
.380
.233
61.666
11.758
9.800
6.568
6.327
3.881
61.666
73.424
83.224
89.792
96.119
100.000
Table 4 Principal components analysis of Forms 2 and 3

Factor
Eigenvalue
Percentage of variance
Cumulative percentage
1
2
3
4
5
6
3.594
.694
.653
.461
.432
.166
59.902
11.562
10.881
7.687
7.201
2.767
59.902
71.464
82.345
90.032
97.233
100.000

October 6, 2007
181
Table 5 Factor matrix

Variable
Factor loading
Test Forms 1 and 2:

F1TA
F1TB
F1TC
F2TA
F2TB
F2TC
.802
.819
.765
.745
.806
.773
Test Forms 2 and 3:

F2TA
F2TB
F2TC
F3TA
F3TB
F3TC
.638
.875
.758
.690
.872
.780
4 ANOVA
Analysis of variance was employed to test whether there was any
significant difference between the corresponding tasks in different
test forms. The results for Group 1 (Forms 1 and 2) tell us that
there were no significant differences (Table 6). A post hoc
Bonferroni test (a multiple comparison test of the Tukey HSD type)
confirmed there was no significant difference between any of
the pairs of equivalent tasks. The same results were obtained for
Group 2 (Forms 2 and 3) (see Table 7). A post hoc Bonferroni test
Table 6 Analysis of variance for Group 1 (Forms 1 and 2; ANOVA scores)
Between groups
Within groups
Total
Sum of squares
df
Mean square
Significance
3.358
131.470
134.828
5
354
359
.672
.371
1.808
.110
Table 7 Analysis of variance for Group 2 (Forms 2 and 3; ANOVA scores)
Between groups
Within groups
Total
Sum of squares
df
Mean square
Significance
.800
128.300
129.100
5
354
359
.160
.362
.441
.819

October 6, 2007
182
confirmed there was no significant difference between any of the

pairs of equivalent tasks.
5 Inter-rater reliability
Sixty forms were marked as a core batch by all three markers and
these provide us with three sets of marks on nine tasks across three
forms. The Spearman correlation coefficients between the scores
on the three GEPTS-I forms (Form 1 vs. Form 2; Form 2 vs. Form 3)
at task level provide us with an estimate of the inter-rater reliability
of the three markers on each form (see Table 8).
The correlations at the individual task level between raters
(R1, R2 and R3) were obviously lower because of the restricted
range of marks (15) at the task level (A, B and C) (Table 9).
Table 8 Inter-rater reliability at the form level: correlations:
Spearmans rho (n 60)
TOTAL 1
TOTAL 1:
Correlation coefficient
Significance (2-tailed)
TOTAL 2
TOTAL 3
.808**
.000
.875**
.000
1.000
TOTAL 2:
.808**
.000
TOTAL 3:
.875**
.000
1.000
.809**
.000
.809**
.000
1.000
Notes: **correlation is significant at the 0.01 level (2-tailed)

Table 9 Correlations of ratings of the individual test tasks
R1A
R2A
R3A
R1B
R2B
R3B
R1C
R2C
R3C
R2A
R3A
R1B
R2B
R3B
R1C
R2C
R3C
.587**
.697**
.661**
.491**
.402**
.432**
.415**
.463**
.345**
.648**
.474**
.494**
.441**
.756**
.760**
.408**
.310*
.338*
.451**
.457**
.536**
.436**
.231
.370*
.611**
.403**
.532**
.600**
.500**
.343**
.501**
.636**
.465**
.545**
.693**
.657**
Notes : **correlation is significant at the 0.01 level (2-tailed); *correlation is

significant at the 0.05 level (2-tailed)
October 6, 2007
183
6 MFRM at the form (test) level

In addition to the conventional parallel-form statistics already
presented, multi-faceted Rasch measurement (MFRM) was applied
in order to take into account major potential sources of variability
that might have affected the test data, such as rater harshness and
consistency, and form variability; and this allowed for a clearer view
of the psychometric qualities of the test. As the main focus of this
study is on the test forms (and not on the candidate performance), the
test forms were set as the non-centered facet, which means that all
other facets (raters and candidates) are fixed and the test form facet
is allowed to float so that the test forms can be positioned with
respect to the other facets.
In the first analysis (Table 10) the whole data set for Groups 1 and
2 was analysed using FACETS. The table provides information on
the characteristics of raters (harshness and consistency). All raters
displayed acceptable levels of consistency with themselves. This can
be seen from the Infit Mean Square column, by adding two standard
deviations to the mean. Raters falling within these parameters in
their reported Infit Mean Square indices are considered to have
behaved consistently. On the other hand, the separation and reliability figures indicate that there were significant differences between
raters in terms of harshness. However the difference, based on fair
average scores, is only .06 of a band, suggesting that there would be
no impact on scores awarded in an operational setting.
The analysis of the three forms in Table 11 shows that no significant difference occurs between the tests at the level of form, i.e.
where all three tasks in a form are combined for analysis and compared to the other forms. The forms do not appear to separate the
candidates to a significant degree. Given that grades on the GEPT are
awarded as a composite score across the three tasks, this means that
in normal operations the three forms can be considered equivalent.
Table 10 FACETS analysis of rater characteristics
Rater 1
Rater 2
Rater 3
Mean
SD
Fair-M average
Severity (logits)
Error
Infit (mean square)
3.97
3.95
3.91
3.94
.02
.23
.09
.32
.00
.23
.13
.13
.13
.13
.00
1.06
1.11
.82
1.00
.13
Notes: Reliability of separation index .70; fixed (all same) chi-square: 9.8, df:2;
significance: p .01
October 6, 2007
184

Table 11 FACETS forms measurement report (arranged by n)
Form 1
Form 2
Form 3
Mean
SD
Difficulty (logits)
Error
Infit (mean square)
2.15
2.56
2.40
2.37
.17
.15
.10
.15
.13
.02
.93
.97
1.12
1.01
.08
Notes: reliability of separation index .38; fixed (all same)

chi-square: 5.4, df:2; significance: p .07
Batch was also included as a facet to check that the way the tapes
had been allocated had not led to any problems. The data in Table 12
shows this to have been the case.
7 MFRM at the individual task level
In many of the earlier studies on intra-task variation, there seems to
have been little attempt to establish parallelness of task before the
effects on task performance of manipulating variables were investigated. This must raise serious doubts about the results obtained. For
the purposes of future GEPT research on intra-task variability in
speaking tasks, we wanted to establish if any variation occurred
between the three forms at the level of tasks. The ANOVA results
presented earlier suggested that there were no differences between
the tasks themselves across pairs of forms. This was of course based
on scores that had not been calibrated to take account of form or rater
variability.
Therefore three separate MFRM analyses were run on the individual
Tasks A, B and C across the three forms. This information on
Table 12 FACETS batches measurement report (arranged by n)
Batch
Batch
Batch
Batch
Mean
SD
1
2
3
4
Measure (logits)
Error
Infit (mean square)
.03
.16
.09
.03
.00
.10
.10
.19
.19
.18
.16
.03
1.05
1.05
.75
1.01
.97
.12
Notes: reliability of separation index .00; fixed (all same)

chi-square: 1.1, df:2; significance: p .77
October 6, 2007
185
estimated test form difficulty at task level is presented in Table 13.

The table identifies the test forms in column 1, reports the Fair-M
average in column 2, and the estimate of test difficulty (measured in
logits) in column 3. The fair average score (Fair-M) is the observed
average adjusted for the measures of the facets encountered the
harshness of the raters, the challenge of the task, and the ability of
the student and expressed in the original raw score metric. The
logit measure in column 3 is the linear measure implied by the observations from which Fair-M is derived. It provides a better basis for
statistical inference, though it may not be as communicative as the
original scale categories. Standard errors and Infit-mean square
indices are also provided in columns 4 and 5. Since the intention was
to determine to what extent these three test forms are equivalent in
terms of difficulty at task level, the results of fixed chi-square tests
are given in column 6.
Table 13 reports the estimated difficulty of Task A in Test Forms
1, 2, and 3 calibrated on the logit scale as 3.29, 4.96, and 5.29,
respectively. There is only a difference of .33 between the logit
values for Forms 2 and 3, as compared to a difference of 1.67
between Forms 1 and 2, indicating that for Task A, Form 2 and Form
Table 13 Summarized MFRM results for three test forms and tasks
Test form
Fair-M
average
Difficulty
measure (logits)
Error
Infit
mean-square
Fixed
chi-square (p)
Task A:
1
2
3
Mean
SD
3.85
3.97
3.98
3.93
.06
3.29
4.96
5.29
4.51
.87
.35
.24
.31
.30
.05
.86
.93
1.13
.98
.11
.00
Task B:
1
2
3
Mean
SD
3.82
3.98
3.95
3.92
.07
1.78
4.31
3.23
3.11
1.04
.35
.27
.37
.33
.04
.94
.99
1.06
.99
.05
.00
Task C:
1
2
3
Mean
SD
3.99
3.88
3.96
3.98
.02
4.87
3.74
3.09
3.90
.73
.31
.22
.33
.29
.05
.95
.92
1.04
.97
.05
.00

October 6, 2007
186
3 are more similar in terms of difficulty. In the case of Task B there

is a difference of 1.08 between Forms 2 and 3, as against 2.53
between Forms 1 and 2. The same pattern is repeated for Task C, in
which the calibrated scores in logits for Form 2 and Form 3 are very
close, with a difference of just 0.65.
In the fixed chi-square test, a significance level of .00 for all three
tasks means we have to reject the null hypothesis of the statistical
test. Thus, the three versions of each task are not equivalent;
however, although it should be noted that the actual differences are
very small.
In reviewing these results, Mike Linacre (personal communication)
commented that:
in these data the large logit differences for very small raw score differences
suggest that the data have a close to Guttman pattern, i.e. there is very little
of the stochastic randomness (rater disagreement) on which MFRM depends
because the data are close to Guttman-patterned, these differences are
probably like scratches on a glass window noticeable, but not of any
substantive importance.
As the three tasks in Form 2 and Form 3 appeared to be more parallel in terms of difficulty, MFRM analyses on these two forms at task
level were repeated to investigate whether they might be considered
equivalent. The results summarized in Table 14 show that Tasks A,
B, and C in Forms 2 and 3 were equally difficult. For Task A, the
scores calibrated on the logit scale were very close, with a difference
of just 0.20, showing that these two tasks were similar in difficulty.
To confirm this finding, the fixed chi-square test failed to reject the
null hypothesis that the two versions of Task A could be thought of
as equally difficult (p .62). The same results applied to Tasks B
and C: the differences in logits were small (0.93 and 0.74, respectively) and, in the chi-square test, the significance levels of .06 (Task
B) and.08 (Task C) were insufficient to reject the null hypothesis.
In sum, Form 2 and Form 3 can be considered parallel tests at the
task level as well as at the overall form level.
V Results of qualitative analyses

1 Task A: Read-aloud
The 12 raters views summarized in Table 15 show that the three test
forms of the read-aloud task were considered in most respects
equally difficult (12/12), with some reservations expressed in the
October 6, 2007
187
Table 14 Summarized MFRM results for test Forms 2 and 3

Test form
Fair-M
average
Difficulty
measure (logits)
Error
Infit
mean-square
Task A:
2
3
Mean
SD
3.99
3.99
3.99
.00
.44
.24
.34
.10
.27
.32
.30
.02
.76
1.21
.98
.22
Task B:
2
3
Mean
SD
3.98
3.95
3.96
.02
3.01
2.08
2.54
.46
.33
.37
.35
.02
.90
.90
.90
.00
Task C:
2
3
Mean
SD
3.98
3.95
3.96
.01
3.90
3.16
3.53
.37
.26
.33
.29
.04
.90
1.02
.96
.06
Fixed
chi-square (p)
.62
.06
.08
Table 15 Raters views on the difficulty of Task A across forms

Conditions
Agreement
Topics
Concreteness of content
Text types
Lexical items
Grammatical structures
Language functions
Amount of time for performance
12/12
11/12
7/12
10/12
9/12
12/12
12/12
(100%)
(92%)
(58%)
(83%)
(75%)
(100%)
(100%)
areas of text types, lexical items, and grammatical structures. Text

types prompted the most disagreement, with five raters considering
that they varied across test forms. These raters noted that Form 1
contained three passages one narrative, one descriptive, and one
telephone talk whereas Forms 2 and 3 contained two passages,
with two narratives in Form 2 and one narrative and one telephone
talk in Form 3. This discrepancy in itself needs addressing in future
versions of this task. For lexical items, reasonable agreement among
raters (10 out of 12) was found, indicating that the lexical items in
the tests were considered equally familiar to candidates, although
four raters agreed with some degree of reservation. For grammatical
structures, a 75% agreement among raters (9 out of 12) was found.
Three raters reported that the sentences in the passages in Form 1
October 6, 2007
188
were shorter and easier to read than those in Form 2 and Form 3.
However, according to the DaleChall readability formula, which
takes lexical items and sentence length into consideration, all passages were rated equally at Level 3. This shows the importance of
qualitative data as a complement to more quantitative procedures.
2 Task B: Answering questions
To examine the extent to which each test form was similar to or
different from the other forms in test content coverage in Task B
(answering questions), the 12 raters were invited to judge which
language functions would actually be elicited by the questions in
each test form. The results are presented in Table 16. Within each test
form, a function that was observed by seven raters or more (60%
or above) was considered to have good agreement. The intended
functions which received good agreement in all three test forms are
viewed as parallel functions.
Among the functions listed in Table 16, six out of nine (about
70%) were judged to be parallel (marked with ). The other three
functions (marked with ) were not considered consistent across
test forms. Thus, based on the collated views of the raters, the
content coverage in terms of language functions was found to be
similar to a large degree across test forms, though some differences
were also seen.
Turning to the language functions observed in the actual performance of the candidates, the raters found that not only did the six
parallel functions occur consistently across the three test forms, but
the other three functions did as well. In other words, a much greater
degree of similarity in test content coverage across forms was found
Table 16 The parallelness of the intended language functions in Task B
Functions
Parallel or not
Providing personal information (present)

Providing personal information (past)
Expressing opinions
Elaborating
Justifying opinions
Speculating/hypothesizing
Suggesting
Expressing preferences
Asking for information


October 6, 2007
189
in the actual performance than was predicted by a review of the test

questions on paper. These data illustrate the need to investigate both
intended and actually occurring functions in test-development.
Besides analysing the language functions, the same 12 raters were
invited to provide their views on the difficulty of the tasks, using the
specifically created checklist. Overall, the raters achieved a high
level of agreement that the three versions of Task B were equally
difficult with respect to all of the items on the checklist. Their
responses are summarized in Table 17. For five out of the nine
difficulty factors, all the raters (12/12) agreed that the degree of
difficulty was consistent across forms. A similar level of agreement
(11/12) was found with three of the other factors. The one variable
where there was less consensus was grammatical structures heard
in the questions. Although nine raters agreed that the difficulty
level was similar across forms, the other three had reservations and
considered that the questions in Forms 2 and 3 contained more
difficult structures than those in Form 1.
3 Task C: Picture description
In the case of the picture description tasks, raters again agreed
closely that tasks from the three test forms were equally difficult in
all conditions. Their responses are summarized in Table 18. There
was 100% consensus for five of the conditions. However, three out
of the 12 raters reported that the locations where the pictures had
been taken would not be equally familiar to candidates because the
picture in Form 2 was located in a foreign country, which might
affect their performance adversely; whereas candidates might
benefit from the local context of the other two pictures as they would
Table 17 Raters views on the difficulty of Task B across forms
Conditions
Agreement
Familiarity with topics

Concreteness of content
Lexical items in questions heard
Grammatical structures in questions heard
Lexical items required for use
Grammatical structures required for use
Language functions
Speed of delivery
11/12
12/12
12/12
9/12
11/12
11/12
12/12
12/12
12/12

October 6, 2007
(92%)
(100%)
(100%)
(75%)
(92%)
(92%)
(100%)
(100%)
(100%)
190

Table 18 Raters views on the difficulty of Task C across forms
Conditions
Agreement
Roles of people
Locations
Events
Objects
Visibility
Amount of details
Lexical items required for use
Grammatical structures required for use
Language functions required for use
12/12
9/12
9/12
12/12
11/12
10/12
10/12
12/12
12/12
12/12
(100%)
(75%)
(75%)
(100%)
(92%)
(83%)
(83%)
(100%)
(100%)
(100%)
naturally be more familiar and comfortable with it. In addition, three

raters also indicated that the picture in Form 2 was mainly about an
event happening in a foreign location, which was less familiar to
candidates. In short, according to the raters views, candidates might
be disadvantaged by Picture 2 because it represented a less familiar
context.
VI Discussion
When various statistical procedures, including correlation, ANOVA,
factor analysis, and MFRM, were used to measure the parallel-form
reliability of three GEPTS-I trial test forms, certain discrepancies
were found in the results generated by different statistical procedures. The ANOVAs comparing the candidates performance on the
three GEPTS-I test forms (Form 1 vs. Form 2; Form 2 vs. Form 3)
showed no significant difference between candidates performance
on each individual task type in Form 1 and 2, and in Form 2 and
Form 3, which supports the parallelness of the GEPTS-I in respect of
these versions.
In the MFRM, at the task level across all three task types, Form 1
was identified as an outlier in comparison with the other two tests.
The MFRM was repeated to confirm the ANOVA result that Form 2
and Form 3 were parallel at the task level. The results support the
conclusion that Tasks A, B, and C in both forms were equally difficult. To confirm this finding, evidence from other sources, such as
the rater judgements, was also utilized in this study.
The raters views shed light on what was found in the statistical
analyses. Just as Form 1 was statistically identified as an outlier, raters
October 6, 2007
191
also commented on it as being different from the other two forms.

In particular, some raters considered that Task A (read-aloud) in
Form 1 was the easiest of that task type in the three tests as the
sentences in the read-aloud passages were shorter and easier to read
than those in the other two test forms. In addition, for Task A in
Form 1, the greater variety of text types in the passages was considered to increase the difficulty. Other evidence was found in Task B
(answering questions) in Form 1, that is, some raters considered
Form 1 contained simpler structures in the questions than Form 2 or
Form 3. Such evidence obtained from both qualitative and quantitative sources enables us to explain why performance in Form 1 was
slightly different from that in the other forms. This evidence also
provides valuable insights for the improvement of the parallelness at
the task level in Form 1.
VII Conclusions
This study has attempted to measure a number of aspects of the
parallel-form reliability of three trial GEPTS-I test forms, both
quantitatively at the form and the task level and qualitatively at the
task level. In addition to more conventional statistical procedures,
such as correlation and ANOVA, MFRM was employed to process
candidates score data to account for the effect of variables associated with rater severity and form/task difficulty. The statistical
results show that all three GEPTS-I forms can be considered
parallel at the overall test level. Further, they demonstrate that for
individual task types the same point holds true across two forms
(2 and 3).
Apart from measuring parallel-form reliability in a conventional
quantitative way, the assistance of raters judgements has enabled us
to investigate such parallelness qualitatively. This study employed
checklists to investigate the content of the three trial GEPTS-I test
forms from the viewpoints of raters. An individual checklist was
developed for each of the three task types in which potential variables affecting difficulty of the task were detailed for raters judgements. According to raters views on task difficulty, Form 1 was seen
as not equivalent to the other two tests. This interpretation is similar
to that available in the MFRM statistical results at the task level (but
not the ANOVA results), and also raters judgements indicate that
two of the trial GEPTS-I tests, Form 2 and Form 3, are parallel in
this respect. Moreover, reasons why the degree of difficulty of
October 6, 2007
192
Form 1 varies from that of the other two test forms were suggested
by raters, and provided useful insights to the test-development
team of the GEPTS-I for future improvement in equivalence in terms
of content validity.
In addition to eliciting raters views on task difficulty, this study
also adopted the use of observation checklists for the validation of
the speaking test, as proposed by OSullivan et al. (2002). Through
raters observations, a comparison of the intended functions in
Task B (answering questions) of the three trial GEPTS-I test forms
was made, so that the extent to which the tasks were similar across
forms in the area of content coverage was also measured. The
raters reported that the language functions covered in answering
questions in these three forms were similar. Moreover, content
coverage was also investigated by way of a posteriori studies on a
limited number of transcripts, whereby raters were asked to map
the language functions that they observed from actual candidate
performances. These studies substantiated the previous findings
of equivalent coverage of language functions in the three trial
GEPTS-I forms.
The effect on task difficulty of variables associated with performance conditions is evident in this study and sounds a warning for
all those involved in intra-task variability research and producing
equivalent forms. The results show that, without taking the necessary
steps to control context variables affecting test difficulty, test quality
may fluctuate over tasks in different test forms. It is hoped that findings from research of this present kind will provide insights into how
to enhance fairness for candidates in the development and administration of semi-direct speaking tests, such as the GEPTS-I and also
send a clear message to those involved in intra-task variation studies
about the necessity of first developing parallel tasks before any
manipulation occurs.
Comprehensive evidence is required to support the argument for
the validity of even a single test (Messick, 1989; Kane, 1992; Kane
et al., 1999; Mislevy et al., 2002; 2003; Chapelle et al., 2004; Weir,
2005). Comparisons between tests are even more demanding where
equivalence is required in all aspects of validity. This article is
limited by its narrow focus on parallel-forms reliability, with some
consideration of equivalence in terms of content validity in the
qualitative part of the study.
High correlations between tests (traditionally viewed as establishing parallel-forms reliability) do not in themselves provide sufficient
evidence that two tests are equivalent in terms of validity. When
October 6, 2007
supplemented by full evidence of consistent content coverage over

test forms, this still only constitutes a partial validity argument. If
confidence in the equivalence of test forms is to be established,
evidence is also required of a tests congnitive validity (i.e. the
cognitive processing required) and consequential validity (i.e. its
effects and impact). The latter may, of course, prove difficult in those
cases where examination boards for security reasons have to provide
multiple versions of a speaking examination for each administration
as well as across annual administrations. It may be that for them
a priori statistical and content checks in pilot studies are the only
realistic operational procedures. Secure tests with a longer shelf life,
such as IELTS or TOEFL may be able to meet more comprehensive
validity requirements (for details of what might be required, see
OLoughlin, 2001; Mislevy et al., 2002; 2003; Chapelle et al., 2004;
Weir, 2005).
Establishing content validity and parallel-forms reliability may
not be sufficient in itself for developing a comprehensive validity
argument, but it is an essential first step for operational high-stakes
tests. The consequences of not doing so in terms of fairness to
candidates are clear from the evidence presented in this study; not all
the tasks proved parallel either statistically or in terms of task
content. Further applied studies are required to guide the profession
in operationalizing validity in all its manifestations. It is hoped that
this study will encourage other examination boards to progress
further with the methodology and report on their efforts.
Acknowledgements
An earlier version of the article was presented at the 5th International
Conference on English Language Testing in Asia, Tokyo, Japan,
2002. We wish to acknowledge the assistance of raters who participated in marking the tapes and in completing the checklists. Thanks
are also due to Barry OSullivan of Roehampton University, UK for
his invaluable advice throughout, and to colleagues and management
in the Language Training and Testing Center in Taiwan for their
contribution. The advice provided by the anonymous Language
Testing reviewers enabled us to radically improve the article from
its original conception. John Read provided his usual invaluable
assistance in editing the manuscript. Lastly, we would like to
acknowledge the very generous help of Mike Linacre, who gently
led two novices through the complexities of MFRM.
October 6, 2007
194
VIII References
AERA/APA/NCME 1985/1999: Standards for educational and psychological
testing. Washington, DC: AERA.
Anastasi, A. 1988: Psychological testing. New York: Macmillan.
Bachman, L.F. 1990: Fundamental considerations in language testing.
Oxford: Oxford University Press.
Bachman, L., Davidson, F., Ryan, K. and Choi, I.-C. 1995: An investigation
into the comparability of two tests of EFL: the Cambridge-TOEFL
comparability study. Cambridge: Cambridge University Press.
Bygate, M. 1987: Speaking. Oxford: Oxford University Press.
Chall, J.S. and Dale, E. 1995: Manual for The New Dale-Chall Readability
Formula. Cambridge, MA: Brookline Books.
Chalhoub-Deville, M. and Turner, C.E. 2000: What to look for in ESL
admission tests: Cambridge certificate exams, IELTS, and TOEFL.
System 28, 52339.
Chapelle, C.A., Enright, M.K. and Jamieson, J. 2004: Issues in Developing a
TOEFL Validity Argument Paper presented at the Language Testing
Research Colloquium, Temecula, CA.
Foster, P. and Skehan, P. 1996: The influence of planning and task type on
second language performance. Studies in Second Language Acquisition
18, 299323.
1999: The influence of source of planning and focus of planning on
task-based performance. Language Teaching Research, 3, 21547.
Iwashita, N., McNamara, T. and Elder, C. 2001: Can we predict task
difficulty in an oral proficiency test? Exploring the potential of an
information processing approach to task design. Language Learning
51, 40136.
Kane, M.T. 1992: An argument-based approach to validity. Psychological
Bulletin 122, 52735.
Kane, M., Crooks, T. and Cohen, A. 1999: Validating measures of
performance. Educational Measurement: Issues and Practice 18, 517.
Lazaraton, A. 2000: A qualitative approach to the validation of oral language
tests. Studies in Language Testing, Volume 14. Cambridge: Cambridge
University Press. Bulletin 122, 52735.
Linacre, J.M. 1999: Facets: Rasch measurement computer program, version
3.22. Chicago, IL: MESA Press.
McNamara, T. 1996: Measuring Second Language Performance. London:
Longman.
Mehnert, U. 1998: The effects of different lengths of time for planning on
second language performance. Studies in Second Language Acquisition
20, 83108.
Messick, S. 1989: Validity. In Linn R.L., editor, Educational measurement,
3rd edn. New York: Macmillan, 13103.
Mislevy, R.J., Steinberg, L.S. and Almond, R.G. 2002: Design and
analysis in task-based language assessment. Language Testing 19,
47796.

October 6, 2007
195
Mislevy, R.J., Steinberg, L.S. and Almond, R.G. 2003: On the structure of
educational assessments. Measurement: Interdisciplinary Research and
Perspectives 1, 367.
Myford, C.M. and Wolfe, E.W. 2000a: Monitoring sources of variability
within the Test of Spoken English Assessment System. TOEFL Research
Report No. 65. Princeton, NJ: Educational Testing Service.
2000b: Strengthening the ties that bind: improving the linking network in
sparsely connected rating designs. TOEFL Technical Report No. 15.
Princeton, NJ: Educational Testing Service.
Norris, J., Brown, J.D., Hudson, T. and Yoshioka, J. 1998: Designing
second language performance assessments. Technical Report 18.
Honolulu, HI: University of Hawaii Press.
OLoughlin, K. 2001: The equivalence of direct and semi-direct speaking
tests. Cambridge: Cambridge University Press.
Ortega, L. 1999: Planning and focus on form in L2 oral performance. Studies
in Second Language Acquisition 20, 10948.
OSullivan, B., Weir, C. and Ffrench, A. 2001: Task difficulty in testing
spoken language: a socio-cognitive perspective. Paper presented at the
Language Testing Research Colloquium, St Louis, MO.
OSullivan, B., Weir, C. and Saville, N. 2002: Using observation checklists to
validate speaking-test tasks. Language Testing 19, 3356.
Robinson, P. 1995: Task complexity and second language narrative discourse.
Language Learning 45, 99140.
Ross, S. and Berwick, R. 1992: The discourse of accommodation in oral
proficiency interviews. Studies in Second Language Acquisition
14, 15976.
Skehan, P. 1996: A framework for the implementation of task based
instruction. Applied Linguistics 17, 3862.
1998: Tasks and language performance assessment. Paper presented at
the Language Testing Forum, University of Wales, Swansea.
Skehan, P. and Foster, P. 1997: Task type and task processing conditions as
influences on foreign language performance. Thames Valley University
Working Papers in English Language Teaching 3, 13988.
1999: The influence of task structure and processing conditions on
narrative retellings. Language Learning 49, 93120.
Spolsky, B. 1995: Measured Words. Oxford: Oxford University Press.
Weir, C.J. 2005: Language testing and validation: an evidence based
approach. Basingstoke: Palgrave Macmillan.
Weir, C.J. and Milanovic, M., editors, 2003: Continuity and innovation: the
history of the CPE 19132002. Cambridge: Cambridge University Press.
Wigglesworth, G. 1997: An investigation of planning time and proficiency
level on oral test discourse. Language Testing 14, 85106.
Young, R. and Milanovic, M. 1992: Discourse variation in oral proficiency
interviews. Studies in Second Language Acquisition 14, 40324.

October 6, 2007
196
Appendix 1 Modified rating scales

Task A, read-aloud :
Rating
Interpretation
5 Excellent
Entirely intelligible pronunciation; very natural and correct

intonation; the candidate speaks fluently with minimal
hesitations.
Generally intelligible pronunciation; generally natural and
correct intonation; the candidate generally speaks fluently,
hesitations may sometimes occur.
Some errors in pronunciation and intonation influence
comprehensibility; the candidate sometimes speaks fluently,
though unnecessary hesitations still occur.
Many errors in pronunciation and intonation; the candidate
sometimes gives up on reading words which he or she does
not recognize; the candidate doesnt speak with ease;
unnecessary hesitations occur frequently.
The candidate has little ability to handle the task; the
candidate doesnt speak with ease; unnecessary hesitations
occur very frequently.
4 Good
3 Fair
2 Poor
1 Very poor
Task B, answering questions; Task C, picture description:

Rating
Interpretation
5 Excellent
Functions performed clearly and effectively; appropriate
responses to questions; almost always accurate
pronunciation, grammar, vocabulary, and fluency.
4 Good
Functions generally performed clearly and effectively;
generally appropriate responses to questions; generally
accurate pronunciation, grammar, vocabulary, and fluency.
3 Fair
Functions performed somewhat clearly and effectively;
somewhat appropriate responses to questions; somewhat
accurate pronunciation, grammar, vocabulary, and fluency.
2 Poor
Functions generally performed unclearly and ineffectively;
generally inappropriate responses to questions; generally
inaccurate pronunciation, grammar, vocabulary, and fluency.
1 Very poor
Functions always performed unclearly and ineffectively;
inappropriate responses to questions; almost always
inaccurate pronunciation, grammar, vocabulary, and fluency.
Appendix 2 Checklists of difficulty

Read-aloud
1.
2.
3.
4.
5.
The topics of the texts are equally familiar to candidates.

The content (in terms of concreteness/abstractness) of the texts is the same.
The text type of the texts is the same.
The lexical items of the texts are equally familiar to candidates.
The grammatical structures of the texts are equally familiar to candidates.
October 6, 2007

6. The language functions of the texts are equally familiar to candidates.
7. The candidates should be able to complete the tasks within the time given
for each.
Answering questions
1. The topics of the questions in the tests are equally familiar to candidates.
2. The content (in terms of concreteness/abstractness) of the questions in the
tests is the same.
3. The lexical items of the questions in the tests are equally familiar to
candidates.
4. The grammatical structures of the questions in the tests are equally
familiar to candidates.
5. The lexical items required to answer the questions in the tests are equally
6. The grammatical structures required to answer the questions in the tests
are equally familiar to candidates.
7. The language functions elicited by the questions in the tests are equally
8. The speed of delivery in the tests is the same.
9. The candidates should be able to complete the tasks within the time given
for each.
Picture description
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
The roles of people in the pictures are equally familiar to candidates.

The locations in the pictures are equally familiar to candidates.
The events in the pictures are equally familiar to candidates.
The objects in the pictures are equally familiar to candidates.
The objects and events to be described by candidates are equally visible
in the pictures.
There are enough details (amount of information) in the pictures for the
candidates to complete the task.
The lexical items required to describe the pictures are equally familiar to
candidates.
The grammatical structures required to describe the pictures are equally
The language functions required to describe the pictures are equally familiar to candidates.
The candidates should be able to complete the tasks within the time given
for each.
October 6, 2007

Language Testing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Language Testing

Uploaded by

Copyright:

Available Formats

Language Testing

Establishing test form and individual task comparability: a

Downloaded from http://ltj.sagepub.com at SWETS WISE ONLINE CONTENT on

Establishing test form and individual

Establishing test form and individual task comparability

Superior each level including listening, reading, writing, and

Cyril J. Weir and Jessica R.W. Wu

effects of these variables on performance and generated interest in

Establishing test form and individual task comparability

A short period would be not so short that the learners would be

Cyril J. Weir and Jessica R.W. Wu

In the case of a taped speaking test, the examination conditions

Establishing test form and individual task comparability

more a year in multiple forms. However, no claims can be made

Therefore, in an attempt to overcome these practical problems, in a

Cyril J. Weir and Jessica R.W. Wu 173

OSullivan et al. (2002: 38) proposed the use of language function

Based on their findings, they concluded positively that, although

III Methods and procedures

Establishing test form and individual task comparability

Cyril J. Weir and Jessica R.W. Wu

Establishing test form and individual task comparability

Cyril J. Weir and Jessica R.W. Wu

content of direct speaking tests, such as those in the Cambridge

Establishing test form and individual task comparability

performance. Therefore, the criteria for the checklist focus mainly

IV Results of statistical analyses

Downloaded from http://ltj.sagepub.com at SWETS WISE ONLINE CONTENT on

Cyril J. Weir and Jessica R.W. Wu

a preliminary estimate of the parallel-form reliability of each form.

Table 2a Spearman correlation coefficients: Form 1 vs. Form 2

Note : **all correlations significant at 0.01 level (2-tailed)

Table 2b Spearman correlation coefficients: Form 2 vs. Form 3

Note : **all correlations significant at 0.01 level (2-tailed)

Establishing test form and individual task comparability

together scores on each task within forms), the correlation between

Table 3 Principal components analysis of Forms 1 and 2

Table 4 Principal components analysis of Forms 2 and 3

Downloaded from http://ltj.sagepub.com at SWETS WISE ONLINE CONTENT on

Cyril J. Weir and Jessica R.W. Wu

Table 5 Factor matrix

Test Forms 1 and 2:

Test Forms 2 and 3:

Table 7 Analysis of variance for Group 2 (Forms 2 and 3; ANOVA scores)

Downloaded from http://ltj.sagepub.com at SWETS WISE ONLINE CONTENT on

Establishing test form and individual task comparability

confirmed there was no significant difference between any of the

Notes: **correlation is significant at the 0.01 level (2-tailed)

Notes : **correlation is significant at the 0.01 level (2-tailed); *correlation is

Cyril J. Weir and Jessica R.W. Wu

6 MFRM at the form (test) level

Infit (mean square)

Establishing test form and individual task comparability

Infit (mean square)

Notes: reliability of separation index .38; fixed (all same)

Infit (mean square)

Notes: reliability of separation index .00; fixed (all same)

Cyril J. Weir and Jessica R.W. Wu

estimated test form difficulty at task level is presented in Table 13.

Downloaded from http://ltj.sagepub.com at SWETS WISE ONLINE CONTENT on

Establishing test form and individual task comparability

3 are more similar in terms of difficulty. In the case of Task B there

V Results of qualitative analyses

Cyril J. Weir and Jessica R.W. Wu