The Use of Dictionary in Writing Assessment

Language Testing
http://ltj.sagepub.com
Bilingual dictionaries in tests of L2 writing proficiency: do they make a

difference?
Martin East
Language Testing 2007; 24; 331
DOI: 10.1177/0265532207077203
The online version of this article can be found at:

http://ltj.sagepub.com/cgi/content/abstract/24/3/331
Published by:
http://www.sagepublications.com
Additional services and information for Language Testing can be found at:
Email Alerts: http://ltj.sagepub.com/cgi/alerts
Subscriptions: http://ltj.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.co.uk/journalsPermissions.nav
Citations http://ltj.sagepub.com/cgi/content/refs/24/3/331
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
Bilingual dictionaries in tests
of L2 writing proficiency: do they
make a difference?
Martin East Unitec New Zealand
Whether test takers should be allowed access to dictionaries when taking L2

tests has been the subject of debate for a good number of years. Opinions dif-
fer according to how the test construct is understood and whether the under-
lying value system favours process-orientated assessment for learning, with
its concern to elicit the test takers’ best performance, or product-focused
assessment of learning, with its emphasis on the discriminatory power of the
test. One key study into bilingual dictionaries and writing tests (Hurman &
Tall, 1998) concluded that dictionary use improves test scores. This study
was influential in a recent decision to ban dictionaries in several high-stakes
L2 examinations in the UK. Research into dictionary use in reading tests has
suggested, however, that dictionary availability makes no statistically signif-
icant difference to test scores. This article presents the findings of a further
study into bilingual dictionary use in writing tests that also indicated no sig-
nificant difference to scores. Considering this finding alongside those of
studies other than Hurman and Tall’s casts some doubt on whether the UK
ban was fully justified.
I Introduction
One area of ongoing debate with regard to the testing of second or for-
eign languages (L2) has focused on whether test takers should be
allowed access to dictionaries when being assessed. On the one hand,
when use of a dictionary is perceived as an authentic and legitimate
activity in the context of L2 students’ learning and using the target lan-
guage, and when it is believed that language test performance should
correspond with language in actual use or represent ‘real world’ prac-
tice (Bachman & Palmer, 1996; Wiggins, 1989), it may be argued that
allowing dictionaries in tests is valid. On the other hand, when use of
a dictionary potentially obscures the information about the test takers’
linguistic ability available from the test (which is, after all, what the
Address for correspondence: Martin East, School of Language Studies, Unitec New Zealand,
Private Bag 92025, Auckland, New Zealand; email: meast@unitec.ac.nz
Language Testing 2007 24 (3) 331–353 10.1177/0265532207077203 © 2007 SAGE Publications
332 Bilingual dictionaries in tests of L2 writing proficiency
test is for) because, for example, it provides information relevant to

what is being tested and thereby makes the test ‘too easy’, its use in
tests may be seen as contentious. A further dimension of complexity is
added by the concern to elicit test takers’ best performance through
testing procedures that, in the words of Swain (1985), ‘bias for best’
and ‘work for washback’. When the dictionary is perceived not only
as an authentic tool but also as a supportive tool, there is even more
reason to consider its use or non-use in tests carefully.
The debates have therefore been fuelled by different perspectives
on the purpose of the tests and different understandings of the con-
structs that underlie them. One significant example of these debates
has been a recent ban on the use of bilingual dictionaries in high-
stakes L2 examinations in the United Kingdom, which reversed an
earlier decision to allow dictionaries. The ban followed research into
the impact of bilingual dictionaries on beginner and pre-intermediate
writing in French (Hurman & Tall, 1998), which concluded that scores
improve in ‘with dictionary’ tests, raising questions about the valid-
ity of tests that allow their use.
With a view to contributing further to the ongoing debates, this
article reports a subsequent study into bilingual dictionary use. Using
a different target language (German) and a different level of test
taker (intermediate and above), the study sought to answer the fol-
lowing research question:
Does use of a dictionary make a difference to the quality of L2 test
takers’ writing, as measured by the test scores?
Two sub-questions were also investigated:
a. Does test takers’ level of language ability or level of prior ex-
perience with the dictionary make a difference to their perfor-
mance?
b. Does frequency of dictionary use make a difference to perfor-
mance in a ‘with dictionary’ writing test?
Implications for the validity of writing tests that allow the use of
bilingual dictionaries are then raised.
II Determining the validity of the test

Both the debates around assessment practices and the recent policy
reversal in the UK exemplify an apparent conflict between two oppos-
ing value systems. One is influenced by constructivist process-oriented
approaches to the curriculum that favour ‘dynamic’ assessment, a
Martin East 333
means of assessment which is more learner-centred and flexible

(Luther, Cole, & Gamlin, 1996). This value system informs a theory of
measuring students’ attainments that has become known as ‘assess-
ment for learning’, an approach embedded within the teaching and
learning context that helps to move learners from a lower to a higher
level of proficiency in ways that enhance learner autonomy and learn-
er motivation (Assessment Reform Group, 2002a, 2002b; Cambridge
University, 1999). The other value system is rooted in traditional
behaviourist, product-oriented and knowledge-based approaches that
are concerned with the ‘static’ assessment of students’ learning –
summative tests that take place at a particular moment in time and pro-
vide a ‘snapshot’ of the test takers’ abilities. The focus of such tests is
on the tests’ discriminatory power: identifying different levels of test
taker ability and predicting future academic performance.
Dictionaries in writing tests may be viewed more positively by those
who support ‘dynamic’ assessment, and more negatively by those who
take a more traditional line. In the former, the communicative writ-
ing proficiency construct is defined more broadly as an authentic
reflection of writing as process, including the strategies normally
adopted to enhance the communicative effectiveness of the mes-
sages. As such, the use of a dictionary is arguably a valid part of the
construct. In the latter, the communicative writing proficiency con-
struct is understood more narrowly as ‘writing performance’ – a
snapshot of test takers’ writing proficiency at a particular point in
time that aims to measure test takers’ knowledge of key components
of the writing construct such as vocabulary and grammar. As such,
the use of a dictionary may lessen the validity of the test because it
may interfere with a reliable measure of, for example, vocabulary
knowledge. Views regarding the role of dictionaries in writing assess-
ments therefore differ depending on how the communicative writing
proficiency construct is defined. Tests that either factor in or factor
out their use may be equally valid, depending on the value system
and the understanding of the construct that underlies the test, but the
value systems are in tension.
When it comes to investigating whether dictionaries should be
included in L2 tests, test score evidence may provide essential in-
formation on which subsequent decisions for or against dictionary
availability in tests may be made, particularly in the case of high-
stakes tests with results which have significant consequences, posi-
tive or negative, for the stakeholders. Rivera and Stansfield (1998)
argue that one means of providing such test score information is
through ‘within-subjects’ studies that provide comparative evidence
from the same group of test takers on test performance in two test
conditions (for example, with and without a dictionary). Rivera and
Stansfield were wrestling with the question of how accommodations
for ‘limited English proficient’ (LEP) students might threaten the
validity of mainstream tests. They argue that comparative evidence
would help to establish if a given accommodation may be rightfully
endorsed. They suggest that if a comparative within-subjects study
leads to a finding of no significant difference in scores across two test
conditions for non-LEP test takers (those who do not require the
accommodation), this would mean that the accommodation does not
compromise the measurement of the construct.
Bearing in mind that Rivera and Stansfield’s argument relates to
constructs other than the L2 language ability of L2 language learners,
their suggested methodology may be helpful in investigating the use of
dictionaries in L2 tests. Messick (1989) suggests that test scores might
be affected by two major threats to a test’s validity. The first is con-
struct under-representation, whereby the tasks measured in the assess-
ment fail to include important dimensions or facets of the construct
and the test results are therefore unlikely to reveal the test takers’ true
abilities within the construct that was supposed to have been measured
by the test. The other threat to validity is construct irrelevant variance,
whereby the test measures variables that are not relevant to the inter-
preted construct, allowing certain test takers to score higher than they
would in normal circumstances (construct-irrelevant easiness), or po-
tentially leading to a notably lower score for some test takers (con-
struct-irrelevant difficulty). If no significant differences were found
in test scores derived from a comparative within-subjects study, this
would indicate that there was no significant threat to the construct va-
lidity of the test as a measure of test takers’ performance, regardless of
how the construct was understood. That is, the L2 test takers did not
‘require’ the dictionary to demonstrate their ability.
For those who believe that the dictionary is extraneous to the test
construct, such a finding would be reassuring because it could be inter-
preted as meaning that the dictionary was not a ‘confounding variable’
(Elder, 1997: 261) or source of construct irrelevant variance – the use
of the dictionary is not masking the measurement of the test takers’
abilities as reflected in the scores. No difference in scores might raise
questions for those who see the dictionary as a supportive part of
the construct being tested. The construct that underlies the test may
include the supportive use of the dictionary, but it becomes irrele-
vant if its availability interferes with an improved performance.
Nevertheless, where there is no evidence of notably lower scores in
Martin East 335
the ‘without dictionary’ condition, the construct is not necessarily

being under-represented if the dictionary is not available (perfor-
mance, as measured by test scores, is not weaker).
III Hurman and Tall’s study into dictionaries in writing tests

Allowing dictionaries in L2 tests has recently been a contentious
issue in the United Kingdom. The UK’s Qualifications and Curriculum
Authority (QCA), the governing body that maintains and develops
the National Curriculum for UK schools and regulates its associated
examinations and assessments, had decided that from 1998 bilingual
dictionaries would be allowed in components of the UK’s General
Certificate of Secondary Education (GCSE) L2 examinations (the
first externally assessed school examination, taken by 15–16 year-
olds). This reflected an early expectation of the National Curriculum
that students’ independent learning should be developed through
the effective use of reference materials including dictionaries (DES,
1990). At the same time, the QCA commissioned a study to investi-
gate the difference that dictionaries might be making in the tests
(Hurman & Tall, 1998). This substantial study (n ⫽ 1330) used a com-
parative within-subjects design whereby participants were asked to
take two writing tests in French, designed to be as similar as possi-
ble, one with and one without a bilingual dictionary. The tests were
set at two levels (Foundation Tier and Higher Tier), commensurate
with the two-tier nature of the GCSE.
The research was concerned with two major effects:
1) The effect of using a dictionary on scores at both tiers.
2) The comparative effect of three different types of dictionary on
the scores awarded.
The study concluded that scores on the tests where dictionaries
were available were higher than where they were not, with an aver-
age increase of two marks, or 9%, at Foundation Tier, and two to three
marks, or 9%, at Higher Tier. The differences in scores were statisti-
cally significant (Tall & Hurman, 2002). The use of different diction-
aries appeared to have different effects on different types of question.
It appeared that it could not be ruled out that some dictionary types
were providing distinct advantage to some test takers. The findings
raised questions about the validity of the tests, and had considerable
consequences for testing practice (Birmingham University, 2000). By
2003 dictionaries were no longer allowed, either in the GCSE (the
focus of Hurman and Tall’s research) or at higher levels (including the

‘high stakes’ Advanced (A) level where bilingual dictionary access
had been introduced in one syllabus as early as 1978 (Bishop, 2000)).
Given the debates surrounding dictionaries in assessment, the pol-
icy volte-face apparently made on the basis of one piece of research
might be considered unwise or hasty. The research itself, and its con-
clusions, give rise to several questions. Several studies into diction-
aries in tests of the reading proficiency of learners of English as a
second or foreign language (Bensoussan et al., 1984; Nesi and Meara,
1991; Idstein, 2003) concluded that allowing dictionaries makes
no significant difference to test scores. Although it may be that the
nature of tests of reading proficiency is such that dictionary avail-
ability really makes no difference to performance, whereas in tests
of writing proficiency use of a dictionary has significant positive
impact, this discrepant finding alone indicates that further research is
needed to better establish the extent to which dictionary availability
does indeed have a differential effect across skills.
Of more concern, however, is that several aspects of Hurman and
Tall’s study may have led to the introduction of several confounding
factors for which it would have been better to control. Firstly, a within-
subjects investigation with the dictionary as an independent variable
would ideally use a design that was counterbalanced to account for
the potentially confounding effects of two other variables: order effect
(whether the ‘with dictionary’ test was taken first or second) and task
effect (whether the ‘with dictionary’ test was easier than the other for
reasons other than the availability of the dictionary). Hurman and Tall’s
experimental design may be critiqued with regard to these two effects:
• The majority of participants took the ‘without dictionary’ test
first. This arrangement may have led to a potential ‘practice ef-
fect’ that may have helped to improve performance on the second,
‘with dictionary’, test.
• Task effect was not taken into account in the testing process: each
test was only available in one test condition. It would therefore
not have been possible to isolate the impact of the dictionary
from the impact of any differences potentially inherent in the
tasks; test takers may simply have done better on the ‘with dic-
tionary’ tests because these tests were easier, and not counterbal-
ancing the tests would obscure an observation of this.1
1Hurman and Tall appear to favour statistical adjustments over robustness of design. They assert
(1998: 30) that ‘order effect’ was a factor in a multivariate analysis of covariance. They suggest
Martin East 337
Secondly, aspects of the design may have had a confounding in-

fluence on the reliability of the scoring process. Raters were asked
to mark both ‘with dictionary’ and ‘without dictionary’ scripts, and
having only one set of ‘with dictionary’ tests meant the raters knew
which sets of scripts were written in which test condition. This may
have impacted on the way scripts were rated due to the influence of
rater expectations (Weigle, 1999) or an interaction between the
raters and the assessment (Lumley & McNamara, 1995): the raters
may have ‘expected’ the ‘with dictionary’ scripts to be better. In
addition, only one rater marked a given set of scripts, and only 10%
of each rater’s scripts were moderated by team leaders. Inter-rater
reliability checks were therefore limited, and there was no intra-
rater evidence to support consistency of scoring.
Although it is not possible to determine the extent to which these
circumstances may have contributed to differential scores across the
conditions, there is the possibility (although this is purely specula-
tive) that different scores were awarded on the ‘with dictionary’ tests
because ‘it is believed that dictionary use improves performance on
the examination’ (Spolsky, 2001: 4, my emphasis).
Furthermore, Hurman and Tall’s study had focused only on the
GCSE, and yet as a consequence dictionaries were removed from all
secondary school L2 examinations. It is difficult to assess the rele-
vance of Hurman and Tall’s findings for tests beyond the beginners
to pre-intermediate level targeted in the GCSE and, because the
research failed to address the impact of the bilingual dictionary on
higher levels of writing, it is debatable whether its findings should
therefore have prompted the change in policy that removed diction-
aries from higher level examinations such as the A level.
The study reported in this article therefore sought to further inves-
tigate the difference, if any, that bilingual dictionaries might make to
test scores, once more measuring the construct of writing proficien-
cy. The study used a within-subjects design and aimed to address
some of the weaknesses of the Hurman and Tall study. It was also
targeted at the higher level of writing proficiency not investigated
by Hurman and Tall.
from a pilot study (1998: 32) that the ‘with dictionary’ tests (all taken second) were harder, and
used this information to ‘adjust’ the score data from the main study. Nevertheless, Tall admits (per-
sonal communication, 24 March, 2004) that the design was ‘criticised by some individuals on the
sponsor’s management committee on the grounds that students using the dictionary would gain
higher grades simply because they had gained experience taking the first examination’.
IV Method
The study was conducted in September 2003 with students drawn
from 11 schools in New Zealand who were being prepared for the
high-stakes intermediate level Bursary German (A level equivalent)
examination in the forthcoming November. Although the sample size
was small (n ⫽ 47), the number of candidates taking Bursary German
in 2003 was also small (N ⫽ 366), and the sample represented just
under 13% of those being prepared for Bursary in just under 13% of
schools. Furthermore, although this was a convenience rather than a
random sample, the target population is very clearly defined (17–18
year old students in New Zealand secondary schools studying for the
Bursary German examination in 2003). The sample in this study can
therefore be regarded as largely representative of the population to
which any results might be generalized. Procedures used in the study
are reported below in some detail to facilitate subsequent replication.
The overall design incorporated four elements:
1) A 50-item multiple-choice placement test for learners of
German (Oxford University Language Centre, 2003). The on-
line test, which ranks language learners into six bands from
‘beginner’ to ‘advanced’, was used to provide an independent
measure of participants’ abilities for benchmarking purposes.
For ease of administration a hard copy pencil-and-paper version
of the test was created.
2) Two timed essays of 50 minutes, one taken with and one with-
out a bilingual dictionary.
3) Two short questionnaires, to be completed after each timed
essay. The questionnaires elicited information about test taker
perceptions and strategy use across the two conditions.
4) A longer questionnaire, to be completed after both essays and
the two shorter questionnaires. This contained both closed-ended
questions eliciting information about type of dictionary, prior
experience with the dictionary, dictionary use strategies and opi-
nions about dictionaries in writing tests, and open-ended questions
about perceived advantages and disadvantages of dictionary use
in writing tests.
The procedures, together with the use of a detailed scoring
rubric, were piloted through two small-scale studies. It was estab-
lished through the earlier studies that the 50-minute time-frame
was sufficient to elicit a rateable sample of language and that the
scoring rubric could be used successfully by raters despite its
Martin East 339
length (see Appendix A). The study therefore drew on multiple

sources of information in its enquiry into the relative validity of
writing tests at the intermediate level that allow or disallow bilin-
gual dictionaries.
Following Rivera and Stansfield (1998), information regarding test
scores was considered as a primary source of evidence with which
to determine the impact on validity of the different testing condi-
tions, and this is the main focus of the present article. A summary
of other evidence is, however, provided at the end of the article.
Several measures were put in place both to ensure the reliability
of scores and to address some of the weaknesses noted from the
Hurman and Tall study.
1 Task type
Two essay tasks were devised to be as comparable as possible. This
was achieved by making both titles general, not requiring any spe-
cialist knowledge, but at the same time eliciting opinions from par-
ticipants about subjects about which they were all expected to have
something to say. The two titles were:
‘
1) Sprachen in Neuseeland: ,Alle Schüler sollen heutzutage min-
destens EINE Fremdsprache in der Schule lernen . Sind Sie
auch dieser Meinung?
(Languages in New Zealand: ‘These days all school students should
learn at least ONE foreign language in school’. Are you also of this
opinion?)
and
2) Der Massentourismus: ,Viele Touristen aus vielen Ländern ‘
besuchen heutzutage Neuseeland, und das ist gut für das Land .
Was meinen Sie?
(Mass Tourism: ‘These days many tourists from many countries visit
New Zealand, and that is good for the country’. What do you think?).
To ensure that the two titles were as similar as possible in terms of
complexity of language each was created to contain a similar num-
ber of words, and 18 of the words used in each title (90%) were taken
from a prescribed list of ‘basic’ vocabulary with which all the par-
ticipants should have been actively familiar by the time of taking the
Bursary examination.
Table 1 Schedule for the study
Component Time allowed
Placement test Up to 30 minutes

First essay 50 minutes
1st short questionnaire 5 minutes
Second essay 50 minutes
2nd short questionnaire 5 minutes
Longer questionnaire 10 minutes
Total 150 minutes
2 Order and task effect

Table 1 presents the schedule for the study.
So that any changes in performance could be considered, as far as
possible, to be due to the intervention of the independent variable
(availability of a bilingual dictionary), it was important to account
for any order effect (participants doing better on the second task
because they had had practice with the first), or task effect (partici-
pants doing better on one task than another because they found it
easier). To control for any differences in performance that might be
due to these potentially confounding effects, participants were placed
into one of four groups, as illustrated in Table 2.
From an original sample of 61 students, 15 participants were allo-
cated to each group (Group 4 had 16) in such a way as to ensure that,
as far as possible, participants from each school were distributed
among the four groups. In practice, due to attrition, final allocations
to groups were somewhat uneven. The final allocations are presented
in Table 3.
Participants were given freedom to choose the dictionary they
wanted to use, and were encouraged to use the one with which they
had previously developed a level of familiarity. In all, five different
types of bilingual dictionary were used, ranging in size from around
40,000 to around 85,000 entries.
Table 2 Writing task order
Group First task Second task
1 Task 1 with dictionary Task 2 without dictionary

2 Task 1 without dictionary Task 2 with dictionary
3 Task 2 with dictionary Task 1 without dictionary
4 Task 2 without dictionary Task 1 with dictionary
Martin East 341
Table 3 Number of participants per group
Group Number of participants
1 10
2 11
3 14
4 12
Total 47
3 Rating the tasks

In addition to the use of a detailed scoring rubric other measures
were put in place to provide accurate and reliable test scores. Two in-
dependent raters, both experienced with Bursary examining and both
with extensive experience of German teaching at secondary level,
were invited to rate the tasks. The rating was done in one sitting as a
controlled reading, with the aim of creating a positive social envi-
ronment to enforce and maintain the rating standards (Weigle, 2002;
White, 1985).
Essays had been transcribed as word processor files identifiable
only by a three-digit index number randomly assigned to each
essay. Raters were not informed which essays had been written in
which condition, and counterbalancing lessened any opportunity
for them to guess the test condition with any accuracy. Parti-
cipants had been asked to underline in their responses words
looked up in the dictionary,2 but this information was removed
from the essays presented to the raters. Raters had access to their
own copies of the essays, and were therefore free to annotate them
as they wanted.
Raters were given a brief ‘Rater Training Manual’ a few days
before taking part in the controlled reading. This contained the fol-
lowing:
1) A copy of the scoring rubric.
2) Instructions for how the rating process would be carried out.
3) Four scored essays, arising from the second prior study and ex-
emplifying different levels of performance.
2Underlining was used to enable quantitative and qualitative analyses of how participants used the
dictionary. Although not directly relevant to the findings explored in this article, it was interesting
to observe a wide range of look-up activity, from no look-ups (in one case) to 32 (M ⫽ 9, SD ⫽ 5.7).
On average, lower ability participants made more use of the dictionary than higher ability partici-
pants; the advanced participants used the dictionary the least according to this measure.
At the beginning of the controlled reading, the scoring rubric was

compared to the four sample essays. Then the raters were presented
with four more sample essays, again taken from the second study, in
which the scores had been omitted. They were asked to rate each
essay, one at a time, according to the criteria. Raters were required
to provide a score (from 0 to 7) for each of the five categories of the
rubric. After scoring each essay, the raters revealed their scores, and
disagreements and discrepancies were discussed with reference to
the scoring criteria.
For the ‘live’ rating the following steps were followed:
1) Five responses to the first task were scored independently.
Scores awarded by the raters were then shared and discussed,
and, by negotiation, discrepancies were adjusted.
2) The raters then scored the remaining Task 1 essays in batches of
ten to twelve. It was found that the raters worked on these at dif-
ferent speeds, and it was not possible to stop to discuss each
batch. Batches were therefore handed to the researcher on com-
pletion and raters’ scores across the five categories were noted.
3) Whenever there was a discrepancy of more than two marks in
any given category, the essay was put aside to be revisited and
discussed at the end of the day.
4) A similar procedure was followed for the Task 2 essays.
To provide some measure of the intra-rater reliability of the scor-
ing, the two raters were invited back about a month after the original
rating to re-rate 16 essays (17% of the original sample). On this
occasion, raters were asked to rate two sample scripts before pro-
ceeding with the rating, in one sitting, of the 16 essays. As with the
first rating, discrepancies of more than two marks were discussed at
the end of the rating session.
V Results
Two sets of final scores were calculated for each essay (one for each
rater) out of 35 (7 ⫻ 5). In cases where component scores were ad-
justed after discussion, adjusted scores were used. Coefficients for
both the inter-rater and intra-rater reliability of the scores awarded
were also calculated. The inter-rater reliability coefficient for the
original two sets of independent total scores was r ⫽.86, n ⫽ 94,
p ⱕ.001 (two-tailed). This was considered to be an acceptably high
level of correlation, commensurate with the minimum requirements
Martin East 343
for reliable scoring suggested by Hatch and Lazaraton (1991: 441)

and Hamp-Lyons (1990: 69). Intra-rater reliability coefficients were
as follows: rater 1: r ⫽.84, n ⫽ 16, p ⫽.001 (two-tailed); rater 2:
r ⫽.71, n ⫽ 16, p ⫽.002 (two-tailed). The consistency of scoring
for the second rater across both occasions was therefore somewhat
weak. Nevertheless the correlation between the final reported scores
on the 16 essays from the first and second ratings was r ⫽.87,
n ⫽ 16, p ⫽.001 (two-tailed). This indicated an acceptably high
level of reliability between the reported total scores from the first
and second ratings. The final reported scores were determined by
the average of the two independent sets of scores from the first rat-
ing session.
To address the main research question, the final scores were com-
pared across the two test conditions. A comparison of mean scores
revealed that there appeared to be no meaningful difference between
scores in the ‘without dictionary’ and ‘with dictionary’ conditions
(Ms ⫽ 22.6 and 22.7 respectively; SDs ⫽ 6.1 and 6.1 respectively).
The results of a paired samples t-test indicated no statistically sig-
nificant difference between the two sets of scores (t (46) ⫽.279,
p ⫽.781, two-tailed).
The standard deviations indicated, however, that (as might be
expected) a wide range of performance was observable. Indeed the
scores awarded ranged from 13/35 to 35/35. It was decided to ex-
plore whether level of ability (as measured by participants’ perfor-
mance on the placement test) and amount of prior experience with
using a dictionary (as judged by participant responses to a question
on the longer questionnaire – ‘how often have you used this diction-
ary before today?’3) made any significant difference to performance
across the conditions. Table 4 presents the numbers of participants
in each group.
Table 4 Number of participants by level of ability and level of experience
Level of ability n Level of experience n
Lower intermediate 5 Very inexperienced 10

Intermediate 22 Quite inexperienced 15
Upper intermediate 16 Quite experienced 13
Advanced 4 Very experienced 9
3This question provided four levels of response that were subsequently translated into the four levels
of experience given in Table 4.
Table 5 Final scores by level of ability
Ability Test condition
‘Without’ ‘With’
LI M 14.9 16.2
SD 1.8 2.2
I M 20 20.1
SD 3.1 3.1
UI M 25.8 25.9
SD 5.4 5.8
A M 33 32.3
SD 1.1 5.2
Note: LI: lower intermediate; I: intermediate; UI:

upper intermediate; A: advanced.
Level of ability and amount of prior experience with the dictionary

were subsequently used as factors when further investigating the test
scores to address the first research sub-question.
Table 5 records the final overall mean scores according to each
level of ability.
These descriptive statistics indicate the following:
The dictionary appeared to help the lower intermediate partici-
pants (n ⫽ 5) to improve their overall score by just over one mark
when they had access to the dictionary (14.9 and 16.2), whereas
advanced participants (n ⫽ 4) did slightly worse, by just under one
mark (33 and 32.3).4 The very small sizes of these sub-groups make
the differences essentially meaningless.
The dictionary apparently made no real difference to test scores
for the other levels of ability:
• Intermediate (n ⫽ 22) – 20 and 20.1
• Upper intermediate (n ⫽ 16) – 25.8 and 25.9.
Table 6 records the final mean scores according to prior experience
with the dictionary.
These descriptive statistics suggest that experience with the dic-
tionary, viewed independently of level of ability, did not appear to be
making any real difference to test takers’ performance.
4The wide standard deviation for the advanced ‘with dictionary’ scores may be explained by the
fact that one advanced participant failed to complete the ‘with dictionary’ task adequately, thereby
receiving a considerably lower overall score than might otherwise have been expected. Indeed, an
analysis of the final scores revealed widely differential performance by two participants (one
advanced and one upper intermediate). The implications of this are raised in Appendix B.
Martin East 345
Table 6 Final scores by level of experience
Experience Test condition
‘Without’ ‘With’
Very inexperienced M 22.9 23.8

SD 5.1 5.4
Quite inexperienced M 23.8 22
SD 5.4 5.4
Quite experienced M 21.7 23.5
SD 7.4 7.5
Very experienced M 21.4 21.5
SD 6.7 6.1
To determine whether level of ability or level of prior experience

with the dictionary were factors contributing to significant differ-
ences in scores, a two-way repeated measures analysis of variance
(ANOVA) was carried out. It was noted that in two cases there were
five or fewer observations per cell, indicating that any marginal result
would need to be treated with caution (Hatch & Lazaraton, 1991).5
Results of the ANOVA are given in Table 7.
A within-subjects interaction between condition and level of expe-
rience was found to be significant. A post hoc analysis of the ANOVA
using Fisher’s LSD test found, however, that no level of experience
Table 7 Analysis of variance
Interaction df F p Partial η2
Between subjects
Level of ability 3 16.891 .000** .598
Level of experience 3 .299 .826 .026
Level of ability * level of experience 6 .492 .809 .080
Within subjects
Conditiona 1 .050 .824 .001
Condition * level of ability 3 .683 .569 .057
Condition * level of experience 3 3.037 .042* .211
Condition * level of ability * level of experience 6 .701 .650 .110
Notes: aCondition: the final awarded score in both test conditions; **p ⬍ .01;
*p ⬍ .05.
5The finding of no significant difference obtained from the t-test indicated, however, that it was
unlikely that the ANOVA would reveal contrary results.
Table 8 Linear regression analysis of scores in the ‘with dictionary’

condition
Variable B SE B β t p
Ability 5.589 .817 .735 6.843 .000*

Experience .425 .629 .073 .675 .503
No. of look ups ⫺.024 .107 ⫺.023 ⫺.221 .826
Notes: *p ⬍ .01; R ⫽ .724; R2 ⫽ .524; Adjusted R2 ⫽ .491.
was significantly different from any other. It may therefore be that this
statistically significant interaction was anomalous and a result of the
relatively small sample size. Level of ability was found to be a highly
significant between-subjects factor. The post hoc test indicated that
each level of ability was significantly different from the others.
To investigate the extent to which level of ability and level of
experience contributed to ‘with dictionary’ scores, a linear regression
was also calculated. This calculation also took into account the num-
ber of look-ups made by each participant in order to address the sec-
ond research sub-question. Look-ups were determined by items test
takers underlined in their responses. The purpose of the linear regres-
sion was to determine which of several factors were having an effect
on outcomes, and therefore to predict the factors that were most
likely to contribute to outcomes and predict performance on ‘with
dictionary’ tests. Table 8 records the results of the linear regression.
These results confirm those of the ANOVA. The only significant
predictor of outcomes when writing with the dictionary was level of
ability, with the coefficient indicating a positive relationship between
level of ability and scores.
VI Discussion
The most important finding of this study was that the availability of
the bilingual dictionary made no significant difference to ‘with dic-
tionary’ writing test scores in comparison with ‘without dictionary’
scores, a finding commensurate with previous studies of reading tests.
This means that, first, the dictionary did not appear to make any dif-
ference to the tests as reliable measures of the construct of writing
proficiency as operationalised by the scoring rubric; there was con-
sistency of measurement across the two conditions. Also, the dic-
tionary was not found to contribute substantially to the two threats to
construct validity identified by Messick (1989). Participants were
Martin East 347
able to perform equally well with or without the dictionary, and the
construct validity of the test was not threatened, either by construct
under-representation when the dictionary was not available, or by
construct irrelevant variance when it was.
The finding stands in contrast to that of Hurman and Tall (1998).
Bearing in mind the consequences of Hurman and Tall’s research for
policy decisions in the United Kingdom, the contrast gives cause for
some concern. The study reported here is small in scale, and infer-
ences drawn from the data may therefore be challenged on that basis.
Nevertheless, the finding of no significant difference to scores, when
considered alongside findings of research into reading tests, is suffi-
cient to cast some doubt on the validity of a blanket decision to
remove dictionaries from all levels of examination in the UK because,
in one type and at one level of examination, the availability of a dic-
tionary ‘clearly increased the mean scores’ (Hurman & Tall, 2002:
26). Certainly Spolsky (2001: 4), in critiquing the argument that
dictionaries will necessarily improve performance, asserts that ‘all
empirical studies so far contradict this argument, and suggest that the
belief is wrong’. The study reported here, in common with studies
other than Hurman and Tall’s, adds some weight to the argument that
the belief is indeed wrong.
Secondarily, the findings revealed that, with regard to ‘with dic-
tionary’ tests, neither level of prior experience with the dictionary
nor frequency of use of the dictionary in the test made any signifi-
cant difference to performance as measured by the scores. Although
prior experience with the dictionary is not necessarily the same as
prior training in the use of the dictionary, it certainly appears that
neither previous exposure to using a dictionary nor the number of
times the dictionary is used is a significant predictor of performance
in ‘with dictionary’ writing tests. This might lead to a conclusion,
particularly by test setters who are concerned that the test should
‘bias for best’ and ‘work for washback’, that there is no need to be
overly concerned either that substantial previous experience with the
dictionary or that using the dictionary more frequently will con-
tribute significantly to test takers’ performance. It should be noted,
however, that levels of experience with using the dictionary were
measured by one closed-ended question on the longer questionnaire
which provided four levels of response – never / once or twice /
fairly often / quite frequently. These were subsequently translated
into the four levels of experience used in the analysis. It is acknowl-
edged that this means of determining prior experience with the dic-
tionary is of limited validity, and a more careful exploration of the
interaction between dictionary experience, dictionary use training

and test performance might lead to different conclusions.
VII Conclusion and implications for further research

The finding of no significant difference in test scores across the two
testing conditions is an important one that has implications for the
validity of writing tests that admit or disallow the use of dictionaries.
This is because, as previously stated, the use of the dictionary is not
masking the measurement of the test takers’ abilities as reflected in the
scores – reassuring for those who would outlaw the dictionary on the
basis that its use may be contributing to construct-irrelevant differ-
ences in performance. The construct is also not necessarily being
under-represented if the dictionary is not available – reassuring for
those who would like to see it included on the basis that it is part of the
construct. Bachman (1990; 2000) and Messick (1989; 1996), howev-
er, move beyond a conceptualisation of construct validity that focuses
solely on the test and interpretations of scores. Construct validity must,
for example, also take account of the impact of the scores on the test
takers and therefore whether the scores were affected by facets of the
test procedure that biased the test against some test takers. As
Bachman (2000: 23) suggests, ‘investigating the construct validity of
interpretations without also considering values and consequences is a
barren exercise inside the psychometric test-tube, isolated from the
real-world decisions that need to be made and the societal, political
and educational mandates that impel them’. Messick (1989: 21) argues
that ‘construct validity binds social consequences of testing to the evi-
dential basis of test interpretation and use’.
There may well be good reasons, other than the impact on scores, to
make the decision either to outlaw or allow the use of the dictionary.
These other factors may need to be considered because their impact
on the testing procedure may affect the construct validity of the tests.
In the study reported here, questionnaire evidence revealed contrast-
ing perceptions. Participants expressed a wider range of disadvantages
than advantages to having the dictionary. The most commonly cited
advantage was the ability to check or find words participants did not
know or were not sure about (98%). It was also, according to 47%, an
advantage to be able to check related grammar information or spelling.
On the other hand, the most frequently identified disadvantage was
that using the dictionary takes too long in an examination (68%). For
51% of the participants there was a sense that its use distracted from a
Martin East 349
student’s ‘real’ knowledge, based on a perception that the test exists to

measure what the test takers had previously learnt and retained. There
was a sense in which having the dictionary thereby created an ‘unfair’
dimension in that the test was no longer testing what a test taker ‘really
knew’. Although more participants preferred having the dictionary to
not having it (40% compared to 21%), around four out of ten (38%)
were in fact equally happy with or without. A sizeable number felt
more confident with the dictionary (62%), although by contrast many
thought it was fairer to be assessed without the dictionary (66%).
In studies such as the one reported here where it is found that there
is no significant difference in test scores across test conditions, such
test taker perceptions may be one useful source of evidence in inform-
ing a decision to include or exclude the dictionary from tests. When a
significant difference in scores is found, evidence from other sources
(whether supporting or challenging the use of dictionaries in tests)
becomes imperative. At the very least the findings of this research into
dictionaries and impact on test scores, when considered alongside
those of others, is sufficient to indicate that the issue of allowing or
outlawing dictionaries in tests is one that requires further investigation.
VIII References
Assessment Reform Group 2002a: Assessment for learning: 10 principles.
Retrieved 26 October 2004 from http://www.assessment-reform-group
.org.uk
—— 2002b: Testing, motivation and learning. Cambridge: University of
Cambridge Faculty of Education.
Bachman, L.F. 1990: Fundamental considerations in language testing.
Oxford: Oxford University Press.
—— 2000: Modern language testing at the turn of the century: Assuring that
what we count counts. Language Testing 17, 1–42.
Bachman, L.F. and Palmer, A.S. 1996: Language testing in practice:
Designing and developing useful language tests. Oxford: Oxford
University Press.
Bensoussan, M., Sim, D. and Weiss, R. 1984: The effect of dictionary usage
on EFL test performances compared with student and teacher attitudes
and expectations. Reading in a Foreign Language 2, 262–76.
Birmingham University 2000: Review 2000. Retrieved 1 April 2003 from
http://www.bham.ac.uk/publications/review2000/research.html.
Bishop, G. 2000: Dictionaries, examinations and stress. Language Learning
Journal 21, 52–65.
Cambridge University 1999: Assessment for learning: Beyond the black box.
Cambridge: University of Cambridge School of Education.
Canale, M. and Swain, M. 1980: Theoretical bases of communicative approach-

es to second language teaching and testing. Applied Linguistics 1, 1– 47.
DES 1990: MFL for ages 11–16. London: Department of Education and Science.
Elder, C. 1997: What does test bias have to do with fairness? Language Testing
14, 261–77.
Hamp-Lyons, L. 1990: Second language writing: Assessment issues. In
Kroll, B., editor, Second language writing: Research insights for the
classroom. Cambridge: Cambridge University Press, 69–87.
Hatch, E. and Lazaraton, A. 1991: The research manual: Design and statis-
tics for applied linguistics. Boston, MA: Heinle and Heinle.
Hurman, J. and Tall, G. 1998: The use of dictionaries in GCSE modern for-
eign languages written examinations (French). Birmingham: University
of Birmingham School of Education.
—— 2002: Quantitative and qualitative effects of dictionary use on written
examination scores. Language Learning Journal 25, 21–26.
Idstein, B. 2003: Dictionary use during reading comprehension tests: An aid
or a diversion? Unpublished doctoral dissertation, Indiana University of
Pennsylvania, Pennsylvania.
Jacobs, H.L., Zinkgraf, S.A., Wormuth, D.R., Hartfiel, V.F., and Hughey,
J.B. 1981. Testing ESL composition: A practical approach. Rowley, MA:
Newbury House.
Lumley, T. and McNamara, T. 1995: Rater characteristics and rater bias:
Implications for training. Language Testing 12, 54–71.
Luther, M., Cole, E. and Gamlin, P., editors, 1996: Dynamic assessment for
instruction: From theory to applications. North York, ON, Canada: Captus
Press.
Messick, S. 1989: Validity. In Linn, R.L., editor, Educational measurement,
third edition. New York: Macmillan, 13–103.
—— 1996: Validity and washback in language testing. Language Testing 13,
241–56.
Nesi, H. and Meara, P. 1991: How using dictionaries affects performance in
multiple-choice EFL tests. Reading in a Foreign Language 8, 631–43.
Oxford University Language Centre 2003: Online tests. Retrieved 3 February
2003 from http://www.lang.ox.ac.uk/placement.html
Rivera, C. and Stansfield, C.W. 1998: Leveling the playing field for English
language learners: Increasing participation in state and local assessments
through accommodations. Retrieved 4 August 2004 from http://ceee.gwu
.edu/standards_assessments/researchLEP_accommodintro.htm
Spolsky, B. 2001: A note on the use of dictionaries in examinations. Unpublished
paper.
Swain, M. 1985: Large-scale communicative language testing: A case study.
In Savignon, S. and Burns, M., editors, Initiatives in communicative
language teaching. Reading, MA: Addison-Wesley, 185–201.
Tall, G. and Hurman, J. 2002: Using dictionaries in modern languages GCSE
examinations. Educational Review 54, 205–17.
Weigle, SC. 1999: Investigating rater/prompt interactions in writing assessment:
Quantitative and qualitative approaches. Assessing Writing 6, 145–78.
–––– 2002: Assessing writing. Cambridge: Cambridge University Press.
Martin East 351
White, E.M. 1985: Teaching and assessing writing. San Francisco, CA:
Jossey-Bass.
Wiggins, G. 1989: A true test: Towards more authentic and equitable assess-
ment. Phi Delta Kappan 70(9), 703–13.
Appendix A: The scoring rubric

Three principles guided the development of the scoring rubric used
in the study:
1. Canale and Swain’s (1980) conceptual framework of commu-
nicative competence underpins the categories of the rubric.
2. The rubric draws on that designed by Jacobs et al. (1981), whose
‘ESL Composition Profile’ contained several clearly articulated
scales for scoring different facets of writing.
3. The rubric also draws on that suggested by one of the major
examining boards in the United Kingdom for use in the Advanced
Subsidiary and Advanced level writing examinations.
Raters could award up to seven marks across five categories:
Score Cohesion, Knowledge Grammatical ‘Mechanics’: Knowledge of

coherence and of lexis, idiomatic competence Spelling and register and
rhetorical expressions; punctuation varieties of
organization functional language;
knowledge knowledge of
cultural references
(where appropriate)
7 Excellent
6 Very Good
5 Good
4 Satisfactory
3 Fair
2 Poor
1 Weak
0 No rewardable material
The rubric itself provided detailed descriptors for each category and
each score band. Grammatical competence, for example, could be
measured according to the detailed criteria above. These descriptors
enabled the raters to differentiate between levels of performance across
the categories with a high level of sensitivity and provided the test tak-
ers with feedback that would help them with their subsequent work (a
copy of the full rubric is available from the author).
Grammatical competence
A score of 7 A score of 6 A score of 5
• errors are only of a • errors are generally • simple constructions

very minor nature of a minor nature are used accurately,
• very few errors of • a few errors of • minor problems with
complex constructions,
• agreement, • agreement, • several errors of
• tense, • tense, • agreement,

• number, • number, • tense,
• gender, • gender, • number,
• word order, • word order, • gender,
• function, • function, • word order,
• articles, • articles, • function,
• pronouns, • pronouns, • articles,
• prepositions. • prepositions. • pronouns,
• prepositions.
• these are mainly of a
less serious nature.
• the meaning is very
rarely obscured.
Appendix B: Accounting for outliers

A boxplot of score differences across the two conditions revealed that
in the vast majority of cases (n ⫽ 45), the range of score differences
Martin East 353
was at most around ⫾5 marks, indicating a potential average differ-

ence of ⫾1 mark across each of the five criteria of the scoring rubric.
It was noted, however, that there were two outliers where perfor-
mances across the two test conditions were markedly different, and
it was acknowledged that these outliers were skewing the data and
might have arisen from variables other than the independent variable
under investigation. The researcher experimented with removing
these two outliers from the data set and performing a paired-samples
t-test on the remaining sets of scores. The result of this t-test was:
t (44) ⫽ 1.374, p ⫽ .176, two-tailed. (The result of the t-test for all
sets of scores was: t (46) ⫽ .279, p ⫽ .781, two-tailed.) No significant
difference was therefore indicated, whether the outliers were
included or excluded.

The Use of Dictionary in Writing Assessment

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Use of Dictionary in Writing Assessment

Uploaded by

Copyright:

Available Formats

Language Testing

Bilingual dictionaries in tests of L2 writing proficiency: do they make a

The online version of this article can be found at:

Email Alerts: http://ltj.sagepub.com/cgi/alerts

Whether test takers should be allowed access to dictionaries when taking L2

Language Testing 2007 24 (3) 331–353 10.1177/0265532207077203 © 2007 SAGE Publications

test is for) because, for example, it provides information relevant to

II Determining the validity of the test

means of assessment which is more learner-centred and flexible

the ‘without dictionary’ condition, the construct is not necessarily

III Hurman and Tall’s study into dictionaries in writing tests

focus of Hurman and Tall’s research) or at higher levels (including the

Secondly, aspects of the design may have had a confounding in-

length (see Appendix A). The study therefore drew on multiple

Table 1 Schedule for the study

Component Time allowed

Placement test Up to 30 minutes

2 Order and task effect

Table 2 Writing task order

Group First task Second task

1 Task 1 with dictionary Task 2 without dictionary

Table 3 Number of participants per group

Group Number of participants

3 Rating the tasks

At the beginning of the controlled reading, the scoring rubric was

for reliable scoring suggested by Hatch and Lazaraton (1991: 441)

Table 4 Number of participants by level of ability and level of experience

Level of ability n Level of experience n

Lower intermediate 5 Very inexperienced 10

Table 5 Final scores by level of ability

Ability Test condition

Note: LI: lower intermediate; I: intermediate; UI:

Level of ability and amount of prior experience with the dictionary

Table 6 Final scores by level of experience

Experience Test condition

Very inexperienced M 22.9 23.8

To determine whether level of ability or level of prior experience

Table 7 Analysis of variance

Table 8 Linear regression analysis of scores in the ‘with dictionary’

Ability 5.589 .817 .735 6.843 .000*

Notes: *p ⬍ .01; R ⫽ .724; R2 ⫽ .524; Adjusted R2 ⫽ .491.

interaction between dictionary experience, dictionary use training

VII Conclusion and implications for further research

student’s ‘real’ knowledge, based on a perception that the test exists to

Canale, M. and Swain, M. 1980: Theoretical bases of communicative approach-

Appendix A: The scoring rubric

Raters could award up to seven marks across five categories:

Score Cohesion, Knowledge Grammatical ‘Mechanics’: Knowledge of

A score of 7 A score of 6 A score of 5

• errors are only of a • errors are generally • simple constructions

• tense, • tense, • agreement,

Appendix B: Accounting for outliers

was at most around ⫾5 marks, indicating a potential average differ-

You might also like