Professional Documents
Culture Documents
The Use of Dictionary in Writing Assessment
The Use of Dictionary in Writing Assessment
http://ltj.sagepub.com
Published by:
http://www.sagepublications.com
Additional services and information for Language Testing can be found at:
Subscriptions: http://ltj.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.co.uk/journalsPermissions.nav
Citations http://ltj.sagepub.com/cgi/content/refs/24/3/331
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
Bilingual dictionaries in tests
of L2 writing proficiency: do they
make a difference?
Martin East Unitec New Zealand
I Introduction
One area of ongoing debate with regard to the testing of second or for-
eign languages (L2) has focused on whether test takers should be
allowed access to dictionaries when being assessed. On the one hand,
when use of a dictionary is perceived as an authentic and legitimate
activity in the context of L2 students’ learning and using the target lan-
guage, and when it is believed that language test performance should
correspond with language in actual use or represent ‘real world’ prac-
tice (Bachman & Palmer, 1996; Wiggins, 1989), it may be argued that
allowing dictionaries in tests is valid. On the other hand, when use of
a dictionary potentially obscures the information about the test takers’
linguistic ability available from the test (which is, after all, what the
Address for correspondence: Martin East, School of Language Studies, Unitec New Zealand,
Private Bag 92025, Auckland, New Zealand; email: meast@unitec.ac.nz
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
332 Bilingual dictionaries in tests of L2 writing proficiency
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
Martin East 333
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
334 Bilingual dictionaries in tests of L2 writing proficiency
from the same group of test takers on test performance in two test
conditions (for example, with and without a dictionary). Rivera and
Stansfield were wrestling with the question of how accommodations
for ‘limited English proficient’ (LEP) students might threaten the
validity of mainstream tests. They argue that comparative evidence
would help to establish if a given accommodation may be rightfully
endorsed. They suggest that if a comparative within-subjects study
leads to a finding of no significant difference in scores across two test
conditions for non-LEP test takers (those who do not require the
accommodation), this would mean that the accommodation does not
compromise the measurement of the construct.
Bearing in mind that Rivera and Stansfield’s argument relates to
constructs other than the L2 language ability of L2 language learners,
their suggested methodology may be helpful in investigating the use of
dictionaries in L2 tests. Messick (1989) suggests that test scores might
be affected by two major threats to a test’s validity. The first is con-
struct under-representation, whereby the tasks measured in the assess-
ment fail to include important dimensions or facets of the construct
and the test results are therefore unlikely to reveal the test takers’ true
abilities within the construct that was supposed to have been measured
by the test. The other threat to validity is construct irrelevant variance,
whereby the test measures variables that are not relevant to the inter-
preted construct, allowing certain test takers to score higher than they
would in normal circumstances (construct-irrelevant easiness), or po-
tentially leading to a notably lower score for some test takers (con-
struct-irrelevant difficulty). If no significant differences were found
in test scores derived from a comparative within-subjects study, this
would indicate that there was no significant threat to the construct va-
lidity of the test as a measure of test takers’ performance, regardless of
how the construct was understood. That is, the L2 test takers did not
‘require’ the dictionary to demonstrate their ability.
For those who believe that the dictionary is extraneous to the test
construct, such a finding would be reassuring because it could be inter-
preted as meaning that the dictionary was not a ‘confounding variable’
(Elder, 1997: 261) or source of construct irrelevant variance – the use
of the dictionary is not masking the measurement of the test takers’
abilities as reflected in the scores. No difference in scores might raise
questions for those who see the dictionary as a supportive part of
the construct being tested. The construct that underlies the test may
include the supportive use of the dictionary, but it becomes irrele-
vant if its availability interferes with an improved performance.
Nevertheless, where there is no evidence of notably lower scores in
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
Martin East 335
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
336 Bilingual dictionaries in tests of L2 writing proficiency
1Hurman and Tall appear to favour statistical adjustments over robustness of design. They assert
(1998: 30) that ‘order effect’ was a factor in a multivariate analysis of covariance. They suggest
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
Martin East 337
from a pilot study (1998: 32) that the ‘with dictionary’ tests (all taken second) were harder, and
used this information to ‘adjust’ the score data from the main study. Nevertheless, Tall admits (per-
sonal communication, 24 March, 2004) that the design was ‘criticised by some individuals on the
sponsor’s management committee on the grounds that students using the dictionary would gain
higher grades simply because they had gained experience taking the first examination’.
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
338 Bilingual dictionaries in tests of L2 writing proficiency
IV Method
The study was conducted in September 2003 with students drawn
from 11 schools in New Zealand who were being prepared for the
high-stakes intermediate level Bursary German (A level equivalent)
examination in the forthcoming November. Although the sample size
was small (n ⫽ 47), the number of candidates taking Bursary German
in 2003 was also small (N ⫽ 366), and the sample represented just
under 13% of those being prepared for Bursary in just under 13% of
schools. Furthermore, although this was a convenience rather than a
random sample, the target population is very clearly defined (17–18
year old students in New Zealand secondary schools studying for the
Bursary German examination in 2003). The sample in this study can
therefore be regarded as largely representative of the population to
which any results might be generalized. Procedures used in the study
are reported below in some detail to facilitate subsequent replication.
The overall design incorporated four elements:
1) A 50-item multiple-choice placement test for learners of
German (Oxford University Language Centre, 2003). The on-
line test, which ranks language learners into six bands from
‘beginner’ to ‘advanced’, was used to provide an independent
measure of participants’ abilities for benchmarking purposes.
For ease of administration a hard copy pencil-and-paper version
of the test was created.
2) Two timed essays of 50 minutes, one taken with and one with-
out a bilingual dictionary.
3) Two short questionnaires, to be completed after each timed
essay. The questionnaires elicited information about test taker
perceptions and strategy use across the two conditions.
4) A longer questionnaire, to be completed after both essays and
the two shorter questionnaires. This contained both closed-ended
questions eliciting information about type of dictionary, prior
experience with the dictionary, dictionary use strategies and opi-
nions about dictionaries in writing tests, and open-ended questions
about perceived advantages and disadvantages of dictionary use
in writing tests.
The procedures, together with the use of a detailed scoring
rubric, were piloted through two small-scale studies. It was estab-
lished through the earlier studies that the 50-minute time-frame
was sufficient to elicit a rateable sample of language and that the
scoring rubric could be used successfully by raters despite its
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
Martin East 339
1 Task type
Two essay tasks were devised to be as comparable as possible. This
was achieved by making both titles general, not requiring any spe-
cialist knowledge, but at the same time eliciting opinions from par-
ticipants about subjects about which they were all expected to have
something to say. The two titles were:
‘
1) Sprachen in Neuseeland: ,Alle Schüler sollen heutzutage min-
destens EINE Fremdsprache in der Schule lernen . Sind Sie
auch dieser Meinung?
(Languages in New Zealand: ‘These days all school students should
learn at least ONE foreign language in school’. Are you also of this
opinion?)
and
2) Der Massentourismus: ,Viele Touristen aus vielen Ländern ‘
besuchen heutzutage Neuseeland, und das ist gut für das Land .
Was meinen Sie?
(Mass Tourism: ‘These days many tourists from many countries visit
New Zealand, and that is good for the country’. What do you think?).
To ensure that the two titles were as similar as possible in terms of
complexity of language each was created to contain a similar num-
ber of words, and 18 of the words used in each title (90%) were taken
from a prescribed list of ‘basic’ vocabulary with which all the par-
ticipants should have been actively familiar by the time of taking the
Bursary examination.
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
340 Bilingual dictionaries in tests of L2 writing proficiency
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
Martin East 341
1 10
2 11
3 14
4 12
Total 47
2Underlining was used to enable quantitative and qualitative analyses of how participants used the
dictionary. Although not directly relevant to the findings explored in this article, it was interesting
to observe a wide range of look-up activity, from no look-ups (in one case) to 32 (M ⫽ 9, SD ⫽ 5.7).
On average, lower ability participants made more use of the dictionary than higher ability partici-
pants; the advanced participants used the dictionary the least according to this measure.
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
342 Bilingual dictionaries in tests of L2 writing proficiency
V Results
Two sets of final scores were calculated for each essay (one for each
rater) out of 35 (7 ⫻ 5). In cases where component scores were ad-
justed after discussion, adjusted scores were used. Coefficients for
both the inter-rater and intra-rater reliability of the scores awarded
were also calculated. The inter-rater reliability coefficient for the
original two sets of independent total scores was r ⫽.86, n ⫽ 94,
p ⱕ.001 (two-tailed). This was considered to be an acceptably high
level of correlation, commensurate with the minimum requirements
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
Martin East 343
3This question provided four levels of response that were subsequently translated into the four levels
of experience given in Table 4.
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
344 Bilingual dictionaries in tests of L2 writing proficiency
‘Without’ ‘With’
LI M 14.9 16.2
SD 1.8 2.2
I M 20 20.1
SD 3.1 3.1
UI M 25.8 25.9
SD 5.4 5.8
A M 33 32.3
SD 1.1 5.2
4The wide standard deviation for the advanced ‘with dictionary’ scores may be explained by the
fact that one advanced participant failed to complete the ‘with dictionary’ task adequately, thereby
receiving a considerably lower overall score than might otherwise have been expected. Indeed, an
analysis of the final scores revealed widely differential performance by two participants (one
advanced and one upper intermediate). The implications of this are raised in Appendix B.
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
Martin East 345
‘Without’ ‘With’
SD 6.7 6.1
Interaction df F p Partial η2
Between subjects
Level of ability 3 16.891 .000** .598
Level of experience 3 .299 .826 .026
Level of ability * level of experience 6 .492 .809 .080
Within subjects
Conditiona 1 .050 .824 .001
Condition * level of ability 3 .683 .569 .057
Condition * level of experience 3 3.037 .042* .211
Condition * level of ability * level of experience 6 .701 .650 .110
Notes: aCondition: the final awarded score in both test conditions; **p ⬍ .01;
*p ⬍ .05.
5The finding of no significant difference obtained from the t-test indicated, however, that it was
unlikely that the ANOVA would reveal contrary results.
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
346 Bilingual dictionaries in tests of L2 writing proficiency
Variable B SE B β t p
was significantly different from any other. It may therefore be that this
statistically significant interaction was anomalous and a result of the
relatively small sample size. Level of ability was found to be a highly
significant between-subjects factor. The post hoc test indicated that
each level of ability was significantly different from the others.
To investigate the extent to which level of ability and level of
experience contributed to ‘with dictionary’ scores, a linear regression
was also calculated. This calculation also took into account the num-
ber of look-ups made by each participant in order to address the sec-
ond research sub-question. Look-ups were determined by items test
takers underlined in their responses. The purpose of the linear regres-
sion was to determine which of several factors were having an effect
on outcomes, and therefore to predict the factors that were most
likely to contribute to outcomes and predict performance on ‘with
dictionary’ tests. Table 8 records the results of the linear regression.
These results confirm those of the ANOVA. The only significant
predictor of outcomes when writing with the dictionary was level of
ability, with the coefficient indicating a positive relationship between
level of ability and scores.
VI Discussion
The most important finding of this study was that the availability of
the bilingual dictionary made no significant difference to ‘with dic-
tionary’ writing test scores in comparison with ‘without dictionary’
scores, a finding commensurate with previous studies of reading tests.
This means that, first, the dictionary did not appear to make any dif-
ference to the tests as reliable measures of the construct of writing
proficiency as operationalised by the scoring rubric; there was con-
sistency of measurement across the two conditions. Also, the dic-
tionary was not found to contribute substantially to the two threats to
construct validity identified by Messick (1989). Participants were
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
Martin East 347
able to perform equally well with or without the dictionary, and the
construct validity of the test was not threatened, either by construct
under-representation when the dictionary was not available, or by
construct irrelevant variance when it was.
The finding stands in contrast to that of Hurman and Tall (1998).
Bearing in mind the consequences of Hurman and Tall’s research for
policy decisions in the United Kingdom, the contrast gives cause for
some concern. The study reported here is small in scale, and infer-
ences drawn from the data may therefore be challenged on that basis.
Nevertheless, the finding of no significant difference to scores, when
considered alongside findings of research into reading tests, is suffi-
cient to cast some doubt on the validity of a blanket decision to
remove dictionaries from all levels of examination in the UK because,
in one type and at one level of examination, the availability of a dic-
tionary ‘clearly increased the mean scores’ (Hurman & Tall, 2002:
26). Certainly Spolsky (2001: 4), in critiquing the argument that
dictionaries will necessarily improve performance, asserts that ‘all
empirical studies so far contradict this argument, and suggest that the
belief is wrong’. The study reported here, in common with studies
other than Hurman and Tall’s, adds some weight to the argument that
the belief is indeed wrong.
Secondarily, the findings revealed that, with regard to ‘with dic-
tionary’ tests, neither level of prior experience with the dictionary
nor frequency of use of the dictionary in the test made any signifi-
cant difference to performance as measured by the scores. Although
prior experience with the dictionary is not necessarily the same as
prior training in the use of the dictionary, it certainly appears that
neither previous exposure to using a dictionary nor the number of
times the dictionary is used is a significant predictor of performance
in ‘with dictionary’ writing tests. This might lead to a conclusion,
particularly by test setters who are concerned that the test should
‘bias for best’ and ‘work for washback’, that there is no need to be
overly concerned either that substantial previous experience with the
dictionary or that using the dictionary more frequently will con-
tribute significantly to test takers’ performance. It should be noted,
however, that levels of experience with using the dictionary were
measured by one closed-ended question on the longer questionnaire
which provided four levels of response – never / once or twice /
fairly often / quite frequently. These were subsequently translated
into the four levels of experience used in the analysis. It is acknowl-
edged that this means of determining prior experience with the dic-
tionary is of limited validity, and a more careful exploration of the
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
348 Bilingual dictionaries in tests of L2 writing proficiency
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
Martin East 349
VIII References
Assessment Reform Group 2002a: Assessment for learning: 10 principles.
Retrieved 26 October 2004 from http://www.assessment-reform-group
.org.uk
—— 2002b: Testing, motivation and learning. Cambridge: University of
Cambridge Faculty of Education.
Bachman, L.F. 1990: Fundamental considerations in language testing.
Oxford: Oxford University Press.
—— 2000: Modern language testing at the turn of the century: Assuring that
what we count counts. Language Testing 17, 1–42.
Bachman, L.F. and Palmer, A.S. 1996: Language testing in practice:
Designing and developing useful language tests. Oxford: Oxford
University Press.
Bensoussan, M., Sim, D. and Weiss, R. 1984: The effect of dictionary usage
on EFL test performances compared with student and teacher attitudes
and expectations. Reading in a Foreign Language 2, 262–76.
Birmingham University 2000: Review 2000. Retrieved 1 April 2003 from
http://www.bham.ac.uk/publications/review2000/research.html.
Bishop, G. 2000: Dictionaries, examinations and stress. Language Learning
Journal 21, 52–65.
Cambridge University 1999: Assessment for learning: Beyond the black box.
Cambridge: University of Cambridge School of Education.
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
350 Bilingual dictionaries in tests of L2 writing proficiency
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
Martin East 351
White, E.M. 1985: Teaching and assessing writing. San Francisco, CA:
Jossey-Bass.
Wiggins, G. 1989: A true test: Towards more authentic and equitable assess-
ment. Phi Delta Kappan 70(9), 703–13.
7 Excellent
6 Very Good
5 Good
4 Satisfactory
3 Fair
2 Poor
1 Weak
0 No rewardable material
The rubric itself provided detailed descriptors for each category and
each score band. Grammatical competence, for example, could be
measured according to the detailed criteria above. These descriptors
enabled the raters to differentiate between levels of performance across
the categories with a high level of sensitivity and provided the test tak-
ers with feedback that would help them with their subsequent work (a
copy of the full rubric is available from the author).
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
352 Bilingual dictionaries in tests of L2 writing proficiency
Grammatical competence
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010
Martin East 353
Downloaded from http://ltj.sagepub.com at MICHIGAN STATE UNIV LIBRARIES on April 18, 2010