You are on page 1of 7

Gap-fill Tests for Language Learners:

Corpus-Driven Item Generation


Smith, S.
Author’s published paper deposited in CURVE April 2013

Original citation:
Smith, S. , Avinesh, P.V.S. and Kilgarriff, Adam (2010) 'Gap-fill Tests for Language Learners:
Corpus-Driven Item Generation' Proceedings of ICON-2010: 8th International Conference on
Natural Language Processing. held: 8-11 December 2010, Kharagpur, India . India: Macmillan
Publishers

Copyright © and Moral Rights are retained by the author(s) and/ or other copyright
owners. A copy can be downloaded for personal non-commercial research or study,
without prior permission or charge. This item cannot be reproduced or quoted extensively
from without first obtaining permission in writing from the copyright holder(s). The
content must not be changed in any way or sold commercially in any format or medium
without the formal permission of the copyright holders.

CURVE is the Institutional Repository for Coventry University


http://curve.coventry.ac.uk/open
Gap-fill Tests for Language Learners: Corpus-Driven Item Generation

Simon Smith P.V.S Avinesh


Xi’an Jiaotong Liverpool University, China IIIT Hyderabad, India
Simon.Smith@xjtlu.edu.cn avinesh@research.iiit.ac.in

Adam Kilgarriff
Lexical Computing Ltd, UK
adam@lexmasterclass.com

Abstract one or more gaps in, and, for each gap, is asked to
select the term that goes into it from a small num-
Gap-fill exercises have an important role ber of candidates.1,2 The tests allow targeted test-
in language teaching. They allow stu- ing of particular competences in a controlled man-
dents to demonstrate that they under- ner. Being multiple-choice, they are well-suited
stand vocabulary in context, discouraging for automatic marking and are particularly useful
memorization of translations. It is time- in proficiency testing.
consuming and difficult for item writers to The standard method for producing test items
create good test items, and even then test is for an item writer to compose or locate a con-
items are open to Sinclair’s critique of in- vincing carrier sentence, which incorporates the
vented examples. We present a system, desired KEY (the correct answer, which has been
T EDDCLOG, which automatically gener- deleted to make the gap). They then have to gen-
ates draft test items from a corpus. T EDD - erate DISTRACTORS (wrong answers intended to
CLOG takes the key (the word which will
‘distract’ the student from selecting the correct an-
form the correct answer to the exercise) swer). This is non-trivial as the distractor must be
as input. It finds distractors (the alter- incorrect, in that inserting it into the blank gener-
native, wrong answers for the multiple- ates a ‘bad’ sentence, yet the distractors must in
choice question) from a distributional the- some way be viable alternatives, or else the test
saurus, and identifies a collocate of the key item will be too easy.
that does not occur with the distractors. The simple fact that carrier sentences are usu-
Next it finds a simple corpus sentence con- ally invented is also problematic. Since Sinclair
taining the key and collocate. The system (1986) the objections, in language teaching, to in-
then presents the sentences and distractors vented examples are well-established: it all too of-
to the user for approval, modification or re- ten occurs that invented examples do not replicate
jection. The system is implemented using the phraseology and collocational preferences of
the API to the Sketch Engine, a leading naturally-occurring text.
corpus query system. We compare T ED -
T EDDCLOG (Testing English with Data-Driven
DCLOG with other gap-fill-generation sys-
CLOze Generation) is a system that generates
tems, and offer a partial evaluation of the
draft test items using a very large corpus of En-
results.
glish, using functions for finding collocates, dis-
Key Words: gap-fill, Sketch Engine, cor- tractors and carrier sentences in the Sketch En-
pus linguistics, ELT, GDEX, proficiency gine, a leading corpus query tool.3 We use the
testing
1
The more widely-used name is cloze. However in some
1 Introduction language-teaching literature, cloze is reserved for multi-
sentence texts with several gaps to fill. Gap-fill is a more
generic name.
Gap-fill exercises are widely used throughout the 2
We focus here on multiple-choice exercises, though open
language-teaching world. In a gap-fill (or cloze) gap fills have also been used (for example Pino et al, 2008).
3
test item, the student is presented with a text with http://www.sketchengine.co.uk

Proceedings of ICON-2010: 8th International Conference on Natural Language Processing, Macmillan


Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2010
Figure 1: Word sketch for showing collocates, grammatical relation, frequency and salience.

UKWaC corpus (Ferraresi et al 2008), a 1.5- were not found with the highest-scoring collocate,
billion-word web corpus. Ferraresi et al show that scenery.
UKWaC is a good, broad sample of English. Size
is important as then there are plenty of examples
for most key-collocate pairings so the chances of
finding a short, simple one suitable for language
testing are high.

2 The System
The user inputs the key and its word class. Two
Sketch Engine calls retrieve collocates and the-
saurus items. We work through the two lists,
checking each <collocate, thesaurus-entry> pair
in turn to see if they co-occur in the corpus. We
continue until we find a collocate and three the-
saurus entries that do not occur with it.
An example: with the key spectacular the top
collocates are scenery, panoramic (AND/OR rela-
tion) and scenically, as can be seen in the word
sketch for spectacular (see Fig. 1).4 The top the- Figure 2: Distributional thesaurus entry.
saurus items are stunning, magnificent, impressive
as shown in the thesaurus screenshot (Fig. 2). We
The carrier sentence needs to contain spectac-
take the first collocate and see if there are three
ular scenery. There are 1295 such sentences in
thesaurus items that do not co-occur with it. (The
UKWaC. The next task is to choose the most suit-
corpus has been parsed at compile-time so this can
able for a language-teaching, gap-fill exercise con-
be done quickly.) If we find three thesaurus items
text.
not occurring with the collocate, we are done. If
T EDDCLOG uses the Sketch Engine’s GDEX
not we move on to the next collocate and iterate.
function (Good Dictionary Example Extractor
In this case we found three thesaurus items which
(Kilgarriff et al 2008) to find the best sentence
4
In the Sketch Engine and this work, a collocation is a containing the collocation, choosing a sentence
triple involving two lemmas (of specific word class) and the
grammatical relation holding between them; the grammatical which is short (but not too short, or there is not
relations applying here can be seen in Fig. 1. enough useful contexts); begins with a capital let-
ter and ends with a full stop; has a maximum of presented to intermediate English learners. They
two commas; and otherwise contains only the 26 classified the cases where it was not.
lowercase letters. All others are rejected. These For an item to be fully acceptable the carrier
constraints may seem rigid, but in earlier tests we sentence must be:
encountered many examples of sentences which
• a well-formed sentence of English
were too technical or too informal to be compre-
hensible to learners. We found that excluding sen- • at a level that an intermediate learner of En-
tences that included symbols, numbers, quotation glish can be expected to understand
marks and proper names eliminated many of the
problem items. • with sufficient context; not too short
In our example, spectacular occurs in its base • without superfluous material; not too long
form. Sometimes verbs and adjectives occur in in-
flected forms in the carrier sentence, and in those Furthermore the distractors must be ‘bad’ but with
cases we provide the key and distractors in the in- some plausibility.
flected form that the context requires.5 Results are given in Table 1. As we view T ED -
Next, we blank out the keyword, randomize the DCLOG as a drafting system, we have classified
order of key and distractors to give a test item as items according to whether they are acceptable as
here: they are; acceptable after a minor edit (editing a
maximum of one word); the carrier sentence is ac-
Some of the areas were high in the mountains where there ceptable but one or more of the distractors are not;
was scenery. unacceptable.
(a) historic
(b) spectacular Table 1: System Evaluation
(c) huge
Item action # # %
(d) exciting
Accept 40 53
- As is 27
- Edit carrier sentence 4
3 System Evaluation - Change distractor(s) 9
A random sample of 79 word-and-word-class pairs Reject 35 47
from the CEEC list6 were entered into the system Total 75 100
as the gap-fill item key. One carrier sentence and
three distractors were generated for each key. The
parameters used to identify candidate collocations 3.1 Carrier sentence quality
were that the collocation’s frequency needed to be Six items were not sentences. They were two noun
greater than seven, and the salience7 greater than phrases, two verb phrases, one adjective phrase,
zero. Test items were generated for 75 of the 79 one containing the string dh (uninterpretable out of
input words. context) and one where the grammar was bungled.
The two authors of the paper who are native Items where the language was grammatical
speakers of English (both linguists, and one an ex- but beyond the reach of most intermediate-level
perienced language teacher) then assessed whether students included “Two muffins and a piece of
each item was acceptable or not, as a test item to be rocky road and latte?” (key=skinny, also a
non-sentence), “Ursodeoxycholic acid normalises
5
The indefinite article can take two forms, a and an. If and helps itching” (key=biochemistry) and
the carrier sentence contains an before the blank, it is obvi-
ous that the key must begin with a vowel. However, we do “Consequentialism, in so far as it diverges from
not wish to exclude consonant-initial distractors so if the in- commonsense on these points, strikes many
definite article immediately precedes the key, we blank out as an unacceptable moral theory” (key=morality).
both article and key, and offer article-noun pairs as fillers.
6
A glossary of 6480 words used to help people study- The first case relates to informal vocabulary, the
ing for university entrance exams in Taiwan (see College En- second, to specialist vocabulary, and the third, to
trance Examination Center 2002). formality of genre.
7
The salience statistic is based on the Dice coefficient; for
details see ‘Statistics used in the Sketch Engine’ in the Sketch Items offering very little context included
Engine help pages. “These are followed by an alphabetical ”
(key=index), “Of course, these are optical ” fits in “Is it easy to around the site?”.
(key=illusions) and “Then he noticed the There was just one case where distractors were
lever” (key=gear. There were no clear cases where wrong enough to make the item notably easy: in
sentences were too long (with the morality exam- “All rooms in the courts are fitted with a wash
ple above being one of the longest). basin” (key=hand), distractors eye, body,
One of the authors of the paper ‘took the test’ head are all very common core vocabulary and
and was able to identify the correct answer in all clearly do not fit.
but six of the 76 cases. Lack of context was not
a problem for him because the compounds alpha- 4 Other gap-fill-generating systems
betical index, optical illusion and gear lever are
Several researchers have developed automatic
well-established. Here, the pedagogical issue be-
gap-fill generators. Mostow et al (2004) generated
comes: what do we want to test for? For gen-
gap-fill items of varying difficulty from children’s
eral knowledge of the meaning and use of the
stories. The items were presented to children via
key, or specific knowledge of the compounds or
a voice interface, and the response data was used
other multi-word units it participates in? In cases
to assess comprehension. Hoshino & Nakagawa
like “Our family room was absolutely .”
(2007) devised an NLP-based teacher’s assistant,
(key=superb) the critical knowledge is of the col-
which first asks the user to supply a text. The sys-
location absolutely superb. In “Residents rushed
tem then suggests deletions that could be made,
to help and the flames” (key=smother) the
and helps the teacher to select appropriate distrac-
knowledge is of a non-core sense of the verb. In
tors, chosen from among other words of the same
“Surely it is that injustice that will lead to
class occurring in the same article, as well as their
of discontent” (key=a winter) the knowledge is of
synonyms as recorded in WordNet. They also at-
Shakespeare.
tempt to find distractors of approximately the same
frequency as the key. In a teacher-user evaluation,
3.2 Distractor quality
79% of the items generated were deemed appro-
As we note above, for the test item to be satis- priate.
factory, the distractors must be bad, in the context Both of these systems use longer texts, while
of the carrier sentence. In each case, our algo- Sumita et al (2005) describe the automatic gen-
rithm guaranteed a collocation where the distrac- eration of single sentence gap-fill exercises from
tor did not occur in the corresponding construc- a large corpus. They use a published thesaurus
tion in the corpus. A first concern is: can we to find potential distractors. To establish whether
expect the test-taker to know the collocation? A potential distractors are permissible in the carrier
second is: does the absence of the distractor in sentence, they submit queries to Google compris-
the relevant construction indicate anything that we ing the carrier sentence (or parts of it) with the
might expect the test-taker to know? If we look at key replaced by the potential distractor. They only
“I had lain into the potato pie, mushy and retain distractors where Google does not find any
red cabbage” (key=pea(s), distractors: bean(s), hits.
spinach, carrot(s)), do we wish to test knowledge To evaluate their system they gave tests items
of mushy peas? If not the test item does not work to a set of students for whom TOEIC English pro-
because mushy beans/spinach/carrots are all plau- ficiency scores were known. They were able to
sible. This was the norm in our dataset: getting show a high level of correlation between perfor-
the right answer depended on knowing a colloca- mance on the test items they had generated, and
tion including the key, and in the absence of that the students’ proficiency scores. The correlation
knowledge one or more distractors became a plau- was similar to, but not as good as, the correla-
sible answer. tion between TOEIC scores and performance on
There were several cases where particular dis- expert-generated gap-fill items. A native speaker
tractors did not work for grammatical reasons. of English also did the test and scored 93.5%,
speaks has the wrong syntax to fill the gap in “The higher than the highest-scoring non-native speaker
trouble is she always bloody me” (key=tells). who scored 90.6%.
None of access, download, manipulate take a For Liu et al (2005) the user input is word plus
complement with around so only the key navigate word sense. Much of their effort is spent on dis-
ambiguating the key in potential carrier sentences taining the key and the other specifying the context
in order to find a carrier sentence in which the key (though this could be a feature of the fact that this
is used in the intended sense. They succeed in do- is how these invented sentences are constructed, so
ing this 65.5% of the time. Distractors are found could fall foul of the ‘inauthenticity’ critique).
in WordNet. The authors report 91.5% at gener- Our method focuses on a single collocation of
ating sets of distractors which were not infelici- the key’s. There is often nothing to favour key over
tously correct. distractors except knowledge of the compound or
Over several years the REAP project at collocation. If the goal of the exercise is to test
Carnegie Mellon University has been developing knowledge of the core meaning of the word (as
web-based tools including gap-fill tests for learn- opposed to knowledge of its collocation, a more
ers of English. Recent work includes an investiga- advanced topic) then such test items fail. Pino et
tion of distractors based on morphological (boring al and Liu et al look at the overall coherence of the
vs. bored), orthographic (bread vs. beard) and candidate sentence by checking for relations be-
pronunciation-based (f ile vs. f ly) confusability tween all the content words and each other. This
(Pino & Eskenazi, 2009). Their gap-fill genera- is a technique we shall add, both for finding co-
tion system (Pino et al, 2008) explores using ex- herent sentences (where the key is in place) and
amples from WordNet, from a learners’ dictionary, for strengthening the evidence that a distractor is
and from a large corpus of documents suitable for bad.
learners, as carrier sentences. They identify good Sumita et al’s approach to negative evidence
sentences by assessing complexity, well-defined is via Google. While this is appealing, it has a
context, grammaticality and length. They look at downside: Google places limits on the number
expert-produced test items to find optimum struc- of queries allowed per day, it is not clear what
tures and lengths. They use the Stanford parser a suitable length of sentence fragment to submit
to assess complexity, and whether a sentence is to Google is, and replicability is lost (Kilgarriff,
grammatical. They assess whether there is a well- 2007). We prefer the model in which very large
defined context by computing the extent to which corpora (compiled from the web) are used; we
the words in the sentence associate with each other shall soon start using a corpus of 5 billion words.
based on the pointwise mutual information of the Then, measuring coherence between all words and
word pairs. If the words cohere, in this sense, that the distractor will also give us more evidence (in-
implies a well-defined context. They then also use cluding negative evidence) in relation to the ques-
this framework for identifying potential distrac- tion “is a distractor accidentally acceptable?”
tors: they are words which fit quite well but not
As a source for carrier sentences, Pino et al use
too well in the context. In their evaluation, 66.5%
a database of pre-selected web texts. This is simi-
of questions were found to be acceptable, with the
lar to our approach except that we first gather very
most common flaw being that some of the distrac-
large numbers of web texts, and include them all
tors were acceptable.
in the corpus. We then select suitable sentences
at run time, using GDEX. We have also been ex-
5 Critical comparison with other
ploring pre-classifying all documents for readabil-
systems, and future work ity and including that information in the document
The systems reviewed offer a number of insights header (which is accessible to the search tools at
that we plan to integrate into T EDDCLOG in the run time).
future. Mostow et al and Hoshino and Nakagawa’s sys-
A first point is sentence length. Liu et al anal- tems start from texts or sentences, whereas we,
yse a batch of expert-generated test items and find like Liu et al, start from the key. This is significant
the mean length to be 16 words. In our dataset for two reasons: first, because item writers gener-
the average was eleven. As our internal evidence ally wish to use a specific word as a point of depar-
also indicates, GDEX parameters need adjusting ture for producing a gap-fill item. Second, our ar-
to favour longer sentences giving more context. chitecture is capable of generating large numbers
Further structure is given to the theme by Pino of gap-fill items on a given topic (Business, per-
et al’s (2008) observation that high-quality carrier haps, or Starting out at University).
sentences often consist of two clauses, one con- Currently we use the Sketch Engine’s shallow
parser, which supplies grammatical relations be- References
tween pairs of words but no bracketing. We intend College Entrance Examination Center, High School
to follow Sumita, Liu and Pino in using a fuller English Reference Wordlist, Retrieved December
parser, and exploiting its output to find sentences 29, 2008, from http://www.ceec.edu.tw/
of suitable grammatical structure. Research/paper_doc/ce37/ce37.htm

We are planning to explore frequency factors Ferraresi, A., Zanchetta, E., Bernardini, S. & Baroni,
further. Mostow et al are among the authors M. 2008, Introducing and evaluating ukWaC, a very
large web-derived corpus of English. Proceedings,
proposing distractors of similar frequency to the 4th WAC workshop, LREC, Marrakech, Morocco.
key. We do this indirectly, to some extent, as
words tend to be classified as similar to other Hoshino, A. and Nakagawa, H. 2007, Assisting cloze
words of similar frequency in a distributional the- test making with a web application. Proc. Soci-
ety for Information Technology and Teacher Ed-
saurus such as the Sketch Engine’s (Weeds and ucation International Conference 2007, (pp. 2807-
Weir, 2005). Currently we set the frequency 2814). Chesapeake, VA: AACE.
threshold for the collocation involving the key at
Kilgarriff, A. 2007, Googleology is bad science.
just seven. We get some obscure collocations like Computational Linguistics 33(1):147-151.
umeboshi plums. We plan to explore setting this
threshold much higher. Kilgarriff, A., Husak, M., McAdam, K., Rundell, M.
and Rychlý, P. 2008, GDEX: Automatically finding
We are impressed by Sumita et al’s evaluation, good dictionary examples in a corpus. Proc. EU-
and in particular the way it addresses the questions RALEX, Barcelona.
“what level of difficulty in the text is acceptable?
Liu, C-L., Wang C-H & Gao Z-M. 2005, Using
How subtle are the contrasts between key and dis- lexical constraints to enhance computer-generated
tractors allowed to be?” They evaluate by correlat- multiple-choice cloze items. International Journal
ing with students’ TOEIC scores. This allows that of Computational Linguistics and Chinese Language
some test items are hard (and only the best stu- Processing 10(3), 303-328.
dents will get them right), others are intermediate, Mostow, J., Beck, J. E., Bey, J., Cuneo, A., Sison, J.,
and others are easy (so the less good students will Tobin, B. and Valeri, J. 2004, Using automated
often get them right). Particularly for proficiency questions to assess reading comprehension, vocabu-
lary, and effects of tutorial interventions. Technol-
testing, it is often convenient to have questions at ogy, Instruction, Cognition and Learning 2: 97-134.
a range of levels.
Pino, J. and Eskenazi, M. 2009, Semi-Automatic Gen-
Perhaps the most important way forward is to eration of Cloze Question Distractors Effect of Stu-
look more closely at the different competences dents’ L1. SLaTE Workshop on Speech and Lan-
that gap-fill tests are used to test, if possible in con- guage Technology in Education.
sort with professional testers, and to tailor our al-
Pino, J., Heilman, M. and Eskenazi, M. 2008, A Se-
gorithms to their particular tasks. This is what we lection Strategy to Improve Cloze Question Qual-
intend to do. In that context, we shall offer item- ity. Wkshop on Intelligent Tutoring Systems for Ill-
writers a number of carrier sentences and distrac- Defined Domains. 9th Int. Conf. on ITS.
tors, to choose from and edit for each key. Sinclair, J., ed. 1986, Looking Up: An Account of the
COBUILD Project in Lexical Computing. Collins
COBUILD, London and Glasgow.
6 Summary
Sumita, E., Sugaya, F. and Yamamoto, S. 2005, Mea-
suring Non-native Speakers Proficiency of English
We have described a program which generates by Using a Test with Automatically-Generated Fill-
gap-fill exercises with distractors which will ap- in-the-Blank Questions. 2nd Wkshop on Building
pear, to many students, to be plausible correct an- Educational Applications using NLP, Ann Arbor.
swers. Using a very large corpus, and methods Taylor, W. L. 1953, Cloze procedure: A new tool
from computational linguistics, T EDDCLOG offers for measuring readability. Journalism Quarterly 30:
the prospect of making the preparation of gap-fill 415-433.
items (currently labour-intensive and vulnerable to Weeds, J., and Weir, D. 2005, Co-occurrence Re-
the ‘invented example’ objection) both faster and trieval: A Flexible Framework for Lexical Distribu-
based on real language data. tional Similarity. Computational Linguistics 31:4.

You might also like