Professional Documents
Culture Documents
Oxford Dictionary of English - Current Development
Oxford Dictionary of English - Current Development
net/publication/220947150
CITATION READS
1 26,312
2 authors, including:
Adam Kilgarriff
Lexical Computing Ltd
142 PUBLICATIONS 4,859 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Kelly - KEywords for Language Learning for Young and adults alike View project
All content following this page was uploaded by Adam Kilgarriff on 21 May 2014.
123
ralized, or whether or not an adjective may take a The first element is the identification of the 'key
comparative or superlative. In such cases, any term' in the definition. This is the most significant
available clues are collected from the entry but are noun in the definition — not a rigorously defined
then weighted by testing possible forms against a concept, but one which has proved pragmatically
corpus. effective. It is not always coterminous with the
genus term; for example, in a definition beginning
3 Idioms and other phrases 'a morsel of food which...', the 'key term' is taken
to be food rather than morsel.
Phrases and phrasal verbs are generally lemma- The second element is a scoring of all the other
tized in an 'idealized' form which may not repre- meaningful vocabulary in the definition (i.e. ignor-
sent actual occurrences. Variation and alternative ing articles, conjunctions, etc.). A simple weight-
wording is embedded parenthetically in the lemma: ing scheme is used to give slightly more
importance to words at the beginning of a defini-
(as) nice (or sweet) as pie tion (e.g. a modifier of the key term) than to words
at the end.
Objects, pronouns, etc., which may form part of These two elements are then assigned mutual in-
the phrase are indicated in the lemma by words formation scores in relation to each possible classi-
such as 'someone', 'something', 'one': fication, and the two MI scores are combined in
order to give an overall score. This overall score is
twist (or wind or wrap) someone around taken to be a measure of how 'typical' a given defi-
one's little finger nition would be for each possible classification.
This enables one very readily to rank and group all
In order to be able to match such phrases to real- the senses for a given classification, thus exposing
world occurrences, each dictionary lemma was misclassifications or points where a classification
extended as a series of strings which enumerate needs to be broken down into subcategories.
each possible variant and codify how pronouns, The semantic taxonomy currently has about
noun phrases, etc., may be interpolated. Each oc- 1250 'nodes' (each representing a classification
currence of a verb in these strings is linked to the category) on up to 10 levels. The dictionary con-
morphological data in the verb's own entry, to en- tains 95,000 defmed noun senses in total, so there
sure that inflected forms of a phrase (e.g. 'she had are on average 76 senses per node. However, this
him wrapped around her little finger') can be iden- average disguises the fact that there are a small
tified. number of nodes which classify significantly larger
sets of senses. Further subcategorization of large
4 Semantic classification sets is desirable in principle, but is not considered a
We are seeking to classify all noun senses in the priority in all cases. For example, there are several
dictionary according to a semantic taxonomy, hundred senses classified simply as tree; the effort
loosely inspired by the Princeton WordNet project. involved in subcategorizing these into various tree
Initially, a relatively small number of senses were species is unlikely to pay dividends in terms of
classified manually. Statistical data was then gen- value for normal NLP applications. A pragmatic
erated by examining the definitions of these senses. approach is therefore to deprioritize work on ho-
This established a definitional 'profile' for each mogeneous sets (sets where the range of 'typicality'
classification, which was then used to automati- scores for each sense is relatively small), more or
cally classify further senses. Applied iteratively, less irrespective of set size.
this process succeeded in classifying all noun Hence the goal is not achieve granularity on the
senses in a relatively coarse-grained way, and is order of WordNet's 'synset' (a set in which all
now being used to further refine the granularity of terms are synonymous, and hence are rarely more
the taxonomy and to resolve anomalies. than four or five in number) but rather a somewhat
Definitional profiling here involves two ele- more coarse-grained 'sirnilarset' in which every
ments: sense is similar enough to support general-purpose
word-sense disambiguation, document retrieval,
and other standard NLP tasks. At this level, auto-
124
matic analysis and grading of defmitions is proving main. Beyond a certain point, the relationship will
highly productive in establishing classification become too tenuous to be of much use in most con-
schemes and in monitoring consistency, although texts; but that point will differ for each subject
extensive supervision and manual correction is still field (and for each context). Hence a further objec-
required. tive is to implement a 'points' system which not
It should be noted that a significant number of only classifies a sense by domain but also scores
nouns and noun senses in ODE do not have defini- its relevance to that domain.
tions and are therefore opaque to such processes.
Firstly, some senses cross-refer to other defini- 6 Collocates for senses
tions; secondly, derivatives are treated in ODE as
undefined subentries. Classification of these will We are currently exploring methods to automati-
be deferred until classification of all defmed senses cally determine key collocates for each sense of
is complete. It should then be possible to classify multi-sense entries, to assist in applications involv-
most of the remainder semi-automatically, by ing word-sense disambiguation. Since collocates
combining an analysis of word formation with an were not given explicitly in the original dictionary
analysis of target or 'parent' senses. content of ODE, the task involves examining all
available elements of a sense for clues which may
5 Domain indicators point to collocational patterns.
The most fruitful areas in this respect are firstly
Using a set of about 200 subject areas (biochemis- definition patterns, and secondly example sen-
try, soccer, architecture, astronomy, etc.), all rele- tences.
vant senses and lemmas in ODE are being Definition patterns are best illustrated by verbs,
populated with markers indicating the subject do- where likely subjects and or objects are often indi-
main o which they relate. It is anticipated that this cated in parentheses:
will support the extraction of specialist lexicons,
and will allow the ODE database to function as a fly: (of a bird, bat, or insect) move through the
resource for document classification and similar air...
applications. impound: (of a dam) hold back (water)...
As with semantic classification above, a number
of domain indicators were assigned manually, and The terms in parentheses can be collected as possi-
these were then used iteratively to seed assignment ble collocates, and in some cases can be used as
of further indicators to statistically similar defini- seeds for the generation of longer lists (by exploit-
tions. Automatic assignment is a little more ing the semantic classifications described in sec-
straightforward and robust here, since most of the tion 3 above). Similar constructions are often
time the occurrence of strongly-typed vocabulary found in adjective definitions. For other parts of
will be a sufficient cue, and there is little reason to speech (e.g. nouns), and for definitions which hap-
identify a key term or otherwise parse the defini- pen not to use the parenthetic style, inference of
tion. likely collocates from definition text is a less
Similarly, assignment to undefined items (e.g. straightforward process; however, by identifying a
derivatives) is simpler, since for most two- or set of characteristic constructions it is possible to
three-sense entries a derivative can simply inherit define search patterns that will locate collocate-like
any domain indicators of the senses of its 'parent' elements in a large number of definitions. The de-
entry. For longer entries this process has to be fining style in ODE is regular enough to support
checked manually, since the derivative may not this approach with some success.
relate to all the senses of the parent. Some notable 'blind spots' have emerged, often
Currently, about 72,000 of a total 206,000 reflecting ODE's original editorial agenda; for
senses and lemmas have been assigned domain example, the defining style used for verbs often
indicators. There is no clearly-defined cut-off point makes it hard to determine automatically whether a
for iterations of the automatic assignment process; sense is transitive or intransitive.
each iteration will continue to capture senses Example sentences can be useful sources since
which are less and less strongly related to the do- they were chosen principally for their typicality,
125
and are therefore very likely to contain one or try element in order to classify it, mine it for perti-
more high-scoring collocates for a given sense. nent information, and note instances which may be
The key problem is to identify automatically which anomalous.
words in the sentence represent collocates, as op- The formal lexical data is being built up along-
posed to those words which are merely incidental. side the original dictionary content in a single inte-
Syntactic patterns can help here; if looking for col- grated database. This arrangement supports a broad
locates for a noun, for example, it makes sense to range of possible uses. Elements of the formal data
collect any modifiers of the word in question, and can be used on their own, ignoring the original dic-
any words participating in prepositional construc- tionary content. More interestingly, the formal data
tions. Thus if a sense of the entry for breach has can be used in conjunction with the original dic-
the example sentence tionary content, enabling an application to exploit
the rich detail of natural-language lexicography
She was guilty of a breach of trust. while using the formalism to orient itself reliably.
The formal data can then be regarded not so much
then some simple parsing and pattern-matching can as a stripped-down counterpart to the main diction-
collect guilty and trust as possible collocates. ary content, but more as a bridge across which ap-
However, it will be apparent from this that ex- plications can productively access that content.
amination of the content of a sense can do no more
than build up lists of candidate collocates — a Acknowledgements
number of which will be genuinely high-scoring I would like to thank Adam Kilgarriff of ITRI, Brigh-
collocates, but others of which may be more or less ton, and Ken Litkowski of CL Research, who have
arbitrary consequences of an editorial decision. been instrumental in both devising and implementing
The second step will therefore be to build into the significant parts of the work outlined above.
process a means of testing each candidate against a
corpus-based list of collocates, in order to elimi-
nate the arbitrary items and to extend the list that References
remains Christiane Fellbaum and George Miller. 1998. Word-
Net: an electronic lexical database. MIT Press,
7 Conclusion Cambridge, Mass.
Judy Pearsall. 1998. The New Oxford Dictionary of
In order for a non-formalized, natural-language
English. Oxford University Press, Oxford, UK.
dictionary like ODE to become properly accessible
to computational processing, the dictionary content
must be positioned within a formalism which ex-
plicitly enumerates and classifies all the informa-
tion that the dictionary content itself merely
assumes, implies, or refers to. Such a system can
then serve as a means of entry to the original dic-
tionary content, enabling a software application to
quickly and reliably locate relevant material, and
guiding interpretation.
The process of automatically generating such a
formalism by examining the original dictionary
content requires a great deal of manual supervision
and ad hoc correction at all stages. Nevertheless,
the process demonstrates the richness of a large
natural-language dictionary in providing cues and
flagging exceptions. The stylistic regularity of a
dictionary like ODE supports the enumeration of a
finite (albeit large) list of structures and patterns
which can be matched against a given entry or en-
126