You are on page 1of 3

A Dictionary and Morphological Analyser for English

G.J. Russell G.D. Rttchle


S.G. P u h n a n A.W. Black
Computer Laboratory, Department of
University of Cambridge Artificial Intelligence,
University of Edinburgh
1. I n t r o d u c t i o n a n d O v e r v i e w
associated values, a morphosyntactlc category, and a rule
This paper describes the current state of a three-year Identifier.
project aimed at the development of software for use in The system is written in FRANZ LISP (opus 42.15)
handling large quantities of dictionary information within running under Berkeley 4.2 Unix. Future developments
natural language processing systems. 1 The project was will concentrate on improving its efficiency, in particular
accepted for funding by SERC/Alvey commencing ill by restructuring the code. We also hope to produce an
June 1984, and is being carried out by Graeme Ritchie implementation in C, which should offer a faster
and Alan Black at the University of Edinburgh and response time.
Steve P u h n a n and Graham Russell at the University of
Cambridge. It is one of three closely related projects 2. L i n g u i s t i c A s s u m p t i o n s
funded under the Alvey IKBS Programme (Natural
The grammatical f r a m e w o r k underlying the linguistic
Language Tlleme); a parser is under development at
aspects of the system is that of Generalized Phrase
Edinburgh by Henry Thompson and John Phillips, and a
Structure Grammar, as set o u t in Gazdar et al. (1985).
sentence grammar is being devised by Ted Briscoe and
Morphological categories employed here correspond to the
Clare Grover at Lancaster and Bran Boguraev and John
Carroll at Cambridge. It is intended t h a t the software syntactic categories in that work, and the type of syn-
tactic information present in dictionary entries is
and rules produced by all three projects will be directly
compatible and capable of functioning in an integrated intended to facilitate the use of the system as part of a
system. more general GPSG-based program. In developing our
Realistic and useful n a t u r a l language processing sys- prototype, we have adopted m a n y of the proposals made
tems such as database front-ends require large numbers in t h a t work. To t h a t extent, certain assumptions about
a correct analysis of English sentence syntax are built in
of words, together w i t h associated syntactic and semantic
Information, to be efficiently stored in machine-readable to the lexlcal entries, but this should not preclude adap-
tation by users to suit different analyses.
form. Our system is Intended to provide the necessary
facilities, being designed to store a large n u m b e r (at least Following w h a t has become a general assumption in
syntactic theory, we take the major lexlcal categories to
10,000) of words and to perform morphological analysis
be partitioned into four classes by the two binary-valued
on them, covering both Inflectional and derlvatlonal mor-
phology. In pursuit of these objectives, the dictionary features [ + N] and [:k V]. The major lexlcat categories
associates w i t h each word information concerning its have phrasal projections; these are distinguished from
morphosyntactlc properties. Users are free to modify the their lexlcal counterparts by their value for the feature
system In a n u m b e r of ways; they may add to the lexi- BAR. Lexlcal categories have the value 0, and phrasal
categories (including sentences) have the value 1 or 2.
cal entries Lisp functions that perform semantic manipu-
latlons, and tailor the dictionary to the particular subject Thus, a Noun Phrase is of the category:
matter they are interested in (different databases, for ((V -) (N +) (BAR 2))
example). It Is also hoped that the system is general
In our analysis, 'bound morphemes', t h a t is to say
enough to be of use to linguists wishing to Investigate
prefxes and suffixes, are distinguished from others by
the morphology of English and other languages. Con-
their BAR specification; tile suffix ing is the sole member
tents of the basle data files may be altered or replaced:
of the category:
1. A 'Word Grammar' file contains rules assigning inter-
nal structure to complex words, ((V 4-) (N -) (VFORM ING) (BAR -1))
2. A 'Lexicon' file holds the morpheme entries which
As in other GPSG-based work, our analysis encodes the
include syntactic and other Information associated
subcategorlzational prbpertles of lexlcal Items in the value
with stems and affixes.
of a feature SUBCAT. Transitive verbs such as devour
3. A 'Spelling Rules' file contains rules governing permis-
are specified as (SUBCAT NP), and Intransitives such as
sible correspondences between the form of morphemes
elapse as (SUBCAT NULL).
listed in the lextcon and complex words consisting of
As an example from the current analysis of how the
sequences of these morphemes.
system can operate to produce well-formed words, con-
Once these data flies have been prepared, they are com-
sider the familiar fact of English morphology that no
piled using a n u m b e r of pre-processtng functions that
word may contain more than one imqection. The word
operate to produce a set of o u t p u t files. These
grammar must permit both walked and walking, but not
constitute a f u l l y expanded and cross-Indexed dictionary
walkinged. This is achiev~xi by restricting the distribu-
which can then be accessed from within LISP.
tion of inflectional suffixes so t h a t they attach to non-
The process of morphological analysis consists of pars-
Inflected stems only. A general statement of this type
lng a sequence of Input morphemes w i t h respect to the
word grammar, It Is Implemented as an active chart of restriction is made in terms of a feature INFL: stems
specified as (INFL +) may take an lnflecUonal sulfix,
parser (Thompson & Rltchle (1984)), and builds a struc-
ture in the form of a tree in which each node has two while those specified as (INFL ~) m a y not. The STEM
feature described in section 4 provides one means of
1 This work is supported by SERC/AIvey grant number enforcing correct stem-affix combinations; if the suffixes
GR/C/79114. ed and ing are specified with (STEM ((INFL +))), they

277
will attach only to categories which Include the al. (1968)), and the other holding the expanded entries
specification (INFL +). Walk, as a regular verb, is so relating to them. The comptlatlon of a lexicon can take
specified; wallced and waltcing are therefore accepted. Ed, a considerable amount of time; our prototype incorporates
ing, other tnfectlonal suffixes, and irregular (i.e. a lexicon with approximately 3500 entries, which com-
unlnflectable) words, however, are specified as (INFL -). plies In approximately ninety minutes.
Our grammar assigns a binary structure to the words in
question. In order for this method to prevent e.g. walk- 4. The W o r d G r a m m a r
inged, the stem walking must also bear the (INFL -) The internal structure of words is handled by a
specification. This it does, since we regard sutfixes as unification feature grammar with rules of the form:
being the head of a word, and as contributing to the
categorial content of the w o r d as a whole. If the INFL mother -~ daughter 1 daughter 2 ...
specification of the suf~x is copied into the mother where 'mother', 'daughtcrl', etc. are categories. A rule
category, the STEM specification of a further suffix will which adds the plural morpheme to a noun might be
not be satisfied. See section 4 for more discussion of given as shown below:
these matters.
((BAR 0) (V -) (N +) (PLU +) (INFL -)) = >
3. T h e Lexicon
((BAR 0) (V -) (N +) (INFL +))
The lexicon itself consists of a sequence of entries, each ((BAR -1) (V -) (N 4-) (PLU 4-) (INFL -))
in the form of a Lisp s-expression. An entry has five
elements: (1) and (ii) the head word, in its written form The system provides two methods of writing rules in a
and in a phonological transcription, (ill) a 'syntactic more general form; variables and feature-passing conven-
tions.
field', (iv) a 'semantic field', and (v) a 'user field'. The
semantic field has been provided as a facility for users, In our grammar, the category and inflectabllity of a
and any Lisp s-expression can be inserted here. No suffixed word are determined by the category and
significant semantic information is present in our entries, lnflectablllty of the suffix; in the rule below, ALPHA,
beyond the fact that e.g. better and best are related in BETA, and GAMMA are variables ranging over the set
meaning to good. of values {+, -}:
Similarly, the user f e l d Is unexploited, being occupied ((V ALPHA)(N BETA)(INFL GAMMA)(BAR 0)) = >
in all cases by the atom 'nil'. It serves primarily as a ((BAR 0))
place-holder, in that, while it is desirable to maintain ((V ALPHA)(N BETA)(INFL GAMMA)(BAR -1))
the possibility for users to include in an entry whatever
additional information they desire, the form which that Since variables are interpreted consistently throughout a
Information might take in practice is clearly not predict- rule, the mother category and suffix will be identical In
able. their specifications for N, V and INFL.
As an alternative to variables, feature passing conven-
The syntax field consists of a syntactic category, as
tions are also available. These relate categories in what
defined by Gazdar et al. (1985), i.e. a set of feature-
Gazdar et al. (1.985) term 'local trees', i.e. sections of
value pairs. Some of these are relevant only to the
morphological structure consisting of a mother category
workings of the w o r d grammar, and may thus be
and all of Its immediate daughters. The conventions
Ignored by other components In an integrated natural
refer to 'pre-lnstantlatlon' features; these are features
language processing system. Their purpose is to control
present in the categories mentioned In the relevant rule.
the distribution of morphemes in complex words, as
'Extension' and 'unification' are meant In the sense of
described in the following section.
Gazdar et al. (1985), q.v.
The content of a syntax field is often at least par-
tlally predictable. This fact allows us to employ as an The Word-Head Convention:
aid to users wishing to write their own dictionary rules After lnstantlatlon, the set of WHead features in the
which add information to the lexicon during the compi- mother is the unification of the pre-lnstantlatlon
lation process. Recall that, in our analysis of English, WHead features of the Mother with the pre-
the lnflectablllty of a word is governed by the value in lnstantlatlon WHead features of the Rlghtdaughter.
that word's category for INFL. Completion Rules (CRs)
This convention is analogous to the simplest case of the
can be written that will add the specification ( I N F L - )
Head Feature Convention in Gazdar et at. (1985).
to any e n t r y already Including (PLU +) (for e.g. men),
Although there is no formal notion of 'head' in the sys-
(AFORM ER) (for e.g. worse), (VFORM ING), etc,, thus
tem, this convention embodies the Implicit claim that the
removing the need to state Individually that a given head in a local tree is always the right daughter. If the
word cannot be inflected. daughters are a prefix and a stem (as in e.g. re-apply),
A second means of reducing the amount of prepara- the WHead features of the stem are passed up to the
tory work is provided in the form of Multiplication mother. Features encoding morphosyntactic category can
Rules (MRs). Whereas CRs add f u r t h e r specifications to be declared as members of the WHead set, and re-apply
a single entry, MRs have the effect of Increasing the is then of the same category as, and shares various
number of entries In some principled way. One applica- sentence-level syntactic properties with, apply. If the
tion of MRs Is to express the fact that nouns and adjec- daughters are a stem and a suffix, the category of the
tlves do not subcategorize for obligatory complements. mother Is determined not by the stem, but rather by the
A MR can be written which, for each entry containing suffix. For example, possible and ity may be combined to
the specification (N +) and some non-NULL value for form possibility, whose 'nountness' is due to the category
SUBCAT, produces a copy of that entry where the SUB- of the suffix.
CAT specification is replaced by (SUBCAT NULL).
The lexicon complies Into two files, one holding mor-
phemes stored in a tree-shaped structure (cf. Thorne et

278
The Word-Daughter Convention: additional e is Inserted when some nouns are suffixed
(a) If any WDaughter features exist on the Right- with the plural morpheme +s:
daughter then the WDaughter features on the Epenthesls
Mother are the unification of the pre-lnstantlaUon +:e <=~> { < s:s h:h > s:s x:x z:z } - - - s:s
WDaughter features on the Mother with the pre- or < c:c h:h2> .... s:s
lnstantlatlon WDaughter featm-es on the Right-.
The epenthests rule states that e must be inserted at a
daughter.
morpheme boundary if an(:[ only if the boundary has to
(b) If no WDaughter features exist on the Right- its left sh, s, x, z or eh and to Its right s. The
daughter then the WDaughter features on the Interpretation of the rule Is simple; the character pair
Mother are the unification of the pre-lnstantiatlon ('lexical c h a r a c t e r : s u r f a c e character') to the left of the
WDaughter features on the Mother with the pre- arrow specifies the change that takes place between the
lnstantlation WDaughter features on the Left- contexts (again stated in character pairs) given to the
daughter. right of the arrow. Braces ('{','}') Indicate disjunction
The subcategorlzation class of a word remains constant and angled brackets Indicate a sequence, Alternative
under Inflection, but is likely to be changed by the contexts may be specified using the word 'or'. IJexlcal
attachment of a derlvatlonal suffix. Moreover, the sub- and surface strings of unequal length can be matched by
categorization of a prefixed word is the same as that of using the null character '0', and special characters may
its stem. The WDaughter convention is designed to be defined and used in rules, for example to cover the
reflect these facts by enforcing a feature correspondence set of alphabetic characters representing vowels.
between one of the daughters and the mother. When The spelling rules are able to match any pair of char-
the feature set WDaughter is defined as including the acter strings. It would for example be possible to
subcategorlzation feature SUBCAT, the convention results analyse the suppletlve went as a surface form
in configuratkms such as: corresponding to the lexlcal form go+ed. In this case,
((SUBCAT NP)) ((SUBCAT NP)) four rules would be needed to effect the change, and a
((V +)(N +]) ((SUBCAT NP)) better solution is to list went separately In the lexicon.
((SUBCAT NP)) ((VFORM ING)) in practice, the choice between treating this type of
alternation dynamically, with morphological and spelling
which show the relevant feature specifications in local
rules, and statically, by exploiting the lexicon directly,
trees arising from suffixatton of an adjective with +ize to
produce a transitive verb and suffixatlon of a transitive depends on the user's Idea of which is the more elegant
verb with +ing to produce a present participle. solution. While elegance may be in the eye of the
beholder, computational efficiency is mffortunately not.
The Word-Sister Convention: I[ will generally be more efficient to list a word In the
When one daughter is specified for STEM, the lexicon titan to add spelling or morphological rules
category of the other daughter must be an extension specific to small number of cases.
of the value of STEM.
The purpose of this third convention is to allow the Rel'el'ellCCS
subcategorization of affixes with respect to the type of Gazdar, G., E. Klein, G.K. Pullmn, and I.A. Sag (1985)
stem they may attach to. The behavlour of affixes that Generalized Phrase Structure Grammar. Oxford:
attach to more than one category can be handled natur- Blackwells.
ally by giving them a suitable specification for STEM. Karttunen, L. (1983) "KIM:MO - A General Morphologi-
If it is desired to have anti- attached to both nouns and cal Processor", in Texas Linguistic Forum 22, 165 -
adjectives, for example, the specification (STEM ((N +))) 186. Department of Linguistics, University of Texas,
will have that effect, since both adjectives and nouns are Austin, Texas.
extensions of the category ((N +)1. Koskennieml, K. (1983a) "Two-level model for morpho-
The user can define the sets WHead and WDaughter logical analysis", in Proceedings of the Eighth Interna-
as he wishes, or, by leaving them undefined, avoid their tiona2 Joint Conference on AzTificial Intelligence,
effects altogether. The feature STEM is built in, and Karlsruhe, 683 - 685.
need not be defined. The effects of the Word-Sister Koskennleml, K. (1983b) Two-level Morphology: a general
Convention can be modified by changing the STEM computational model for word-form recognition and pro-
specifications ill the lexlcal entries, and avoided by duction, Publication No. 11, University of tIelslnkl,
omitting them. Finland.
Koskennteml, K. (1985) "Compilation of Automata from
5. The S p e l l i n g Rules Two-level Rules", talk given at the Workshop on
Finite-State Morphoiogy, CSLI, Stanford, July, 1985.
The rules are based on the work of Koskennlemt (1983a, Thompson, IL and G.D. Rltchte (19841 "Implementing
1983b, Karttunen 1983), though their application here is Natural Language Parsers", in T. O'Shea and M. Elsen+
solely to the question of 'morphographemlcs'; the more stadt (eds.) Az~tificial Intelligence: Tools, Techniques
general morphological effects of Koskenniemi's rules are and Applications. New York: Harper and Row.
produced dlffenmtly. The current version of the system
contains a compiler allowing the rules to be written in a Thorne, J.P., P. Bratley, and, It. Dewar (1968) "The s y n -
high level notation based on KoskennIemi (1985). Any tactic analysis of English by machine", in D. Mlchie
number of spelling rules can be employed, though our (ed.) Machine Intelligence 3. Edinburgh: Edinburgh
system has fifleen. They are compiled during the gen- University Press.
eral dictionary pre-processlng stage into deterministic
finite state transducers, of which one tape represents the
lexlcal form and the other the surface form.
The following rule describes the process by which an

279

You might also like