You are on page 1of 26

Lecture 10

Morphology and Finite State Transducers

Intro to NLP, CS585, Fall 2014


http://people.cs.umass.edu/~brenocon/inlp2014/
Brendan O’Connor (http://brenocon.com)

1
Thursday, October 2, 14
• PS2 out tonight: due next Friday at midnight
• David will post OH for early next week
• I might have to change mine (next Thurs)

2
Thursday, October 2, 14
Word-internal structure

• What’s inside a word?


• walk, walked, walks: Our models think these are
different. Should they be?
• Motivations: less sparsity, deeper understanding
• Orthography: spelling conventions
• “blacke as the night” vs “black as the night”
Not really linguistically meaningful
• yesss vs yesssssssss
(Linguistically meaningful?...)
• Morphology: linguistically productive
• Rule-based approaches are very common here

3
Thursday, October 2, 14
Morphology: Phenomena

• Inflection
• Derivation
• Compounding
• Cliticization

4
Thursday, October 2, 14
The possessive suffix is realized by apostrophe + -s for regular singular nouns
(llama’s) and plural nouns not ending in -s (children’s) and often by a lone apostro-
phe after regular plural nouns (llamas’) and some names ending in -s or -z (Euripides’
comedies).
English verbal inflection is more complicated than nominal inflection. First, En-


glish has three kinds of verbs; main verbs, (eat, sleep, impeach), modal verbs (can,
Inflectional
will, should), and primarymorphology: modify
verbs (be, have, do) (using theroot
terms ofto a word
Quirk et al., 1985).
In thisof the same class, due to grammatical

FT
chapter we will mostly be concerned with the main and primary verbs, because
REGULAR it is these that have inflectional endings. Of these verbs a large class are regular, that is
to sayconstraints likehaveagreement
all verbs of this class the same endings marking the same functions. These


regular verbs (e.g. walk, or inspect) have four morphological forms, as follow:
e.g. regular verbs. (Exceptions?)
Morphological Form Classes Regularly Inflected Verbs
stem walk merge try map
-s form walks merges tries maps
-ing participle walking merging trying mapping
Past form or -ed participle walked merged tried mapped


A
English
These verbsisarerelatively
called regularsimple
because just by knowing the stem we can predict
the other forms by adding one of three predictable endings and making some regular
spelling changes (and as we will see in Ch. 7, regular pronunciation changes). These
regular verbs and forms are significant in the morphology of English first because they
cover a majority of the verbs, and second because the regular class is productive. As
discussed earlier, a productive class is one that automatically includes any new words
that enter the language. For example the5 recently-created verb fax (My mom faxed me
the note from cousin Everett) takes the regular endings -ed, -ing, -es. (Note that the -s
Thursday, October 2, 14
• Derivational morphology: modify root to a
word of a different class
• derivational
derive -ation -al
• Can be tricky
• universe --> uni- verse ?
• universal --> uni- verse -al ???

6
Thursday, October 2, 14
can be added to almost any verb ending in -ize, cannot be added to a
ery verb. Thus we can’t say *eatation or *spellation (we use an asteris

R
“non-examples” of English). Another is that there are subtle and comp
differences among nominalizing suffixes. For example sincerity has a sub
in meaning from sincereness.

3.1.3 Cliticization
D •

Recall that a clitic is a unit whose status lies in between that of an affix and
Compounding
phonological behavior of clitics is like affixes; they tend to be short and un
baseball desktop
will talk more about phonology in Ch. 8). Their syntactic behavior is mo
often acting as pronouns, articles, conjunctions, or verbs. Clitics precedin
PROCLITICS called proclitics, while those following are enclitics.

ENCLITICS Cliticization
English clitics include these auxiliary verbal forms:
Full Form Clitic Full Form Clitic
am ’m have ’ve
are ’re has ’s
is ’s had ’d
will ’ll would ’d

7
Thursday, October 2, 14
Let’s now proceed to the problem of parsing morphology. Our goal will be to take

Full morph. parsing


input forms like those in the first and third columns of Fig. 3.2, produce output forms

F
like those in the second and fourth column.

English Spanish
Input Morphologically Input Morphologically Gloss
Parsed Output Parsed Output

RA cats
cat
cities
geese
goose
goose
gooses
merging
caught
caught
cat +N +PL
cat +N +SG
city +N +Pl
goose +N +Pl
goose +N +Sg
goose +V
goose +V +1P +Sg
merge +V +PresPart
catch +V +PastPart
catch +V +Past
pavos
pavo
bebo
canto
canto
puse
vino
vino
lugar
pavo +N +Masc +Pl
pavo +N +Masc +Sg
beber +V +PInd +1P +Sg
cantar +V +PInd +1P +Sg
canto +N +Masc +Sg
poner +V +Perf +1P +Sg
venir +V +Perf +3P +Sg
vino +N +Masc +Sg
lugar +N +Masc +Sg
‘ducks’
‘duck’
‘I drink’
‘I sing’
‘song’
‘I was able’
‘he/she came’
‘wine’
‘place’

Figure 3.2 Output of a morphological parse for some English and Spanish words. Span-
ish output modified from the Xerox XRCE finite-state language tools.

• Need
D
The second column contains the stem of each word as well as assorted morpho-
FEATURES

logical 1. Lexicon
features. Theseoffeatures
stemsspecify
and additional
affixes (maybe
informationguess
about stems...)
the stem. For
example the feature +N means that the word is a noun; +Sg means it is singular, +Pl

that it is2. Morphotactics:
plural. morpheme
Morphological features ordering
will be referred model
to again in Ch. 5 and in more


detail in Ch. 16; for now, consider +Sg to be a primitive unit that means “singular”.
Spanish3. hasOrthographic rules:
some features that don’t spelling
occur changes
in English; upon
for example combination
the nouns lugar and
pavo are(city
marked+ +Masc
-s -->(masculine).
citys -> cities)
Because Spanish nouns agree in gender with ad-
jectives, knowing the gender of a noun will be important for tagging and parsing.
Note that some of the input forms (like8 caught, goose, canto, or vino) will be am-
biguous between different morphological parses. For now, we will consider the goal of
Thursday, October 2, 14
way of modeling morphological features in the lexicon, and addressing morphological

Lexicons
parsing. Finally, we show how to use FSTs to model orthographic rules.

F
.3 B UILDING A INITE
F -S L
• Noun inflection: need three word lists for nouns
TATE EXICON


A lexiconregular nouns

RA •



is a repository for words. The simplest possible lexicon would consist of
an explicit list of every word of the language (every word, i.e., including abbreviations
(“AAA”)irregular plural
and proper names

irregular singular nouns


nouns
(“Jane” or “Beijing”)) as follows:
a, AAA, AA, Aachen, aardvark, aardwolf, aba, abaca, aback, . . .
Since it will often be inconvenient or impossible, for the various reasons we dis-
FSA represents all inflected nouns: don’t have to
cussed above, to list every word in the language, computational lexicons are usually
store plural forms.
structured with a list of each of the stems and affixes of the language together with a
representation of the morphotactics that tells us how they can fit together. There are
Abbreviated form below (What does the full FSA
many ways to model morphotactics; one of the most common is the finite-state au-
tomaton. A very simple finite-state model for English nominal inflection might look
look like?)
like Fig. 3.3.
D
reg-noun plural -s
q0 q1 q2

irreg-pl-noun

irreg-sg-noun

Figure 3.3 A finite-state automaton for English nominal inflection.

The FSA in Fig. 3.3 assumes that the lexicon9 includes regular nouns (reg-noun)
that take the regular -s plural (e.g., cat, dog, fox, aardvark). These are the vast majority
Thursday, October 2, 14
Derivaional morph. example
English derivational morphology is significantly more complex than English inflec-
tional morphology, and so automata for modeling English derivation tend to be quite
complex. Some models of English derivation, in fact, are based on the more complex
context-free grammars of Ch. 12 (Sproat, 1993).
• Derivational
Consider data simpler
a relatively we wantcaseto
of model:
derivation: the morphotactics of English ad-
adjectives
jectives. Here arebecome opposites,
some examples comparatives,
from Antworth (1990): adverbs
big, bigger, biggest, cool, cooler, coolest, coolly
happy, happier, happiest, happily red, redder, reddest
unhappy, unhappier, unhappiest, unhappily real, unreal, really
clear, clearer, clearest, clearly, unclear, unclearly

10
Thursday, October 2, 14
Derivaional morph. example
English derivational morphology is significantly more complex than English inflec-
tional morphology, and so automata for modeling English derivation tend to be quite
complex. Some models of English derivation, in fact, are based on the more complex
context-free grammars of Ch. 12 (Sproat, 1993).
• Derivational
Consider data simpler
a relatively we wantcaseto
of model:
derivation: the morphotactics of English ad-
adjectives
jectives. Here arebecome opposites,
some examples comparatives,
from Antworth (1990): adverbs
big, bigger, biggest, cool, cooler, coolest, coolly
happy, happier, happiest, happily red, redder, reddest
2 unhappy, unhappier, unhappiest, unhappily real, unreal,3.reallyWords & Transducers
Chapter
clear, clearer, clearest, clearly, unclear, unclearly
An initial hypothesis might be that adjectives can have an optional prefix (un-), an


obligatory root (big, cool, etc.) and an optional suffix (-er, -est, or -ly). This might
Task:
suggest therecognition. Proposed model
the FSA in Fig. 3.5.

un- adj-root -er -est -ly


q0 q1 q2 q3

Figure 3.5 An FSA for a fragment of English adjective morphology: Antworth’s Pro-
posal #1.

•recognize
Any false this
Alas, while positives? (Compare:
FSA will recognize Child language
all the adjectives in the tablelearning)
above, it will also
ungrammatical forms like unbig, unfast, oranger, or smally. We need to set
10
up classes of roots and specify their possible suffixes. Thus adj-root1 would include
Thursday, October 2, 14 adjectives that can occur with un- and -ly (clear, happy, and real) while adj-root2 will
include adjectives that can’t (big, small), and so on.
This gives an idea of the complexity to be expected from English derivation. As a
further example, we give in Figure 3.6 another fragment of an FSA for English nominal
and verbal derivational morphology, based on Sproat (1993), Bauer (1983), and Porter
(1980). This FSA models a number of derivational facts, such as the well known

RA
generalization that any verb ending in -ize can be followed by the nominalizing suffix
-ation (Bauer, 1983; Sproat, 1993). Thus since there is a word fossilize, we can predict
the word fossilization by following states q0 , q1 , and q2 . Similarly, adjectives ending
in -al or -able at q5 (equal, formal, realizable) can take the suffix -ity, or sometimes
the suffix -ness to state q6 (naturalness, casualness). We leave it as an exercise for the
reader (Exercise 3.1) to discover some of the individual exceptions to many of these
constraints, and also to give examples of some of the various noun and verb classes.

nouni -ize/V -ation/N


q0 q1 q2 q3 q4
adj-al
-able/A -ity/N -er/N
adj-al q6
q5 -ness/N
D
adj-ous -ness/N
verbj -ly/Adv q9
-ive/A q8
verbk q7 -ly/Adv
-ative/A -ful/A
nounl q10 q11

Figure 3.6 An FSA for another fragment of English derivational morphology.

We can now use these FSAs to solve the problem of morphological recognition;
that is, of determining whether an input string of letters makes up a legitimate English
word or not. We do this by taking the morphotactic FSAs, and plugging in each “sub-

11
Thursday, October 2, 14
Recognition vs Parsing

• Morphological recognition
• is_pasttense_verb(loved) --> TRUE
• Finite State Automata can do this
• Morphological parsing: what is its breakdown?
• parse(loved) --> take/VERB -n/PAST-TENSE
• Finite State Transducers can do this

12
Thursday, October 2, 14
Finite State Transducers
• FSAutomata
• An FSA represents a set of strings. e.g.
{walk, walks, walked, love loves, loved}
• Regular language.
• A recognizer function.
recognize(str) -> true or false
• FSTransducers
• An FST represents a set of pairs of strings (think of as
input,output pairs)
{ (walk, walk+V+PL), (walk, walk+N+SG), (walked, walk+V+PAST) ...}
• Regular relation. (Not a function!)
• A transducer function: maps input to zero or more outputs.
transduce(walk) --> {walk+V+PL, walk+N+SG}
Can return multiple answers if ambiguity: e.g. if you don’t have
POS-tagged input, “walk” could be the verb “They walk to the
store” versus the noun “I took a walk”.
• Generic inversion and composition operations.

13
Thursday, October 2, 14
Finite State Transducers
• FSAutomata have input labels.
• One input tape
• FSTransducers have input:output pairs on labels.
• Two tapes: input and output.
Chapter 3. Words & Transducers

aa:b b:

b:a
q0 b:b q1
a:ba
Figure 3.8 A finite-state transducer, modified from Mohri (1997).

• FST as generator: a machine that outputs pairs of strings of the language. Thus
the output is a yes or no, and a pair of output strings.
• FST as translator: a machine that reads14 a string and outputs another string
• FST as set relater: a machine that computes relations between sets.
Thursday, October 2, 14
• FST as translator: a machine that reads a string and outputs another string

FT
Finite State Transducers• FST as set relater: a machine that computes relations between sets.
All of these have applications in speech and language processing. For morphologi-
cal parsing (and for many other NLP applications), we will apply the FST as translator

• FSAutomata have input labels.


metaphor, taking as input a string of letters and producing as output a string of mor-
phemes.

• FSTransducers have input:output pairs on labels.


Let’s begin with a formal definition. An FST can be formally defined with 7 pa-
rameters:

Q a finite set of N states q0 , q1 , . . . , qN−1

RA New
Σ
Δ
q0 ∈ Q
F ⊆Q
δ(q, w)
a finite set corresponding to the input alphabet
a finite set corresponding to the output alphabet
the start state
the set of final states
the transition function or transition matrix between states; Given a
state q ∈ Q and a string w ∈ Σ∗ , δ(q, w) returns a set of new states
Q% ∈ Q. δ is thus a function from Q × Σ∗ to 2Q (because there are
2Q possible subsets of Q). δ returns a set of states rather than a
single state because a given input may be ambiguous in which state
it maps to.
New σ(q, w) the output function giving the set of possible output strings for each
D
state and input. Given a state q ∈ Q and a string w ∈ Σ∗ , σ(q, w)
gives a set of output strings, each a string o ∈ Δ∗ . σ is thus a func-

tion from Q × Σ∗ to 2Δ

Where FSAs are isomorphic to regular languages, FSTs are isomorphic to regu-
REGULAR lar relations. Regular relations
RELATIONS 15 are sets of pairs of strings, a natural extension of the
regular languages, which are sets of strings. Like FSAs and regular languages, FSTs
Thursday, October 2, 14
• FST as translator: a machine that reads a string and outputs another string

FT
Finite State Transducers• FST as set relater: a machine that computes relations between sets.
All of these have applications in speech and language processing. For morphologi-
cal parsing (and for many other NLP applications), we will apply the FST as translator

• FSAutomata have input labels.


metaphor, taking as input a string of letters and producing as output a string of mor-
phemes.

• FSTransducers have input:output pairs on labels.


Let’s begin with a formal definition. An FST can be formally defined with 7 pa-
rameters:

Q a finite set of N states q0 , q1 , . . . , qN−1

RA New
Σ
Δ
q0 ∈ Q
F ⊆Q
δ(q, w)
a finite set corresponding to the input alphabet
a finite set corresponding to the output alphabet
the start state
the set of final states
the transition function or transition matrix between states; Given a
state q ∈ Q and a string w ∈ Σ∗ , δ(q, w) returns a set of new states
Q% ∈ Q. δ is thus a function from Q × Σ∗ to 2Q (because there are
2Q possible subsets of Q). δ returns a set of states rather than a
single state because a given input may be ambiguous in which state
it maps to.
New σ(q, w) the output function giving the set of possible output strings for each
D
state and input. Given a state q ∈ Q and a string w ∈ Σ∗ , σ(q, w)
gives a set of output strings, each a string o ∈ Δ∗ . σ is thus a func-

tion from Q × Σ∗ to 2Δ

Where FSAs are isomorphic to regular languages, FSTs are isomorphic to regu-
REGULAR lar relations. Regular relations
RELATIONS 15 are sets of pairs of strings, a natural extension of the
regular languages, which are sets of strings. Like FSAs and regular languages, FSTs
Thursday, October 2, 14
Inversion
Finite-State Transducers 15

• inversion: The inversion of a transducer T (T −1 ) simply switches the input and


output labels. Thus if T maps from the input alphabet I to the output alphabet O,
T −1 maps from O to I.
• composition: If T1 is a transducer from I1 to O1 and T2 a transducer from O1 to
O2 , then T1 ◦ T2 maps from I1 to O2 .
Inversion is useful because it makes it easy to convert a FST-as-parser into an FST-
as-generator. T = { (a,a1), (a, a2), (b, b1), (c, c1), (c, a) }
Composition T-1 = { (a1,a),
is useful because it (a2,
allowsa), (b1,
us to take b),
two (c1, c), (a,thatc)run} in series
transducers
and replace them with one more complex transducer. Composition works as in algebra;
applying T1 ◦ T2 to an input sequence S is identical to applying T1 to S and then T2 to
the result; thus T1 ◦ T2 (S) = T2 (T1 (S)).

T
Fig. 3.9, for example, shows the composition of [a:b]+ with [b:c]+ to produce
[a:c]+.

a:b b:c a:c


a:b b:c a:c
q0 q1 q0 q1 = q0 q1
16
Figure 3.9
Thursday, October 2, 14 The composition of [a:b]+ with [b:c]+ to produce [a:c]+.
output labels. Thus if T maps from the input alphabet I to the output alphabet O,
on 3.4. Finite-State
T −1 mapsTransducers
from O to I. 15

INVERSION Composition
• composition:
• inversion: If
O2 , then
The
T1 labels.
output
T1inversion
◦ T2 maps
is a transducer
Thus iffrom I1 to
T maps
fromT I(T
of a transducer
O2the
from
1 to
−1 )O 1 andswitches
simply T2 a transducer from O1 to
the input and
. input alphabet I to the output alphabet O,
T −1 maps from O to I.
Inversion is useful because it makes it easy to convert a FST-as-parser into an FST-
COMPOSITION • composition: If T1 is a transducer from I1 to O1 and T2 a transducer from O1 to
as-generator.O , then T ◦ T maps from I to O .
2 1 2 1 2
Composition is useful because it allows us to take two transducers that
Inversion is useful because it makes it easy to convert a FST-as-parser into an FST-
run in series
and replace them with one more complex transducer. Composition works as in algebra;
as-generator.

As
applying TComposition
1◦ T2 to
and replace
the result;
transducer
thus T1them
anis input functions,
sequence
useful because
◦ T2with
(S)one= Tmore
S is us
it allows
2 (T1complex
identical totransducers
to take two applyingthat
T1 run
to Sin and
(S)). transducer. Composition works as in algebra;
seriesthen T2 to

3.9, (T1 ◦ To T2)(x) =the


T1(T2(x))

T
Fig.applyingforT1example,
2 to an input
showssequence S is identical of
composition to applying
[a:b]+ T1 with
to S and then T2 toto produce
[b:c]+
the result; thus T1 ◦ T2 (S) = T2 (T1 (S)).

FT
[a:c]+. Fig. 3.9, for example, shows the composition of [a:b]+ with [b:c]+ to produce
[a:c]+.
a:b b:c a:c
a:b a:b b:c b:c a:c
a:c
b:c
q0 q a:b q1q q
q0
q
q1 1 =
=q a:c
q0 q1
q1
0 1 0 0

FigureFigure
3.9 3.9The The compositionof
composition [a:b]+ with
of [a:b]+ with[b:c]+
[b:c]+
to produce [a:c]+.
to produce [a:c]+.

PROJECTION The projection of an FST is the FSA that is produced by extracting only one side
Theofprojection
the relation. Weof can
an FST
refer toisthe
theprojection
FSA that is left
to the produced
or upperby sideextracting only
of the relation as one side
of the relation.
the upper We canprojection
or first refer toand thetheprojection tothe
projection to thelower
leftororright
upper
sideside
of theof the relation as
relation
as the
the upper orlower
firstorprojection
second projection.
and the projection to the lower or right side of the relation
17
as the lower or second projection.
Thursday, October 2, 14
Generic FST operations

• There exist generic algorithms for inversion and


composition (and minimization...)
• Chain together many transducers
• e.g. OpenFST open-source library

18
Thursday, October 2, 14
Let’s now proceed to the problem of parsing morphology. Our goal will be to take

FSTs for morph parsing


input forms like those in the first and third columns of Fig. 3.2, produce output forms

F
like those in the second and fourth column.

English Spanish
Input Morphologically Input Morphologically Gloss
Parsed Output Parsed Output

RA cats
cat
cities
geese
goose
goose
gooses
merging
caught
caught
cat +N +PL
cat +N +SG
city +N +Pl
goose +N +Pl
goose +N +Sg
goose +V
goose +V +1P +Sg
merge +V +PresPart
catch +V +PastPart
catch +V +Past
pavos
pavo
bebo
canto
canto
puse
vino
vino
lugar
pavo +N +Masc +Pl
pavo +N +Masc +Sg
beber +V +PInd +1P +Sg
cantar +V +PInd +1P +Sg
canto +N +Masc +Sg
poner +V +Perf +1P +Sg
venir +V +Perf +3P +Sg
vino +N +Masc +Sg
lugar +N +Masc +Sg
‘ducks’
‘duck’
‘I drink’
‘I sing’
‘song’
‘I was able’
‘he/she came’
‘wine’
‘place’

Figure 3.2 Output of a morphological parse for some English and Spanish words. Span-
ish output modified from the Xerox XRCE finite-state language tools.
D
The second column contains the stem of each word as well as assorted morpho-
FEATURES logical features. These features specify additional information about the stem. For
example the feature +N means that the word is a noun; +Sg means it is singular, +Pl
that it is plural. Morphological features will be referred to again in Ch. 5 and in more
detail in Ch. 16; for now, consider +Sg to be a primitive unit that means “singular”.
Spanish has some features that don’t occur in English; for example the nouns lugar and
pavo are marked +Masc (masculine). Because Spanish nouns agree in gender with ad-
jectives, knowing the gender of a noun will be important for tagging and parsing.
Note that some of the input forms (like19caught, goose, canto, or vino) will be am-
biguous between different morphological parses. For now, we will consider the goal of
Thursday, October 2, 14
drink’.
In the finite-state morphology paradigm that we will use, we represent a word as

FSTs for morph parsing


a correspondence between a lexical level, which represents a concatenation of mor-
SURFACE phemes making up a word, and the surface level, which represents the concatenation
of letters which make up the actual spelling of the word. Fig. 3.12 shows these two

FT
levels for (English) cats.

Lexical c +N +Pl

Surface c a t s
Figure 3.12 Schematic examples of the lexical and surface tapes; the actual transducers
will involve intermediate tapes as well.
RA LEXICAL TAPE
For finite-state morphology it’s convenient to view an FST as having two tapes. The
upper or lexical tape, is composed from characters from one alphabet Σ. The lower
or surface tape, is composed of characters from another alphabet Δ. In the two-level
morphology of Koskenniemi (1983), we allow each arc only to have a single symbol
from each alphabet. We can then combine the two symbol alphabets Σ and Δ to create
a new alphabet, Σ! , which makes the relationship to FSAs quite clear. Σ! is a finite
alphabet of complex symbols. Each complex symbol is composed of an input-output
pair i : o; one symbol i from the input alphabet Σ, and one symbol o from an output
alphabet Δ, thus Σ! ⊆ Σ × Δ. Σ and Δ may each also include the epsilon symbol !. Thus
where an FSA accepts a language stated over a finite alphabet of single symbols, such
as the alphabet of our sheep language:
D
(3.2) Σ = {b, a, !}
an FST defined this way accepts a language stated over pairs of symbols, as in:
(3.3) Σ! = {a : a, b : b, ! : !, a : !, a : !, ! : !}
FEASIBLE PAIRS In two-level morphology, the pairs of symbols in Σ! are also called feasible pairs. Thus
20 alphabet Σ! expresses how the symbol
each feasible pair symbol a : b in the transducer
Thursday, October 2, 14 a from one tape is mapped to the symbol b on the other tape. For example a : ! means
reg-noun irreg-pl-noun
We are now ready to build an FST morphological parser out of our earlier morpho-
tactic FSAs and lexica by adding an extra “lexical” tape and the appropriate morpho-
irreg-sg-noun
fox g o:e o:e s e
logical features. Fig. 3.13 shows an augmentation of Fig. 3.3 with the nominal mor-
phological features (+Sg and +Pl) that correspond to each morpheme. The symbol
goose
cat MORPHEME
BOUNDARY sheep
ˆ indicates a morpheme boundary, while the symbol # indicates a word boundary. sheep
# The morphological features map to the empty string ! or the boundary symbols since
aardvark WORD BOUNDARY m o:i u:! s:c e
there is no segment corresponding to them on the output tape. mouse
+N +Pl
q1 q4
reg-noun +Sg
∋ ^s#
The resulting transducer, shown in Fig. 3.14, will
# map plural nouns into the stem

T
plus the morphological marker +Pl, and
q0 irreg-sg-noun q2 singular
+N nouns into
q5 +Sg q the stem plus the mor-

#
phological marker +Sg. Thus airreg-pl-noun
surface cats will map to cat +N +Pl. This can be
viewed in feasible-pair format as follows: q3 +N q6 +Pl
#

F
Figure 3.13 A schematic transducer for English nominal number inflection Tnum . The
symbols above each arc represent elements of the morphological parse in the lexical tape;
the symbols below each arc represent the surface tape (or the intermediate tape, to be
described later), using the morpheme-boundary symbol ˆ and word-boundary marker #.

RA
Section 3.6.
The labels on the arcs leaving q0 are schematic, and need to be expanded by individual
wordsand
Transducers in the lexicon.
Orthographic

f
by allowing the lexicon
goose, the new lexical entry
Rules

In order to use Fig. 3.13 as a morphological noun parser, it needs to be expanded


o irregular noun
with all the individual regular and
1 2
x stems, replacing the labels reg-noun
+Pl
o update thexlexicon for this transducer, so that irreg-
etc. In order to do this we need to

have two levels.


3 will be “g:g
4 5 6
^s#
f will parse into the correct stem goose +N +Pl. We do this
ular plurals like geese
c to also a +N geese
t Since surface +Sg
maps to lexical
19

c a to:e o:e s:s e:e”. Regular forms



are simpler; the two-level entry for fox will now be “f:f o:o x:x”, # but by relying
0
on the orthographic convention that f stands for f:f and so on, we can simply refer to
g o “g o:e
o geese as s o:ee s e”.+N +Sg
Compose it as fox and the form for
g
Thus the lexicon
7 will look
D
o o s e #
only slightly more complex:

+Pl
reg-noun o o s
irreg-pl-noun e +N irreg-sg-noun
#

Thursday, October 2, 14
T fox
cat
aardvark
Figure 3.14
e
g o:e o:e s e
e
sheep
s
m o:i u:! s:c e
e
∋ goose
sheep
mouse
A fleshed-out English nominal inflection FST Tlex , expanded from Tnum
by replacing the three arcs with individual word stems (only a few sample word stems are
The resulting transducer, shown in Fig. 3.14, will map plural nouns into the stem
shown).
plus the morphological marker +Pl, and singular nouns into the stem plus the mor-
phological marker +Sg. Thus a surface cats will map to cat +N +Pl. This can be
c:c a:a t:t +N:! +Pl:ˆs# 21
viewed in feasible-pair format as follows:
Since the output symbols include the morpheme and word boundary markers ˆ and
Chapter 3. Words & Transducers

Name Description of Rule Example


Consonant 1-letter consonant doubled before -ing/-ed beg/begging
doubling
E deletion Silent e dropped before -ing and -ed make/making
E insertion e added after -s,-z,-x,-ch, -sh before -s watch/watches
Y replacement -y changes to -ie before -s, -i before -ed try/tries
K insertion verbs ending with vowel + -c add -k panic/panicked
We can think of these spelling changes as taking as input a simple concatenation of
morphemes (the “intermediate output” of the lexical transducer in Fig. 3.14) and pro-
ducing as output a slightly-modified (correctly-spelled) concatenation of morphemes.
Fig. 3.16 shows in schematic form the three levels we are talking about: lexical, inter-
mediate, and surface. So for example we could write an E-insertion rule that performs
the mapping from the intermediate to surface
22 levels shown in Fig. 3.16. Such a rule
Thursday, October 2, 14
Other applications of FSTs
• Spelling correction: make an FST that can
capture common mistakes
• Transpositions: teh -> the
• Speech recognition: phonemes, pronunciation
dictionaries, words...
• OpenFST library: originally developed for speech
applications
• Machine translation: Model 1 can be thought of
as an FST!
• Usually use weighted FSTs: probabilities of
spelling errors, word/phrase translations, etc.

23
Thursday, October 2, 14
Section 3.9. Word and Sentence Tokenization 25
determine that two words have the same stem; the suffixes are thrown away.
NG
Stemming: simple morph. parse
One of the most widely
might not used
match the

determine that which


two wordsis
such
keyword

based
have
stemming
marsupial,

the same
some IRalgorithms

onstem;
a series of are
the suffixes simple
isa the
systems first run simple
stemmer
query and document words. Morphological information in IR is thus only used to
Porter (1980) algorithm,
on the and

cascaded rewri
thrown away.
Since cascaded
STEMMING rewrite rules
One of the mostare
widelyjust
usedthe
suchsort of thing
stemming thatiscould
algorithms beandeasily
the simple efficientimple
Porter (1980) algorithm, which is based on a series of simple cascaded rewrite rules.


as an FST, we think
Porter
will be developed
of therewrite
stemmer:
Since cascaded
as anfurther
Porter
in the
FST, we think
cascade
rules algorithm
are
exercises
of the
just of
the as
sort of
(Exercise
Porter algorithm
a lexicon-free
rewrite
thing that rules.
could
3.6). The
as a lexicon-free
be FST
easily stemmer (t
implemented
algorithm
FST stemmer contai
(this idea
Output
like these: of
will be one
developed stage is
further in the the
exercisesinput
(Exercisefor3.6). the next
The algorithm contains rules
like these:

ATIONAL →ATIONAL
ATE → ATE (e.g., relational → relate)
(e.g., relational → relate)
ING → ! if stem contains vowel (e.g., motoring → motor)

FTT
ING → ! if stem contains vowel (e.g., motoring → motor)
See Porter (1980) or Martin Porter’s official homepage for the Porter stemmer for more

• details.
Can be thought of as a lexicon-free FST
See Porter (1980) or Martin
Krovetz (1993)Porter’s
showed thatofficial
stemming homepage forimprove
tends to somewhat the Porter stemmer f
the performance


details. of information retrieval, especially with smaller documents (the larger the document,
Is stemming always good? Depends on language,

F
the higher the chance the keyword will occur in the exact form used in the query).
Krovetz (1993) showed
Nonetheless, that
not all stemming
IR engines tends partly
use stemming, to somewhat improve
because of stemmer errorsthe
suchperfo
amount of morph. productiveness, and data size
as these notedespecially
of information retrieval, by Krovetz: with smaller documents (the larger the do
Errorsthe
the higher the chance of Commission Errors of in
keyword will occur Omission
the exact form used in the
Nonetheless, not all IR engines
doing
as these noted by Krovetz:
A
organization organ

numerical
policy
Errors of Commission
European Europe
doe use stemming,
generalization generic
analysis partly
analyzesbecause of stemmer erro
matrices matrix
numerous
police
Errors of Omission
noise
sparse
noisy
sparsity

organization organ 24
European Europe
3.9
doingW ORD ANDdoe
S ENTENCE T OKENIZATION
analysis analyzes
Thursday, October 2, 14

You might also like