You are on page 1of 98

Unit-3: Morphology & FST

IS 7118: Natural Language Processing


1st Year, 2nd Semester, M.Sc(IS)
(Slides are adapted from Text Book by Jurafsky & Martin )
Instructor: Prof. Rama Krishna Rao Bandaru
Crazy English

• A writer is someone who writes, and a stinger


is something that stings. But fingers don’t fing,
grocers don’t groce, haberdashers don’t
haberdash, hammers don’t ham, and
humdingers don’t humding.
Richard Lederer, Crazy English

IS 7118:NLP Unit-3: Morphology & FST, 2


Prof. R.K.Rao Bandaru
Contents of Unit -3
❖ Morphology
▪ Inflectional Morphology
▪ Derivational Morphology
▪ Cliticization
▪ Non-Concatenative Morphology
❖ Finite-State Morphological Parsing
❖ Finite-State Lexicon
❖ Finite-State Transducers
❖ FST’s for Morphological Parsing
❖ Transducers and Orthographic Rules
❖ Lexicon-Free FSTs: The Porter Stemmer
❖ Word and Sentence Tokenization
❖ Detection and Correction of Spelling Errors
❖ Minimum Edit Distance IS 7118:NLP Unit-3: Morphology & FST,
Prof. R.K.Rao Bandaru
3
Types of Words
• Variable Words: that can modify their structure
– E.g., Woman, Sweet, Walk
• Invariable Words: words that can not change at all.
– E.g., But, If, About, Myself
• Closed -Set items: Small substitution sets whose principal
function is grammatical. They change slowly through time
– E.g., he, a, on, be
• Open-Set items: Large substitution sets, carry meaning, rapid
turn-over
– E,g,, play -> replay

IS 7118:NLP Unit-3: Morphology & FST, 4


Prof. R.K.Rao Bandaru
Morphology(1)
• Morpheme = "minimal meaning-bearing unit in a language“
– E,g., fold, grace, man
• Morphology handles the formation of words by using morphemes.
• Free Morphemes : morphemes that can stand alone as words.
– E.g., man,, book, dog, clever
• Bound Morphemes : morphemes that only function as parts of
words. E.g, manly, dogs, and, unmanly, quickly, slowly
• Morphological Parsing: the task of recognizing the morphemes
inside a word
• For instance, it’s a task of recognizing ‘foxes’ breaks into two morphemes
‘fox’ and ‘es’
• Finite State Tranducers(FST) is an important morphological parsing
algorithm
5
IS 7118:NLP Unit-3: Morphology & FST,
Prof. R.K.Rao Bandaru
Morphology(2)
• We can divide morphemes into two broad classes.
– Stems – the core meaningful units, the root of the word.
– Affixes – add additional meanings and grammatical
functions to words.
– base form (stem), e.g., believe
– affixes (suffixes, prefixes, infixes), e.g., un-, -able, -ly
• Affixes are further divided into:
Concatenative morpology:
– Prefixes – precede the stem: do / undo Because no. of morphemes
concatenated together
– Suffixes – follow the stem: eat / eats
– Infixes – are inserted inside the stem, like sing, sang, sung
– Circumfixes – affixes go around a stem.

IS 7118:NLP Unit-3: Morphology & FST, 6


Prof. R.K.Rao Bandaru
Morphemes and Words IS 7118:NLP Unit-3: Morphology &
FST, Prof. R.K.Rao Bandaru

• Combine morphemes to create words


– Inflection
• combination of a word stem with a grammatical morpheme
• same word class, e.g. clean (verb), clean-ing (verb)
– Derivation
• combination of a word stem with a grammatical morpheme
• Yields different word class, e.g. computerize(verb), computerization
(noun)
– Compounding
• combination of multiple word stems. E.g., doghouse, icecream
– Cliticization:
• A clitic is a unit whose status lies between that of an affix and a word.
• The phonological behavior of clitics is like affixes.
• Cliticization is combination of a word stem with a clitic
7
• different words from different syntactic categories, e.g. I’ve = I + have
Inflectional Morphology
IS 7118:NLP Unit-3: Morphology &
• Inflectional Morphology FST, Prof. R.K.Rao Bandaru

• word stem + grammatical morpheme cat + s


• Only nouns, verbs, and some adjectives can be inflected.
• Two kinds of inflection:
– An affix that marks plural
– An affix that marks possessive.
• Nouns
– plural:
– regular: +s, +es irregular: mouse - mice; ox - oxen
– rules for exceptions: e.g. -y -> -ies like: butterfly - butterflies
– possessive: +'s, +'
• Verbs These verbs are ‘regular’ in the sense we can predict
the other forms by adding one of the three predictable
– main verbs (sleep, eat, walk)
endings, and making regular spelling changes. Also
– modal verbs (can, will, should) 8 any
they are productive, i.e., it automatically includes
– primary verbs (be, have, do) new words that enter the language. (Fax, faxed )
Inflectional Morphology (verbs)
• Verb Inflections are more complicated than nominal inflection.
• Morph. Form Regularly Inflected Form
a) stem walk merge try map
b) -s form walks merges tries maps
c) -ing participle walking merging trying mapping
d) past; -ed participle walked merged tried mapped
• Irregular verbs are those that have more or less idiosyncratic forms of inflection. They
constitute a much smaller class of verbs but ,very frequently used

• Morph Form Irregularly Inflected Form


a) stem eat catch cut
b) -s form eats catches cuts
c) -ing participle eating catching cutting
d) -ed past ate caught cut
e) -ed participle eaten caught cut
Note: irregular verb can inflect in the past form by changing its vowel(eat/ate). This is called Preterite.
9
IS 7118:NLP Unit-3: Morphology &
FST, Prof. R.K.Rao Bandaru
Derivational Morphology
IS 7118:NLP Unit-3: Morphology &
• Derivation : FST, Prof. R.K.Rao Bandaru

– results in a word of different class, often with a meaning hard to predict


exactly and in quite complex in English.
– Commonly new nouns are formed from verbs or adjectives.
– This process of formation of new nouns, verbs or adjectives from derivation
in English is known as normalization.
• Some English derivational affixes

• Adjectives can also be derived from nouns an or verbs

10
Clitic
• A clitic is a unit whose status lies between that of an affix and a
word. The syntactic behavior is more like words, often acting
as pronouns, articles, conjunctions, or verbs.
• Critic preceding a word are called proclitics and those
following are enclitics.
• Here are examples of verb clitics:

IS 7118:NLP Unit-3: Morphology & FST, 11


Prof. R.K.Rao Bandaru
IS 7118:NLP Unit-3: Morphology &
FST, Prof. R.K.Rao Bandaru
Non-Concatenative Morphology
• In concatenative morphology, words are composed of a string of
concatenated morphems.
• In non-concatenative morphology, morphemes are combined in
more complex ways .
– E.g., The Tagalog language have two morphemes ‘hingi’ and’ um’ are
intermingled.
• Another kind of non-concatenative morphology is called
templatic morphology or root-and –pattern morphology
– E.g, Arabic, Hebrew, Semitic languages
– In Hebrew, tri-consonant root
– Lmd = learn or study
– Lamad = he studied,
– Limed = he taught
• Agreement: In English the subject noun and the main verb have to agree in12
number(singular or plural)
Morphology and FSAs
• We would like to use the machinery provided
by FSAs to capture facts about morphology
– Accept strings (words) that are legitimate words in
the language
– Reject those that are not
– And do so in a way that doesn’t require us to list all
the forms of all the words in the language
• Even in English this is inefficient
• And in other languages it is impossible

IS 7118:NLP Unit-3: Morphology & FST, 13


Prof. R.K.Rao Bandaru
Parsing/Generation vs. Recognition

• We can now run strings through these machines to


recognize strings in the language

• But recognition is usually not quite what we need


– Often if we find some string in the language we might like to
assign a structure to it (parsing)
– Or we might start with some structure and want to produce a
surface form for it (production/generation)

• For that we’ll move to finite state transducers


– Add a second tape that can be written to
IS 7118:NLP Unit-3: Morphology & FST, 14
Prof. R.K.Rao Bandaru
Finite-State Morphological Parsing
• Morphological parsing = the task of recognizing the morphemes inside a word
• Morphological parsing in example below involves, taking input forms like those in
first column and produce output forms like in second column.
• The second column contains stem of each word as well as assorted morphological
features which gives additional information about stem.

• English doesn’t stack more affixes.


• But Turkish can have words with a lot of suffixes.
• Languages, such as Turkish, tend to string affixes together are called agglutinative
languages. “uygarlas¸tıramadıklarımızdanmıs¸sınızcasına”
15
IS 7118:NLP Unit-3: Morphology & FST,
Prof. R.K.Rao Bandaru
Parts of A Morphological Processor
• In order to build a morphological processor, we need at least
followings:
• Lexicon : The list of stems and affixes together with basic
information about them such as their main categories (noun, verb,
adjective, …) and their sub-categories (regular noun, irregular
noun, …).
• Morphotactics : The model of morpheme ordering that explains
which classes of morphemes can follow other classes of
morphemes inside a word.
• Orthographic Rules (Spelling Rules) : These spelling rules are
used to model changes that occur in a word (normally when two
morphemes combine).
IS 7118:NLP Unit-3: Morphology & FST, 16
Prof. R.K.Rao Bandaru
Lexicon
• A lexicon is a repository for words (stems).
– Say, explicit listing of every word: a, AAA, AA, Aachen, aardvark, etc.
• They are grouped according to their main categories.
– noun, verb, adjective, adverb, …

• They may be also divided into sub-categories.


– regular-nouns, irregular-singular nouns, irregular-plural nouns, …
• The simplest way to create a morphological parser, put all possible
words (together with its inflections) into a lexicon.
– We do not do this because their numbers are huge, inconvenient and
impossible to list (theoretically for Turkish, it is infinite)
• One way to list every word in the language is:
Computational Lexicon = stems + affixes + morphotactics
IS 7118:NLP Unit-3: Morphology & FST, 17
Prof. R.K.Rao Bandaru
Construction of FSA Lexicon
• Morphotactics (Noun inflection): which morphemes can follow
which morphemes. One way to model morphotactic is using FSA

Figure: A simple FSA model for English


nominal inflection .

• The FSA assumes that the lexicon includes only nominal inflections given in
table below.

• Note: we ignore the fact that plural of words like fox have an inserted e: foxes.
IS 7118:NLP Unit-3: Morphology & FST, 18
Prof. R.K.Rao Bandaru
Morphotatctics (Verbal Inflection)
• A similar model for English verbal inflection might look like
the figure below.

Figure: A simple FSA model


for English verbal inflection .

19
IS 7118:NLP Unit-3: Morphology & FST,
Prof. R.K.Rao Bandaru
FSAs for English Adjective Morphology
• Antworth offers the following data on English adjectives:
1) big,bigger, biggest
2) cool, cooler, coolest, coolly
3) red, redder, reddest
4) clear, clearer, clearest, clearly, unclear, unclearly
5) happy, happier, happiest, happily
6) unhappy, unhappier, unhappiest, unhappily
7) real, unreal, really
Antworth’s Proposal #2
Antworth’s Proposal #1

IS 7118:NLP Unit-3: Morphology & FST, 20


Prof. R.K.Rao Bandaru
Derivational Rules

Figure: A FSA for another fragment of English derivational morphology

Fossilize-fossilization
Clearly, happily
Realizable, equal, formal
Naturalness, casualness

IS 7118:NLP Unit-3: Morphology & FST, 21


Prof. R.K.Rao Bandaru
Compiled FSA for a few English Nouns
• The FSAs used can solve the problem of morphological
recognition, that is, determining whether an input string of
letters makes up a legitimate English word or not.

Figure: Expanded FSA for a few English


nouns with their inflection.

• Note that automaton will incorrectly accept the input foxs.


IS 7118:NLP Unit-3: Morphology & FST, 22
Prof. R.K.Rao Bandaru
Finite-State Transducers(FSTs)
• Finite-state transducer is a type of FSA which maps between two sets of
symbols.
• It can be viewed as a two-tape automaton that recognizes or generates pairs
of strings.
• A FST is a two-tape automaton:
– Reads from one tape, and writes to other one.
Figure: FST
• For morphological processing, one tape holds lexical representation, the
second one holds the surface form of a word.
Lexical Tape d o g +N +PL (upper tape)

Surface Tape d o g s (lower tape)

• While FSA defines a formal language by defining a set of strings, FST


defines a relation between sets of strings.
IS 7118:NLP Unit-3: Morphology & FST, 23
Prof. R.K.Rao Bandaru
Role of FST
• FST as a recognizer: a transducer that takes a pair of
strings as input and outputs; accept if the string-pair is
in the string-pair language, and reject if it is not.
• FST as generator: a machine that outputs pairs of
strings of the language. Thus the output is a yes or no,
and pair of output strings.
• FST as translator: a machine that reads a string and
outputs another string.
• FST as set relater: a machine that computes relations
between sets.
IS 7118:NLP Unit-3: Morphology & FST, 24
Prof. R.K.Rao Bandaru
IS 7118:NLP Unit-3: Morphology &
FST, Prof. R.K.Rao Bandaru
Formal Definition of FST (Mealey Machine)
• FST is Q ×  × Δ × q0 × F ×  , where,
• Q : a finite set of N states q0, q1, … Qn
•  : a finite input alphabet of complex symbols.
• Δ : a finite set corresponding to the output alphabets
• q0 : the start state and q0  Q
• F : the set of final states -- F  Q
• (q,w) : the transition function or transition matrix between states. Given a
state q  Q and a string w belongs to *, (q,w) , retuns a set of new states Q’
 Q.  is thus a function from Q × * to 2Q (because there are 2Q possible
subsets of Q).  returns a set of states rather than a single state because a given
input may be ambiguous as to which state it maps to.
• σ(q,w) : the output function giving the set of possible output strings for each
state and input. Given a state q Q and a string w  *, σ(q,w) gives a set of
output strings , each string o  Δ*. σ is thus a function from Q × * to 2Δ*

25
IS 7118:NLP Unit-3: Morphology & FST,
FST Properties Prof. R.K.Rao Bandaru

• FSTs are closed under: union, inversion, and composition.


• union : The union of two regular relations is also a regular relation.
• inversion : The inversion of a FST simply switches the input and
output labels.
– This means that the same FST can be used for both directions of a
morphological processor.
• composition : If T1 is a FST from I1 to O1 and T2 is a FST from O1
to O2, then composition of T1 and T2 (T1oT2) maps from I1 to O2.
• We use these properties of FSTs in the creation of the FST for a
morphological processor.
Figure: The composition of [a:b]+ with [b:c]+ to produce [a:c]+

26
Sequential Transducers
• Sequential transducers are a subtype of
transducers that are deterministic on their
input.
• At any state of a sequential transducer, each
given symbols of input alphabet ∑ can label at
most one transition out of that state.
• The transitions out of each state are
deterministic, based on the state and the input
Figure: A sequential FST,
symbol. from Mohri
• Sequential transducers can have epsilon
symbols in the output string, but not on the
input.
IS 7118:NLP Unit-3: Morphology & FST, 27
Prof. R.K.Rao Bandaru
FSTs for Morphological Parsing
• The simple story
– Add another tape
– Add extra symbols to the transitions
– On one tape we read “cats”, on the other we write “cat +N +PL”
• +N and +PL are elements in the alphabet for one tape that represent
underlying linguistic features

Figure: Schematic examples of the lexical and surface tapes; the actual transducers involve
intermediate tapes as well
IS 7118:NLP Unit-3: Morphology & FST, 28
Prof. R.K.Rao Bandaru
FSTs for Morphological Parsing
• Of course, its not as easy as
• “cat +N +PL” <-> “cats”
• As we saw earlier there are geese, mice and oxen
• But there are also a whole host of spelling/pronunciation
changes that go along with inflectional changes
• Cats vs Dogs (‘s’ sound vs. ‘z’ sound)
• Fox and Foxes (that ‘e’ got inserted)
• And doubling consonants (swim, swimming)
• adding k’s (picnic, picnicked)
• deleting e’s,...

IS 7118:NLP Unit-3: Morphology & FST, 29


Prof. R.K.Rao Bandaru
FST …(continued)
• The upper or lexical tape alphabet  , the lower or surface tape
alphabet Δ . And l is alphabet of  and Δ together.
• For example for sheep talk:
–  = {a, b, !} Δ ={a,b,!, є}
– l = {a:a, b:b, !:!, a:!, a: є, є:!}
• feasible pairs – In two-level morphology terminology, the
pairs in l are called as feasible pairs.
• default pair – Instead of a:a we can use a single character a for
this default pair.
• FSAs are isomorphic to regular languages, and FSTs are
isomorphic to regular relations (pair of strings of regular
languages).
IS 7118:NLP Unit-3: Morphology & FST, 30
Prof. R.K.Rao Bandaru
FSTs for Morphological Parsing

• Augmentation of nominal morphological feature

Figure: A schematic transducer for English nominal number inflection Tnum. The symbols above each
arc represent elements of the morphological parse in the lexical tape; the symbols below each arc
represent the surface tape( or the intermediate tape),using the morpheme –boundary symbol ^ and
word-boundary marker #. The levels on the arcs leaving q0 are schematic and must be expanded by
individual words in the lexicon.
IS 7118:NLP Unit-3: Morphology & FST, 31
Prof. R.K.Rao Bandaru
FSTs for Morphological Parsing
Figure: A fleshed-out English nominal inflection FST Tlex, expanded from Tnum by replacing
the three arcs with individual word stems(only a few sample word Stems are shown.

32
IS 7118:NLP Unit-3: Morphology & FST,
Prof. R.K.Rao Bandaru
FSTs for Morphological Parsing
• To deal with these complications, we will add
even more tapes and use the output of one tape
machine as the input to the next
• So, to handle irregular spelling changes we will
add intermediate tapes with intermediate
symbols

IS 7118:NLP Unit-3: Morphology & FST, 33


Prof. R.K.Rao Bandaru
Intermediate Tapes
• Spell rule(or Orthographic rules):
• English often requires spelling changes at
morpheme boundaries by introducing spelling
rules( or orthographic rules).

A schematic view of the lexical and intermediate tapes

IS 7118:NLP Unit-3: Morphology & FST, 34


Prof. R.K.Rao Bandaru
Transducers and Orthographic Rules
• Orthographic Rules( or Spelling Rules)
notation :

IS 7118:NLP Unit-3: Morphology & FST, Prof. 35


R.K.Rao Bandaru
Transducers and Orthographic Rules
M1

M2

Figure: An example of the lexical, intermediate and surface tapes. Between each pair of tapes is a two-level transducer,
the lexical transducer between the lexical and intermediate levels, and the E-insertion spelling rule between intermediate
and surface levels. The E-insertion spelling rule inserts an e on the surface tape when the intermediate tape has a
morpheme boundary ^ followed by the morpheme –s.

• We use one machine to transduce between the lexical and the


intermediate level (M1), and another (M2) to handle the
spelling changes to the surface tape
– M1 knows about the particulars of the lexicon
– M2 knows about weird English spelling rules
36
IS 7118:NLP Unit-3: Morphology & FST,
Prof. R.K.Rao Bandaru
Transducers and Orthographic Rules
• A rule of the form a →b/c__d means “rewrite a as b when it
occurs between c and d.
• Since є means an empty transition, replacing it means inserting
something.
• The symbol ^ indicates morpheme boundary.
• The symbol # marks a word boundary.
• Thus “insert an e after a morpheme-final x, s or z and before
the morpheme s”

IS 7118:NLP Unit-3: Morphology & FST, Prof. 37


R.K.Rao Bandaru
Transducers and Orthographic Rules
• The add an “e” English spelling rule as in fox^s# <-> foxes#

“other’”: any feasible pair


that is not in this transducer.

Figure: The transducer for the E-insertion rule , extended from a similar transducer in Antworth. We
additionally need to delete the # symbol from surface string. We can do this either by interpreting the symbol
# as the pair #:є or by preprocessing the output.
38
IS 7118:NLP Unit-3: Morphology & FST, Prof.
R.K.Rao Bandaru
Transducers and Orthographic Rules
• Idea of this transducer is to express constraint only for the rule it is built.
• Allow any other strings pass unchanged.
• So, state q0 models accepting state having seen default pairs unrelated to
the rule,
• So is, state q1 models having seen a ‘z’,’s’, or ‘x’
• q2 models having seen morpheme boundary for z,s, or x and again is an
accepting state.
• State q3 models having just seen the E-insersion, is not in accepting state,
pending arrival of s morpheme followed by #

39
IS 7118:NLP Unit-3: Morphology & FST,
Prof. R.K.Rao Bandaru
Transducers and Orthographic Rules
• The transition table for the rule that makes the illegal
transitions explicitly with the “_” symbol. .
Input s:s x:x z:z ^: :e # other

↓State
q0: 1 1 1 0 - 0 0

q1: 1 1 1 2 - 0 0

q2: 5 1 1 0 3 0 0

q3: 4 - - - - - -

q4: - - - - - 0 -

q5: 1 1 1 2 - - 0
Figure: The state-transition table for the E-insertion rule. 40
IS 7118:NLP Unit-3: Morphology & FST, Prof.
R.K.Rao Bandaru
Two-level Cascade of Transducers
Overall Scheme

Separate FSTs for


each of the English
spelling rules we
want to capture.

Figure: Generating or parsing with FST lexicon and rules


41
IS 7118:NLP Unit-3: Morphology & FST,
Prof. R.K.Rao Bandaru
Overall Scheme
• We have one FST that has explicit information
about the lexicon (actual words, their spelling,
facts about word classes and regularity).
• Lexical level to intermediate forms
• And we have a larger set of machines that
capture orthographic/spelling rules.
• Intermediate forms to surface forms
• One machine for each spelling rule

IS 7118:NLP Unit-3: Morphology & FST, Prof. 42


R.K.Rao Bandaru
Cascades
• This is an architecture recurs again and again
• Overall processing is divided up into distinct
rewrite steps
• The output of one layer serves as the input to the
next
• The intermediate tapes may or may not end up
being useful in their own right

IS 7118:NLP Unit-3: Morphology & FST, Prof. 43


R.K.Rao Bandaru
Combination of FST Lexicon and
Rules : Foxes Example

IS 7118:NLP Unit-3: Morphology & FST, 44


Prof. R.K.Rao Bandaru
Foxes Example

IS 7118:NLP Unit-3: Morphology & FST, 45


Prof. R.K.Rao Bandaru
Foxes Example

IS 7118:NLP Unit-3: Morphology & FST, Prof. 46


R.K.Rao Bandaru
Note
• A key feature of this lower machine is that it
has to do the right thing for inputs to which it
doesn’t apply. So...
– fox^s# → foxes
– but bird^s# → birds
– and cat# → cat

IS 7118:NLP Unit-3: Morphology & FST, Prof. 47


R.K.Rao Bandaru
Intersecting FSTs
• Recall that for FSAs it was ok to take the
intersection of two regular languages and have
the result still be regular
– Regular languages are closed under intersection
• There’s no such guarantee for FSTs
– Regular relations are not closed under intersection
in general
– But interesting subsets are

IS 7118:NLP Unit-3: Morphology & FST, Prof. 48


R.K.Rao Bandaru
Intersection
• We can intersect all rule FSTs to create a single
FST.
• Intersection algorithm just takes the Cartesian product
of states.
– For each state qi of the first machine and qj of the second
machine, we create a new state qij
– For input symbol a, if the first machine would transition to
state qn and the second machine would transition to qm the
new machine would transition to qnm.

IS 7118:NLP Unit-3: Morphology & FST, 49


Prof. R.K.Rao Bandaru
Composition
• Cascade can turn out to be somewhat pain.
– it is hard to manage all tapes
– it fails to take advantage of restricting power of the machines
• So, it is better to compile the cascade into a single large machine.
• Create a new state (x,y) for every pair of states x є Q1 and y є Q2.
The transition function of composition will be defined as follows:
δ((x,y),i:o) = (v,z) if there exists c such that δ1(x,i:c) = v and
δ2(y,c:o) = z

IS 7118:NLP Unit-3: Morphology & FST, 50


Prof. R.K.Rao Bandaru
Final Scheme

Figure: Intersection and composition of transducers


IS 7118:NLP Unit-3: Morphology & FST, Prof. 51
R.K.Rao Bandaru
Basic Text Processing

Stemming
Lexicon-Free FSTs: The Porter Stemmer
• A stem is the “main” morpheme of the word, supplying the main
meaning.
• Some applications ( informational retrieval ) do not require the whole
morphological processor. They only need the stem of the word.
• Stemming is crude chopping of affixes, and is language independent.
• A stemming algorithm (Port Stemming algorithm) is a lexicon-free
FST
• Stemming algorithms are efficient but they may introduce errors
because they do not use a lexicon.
• The algorithm contains a series of cascaded rewrite rules:
ATIONAL→ATE (e.g., relational → relate)
ING → є if stem contains vowel(e.g., motoring → motor
SSES → SS (e.g., grasses → grass)
IS 7118:NLP Unit-3: Morphology & FST, 53
Prof. R.K.Rao Bandaru
The Porter Stemmer
• Though Porter stemmer algorithm is simpler
than full lexicon-based morphological parser,
the algorithm commit errors like one shown
below:
Errors of Commission Errors of Omission

organization organ European Europe

doing doe analysis analyzes

Generalization Generic Matrices matrix

Numerical numerous Noise noisy

Policy police sparse sparsity


IS 7118:NLP Unit-3: Morphology & FST, 54
Prof. R.K.Rao Bandaru
Stemming approaches
• Many different approaches:
– Brute force look up
– Suffix, affix stripping
– Part-Of-Speech recognition
– Statistical algorithms (n-grams, HMM)

IS 7118:NLP Unit-3: Morphology & FST, 55


Prof. R.K.Rao Bandaru
Porter Stemmer Overview
• Porter Stemmer utilizes suffix stripping.
– It does not address prefixes
– Algorithm dates from 1980 and still a default “go-
to” stemmer
– Excellent trade-off between speed, readability and
accuracy.
– Stems using as set of rules, or transformations,
applied in a succession of steps
– About 60 rules in 6 steps
– No recursion.
IS 7118:NLP Unit-3: Morphology & FST, 56
Prof. R.K.Rao Bandaru
Porter Stemmer Steps
• Step 1: Gets rid of plurals and –ed or –ing suffixes
• Step 2: Turns terminal ‘y’ to ‘i’ when there is another vowel in
the stem
• Step 3: Map double suffixes to single ones:
-ization, -ational, etc.
• Step 4: Deals with suffixes, -full, -ness etc.
• Step 5: Takes off –ant, -ence, etc
• Step 6: Removes a final –e
• For further Information:
– http://the-smirnovs.org/info/stemming.pdf
– http://qaa.ath.ex/porter_js_demo.html

IS 7118:NLP Unit-3: Morphology & FST, 57


Prof. R.K.Rao Bandaru
Porter’s Stemmer Algorithm
• Step 1a
Step 2(for long stems)
sses → ss caresses → caress ational → ate relational → relate
ies → i ponies → poni izer → ize digitizer → digitize
ss → ss caress → caress ator → ate operator → operate
s→Ø cats → cat

• Step 3 (for longer stems)


• Step 1b al → Ø revival → reviv
(*v*)ing → Ø walking → walk able → Ø adjustable-→ adjust
sing → sing
ate → Ø activate → active
(*v*)ed → Ø plastered → plaster

IS 7118:NLP Unit-3: Morphology & FST, 58


Prof. R.K.Rao Bandaru
Basic Text Processing

Word Tokenization
Text Normalization
• Every NLP task needs to do text normalization:
1) Segmenting/tokenizing words in running text
2) Normalizing word formats
3) Segmenting sentences in running text

IS 7118:NLP Unit-3: Morphology & FST, 60


Prof. R.K.Rao Bandaru
Word and Sentence Tokenization
• Tokenization or word segmentation
• separate out “words” (lexical entries) from
running text
• expand abbreviated terms
– E.g. I’m into I am, it’s into it is
• collect tokens forming single lexical entry
– E.g. New York marked as one single entry
• More of an issue in languages like Chinese

IS 7118:NLP Unit-3: Morphology & FST, 61


Prof. R.K.Rao Bandaru
Simple Tokenization
• Analyze text into a sequence of discrete tokens (words).
• Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token.
– However, frequently they are not.
• Simplest approach is to ignore all numbers and punctuation and
use only case-insensitive unbroken strings of alphabetic
characters as tokens.
• More careful approach:
– Separate ? ! ; : “ ‘ [ ] ( ) < >
– Care with . - why? when?
– Care with … ??

IS 7118:NLP Unit-3: Morphology & FST, Prof. 62


R.K.Rao Bandaru
Normalization
• Need to normalize terms
– Information retrieval: indexed and text terms must have the
same form
• We want to match U.S.A. and USA
• We implicitly define equivalence classes of terms
– E.g., deleting periods in a term
• Alternative: asymmetric expansion :
• Enter: window search: window, windows
• Enter: windows search:Windows, windows, windo
• Enter: Windows search: Windows

IS 7118:NLP Unit-3: Morphology & FST, 63


Prof. R.K.Rao Bandaru
Case folding
• Applications like IR reduce all letters to lower
case
– Since users tend to use lower case
– Possible exception: upper case in mid-sentences?
• E.g., General Motors
• Fed vs. fed
• SAIL vs. sail
• For sentiment analysis, MT, Information extraction
• Case is helpful (US verses us is important)

IS 7118:NLP Unit-3: Morphology & FST, 64


Prof. R.K.Rao Bandaru
Numbers
• 3/12/91
• Mar. 12, 1991
• 55 B.C.
• B-52
• 100.2.86.144
– Generally, don’t index as text
– Creation dates for docs

IS 7118:NLP Unit-3: Morphology & FST, 65


Prof. R.K.Rao Bandaru
Lemmatization
• A lemma is a set of lexical forms having the same stem, the same
major part-o-speech, and the same wordsense.
• Lemmatization reduce inflectional/derivational forms to base form
• Direct impact on vocabulary size
• E.g., am, are, is → be
– car, cars, car's, cars' → car
• the boy's cars are different colors → the boy car be different color
• Lemmatization: have to find correct dictionary headword form
• How to do this?
– Need a list of grammatical rules + a list of irregular words
– Children → child, spoken → speak …
– Practical implementation: use WordNet’s morphstr function

IS 7118:NLP Unit-3: Morphology & FST, Prof. 66


R.K.Rao Bandaru
How many words?
• I do uh main- mainly business data processing
– Fragments, filled pauses
• Seuss’s cat in the hat is different from other cats!
– Lemma: same stem, part of speech, rough word sense
• cat and cats = same lemma
– Wordform: the full inflected surface form
• cat and cats = different wordforms

IS 7118:NLP Unit-3: Morphology & FST, 67


Prof. R.K.Rao Bandaru
How many words?

they lay back on the San Francisco grass and looked at the
stars and their

• Type: an element of the vocabulary.


• Token: an instance of that type in running text.
• How many?
– 15 tokens (or 14)
– 13 types (or 12) (or 11?)
IS 7118:NLP Unit-3: Morphology & FST, 68
Prof. R.K.Rao Bandaru
How many words?

N = number of tokens Church and Gale (1990): |V| > O(N½)


V = vocabulary = set of types
|V| is the size of the vocabulary

Tokens = N Types = |V|


Switchboard 2.4 million 20 thousand
phone
conversations
Shakespeare 884,000 31 thousand
Google N-grams 1 trillion 13 million
IS 7118:NLP Unit-3: Morphology & FST, 69
Prof. R.K.Rao Bandaru
Issues in Tokenization
• Finland’s capital → Finland Finlands
Finland’s ?
• what’re, I’m, isn’t → What are, I
am, is not
• Hewlett-Packard → Hewlett Packard ?
• state-of-the-art → state of the
art ?
• Lowercase → lower-case lowercase
lower case ?
• San Francisco → one token or two?
• m.p.h., PhD. → ??
IS 7118:NLP Unit-3: Morphology & FST, 70
Prof. R.K.Rao Bandaru
Tokenization: language issues

• French
– L'ensemble → one token or two?
• L ? L’ ? Le ?
• Want l’ensemble to match with un ensemble

• German noun compounds are not segmented


– Lebensversicherungsgesellschaftsangestellter
– ‘life insurance company employee’
– German information retrieval needs compound splitter

IS 7118:NLP Unit-3: Morphology & FST, 71


Prof. R.K.Rao Bandaru
Tokenization: language issues
IS 7118:NLP Unit-3: Morphology & FST,
Prof. R.K.Rao Bandaru

• Chinese and Japanese no spaces between words:


– 莎拉波娃现在居住在美国东南部的佛罗里达。
– 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
– Sharapova now lives in US southeastern Florida
• Further complicated in Japanese, with multiple alphabets
intermingled
– Dates/amounts in multiple formats

フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana! 72


Word Tokenization in Chinese
• Also called Word Segmentation
• Chinese words are composed of characters
– Characters are generally 1 syllable and 1
morpheme.
– Average word is 2.4 characters long.
• Standard baseline segmentation algorithm:
– Maximum Matching (also called Greedy)

IS 7118:NLP Unit-3: Morphology & FST, 73


Prof. R.K.Rao Bandaru
Maximum Matching
Word Segmentation Algorithm

• Given a wordlist of Chinese, and a string.


1) Start a pointer at the beginning of the string
2) Find the longest word in dictionary that matches
the string starting at pointer
3) Move the pointer over the word in string
4) Go to 2

IS 7118:NLP Unit-3: Morphology & FST, Prof. 74


R.K.Rao Bandaru
Max-match segmentation illustration

• Thecatinthehat the cat in the hat

• Thetabledownthere the table down there


theta bled own there

• Doesn’t generally work in English!

• But works astonishingly well in Chinese


– 莎拉波娃现在居住在美国东南部的佛罗里达。
– 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
• Modern probabilistic segmentation algorithms even better

IS 7118:NLP Unit-3: Morphology & FST, 75


Prof. R.K.Rao Bandaru
Spelling Correction
• We can detect spelling errors (spell check) by
building an FST-based lexicon and noting any strings
that are rejected.
• But how do I fix “graffe”? That is, how do I come up
with suggested corrections?
– Search through all words in my lexicon
• Graft, craft, grail, giraffe, crafted, etc.
– Pick the one that’s closest to graffe
– But what does “closest” mean?
• We need a distance metric.
• The simplest one: minimum edit distance
– As in the Unix diff command

IS 7118:NLP Unit-3: Morphology & FST, Prof. 76


R.K.Rao Bandaru
Basic Text Processing

Minimum Cost Edit Distance


Edit Distance
• Minimum edit distance is a way of solving problem of
string similarities.
– Which is closest to graffe?
• graf,
• graft,
• grail,
• giraffe
• The minimum edit distance between two strings is the
minimum number of editing operations
o Insertion
o Deletion
o Substitution
IS 7118:NLP Unit-3: Morphology & FST, Prof. 78
R.K.Rao Bandaru
Minimum Edit Distance
(Wagner and Fischer,1974)

– If each operation has cost of 1


• Distance between these is 5
– If substitution cost 2( Levenshtein)
• Distance between them is 8

IS 7118:NLP Unit-3: Morphology & FST, 79


Prof. R.K.Rao Bandaru
Min Edit Example

IS 7118:NLP Unit-3: Morphology & FST, Prof. 80


R.K.Rao Bandaru
Other uses of Edit Distance in NLP
• Evaluating Machine Translation and speech
recognition
–H Spokesman confirms senior government adviser was shot

–M Spokesman said the senior advisor was shot dead

• S I D I
• Named Entity Extraction and Entity Conference
– IBM Inc announced today
– IBM profits
– Stanford President John Hennessy announced yesterday
– Stanford University President John Hennessy

IS 7118:NLP Unit-3: Morphology & FST, 81


Prof. R.K.Rao Bandaru
Min Edit As Search
• How to find the Min Edit Distance?
• We can view edit distance as a search for a path (a
sequence of edits) that gets us from the start string to
the final string
1) Initial state is the word we’re transforming
2) Operators are : insert, delete, substitute
3) Goal state is the word we’re trying to get to
4) Path cost is what we’re trying to minimize: the
number of edits

IS 7118:NLP Unit-3: Morphology & FST, 82


Prof. R.K.Rao Bandaru
Min Edit as Search

•But that generates a huge search space


•Navigating that space in a naïve backtracking fashion would be
incredibly wasteful
•Why?
•Lots of distinct paths wind up at the same state. But there is no need
to keep track of the them all. We only care about the shortest path to
each of those revisited states.
IS 7118:NLP Unit-3: Morphology & FST, 83
Prof. R.K.Rao Bandaru
Defining Min Edit Distance
• For two strings:
• S1 of length n,
• S2 of length m
– distance(i,j) or D(i,j)
• is the min edit distance of S1[1..i] and S2[1..j]
– That is, the minimum number of edit operations need to
transform the first i characters of S1 into the first j characters of S2
• The edit distance of S1,S2is D(n,m)
• We compute D(n,m) by computing D(i,j) for all i
(0 < i < n) and j (0 < j < m)
IS 7118:NLP Unit-3: Morphology & FST, 84
Prof. R.K.Rao Bandaru
84
Dynamic Programming
• Dynamic Programming is the name for class of
algorithms
• A tabular computation of D(n,m)
• Intuition: Large problem can be solved by properly
combining the solutions to various sub problems.
• Bottom-up
– We compute D(i,j) for small i,j
– And compute larger D(i,j) based on previously computed
smaller values.
– i.e., compute D(i,j) for all i (0<i<n) and j(0<j<m)

IS 7118:NLP Unit-3: Morphology & FST, 85


Prof. R.K.Rao Bandaru
Why “Dynamic Programming”
“Where did the name, dynamic programming, come from? The 1950s were not good years for
mathematical research. We had a very interesting gentleman in Washington named
Wilson. He was Secretary of Defense, and he actually had a pathological fear and
hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face
would suffuse, he would turn red, and he would get violent if people used the term, research,
in his presence. You can imagine how he felt, then, about the term, mathematical. The RAND
Corporation was employed by the Air Force, and the Air Force had Wilson as its boss,
essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the
fact that I was really doing mathematics inside the RAND Corporation. What title, what name,
could I choose? In the first place I was interested in planning, in decision making, in thinking.
But planning, is not a good word for various reasons. I decided therefore to use the word,
“programming” I wanted to get across the idea that this was dynamic, this was
multistage, this was time-varying I thought, lets kill two birds with one stone. Lets take
a word that has an absolutely precise meaning, namely dynamic, in the classical
physical sense. It also has a very interesting property as an adjective, and that is its
impossible to use the word, dynamic, in a pejorative sense. Try thinking of some
combination that will possibly give it a pejorative meaning. Its impossible. Thus, I
thought dynamic programming was a good name. It was something not even a
Congressman could object to. So I used it as an umbrella for my activities.”

Richard Bellman, “Eye of the Hurricane: an autobiography” 1984.

IS 7118:NLP Unit-3: Morphology & FST, Prof. 86


R.K.Rao Bandaru
DP Search
• In the context of language processing (and
signal processing) this kind of algorithm is
often referred to as a DP search
– Min edit distance
– Viterbi and Forward algorithms
– CKY and Earley
– MT decoding

IS 7118:NLP Unit-3: Morphology & FST, 87


Prof. R.K.Rao Bandaru
Min Edit Distance Algorithm

IS 7118:NLP Unit-3: Morphology & FST, 88


Prof. R.K.Rao Bandaru
Defining Min Edit Distance (Levenshtein* )

• Base conditions: * Vladmir Levenshtein,1966

D(i,0) = i
D(0,j) = j
• Recurrence Relation:
for each i=1…m
for each j= 1…n
D(i-1,j) + 1
D(i,j) = min D(i,j-1) + 1
D(i-1,j-1) + 2; if S1(i) ≠ S2(j)
Termination: D(n,m) is distance 0; if S1(i) = S2(j)
89
IS 7118:NLP Unit-3: Morphology & FST, Prof.
R.K.Rao Bandaru
The Edit Distance Table

N 9
O 8
I 7

T 6
N 5
S1 E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
S2 IS 7118:NLP Unit-3: Morphology & FST, 90
Prof. R.K.Rao Bandaru
The Edit Distance Table IS 7118:NLP Unit-3: Morphology & FST, Prof.
R.K.Rao Bandaru

N 9
O 8
I 7
T 6
N 5
E 4
S1
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
91
S2
The Edit Distance
Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10

T 6 5 6 7 8 9 8 9 10 11

S1 N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
S2 IS 7118:NLP Unit-3: Morphology & FST, 92
Prof. R.K.Rao Bandaru
Computing alignments
• Edit distance isn’t sufficient
– We often need to align each character of the two
strings to each other.
• We do this by keeping a “backtrace”
• Every time we enter a cell, remember where we
came from
• When we reach the end
– Trace back the path from the upper right corner to
read off the alignment.
IS 7118:NLP Unit-3: Morphology & FST, 93
Prof. R.K.Rao Bandaru
Paths
• Keep a back pointer
– Every time we fill a cell add a pointer back to the
cell that was used to create it (the min cell that lead
to it)
– To get the sequence of operations follow the
backpointer from the final cell

IS 7118:NLP Unit-3: Morphology & FST, 94


Prof. R.K.Rao Bandaru
Backtrace

Figure: When entering a value in each cell, we mark which of the three neighboring cells we
came from with up to three arrows. After the table is full we compute an alignment(minimum
edit path) by using a backtrace. Starting at the 8 in the upper left corner and following the arrow.
The sequence of dark grey cells represent one possible minimum cost alignment between the two
strings. IS 7118:NLP Unit-3: Morphology & FST, 95
Prof. R.K.Rao Bandaru
Adding Backtrace to MinEdit
• Base conditions: Termination:
D(i,0) = i D(0,j) = j D(N,M) is distance
Recurrence Relation:
for each I =1…M
for each j=1…N
deletion
D(i-1,j) + 1
D(i,j) = min D(i,j-1) + 1 insertion
D(i-1,j-1) + 1; if S1(i) ≠ S2(j)
substitution
0; if S1(i) = S2(j)
insertion
LEFT
ptr(i,j) DOWN deletion
substitution 96
DIAG IS 7118:NLP Unit-3: Morphology & FST, Prof.
R.K.Rao Bandaru
Complexity
• Time:
O(nm)

• Space:
O(nm)

• Backtrace
O(n+m)

IS 7118:NLP Unit-3: Morphology & FST, 97


Prof. R.K.Rao Bandaru
End of Unit -3

???

You might also like