Professional Documents
Culture Documents
10
Clitic
• A clitic is a unit whose status lies between that of an affix and a
word. The syntactic behavior is more like words, often acting
as pronouns, articles, conjunctions, or verbs.
• Critic preceding a word are called proclitics and those
following are enclitics.
• Here are examples of verb clitics:
• The FSA assumes that the lexicon includes only nominal inflections given in
table below.
• Note: we ignore the fact that plural of words like fox have an inserted e: foxes.
IS 7118:NLP Unit-3: Morphology & FST, 18
Prof. R.K.Rao Bandaru
Morphotatctics (Verbal Inflection)
• A similar model for English verbal inflection might look like
the figure below.
19
IS 7118:NLP Unit-3: Morphology & FST,
Prof. R.K.Rao Bandaru
FSAs for English Adjective Morphology
• Antworth offers the following data on English adjectives:
1) big,bigger, biggest
2) cool, cooler, coolest, coolly
3) red, redder, reddest
4) clear, clearer, clearest, clearly, unclear, unclearly
5) happy, happier, happiest, happily
6) unhappy, unhappier, unhappiest, unhappily
7) real, unreal, really
Antworth’s Proposal #2
Antworth’s Proposal #1
Fossilize-fossilization
Clearly, happily
Realizable, equal, formal
Naturalness, casualness
25
IS 7118:NLP Unit-3: Morphology & FST,
FST Properties Prof. R.K.Rao Bandaru
26
Sequential Transducers
• Sequential transducers are a subtype of
transducers that are deterministic on their
input.
• At any state of a sequential transducer, each
given symbols of input alphabet ∑ can label at
most one transition out of that state.
• The transitions out of each state are
deterministic, based on the state and the input
Figure: A sequential FST,
symbol. from Mohri
• Sequential transducers can have epsilon
symbols in the output string, but not on the
input.
IS 7118:NLP Unit-3: Morphology & FST, 27
Prof. R.K.Rao Bandaru
FSTs for Morphological Parsing
• The simple story
– Add another tape
– Add extra symbols to the transitions
– On one tape we read “cats”, on the other we write “cat +N +PL”
• +N and +PL are elements in the alphabet for one tape that represent
underlying linguistic features
Figure: Schematic examples of the lexical and surface tapes; the actual transducers involve
intermediate tapes as well
IS 7118:NLP Unit-3: Morphology & FST, 28
Prof. R.K.Rao Bandaru
FSTs for Morphological Parsing
• Of course, its not as easy as
• “cat +N +PL” <-> “cats”
• As we saw earlier there are geese, mice and oxen
• But there are also a whole host of spelling/pronunciation
changes that go along with inflectional changes
• Cats vs Dogs (‘s’ sound vs. ‘z’ sound)
• Fox and Foxes (that ‘e’ got inserted)
• And doubling consonants (swim, swimming)
• adding k’s (picnic, picnicked)
• deleting e’s,...
Figure: A schematic transducer for English nominal number inflection Tnum. The symbols above each
arc represent elements of the morphological parse in the lexical tape; the symbols below each arc
represent the surface tape( or the intermediate tape),using the morpheme –boundary symbol ^ and
word-boundary marker #. The levels on the arcs leaving q0 are schematic and must be expanded by
individual words in the lexicon.
IS 7118:NLP Unit-3: Morphology & FST, 31
Prof. R.K.Rao Bandaru
FSTs for Morphological Parsing
Figure: A fleshed-out English nominal inflection FST Tlex, expanded from Tnum by replacing
the three arcs with individual word stems(only a few sample word Stems are shown.
32
IS 7118:NLP Unit-3: Morphology & FST,
Prof. R.K.Rao Bandaru
FSTs for Morphological Parsing
• To deal with these complications, we will add
even more tapes and use the output of one tape
machine as the input to the next
• So, to handle irregular spelling changes we will
add intermediate tapes with intermediate
symbols
M2
Figure: An example of the lexical, intermediate and surface tapes. Between each pair of tapes is a two-level transducer,
the lexical transducer between the lexical and intermediate levels, and the E-insertion spelling rule between intermediate
and surface levels. The E-insertion spelling rule inserts an e on the surface tape when the intermediate tape has a
morpheme boundary ^ followed by the morpheme –s.
Figure: The transducer for the E-insertion rule , extended from a similar transducer in Antworth. We
additionally need to delete the # symbol from surface string. We can do this either by interpreting the symbol
# as the pair #:є or by preprocessing the output.
38
IS 7118:NLP Unit-3: Morphology & FST, Prof.
R.K.Rao Bandaru
Transducers and Orthographic Rules
• Idea of this transducer is to express constraint only for the rule it is built.
• Allow any other strings pass unchanged.
• So, state q0 models accepting state having seen default pairs unrelated to
the rule,
• So is, state q1 models having seen a ‘z’,’s’, or ‘x’
• q2 models having seen morpheme boundary for z,s, or x and again is an
accepting state.
• State q3 models having just seen the E-insersion, is not in accepting state,
pending arrival of s morpheme followed by #
39
IS 7118:NLP Unit-3: Morphology & FST,
Prof. R.K.Rao Bandaru
Transducers and Orthographic Rules
• The transition table for the rule that makes the illegal
transitions explicitly with the “_” symbol. .
Input s:s x:x z:z ^: :e # other
→
↓State
q0: 1 1 1 0 - 0 0
q1: 1 1 1 2 - 0 0
q2: 5 1 1 0 3 0 0
q3: 4 - - - - - -
q4: - - - - - 0 -
q5: 1 1 1 2 - - 0
Figure: The state-transition table for the E-insertion rule. 40
IS 7118:NLP Unit-3: Morphology & FST, Prof.
R.K.Rao Bandaru
Two-level Cascade of Transducers
Overall Scheme
Stemming
Lexicon-Free FSTs: The Porter Stemmer
• A stem is the “main” morpheme of the word, supplying the main
meaning.
• Some applications ( informational retrieval ) do not require the whole
morphological processor. They only need the stem of the word.
• Stemming is crude chopping of affixes, and is language independent.
• A stemming algorithm (Port Stemming algorithm) is a lexicon-free
FST
• Stemming algorithms are efficient but they may introduce errors
because they do not use a lexicon.
• The algorithm contains a series of cascaded rewrite rules:
ATIONAL→ATE (e.g., relational → relate)
ING → є if stem contains vowel(e.g., motoring → motor
SSES → SS (e.g., grasses → grass)
IS 7118:NLP Unit-3: Morphology & FST, 53
Prof. R.K.Rao Bandaru
The Porter Stemmer
• Though Porter stemmer algorithm is simpler
than full lexicon-based morphological parser,
the algorithm commit errors like one shown
below:
Errors of Commission Errors of Omission
Word Tokenization
Text Normalization
• Every NLP task needs to do text normalization:
1) Segmenting/tokenizing words in running text
2) Normalizing word formats
3) Segmenting sentences in running text
they lay back on the San Francisco grass and looked at the
stars and their
• French
– L'ensemble → one token or two?
• L ? L’ ? Le ?
• Want l’ensemble to match with un ensemble
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
• S I D I
• Named Entity Extraction and Entity Conference
– IBM Inc announced today
– IBM profits
– Stanford President John Hennessy announced yesterday
– Stanford University President John Hennessy
D(i,0) = i
D(0,j) = j
• Recurrence Relation:
for each i=1…m
for each j= 1…n
D(i-1,j) + 1
D(i,j) = min D(i,j-1) + 1
D(i-1,j-1) + 2; if S1(i) ≠ S2(j)
Termination: D(n,m) is distance 0; if S1(i) = S2(j)
89
IS 7118:NLP Unit-3: Morphology & FST, Prof.
R.K.Rao Bandaru
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
S1 E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
S2 IS 7118:NLP Unit-3: Morphology & FST, 90
Prof. R.K.Rao Bandaru
The Edit Distance Table IS 7118:NLP Unit-3: Morphology & FST, Prof.
R.K.Rao Bandaru
N 9
O 8
I 7
T 6
N 5
E 4
S1
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
91
S2
The Edit Distance
Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
S1 N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
S2 IS 7118:NLP Unit-3: Morphology & FST, 92
Prof. R.K.Rao Bandaru
Computing alignments
• Edit distance isn’t sufficient
– We often need to align each character of the two
strings to each other.
• We do this by keeping a “backtrace”
• Every time we enter a cell, remember where we
came from
• When we reach the end
– Trace back the path from the upper right corner to
read off the alignment.
IS 7118:NLP Unit-3: Morphology & FST, 93
Prof. R.K.Rao Bandaru
Paths
• Keep a back pointer
– Every time we fill a cell add a pointer back to the
cell that was used to create it (the min cell that lead
to it)
– To get the sequence of operations follow the
backpointer from the final cell
Figure: When entering a value in each cell, we mark which of the three neighboring cells we
came from with up to three arrows. After the table is full we compute an alignment(minimum
edit path) by using a backtrace. Starting at the 8 in the upper left corner and following the arrow.
The sequence of dark grey cells represent one possible minimum cost alignment between the two
strings. IS 7118:NLP Unit-3: Morphology & FST, 95
Prof. R.K.Rao Bandaru
Adding Backtrace to MinEdit
• Base conditions: Termination:
D(i,0) = i D(0,j) = j D(N,M) is distance
Recurrence Relation:
for each I =1…M
for each j=1…N
deletion
D(i-1,j) + 1
D(i,j) = min D(i,j-1) + 1 insertion
D(i-1,j-1) + 1; if S1(i) ≠ S2(j)
substitution
0; if S1(i) = S2(j)
insertion
LEFT
ptr(i,j) DOWN deletion
substitution 96
DIAG IS 7118:NLP Unit-3: Morphology & FST, Prof.
R.K.Rao Bandaru
Complexity
• Time:
O(nm)
• Space:
O(nm)
• Backtrace
O(n+m)
???