Professional Documents
Culture Documents
Asif Ekbal
CSE Dept.,
IIT Patna
1/31/2022 1
Part of Speech Tagging
1/31/2022 2
NLP Trinity
Problem
Semantics NLP
Trinity
Parsing
Part of Speech
Tagging
Morph
Analysis Marathi French
HMM
Hindi English
CRF
Language
MEMM
Algorithm
1/31/2022 3
Part of Speech Tagging
POS Tagging: attaches to each word in
a sentence a part of speech tag from a
given set of tags called the Tag-Set
Standard Tag-set : Penn Treebank (for
English)- 36 tags
1/31/2022 4
Example
“_“ The_DT mechanisms_NNS that_WDT
make_VBP traditional_JJ hardware_NN
are_VBP really_RB being_VBG
obsoleted_VBN by_IN microprocessor-
based_JJ machines_NNS ,_, ”_” said_VBD
Mr._NNP Benton_NNP ._.
1/31/2022 5
Where does POS tagging fit in
Discourse and Corefernce
Increased
Semantics Extraction
Complexity
Of
Processing Parsing
Chunking
POS tagging
Morphology
1/31/2022
6
Example to illustrate
complexity of POS taggng
1/31/2022 7
POS tagging is disambiguation
N (noun), V (verb), J (adjective), R
(adverb) and F (other, i.e., function words).
1/31/2022 8
POS disambiguation
That_F/N/J (‘that’ can be complementizer (can be put under ‘F’),
demonstrative (can be put under ‘J’) or pronoun (can be put under ‘N’))
former_J
Sri_N/J Lanka_N/J (Sri Lanka together qualify the skipper)
skipper_N/V (‘skipper’ can be a verb too)
and_F
ace_J/N (‘ace’ can be both J and N; “Nadal served an ace”)
batsman_N/J (‘batsman’ can be J as it qualifies Aravinda De Silva)
Aravinda_N De_N Silva_N is_F a_F
man_N/V (‘man’ can verb too as in’man the boat’)
of_F few_J
words_N/V (‘words’ can be verb too, as in ‘he words is speeches beautifully’)
1/31/2022 9
Behaviour of “That”
That
That man is known by the company he keeps.
(Demonstrative)
Man that is known by the company he keeps,
gets a good job. (Pronoun)
That man is known by the company he keeps,
is a proverb. (Complementation)
Chaotic systems: Systems where a small
perturbation in input causes a large change
in output
1/31/2022 10
POS disambiguation
was_F very_R much_R evident_J on_F Wednesday_N
when_F/N (‘when’ can be a relative pronoun (put under ‘N) as in ‘I
know the time when he comes’)
the_F legendary_J batsman_N
who_F/N
has_V always_R let_V his_N
bat_N/V
talk_V/N
struggle_V /N
answer_V/N
barrage_N/V
question_N/V
function_N/V
promote_V cricket_N league_N city_N
1/31/2022 11
Mathematics of POS tagging
1/31/2022 12
Argmax computation (1/2)
Best tag sequence
= T*
= argmax P(T|W)
= argmax P(T)P(W|T) (by Baye’s Theorem)
1/31/2022 13
Argmax computation (2/2)
P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) …
P(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)
= P(wo|to)P(w1|t1) … P(wn+1|tn+1)
n+1
= ∏ P(wi|ti)
i=0
n+1
= ∏ P(wi|ti) (Lexical Probability Assumption)
i=1
1/31/2022 14
Generative Model
Lexical
Probabilities
^ N V A .
V N N Bigram
Probabilities
This model is called Generative model.
Here words are observed from tags as states.
This is similar to HMM.
1/31/2022 15
Typical POS tag steps
Implementation of Viterbi – Unigram,
Bigram.
Evaluation
Confusion Matrix
1/31/2022 16
A Motivating Example
Colored Ball choosing
1/31/2022 17
Example (contd.)
U1 U2 U3 R G B
Given : U1 0.1 0.4 0.5 U1 0.3 0.5 0.2
and
U2 0.6 0.2 0.2 U2 0.1 0.4 0.5
U3 0.3 0.4 0.3 U3 0.6 0.1 0.3
Observation : RRGGBRGR
State Sequence : ??
1/31/2022 18
Diagrammatic representation (1/2)
R, 0.3 G, 0.5
B, 0.2
0.3 0.3
0.1
U1 0.5 U3 R, 0.6
0.4 B, 0.3
0.4
R, 0.1 U2
G, 0.4
B, 0.5 0.2
1/31/2022 19
Diagrammatic representation (2/2)
R,0.03
G,0.05 R,0.18
B,0.02 G,0.03
B,0.09
R,0.15 R,0.18
U1 G,0.25 U3 G,0.03
B,0.10 B,0.09
R,0.02
R,0.06 G,0.08
G,0.24 B,0.10
B,0.30 R,0.24
R, 0.08
G, 0.20 G,0.04
B, 0.12 B,0.12
U2
R,0.02
G,0.08
B,0.10
20
1/31/2022
Classic problems with respect to HMM
1/31/2022 21
Illustration of Viterbi
POS Tagset
● N(noun), V(verb), O(other) [simplified]
● ^ (start), . (end) [start & end states]
Illustration of Viterbi
Lexicon
people: N, V
laugh: N, V
.
.
.
Corpora for Training
^ w11_t11 w12_t12 w13_t13 ……………….w1k_1_t1k_1 .
^ w21_t21 w22_t22 w23_t23 ……………….w2k_2_t2k_2 .
.
.
^ wn1_tn1 wn2_tn2 wn3_tn3 ……………….wnk_n_tnk_n .
Inference
N N
V N
^ N V O .
This
^ 0 0.6 0.2 0.2 0 transition
N 0 0.1 0.4 0.3 0.2 table will
change from
V 0 0.3 0.1 0.3 0.3 language to
language
O 0 0.3 0.2 0.3 0.2 due to
. 1 0 0 0 0 language
divergences.
Lexical Probability Table
Є people laugh ... …
^ 1 0 0 ... 0
O 0 0 0 ... ...
. 1 0 0 0 0
^ people laugh .
Є N N
^ .
Є V N
p( ^ N N . | ^ people laugh .)
= (0.6 x 1.0) x (0.1 x 1 x 10-3) x (0.2 x 1 x 10-5)
Computational Complexity
● If we have to get the probability of each sequence
and then find maximum among them, we would
run into exponential number of computations
● If |s| = #states (tags + ^ + . )
and |o| = length of sentence ( words + ^ + . )
Then, #sequences = s|o|-2
● But, a large number of partial computations can
be reused using Dynamic Programming
Dynamic Programming
^
Є
people
0.6 x 0.1 xN V1 O2 .3 N4 V O . N5 V O .
10-3 = 6 x
10-5
laugh
No need to expand N4
N V O . N V O . and N5 because they
will never be a part of
the winning sequence.
1 0.6 x 0.4 x 2 0.6 x 0.3 x 10- 3 0.6 x 0.2 x 10-
10-3 = 2.4 x 3 = 1.8 x 10-4 3 = 1.2 x 10-4
-4
Computational Complexity
1/31/2022 30
Viterbi Algorithm
1/31/2022 31
Viterbi Algorithm
Tree for the sentence: “^ People laugh .”
^
Ԑ
People
(0.18*10^-3) (0.02*10^-6)
N V O N V O N (0) V (0) O (0)
(0.06*10^-3)(0.24*10^-3) (0.06*10^-6) (0.06*10^-6)
1/31/2022 32
Viterbi phenomenon (Markov process)
N1 (6*10^-5) N2 (6*10^-8)
LAUGH
N V O N V O
Next step all the probabilities will be multiplied by identical probability (lexical
and transition). So children of N2 will have probability less than the children of
N1.
1/31/2022 33
Effect of shifting probability mass
Will a word be always given the same tag?
No. Consider the example:
^ people the city with soldiers .
^ quickly people the city . (i.e., ‘populate’)
1/31/2022 34
Tail phenomenon and Language phenomenon
Long tail Phenomenon: Probability is very low but not zero over a
large observed sequence.
Language Phenomenon:
“people” which is predominantly tagged as “Noun” displays a long
tail phenomenon.
“laugh” is predominantly tagged as “Verb”.
1/31/2022 35
Viterbi phenomenon (Markov process)
N1 (6*10^-5) N2 (6*10^-8)
LAUGH
N V O N V O
1/31/2022 36
Back to the Urn Example
Here :
U1 U2 U3
S = {U1, U2, U3} A=
V = { R,G,B} U1 0.1 0.4 0.5
For observation: U2 0.6 0.2 0.2
O ={o1… on} U3 0.3 0.4 0.3
And State sequence R G B
Q ={q1… qn} B=
U1 0.3 0.5 0.2
π is
U2 0.1 0.4 0.5
i P (q1 U i )
U3 0.6 0.1 0.3
38
Observations and states
O1 O2 O3 O4 O5 O6 O7 O8
OBS: R R G G B R G R
State: S1 S2 S3 S4 S5 S6 S7 S8
1/31/2022 40
False Start
O1 O2 O3 O4 O5 O6 O7 O8
OBS: R R G G B R G R
State: S1 S2 S3 S4 S5 S6 S7 S8
P( S | O) P( S 1 8 | O1 8)
P( S | O) P( S 1 | O).P( S 2 | S 1, O).P( S 3 | S 1 2, O)...P( S 8 | S 1 7, O)
1/31/2022 41
Baye’s Theorem
P( A | B) P( A).P( B | A) / P( B)
P(A) -: Prior
P(B|A) -: Likelihood
1/31/2022 42
State Transitions Probability
P ( S ) P ( S 1 8)
P( S ) P( S 1).P( S 2 | S 1).P( S 3 | S 1 2).P( S 4 | S 1 3)...P( S 8 | S 1 7)
1/31/2022 43
Observation Sequence probability
1/31/2022 44
Grouping terms
O0 O1 O2 O3 O4 O5 O6 O7 O8
Obs: ε R R G G B R G R
State: S0 S1 S2 S3 S4 S5 S6 S7 S8 S9
1/31/2022 45
Introducing useful notation
O0 O1 O2 O3 O4 O5 O6 O7 O8
Obs: ε R R G G B R G R
State: S0 S1 S2 S3 S4 S5 S6 S7 S8 S9
R G G B R
ε R S4 S5 S6 S7
S0 S1 S2 S3
G
S8
Ok
R
P(Ok|Sk).P(Sk+1|Sk)=P(SkSk+1) S9
1/31/2022 46
Probabilistic FSM
(a1:0.3)
S1 S2
(a2:0.2) (a1:0.2) (a2:0.2)
(a2:0.3)
1/31/2022 47
Developing the tree
Start
1.0 0.0 €
S1 S2
0.1 0.3 0.2 0.3 a1
. .
1*0.1=0.1 S1 0.3 S2 0.0 S1 S2 0.0
0.2 0.2
0.4 0.3
a2
. .
S1 S2 S1 S2
1/31/2022 48
Tree structure contd…
0.09 0.06
S1 S2
0.1 0.3 0.2 0.3 a1
. .
0.09*0.1=0.009 S1 0.027 S2 0.012 S1 S2 0.018
S1 . S2 S1 S2
1/31/2022 49
Path found: S1 S2 S1 S2 S1
(working backward)
a1 a2 a1 a2
Model or Machine {S 0, S , A, T }
1/31/2022 51
Why probability of observation
sequence?: Language modeling problem
Probabilities computed in the context of corpora
1/31/2022 52
Uses of language model
1. Detect well-formedness
• Lexical, syntactic, semantic, pragmatic,
discourse
2. Language identification
• Given a piece of text what language does it
belong to.
Good morning - English
Guten morgen - German
Bon jour - French
3. Automatic speech recognition
4. Machine translation
1/31/2022 53
How to compute P(o0o1o2o3…om)?
O0O1O2 ......Om
S0 S1 S 2 S3 ..S m S m1
Where Si s represent the state sequences.
1/31/2022 54
Computing P(o0o1o2o3…om)
P(O, S ) P( S ) P(O | S )
P( S 0 S1S 2 ...S m 1 ) P(O0O1O2 ...Om | S )
P( S 0 ).P( S1 | S 0 ).P( S 2 | S1 )....P( S m 1 | S m ).
P(O0 | S 0 ).P(O1 | S1 ).....P(Om | S m )
P( S 0 )[ P(O0 | S 0 ).P( S1 | S 0 )].....[P(Om | S m ).P( S m 1 | S m )]
1/31/2022 55
Forward and Backward
Probability Calculation
1/31/2022 56
Forward probability F(k,i)
Define F(k,i)= Probability of being in state Si
having seen o0o1o2…ok
F(k,i)=P(o0o1o2…ok , Si )
With m as the length of the observed
sequence and N states
P(observed sequence)=P(o0o1o2..om)
=Σp=0,N P(o0o1o2..om , Sp)
=Σp=0,N F(m , p)
1/31/2022 57
Forward probability (contd.)
F(k , q)
= P(o0o1o2..ok , Sq)
= P(o0o1o2..ok , Sq)
= P(o0o1o2..ok-1 , ok ,Sq)
= Σp=0,N P(o0o1o2..ok-1 , Sp , ok ,Sq)
= Σp=0,N P(o0o1o2..ok-1 , Sp ).
P(ok ,Sq|o0o1o2..ok-1 , Sp)
= Σp=0,N F(k-1,p). P(ok ,Sq|Sp)
ok
= Σp=0,N F(k-1,p). P(Sp Sq)
O0 O1 O2 O3 … Ok Ok+1 … Om-1 Om
S0 S1 S2 S3 … Sp Sq … Sm Sfinal
1/31/2022 58
Backward probability B(k,i)
Define B(k,i)= Probability of seeing okok+1ok+2…om given
that the state was Si
B(k,i)=P(okok+1ok+2…om \ Si )
With m as the length of the whole observed sequence
P(observed sequence)=P(o0o1o2..om)
= P(o0o1o2..om| S0)
=B(0,0)
1/31/2022 59
Backward probability (contd.)
B(k , p)
= P(okok+1ok+2…om \ Sp)
= P(ok+1ok+2…om , ok |Sp)
= Σq=0,N P(ok+1ok+2…om , ok , Sq|Sp)
= Σq=0,N P(ok ,Sq|Sp)
P(ok+1ok+2…om|ok ,Sq ,Sp )
= Σq=0,N P(ok+1ok+2…om|Sq ). P(ok ,
Sq|Sp) ok
= Σq=0,N B(k+1,q). P(Sp Sq)
O0 O1 O2 O3 … Ok Ok+1 … Om-1 Om
S0 S1 S2 S3 … Sp Sq … Sm Sfinal
1/31/2022 60
How Forward Probability Works
^ 0 0.7 0.3 0 ^ 1 0 0
N 0 0.2 0.6 0.2 N 0 0.8 0.2
V 0 0.6 0.2 0.2 V 0 0.1 0.9
. 1 0 0 0 . 1 0 0
Inefficient Computation:
oj
P(O) P(O, S ) P( Si S j )
S S i
1/31/2022 62
Computation in various paths of the Tree
ε People Laugh
Path 1: ^ N N
People Laugh .
P(Path1) = (1.0x0.7)x(0.8x0.2)x(0.2x0.2)
N . ε People Laugh
Path 2: ^ N V
ε
N .
P(Path2) = (1.0x0.7)x(0.8x0.6)x(0.9x0.2)
V . ε People Laugh
^ Path 3:
.
^ V N
N . P(Path3) = (1.0x0.3)x(0.1x0.6)x(0.2x0.2)
ε ε People Laugh
V Path 4: ^ V V
V . .
P(Path4) = (1.0x0.3)x(0.1x0.2)x(0.9x0.2)
1/31/2022 63
Computations on the Trellis
People Laugh
ε N N
^ .
ε
V V
1/31/2022 64
Number of Multiplications
Tree Trellis
Each path has 5
multiplications + 1
addition.
There are 4 paths in the
tree.
Therefore, total of 20
multiplications and 3
additions.
1/31/2022 65
Complexity
1/31/2022 66
Summary : Forward Algorithm
1/31/2022 67
Reading List
TnT (http://www.aclweb.org/anthology-new/A/A00/A00-
1031.pdf)
Brill Tagger
(http://delivery.acm.org/10.1145/1080000/1075553/p112-
brill.pdf?ip=182.19.16.71&acc=OPEN&CFID=129797466&CFTO
KEN=72601926&__acm__=1342975719_082233e0ca9b5d1d67a
9997c03a649d1)
1/31/2022 68