You are on page 1of 67

Part of Speech Tagging

Asif Ekbal
CSE Dept.,
IIT Patna

1/31/2022 1
Part of Speech Tagging

With Hidden Markov Model

1/31/2022 2
NLP Trinity
Problem
Semantics NLP
Trinity
Parsing

Part of Speech
Tagging

Morph
Analysis Marathi French
HMM
Hindi English
CRF
Language
MEMM

Algorithm

1/31/2022 3
Part of Speech Tagging
 POS Tagging: attaches to each word in
a sentence a part of speech tag from a
given set of tags called the Tag-Set
 Standard Tag-set : Penn Treebank (for
English)- 36 tags

1/31/2022 4
Example
 “_“ The_DT mechanisms_NNS that_WDT
make_VBP traditional_JJ hardware_NN
are_VBP really_RB being_VBG
obsoleted_VBN by_IN microprocessor-
based_JJ machines_NNS ,_, ”_” said_VBD
Mr._NNP Benton_NNP ._.

1/31/2022 5
Where does POS tagging fit in
Discourse and Corefernce

Increased
Semantics Extraction
Complexity
Of
Processing Parsing

Chunking

POS tagging

Morphology

1/31/2022
6
Example to illustrate
complexity of POS taggng

1/31/2022 7
POS tagging is disambiguation
N (noun), V (verb), J (adjective), R
(adverb) and F (other, i.e., function words).

That_F former_J Sri_Lanka_N skipper_N and_F ace_J


batsman_N Aravinda_De_Silva_N is_F a_F man_N of_F
few_J words_N was_F very_R much_R evident_J on_F
Wednesday_N when_F the_F legendary_J batsman_N ,_F
who_F has_V always_R let_V his_N bat_N talk_V ,_F
struggled_V to_F answer_V a_F barrage_N of_F
questions_N at_F a_F function_N to_F promote_V the_F
cricket_N league_N in_F the_F city_N ._F

1/31/2022 8
POS disambiguation
 That_F/N/J (‘that’ can be complementizer (can be put under ‘F’),
demonstrative (can be put under ‘J’) or pronoun (can be put under ‘N’))
 former_J
 Sri_N/J Lanka_N/J (Sri Lanka together qualify the skipper)
 skipper_N/V (‘skipper’ can be a verb too)
 and_F
 ace_J/N (‘ace’ can be both J and N; “Nadal served an ace”)
 batsman_N/J (‘batsman’ can be J as it qualifies Aravinda De Silva)
 Aravinda_N De_N Silva_N is_F a_F
 man_N/V (‘man’ can verb too as in’man the boat’)
 of_F few_J
 words_N/V (‘words’ can be verb too, as in ‘he words is speeches beautifully’)

1/31/2022 9
Behaviour of “That”
 That
 That man is known by the company he keeps.
(Demonstrative)
 Man that is known by the company he keeps,
gets a good job. (Pronoun)
 That man is known by the company he keeps,
is a proverb. (Complementation)
 Chaotic systems: Systems where a small
perturbation in input causes a large change
in output
1/31/2022 10
POS disambiguation
 was_F very_R much_R evident_J on_F Wednesday_N
 when_F/N (‘when’ can be a relative pronoun (put under ‘N) as in ‘I
know the time when he comes’)
 the_F legendary_J batsman_N
 who_F/N
 has_V always_R let_V his_N
 bat_N/V
 talk_V/N
 struggle_V /N
 answer_V/N
 barrage_N/V
 question_N/V
 function_N/V
 promote_V cricket_N league_N city_N

1/31/2022 11
Mathematics of POS tagging

1/31/2022 12
Argmax computation (1/2)
Best tag sequence
= T*
= argmax P(T|W)
= argmax P(T)P(W|T) (by Baye’s Theorem)

P(T) = P(t0t1t2 … tn+1=.)


= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) …
P(tn|tn-1tn-2…t0)P(tn+1|tntn-1…t0)
= P(t0)P(t1|t0)P(t2|t1) … P(tn|tn-1)P(tn+1|tn)
N+1
= ∏ P(ti|ti-1) Bigram Assumption
i=0

1/31/2022 13
Argmax computation (2/2)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) …
P(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption: A word is determined completely by its tag. This is inspired


by speech recognition

= P(wo|to)P(w1|t1) … P(wn+1|tn+1)
n+1

= ∏ P(wi|ti)
i=0
n+1
= ∏ P(wi|ti) (Lexical Probability Assumption)
i=1

1/31/2022 14
Generative Model

^_^ People_N Jump_V High_R ._.

Lexical
Probabilities

^ N V A .

V N N Bigram
Probabilities
This model is called Generative model.
Here words are observed from tags as states.
This is similar to HMM.

1/31/2022 15
Typical POS tag steps
 Implementation of Viterbi – Unigram,
Bigram.

 Evaluation

 Per POS Accuracy

 Confusion Matrix

1/31/2022 16
A Motivating Example
Colored Ball choosing

Urn 1 Urn 2 Urn 3


# of Red = 30 # of Red = 10 # of Red =60
# of Green = 50 # of Green = 40 # of Green =10
# of Blue = 20 # of Blue = 50 # of Blue = 30

1/31/2022 17
Example (contd.)
U1 U2 U3 R G B
Given : U1 0.1 0.4 0.5 U1 0.3 0.5 0.2
and
U2 0.6 0.2 0.2 U2 0.1 0.4 0.5
U3 0.3 0.4 0.3 U3 0.6 0.1 0.3

Transition probability table Emission probability table

Observation : RRGGBRGR

State Sequence : ??

Not so Easily Computable.

1/31/2022 18
Diagrammatic representation (1/2)

R, 0.3 G, 0.5
B, 0.2

0.3 0.3
0.1
U1 0.5 U3 R, 0.6

0.6 0.2 G, 0.1

0.4 B, 0.3
0.4

R, 0.1 U2
G, 0.4
B, 0.5 0.2

1/31/2022 19
Diagrammatic representation (2/2)

R,0.03
G,0.05 R,0.18
B,0.02 G,0.03
B,0.09

R,0.15 R,0.18
U1 G,0.25 U3 G,0.03
B,0.10 B,0.09
R,0.02
R,0.06 G,0.08
G,0.24 B,0.10
B,0.30 R,0.24
R, 0.08
G, 0.20 G,0.04
B, 0.12 B,0.12
U2

R,0.02
G,0.08
B,0.10
20
1/31/2022
Classic problems with respect to HMM

1.Given the observation sequence, find the


possible state sequences- Viterbi
2.Given the observation sequence, find its
probability- forward/backward algorithm
3.Given the observation sequence find the
HMM parameters.- Baum-Welch algorithm

1/31/2022 21
Illustration of Viterbi

● The “start” and “end” are important in a


sequence
● Subtrees get eliminated due to the Markov
Assumption

POS Tagset
● N(noun), V(verb), O(other) [simplified]
● ^ (start), . (end) [start & end states]
Illustration of Viterbi
Lexicon
people: N, V
laugh: N, V
.
.
.
Corpora for Training
^ w11_t11 w12_t12 w13_t13 ……………….w1k_1_t1k_1 .
^ w21_t21 w22_t22 w23_t23 ……………….w2k_2_t2k_2 .
.
.
^ wn1_tn1 wn2_tn2 wn3_tn3 ……………….wnk_n_tnk_n .
Inference

N N

^ . Partial sequence graph

V N

^ N V O .
This
^ 0 0.6 0.2 0.2 0 transition
N 0 0.1 0.4 0.3 0.2 table will
change from
V 0 0.3 0.1 0.3 0.3 language to
language
O 0 0.3 0.2 0.3 0.2 due to
. 1 0 0 0 0 language
divergences.
Lexical Probability Table
Є people laugh ... …
^ 1 0 0 ... 0

N 0 1x10-3 1x10-5 ... ...

V 0 1x10-6 1x10-3 ... ...

O 0 0 0 ... ...

. 1 0 0 0 0

Size of this table = # pos tags in tagset X vocabulary size

vocabulary size = # unique words in corpus


Inference
New Sentence:

^ people laugh .
Є N N

^ .
Є V N

p( ^ N N . | ^ people laugh .)
= (0.6 x 1.0) x (0.1 x 1 x 10-3) x (0.2 x 1 x 10-5)
Computational Complexity
● If we have to get the probability of each sequence
and then find maximum among them, we would
run into exponential number of computations
● If |s| = #states (tags + ^ + . )
and |o| = length of sentence ( words + ^ + . )
Then, #sequences = s|o|-2
● But, a large number of partial computations can
be reused using Dynamic Programming
Dynamic Programming

^
Є

0.6 x 1.0 = 0.6 N V 0.2 O 0.2

people

0.6 x 0.1 xN V1 O2 .3 N4 V O . N5 V O .
10-3 = 6 x
10-5
laugh

No need to expand N4
N V O . N V O . and N5 because they
will never be a part of
the winning sequence.
1 0.6 x 0.4 x 2 0.6 x 0.3 x 10- 3 0.6 x 0.2 x 10-
10-3 = 2.4 x 3 = 1.8 x 10-4 3 = 1.2 x 10-4
-4
Computational Complexity

● Retain only those N / V / O nodes which ends


in the highest sequence probability
● Now, complexity reduces from |s||o| to
|s|.|o|
● Here, we followed the Markov assumption of
order 1
Points to ponder wrt HMM and
Viterbi

1/31/2022 30
Viterbi Algorithm

 Start with the start state


 Keep advancing sequences that are
“maximum” amongst all those ending in
the same state

1/31/2022 31
Viterbi Algorithm
Tree for the sentence: “^ People laugh .”
^
Ԑ

N (0.6) V (0.2) O (0.2)

People

(0.18*10^-3) (0.02*10^-6)
N V O N V O N (0) V (0) O (0)
(0.06*10^-3)(0.24*10^-3) (0.06*10^-6) (0.06*10^-6)

Claim: We do not need to draw all the subtrees in the algorithm

1/31/2022 32
Viterbi phenomenon (Markov process)

N1 (6*10^-5) N2 (6*10^-8)

LAUGH

N V O N V O

Next step all the probabilities will be multiplied by identical probability (lexical
and transition). So children of N2 will have probability less than the children of
N1.

1/31/2022 33
Effect of shifting probability mass
 Will a word be always given the same tag?
 No. Consider the example:
 ^ people the city with soldiers .
 ^ quickly people the city . (i.e., ‘populate’)

 In the first sentence “people” is most likely to be tagged as


noun, whereas in the second, probability mass will shift and
“people” will be tagged as verb, since it occurs after an adverb.

1/31/2022 34
Tail phenomenon and Language phenomenon

 Long tail Phenomenon: Probability is very low but not zero over a
large observed sequence.

 Language Phenomenon:
 “people” which is predominantly tagged as “Noun” displays a long
tail phenomenon.
 “laugh” is predominantly tagged as “Verb”.

1/31/2022 35
Viterbi phenomenon (Markov process)

N1 (6*10^-5) N2 (6*10^-8)

LAUGH

N V O N V O

Next step all the probabilities will be multiplied by identical probability


(lexical and transition). So children of N2 will have probability less than
the children of N1.

1/31/2022 36
Back to the Urn Example

 Here :
U1 U2 U3
 S = {U1, U2, U3} A=
 V = { R,G,B} U1 0.1 0.4 0.5
 For observation: U2 0.6 0.2 0.2
 O ={o1… on} U3 0.3 0.4 0.3
 And State sequence R G B
 Q ={q1… qn} B=
U1 0.3 0.5 0.2
 π is
U2 0.1 0.4 0.5
 i  P (q1  U i )
U3 0.6 0.1 0.3

38
Observations and states
O1 O2 O3 O4 O5 O6 O7 O8
OBS: R R G G B R G R
State: S1 S2 S3 S4 S5 S6 S7 S8

Si = U1/U2/U3; A particular state


S: State sequence
O: Observation sequence
S* = “best” possible state (urn) sequence
Goal: Maximize P(S*|O) by choosing “best” S
1/31/2022 39
Goal
 Maximize P(S|O) where S is the State
Sequence and O is the Observation
Sequence

S *  arg max S ( P( S | O))

1/31/2022 40
False Start
O1 O2 O3 O4 O5 O6 O7 O8
OBS: R R G G B R G R
State: S1 S2 S3 S4 S5 S6 S7 S8
P( S | O)  P( S 1  8 | O1  8)
P( S | O)  P( S 1 | O).P( S 2 | S 1, O).P( S 3 | S 1  2, O)...P( S 8 | S 1  7, O)

By Markov Assumption (a state depends only on the previous state)

P(S | O)  P(S 1 | O).P(S 2 | S 1, O).P(S 3 | S 2, O)...P(S 8 | S 7, O)

1/31/2022 41
Baye’s Theorem

P( A | B)  P( A).P( B | A) / P( B)

P(A) -: Prior
P(B|A) -: Likelihood

arg max S P( S | O)  arg max S P( S ).P(O | S )

1/31/2022 42
State Transitions Probability

P ( S )  P ( S 1  8)
P( S )  P( S 1).P( S 2 | S 1).P( S 3 | S 1  2).P( S 4 | S 1  3)...P( S 8 | S 1  7)

By Markov Assumption (k=1)

P(S )  P(S1).P(S 2 | S1).P(S 3 | S 2).P(S 4 | S 3)...P(S 8 | S 7)

1/31/2022 43
Observation Sequence probability

P(O | S )  P(O1 | S 1  8).P(O2 | O1, S 1  8).P(O3 | O1  2, S 1  8)...P(O8 | O1  7, S 1  8)

Assumption that ball drawn depends only


on the Urn chosen
P(O | S )  P(O1 | S 1).P(O 2 | S 2).P(O3 | S 3)...P(O8 | S 8)
P( S | O)  P( S ).P(O | S )
P( S | O)  P( S 1).P( S 2 | S 1).P( S 3 | S 2).P( S 4 | S 3)...P( S 8 | S 7).
P(O1 | S 1).P(O 2 | S 2).P(O3 | S 3)...P(O8 | S 8)

1/31/2022 44
Grouping terms
O0 O1 O2 O3 O4 O5 O6 O7 O8
Obs: ε R R G G B R G R
State: S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

P(S).P(O|S) We introduce the states S0


= [P(O0|S0).P(S1|S0)]. and S9 as initial and final
[P(O1|S1). P(S2|S1)]. states respectively.
[P(O2|S2). P(S3|S2)].
[P(O3|S3).P(S4|S3)].
[P(O4|S4).P(S5|S4)]. After S8 the next state is S9
[P(O5|S5).P(S6|S5)]. with probability 1, i.e.,
[P(O6|S6).P(S7|S6)]. P(S9|S8)=1
[P(O7|S7).P(S8|S7)]. O0 is ε-transition
[P(O8|S8).P(S9|S8)].

1/31/2022 45
Introducing useful notation
O0 O1 O2 O3 O4 O5 O6 O7 O8
Obs: ε R R G G B R G R
State: S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

R G G B R
ε R S4 S5 S6 S7
S0 S1 S2 S3
G

S8

Ok
R
P(Ok|Sk).P(Sk+1|Sk)=P(SkSk+1) S9

1/31/2022 46
Probabilistic FSM

(a1:0.3)

(a1:0.1) (a2:0.4) (a1:0.3)

S1 S2
(a2:0.2) (a1:0.2) (a2:0.2)

(a2:0.3)

The question here is:


“what is the most likely state sequence given the output sequence seen”

1/31/2022 47
Developing the tree

Start
1.0 0.0 €

S1 S2
0.1 0.3 0.2 0.3 a1
. .
1*0.1=0.1 S1 0.3 S2 0.0 S1 S2 0.0

0.2 0.2
0.4 0.3
a2
. .
S1 S2 S1 S2

0.1*0.2=0.02 0.1*0.4=0.04 0.3*0.3=0.09 0.3*0.2=0.06 Choose the winning


sequence per state
per iteration

1/31/2022 48
Tree structure contd…
0.09 0.06

S1 S2
0.1 0.3 0.2 0.3 a1
. .
0.09*0.1=0.009 S1 0.027 S2 0.012 S1 S2 0.018

0.3 0.2 0.2 0.4


a2

S1 . S2 S1 S2

0.0081 0.0054 0.0024 0.0048

The problem being addressed by this tree is S *  arg max P ( S | a1  a 2  a1  a 2,  )


s
a1-a2-a1-a2 is the output sequence and μ the model or the machine

1/31/2022 49
Path found: S1 S2 S1 S2 S1
(working backward)
a1 a2 a1 a2

Problem statement: Find the best possible sequence


S *  arg max P ( S | O,  )
s
where, S  State Seq, O  Output Seq,   Model or Machine

Model or Machine  {S 0, S , A, T }

Start symbol State collection Alphabet Transitions


set
ak Sj )
T is defined as P( Si  i , j , k
1/31/2022 50
Probability of observation
sequence

1/31/2022 51
Why probability of observation
sequence?: Language modeling problem
Probabilities computed in the context of corpora

1. P(“The sun rises in the east”)


2. P(“The sun rise in the east”)
• Less probable because of grammatical
mistake.
3. P(The svn rises in the east)
• Less probable because of lexical mistake.
4. P(The sun rises in the west)
• Less probable because of semantic mistake.

1/31/2022 52
Uses of language model
1. Detect well-formedness
• Lexical, syntactic, semantic, pragmatic,
discourse
2. Language identification
• Given a piece of text what language does it
belong to.
Good morning - English
Guten morgen - German
Bon jour - French
3. Automatic speech recognition
4. Machine translation
1/31/2022 53
How to compute P(o0o1o2o3…om)?

P(O)   P(O, S ) Marginaliz ation


S
Consider the observation sequence,

O0O1O2 ......Om
S0 S1 S 2 S3 ..S m S m1
Where Si s represent the state sequences.

1/31/2022 54
Computing P(o0o1o2o3…om)

P(O, S )  P( S ) P(O | S )
 P( S 0 S1S 2 ...S m 1 ) P(O0O1O2 ...Om | S )
 P( S 0 ).P( S1 | S 0 ).P( S 2 | S1 )....P( S m 1 | S m ).
P(O0 | S 0 ).P(O1 | S1 ).....P(Om | S m )
 P( S 0 )[ P(O0 | S 0 ).P( S1 | S 0 )].....[P(Om | S m ).P( S m 1 | S m )]

1/31/2022 55
Forward and Backward
Probability Calculation

1/31/2022 56
Forward probability F(k,i)
 Define F(k,i)= Probability of being in state Si
having seen o0o1o2…ok
 F(k,i)=P(o0o1o2…ok , Si )
 With m as the length of the observed
sequence and N states
P(observed sequence)=P(o0o1o2..om)
=Σp=0,N P(o0o1o2..om , Sp)
=Σp=0,N F(m , p)

1/31/2022 57
Forward probability (contd.)
F(k , q)
= P(o0o1o2..ok , Sq)
= P(o0o1o2..ok , Sq)
= P(o0o1o2..ok-1 , ok ,Sq)
= Σp=0,N P(o0o1o2..ok-1 , Sp , ok ,Sq)
= Σp=0,N P(o0o1o2..ok-1 , Sp ).
P(ok ,Sq|o0o1o2..ok-1 , Sp)
= Σp=0,N F(k-1,p). P(ok ,Sq|Sp)

ok
= Σp=0,N F(k-1,p). P(Sp  Sq)

O0 O1 O2 O3 … Ok Ok+1 … Om-1 Om
S0 S1 S2 S3 … Sp Sq … Sm Sfinal

1/31/2022 58
Backward probability B(k,i)
 Define B(k,i)= Probability of seeing okok+1ok+2…om given
that the state was Si
B(k,i)=P(okok+1ok+2…om \ Si )
 With m as the length of the whole observed sequence
 P(observed sequence)=P(o0o1o2..om)
= P(o0o1o2..om| S0)
=B(0,0)

1/31/2022 59
Backward probability (contd.)
B(k , p)
= P(okok+1ok+2…om \ Sp)
= P(ok+1ok+2…om , ok |Sp)
= Σq=0,N P(ok+1ok+2…om , ok , Sq|Sp)
= Σq=0,N P(ok ,Sq|Sp)
P(ok+1ok+2…om|ok ,Sq ,Sp )
= Σq=0,N P(ok+1ok+2…om|Sq ). P(ok ,
Sq|Sp) ok
= Σq=0,N B(k+1,q). P(Sp  Sq)

O0 O1 O2 O3 … Ok Ok+1 … Om-1 Om
S0 S1 S2 S3 … Sp Sq … Sm Sfinal

1/31/2022 60
How Forward Probability Works

 Goal of Forward Probability: To find P(O)


[the probability of Observation Sequence].
 E.g. ^ People laugh .
N N
^ .
V V
1/31/2022 61
Transition and Lexical Probability
Tables
^ N V . ε People Laugh

^ 0 0.7 0.3 0 ^ 1 0 0
N 0 0.2 0.6 0.2 N 0 0.8 0.2
V 0 0.6 0.2 0.2 V 0 0.1 0.9
. 1 0 0 0 . 1 0 0

Inefficient Computation:
oj
P(O)   P(O, S )    P( Si  S j )
S S i

1/31/2022 62
Computation in various paths of the Tree

ε People Laugh
Path 1: ^ N N
People Laugh .
P(Path1) = (1.0x0.7)x(0.8x0.2)x(0.2x0.2)
N . ε People Laugh
Path 2: ^ N V
ε
N .
P(Path2) = (1.0x0.7)x(0.8x0.6)x(0.9x0.2)
V . ε People Laugh
^ Path 3:
.
^ V N

N . P(Path3) = (1.0x0.3)x(0.1x0.6)x(0.2x0.2)
ε ε People Laugh
V Path 4: ^ V V
V . .
P(Path4) = (1.0x0.3)x(0.1x0.2)x(0.9x0.2)

1/31/2022 63
Computations on the Trellis


People Laugh

ε N N
^ .
ε
V V

1/31/2022 64
Number of Multiplications
Tree Trellis
 Each path has 5 

multiplications + 1
addition.
 There are 4 paths in the
tree.
 Therefore, total of 20
multiplications and 3
additions.

1/31/2022 65
Complexity

1/31/2022 66
Summary : Forward Algorithm

1/31/2022 67
Reading List
 TnT (http://www.aclweb.org/anthology-new/A/A00/A00-
1031.pdf)

 Brill Tagger
(http://delivery.acm.org/10.1145/1080000/1075553/p112-
brill.pdf?ip=182.19.16.71&acc=OPEN&CFID=129797466&CFTO
KEN=72601926&__acm__=1342975719_082233e0ca9b5d1d67a
9997c03a649d1)

1/31/2022 68

You might also like