You are on page 1of 93

Introduction Machine learning &

NLP
17B1NCI731 ( Credits :3 , Contact Hours
: 3)

Department of CSE & IT,


Jaypee Institute of Information Technology, Noida

1
Part Of Speech Tagging
• Annotate each word in a sentence with a
part-of-speech marker.
• Lowest level of syntactic analysis.
• Identifies the relations between words.
John saw the saw and decided to take it to the table.
NNP VBD DT NN CC VBD TO VB PRP IN DT NN

• Useful for subsequent syntactic parsing and


word sense disambiguation.

2
English POS Tagsets
• Original Brown corpus used a large set of
87 POS tags.
• Most common in NLP today is the Penn
Treebank set of 45 tags.
– Tagset used in these slides.
– Reduced from the Brown set for use in the
context of a parsed corpus (i.e. treebank).
• The C5 tagset used for the British National
Corpus (BNC) has 61 tags.
3
English Parts of Speech
• Noun (person, place or thing)
– Singular (NN): dog, fork
– Plural (NNS): dogs, forks
– Proper (NNP, NNPS): John, Springfields
– Personal pronoun (PRP): I, you, he, she, it
– Wh-pronoun (WP): who, what
• Verb (actions and processes)
– Base, infinitive (VB): eat
– Past tense (VBD): ate
– Gerund (VBG): eating
– Past participle (VBN): eaten
– Non 3rd person singular present tense (VBP): eat
– 3rd person singular present tense: (VBZ): eats
– Modal (MD): should, can
– To (TO): to (to eat)
4
English Parts of Speech (cont.)
• Adjective (modify nouns)
– Basic (JJ): red, tall
– Comparative (JJR): redder, taller
– Superlative (JJS): reddest, tallest
• Adverb (modify verbs)
– Basic (RB): quickly
– Comparative (RBR): quicker
– Superlative (RBS): quickest
• Preposition (IN): on, in, by, to, with
• Determiner:
– Basic (DT) a, an, the
– WH-determiner (WDT): which, that
• Coordinating Conjunction (CC): and, but, or,
• Particle (RP): off (took off), up (put up)

5
Closed vs. Open Class
• Closed class categories are composed of a small, fixed set
of grammatical function words for a given language. (it’s
hard to make up a new word in these classes)
– Pronouns, Prepositions, Modals, Determiners, Particles,
Conjunctions
• Open class categories have large number of words and new
ones are easily invented (anybody can make up a new
word, in any of these classes, at any time).
– Nouns (Googler, textlish), Verbs (Google), Adjectives
(geeky), Abverb (automagically)

6
Ambiguity in POS Tagging
• “Like” can be a verb or a preposition
– I like/VBP candy.
– Time flies like/IN an arrow.
• “Around” can be a preposition, particle, or
adverb
– I bought it at the shop around/IN the corner.
– I never got around/RP to getting a car.
– A new Prius costs around/RB $25K.

7
POS Tagging Process
• Usually assume a separate initial tokenization process that
separates and/or disambiguates punctuation, including
detecting sentence boundaries.
• Degree of ambiguity in English (based on Brown corpus)
– 11.5% of word types are ambiguous.
– 40% of word tokens are ambiguous.
• Average POS tagging disagreement amongst expert human
judges for the Penn treebank was 3.5%
– Based on correcting the output of an initial automated tagger,
which was deemed to be more accurate than tagging from scratch.
• Baseline: Picking the most frequent tag for each specific
word type gives about 90% accuracy
– 93.7% if use model for unknown words for Penn Treebank tagset.

8
POS Tagging Approaches
• Rule-Based: Human crafted rules based on lexical
and other linguistic knowledge.
• Learning-Based: Trained on human annotated
corpora like the Penn Treebank.
– Statistical models: Hidden Markov Model (HMM),
Maximum Entropy Markov Model (MEMM),
Conditional Random Field (CRF)
– Rule learning: Transformation Based Learning (TBL)
– Neural networks: Recurrent networks like Long Short
Term Memory (LSTMs)
• Generally, learning-based approaches have been
found to be more effective overall, taking into
account the total amount of human expertise and
effort involved. 9
Information Extraction
• Identify phrases in language that refer to specific types of
entities and relations in text.
• Named entity recognition is task of identifying names of
people, places, organizations, etc. in text.
people organizations places
– Michael Dell is the CEO of Dell Computer Corporation and lives
in Austin Texas.
• Extract pieces of information relevant to a specific
application, e.g. used car ads:
make model year mileage price
– For sale, 2002 Toyota Prius, 20,000 mi, $15K or best offer.
Available starting July 30, 2006.

10
Probabilistic Sequence Models
• Probabilistic Sequence model that computes
probable sequence of labels for a given
input sentence and selects the sequence with
maximal probability.
• Two standard models
– Hidden Markov Model (HMM)
– Conditional Random Field (CRF)

11
Markov Model / Markov Chain
• A finite state machine with probabilistic
state transitions.
• Makes Markov assumption that next state
only depends on the current state and
independent of previous history.

12
Sample Markov Model for POS

0.1

Det Noun
0.5
0.95
0.9
stop
0.05 Ver
0.25 b
0.1
PropNoun 0.8
0.4
0.5 0.1
0.25
0.1
start
13
Sample Markov Model for POS

0.1

Det Noun
0.5
0.95
0.9
stop
0.05 Ver
0.25 b
0.1
PropNoun 0.8
0.4
0.5 0.1
0.25
0.1
start
P(PropNoun Verb Det Noun) = 0.4*0.8*0.25*0.95*0.1=0.0076 14
Hidden Markov Model
• Probabilistic generative model for sequences.
• POS Tagging has sequence of observable and
hidden events in it.
• Assume an underlying set of hidden (unobserved,
latent) states in which the model can be (e.g. parts of
speech).
• Assume probabilistic transitions between states over
time (e.g. transition from POS to another POS as
sequence is generated).
• Assume a probabilistic generation of tokens from
states (e.g. words generated for each POS).

15
Sample HMM for POS

the cat 0.1


a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start
16
Sample HMM Generation

the cat 0.1


a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start
17
Sample HMM Generation

the cat 0.1


a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1

0.1
start
18
Sample HMM Generation

the cat 0.1


a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John
19
Sample HMM Generation

the cat 0.1


a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John
20
Sample HMM Generation

the cat 0.1


a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit
21
Sample HMM Generation

the cat 0.1


a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit
22
Sample HMM Generation

the cat 0.1


a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the
23
Sample HMM Generation

the cat 0.1


a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the
24
Sample HMM Generation

the cat 0.1


a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the apple
25
Sample HMM Generation

the cat 0.1


a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the apple
26
Formal Definition of an HMM
• A set of N +2 states S={s0,s1,s2, … sN, sF}
– Distinguished start state: s0
– Distinguished final state: sF
• A set of M possible observations V={v1,v2…vM}
• A state transition probability distribution A={aij}

• Observation probability distribution/emission


probability distribution for each state j B={bj(k)}

• Total parameter set λ={A,B}


27
HMM Generation Procedure
• To generate a sequence of T observations:
O = o1 o2 … oT

Set initial state q1=s0


For t = 1 to T
Transit to another state qt+1=sj based on transition
distribution aij for state qt
Pick an observation ot=vk based on being in state qt using
distribution bqt(k)

28
SUPERVISED LEARNING FOR
HMMS

•29
Naïve Bayes for Time Series Data
We could treat each word-tag pair (i.e. token) as independent. This
corresponds to a Naïve Bayes model with a single feature (the word).

p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .1 * .5 * …)

represents the probability of observing


a sequence of words (tokens) along with
their associated tags

n v p d n

arro
time flies like an
w

•30
Hidden Markov Model
A Hidden Markov Model (HMM) provides a joint distribution over the
the sentence/tags with an assumption of dependence between adjacent
tags.

p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …)

<STAR
T> n v p d n

arro
time flies like an
w

•31
From NB to HMM

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

“Naïve Bayes”:

Y0 Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

HMM: •32
Hidden Markov Model
HMM Parameters:

Y0 Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

•33
Hidden Markov Model
HMM Parameters:

Assumption:
Generative Story:

Y0 Y1 Y2 Y3 Y4 Y5

X1 X2 •34 X3 X4 X5
Hidden Markov Model
Joint Distribution:

Y0 Y1 Y2 Y3 Y4 Y5

X1 X2 •35 X3 X4 X5
Higher-order HMMs
• 1st-order HMM (i.e. bigram HMM)
<ST
ART Y1 Y2 Y3 Y4 Y5
>

X1 X2 X3 X4 X5

• 2nd-order HMM (i.e. trigram HMM)


<ST
ART Y1 Y2 Y3 Y4 Y5
>

rd X1 X2 X3 X4 X5
• 3 -order HMM

<ST
ART Y1 Y2 Y3 Y4 Y5
>

X1 X2 X3 X4 X5

•36
Three Useful HMM Tasks
• Observation Likelihood: To classify and
order sequences.
• Most likely state sequence (Decoding): To
tag each token in a sequence with a label.
• Maximum likelihood training (Learning): To
train models to fit empirical training data.

37
HMM: Observation Likelihood
• Given a sequence of observations, O, and a model
with a set of parameters, λ, what is the probability
that this observation was generated by this model:
P(O| λ) ?
• Allows HMM to be used as a language model: A
formal probabilistic model of a language that
assigns a probability to each string saying how
likely that string was to have been generated by
the language.
• Useful for two tasks:
– Sequence Classification
– Most Likely Sequence

38
Sequence Classification
• Assume an HMM is available for each category
(i.e. language).
• What is the most likely category for a given
observation sequence, i.e. which category’s HMM
is most likely to have generated it?
• Used in speech recognition to find most likely
word model to have generate a given sound or
phoneme sequence.
O
ah s t e n

? ?
Austin P(O | Austin) > P(O | Boston) ? Boston 39
Most Likely Sequence
• Of two or more possible sequences, which
one was most likely generated by a given
model?
• Used to score alternative word sequence
interpretations in speech recognition.
O1
? dice precedent core

? vice president Gore


O2
Ordinary English
P(O2 | OrdEnglish) > P(O1 | OrdEnglish) ?
40
Forward Probabilities

• Let αt(j) be the probability of being in state j


after seeing the first t observations (by
summing over all initial paths leading to j).

𝜆 𝑖𝑠 𝑡ℎ𝑒 𝑔𝑖𝑣𝑒𝑛 𝐻𝑀𝑀

41
Forward Step

• Consider all possible ways of


s1 a1j
getting to sj at time t by coming
s2 a2j from all possible states si and
∙ a2j determine probability of each.
sj

∙ aNj • Sum these to get the total
probability of being in state sj at
sN
time t while accounting for the
αt-1(i) αt(i)
first t −1 observations.
• Then multiply by the probability
of actually observing ot in sj.
42
Forward Trellis

s1 ∙ ∙ ∙
s2 ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙
s0 ∙ ∙ ∙ sF
∙ ∙ ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙

sN ∙ ∙ ∙

t1 t2 t3 tT- tT
1
• Continue forward in time until reaching final time
point and sum probability of ending in final state.
43
Computing the Forward Probabilities
observation probability of observing observation
• Initialization at time t=1 in state j

• Recursion forward probability of being in state j at


time t = 1

• Termination

the total probability of the observation the forward probability of being in state i
sequence given the HMM at the final time step T. 44
Most Likely State Sequence (Decoding)
• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT,
that generated this sequence from this model?
• Used for sequence labeling, assuming each state
corresponds to a tag, it determines the globally best
assignment of tags to all tokens in a sequence using a
principled approach grounded in probability theory.

John gave the dog an apple.

45
Most Likely State Sequence
• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT,
that generated this sequence from this model?
• Used for sequence labeling, assuming each state
corresponds to a tag, it determines the globally best
assignment of tags to all tokens in a sequence using a
principled approach grounded in probability theory.

John gave the dog an apple.

Det Noun PropNoun Verb


46
Most Likely State Sequence
• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT,
that generated this sequence from this model?
• Used for sequence labeling, assuming each state
corresponds to a tag, it determines the globally best
assignment of tags to all tokens in a sequence using a
principled approach grounded in probability theory.

John gave the dog an apple.

Det Noun PropNoun Verb


47
Most Likely State Sequence
• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT,
that generated this sequence from this model?
• Used for sequence labeling, assuming each state
corresponds to a tag, it determines the globally best
assignment of tags to all tokens in a sequence using a
principled approach grounded in probability theory.

John gave the dog an apple.

Det Noun PropNoun Verb


48
Most Likely State Sequence
• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT,
that generated this sequence from this model?
• Used for sequence labeling, assuming each state
corresponds to a tag, it determines the globally best
assignment of tags to all tokens in a sequence using a
principled approach grounded in probability theory.

John gave the dog an apple.

Det Noun PropNoun Verb


49
Most Likely State Sequence
• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT,
that generated this sequence from this model?
• Used for sequence labeling, assuming each state
corresponds to a tag, it determines the globally best
assignment of tags to all tokens in a sequence using a
principled approach grounded in probability theory.

John gave the dog an apple.

Det Noun PropNoun Verb


50
Most Likely State Sequence
• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT,
that generated this sequence from this model?
• Used for sequence labeling, assuming each state
corresponds to a tag, it determines the globally best
assignment of tags to all tokens in a sequence using a
principled approach grounded in probability theory.

John gave the dog an apple.

Det Noun PropNoun Verb


51
THE FORWARD-BACKWARD
ALGORITHM

•52
Learning and Inference Summary
For discrete variables:

•53
Forward-Backward Algorithm

Y1 Y2 Y3

find preferred tags

Could be verb or noun Could be adjective or verb Could be noun or verb

54
Forward-Backward Algorithm

• Show the possible values for each variable


55
Forward-Backward Algorithm

• Let’s show the possible values for each variable


• One possible assignment
56
Forward-Backward Algorithm

• Let’s show the possible values for each variable


• One possible assignment
• And what the 7 factors think of it …
57
Forward-Backward Algorithm

• Let’s show the possible values for each variable


• One possible assignment
• And what the 7 transition/emission factors think of it …
58
Viterbi Algorithm: Most Probable Assignment

Partition function/normalization constant


• So p(v a n) = (1/Z) * product of 7 numbers
• Numbers associated with edges and nodes of path
• Most probable assignment = path with highest product 59
Viterbi Algorithm: Most Probable Assignment

• So p(v a n) = (1/Z) * product weight of one path


60
Forward-Backward Algorithm: Finds
Marginals

• So p(v a n) = (1/Z) * product weight of one path


• Marginal probability p(Y2 = a)
= (1/Z) * total weight of all paths through a 61
Forward-Backward Algorithm: Finds
Marginals

• So p(v a n) = (1/Z) * product weight of one path


• Marginal probability p(Y2 = n)
n
= (1/Z) * total weight of all paths through 62
Forward-Backward Algorithm: Finds
Marginals

• So p(v a n) = (1/Z) * product weight of one path


• Marginal probability p(Y2 = v)
= (1/Z) * total weight of all paths through v 63
Forward-Backward Algorithm: Finds
Marginals

• So p(v a n) = (1/Z) * product weight of one path


• Marginal probability p(Y2 = n)
n
= (1/Z) * total weight of all paths through 64
Forward-Backward Algorithm: Finds
Marginals

(found by dynamic programming: matrix-vector


products) 65
Forward-Backward Algorithm: Finds
Marginals

(found by dynamic programming: matrix-vector


products)
66
Forward-Backward Algorithm: Finds
Marginals

Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of 67


Forward-Backward Algorithm: Finds
Marginals

68
Forward-Backward Algorithm: Finds
Marginals

69
Forward-Backward Algorithm: Finds
Marginals

70
Viterbi Scores
• Recursively compute the probability of the most
likely subsequence of states that accounts for the
first t observations and ends in state sj.

▪ btt(j) stores the state at time t-1 that maximizes the


probability that system was in state sj at time t (given
the observed sequence).

71
Computing the Viterbi Scores
• Initialization

• Recursion

• Termination

Analogous to Forward algorithm except take max instead of sum


72
Summary

Viterbi algorithm – A dynamic programming


approach

𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑝𝑎𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦

𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑓𝑟𝑜𝑚 𝑠𝑡𝑎𝑡𝑒 𝑖 𝑡𝑜 𝑗

73
Example

◦ Input Sentence – “John is reading a book.”


◦ Tagset - {NN, VB, JJ, DET, PP, RB}

74
Example

◦ Input Sentence – “John is reading a book.”


◦ Tagset - {NN, VB, JJ, DET, PP, RB}

75
Example

◦ Input Sentence – “John is reading a book.”


◦ Tagset - {NN, VB, JJ, DET, PP, RB}

= max(0.0775 ∗ 0.12 ∗
0.17,0,0,0,0,0) = 0.0016

76
Example

◦ Input Sentence – “John is reading a book.”


◦ Tagset - {NN, VB, JJ, DET, PP, RB}

77
Example

◦ Input Sentence – “John is reading a book.”


◦ Tagset - {NN, VB, JJ, DET, PP, RB}

78
Example

◦ Input Sentence – “John is reading a book.”


◦ Tagset - {NN, VB, JJ, DET, PP, RB}

◦ John/NN is/PP/ reading/VB a/DET


book/NN.

79
Refer the below link for HMM and CRFs example:
https://www.baeldung.com/cs/hidden-markov-vs-crf

Conditional Random Fields (CRFs) for time series data

LINEAR-CHAIN CRFS

•80
Shortcomings of
Hidden Markov Models

ST Y Y Y
AR
T
… … …
1 2 n

X X X
… … …
1 2 n

• HMM models capture dependences between each state and only its
corresponding observation
– NLP example: In a sentence segmentation task, each segmental state may depend not just
on a single word (and the adjacent segmental stages), but also on the (non-local) features
of the whole line such as line length, indentation, amount of white space, etc.
• Mismatch between learning objective function and prediction objective
function
– HMM learns a joint distribution of states and observations P(Y, X), but in a prediction
task, we need the conditional probability P(Y|X)

•81
Conditional Random Field (CRF)
Conditional distribution over tags Xi given words wi.
The factors and Z are now specific to the sentence w.
p(n, v, p, d, n | time, flies, like, an, arrow) = (4 * 8 * 5 * 3 * …)

<STAR
T> ψ0 n ψ2 v ψ4 p ψ6 d ψ8 n

ψ1 ψ3 ψ5 ψ7 ψ9

arro
time flies •82 like an
w
Conditional Random Field (CRF)
Shaded nodes in a graphical model are observed

<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5

ψ1 ψ3 ψ5 ψ7 ψ9

X1 X2 •83 X3 X4 X5
Conditional Random Field (CRF)
This linear-chain CRF is just like an HMM, except that its
factors are not necessarily probability distributions

<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5

ψ1 ψ3 ψ5 ψ7 ψ9

X1 X2 •84 X3 X4 X5
Conditional Random Field (CRF)
• That is the vector X
• Because it’s observed, we can condition on it for free
• Conditioning is how we converted from the MRF to the CRF
(i.e. when taking a slice of the emission factors)

<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5

ψ1 ψ3 ψ5 ψ7 ψ9

•85 X
Conditional Random Field (CRF)

•86
Conditional Random Field (CRF)
• This is the standard linear-chain CRF definition
• It permits rich, overlapping features of the vector X

<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5

ψ1 ψ3 ψ5 ψ7 ψ9

•87 X
Conditional Random Field (CRF)
• This is the standard linear-chain CRF definition
• It permits rich, overlapping features of the vector X

<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5

ψ1 ψ3 ψ5 ψ7 ψ9

Visual Notation: Usually we


draw a CRF without showing
the variable corresponding to•88X
General CRF
The topology of the graphical Y9

model for a CRF doesn’t have ψ


ψ
{1,
{1,
{1,8,9 Y8
to be a chain 8,9
8,9
}
}}
ψ
{2,7,8
Y7
}

ψ
{3,6,7
} Y6

ψ
10
10
Y1 ψ
ψ Y2 ψ Y3 ψ Y4 ψ
ψ Y5
22
{1,2} {2,3} {3,4} 88

ψ
ψ ψ
ψ ψ
ψ ψ ψ
11
{1} 33
{2} 55
{3} 77 9
time flies like an arrow
•89
Standard CRF Parameterization

Define each potential function in terms of a fixed


set of feature functions:

Predicted Observed
variables variables

•90
Standard CRF Parameterization
Define each potential function in terms of a
fixed set of feature functions:

n ψ2 v ψ4 p ψ6 d ψ8 n

ψ1 ψ3 ψ5 ψ7 ψ9

arro
time flies like an
w

•91
Standard CRF Parameterization
Define each potential function in terms of a
fixed set of feature functions:

s
ψ13
v
p
ψ12
p
p
ψ11
n
p
ψ10

n ψ2
v ψ4
p ψ6
d ψ8
n

ψ1 ψ3 ψ5 ψ7 ψ9

time flies •92


like an arrow
Conclusions
• POS Tagging is the lowest level of syntactic
analysis.
• It is an instance of sequence labeling, a collective
classification task that also has applications in
information extraction, phrase chunking, semantic
role labeling, and bioinformatics.
• HMMs are a standard generative probabilistic
model for sequence labeling that allows for
efficiently computing the globally most probable
sequence of labels and supports supervised,
unsupervised and semi-supervised learning.

You might also like