Professional Documents
Culture Documents
Introduction Machine Learning & NLP: 17B1NCI731 (Credits:3, Contact Hours: 3)
Introduction Machine Learning & NLP: 17B1NCI731 (Credits:3, Contact Hours: 3)
NLP
17B1NCI731 ( Credits :3 , Contact Hours
: 3)
1
Part Of Speech Tagging
• Annotate each word in a sentence with a
part-of-speech marker.
• Lowest level of syntactic analysis.
• Identifies the relations between words.
John saw the saw and decided to take it to the table.
NNP VBD DT NN CC VBD TO VB PRP IN DT NN
2
English POS Tagsets
• Original Brown corpus used a large set of
87 POS tags.
• Most common in NLP today is the Penn
Treebank set of 45 tags.
– Tagset used in these slides.
– Reduced from the Brown set for use in the
context of a parsed corpus (i.e. treebank).
• The C5 tagset used for the British National
Corpus (BNC) has 61 tags.
3
English Parts of Speech
• Noun (person, place or thing)
– Singular (NN): dog, fork
– Plural (NNS): dogs, forks
– Proper (NNP, NNPS): John, Springfields
– Personal pronoun (PRP): I, you, he, she, it
– Wh-pronoun (WP): who, what
• Verb (actions and processes)
– Base, infinitive (VB): eat
– Past tense (VBD): ate
– Gerund (VBG): eating
– Past participle (VBN): eaten
– Non 3rd person singular present tense (VBP): eat
– 3rd person singular present tense: (VBZ): eats
– Modal (MD): should, can
– To (TO): to (to eat)
4
English Parts of Speech (cont.)
• Adjective (modify nouns)
– Basic (JJ): red, tall
– Comparative (JJR): redder, taller
– Superlative (JJS): reddest, tallest
• Adverb (modify verbs)
– Basic (RB): quickly
– Comparative (RBR): quicker
– Superlative (RBS): quickest
• Preposition (IN): on, in, by, to, with
• Determiner:
– Basic (DT) a, an, the
– WH-determiner (WDT): which, that
• Coordinating Conjunction (CC): and, but, or,
• Particle (RP): off (took off), up (put up)
5
Closed vs. Open Class
• Closed class categories are composed of a small, fixed set
of grammatical function words for a given language. (it’s
hard to make up a new word in these classes)
– Pronouns, Prepositions, Modals, Determiners, Particles,
Conjunctions
• Open class categories have large number of words and new
ones are easily invented (anybody can make up a new
word, in any of these classes, at any time).
– Nouns (Googler, textlish), Verbs (Google), Adjectives
(geeky), Abverb (automagically)
6
Ambiguity in POS Tagging
• “Like” can be a verb or a preposition
– I like/VBP candy.
– Time flies like/IN an arrow.
• “Around” can be a preposition, particle, or
adverb
– I bought it at the shop around/IN the corner.
– I never got around/RP to getting a car.
– A new Prius costs around/RB $25K.
7
POS Tagging Process
• Usually assume a separate initial tokenization process that
separates and/or disambiguates punctuation, including
detecting sentence boundaries.
• Degree of ambiguity in English (based on Brown corpus)
– 11.5% of word types are ambiguous.
– 40% of word tokens are ambiguous.
• Average POS tagging disagreement amongst expert human
judges for the Penn treebank was 3.5%
– Based on correcting the output of an initial automated tagger,
which was deemed to be more accurate than tagging from scratch.
• Baseline: Picking the most frequent tag for each specific
word type gives about 90% accuracy
– 93.7% if use model for unknown words for Penn Treebank tagset.
8
POS Tagging Approaches
• Rule-Based: Human crafted rules based on lexical
and other linguistic knowledge.
• Learning-Based: Trained on human annotated
corpora like the Penn Treebank.
– Statistical models: Hidden Markov Model (HMM),
Maximum Entropy Markov Model (MEMM),
Conditional Random Field (CRF)
– Rule learning: Transformation Based Learning (TBL)
– Neural networks: Recurrent networks like Long Short
Term Memory (LSTMs)
• Generally, learning-based approaches have been
found to be more effective overall, taking into
account the total amount of human expertise and
effort involved. 9
Information Extraction
• Identify phrases in language that refer to specific types of
entities and relations in text.
• Named entity recognition is task of identifying names of
people, places, organizations, etc. in text.
people organizations places
– Michael Dell is the CEO of Dell Computer Corporation and lives
in Austin Texas.
• Extract pieces of information relevant to a specific
application, e.g. used car ads:
make model year mileage price
– For sale, 2002 Toyota Prius, 20,000 mi, $15K or best offer.
Available starting July 30, 2006.
10
Probabilistic Sequence Models
• Probabilistic Sequence model that computes
probable sequence of labels for a given
input sentence and selects the sequence with
maximal probability.
• Two standard models
– Hidden Markov Model (HMM)
– Conditional Random Field (CRF)
11
Markov Model / Markov Chain
• A finite state machine with probabilistic
state transitions.
• Makes Markov assumption that next state
only depends on the current state and
independent of previous history.
12
Sample Markov Model for POS
0.1
Det Noun
0.5
0.95
0.9
stop
0.05 Ver
0.25 b
0.1
PropNoun 0.8
0.4
0.5 0.1
0.25
0.1
start
13
Sample Markov Model for POS
0.1
Det Noun
0.5
0.95
0.9
stop
0.05 Ver
0.25 b
0.1
PropNoun 0.8
0.4
0.5 0.1
0.25
0.1
start
P(PropNoun Verb Det Noun) = 0.4*0.8*0.25*0.95*0.1=0.0076 14
Hidden Markov Model
• Probabilistic generative model for sequences.
• POS Tagging has sequence of observable and
hidden events in it.
• Assume an underlying set of hidden (unobserved,
latent) states in which the model can be (e.g. parts of
speech).
• Assume probabilistic transitions between states over
time (e.g. transition from POS to another POS as
sequence is generated).
• Assume a probabilistic generation of tokens from
states (e.g. words generated for each POS).
15
Sample HMM for POS
0.1
start
18
Sample HMM Generation
28
SUPERVISED LEARNING FOR
HMMS
•29
Naïve Bayes for Time Series Data
We could treat each word-tag pair (i.e. token) as independent. This
corresponds to a Naïve Bayes model with a single feature (the word).
n v p d n
arro
time flies like an
w
•30
Hidden Markov Model
A Hidden Markov Model (HMM) provides a joint distribution over the
the sentence/tags with an assumption of dependence between adjacent
tags.
<STAR
T> n v p d n
arro
time flies like an
w
•31
From NB to HMM
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
“Naïve Bayes”:
Y0 Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
HMM: •32
Hidden Markov Model
HMM Parameters:
Y0 Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
•33
Hidden Markov Model
HMM Parameters:
Assumption:
Generative Story:
Y0 Y1 Y2 Y3 Y4 Y5
X1 X2 •34 X3 X4 X5
Hidden Markov Model
Joint Distribution:
Y0 Y1 Y2 Y3 Y4 Y5
X1 X2 •35 X3 X4 X5
Higher-order HMMs
• 1st-order HMM (i.e. bigram HMM)
<ST
ART Y1 Y2 Y3 Y4 Y5
>
X1 X2 X3 X4 X5
rd X1 X2 X3 X4 X5
• 3 -order HMM
<ST
ART Y1 Y2 Y3 Y4 Y5
>
X1 X2 X3 X4 X5
•36
Three Useful HMM Tasks
• Observation Likelihood: To classify and
order sequences.
• Most likely state sequence (Decoding): To
tag each token in a sequence with a label.
• Maximum likelihood training (Learning): To
train models to fit empirical training data.
37
HMM: Observation Likelihood
• Given a sequence of observations, O, and a model
with a set of parameters, λ, what is the probability
that this observation was generated by this model:
P(O| λ) ?
• Allows HMM to be used as a language model: A
formal probabilistic model of a language that
assigns a probability to each string saying how
likely that string was to have been generated by
the language.
• Useful for two tasks:
– Sequence Classification
– Most Likely Sequence
38
Sequence Classification
• Assume an HMM is available for each category
(i.e. language).
• What is the most likely category for a given
observation sequence, i.e. which category’s HMM
is most likely to have generated it?
• Used in speech recognition to find most likely
word model to have generate a given sound or
phoneme sequence.
O
ah s t e n
? ?
Austin P(O | Austin) > P(O | Boston) ? Boston 39
Most Likely Sequence
• Of two or more possible sequences, which
one was most likely generated by a given
model?
• Used to score alternative word sequence
interpretations in speech recognition.
O1
? dice precedent core
41
Forward Step
s1 ∙ ∙ ∙
s2 ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙
s0 ∙ ∙ ∙ sF
∙ ∙ ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙
sN ∙ ∙ ∙
t1 t2 t3 tT- tT
1
• Continue forward in time until reaching final time
point and sum probability of ending in final state.
43
Computing the Forward Probabilities
observation probability of observing observation
• Initialization at time t=1 in state j
• Termination
the total probability of the observation the forward probability of being in state i
sequence given the HMM at the final time step T. 44
Most Likely State Sequence (Decoding)
• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT,
that generated this sequence from this model?
• Used for sequence labeling, assuming each state
corresponds to a tag, it determines the globally best
assignment of tags to all tokens in a sequence using a
principled approach grounded in probability theory.
45
Most Likely State Sequence
• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT,
that generated this sequence from this model?
• Used for sequence labeling, assuming each state
corresponds to a tag, it determines the globally best
assignment of tags to all tokens in a sequence using a
principled approach grounded in probability theory.
•52
Learning and Inference Summary
For discrete variables:
•53
Forward-Backward Algorithm
Y1 Y2 Y3
54
Forward-Backward Algorithm
68
Forward-Backward Algorithm: Finds
Marginals
69
Forward-Backward Algorithm: Finds
Marginals
70
Viterbi Scores
• Recursively compute the probability of the most
likely subsequence of states that accounts for the
first t observations and ends in state sj.
71
Computing the Viterbi Scores
• Initialization
• Recursion
• Termination
73
Example
74
Example
75
Example
= max(0.0775 ∗ 0.12 ∗
0.17,0,0,0,0,0) = 0.0016
76
Example
77
Example
78
Example
79
Refer the below link for HMM and CRFs example:
https://www.baeldung.com/cs/hidden-markov-vs-crf
LINEAR-CHAIN CRFS
•80
Shortcomings of
Hidden Markov Models
ST Y Y Y
AR
T
… … …
1 2 n
X X X
… … …
1 2 n
• HMM models capture dependences between each state and only its
corresponding observation
– NLP example: In a sentence segmentation task, each segmental state may depend not just
on a single word (and the adjacent segmental stages), but also on the (non-local) features
of the whole line such as line length, indentation, amount of white space, etc.
• Mismatch between learning objective function and prediction objective
function
– HMM learns a joint distribution of states and observations P(Y, X), but in a prediction
task, we need the conditional probability P(Y|X)
•81
Conditional Random Field (CRF)
Conditional distribution over tags Xi given words wi.
The factors and Z are now specific to the sentence w.
p(n, v, p, d, n | time, flies, like, an, arrow) = (4 * 8 * 5 * 3 * …)
<STAR
T> ψ0 n ψ2 v ψ4 p ψ6 d ψ8 n
ψ1 ψ3 ψ5 ψ7 ψ9
arro
time flies •82 like an
w
Conditional Random Field (CRF)
Shaded nodes in a graphical model are observed
<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5
ψ1 ψ3 ψ5 ψ7 ψ9
X1 X2 •83 X3 X4 X5
Conditional Random Field (CRF)
This linear-chain CRF is just like an HMM, except that its
factors are not necessarily probability distributions
<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5
ψ1 ψ3 ψ5 ψ7 ψ9
X1 X2 •84 X3 X4 X5
Conditional Random Field (CRF)
• That is the vector X
• Because it’s observed, we can condition on it for free
• Conditioning is how we converted from the MRF to the CRF
(i.e. when taking a slice of the emission factors)
<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5
ψ1 ψ3 ψ5 ψ7 ψ9
•85 X
Conditional Random Field (CRF)
•86
Conditional Random Field (CRF)
• This is the standard linear-chain CRF definition
• It permits rich, overlapping features of the vector X
<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5
ψ1 ψ3 ψ5 ψ7 ψ9
•87 X
Conditional Random Field (CRF)
• This is the standard linear-chain CRF definition
• It permits rich, overlapping features of the vector X
<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5
ψ1 ψ3 ψ5 ψ7 ψ9
ψ
{3,6,7
} Y6
ψ
10
10
Y1 ψ
ψ Y2 ψ Y3 ψ Y4 ψ
ψ Y5
22
{1,2} {2,3} {3,4} 88
ψ
ψ ψ
ψ ψ
ψ ψ ψ
11
{1} 33
{2} 55
{3} 77 9
time flies like an arrow
•89
Standard CRF Parameterization
Predicted Observed
variables variables
•90
Standard CRF Parameterization
Define each potential function in terms of a
fixed set of feature functions:
n ψ2 v ψ4 p ψ6 d ψ8 n
ψ1 ψ3 ψ5 ψ7 ψ9
arro
time flies like an
w
•91
Standard CRF Parameterization
Define each potential function in terms of a
fixed set of feature functions:
s
ψ13
v
p
ψ12
p
p
ψ11
n
p
ψ10
n ψ2
v ψ4
p ψ6
d ψ8
n
ψ1 ψ3 ψ5 ψ7 ψ9