Introduction Machine Learning & NLP: 17B1NCI731 (Credits:3, Contact Hours: 3)

Introduction Machine learning &
NLP
17B1NCI731 ( Credits :3 , Contact Hours
: 3)
Department of CSE & IT,

Jaypee Institute of Information Technology, Noida
1
Part Of Speech Tagging
• Annotate each word in a sentence with a
part-of-speech marker.
• Lowest level of syntactic analysis.
• Identifies the relations between words.
John saw the saw and decided to take it to the table.
NNP VBD DT NN CC VBD TO VB PRP IN DT NN
• Useful for subsequent syntactic parsing and

word sense disambiguation.
2
English POS Tagsets
• Original Brown corpus used a large set of
87 POS tags.
• Most common in NLP today is the Penn
Treebank set of 45 tags.
– Tagset used in these slides.
– Reduced from the Brown set for use in the
context of a parsed corpus (i.e. treebank).
• The C5 tagset used for the British National
Corpus (BNC) has 61 tags.
3
English Parts of Speech
• Noun (person, place or thing)
– Singular (NN): dog, fork
– Plural (NNS): dogs, forks
– Proper (NNP, NNPS): John, Springfields
– Personal pronoun (PRP): I, you, he, she, it
– Wh-pronoun (WP): who, what
• Verb (actions and processes)
– Base, infinitive (VB): eat
– Past tense (VBD): ate
– Gerund (VBG): eating
– Past participle (VBN): eaten
– Non 3rd person singular present tense (VBP): eat
– 3rd person singular present tense: (VBZ): eats
– Modal (MD): should, can
– To (TO): to (to eat)
4
English Parts of Speech (cont.)
• Adjective (modify nouns)
– Basic (JJ): red, tall
– Comparative (JJR): redder, taller
– Superlative (JJS): reddest, tallest
• Adverb (modify verbs)
– Basic (RB): quickly
– Comparative (RBR): quicker
– Superlative (RBS): quickest
• Preposition (IN): on, in, by, to, with
• Determiner:
– Basic (DT) a, an, the
– WH-determiner (WDT): which, that
• Coordinating Conjunction (CC): and, but, or,
• Particle (RP): off (took off), up (put up)
5
Closed vs. Open Class
• Closed class categories are composed of a small, fixed set
of grammatical function words for a given language. (it’s
hard to make up a new word in these classes)
– Pronouns, Prepositions, Modals, Determiners, Particles,
Conjunctions
• Open class categories have large number of words and new
ones are easily invented (anybody can make up a new
word, in any of these classes, at any time).
– Nouns (Googler, textlish), Verbs (Google), Adjectives
(geeky), Abverb (automagically)
6
Ambiguity in POS Tagging
• “Like” can be a verb or a preposition
– I like/VBP candy.
– Time flies like/IN an arrow.
• “Around” can be a preposition, particle, or
adverb
– I bought it at the shop around/IN the corner.
– I never got around/RP to getting a car.
– A new Prius costs around/RB $25K.
7
POS Tagging Process
• Usually assume a separate initial tokenization process that
separates and/or disambiguates punctuation, including
detecting sentence boundaries.
• Degree of ambiguity in English (based on Brown corpus)
– 11.5% of word types are ambiguous.
– 40% of word tokens are ambiguous.
• Average POS tagging disagreement amongst expert human
judges for the Penn treebank was 3.5%
– Based on correcting the output of an initial automated tagger,
which was deemed to be more accurate than tagging from scratch.
• Baseline: Picking the most frequent tag for each specific
word type gives about 90% accuracy
– 93.7% if use model for unknown words for Penn Treebank tagset.
8
POS Tagging Approaches
• Rule-Based: Human crafted rules based on lexical
and other linguistic knowledge.
• Learning-Based: Trained on human annotated
corpora like the Penn Treebank.
– Statistical models: Hidden Markov Model (HMM),
Maximum Entropy Markov Model (MEMM),
Conditional Random Field (CRF)
– Rule learning: Transformation Based Learning (TBL)
– Neural networks: Recurrent networks like Long Short
Term Memory (LSTMs)
• Generally, learning-based approaches have been
found to be more effective overall, taking into
account the total amount of human expertise and
effort involved. 9
Information Extraction
• Identify phrases in language that refer to specific types of
entities and relations in text.
• Named entity recognition is task of identifying names of
people, places, organizations, etc. in text.
people organizations places
– Michael Dell is the CEO of Dell Computer Corporation and lives
in Austin Texas.
• Extract pieces of information relevant to a specific
application, e.g. used car ads:
make model year mileage price
– For sale, 2002 Toyota Prius, 20,000 mi, $15K or best offer.
Available starting July 30, 2006.
10
Probabilistic Sequence Models
• Probabilistic Sequence model that computes
probable sequence of labels for a given
input sentence and selects the sequence with
maximal probability.
• Two standard models
– Hidden Markov Model (HMM)
– Conditional Random Field (CRF)
11
Markov Model / Markov Chain
• A finite state machine with probabilistic
state transitions.
• Makes Markov assumption that next state
only depends on the current state and
independent of previous history.
12
Sample Markov Model for POS
0.1
Det Noun
0.5
0.95
0.9
stop
0.05 Ver
0.25 b
0.1
PropNoun 0.8
0.4
0.5 0.1
0.25
0.1
start
13
Sample Markov Model for POS
0.1
Det Noun
0.5
0.95
0.9
stop
0.05 Ver
0.25 b
0.1
PropNoun 0.8
0.4
0.5 0.1
0.25
0.1
start
P(PropNoun Verb Det Noun) = 0.4*0.8*0.25*0.95*0.1=0.0076 14
Hidden Markov Model
• Probabilistic generative model for sequences.
• POS Tagging has sequence of observable and
hidden events in it.
• Assume an underlying set of hidden (unobserved,
latent) states in which the model can be (e.g. parts of
speech).
• Assume probabilistic transitions between states over
time (e.g. transition from POS to another POS as
sequence is generated).
• Assume a probabilistic generation of tokens from
states (e.g. words generated for each POS).
15
Sample HMM for POS
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start
16
Sample HMM Generation
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start
17
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.1
start
18
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John
19
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John
20
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit
21
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit
22
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the
23
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the
24
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the apple
25
the cat 0.1

a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the apple
26
Formal Definition of an HMM
• A set of N +2 states S={s0,s1,s2, … sN, sF}
– Distinguished start state: s0
– Distinguished final state: sF
• A set of M possible observations V={v1,v2…vM}
• A state transition probability distribution A={aij}
• Observation probability distribution/emission

probability distribution for each state j B={bj(k)}
• Total parameter set λ={A,B}

27
HMM Generation Procedure
• To generate a sequence of T observations:
O = o1 o2 … oT
Set initial state q1=s0

For t = 1 to T
Transit to another state qt+1=sj based on transition
distribution aij for state qt
Pick an observation ot=vk based on being in state qt using
distribution bqt(k)
28
SUPERVISED LEARNING FOR
HMMS
•29
Naïve Bayes for Time Series Data
We could treat each word-tag pair (i.e. token) as independent. This
corresponds to a Naïve Bayes model with a single feature (the word).
p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .1 * .5 * …)
represents the probability of observing

a sequence of words (tokens) along with
their associated tags
n v p d n
arro
time flies like an
w
•30
Hidden Markov Model
A Hidden Markov Model (HMM) provides a joint distribution over the
the sentence/tags with an assumption of dependence between adjacent
tags.
p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …)
<STAR
T> n v p d n
arro
time flies like an
w
•31
From NB to HMM
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
“Naïve Bayes”:
Y0 Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
HMM: •32
Hidden Markov Model
HMM Parameters:
Y0 Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
•33
Hidden Markov Model
HMM Parameters:
Assumption:
Generative Story:
Y0 Y1 Y2 Y3 Y4 Y5
X1 X2 •34 X3 X4 X5
Hidden Markov Model
Joint Distribution:
Y0 Y1 Y2 Y3 Y4 Y5
X1 X2 •35 X3 X4 X5
Higher-order HMMs
• 1st-order HMM (i.e. bigram HMM)
<ST
ART Y1 Y2 Y3 Y4 Y5
>
X1 X2 X3 X4 X5
• 2nd-order HMM (i.e. trigram HMM)

<ST
ART Y1 Y2 Y3 Y4 Y5
>
rd X1 X2 X3 X4 X5
• 3 -order HMM
<ST
ART Y1 Y2 Y3 Y4 Y5
>
X1 X2 X3 X4 X5
•36
Three Useful HMM Tasks
• Observation Likelihood: To classify and
order sequences.
• Most likely state sequence (Decoding): To
tag each token in a sequence with a label.
• Maximum likelihood training (Learning): To
train models to fit empirical training data.
37
HMM: Observation Likelihood
• Given a sequence of observations, O, and a model
with a set of parameters, λ, what is the probability
that this observation was generated by this model:
P(O| λ) ?
• Allows HMM to be used as a language model: A
formal probabilistic model of a language that
assigns a probability to each string saying how
likely that string was to have been generated by
the language.
• Useful for two tasks:
– Sequence Classification
– Most Likely Sequence
38
Sequence Classification
• Assume an HMM is available for each category
(i.e. language).
• What is the most likely category for a given
observation sequence, i.e. which category’s HMM
is most likely to have generated it?
• Used in speech recognition to find most likely
word model to have generate a given sound or
phoneme sequence.
O
ah s t e n
? ?
Austin P(O | Austin) > P(O | Boston) ? Boston 39
Most Likely Sequence
• Of two or more possible sequences, which
one was most likely generated by a given
model?
• Used to score alternative word sequence
interpretations in speech recognition.
O1
? dice precedent core
? vice president Gore

O2
Ordinary English
P(O2 | OrdEnglish) > P(O1 | OrdEnglish) ?
40
Forward Probabilities
• Let αt(j) be the probability of being in state j

after seeing the first t observations (by
summing over all initial paths leading to j).
𝜆 𝑖𝑠 𝑡ℎ𝑒 𝑔𝑖𝑣𝑒𝑛 𝐻𝑀𝑀
41
Forward Step
• Consider all possible ways of

s1 a1j
getting to sj at time t by coming
s2 a2j from all possible states si and
∙ a2j determine probability of each.
sj
∙
∙ aNj • Sum these to get the total
probability of being in state sj at
sN
time t while accounting for the
αt-1(i) αt(i)
first t −1 observations.
• Then multiply by the probability
of actually observing ot in sj.
42
Forward Trellis
s1 ∙ ∙ ∙
s2 ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙
s0 ∙ ∙ ∙ sF
∙ ∙ ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙
sN ∙ ∙ ∙
t1 t2 t3 tT- tT
1
• Continue forward in time until reaching final time
point and sum probability of ending in final state.
43
Computing the Forward Probabilities
observation probability of observing observation
• Initialization at time t=1 in state j
• Recursion forward probability of being in state j at

time t = 1
• Termination
the total probability of the observation the forward probability of being in state i
sequence given the HMM at the final time step T. 44
Most Likely State Sequence (Decoding)
• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT,
that generated this sequence from this model?
• Used for sequence labeling, assuming each state
corresponds to a tag, it determines the globally best
assignment of tags to all tokens in a sequence using a
principled approach grounded in probability theory.
John gave the dog an apple.
45
Most Likely State Sequence
Det Noun PropNoun Verb

46

47

48

49

50

51
THE FORWARD-BACKWARD
ALGORITHM
•52
Learning and Inference Summary
For discrete variables:
•53
Forward-Backward Algorithm
Y1 Y2 Y3
find preferred tags
Could be verb or noun Could be adjective or verb Could be noun or verb
54
• Show the possible values for each variable

55
• Let’s show the possible values for each variable

• One possible assignment
56

• And what the 7 factors think of it …
57

• And what the 7 transition/emission factors think of it …
58
Viterbi Algorithm: Most Probable Assignment
Partition function/normalization constant

• So p(v a n) = (1/Z) * product of 7 numbers
• Numbers associated with edges and nodes of path
• Most probable assignment = path with highest product 59
Viterbi Algorithm: Most Probable Assignment
• So p(v a n) = (1/Z) * product weight of one path

60
Forward-Backward Algorithm: Finds
Marginals

• Marginal probability p(Y2 = a)
= (1/Z) * total weight of all paths through a 61
Marginals

• Marginal probability p(Y2 = n)
n
= (1/Z) * total weight of all paths through 62
Marginals

• Marginal probability p(Y2 = v)
= (1/Z) * total weight of all paths through v 63
Marginals

• Marginal probability p(Y2 = n)
n
= (1/Z) * total weight of all paths through 64
Marginals
(found by dynamic programming: matrix-vector

products) 65
Marginals
(found by dynamic programming: matrix-vector

products)
66
Marginals
Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of 67

Marginals
68
Marginals
69
Marginals
70
Viterbi Scores
• Recursively compute the probability of the most
likely subsequence of states that accounts for the
first t observations and ends in state sj.
▪ btt(j) stores the state at time t-1 that maximizes the

probability that system was in state sj at time t (given
the observed sequence).
71
Computing the Viterbi Scores
• Initialization
• Recursion
• Termination
Analogous to Forward algorithm except take max instead of sum

72
Summary
Viterbi algorithm – A dynamic programming

approach
𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑝𝑎𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑓𝑟𝑜𝑚 𝑠𝑡𝑎𝑡𝑒 𝑖 𝑡𝑜 𝑗
73
Example
◦ Input Sentence – “John is reading a book.”

◦ Tagset - {NN, VB, JJ, DET, PP, RB}
74
Example

75
Example

= max(0.0775 ∗ 0.12 ∗
0.17,0,0,0,0,0) = 0.0016
76
Example

77
Example

78
Example

◦ John/NN is/PP/ reading/VB a/DET

book/NN.
79
Refer the below link for HMM and CRFs example:
https://www.baeldung.com/cs/hidden-markov-vs-crf
Conditional Random Fields (CRFs) for time series data
LINEAR-CHAIN CRFS
•80
Shortcomings of
Hidden Markov Models
ST Y Y Y
AR
T
… … …
1 2 n
X X X
… … …
1 2 n
• HMM models capture dependences between each state and only its
corresponding observation
– NLP example: In a sentence segmentation task, each segmental state may depend not just
on a single word (and the adjacent segmental stages), but also on the (non-local) features
of the whole line such as line length, indentation, amount of white space, etc.
• Mismatch between learning objective function and prediction objective
function
– HMM learns a joint distribution of states and observations P(Y, X), but in a prediction
task, we need the conditional probability P(Y|X)
•81
Conditional distribution over tags Xi given words wi.
The factors and Z are now specific to the sentence w.
p(n, v, p, d, n | time, flies, like, an, arrow) = (4 * 8 * 5 * 3 * …)
<STAR
T> ψ0 n ψ2 v ψ4 p ψ6 d ψ8 n
ψ1 ψ3 ψ5 ψ7 ψ9
arro
time flies •82 like an
w
Shaded nodes in a graphical model are observed
<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5
ψ1 ψ3 ψ5 ψ7 ψ9
X1 X2 •83 X3 X4 X5
This linear-chain CRF is just like an HMM, except that its
factors are not necessarily probability distributions
<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5
ψ1 ψ3 ψ5 ψ7 ψ9
X1 X2 •84 X3 X4 X5
• That is the vector X
• Because it’s observed, we can condition on it for free
• Conditioning is how we converted from the MRF to the CRF
(i.e. when taking a slice of the emission factors)
<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5
ψ1 ψ3 ψ5 ψ7 ψ9
•85 X
•86
• This is the standard linear-chain CRF definition
• It permits rich, overlapping features of the vector X
<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5
ψ1 ψ3 ψ5 ψ7 ψ9
•87 X
• This is the standard linear-chain CRF definition
• It permits rich, overlapping features of the vector X
<STAR
T> ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5
ψ1 ψ3 ψ5 ψ7 ψ9
Visual Notation: Usually we

draw a CRF without showing
the variable corresponding to•88X
General CRF
The topology of the graphical Y9
model for a CRF doesn’t have ψ

ψ
{1,
{1,
{1,8,9 Y8
to be a chain 8,9
8,9
}
}}
ψ
{2,7,8
Y7
}
ψ
{3,6,7
} Y6
ψ
10
10
Y1 ψ
ψ Y2 ψ Y3 ψ Y4 ψ
ψ Y5
22
{1,2} {2,3} {3,4} 88
ψ
ψ ψ
ψ ψ
ψ ψ ψ
11
{1} 33
{2} 55
{3} 77 9
time flies like an arrow
•89
Standard CRF Parameterization
Define each potential function in terms of a fixed

set of feature functions:
Predicted Observed
variables variables
•90
Define each potential function in terms of a
fixed set of feature functions:
n ψ2 v ψ4 p ψ6 d ψ8 n
ψ1 ψ3 ψ5 ψ7 ψ9
arro
time flies like an
w
•91
Define each potential function in terms of a
fixed set of feature functions:
s
ψ13
v
p
ψ12
p
p
ψ11
n
p
ψ10
n ψ2
v ψ4
p ψ6
d ψ8
n
ψ1 ψ3 ψ5 ψ7 ψ9
time flies •92

like an arrow
Conclusions
• POS Tagging is the lowest level of syntactic
analysis.
• It is an instance of sequence labeling, a collective
classification task that also has applications in
information extraction, phrase chunking, semantic
role labeling, and bioinformatics.
• HMMs are a standard generative probabilistic
model for sequence labeling that allows for
efficiently computing the globally most probable
sequence of labels and supports supervised,
unsupervised and semi-supervised learning.

Introduction Machine Learning & NLP: 17B1NCI731 (Credits:3, Contact Hours: 3)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction Machine Learning & NLP: 17B1NCI731 (Credits:3, Contact Hours: 3)

Uploaded by

Copyright:

Available Formats

Introduction Machine learning &

Department of CSE & IT,

• Useful for subsequent syntactic parsing and

the cat 0.1

the cat 0.1

the cat 0.1

the cat 0.1

the cat 0.1

the cat 0.1

the cat 0.1

the cat 0.1

the cat 0.1

the cat 0.1

the cat 0.1

• Observation probability distribution/emission

• Total parameter set λ={A,B}

Set initial state q1=s0

p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .1 * .5 * …)

represents the probability of observing

p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …)

• 2nd-order HMM (i.e. trigram HMM)

? vice president Gore

• Let αt(j) be the probability of being in state j

𝜆 𝑖𝑠 𝑡ℎ𝑒 𝑔𝑖𝑣𝑒𝑛 𝐻𝑀𝑀

• Consider all possible ways of

• Recursion forward probability of being in state j at

John gave the dog an apple.

John gave the dog an apple.

Det Noun PropNoun Verb

John gave the dog an apple.

Det Noun PropNoun Verb

John gave the dog an apple.

Det Noun PropNoun Verb

John gave the dog an apple.

Det Noun PropNoun Verb

John gave the dog an apple.

Det Noun PropNoun Verb

John gave the dog an apple.

Det Noun PropNoun Verb

find preferred tags

Could be verb or noun Could be adjective or verb Could be noun or verb

• Show the possible values for each variable

• Let’s show the possible values for each variable

• Let’s show the possible values for each variable

• Let’s show the possible values for each variable

Partition function/normalization constant

• So p(v a n) = (1/Z) * product weight of one path

• So p(v a n) = (1/Z) * product weight of one path

• So p(v a n) = (1/Z) * product weight of one path

• So p(v a n) = (1/Z) * product weight of one path

• So p(v a n) = (1/Z) * product weight of one path

(found by dynamic programming: matrix-vector

(found by dynamic programming: matrix-vector

Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of 67

▪ btt(j) stores the state at time t-1 that maximizes the

Analogous to Forward algorithm except take max instead of sum

Viterbi algorithm – A dynamic programming

𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑝𝑎𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦

𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑓𝑟𝑜𝑚 𝑠𝑡𝑎𝑡𝑒 𝑖 𝑡𝑜 𝑗

◦ Input Sentence – “John is reading a book.”

◦ Input Sentence – “John is reading a book.”

◦ Input Sentence – “John is reading a book.”