Professional Documents
Culture Documents
Jia Li
Department of Statistics
The Pennsylvania State University
Email: jiali@stat.psu.edu
http://www.stat.psu.edu/∼jiali
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Model Setup
I Suppose we have a sequential data
u = {u1 , u2 , ..., ut , ..., uT }, ut ∈ Rd .
I As in the mixture model, every ut , t = 1, ..., T , is generated
by a hidden state, st .
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
I Transition probabilities:
ak,l = P(st+1 = l | st = k) ,
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
I In summary:
P(u, s) = P(s)P(u | s)
= πs1 bs1 (u1 )as1 ,s2 bs2 (u2 ) · · · asT −1 ,sT bsT (uT ) .
X
P(u) = P(s)P(u | s) total prob. formula
s
X
= πs1 bs1 (u1 )as1 ,s2 bs2 (u2 ) · · · asT −1 ,sT bsT (uT )
s
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Example
I Suppose we have a video sequence and would like to
automatically decide whether a speaker is in a frame.
I Two underlying states: with a speaker (state 1) vs. without a
speaker (state 2).
I From frame 1 to T , let st , t = 1, ..., T denotes whether there
is a speaker in the frame.
I It does not seem appropriate to assume that st ’s are
independent. We may assume the state sequence follows a
Markov chain.
I If one frame contains a speaker, it is highly likely that the next
frame also contains a speaker because of the strong
frame-to-frame dependence. On the other hand, a frame
without a speaker is much more likely to be followed by
another frame without a speaker.
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Model Estimation
I Parameters involved:
I Transition probabilities: ak,l , k, l = 1, ..., M.
I Initial probabilities: πk , k = 1, ..., M.
I For each state k, µk , Σk .
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Definitions
I Under a given set of parameters, let Lk (t) be the conditional
probability of being in state k at position t given the entire
observed sequence u = {u1 , u2 , ..., uT }.
X
Lk (t) = P(st = k|u) = P(s | u)I (st = k) .
s
I Under a given set of parameters, let Hk,l (t) be the conditional
probability of being in state k at position t and being in state
l at position t + 1, i.e., seeing a transition from k to l at t,
given the entire observed sequence u.
Hk,l (t) = P(st = k, st+1 = l|u)
X
= P(s | u)I (st = k)I (st+1 = l)
s
PM PM
I Note that Lk (t) = l=1 Hk,l (t), k=1 Lk (t) = 1.
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
PT
t=1 Lk (t)(ut − µk )(ut − µk )t
Σk = PT
t=1 Lk (t)
PT −1
Hk,l (t)
ak,l = Pt=1
T −1
.
t=1 Lk (t)
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
PT
t=1 pt,k (ut − µk )(ut − µk )t
Σk = PT
t=1 pt,k
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Derivation from EM
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
I The function f (x | θ0 ) is
f (x | θ0 ) = P(s | θ0 )P(u | s, θ0 )
0
= P(s | ak,l : k, l ∈ M)P(u | s, µ0k , Σ0k : k ∈ M)
T
Y T
Y
= πs0 1 as0 t−1 ,st × P(ut | µ0st , Σ0st ) .
t=2 t=1
We then have
T
X
0
log f (x | θ ) = log(πs0 1 ) + log as0 t−1 ,st +
t=2
T
X
log P(ut | µ0st , Σ0st ) (1)
t=1
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
E (log f (x | θ0 )|u, θ)
T
"
X X
= P(s|u, θ) log(πs0 1 ) + log as0 t−1 ,st +
s t=2
T
#
X
log P(ut | µ0st , Σ0st )
t=1
XM T X
X M X
M
= Lk (1) log(πk0 ) + 0
Hk,l (t) log ak,l
k=1 t=2 k=1 l=1
XT X
M
+ Lk (t) log P(ut | µ0k , Σ0k )
t=1 k=1
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
X T
X
P(s|u, θ) log as0 t−1 ,st
s t=2
T X
X M X
M
0
= Hk,l (t) log ak,l
t=2 k=1 l=1
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
X T
X
P(s|u, θ) log as0 t−1 ,st
s t=2
X XT XM X
M
0
= P(s|u, θ) I (st−1 = k)I (st = l) log ak,l
s t=2 k=1 l=1
T X
XX M X
M
0
= P(s|u, θ)I (st−1 = k)I (st = l) log ak,l
s t=2 k=1 l=1
T M X M
" #
XX X
0
= P(s|u, θ)I (st−1 = k)I (st = l) log ak,l
t=2 k=1 l=1 s
XT X M X M
0
= Hk,l (t) log ak,l
t=2 k=1 l=1
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Forward-Backward Algorithm
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
αk (1) = πk bk (u1 ) 1 ≤ k ≤ M
M
X
αk (t) = bk (ut ) αl (t − 1)al,k ,
l=1
1 < t ≤ T, 1 ≤ k ≤ M .
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
I Proof:
αk (t) = P(u1 , u2 , ..., ut , st = k)
M
X
= P(u1 , u2 , ..., ut , st = k, st−1 = l)
l=1
M
X
= P(u1 , ..., ut−1 , st−1 = l) · P(ut , st = k | st−1 = l, u1 , ..., ut−1 )
l=1
M
X
= αl (t − 1)P(ut , st = k | st−1 = l)
l=1
M
X
= αl (t − 1)P(ut | st = k, st−1 = l) · P(st = k | st−1 = l)
l=1
M
X
= αl (t − 1)P(ut | st = k)P(st = k | st−1 = l)
l=1
M
X
= αl (t − 1)bk (ut )al,k
l=1
βk (T ) = 1
M
X
βk (t) = ak,l bl (ut+1 )βl (t + 1) 1≤t<T .
l=1
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
P(u, st = k)
Lk (t) = P(st = k | u) =
P(u)
1
= αk (t)βk (t)
P(u)
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
P(u, st = k, st+1 = l)
= P(u1 , ..., ut , ..., uT , st = k, st+1 = l)
= P(u1 , ..., ut , st = k) ·
P(ut+1 , st+1 = l | st = k, u1 , ..., ut ) ·
P(ut+2 , ..., uT | st+1 = l, st = k, u1 , ..., ut+1 )
= αk (t)P(ut+1 , st+1 = l | st = k) ·
P(ut+2 , ..., uT | st+1 = l)
= αk (t)P(st+1 = l | st = k) ·
P(ut+1 | st+1 = l, st = k)βl (t + 1)
= αk (t)ak,l P(ut+1 | st+1 = l)βl (t + 1)
= αk (t)ak,l bl (ut+1 )βl (t + 1)
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
I Note that the amount of computation for Lk (t) and Hk,l (t),
k, l = 1, ..., M, t = 1, ..., T is at the order of M 2 T .
I Note:
M
X
P(u) = αk (t)βk (t) , for any t
k=1
I In particular, if we let t = T ,
M
X M
X
P(u) = αk (T )βk (T ) = αk (T ) .
k=1 k=1
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Proof:
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
M
X
αk (t) = bk (ut ) αl (t − 1)al,k ,
l=1
1 < t ≤ T, 1 ≤ k ≤ M .
βk (T ) = 1
M
X
βk (t) = ak,l bl (ut+1 )βl (t + 1) 1≤t<T .
l=1
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
1
Lk (t) = αk (t)βk (t)
P(u)
1
Hk,l (t) = αk (t)ak,l bl (ut+1 )βl (t + 1) .
P(u)
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
PT
t=1 Lk (t)(ut − µk )(ut − µk )t
Σk = PT
t=1 Lk (t)
PT −1
Hk,l (t)
ak,l = Pt=1
T −1
.
t=1 Lk (t)
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Multiple Sequences
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
(i)
I Compute the forward and backward probabilities αk (t),
(i)
βk (t), k = 1, ..., M, t = 1, ..., T , i = 1, ..., N, under the
current set of parameters.
(i)
αk (1) = πk bk (ui,1 ) , 1 ≤ k ≤ M, 1 ≤ i ≤ N .
M
(i) (i)
X
αk (t) = bk (ui,t ) αl (t − 1)al,k ,
l=1
1 < t ≤ T , 1 ≤ k ≤ M, 1 ≤ i ≤ N .
(i)
βk (T ) = 1 , 1 ≤ k ≤ M, 1 ≤ i ≤ N
M
(i) (i)
X
βk (t) = ak,l bl (ui,t+1 )βl (t + 1)
l=1
1 ≤ t < T , 1 ≤ k ≤ M, 1 ≤ i ≤ N .
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
PN PT (i)
i=1 t=1 Lk (t)(ui,t − µk )(ui,t − µk )t
Σk = PN PT (i)
i=1 t=1 Lk (t)
PN PT −1 (i)
i=1 t=1 Hk,l (t)
ak,l = PN PT −1 (i) .
i=1 t=1 Lk (t)
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
PT
t=1Lk (t)I (ut = j)
qk,j = PT , k = 1, ..., M; j = 1, ..., J
t=1 Lk (t)
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Viterbi Algorithm
I In many applications using HMM, we need to predict the
state sequence s = {s1 , ..., sT } based on the observed data
u = {u1 , ..., uT }.
I Optimization criterion: find s that maximizes P(s | u):
P(s, u)
s∗ = arg max P(s | u) = arg max = arg max P(s, u)
s s P(u) s
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
The first function involves only states before t; and the second
only states after t.
I Also note the recursion of Gt,k (s1 , ..., st−1 ):
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Pseudocode
∗
I At t = 1, for each node (state) k = 1, ..., M, record G1,k = g1 (k).
I At t = 2, for each node k = 1, ..., M, only need to record which
node is the best preceding one. Suppose node k is linked to node l ∗
at t = 1, record l ∗ and
∗ ∗ ∗ ∗
G2,k = maxl=1,2,...,M [G1,l + g2 (k, l)] = G1,l ∗ + g2 (k, l ).
the sum of weights of the best path up to t and with the end tied at
state k and l ∗ is the best preceding state. Record l ∗ and Gt,k ∗
.
I At the end, only M paths are formed, each ending with a different
state at t = T . The objective function for a path ending at node k
is GT∗ ,k . Pick k ∗ that maximizes GT∗ ,k . Trace the path backwards
from the last state k ∗ .
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
∗ = max
Let Gt,k s1 ,...,st−1 Gt,k (s1 , ..., st−1 ).
I Let s̄∗ (t, k) be the sequence {st+1 , ..., sT } that maximizes
Ḡt,k (st+1 , ..., sT ):
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
I Gt,k (s∗ (t, k)) and s∗ (t, k) can be obtained recursively for
t = 1, ..., T .
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Viterbi Training
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Applications
Speech recognition:
I Goal: identify words spoken according to speech signals
I Automatic voice recognition systems used by airline companies
I Automatic stock price reporting
I Raw data: voice amplitude sampled at discrete time spots (a
time sequence).
I Input data: speech feature vectors computed at the sampling
time.
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
I Methodology:
I Estimate an Hidden Markov Model (HMM) for each word,
e.g., State College, San Francisco,
Pittsburgh. The training provides a dictionary of models
{W1 , W2 , ...}.
I For a new word, find the HMM that yields the maximum
likelihood. Denote the sequence of feature vectors extracted
for this voice signal by u = {u1 , ..., uT }. Classify to word i ∗ if
Wi ∗ maximizes P(u | Wi ).
PM
I Recall that P(u) = k=1 αk (T ), where αk (T ) are the forward
probabilities at t = T , computed using parameters specified by
Wi ∗ .
I In the above example, HMM is used for “profiling”. Similar
ideas have been applied to genomics sequence analysis, e.g.,
profiling families of protein sequences by HMMs.
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Supervised learning:
I Use image classification as an example.
I The image is segmented into man-made and natural regions.
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
I Details:
I Suppose we have M states, each belonging to a certain class.
Use C (k) to denote the class state k belongs to. If a block is
in a certain class, it can only exist in one of the states that
belong to its class.
I Train the HMM using the feature vectors
{u1 , u2 , ..., uT } and their classes {z1 , z2 , ..., zT }.
There are some minor modifications from the training
algorithm described before since no class labels are involved
there.
I For a test image, find the optimal sequence of states
{s1 , s2 , ..., sT } with maximum a posteriori probability (MAP)
using the Viterbi algorithm.
I Map the state sequence into classes: ẑt = C (st∗ ).
Jia Li http://www.stat.psu.edu/∼jiali
Hidden Markov Model
Unsupervised learning:
I Since a mixture model can be used for clustering, HMM can
be used for the same purpose. The difference lies in the fact
HMM takes spatial dependence into consideration.
I For a given number of states, fit an HMM to a sequential
data.
I Find the optimal state sequence s∗ by the Viterbi algorithm.
I Each state represents a cluster.
I Examples: image segmentation, etc.
Jia Li http://www.stat.psu.edu/∼jiali