You are on page 1of 33

Hidden Markov Models

David Meir Blei


November 1, 1999
What is an HMM?

• Graphical Model
• Circles indicate states
• Arrows indicate probabilistic dependencies
between states
What is an HMM?

• Green circles are hidden states


• Dependent only on the previous state
• “The past is independent of the future given the
present.”
What is an HMM?

• Purple nodes are obser ved states


• Dependent only on their corresponding hidden
state
HMM Formalism
S S S S S

K K K K K

• {S, K, Π, Α, Β}
• S : {s1…sN } are the values for the hidden states
• K : {k1…kM } are the values for the observations
HMM Formalism
S A S A S A S A S

B B B
K K K K K

• {S, K, Π, Α, Β}
• Π = {πι} are the initial state probabilities
• A = {aij} are the state transition probabilities
• B = {bik} are the observation state probabilities
Inference in an HMM

• Compute the probability of a given observation


sequence
• Given an observation sequence, compute the most
likely hidden state sequence
• Given an observation sequence and set of possible
models, which model most closely fits the data?
Decoding

o1 ot-1 ot ot+1 oT

Given an observation sequence and a model,


compute the probability of the observation sequence

O = (o1...oT ), µ = ( A, B, Π )
Compute P(O | µ )
Decoding
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

P (O | X , µ ) = bx1o1 bx2o2 ...bxT oT


Decoding
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

P (O | X , µ ) = bx1o1 bx2o2 ...bxT oT


P( X | µ ) = π x1 a x1x2 a x2 x3 ...a xT −1xT
Decoding
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

P (O | X , µ ) = bx1o1 bx2o2 ...bxT oT


P( X | µ ) = π x1 a x1x2 a x2 x3 ...a xT −1xT
P (O, X | µ ) = P (O | X , µ ) P( X | µ )
Decoding
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

P (O | X , µ ) = bx1o1 bx2o2 ...bxT oT


P( X | µ ) = π x1 a x1x2 a x2 x3 ...a xT −1xT
P (O, X | µ ) = P (O | X , µ ) P( X | µ )
P(O | µ ) = ∑ P(O | X , µ ) P( X | µ )
X
Decoding
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

T −1
P (O | µ ) = ∑π
{ x1 ... xT }
b
x1 x1o1 Πa
t =1
b
xt xt +1 xt +1ot +1
Forward Procedure
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

• Special structure gives us an efficient solution


using dynamic programming.
• Intuition: Probability of the first t observations is
the same for all possible t+1 length state
sequences.
• Define: α (t ) = P(o ...o , x = i | µ )
i 1 t t
Forward Procedure
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

α j (t + 1)

= P(o1...ot +1 , xt +1 = j )
= P(o1...ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot | xt +1 = j ) P(ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot , xt +1 = j ) P(ot +1 | xt +1 = j )
Forward Procedure
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

α j (t + 1)

= P(o1...ot +1 , xt +1 = j )
= P(o1...ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot | xt +1 = j ) P(ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot , xt +1 = j ) P(ot +1 | xt +1 = j )
Forward Procedure
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

α j (t + 1)

= P(o1...ot +1 , xt +1 = j )
= P(o1...ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot | xt +1 = j ) P(ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot , xt +1 = j ) P(ot +1 | xt +1 = j )
Forward Procedure
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

α j (t + 1)

= P(o1...ot +1 , xt +1 = j )
= P(o1...ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot | xt +1 = j ) P(ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot , xt +1 = j ) P(ot +1 | xt +1 = j )
Forward Procedure
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

= ∑ P(o ...o , x
i =1... N
1 t t = i, xt +1 = j )P(ot +1 | xt +1 = j )

= ∑ P(o ...o , x
i =1... N
1 t t +1 = j | xt = i )P( xt = i ) P(ot +1 | xt +1 = j )

= ∑ P(o ...o , x
i =1... N
1 t t = i )P( xt +1 = j | xt = i ) P(ot +1 | xt +1 = j )

= ∑α (t )a b
i =1... N
i ij jot +1
Forward Procedure
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

= ∑ P(o ...o , x
i =1... N
1 t t = i, xt +1 = j )P(ot +1 | xt +1 = j )

= ∑ P(o ...o , x
i =1... N
1 t t +1 = j | xt = i )P( xt = i ) P(ot +1 | xt +1 = j )

= ∑ P(o ...o , x
i =1... N
1 t t = i )P( xt +1 = j | xt = i ) P(ot +1 | xt +1 = j )

= ∑α (t )a b
i =1... N
i ij jot +1
Forward Procedure
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

= ∑ P(o ...o , x
i =1... N
1 t t = i, xt +1 = j )P(ot +1 | xt +1 = j )

= ∑ P(o ...o , x
i =1... N
1 t t +1 = j | xt = i )P( xt = i ) P(ot +1 | xt +1 = j )

= ∑ P(o ...o , x
i =1... N
1 t t = i )P( xt +1 = j | xt = i ) P(ot +1 | xt +1 = j )

= ∑α (t )a b
i =1... N
i ij jot +1
Forward Procedure
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

= ∑ P(o ...o , x
i =1... N
1 t t = i, xt +1 = j )P(ot +1 | xt +1 = j )

= ∑ P(o ...o , x
i =1... N
1 t t +1 = j | xt = i )P( xt = i ) P(ot +1 | xt +1 = j )

= ∑ P(o ...o , x
i =1... N
1 t t = i )P( xt +1 = j | xt = i ) P(ot +1 | xt +1 = j )

= ∑α (t )a b
i =1... N
i ij jot +1
Backward Procedure
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

β i (T + 1) = 1
β i (t ) = P (ot ...oT | xt = i ) Probability of the rest
of the states given the
β i (t ) = ∑a b
j =1... N
ij iot β j (t + 1) first state
Decoding Solution
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

N
P(O | µ ) = ∑ α i (T ) Forward Procedure
i =1
N
P(O | µ ) = ∑ π i β i (1) Backward Procedure
i =1
N
P(O | µ ) = ∑ α i (t )β i (t ) Combination
i =1
Best State Sequence

o1 ot-1 ot ot+1 oT

• Find the state sequence that best explains the observations

• Viterbi algorithm

• arg max P( X | O)
X
Viterbi Algorithm
x1 xt-1 j

o1 ot-1 ot ot+1 oT

δ j (t ) = max P( x1...xt −1 , o1...ot −1 , xt = j , ot )


x1 ... xt −1

The state sequence which maximizes the


probability of seeing the observations to time
t-1, landing in state j, and seeing the
observation at time t
Viterbi Algorithm
x1 xt-1 xt xt+1

o1 ot-1 ot ot+1 oT

δ j (t ) = max P( x1...xt −1 , o1...ot −1 , xt = j , ot )


x1 ... xt −1

δ j (t + 1) = max δ i (t )aij b jo t +1
i Recursive
Computation
ψ j (t + 1) = arg max δ i (t )aij b jo t +1
i
Viterbi Algorithm
x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

Xˆ T = arg max δ i (T ) Compute the most


i
likely state sequence
Xˆ t = ψ ^ (t + 1) by working
X t +1
backwards
P( Xˆ ) = arg max δ i (T )
i
Parameter Estimation
A A A A

B B B B B
o1 ot-1 ot ot+1 oT

• Given an observation sequence, find the model


that is most likely to produce that sequence.
• No analytic method
• Given a model and observation sequence, update
the model parameters to better fit the observations.
Parameter Estimation
A A A A

B B B B B
o1 ot-1 ot ot+1 oT

α i (t )aij b jo β j (t + 1)
pt (i, j ) = t +1
Probability of
∑α m (t ) β m (t )
m =1... N
traversing an arc

γ i (t ) = ∑ p (i, j )
j =1... N
t
Probability of
being in state i
Parameter Estimation
A A A A

B B B B B
o1 ot-1 ot ot+1 oT

πˆ i = γ i (1)

T
p (i, j ) Now we can
= t =1 t
aˆij
∑ γ (t )
T compute the new
t =1 i estimates of the

bˆik =
∑ γ (i )
{t :ot = k } t
model parameters.

∑ γ (t )
T
t =1 i
HMM Applications

• Generating parameters for n-gram models


• Tagging speech
• Speech recognition
The Most Important Thing
A A A A

B B B B B
o1 ot-1 ot ot+1 oT

We can use the special structure of this


model to do a lot of neat math and solve
problems that are otherwise not solvable.

You might also like