Professional Documents
Culture Documents
Fall 2017
o0 o1 o2 o3 o4 o5
B(s0 , o0 ) B(s1 , o1 ) B(s2 , o2 ) B(s3 , o3 ) B(s4 , o4 ) B(s5 , o5 )
s0 A(s0 , s1 )
s1 A(s1 , s2 )
s2 A(s2 , s3 )
s3 A(s3 , s4 )
s4 A(s4 , s5 )
s5
Here, we are being a little sloppy in terms of notation by assuming that there is a starting state
s0 , just so that the indexing is valid. In practice, such state s0 is sometimes called the starting
state, and one can asisgn a probability distribution to it too.
HMM goes further than Equation 1.1 by making the assumption that
Note that the Markov assumption generally does not hold. In our example, sequences of words
usually have dependencies that go back longer than one immediate step.
4. Each observation ot only depends on the immediate hidden state st . Formally
P (ot |o<t , s<t ) = P (ot |st ) (1.4)
This is called the independence assumption.
Under the Markov assumption and the independence assumption, an HMM only needs two set of
parameters to model P (st |st−1 ) and P (ot |st ). We call these, respectively, the transition probabilities
and the emission probabilities. These probabilities are denoted by matrices A and B, each of whose
row is a valid probability distribution, i.e. non-negative numbers that sum up to 1. Table 1 summarizes
the components of a general HMM.
(3) Estimating the Parameters. Given a sequence of observation O. Estimate the transition and
emission probabilities that are most likely to give O. That is to find
A∗ , B ∗ = argmax P (O|A, B) (1.6)
A,B
2
If we know f [t, s], we can compute P (O|A, B) as follows
X
P (o1 , o2 , ..., ot |A, B) = P (o1 , o2 , ..., ot , st = s|A, B) · P (ot |st = s) (2.2)
s
X
= f [t, s] · B(s, ot ) (2.3)
s
Putting together Equation 2.3 and Equation 2.6, we have the so-called Forward-Pass algorithm to
compute the likelihood in an HMM.
(1) Initialize:
• For each hidden state s:
(2) For t = 2 to T :
(3) Finally:
X
P (o1 , o2 , ..., oT |A, B) ← f [T, s]B(s, oT ) (2.9)
s
Its complexity is O(N 2 · T ) where N is the size of the hidden state space.
3
We want to find the sequence S that maximizes the quantity above, that is
It turns out that a dynamic programming programming algorithm that is almost identical to the
Forward-Pass algorithm in Section 2 is applicable for the task of finding S in Equation 3.4. That is
the Viterbi algorithm [Viterbi, 1967].
We define
def
g[t, s] = max P (o1 , o2 , ..., ot , st = s|A, B) (3.5)
s1 ...st−1
The key observation is that if we know all values of g[t, s], then we can compute the maximum joint
probability P (O, S|A, B) as follows
max P (o1 , o2 , ..., ot |A, B) = max P (o1 , o2 , ..., ot , st = s|A, B) · P (ot |st = s) (3.6)
s1 ,...,st s
Please note the similarity between Equations 3.6-3.7 to Equations 2.2-2.3, where the only difference
is the change from summation to max. The reason that Viterbi’s algorithm is almost identical to
Forward-Pass algorithm is that max has the same distributing property as summation, i.e. one can
“factor” our the same term.
We can compute g[t, s] similar to the way we compute f [t, s] in Equation 2.6. Specifically
= max
0
P (o1 , o2 , ..., ot , st−1 = s0 |A, B) · P (st = s|st−1 = s0 )P (ot |s) (3.9)
s
= max
0
g[t − 1, s0 ] · A(s0 , s)B(s, ot ) (3.10)
s
Finally, note that in Equation 3.4, we wanted an argmaxS P (O, S|A, B), not the maxS P (O, S|A, B).
The general idea to go from max(·) to argmax(·) in a dynamic programming algorithm is to create
a trace-back array. In Equation 3.10, whenever you know which s0 leads to the largest value for
g[t − 1, s0 ] · A(s0 , s)B(s, ot ), you can store that s0 in an array h[t, s]. After you finish with g[T, s] as
in Equation 3.7, you can simply follow h[t, s] to trace back the sequence of hidden states that leads
to the maximum value of P (O, S|A, B), which gives you the argmax. Putting everything together, we
have the Viterbi’s algorithm
(1) Initialize:
• For each hidden state s:
(2) For t = 2 to T :
g[t, s] ← max
0
g[t − 1, s0 ] · A(s0 , s)B(s, ot ) (3.12)
s
h[t, s] ← argmax g[t − 1, s0 ] · A(s0 , s)B(s, ot ) (3.13)
s0
4
(3) Follow h[t, s] to find s∗T , s∗T −1 , ..., s∗1 . Starting at t = T
Intuitively, γ[t, s] counts how many times does the tth states of the hidden sequence equals s and
ξ[t, s, s0 ] counts how many times the duo (s0 , s) happens at the (t − 1)st step and the tth step in the
hidden sequence, both up to normalization by a partition function. That is why γ[t, s] and ξ[t, s, s0 ]
are sometimes referred to as the pseudo counts.
We now show how to compute γ[t, s] and ξ[t, s0 , s]. Since γ[t, s] and ξ[t, s0 , s] are probabilities
conditioned on O, we compute via the partial joint probabilities
If one knows all α[t, s] and β[t, s], then γ[t, s] and ξ[t, s0 , s0 ] can be computed as follows
P (st = s, O|A, B)
γ[t, s] = P (st = s|O, A, B) = (4.5)
P (O|A, B)
α[t, s] · β[t, s]
=P 0 0
(4.6)
s0 α[t, s ] · β[t, s ]
P (st−1 = s0 , st = s, O|A, B)
ξ[t, s0 , s] = P (st−1 = s0 , st = s|O, A, B) = (4.7)
P (O|A, B)
α[t − 1, s] · A(s0 , s) · β[t, s]
= P 0 0
(4.8)
s0 α[t, s ] · β[t, s ]
It only remains how to compute α[t, s] and β[t, s]. Not surprisingly, they can be computed with
dynamic programming
X X
α[t, s] = α[t − 1, s0 ] · A(s0 , s) · B(s, ot ) = B(s, ot ) α[t − 1, s0 ] · A(s0 , s) (4.9)
s0 s0
5
X
β[t, s] = β[t + 1, s0 ] · A(s, s0 ) · B(s0 , ot+1 ) (4.10)
s0
M-step. When these pseudo counts γ and ξ are known, then in the M-step, one can use maximum
likelihood estimator to derive the updated values for A and B
PT −1 0
#times s follows s0 t=1 ξ[t, s , s]
Â(s0 , s) = 0 = P PT −1 (4.11)
#times anything follows s s00
0 00
t=1 ξ[t, s , s ]
PT
#times o is observed given s 1[ot = o]γ[t, s]
B̂(s, o) = = t=1PT (4.12)
#times anything is observed given s t=1 γ[t, s]
References
L. E. Baum. An inequality and associated maximization in statistical estimation for probabilistic
functions of markov processes. In Proceedings of the 3rd Symposium on Inequalities, University of
California, Los Angeles, 1972.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em
algorithm. In Journal of the Royal Statistical Society, 1977.
A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm.
In IEEE Transactions on Information Theory, 1967.