You are on page 1of 31

Learning HMM parameters

Sushmita Roy
BMI/CS 576
www.biostat.wisc.edu/bmi576/
sroy@biostat.wisc.edu
Oct 21st, 2014
Recall the three questions in
HMMs
How likely is an HMM to have generated a
given sequence?
Forward algorithm
What is the most likely path for
generating a sequence of observations
Viterbi algorithm
How can we learn an HMM from a set of
sequences?
Forward-backward or Baum-Welch (an EM algorithm)
Learning HMMs from data

Parameter estimation
If we knew the state sequence it would be
easy to estimate the parameters
But we need to work with hidden state
sequences
Use expected counts of state transitions
Reviewing the notation
States with emissions will be numbered
from 1 to K
0 begin state, N end state
observed character at position t
Observed sequence
Hidden state sequence
or path
Transition probabilities

Emission probabilities: Probability of


emitting symbol b from state k
Learning without hidden
information
Learning is simple if we know the correct path
for each sequence in our training set

0 2 2 4 4 5

C A G T

begin 1 3 end
0 5

2 4

Estimate parameters by counting the number


of times each parameter is used across the
training set
Learning without hidden
information
Transition probabilities
Number of transitions from state k to state

Emission probabilities

Number of times c is emitted from k


Learning with hidden
information
if we dont know the correct path for each sequence
in our training set, consider all possible paths for the
sequence
0 ? ? ? ? 5

C A G T
begin 1 3 end
0 5

2 4

estimate parameters through a procedure that


counts the expected number of times each
parameter is used across the training set
The Baum-Welch algorithm

Also known as Forward-backward algorithm


An Expectation Maximization (EM)
algorithm
EM is a family of algorithms for learning probabilistic
models in problems that involve hidden information
Expectation: Estimate the expected number of
times there are transitions and emissions (using
current values of parameters)
Maximization: Estimate parameters given expected
counts
Hidden variables are the state transitions
and emission counts
Learning parameters:
the Baum-Welch algorithm
algorithm sketch:
initialize parameters of model
iterate until convergence
calculate the expected number of times each
transition or emission is used
adjust the parameters to maximize the
likelihood of these expected values
The expectation step

We need to know the probability of the symbol at


t being produced by state k, given the entire
sequence x

We also need to know the probability of symbol


at t and (t+1) being produced by state k, and l
respectively given sequence x

Given these we can compute our expected


counts for state transitions, character emissions
Computing

First we compute the probability of the


entire observed sequence with the tth
symbol being generated by state k

Then our quantity of interest is computed


as

Obtained from the forward algorith


Computing

To compute

We need the forward and backward


algorithm

Forward algorithm fk(t) Backward algorithm bk(t)


Computing

Using the forward and backward variables,


this is computed as
The backward algorithm
the backward algorithm gives bk (t)
us , the
probability of observing the rest of x, given that
were in state k after t characters
0.2 0.4

A 0.4 A 0.2
C 0.1 0.8 C 0.3
G 0.2 G 0.3
0.5 0.6
T 0.3 T 0.2
begin 1 3 end
0 5
0.5 A 0.4 A 0.1 0.9
C 0.1 0.2 C 0.4
G 0.1 G 0.4
T 0.4 T 0.1
2 4
0.8 0.1

C A G T
Example of computing
0.2 0.4

A 0.4 A 0.2
C 0.1 0.8 C 0.3
G 0.2 G 0.3
0.5 0.6
T 0.3 T 0.2
begin 1 3 end
0 A 0.4 5
0.5 A 0.1 0.9
C 0.1 0.2 C 0.4
G 0.1 G 0.4
T 0.4 T 0.1
P( 3 4, x)
2 4 P( 3 4 | x)
0.8 0.1 P(x)
f4 (3)b4 (3)

C A G T f5 (4)
Steps of the backward
algorithm
Initialization (t=T)

Recursion (t=T-1 to 1)

Termination

Note, the same


quantity can be
obtained from the
forward algorithm as
well
Computing

This is the probability of symbols at t and


t+1 emitted from states k and l given the
entire sequence x
Putting it all together

Assume we are given J training instances


x1,..,xj,.. xJ
Expectation step
Using current parameter values compute for each xj
Apply the forward and backward algorithms
Compute
expected number of transitions between all pairs of states
expected number of emissions for all states
Maximization step
Using current expected counts
Compute the transition and emission probabilities
The expectation step:
emission count
We need the expected number of times c is
emitted by state k

1
nk,c j fk (t)bk (t)
j j
f (T) {t |x j c}
x j N
t

sum over positions


xj : jth training sequences where c occurs in x


The expectation step:
transition count
Expected number of times of transitions
from k to l
The maximization step

Estimate new emission parameters by:


nk,c
ek (c)
nk,c'
c'
Estimate new transition parameters by

Just like in the simple case but typically well do


some smoothing (e.g. add pseudocounts)
The Baum-Welch algorithm

initialize the parameters of the HMM


iterate until convergence
initialize , with pseudocounts
E-step: for each
nk,c nkltraining set sequence j = 1n
calculate values for sequence j
calculate values for sequence j
fk (t)
add the contribution of sequence j to ,
M-step: update
bk (t) the HMM parameters using
, n n
k, c kl
nk,c nkl
Baum-Welch algorithm
example
Given
The HMM with the parameters initialized as
shown
Two training sequences TAG, ACG

1.0 A 0.4 0.2 A 0.1 0.9


begin C 0.1 C 0.4 end
G 0.1 G 0.4
0 T 0.4 T 0.1 3
1 2
0.8 0.1

well work through one iteration of Baum-Welch


Baum-Welch example (cont)
Determining the forward values for TAG
f0 (0) 1
f1 (1) e1 (T ) a01 f0 (0) 0.4 1 0.4
f1 (2) e1 ( A) a11 f1 (1) 0.4 0.8 0.4 0.128
f2 ( 2) e2 ( A) a12 f1 (1) 0.1 0.2 0.4 0.008
f2 (3) e2 (G) a12 f1 ( 2) a22 f2 (2)
0.4 (0.0008 0.0256) 0.01056
Heref3we a23 f2 (just
(3) compute 3) the
0.9 0.01056
values 0are
that .009504
needed for
computing successive values.
For example, no point in calculating f2(1)
In a similar way, we also compute forward values for
ACG
Baum-Welch example (cont)
Determining the backward values for TAG

b2 (3) a23 0.9


b2 (2) a22 e2 (G) b2 (3) 0.1 0.4 0.9 0.036
b1 (2) a12 e2 (G) b2 (3) 0.2 0.4 0.9 0.072
b1 (1) a11 e1 ( A) b1 (2) a12 e2 ( A) b2 (2)
0.8 0.4 0.072 0.2 0.1 0.036 0.02376
b0 (0) a01 e1 (T) b1 (1) 1.0 0.4 0.02376 0.009504
Again, here we compute just the values that are
needed
In a similar way, we also compute backward values
for ACG
Baum-Welch example (cont)
determining the expected emission counts for
state 1 contribution contribution
of TAG of ACG

f1 (2)b1 (2) f1 (1)b1 (1)


n1, A 1
f3 (3) f3 (3)
f1 (2)b1 (2)
n1,C 1
f3 (3)
n1,G 1
f1 (1)b1 (1)
n1,T 1
f3 (3)
*note that the forward/backward values in these two columns differ; in each column
they are computed for the sequence associated with the column
Baum-Welch example (cont)
Determining the expected transition counts for
state 1 (not using pseudocounts)
Contribution of Contribution of ACG
TAG

f1 (1)a11e1 ( A)b1 (2) f1 (1)a11e1 (C)b1 (2)


n11
f3 (3) f3 (3)

f1 (1)a12 e2 ( A)b2 (2) f1 (2)a12 e2 (G)b2 (3) f1 (1)a12 e2 (C )b2 (2) f1 (2)a12 e2 (G)b2 (3)
n1 2
f3 (3) f3 (3)

In a similar way, we also determine the


Baum-Welch example (cont)
Maximization step: determining probabilities for
state 1
n1, A
e1 ( A)
n1, A n1,C n1,G n1,T
n1,C
e1 (C )
n1, A n1,C n1,G n1,T

n11
a11
n11 n12
n1 2
a12
n11 n1 2
Computational complexity of
HMM algorithms
Given an HMM with S states and a sequence of
length L, the complexity of the Forward, Backward
and Viterbi algorithms
2 is
O( S L)
this assumes that the states are densely
interconnected
Given M sequences of length L, the complexity of
Baum-Welch on each iteration is
2
O( MS L)
Baum-Welch convergence

Some convergence criteria


likelihood of the training sequences changes little
fixed number of iterations reached
Usually converges in a small number of
iterations
Will converge to a local maximum (in the
likelihood of the data given the model)
log P(sequences | ) log P(x j | )
xj


Summary

Three problems in HMMs


Probability of an observed sequence
Forward algorithm
Most likely path for an observed sequence
Viterbi
Can be used for segmentation of observed sequence
Parameter estimation
Baum-Welch
The backward algorithm is used to compute a quantity
needed to estimate the posterior of a state given the entire
observed sequence

You might also like