You are on page 1of 26

Algorithms in

Bioinformatics

Lecture 6
Hidden Markov Model for
Sequence Alignment

Outline
‹ Hidden Markov Model (HMM)
z From Finite State Machine to Finite Markov Model

z From Finite Markov Model to Hidden Markov Model

‹ Find the Most Probable Path for HMM

‹ Parameter Estimation for HMM (by EM Algorithm)

‹ HMM for Sequence Alignment

‹ Appendix

1
Finite Markov Chain
A Finite Markov Chain with four states

Sequence modeled by Finite Markov


Chain
‹ For instance, the sequence CA….TC is modeled as
follows

C A T C

‹ Similarly, (X1,…, Xi ,…) is a sequence of probability


distributions over D.

X1 X2 Xn-1 Xn

2
state 1 state 2 state i state j state N − 1 state N

α1 α 2
α i α j α k −1
α k

⎡ p(e | e) p( f | e) ⎤
[ p(e), p( f )]t +1 = [ p(e), p( f )]t ⎢ ⎥
⎛ p (e | e) p (e | f ) ⎞ ⎣ p(e | f ) p( f | f )⎦
⎜⎜ ⎟ p(f|e)
⎝ p ( f | e) p( f | f ) ⎟⎠

e
f
p(e|e) p(f|f)
p(e|f) p(T|f)
p(A|e) p(C|e)
p(G|f)
p(C|f) p(G|e)
p(A|f) p(T|e)

A C T G

⎛ p(A| e) p(C| e) p(T | e) p(G| e) ⎞ p( A) =


⎜⎜ ⎟⎟ p( e ) p( A | e ) + p( f ) p( A | f )
⎝ p(A| f ) p(C| f ) p(T | f ) p(G| f ) ⎠
6

3
HMM: An example

From Finite Markov Model to Hidden


Markov Model
We add a output (emission) for each state in Finite Markov
Model, it becomes a Hidden Markov Model

M M M M
S1 S2 SL-1 SL

T T T T
x1 x2 XL-1 xL

Notations:
Sequence of symbols (output) (X1…XL), modeled by the Emission
probabilities: p(Xi = b| Si = s) = es(b)

Why called Hidden Markov Model?


The sequence of states is hidden!!!
8

4
Hidden Markov Model
M M M M
S1 S2 SL-1 SL

T T T T
x1 x2 XL-1 xL

Given the “visible” sequence x=(x1,…,xL),:


How to find the most probable (hidden) path?
Viterbi’s algorithm

M M M M
S1 S2 SL-1 SL

x1 x2 XL-1 xL
p(f|e)

e
f
p(e|e) p(f|f)
p(A|e) p(C|e) p(e|f) p(T|f)
p(G|f)
p(C|f) p(G|e)
p(A|f) p(T|e)

A C T G
Given the “visible” sequence x=(x1,…,xL),: How to estimate
the parameters for HMM? Baum-Welch Algorithm 10

5
p jj α (i t ) = ∑ α (jt −1) pij
pij j

state 1 state 2 state i state j state N − 1 state N

p kj
α1 α2 αi α j ααi k=− 1lim t → ∞ααki( t )

How many states ?


.

p(f|e)

e
f
p(e|e) p(f|f)
p(e|f) p(T|f)
p(A|e) p(C|e)
p(G|f)
p(C|f) p(G|e)
p(A|f) p(T|e)

A C T G

p( X ) =
p( A) =
p( e ) p( A | e ) + p( f ) p( A | f ) p(e) p( X | e) + p( f ) p( X | f )

12

6
1. Most Probable state path
M M M M
S1 S2 SL-1 SL

T T T T
x1 x2 xL-1 xL

First Question: Given an output sequence x = (x1,…,xL),


A most probable path s*= (s*1,…,s*L) is one which
maximizes p(s|x).
s* = ( s1* ,..., s*L ) = maxarg p(s1 ,..., sL | x1 ,..., xL )
(s1 ,..., sL )

13

Most Probable path (cont.)

M M M M
S1 S2 SL-1 SL

T T T T
x1 x2 xL-1 xL

p ( s, x)
Since p ( s| x) = α p ( s, x)
p ( x)

we need to find s which maximizes p(s,x)

14

7
Model 3-Hidden Markov Chain

For a given sequence ATCGCCGGGA, assume that


‹ The probability of each letter occurring in the sequence does not
depends on known factors.
‹ The probability of each letter occurring in the sequence depends on
some unknown status (hidden status).
For example, there are 2 unknown status e and f.
z When the unknown status is e, we assume that A,C,T and G have a
probability distribution, and denoted as
p(A|e), p(C|e), p(T|e), p(G|e) (called as conditional probability).
z When the unknown status is f, we assume that A,C,T and G have a
probability distribution, and denoted as
p(A|f), p(C|f), p(T|f), p(G|f) (called as conditional probability).

15

Viterbi’s algorithm for most probable path

s1 s2 si

x1 x2 xi

The task: compute maxarg p ( s1 ,..., sL ; x1 ,..., xL )


(s1 ,..., sL )
Idea: for i=1,…,L and for each state l, compute:
vl(i) = the probability p(s1,..,si;x1,..,xi|si=l ) of a most probable
path up to i, which ends in state l .

Exercise: For i = 1,…,L and for each state l:


vl (i ) = el ( xi ) ⋅ max{vk (i − 1) ⋅ akl }
k

16

8
Dependence relations

p ( y | x ) p ( x ) = p ( x, y ) = p ( x | y ) p ( y )

p ( x ) = ∑ p ( x, y )
y

p(x | y) p( y) p(x | y) p( y)
p( y | x) = p(x, y) / p(x) = =
p(x) ∑ p(x, y) y

17

vl(i) = the probability p(s1,..,si;x1,..,xi|si=l ) of a most probable


path up to i, which ends in state l .

p(s1 ,..., si , x1 ,..., xi | si = l ) = p(s1 ,..., si −1 , si , x1 ,..., xi −1 , xi | si = l )


= p(s1 ,..., si −1 , si , x1 ,..., xi −1 | xi , si = l ) p( xi | si = l )

p (s1 ,..., si −1 , si , x1 ,..., xi −1 | si = l, xi ) = p(s1 ,..., si −1 , si = l, x1 ,..., xi −1 )


= p (s1 ,..., si −1 , x1 ,..., xi −1 ) p(si = l | s1 ,..., si −1 , x1 ,..., xi −1 )

akl
p(si | s1 ,..., si −1 , x1 ,..., xi −1 ) = p(si | si −1 ) k l

p(xi | si = l ) = el ( xi )

vl (i ) = el ( xi ) ⋅ max{vk (i − 1) ⋅ akl } xi
k
18

9
s1 s2 si-1 si

x1 x2 xi-1 xi

p(si | s1 ,..., si −1 , x1 ,..., xi −1 ) = p(si | si −1 )

s1 s2 si-1 si

x1 x2 xi-1 xi

p(s1 ,..., si −1 , si , x1 ,..., xi −1 | si = l, xi )


= p (s1 ,..., si −1 , si = l, x1 ,..., xi −1 )
19

Viterbi’s algorithm
s1 s2 si sL-1 sL
0

x1 x2 xi xL-1 xL

We add the special initial state 0.


Initialization: v0(0) = 1 , vk(0) = 0 for k > 0
For i=1 to L do for each state l :
vl(i) = el(xi) MAXk {vk(i-1)akl }
ptri(l)=argmaxk{vk(i-1)akl} [storing previous state for reconstructing
the path]
Termination:

Result: p(s1*,…,sL*;x1,…,xl) = max{vk ( L)}


k
20

10
Parameter Estimation for HMM
s1 s2 si sL-1 sL

x1 x2 xi xL-1 xL

An HMM model is defined by the parameters: akl and ek(b), for all
states k,l and all symbols b.
Let θ denote the collection of these parameters.
akl
k l

ek(b)
b
21

{X1,...,Xn}

X1

Xj

Xn

22

11
Case 1: Sequences are fully known
s1 s2 si sL-1 sL

x1 x2 xi xL-1 xL

We know the complete structure of each sequence in the


training set {X1,...,Xn}. We wish to estimate akl and ek(b)
for all pairs of states k, l and symbols b.

23

Case 1 (Cont)
s1 s2 si sL-1 sL

x1 x2 xi xL-1 xL

For each k the parameters {akl|l=1,..,m} and


{ek(b)|b∈Σ}:
Akl Ek (b)
akl = , and ek (b) =
∑ l ' kl '
A ∑ b ' Ek (b ')
Ald denotes the number of transitions from state l to state d
Ek(b) denotes the number of emitting b as in state k

24

12
ML for Parameter Estimation (CASE 1)
s1 s2 si sL-1 sL

x1 x2 xi xL-1 xL

For each Xj we have:


p( X j | θ ) = ∏ asi−1si esi ( xij )
i

25

Case 1 (cont)
s1 s2 si sL-1 sL

x1 x2 xi xL-1 xL

Thus, if Akl = #(transitions from k to l) in the training set,


and Ek(b) = #(emissions of symbol b from state k) in the
training set, we have:

26

13
Case 1 (cont)
So we need to find akl’s and ek(b)’s which maximize:

∏ kl
a Akl
∏ k
[ e (b )] Ek (b )

( k ,l ) ( k ,b )

Subject to:

For all states k , ∑a l


kl = 1 and ∑ e (b ) = 1
b
k

[a kl , ek (b ) ≥ 0 ]
27

Generalization for distribution with


any number n of outcomes
Let X be a random variable with n values x1,…,xk denoting the k
outcomes of an iid experiments, with parameters
θ ={θ1,θ2,...,θk} (θi is the probability of xi).
Again, the data is one sequence of length n:
Data = (xi1,xi2,...,xin)
Then we have to maximize

P( Data | θ ) = θ1n1 ⋅ θ 2n2 ⋅⋅⋅⋅θ knk , (n1 + ... + nk = n)


Subject to: θ1+θ2+ ....+ θk=1
nk
⎛ k −1 ⎞
i.e., P( Data | θ ) = θ1n1 ⋅⋅⋅ θ kn−k −11 ⋅ ⎜1 − ∑ θi ⎟
⎝ i =1 ⎠ 28

14
Generalization for n outcomes (cont)
By treatment identical to the die case, the maximum is obtained
when for all i:

ni nk
=
θi θk
Hence the MLE is given by:

ni
θi = i = 1,.., k
n

29

Fractional Exponents

Some models allow ni’s which are not integers (eg,


when we are uncertain of a die outcome, and consider it
“6” with 20% confidence and “5” with 80%):
We still can have

P( Data | θ ) = θ1n1 ⋅ θ 2n2 ⋅⋅⋅⋅θ knk , (n1 + ... + nk = n)


And the same analysis yields:

ni
θi = i = 1,.., k
n
30

15
Side comment: Sufficient Statistics
‹ To compute the probability of data in the die example
we only require to record the number of times Ni
falling on side i (namely,N1, N2,…,N6).
‹ We do not need to recall the entire sequence of
outcomes

(
P ( Data | Θ) = θ1N1 ⋅θ 2N 2 ⋅θ 3N 3 ⋅θ 4N 4 ⋅θ 5N 5 ⋅ 1 − ∑i =1θ i
5
)
N6

‹ {Ni | i=1…6} is called sufficient statistics for the


multinomial sampling.

31

Sufficient Statistics
‹A sufficient statistics is a function of the data that
summarizes the relevant information for the
likelihood
‹ Formally, s(Data) is a sufficient statistics if for any
two datasets D and D’
z s(Data) = s(Data’ ) ⇒ P(Data|θ) = P(Data’|θ)
Exercise:
Define “sufficient
statistics” for the
HMM model.
Datasets
Statistics
32

16
Case 2: State paths are unknown:
s1 s2 si sL-1 sL

x1 x2 xi xL-1 xL

In this case only the values of the xi’s of the input


sequences are known.
This is a ML problem with “missing data”.
We wish to find θ* so that p(x|θ*)=MAXθ{p(x|θ)}.
For each sequence x,
p(x|θ)=∑s p(x,s|θ),
taken over all state paths s.

33

Case 2: ML Parameter Estimation for HMM


Informally, the general process for finding θ in this case is
1. Start with an initial value of θ.
2. Find θ’ so that p(X1,..., Xn|θ’) > p(X1,..., Xn|θ)
3. set θ = θ’.
4. Repeat until some convergence criterion is met.

A general algorithm of this type is the Expectation


Maximization (EM) algorithm..

34

17
Baum-Welch training (EM algorithm for HMM)
‹ the process is iterated as follows
z Estimate Akl and Ek(b) by considering probable
paths for the training sequences using the current
values of akl and ek(b)
z Use the approach for case 1 to derive new values
of the a’s and e’s

Akl Ek (b)
akl = , and ek (b) =
∑ l ' Akl ' ∑ b ' Ek (b ')
Ald denotes the number of transitions from state l to state d
Ek(b) denotes the number of emitting b as in state k
35

Baum Welch: step 1


Si-1 Si sL

.. ..
s1

x1 xi-1 xi xL

Count expected number of state transitions: For each


sequence Xj, for each i and for each k,l, compute the
posterior state transitions probabilities:
P(si-1=k, si=l | Xj,θ)
Akl Ek (b)
akl = , and ek (b) =
∑ l ' Akl ' ∑ b ' Ek (b ')
Ald denotes the number of transitions from state l to state d
Ek(b) denotes the number of emitting b as in state k 36

18
Step 1: Computing P(si-1=k, si=l | Xj,θ)

s1 s2 Si-1 si sL-1 sL

x1 x2 xi-1 xi xL-1 xL

P(x1,…,xL,si-1=k,si=l) = P(x1,…,xi-1,si-1=k) aklel(xi ) P(xi+1,…,xL |si=l)


Xj = fk(i-1) aklel(xi ) bl(i)
Via the forward Via the backward
algorithm algorithm

fk(i-1) aklel(xi ) bl(i)


p(si-1=k,si=l | Xj) =
P(x1,…,xL)

P(x1,…,xL,si) = P(x1,…,xi,si) P(xi+1,…,xL | si) ≡ f(si) b(si)


37

Step 1: Computing P(si-1=k, si=l | Xj,θ)

s1 s2 Si-1 si sL-1 sL

x1 x2 xi-1 xi x-1 xL

P(x1,…,xL,si-1=k,si=l) = P(x1,…,xi-1,si-1=k) aklek(xi ) P(xi+1,…,xL |si=l)


= fk(i-1) aklek(xi ) bl(i)
Xj
Via the forward Via the backward
algorithm algorithm

fk(i-1) aklel(xi ) bl(i)


p(si-1=k,si=l | xj) =
ΣK’ Σl’fk’(i-1) ak’l’ek’(xi ) bl’(i)
38

19
The forward algorithm
s1 s2 si

x1 x2 xi

The task: Compute f(si) = P(x1,…,xi,si) for i=1,…,L (namely,


considering evidence up to time slot i).
P(x1, s1) = P(s1) P(x1|s1) {Basis step}

P(x1,x2,s2) = Σ s P(x1,s1,s2,x2)
1
{Second step}
= Σ s P(x1,s1) P(s2 | x1,s1) P(x2 | x1,s1,s2)
Last equality due 1

to conditional
independence = Σs 1
P(x1,s1) P(s2 | s1) P(x2 | s2)
{step i}
P(x1,…,xi,si) = Σs i-1
P(x1,…,xi-1, si-1) P(si | si-1 ) P(xi | si)
39

Likelihood of evidence
s1 s2 si sL-1 sL

x1 x2 xi xL-1 xL

To compute the likelihood of evidence P(x1,…,xL), do one


more step in the forward algorithm, namely,
Σs f(sL) = Σ s P(x1,…,xL,sL)
L L

40

20
The backward algorithm
si si+1 sL-1 sL

xi+1 xL-1 xL

The task: Compute b(si) = P(xi+1,…,xL|si) for i=L-1,…,1


(namely, considering evidence after time slot i).

P(xL| sL-1) = Σ sLP(xL ,sL |sL-1) = Σ sLP(sL |sL-1) P(xL |sL-1 ,sL )=
Last equality due to
conditional independence = Σ s P(sL |sL-1) P(xL |sL )
L
{first step}
{step i}
P(xi+1,…,xL|si) = Σ si+1 P(si+1 | si) P(xi+1 | si+1) P(xi+2,…,xL| si+1)
=b(si) =b(si+1)
41

Likelihood of evidence
s1 s2 si sL-1 sL

x1 x2 xi xL-1 xL

do one more step in the backward algorithm, namely,


Σ b(s1) P(s1) P(x1|s1) = Σ P(x2,…,xL|s1) P(s1) P(x1|s1)
s1 s1

=p(x1,…,xL),

42

21
Step 1
for each pair (k,l), compute the expected number of state
transitions from k to l:
n
1 L j
Akl = ∑ j ∑
p( si −1 =k , si =l , xXj | θ )
j =1 p ( x ) i =1
n L
1
=∑ jj ∑ f kj (i − 1)akl el ( xi )bl j (i )
j =1 p( xX ) i =1

n L
1 X j |θ )
Akl = ∑ ∑ p ( si −1 =k , si =l , x j

p ( xX ) jj
j =1 i =1
n L
1
=∑ ∑ f kj (i − 1)akl el ( xi )bl j (i )
j
j =1 p ( xX ) j
i =1 43

Baum-Welch: Step 2
k
s1 s2 si sL-1 sL

x1 x2 xi xL-1 xL

b
for each state k and each symbol b, compute the expected
number of emissions of b from k:
n
1
Ek ( b ) = ∑ j ∑ f k j (i )bkj (i )
j =1 p ( x
Xj ) i: xij = b

P(x1,…,xL,si) = P(x1,…,xi,si) P(xi+1,…,xL | si) ≡ f(si) b(si)

44

22
Baum-Welch: step 3

Use the Akl’s, Ek(b)’s to compute the new values of akl


and ek(b). These values define θ*.
Akl Ek (b)
akl = , and ek (b) =
∑ l ' kl '
A ∑ b ' Ek (b ')
It can be shown that:
p(X1,..., Xn|θ*) > p(X1,..., Xn|θ)
i.e, θ* increases the probability of the data

This procedure is iterated, until some convergence


criterion is met. 45

Sequence Comparison using HMM

“Hidden” States Symbols emitted


Match M Match: {(a,b)| a,b in ∑ }
Insertion in x Insertion in x: {(a,-)| a in ∑ }.
insertion in y Insertion in y: {(-,a)| a in ∑ }.

‹ We call this type of model a pair HMM to


distinguish it from the standard HMMs that emit
single sequence
‹ Each aligned pair is generated by the above HMM
with certain probability
‹ Most probable path is the best alignment (Why?)

46

23
The Transition Probabilities
M X Y
Transitions probabilities 1-2δ δ δ
M
δ: transition from M to an
insert state X 1- ε ε
ε: staying in an insert state
1- ε ε
Y

‹ Emission Probabilities
‹ Match: (a,b) with pab – only from M states

‹ Insertion in x: (a,-) with qa – only from X state

‹ Insertion in y: (-,a).with qa - only from Y state.

47

Adding termination probabilities


We may want a model which defines a
probability distribution over all possible
sequences.
M X Y END
For this, an END state is 1-2δ
added, with transition M

δ δ τ
probability τ from any 1-ε -
other state to END. This X ε τ
τ
assume expected sequence 1-ε -
length of 1/ τ. Y τ ε τ

END 1
48

24
HMM for Sequence Alignment: detailed
algorithm
•Most probable path is the best alignment

Let vM(i,j) be the probability of the most probable


alignment of x(1..i) and y(1..j), which ends with a
match. Then using a recursive argument, we get:

⎛ (1 − 2δ )v M (i − 1, j − 1) ⎞
⎜ ⎟
v M [i, j ] = pxi y j max ⎜ (1 − ε )v X (i − 1, j − 1) ⎟
⎜ ⎟
⎜ (1 − ε )vY (i − 1, j − 1) ⎟
⎝ ⎠

49

Most probable path


Similarly, vX(i,j) and vY(i,j), the probabilities of the most probable
alignment of x(1..i) and y(1..j), which ends with an insertion to x
or y:

X
⎛ δ v M (i − 1, j ) ⎞
v [i, j ] = qxi max ⎜ ⎟
⎜ ε v X (i − 1, j ) ⎟
⎝ ⎠

Y
⎛ δ v M (i, j − 1) ⎞
v [i, j ] = q y j max ⎜ ⎟
⎜ ε vY (i, j − 1) ⎟
⎝ ⎠

Termination:

(
v E = τ max v M (n, m ), v X (n, m ), vY (n, m ) )
50

25
p ij p jj α (t)
i = ∑ j
α ( t −1)
j p ij

state 1 state 2 state i state j state N −1 state N

p kj

α 1
α 2
α i α j α αi k=− 1lim (t )
t → ∞ α ki α

either or

G (x µ, Σ )
q(x) = ∑r=1αrG x µr , Σr
k
( )
.

26

You might also like