You are on page 1of 55

.

Computational Genomics
Lecture 10

Hidden Markov Models (HMMs)
© Ydo Wexler & Dan Geiger (Technion) and by Nir Friedman (HU)
Modified by Benny Chor (TAU)
2
Outline
 Finite, or Discrete, Markov Models
 Hidden Markov Models
 Three major questions:
 Q1.: Computing the probability of a given
observation.
A1.: Forward – Backward (Baum Welch) dynamic
programming algorithm.
 Q2.: Computing the most probable sequence, given
an observation.
A2.: Viterbi’s dynamic programming Algorithm
 Q3.: Learn best model, given an observation,.
A3.: Expectation Maximization (EM): A Heuristic.
3
Markov Models
 A discrete (finite) system:
 N distinct states.
 Begins (at time t=1) in some initial state(s).
 At each time step (t=1,2,…) the system moves
from current to next state (possibly the same as
the current state) according to transition
probabilities associated with current state.
 This kind of system is called a finite, or discrete
Markov model

 After Andrei Andreyevich Markov (1856 -1922)
4
Outline
 Markov Chains (Markov Models)
 Hidden Markov Chains (HMMs)
 Algorithmic Questions
 Biological Relevance
5
Discrete Markov Model: Example
 Discrete Markov Model
with 5 states.
 Each a
ij
represents the
probability of moving
from state i to state j
 The a
ij
are given in a
matrix A = {a
ij
}
 The probability to start
in a given state i is t
i ,
The vector t repre-
sents these start
probabilities.
6
Markov Property
• Markov Property: The state of the system at time t+1
depends only on the state of the system at time t
X
t=1
X
t=
2
X
t=3
X
t=
4
X
t=5
] x X | x P[X
] x X , x X , . . . , x X , x X | x P[X
t t 1 1 t
0 0 1 1
1 -
t 1 - t t t 1 1 t
= = =
= = = = =
+ +
+ +
t
t
7
Markov Chains
Stationarity Assumption
Probabilities independent of t when process is “stationary”
So,


This means that if system is in state i, the probability that
the system will next move to state j is p
ij
, no matter what
the value of t is
t 1 j t i
ij for all t, P[X x |X x ] p
+
= = =
8

• raining today rain tomorrow p
rr
= 0.4
• raining today no rain tomorrow p
rn
= 0.6
• no raining today rain tomorrow p
nr
= 0.2
• no raining today no rain tomorrow p
rr
= 0.8
Simple Minded Weather Example
9
Simple Minded Weather Example
Transition matrix for our example



• Note that rows sum to 1
• Such a matrix is called a Stochastic Matrix
• If the rows of a matrix and the columns of a matrix all
sum to 1, we have a Doubly Stochastic Matrix

|
|
.
|

\
|
=
8 . 0 2 . 0
6 . 0 4 . 0
P
10
Coke vs. Pepsi (a cental cultural dilemma)
Given that a person’s last cola purchase was Coke ™,
there is a 90% chance that her next cola purchase will
also be Coke ™.
If that person’s last cola purchase was Pepsi™, there
is an 80% chance that her next cola purchase will also
be Pepsi™.
coke pepsi
0.1
0.9
0.8
0.2
11
Coke vs. Pepsi
Given that a person is currently a Pepsi purchaser,
what is the probability that she will purchase Coke
two purchases from now?
(
¸
(

¸

=
(
¸
(

¸

(
¸
(

¸

=
66 . 0 34 . 0
17 . 0 83 . 0
8 . 0 2 . 0
1 . 0 9 . 0
8 . 0 2 . 0
1 . 0 9 . 0
2
P
(
¸
(

¸

=
8 . 0 2 . 0
1 . 0 9 . 0
P
The transition matrices are:
(corresponding to
one purchase ahead)
12
Coke vs. Pepsi
Given that a person is currently a Coke
drinker, what is the probability that she will
purchase Pepsi three purchases from now?
(
¸
(

¸

=
(
¸
(

¸

(
¸
(

¸

=
562 . 0 438 . 0
219 . 0 781 . 0
66 . 0 34 . 0
17 . 0 83 . 0
8 . 0 2 . 0
1 . 0 9 . 0
3
P
13
Coke vs. Pepsi
Assume each person makes one cola purchase per
week. Suppose 60% of all people now drink Coke, and
40% drink Pepsi.
What fraction of people will be drinking Coke three
weeks from now?
6438 . 0 438 . 0 4 . 0 781 . 0 6 . 0 ) 0 (
) 3 (
10 1
) 3 (
00 0
1
0
) 3 (
0 3
= · + · = + = = =
¿
=
p Q p Q p Q X P
i
i i
Let (Q
0
,Q
1
)=(0.6,0.4) be the initial probabilities.
We will regard Coke as 0 and Pepsi as 1
We want to find P(X
3
=0)
(
¸
(

¸

=
8 . 0 2 . 0
1 . 0 9 . 0
P
P
00
14
Equilibrium (Stationary) Distribution
coke pepsi
0.1
0.9
0.8
0.2
Suppose 60% of all people now drink Coke, and 40%
drink Pepsi. What fraction will be drinking Coke
10,100,1000,10000 … weeks from now?
For each week, probability is well defined. But does it
converge to some equilibrium distribution [p
0
,p
1
]?
If it does, then eqs. : .9p
0
+.2p
1
=p
0
, .8p
1
+.1p
0
=p
1

must hold,

yielding p
0
= 2/3, p
1
=1/3 .
15
Equilibrium (Stationary) Distribution
Whether or not there is a stationary distribution, and
whether or not it is unique if it does exist, are determined
by certain properties of the process. Irreducible means that
every state is accessible from every other state. Aperiodic
means that there exists at least one state for which the
transition from that state to itself is possible. Positive
recurrent means that the expected return time is finite for
every state.
coke pepsi
0.1
0.9
0.8
0.2
http://en.wikipedia.org/wiki/Markov_chain
16
Equilibrium (Stationary) Distribution
If the Markov chain is positive recurrent, there
exists a stationary distribution. If it is positive
recurrent and irreducible, there exists a unique
stationary distribution, and furthermore the process
constructed by taking the stationary distribution as
the initial distribution is ergodic. Then the average
of a function f over samples of the Markov chain is
equal to the average with respect to the stationary
distribution,
http://en.wikipedia.org/wiki/Markov_chain
17
Equilibrium (Stationary) Distribution
Writing P for the transition matrix, a stationary
distribution is a vector π which satisfies the equation
 Pπ = π .
In this case, the stationary distribution π is an
eigenvector of the transition matrix, associated with
the eigenvalue 1.
http://en.wikipedia.org/wiki/Markov_chain
18
Discrete Markov Model - Example
 States – Rainy:1, Cloudy:2, Sunny:3


 Matrix A –


 Problem – given that the weather on day 1 (t=1) is sunny(3), what is the
probability for the observation O:


19
Discrete Markov Model – Example (cont.)
 The answer is -
20
Types of Models
 Ergodic model
Strongly connected - directed
path w/ positive probabilities
from each state i to state j
(but not necessarily complete
directed graph)
21
Third Example: A Friendly Gambler
Game starts with 10$ in gambler’s pocket
– At each round we have the following:
• Gambler wins 1$ with probability p
• Gambler loses 1$ with probability 1-p
– Game ends when gambler goes broke (no sister in bank),
or accumulates a capital of 100$ (including initial capital)
– Both 0$ and 100$ are absorbing states
0 1
2
N-1 N
p
p p p
1-p
1-p 1-p
1-p
Start
(10$)
or
22
Fourth Example: A Friendly Gambler
0 1
2
N-1 N
p
p p p
1-p
1-p 1-p
1-p
Start
(10$)
Irreducible means that every state is accessible from every other
state. Aperiodic means that there exists at least one state for
which the transition from that state to itself is possible. Positive
recurrent means that the expected return time is finite for every
state. If the Markov chain is positive recurrent, there exists a
stationary distribution.

Is the gambler’s chain positive recurrent? Does it have a
stationary distribution (independent upon initial distribution)?

23
Let Us Change Gear
Enough with these simple Markov chains.

Our next destination: Hidden Markov chains.

0.9
Fair

loaded

head

head
tail

tail

0.9
0.1
0.1
1/2
1/4
3/4
1/2
Start

1/2
1/2
24
Hidden Markov Models
(probabilistic finite state automata)
Often we face scenarios where states cannot be
directly observed.
We need an extension: Hidden Markov Models
a
11

a
22

a
33
a
44

a
12

a
23

a
34

b
11
b
14
b
12
b
13
1
2
3
4
Observed
phenomenon
a
ij
are state transition probabilities.
b
ik
are observation (output) probabilities.
b
11
+ b
12
+ b
13
+ b
14
= 1,
b
21
+ b
22
+ b
23
+ b
24
= 1, etc.
25
Hidden Markov Models - HMM
H
1
H
2
H
L-1
H
L
X
1
X
2
X
L-1
X
L
H
i
X
i
Hidden variables
Observed data
26
Example: Dishonest Casino
Actually, what is hidden in this model?
27
Coin-Tossing Example
0.9
Fair

loaded

head

head
tail

tail

0.9
0.1
0.1
1/2
1/4
3/4
1/2
H
1
H
2
H
L-1
H
L
X
1
X
2
X
L-1
X
L
H
i
X
i
L tosses
Fair/Loade
d
Head/Tail
Start

1/2
1/2
28
H
1
H
2
H
L-1
H
L
X
1
X
2
X
L-1
X
L
H
i
X
i
L tosses
Fair/Loade
d
Head/Tail
0.9
Fair

loaded

head

head
tail

tail

0.9
0.1
0.1
1/2
1/4
3/4
1/2
Start

1/2
1/2
Loaded Coin Example
Q1.: What is the probability of the sequence of observed
outcome (e.g. HHHTHTTHHT), given the model?
29
HMMs – Question I
 Given an observation sequence O = (O
1
O
2
O
3
… O
L
),
and a model M = {A, B, t }, how do we efficiently
compute P(O|M), the probability that the given
model M produces the observation O in a run of
length L ?

 This probability can be viewed as a measure of the
quality of the model M. Viewed this way, it enables
discrimination/selection among alternative models.
31
C-G Islands Example
Regular
DNA
C-G island
C-G islands: DNA parts which are very rich in C and G
A

C

G

T

change

A

C

G

T

(1-P)/4
P/6
q/4
q/4
q/4
q/4 P
P
q
q
q
q P
P
(1-q)/6
(1-q)/3
p/3
p/3
p/6
32
Example: CpG islands
In human genome, CG dinucleotides are relatively
rare
 CG pairs undergo a process called methylation
that modifies the C nucleotide
 A methylated C mutate (with relatively high
chance) to a T
Promotor regions are CG rich
 These regions are not methylated, and thus
mutate less often
 These are called CpG islands
33
CpG Islands
We construct Markov chain
for CpG rich and poor
regions
Using maximum likelihood
estimates from 60K
nucleotide, we get two
models
34
Ratio Test for CpC islands
Given a sequence X
1
,…,X
n
we compute the
likelihood ratio
¿
¿
+
+
+
=
=
÷
+
=
÷
+
i
X X
i
X X
X X
n 1
n 1
n 1
1 i i
1 i i
1 i i
A
A
X X P
X X P
X X S
|
log
) | , , (
) | , , (
log ) , , (



35
Empirical Evalation
36
Finding CpG islands
Simple Minded approach:
Pick a window of size N
(N = 100, for example)
Compute log-ratio for the sequence in the window,
and classify based on that

Problems:
How do we select N?
What do we do when the window intersects the
boundary of a CpG island?
37
Alternative Approach
Build a model that include “+” states and “-” states







 A state “remembers” last nucleotide and the type of region
 A transition from a - state to a + describes a start of CpG
island
38
A Different C-G Islands Model
A

C

G

T

change

A

C

G

T

H
1
H
2
H
L-1
H
L
X
1
X
2
X
L-1
X
L
H
i
X
i
C-G
island?
A/C/G/T
39
HMM Recognition (question I)
 For a given model M = { A, B, p} and a given state
sequence Q
1
Q
2
Q
3
… Q
L ,
, the probability of an
observation sequence O
1
O
2
O
3
… O
L
is
P(O|Q,M) = bQ
1
O
1
bQ
2
O
2
bQ
3
O
3


bQ
T
O
T
 For a given hidden Markov model M = { A, B, p}
the probability of the state sequence Q
1
Q
2
Q
3
… Q
L
is (the initial probability of Q
1
is taken to be pQ
1
)
P(Q|M) = pQ
1
aQ
1
Q
2
aQ
2
Q
3
aQ
3
Q
4


aQ
L-1
Q
L
 So, for a given HMM, M
the probability of an observation sequence O
1
O
2
O
3
… O
T
is obtained by summing over all possible state sequences

40
HMM – Recognition (cont.)
P(O| M) = E P(O|Q) P(Q|M)
= E
Q
tQ
1
bQ
1
O
1
aQ
1
Q
2
bQ
2
O
2
aQ
2
Q
3
bQ
2
O
2



 Requires summing over exponentially many paths
 Can this be made more efficient?
41
HMM – Recognition (cont.)
 Why isn’t it efficient? – O(2LQ
L
)
 For a given state sequence of length L we have
about 2L calculations
 P(Q|M) = tQ
1
aQ
1
Q
2

aQ
2
Q
3

aQ
3
Q
4



aQ
T-1
Q
T
 P(O|Q) = bQ
1
O
1

bQ
2
O
2

bQ
3
O
3



bQ
T
O
T

 There are Q
L
possible state sequence
 So, if Q=5, and L=100, then the algorithm
requires 200x5
100
computations
 We can use the forward-backward (F-B)
algorithm to do things efficiently



42
The Forward Backward Algorithm
 A white board presentation.
43
The F-B Algorithm (cont.)
Option 1) The likelihood is measured using any
sequence of states of length T
 This is known as the “Any Path” Method

Option 2) We can choose an HMM by the probability
generated using the best possible sequence of
states
 We’ll refer to this method as the “Best Path”
Method
44
HMM – Question II (Harder)
 Given an observation sequence, O = (O
1
O
2
… O
T
),
and a model, M = {A, B, p }, how do we efficiently
compute the most probable sequence(s) of states,
Q ?
 Namely the sequence of states Q = (Q
1
Q
2
… Q
T
) ,
which maximizes P(O|Q,M), the probability that the
given model M produces the given observation O
when it goes through the specific sequence of
states Q .

 Recall that given a model M, a sequence of
observations O, and a sequence of states Q, we
can efficiently compute P(O|Q,M) (should watch
out for numeric underflows)


45
Most Probable States Sequence (Q. II)
Idea:
If we know the identity of Q
i
, then the most
probable sequence on i+1,…,n does not depend on
observations before time i

 A white board presentation of Viterbi’s algorithm
46
Dishonest Casino (again)
Computing posterior probabilities for “fair” at each
point in a long sequence:
47
HMM – Question III (Hardest)
 Given an observation sequence O = (O
1
O
2
… O
L
),
and a class of models, each of the form M = {A,B,p},
which specific model “best” explains the
observations?
 A solution to question I enables the efficient
computation of P(O|M) (the probability that a specific
model M produces the observation O).
 Question III can be viewed as a learning problem:
We want to use the sequence of observations in
order to “train” an HMM and learn the optimal
underlying model parameters (transition and output
probabilities).
48
Learning
Given a sequence x
1
,…,x
n,
h
1
,…,h
n

How do we learn A
kl
and B
ka
?

We want to find parameters that maximize the
likelihood P(x
1
,…,x
n
,

h
1
,…,h
n
)
We simply count:
N
kl
- number of times h
i
=k & h
i+1
=l
N
ka
- number of times h
i
=k & x
i
= a
¿ ¿
= =
'
'
'
'
a
ka
ka
ka
l
kl
kl
kl
N
N
B
N
N
A
49
Learning
Given only sequence x
1
,…,x
n
How do we learn A
kl
and B
ka
?

We want to find parameters that maximize the
likelihood P(x
1
,…,x
n
)

Problem:
Counts are inaccessible since we do not observe h
i

50
If we have A
kl
and B
ka
we can compute
) , , (
) ( ) (
) , , (
) | , , ( ) | ( ) , , , (
) , , (
) , , , , (
) , , | , (
n 1
l lx k
n 1
1 i n 2 i 1 i 1 i i 1 i
n 1
n 1 1 i i
n 1 1 i i
x x P
1 i b B i f
x x P
l H x x P l H x P x x k H P
x x P
x x l H k H P
x x l H k H P
1 i


 



+
=
= = =
=
= =
=
= =
+
+ + + +
+
+
51
Expected Counts
We can compute expected number of times
h
i
=k & h
i+1
=l



Similarly
¿
= = =
+
i
n 1 1 i i kl
x x l H k H P N E ) , , | , ( ] [ 
¿
=
= =
a x i
n 1 i ka
i
x x k H P N E
,
) , , | ( ] [ 
52
Expectation Maximization (EM)
Choose A
kl
and B
ka
E-step:
Compute expected counts E[N
kl
], E[N
ka
]
M-Step:
Restimate:



Reiterate
¿
¿
=
=
'
'
'
'
] [
] [
'
] [
] [
'
a
ka
ka
ka
l
kl
kl
kl
N E
N E
B
N E
N E
A
53
EM - basic properties
P(x
1
,…,x
n:
A
kl
, B
ka
) s P(x
1
,…,x
n:
A’
kl
, B’
ka
)
 Likelihood grows in each iteration

 If P(x
1
,…,x
n:
A
kl
, B
ka
) = P(x
1
,…,x
n:
A’
kl
, B’
ka
)
then A
kl
, B
ka
is a stationary point of the likelihood
 either a local maxima, minima, or saddle point
54
Complexity of E-step
Compute forward and backward messages
 Time & Space complexity: O(nL)
Accumulate expected counts
 Time complexity O(nL
2
)
 Space complexity O(L
2
)

55
EM - problems
Local Maxima:
Learning can get stuck in local maxima
Sensitive to initialization
Require some method for escaping such maxima

Choosing L
We often do not know how many hidden values we
should have or can learn

56
Communication Example