Statistics: Markov Chain & Processes

Markov Chains
Lecture #5
Background Readings: Durbin et. al. Section 3.1,
.
Dependencies along the genome
In previous classes we assumed every letter in a
sequence is sampled randomly from some
distribution q(⋅) over the alpha bet {A,C,T,G}.
This model could suffice for alignment scoring, but it

is not the case in true genomes.
4. There are special subsequences in the genome, like
TATA within the regulatory area, upstream a gene.
5. The pattern CG is less common than expected for
random sampling.
We model such dependencies by Markov chains and

hidden Markov model, which we define next.
2
Finite Markov Chain
An integer time stochastic process, consisting of a

domain D of m>1 states {s1,…,sm} and
• An m dimensional initial distribution vector ( p(s1),.., p(sm)).
• An m×m transition probabilities matrix M= (asisj)
For example, D can be the letters {A, C, T, G}, p(A) the

probability of A to be the 1st letter in a sequence, and
aAG the probability that G follows A in a sequence.
3
Markov Chain (cont.)
X1 X2 Xn-1 Xn
• For each integer n, a Markov Chain assigns probability to

sequences (x1…xn) over D (i.e, xi D) as follows:
n
p (( x1 , x2 ,...xn ))  p ( X 1  x1 ) p ( X i  xi | X i 1  xi 1 )
n
i2
 p ( x1 ) axi1xi
i 2
Similarly, (X1,…, Xi ,…) is a sequence of probability
distributions over D. There is a rich theory which studies the
properties of these sequences. A bit of it is presented next.
4
Markov Chain (cont.)
X1 X2 Xn-1 Xn
Similarly, each Xi is a probability distributions over D, which

is determined by the initial distribution (p1,..,pn) and the
transition matrix M.
There is a rich theory which studies the properties of such
“Markov sequences” (X1,…, Xi ,…). A bit of this theory is
presented next.
5
Matrix Representation
A B C D The transition probabilities
0.95 0 0.05 0 Matrix M =(ast)
A
M is a stochastic Matrix:
B 0.2 0.5 0 0.3
a t st 1
0 0.2 0 0.8
C The initial distribution vector
(u1…um) defines the distribution
0 0 1 0
D of X1 (p(X1=si)=ui) .
Then after one move, the distribution is changed

to X2 = X1M
6
Matrix Representation
A B C D
0.95 0 0.05 0 Example: if X1=(0, 1, 0, 0)
A
then X2=(0.2, 0.5, 0, 0.3)
B 0.2 0.5 0 0.3
And if X1=(0, 0, 0.5, 0.5)
0 0.2 0 0.8 then X2=(0, 0.1, 0.5, 0.4).
C
0 0 1 0
D
The i-th distribution is Xi = X1Mi-1
7
Representation of a Markov Chain as a
Digraph
A B C D
0.95 0.95 0 0.05 0
A
0.2 0.5 0 0.3
B
0.2 0.5
A B C
0 0.2 0 0.8
0 0 1 0
0.05 0.2 0.3 D
0.8
C D
1
Each directed edge A→B is associated with the positive

transition probability from A to B.
8
Properties of Markov Chain states
States of Markov chains are classified by the digraph
representation (omitting the actual probability values)
A, C and D are recurrent states: they are in strongly connected

components which are sinks in the graph.
B is not recurrent – it is a transient state

A B
Alternative definitions:
A state s is recurrent if it can be
reached from any state reachable
C D from s; otherwise it is transient.
9
Another example of Recurrent and
Transient States
A B A and B are transient states, C and D

are recurrent states.
Once the process moves from B to

C D D, it will never come back.
10
Irreducible Markov Chains
A Markov Chain is irreducible if the corresponding graph is

strongly connected (and thus all its states are recurrent).
A B E A B
C D C D
1
11
Periodic States
A B E A state s has a period k if k is the
GCD of the lengths of all the
cycles that pass via s. (in the
shown graph the period of A is 2).
C D
Exercise: All the states in the same strongly connected

component have the same period
A Markov Chain is periodic if all the states in it have a period k >1.

It is aperiodic otherwise.
12
Ergodic Markov Chains
A B A Markov chain is ergodic if :

2. the corresponding graph is
strongly connected.
D 3. It is not peridoic
C
Ergodic Markov Chains are important since they guarantee the

corresponding Markovian process converges to a unique
distribution, in which all states have strictly positive probability.
13
Stationary Distributions for Markov
Chains
Let M be a Markov Chain of m states, and let V = (v1,…,vm) be a

probability distribution over the m states
V = (v1,…,vm) is stationary distribution for M if VM=V.
(ie, if one step of the process does not change the distribution).
V is a stationary distribution
V is a left (row) Eigenvector of M with Eigenvalue 1.
14
Stationary Distributions for a Markov
Chain M
Exercise: A stochastic matrix always has a real left Eigenvector

with Eigenvalue 1 (hint: show that a stochastic matrix has a
right Eigenvector with Eigenvalue 1. Note that the left
Eigenvalues of a Matrix are the same as the right Eiganvlues).
[It can be shown that the above Eigenvector V can be non-

negative. Hence each Markov Chain has a stationary
distribution.]
15
“Good” Markov chains
A Markov Chains is good if the distributions Xi , as i→ ∞:
(3) converge to a unique distribution, independent of the initial

distribution.
(2) In that unique distribution, each state has a positive

probability.
The Fundamental Theorem of Finite Markov Chains:

A Markov Chain is good ⇔ the corresponding graph is
ergodic.
We will prove the ⇒ part, by showing that non-ergodic Markov

Chains are not good.
16
Examples of
“Bad” Markov Chains
A Markov chains is not “good” if either :

2. It does not converge to a unique
distribution.
3. It does converge to u.d., but some states
in this distribution have zero probability.
17
Bad case 1: Mutual Unreachabaility
A B
Consider two initial distributions:
• p(X1=A)=1 (p(X1 = x)=0 if x≠A).
C D
• p(X1= C) = 1
In case a), the sequence will stay at A forever.

In case b), it will stay in {C,D} for ever.
Fact 1: If G has two states which are unreachable from each

other, then {Xi} cannot converge to a distribution which is
independent on the initial distribution. 18
Bad case 2: Transient States
A B
C D
Once the process moves from B to D, it will never come back.
19
Bad case 2: Transient States
Fact 2: For each initial distribution,

A B with probability 1 a transient state
will be visited only a finite number
of times.
C D
X
Proof: Let A be a transient state, and let X be the set of states

from which A is unreachable. It is enough to show that,
starting from any state, with probability 1 a state in X is
reached after a finite number of steps (Exercise: complete the
proof)
20
Corollary: A good Markov
Chain is irreducible
21
Bad case 3: Periodic Markov Chains
A B E
C D
Recall: A Markov Chain is periodic if all the states in it have a

period k >1. The above chain has period 2.
In the above chain, consider the initial distribution p(B)=1.
Then states {B, C} are visited (with positive probability) only in odd
steps, and states {A, D, E} are visited in only even steps.
22
Bad case 3: Periodic States
A B E
C D
Fact 3: In a periodic Markov Chain (of period k >1) there are initial
distributions under which the states are visited in a periodic manner.
Under such initial distributions Xi does not converge as i→ ∞.
Corollary: A good Markov

Chain is not periodic
23
The Fundamental Theorem of Finite Markov
Chains:
We have proved that non-ergodic Markov Chains are not
good. A proof of the other part (based on Perron-Frobenius
theory) is beyond the scope of this course:
If a Markov Chain is ergodic, then

• It has a unique stationary distribution vector V > 0, which is an
Eigenvector of the transition matrix.
• For any initial distribution, the distributions Xi , as i→ ∞,
converges to V.
24
Use of Markov Chains in Genome search:
Modeling CpG Islands
In human genomes the pair CG often transforms to
(methyl-C) G which often transforms to TG.
Hence the pair CG appears less than expected from

what is expected from the independent frequencies
of C and G alone.
Due to biological reasons, this process is sometimes

suppressed in short stretches of genomes such as in
the start regions of many genes.
These areas are called CpG islands (p denotes “pair”).
25
Example: CpG Island (Cont.)
We consider two questions (and some variants):
Question 1: Given a short stretch of genomic data, does

it come from a CpG island ?
Question 2: Given a long piece of genomic data, does

it contain CpG islands in it, where, what length ?
We “solve” the first question by modeling strings with

and without CpG islands as Markov Chains over the
same states {A,C,G,T} but different transition
probabilities:
26
The “+” model: Use transition matrix A+ = (a+st),

Where:
a+st = (the probability that t follows s in a CpG
island)
The “-” model: Use transition matrix A- = (a-st),

Where:
a-st = (the probability that t follows s in a non
CpG island)
27
With this model, to solve Question 1 we need to decide

whether a given short sequence of letters is more likely
to come from the “+” model or from the “–” model.
This is done by using the definitions of Markov Chain,
in which the parameters are determined by known data
and the log odds-ratio test.
28
Question 1: Using two Markov chains
A+ (For CpG islands):
We need to specify p+(xi | xi-1) where + stands for CpG

Island. From Durbin et al we have:
Xi
A C G T
A 0.18 0.27 0.43 0.12
Xi-1 C 0.17 p+(C | C) 0.274 p+(T|C)
G 0.16 p+(C|G) p+(G|G) p+(T|G)
T 0.08 p+(C |T) p+(G|T) p+(T|T)
(Recall: rows must add up to one; columns need not.)

29
Question 1: Using two Markov chains
A- (For non-CpG islands):

…and for p-(xi | xi-1) (where “-” stands for Non CpG
island) we have:
Xi
A C G T
A 0.3 0.2 0.29 0.21
Xi-1 C 0.32 p-(C|C) 0.078 p-(T|C)
G 0.25 p-(C|G) p-(G|G) p-(T|G)
T 0.18 p-(C|T) p-(G|T) p-(T|T)
30
Discriminating between the two models
X1 X2 XL-1 XL
Given a string x=(x1….xL), now compute the ratio

L −1
p ( x | + model)
∏p
i =0
+ ( xi +1 | xi )
RATIO = = L −1
p ( x | − model)
∏ p (x
i =0
− i +1 | xi )
If RATIO>1, CpG island is more likely.

Actually – the log of this ratio is computed:
Note: p+(x1|x0) is defined for convenience as p+(x1).

p-(x1|x0) is defined for convenience as p-(x1).
31
Log Odds-Ratio test
Taking logarithm yields
p(x1...x L| + ) p +(xi|xi −1 )
log Q = log
p(x1...x L| − )
= ∑i
log
p −(xi|xi −1 )
If logQ > 0, then + is more likely (CpG island).

If logQ < 0, then - is more likely (non-CpG island).
32
Where do the parameters (transition-
probabilities) come from ?
Learning from complete data, namely, when the label is

given and every xi is measured:
Source: A collection of sequences from CpG islands, and a

collection of sequences from non-CpG islands.
Input: Tuples of the form (x1, …, xL, h), where h is + or -
Output: Maximum Likelihood parameters (MLE)
Count all pairs (Xi=a, Xi-1=b) with label +, and

with label -, say the numbers are Nba,+ and Nba,- .
33
Maximum Likelihood Estimate (MLE) of
the parameters (using labeled data)
X1 X2 XL-1 XL
The needed parameters are:

P+(x1), p+ (xi | xi-1), p-(x1), p-(xi | xi-1)
The ML estimates are given by:
N a,+
p+ ( X 1 = a ) = Where Nx,+ is the number of times letter x
∑N
x
x,+
appear in CpG islands in the dataset.
N ba , + Where Nbx,+ is the number of times letter

p+ ( X i = a | X i −1 = b ) =
∑N bx , + x appears after letter b in CpG islands in
x
the dataset.
Using MLE is justified when we have a large sample. The numbers appearing in
the text book are based on 60,000 nucleotides. When only small samples are
available, Bayesian learning is an attractive alternative. 34
CpG Island: Question 2
Now we solve the 2nd question:
Question 2: Given a long piece of genomic data, does

it contain CpG islands in it, and where?
For this, we need to decide which parts of a given long sequence

of letters is more likely to come from the “+” model, and which
parts are more likely to come from the “–” model. This is done by
using the Hidden Markov Model, to be defined.
35
Question 2: Finding CpG Islands
Given a long genomic string with possible CpG

Islands, we define a Markov Chain over 8 states, all
interconnected (hence it is ergodic):
The problem is that we don’t know

A+ C+ G+ T+ the sequence of states which are
traversed, but just the sequence of
A- C- G- T- letters.
Therefore we use here Hidden Markov Model

36
Hidden Markov Model
S1
M S2
M M SL-1
M SL
T T T T
x1 x2 XL-1 xL
L
A Markov chain (s1,…,sL): p( s1 ,K , sL )   p( si | si 1 )
i 1
and for each state s and a symbol x we have p(Xi=x|Si=s)
Application in communication: message sent is (s1,…,sm) but we

receive (x1,…,xm) . Compute what is the most likely message sent ?
Application in speech recognition: word said is (s1,…,sm) but we
recorded (x1,…,xm) . Compute what is the most likely word said ?
37
Hidden Markov Model
S1
M S2
M M SL-1
M SL
T T T T
x1 x2 XL-1 xL
Notations:
Markov Chain transition probabilities: p(Si+1= t|Si = s) = ast
Emission probabilities: p(Xi = b| Si = s) = es(b)
L
For Markov Chains we know: p( s)  p ( s1 ,K , sL )   p ( si | si 1 )
i 1
What is p(s,x) = p(s1,…,sL;x1,…,xL) ?
38
Hidden Markov Model
S1
M S2
M M SL-1
M SL
T T T T
x1 x2 XL-1 xL
p(Xi = b| Si = s) = es(b), means that the probability of xi

depends only on the probability of si.
Formally, this is equivalent to the conditional
independence assumption:
p(Xi=xi|x1,..,xi-1,xi+1,..,xL,s1,..,si,..,sL) = esi(xi)
L
Thus p( s, x)  p ( s1 ,K , sL ; x1 ,..., xL )   p ( si | si 1 ) esi ( xi )

i 1
39

Statistics: Markov Chain &amp; Processes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics: Markov Chain &amp; Processes

Uploaded by

Copyright:

Available Formats

Markov Chains

Background Readings: Durbin et. al. Section 3.1,

This model could suffice for alignment scoring, but it

We model such dependencies by Markov chains and

An integer time stochastic process, consisting of a

For example, D can be the letters {A, C, T, G}, p(A) the

• For each integer n, a Markov Chain assigns probability to

Similarly, each Xi is a probability distributions over D, which

Then after one move, the distribution is changed

The i-th distribution is Xi = X1Mi-1

Each directed edge A→B is associated with the positive

A, C and D are recurrent states: they are in strongly connected

B is not recurrent – it is a transient state

A B A and B are transient states, C and D

Once the process moves from B to

A Markov Chain is irreducible if the corresponding graph is

Exercise: All the states in the same strongly connected

A Markov Chain is periodic if all the states in it have a period k >1.

A B A Markov chain is ergodic if :

Ergodic Markov Chains are important since they guarantee the

Let M be a Markov Chain of m states, and let V = (v1,…,vm) be a

V is a left (row) Eigenvector of M with Eigenvalue 1.

Exercise: A stochastic matrix always has a real left Eigenvector

[It can be shown that the above Eigenvector V can be non-

(3) converge to a unique distribution, independent of the initial

(2) In that unique distribution, each state has a positive

The Fundamental Theorem of Finite Markov Chains:

We will prove the ⇒ part, by showing that non-ergodic Markov

A Markov chains is not “good” if either :

In case a), the sequence will stay at A forever.

Fact 1: If G has two states which are unreachable from each

Once the process moves from B to D, it will never come back.

Fact 2: For each initial distribution,

Proof: Let A be a transient state, and let X be the set of states

Recall: A Markov Chain is periodic if all the states in it have a

Corollary: A good Markov

If a Markov Chain is ergodic, then

Hence the pair CG appears less than expected from

Due to biological reasons, this process is sometimes

These areas are called CpG islands (p denotes “pair”).

We consider two questions (and some variants):

Question 1: Given a short stretch of genomic data, does

Question 2: Given a long piece of genomic data, does

We “solve” the first question by modeling strings with

The “+” model: Use transition matrix A+ = (a+st),

The “-” model: Use transition matrix A- = (a-st),

With this model, to solve Question 1 we need to decide

We need to specify p+(xi | xi-1) where + stands for CpG

(Recall: rows must add up to one; columns need not.)

A- (For non-CpG islands):

Given a string x=(x1….xL), now compute the ratio

If RATIO>1, CpG island is more likely.

Note: p+(x1|x0) is defined for convenience as p+(x1).

If logQ > 0, then + is more likely (CpG island).

Learning from complete data, namely, when the label is

Source: A collection of sequences from CpG islands, and a

Output: Maximum Likelihood parameters (MLE)

Count all pairs (Xi=a, Xi-1=b) with label +, and

The needed parameters are:

N ba , + Where Nbx,+ is the number of times letter

Now we solve the 2nd question:

Question 2: Given a long piece of genomic data, does

For this, we need to decide which parts of a given long sequence

Statistics: Markov Chain & Processes

Statistics: Markov Chain & Processes