You are on page 1of 75

Stochastic processes and

Markov chains (part I)

Wessel van Wieringen


w n van wieringen@vu nl
w.n.van.wieringen@vu.nl

Department of Epidemiology and Biostatistics, VUmc


& Department of Mathematics,
Mathematics VU University
Amsterdam, The Netherlands
Stochastic processes
Stochastic processes

Example 1
• The intensity of the sun.
• Measured every day by
the KNMI.
• Stochastic variable Xt
represents the sun’s
intensity at day t, 0 ≤ t ≤
T. Hence, Xt assumes
values in R+ (positive
values only)
only).
Stochastic processes

Example 2
• DNA sequence of 11 bases long.
• At each base position there is an
A, C, G or T.
• Stochastic variable Xi is the base
at position i, i = 1,…,11.
• In case the sequence
q has been
observed, say:
(x1, x2, …, x11) = ACCCGATAGCT,
then A is the realization of X1, C
that of X2, et cetera.
Stochastic processes
position … t t+1 t+2 t+3 …
base … A A T C …

… A A A A …

… G G G G …

… C C C C …

… T T T T …

position position position position


t t+1 t+2 t+3
Stochastic processes

Example 3
• A patient’s heart pulse during surgery.
• Measured continuously during interval [0 [0, T].
T]
• Stochastic variable Xt represents the occurrence of
a heartbeat at time t, 0 ≤ t ≤ T. Hence, Xt assumes
only the values 0 (no heartbeat) and 1 (heartbeat).
Stochastic processes

Example 4
• Brain activity of a human
under experimental
conditions.
• Measured continuously
during interval [0, T].
• Stochastic variable Xt
represents the magnetic
field at time t, 0 ≤ t ≤ T.
Hence, Xt assumes values
on R.
Stochastic processes
Differences between examples

Xt
Discrete Continuous
te
Discret

Example 2 Example 1
D
Time
ntinuous

Example 3 Example 4
Con
Stochastic processes

The state space S is the collection of all possible values


that the random variables of the stochastic process may
assume.

If S = {{E1, E2, …,, Es} discrete,, then Xt is a discrete


stochastic variable.
→ examples 2 and 3.

If S = [0, ∞)) discrete, then Xt is a continuous stochastic


variable.
→ examplesp 1 and 4.
Stochastic processes
Stochastic process:
• Discrete time: t = 0,
0 11, 2
2, 3
3, ….
• S = {-3, -2, -1, 0, 1, 2, …..}
• P(one
( stepp up)
p) = ½ = P(one
( step
p down))
• Absorbing state at -3
7
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-1
-2
-3
-4
Stochastic processes
The first passage time of a certain state Ei in S is the
time t at which Xt = Ei for the first time since the start
of the process.

7
What is the first p
passage
g
6 time of 2? And of -1?
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-1
-2
-3
-4
Stochastic processes
The absorbing state is a state Ei for which the
following holds: if Xt = Ei than Xt+1
t 1 = Ei for all s ≥ tt.
The process will never leave state Ei.

7
6 The absorbing state is -3.
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-1
-2
-3
-4
Stochastic processes
The time of absorption of an absorbing state is the
first passage time of that state
state.

7
6 The time of absorption is 17.
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-1
-2
-3
-4
Stochastic processes
The recurrence time is the first time t at which the
process has returned to its initial state
state.

7
6 The recurrence time is 14.
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-1
-2
-3
-4
Stochastic processes

A stochastic process is described by a collection of


time points, the state space and the simultaneous
distribution of the variables Xt, i.e., the distributions of
all Xt and their dependency.
dependency

There are two important types of processes:


• Poisson process: all variables are identically and
independently distributed. Examples: queues for
counters, call centers, servers, et cetera.
• Markov process: the variables are dependent in a
simple
p manner.
Markov processes
Markov processes
A 1st order Markov process in discrete time is a sto-
chastic process {Xt}t=1,2, … for which the following holds:
P(Xt+1=xt+1 | Xt=xt, …, X1=x1) = P(Xt+1=xt+1 | Xt=xt).
In other words, only the present determines the future,
the past is irrelevant.

From every 7

state with 6
5

probability 0
0.55 4
3

one step up or
2
1
0

down (except -1
-2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

for state -3). -3


-4
Markov processes
The 1st order Markov property, i.e.

P(Xt+1=xt+1 | Xt=xt, …, X1=x1) = P(Xt+1=xt+1 | Xt=xt),


does not imply independence between Xt-1 and Xt+1.
In fact, often P(Xt+1=xt+1 | Xt-1=xt-1) is unequal to zero.

For instance, 7

P(Xt+1=2 | Xt-1=0) = 6
5

P(Xt+1=22 | Xt=1)
1)
4
3
2

* P(Xt=1 | Xt-1=0) 1
0

= ¼.
¼ -1
-2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

-3
-4
Markov processes
An m-order Markov process in discrete time is a stochastic
process {Xt}t=1,2, … for which the following holds:
P(Xt+1=xt+1 | Xt=xt, …, X1=x1)
= P(Xt+1=xt+1 | Xt=xt,…,Xt-m+1= xt-m+1).

Loosely, the future depends on the most recent past.

Suppose m=9:
9 preceeding
t+1
bases

… CTTCCTCAAGATGCGTCCAATCCCTCAAAC …
Distribution of the t+1-th base
depends on the 9 preceeding ones.
Markov processes
A Markov process is called a Markov chain if the state
space is discrete
discrete, ii.e.,
e is finite or countable
countable.

In these lecture series we consider Markov chains in


discrete time.

Recall the DNA


example.
Markov processes
The transition probabilities are the
P(Xt+1=xt+1 | Xt=xt),
but also the
P(Xt+1=xt+1 | Xs=xs) for s < t,
where xt+1, xt in {E1, …, Es} = S.
S

P(Xt+1=3 | Xt=2) = 0.5 6


5
4

P(Xt+1=0 | Xt=1) = 0.5 3


2
1

( t+1=5 | Xt=7)) = 0
P(X 0
-1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

-2
-3
-4
Markov processes
A Markov process is called time homogeneous if the
transition probabilities are independent of t:
P(Xt+1=x1 | Xt=x2) = P(Xs+1=x1 | Xs=x2).

For example:
P(X2=-1
= 1 | X1=2) = P(X6=-1
= 1 | X5=2)
P(X584=5 | X583=4) = P(X213=5 | X212=4)

7
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-1
1
-2
-3
-4
Markov processes
Example of a time inhomogeneous process

Bathtub
h(t) curve

Infant mortality Random Wearout failures


failures
Markov processes
Consider a DNA sequence of 11 bases. Then, S={a, c,
t} Xi is the base of position i,
g t},
g, i and {Xi}i=1, …, 11 is a
Markov chain if the base of position i only depends on
the base of position ii-1,
1, and not on those before ii-1. 1.
If this is plausible, a Markov chain is an acceptable
model for base ordering in DNA sequences
sequences.

A C

G T
state diagram
Markov processes
In remainder, only time homogeneous Markov processes.
We then denote the transition probabilities of a finite time
homogeneous Markov chain in discrete time {Xt}t=1,2,… with
S {E1, …, Es} as:
S={E
P(Xt+1=Ej | Xt=Ei) = pij (does not depend on t).

Putting the pij in a matrix yields the transition matrix:

The rows of this matrix sum to one.


Markov processes
In the DNA example:

pAA pCC
where pAC
A pCA
C
pAA = P(Xt+1 = A | Xt = A) pGC
pGC
and: pGA pAG pTC pCT

pAA + pAC + pAG + pAT = 1


pGT
pCA + pCC + pCG + pCT = 1 G T
pTG
et cetera pGG pTT
Markov processes
The initial distribution π = (π1, …,πs)T gives the
probabilities
b bili i off the
h iinitial
i i l state:
πi = P(X1 = Ei) for i = 1, …, s,
and π1 + … + πs = 1.

Together
g the initial distribution π and the transition
matrix P determine the probability distribution of the
process, the Markov chain {Xt}t=1,2, …
Markov processes

We now show how the couple (π (π, P) determines the


probability distribution (transition probabilities) of time
steps
p larger
g than one.

Hereto define for


n=2 :

generall n :
Markov processes
For n = 2 :

just the definition


Markov processes
For n = 2 :

use the fact that


P(A B) + P(A,
P(A, P(A BC) = P(A)
Markov processes
For n = 2 :

use the definition of conditional probability:


P(A B) = P(A | B) P(B)
P(A,
Markov processes
For n = 2 :

use the Markov property


Markov processes
For n = 2 :
Markov processes

In summary,
summary we have shown that:

In a similar fashion we can show that:

IIn words:
d th
the ttransition
iti matrix
t i for
f n steps
t is
i the
th one-
step transition matrix raised to the power n.
Markov processes

The general case:

is proven by induction to n.

One needs the Kolmogorov-Chapman equations:

for all n, m ≥ 0 and i, j =1,…,S.


Markov processes
Proof of the Kolmogorov-Chapman equations:
Markov processes
Induction proof of
Assume
Then:
Markov processes
A numerical example:

Then:

matrix multiplication (“rows


( rows times columns
columns”))
Markov processes
Thus:

0.6490 = P(Xt+2 = I | Xt = I) is composed of two probabilities:

probability of this route plus probability of this route


I I I I I I

II II II II II II

position position position position position position


t t+1 t+2 t t+1 t+2
Markov processes
In similar fashion we may obtain:

Sum over probs. of all possible paths between 2 states:


I I I I I I

II II II II II II
pos. t pos. t+1 pos. t+2 pos. t+3 pos. t+4 pos. t+5

Similarly:
Markov process
How do we sample from a Markov process?
Reconsider the DNA example. pAA pCC
pAC
A pCA
C
A C G T pGC
pGC
pGA pAG pTC pCT

pGT
to G T
pTG
A C G T pGG pTT

f A
r C
o G
m T
Markov process
1 Sample the first base from the initial distribution π:
P(X1=A) = 0.45, P(X1=C) = 0.10, et cetera
Suppose we have drawn an A. Our DNA sequence
now consists of one base, namely A.
After the first step,
step our sequence looks like:
A

2 Sample the next base using P given the previous base.


Here X1=A, thus we sample from:
P(X2=A
A | X1=A)
A) = 0
0.1,
1 P(X2=CC | X1=A)
A) = 0.1,
0 1 ett cetera.
t
In case X1=C, we would have sampled from:
P(X2=A | X1=C) = 0.2, P(X2=C | X1=C) = 0.3, et cetera.
Markov process
2 After the second step:
AT

3 The last step is iterated until


ntil the last base
base.
After the third step:
ATC

9 After nine steps:


ATCCGATGC
CCG GC
Andrey Markov
Andrey Markov

1856 1922
1856–1922
Andrey Markov

In the famous first application of Markov chains, Andrey


Markov studied the sequence of 20,000 letters in Pushkin’s
poem “Eugeny Onegin”, discovering that
- the stationary vowel probability is 0.432,
- the probability of a vowel following a vowel is 0.128, and
- the probability of a vowel following a consonant is 0
0.663.
663

Markov also studied the sequence of 100,000 letters in


Aksakov’s novel “The Childhood of Bagrov, the Grandson”.

This was in 1913


1913, long before the computer age!

Basharinl et al. (2004)


Parameter estimation
Parameter estimation
T A G T T C G G A A G C A C G T G G A T G A G G A A C C C G C C A C C C C A T C C G C C T G G T G T T T G A T A G C T G T A T T C G G
G C A T G G C G G C C C A C G C T A T T C G T C T G T A C G C A A G C C T C G T T T T A T C C C C G C C G C C A G A G T A C C G C A G
C A C C C G G C G C G A C G G G A C C G C G C C G C G G C C A A C T A G C G C A G C C G C A G A C C G C C A C G T G C G C A G G A C T
T C C C A C C C T C A C C C T T C C G C T C A C C C T T A T C G G A G G G C A C T C G C A C G G A T A G T T G G T A G G G G G A C A T
C C C A G G G T A G C C G C A C G A T G C G G A T A G G G G T A T T T A T G G C G C C C T C A C C C A G G C G A T C T C G G G C C C G
G C T G G G C G C C G G G C G G A C C C A G A T A G A T C G G T G C C C C G G C C C C A A C C G A A G G G T A G G A C T G T C G G A C
C T G C G G C G C C C G C G A C G A T A C G A C A G A C A G G A A A G A A A G C A T C C A G G G A C G G C G C G C T T G C T G G A C T
G A C C A G C C A C T C C A T G G G G T C C G T A C C C G C C C G G G G G C A C T T T C G C C C C C G C G G A G A G C T G A G A C A A
T C G A T G T G C C C A G C C G A G G G G G C C C G C C A G C C C A G G G G A C C G G A C T A A G G G C C A C C G T A G G G T A C A G
C A A C C C C G G A C A C T G G C G A C C C C G C A C C A C C C C C A C C C G T A G C T C A G T A C A C G A G G A G A G G A A C C C G
C G C G A A G C C C C T G G G C G C T C T T G C T G G T C A G A A A T A C C G T C G T T T A T C C G T G G A C T C G T A C G A C C T G
C C A C C C T G G C A G G A C T G A C T G C C G C C A C C C C C C G T C C A C T A C G G A A A G T G A G G G G G C C A A C C G C G T A
G G A C A A G A G G G T C T C G C C T C G G G G G G T T G C C G G C C G G G G C A G G T T C A G G A G T T A T T T C C A G G C C C C A
C
T
G
G
G
G
A
A
C
G
T
A
G
C
T
C
T
G
T
T
T
G
T
C
C
G
C
C
G
A How do we estimate the parameters
T
A
C
G
C
A
C
G
A
G
G
G
A
G
T
G
G
C
G
T
G
A
A
C
T
C
G
C
G
A
A
C
A
T
T
T
T
T
A
C
A
T
C
G
G
C
C
G
A
T
C
A
G
T
C
A
C
T
A
A
T
C
C
G
A
T
C
T
A
A
C
G
C
G
T
A
G
C
C
T
A
C
C
G
G
A
C
A
C
G
G
C
G
C
C
C
C
A
C
G
C
G
C
G

of a Markov chain from the data?


T A G T G A C G G A C G C G C T C G A T C C C C C G T T C A C C G G G A C C C T G C C G G A T T T A A G G C G C G T G C C G G G C C C
G T G C T C A C C C C C G G A C A G C C A G A C A C A T T T C T C C T G C G A G T G G T G C C G C T T A C C G G G T C T G A T A C G G
A G T A G A G G C C C C C C G G C T A T A G C C A C A T G C C T A C A T C C T C G C G T C A C A C G G G C T C A G T T C G C C T G G G
G A C G G A T G C C C C G A T C G C G T G G G A T A C C C G C G G C T T A C G A A T C A A A C A C G C C G C G G C G T G C G G G C G G
G
G
G
T
A
G
G
T
G
C
C
G
C
G
T
C
A
G
G
C
G
C
C
G
T
C
T
A
C
C Maximum likelihood!
G
T
G
C
G
C
A
C
G
T
G
G
G
C
G
G
C
T
A
G
T
C
T
C
G
G
C
G
G
T
C
C
C
T
C
C
T
C
C
T
G
T
A
G
T
T
A
G
C
T
G
A
A
A
C
C
C
G
G
G
G
T
G
G
T
A
C
T
C
A
C
C
C
G
G
G
C
G
C
G
C
G
A
T
C
A
G
C
C
C
T
G
C
G
T
T
T
G
T
T
G
G
G
C
C A G C A G G C C T C T A A G C T A T A G C C C C T G C C C A C C T T C G T G A C A C C C G G G T T G C C C C C A C G T C C C G G T A
G T C A A A G G G C T A C C C A G C G C C G C A T C G A C G T C C C T T G G G T G G C G G T T A A C A G C A G A T C T T T G C C A G A
A G T C G C G C A A C A A A G G C A C T C T G C A G C C T G G C C A C C A G G C G G C C C T C T C G A G C G G C C A G C A G G A C C C
G G G C A C G G C A C T C T A T A G C A C G G T A G C C A C T G C A G T A C C G T A G G A C C A C C C G G T G A C A C A G T G G A G A
C C G G G G T T A C A C C A C T C C C C C G A G G T T A G C C A T C C T C G A G A A C G A C T A A G C G C G G G A A C G A G G C G A G
C A G C C A A T A C G T G C G A A G A A C G T C T G C G G A G C G C C A T A C A C G A T A A C T A C G G T T T A T A G G T A C G T C G
C A T G G A G G T T C G C G G C T G T A A A A C C C A C A C A T A C T C G G T C C C A C A C G G G T G T A A G C A C T C G T C A G T C
C G C A G C C T C G T C T C G C C G A G G A T C A C G A G T C G G T C G C C C T C A A A G G A G C T T C A A G T T G G A C C C T G A C
C G G G G A C T T G A G A A G T A A G C A C G C C C G A C G A A G C G G G G C C A G C A C C G G T A A A C C T G C G A C A A G T C C T
T T A C C G C T A G G G T A C T G C G T G T C A A C C A G T C A C G A G T G C C C C A C C C T G G C G A C C C G G C G A G G A C C G C
C C C C T T C C G G G A A G G T C A G A C C T G T G A G C A G A G A G C C A G G A T G C C A C T A A C G C T C A C C G T A C C T T C C
G C C T G A C T C G C C G G T C C G C G C G C G A G C A A C C C C T A A C C A C C A A C C G G C G G A G G A C C C A G C A G A A T A T
G G C G A A C C G G C C C C C G C G A C T G C G C G A T C G G C A G A C T C C T G T G C G G G G A G C A T A C A G G C T C T C C T C G
G A G C G T G G C T C A G T G C C G C G T G G G C C G C T T T C T C C A G C C C C T A C A C G C C C A C T A C T C G A A C G T T T T T
G T G G G C C T G A C T T G C C G C G T A C C C G C G C C T T T C G C C C T G A C G A C C C G C G T T A C C T A T G A G G G A T C A T
G T G G G G T G G G G C C C C G G G T C C A C G C T A T C G A C G T A T G G T T G C T G C G T C A A C T C A C G C T C G A T G T A C A
C G C C G T C C T C G T A A C A G C G A C C A C G C T C C C G A A G C T G G A T A A A C G A C T T C A G C A A T A A G G G A C G G T C
A G G G G G G A G G A C G T A C G T A C T G G G C A G G C A G G C G G C T G G C A C C C T G A G G C A A A T G C G G A C G C G A A C C
A T C C A A G A C G G G C G A C A A A A C G A C C G G A A G A G C G A G G A G A C G T A G C A C C G T T T C C G C G G C T T C C G A A
C G G A G C T A A C A G A G A C C C G C C G G A T C T C G T T G G T A C C G A G C C G C C G T A T A C A T C A C A C C T G C A C G G C
C A G T G C C T G C C T T A G T T C G G G C G C T A G C C C G T C G T C C C T A T T T A A C T G C G C G A G T G G C G A G C T C G G C
Parameter estimation

Using the definition of conditional probability, the


likelihood can be decomposed as follows:

where x1, …, xT in S = {E1, …, ES}.


Parameter estimation

Using the Markov property, we obtain:


Parameter estimation

Furthermore, e.g.:

where only one transition probability at the time enters the


likelihood due to the indicator function
likelihood, function.
Parameter estimation

The likelihood then becomes:

where, e.g.,
Parameter estimation

Now note that, e.g.,


pAA + pAC + pAG + pAT = 1.
Or,
pAT = 1 - pAA - pAC - pAG.

Substitute this in the likelihood, and take the logarithm


to arrive at the log
log-likelihood.
likelihood

Note: it is irrelevant which transition probability is


substituted.
Parameter estimation

The log
log-likelihood:
likelihood:
Parameter estimation

Differentiation the log


log-likelihood
likelihood yields, e.g.:

Equate the derivatives to zero. This yields four systems of


equations to solve
solve, e
e.g.:
g:
Parameter estimation
The transition probabilities are then estimated by:

V if 2ndd order
Verify d part.
t derivatives
d i ti off llog-likehood
lik h d are negative!
ti !
Parameter estimation

If one may assume the observed data is a realization of a


stationary Markov chain, the initial distribution is estimated
by the stationary distribution.

If only
y one realization of a stationaryy Markov chain is
available and stationarity cannot be assumed, the initial
distribution is estimated by:
Parameter estimation

Thus,, for the following


g sequence:
q ATCGATCGCA,
,
tabulation yields

The maximum likelihood estimates thus become:


Parameter estimation

In R:
> DNAseq <- c("A", "T", "C", "G", "A",
"T", "C", "G", "C" "A")
> table(DNAseq)
DNAseq
A C G T
3 3 2 2
> table(DNAseq[1:9], DNAseq[2:10])
A C G T
A 0 0 0 2
C 1 0 2 0
G 1 1 0 0
T 0 2 0 0
Testing the order of
the Markov chain
Testing the order of a Markov chain
Often the order of the Markov chain is unknown and
needs
d to b
be d
determined
i d ffrom the
h ddata. Thi
This iis d
done using
i
a Chi-square test for independence.

Consider a DNA-sequence. To assess the order, one first


tests whether the sequence consists of independent
letters. Hereto count the nucleotides, di-nucleotides and
tri-nucleotides in the sequence:
q
Testing the order of a Markov chain
The null hypothesis of independence between the letters
off the
th DNA sequence evaluated
l t db
by th
the χ2 statistic:
t ti ti

which has (4-1) x (4-1) = 9 degrees of freedom.

If the null hypothesis cannot be rejected,


rejected one would fit a
Markov model of order m=0. However, if H0 can be
rejected one would test for higher order dependence.
Testing the order of a Markov chain
To test for 1st order dependence, consider how often a di-
nucleotide e
nucleotide, e.g.,
g CT,
CT is followed by
by, e
e.g.,
g T.T

The 1st order dependence hypothesis is evaluated by:

where

This is χ2 distributed with (16


(16-1)
1) x (4-1)
(4 1) = 45 d.o.f..
dof
Testing the order of a Markov chain
A 1st order Markov chain provides a reasonable description of the
sequence of the hlyE gene of the E
E.coli
coli bacteria
bacteria.
Independence test:
Chi-sq
q stat: 22.45717,
, p
p-value: 0.00754
1st order dependence test:
Chi-sq stat: 55.27470, p-value: 0.14025

The sequence
q of the p
prrA g
gene of the E.coli bacteria requires
q a higher
g
order Markov chain model.
Independence test:
Chi-sq
hi stat: 33.51356,
33 51356 p-value:
l 1
1.266532e-04
266532 04
1st order dependence test:
Chi-sq stat: 114
114.56290,
56290 p-value: 5
5.506452e-08
506452e-08
Testing the order of a Markov chain
In R (the independence case, assuming the DNAseq-object is
a character-object
j containing
g the sequence):
q )
> # calculate nucleotide and dimer frequencies
> nuclFreq <- matrix(table(DNAseq), ncol=1)
> dimerFreq <- table(DNAseq[1:(length(DNAseq)-1)],
DNAseq[2:length(DNAseq)])

> # calculate expected dimer frequencies


> dimerExp <- nuclFreq %*% t(nuclFreq)/(length(DNAseq)-1)

> # calculate test statistic and p-value


> teststat <- sum((dimerFreq - dimerExp)^2 / dimerExp)
> pval <-
< exp(pchisq(teststat,
exp(pchisq(teststat 99, lower
lower.tail=FALSE,
tail=FALSE
log.p=TRUE))

E
Exercise:
i modify
dif th
the code
d above
b ffor th
the 1st order
d ttest.
t
Application:
pp
sequence
di i i ti
discrimination
Application: sequence discrimination

We have fitted 1st order Markov chain models to a


representative coding and noncoding sequence of the
E.coli bacteria.

Pcoding Pnoncoding
A C G T A C G T
A 0.321 0.257 0.211 0.211 A 0.320 0.278 0.231 0.172
C 0.319 0.207 0.266 0.208 C 0.295 0.205 0.286 0.214
G 0.259 0.284 0.237 0.219 G 0.241 0.261 0.233 0.265
T 0.223 0.243 0.309 0.225 T 0.283 0.238 0.256 0.223

These models can used to discriminate between


coding and noncoding sequences
sequences.
Application: sequence discrimination

For a new sequence with unknown function calculate the


likelihood under the coding and noncoding model, e.g.:

Th
These lik
likelihoods
lih d are compared
dbby means off th
their
i llog ratio:
ti

the log-likelihood ratio.


Application: sequence discrimination

If the log likelihood LR(X) of a new sequence exceeds a


certain threshold, the sequence is classified as coding
and non-coding otherwise.

Back to E.coli
To illustrate the p
potential of the discrimination approach,
pp ,
we calculate the log likelihood ratio for large set of known
coding and noncoding sequences. The distribution of the
t
two sets
t off LR(X)’s
LR(X)’ are compared. d
Application: sequence discrimination

noncoding
di coding
di
sequences sequences
cy
frequenc
f

log likelihood ratio


Application: sequence discrimination
flip mirror

violinplot

noncoding coding
Application: sequence discrimination

Conclusion E.coli example


Comparison of the distributions indicate that one could
discriminate reasonablyy between coding g and noncoding g
sequences on the basis of simple 1st order Markov.

Improvements:
- Higher order Markov model.
- Incorporate more structure information of the DNA.
References &
further reading
References and further reading

Basharin, G.P., Langville, A.N., Naumov, V.A (2004),


“The life and work of A.A. Markov”, Linear Algebra
Applications 386,
and its Applications, 386 3-26.
3 26
Ewens, W.J, Grant, G (2006), Statistical Methods for
Bioinformatics Springer,
Bioinformatics, Springer New York
York.
Reinert, G., Schbath, S., Waterman, M.S. (2000),
Probabilistic and statistical properties of words: an
“Probabilistic
overview”, Journal of Computational Biology, 7, 1-46.
This material is provided under the Creative Commons
Attribution/Share-Alike/Non-Commercial License.

See http://www.creativecommons.org for details.

You might also like