You are on page 1of 31

Random Walks

Presented By Cindy Xiaotong Lin


Why Random Walks?
A random walk (RW) is a useful model in
understanding stochastic processes across
a variety of scientific disciplines.

Random walk theory supplies the basic
probability theory behind BLAST ( the most
widely used sequence alignment theory).

What is a Random Walk?
An Intuitive understanding: A series of
movement which direction and size are randomly
decided (e.g., the path a drunk person left behind).
Formal Definition: Let a fixed vector in the
d-dimensional Euclidean space and a
sequence of independent, identically distributed (i.i.d.)
real-valued random variables in . The discrete-time
stochastic process defined by

is called a d-dimensional random walk

n n
X X X S + + =
1 0
0
X
d
R { } 1 , > n X
n
d
R
{ } 1 : > = n S S
n
Definitions (cont.)
If and RVs take values in ,
then is called d-dimensional
lattice random walk.
In the lattice walk case, if we only allow
the jump from to
where or , then the
process is called d-dimensional sample
random walk.
0
X
n
X
d
I
{ } 1 , > n S
n
) ,..., (
1 d
x x X = ) ,..., (
1 1 d d
x x Y c c + + =
1
c
1 =
k
c
1
Definitions (cont.)
A random walk is defined as restricted
walk if the walk is limited to the interval
[a, b].
The endpoints a and b are called
absorbing barriers if the random walk
eventually stays there forever;
or reflecting barriers if the walk reaches
the endpoint and bounces back.
Example: sequence alignment modeled as RW
| | | ||| || |||
ggagactgtagacagctaatgctata
gaacgccctagccacgagcccttatc

Simple scoring schemes:
at a position: +1, same nucleotides
-1, different nucleotides

*
Example (cont.): simple RW
Ladder Point (LP):the point in the walk lower
than any previously reached points.
Excursion: the part of the walk from a LP until
the highest point attained before the next LP.
Excursions in Fig: 1, 1, 4, 0, 0, 0, 3;

BLAST theory focused on the maximum heights
achieved by these excursions.

Ladder point
Example (cont.): General RW
Consider arbitrary scoring scheme
(e.g. substitution matrix)
RW: Consider a 1-d simple RW starting at h,
restricted to the interval [a, b], where a
and b are absorbing barriers,
and

Problems: I. (Absorption Probabilities) what is the
probability that eventually the walk
finishes at b (or a) rather than a (or
b), i.e., (or )?
II. What is the mean number of steps
taken until the walk stops ( )?

Primary Study of RW: 1-d simple RW
h
=
q X
n
= = ) 1 Pr(
h
m
p X
n
= = ) 1 Pr(
h
u
Methods
The Difference Equation Approach
Classical

The Moment-Generating Function Approach
Ready to generate to more complicate
walk
Assume: the probability that the simple
random walk eventually finishes (absorbed)
at b.
Difference Equation obtained by comparing
the situation just before and after the first
step of the walk:
(7.4)

Initial Conditions:
(7.5)

Difference Equation Approach (M1)
h
=
1 1 +
+ =
h h h
q p = = =
1 , 0 = =
b a
= =
M1 (cont.): solutions
Solve Equ 7.4, using the theory of homogeneous
difference equations
when :



The same procedure can be used to obtain the
probability that the walk ends at a,


q p =
a b
a h
h
e e
e e
* *
* *
u u
u u
=

=
a b
h b
h
e e
e e
u
* *
* *
u u
u u

=
p
q
log
*
= u
M1 (cont.): mean number of steps
Difference Equation:



Initial Conditions:

Solution:

h
m
1 1
1
+
+ =
h h h
qm pm m
0 = =
b a
m m
a b
a h
h
e e
e e
p q
a b
p q
a h
m
* *
* *
u u
u u

=
Moment-Generating function Approach (M2)
Recall the definition of mgf of a random variable Y:



In our case, mgf of random variable is:




According to Theorem 1.1, there exists a unique
nonzero value of such that

(7.12)





) ( ) ( ) ( y P e e E m
Y
y
y
Y
u
u
u

= =
n
X
u u
u pe qe m + =

) (
*
u u
1 ) (
*
= u m
M2 (cont.)
The mgf of the total displacement after N steps is from
(2.17)


When the walk has just finished, the total displacement is
either
or with the probabilities of
or respectively:



) 0 ( , 1 ) (
* *
> = +

N pe qe
N u u
h b
h a h
=
h
u
1 ) 1 (
* *
) ( ) (
= +
u u
= =
h a
h
h b
h
e a e
M2 (cont.)
Therefore, we have


Thus,


Which is identical to (7.9), the solution from
difference equation approach.
1 ) 1 (
* *
) ( ) (
= +
u u
= =
h a
h
h b
h
e a e
* *
* *
u u
u u
=
a b
a h
h
e e
e e

=
M2(cont.): Mean number of steps
until the walk stops

Assume the total displacement after N steps is


Theorem 7.1(Walds Identity) states:


Derivative with respect to on both sides, and obtain

=
=
N
j
j N
S T
1
1 ) ) ( ( =

N
T N
e m E
u
u
u
h N
m S E T E ) ( ) ( =
M2(cont.)
In , (7.24)
The mean of displacement in N steps


The mean of step size

Which states: the mean value of the final total
displacement of the walk, is the mean size of each
step multipled by the mean number of steps taken
until the walk stops
h N
m S E T E ) ( ) ( =
) ( ) ( ) ( h a u h b T E
h h N
+ ==
q p S E = ) (
M2(cont.)
The mean of number of steps until the walk
stops,




Which is agree with the result from
difference equation approach
q p
h b h a u
m
h h
h

+
=
) ( ) ( =
An Asymptotic case: a walk BLAST concerns
The walks BLAST concerns are,

a walk without upper boundary and ending at -1.

Applying the previous results and
We get the following Asymptotic results:
The probability distribution of the maximum value
that the walk ever achieves before reaching -1 is in
the form of the geometric-like probability.
The mean number of steps until the walk stops,
= = b a h ; 1 ; 0
b
p q
m

=
1
0
General Walk
Suppose generally the possible step sizes are,
and their respective
probabilities are,

The mean of step size is negative, i.e.,


The mgf of S(step size) is,

d d c c , 1 ,..., 0 ,..., 1 , +
d c c
p p p
,..., 1
,
+
0 ) ( < =

=
d
c j
j
jp S E

=
=
d
c j
j
j
e p m
u
u) (
General Walk (cont.)
According to Theorem 1.1, there exists unique
positive , such that,


To consider the walk that start at 0, with stopping
boundary at -1 and without upper boundary, impose
an artificial barrier at
The possible stopping points can be,

And Walds Identity states,
where, is the total displacement
when the walk stops.
*
u
1
*
=

=
d
c j
j
j
e p
u
0 > y
. 1 ,..., ,..., 1 , + + d y y c c
1 ) (
*
=
N
T
e E
u
N
T
General Walk
Thus,


Where, is the probability that the walk
finishes at the point k.
The mean of number of steps until the walk
stops or would be
1
1
1
* *
= +

+
=

=
d y
y k
k
k
c k
k
k
e P e P
u u
k
P
A
0
m

=
=

= =
d
c j
j
c
j
j
N
jp
jR
S E
T E
A
1
) (
) (
General Walk: unrestricted
Objective: Find the probability distribution of the
maximum value that the walk ever achieves before
reaching -1 or lower.
Define:
the probability that in the unrestricted walk,
the maximum upward excursion is or less;
is the probability that the walk visits the
positive value before reaching any other positive
value.

) ( y F
unr
Y
y
k
Q
k
General Walk: unrestricted
Therefore,



The event that in the unrestricted walk the maximum upward excursion is y or less is
the union of the event that the maximum excursion never reaches positive values and
the events the first positive value achieved by the excursion is k, k=1,2,y, then the
walk never achieves a further height exceeding y-k

Applying the Renewal Theorem, we have,

d
Y
y
k
k Y
Q Q Q Q
k y F Q Q y F
unr unr
=
+ =

=
... 1
); ( ) (
2 1
_
0
_
) , , (
, )) ( 1 ( lim
*
_
*
u
u
k
y
Y
y
Q Q f V
V e y F
unr
=
=
+
General Walk: restricted
Consider general walk starting at 0, lower barrier at -1.
The size of an excursion of the unrestricted walk can
exceed the value either before or after reaching
negative value, i.e.,



Where, the probability that the size of an
excursion in the restricted walks exceeds the value
up y. is the probability that the first negative
value reached by the walk is .
y
) (
*
y F Y
) ( ) ( ) (
*
1
* *
j y F R y F y F
unr unr
Y
c
j
j
Y Y
+ + =

=

j
R

j
General Walk: restricted
Then,

=
> =
d
k
k
k
c
j
j
j
y
Y
e kQ e
e R Q
C
Ce y Y y F
1
1
_
*
) )( 1 (
) 1 (
, ~ ) Pr( ) (
* *
*
*
u u
u
u
Application: BLAST
BLAST is the most frequently used method for
assessing which DNA or protein sequences in a large
database have significant similarity to a given query
sequence;
a procedure that searches for high-scoring local
alignments between sequences and then tests for
significance of the scores found via P-value.

The null hypothesis to be test is that for each
aligned pair of animo acids, the two amino acids
were generated by independent mechanism.
BLAST (cont.) : modeling
The positions in the alignment are numbered from
left to right as 1, 2,, N. A score S(j, k) is allocated
to each position where the aligned amino acid pair
(j,k) is observed, where S(j,k) is the (j,k) element in
the substitution matrix chosen.

An accumulated score at position i is calculated as
the sum of the scores for the various amino acid
comparison at position 1, 2,,i. As i increases, the
accumulated score undergoes a random walk.
BLAST (cont.) : calculating parameters
Let Y1, Y2, be the respective maximum heights of the
excursions of this walk after leaving one ladder point
and before arriving the next, and let Ymax be the
maximum of these maxima. It is in effect the test
statistic used in BLAST. So it is necessary to find its null
hypothesis distribution.

The asymptotic probability distribution of any Yi is
shown to be the geometric-like distribution. The values
of C and in this distribution depend on the
substitution matrix used and the amino acid frequencies
{pj} and {pj}. The probability distribution of Ymax also
depends on n, the mean number of ladder points in the
walk.
*
u
Discussion
???

You might also like