This action might not be possible to undo. Are you sure you want to continue?
Stochastic Processes
Jiahua Chen
Department of Statistics and Actuarial Science
University of Waterloo
c Jiahua Chen
Key Words: σﬁeld, Brownian motion, diﬀusion process, ergordic, ﬁnite
dimensional distribution, Gaussian process, Kolmogorov equations, Markov
property, martingale, probability generating function, recurrent, renewal the
orem, sample path, simple random walk, stopping time, transient,
2
Stat 433/833 Course Outline
• We use the formula for ﬁnal grade:
25%assignment + 25%midterm + 50%ﬁnal.
• There will be 4 assignments.
• Oﬃce hours: to be announced.
• Oﬃce: MC6007 B, Ex 35506, jhchen@uwaterloo.ca
3
Textbook and References
The text book for this course is Probability and Random Processes
by Grimmett and Stirzaker.
It is NOT essential to purchase the textbook. This course note will be
free to download to all students registered. We may not be able to cover all
the materials included.
Stat433 students will be required to work on fewer problems in assign
ments and exams.
The lecture note reﬂects the instructor’s still inmature understanding of
this general topic, which were formulated after reading pieces of the following
books. A request to reserve has been sent to the library for books with call
numbers.
The References are in Random Order.
1. Introduction to probability models. (QA273 .R84) Sheldon M. Ross.
Academic Press.
2. A Second course in stochastic processes. S. Karlin and H. M. Taylor.
Academic Press.
3. An introduction to probability theory and its applications. Volumes I
and II. W. Feller, Wiley.
4. Convergence of probability measures. Billingsley, P. Wiley.
5. Gaussian random processes. (QA274.4.I2613) I.A. Ibragimov, and Rozanov.
SpringerVerlag, New York.
6. Introduction to stochastic calculus with applications. Fima C. Klebaner.
Imperial College Press.
7. Diﬀusions, Markov processes and martingales.(QA274.7.W54) L. C. G.
Rogers and D. Williams. Cambridge.
8. Probability Theory, independence, interchangeability, martingale. Y. S.
Chow and H. Teicher. Springer, New York.
Contents
1 Review and More 1
1.1 Probability Space . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 More σﬁelds and Related Concepts . . . . . . . . . . . . . . . 6
1.4 Multivariate Normal Distribution . . . . . . . . . . . . . . . . 9
1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Simple Random Walk 13
2.1 Counting sample paths . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Generating Functions 23
3.1 Renewal Events . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Proof of the Renewal Theorem . . . . . . . . . . . . . . . . . . 27
3.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Quick derivation of some generating functions . . . . . 33
3.3.2 Hitting time theorem . . . . . . . . . . . . . . . . . . . 33
3.3.3 Spitzer’s Identity . . . . . . . . . . . . . . . . . . . . . 35
3.3.4 Leads for tieddown random walk . . . . . . . . . . . . 38
3.4 Branching Process . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Discrete Time Markov Chain 43
4.1 Classiﬁcation of States and Chains . . . . . . . . . . . . . . . 45
4.2 Class Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1
2 CONTENTS
4.2.1 Properties of Markov Chain with Finite Number of States 49
4.3 Stationary Distribution . . . . . . . . . . . . . . . . . . . . . . 49
4.3.1 Limiting Theorem . . . . . . . . . . . . . . . . . . . . . 55
4.4 Reversibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Continuous Time Markov Chain 63
5.1 Birth Processes and the Poisson Process . . . . . . . . . . . . 63
5.1.1 Strong Markov Property . . . . . . . . . . . . . . . . . 70
5.2 Continuous time Markov chains . . . . . . . . . . . . . . . . . 71
5.3 Limiting probabilities . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Birthdeath processes and imbedding . . . . . . . . . . . . . . 77
5.5 Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.6 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . . 81
6 General Stochastic Process in Continuous Time 87
6.1 Finite Dimensional Distribution . . . . . . . . . . . . . . . . . 87
6.2 Sample Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3 Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . 91
6.5 Stopping Times and Martingales . . . . . . . . . . . . . . . . . 93
7 Brownian Motion or Wiener Process 95
7.1 Existence of Brownian Motion . . . . . . . . . . . . . . . . . . 100
7.2 Martingales of Brownian Motion . . . . . . . . . . . . . . . . . 100
7.3 Markov Property of Brownian Motion . . . . . . . . . . . . . . 102
7.4 Exit Times and Hitting Times . . . . . . . . . . . . . . . . . . 103
7.5 Maximum and Minimum of Brownian Motion . . . . . . . . . 105
7.6 Zeros of Brownian Motion and Arcsine Law . . . . . . . . . . 107
7.7 Diﬀusion Processes . . . . . . . . . . . . . . . . . . . . . . . . 108
Chapter 1
Review and More
1.1 Probability Space
A probability space consists of three parts: sample space, a collection of
events, and a probability measure.
Assume an experiment is to be done. The set of all possible outcomes
is called Sample Space. Every element ω of Ω is called a sample point.
Mathematically, the sample space is merely an arbitrary set. There is no
need of a corresponding experiment.
A probability measure intends to be a function deﬁned for all subsets of
Ω. This is not always possible when the probability measure is required to
have certain properties. Mathematically, we settle on a collection of subsets
of Ω.
Deﬁnition 1.1
A collection of sets F is a σﬁeld (algebra) if it satisﬁes
1. The empty set φ ∈ F,
2. If A
1
, A
2
, . . . ∈ F, then ∪
∞
i=1
∈ F (closed under countable union).
3. If A ∈ F, then its complement A
c
∈ F.
♦
Let us give a few examples.
1
2 CHAPTER 1. REVIEW AND MORE
Example 1.1
1. The simplest σﬁeld is {φ, Ω}.
2. A simplest nontrivial σﬁeld is {φ, A, A
c
, Ω}.
3. A most exhaustive σﬁeld is F consists of all subsets of Ω.
♦
It can be shown that if A and B are two sets in a σﬁeld, then the resulting
sets of all commonly known operations between A and B are members of the
same σﬁeld.
Axioms of Probability Measure
Given a sample space Ω and a suitable σﬁeld F, A probability measure
P is a mapping from F to R (set of real numbers) such that:
1. 0 ≤ P(A) ≤ 1 for all A ∈ F;
2. P(Ω) = 1;
3. P(∪
∞
i=1
A
i
) =
∞
i=1
P(A
i
) for any A
i
∈ F, i = 1, 2, . . . which satisfy
A
i
A
j
= φ whenever i = j.
Mathematically, the above deﬁnition does not rely on any hypothetical
experiment. A probability space is given by (Ω, F, P). The above axioms
lead to restrictions on F. For example, suppose Ω = [0, 1] and F is the
σﬁeld that contains all possible subsets. In this case, if we require that the
probability of all closed intervals equal their lengthes, then it is impossible
to ﬁnd such a probability measure satisfying Axiom 3.
When the sample space Ω contains ﬁnite number of elements, the collec
tion of all subsets forms a σﬁeld F. Deﬁne P(A) as the ratio of sizes of A
and of Ω, then it is a probability measure. This is the classical deﬁnition of
probability.
The axioms for a probability space imply that the probability measure
has many other properties not explicitly stated as axioms. For example, since
P(φ ∪ φ) = P(φ) + P(φ), we must have P(φ) = 0.
1.1. PROBABILITY SPACE 3
Axioms 2 and 3 imply that
1 = P(Ω) = P(A ∪ A
c
) = P(A) + P(A
c
).
Hence, P(A
c
) = 1 −P(A).
For any two events A
1
and A
2
, we have
P(A
1
∪ A
2
) = P(A
1
) + P(A
2
) −P(A
1
A
2
).
In general,
P(∪
n
i=1
A
i
) =
P(A
i
) −
i
1
<i
2
P(A
i
1
A
i
2
) +· · · , +(−1)
n+1
P(∩
n
i=1
A
i
).
Lemma 1.1 Continuity of the probability measure.
Let (Ω, F, P) be a probability space. If A
1
⊂ A
2
⊂ A
3
⊂ · · · is an increasing
sequence of events and
A = lim
n→∞
∪
n
i=1
A
i
= lim
n→∞
A
n
,
then
P(A) = lim
n→∞
P(A
n
).
Proof: Let B
n
= A
n
− A
n−1
, n = 1, 2, 3, . . . with A
0
= φ. Then A
n
=
∪
n
i=1
B
i
, and A = ∪
∞
i=1
B
i
. Notice that B
1
, B
2
, . . . are mutually exclusive,
and P(B
n
) = P(A
n
) −P(A
n−1
). Using countable additivity,
P(A) =
∞
i=1
P(B
i
)
= lim
n→∞
n
i=1
P(B
i
)
= lim
n→∞
P(A
n
).
This completes the proof. ♦
4 CHAPTER 1. REVIEW AND MORE
1.2 Random Variable
Deﬁnition 1.1
A random variable is a map X : Ω → R such that {X ≤ x} = {ω ∈ Ω :
X(ω) ≤ x} ∈ F for all x ∈ R. ♦
Notice that {X ≤ x} is more often the notation of an event than the
notation for the outcome of an experiment.
The cumulative distribution function (c.d.f.) of a random variable X
is F
X
(t) = P(X ≤ t). It is known that a c.d.f. is a nondecreasing, right
continuous function such that
F(−∞) = 0, F(∞) = 1.
The random behaviour of X is largely determined by its distribution
through c.d.f. At the same time, there are many diﬀerent ways to characterise
the distribution of a random variable. Here is an incomplete list:
1. If X is absolutely continuous, then the probability density function
(p.d.f.) is f(x) ≥ 0 such that
F(x) =
_
x
−∞
f(t)dt.
For almost all x, f(x) = dF(x)/dx.
2. If X is discrete, taking values on {x
1
, x
2
, . . .}, the probability mass
function of X is
f(x
i
) = P(X = x
i
) = F(x
j
) −F(x
j
−)
for i = 1, 2, . . ., where F(a−) = lim
x→a,x<a
F(x).
3. The moment generating function is deﬁned as
M(t) = E{exp(tX)}.
1.2. RANDOM VARIABLE 5
4. The characteristic function of X is deﬁned as
φ(t) = E{exp(itX)}
where i is such that i
2
= −1. The advantage of the characteristic
function is that it exists for all X. Otherwise, the moment generating
function is simpler as it is not built on the concepts of complex numbers.
5. The probability generating function is useful when the random variable
is nonnegative integer valued.
6. In survival analysis or reliability, we often use survival function
S(x) = P(X > x) = 1 −F(x)
where we assume P(X ≥ 0) = 1.
7. The hazard function of X is
λ(x) = lim
∆→0+
P{x ≤ X < x + ∆X ≥ x}
∆
.
It is also called “instantaneous failure rate” as it is the rate at which
the individual fails in the next instance given it has survived so far.
When X is absolutely continuous,
λ(x) =
f(x)
S(x)
= −
d
dx
log S(x).
Hence, for x ≥ 0,
S(x) = exp{−
_
x
0
λ(t)dt}.
Let us examine a few examples to illustrate the hazard function and its
implication.
Example 1.2
6 CHAPTER 1. REVIEW AND MORE
(1) When X has exponential distribution with density function f(x) =
λexp(−λx), it has constant hazard rate λ.
(2) When X has Weibull distribution with density function
f(x) = λα(λx)
α−1
exp{−(λx)
α
}
for x ≥ 0. The hazard function is given by
λ(x) = λα(λx)
α−1
.
(a) When α = 1, it reduces to exponential distribution. Constant hazard
implies the item does not age.
(b) When α > 1, the hazard increases with time. Hence the item ages.
(c) When α < 1, the hazard decreases with time. The longer the item
has survived, the more durable this item becomes. It ages negatively.
(3) Human mortality can be described by a curve with bathtub shape.
The death rate of new borns are high, then it stabilizes. After certain age,
we becomes vulnerable to diseases and the death rate increases.
1.3 More σﬁelds and Related Concepts
Consider a random variable X deﬁned on the probability space (Ω, F, P).
For any real number such as −
√
2, the sample points satisfying X ≤ −
√
2
form a set which belongs to F. Similarly X ≥ 1.2 is also a set belongs to F.
This claim extends to the union of these two events, the intersection of these
two events, and so on.
Putting all events induced by X as above together, we obtain a collection
of events which will be denoted as σ(X). It can be easily argued that σ(X)
is the smallest sub σﬁeld generated by X.
Example 1.3
1. The σﬁeld of a constant random variable;
2. The σﬁeld of an indicator random variable;
3. The σﬁeld of a random variable takes only two possible values.
1.3. MORE σFIELDS AND RELATED CONCEPTS 7
If X and Y are two random variables, we may deﬁne the conditional
cumulative distribution function of X given by Y = y by
P(X ≤ xY = y) =
P(X ≤ x, Y = y)
P(Y = y)
.
This deﬁnition works if P(Y = y) > 0. Otherwise, if X and Y are jointly
absolutely continuous, we make use of their joint density to compute the
conditional density function.
One purpose of introducing conditional density function is for the sake
of computing conditional expectation. Because we have to work with con
ditional expectation more extensively later, we hope to deﬁne conditional
expectation for any pairs of random variables.
If Y is an indicator random variable I
A
for some event A, such that
0 < P(A) < 1, then E{XY = 0} and E{XY = 1} are both well deﬁned
as above. Further, the corresponding values are the average sizes (expected
values) of X over the ranges A
c
and A. Because of this, the conditional
expectation of X given Y is expected sizes of X over various ranges of Ω
partitioned according to the size of Y . When Y is as simple as an indicator
random variable, this partition is also simple and we can easily work out
these numbers conceptually. When Y is an ordinary random variable, the
partition of Ω is σ(Y ). The problem is: σ(Y ) contains so many events in
general, and also they are not mutually exclusive. Thus, there is no way to
present the conditional expectation of X given Y by enumerating them all.
The solution to this diﬃculty in mathematics is to deﬁne Z = E{XY }
as a measurable function of Y such that
E{ZI
A
} = E{XI
A
}
for any A ∈ σ(Y ).
We may realize that this deﬁnition does not provide any concrete means
for us to compute E{XY }. In mathematics, this deﬁnition has to be backed
up by an existence theorem that such a function is guaranteed to exist. At
the same time, when another random variable diﬀers to Z only by a zero
probability event, that random variable is also a conditional expectation of
X given Y . Thus, this deﬁnition is unique only up to some zeroprobability
event.
8 CHAPTER 1. REVIEW AND MORE
Since Z is a function of Y , it is also a random variable. In particular, by
letting A = Ω, we get
E[E{XY }] = E{X}.
A remark here is: the conditional expectation is well deﬁned only if the
expectation of X exists. You may remember a famous formula:
V ar(X) = V ar[E{XY }] + E[V ar{XY }].
Note that deﬁnition of the conditional expectation relies on the σﬁeld
generated by Y only. Thus, if G is a σﬁeld, we simply deﬁne Z = E{XG}
as a G measurable function such that
E{ZI
A
} = E{XI
A
}
for all A ∈ G. Thus, the concept of conditional expectation under measure
theory is in fact built on σﬁeld.
Let us go over a few concepts in elementary probability theory and see
what they mean under measure theory.
Recall that if X and Y are independent, then we have
E{g
1
(X)g
2
(Y )} = E{g
1
(X)}E{g
2
(Y )}
for any functions g
1
and g
2
. Under measure theory, we need to add a cosmes
tic condition that these two functions are measurable, plus these expectations
exist. More rigorously, the independence of two random variables is built on
the independence of their σﬁelds. It can be easily seen that the σﬁeld gen
erated by g
1
(X) is a subσﬁeld of that of X. One may then see what g
1
(X)
is independent of g
2
(Y ).
Example 1.4 Some properties.
1. If X is G measurable, then
E{XY G} = XE{Y G}.
2. If G
1
⊂ G
2
, then
E[E{XG
2
}G
1
] = E{XG
1
}.
1.4. MULTIVARIATE NORMAL DISTRIBUTION 9
3. If G and σ(X) are independent of each other, then
E{XG} = E{X}.
Most well known elementary properties of the mathematical expectation re
main valid.
1.4 Multivariate Normal Distribution
A group of random variables X have joint normal distribution if their joint
density function is in the form of
C exp(−xAx
τ
−2bx
τ
)
where A is positive deﬁnite, and C is a constant such that the density function
has total mass 1. Note that x and b are vectors.
Being positive deﬁnite implies that there exists an orthogonal decompo
sition of A such that
A = BΛB
τ
where Λ is diagonal matrix with all element positive and BB
τ
= I, the
identity matrix.
With this knowledge, it is seen that
_
exp(−
1
2
xAx
τ
)dx =
_
exp(−
1
2
yΛy
τ
)dy =
(2π)
n/2
A
1/2
.
Hence when b = 0, the density function has the form
f(x) = [(2π)
−n
A]
1/2
exp(−
1
2
xAx
τ
).
Let X be a random vector with the density function given above, and let
Y = X + µ. Then the density function of Y is given by
{(2π)
−n
A}
1/2
exp{−
1
2
(y −µ)A(y −µ)
}.
It is more convenient to use V = A
−1
in most applications. Thus, we
have the deﬁnition as follows.
10 CHAPTER 1. REVIEW AND MORE
Deﬁnition 1.1 Multivariate normal distribution
X = (X
1
, . . . , X
n
) has multivariate normal distribution (written as N(µ, V )),
if its joint density function is
f(x) = {(2π)
n
V }
−1/2
exp{−
1
2
(x −µ)V
−1
(x −µ)
τ
}
where V is positive deﬁnite matrix. ♦
If X is multivariate normal N(µ, V ), then E(X) = µ and V ar(X) = V .
If X(length n) is N(µ, V ) and D is an n×m matrix of rank m ≤ n, then
Y = XD is N(µD, D
τ
V D).
More general, we say a random vector X = (X
1
, . . . , X
n
) has multivariate
normal distribution when Xa
τ
is a normally distributed random variable for
all a ∈ R
n
.
A more rigorous and my favoured deﬁnition is: if X can be written in
the form AY + b for some (nonrandom) matrix A and vector b, and Y is
a vector of iid standard normally distributed random variables, then X is a
multinormally distributed random vector.
We now list a number of well known facts about the multivariate distri
butions. If you ﬁnd that you are not familiar with a large number of them,
it probably means some catch up work should be done
1. Suppose X has normal distribution with mean µ and variance σ
2
. Its
characteristic function is
φ(t) = exp{iµt −
1
2
σ
2
t
2
}.
Its moment generating function is
M(t) = exp{µt +
1
2
σ
2
t
2
}.
2. If X has standard normal distribution, then all its odd moments are
zero, and its even moments are
EX
2r
= (2r −1)(2r −3) · · · 3 · 1.
1.5. SUMMARY 11
3. Let Φ(x) and φ(x) be the cumulative distribution function and the
density function of the standard normal distribution. It is known that
_
1
x
−
1
x
3
_
φ(x) ≤ 1 −Φ(x) ≤
1
x
φ(x)
for all positive x. In particular, when x is large, they provide a very
accurate bounds.
4. Suppose that X
1
, X
2
, . . . , X
n
are a set of independent and identically
distributed standard normal random variables. Let X
(n)
= max{X
i
, i =
1, . . . , n}. Then X
(n)
= O
p
(
√
log n).
5. Suppose Y
1
, Y
2
are two independent random variables such that Y
1
+Y
2
are normally distributed, then Y
1
and Y
2
are jointly normally dis
tributed. That is, they are multivariate normal.
6. A set of random variables have multivariate normal distribution if and
only if all their linear combinations are normally distributed.
7. Let X
1
, . . . , X
n
be i.i.d. normal random variables. The sample mean
¯
X
n
= n
−1
n
i=1
X
i
and the sample variance S
2
n
= (n − 1)
−1
n
i=1
(X
i
−
¯
X)
2
are independent. Further,
¯
X
n
has normal distribution and S
2
n
has
chisquare distribution with n −1 degrees of freedom.
8. Let X be a vector having multivariate normal distribution with mean µ
and covariance matrix Σ, and A be a nonnegative deﬁnite symmetric
matrix. Then, XAX
τ
has chisquared distribution when
AΣAΣA = AΣA.
1.5 Summary
We did not intend to give you a full account of probability theory learned in
courses preceding this one. Neither we try to lead you to the world of measure
theory based, advanced, and widely regarded as useless rigorous probability
theory. Yet, we did introduce the concept of probability space, and why a
dose of σﬁeld is needed in doing so.
12 CHAPTER 1. REVIEW AND MORE
In the later half of this course, we need the σﬁeld based deﬁnitions of
independence, conditional expectation. It is also very important to get fa
miliar with multivariate normal random variables. These preparations are
not suﬃcient, but will serve as starting point. Also, you may be happy to
know that that the pressure is not here yet.
Chapter 2
Simple Random Walk
Simple random walk is an easy object in the family of stochastic processes.
At the same time, it shares many properties with more complex stochas
tic processes. Understanding simple random walk helps us to understand
abstract stochastic processes.
The building block of a simple random walk is a sequence of iid random
variables X
1
, . . . , X
n
, . . . such that
P(X
1
= 1) = p, P(X
1
= −1) = 1 −p = q.
We assume that 0 < p < 1, otherwise, the simple random walk becomes
trivial.
We start with S
0
= a and deﬁne S
n+1
= S
n
+ X
n+1
for n = 1, 2, . . .. It
mimics the situation of starting gambling with an initial capital of $a, and
the stake is $1 on each trial. The probability of winning a trial is p and the
probability of losing is q = 1 −p. Trials are assumed independent.
When p = q = 1/2, the random walk is symmetric.
Properties:
A simple random walk is
(i) time homogeneous:
P{S
n+m
= jS
m
= a} = P{S
n
= jS
0
= a}.
(ii) spatial homogeneous:
P{S
n+m
= j + bS
m
= a + b} = P{S
n
= jS
0
= a}.
13
14 CHAPTER 2. SIMPLE RANDOM WALK
(iii) Markov:
P{S
n+m
= jS
0
= j
0
, S
1
= j
1
, · · · , S
m
= j
m
} = P{S
n+m
= jS
m
= j
m
}.
♦
Example 2.1 Gambler’s ruin
Assume that start with initial capital S
0
= a, the game stops as soon as
S
n
= 0 or S
n
= N for some n, where N ≥ a ≥ 0. Under the conditions of
simple random walk, what is the probability that the game stops at S
n
= N?
Solution: A brute force solution for this problem is almost impossible.
The key is to view this probability as a function of initial capital a. By
establishing a relationship between the probabilities, we will get a diﬀerence
equation which is not hard to solve. ♦
2.1 Counting sample paths
If we plot (S
i
, i = 0, 1, . . . , n) against (0, 1, . . . , n), we obtain a path connect
ing (0, S
0
) and (n, S
n
). We call it a sample path. Assume that S
0
= a and
S
n
= b. There might be many possible sample paths leading from (0, a) to
(n, b).
Property 2.1
The number of paths from (0, a) to (n, b) is
N
n
(a, b) =
_
n
n+b−a
2
_
when (n + a −b)/2 is a positive integer, 0 otherwise. ♦
Property 2.2
2.1. COUNTING SAMPLE PATHS 15
For a simple random walk, and when (n + a − b)/2 is a positive integer, all
sample paths from (0, a) to (n, b) have equal probability p
(n+b−a)/2
q
(n+a−b)/2
to occur. In addition,
P(S
n
= bS
0
= a) =
_
n
n+b−a
2
_
p
(n+b−a)/2
q
(n+a−b)/2
.
♦
The proof is straightforward.
Example 2.2
It is seen that
P(S
2n
= 0S
0
= 0) =
_
2n
n
_
p
n
q
n
for n = 0, 1, 2, . . .. This is the probability when the random walk returns to
0 at trial 2n.
When p = q = 1/2, we have
u
2n
= P(S
2n
= 0S
0
= 0) =
_
2n
n
_
(1/2)
2n
.
Using Stirling’s approximation
n! ≈
√
2πn(n/e)
n
for large n. The approximation is good even for small n. We have
u
2n
≈
1
√
nπ
.
Thus,
n
u
2n
= ∞. By renewal theorem to be discussed, S
n
= 0 is a recurrent
event when p = q = 1/2. ♦
Property 2.3 Reﬂection principle
Let N
0
n
(a, b) be the number of sample paths from (0, a) to (n, b) that touch
or cross the xaxis. Then, when a, b > 0,
N
0
n
(a, b) = N
n
(−a, b).
Proof: We Simply count the number of paths. The key step is to establish
the onetoone relationship. ♦
The above property enables us to show the next interesting result.
16 CHAPTER 2. SIMPLE RANDOM WALK
Property 2.4 Ballot Theorem
The number of paths from (0, 0) to (n, b) that do not revisit the axis is
b
n
N
n
(0, b).
Proof: Notice that such a sample path has to start with a transition from
(0, 0) to (1, 1). The number of paths from (1, 1) to (n, b), such that b > 0,
which do not touch xaxis is
N
n−1
(1, b) −N
0
n−1
(1, b) = N
n−1
(1, b) −N
n−1
(−1, b)
=
_
n −1
n−b
2
_
−
_
n −1
n−b−2
2
_
=
b
n
N
n
(0, b).
♦
What does this name suggest? Suppose that Micheal and George are in a
competition for some title. In the end, Micheal wins by b votes in n casts. If
the votes are counted in a random order, and A =“Micheal leads throughout
the count”, then
P(A) =
b
n
N
n
(0, b)
N
n
(0, b)
=
b
n
.
Theorem 2.1
Assume that S
0
= 0 and b = 0. Then
(i) P(S
1
S
2
· · · S
n
= 0, S
n
= bS
0
= 0) =
b
n
P(S
n
= bS
0
= 0).
(ii) P(S
1
S
2
· · · S
n
= 0) = n
−1
E(S
n
). ♦
Proof: The ﬁrst part can be proved by counting the number of sample
paths which do not touch xaxis.
The second part is a consequence of the ﬁrst one. Just sum over the
probability of (i) for possible values of S
n
. ♦
Theorem 2.2 First passage through b at trial n (Hitting time problem).
2.1. COUNTING SAMPLE PATHS 17
If b = 0 and S
0
= 0, then
f
n
(b) = P(S
1
= b, · · · , S
n−1
= b, S
n
= bS
0
= 0) =
b
n
P(S
n
= b).
Proof: If we reverse the time, the sample paths become those who reach b
in n trials without touching the xaxis. Hence the result. ♦
The next problem of interest is how high S
i
have attained before it settles
at S
n
= b. We assume b > 0. In general, we often encounter problems of
ﬁnding the distribution of the extreme value of a stochastic process. This
is an extremely hard problem, but we have a kind of answer for the simple
random walk.
Deﬁne M
n
= max{S
i
, i = 1, . . . , n}.
Theorem 2.3
Suppose S
0
= 0. Then for r ≥ 1,
P(M
n
≥ r, S
n
= b) = {
P(S
n
= b) if b ≥ r
(q/p)
r−b
P(S
n
= 2r −b) if b < r
Hence,
P(M
n
≥ r) = P(S
n
≥ r) +
r−1
c=−∞
(q/p)
r−b
P(S
n
= 2r −b)
= P(S
n
= r) +
∞
c=r+1
[1 + (q/p)
c−r
]P(S
n
= c)
When p = q = 1/2, it becomes
P(M
n
≥ r) = 2P(S
n
≥ r + 1) + P(S
n
= r).
Proof: When b ≥ r, the event M
n
≥ r is a subset of the event S
n
= b.
Hence the ﬁrst conclusion is true.
When b < r, we may draw a horizontal line y = r. For each sample path
belongs to M
n
≥ r, S
n
= b, we obtain a partial mirror image: it retains
the part until the sample path touches the line of y = r, and completes it
with the mirror image from there. Thus, the number of sample paths is the
18 CHAPTER 2. SIMPLE RANDOM WALK
same as the number of sample paths from 0 to 2r − b. The corresponding
probability, however, is obtained by exchanging the roles of r − b pairs of p
and q.
The remaining parts of the theorem are self illustrative. ♦
Let µ
b
be the mean number of visits of the walk to the point b before
it turns to its starting point. Suppose S
0
= 0. Let Y
n
= I(S
1
S
2
· · · S
n
=
0, S
n
= b). Then
∞
n=1
Y
n
is the number of times it visits b. Notice that this
reasoning is ﬁne even when the walk will never return to 0.
Since E(Y
n
) = f
b
(n), we get
µ
b
=
∞
n=1
f
b
(n).
Thus, in general, the mean number of visit is less than 1. When p = q = 0.5,
all states are recurrent. Therefore µ
b
= 1 for any b. ♦
Suppose a perfect coin is tossed until the ﬁrst equalisation of the accu
mulated numbers of heads and tails. The gambler receives one dollar every
time that the number of heads exceeds the number of tails by b. This fact
results in comments that the “fair entrance fee” equals 1 independent of b.
My remark: how many of us thinks that this is against their intuition?
Theorem 2.4 Arc sine law for last visit to the origin.
Suppose that p = q = 1/2 and S
0
= 0. The probability that the last visit to
0 up to time 2n occurred at time 2k is given by
P(S
2k
= 0)P(S
2n−2k
= 0).
♦
Proof: The probability in question is
α
2n
(2k) = P(S
2k
= 0)P(S
2k+1
· · · S
2n
= 0S
2k
= 0)
= P(S
2k
= 0)P(S
1
· · · S
2n−2k
= 0S
0
= 0).
We are hence asked to show that
P(S
1
· · · S
2n−2k
= 0S
0
= 0) = u
2n−2k
.
2.1. COUNTING SAMPLE PATHS 19
Since S
0
= 0 is part of our assumption, we may omit the conditional part.
We have
P(S
1
· · · S
2m
= 0S
0
= 0) =
b=0
P(S
1
· · · S
2m
= 0, S
2m
= b)
=
b=0
b
m
P(S
2m
= b)
= 2
m
b=1
2b
2m
P(S
2m
= 2b)
= 2
_
1
2
_
2m m
b=1
__
2m−1
m + b −1
_
−
_
2m−1
m + b
__
=
_
2m
m
_
_
1
2
_
2m
= u
2m
.
♦
The proof shows that
P(S
1
S
2
. . . S
2m
= 0) = P(S
2m
= 0).
Denote, as usual, u
2n
= P(S
2n
= 0) and α
2n
(2k) = P(S
2k
= 0, S
i
= 0, i =
2k + 1, . . . , 2n). Then the theorem says
α
2n
(2k) = u
2k
u
2n−2k
.
As p = q = 1/2 is a simple case, we have accurate approximation for u
2k
.
It turns out that u
2k
≈ (πk)
−1/2
and so α
2n
(2k) ≈ {π[k(n − k)]}
−1/2
. If
we let T
2n
be the time of the last visit to 0 up to time 2n, the it follows that
P(T
2n
≤ 2xn) ≈
k≤xn
{π[k(n −k)]}
−1/2
≈
_
x
0
1
π[u(1 −u)]
1/2
du
=
2
π
arcsin(
√
x).
That is, the limiting distribution has a density function described by
arcsin(
√
x).
20 CHAPTER 2. SIMPLE RANDOM WALK
(1) Since p = q = 1/2, one may think the balance of heads and tails
should occur very often. Is it true? After 2n tosses with n large, the chance
that it never touches 0 after trial n is 50% which is surprisingly large.
(2) The last time before trial 2n when the simple random walk touches
0 should be closer to the end. This result shows that it is symmetric about
midpoint n. It is more likely to be at the beginning and near the end.
Why is it more likely at the beginning? since it started at 0, it is likely
to touch 0 again soon. Once it wandered away from 0, touching 0 becomes
less and less likely.
Why is it also likely occur near the end: if for some reason the simple
random walk returned to 0 at some point, it becomes more likely to visit 0
again in the near future. Thus, it pushes the most recent visit closer and
closer to the end.
(3) If we count the number of k such that S
k
> 0, what kind of distribution
it has? Should it be more likely to be close to n? It turns out that it is either
very small or very large.
Thus, if you gamble, you may win all the time, or lose all the time even
though the game is perfectly fair in the sense of probability theory.
Let us see how the result establishes (3).
Property 2.5 Arc sine law for sojourn times.
Suppose that p = 1/2 and S
0
= 0. The probability that the walk spends
exactly 2k intervals of time, up to time 2n, to the right of the origin equals
u
2k
u
2n−2k
. ♦
Proof: Call this probability β
2n
(2k). We are asked to show that β
2n
(2k) =
α
2n
(2k).
We use mathematical induction.
The ﬁrst step is to consider the case when k = n = m.
In our previous proofs, we have shown that for symmetric random walk
u
2m
= P(S
1
· · · S
2m
= 0).
Since these sample paths can belong to one of the two possible groups: always
above 0 or always below 0. Hence,
u
2m
= 2P(S
1
> 0, . . . , S
2m
> 0).
2.1. COUNTING SAMPLE PATHS 21
On the other hand, we have
P(S
1
> 0, S
2
> 0, · · · , S
2m
> 0)
= P(S
1
= 1, S
2
≥ 1, . . . , S
2m
≥ 1)
= P(S
1
−1 = 0, S
2
−1 ≥ 0, . . . , S
2m
−1 ≥ 0)
= P(S
1
= 1S
0
= 0)P(S
2
−1 ≥ 0, . . . , S
2m
−1 ≥ 0S
1
−1 = 0)
=
1
2
P(S
2
≥ 0, S
3
≥ 0, . . . , S
2m
≥ 0S
1
= 0)
=
1
2
P(S
1
≥ 0, S
2
≥ 0, . . . , S
2m−1
≥ 0S
0
= 0)
=
1
2
P(S
1
≥ 0, S
2
≥ 0, . . . , S
2m−1
≥ 0, S
2m
≥ 0S
0
= 0)
=
1
2
β
2m
(2m)
where the last equality comes from the fact that S
2m−1
≥ 0 implies S
2m
≥ 0
as it is impossible for S
2m−1
= 0. Consequently, since u
0
= 1, we have shown
that
α
2m
(2m) = u
2m
= β
2m
(2m).
Let us now make an induction assumption that
α
2n
(2k) = β
2n
(2k)
for all n with k = 0, 1, . . . , m−1 and k = n, n−1, . . . , n−(m−1). Note the
reason for the validity of the second sets of k is symmetry. Our induction
task is to show that the same is true when k = m.
By renewal equation,
u
2m
=
m
r=1
u
2m−2r
f
2r
.
Using idea similar to renewal equation, we also have
β
2n
(2m) =
1
2
m
r=1
f
2r
β
2n−2r
(2m−2r) +
1
2
n−m
r=1
f
2r
β
2n−2r
(2m)
since the walk will start with either a positive move or a negative move, it
will touch 0 again at some point 2r in another 2n−2r trials (it could happen
that 2r = 2n).
22 CHAPTER 2. SIMPLE RANDOM WALK
So, with renewal equation, and induction assumption,
β
2n
(2m) =
1
2
m
r=1
f
2r
u
2m−2r
u
2n−2m
+
1
2
n−m
r=1
f
2r
u
2m−2r
u
2n−2m−2r
=
1
2
[u
2n−2m
u
2m
+ u
2m
u
2n−2m
]
= u
2m
u
2n−2m
.
The induction is hence completed. ♦
2.2 Summary
What have we learned in this chapter? One observation is that all sample
paths starting and ending at the same location have equal probability to
occur. Making use of this fact, some interesting properties of the simple
random work are revealed.
One such property is that the simple random work is null recurrent when
p = q and transient otherwise. Let us recall some detail; by counting sample
paths, it is possible to provide an accurate enough approximation to the
probability of entering 0.
The reﬂection principle allows us to count the number of paths going from
one point to another without touching 0. This result is used to establish the
Ballot Theorem. In old days, there were enough nerds who believed that this
result is intriguing.
We may not believe that the hitting time problem is a big deal. Yet when
it is linked with stock price, you may change your mind. If you ﬁnd some
neat estimates of such probabilities for very general stochastic processes, you
will be famous. In this course, we provide such a result for the simple random
work. There is a similar result for Brownian motion to be discussed.
Similarly, arc sine law also have its twin in Brownian motion. Please do
not go away and stay tuned.
Chapter 3
Generating Functions and
Their Applications
The idea of generating functions is to transform the function under inves
tigation from one functional space to another functional space. The new
functional space might be more convenient to work with.
If X is a nonnegative integer valued random variable, then we call
G
X
(s) = E(s
X
)
the probability generating function of X.
Advantages? It can be seen that moments of X equal derivatives of G
X
(s)
at s = 1.
When X and Y are independent, then the probablity generating function
of X + Y is
G
X+Y
(s) = G
X
(s)G
Y
(s).
Theorem 3.1
If X
1
, X
2
, . . . is a sequence of independent and identically distributed random
variables with common generating function G
X
(s) and N(≥ 0) is a random
variable which is independent of the X
i
’s and has generating function G
N
(s),
then
S = X
1
, +X
2
+· · · + X
N
23
24 CHAPTER 3. GENERATING FUNCTIONS
has generating function given by
G
S
(s) = G
N
(G
X
(s)).
♦
The proof can be done through conditioning on N. When N = 0, we
assume that S = 0.
Deﬁnition 3.2
The joint probability generating function of two random variables X
1
and
X
2
taking nonnegative integer values, is given by
G
X
1
,X
2
(s
1
, s
2
) = E[s
X
1
1
s
X
2
2
].
♦
Theorem 3.2
X
1
and X
2
are independent if and only if
G
X
1
,X
2
(s
1
, s
2
) = G
X
1
(s
1
)G
X
2
(s
2
).
♦
The idea of generating function also applies to a sequence of real numbers.
If {a
n
}
∞
n=0
is a sequence of real numbers, then
A(s) =
n
a
n
s
n
is called the generating function of {a
n
}
∞
n=0
when it converges in a neighbor
hood of s = 0.
Suppose that {a
n
}
∞
n=0
and {b
n
}
∞
n=0
are two sequence of real numbers, and
their generating functions exists in a neighborhood of s = 0. Deﬁne
c
n
=
n
i=0
a
i
b
n−i
for n = 0, 1, . . .. Then
C(s) = A(s)B(s)
where A(s), B(s) and C(s) are generating functions of {a
n
}
∞
n=0
, {b
n
}
∞
n=0
and
{c
n
}
∞
n=0
respectively.
3.1. RENEWAL EVENTS 25
3.1 Renewal Events
Consider a sequence of trials with outcomes X
1
, X
2
, . . .. We do not require
X
i
’s be independent of each other. Let λ represent some property which, on
the basis of the outcomes of the ﬁrst n trials, can be said unequivocally to
occur or not to occur at trial n. By convention, we suppose that λ has just
occurred at trial 0, and E
n
represents the “event” that λ occurs at trial n,
n = 1, 2, . . ..
Roughly speaking, a property is a renewal event if the waiting time dis
tribution for the next occurrence remains the same, and is independent of
the past, given that it has just occurred. Thus, the process renewals itself
each time when the renewal event occurs. It resets the clock back to time 0.
Let λ represent a renewal event and as before deﬁne the lifetime sequence
{f
n
} where f
0
= 0 and
f
n
= P{λ occurs for the ﬁrst time at trial n}, n = 1, 2, . . . .
In like manner, we deﬁne the renewal sequence u
n
, where u
0
= 1 and
u
n
= P{λ occurs at trial n}, n = 1, 2, . . . .
Let F(s) =
f
n
s
n
and U(s) =
u
n
s
n
be the generating functions of
{f
n
} and {u
n
}. Note that
f =
f
n
= F(1) ≤ 1
because f has the interpretation that λ recurs at some time in the sequence.
Since the event may not occur at all, it is possible for f to be less than 1.
Clearly, 1 − f represents the probability that λ never recurs in the inﬁnite
sequence of trials. When f < 1, the probability that λ occurs ﬁnite number
of times only is 1. Hence, we say that λ is transient. Otherwise, it it
recurrent.
For a recurrent renewal event, F(s) is a probability generating function.
The mean interoccurrence time is
µ = F
(1) =
∞
n=0
nf
n
.
26 CHAPTER 3. GENERATING FUNCTIONS
If µ < ∞, we say that λ is positive recurrent. If µ = ∞, we say that λ is
null recurrent.
Finally, if λ can occur only at n = t, 2t, 3t, . . . for some positive integer
t > 1, we say that λ is periodic with period t. More formally, let t =
g.c.d.{n : f
n
> 0}. (g.c.d. stands for the greatest common divisor). If t > 1,
the recurrent event λ is said to be periodic with period t. If t = 1, λ is said
to be aperiodic.
For a renewal event λ to occur at trial n ≥ 1, either λ occurs for the ﬁrst
time at n with probability f
n
= f
n
u
0
, or λ occurs for the ﬁrst time at some
intermediate trial k < n and then occurs again at n. The probability of this
event is f
k
u
n−k
. Notice that f
0
= 1, we therefore have
u
n
= f
0
u
n
+ f
1
u
n−1
+· · · + f
n−1
u
1
+ f
n
u
0
, n = 1, 2, . . . .
This equation is called renewal equation. Using the typical generating
function methodology, we get
U(s) −1 = F(s)U(s).
Hence
U(s) =
1
1 −F(s)
or F(s) = 1 −
1
U(s)
.
Theorem 3.3
The renewal event λ is
1. transient if and only if u =
u
n
= U(1) < ∞,
2. recurrent if and only if u = ∞,
3. periodic if t = g.c.d.{n : u
n
> 0} is greater than 1 and aperiodic if
t = 1.
4. null recurrent if and only if
u
n
= ∞ and u
n
→0 as n →∞.
The following is the famous renewal theorem.
Theorem 3.4 (The renewal theorem).
3.2. PROOF OF THE RENEWAL THEOREM 27
Let λ be a recurrent and aperiodic renewal event and let
µ =
nf
n
= F
(1)
be the mean interoccurrence time. Then
lim
n→∞
u
n
= µ
−1
.
The proof of the Renewal Theorem is rather involved. We will put it as
a section here. To the relief of many, this section will only for the sake of
those who are nerdy enough in mathematics.
3.2 Proof of the Renewal Theorem
I am using my own language to restate the results of Feller (1968, page 335).
First, the result can be stated without the renewal event background.
Equivalent result: Let f
0
= 0, f
1
, f
2
, . . . be a sequence of nonnegative
numbers such that
f
n
= 1, and 1 is the greatest common divisor of these
n for which f
n
> 0. Let u
0
= 1 and
u
n
= f
0
u
n
+ f
1
u
n−1
+· · · + f
n
u
0
for all n ≥ 1.
Then u
n
→µ
−1
as n →∞ where µ =
nf
n
(and µ
−1
= 0 when µ = ∞).
♦
Note this deﬁnition of u
n
does not apply to the case of n = 0. Otherwise,
the usequence would be a convolution of fsequence and itself.
The implication of being aperiodic is as follows. Let A be the set of all
integers for which f
n
> 0, and denote by A
+
the set of all positive linear
combinations
p
1
a
1
+ p
2
a
2
+· · · + p
r
a
r
of numbers a
1
, . . . , a
r
in A.
Lemma 3.1
28 CHAPTER 3. GENERATING FUNCTIONS
There exists an integer N such that A
+
contains all integers n > N. ♦
In other words, if the greatest common divisor of a set of positive integers
is 1, then the set of their linear combinations contains all but ﬁnite number
positive integers. To reduce the burden in math respect, this result will not
be proved here. Some explanation will be given in class.
The next lemma has been restated in diﬀerent words.
Lemma 3.2
Suppose we have a sequence of sequences: [ {(a
m,n
)
∞
m=1
}
∞
n=1
] such that 0 ≤
a
m,n
≤ 1 for all m, n. There exists a sequence {n
i
: i = 1, 2, . . .} such that
lim
i→∞
a
m,n
i
exists for each m. ♦
Again, the proof will be omitted. This result is also used when proving
a result about the relationship between convergence in distribution and the
convergence of characteristic functions.
Lemma 3.3
Let {w
n
}
∞
n=−∞
be a doubly inﬁnite sequence of numbers such that 0 ≤ w
n
≤ 1
and
w
n
=
∞
k=1
f
k
w
n−k
for each n. If w
0
= 1 then w
n
= 1 for all n. ♦
To be more explicit, the sequence f
n
is assumed to have the same property
as the sequence introduced earlier.
Proof: Since all numbers involved are nonnegative and w
n
≤ 1, we have
that for all n
w
n
=
∞
k=1
f
k
w
n−k
≤
f
k
= 1.
Applying this conclusion to n = 0 with the condition w
0
= 1, we have
w
−k
= 1 whenever f
k
= 0.
Recall the deﬁnition of A, the above conclusion can be restated as
w
−a
= 0 whenever a ∈ A.
3.2. PROOF OF THE RENEWAL THEOREM 29
For each a ∈ A, using the same argument:
w
n
=
∞
k=1
f
k
w
n−k
≤
f
k
= 1
to n = a, we ﬁnd w
−a−k
= 1 whenever both a, k ∈ A.
It can then be strengthened to w
−a
= 1 for any a ∈ A
+
by induction.
An implication is then: there exists an N such that
a
−N
= 1, a
−N−1
= 1, a
−N−2
= 1, . . . .
Using
w
n
=
∞
k=1
f
k
w
n−k
≤
f
k
= 1
again for n = −N + 1, we get w
−N+1
= 1.
Repeat it, we get w
−N+2
= 1. Repeat it again and again, we ﬁnd w
n
= 1
for all n. ♦
Finally, we prove the renewal theorem.
Proof of Renewal Theorem: Our goal is to prove the limit of u
n
exists
and equals µ
−1
. Since the items in this sequence are bounded, a subsequence
can be found such that this subsequence converges to some number η. As
sume this η is the upper limit.
Notationally, let us denote it as u(i
v
) →η as v →∞.
Let u
n,v
be a double sequence such that for each v, u
n,v
= u
n+iv
when
n + i
v
≥ 0; and u
n,v
= 0 otherwise. In other words, it shifts the original
sequence by i
v
positions toward left, and further adding zeroes to make a
double sequence.
For example, suppose i
5
= 10, then we have
. . . , u
−2,5
= u
8
, u
−1,5
= u
9
, u
0,5
= u
10
, u
1,5
= u
11
, . . . .
If we set n = 0 and let v increases, the resulting sequence is
. . . , u(0, 0) = u(i
0
), u
(
0, 1) = u(i
1
), u
(
0, 2) = u(i
2
), . . . .
Hence, lim
v→∞
u
0,v
= η.
For the sequence with n = ±1, they look like
30 CHAPTER 3. GENERATING FUNCTIONS
. . . , u(1, 0) = u
i
0
+1
, u(1, 1) = u
i
1
+1
, u(1, 2) = u
i
2
+1
, . . . ;
. . . , u(−1, 0) = u
i
0
−1
, u(−1, 1) = u
i
1
−1
, u(−1, 2) = u
i
2
−1
. . . ;
with the additional claus that if n < −i
v
, u(n, v) = 0.
In summary, for each n = 0, ±1, ±2, . . ., we have a sequence u(n, v) in v.
By the earlier lemma, it is possible to ﬁnd a subsequence in v, say v
j
, j =
1, 2, . . . such that
lim
j→∞
u
n,v
j
exists for each n. Call this limit w
n
.
Since for all n > −v, we have
u(n, v) =
∞
k=1
f
k
u(n −k, v).
Taking limit along the path of v = v
j
, we get
w
n
=
∞
k=1
f
k
w
n−k
.
By the other lemma, we then get w
n
= η for all n.
After such a long labour, we have only achieve that the limit of each
u
n,v
j
= u
v
j
+n
, j = 1, 2, . . . are the same. What we wanted, however, is that
the limit of u
n
exists and equal µ
−1
.
To aim a bit low, let us show that η = µ
−1
. This is easy when µ = ∞.
Let
ρ
k
= f
k+1
+ f
k+2
+· · ·
for k = 0, 1, . . ., we have sumρ
k
= µ. We should have seen it before, but if you
do not remember, it is a good opportunity for us to redo it with generating
function concept. Another note is that
1 −ρ
k
= f
1
+ f
2
+· · · + f
k
.
Let us now list the deﬁning relations as follows:
u
1
= f
1
u
0
;
u
2
= f
2
u
0
+ f
1
u
1
;
u
3
= f
3
u
0
+ f
2
u
1
+ f
1
u
2
;
. . . = . . . ;
u
N
= f
N
u
0
+ f
N−1
u
1
+ f
N−2
u
2
+· · · + f
1
u
N−1
.
3.2. PROOF OF THE RENEWAL THEOREM 31
Adding up both sides, we get
N
k=1
u
k
= (1 −ρ
N
)u
0
+ (1 −ρ
N−1
)u
1
+ (1 −ρ
N−2
)u
2
+· · · + (1 −ρ
1
)u
N−1
.
This is the same as
u
N
= (1 −ρ
N
)u
0
−ρ
N−1
u
1
−ρ
N−2
u
2
−· · · −ρ
1
u
N−1
.
Noting that u
0
= 1, we have
ρ
N
u
0
+ ρ
N−1
u
1
+· · · + ρ
1
u
N−1
+ ρ
0
u
N
= 1. (3.1)
Recall that
lim
j→∞
u
n,v
j
= w
n
= η
for each n. At the same time, u
n,v
j
= u
v
j
+n
is practically true for all n (or
more precisely for all large n). Let N = v
j
, and let j →∞, (3.1), implies
η ×{
ρ
k
} = 1.
That is, either η = µ
−1
when µ is ﬁnite, or 0 otherwise.
Even with this much trouble, our proof is not complete yet. By deﬁnition,
η is the upper limit of u
n
. Is it also THE limit of u
n
? If its limit exists, then
the answer is yes. If µ = ∞, then answer is also yes because the lower limit
cannot be smaller than 0. Otherwise, we have more work to do.
If the limit of u
n
does not exist, it must has a subsequence whose limit
exists and smaller than η. Let it be η
0
< η. Being the upper limit of u
n
,
it implies that given any small positive quantity , for all large enough n,
u
n
≤ η + . Let us examine the relationship:
ρ
N
u
0
+ ρ
N−1
u
1
+· · · + ρ
1
u
N−1
+ ρ
0
u
N
= 1
again. By letting some u
k
replace by a larger value 1, it becomes
N−r−1
k=0
ρ
N−k
+
N−1
k=N−r
ρ
N−k
u
k
+ ρ
0
u
N
≥ 1.
32 CHAPTER 3. GENERATING FUNCTIONS
By including more terms in the ﬁrst summation, it becomes
∞
k=r+1
ρ
k
+
N−1
k=N−r
ρ
N−k
u
k
+ ρ
0
u
N
≥ 1.
Replacing u
k
by a larger value η + which is true for large k, we have
∞
k=r+1
ρ
k
+ (η + )
r
k=1
ρ
k
+ ρ
0
u
N
≥ 1.
Further, since
∞
k=0
ρ
k
= µ, we get
∞
k=r+1
ρ
k
+ (η + )(µ −ρ
0
) + ρ
0
u
N
≥ 1.
Reorganize terms slightly, we get
∞
k=r+1
ρ
k
+ (η + )µ + ρ
0
(u
N
−η −) ≥ 1.
At last, let N go to inﬁnite along the subsequence with limit η
0
, we get
∞
k=r+1
ρ
k
+ (η + )µ + ρ
0
(η
0
−η −) ≥ 1.
Since this result applies to any r and , let r →∞ and →0, we get
ηµ + ρ
0
(η
0
−η) ≥ 1.
Because ηµ = 1, we must have η
0
−η ≥ 0. Combined with the fact that η is
the upper limit, we must have η
0
= η or there will be a contradiction.
Since the limit of any subsequence of u
n
must be the same as η, the limit
of u
n
exists and equals η.
Thus, we ﬁnally proved the Renewal Theorem.
3.3 Properties of Random Walks by Gener
ating Functions
.
3.3. PROPERTIES 33
3.3.1 Quick derivation of some generating functions
The generating functions for certain sequences:
For “ﬁrst returning to 0”: F(s) = 1 −(1 −4pqs
2
)
1/2
.
For “returning to 0”: U(s) = (1 −4pqs
2
)
−1/2
.
For “ﬁrst passage of 1”: Λ(s) = (2qs)
−1
[1 −(1 −4pqs
2
)
1/2
].
Consequences:
P(ever returns to 0) = 1 −p −q.
P( ever rearches 1) = min(1, p/q).
(We will go through a lot of details in the class)
3.3.2 Hitting time theorem
.
Hitting time theorem is more general than what has been shown. Consider
the random walk such that
S
n
= X
1
+ X
2
+· · · + X
n
,
P(X ≤ 1) = 1
and P(X = 1) > 0. This is so called rightcontinuous random walk. The
random walk cannot skip over (upward) any state without stepping on it.
Let T
b
be the ﬁrst time when S
n
= b for some b > 0 and assume S
0
= 0.
It has been shown for simple random walk that
f
b
(n) = P(T
b
= n) =
b
n
P(S
n
= n).
Let us see why this is still true for the new random walk we have just
deﬁned.
For this purpose, deﬁne
F
b
(z) = E(z
T
b
)
34 CHAPTER 3. GENERATING FUNCTIONS
which is the generating function of T
b
. It may happen that P(T
b
= ∞) > 0,
in which case, it is more precise to deﬁne
F
b
(z) = lim
n→∞
n
i=1
z
n
P(T
b
= n).
We also deﬁne
G(z) = E(z
1−X
1
).
Since X
1
can be negative with large absolute values, work with generating
function of 1−X
1
makes sense. Since 1−X
1
is nonnegative, G is well deﬁned
for all z ≤ 1. Please note that our G here is a bit diﬀerent from G in the
textbook.
The purpose of using letter z, rather than s, is to allow z to be a complex
number. It seems that deﬁning a function without a value at z = 0 calls for
some attention. We will pretend z is just a real number for illustration.
If we ignore the mathematical subtlety, our preparations boil down to have
deﬁned generating functions of T
b
and X
1
. Due to the way S
n
is deﬁned, the
“generating function” of S
n
is determined by [G(z)]
n
. Hence the required
hitting time theorem may be obtained by linking G(z) and F
b
(z).
This step turns out to be rather simple. It is seen that
F
b
(z) = [F
1
(z)]
b
as the random walk cannot skip a state without landing on it. We take notice
that this relationship works even when b = 0.
Further,
F
1
(z) = E[E(z
T
1
X
1
)] = E[z
1+T
1−X
1
X
1
] = zE{[F
1
(z)]
1−X
1
} = zG(F
1
(z)).
Denote w = F
1
(z), this relation can then be written as
z =
w
G(w)
That is, it is known that w = F
1
(z). At the same time, its relationship
with w is also completely determined by the above equation. The two versions
must be consistent with each other. The result to be proved is a consequence
of this consistency.
It turns out there is a complex analysis theorem for inversing z = w/G(w).
3.3. PROPERTIES 35
Theorem 3.5 Lagranges’ inversion formula
Let z = w/f(w) where f(w) is an analysis function (has derivative in complex
analysis sense) in a neighborhood of w = 0. (and a onetoone relationship
between z and w in this neighborhood.). If g is inﬁnitely diﬀerentiable, (but
I had the impression being analytic implies inﬁnitely diﬀerentiable), then
g(w(z)) = g(0) +
∞
n=1
z
n
n!
_
d
n−1
du
n−1
[g
(u){f(u)}
n
]
_
u=0
.
(See Kyrala (1972), Applied functions of a complex variable. Page 51 for
a close resemble of this result. Related to Cauchy integration over a closed
path). ♦
Apply this result with f(w) being our G(w) and g(w) = w
b
, we have
[F
1
(z)]
b
=
∞
n=1
z
n
n!
_
d
n−1
du
n−1
[bu
b−1
G
n
(u)]
_
u=o
.
Let us not forget that [F
1
(z)]
b
is the “probability” generating function of T
b
,
and G
n
(u)/z
n
is the probability generating function of S
n
.
Matching the coeﬃcient of z
n
, we have
n!P(T
b
= n) =
_
d
n−1
du
n−1
[bu
b−1
G
n
(u)]
_
u=0
.
Notice the latter is (n − 1)! times of the coeﬃcient of u
n−1
in the power
expansion of bu
b−1
G
n
(u), which in turn, is the coeﬃcent of u
n−b
in the power
series expansion of bG
n
(u), and which is the coeﬃcient of u
−b
in the expansion
of bu
−n
G
n
(u).
Now, we point out that u
−n
G
n
(u) = Eu
−Sn
. That is, (n − 1)! times of
this coeﬃcient is b(n −1)!P(S
n
= b). Hence we get the result. ♦
3.3.3 Spitzer’s Identity
Here we discuss the magical results of Spitzer’s identity. I kind of believe
that a result like this can be useful in statistics. Yet I ﬁnd that it is still too
complex to be practical at the moment.
36 CHAPTER 3. GENERATING FUNCTIONS
Theorem 3.6
Assume that S
n
is a rightcontinuous random walk, and let M
n
= max{S
i
:
0 ≤ i ≤ n} be the maximum of the walk up to time n. Then for s, t < 1,
log
_
∞
n=0
t
n
E(S
Mn
)
_
=
∞
n=1
1
n
t
n
E(s
S
+
n
)
where S
+
n
= max{0, S
n
} (which is the nonnegative part of S
n
). ♦
Proof: We keep the notation f
b
(n) = P(T
b
= n) as before. To have M
n
= b,
it has to land on b at some time j between 1 and n, and at the same time,
the walk does not land on b + 1 within another n −j steps. Thus,
P(M
n
= b) =
n
j=0
P(T
b
= j)P(T
1
> n −j)
=
n
j=0
P(T
b
= j)P(T
1
> n −j).
Multiply both sides by s
b
t
n
and sum over b, n ≥ 0. The left hand side is
n=0
t
n
E(s
Mn
)
Recall
t
n
P(T
1
> n) =
1−F
1
(t)
1−t
. The right hand side is, by convolution
relationship,
∞
b=0
s
b
E(t
T
b
)
1 −F
1
(t)
1 −t
=
1 −F
1
(t)
1 −t
∞
b=0
s
b
[F
1
(t)]
b
=
1 −F
1
(t)
(1 −t)(1 −sF
1
(t))
.
Denote this function as D(s, t).
The rest contains mathematical manipulation: By Hitting time theorem,
nP(T
1
= n) = P(S
n
= 1) =
n
j=0
P(T
1
= j)P(S
n−j
= 0)
and we get generating function relationship as
tF
1
(t) = F
1
(t)U(t).
3.3. PROPERTIES 37
(Recall U(s) is the g.f. for returning to 0).
With this relationship, we have
∂
∂t
log[1 −sF
1
(t)] =
−sF
1
(t)
1 −sF
1
(t)
= −
s
t
F
1
(t)U(t)
k
s
k
[F
1
(t)]
k
= −
∞
k=0
s
k+1
t
[F
1
(t)]
k+1
U(t)
= −
∞
k=1
s
k
t
[F
1
(t)]
k
U(t)
= −
∞
k=1
s
k
t
[F
1
(t)]
k
U(t)
Notice
P(S
n
= k) =
n
j=0
P(T
k
= j)P(S
n−j
= 0).
Thus, the generating function [F
1
(t)]
k
U(t) is in fact a generating function for
the sequence of P(S
n
= k) in n. That is,
[F
1
(t)]
k
U(t) =
∞
n=0
t
n
P(S
n
= k).
Notice that the sum is over n.
It therefore implies
∂
∂t
log[1 −sF
1
(t)] = −
∞
n=1
t
n−1
∞
k=1
s
k
P(S
n
= k).
So, we get
∂
∂t
log D(s, t) = −
∂
∂t
log(1 −t) +
∂
∂t
log[1 −F
1
(t)] −
∂
∂t
log[1 −sF
1
(t)]
=
∞
i=1
t
n−1
_
1 −
∞
k=1
P(S
n
= k) +
∞
k=1
s
k
P(S
n
= k)
_
=
∞
i=1
t
n−1
_
P(S
n
≤ 0) +
∞
k=1
s
k
P(S
n
= k)
_
=
∞
i=1
t
n−1
E(s
S
+
n
).
38 CHAPTER 3. GENERATING FUNCTIONS
Now integrate with respect to t to get the ﬁnal result.
♦
Remark: I do not see why this ﬁnal result is more meaningful than the
intermedian result. I am wondering if any of you can oﬀer some insight.
3.3.4 Leads for tieddown random walk
Let us now go back to the simple random walk such that S
0
= 0 and P(X
i
=
1) = p, P(X
i
= −1) = q = 1 −p.
Deﬁne L
2n
as the number of steps when the random walk was above the
line of 0 (but is allowed to touch 0). We have an arcsin law established for
this random variable already. This time, we investigate its property when
S
2n
= 0 is given.
Theorem 3.7 Leads for tieddown random walk.
For the simple random walk S,
P(L
2n
= 2kS
2n
= 0) =
1
2n + 1
,
for k = 0, 1, . . . , n. ♦
Remark: we may place more possibility at L
2n
= n. It should divide the
time evenly above and below zero. This theorem, however, claims that a
uniform distribution is the truth. Note also that the result does not depend
on the value of p.
Proof: Deﬁne T
0
be the ﬁrst time when S is zero again. Given T
0
= 2r
for some r, L
2r
can either be 0 or 2r as the simple random walk does not
cross zero. Making use of the results on the size of λ
2r
, and f
2r
, it turns out
the conditional distribution is placing half and half probabilities on 0 and 2r.
Thus,
E(s
L
2r
S
2n
= 0, T
0
= 2r) =
1
2
+
1
2
s
2r
.
Now we deﬁne G
2n
(s) = E[S
L
2n
S
2n=0
and F
0
(s) = E(S
T
0
). Further, let
H(s, t) =
∞
n=0
t
2n
P(S
2n
= 0)G
2n
(s).
3.4. BRANCHING PROCESS 39
Conditioning on T
0
, we have
G
2n
(s) =
n
r=1
E[S
L
2n
S
2n
= 0, T
0
= 2r]P(T
0
= 2rS
2n
= 0)
=
n
r=1
(
1
2
s
2r
)P(T
0
= 2rS
2n
= 0).
Also
P(T
0
= 2rS
2n
= 0) =
P(T
0
= 2r)P(S
2n−2r
= 0)
P(S
2n
= 0)
.
Hence,
H(s, t) −1 =
1
2
H(s, t)[F
0
(t) + F
0
(st)].
Recall that
F
0
(t) = (1 −4pqt
2
)
−1/2
,
one obtains
H(s, t) =
2
√
1 −t
2
+
√
1 −s
2
t
2
=
2[
√
1 −s
2
t
2
−
√
1 −t
2
]
t
2
(1 −s
2
)
=
∞
n=0
t
2n
P(S
2n
= 0)
_
1 −s
2n+2
(n + 1)(1 −s
2
)
_
.
Compare the coeﬃcient of t
2n
in two expansions of H(s, t), we deduce
G
2n
(s) =
n
k=0
(n + 1)
−1
s
2k
.
♦
3.4 Branching Process
We only consider a simple case where Z
0
= 1 representing a species starts
from a single individual. It is assumed that each individaul will give birth of
random number of oﬀsprings independently of each other. We call it family
size. We assume family sizes have the same distribution with probability
40 CHAPTER 3. GENERATING FUNCTIONS
generating function G(s). Let µ and σ
2
be the mean and variance of the
family size.
Let the population size of the nth generation be called Z
n
.
It is known that
E[Z
n
] = µ
n
V ar(Z
n
) =
σ
2
(µ
n
−1)
µ −1
µ
n−1
.
These relations can be derived from the identity:
G
n
(t) = G(G
n−1
(t))
where G
n
(t) = E(t
Zn
). The same identity also implies that the probability of
ultimate extinction η = limP(Z
n
= 0) is the smallest nonnegative solution
of the equation
x = G(x).
Because of this, it is known that:
1. if µ < 1, η = 1;
2. if µ > 1, η < 1;
3. if µ = 1 and σ
2
> 0, then η = 1;
4. if µ = 1 and σ
2
= 0, then η = 0.
Discussion will be given in class.
3.5 Summary
The biggest deal of this chapter is the proof of the Renewal Theorem. The
proof is not much related to generating function at all. One may learn a lot
from this proof. At the same time, it is okay to choose to ignore the proof
completely.
One should be able to see the beauty of generating functions when han
dling problems related to the simple random walk and the branching process.
At the same time, none of these discussed should be new to you. You may
3.5. SUMMARY 41
realize that two very short sections on the simple random walk and brach
ing process are rich in content. Please take the opportunity to plant these
knowledge ﬁrmly in you brain if they were not so before.
42 CHAPTER 3. GENERATING FUNCTIONS
Chapter 4
Discrete Time Markov Chain
A discrete time Markov chain is a stochastic process consists of a countable
number of random variables arranged in a sequence. Usually, we name these
random variables as X
0
, X
1
, X
2
, . . .. To qualify as a Markov chain, its state
space, S, which is the set of all possible values of these random variables,
must be countable, In addition, it must have Markov property:
P(X
n+1
= jX
n
= i, X
n−1
= i
n−1
, . . . , X
1
= i
1
, X
0
= i
0
) = P(X
n+1
= jX
n
= i).
When this transition probability does not depend on n, we say that the
Markov chain is time homogeneous.
The Markov chain discussed in this course will be assumed time homoge
neous unless otherwise speciﬁed. In this case, we use notation
p
ij
= P(X
1
= jX
0
= i)
and P for the matrix with the i, jth entry being p
ij
.
It is known that all entries of the transition matrix are nonnegative, and
its row sums are all 1.
Further, let P
(m)
be the mstep transition matrix deﬁned by
p
(m)
ij
= P(X
m
= jX
0
= i).
Then they satisfy the ChapmanKolmogorov equations:
P
(m+n)
= P
(m)
P
(n)
43
44 CHAPTER 4. DISCRETE TIME MARKOV CHAIN
for all nonnegative positive integers m and n. Here, we denote P
(0)
= I.
Let µ
(m)
be the row vector so that its ith entry
µ
(m)
i
= P(X
m
= i).
It can be seen that µ
(m+n)
= µ
(m)
P
n
.
Both simple random walk and branching process are special cases of dis
crete time Markov chain.
Example 4.1 Markov’s other chain.
Note that the ChapmanKolmogorov equation is the direct consequence of
the Markov property. Suppose that a discrete time stochastic process has
countable state space, and its transition probability matrix deﬁned in the
same way as for the Markov chain, satisﬁes the ChapmanKolmogorov equa
tion. Does it have to be a Markov chain?
In mathematics, we can oﬀer a rigorous proof to show this is true, or oﬀer
an example a stochastic process with this property which is not a Markov
chain. It turns out the latter is the case.
Let Y
1
, Y
3
, Y
5
, . . . be a sequence of iid random variables such that
P(Y
1
= 1) = P(Y
1
= −1) =
1
2
.
Let Y
2k
= Y
2k−1
Y
2k+1
for k = 1, 2, . . .. We hence have a stochastic process
Y
1
, Y
2
, Y
3
, . . . which has state space {−1, 1}.
Note that
E[Y
2k
Y
2k+1
] = E[Y
2k−1
Y
2
2k+1
] = E[Y
2k−1
] = 0.
Since these random variables take only two possible values, having correlation
0 implies independence. All other nonneighbouring pairs of random variables
are also independent of each other by deﬁnition.
Thus,
P(X
m+n
= jX
n
= i) = P(X
m+n
= j) =
1
2
for any i, j = ±1. Thus, the mstep transition matrix is
P
(m)
= P =
_
1
2
1
2
1
2
1
2
_
4.1. CLASSIFICATION OF STATES AND CHAINS 45
which is idempotent. It implies P
(m+n)
= P
(m)
P
(n)
for all m and n.
That is, the ChapmanKolmogorov equation is satisﬁed. However,
P(Y
2k+1
= 1Y
2k
= −1) =
1
2
but
P(Y
2k+1
= 1Y
2k
= −1, Y
2k−1
= 1) = 0.
Thus, this process does not have Markov property and is not a Markov chain.
♦
Although this example shows that a stochastic process with transition
probability matrices satisfying the ChapmanKolmogorov equations is not
necessarily a Markov chain. When we are given a set of stochastic matrices
satisfying ChapmanKolmogorov equations, it is always possilbe to construct
a Markov chain with this set of stochastic matrices as its transition proba
bility matrices.
4.1 Classiﬁcation of States and Chains
We use i, j, k and so on to denote states in state space.
If there exists an integer m ≥ 0 such that p
(m)
ij
> 0, we say that j is
reachable from i.
If i is reachable from j and j is reachable from i, then we say that i and
j communicate.
It is obvious that communication is an equivalence relationship. Thus, the
state space is partitioned into classes so that any pair of states in the same
class communicate with each other. Any pair of states from two diﬀerent
classes do not communicate.
It is usually very simple to classify the state space. In most examples, it
can be done by examining the transition probability matrix of the Markov
chain.
Now let us consider individual classes. Let G be a class. If G consists of
all states of the Markov chain, we say that the chain is irreducible.
If for any state i ∈ G and state j ∈ G, p
ij
= 0. Then the Markov chain
can never leave class G once it enters. That is, once the value of X
n
is in G
46 CHAPTER 4. DISCRETE TIME MARKOV CHAIN
for some n, then the values of X
n+m
’s are all in G when m ≥ 0. A class G is
closed when it has this property. Notice that given X
n
∈ G for some n, the
Markov chain {X
n
, X
n+1
, . . .} has eﬀectively G as its state space. Thus, the
state space is reduced when G is a true subset of the state space.
In contrast to a closed class, if there exist states i ∈ G and j ∈ G such
that p
ij
> 0, then the class is said open.
4.2 Class Properties
It turns out that the states in the same class have some common properties.
We call them class properties.
Due to Markov property, the event X
n
= i for any i is a renewal event.
Some properties that describe renewal events are positive or null recurrent,
transient, periodicity.
Deﬁnition 4.1 Property of renewal events
1. A renewal event is recurrent(persistent) if the probability for its future
occurrence is 1, given its occurrence at time 0.
2. If a renewal event is not recurrent, then it is called transient.
3. If the expected waiting time for the next occurrence of a recurrent
renewal event is ﬁnite, then the renewal event is positive recurrent.
Otherwise, it is nullrecurrent.
4. Let Abe a renewal event and T = {n : P(A occurs at nA occurs at 0) >
0}. The period of A is d, the greatest common divisor of T . If d = 1,
then A is aperiodic.
♦
Due to the fact that entering a state is a renewal event, some properties
of renewal event can be translated easily here. Let
P
ij
(s) =
∞
i=1
s
n
p
ij
(n)
4.2. CLASS PROPERTIES 47
as the generating function of the nstep transition probability. Deﬁne
f
ij
(n) = P(X
1
= j, . . . , X
n−1
= jX
0
= i)
be the probability of entering j from i for the ﬁrst time at time n. Let its
corresponding generating function be F
ij
(s). We let
f
ij
=
n
f
ij
(n)
for the probability that the chain ever enters state j starting from i. Note
that f
ij
= F
ij
(1).
Lemma 4.1
(a): P
ii
(s) = 1 + F
ii
(s)P
ii
(s).
(b): P
ij
(s) = F
ij
(s)P
jj
(s) if i = j. ♦
Proof: The ﬁrst property is the renewal equation. The second one is the
delayed renewal equation. ♦
Based on these two equations, it is easy to show the following.
Theorem 4.1
(a) State j is recurrent if and only if
j
p
jj
(n) = ∞. Consequently,
n
p
ij
(n) = ∞ when j is reachable from i
(b) State j is transient if and only if
j
p
jj
(n) < ∞. Consequently,
n
p
ij
(n) < ∞.
♦
Proof: Using the generating function equations.
Theorem 4.2
. All properties in Deﬁnition 4.1 are class properties. ♦
Proof: Suppose i and j are two states in the same class. State i is recurrent,
and we want to show that state j is also recurrent.
48 CHAPTER 4. DISCRETE TIME MARKOV CHAIN
Let p
ij
(m) be the mstep transition probability from i to j. Hence,
m
p
ii
(m) = ∞. Since i and j communicate, there exist m
1
and m
2
such that
p
ij
(m
1
) > 0 and p
ji
(m
2
) > 0. Hence, p
jj
(m+m
1
+m
2
) ≥ p
ji
(m
2
)p
ii
(m)p
ij
(m
1
).
Consequently,
m
p
jj
(m) ≥
m
p
jj
(m + m
1
+ m
2
)
≥ p
ji
(m
2
){
m
p
ii
(m)}p
ij
(m
1
)
= ∞.
That is, j is also a recurrent state.
The above proof also implies that if i is transient, j cannot be recurrent.
Hence, j is also transient.
We delegate the positive recurrentness proof to the future.
At last, we show that i and j must have the same period.
Deﬁne T
i
= {n : p
ii
(n) > 0} and similarly for T
j
. Let d
i
and d
j
be the
periods of state i and j. Note that if n
1
, n
2
∈ T
i
, then an
1
+bn
2
∈ T
i
for any
positive integers a and b. Since d
i
is the greatest common divisor of T
i
, there
exist integers a
1
, . . . , a
m
and n
1
, . . . , n
m
∈ T
i
such that
a
1
n
1
+ a
2
n
2
+· · · + a
m
n
m
= d
i
.
By grouping positive and negative coeﬃcients. it is easy to see that the
number of items m can reduced to 2. Thus, we assume that it is possible to
ﬁnd a
1
, a
2
and n
1
, n
2
such that
a
1
n
1
+ a
2
n
2
= d
i
.
We can then further pick nonnegative coeﬃcients such that
a
11
n
1
+ a
12
n
2
= kd
i
, a
21
n
1
+ a
22
n
2
= (k + 1)d
i
for some positive integer k.
Let m
1
and m
2
be the number of steps the chain can go and return from
state i to state j. Then, both
m
1
+ m
2
+ kd
i
, m
1
+ m
2
+ (k + 1)d
i
∈ T
j
.
Thus, we must have d
j
divides d
i
. The reverse is also true by symmetry.
Hence d
i
= d
j
. ♦
4.3. STATIONARY DISTRIBUTION 49
4.2.1 Properties of Markov Chain with Finite Number
of States
If a Markov chain has ﬁnite state space, then one of the states will be visited
inﬁnite number of times over inﬁnite time horizon. Thus, at least one of
them will be recurrent (persistent). We can prove this result rigorously.
Lemma 4.2
If the state space is ﬁnite, then at least one state is recurrent and all recurrent
states are positive recurrent. ♦
Proof: A necessary condition for a recurrent event to be transient is u
n
→0
as n increases. In the context of Markov chain, it implies that state j is
transient implies p
ij
(n) →0. Because
1 =
j
p
ij
(n),
we cannot have p
ij
(n) →0 for all j when the state space is ﬁnite. Hence, at
least one of them is recurrent. ♦
Intuitively, the last conclusion implies that there is always a closed class
of recurrent states for a Markov chain with ﬁnite state space. Since the
Markov chain cannot escape from the ﬁnite class, the average waiting time
for the next visit will be ﬁnite. Thus, at least one of them is recurrent and
hence positive recurrent.
We skip the rigorous proof for now.
4.3 Stationary Distribution and the Limit The
orem
A Markov chain consists of a sequence of discrete random variables. For
convenience, they are regarded as nonnegative integers.
For each n, X
n
is a random variable whose distribution may be diﬀerent
from that of X
n−1
but is involved from it. When n is large, the dependence of
the distribution of X
n
on that of X
0
gets weaker and weaker. It is therefore
50 CHAPTER 4. DISCRETE TIME MARKOV CHAIN
possible that the distribution stabilizes and it may hence have a limit. This
limit turns out to exist in many cases and it is related to so called stationary
distribution.
Deﬁnition 4.1
The vector π is called a stationary distribution of a Markov chain if π has
entries π
j
, j ∈ S such that
(a) π
j
≥ 0 for all j and
j
π
j
= 1;
(b) π = πP where P is the transition probability matrix. ♦
It is seen that if the probability function of X
0
is given by a vector called
α, then the probability function of X
n
is given by
β(n) = αP
n
.
Consequently, if the distribution of X
n
is given by π, then all distributions
of X
n+m
are given by π. Hence it gains the name of stationary.
It should be noted that if the distribution of X
n
is given by π, the distri
bution of X
n−m
does not have to be π.
Lemma 4.3
If the Markov chain is irreducible and recurrent, there exists a positive root
x of the equation x = xP, which is unique up to a multiplicative constant.
The chain is nonnull if
i
x
i
< ∞, and null if
i
x
i
= ∞. ♦
Proof: Assume the chain is recurrent and irreducible. For any states k, i ∈
S,
N
ik
=
n
I(X
n
= i, T
k
≥ n)
with T
k
being the time of the ﬁrst return to state k. Note that T
k
is a well
deﬁned random variable because p(T
k
< ∞) = 1.
Deﬁne ρ
i
(k) = E[N
ik
X
0
= k] which is the mean number of visits of the
chain to state i between two successive visits of state k.
By the way, if j is recurrent, then µ
j
= E[T
j
] is called the mean recurrent
time. It is allowed to be inﬁnity when the summation in the deﬁnition of
4.3. STATIONARY DISTRIBUTION 51
expectation does not converge. When j is transient, we simply deﬁne µ
j
= ∞.
Thus, being positive recurrent is equivalent to claim that µ
j
< ∞.
Note that ρ
k
(k) = 1. At the same time,
ρ
i
(k) =
n
P(X
n
= i, T
k
≥ nX
0
= k).
It will be seen that the vector ρ with ρ
i
(k) as its kth component is a base for
ﬁnding the stationary distribution.
We ﬁrst show that ρ
i
(k) are ﬁnite for all k. Let
l
ki
= P(X
n
= i, T
k
≥ nX
0
= k),
the probability that the chain reaches i in n steps but with no intermediate
return to the starting point k.
With this deﬁnition, we see that
f
kk
(m + n) ≥ l
ki
(m)f
ik
(n)
in which the right hand side is one of the sample paths taking a roundtrip
from k back to k in m + n steps, without visiting k in intermediate steps.
Because the chain is irreducible, there exists n such that f
ik
(n) > 0. Now
sum over m on two sides and we get
m
f
kk
(m + n) ≥ f
ik
(n)
m
l
ki
(m).
Since
m
f
kk
(m + n) < ∞ and f
ik
(n) > 0, we get
m
l
ki
(m) < ∞. That is,
ρ
i
(k) < ∞.
Now we move to the next step. Note that l
ki
(1) = p
ki
, the onestep
transition probability. Further,
l
ki
(n) =
j:j=k
P(X
i
, X
n−1
= j, T
k
≥ nX
0
= k)
=
j:j=k
l
kj
(n −1)p
ji
for n ≥ 2. Now sum over n ≥ 2, we have
ρ
i
(k) = p
ki
+
j:j=k
_
_
_
n≥2
l
kj
(n −1)
_
_
_
p
ji
= ρ
k
(k)p
ki
+
j:j=k
ρ
j
(k)p
ji
52 CHAPTER 4. DISCRETE TIME MARKOV CHAIN
because ρ
k
(k) = 1. We have shown that ρ is a solution to xP = x.
This proves the existence part of the lemma. We prove uniqueness next.
First, if a chain is irreducible, we cannot have a nonnegative solution for
xP = x with zerocomponents. If so, write x = (0, x
2
) and
P =
_
P
11
P
21
P
21
P
22
_
.
Then, we must have x
2
P
21
= 0. Since all components in x
2
are positive, we
must have P
21
= 0. This contradicts the irreducibility assumption.
If there are two solutions to xP = x who are not multiplicative of each
other, then we can obtain another solution with nonnegative components
and at least one 0 entry. This will contradicts the irreducibility assumption.
(This proof works only if there are ﬁnite number of states. The more general
case will be proved later).
Due to the uniqueness, for any solution x, we must have for any state k,
µ
k
= c
x
i
.
Thus, if k is positive recurrent, we have
x
i
< ∞.
This completes the proof of the lemma. ♦
We have an immediate corollary.
Corollary 4.1
If states i and j communicate and state i is positive recurrent, then so is
state j. ♦
One intermediate results in the proof of the theorem can be summarised
as another lemma.
Lemma 4.4
For any state k of an irreducible recurrent Markov chain, the vector ρ(k)
satisﬁes ρ
i
(k) < ∞ for all i, and furthermore ρ(k) = ρ(k)P. ♦
The above lemma shows why we did not pay much attention on when
a Markov chain is positive recurrent. Once we ﬁnd a suitable solution to
x = xP, we know whether it is positive recurrent or not immediately.
The next theorem summary these results to give a conclusion on the
existence of stationary distribution.
4.3. STATIONARY DISTRIBUTION 53
Theorem 4.3
An irreducible Markov chain has a stationary distribution π if and only if
all the states are positive recurrent; in this case, π is the unique stationary
distribution and is given by π
j
= µ
−1
j
for each i ∈ S, where µ
j
is the mean
recurrence time of j. ♦
Proof. We have already shown that if the chain is positive recurrent, then
the stationary distribution exists and is unique. If the chain is recurrent
but not positive recurrent, then due to the uniqueness, there cannot be an
other solution x such that
x
i
< ∞. Thus, there cannot exist a stationary
distribution or it violates the uniqueness conclusion.
Now if the chain is transient, we have p
ij
(n) →0 for all ij as n →∞. If
a stationary distribution π exists, we must have
π
j
=
i
π
i
p
ij
(n) →0
by dominate converges theorem. This contradicts the assumption that π is
a stationary distribution.
Thus, all left to be shown is whether π
j
= µ
−1
j
for each i ∈ S when the
chain is positive recurrent.
Suppose X
0
has π as its distribution.. Then
π
j
µ
j
=
∞
n=1
P(T
j
≥ nX
0
= j)P(X
0
= j) = P(T
j
≥ n, X
0
= j).
However, P(T
j
≥ 1, X
0
= j) = P(X
0
= j) = π
j
and for n ≥ 2,
P(T
j
≥ n, X
0
= j)
= P(X
0
= j, X
m
= j for all 1 ≤ m ≤ n −1)
= P(X
m
= j for 1 ≤ m ≤ n −1) −P(X
m
= j for 0 ≤ m ≤ n −1)
= P(X
m
= j for 0 ≤ m ≤ n −2) −P(X
m
= j for 0 ≤ m ≤ n −1)
= a
n−2
−a
n−1
with a
m
deﬁned as is.
Thus, sum over n we obtain
π
j
µ
j
= P(X
0
= j) + P(X
0
= j) −lima
n
= 1 −lima
n
.
54 CHAPTER 4. DISCRETE TIME MARKOV CHAIN
Since state j is recurrent,
a
n
= P(X
m
= j for all 1 ≤ m = n)
has limit 0. Hence we have shown π
j
= µ
−1
j
. ♦.
Now we come back for the uniqueness again. We skipped its proof earlier.
It turns out that the general proof is very tricky.
Suppose {X
n
} is an irreducible and recurrent Markov chain with transi
tion probability matrix P. Let x be a solution to x = xP. We construct
another Markov chain Y
n
such that
q
ij
(n) = P(Y
n
= jY
0
= i) =
x
j
x
i
p
ji
(n)
for all i, j ∈ S and n. It is easy to verify that q
ij
make a proper transition
probability matrix. Further, it is obvious that {Y
n
} is also irreducible. By
checking the sum over n, it further shows that {Y
n
} is recurrent.
Let, for i = j,
l
ji
(n) = P(X
n
= i, X
m
= j for 1 ≤ m ≤ n −1X
0
= j).
We show that
g
ij
(n) = P(Y
n
= j, Y
m
= j, for 1 ≤ m ≤ n −1Y
0
= i)
satisﬁes
g
ij
(n) =
x
j
x
i
l
ij
(n).
Let us examine the expressions of g
ij
and l
ji
. One is the probability of all
the pathes which start from j and end up at state i before they ever visit j
again. The other is the probability of all the pathes which start from i and
end up at state j without being there in between. If we reverse the time of
the second, then we are working with the same set of sample pathes.
For each sample path corresponding to l
ji
(n), let us denote it as
j, k
1
, k
2
, . . . , k
n−1
, i,
its probability of occurrence is
p
j,k
1
p
k
1
,k
2
· · · p
k
n−1
,i
.
4.3. STATIONARY DISTRIBUTION 55
The reverse sample path for g
ij
(n) is i, k
n−1
, . . . , k
2
, k
1
, j and its probability
of occurrence is
q
i,k
n−1
q
k
n−1
,k
n−2
· · · q
k
2
,k
1
q
k
1
,j
= p
k
n−1
,i
· · · p
k
1
,k
2
p
j,k
1
×
x
k
n−1
x
i
x
k
n−2
x
k
n−1
· · ·
x
k
1
x
k
2
x
k
j
x
k
1
=
x
i
x
j
p
k
n−1
,i
· · · p
k
1
,k
2
p
j,k
1
Since it is true for every sample path, the result is proved.
Note that
n
g
ij
(n) corresponds to the probability that the chain will ever
enter state j and the chain is recurrent and irreducible, we have
n
g
ij
(n) = 1.
Consequently,
x
j
x
i
= [
n
l
ij
(n)]
−1
and hence the ratio is unique.
4.3.1 Limiting Theorem
We have seen that an irreducible Markov chain has a unique stationary dis
tribution when all states are positive recurrent. It turns out that the limits
of the transition probabilities p
ij
(n) exist when n increases, provided they
are aperiodic.
When the chain is periodic, the limit does not always exist. For example,
if S = {0, 1} and p
12
= p
21
= 1, then
p
11
(n) = p
22
(n) = I(n is even).
Obviously, the limit does not exist.
When the limits exist, the conclusion is neat.
Theorem 4.4
For an irreducible aperiodic Markov chain, we have that
p
ij
(n) →µ
−1
j
as n →∞ for all i and j. ♦
56 CHAPTER 4. DISCRETE TIME MARKOV CHAIN
Remarks:
1. If the chain is transient or null recurrent, then it is known that p
ij
(n) →
0 for all i and j. Since µ
j
= ∞ in this case, the theorem is automatically
true.
2. If the chain is positive recurrent, then according to this theorem,
p
ij
(n) →π
j
= µ
−1
j
.
3. This theorem implies that the limit of p
ij
(n) does not depend on n. It
further implies that
P(X
n
= j) →µ
−1
j
irrespective of the distribution of X
0
.
4. If {X
n
} is an irreducible chain with period d, then {X
nd
} is an aperiodic
chain. (Not necessarily irreducible). It follows that
p
jj
(nd) = P(Y
n
= jY
0
= j) →dµ
−1
j
as n →∞.
Proof:
Case I: the Markov chain is transient. In this case, the theorem is true
by renewal theorem.
Case II: the Markov chain is recurrent. We could use renewal theorem.
Yet let us see another line of approach.
Let {X
n
} be the Markov chain with the transition probability matrix P
under consideration. Let {Y
n
} be an independent Markov chain with the
same state space S and transition matrix P with {X
n
}.
Now we consider the stochastic process {Z
n
} such that Z
n
= (X
n
, Y
n
),
n = 0, 1, 2, . . .. Its state space is S×S. Its transition probabilities are simply
multiplication of original transition probabilities. That is,
p
(ij)→(kl)
= p
ik
p
jl
.
The new chain is still irreducible. If p
ik
(m) > 0 and p
jl
(n) > 0, then
p
ik
(mn)p
jl
(mn) > 0 and so kl is reachable from ij.
The new chain is still aperiodic. It states that if the period is one, then
p
ij
(n) > 0 for all suﬃciently large n. Thus, p
ij
(n)p
kl
(n) > 0 for all large
enough n too.
4.3. STATIONARY DISTRIBUTION 57
Case II.1: positive recurrent. In this case, {X
n
} has stationary distribu
tion π, and hence {Z
n
} has stationary distribution given by {π
i
π
j
: i, j ∈ S}.
Thus by the lemma in the last section, {Z
n
} is also positive recurrent.
Assume (X
0
, Y
0
) = (i, j) for some (i, j). For any state k, deﬁne
T = min{n : Z
n
= (k, k)}.
Due to positive recurrentness, P(T < ∞) = 1.
Implication? Sooner or later, X
n
and Y
n
will occupy the same state. If so,
from that moment and on, X
n+m
and Y
n+m
will have the same distribution.
If Y
n
has π as its distribution, so will X
n
.
More precisely, starting from any pair of states (i, j), we have
p
ik
(n) = P(X
n
= k)
= P(X
n
= k, T ≤ n) + P(X
n
= k, T > n)
= P(Y
n
= k, T ≤ n) + P(X
n
= k, T > n)
≤ P(Y
n
= k) + P(T > n)
= p
jk
(n) + P(T > n).
Due to the symmetry, we also have
p
jk
(n) ≤ p
ik
(n) + P(T > n).
Hence
p
ik
(n) −p
jk
(n) ≤ P(T > n) →0
as n →∞. That is,
p
ik
(n) −p
jk
(n) →0.
or if the limit of p
jk
(n) exists, it does not depend on j. (Remark: P(T >
n) →0 is a consequence of recurrentness).
For the existence, we have
π
k
−p
jk
(n) =
i∈S
π
i
(p
ik
(n) −p
jk
(n)) →0.
Case II.2: null recurrent. We can no longer claim P(T > n) →0.
In this case, {X
n
, Y
n
} may be transient or null recurrent.
58 CHAPTER 4. DISCRETE TIME MARKOV CHAIN
If {X
n
, Y
n
} is transient, then
P(X
n
= j, Y
n
= jX
0
= i, Y
0
= i) = [p
ij
(n)]
2
→0.
Hence, the conclusion p
ij
(n) →0 remains true.
If {X
n
, Y
n
} is null recurrent, the problem becomes a bit touchy. However,
P(T > n) →0 stays. Thus, we still have the conclusion
p
ik
(n) −p
jk
(n) →0,
but we do not have a stationary distribution handy.
What we hope to show is that p
ij
(n) → 0 for all i, j ∈ S. If this is not
true, then we can ﬁnd a subsequence of n such that
p
ij
(n
r
) →α
j
as r →∞, and at least one of α
j
is not zero.
Let F be a ﬁnite subset of S. Then
j∈F
α
j
= lim
r→∞
p
ij
(n
r
) ≤ 1.
This is true for any ﬁnite subset, which implies
α =
i∈S
α
j
≤ 1.
(Recall that S is countable).
Using the idea of onestep transition, we have
k∈F
p
ik
(n
r
)p
kj
≤ p
ij
(n
r
+ 1) =
k∈S
p
ik
p
kj
(n
r
).
Let r →∞, we have
k∈F
α
k
p
kj
≤
k∈S
p
ik
α
j
= α
j
.
Let F get large, we have
k∈S
α
k
p
kj
≤ α
j
.
4.3. STATIONARY DISTRIBUTION 59
The equality has to be true, otherwise,
k∈S
α
k
=
k∈S
α
k
j∈S
p
kj
=
k,j∈S
α
k
p
kj
=
j∈S
[
k∈S
α
k
p
kj
]
<
j∈S
α
j
which is a contradiction. However, when the equality holds, we would have
k∈S
α
k
p
kj
= α
j
for each j ∈ S. Thus, {α
j
/α, j ∈ S} is a stationary distribution of the
Markov chain. This contradicts the assumption of the null recurrentness. ♦
Theorem 4.5
For any aperiodic state j of a Markov chain, p
jj
(n) → µ
−1
j
as n → ∞.
Furthermore, if i is another state, then p
ij
(n) →f
ij
µ
−1
j
as n →∞. ♦
Corollary 4.2
Let
τ
ij
(n) =
1
n
n
m=1
p
ij
(m)
be the mean proportion of elapsed time up to the nth step during which the
chain was in state j, starting from i. If j is aperiodic, then
τ
ij
(n) →f
ij
/µ
j
as n →∞. ♦
60 CHAPTER 4. DISCRETE TIME MARKOV CHAIN
4.4 Reversibility
Let {X
n
: n = 0, 1, 2, . . . , N} be a (part) of a Markov chain whose transition
probability matrix is P and it is irreducible and positive recurrent so that π
is its stationary distribution.
Now let us deﬁne Y
n
= X
N−n
for n = 0, 1, . . . , N. Suppose all X
n
has
distribution given by the stationary distribution π. It can be shown that
{Y
n
} is also a Markov chain.
Theorem 4.6
The sequence Y = {Y
n
: n = 0, 1, . . . , N} is a Markov chain with transition
probabilities
q
ij
= P(Y
n+1
= jY
n
= i) =
π
j
π
i
p
ji
.
♦
Proof We need only verify the Markov property. Other conditions for a
Markov chain are obvious.
P(Y
n+1
= i
n+1
Y
k
= i
k
, k = 0, . . . , n)
= P(Y
k
= i
k
, k = 0, . . . , n, n + 1)/P(Y
k
= i
k
, k = 0, . . . , n)
= P(X
N−k
= i
k
, k = 0, . . . , n, n + 1)/P(X
N−k
= i
k
, k = 0, . . . , n)
=
P(X
N−n−1
= i
n+1
)P(X
N−n
= i
n
X
N−n−1
= i
n+1
)
P(X
N−n
= i
n
)
=
π
i
n+1
p
i
n+1
,in
π
in
.
Since this transition probability does not depend on i
k
, k = 0, . . . , n −1, the
Markov property is veriﬁed. ♦
Although Y in the above theorem is a Markov chain, it is not the same
as the original Markov chain.
Deﬁnition 4.1
Let {X
n
: 0 ≤ n ≤ N} be an irreducible Markov chain such that it has sta
tionary distribution π for all n. The chain is call reversible if the transition
matrices of X and its timereversal; Y are the same, which is to say that
π
i
p
ij
= π
j
p
ji
4.4. REVERSIBILITY 61
for all i, j. ♦
Why do we deﬁne the reversibility? One advantage is when a chain is
reversible, the transitions between i and j are balanced at equilibrium. This
provides us a convenient way to solve for the stationary distribution. We
have also used the idea to prove the uniqueness of the solution to π = πP.
Theorem 4.7
Let P be the transition matrix of an irreducible chain X and suppose that
there exists a distribution π such that π
i
p
ij
= π
j
p
ji
for all i, j ∈ S. Then
π is a stationary distribution of the chain. Furthermore, X is reversible in
equilibrium.
62 CHAPTER 4. DISCRETE TIME MARKOV CHAIN
Chapter 5
Continuous Time Markov
Chain
It is more realistic to consider processes which do not have to evolve at
speciﬁed epochs. However, setting up continuous time stochastic processes
properly involves a lot of eﬀort in mathematics. We now work on two special
continuous time stochastic processes ﬁrst to motivate the continuous time
Markov chain.
5.1 Birth Processes and the Poisson Process
These processes may be called counting processes in general. We have a
process {N(t) : t ≥ 0} such that
(a) N(0) = 0, and N(t) ∈ {0, 1, 2, . . .},
(b) if s < t, then N(s) ≤ N(t).
What other detailed properties should we place on it?
Deﬁnition 5.1
A Poisson process with intensity λ is a process N = {N(t) : t ≥ 0} taking
values in S = {0, 1, 2, . . .} such that
(a) N(0) = 0; if s < t, then N(s) ≤ N(t),
63
64 CHAPTER 5. CONTINUOUS TIME MARKOV CHAIN
(b)P{N(t + h) = n + mN(t) = n} =
_
¸
¸
_
¸
¸
_
λh + o(h) if m = 1,
o(h) if m > 1,
1 −λh + o(h) if m = 0.
(c) if s < t, the random variable N(t) −N(s) is independent of the N(s).
♦
In general, we call (b) the individuality, and call (c) the independence
property of the Poisson process. It is well known as these speciﬁcations imply
that N(t) −N(s) has Poisson distribution with mean λ(t −s) for s < t.
Theorem 5.1
N(t) has the Poisson distribution with parameter λt; that is to say
P(N(t) = j) =
(λt)
j
j!
exp(−λt), j = 0, 1, 2, . . . .
♦
Proof: Denote p
j
(t) = P(N(t) = j). The properties of the Poisson process
lead to the equation
p
j
(t) = λp
j−1
(t) −λp
j
(t)
of j = 0; likewise
p
0
(t) = −λp
0
(t).
With the boundary condition p
j
(0) = I(j = 0), the equations can be solved
and the conclusion proved. ♦
We may view a counting process by recording the arrival time of the nth
event. For that purpose, we deﬁne
T
0
= 0, T
n
= inf{t : N(t) = n}.
The interarrival times are the random variables X
1
, X
2
, . . . given by
X
n
= T
n
−T
n−1
.
The counting process can be easily recontructed from X
n
’s too.
The following theorem is a familiar story.
5.1. BIRTH PROCESSES AND THE POISSON PROCESS 65
Theorem 5.2
The random variables X
1
, X
2
, . . . are independent, each having the exponen
tial distribution with parameter λ. ♦
Proof: It can be shown by working on
P(X
n+1
> t +
n
i=1
t
i
X
i
= t
i
, i = 1, 2, . . . , n)
by making use of the property of independent increment. ♦
There is a bit problem with this proof as the event we conditioning on has
zero probability. It could be made more rigorous with some measure theory
results.
The Poisson process is a very satisfactory model for redioactive emissions
from a sample of uranium235 since this isotope has a halflife of 7×10
8
years
and decays fairly slowly. That is, we have a constant radioactive source in
a short time. For some other kinds of radioactive substances, the rate of
emission should depend on the number of detected emission already.
It can be shown that {N(t) : t ≥ 0} is a Poisson process if and only if
X
1
, X
2
, . . . are independent and identically distributed exponential random
variables. If the X
i
has exponential distribution with rate λ
i
instead, then
we have a general birth process.
Deﬁnition 5.2
A birth process with intensity λ
0
, λ
1
, . . . is a process {N(t) : t ≥ 0} taking
values in S = {0, 1, 2, . . .} such that
(a) N(0) ≥ 0; if s < t, then N(s) ≤ N(t),
(b)P[N(t + h) = n + mN(t) = n) =
_
¸
¸
_
¸
¸
_
λ
n
h + o(h) if m = 1,
o(h) if m > 1,
1 −λ
n
h + o(h) if m = 0.
(c) if s < t, the random variable N(t) −N(s) is independent of the N(s).
♦
Here is a list of special cases:
66 CHAPTER 5. CONTINUOUS TIME MARKOV CHAIN
(a) Poisson process. λ
n
= λ for all n.
(b) Simple birth. λ
n
= nλ. This is the case when each living individual
gives birth independently of others, and at the constant rate.
(c) Simple birth with immigration. λ
n
= nλ+ν. In addition to the simple
birth, there is a steady source of immigration.
The diﬀerential equation we derived for the Poisson process can be easily
generalized. We can ﬁnd two basic sets of them. Deﬁne p
ij
(t) = P(N(s+t) =
jN(s) = i). The boundary conditions are p
ij
(0) = δ
ij
= I(i = j).
Forward system of equations:
p
ij
(t) = λ
j−1
p
i,j−1
(t) −λ
j
p
ij
(t)
for j ≥ i.
Backward system of equations:
p
ij
(t) = λ
i
p
i+1,j
(t) −λ
i
p
ij
(t)
for j ≥ i.
The forward equation can be obtained by computing the probability of
N(t + h) = j conditioning on N(t) = i. The backward equation is obtained
by computing the probability of N(t + h) = j conditioning on N(h) = i.
Theorem 5.3
The forward system has a unique solution, which satisﬁes the backward sys
tem. ♦
Proof: First, it is seen that p
ij
(t) = 0 whenever j < i. When j = i,
p
ii
(t) = exp(−λ
j
t) is the solution. Substituting into the foward equation,
we obtain the solution for p
i,i+1
(t). Repeat this procedure implies that the
forward system has a unique solutions.
Using Laplace transformation reveals the structure of the solution better.
Deﬁne
ˆ p
ij
(θ) =
_
∞
0
exp(−θt)p
ij
(t)dt.
Then, the forward system becomes
(θ + λ
j
)ˆ p
ij
(θ) = δ
ij
+ λ
j−1
ˆ p
i,j−1
(θ).
5.1. BIRTH PROCESSES AND THE POISSON PROCESS 67
The new system becomes easy to solve. We obtain
ˆ p
ij
(θ) =
1
λ
j
λ
i
θ + λ
i
λ
i+1
θ + λ
i+1
· · ·
λ
j
θ + λ
j
for j > i. The uniqueness is determined by the inversion theorem for Laplace
transforms.
If π
ij
(t)’s solve the backword systems, their corresponding Laplace trans
forms will satisfy
(θ + λ
i
)ˆ π
ij
(θ) = δ
ij
+ λ
i
ˆ π
i+1,j
(θ).
It turns out that these satisfying the forward system will also satisfy the
backward system here in Laplace transforms. Thus, the solution to the for
ward equation is also a solution to the backward equation. ♦
An implicit conclusion here is: the backward equation may have many
solutions. It turns out that if there are many solutions to the backward
equation, the solution given by the forward equation is the minimum solution.
Theorem 5.4
If {p
ij
(t)} is the unique solution of the forward system, then any solution
{π
ij
(t)} of the backward system satisﬁes p
ij
(t) ≤ π
ij
(t) for all i, j, t. ♦
If {p
ij
(t)} is the transition probabilities of the speciﬁed birth and death
process, then it must solve both forward and backward systems. Thus, the
solution to the forward system must be the transition probabilities. This
could be compared to the problem related to probability of ultimate extinc
tion in the branching process. Conversely, the solution to the foward system
can be shown to satisfy the ChapmanKolmogorov equations. Thus, it is a
relevent solution.
The textbook fails to demonstrate the why the proof of the uniqueness
for the forward system cannot be applied to the backward system. The
key is the assumption of p
ij
(t) = 0 when j < i. When this restrict is
removed, the backward system may have multiple solutions. This restriction
reﬂects the existence of some continuous time Markov chains which have the
same transition probability matrices to some degree to the birth process.
68 CHAPTER 5. CONTINUOUS TIME MARKOV CHAIN
Consequently, the text book should not have made use of this restriction in
proving the uniqueness of solution to the forward system.
Intuitively, we may expect that
j∈S
p
ij
(t) = 1
for any solutions. If so, no solutions can be larger than other solutions and
hence the uniqueness is automatic. The nonuniqueness is exactly built on
this observation. This constraint does not hold for some birth processes.
When the birth rate increases with the population size fast enough, the
population size may rearch inﬁnity in ﬁnite amount of time.
When the solution to the forward equation
j∈S
p
ij
(t) < 1
for some t > 0 and i, it is possible then to construct another solution which
also satisﬁes the backward system. See Feller for detailed constructions.
In that case, it is possible to design a new stochastics process so that its
transition probabilities are given by this solution.
What is the probabilistic interpretation when
j∈S
p
ij
(t) < 1
for some ﬁnite t, and i? It implies that within the period of t, the population
size has jumped or exploded all the way to inﬁnity. Consequently, inﬁnite
number of transitions must be have occurred. Recall the waiting time for the
next transition when N(t) = n is exponential with rate λ
n
. Let T
n
=
n
i=1
X
i
be the waiting time for the nth transition. Deﬁne T
∞
= lim
n→∞
T
n
.
Deﬁnition 5.3 Honest/dishonest
We call the process N honest if P(T
∞
< ∞) = 0 and dishonest otherwise.
Lemma 5.1
5.1. BIRTH PROCESSES AND THE POISSON PROCESS 69
Let X
1
, X
2
, . . . be independent random variables, X
n
having the exponential
distribution with parameter λ
n−1
, and let T
∞
=
∞
n=1
X
n
. We have that
P(T
∞
< ∞) =
_
0 if
∞
n=1
λ
−1
n
= ∞.
1 if
∞
n=1
λ
−1
n
< ∞
Proof: By deﬁnition of T
∞
, we have
E[T
∞
] =
∞
n=1
λ
−1
n
.
Hence, when
∞
n=1
λ
−1
n
< ∞, E[T
∞
] < ∞ and P(T
∞
< ∞) = 1. This implies
dishonesty.
If
∞
n=1
λ
−1
n
= ∞, it does not imply T
∞
= ∞ with any positive prob
ability. If, however, P(T
∞
< t) > 0 for any t > 0, then E[exp(−T
∞
)] ≥
exp(−t)P(T
∞
< t) > 0. We show that this is impossible under the current
assumption. Note that
E[exp(−T
∞
)] = lim
N→∞
E
N
n=1
exp(−X
n
)
= lim
N→∞
N
n=1
E exp(−X
n
)
= lim
N→∞
N
n=1
(1 + λ
−1
n
)
−1
= lim
N→∞
[
N
n=1
(1 + λ
−1
n
)]
−1
.
According to mathematical analysis result, the product with inﬁnite terms
equals inﬁnity when
∞
n=1
λ
−1
n
= ∞. (Check its logrithm). Hence, E[exp(−T
∞
)] =
0 which contridicts the assumption E[exp(−T
∞
)] > 0 made earlier. ♦
The following theorem is an easy consequence.
Theorem 5.5
The process N is honest if and only if
∞
n=1
λ
−1
n
= ∞.
70 CHAPTER 5. CONTINUOUS TIME MARKOV CHAIN
5.1.1 Strong Markov Property
The Markov property for discrete time stochastic process states: given the
present (X
n
= i), the future (X
n+m
’s) is independent of the past (X
n−k
’s).
The time represents the present is a nonrandom constant. In the example
of simple random walk, we often claim that once the random walk returns
to 0, it renews itself. That is, the future behavior of the random walk does
not depend on how the random walk got into 0 nor when it returns to 0. We
may notice that this notion is a bit diﬀerent from the Markov property. The
“present time” is a random variable.
This example implies that the Markov property is true for some randomly
selected time. The question is what kind of random variable can be used for
this purpose? Suppose {N(t) : t ≥ 0} is a stochastic process and T is
a random variable. If the event T ≤ t is completely determined by the
knowledge of {N(s) : s ≤ t}, then we call it a stopping time for stochastic
process {N(t) : t ≥ 0}. More rigorous deﬁnition relates to the concept of
σﬁeld we introduce before.
Theorem 5.6 Strong Markov Property
Let N be a birth process and let T be a stopping time for N. Let A be an
event which depends on {N(s) : s > T} and B be an event which depends
on {N(s) : s ≤ T}. Then
P(AN(T) = i, B) = P(AN(T) = i)
for all i. ♦
Proof: In fact, this is a simple case, as N(T) is a discrete random variable.
The kind of event B which causes most trouble are those contains all infor
mation about the history of the process before and including time T. If that
is the case, then the value of T is completely determined by B. Hence, we
may write T = T(B). Hence, it claims that
P(AN(T) = i, B) = P(AN(T) = i, T = T(B), B).
Among the three pieces of information on the right hand side, T is deﬁned
based on {N(s) : s ≤ T(B)} and is a constant when “N(T) = i, T = T(B)”.
5.2. CONTINUOUS TIME MARKOV CHAINS 71
Hence the Markov (weak one) property allows us to ignore B itself. At the
same time, the process is time homogeneous, it depends only on the fact that
the chain is in state i now, not when it ﬁrst reached state i. Hence, the part
of T = T(B) is also not informative. Hence the conclusion.
The measure theory proof is as follows. Let H = σ{N(s) : s ≤ T} which
is the σﬁeld generated by these random variables. An event B containing
historic information before T is simply an event in this σalgebra. Recall the
formular E[E(XY )] = E(X). Let H play the role of Y , and E(·N(T) =
i, B) play the role of expectation, we ahve
P(AN(T) = i, B) = E(I(A)N(T) = i, B)
= E[E(I(A)N(T) = i, B, H)N(T) = i, B).
We claim E(I(A)N(T) = i, B, H) = E(I(A)N(T) = i) since it is H
measurable, plus T is a constant on B and the weak Markov property. Since
this function is a constant in the eyes of N(T) = i, B, we have
E{E(I(A)N(T) = i)N(T) = i, B} = E(I(A)N(T) = i).
This result is easy to present for discrete time Markov chain. Hence, if
you cannot understand the above proof, work on the discrete time example
in the assignment will help. ♦
Example 5.1
Consider the birth process with N(0) = I > 0. Deﬁne p
n
(t) = P(N(t) = n).
Set up the forward system and solve it when λ
n
= nλ. ♦
5.2 Continuous time Markov chains
Let X = {X(t) : t ≥ 0} be a family of random variables taking values in
some countable state space S and indexed by the halfline [0, ∞). As before,
we shall assume that S is a subset of nonnegative integers.
Deﬁnition 5.1
72 CHAPTER 5. CONTINUOUS TIME MARKOV CHAIN
The process X satisﬁes the Markov property of
P(X(t
n
) = jX(t
1
) = i
1
, . . . , X(t
n−1
) = i
n−1
) = P(X(t
n
) = jX(t
n−1
) = i
n−1
)
for all j, i
1
, . . . , i
n−1
∈ S and any sequence 0 < t
1
< t
2
< · · · < t
n
of times.
A continuous time stochastic process X satisfying the Markov property
is called a continuous time Markov chain.
One obvious example of continuous time Markov process is the Poisson
process. The pure birth process is not a Markov process when it is dishonest.
The reason is that to qualify as a Markov chain, we would have required
{N(t) : t ≥ 0} to be a family of random variables. If the population size can
explode to inﬁnity at ﬁnite t, then N(t) is not always a random variable for
a given t.
One diﬃculty of studying the evolution of the continuous time Markov
chain is the lack of onestep transition probability matrix. No matter how
short a period of time is, we can always ﬁnd a shorter period. This observa
tion gives rise to the need of inﬁnitesimal generator.
Deﬁnition 5.2
The transition probability p
ij
(s, t) is deﬁned to be
p
ij
(s, t) = P(X(t) = jX(s) = i)
for s < t.
The chain is called time homogeneous if p
ij
(s, t) = p
ij
(0, t − s) for all
i, j, s, t and we write p
ij
(t −s) for p
ij
(s, t). ♦
We will assume the homogeneity for all continuous time Markov chain
to be discussed unless otherwise speciﬁed. We also use notation P
t
for the
corresponding matrix. It turns out that the collection of all transition prob
ability matrices form a stochastic semigroup. Forming a group means we
can deﬁne an operation on this set such that the operation is closed and
invertable. A semigroup may not have an inverse for each member in the
same set.
Theorem 5.7
5.2. CONTINUOUS TIME MARKOV CHAINS 73
The family {P
t
: t ≥ 0} is a stochastic semigroup; that is, it satisﬁes the
following conditions:
(a) P
0
= I, the identity matrix;
(b) P
t
is stochastic, that is P
t
has nonnegative entries and row sum 1;
(c) the ChapmanKolmogorov equations: P
s+t
= P
s
P
t
if s, t, ≥ 0. ♦
Proof: Obvious. ♦
Remark: a quick reference to mathematics deﬁnition reveals that only
(c), and sometimes also (a) are properties of a semigroup.
For continuous time Markov chain, there is no t
0
such that P
t
0
can be
used to compute P
t
for all t. This is diﬀerent from the discrete time Markov
chain. Is it possible to represent all P
t
from a single entity?
The answer is positive if the semigroup satisﬁes some properties.
Deﬁnition 5.3
The semigroup {P
t
: t ≥ 0} is called standard if P
t
→I as t ↓ 0, which is to
say that p
ii
(t) →1 and p
ij
(t) →0. ♦
Being standard means that every entry of P
t
is a continuous function of t,
not only at t = 0, but also at any t > 0 by ChapmanKolmogorov equations.
It does not imply, though, that they are diﬀerentiable at t = 0.
Suppose that X(t) = i at the the moment t. What might happy in the
next short period (t, t + h)?
May be nothing will happen, the corresponding chance is p
ii
(h) + o(h).
It may switched into state j with corresponding chance p
ij
(h) + o(h).
The chance to have two or more transitions are assumed o(h) here. I take
it as an assumption, which means not all Markov chain has this property.
The book mentioned that it can be proved. I am somewhat skeptical.
It turns out that when the semigroup is standard, the following claim is
true:
p
ij
(h) = g
ij
h + o(h), p
ii
(h) = 1 + g
ii
h + o(h) (5.1)
where g
ij
’s are a set of constants such that g
ij
> 0 for i = j and −∞≤ g
ii
≤ 0
for all i.
74 CHAPTER 5. CONTINUOUS TIME MARKOV CHAIN
Unless additional retrictions are applied, some g
ii
may take value −∞.
In which case I guess that the interpretation is
p
ii
(h) −1
h
→−∞.
When all the needed conditions are in place, we set up a matrix G for g
ij
and call it inﬁnitesimal generator.
Linking this expansion to the transition probability, we must have
j
g
ij
= 0
for all i. Due to the fact that some g
ii
can be negative inﬁnity, the above rela
tionship does not apply to all Markov chains. We will discuss the conditions
under which this result is guaranteed.
Once we admit the validity of (5.1), then we can easily obtain the forward
and backward equations.
Forward equations P
t
= P
t
G;
Backward equations P
t
= GP
t
;
Example 5.2
Birth process. Write down the inﬁnitesimal generator here. ♦
The validity of (5.1) does not guarantee the uniqueness of the solution
to these two systems. It is known that the backward equations are satisﬁed
when the semigroup is standard. The forward equations are proved when
the semigroup is uniform which will be discussed a bit further.
When all entries of G are bounded by a common constant in absolute
value, then two systems share a unique solution for P
t
. In fact, the solution
can be written in the form
P
t
=
∞
n=0
t
n
n!
G
n
.
We may also write it as
P
t
= exp(tG).
5.2. CONTINUOUS TIME MARKOV CHAINS 75
Recall that for some Markov chain with standard semigroup transition
probability matrices, the instantanuous rates g
ii
could be negative inﬁnity.
In this case, state i is instantanuous. That is, the moment of the Markov
chain entering state i is also the moment of leaving the state. Barring such
possibilities by assuming g
ij
have a common upper bound in absolute value,
then waiting times for Markov chain leaving state i will have momoryless
property. Therefore, we have the result as follows.
Theorem 5.8
Under some conditions, the random variable U
i
is exponentially distributed
with parameter −g
ii
, where U
i
is the waiting time for the Markov chain to
leave state i.
Further, the probability that the chain jumps to state j from state i is
−g
ij
/g
ii
. ♦
The result on the exponential distribution is the product of the Markov
property. Since the Markov property implies the memoriless property, and
the later implies the exponential distribution. The only hassle is the param
eter −g
ii
. As long as it is ﬁnite for all i, the proof is good enough.
The proof for the second part is less rigorous. Taking a small positive
value h, we consider the probability under the assumption of x < U < x +
h. The conditional probability converges to −g
ij
/g
ii
when the boundedness
conditions on g
ii
are satisﬁed.
Example 5.3
A continuous Markov chain with ﬁnite state space is always standard and
uniform. Thus, all the conclusions are valid. ♦
Example 5.4
A continuous Markov chain with only two possible states can have its forward
and backward equations solved easily. ♦
76 CHAPTER 5. CONTINUOUS TIME MARKOV CHAIN
5.3 Limiting probabilities
Our next issue is about limiting probabilities. Similar to discrete time Markov
chain, we need to examine the issue of irreducibility. It turns out that for
any state i and j, if
p
ij
(t) > 0
for some t > 0, then p
ij
(t) > 0 for all t ≥ 0.
We say that a continuous time Markov chain is irreducible when p
ij
(t) > 0
for all i, j and t > 0.
Deﬁnition 5.1
The vector π is a stationary distribution of the continuous Markov chain if
π
j
≥ 0,
j
π
j
= 1, and π = πP
i
for all t ≥ 0. ♦
By ChapmanKolmogorov equations, if X(t) has distribution π, then
X
(
t + s) also has distribution π for all s ≥ 0.
If π is such a distribution, it must be unique and nonnegative when the
chain is irreducible. This is due to the fact that P
t
is a transition probability
matrix of some discrete times Markov chain which is irreducible. Hence, its
stationary distribution is nonnegative and unique.
Solving equations π = πP
t
may not be convenient as P
t
’s are hard to
specify. We hence seek help from the following theorem.
Theorem 5.9
If the Markov chain has uniform semigroup transition probability matrices,
then π = πP
t
for all t if and only if πG = 0. ♦
The proof is simple. When the Markov chain has uniform semigroup,
then we have P
t
= exp(tG). Using the expansion for the exponential function
results in the conclusion. Note that when πG = 0, it is easy to show that
every term in the power expansion is zero. If πP
t
= 0 for all t, it implies the
function is a zero function. An analytical function is a zero function if and
only if all coeﬃcients are zero. This implies that πG = 0.
The textbook does not specify the uniformity condition. The result is
true for more general Markov chains such as the birth and death process.
Finally, we state the limiting theorem.
5.4. BIRTHDEATH PROCESSES AND IMBEDDING 77
Theorem 5.10
Let X be irreducible with a standard semigroup {P
t
} of transition proba
biblities.
(a) If there exists a stationary distribution π, then it is unique and
p
ij
(t) →π
j
as t →∞ for all i and j.
(b) If there is no stationary distribution then p
ij
(t) →0 as t →0 as t →∞
for all i and j.
♦
Proof: For any h > 0, we may deﬁne Y
n
= X(nh). Then we have a
discrete time Markov chain which is irreducible and ergodic. If it is nonnull
recurrent, then it has a unique stationary distribution π
(h)
. In addition,
p
ij
(nh) = P(Y
n
= jY
0
= i) →π
(h)
j
.
If the chain is null recurrent or transient, we would have
p
ij
(nh) = P(Y
n
= jY
0
= i) →0
for all i, j.
For any two rational numbers h
1
and h
2
, applying the above result implies
that π
(h
1
)
j
= π
(h
2
)
j
.
Next, using continuity of p
ij
(t) to ﬁll up the gap for all real numbers.
Remark: Unlike other result, we are told that the conclusion is true when
the class of the transition matrix is a standard semigroup.
5.4 Birthdeath processes and imbedding
Birth and death process is a more realistic model for population evolution.
Suppose the random variable X(t) represents the population size at time t.
(a) Its state space is S = {0, 1, . . .}.
78 CHAPTER 5. CONTINUOUS TIME MARKOV CHAIN
(b) Its inﬁnitesimal transition probbilities are given by
P(X(t + h) = m + nX(t) = n) =
_
¸
¸
_
¸
¸
_
λ
n
h + o(h) if m = 1,
µ
n
h + o(h) if m = −1,
o(h) if m > 1,
(c) the ‘birthe rates’ λ
0
, . . . , and ‘death rates’ µ
0
= 0, µ
1
, µ
2
, . . . satisfy
λ
n
≥ 0 and µ
n
≥ 0.
Due to memoryless property, the waiting time for the next transition has
exponential distribution with rate λ
n
+ µ
n
when X(t) = n. The probability
that the next transition is a birth is given by λ
n
/(λ
n
+ µ
n
).
The inﬁnitesimal generator G ha a nice form
G =
_
_
_
_
_
_
_
_
_
−λ
0
λ
0
0 0 0 · · ·
µ
1
−(λ
1
+ µ
1
) λ
1
0 0 · · ·
0 µ
2
−(λ
2
+ µ
2
) λ
2
0 · · ·
0 0 µ
3
−(λ
3
+ µ
3
) λ
3
· · ·
. . . . . . . . . . . . . . .
_
_
_
_
_
_
_
_
_
.
The chain is uniform if and only if sup
n
{λ
n
+ µ
n
} < ∞. When λ
0
= 0,
the chain is reducible. State 0 becomes absorbing. If λ
n
= 0 for some n > 0
and X(0) = m < n, then the population size will be bounded by n. It seems
the standard semigroup condition is satisﬁed as long as all rates are ﬁnite.
Yet, we have to assume that the process is honest to maintain that it is a
continuous time Markov chain.
The transition probabilities p
ij
(t) may be computed in principle, but may
not be so useful. It has a stationary distribution under some simple condi
tions:
π
n
=
λ
0
λ
1
. . . λ
n−1
µ
1
µ
2
· · · µ
n
π
0
with
π
−1
0
=
∞
n=0
λ
0
λ
1
. . . λ
n−1
µ
1
µ
2
· · · µ
n
.
The limit exists when π
0
> 0. Note that in this case, the condition of
standard semigroup is satisﬁed. Hence, the conclusion for the existence of
the limiting probabilities is solid.
5.4. BIRTHDEATH PROCESSES AND IMBEDDING 79
Example 5.5 Simple death with immigration.
Consider the situation when the birth rates λ
n
= λ for all n, and µ
n
= nµ
for n = 1, 2, . . .. That is, there is a constant source of immigration, and each
individual in the population has the same rate of die. Using the transition
probability language, the book staes that
p
ij
(h) = P(X(t + h) = jX(t) = i)
=
_
P(j −iarrivals, 0 deaths) + o(h) if j ≥ i,
P(i −jdeaths, 0 arrivals) + o(h) if j < i
.
Since the chance to have two or more transitions in a short period of length
h is o(h), it reduces to
p
i,i+1
(h) = λh + o(h),
p
i,i−1
(h) = (iµ)h + o(h)
and p
ij
(h) = o(h) when i −j ≥ 2.
It is very simple to work out its limiting probabilities.
Theorem 5.11
In the limit as t → ∞, X(t) is asymptotically Poisson distributed with
parameter ρ =
λ
µ
. That is,
P(X(t) = n) →
ρ
n
n!
exp(−ρ), n = 0, 1, 2, . . . .
The proof is very simple. Let us work out the distribution of X(t) directly.
Assume X(0) = I.
Let p
j
(t) = P(X(t) = jX(0) = I). By the foward equations, we have
p
j
(t) = λp
j−1
(t) −(λ + jµ)p
j
(t) + µ(j + 1)p
j+1
(t)
for j ≥ 1 and p
0
(t)a = µp
1
(t).
Note that the probability generating function of X(t) is given by G(s, t) =
E[s
X(t)
], we have
∂G
∂s
=
∞
j=0
js
j−1
p
j
(t)
80 CHAPTER 5. CONTINUOUS TIME MARKOV CHAIN
and
∂G
∂t
=
∞
j=0
s
j
p
j
(t).
Hence, by multiplying s
j
on both sides of the forward system and sum
up, we have
∂G
∂t
(s, t) = λ(s −1)G(s, t) −(µs −µ)
∂G
∂s
(s, t)
= (s −1)[λG(s, t) −µ
∂G
∂t
(s, t)].
It is easy to verify that
G(s, t) = {1 + (s −1) exp(−µt)}
I
exp{ρ(s −1)(1 −e
−µt
)}.
It is interesting to see that if I = 0, then the distribution of the X(t) is
Poisson. Otherwise, the distribution is a convolution of a binomial distribu
tion and a Poisson distribution. Due to the uniqueness of the solution to the
forward system, we do not have to look for other solutions anymore.
Example 5.6 Simple birthdeath
When the birth and death rates are given by λ
n
= nλ and µ
n
= nµ, we can
work out the problem in exactly the same way.
It turns out that given X(0) = I, the generating function of X(t) is given
by
G(s, t) =
_
µ(1 −s) −(µ −λs) exp{−t(λ −µ)}
λ(1 −s) −(µ −λs) exp{−t(λ −µ)
_
I
.
The corresponding diﬀerential equation is given by
∂G
∂t
(s, t) = (λs −µ)(s −1)
∂G
∂s
(s, t).
Does it look like a convolution of one binomial and one negative binomial
distribution?
One can ﬁnd the mean and variance of X(t) from this expression.
E[X(t)] = I exp{(λ −µ)t}
5.5. EMBEDDING 81
V ar(X(t)) =
λ + µ
λ −µ
exp{(λ −µ)t}[exp{(λ −µ)t} −1]I.
The limit for the expectation is either 0 or inﬁnity depending on whether the
ratio of λ/µ is larger than or smaller than 1.
The probability of extinction is given by the limit of η(t) = P(X(t) = 0)
which is min(ρ
−1
, 1). ♦
5.5 Embedding
If we ignore the length of the time between two consecutive transitions in a
continuous time Markov chain, we obtain a discrete time Markov chain. In
the example of linear birth and death, the waiting time for the next birth or
death has exponential distribution with rate n(λ + µ) given X(t) = n. The
probability that the transition is a birth is (nλ)/[n(λ + µ)] = λ/(λ + µ).
Think of the transition as the movement of a particle from the integer
n to the new integer n + 1 or n − 1. The particle is performing a simple
random walk with probability p = λ/(λ + µ). The state 0 is an absorbing
state. The probability that the random walk will be absorbed to state 0 is
given by min(1, q/p).
5.6 Markov chain Monte Carlo
In statistical applications, we often need to compute the joint commulative
distribution functions of several random variables. For example, we might be
interested to know the distribution of s
2
n
= (n − 1)
−1
n
i=1
(X
i
−
¯
X
n
) where
X
1
, . . . , X
n
are independent and identically distributed random variables,
¯
X
n
= n
−1
i
X
i
.
When X
1
has normal distribution, the distribution of s
2
n
is known to be
chisquare with n−1 degrees of freedom. The density function is well known,
and the value of P(s
2
n
≤ x) can be easily obtained via numerical integration.
If X
1
has exponential distribution, the density function of s
2
n
is no longer
available in a simple analytical form. Straightforward numerical integration
becomes very diﬃcult. If one can generate 10,000 sets of independent random
variables Y
1
, Y
2
, . . . , Y
n
such that they all have exponential distribution, then
82 CHAPTER 5. CONTINUOUS TIME MARKOV CHAIN
we may compute 10,000 many s
2
n
values. If 12% of them are smaller than
x = 5, then it can be shown that P(s
2
n
≤ 5) can be approximated by 12%.
The precision improves when we generate more and more sets of such random
variables.
A more general problem to compute
_
g(θ)π(θ)dθ, or
θ
g(θ)π(θ)
over θ ∈ Θ, where π(θ) is a density function or probability mass function
on Θ. If we can generate random variables X
1
, X
2
, . . . such that they all
take values in Θ and have π(θ) as their density function or probability mass
function, then the law of large number in the probability theory implies
n
−1
i
g(X
i
) →E
π
{g(X
1
)}.
Hence, the value of n
−1
i
g(X
i
) provides an approximation to
_
g(θ)π(θ)dθ,
or
θ
g(θ)π(θ).
Before this idea becomes applicable, we need to overcome a technical
diﬃculty. How do we generate random variables with a speciﬁed distribution
very quickly? It is found that some seemly simple functions can produce very
unpredictable outcomes. Thus, when this operation is applied repeatedly,
the outcomes appear to be completely random. A so called pseudo random
number generator for uniform distribution is hence constructed. Starting
from this point, one can then generate random numbers from most commonly
used distributions, whether they are discreate or continuous.
In some applications, especially for Baysian analysis, it is often necessary
to generate random numbers from a not well deﬁned density function. For
example, we may only know that the corresponding density function π(θ) is
proportional to some known function. In theory, one needs only rescale the
function so that the total probability becomes 1. In reality, to ﬁnd the scale
itself is numerically impractical.
The beauty of the Markov Chain Monte Carlo (MCMC) is to construct
a discrete time Markov chain so that its stationary distribution is the same
as the given density function π(θ), without the need of completely specifying
π(θ). How this is achieved is the topic of this section.
5.6. MARKOV CHAIN MONTE CARLO 83
In the eyes of a computer, nothing is continuous. Hence we deal with
discrete π(θ). Assume the probability function π = (π
i
: i ∈ Θ) is our target.
We construct a discrete time Markov chain with state space given by Θ, and
its stationary distribution is given by π.
For a given π, we have many choices of such Markov chains. The one
which is reversible at equilibrium may have some advantage. Thus, we try
to ﬁnd a Markov chain with its transition probabilities satisfying
π
k
p
kj
= π
j
p
jk
for all k, j ∈ Θ. The following steps will create such a discrete time Markov
chain.
Assume we have X
n
= i already. We need to generate X
n+1
according to
some transition probability.
(1) First, we pick an arbitrary stochastic matrix H = (h
ij
: i, j ∈ Θ).
This matrix is called the ‘proposal matrix’. We generate a random number Y
according to the distribution given by (h
ik
: k ∈ Θ). That is, P(Y = k) = h
ik
.
Since h
ik
are well deﬁned, we consider it feasible.
(2) Select a matrix A = (a
ij
: i, j ∈ Θ) be a matrix with entries between
0 and 1. The a
ij
are called ‘acceptance probabilities’. We ﬁrst generate a
uniform [0, 1] random variable Z and deﬁne
X
n+1
= X
n
I(Z > a
ij
) + Y I(Z < a
ij
).
Repeat this two steps, we have created a sequence of random numbers
X
0
, X
1
, . . . with an arbitrary starting value X
0
. Our next task is to choose
H and {A} such that the stationary distribution is what we are look for: π.
With the given A, the transition probabilities are given by
p
ij
=
_
h
ij
a
ij
if i = j
1 −
k:k=i
h
ik
a
ik
if i = j
This is because that the transition to j occurs only if both Y = j and Z < a
ij
when X
n
= j.
It turns out that the balance equation is satisﬁed when we choose
a
ij
= min{1, (π
j
h
ji
)/(π
i
h
ij
)}.
84 CHAPTER 5. CONTINUOUS TIME MARKOV CHAIN
This choice results in the algorithm called the Hastings algorithm. Note also
that in this algorithm, we do not need the knowledge of π
i
, but π
i
/π
j
for all
i and j.
Let us try to verify this result. Consider two states i = j. We have π
i
p
ij
=
π
i
h
ij
a
ij
= min{π
i
h
ij
, π
j
h
ji
}. Similarly, π
j
p
ji
= π
j
h
ji
a
ji
= min{π
i
h
ij
, π
j
h
ji
}.
Hence, π
i
p
ij
= π
j
p
ji
.
When two states are the same, the balance equation is obviously satisﬁed.
When Θ is multidimensional, there is some advantage of trying to gener
ate the random vectors one component a time. That is, if X
n
is the current
random vector, we make X
n+1
equal X
n
except of one of its component. The
ultimate goal can be achieved by randomly select a component or by rotating
the component.
Gibbs sampler, or heat bath algorithm Assume Θ = S
V
. That is,
its dimension is V . The state space S is ﬁnite and so is the dimension V .
Each state in Θ can be written as i = (i
w
: w = 1, 2, . . . , V ). Let
Θ
i,v
= {j ∈ Θ : j
w
= i
w
for w = v}.
That is, it is the set of all states which have the same components as state i
other than the vth component. Assume X
n
= i, we now choose a state from
Θ
i,v
for X
n+1
. For this purpose, we select
h
ij
=
pi
j
k∈Θ
i,v
π
k
, j ∈ Θ
i,v
in the ﬁrst step of the Hastings algorithm. That is, the choice in Θ
i,v
is based
on the conditional distribution given the vth component.
The next step of the Hastings algorithm is the same as before. It turns
out that a
ij
= 1 for all j ∈ Θ
i,v
. That is, there is no need of the second step.
We need to determine the choice of v. We may rotate the components or
randomly pick a component each time we generate the next random vector.
Metropolis algorithmIf the matrix His symmetric, then a
ij
= min{1, π
j
/π
i
}
and hence p
ij
= h
ij
min{1, π
j
/π
i
}.
We ﬁnd such a matrix by placing a uniform distribution on Θ
i,v
as deﬁned
in the last Gibbs samples algorithm. That is
h
ij
= [Θ
i,v
 −1]
−1
5.6. MARKOV CHAIN MONTE CARLO 85
for i = j.
Finally, even the limiting distribution of X
n
is π. The distribution of X
n
may be far from π until n is hugh. It is also very hard to tell how large
n has to be before the distribution of X
n
well approximates the stationary
distribution.
In applications, researchers often propose to make use of a burn out period
M. That is, we throw away X
1
, X
2
, . . . , X
M
for a large M and start using
Y
1
= X
M+1
, Y
2
= X
M+2
and so on to compute various characteristics of
π. For example, we estimate E
π
g(θ) by n
−1
n
i=1
g(X
M+i
). It is hence very
important to have some idea on how large this M must be before the random
vectors (numbers) can be used.
Let P be a transition probability matrix of a ﬁnite irreducible Markov
chain with period d. Then,
(a) λ
1
= 1 is an eigenvalue of P,
(b) the d complex roots of unity,
λ
1
= ω
0
, λ
2
= ω
1
, . . . , λ
d
= ω
d−1
,
are eigenvalues of P.
(c) the remaining eigenvalues λ
d+1
, . . . , λ
N
satisfy λ
j
 < 1.
When all eigenvalues are distinct, then P = B
−1
ΛB for some matrix
B and diagonal matrix with entries λ
1
, . . . , λ
N
. Thus, this decomposition
allows us to compute the n step transition probability matrix easily. That
is, P
n
= B
−1
Λ
n
P.
When not all eigenvalues are distinct, then we can still decompose P as
B
−1
MB. However, M cannot be made diagonal any more. The best we can
is block diagnonal diag(J
1
, J
2
, . . .) such that
J
i
=
_
_
_
_
_
_
_
_
_
λ
i
1 0 0 · · ·
0 λ
i
1 0 · · ·
0 0 λ
i
1 · · ·
0 0 0 λ
i
· · ·
. . . . . . . . . . . .
_
_
_
_
_
_
_
_
_
.
Fortunately, M
n
still have a very simple form. Hence, once such a decompo
sition is found, the nstep transition probability matrix is easily obtained.
86 CHAPTER 5. CONTINUOUS TIME MARKOV CHAIN
Remember that the limiting probability is closed linked to the nstep
transition probability matrix. Suppose that the Markov chain in MCMC
algorithm is aperiodic. Then, it can be shown that
p
(n)
ij
= π
j
+ O(n
m−1
λ
2

n
)
where λ
2
is the eigenvalue with the second largest modulus and m is the
multiplicity of this eigenvalue. Note that the transition probability converges
to π
j
at exponential rate when m = 1.
Theorem 5.12
Let X be an aperiod irreducible reversible Markov chain on the ﬁnite state
space Θ, with transition probability matrix P and stationary distribution π.
Then
k
p
ik
−π
k
 ≤ Θ · λ
2

n
sup{ν
r
(i)) : r ∈ Θ}, i ∈ Θ, n ≥ 1,
where ν
r
(i) is the ith term of the rth righteigenvector V
r
of P. ♦
Chapter 6
General Stochastic Process in
Continuous Time
The term stochastic process is generally referred to a collection of random
variables denoted as {X(t), t ∈ T }. Usually, these random variables are
linked by some index generally referred to as time. If so, T is a set of
time points. When T contains integer values only, we get a discrete time
stochastic process; while when T is an interval of the real numbers, we have
a continuous time stochastic process. Examples other than T being time
includes the situation when T represents geological location.
6.1 Finite Dimensional Distribution
Let us make a simplifying assumption that T = [0, T] for some positive
number T. Let us recall that a random variable is a measurable function on
a probability space (Ω, F, P). That is, for each t ∈ T, S(t) is a map in the
form
S(t) : Ω →R
where R is the set of all real numbers. That is, at any sample point ω ∈ Ω,
S(t) takes a real value: S(ω, t).
It becomes apparent that S(ω, t) is a bivariate function. One of its vari
ables is time t, and the other variable is the sample point ω. Given t, we
87
88CHAPTER 6. GENERAL STOCHASTIC PROCESS INCONTINUOUS TIME
get a random variable. At the same time, once we ﬁx ω = ω
0
, S(ω
0
, t) is a
function of t. This function is called a sample path of S(t). We denote s(t)
as realized sample path of S(t). This is analog to use x for a realized value
of a random variable X.
We may also put it as follows: a random variable is a map from the sample
space to the space of real numbers; a stochastic process is a map from the
sample space to the space of real valued functions. Recall that a random
variable is required to be a measurable map. Thus, there must be some kind
of measurability requirement on stochastic process.
Compared to the space of real numbers, the space of real functions on
[0, T] is much more complex. The problem of setting up a suitable σalgebra
is beyond this course. The space of all continuous functions, denoted as
C[0, T], has a simpler structure, and a proper σﬁeld is available. In appli
cations, however, stochastic processes with noncontinuous sample paths are
useful. The well known Poisson process, for example, does not have contin
uous sample path. A slight extension of C[0, T] is to add functions which
are rightcontinuous with left limits. We denote this space as D[0, T]. These
functions are referred as regular rightcontinuous function or c`ad`ag
functions. It turns out that D[0, T] is large enough for usual applications,
yet is still simple enough to allow a useful σﬁeld.
To properly investigate the properties of a random variable X, we want
to be able to compute P(X ≤ x) for any real number x. We refer to
F(x) = P(X ≤ x) as the cumulative distribution function of X, or sim
ply the distribution of X. If S(t) is a stochastic process on D[0, T], then for
any t
1
, t
2
∈ [0, T], we must have
P(S(t
1
) ≤ s
1
, S(t
2
) ≤ s
2
)
well deﬁned. That means, the set S(t
1
) ≤ s
1
, S(t
2
) ≤ s
2
must be a member
of F. It is easy to see that this requirement goes from two time points to any
ﬁnite number of time points, and to any countable number of time points. It
can be shown that if we can give a proper probability to all such sets (called
cylinder sets), then the probability measure can be extended uniquely to the
smallest σalgebra that contains all cylinder sets.
As consequence of this discussion is the following theorem.
6.2. SAMPLE PATH 89
Theorem 6.1
A stochastic process taking values in D[0, T] is uniquely determined by its
ﬁnite dimensional distributions. ♦
6.2 Sample Path
Let us have a look of the following example.
Example 6.1
Let X(t) = 0 for all t, 0 ≤ t ≤ 1, and τ is a uniformly distributed random
variable on [0, 1]. Let Y (t) = 0 for t = τ and Y (t) = 1 if t = τ.
Now we have two stochastic processes: {X(t) : t ∈ [0, 1]} and {Y (t) :
t ∈ [0, 1]}. It appears that they are very diﬀerent. All sample paths of X
is identically 0 function, and every sample path of Y (t) has a jump point at
t = τ.
Yet we notice that for any ﬁxed t,
P(Y (t) = 0) = P(τ = t) = 0.
That is, P(Y (t) = 0) = 1. Hence, for any t ∈ [0, 1], X(t), Y (t) have the
same distribution. Further, for any set of 0 ≤ t
1
< t
2
< · · · < t
n
≤ 1,
the joint distribution of {X(t
1
), X(t
2
), . . . , X(t
n
)} is identically to that of
{Y (t
1
), Y (t
2
), . . . , Y (t
n
)}. That is, X(t) and Y (t) have the same ﬁnite di
mensional distributions. ♦
The moral of this example is: even if two stochastic processes have the
same distribution, they can still have very diﬀerent sample paths. This ob
servation prompts the following deﬁnition.
Deﬁnition 6.1
Two stochastic processes are called a version (modiﬁcation) of one another
if
P{X(t) = Y (t)} = 1 for all t, 0 < t < T.
♦
90CHAPTER 6. GENERAL STOCHASTIC PROCESS INCONTINUOUS TIME
In the case when a number of versions exist for a given distribution, a
stochastic process with the smoothest sample path will be the version for
investigation.
Having P{X(t) = Y (t)} = 1 does not mean the event X(t) = Y (t) is an
empty set. Let this set be called N
t
. While the probability of each is zero,
the probability of its union over t ∈ [0, T] can be nonzero. In the above
example, it equals 1. When P(∪
0≤t≤T
N
t
) = 0, the two stochastic processes
are practically the same. They are indistinguishable.
Given a distribution of a stochastic process, can we always ﬁnd a version
of it so that its sample paths are continuous or regular (with only jump
discontinuities)?
Theorem 6.2
Let S(t), 0 ≤ t ≤ T be a real valued stochastic process.
(a) A continuous version of S(t) exists if we can ﬁnd positive constants
α, and C such that
ES(t) −S(u)
α
≤ Ct −u
1+
for all 0 ≤ t, u ≤ T.
(b) A regular version of S(t) exists if we can ﬁnd positive constants
α
1
, α
2
, and C such that
E{S(u) −S(v)
α
1
S(t) −S(v)
α
2
} ≤ Ct −u
1+
for all 0 ≤ u ≤ v ≤ t ≤ T. ♦
There is no need to memorize this theorem. What we should make out
of it? Under some continuity condition on expectations, a regular enough
version of a stochastic process exists. We hence have legitimate base to
investigate stochastic processes with this property.
Finally, let us summarize the results we discussed in this section. The
distribution of a stochastic process is determined by its ﬁnite dimensional
distributions, whatever it means. Even if two stochastic processes have the
same distribution, their sample paths may have very diﬀerent properties.
However, for each given stochastic process, there often exists a continuous
6.3. GAUSSIAN PROCESS 91
version, or regular version, whether the given one itself has continuous sample
paths or not.
Ultimately, we focus on the most convenient version of the given stochastic
process in most applications.
6.3 Gaussian Process
Let us make it super simple.
Deﬁnition 6.1
Let {X(t) : 0 ≤ t ≤ T} be a stochastic process. X(t) is a Gaussian process
if all its ﬁnite dimensional distributions are multivariate normal. ♦
We are at ease when working with normal, multivariate normal random
variables. Thus, we also feel more comfortable with Gaussian processes when
we have to deal with continuous time stochastic process taking arbitrary real
values. We will be obsessed with Gaussian processes.
Example 6.2 A simple Gaussian Process
Let X and Y be two independent standard normal random variables. Let
S(t) = sin(t)X + cos(t)Y.
It is seen that S(t) is a Gaussian process with certain covariance structures.
♦
6.4 Stationary Processes
Since the random behavior of a general stochastic process is very complex, we
often ﬁrst settle down with very simple ones and then timidly move toward
more complex ones. The simple random work is the ﬁrst example of stochastic
process. We then introduce discrete time Markov chain and so on. We still
hope to stay in the familiar territory by moving forward a little: introducing
the concept of stationary processes here.
92CHAPTER 6. GENERAL STOCHASTIC PROCESS INCONTINUOUS TIME
Deﬁnition 6.1 Strong Stationarity.
Let {X(t) : 0 ≤ t ≤ T = ∞} be a stochastic process. If for any choices of
0 ≤ t
1
< t
2
. . . ≤ t
n
, with any n and positive constant s, the joint distribution
of
X(t
1
+ s), X(t
2
+ s), . . . , X(t
n
+ s)
does not depend on s, then we say that {X(t) : 0 ≤ t ≤ T = ∞} is strongly
stationary. ♦
Deﬁnition 6.2 Weak Stationarity.
Let {X(t) : 0 ≤ t ≤ T = ∞} be a stochastic process. If for any choices
of 0 ≤ t
1
< t
2
. . . ≤ t
n
, with any n and positive constant s, the mean and
covariance matrix of
X(t
1
+ s), X(t
2
+ s), . . . , X(t
n
+ s)
do not depend on s, then we say that {X(t) : 0 ≤ t ≤ T = ∞} is weakly
stationary, or simply stationary. ♦
The above deﬁnition can be easily applied to the case when T < ∞, or
to discrete time stochastic processes. The obstacles are merely notational.
Being strongly stationary does not necessary imply the weak stationary,
because of the moment requirements of the latter.
Most results for stationary processes are rather involved. We only men
tion two theorems.
Theorem 6.3 Strong and Weak Laws of Large Numbers (Ergodic Theo
rems).
(a) Let X = {X
n
: n = 1, 2, . . .} be a strongly stationary process such that
EX
1
 < ∞. There exists a variable Y with the same mean as the X
n
such
that
n
−1
n
i=1
X
i
→Y
almost surely, and in mean.
6.5. STOPPING TIMES AND MARTINGALES 93
(b) Let X = {X
n
: n = 1, 2, . . .} be a weakly stationary process. There
exists a variable Y with the same mean as the X
n
such that
n
−1
n
i=1
X
i
→Y
in the second moment. ♦
If a continuous time stationary process can be discretised, the above two
results can also be very useful.
6.5 Stopping Times and Martingales
We are often interested in knowing how the property of a stochastic process
following the moment when some event has just occurred. This is a setting
point of a renewal process. These moments are random variables with special
properties.
Consider the example of simple random walk {X
n
: n = 0, 1, . . .}, the
time of returning to zero is a renewal event. At any moment, we examine
the value of X
n
to determine whether the process has renewal itself at n with
respect to this particular renewal event.
In general, we need to examine the values of {X
i
: i = 0, 1, . . . , n} to
determine the occurrence of the renewal or other events. The time for the
occurrence of such events is therefore a function of {X
i
: i = 0, 1, . . . , n}. In
measure theory language, if τ represents such moments, then the event
τ ≤ n
is F
n
= σ(X
0
, . . . , X
n
} measurable.
Deﬁnition 6.1 Stopping Time
Let {X(t) : 0 ≤ t ≤ T = ∞} be a stochastic process. A nonnegative random
variable τ, which is allowed to take value ∞, is called a stopping time if for
each t, the event
{τ ≤ t} ∈ σ(X
s
, 0 ≤ s ≤ t}.
♦
Example 6.3 To be invented.
94CHAPTER 6. GENERAL STOCHASTIC PROCESS INCONTINUOUS TIME
Chapter 7
Brownian Motion or Wiener
Process
It was botanist R. Brown who ﬁrst described the irregular and random mo
tion of a pollen particle suspected in ﬂuid (1828). Hence, such motions are
called Brownian motion. Interestingly, the particle theory in physics was not
widely accepted until the Einstein used Brown motion to argue that such
motions are caused by bombardments of molecules in 1905. The theory is
far from complete without a mathematical model developed later by Wiener
(1931). This stochastic model used for Brown motion is hence also called
Wiener process. We will see that the mathematical model has many desir
able properties, and a few that do not ﬁt well with physical laws for Brownian
motion. As early as 1931, Brown motion has been used in mathematical the
ory for stock prices. It is nowadays a fashion to use stochastic processes to
study ﬁnancial market.
Deﬁnition 7.2 Deﬁning properties of Brownian motion.
Brownian motion {B(t)} is a stochastic process with the following properties:
1. (Independent increment) The random variable B(t) −B(s) is indepen
dent of {B(u) : 0 ≤ u ≤ s} for any s < t.
2. (Stationary Gaussian increment) The distribution of B(t) − B(s) is
normal with mean 0 and variance t −s.
95
96 CHAPTER 7. BROWNIAN MOTION OR WIENER PROCESS
3. (Continuity) The sample paths of B(t) are continuous.
♦
We should compare this deﬁnition with that of Poisson process: the in
crement distribution is changed from Poisson to normal. The sample path
requirement is needed so that we will work on a deﬁnitive version of the
process.
Once the distribution of B(0) is given, then all the ﬁnite dimensional
distributions of B(t) are completely determined, which then determines the
distribution of B(t) itself. This notion is again analog to that of a counting
process with independent, stationary and Poisson increments.
In some cases, B(0) = 0 is part of the deﬁnition. We will more or less
adopt this convention, but will continue to spell it out.
Example 7.1
Assume that B(t) is a Brownian motion and B(0) = 0.
1. Find the marginal distributions of B(1), B(2) and B(2) −B(1).
2. Find the joint distribution of B(1) and B(2), and compute P(B(1) >
1, B(2) < 0).
♦
Recall the multivariate normal distribution is completely determined by
its mean vector and covariance matrix. Thus, the ﬁnite dimensional distribu
tions of a Gaussian process is completely determined by the mean function
and the correlation function. As a special Gaussian process, it is useful to
calculate the mean and correlation function of the Brownian motion.
Example 7.2 Compute the correlation function of the Brownian motion.
♦
Based this calculation, it is clear that the above Brownian motion is not
stationary.
Example 7.3 Distribution of
_
1
0
B(t)dt given B(t) = 0.
97
♦
Example 7.4
Let ξ
0
, ξ
1
, . . . be a sequence of independent standard normally distributed
random variables. Deﬁne
S(t) =
t
√
π
ξ
0
+
2
√
π
∞
i=1
sin(it)
i
ξ
i
.
Assume that we can work with inﬁnite summation as if nothing is unusual.
Then we can verify the covariance structure of S(t) resembles that of the
Brownian motion with B(0) = 0 over [0, π]. Thus, S(t) is a Brownian motion
and it provides a very concrete way of represent the Brownian motion. It can
be seen that all sample paths of S(t) are continuous, yet they are nowhere
diﬀerentiable. ♦
Example 7.5 Sample paths of the Brownian motion.
Brownian motions have the following well known but very peculiar properties.
1. Almost all sample paths are continuous;
2. Almost every sample path is not monotone in any interval, no matter
how small the interval is;
3. Almost every sample path is not diﬀerentiable at any point;
4. Almost every sample path has inﬁnite variation on any interval, not
matter how short it is;
5. Almost all sample paths have quadratic variation on [0, t] equal to t,
for any t.
♦
Deﬁnition 7.3
98 CHAPTER 7. BROWNIAN MOTION OR WIENER PROCESS
Let s(x) be a function of real variable.
(a) The (total) variation of s(x) over [0, t] is
lim
n
i=1
s(x
i
) −s(x
i−1
)
where the limit is taken over any sequence of sets of (x
1
, . . . , x
n
) such that
0 = x
0
< x
1
< x
2
< · · · < x
n
= t
and max{x
i
−x
i−1
: i = 1, . . . , n} →0.
(b) The quadratic variation over the interval [0, t] is
lim
n
i=1
{s(x
i
) −s(x
i−1
)}
2
where the limit is taken over any sequence of sets of (x
1
, . . . , x
n
) such that
0 = x
0
< x
1
< x
2
< · · · < x
n
= t
and max{x
i
−x
i−1
: i = 1, . . . , n} →0. ♦
If s(x) is monotone over [0, t], then the total variation is simply s(t) −
s(0). Having a nonzero quadratic variation implies that the function ﬂuc
tuates up and down very very often. It is hard to think of a function with
such property. One example is s(x) =
√
x sin(1/x) over [0, 1].
Theorem 7.1
If s(x) is continuous and of ﬁnite total variation, then its quadratic variation
is zero. ♦
Property 1 is true by deﬁnition. Property 2 can be derived from Property
4. If a sample path is monotone over an interval, then its variation is ﬁnite,
which contradicts Property 4. Property 3 ﬁts well with Property 2; if a
sample path is nowhere monotone, it is very odd to be diﬀerentiable. This
point can be made rigorous.
Finally, it is seen that Property 4 follows from Property 5 because of these
arguments. It is hence vital to show that Brownian motion has Property 5.
99
Let 0 = t
0
< t
1
< t
2
< · · · < t
n
= t be a partition of the interval [0, 1],
where each of t
i
depends on n though not spelled out. Let
T
n
=
n
i=1
{B(t
i−1
) −B(t
i
)}
2
be the variation based on this partition. We ﬁnd
E{T
n
} =
n
i=1
E{B(t
i−1
) −B(t
i
)}
2
=
n
i=1
V ar{B(t
i−1
) −B(t
i
)}
=
n
i=1
(t
i
−t
i−1
) = t.
That is, the expected sample path quadratic variation is t, regardless of how
the interval is partitioned.
Property 5, however, mean that limT
n
= t almost surely along any se
quence of parititions such that δ
n
= max t
i
− t
i−1
 → 0. We give a partial
proof when
d
n
< ∞. In this case, we have
V ar{T
n
} =
n
i=1
V ar{B(t
i−1
) −B(t
i
)}
2
=
n
i=1
3(t
i
−t
i−1
)
2
≤ 3t max t
i
−t
i−1
 = 3δ
n
t.
Therefore,
V ar{T
n
} =
E{T
n
−E(T
n
)}
2
< ∞. Using monotone conver
gence theorem in measure theory, it implies E
{T
n
−E(T
n
)}
2
< ∞. (Note
the change of order between summation and expectation). The latter implies
T
n
−E(T
n
) →0 almost surely, which is the same as T
n
→t almost surely.
Despite the beauty of this result, this property also shows that the math
ematical Brownian motion does not model the physical Brownian motion in
microscope level. If Newton’s laws hold, the acceleration cannot be instan
taneous, and the sample paths of a diﬀusion particle should be smooth.
100 CHAPTER 7. BROWNIAN MOTION OR WIENER PROCESS
7.1 Existence of Brownian Motion
One technical consideration is whether there exist stochastic processes with
the properties prescribed by the deﬁnition of Brownian motion. This question
is usually answered by deﬁning a sequence of stochastic processes such that
its limit exists, and having the deﬁning properties of Brownian motion. One
rigorous proof of the existence is scratched as follows.
Let ξ
i
, i = 1, 2, . . . be independent and normally distributed with mean
0 and variance 1. The existence of such sequence and the corresponding
probability space is well discussed in probability theory and is assumed here.
Let
Y
n
=
n
i=1
ξ
i
and deﬁne
B
n
(t) =
1
√
n
{Y
[nt]
+ (nt −[nt])ξ
[nt]+1
}
which is a smoothed partial sum of ξ’s. It can be veriﬁed that the covari
ance structure of B
n
(t) approaches what is assumed for Brownian motion.
All ﬁnite dimensional distributions of B
n
(t) are normal. With a dose of
asymptotic theory which requires the veriﬁcation of tightness, it is seen that
B
n
(t) has a limit, and the limiting process has all the deﬁning properties of a
Brownian motion. Thus, the existence is veriﬁed. See Page 62 of Billingsley
(1968).
One may replace ξ
i
, i = 1, 2, . . . by independent and identically distributed
random variables taking ±a
n
values. We may deﬁne the partial sum Y
n
in the
same way, and deﬁne a similar stochastic process. By properly scaling down
the time, and letting a
n
→0, the limiting process will also have the deﬁning
properties of a Brownian motion. Thus, Brownian motion is regarded as a
limit of simple random walk. Many of other properties of Brownian motion
to be discussed have their simple random walk versions.
7.2 Martingales of Brownian Motion
The reason for Brownian motion being the centre of most textbooks and
courses in stochastic processes can be attributed to the fact that it is a good
7.2. MARTINGALES OF BROWNIAN MOTION 101
example of any important concepts in continuous time stochastic process.
This section introduces its martingale aspects.
Recall that each random variable X generates a σﬁeld: σ(X). This
extends to a set of random variables. If X(t) is a stochastic process on [0, T],
then {X(u) : 0 ≤ u ≤ t} is a set of random variables and it generates a
σﬁeld which is usually denoted as F
t
. As sets of sets, these σﬁelds satisfy
F
s
⊂ F
t
for all 0 ≤ s < t ≤ T.
We may put forward a sequence of σﬁeld F
t
not associated with any
stochastic processes. As long as {F
t
, t ≥ 0} has the increasing property, it is
called a ﬁltration.
Deﬁnition 7.1 Martingale
A stochastic process {X(t), t ≥ 0 is a martingale with respect to F
t
if for
any t it is integrable, EX(t) ≤ ∞, and for any s > 0
E{X(t + s)F
t
} = X(t).
♦
With this deﬁnition, we have implied that X(t) is F
t
measurable. In
general, F
t
represents information about X(t) available to an observer up
to time t. If any decision is to be made at time t by this observer, the
decision is a F
t
measurable function. Unless otherwise speciﬁed, we take
F
t
= σ{X(s) : s ≤ t} when a stochastic process X(t) and a ﬁltration F
t
are
subjects of some discussion.
One of the deﬁning properties of Brownian motion can be reworded as
the Brownian motion is a martingale.
We would like to mention two other martingales.
Theorem 7.2
Let B(t) be a Brownian motion.
(a). B
2
(t) −t is a martingale;
(b). For any θ, exp{θB(t) −
1
2
θ
2
t} is a martingale. ♦
Proofs will be provided in class.
102 CHAPTER 7. BROWNIAN MOTION OR WIENER PROCESS
7.3 Markov Property of Brownian Motion
Let {X(t), t ≥ 0} be a stochastic process and F
t
is the corresponding ﬁltra
tion. The Markov property is naturally extended as follows.
Deﬁnition 7.1 Markov Process.
If for any s, t > 0,
P{X(t + s) ≤ yF
t
} = P{X(t + s) ≤ yX
t
}
almost surely for all y, then {X(t), t ≥ 0} is a Markov process. ♦
Compared to the deﬁnition of Markov chain, this deﬁnition removes the
requirement that the state space of X(t) is countable.
Theorem 7.3 A Brownian motion is a Markov Process.
Proof: Working out the conditional moment generating function will be
suﬃcient. ♦
In applications, we often want to know the property following the occur
rence of some event. A typical example is the behavior of a simple random
walk after its return to 0. Since the returning time itself is random, we must
be careful when claim that the simple random walk following that moment
will have the same property as a simple random walk starting from 0.
If we recall, this claim turns out to be true. Regardless, we must be
prepared to prove that this is indeed the case. As we mentioned before,
Brownian motion in many respects generalizes the simple random walk. They
share many similar properties. One of them is the strong Markov property.
As a preparation, we reintroduce the concept of stopping times.
Deﬁnition 7.2
Let {X(t), t ≥ 0} be a stochastic process, and F
t
= σ{X(s) : s ≤ t}. A
random time T is called a stopping time for {X(t), t ≥ 0} if {T ≤ t} ∈ F
t
for all t > 0. ♦
7.4. EXIT TIMES AND HITTING TIMES 103
If we stripe the jargon of σﬁeld, a stopping time T is a random quantity
such that after we observed the stochastic process up to and include time t,
we can tell whether T ≤ t or not.
Let T to be the time when a Brownian motion ﬁrst exceeds value 1. Then
T is a stopping time. By examining the values of X(s) for all s smaller or
equal t, the truthfulness of T ≤ t is a simple matter.
Let T be the time when the value of stock price will drop by 50% in the
next unit of time. If T were a stopping time, we would make a lot of money.
In this case, you really have to be able to see into the future. We do not
need strong mathematics to ﬁnd that T is not a stopping time.
Another preparation is: for each nonrandom time t, F
t
is the σﬁeld
generated by random variables X(s) including all s ≤ t. If T is a stopping
time, we hope to be able to use it as if it is a nonrandom time. We deﬁne
F
T
= {A : A ∈ F, A ∩ {T ≤ t} ∈ F
t
for any t}.
Finally, let us return to the issue of strong Markov property.
Theorem 7.4 Strong Markov Property.
Let T be a stopping time associated with Brownian motion B(t). Then for
any t > 0,
P{B(T + t) ≤ yF
T
} = P{B(T + t) ≤ yB(T)}
for all y. ♦
We cannot aﬀord to spend more time at proving this result. Let us
tentatively accept it as a fact and see what will be the consequences.
Let us deﬁne
ˆ
B(t) = B(T + t) −B(T). When T is a constant, it can be
easily veriﬁed that
ˆ
B(t) is also a Brownian motion. When T is a stopping
time, it remains to be true due to the strong Markov property.
7.4 Exit Times and Hitting Times
Let T(x) = inf{t > 0 : B(t) = x} be a stopping time for Brownian motion
B(t). It is seen that T(x) is the ﬁrst moment when the Brownian motion
hits target x.
104 CHAPTER 7. BROWNIAN MOTION OR WIENER PROCESS
Assume B(0) = x and a < x < b. Deﬁne T = min{T(a), T(b)}. Hence
T is the time when the Brownian motion escapes the area formed by two
horizontal lines. The question is: will it occur in ﬁnite time? If so, what is
the average time it takes?
The ﬁrst question is answered by computing P(T < ∞), and the second
question is question is answered by computing E{T} under the assumption
that the ﬁrst answer is aﬃrmative. One may link this problem with simple
random walk and predict the outcome. The probability for the ﬁrst passage
of “1” for a symmetric simple random walk is 1. Since the normal distribution
is symmetric, the simple random walk result seems to suggest that P(T <
∞) = 1. We do not have similar results for E{T}, but for E{T(a)} or
E{T(b)} for simple random walk, and the later are likely ﬁnite from the
same consideration. The real answers are given as follows.
Theorem 7.5
The hitting time T deﬁned above have the properties:
(a) P(T < ∞B(0) = x) = 1;
(b) E{TB(0) = x} < ∞, under the assumption that a < x < b.
♦
Proof: It is seen that the event {T > 1} implies a < B(t) < b for all
t ∈ [0, 1]. Hence,
P(T > 1B(0) = x) ≤ P{a < B(1) < bB(0) = x}
= Φ(b −x) −Φ(a −x).
As long as a, b are ﬁnite,
α = sup
x
{Φ(b −x) −Φ(a −x)} < 1.
Next, for any positive integer n,
p
n
= P{T > nB(0) = x}
= P{T > nT > n −1, B(0) = x)p
n−1
7.5. MAXIMUM AND MINIMUM OF BROWNIAN MOTION 105
= P{a < B(s) < b, for n −1 ≤ s ≤ na < B(s) < b,
for 0 ≤ s ≤ n −1, B(0) = x}p
n−1
= P{a < B(s) < b, for n −1 ≤ s ≤ na < B(n −1) < b}p
n−1
= P{a < B(s) < b, for 1 ≤ s ≤ 1a < B(0) < b}p
n−1
≤ αp
n−1
.
Applying this relationship repeatedly, we get p
n
≤ α
n
. Thus, we have
P(T < ∞B(0) = x) = 1 − lim
n→∞
p
n
= 1.
Further,
E(TB(0) = x) ≤
∞
n=0
P{T > nB(0) = x} < ∞.
♦
I worry slightly that I did not really make use of strong Markov property,
but the Markov property itself.
The following results are obvious:
P{T(b) < ∞B(0) = a} = 1; P{T(a) < ∞B(0) = a}.
7.5 Maximum and Minimum of Brownian Mo
tion
Deﬁne, for a given Brownian motion,
M(t) = max{B(s) : 0 ≤ s ≤ t} and m(t) = min{B(s) : 0 ≤ s ≤ t}.
The signiﬁcance of these two quantities are obvious.
Again, the following result resembles a result of simple random walk, and
I believe that a proof using some limiting approach is possible.
Theorem 7.6
For any x > 0,
P{M(t) > mB(0) = 0} = 2P{B(t) ≤ mB(0) = 0}.
106 CHAPTER 7. BROWNIAN MOTION OR WIENER PROCESS
Proof: We will omit the condition that B(0) = 0. Also, we use T(a) as the
stopping time when B(t) ﬁrst hits a.
First, for any m, we have
P{M(t) ≥ m} = P{M(t) ≥ m, B(t) ≥ m}
+P{M(t) ≥ m, B(t) < m}.
Note that B(T(m)) = m as the sample paths are continuous almost surely,
plus {M(t) > m} is equivalent to T(m) ≤ t.
P{M(t) ≥ m, B(t) < m}
= P{T(m) < t, B(t) < B(T(m))}
= P{B(t) < B(T(m))T(m) < t}P{T(m) < t}
= P{B(t) > B(T(m))T(m) < t}P{T(m) < t}
Strong Markov Property
Reﬂection Principle
= P{M(t) ≥ m, B(t) ≥ m}.
Applying this back, we get
P{M(t) ≥ m} = 2P{M(t) ≥ m, B(t) ≥ m}.
Since B(t) ≥ m implies M(t) ≥ m, the above result becomes
P{M(t) ≥ m} = 2P{B(t) ≥ m}.
This completes the proof. ♦
This result can be used to ﬁnd the distribution function of T(x) under
the assumption that B(0) = 0.
Theorem 7.7
The density function of T(x) is given by
f(t) =
x
√
2πt
3
exp{−
x
2
2t
} t ≥ 0.
7.6. ZEROS OF BROWNIAN MOTION AND ARCSINE LAW 107
Proof: It is done by using the equivalence between T(x) ≤ t and M(t) ≥ x.
♦
We call T(x) the ﬁrst passage time.
The reﬂection idea is formally stated as the next theorem.
Theorem 7.8
Let T be a stopping time. Deﬁne
ˆ
B(t) = B(t) for t ≤ T and
ˆ
B(t) =
2B(t) −B(t) for t > T. Then
ˆ
B(t) is also Brownian motion. ♦
The proof is not hard, and we should draw a picture to understand its
meaning.
7.6 Zeros of Brownian Motion and Arcsine
Law
Assume B(0) = 0. How long does it take for the motion to come back to 0?
This time, the conclusion is very diﬀerent from that of simple random walk.
Theorem 7.9
The probability that B(t) has a least one zero in the time interval (a, b) is
given by
2
π
arc cos
_
a/b.
Proof: Let E(a, b) be the event that B(t) = 0 for at lease some t ∈ (a, b).
P{E(a, b)} = E[P{E(a, b)B(a)}]
For any x > 0,
P{E(a, b)B(a) = −x} = P{T(x) < b −aB(0) = 0} = P{T(x) < b −a}
by Markov property and so on.
108 CHAPTER 7. BROWNIAN MOTION OR WIENER PROCESS
Hence,
E[P{E(a, b)B(a)}] =
_
∞
−∞
P{E(a, b)B(a) = x}f
a
(x)dx
= 2
_
∞
0
P{E(a, b)B(a) = −x}f
a
(x)dx
= 2
_
∞
0
P{T(x) < b −a}f
a
(x)dx
where f
a
(x) is the density function of B(a) and known to be normal with
mean 0 and variance a. The density function of T(x) is given earlier. Sub
stituting two expressions in and ﬁnishing the job of integration, we get the
result. ♦
When a →0, so that a/b →0, this probability approaches 1. Hence, the
probability that B(t) = 0 in any small neighborhood of t = 0 (not including
0) is 1 given B(0) = 0. This conclusion can be further strengthened. There
are inﬁnite number of zeros in any neighborhood of t = 0, no matter how
small this neighborhood is. This result further illustrates that the sample
paths of Brownian motion, though continuous, is nowhere smooth.
The famous Arcsine law now follows.
Theorem 7.10
The probability that Brownian motion has no zeros in the time interval (a, b)
is given by (2/π)arc sin
_
(a/b). ♦
There is a similar result for simple random walk.
7.7 Diﬀusion Processes
To be continued.
2
Stat 433/833 Course Outline
• We use the formula for ﬁnal grade: 25%assignment + 25%midterm + 50%ﬁnal. • There will be 4 assignments. • Oﬃce hours: to be announced.
• Oﬃce: MC6007 B, Ex 35506, jhchen@uwaterloo.ca
3
Textbook and References
The text book for this course is Probability and Random Processes by Grimmett and Stirzaker. It is NOT essential to purchase the textbook. This course note will be free to download to all students registered. We may not be able to cover all the materials included. Stat433 students will be required to work on fewer problems in assignments and exams. The lecture note reﬂects the instructor’s still inmature understanding of this general topic, which were formulated after reading pieces of the following books. A request to reserve has been sent to the library for books with call numbers. The References are in Random Order. 1. Introduction to probability models. (QA273 .R84) Sheldon M. Ross. Academic Press. 2. A Second course in stochastic processes. S. Karlin and H. M. Taylor. Academic Press. 3. An introduction to probability theory and its applications. Volumes I and II. W. Feller, Wiley. 4. Convergence of probability measures. Billingsley, P. Wiley. 5. Gaussian random processes. (QA274.4.I2613) I.A. Ibragimov, and Rozanov. SpringerVerlag, New York. 6. Introduction to stochastic calculus with applications. Fima C. Klebaner. Imperial College Press. 7. Diﬀusions, Markov processes and martingales.(QA274.7.W54) L. C. G. Rogers and D. Williams. Cambridge. 8. Probability Theory, independence, interchangeability, martingale. Y. S. Chow and H. Teicher. Springer, New York.
. . . . . . . . . 1 1 4 6 9 11 . . . . . . . . . . . . . . . . functions . 3. . . . . . . . . . . . . .2 Hitting time theorem . . 23 25 27 32 33 33 35 38 39 40 . . . 22 3 Generating Functions 3. . . . . . . . . . . . .4 Leads for tieddown random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Renewal Events . . . 45 4. . . .4 Branching Process . . . . . . . .1 Quick derivation of some generating 3. . . .3 Properties . . . . . . . . . 1. . . . . . . . . . . . . . . . . . .1 Classiﬁcation of States and Chains . . . .1 Counting sample paths . . . . . 3. . . . . . . 1. . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . 3. . .2 Proof of the Renewal Theorem . . . . . . . . . . . . . . . . . . . . . .3. . . .5 Summary . . . .1 Probability Space . . . . . . . .2 Class Properties . . . . . . . . 3. .5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. .2 Random Variable . 46 1 . . . . .3. . . . . . . . . . . . .3 Spitzer’s Identity . . . . . . 14 2. . .4 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . 1. . . . . . . 2 Simple Random Walk 13 2. . . . . . . . . . . . . 3. . . . . . . 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Summary .3.3 More σﬁelds and Related Concepts 1. . . . . . . . . . . . . .Contents 1 Review and More 1. . . . . . . . . . . . . . . . . . 4 Discrete Time Markov Chain 43 4.
. . . . . . . . . . . . . . 7. . . . . .7 Diﬀusion Processes . .4 Birthdeath processes and imbedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reversibility . . . . 7. . . . . . . . . . . .6 Markov chain Monte Carlo . . . . . . . . . . 5. . . . . . . 7. . . . . . . . . . .2 Continuous time Markov chains . . . . . . . . . . . . . . . . .6 Zeros of Brownian Motion and Arcsine Law . . . . . . . . . . . . . .1 Birth Processes and the Poisson Process 5. .1 Properties of Markov Chain with Finite Number of States Stationary Distribution . . . . . . . . . . 7. . . . .4 5 Continuous Time Markov Chain 5. . . .2 Martingales of Brownian Motion . . . . . . . . . 7 Brownian Motion or Wiener Process 7. . . . . . 5. . . . . . . . . . . . .1 Strong Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . . .4 Exit Times and Hitting Times . . .3 Gaussian Process . . . . . . . . . . . . . . . . . . . . . . .5 Embedding . . . . . 5. . . . . . . . . . . . . . .3 Markov Property of Brownian Motion . . . . . . . . . . . . . . .5 Stopping Times and Martingales . .2 CONTENTS 4. . . . . . . . . 6. . . . . . . .1. .1 Existence of Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. . . . .2 Sample Path . . . . . . 6 General Stochastic Process in Continuous Time 6. . . . . . . . . . . . . . . 7. 5. . . . . . . . . .3 4.3. . . . . . . 49 49 55 60 63 63 70 71 76 77 81 81 87 87 89 91 91 93 95 100 100 102 103 105 107 108 4.1 Limiting Theorem . . . . .2. . 6. . . . . . . . . 4.4 Stationary Processes . . . . . .1 Finite Dimensional Distribution . . . . 6. . . . . . . . . . . . .5 Maximum and Minimum of Brownian Motion 7. . . . . . . . . . .3 Limiting probabilities . . . . . . . . . . . . . . . . . . .
Mathematically. Mathematically. The empty set φ ∈ F. Every element ω of Ω is called a sample point. and a probability measure. The set of all possible outcomes is called Sample Space. a collection of events. 1 . . ♦ Let us give a few examples. A2 . ∈ F. the sample space is merely an arbitrary set. then its complement Ac ∈ F. then ∪∞ ∈ F (closed under countable union). If A1 . Assume an experiment is to be done. . 2.1 Probability Space A probability space consists of three parts: sample space. i=1 3. we settle on a collection of subsets of Ω. This is not always possible when the probability measure is required to have certain properties. . A probability measure intends to be a function deﬁned for all subsets of Ω.1 A collection of sets F is a σﬁeld (algebra) if it satisﬁes 1. Deﬁnition 1.Chapter 1 Review and More 1. There is no need of a corresponding experiment. If A ∈ F.
P (Ω) = 1. When the sample space Ω contains ﬁnite number of elements. 1] and F is the σﬁeld that contains all possible subsets. 3. A. The simplest σﬁeld is {φ. REVIEW AND MORE 2. Deﬁne P (A) as the ratio of sizes of A and of Ω. P ). Ω}. 2. we must have P (φ) = 0. A simplest nontrivial σﬁeld is {φ. i = 1. CHAPTER 1. Ac . Axioms of Probability Measure Given a sample space Ω and a suitable σﬁeld F. the above deﬁnition does not rely on any hypothetical experiment. For example. 2. A most exhaustive σﬁeld is F consists of all subsets of Ω. if we require that the probability of all closed intervals equal their lengthes. The above axioms lead to restrictions on F. the collection of all subsets forms a σﬁeld F. then it is impossible to ﬁnd such a probability measure satisfying Axiom 3. A probability measure P is a mapping from F to R (set of real numbers) such that: 1. The axioms for a probability space imply that the probability measure has many other properties not explicitly stated as axioms. 0 ≤ P (A) ≤ 1 for all A ∈ F. F. Mathematically. . then the resulting sets of all commonly known operations between A and B are members of the same σﬁeld. In this case. since P (φ ∪ φ) = P (φ) + P (φ). 3. For example.1 1. . then it is a probability measure. which satisfy i=1 i=1 Ai Aj = φ whenever i = j. ♦ It can be shown that if A and B are two sets in a σﬁeld.2 Example 1. This is the classical deﬁnition of probability. Ω}. P (∪∞ Ai ) = ∞ P (Ai ) for any Ai ∈ F. . suppose Ω = [0. . A probability space is given by (Ω.
P (∪n Ai ) = i=1 P (Ai ) − i1 <i2 3 P (Ai1 Ai2 ) + · · · . F. +(−1)n+1 P (∩n Ai ). Notice that B1 . Let (Ω. 3.1 Continuity of the probability measure. . and A = ∪∞ Bi . For any two events A1 and A2 . i=1 Lemma 1. . B2 .1.1. n = 1. If A1 ⊂ A2 ⊂ A3 ⊂ · · · is an increasing sequence of events and A = lim ∪n Ai = lim An . n→∞ Proof: Let Bn = An − An−1 . ∞ P (A) = i=1 P (Bi ) n = = This completes the proof. . In general. i=1 i=1 and P (Bn ) = P (An ) − P (An−1 ). ♦ . Hence. . P (Ac ) = 1 − P (A). we have P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) − P (A1 A2 ). P ) be a probability space. i=1 n→∞ n→∞ then P (A) = lim P (An ). Using countable additivity. PROBABILITY SPACE Axioms 2 and 3 imply that 1 = P (Ω) = P (A ∪ Ac ) = P (A) + P (Ac ). 2. lim n→∞ n→∞ P (Bi ) i=1 lim P (An ). . are mutually exclusive. . Then An = ∪n Bi . with A0 = φ.
2. then the probability density function (p. there are many diﬀerent ways to characterise the distribution of a random variable. The moment generating function is deﬁned as M (t) = E{exp(tX)}. f (x) = dF (x)/dx. At the same time.f. It is known that a c.}. 3.f.d. the probability mass function of X is f (xi ) = P (X = xi ) = F (xj ) − F (xj −) for i = 1.1 A random variable is a map X : Ω → R such that {X ≤ x} = {ω ∈ Ω : X(ω) ≤ x} ∈ F for all x ∈ R. . . . If X is absolutely continuous. . x2 .f.) is f (x) ≥ 0 such that x F (x) = −∞ f (t)dt. taking values on {x1 .d. where F (a−) = limx→a. . Here is an incomplete list: 1.d.f. REVIEW AND MORE 1. For almost all x. The cumulative distribution function (c. F (∞) = 1.. ♦ Notice that {X ≤ x} is more often the notation of an event than the notation for the outcome of an experiment. right continuous function such that F (−∞) = 0. The random behaviour of X is largely determined by its distribution through c. is a nondecreasing.) of a random variable X is FX (t) = P (X ≤ t).2 Random Variable Deﬁnition 1.4 CHAPTER 1. 2.d. .x<a F (x). . If X is discrete.
S(x) dx S(x) = exp{− 0 λ(t)dt}.1. 5. The probability generating function is useful when the random variable is nonnegative integer valued. x f (x) d = − log S(x). The characteristic function of X is deﬁned as φ(t) = E{exp(itX)} 5 where i is such that i2 = −1. When X is absolutely continuous. The hazard function of X is λ(x) = lim P {x ≤ X < x + ∆X ≥ x} . ∆→0+ ∆ It is also called “instantaneous failure rate” as it is the rate at which the individual fails in the next instance given it has survived so far. for x ≥ 0. In survival analysis or reliability. 7.2 . Let us examine a few examples to illustrate the hazard function and its implication.2. Example 1. the moment generating function is simpler as it is not built on the concepts of complex numbers. RANDOM VARIABLE 4. The advantage of the characteristic function is that it exists for all X. 6. we often use survival function S(x) = P (X > x) = 1 − F (x) where we assume P (X ≥ 0) = 1. λ(x) = Hence. Otherwise.
Example 1. The longer the item has survived. (c) When α < 1. F. The σﬁeld of a constant random variable. the sample points satisfying X ≤ − 2 form a set which belongs to F. we becomes vulnerable to diseases and the death rate increases. and so on. we obtain a collection of events which will be denoted as σ(X). REVIEW AND MORE (1) When X has exponential distribution with density function f (x) = λ exp(−λx).6 CHAPTER 1. The σﬁeld of an indicator random variable. (3) Human mortality can be described by a curve with bathtub shape. 1. After certain age. The death rate of new borns are high. . the hazard increases with time. The hazard function is given by λ(x) = λα(λx)α−1 . Constant hazard implies the item does not age. then it stabilizes. It can be easily argued that σ(X) is the smallest sub σﬁeld generated by X. (b) When α > 1. Hence the item ages. (2) When X has Weibull distribution with density function f (x) = λα(λx)α−1 exp{−(λx)α } for x ≥ 0. P ). It ages negatively.3 More σﬁelds and Related Concepts Consider a random variable X deﬁned on the probability space (Ω. The σﬁeld of a random variable takes only two possible values. it reduces to exponential distribution. This claim extends to the union of these two events. 3. the more durable this item becomes. 2. the intersection of these two events.2 is also a set belongs to F.3 1. (a) When α = 1. Similarly X ≥ 1. √ √ For any real number such as − 2. Putting all events induced by X as above together. the hazard decreases with time. it has constant hazard rate λ.
Because we have to work with conditional expectation more extensively later. Further. then E{XY = 0} and E{XY = 1} are both well deﬁned as above. The problem is: σ(Y ) contains so many events in general. One purpose of introducing conditional density function is for the sake of computing conditional expectation. Thus. if X and Y are jointly absolutely continuous. we hope to deﬁne conditional expectation for any pairs of random variables. the conditional expectation of X given Y is expected sizes of X over various ranges of Ω partitioned according to the size of Y . there is no way to present the conditional expectation of X given Y by enumerating them all. this partition is also simple and we can easily work out these numbers conceptually. Because of this. this deﬁnition has to be backed up by an existence theorem that such a function is guaranteed to exist. Y = y) . such that 0 < P (A) < 1. When Y is an ordinary random variable. this deﬁnition is unique only up to some zeroprobability event.1. P (Y = y) This deﬁnition works if P (Y = y) > 0. the corresponding values are the average sizes (expected values) of X over the ranges Ac and A. Thus. when another random variable diﬀers to Z only by a zeroprobability event. we may deﬁne the conditional cumulative distribution function of X given by Y = y by P (X ≤ xY = y) = P (X ≤ x. MORE σFIELDS AND RELATED CONCEPTS 7 If X and Y are two random variables. In mathematics. When Y is as simple as an indicator random variable. and also they are not mutually exclusive. Otherwise. At the same time. The solution to this diﬃculty in mathematics is to deﬁne Z = E{XY } as a measurable function of Y such that E{ZIA } = E{XIA } for any A ∈ σ(Y ). We may realize that this deﬁnition does not provide any concrete means for us to compute E{XY }. that random variable is also a conditional expectation of X given Y . If Y is an indicator random variable IA for some event A. we make use of their joint density to compute the conditional density function.3. . the partition of Ω is σ(Y ).
then we have E{g1 (X)g2 (Y )} = E{g1 (X)}E{g2 (Y )} for any functions g1 and g2 . 1. If X is G measurable. it is also a random variable. Thus. 2. Example 1. .4 Some properties. by letting A = Ω. then E{XY G} = XE{Y G}. More rigorously. It can be easily seen that the σﬁeld generated by g1 (X) is a subσﬁeld of that of X. One may then see what g1 (X) is independent of g2 (Y ). You may remember a famous formula: V ar(X) = V ar[E{XY }] + E[V ar{XY }]. the concept of conditional expectation under measure theory is in fact built on σﬁeld. REVIEW AND MORE Since Z is a function of Y . Let us go over a few concepts in elementary probability theory and see what they mean under measure theory. plus these expectations exist. Note that deﬁnition of the conditional expectation relies on the σﬁeld generated by Y only. we get E[E{XY }] = E{X}.8 CHAPTER 1. Recall that if X and Y are independent. we need to add a cosmestic condition that these two functions are measurable. A remark here is: the conditional expectation is well deﬁned only if the expectation of X exists. the independence of two random variables is built on the independence of their σﬁelds. In particular. then E[E{XG2 }G1 ] = E{XG1 }. if G is a σﬁeld. Under measure theory. Thus. If G1 ⊂ G2 . we simply deﬁne Z = E{XG} as a G measurable function such that E{ZIA } = E{XIA } for all A ∈ G.
Being positive deﬁnite implies that there exists an orthogonal decomposition of A such that A = BΛB τ where Λ is diagonal matrix with all element positive and BB τ = I.4 Multivariate Normal Distribution A group of random variables X have joint normal distribution if their joint density function is in the form of C exp(−xAxτ − 2bxτ ) where A is positive deﬁnite. then E{XG} = E{X}. 2 Let X be a random vector with the density function given above. With this knowledge. Thus. 9 Most well known elementary properties of the mathematical expectation remain valid. If G and σ(X) are independent of each other. the identity matrix. 1. Note that x and b are vectors. . 2 A1/2 Hence when b = 0.1. and C is a constant such that the density function has total mass 1. we have the deﬁnition as follows.4. and let Y = X + µ. MULTIVARIATE NORMAL DISTRIBUTION 3. 2 It is more convenient to use V = A−1 in most applications. the density function has the form 1 f (x) = [(2π)−n A]1/2 exp(− xAxτ ). it is seen that 1 exp(− xAxτ )dx = 2 1 (2π)n/2 exp(− yΛy τ )dy = . Then the density function of Y is given by 1 {(2π)−n A}1/2 exp{− (y − µ)A(y − µ) }.
then all its odd moments are zero. V )). A more rigorous and my favoured deﬁnition is: if X can be written in the form AY + b for some (nonrandom) matrix A and vector b. If X has standard normal distribution. V ). then Y = XD is N (µD. If you ﬁnd that you are not familiar with a large number of them. REVIEW AND MORE Deﬁnition 1.10 CHAPTER 1. If X(length n) is N (µ. 2 Its moment generating function is 1 M (t) = exp{µt + σ 2 t2 }. We now list a number of well known facts about the multivariate distributions. V ) and D is an n × m matrix of rank m ≤ n. . it probably means some catch up work should be done 1. . . we say a random vector X = (X1 . and its even moments are EX 2r = (2r − 1)(2r − 3) · · · 3 · 1. Its characteristic function is 1 φ(t) = exp{iµt − σ 2 t2 }. Xn ) has multivariate normal distribution when Xaτ is a normally distributed random variable for all a ∈ Rn . Suppose X has normal distribution with mean µ and variance σ 2 . then E(X) = µ and V ar(X) = V . . . Xn ) has multivariate normal distribution (written as N (µ. and Y is a vector of iid standard normally distributed random variables. . if its joint density function is 1 f (x) = {(2π)n V }−1/2 exp{− (x − µ)V −1 (x − µ)τ } 2 where V is positive deﬁnite matrix. More general. . . . ♦ If X is multivariate normal N (µ. Dτ V D).1 Multivariate normal distribution X = (X1 . then X is a multinormally distributed random vector. 2 2.
Xn are a set of independent and identically distributed standard normal random variables. XAX τ has chisquared distribution when AΣAΣA = AΣA. 5. then Y1 and Y2 are jointly normally distributed. Xn be i.5. and why a dose of σﬁeld is needed in doing so. . Let X1 . we did introduce the concept of probability space. X2 . . . n}. Xn has normal distribution and Sn has chisquare distribution with n − 1 degrees of freedom.i. Then. A set of random variables have multivariate normal distribution if and only if all their linear combinations are normally distributed. Further. . Let Φ(x) and φ(x) be the cumulative distribution function and the density function of the standard normal distribution. 1. Neither we try to lead you to the world of measure theory based. . That is. In particular. Suppose that X1 . SUMMARY 11 3. Let X(n) = max{Xi . . The sample mean 2 ¯ Xn = n−1 n Xi and the sample variance Sn = (n − 1)−1 n (Xi − i=1 i=1 2 ¯ ¯ X)2 are independent. It is known that 1 1 1 − 3 φ(x) ≤ 1 − Φ(x) ≤ φ(x) x x x for all positive x. . when x is large.5 Summary We did not intend to give you a full account of probability theory learned in courses preceding this one. they provide a very accurate bounds. Y2 are two independent random variables such that Y1 + Y2 are normally distributed. and A be a nonnegative deﬁnite symmetric matrix. Let X be a vector having multivariate normal distribution with mean µ and covariance matrix Σ. Suppose Y1 . 8. advanced. 4. . . Yet. 6. normal random variables. 7. they are multivariate normal. . . Then X(n) = Op ( log n).1. . . i = √ 1. and widely regarded as useless rigorous probability theory.d.
you may be happy to know that that the pressure is not here yet. we need the σﬁeld based deﬁnitions of independence.12 CHAPTER 1. . conditional expectation. It is also very important to get familiar with multivariate normal random variables. These preparations are not suﬃcient. Also. but will serve as starting point. REVIEW AND MORE In the later half of this course.
P (X1 = −1) = 1 − p = q. 13 . . . At the same time. We start with S0 = a and deﬁne Sn+1 = Sn + Xn+1 for n = 1. When p = q = 1/2. (ii) spatial homogeneous: P {Sn+m = j + bSm = a + b} = P {Sn = jS0 = a}. The building block of a simple random walk is a sequence of iid random variables X1 . . Trials are assumed independent. the random walk is symmetric. otherwise. it shares many properties with more complex stochastic processes.. It mimics the situation of starting gambling with an initial capital of $a. . such that P (X1 = 1) = p. . . Understanding simple random walk helps us to understand abstract stochastic processes. . Properties: A simple random walk is (i) time homogeneous: P {Sn+m = jSm = a} = P {Sn = jS0 = a}. . We assume that 0 < p < 1.Chapter 2 Simple Random Walk Simple random walk is an easy object in the family of stochastic processes. . Xn . 2. . the simple random walk becomes trivial. and the stake is $1 on each trial. The probability of winning a trial is p and the probability of losing is q = 1 − p.
i = 0. . S1 = j1 . . . b) is Nn (a. Property 2. By establishing a relationship between the probabilities. n) against (0. We call it a sample path. The key is to view this probability as a function of initial capital a.14 (iii) Markov: CHAPTER 2. 0 otherwise.1 Counting sample paths If we plot (Si . ♦ 2. b) = n n+b−a 2 when (n + a − b)/2 is a positive integer. Property 2. Sn ). 1. 1. the game stops as soon as Sn = 0 or Sn = N for some n. . . where N ≥ a ≥ 0. Assume that S0 = a and Sn = b.1 The number of paths from (0. Under the conditions of simple random walk. . SIMPLE RANDOM WALK P {Sn+m = jS0 = j0 . · · · . Sm = jm } = P {Sn+m = jSm = jm }. a) to (n. we obtain a path connecting (0. n). we will get a diﬀerence equation which is not hard to solve. a) to (n. b). what is the probability that the game stops at Sn = N ? Solution: A brute force solution for this problem is almost impossible. There might be many possible sample paths leading from (0.1 Gambler’s ruin Assume that start with initial capital S0 = a. . . ♦ Example 2. S0 ) and (n.2 ♦ .
♦ The above property enables us to show the next interesting result.2 It is seen that 2n n n p q n for n = 0. COUNTING SAMPLE PATHS 15 For a simple random walk. nπ Thus. By renewal theorem to be discussed. . The approximation is good even for small n. ♦ The proof is straightforward. when a. n u2n = ∞. P (Sn = bS0 = a) = n n+b−a 2 p(n+b−a)/2 q (n+a−b)/2 . Then. 2. . n for large n. we have P (S2n = 0S0 = 0) = u2n = P (S2n = 0S0 = 0) = Using Stirling’s approximation n! ≈ √ 2πn(n/e)n 2n (1/2)2n . Sn = 0 is a recurrent event when p = q = 1/2. b > 0. b). 1. b) = Nn (−a.1. Proof: We Simply count the number of paths. When p = q = 1/2. The key step is to establish the onetoone relationship.. a) to (n. ♦ Property 2. b) be the number of sample paths from (0.2. a) to (n. all sample paths from (0. and when (n + a − b)/2 is a positive integer. 0 Nn (a. . We have 1 u2n ≈ √ . In addition. b) have equal probability p(n+b−a)/2 q (n+a−b)/2 to occur. This is the probability when the random walk returns to 0 at trial 2n.3 Reﬂection principle 0 Let Nn (a. Example 2. b) that touch or cross the xaxis. .
which do not touch xaxis is 0 Nn−1 (1. b). b) − Nn−1 (1. such that b > 0. b) − Nn−1 (−1. The second part is a consequence of the ﬁrst one. . then b Nn (0. Sn = bS0 = 0) = b P (Sn = bS0 = 0). b). ♦ Proof: The ﬁrst part can be proved by counting the number of sample paths which do not touch xaxis.4 Ballot Theorem CHAPTER 2.1 Assume that S0 = 0 and b = 0. 1). ♦ Theorem 2. b) n Theorem 2.16 Property 2. Just sum over the probability of (i) for possible values of Sn . and A =“Micheal leads throughout the count”. Micheal wins by b votes in n casts. 0) to (1. The number of paths from (1. n Proof: Notice that such a sample path has to start with a transition from (0. 1) to (n. 0) to (n. b). b) that do not revisit the axis is b Nn (0. Nn (0. b) b P (A) = n = . SIMPLE RANDOM WALK The number of paths from (0. b) n−1 n−1 − n−b−2 = n−b 2 2 b = Nn (0. If the votes are counted in a random order. n (ii) P (S1 S2 · · · Sn = 0) = n−1 E(Sn ). n ♦ What does this name suggest? Suppose that Micheal and George are in a competition for some title. b) = Nn−1 (1.2 First passage through b at trial n (Hitting time problem). Then (i) P (S1 S2 · · · Sn = 0. In the end.
Then for r ≥ 1. · · · .3 Suppose S0 = 0. . COUNTING SAMPLE PATHS If b = 0 and S0 = 0. the event Mn ≥ r is a subset of the event Sn = b. i = 1. Sn = bS0 = 0) = b P (Sn = b). Hence the result. Sn = b. the sample paths become those who reach b in n trials without touching the xaxis. Deﬁne Mn = max{Si . When b < r. and completes it with the mirror image from there. r−1 P (Sn = b) if b ≥ r r−b (q/p) P (Sn = 2r − b) if b < r P (Mn ≥ r) = P (Sn ≥ r) + = P (Sn = r) + c=r+1 (q/p)r−b P (Sn = 2r − b) c=−∞ ∞ [1 + (q/p)c−r ]P (Sn = c) When p = q = 1/2. We assume b > 0. . . it becomes P (Mn ≥ r) = 2P (Sn ≥ r + 1) + P (Sn = r). Theorem 2. . we may draw a horizontal line y = r.2. ♦ The next problem of interest is how high Si have attained before it settles at Sn = b. Thus. In general. Sn = b) = { Hence. Hence the ﬁrst conclusion is true. P (Mn ≥ r. but we have a kind of answer for the simple random walk. This is an extremely hard problem. then fn (b) = P (S1 = b. Proof: When b ≥ r. n 17 Proof: If we reverse the time. For each sample path belongs to Mn ≥ r. n}. we obtain a partial mirror image: it retains the part until the sample path touches the line of y = r.1. the number of sample paths is the . we often encounter problems of ﬁnding the distribution of the extreme value of a stochastic process. Sn−1 = b.
in general. ♦ Let µb be the mean number of visits of the walk to the point b before it turns to its starting point. ♦ Suppose a perfect coin is tossed until the ﬁrst equalisation of the accumulated numbers of heads and tails. we get ∞ µb = n=1 fb (n). Let Yn = I(S1 S2 · · · Sn = 0. Suppose S0 = 0. Notice that this n=1 reasoning is ﬁne even when the walk will never return to 0. . Sn = b). The remaining parts of the theorem are self illustrative. all states are recurrent.5. The gambler receives one dollar every time that the number of heads exceeds the number of tails by b. however. the mean number of visit is less than 1. Then ∞ Yn is the number of times it visits b. The corresponding probability. We are hence asked to show that P (S1 · · · S2n−2k = 0S0 = 0) = u2n−2k . SIMPLE RANDOM WALK same as the number of sample paths from 0 to 2r − b. Therefore µb = 1 for any b. ♦ Proof: The probability in question is α2n (2k) = P (S2k = 0)P (S2k+1 · · · S2n = 0S2k = 0) = P (S2k = 0)P (S1 · · · S2n−2k = 0S0 = 0). Thus. When p = q = 0. This fact results in comments that the “fair entrance fee” equals 1 independent of b.4 Arc sine law for last visit to the origin.18 CHAPTER 2. Suppose that p = q = 1/2 and S0 = 0. The probability that the last visit to 0 up to time 2n occurred at time 2k is given by P (S2k = 0)P (S2n−2k = 0). My remark: how many of us thinks that this is against their intuition? Theorem 2. is obtained by exchanging the roles of r − b pairs of p and q. Since E(Yn ) = fb (n).
S2m = 0) = P (S2m = 0). It turns out that u2k ≈ (πk)−1/2 and so α2n (2k) ≈ {π[k(n − k)]}−1/2 . .2. S2m = b) b P (S2m = b) b=0 m 2b P (S2m = 2b) b=1 2m 2m m b=1 2m m = = 2 1 = 2 2 2m = m = u2m . . . Denote. 2n). we have accurate approximation for u2k . we may omit the conditional part. . 2m − 1 2m − 1 − m+b−1 m+b 1 2 ♦ The proof shows that P (S1 S2 . . Then the theorem says α2n (2k) = u2k u2n−2k . COUNTING SAMPLE PATHS 19 Since S0 = 0 is part of our assumption. u2n = P (S2n = 0) and α2n (2k) = P (S2k = 0. We have P (S1 · · · S2m = 0S0 = 0) = b=0 P (S1 · · · S2m = 0. the limiting distribution has a density function described by √ arcsin( x). As p = q = 1/2 is a simple case. . the it follows that P (T2n ≤ 2xn) ≈ ≈ {π[k(n − k)]}−1/2 k≤xn x 1 du 0 π[u(1 − u)]1/2 √ 2 = arcsin( x). . as usual.1. If we let T2n be the time of the last visit to 0 up to time 2n. Si = 0. π That is. i = 2k + 1.
u2m = 2P (S1 > 0. (3) If we count the number of k such that Sk > 0. it pushes the most recent visit closer and closer to the end. if you gamble.20 CHAPTER 2. We are asked to show that β2n (2k) = α2n (2k). Hence. it becomes more likely to visit 0 again in the near future. Thus. . . The ﬁrst step is to consider the case when k = n = m. Suppose that p = 1/2 and S0 = 0. to the right of the origin equals u2k u2n−2k . it is likely to touch 0 again soon. we have shown that for symmetric random walk u2m = P (S1 · · · S2m = 0). touching 0 becomes less and less likely. It is more likely to be at the beginning and near the end. up to time 2n. what kind of distribution it has? Should it be more likely to be close to n? It turns out that it is either very small or very large. . Property 2. . ♦ Proof: Call this probability β2n (2k). Why is it also likely occur near the end: if for some reason the simple random walk returned to 0 at some point. S2m > 0). Since these sample paths can belong to one of the two possible groups: always above 0 or always below 0. Let us see how the result establishes (3). you may win all the time. SIMPLE RANDOM WALK (1) Since p = q = 1/2. Once it wandered away from 0.5 Arc sine law for sojourn times. Why is it more likely at the beginning? since it started at 0. We use mathematical induction. or lose all the time even though the game is perfectly fair in the sense of probability theory. Is it true? After 2n tosses with n large. (2) The last time before trial 2n when the simple random walk touches 0 should be closer to the end. . Thus. This result shows that it is symmetric about midpoint n. The probability that the walk spends exactly 2k intervals of time. In our previous proofs. the chance that it never touches 0 after trial n is 50% which is surprisingly large. one may think the balance of heads and tails should occur very often.
S2m−1 ≥ 0S0 = 0) 2 1 P (S1 ≥ 0. S2m ≥ 1) = P (S1 − 1 = 0. .2. . n − 1. . . S2 ≥ 1. Let us now make an induction assumption that α2n (2k) = β2n (2k) for all n with k = 0. . . . we also have β2n (2m) = 1 m 1 n−m f2r β2n−2r (2m − 2r) + f2r β2n−2r (2m) 2 r=1 2 r=1 since the walk will start with either a positive move or a negative move. 1. . Our induction task is to show that the same is true when k = m. . . n − (m − 1). we have shown that α2m (2m) = u2m = β2m (2m). S2m − 1 ≥ 0S1 − 1 = 0) 1 = P (S2 ≥ 0. · · · . Using idea similar to renewal equation. . . S2m−1 ≥ 0. . S3 ≥ 0. Note the reason for the validity of the second sets of k is symmetry. . . . Consequently. . since u0 = 1. . S2 ≥ 0. S2m ≥ 0S0 = 0) = 2 1 = β2m (2m) 2 21 where the last equality comes from the fact that S2m−1 ≥ 0 implies S2m ≥ 0 as it is impossible for S2m−1 = 0. S2m − 1 ≥ 0) = P (S1 = 1S0 = 0)P (S2 − 1 ≥ 0. .1. S2m > 0) = P (S1 = 1. . . S2 − 1 ≥ 0. S2 ≥ 0. . . . . . . it will touch 0 again at some point 2r in another 2n − 2r trials (it could happen that 2r = 2n). . . . . By renewal equation. m − 1 and k = n. S2 > 0. S2m ≥ 0S1 = 0) 2 1 = P (S1 ≥ 0. we have P (S1 > 0. . m u2m = r=1 u2m−2r f2r . COUNTING SAMPLE PATHS On the other hand. .
One such property is that the simple random work is null recurrent when p = q and transient otherwise. . In this course. ♦ The induction is hence completed. 2. The reﬂection principle allows us to count the number of paths going from one point to another without touching 0. We may not believe that the hitting time problem is a big deal. SIMPLE RANDOM WALK So. In old days. Similarly.2 Summary What have we learned in this chapter? One observation is that all sample paths starting and ending at the same location have equal probability to occur. Please do not go away and stay tuned. β2n (2m) = 1 m 1 n−m f2r u2m−2r u2n−2m + f2r u2m−2r u2n−2m−2r 2 r=1 2 r=1 1 [u2n−2m u2m + u2m u2n−2m ] = 2 = u2m u2n−2m . we provide such a result for the simple random work. by counting sample paths. there were enough nerds who believed that this result is intriguing. you may change your mind. This result is used to establish the Ballot Theorem. There is a similar result for Brownian motion to be discussed.22 CHAPTER 2. you will be famous. arc sine law also have its twin in Brownian motion. some interesting properties of the simple random work are revealed. it is possible to provide an accurate enough approximation to the probability of entering 0. Making use of this fact. and induction assumption. If you ﬁnd some neat estimates of such probabilities for very general stochastic processes. with renewal equation. Yet when it is linked with stock price. Let us recall some detail.
When X and Y are independent. . is a sequence of independent and identically distributed random variables with common generating function GX (s) and N (≥ 0) is a random variable which is independent of the Xi ’s and has generating function GN (s). then S = X1 .Chapter 3 Generating Functions and Their Applications The idea of generating functions is to transform the function under investigation from one functional space to another functional space. If X is a nonnegative integer valued random variable. X2 . . Theorem 3. then we call GX (s) = E(sX ) the probability generating function of X. .1 If X1 . then the probablity generating function of X + Y is GX+Y (s) = GX (s)GY (s). +X2 + · · · + XN 23 . The new functional space might be more convenient to work with. Advantages? It can be seen that moments of X equal derivatives of GX (s) at s = 1.
then n=0 A(s) = n an s n is called the generating function of {an }∞ when it converges in a neighborn=0 hood of s = 0. s2 ) = E[sX1 sX2 ].X2 (s1 .. is given by GX1 .2 The joint probability generating function of two random variables X1 and X2 taking nonnegative integer values. B(s) and C(s) are generating functions of {an }∞ . and n=0 n=0 their generating functions exists in a neighborhood of s = 0. .2 X1 and X2 are independent if and only if GX1 . s2 ) = GX1 (s1 )GX2 (s2 ). 1. . Deﬁnition 3. ♦ The proof can be done through conditioning on N . Suppose that {an }∞ and {bn }∞ are two sequence of real numbers. n=0 .24 CHAPTER 3. GENERATING FUNCTIONS has generating function given by GS (s) = GN (GX (s)). we assume that S = 0. 1 2 ♦ Theorem 3. Deﬁne n cn = i=0 ai bn−i for n = 0. When N = 0. ♦ The idea of generating function also applies to a sequence of real numbers. {bn }∞ and n=0 n=0 {cn }∞ respectively. If {an }∞ is a sequence of real numbers.X2 (s1 . . Then C(s) = A(s)B(s) where A(s).
Let λ represent a renewal event and as before deﬁne the lifetime sequence {fn } where f0 = 0 and fn = P {λ occurs for the ﬁrst time at trial n}. We do not require Xi ’s be independent of each other. we deﬁne the renewal sequence un . we suppose that λ has just occurred at trial 0. a property is a renewal event if the waiting time distribution for the next occurrence remains the same. By convention. Let F (s) = fn sn and U (s) = {fn } and {un }. where u0 = 1 and un = P {λ occurs at trial n}.3. . .1. 1 − f represents the probability that λ never recurs in the inﬁnite sequence of trials. . and En represents the “event” that λ occurs at trial n. Otherwise. n = 1. For a recurrent renewal event. it is possible for f to be less than 1. given that it has just occurred. Note that f= n = 1. Thus. it it recurrent. Roughly speaking. When f < 1. can be said unequivocally to occur or not to occur at trial n. Since the event may not occur at all. . and is independent of the past.. Clearly.. Let λ represent some property which. the process renewals itself each time when the renewal event occurs. . .1 Renewal Events Consider a sequence of trials with outcomes X1 . . . RENEWAL EVENTS 25 3. In like manner. un sn be the generating functions of fn = F (1) ≤ 1 because f has the interpretation that λ recurs at some time in the sequence. the probability that λ occurs ﬁnite number of times only is 1. The mean interoccurrence time is ∞ µ = F (1) = n=0 nfn . . . . on the basis of the outcomes of the ﬁrst n trials. X2 . we say that λ is transient. . It resets the clock back to time 0. 2. . 2. Hence. n = 1. F (s) is a probability generating function. 2. . .
Hence U (s) = Theorem 3. Notice that f0 = 1. (g. 2. . or λ occurs for the ﬁrst time at some intermediate trial k < n and then occurs again at n. we say that λ is null recurrent. we say that λ is periodic with period t. Finally.c. If µ = ∞. Theorem 3. if λ can occur only at n = t. for some positive integer t > 1. we get U (s) − 1 = F (s)U (s). 3. n = 1. we therefore have un = f0 un + f1 un−1 + · · · + fn−1 u1 + fn u0 . .c.{n : fn > 0}. The probability of this event is fk un−k . periodic if t = g.26 CHAPTER 3. U (s) or F (s) = 1 − 2. 2t.d. Using the typical generating function methodology. The following is the famous renewal theorem. let t = g. either λ occurs for the ﬁrst time at n with probability fn = fn u0 . stands for the greatest common divisor). we say that λ is positive recurrent. transient if and only if u = un = U (1) < ∞. the recurrent event λ is said to be periodic with period t. . . null recurrent if and only if un = ∞ and un → 0 as n → ∞.3 The renewal event λ is 1. λ is said to be aperiodic.d. More formally. 1 1 − F (s) 1 . For a renewal event λ to occur at trial n ≥ 1. 4. . If t > 1.{n : un > 0} is greater than 1 and aperiodic if t = 1. This equation is called renewal equation. . . . If t = 1.c. 3t.d. GENERATING FUNCTIONS If µ < ∞.4 (The renewal theorem). recurrent if and only if u = ∞.
be a sequence of nonnegative numbers such that fn = 1.2. Lemma 3. page 335).2 Proof of the Renewal Theorem I am using my own language to restate the results of Feller (1968. Then un → µ−1 as n → ∞ where µ = nfn (and µ−1 = 0 when µ = ∞). . . . The implication of being aperiodic is as follows.3. the result can be stated without the renewal event background. . ♦ Note this deﬁnition of un does not apply to the case of n = 0. Equivalent result: Let f0 = 0. Otherwise. ar in A. 3. First. We will put it as a section here. f2 . Let u0 = 1 and un = f0 un + f1 un−1 + · · · + fn u0 for all n ≥ 1. f1 . and denote by A+ the set of all positive linear combinations p1 a1 + p2 a2 + · · · + pr ar of numbers a1 . PROOF OF THE RENEWAL THEOREM Let λ be a recurrent and aperiodic renewal event and let µ= nfn = F (1) 27 be the mean interoccurrence time. . Then n→∞ lim un = µ−1 . To the relief of many. the usequence would be a convolution of fsequence and itself. The proof of the Renewal Theorem is rather involved. .1 . this section will only for the sake of those who are nerdy enough in mathematics. Let A be the set of all integers for which fn > 0. and 1 is the greatest common divisor of these n for which fn > 0. .
GENERATING FUNCTIONS There exists an integer N such that A+ contains all integers n > N .28 CHAPTER 3. Lemma 3. If w0 = 1 then wn = 1 for all n. Recall the deﬁnition of A. Proof: Since all numbers involved are nonnegative and wn ≤ 1. Lemma 3. The next lemma has been restated in diﬀerent words.2 Suppose we have a sequence of sequences: [ {(am. the proof will be omitted.} such that limi→∞ am. There exists a sequence {ni : i = 1. . . the above conclusion can be restated as w−a = 0 whenever a ∈ A. . This result is also used when proving a result about the relationship between convergence in distribution and the convergence of characteristic functions. if the greatest common divisor of a set of positive integers is 1.ni exists for each m. we have that for all n ∞ wn = k=1 fk wn−k ≤ fk = 1. then the set of their linear combinations contains all but ﬁnite number positive integers. ♦ Again. the sequence fn is assumed to have the same property as the sequence introduced earlier.n ≤ 1 for all m. we have w−k = 1 whenever fk = 0.3 Let {wn }∞ n=−∞ be a doubly inﬁnite sequence of numbers such that 0 ≤ wn ≤ 1 and ∞ wn = k=1 fk wn−k ♦ for each n. ♦ In other words. Some explanation will be given in class.n )∞ }∞ ] such that 0 ≤ m=1 n=1 am. Applying this conclusion to n = 0 with the condition w0 = 1. 2. n. To reduce the burden in math respect. . this result will not be proved here. To be more explicit.
3. let us denote it as u(iv ) → η as v → ∞. . limv→∞ u0. Hence.5 = u8 . Proof of Renewal Theorem: Our goal is to prove the limit of un exists and equals µ−1 . .v = un+iv when n + iv ≥ 0.v be a double sequence such that for each v. we ﬁnd wn = 1 for all n. . 2) = u(i2 ).v = η. 0) = u(i0 ). . .5 = u10 . It can then be strengthened to w−a = 1 for any a ∈ A+ by induction. u−2. k ∈ A. we ﬁnd w−a−k = 1 whenever both a. we prove the renewal theorem. un. using the same argument: ∞ 29 wn = k=1 fk wn−k ≤ fk = 1 to n = a. a subsequence can be found such that this subsequence converges to some number η. Let un. . . u( 0. PROOF OF THE RENEWAL THEOREM For each a ∈ A.5 = u11 . . Repeat it again and again. u−1. ♦ Finally. u1. . we get w−N +1 = 1. u(0. Using wn = k=1 ∞ fk wn−k ≤ fk = 1 again for n = −N + 1. . a−N −2 = 1. a−N −1 = 1.5 = u9 . suppose i5 = 10. . . it shifts the original sequence by iv positions toward left. An implication is then: there exists an N such that a−N = 1. .v = 0 otherwise. In other words. For example. . then we have . . and un. the resulting sequence is . . we get w−N +2 = 1. For the sequence with n = ±1. 1) = u(i1 ). Notationally. u0.2. they look like . . Since the items in this sequence are bounded. Assume this η is the upper limit. If we set n = 0 and let v increases. . u( 0. and further adding zeroes to make a double sequence. Repeat it.
j = 1. = . Let ρk = fk+1 + fk+2 + · · · for k = 0. we have only achieve that the limit of each un. u(n.vj = uvj +n . . Another note is that 1 − ρk = f1 + f2 + · · · + fk . 1) = ui1 +1 . 2. v) = k=1 fk u(n − k. This is easy when µ = ∞. . u2 = f2 u0 + f1 u1 . . we then get wn = η for all n. ±2. let us show that η = µ−1 . . is that the limit of un exists and equal µ−1 . uN = fN u0 + fN −1 u1 + fN −2 u2 + · · · + f1 uN −1 . u(1. . . . 0) = ui0 −1 . . In summary. . u(−1. Taking limit along the path of v = vj . j = 1.. By the other lemma.. . To aim a bit low. it is a good opportunity for us to redo it with generating function concept. . u(1. Let us now list the deﬁning relations as follows: u1 = f1 u0 . . GENERATING FUNCTIONS . however. . After such a long labour. such that lim un. . . v) = 0. . we have a sequence u(n. . . we have sumρk = µ. but if you do not remember. with the additional claus that if n < −iv . say vj .. What we wanted.vj j→∞ exists for each n. u(1. .30 CHAPTER 3. 2. we have ∞ u(n.. Since for all n > −v. ... . . ±1. We should have seen it before. Call this limit wn . we get ∞ wn = k=1 fk wn−k . By the earlier lemma. . it is possible to ﬁnd a subsequence in v. u3 = f3 u0 + f2 u1 + f1 u2 . . . u(−1. 0) = ui0 +1 . 1. are the same. v) in v.. 2) = ui2 −1 . v). . for each n = 0. 2) = ui2 +1 . . 1) = ui1 −1 . u(−1. .
. Let it be η0 < η.3. either η = µ−1 when µ is ﬁnite.1) lim un. it must has a subsequence whose limit exists and smaller than η. That is. and let j → ∞. By deﬁnition. At the same time. (3. then the answer is yes. un ≤ η + . or 0 otherwise. Being the upper limit of un . η is the upper limit of un . Let N = vj . we have more work to do. PROOF OF THE RENEWAL THEOREM Adding up both sides. If µ = ∞. we get N 31 uk = (1 − ρN )u0 + (1 − ρN −1 )u1 + (1 − ρN −2 )u2 + · · · + (1 − ρ1 )uN −1 . Let us examine the relationship: ρN u0 + ρN −1 u1 + · · · + ρ1 uN −1 + ρ0 uN = 1 again.vj = uvj +n is practically true for all n (or more precisely for all large n). Noting that u0 = 1. it implies that given any small positive quantity . for all large enough n. our proof is not complete yet. un. we have ρN u0 + ρN −1 u1 + · · · + ρ1 uN −1 + ρ0 uN = 1. k=1 This is the same as uN = (1 − ρN )u0 − ρN −1 u1 − ρN −2 u2 − · · · − ρ1 uN −1 .2. If the limit of un does not exist. By letting some uk replace by a larger value 1. Is it also THE limit of un ? If its limit exists. implies η×{ ρk } = 1. then answer is also yes because the lower limit cannot be smaller than 0. Even with this much trouble.1). Otherwise. it becomes N −r−1 N −1 ρN −k + k=0 k=N −r ρN −k uk + ρ0 uN ≥ 1.vj = wn = η for each n. Recall that j→∞ (3.
we get ηµ + ρ0 (η0 − η) ≥ 1. the limit of un exists and equals η. it becomes ∞ N −1 ρk + k=r+1 k=N −r ρN −k uk + ρ0 uN ≥ 1. k=r+1 Since this result applies to any r and . Because ηµ = 1. we have ∞ r ρk + (η + ) k=r+1 k=1 ρk + ρ0 uN ≥ 1. k=r+1 At last. we must have η0 = η or there will be a contradiction. GENERATING FUNCTIONS By including more terms in the ﬁrst summation. we get ∞ ρk + (η + )µ + ρ0 (uN − η − ) ≥ 1. let r → ∞ and → 0. Replacing uk by a larger value η + which is true for large k. k=r+1 Reorganize terms slightly. since ∞ k=0 ρk = µ.3 . Combined with the fact that η is the upper limit. we get ∞ ρk + (η + )(µ − ρ0 ) + ρ0 uN ≥ 1. Thus. Further.32 CHAPTER 3. we get ∞ ρk + (η + )µ + ρ0 (η0 − η − ) ≥ 1. let N go to inﬁnite along the subsequence with limit η0 . Properties of Random Walks by Generating Functions . Since the limit of any subsequence of un must be the same as η. we ﬁnally proved the Renewal Theorem. we must have η0 − η ≥ 0. 3.
P (X ≤ 1) = 1 and P (X = 1) > 0. For this purpose.3. For “returning to 0”: U (s) = (1 − 4pqs2 )−1/2 . Consequences: P (ever returns to 0) = 1 − p − q.1 Quick derivation of some generating functions The generating functions for certain sequences: For “ﬁrst returning to 0”: F (s) = 1 − (1 − 4pqs2 )1/2 . deﬁne Fb (z) = E(z Tb ) . Let Tb be the ﬁrst time when Sn = b for some b > 0 and assume S0 = 0.3. For “ﬁrst passage of 1”: Λ(s) = (2qs)−1 [1 − (1 − 4pqs2 )1/2 ]. PROPERTIES 33 3.2 . P ( ever rearches 1) = min(1.3.3. This is so called rightcontinuous random walk. (We will go through a lot of details in the class) 3. Consider the random walk such that Sn = X1 + X2 + · · · + Xn . The random walk cannot skip over (upward) any state without stepping on it. n Let us see why this is still true for the new random walk we have just deﬁned. p/q). It has been shown for simple random walk that fb (n) = P (Tb = n) = b P (Sn = n). Hitting time theorem Hitting time theorem is more general than what has been shown.
At the same time. G is well deﬁned for all z ≤ 1. The result to be proved is a consequence of this consistency. We will pretend z is just a real number for illustration. in which case. it is known that w = F1 (z). GENERATING FUNCTIONS which is the generating function of Tb . Hence the required hitting time theorem may be obtained by linking G(z) and Fb (z). our preparations boil down to have deﬁned generating functions of Tb and X1 . Since 1−X1 is nonnegative. It seems that deﬁning a function without a value at z = 0 calls for some attention. its relationship with w is also completely determined by the above equation. the “generating function” of Sn is determined by [G(z)]n . The two versions must be consistent with each other. . The purpose of using letter z. It turns out there is a complex analysis theorem for inversing z = w/G(w). If we ignore the mathematical subtlety. Denote w = F1 (z). We take notice that this relationship works even when b = 0. rather than s. this relation can then be written as w z= G(w) That is. It is seen that Fb (z) = [F1 (z)]b as the random walk cannot skip a state without landing on it. is to allow z to be a complex number. Please note that our G here is a bit diﬀerent from G in the textbook. it is more precise to deﬁne n Fb (z) = lim We also deﬁne n→∞ z n P (Tb = n). Due to the way Sn is deﬁned.34 CHAPTER 3. Since X1 can be negative with large absolute values. i=1 G(z) = E(z 1−X1 ). It may happen that P (Tb = ∞) > 0. work with generating function of 1−X1 makes sense. This step turns out to be rather simple. F1 (z) = E[E(z T1 X1 )] = E[z 1+T1−X1 X1 ] = zE{[F1 (z)]1−X1 } = zG(F1 (z)). Further.
Matching the coeﬃcient of z n . is the coeﬃcent of un−b in the power series expansion of bGn (u). and which is the coeﬃcient of u−b in the expansion of bu−n Gn (u). then g(w(z)) = g(0) + z n dn−1 [g (u){f (u)}n ] n! dun−1 n=1 ∞ .). Related to Cauchy integration over a closed path).3. .3. we have n!P (Tb = n) = dn−1 [bub−1 Gn (u)] dun−1 . we have [F1 (z)]b = z n dn−1 [bub−1 Gn (u)] n! dun−1 n=1 ∞ . which in turn. Hence we get the result. ♦ b Apply this result with f (w) being our G(w) and g(w) = w . (but I had the impression being analytic implies inﬁnitely diﬀerentiable). That is.3 Spitzer’s Identity Here we discuss the magical results of Spitzer’s identity. u=0 (See Kyrala (1972). Now. we point out that u−n Gn (u) = Eu−Sn . If g is inﬁnitely diﬀerentiable. I kind of believe that a result like this can be useful in statistics.5 Lagranges’ inversion formula 35 Let z = w/f (w) where f (w) is an analysis function (has derivative in complex analysis sense) in a neighborhood of w = 0. u=0 Notice the latter is (n − 1)! times of the coeﬃcient of un−1 in the power expansion of bub−1 Gn (u). ♦ 3. Yet I ﬁnd that it is still too complex to be practical at the moment. and Gn (u)/z n is the probability generating function of Sn . u=o Let us not forget that [F1 (z)]b is the “probability” generating function of Tb .3. (n − 1)! times of this coeﬃcient is b(n − 1)!P (Sn = b). Page 51 for a close resemble of this result. PROPERTIES Theorem 3. (and a onetoone relationship between z and w in this neighborhood. Applied functions of a complex variable.
36 Theorem 3.6
CHAPTER 3. GENERATING FUNCTIONS
Assume that Sn is a rightcontinuous random walk, and let Mn = max{Si : 0 ≤ i ≤ n} be the maximum of the walk up to time n. Then for s, t < 1,
∞
log
n=0
t E(S
n
Mn
) =
+ 1 n t E(sSn ) n=1 n
∞
+ where Sn = max{0, Sn } (which is the nonnegative part of Sn ).
♦
Proof: We keep the notation fb (n) = P (Tb = n) as before. To have Mn = b, it has to land on b at some time j between 1 and n, and at the same time, the walk does not land on b + 1 within another n − j steps. Thus,
n
P (Mn = b) =
j=0 n
P (Tb = j)P (T1 > n − j) P (Tb = j)P (T1 > n − j).
j=0
=
Multiply both sides by sb tn and sum over b, n ≥ 0. The left hand side is tn E(sMn )
n=0
Recall tn P (T1 > n) = relationship,
∞ b Tb
1−F1 (t) . 1−t
The right hand side is, by convolution
1 − F1 (t) 1 − F1 (t) ∞ b s E(t ) = s [F1 (t)]b 1−t 1 − t b=0 b=0 1 − F1 (t) = . (1 − t)(1 − sF1 (t)) Denote this function as D(s, t). The rest contains mathematical manipulation: By Hitting time theorem,
n
nP (T1 = n) = P (Sn = 1) =
j=0
P (T1 = j)P (Sn−j = 0)
and we get generating function relationship as tF1 (t) = F1 (t)U (t).
3.3. PROPERTIES (Recall U (s) is the g.f. for returning to 0). With this relationship, we have ∂ −sF1 (t) log[1 − sF1 (t)] = ∂t 1 − sF1 (t) s = − F1 (t)U (t) t
∞
37
sk [F1 (t)]k
k
= −
k=0 ∞
s
k+1
t
[F1 (t)]k+1 U (t)
= − = − Notice P (Sn = k) =
j=0 n
sk [F1 (t)]k U (t) t k=1
∞
sk [F1 (t)]k U (t) k=1 t
P (Tk = j)P (Sn−j = 0).
Thus, the generating function [F1 (t)]k U (t) is in fact a generating function for the sequence of P (Sn = k) in n. That is,
∞
[F1 (t)]k U (t) =
n=0
tn P (Sn = k).
Notice that the sum is over n. It therefore implies
∞ ∞ ∂ tn−1 sk P (Sn = k). log[1 − sF1 (t)] = − ∂t n=1 k=1
So, we get ∂ ∂ ∂ ∂ log D(s, t) = − log(1 − t) + log[1 − F1 (t)] − log[1 − sF1 (t)] ∂t ∂t ∂t ∂t
∞ ∞ ∞
=
i=1 ∞
tn−1 1 −
k=1
P (Sn = k) +
k=1 ∞
sk P (Sn = k)
=
i=1 ∞
tn−1 P (Sn ≤ 0) +
k=1
sk P (Sn = k)
=
i=1
tn−1 E(s
+ Sn
).
38
CHAPTER 3. GENERATING FUNCTIONS
Now integrate with respect to t to get the ﬁnal result. ♦ Remark: I do not see why this ﬁnal result is more meaningful than the intermedian result. I am wondering if any of you can oﬀer some insight.
3.3.4
Leads for tieddown random walk
Let us now go back to the simple random walk such that S0 = 0 and P (Xi = 1) = p, P (Xi = −1) = q = 1 − p. Deﬁne L2n as the number of steps when the random walk was above the line of 0 (but is allowed to touch 0). We have an arcsin law established for this random variable already. This time, we investigate its property when S2n = 0 is given. Theorem 3.7 Leads for tieddown random walk. For the simple random walk S, P (L2n = 2kS2n = 0) = for k = 0, 1, . . . , n. 1 , 2n + 1 ♦
Remark: we may place more possibility at L2n = n. It should divide the time evenly above and below zero. This theorem, however, claims that a uniform distribution is the truth. Note also that the result does not depend on the value of p. Proof: Deﬁne T0 be the ﬁrst time when S is zero again. Given T0 = 2r for some r, L2r can either be 0 or 2r as the simple random walk does not cross zero. Making use of the results on the size of λ2r , and f2r , it turns out the conditional distribution is placing half and half probabilities on 0 and 2r. Thus, 1 1 E(sL2r S2n = 0, T0 = 2r) = + s2r . 2 2 L2n Now we deﬁne G2n (s) = E[S S2n=0 and F0 (s) = E(S T0 ). Further, let
∞
H(s, t) =
n=0
t2n P (S2n = 0)G2n (s).
we have n 39 G2n (s) = r=1 n E[S L2n S2n = 0. 2 Recall that F0 (t) = (1 − 4pqt2 )−1/2 .3. ♦ 3. t). 1 H(s.4. BRANCHING PROCESS Conditioning on T0 . (n + 1)(1 − s2 ) n=0 Compare the coeﬃcient of t2n in two expansions of H(s. t) = √ 2+ 1−t 1 − s2 t2 √ √ 2[ 1 − s2 t2 − 1 − t2 ] = t2 (1 − s2 ) ∞ 1 − s2n+2 t2n P (S2n = 0) = . It is assumed that each individaul will give birth of random number of oﬀsprings independently of each other. We call it family size. one obtains 2 √ H(s. we deduce n G2n (s) = k=0 (n + 1)−1 s2k . T0 = 2r]P (T0 = 2rS2n = 0) 1 ( s2r )P (T0 = 2rS2n = 0). t)[F0 (t) + F0 (st)].4 Branching Process We only consider a simple case where Z0 = 1 representing a species starts from a single individual. t) − 1 = H(s. We assume family sizes have the same distribution with probability . r=1 2 P (T0 = 2r)P (S2n−2r = 0) . P (S2n = 0) = Also P (T0 = 2rS2n = 0) = Hence.
if µ = 1 and σ 2 = 0. η < 1. η = 1. if µ = 1 and σ 2 > 0. The proof is not much related to generating function at all. then η = 0. if µ > 1. Let the population size of the nth generation be called Zn . At the same time.5 Summary The biggest deal of this chapter is the proof of the Renewal Theorem. At the same time. 4. It is known that E[Zn ] = µn σ 2 (µn − 1) n−1 µ . One should be able to see the beauty of generating functions when handling problems related to the simple random walk and the branching process. 3. if µ < 1. GENERATING FUNCTIONS generating function G(s). it is known that: 1. You may . none of these discussed should be new to you. µ−1 These relations can be derived from the identity: V ar(Zn ) = Gn (t) = G(Gn−1 (t)) where Gn (t) = E(tZn ). 3. it is okay to choose to ignore the proof completely. The same identity also implies that the probability of ultimate extinction η = lim P (Zn = 0) is the smallest nonnegative solution of the equation x = G(x). Discussion will be given in class.40 CHAPTER 3. 2. Let µ and σ 2 be the mean and variance of the family size. then η = 1. One may learn a lot from this proof. Because of this.
. SUMMARY 41 realize that two very short sections on the simple random walk and braching process are rich in content.3. Please take the opportunity to plant these knowledge ﬁrmly in you brain if they were not so before.5.
42 CHAPTER 3. GENERATING FUNCTIONS .
. we name these random variables as X0 . .Chapter 4 Discrete Time Markov Chain A discrete time Markov chain is a stochastic process consists of a countable number of random variables arranged in a sequence. Xn−1 = in−1 . . It is known that all entries of the transition matrix are nonnegative. In addition. let P(m) be the mstep transition matrix deﬁned by pij = P (Xm = jX0 = i). X2 . we use notation pij = P (X1 = jX0 = i) and P for the matrix with the i. must be countable. . and its row sums are all 1. jth entry being pij .. Then they satisfy the ChapmanKolmogorov equations: P(m+n) = P(m) P(n) 43 (m) . X0 = i0 ) = P (Xn+1 = jXn = i). . it must have Markov property: P (Xn+1 = jXn = i. To qualify as a Markov chain. The Markov chain discussed in this course will be assumed time homogeneous unless otherwise speciﬁed. . Further. its state space. Usually. X1 = i1 . S. X1 . we say that the Markov chain is time homogeneous. . In this case. which is the set of all possible values of these random variables. When this transition probability does not depend on n.
44 CHAPTER 4. 1}. 2 Let Y2k = Y2k−1 Y2k+1 for k = 1. All other nonneighbouring pairs of random variables are also independent of each other by deﬁnition. Thus. Y5 . we denote P(0) = I. Y3 . j = ±1. . Here. Both simple random walk and branching process are special cases of discrete time Markov chain. . . . having correlation 0 implies independence. Since these random variables take only two possible values. Let µ(m) be the row vector so that its ith entry µi (m) = P (Xm = i). Note that 2 E[Y2k Y2k+1 ] = E[Y2k−1 Y2k+1 ] = E[Y2k−1 ] = 0. We hence have a stochastic process Y1 . . satisﬁes the ChapmanKolmogorov equation. or oﬀer an example a stochastic process with this property which is not a Markov chain. Example 4. and its transition probability matrix deﬁned in the same way as for the Markov chain. 1 P (Xm+n = jXn = i) = P (Xm+n = j) = 2 for any i. which has state space {−1. we can oﬀer a rigorous proof to show this is true. Let Y1 . Does it have to be a Markov chain? In mathematics. the mstep transition matrix is P(m) = P = 1 2 1 2 1 2 1 2 . It turns out the latter is the case. Thus.1 Markov’s other chain. . Suppose that a discrete time stochastic process has countable state space. be a sequence of iid random variables such that 1 P (Y1 = 1) = P (Y1 = −1) = . Note that the ChapmanKolmogorov equation is the direct consequence of the Markov property. Y3 . It can be seen that µ(m+n) = µ(m) Pn . . . . 2. Y2 . DISCRETE TIME MARKOV CHAIN for all nonnegative positive integers m and n..
CLASSIFICATION OF STATES AND CHAINS which is idempotent. It is usually very simple to classify the state space. then we say that i and j communicate. If i is reachable from j and j is reachable from i. we say that the chain is irreducible. It is obvious that communication is an equivalence relationship. That is. Now let us consider individual classes. this process does not have Markov property and is not a Markov chain. pij = 0. If G consists of all states of the Markov chain. If for any state i ∈ G and state j ∈ G. It implies P(m+n) = P(m) P(n) for all m and n. j.4. Thus. However. it is always possilbe to construct a Markov chain with this set of stochastic matrices as its transition probability matrices. Any pair of states from two diﬀerent classes do not communicate. 4. When we are given a set of stochastic matrices satisfying ChapmanKolmogorov equations. 1 2 45 Thus.1 Classiﬁcation of States and Chains We use i. ♦ Although this example shows that a stochastic process with transition probability matrices satisfying the ChapmanKolmogorov equations is not necessarily a Markov chain. it can be done by examining the transition probability matrix of the Markov chain. we say that j is reachable from i. once the value of Xn is in G . the state space is partitioned into classes so that any pair of states in the same class communicate with each other. k and so on to denote states in state space. P (Y2k+1 = 1Y2k = −1) = but P (Y2k+1 = 1Y2k = −1. the ChapmanKolmogorov equation is satisﬁed. Y2k−1 = 1) = 0. That is. Then the Markov chain can never leave class G once it enters. In most examples.1. (m) If there exists an integer m ≥ 0 such that pij > 0. Let G be a class.
the greatest common divisor of T . Notice that given Xn ∈ G for some n.} has eﬀectively G as its state space. Thus. given its occurrence at time 0. 2. some properties of renewal event can be translated easily here. If a renewal event is not recurrent. If d = 1. then A is aperiodic. Deﬁnition 4. the Markov chain {Xn . then the renewal event is positive recurrent. 3. Otherwise. The period of A is d.1 Property of renewal events 1. 4. then it is called transient. Let A be a renewal event and T = {n : P (A occurs at nA occurs at 0) > 0}. if there exist states i ∈ G and j ∈ G such that pij > 0. A renewal event is recurrent(persistent) if the probability for its future occurrence is 1. Xn+1 . A class G is closed when it has this property. periodicity. Some properties that describe renewal events are positive or null recurrent. In contrast to a closed class. then the class is said open.2 Class Properties It turns out that the states in the same class have some common properties. Let ∞ Pij (s) = i=1 sn pij (n) . the event Xn = i for any i is a renewal event. Due to Markov property. We call them class properties. ♦ Due to the fact that entering a state is a renewal event. DISCRETE TIME MARKOV CHAIN for some n. transient.46 CHAPTER 4. . the state space is reduced when G is a true subset of the state space. 4. it is nullrecurrent. . then the values of Xn+m ’s are all in G when m ≥ 0. . If the expected waiting time for the next occurrence of a recurrent renewal event is ﬁnite.
1 are class properties. The second one is the delayed renewal equation. Note that fij = Fij (1). Theorem 4. n pij (n) = ∞ when j is reachable from i (b) State j is transient if and only if n pij (n) < ∞. We let fij = n fij (n) for the probability that the chain ever enters state j starting from i. . Consequently. it is easy to show the following. ♦ Proof: Using the generating function equations. State i is recurrent. Deﬁne fij (n) = P (X1 = j. ♦ Proof: The ﬁrst property is the renewal equation. and we want to show that state j is also recurrent.1 (a): Pii (s) = 1 + Fii (s)Pii (s). ♦ Proof: Suppose i and j are two states in the same class.4. Let its corresponding generating function be Fij (s). . . Xn−1 = jX0 = i) 47 be the probability of entering j from i for the ﬁrst time at time n.1 (a) State j is recurrent if and only if j pjj (n) = ∞. j Consequently.2. ♦ Based on these two equations. (b): Pij (s) = Fij (s)Pjj (s) if i = j. Lemma 4.2 . . . CLASS PROPERTIES as the generating function of the nstep transition probability. Theorem 4. All properties in Deﬁnition 4. pjj (n) < ∞.
. n2 such that a1 n1 + a2 n2 = di . m pii (m) = ∞. ♦ . Deﬁne Ti = {n : pii (n) > 0} and similarly for Tj . We can then further pick nonnegative coeﬃcients such that a11 n1 + a12 n2 = kdi . . n2 ∈ Ti . then an1 + bn2 ∈ Ti for any positive integers a and b. we must have dj divides di . Thus. We delegate the positive recurrentness proof to the future. Let di and dj be the periods of state i and j. a2 and n1 . pjj (m+m1 +m2 ) ≥ pji (m2 )pii (m)pij (m1 ). Hence.48 CHAPTER 4. DISCRETE TIME MARKOV CHAIN Let pij (m) be the mstep transition probability from i to j. By grouping positive and negative coeﬃcients. Hence di = dj . there exist m1 and m2 such that pij (m1 ) > 0 and pji (m2 ) > 0. At last. j cannot be recurrent. Since i and j communicate. The above proof also implies that if i is transient. Hence. Note that if n1 . am and n1 . Then. there exist integers a1 . . Thus. Let m1 and m2 be the number of steps the chain can go and return from state i to state j. we show that i and j must have the same period. Hence. pjj (m) ≥ m m pjj (m + m1 + m2 ) pii (m)}pij (m1 ) m ≥ pji (m2 ){ = ∞. j is also transient. nm ∈ Ti such that a1 n1 + a2 n2 + · · · + am nm = di . . . it is easy to see that the number of items m can reduced to 2. The reverse is also true by symmetry. . j is also a recurrent state. That is. m1 + m2 + (k + 1)di ∈ Tj . Consequently. . Since di is the greatest common divisor of Ti . both m1 + m2 + kdi . we assume that it is possible to ﬁnd a1 . . a21 n1 + a22 n2 = (k + 1)di for some positive integer k.
2 If the state space is ﬁnite. the average waiting time for the next visit will be ﬁnite. we cannot have pij (n) → 0 for all j when the state space is ﬁnite. We can prove this result rigorously. For convenience. Since the Markov chain cannot escape from the ﬁnite class. Thus. ♦ Proof: A necessary condition for a recurrent event to be transient is un → 0 as n increases. Lemma 4. Hence. then at least one state is recurrent and all recurrent states are positive recurrent. It is therefore . they are regarded as nonnegative integers. at least one of them is recurrent and hence positive recurrent. When n is large. then one of the states will be visited inﬁnite number of times over inﬁnite time horizon.2. at least one of them will be recurrent (persistent). For each n. Because 1= j pij (n). the dependence of the distribution of Xn on that of X0 gets weaker and weaker. Xn is a random variable whose distribution may be diﬀerent from that of Xn−1 but is involved from it. it implies that state j is transient implies pij (n) → 0.3.3 Stationary Distribution and the Limit Theorem A Markov chain consists of a sequence of discrete random variables. Thus. the last conclusion implies that there is always a closed class of recurrent states for a Markov chain with ﬁnite state space. In the context of Markov chain.1 Properties of Markov Chain with Finite Number of States If a Markov chain has ﬁnite state space. STATIONARY DISTRIBUTION 49 4. at least one of them is recurrent. ♦ Intuitively. 4.4. We skip the rigorous proof for now.
This limit turns out to exist in many cases and it is related to so called stationary distribution. Lemma 4. Deﬁnition 4. By the way. then all distributions of Xn+m are given by π.1 The vector π is called a stationary distribution of a Markov chain if π has entries πj . ♦ It is seen that if the probability function of X0 is given by a vector called α.3 If the Markov chain is irreducible and recurrent. there exists a positive root x of the equation x = xP .50 CHAPTER 4. which is unique up to a multiplicative constant. and null if i xi = ∞. The chain is nonnull if i xi < ∞. It is allowed to be inﬁnity when the summation in the deﬁnition of . the distribution of Xn−m does not have to be π. then µj = E[Tj ] is called the mean recurrent time. j ∈ S such that (a) πj ≥ 0 for all j and j πj = 1. i ∈ S. Tk ≥ n) n with Tk being the time of the ﬁrst return to state k. DISCRETE TIME MARKOV CHAIN possible that the distribution stabilizes and it may hence have a limit. ♦ Proof: Assume the chain is recurrent and irreducible. For any states k. Hence it gains the name of stationary. if j is recurrent. Deﬁne ρi (k) = E[Nik X0 = k] which is the mean number of visits of the chain to state i between two successive visits of state k. Note that Tk is a well deﬁned random variable because p(Tk < ∞) = 1. if the distribution of Xn is given by π. then the probability function of Xn is given by β(n) = αPn . It should be noted that if the distribution of Xn is given by π. (b) π = πP where P is the transition probability matrix. Nik = I(Xn = i. Consequently.
Since m fkk (m + n) < ∞ and fik (n) > 0. Now sum over n ≥ 2. without visiting k in intermediate steps. Let lki = P (Xn = i. Xn−1 = j. Tk ≥ nX0 = k). When j is transient. the probability that the chain reaches i in n steps but with no intermediate return to the starting point k. That is. we simply deﬁne µj = ∞. Because the chain is irreducible.4. Note that lki (1) = pki . Now sum over m on two sides and we get fkk (m + n) ≥ fik (n) m m lki (m). With this deﬁnition. we have ρi (k) = pki + j:j=k n≥2 lkj (n − 1) pji ρj (k)pji = ρk (k)pki + j:j=k . lki (n) = j:j=k P (Xi .3. ρi (k) < ∞. we see that fkk (m + n) ≥ lki (m)fik (n) in which the right hand side is one of the sample paths taking a roundtrip from k back to k in m + n steps. Thus. Note that ρk (k) = 1. ρi (k) = n P (Xn = i. Tk ≥ nX0 = k) lkj (n − 1)pji j:j=k = for n ≥ 2. It will be seen that the vector ρ with ρi (k) as its kth component is a base for ﬁnding the stationary distribution. We ﬁrst show that ρi (k) are ﬁnite for all k. STATIONARY DISTRIBUTION 51 expectation does not converge. we get m lki (m) < ∞. At the same time. the onestep transition probability. being positive recurrent is equivalent to claim that µj < ∞. Tk ≥ nX0 = k). Now we move to the next step. there exists n such that fik (n) > 0. Further.
This proves the existence part of the lemma. for any solution x. We have an immediate corollary. µk = c xi . If there are two solutions to xP = x who are not multiplicative of each other. if a chain is irreducible. First. The more general case will be proved later).4 For any state k of an irreducible recurrent Markov chain. The next theorem summary these results to give a conclusion on the existence of stationary distribution. We have shown that ρ is a solution to xP = x.52 CHAPTER 4. we must have for any state k. We prove uniqueness next. This contradicts the irreducibility assumption. we know whether it is positive recurrent or not immediately. Lemma 4. x2 ) and P = P11 P21 P21 P22 . This completes the proof of the lemma. we must have x2 P21 = 0. This will contradicts the irreducibility assumption. then we can obtain another solution with nonnegative components and at least one 0 entry. ♦ One intermediate results in the proof of the theorem can be summarised as another lemma. we have xi < ∞. Then. Since all components in x2 are positive. If so. then so is state j. Due to the uniqueness. DISCRETE TIME MARKOV CHAIN because ρk (k) = 1. if k is positive recurrent. we cannot have a nonnegative solution for xP = x with zerocomponents. we must have P21 = 0. write x = (0. ♦ Thus. Corollary 4. . (This proof works only if there are ﬁnite number of states. Once we ﬁnd a suitable solution to x = xP . ♦ The above lemma shows why we did not pay much attention on when a Markov chain is positive recurrent. the vector ρ(k) satisﬁes ρi (k) < ∞ for all i. and furthermore ρ(k) = ρ(k)P .1 If states i and j communicate and state i is positive recurrent.
3. Xm = j for all 1 ≤ m ≤ n − 1) = P (Xm = j for 1 ≤ m ≤ n − 1) − P (Xm = j for 0 ≤ m ≤ n − 1) = P (Xm = j for 0 ≤ m ≤ n − 2) − P (Xm = j for 0 ≤ m ≤ n − 1) = an−2 − an−1 with am deﬁned as is. there cannot exist a stationary distribution or it violates the uniqueness conclusion.3 53 An irreducible Markov chain has a stationary distribution π if and only if all the states are positive recurrent. ♦ Proof. we have pij (n) → 0 for all ij as n → ∞. Thus. However. We have already shown that if the chain is positive recurrent.. π is the unique stationary distribution and is given by πj = µ−1 for each i ∈ S.4. X0 = j) = P (X0 = j. we must have πj = i πi pij (n) → 0 by dominate converges theorem. there cannot be another solution x such that xi < ∞. . STATIONARY DISTRIBUTION Theorem 4. X0 = j). Then ∞ π j µj = n=1 P (Tj ≥ nX0 = j)P (X0 = j) = P (Tj ≥ n. Now if the chain is transient. If the chain is recurrent but not positive recurrent. all left to be shown is whether πj = µ−1 for each i ∈ S when the j chain is positive recurrent. If a stationary distribution π exists. P (Tj ≥ n. in this case. where µj is the mean j recurrence time of j. This contradicts the assumption that π is a stationary distribution. Thus. then due to the uniqueness. Thus. sum over n we obtain πj µj = P (X0 = j) + P (X0 = j) − lim an = 1 − lim an . X0 = j) = P (X0 = j) = πj and for n ≥ 2. Suppose X0 has π as its distribution. P (Tj ≥ 1. then the stationary distribution exists and is unique.
If we reverse the time of the second. . DISCRETE TIME MARKOV CHAIN Since state j is recurrent. Ym = j. lji (n) = P (Xn = i. We show that gij (n) = P (Yn = j. let us denote it as gij (n) = j.k1 pk1 . i. it is obvious that {Yn } is also irreducible. By checking the sum over n. . We skipped its proof earlier. For each sample path corresponding to lji (n). its probability of occurrence is pj. It turns out that the general proof is very tricky. We construct another Markov chain Yn such that xj qij (n) = P (Yn = jY0 = i) = pji (n) xi for all i. . . kn−1 .k2 · · · pkn−1 . k2 . then we are working with the same set of sample pathes. One is the probability of all the pathes which start from j and end up at state i before they ever visit j again. The other is the probability of all the pathes which start from i and end up at state j without being there in between. for 1 ≤ m ≤ n − 1Y0 = i) satisﬁes xj lij (n). xi Let us examine the expressions of gij and lji . Now we come back for the uniqueness again. Let. it further shows that {Yn } is recurrent. j ♦. Further.54 CHAPTER 4. j ∈ S and n.i . Let x be a solution to x = xP . an = P (Xm = j for all 1 ≤ m = n) has limit 0. k1 . . Xm = j for 1 ≤ m ≤ n − 1X0 = j). It is easy to verify that qij make a proper transition probability matrix. Suppose {Xn } is an irreducible and recurrent Markov chain with transition probability matrix P . for i = j. Hence we have shown πj = µ−1 .
kn−1 qkn−1 . For example. When the chain is periodic. 4. j and its probability of occurrence is qi. . It turns out that the limits of the transition probabilities pij (n) exist when n increases. the limit does not exist.k1 × n−1 n−2 · · · 1 j xi xkn−1 x k 2 x k1 xi pk . Consequently. Obviously. provided they are aperiodic.j xk xk xk xk = pkn−1 . the limit does not always exist. then p11 (n) = p22 (n) = I(n is even). STATIONARY DISTRIBUTION 55 The reverse sample path for gij (n) is i.3. the conclusion is neat.4 For an irreducible aperiodic Markov chain. if S = {0.k2 pj.3.k1 = xj n−1 Since it is true for every sample path. xj = [ lij (n)]−1 xi n and hence the ratio is unique.1 Limiting Theorem We have seen that an irreducible Markov chain has a unique stationary distribution when all states are positive recurrent. . When the limits exist.4. we have that pij (n) → µ−1 j as n → ∞ for all i and j. 1} and p12 = p21 = 1. kn−1 . . k1 .kn−2 · · · qk2 .i · · · pk1 . k2 .k2 pj. ♦ .k1 qk1 . the result is proved. Note that n gij (n) corresponds to the probability that the chain will ever enter state j and the chain is recurrent and irreducible. . we have n gij (n) = 1.i · · · pk1 . Theorem 4.
If pik (m) > 0 and pjl (n) > 0. The new chain is still aperiodic. Case II: the Markov chain is recurrent. the theorem is true by renewal theorem. . 1. the theorem is automatically true. (Not necessarily irreducible). pij (n)pkl (n) > 0 for all large enough n too. If {Xn } is an irreducible chain with period d. If the chain is transient or null recurrent. Since µj = ∞ in this case. This theorem implies that the limit of pij (n) does not depend on n. . pij (n) → πj = µ−1 . Yet let us see another line of approach.. then pij (n) > 0 for all suﬃciently large n. 2. Thus. . It follows that pjj (nd) = P (Yn = jY0 = j) → dµ−1 j as n → ∞.56 CHAPTER 4. . Its transition probabilities are simply multiplication of original transition probabilities. Now we consider the stochastic process {Zn } such that Zn = (Xn . DISCRETE TIME MARKOV CHAIN Remarks: 1. n = 0. In this case. then it is known that pij (n) → 0 for all i and j. Yn ). The new chain is still irreducible. then {Xnd } is an aperiodic chain. It states that if the period is one. That is. p(ij)→(kl) = pik pjl . Its state space is S × S. 4. then according to this theorem. If the chain is positive recurrent. then pik (mn)pjl (mn) > 0 and so kl is reachable from ij. It further implies that P (Xn = j) → µ−1 j irrespective of the distribution of X0 . Let {Yn } be an independent Markov chain with the same state space S and transition matrix P with {Xn }. Proof: Case I: the Markov chain is transient. We could use renewal theorem. j 3. Let {Xn } be the Markov chain with the transition probability matrix P under consideration. 2.
it does not depend on j. More precisely. Hence pik (n) − pjk (n) ≤ P (T > n) → 0 as n → ∞. Assume (X0 . T > n) ≤ P (Yn = k) + P (T > n) = pjk (n) + P (T > n). STATIONARY DISTRIBUTION 57 Case II. For any state k. j). and hence {Zn } has stationary distribution given by {πi πj : i. (Remark: P (T > n) → 0 is a consequence of recurrentness). In this case. Yn } may be transient or null recurrent. from that moment and on. {Xn . we have pik (n) = P (Xn = k) = P (Xn = k. If Yn has π as its distribution. Due to positive recurrentness. P (T < ∞) = 1. T ≤ n) + P (Xn = k. Case II. In this case.1: positive recurrent. k)}. Thus by the lemma in the last section. {Zn } is also positive recurrent.4.2: null recurrent. That is. j). . or if the limit of pjk (n) exists. we have πk − pjk (n) = i∈S πi (pik (n) − pjk (n)) → 0. so will Xn . If so. pik (n) − pjk (n) → 0. Implication? Sooner or later. j) for some (i. we also have pjk (n) ≤ pik (n) + P (T > n). Y0 ) = (i. Xn+m and Yn+m will have the same distribution. deﬁne T = min{n : Zn = (k.3. Xn and Yn will occupy the same state. T > n) = P (Yn = k. We can no longer claim P (T > n) → 0. For the existence. Due to the symmetry. {Xn } has stationary distribution π. T ≤ n) + P (Xn = k. j ∈ S}. starting from any pair of states (i.
and at least one of αj is not zero. then we can ﬁnd a subsequence of n such that pij (nr ) → αj as r → ∞. DISCRETE TIME MARKOV CHAIN If {Xn . we still have the conclusion pik (n) − pjk (n) → 0. However. Thus. P (T > n) → 0 stays. If {Xn . j ∈ S. k∈S . Then αj = r→∞ pij (nr ) ≤ 1. What we hope to show is that pij (n) → 0 for all i.58 CHAPTER 4. Yn = jX0 = i. then P (Xn = j. Yn } is null recurrent. we have αk pkj ≤ αj . the problem becomes a bit touchy. we have pik (nr )pkj ≤ pij (nr + 1) = k∈F k∈S pik pkj (nr ). Let F be a ﬁnite subset of S. Yn } is transient. we have αk pkj ≤ k∈F k∈S pik αj = αj . but we do not have a stationary distribution handy. Hence. the conclusion pij (n) → 0 remains true. If this is not true. Using the idea of onestep transition. (Recall that S is countable). Let F get large. which implies α= i∈S αj ≤ 1. Let r → ∞. lim j∈F This is true for any ﬁnite subset. Y0 = i) = [pij (n)]2 → 0.
♦ j Corollary 4.2 Let τij (n) = 1 n pij (m) n m=1 be the mean proportion of elapsed time up to the nth step during which the chain was in state j. we would have αk pkj = αj k∈S for each j ∈ S. pjj (n) → µ−1 as n → ∞. starting from i. ♦ . j ∈ S} is a stationary distribution of the Markov chain. This contradicts the assumption of the null recurrentness. then τij (n) → fij /µj as n → ∞. However. {αj /α.5 For any aperiodic state j of a Markov chain.j∈S αk pkj [ j∈S k∈S = < j∈S αk pkj ] αj which is a contradiction. when the equality holds. otherwise. ♦ Theorem 4. αk = k∈S k∈S 59 αk j∈S pkj = k. if i is another state. Thus. If j is aperiodic. then pij (n) → fij µ−1 as n → ∞.4.3. j Furthermore. STATIONARY DISTRIBUTION The equality has to be true.
4 Reversibility Let {Xn : n = 0. Now let us deﬁne Yn = XN −n for n = 0.1 Let {Xn : 0 ≤ n ≤ N } be an irreducible Markov chain such that it has stationary distribution π for all n. . . n. n) = P (XN −k = ik . 2. . N } is a Markov chain with transition probabilities πj qij = P (Yn+1 = jYn = i) = pji . . n. ♦ Although Y in the above theorem is a Markov chain. . k = 0.in . . . . n) P (XN −n−1 = in+1 )P (XN −n = in XN −n−1 = in+1 ) = P (XN −n = in ) πin+1 pin+1 . n + 1)/P (Yk = ik . . N . it is not the same as the original Markov chain. n − 1. . Suppose all Xn has distribution given by the stationary distribution π. . DISCRETE TIME MARKOV CHAIN 4. . 1. . n) = P (Yk = ik . . . . . πi ♦ Proof We need only verify the Markov property. . Y are the same. . Other conditions for a Markov chain are obvious. . k = 0. k = 0. 1. . . . .60 CHAPTER 4. k = 0. which is to say that πi pij = πj pji . P (Yn+1 = in+1 Yk = ik . . . k = 0. . = πin Since this transition probability does not depend on ik . . It can be shown that {Yn } is also a Markov chain. . 1.6 The sequence Y = {Yn : n = 0. . . k = 0. N } be a (part) of a Markov chain whose transition probability matrix is P and it is irreducible and positive recurrent so that π is its stationary distribution. . . . The chain is call reversible if the transition matrices of X and its timereversal. . n + 1)/P (XN −k = ik . Deﬁnition 4. . the Markov property is veriﬁed. Theorem 4.
Theorem 4.7 Let P be the transition matrix of an irreducible chain X and suppose that there exists a distribution π such that πi pij = πj pji for all i. j. Then π is a stationary distribution of the chain.4. We have also used the idea to prove the uniqueness of the solution to π = πP .4. REVERSIBILITY for all i. X is reversible in equilibrium. the transitions between i and j are balanced at equilibrium. Furthermore. This provides us a convenient way to solve for the stationary distribution. j ∈ S. 61 ♦ Why do we deﬁne the reversibility? One advantage is when a chain is reversible. .
62 CHAPTER 4. DISCRETE TIME MARKOV CHAIN .
63 .1 A Poisson process with intensity λ is a process N = {N (t) : t ≥ 0} taking values in S = {0. then N (s) ≤ N (t).}. However. . What other detailed properties should we place on it? Deﬁnition 5. setting up continuous time stochastic processes properly involves a lot of eﬀort in mathematics. and N (t) ∈ {0. . We have a process {N (t) : t ≥ 0} such that (a) N (0) = 0. 1. 2. 1. then N (s) ≤ N (t). 5.1 Birth Processes and the Poisson Process These processes may be called counting processes in general.Chapter 5 Continuous Time Markov Chain It is more realistic to consider processes which do not have to evolve at speciﬁed epochs.} such that (a) N (0) = 0. . (b) if s < t. if s < t. . . . We now work on two special continuous time stochastic processes ﬁrst to motivate the continuous time Markov chain. 2.
given by Xn = Tn − Tn−1 . .1 N (t) has the Poisson distribution with parameter λt. and call (c) the independence property of the Poisson process. Tn = inf{t : N (t) = n}. ♦ We may view a counting process by recording the arrival time of the nth event. . we call (b) the individuality. ♦ In general. j! j = 0. likewise p0 (t) = −λp0 (t). (c) if s < t. ♦ Proof: Denote pj (t) = P (N (t) = j).64 CHAPTER 5. With the boundary condition pj (0) = I(j = 0). the random variable N (t) − N (s) is independent of the N (s). . The properties of the Poisson process lead to the equation pj (t) = λpj−1 (t) − λpj (t) of j = 0. . X2 . 1. we deﬁne T0 = 0. 1 − λh + o(h) if m = 0. (b)P {N (t + h) = n + mN (t) = n} = o(h) if m > 1. It is well known as these speciﬁcations imply that N (t) − N (s) has Poisson distribution with mean λ(t − s) for s < t. 2. CONTINUOUS TIME MARKOV CHAIN if m = 1. The following theorem is a familiar story. that is to say P (N (t) = j) = (λt)j exp(−λt). Theorem 5. . . λh + o(h) . the equations can be solved and the conclusion proved. . For that purpose. The interarrival times are the random variables X1 . The counting process can be easily recontructed from Xn ’s too.
It can be shown that {N (t) : t ≥ 0} is a Poisson process if and only if X1 . . . n (c) if s < t. . . Deﬁnition 5. we have a constant radioactive source in a short time. . each having the exponential distribution with parameter λ. ♦ Here is a list of special cases: λn h + o(h) .2 65 The random variables X1 . is a process {N (t) : t ≥ 0} taking values in S = {0. . 1 − λ h + o(h) if m = 0. then we have a general birth process. (b)P [N (t + h) = n + mN (t) = n) = o(h) if m > 1. X2 . λ1 . BIRTH PROCESSES AND THE POISSON PROCESS Theorem 5. It could be made more rigorous with some measure theory results. if m = 1. . . the rate of emission should depend on the number of detected emission already. 1. . ♦ Proof: It can be shown by working on n P (Xn+1 > t + i=1 ti Xi = ti . then N (s) ≤ N (t). . The Poisson process is a very satisfactory model for redioactive emissions from a sample of uranium235 since this isotope has a halflife of 7×108 years and decays fairly slowly. X2 . .2 A birth process with intensity λ0 . are independent and identically distributed exponential random variables. ♦ There is a bit problem with this proof as the event we conditioning on has zero probability. if s < t. . i = 1.5.} such that (a) N (0) ≥ 0. the random variable N (t) − N (s) is independent of the N (s). 2. If the Xi has exponential distribution with rate λi instead. .1. That is. 2. n) by making use of the property of independent increment. are independent. . For some other kinds of radioactive substances. . .
Repeat this procedure implies that the forward system has a unique solutions. p ˆ . Backward system of equations: pij (t) = λi pi+1.i+1 (t). we obtain the solution for pi. CONTINUOUS TIME MARKOV CHAIN (a) Poisson process.j−1 (t) − λj pij (t) for j ≥ i. Using Laplace transformation reveals the structure of the solution better.j (t) − λi pij (t) for j ≥ i. The diﬀerential equation we derived for the Poisson process can be easily generalized. 0 Then. Deﬁne ∞ pij (θ) = ˆ exp(−θt)pij (t)dt. We can ﬁnd two basic sets of them. λn = nλ+ν. (b) Simple birth. The backward equation is obtained by computing the probability of N (t + h) = j conditioning on N (h) = i. it is seen that pij (t) = 0 whenever j < i. Theorem 5. the forward system becomes (θ + λj )ˆij (θ) = δij + λj−1 pi. In addition to the simple birth. The boundary conditions are pij (0) = δij = I(i = j). there is a steady source of immigration. which satisﬁes the backward system. Substituting into the foward equation. When j = i.3 The forward system has a unique solution. λn = nλ. (c) Simple birth with immigration. Deﬁne pij (t) = P (N (s+t) = jN (s) = i). Forward system of equations: pij (t) = λj−1 pi. pii (t) = exp(−λj t) is the solution. The forward equation can be obtained by computing the probability of N (t + h) = j conditioning on N (t) = i. and at the constant rate. ♦ Proof: First.66 CHAPTER 5.j−1 (θ). λn = λ for all n. This is the case when each living individual gives birth independently of others.
This restriction reﬂects the existence of some continuous time Markov chains which have the same transition probability matrices to some degree to the birth process. ♦ An implicit conclusion here is: the backward equation may have many solutions. The key is the assumption of pij (t) = 0 when j < i. Conversely. The textbook fails to demonstrate the why the proof of the uniqueness for the forward system cannot be applied to the backward system.4 If {pij (t)} is the unique solution of the forward system. the solution to the forward equation is also a solution to the backward equation.1. If πij (t)’s solve the backword systems. Thus. . The uniqueness is determined by the inversion theorem for Laplace transforms. j. their corresponding Laplace transforms will satisfy (θ + λi )ˆij (θ) = δij + λi πi+1. the solution to the foward system can be shown to satisfy the ChapmanKolmogorov equations. the backward system may have multiple solutions.j (θ). then any solution {πij (t)} of the backward system satisﬁes pij (t) ≤ πij (t) for all i.5. π ˆ It turns out that these satisfying the forward system will also satisfy the backward system here in Laplace transforms. Thus. This could be compared to the problem related to probability of ultimate extinction in the branching process. it is a relevent solution. BIRTH PROCESSES AND THE POISSON PROCESS The new system becomes easy to solve. Theorem 5. t. When this restrict is removed. the solution to the forward system must be the transition probabilities. the solution given by the forward equation is the minimum solution. ♦ If {pij (t)} is the transition probabilities of the speciﬁed birth and death process. then it must solve both forward and backward systems. It turns out that if there are many solutions to the backward equation. Thus. We obtain pij (θ) = ˆ 1 λi λi+1 λj ··· λj θ + λi θ + λi+1 θ + λj 67 for j > i.
68 CHAPTER 5. What is the probabilistic interpretation when pij (t) < 1 j∈S for some ﬁnite t. no solutions can be larger than other solutions and hence the uniqueness is automatic. and i? It implies that within the period of t. Consequently. the population size has jumped or exploded all the way to inﬁnity. If so.1 . Recall the waiting time for the next transition when N (t) = n is exponential with rate λn . CONTINUOUS TIME MARKOV CHAIN Consequently. inﬁnite number of transitions must be have occurred. Deﬁnition 5. the text book should not have made use of this restriction in proving the uniqueness of solution to the forward system. When the birth rate increases with the population size fast enough. we may expect that pij (t) = 1 j∈S for any solutions. In that case. the population size may rearch inﬁnity in ﬁnite amount of time. Let Tn = n Xi i=1 be the waiting time for the nth transition. The nonuniqueness is exactly built on this observation. When the solution to the forward equation pij (t) < 1 j∈S for some t > 0 and i. it is possible to design a new stochastics process so that its transition probabilities are given by this solution. Deﬁne T∞ = limn→∞ Tn . Lemma 5. Intuitively.3 Honest/dishonest We call the process N honest if P (T∞ < ∞) = 0 and dishonest otherwise. See Feller for detailed constructions. This constraint does not hold for some birth processes. it is possible then to construct another solution which also satisﬁes the backward system.
1. Theorem 5. we have ∞ E[T∞ ] = n=1 λ−1 . If ∞ λ−1 = ∞. If. Hence. This implies n=1 n dishonesty. however. ♦ The following theorem is an easy consequence. X2 .5 The process N is honest if and only if ∞ n=1 λ−1 = ∞. n −1 λn < ∞ Proof: By deﬁnition of T∞ . when ∞ λ−1 < ∞. and let T∞ = ∞ Xn . it does not imply T∞ = ∞ with any positive probn=1 n ability.5. BIRTH PROCESSES AND THE POISSON PROCESS 69 Let X1 . E[exp(−T∞ )] = n=1 n 0 which contridicts the assumption E[exp(−T∞ )] > 0 made earlier. E[T∞ ] < ∞ and P (T∞ < ∞) = 1. be independent random variables. We show that this is impossible under the current assumption. . (Check its logrithm). Xn having the exponential distribution with parameter λn−1 . P (T∞ < t) > 0 for any t > 0. n According to mathematical analysis result. . then E[exp(−T∞ )] ≥ exp(−t)P (T∞ < t) > 0. We have that n=1 P (T∞ < ∞) = 0 if 1 if ∞ n=1 ∞ n=1 λ−1 = ∞. the product with inﬁnite terms equals inﬁnity when ∞ λ−1 = ∞. . n Hence. n . Note that N E[exp(−T∞ )] = = = = N →∞ lim E n=1 N exp(−Xn ) E exp(−Xn ) N →∞ lim n=1 N N →∞ lim (1 + λ−1 )−1 n n=1 N N →∞ lim [ n=1 (1 + λ−1 )]−1 .
♦ Proof: In fact. Theorem 5. More rigorous deﬁnition relates to the concept of σﬁeld we introduce before. then the value of T is completely determined by B. B) = P (AN (T ) = i. then we call it a stopping time for stochastic process {N (t) : t ≥ 0}. this is a simple case. . If that is the case. T is deﬁned based on {N (s) : s ≤ T (B)} and is a constant when “N (T ) = i. If the event T ≤ t is completely determined by the knowledge of {N (s) : s ≤ t}. This example implies that the Markov property is true for some randomly selected time.70 CHAPTER 5. CONTINUOUS TIME MARKOV CHAIN 5. The time represents the present is a nonrandom constant. we often claim that once the random walk returns to 0. Among the three pieces of information on the right hand side. as N (T ) is a discrete random variable. The question is what kind of random variable can be used for this purpose? Suppose {N (t) : t ≥ 0} is a stochastic process and T is a random variable. Hence. That is. In the example of simple random walk. the future (Xn+m ’s) is independent of the past (Xn−k ’s). we may write T = T (B). T = T (B)”.1. the future behavior of the random walk does not depend on how the random walk got into 0 nor when it returns to 0. Hence.1 Strong Markov Property The Markov property for discrete time stochastic process states: given the present (Xn = i). T = T (B). Let A be an event which depends on {N (s) : s > T } and B be an event which depends on {N (s) : s ≤ T }. it renews itself. B) = P (AN (T ) = i) for all i. The kind of event B which causes most trouble are those contains all information about the history of the process before and including time T .6 Strong Markov Property Let N be a birth process and let T be a stopping time for N . Then P (AN (T ) = i. B). We may notice that this notion is a bit diﬀerent from the Markov property. it claims that P (AN (T ) = i. The “present time” is a random variable.
As before. B. Recall the formular E[E(XY )] = E(X). plus T is a constant on B and the weak Markov property.2. we ahve P (AN (T ) = i. The measure theory proof is as follows. Let H = σ{N (s) : s ≤ T } which is the σﬁeld generated by these random variables. An event B containing historic information before T is simply an event in this σalgebra. if you cannot understand the above proof. Deﬁne pn (t) = P (N (t) = n). we have E{E(I(A)N (T ) = i)N (T ) = i.2 Continuous time Markov chains Let X = {X(t) : t ≥ 0} be a family of random variables taking values in some countable state space S and indexed by the halfline [0.1 . it depends only on the fact that the chain is in state i now. Since this function is a constant in the eyes of N (T ) = i. we shall assume that S is a subset of nonnegative integers. B) = E[E(I(A)N (T ) = i. Let H play the role of Y . B) = E(I(A)N (T ) = i. Hence the conclusion. We claim E(I(A)N (T ) = i. the part of T = T (B) is also not informative. B} = E(I(A)N (T ) = i). Hence. work on the discrete time example in the assignment will help. B. ∞). and E(·N (T ) = i. H)N (T ) = i. ♦ 5. CONTINUOUS TIME MARKOV CHAINS 71 Hence the Markov (weak one) property allows us to ignore B itself. H) = E(I(A)N (T ) = i) since it is Hmeasurable. B). At the same time.1 Consider the birth process with N (0) = I > 0. ♦ Example 5. This result is easy to present for discrete time Markov chain. B) play the role of expectation. B. the process is time homogeneous. Set up the forward system and solve it when λn = nλ. not when it ﬁrst reached state i. Hence.5. Deﬁnition 5.
. A semigroup may not have an inverse for each member in the same set.2 The transition probability pij (s. j. This observation gives rise to the need of inﬁnitesimal generator. . . If the population size can explode to inﬁnity at ﬁnite t. Deﬁnition 5. CONTINUOUS TIME MARKOV CHAIN The process X satisﬁes the Markov property of P (X(tn ) = jX(t1 ) = i1 . t) = pij (0. Forming a group means we can deﬁne an operation on this set such that the operation is closed and invertable. . i1 . s.7 . t) = P (X(t) = jX(s) = i) for s < t. A continuous time stochastic process X satisfying the Markov property is called a continuous time Markov chain. One obvious example of continuous time Markov process is the Poisson process. in−1 ∈ S and any sequence 0 < t1 < t2 < · · · < tn of times. . t − s) for all i. . X(tn−1 ) = in−1 ) = P (X(tn ) = jX(tn−1 ) = in−1 ) for all j. then N (t) is not always a random variable for a given t. . t and we write pij (t − s) for pij (s. The pure birth process is not a Markov process when it is dishonest. It turns out that the collection of all transition probability matrices form a stochastic semigroup. The chain is called time homogeneous if pij (s. . We also use notation Pt for the corresponding matrix. t). ♦ We will assume the homogeneity for all continuous time Markov chain to be discussed unless otherwise speciﬁed. t) is deﬁned to be pij (s. The reason is that to qualify as a Markov chain. No matter how short a period of time is. One diﬃculty of studying the evolution of the continuous time Markov chain is the lack of onestep transition probability matrix. we can always ﬁnd a shorter period. Theorem 5.72 CHAPTER 5. we would have required {N (t) : t ≥ 0} to be a family of random variables.
the following claim is true: pij (h) = gij h + o(h). Is it possible to represent all Pt from a single entity? The answer is positive if the semigroup satisﬁes some properties. pii (h) = 1 + gii h + o(h) (5. that they are diﬀerentiable at t = 0. Deﬁnition 5. that is. t + h)? May be nothing will happen. What might happy in the next short period (t. it satisﬁes the following conditions: (a) P0 = I. Suppose that X(t) = i at the the moment t. The chance to have two or more transitions are assumed o(h) here. The book mentioned that it can be proved. but also at any t > 0 by ChapmanKolmogorov equations. ♦ Remark: a quick reference to mathematics deﬁnition reveals that only (c). ♦ Proof: Obvious. This is diﬀerent from the discrete time Markov chain. not only at t = 0. that is Pt has nonnegative entries and row sum 1. and sometimes also (a) are properties of a semigroup. though. (c) the ChapmanKolmogorov equations: Ps+t = Ps Pt if s.1) where gij ’s are a set of constants such that gij > 0 for i = j and −∞ ≤ gii ≤ 0 for all i. ≥ 0.5.3 The semigroup {Pt : t ≥ 0} is called standard if Pt → I as t ↓ 0.2. It does not imply. (b) Pt is stochastic. there is no t0 such that Pt0 can be used to compute Pt for all t. t. . For continuous time Markov chain. the corresponding chance is pii (h) + o(h). It may switched into state j with corresponding chance pij (h) + o(h). I am somewhat skeptical. It turns out that when the semigroup is standard. which means not all Markov chain has this property. which is to say that pii (t) → 1 and pij (t) → 0. I take it as an assumption. the identity matrix. ♦ Being standard means that every entry of Pt is a continuous function of t. CONTINUOUS TIME MARKOV CHAINS 73 The family {Pt : t ≥ 0} is a stochastic semigroup.
n=0 n! ∞ .1) does not guarantee the uniqueness of the solution to these two systems. Linking this expansion to the transition probability. It is known that the backward equations are satisﬁed when the semigroup is standard. some gii may take value −∞. the above relationship does not apply to all Markov chains. Example 5. ♦ The validity of (5. tn n G . Forward equations Pt = Pt G. Due to the fact that some gii can be negative inﬁnity. We will discuss the conditions under which this result is guaranteed. the solution can be written in the form Pt = We may also write it as Pt = exp(tG). In fact. Write down the inﬁnitesimal generator here. we must have gij = 0 j for all i. CONTINUOUS TIME MARKOV CHAIN Unless additional retrictions are applied. The forward equations are proved when the semigroup is uniform which will be discussed a bit further. then we can easily obtain the forward and backward equations. In which case I guess that the interpretation is pii (h) − 1 → −∞. then two systems share a unique solution for Pt . Backward equations Pt = GPt .2 Birth process. we set up a matrix G for gij and call it inﬁnitesimal generator. Once we admit the validity of (5.74 CHAPTER 5. h When all the needed conditions are in place. When all entries of G are bounded by a common constant in absolute value.1).
2. the moment of the Markov chain entering state i is also the moment of leaving the state. As long as it is ﬁnite for all i. That is. the probability that the chain jumps to state j from state i is −gij /gii . The proof for the second part is less rigorous. ♦ Example 5. then waiting times for Markov chain leaving state i will have momoryless property. The only hassle is the parameter −gii . In this case. where Ui is the waiting time for the Markov chain to leave state i. ♦ The result on the exponential distribution is the product of the Markov property. Theorem 5.5. ♦ . the random variable Ui is exponentially distributed with parameter −gii . Barring such possibilities by assuming gij have a common upper bound in absolute value. all the conclusions are valid. we consider the probability under the assumption of x < U < x + h. The conditional probability converges to −gij /gii when the boundedness conditions on gii are satisﬁed. we have the result as follows.3 A continuous Markov chain with ﬁnite state space is always standard and uniform.4 A continuous Markov chain with only two possible states can have its forward and backward equations solved easily. Therefore. CONTINUOUS TIME MARKOV CHAINS 75 Recall that for some Markov chain with standard semigroup transition probability matrices. state i is instantanuous. Thus. Taking a small positive value h. Further. Since the Markov property implies the memoriless property. the instantanuous rates gii could be negative inﬁnity. and the later implies the exponential distribution. the proof is good enough.8 Under some conditions. Example 5.
Solving equations π = πPt may not be convenient as Pt ’s are hard to specify. This implies that πG = 0. then π = πPt for all t if and only if πG = 0.9 If the Markov chain has uniform semigroup transition probability matrices.3 Limiting probabilities Our next issue is about limiting probabilities. The result is true for more general Markov chains such as the birth and death process. it implies the function is a zero function. it is easy to show that every term in the power expansion is zero. then pij (t) > 0 for all t ≥ 0. We say that a continuous time Markov chain is irreducible when pij (t) > 0 for all i. j πj = 1. ♦ The proof is simple. we state the limiting theorem. CONTINUOUS TIME MARKOV CHAIN 5. If πPt = 0 for all t. We hence seek help from the following theorem. By ChapmanKolmogorov equations. j and t > 0. If π is such a distribution. and π = πPi for all t ≥ 0. if pij (t) > 0 for some t > 0. Finally. Using the expansion for the exponential function results in the conclusion. The textbook does not specify the uniformity condition. It turns out that for any state i and j. we need to examine the issue of irreducibility. it must be unique and nonnegative when the chain is irreducible. . Similar to discrete time Markov chain. Note that when πG = 0. When the Markov chain has uniform semigroup. Deﬁnition 5. then we have Pt = exp(tG). An analytical function is a zero function if and only if all coeﬃcients are zero.1 The vector π is a stationary distribution of the continuous Markov chain if ♦ πj ≥ 0.76 CHAPTER 5. Hence. if X(t) has distribution π. Theorem 5. its stationary distribution is nonnegative and unique. This is due to the fact that Pt is a transition probability matrix of some discrete times Markov chain which is irreducible. then X( t + s) also has distribution π for all s ≥ 0.
we are told that the conclusion is true when the class of the transition matrix is a standard semigroup. 1. applying the above result implies (h ) (h ) that πj 1 = πj 2 . (a) If there exists a stationary distribution π. (b) If there is no stationary distribution then pij (t) → 0 as t → 0 as t → ∞ for all i and j. (h) 5.}. . In addition. If the chain is null recurrent or transient. we may deﬁne Yn = X(nh). (a) Its state space is S = {0. then it is unique and pij (t) → πj as t → ∞ for all i and j. using continuity of pij (t) to ﬁll up the gap for all real numbers. For any two rational numbers h1 and h2 . Remark: Unlike other result.4 Birthdeath processes and imbedding Birth and death process is a more realistic model for population evolution. we would have pij (nh) = P (Yn = jY0 = i) → 0 for all i. .5. Then we have a discrete time Markov chain which is irreducible and ergodic.10 77 Let X be irreducible with a standard semigroup {Pt } of transition probabiblities. then it has a unique stationary distribution π (h) . BIRTHDEATH PROCESSES AND IMBEDDING Theorem 5. pij (nh) = P (Yn = jY0 = i) → πj .4. . If it is nonnull recurrent. Suppose the random variable X(t) represents the population size at time t. ♦ Proof: For any h > 0. Next. j. .
... Yet. satisfy λn ≥ 0 and µn ≥ 0. . .. λn h + o(h) (c) the ‘birthe rates’ λ0 . λn−1 . .78 CHAPTER 5.. . . and ‘death rates’ µ0 = 0. Due to memoryless property. . Hence. .. then the population size will be bounded by n. . It seems the standard semigroup condition is satisﬁed as long as all rates are ﬁnite. the waiting time for the next transition has exponential distribution with rate λn + µn when X(t) = n.. It has a stationary distribution under some simple conditions: λ0 λ1 . . but may not be so useful. . The chain is uniform if and only if supn {λn + µn } < ∞. the conclusion for the existence of the limiting probabilities is solid. The transition probabilities pij (t) may be computed in principle. λn−1 π0 πn = µ1 µ2 · · · µn with −1 π0 = λ0 λ1 . State 0 becomes absorbing. .. . The inﬁnitesimal generator G ha a nice form G= −λ0 λ0 0 0 0 µ1 −(λ1 + µ1 ) λ1 0 0 0 µ2 −(λ2 + µ2 ) λ2 0 0 0 µ3 −(λ3 + µ3 ) λ3 . CONTINUOUS TIME MARKOV CHAIN (b) Its inﬁnitesimal transition probbilities are given by if m = 1. n=0 µ1 µ2 · · · µn ∞ The limit exists when π0 > 0. . the condition of standard semigroup is satisﬁed.. Note that in this case.. µ2 . P (X(t + h) = m + nX(t) = n) = µn h + o(h) if m = −1. . If λn = 0 for some n > 0 and X(0) = m < n. The probability that the next transition is a birth is given by λn /(λn + µn ). the chain is reducible.. When λ0 = 0. . o(h) if m > 1. µ1 . ··· ··· ··· ··· . we have to assume that the process is honest to maintain that it is a continuous time Markov chain.
t) = E[sX(t) ]. 79 Consider the situation when the birth rates λn = λ for all n. X(t) is asymptotically Poisson distributed with λ parameter ρ = µ . . Let us work out the distribution of X(t) directly. it reduces to pi. . . . That is. and µn = nµ for n = 1. By the foward equations.i−1 (h) = (iµ)h + o(h) and pij (h) = o(h) when i − j ≥ 2. . n! The proof is very simple. 1. 0 arrivals) + o(h) if j < i Since the chance to have two or more transitions in a short period of length h is o(h).. pi. It is very simple to work out its limiting probabilities.5. n = 0. and each individual in the population has the same rate of die. P (X(t) = n) → ρn exp(−ρ). 2. we have ∞ ∂G = jsj−1 pj (t) ∂s j=0 .4. 2. . Theorem 5. That is. Let pj (t) = P (X(t) = jX(0) = I). 0 deaths) + o(h) if j ≥ i. Assume X(0) = I. we have pj (t) = λpj−1 (t) − (λ + jµ)pj (t) + µ(j + 1)pj+1 (t) for j ≥ 1 and p0 (t)a = µp1 (t). there is a constant source of immigration. Note that the probability generating function of X(t) is given by G(s. . P (i − jdeaths.11 In the limit as t → ∞. BIRTHDEATH PROCESSES AND IMBEDDING Example 5. .i+1 (h) = λh + o(h). the book staes that pij (h) = P (X(t + h) = jX(t) = i) = P (j − iarrivals. Using the transition probability language.5 Simple death with immigration.
It turns out that given X(0) = I. E[X(t)] = I exp{(λ − µ)t} . t) − (µs − µ) (s. t) = . CONTINUOUS TIME MARKOV CHAIN ∞ ∂G = sj pj (t).80 and CHAPTER 5. the distribution is a convolution of a binomial distribution and a Poisson distribution. Otherwise. then the distribution of the X(t) is Poisson. Due to the uniqueness of the solution to the forward system.6 Simple birthdeath When the birth and death rates are given by λn = nλ and µn = nµ. t)]. t). we do not have to look for other solutions anymore. t) = (λs − µ)(s − 1) (s. It is interesting to see that if I = 0. t) = λ(s − 1)G(s. we have ∂G ∂G (s. t) − µ ∂t It is easy to verify that G(s. = (s − 1)[λG(s. Example 5. t) ∂t ∂s ∂G (s. we can work out the problem in exactly the same way. λ(1 − s) − (µ − λs) exp{−t(λ − µ) The corresponding diﬀerential equation is given by ∂G ∂G (s. the generating function of X(t) is given by I µ(1 − s) − (µ − λs) exp{−t(λ − µ)} G(s. ∂t j=0 Hence. by multiplying sj on both sides of the forward system and sum up. ∂t ∂s Does it look like a convolution of one binomial and one negative binomial distribution? One can ﬁnd the mean and variance of X(t) from this expression. t) = {1 + (s − 1) exp(−µt)}I exp{ρ(s − 1)(1 − e−µt )}.
λ−µ The limit for the expectation is either 0 or inﬁnity depending on whether the ratio of λ/µ is larger than or smaller than 1. Y2 . we often need to compute the joint commulative distribution functions of several random variables. the waiting time for the next birth or death has exponential distribution with rate n(λ + µ) given X(t) = n. we might be ¯ interested to know the distribution of s2 = (n − 1)−1 n (Xi − Xn ) where i=1 n X1 . When X1 has normal distribution. The density function is well known. . . The particle is performing a simple random walk with probability p = λ/(λ + µ). q/p).6 Markov chain Monte Carlo In statistical applications. the distribution of s2 is known to be n chisquare with n − 1 degrees of freedom. and the value of P (s2 ≤ x) can be easily obtained via numerical integration. .5 Embedding If we ignore the length of the time between two consecutive transitions in a continuous time Markov chain. The state 0 is an absorbing state. . Straightforward numerical integration becomes very diﬃcult. 5. .000 sets of independent random variables Y1 . The probability that the random walk will be absorbed to state 0 is given by min(1. 1). If one can generate 10. The probability of extinction is given by the limit of η(t) = P (X(t) = 0) which is min(ρ−1 . . ♦ 5. ¯ Xn = n−1 i Xi . n If X1 has exponential distribution.5. EMBEDDING V ar(X(t)) = 81 λ+µ exp{(λ − µ)t}[exp{(λ − µ)t} − 1]I.5. then . . . For example. Think of the transition as the movement of a particle from the integer n to the new integer n + 1 or n − 1. In the example of linear birth and death. Xn are independent and identically distributed random variables. we obtain a discrete time Markov chain. the density function of s2 is no longer n available in a simple analytical form. Yn such that they all have exponential distribution. The probability that the transition is a birth is (nλ)/[n(λ + µ)] = λ/(λ + µ).
How do we generate random variables with a speciﬁed distribution very quickly? It is found that some seemly simple functions can produce very unpredictable outcomes. without the need of completely specifying π(θ). Thus. or θ g(θ)π(θ). where π(θ) is a density function or probability mass function on Θ. Hence. it is often necessary to generate random numbers from a not well deﬁned density function. If 12% of them are smaller than n x = 5. the outcomes appear to be completely random.000 many s2 values. In reality. one can then generate random numbers from most commonly used distributions. . A so called pseudo random number generator for uniform distribution is hence constructed. Starting from this point. then it can be shown that P (s2 ≤ 5) can be approximated by 12%. In theory. . . the value of n−1 i g(Xi ) provides an approximation to g(θ)π(θ)dθ. to ﬁnd the scale itself is numerically impractical. such that they all take values in Θ and have π(θ) as their density function or probability mass function. especially for Baysian analysis. Before this idea becomes applicable.82 CHAPTER 5. In some applications. For example. whether they are discreate or continuous. CONTINUOUS TIME MARKOV CHAIN we may compute 10. How this is achieved is the topic of this section. or θ g(θ)π(θ) over θ ∈ Θ. X2 . then the law of large number in the probability theory implies n−1 i g(Xi ) → Eπ {g(X1 )}. we may only know that the corresponding density function π(θ) is proportional to some known function. A more general problem to compute g(θ)π(θ)dθ. when this operation is applied repeatedly. The beauty of the Markov Chain Monte Carlo (MCMC) is to construct a discrete time Markov chain so that its stationary distribution is the same as the given density function π(θ). one needs only rescale the function so that the total probability becomes 1. we need to overcome a technical diﬃculty. n The precision improves when we generate more and more sets of such random variables. If we can generate random variables X1 . .
We ﬁrst generate a uniform [0. We construct a discrete time Markov chain with state space given by Θ. j ∈ Θ. and its stationary distribution is given by π. with an arbitrary starting value X0 . we pick an arbitrary stochastic matrix H = (hij : i. . .6. It turns out that the balance equation is satisﬁed when we choose aij = min{1. we try to ﬁnd a Markov chain with its transition probabilities satisfying πk pkj = πj pjk for all k. we consider it feasible. This matrix is called the ‘proposal matrix’.5. Thus. For a given π. With the given A. (πj hji )/(πi hij )}. The one which is reversible at equilibrium may have some advantage. We generate a random number Y according to the distribution given by (hik : k ∈ Θ). Our next task is to choose H and {A} such that the stationary distribution is what we are look for: π. Hence we deal with discrete π(θ). We need to generate Xn+1 according to some transition probability. j ∈ Θ) be a matrix with entries between 0 and 1. we have many choices of such Markov chains. X1 . Assume we have Xn = i already. MARKOV CHAIN MONTE CARLO 83 In the eyes of a computer. The aij are called ‘acceptance probabilities’. P (Y = k) = hik . . we have created a sequence of random numbers X0 . . the transition probabilities are given by pij = hij aij 1− if i = j if i = j k:k=i hik aik This is because that the transition to j occurs only if both Y = j and Z < aij when Xn = j. j ∈ Θ). (2) Select a matrix A = (aij : i. 1] random variable Z and deﬁne Xn+1 = Xn I(Z > aij ) + Y I(Z < aij ). Repeat this two steps. The following steps will create such a discrete time Markov chain. Since hik are well deﬁned. That is. (1) First. Assume the probability function π = (πi : i ∈ Θ) is our target. nothing is continuous.
. V ). or heat bath algorithm Assume Θ = S V . That is. πj hji }. That is. we do not need the knowledge of πi .84 CHAPTER 5. we select hij = pij k∈Θi.v = {j ∈ Θ : jw = iw for w = v}. the choice in Θi. The state space S is ﬁnite and so is the dimension V . there is some advantage of trying to generate the random vectors one component a time. That is. We ﬁnd such a matrix by placing a uniform distribution on Θi. Similarly. 2. We have πi pij = πi hij aij = min{πi hij . there is no need of the second step. That is hij = [Θi. When Θ is multidimensional. . When two states are the same. Gibbs sampler.v in the ﬁrst step of the Hastings algorithm.v  − 1]−1 . For this purpose. Hence.v πk . πi pij = πj pji . Each state in Θ can be written as i = (iw : w = 1. Note also that in this algorithm. πj /πi }. . we now choose a state from Θi. if Xn is the current random vector. Consider two states i = j. then aij = min{1. That is. It turns out that aij = 1 for all j ∈ Θi. πj hji }.v is based on the conditional distribution given the vth component. it is the set of all states which have the same components as state i other than the vth component. πj /πi } and hence pij = hij min{1. πj pji = πj hji aji = min{πi hij . Let Θi. Metropolis algorithm If the matrix H is symmetric. Let us try to verify this result. . We need to determine the choice of v. the balance equation is obviously satisﬁed. Assume Xn = i. its dimension is V . CONTINUOUS TIME MARKOV CHAIN This choice results in the algorithm called the Hastings algorithm. The ultimate goal can be achieved by randomly select a component or by rotating the component. The next step of the Hastings algorithm is the same as before.v . We may rotate the components or randomly pick a component each time we generate the next random vector.v as deﬁned in the last Gibbs samples algorithm. but πi /πj for all i and j. we make Xn+1 equal Xn except of one of its component. j ∈ Θi. That is.v for Xn+1 .
. . . then we can still decompose P as −1 B M B. That is.6. Then. When all eigenvalues are distinct. Fortunately. . For example. . When not all eigenvalues are distinct. XM for a large M and start using Y1 = XM +1 . .. (a) λ1 = 1 is an eigenvalue of P . .. λ2 = ω 1 . even the limiting distribution of Xn is π. X2 . we throw away X1 . . (c) the remaining eigenvalues λd+1 . . . . ··· ··· ··· ··· . we estimate Eπ g(θ) by n−1 n g(XM +i ). .. P n = B −1 Λn P . λN satisfy λj  < 1. (b) the d complex roots of unity. . . M cannot be made diagonal any more. then P = B −1 ΛB for some matrix B and diagonal matrix with entries λ1 . . λN . Finally. In applications. MARKOV CHAIN MONTE CARLO 85 for i = j. λ1 = ω 0 . λd = ω d−1 . Hence. . However. It is also very hard to tell how large n has to be before the distribution of Xn well approximates the stationary distribution. Let P be a transition probability matrix of a ﬁnite irreducible Markov chain with period d. M n still have a very simple form...5. . It is hence very i=1 important to have some idea on how large this M must be before the random vectors (numbers) can be used.. . That is. once such a decomposition is found. are eigenvalues of P . The distribution of Xn may be far from π until n is hugh. . the nstep transition probability matrix is easily obtained. The best we can is block diagnonal diag(J1 . . this decomposition allows us to compute the n step transition probability matrix easily. Thus. J2 . Y2 = XM +2 and so on to compute various characteristics of π.. . . . researchers often propose to make use of a burn out period M .) such that Ji = λi 1 0 0 0 λi 1 0 0 0 λi 1 0 0 0 λi . .
Then. CONTINUOUS TIME MARKOV CHAIN Remember that the limiting probability is closed linked to the nstep transition probability matrix. i ∈ Θ. k (n) where νr (i) is the ith term of the rth righteigenvector Vr of P . it can be shown that pij = πj + O(nm−1 λ2 n ) where λ2 is the eigenvalue with the second largest modulus and m is the multiplicity of this eigenvalue. with transition probability matrix P and stationary distribution π. Suppose that the Markov chain in MCMC algorithm is aperiodic.86 CHAPTER 5. n ≥ 1. ♦ . Theorem 5. Note that the transition probability converges to πj at exponential rate when m = 1. Then pik − πk  ≤ Θ · λ2 n sup{νr (i)) : r ∈ Θ}.12 Let X be an aperiod irreducible reversible Markov chain on the ﬁnite state space Θ.
we 87 . and the other variable is the sample point ω. S(t) takes a real value: S(ω. S(t) is a map in the form S(t) : Ω → R where R is the set of all real numbers. for each t ∈ T . Let us recall that a random variable is a measurable function on a probability space (Ω. That is. t) is a bivariate function. at any sample point ω ∈ Ω. 6. It becomes apparent that S(ω. When T contains integer values only. Examples other than T being time includes the situation when T represents geological location.Chapter 6 General Stochastic Process in Continuous Time The term stochastic process is generally referred to a collection of random variables denoted as {X(t). Usually. If so. F. we have a continuous time stochastic process. T is a set of time points. P ). t). we get a discrete time stochastic process. t ∈ T }.1 Finite Dimensional Distribution Let us make a simplifying assumption that T = [0. while when T is an interval of the real numbers. Given t. these random variables are linked by some index generally referred to as time. That is. T ] for some positive number T . One of its variables is time t.
It can be shown that if we can give a proper probability to all such sets (called cylinder sets). The problem of setting up a suitable σalgebra is beyond this course. Compared to the space of real numbers. then for any t1 . once we ﬁx ω = ω0 . It is easy to see that this requirement goes from two time points to any ﬁnite number of time points. yet is still simple enough to allow a useful σﬁeld. T ]. At the same time. This is analog to use x for a realized value of a random variable X. the space of real functions on [0. for example. we want to be able to compute P (X ≤ x) for any real number x. t) is a function of t. we must have P (S(t1 ) ≤ s1 . We refer to F (x) = P (X ≤ x) as the cumulative distribution function of X. Thus. T ]. The well known Poisson process. T ] is to add functions which are rightcontinuous with left limits. and to any countable number of time points. .88CHAPTER 6. The space of all continuous functions. then the probability measure can be extended uniquely to the smallest σalgebra that contains all cylinder sets. has a simpler structure. As consequence of this discussion is the following theorem. Recall that a random variable is required to be a measurable map. This function is called a sample path of S(t). To properly investigate the properties of a random variable X. It turns out that D[0. there must be some kind of measurability requirement on stochastic process. That means. We denote this space as D[0. S(ω0 . the set S(t1 ) ≤ s1 . T ] is large enough for usual applications. and a proper σﬁeld is available. S(t2 ) ≤ s2 must be a member of F. does not have continuous sample path. A slight extension of C[0. a stochastic process is a map from the sample space to the space of real valued functions. stochastic processes with noncontinuous sample paths are useful. We may also put it as follows: a random variable is a map from the sample space to the space of real numbers. GENERAL STOCHASTIC PROCESS IN CONTINUOUS TIME get a random variable. If S(t) is a stochastic process on D[0. T ]. however. These functions are referred as regular rightcontinuous function or c`d`g a a functions. In applications. denoted as C[0. S(t2 ) ≤ s2 ) well deﬁned. t2 ∈ [0. We denote s(t) as realized sample path of S(t). or simply the distribution of X. T ]. T ] is much more complex.
the joint distribution of {X(t1 ). SAMPLE PATH Theorem 6.1 Let X(t) = 0 for all t. . and every sample path of Y (t) has a jump point at t = τ. ♦ 6. Further. That is. . Y (t2 ). 1]. X(t). Now we have two stochastic processes: {X(t) : t ∈ [0.2 Sample Path Let us have a look of the following example. 1]. It appears that they are very diﬀerent. 1]}. Example 6. ♦ The moral of this example is: even if two stochastic processes have the same distribution. for any t ∈ [0. they can still have very diﬀerent sample paths. Hence. This observation prompts the following deﬁnition. . 1]} and {Y (t) : t ∈ [0. X(t2 ).2.6. 0 < t < T. . and τ is a uniformly distributed random variable on [0. P (Y (t) = 0) = P (τ = t) = 0. Y (t) have the same distribution.1 89 A stochastic process taking values in D[0. Deﬁnition 6. Yet we notice that for any ﬁxed t. . Let Y (t) = 0 for t = τ and Y (t) = 1 if t = τ . . . That is. Y (tn )}. T ] is uniquely determined by its ﬁnite dimensional distributions. . 0 ≤ t ≤ 1. All sample paths of X is identically 0 function.1 Two stochastic processes are called a version (modiﬁcation) of one another if P {X(t) = Y (t)} = 1 for all t. X(tn )} is identically to that of {Y (t1 ). ♦ . P (Y (t) = 0) = 1. for any set of 0 ≤ t1 < t2 < · · · < tn ≤ 1. X(t) and Y (t) have the same ﬁnite dimensional distributions.
α2 . the two stochastic processes are practically the same. What we should make out of it? Under some continuity condition on expectations. let us summarize the results we discussed in this section. ♦ There is no need to memorize this theorem.90CHAPTER 6. and C such that ES(t) − S(u)α ≤ Ct − u1+ for all 0 ≤ t. While the probability of each is zero. When P (∪0≤t≤T Nt ) = 0. (b) A regular version of S(t) exists if we can ﬁnd positive constants α1 . there often exists a continuous . Finally. for each given stochastic process. GENERAL STOCHASTIC PROCESS IN CONTINUOUS TIME In the case when a number of versions exist for a given distribution. can we always ﬁnd a version of it so that its sample paths are continuous or regular (with only jump discontinuities)? Theorem 6. They are indistinguishable. However. their sample paths may have very diﬀerent properties. Let this set be called Nt . Given a distribution of a stochastic process. the probability of its union over t ∈ [0. We hence have legitimate base to investigate stochastic processes with this property. In the above example.2 Let S(t). a stochastic process with the smoothest sample path will be the version for investigation. 0 ≤ t ≤ T be a real valued stochastic process. T ] can be nonzero. u ≤ T . it equals 1. a regular enough version of a stochastic process exists. (a) A continuous version of S(t) exists if we can ﬁnd positive constants α. and C such that E{S(u) − S(v)α1 S(t) − S(v)α2 } ≤ Ct − u1+ for all 0 ≤ u ≤ v ≤ t ≤ T . Even if two stochastic processes have the same distribution. Having P {X(t) = Y (t)} = 1 does not mean the event X(t) = Y (t) is an empty set. The distribution of a stochastic process is determined by its ﬁnite dimensional distributions. whatever it means.
6.3. GAUSSIAN PROCESS
91
version, or regular version, whether the given one itself has continuous sample paths or not. Ultimately, we focus on the most convenient version of the given stochastic process in most applications.
6.3
Gaussian Process
Let us make it super simple. Deﬁnition 6.1 Let {X(t) : 0 ≤ t ≤ T } be a stochastic process. X(t) is a Gaussian process if all its ﬁnite dimensional distributions are multivariate normal. ♦ We are at ease when working with normal, multivariate normal random variables. Thus, we also feel more comfortable with Gaussian processes when we have to deal with continuous time stochastic process taking arbitrary real values. We will be obsessed with Gaussian processes. Example 6.2 A simple Gaussian Process Let X and Y be two independent standard normal random variables. Let S(t) = sin(t)X + cos(t)Y. It is seen that S(t) is a Gaussian process with certain covariance structures. ♦
6.4
Stationary Processes
Since the random behavior of a general stochastic process is very complex, we often ﬁrst settle down with very simple ones and then timidly move toward more complex ones. The simple random work is the ﬁrst example of stochastic process. We then introduce discrete time Markov chain and so on. We still hope to stay in the familiar territory by moving forward a little: introducing the concept of stationary processes here.
92CHAPTER 6. GENERAL STOCHASTIC PROCESS IN CONTINUOUS TIME Deﬁnition 6.1 Strong Stationarity. Let {X(t) : 0 ≤ t ≤ T = ∞} be a stochastic process. If for any choices of 0 ≤ t1 < t2 . . . ≤ tn , with any n and positive constant s, the joint distribution of X(t1 + s), X(t2 + s), . . . , X(tn + s) does not depend on s, then we say that {X(t) : 0 ≤ t ≤ T = ∞} is strongly stationary. ♦ Deﬁnition 6.2 Weak Stationarity. Let {X(t) : 0 ≤ t ≤ T = ∞} be a stochastic process. If for any choices of 0 ≤ t1 < t2 . . . ≤ tn , with any n and positive constant s, the mean and covariance matrix of X(t1 + s), X(t2 + s), . . . , X(tn + s) do not depend on s, then we say that {X(t) : 0 ≤ t ≤ T = ∞} is weakly stationary, or simply stationary. ♦ The above deﬁnition can be easily applied to the case when T < ∞, or to discrete time stochastic processes. The obstacles are merely notational. Being strongly stationary does not necessary imply the weak stationary, because of the moment requirements of the latter. Most results for stationary processes are rather involved. We only mention two theorems. Theorem 6.3 Strong and Weak Laws of Large Numbers (Ergodic Theorems). (a) Let X = {Xn : n = 1, 2, . . .} be a strongly stationary process such that EX1  < ∞. There exists a variable Y with the same mean as the Xn such that
n
n almost surely, and in mean.
−1 i=1
Xi → Y
6.5. STOPPING TIMES AND MARTINGALES
93
(b) Let X = {Xn : n = 1, 2, . . .} be a weakly stationary process. There exists a variable Y with the same mean as the Xn such that
n
n−1
i=1
Xi → Y
in the second moment. ♦ If a continuous time stationary process can be discretised, the above two results can also be very useful.
6.5
Stopping Times and Martingales
We are often interested in knowing how the property of a stochastic process following the moment when some event has just occurred. This is a setting point of a renewal process. These moments are random variables with special properties. Consider the example of simple random walk {Xn : n = 0, 1, . . .}, the time of returning to zero is a renewal event. At any moment, we examine the value of Xn to determine whether the process has renewal itself at n with respect to this particular renewal event. In general, we need to examine the values of {Xi : i = 0, 1, . . . , n} to determine the occurrence of the renewal or other events. The time for the occurrence of such events is therefore a function of {Xi : i = 0, 1, . . . , n}. In measure theory language, if τ represents such moments, then the event τ ≤n is Fn = σ(X0 , . . . , Xn } measurable. Deﬁnition 6.1 Stopping Time Let {X(t) : 0 ≤ t ≤ T = ∞} be a stochastic process. A nonnegative random variable τ , which is allowed to take value ∞, is called a stopping time if for each t, the event {τ ≤ t} ∈ σ(Xs , 0 ≤ s ≤ t}. ♦ Example 6.3 To be invented.
GENERAL STOCHASTIC PROCESS IN CONTINUOUS TIME .94CHAPTER 6.
Deﬁnition 7. Brown motion has been used in mathematical theory for stock prices. It is nowadays a fashion to use stochastic processes to study ﬁnancial market.2 Deﬁning properties of Brownian motion. Interestingly. the particle theory in physics was not widely accepted until the Einstein used Brown motion to argue that such motions are caused by bombardments of molecules in 1905. Brownian motion {B(t)} is a stochastic process with the following properties: 1. Brown who ﬁrst described the irregular and random motion of a pollen particle suspected in ﬂuid (1828).Chapter 7 Brownian Motion or Wiener Process It was botanist R. (Stationary Gaussian increment) The distribution of B(t) − B(s) is normal with mean 0 and variance t − s. As early as 1931. The theory is far from complete without a mathematical model developed later by Wiener (1931). 95 . We will see that the mathematical model has many desirable properties. Hence. 2. (Independent increment) The random variable B(t) − B(s) is independent of {B(u) : 0 ≤ u ≤ s} for any s < t. and a few that do not ﬁt well with physical laws for Brownian motion. such motions are called Brownian motion. This stochastic model used for Brown motion is hence also called Wiener process.
2. Thus. 1. then all the ﬁnite dimensional distributions of B(t) are completely determined.96 CHAPTER 7. Example 7. As a special Gaussian process. which then determines the distribution of B(t) itself. This notion is again analog to that of a counting process with independent. it is clear that the above Brownian motion is not stationary. Example 7. it is useful to calculate the mean and correlation function of the Brownian motion. but will continue to spell it out. BROWNIAN MOTION OR WIENER PROCESS 3. . Once the distribution of B(0) is given. ♦ Recall the multivariate normal distribution is completely determined by its mean vector and covariance matrix.2 Compute the correlation function of the Brownian motion. Find the joint distribution of B(1) and B(2). The sample path requirement is needed so that we will work on a deﬁnitive version of the process.3 Distribution of 1 0 B(t)dt given B(t) = 0. ♦ We should compare this deﬁnition with that of Poisson process: the increment distribution is changed from Poisson to normal. B(0) = 0 is part of the deﬁnition. B(2) and B(2) − B(1).1 Assume that B(t) is a Brownian motion and B(0) = 0. (Continuity) The sample paths of B(t) are continuous. stationary and Poisson increments. Find the marginal distributions of B(1). We will more or less adopt this convention. B(2) < 0). the ﬁnite dimensional distributions of a Gaussian process is completely determined by the mean function and the correlation function. and compute P (B(1) > 1. Example 7. In some cases. ♦ Based this calculation.
Almost all sample paths have quadratic variation on [0.4 Let ξ0 . 4. . Deﬁne t 2 S(t) = √ ξ0 + √ π π sin(it) ξi . Almost all sample paths are continuous. π].97 ♦ Example 7. Almost every sample path is not diﬀerentiable at any point. be a sequence of independent standard normally distributed random variables. It can be seen that all sample paths of S(t) are continuous.3 . . yet they are nowhere diﬀerentiable. Brownian motions have the following well known but very peculiar properties. for any t. S(t) is a Brownian motion and it provides a very concrete way of represent the Brownian motion. 1.5 Sample paths of the Brownian motion. not matter how short it is. i i=1 ∞ Assume that we can work with inﬁnite summation as if nothing is unusual. Thus. ξ1 . Almost every sample path is not monotone in any interval. 5. t] equal to t. no matter how small the interval is. Almost every sample path has inﬁnite variation on any interval. . 3. ♦ Example 7. Then we can verify the covariance structure of S(t) resembles that of the Brownian motion with B(0) = 0 over [0. 2. ♦ Deﬁnition 7.
n} → 0. (b) The quadratic variation over the interval [0.1 If s(x) is continuous and of ﬁnite total variation. If a sample path is monotone over an interval. It is hence vital to show that Brownian motion has Property 5. One example is s(x) = x sin(1/x) over [0. . t] is n lim i=1 {s(xi ) − s(xi−1 )}2 where the limit is taken over any sequence of sets of (x1 . then its quadratic variation is zero. Property 3 ﬁts well with Property 2. if a sample path is nowhere monotone. . . Finally. . Having a nonzero quadratic variation implies that the function ﬂuctuates up and down very very often. . 1]. It is hard to think of a function with √ such property. . then the total variation is simply s(t) − s(0). . . xn ) such that 0 = x0 < x 1 < x 2 < · · · < x n = t and max{xi − xi−1 : i = 1. then its variation is ﬁnite. . Theorem 7. . . . . . xn ) such that 0 = x0 < x 1 < x 2 < · · · < x n = t and max{xi − xi−1 : i = 1. ♦ If s(x) is monotone over [0.98 CHAPTER 7. . This point can be made rigorous. ♦ Property 1 is true by deﬁnition. which contradicts Property 4. t] is n lim i=1 s(xi ) − s(xi−1 ) where the limit is taken over any sequence of sets of (x1 . it is very odd to be diﬀerentiable. n} → 0. it is seen that Property 4 follows from Property 5 because of these arguments. t]. Property 2 can be derived from Property 4. BROWNIAN MOTION OR WIENER PROCESS Let s(x) be a function of real variable. (a) The (total) variation of s(x) over [0. . .
The latter implies Tn − E(Tn ) → 0 almost surely. In this case. it implies E {Tn − E(Tn )}2 < ∞. this property also shows that the mathematical Brownian motion does not model the physical Brownian motion in microscope level. the acceleration cannot be instantaneous. 1]. regardless of how the interval is partitioned. We ﬁnd n E{Tn } = i=1 n E{B(ti−1 ) − B(ti )}2 V ar{B(ti−1 ) − B(ti )} i=1 n = = i=1 (ti − ti−1 ) = t. however. Despite the beauty of this result. mean that lim Tn = t almost surely along any sequence of parititions such that δn = max ti − ti−1  → 0. (Note the change of order between summation and expectation). Let n Tn = i=1 {B(ti−1 ) − B(ti )}2 be the variation based on this partition. . We give a partial proof when dn < ∞. Therefore.99 Let 0 = t0 < t1 < t2 < · · · < tn = t be a partition of the interval [0. and the sample paths of a diﬀusion particle should be smooth. the expected sample path quadratic variation is t. which is the same as Tn → t almost surely. That is. Using monotone convergence theorem in measure theory. If Newton’s laws hold. we have n V ar{Tn } = i=1 n V ar{B(ti−1 ) − B(ti )}2 3(ti − ti−1 )2 i=1 = ≤ 3t max ti − ti−1  = 3δn t. where each of ti depends on n though not spelled out. Property 5. V ar{Tn } = E{Tn − E(Tn )}2 < ∞.
be independent and normally distributed with mean 0 and variance 1. Many of other properties of Brownian motion to be discussed have their simple random walk versions. and letting an → 0. it is seen that Bn (t) has a limit. Thus. The existence of such sequence and the corresponding probability space is well discussed in probability theory and is assumed here.2 Martingales of Brownian Motion The reason for Brownian motion being the centre of most textbooks and courses in stochastic processes can be attributed to the fact that it is a good . and deﬁne 7. . and deﬁne a similar stochastic process. Brownian motion is regarded as a limit of simple random walk. i = 1. . 2. . We may deﬁne the partial sum Yn in the same way. . . Thus. BROWNIAN MOTION OR WIENER PROCESS 7.100 CHAPTER 7. This question is usually answered by deﬁning a sequence of stochastic processes such that its limit exists. the limiting process will also have the deﬁning properties of a Brownian motion. With a dose of asymptotic theory which requires the veriﬁcation of tightness. and having the deﬁning properties of Brownian motion. Let n Yn = i=1 ξi 1 Bn (t) = √ {Y[nt] + (nt − [nt])ξ[nt]+1 } n which is a smoothed partial sum of ξ’s. the existence is veriﬁed. It can be veriﬁed that the covariance structure of Bn (t) approaches what is assumed for Brownian motion. and the limiting process has all the deﬁning properties of a Brownian motion. See Page 62 of Billingsley (1968). By properly scaling down the time. Let ξi . by independent and identically distributed random variables taking ±an values. 2. . One rigorous proof of the existence is scratched as follows. i = 1. One may replace ξi . All ﬁnite dimensional distributions of Bn (t) are normal.1 Existence of Brownian Motion One technical consideration is whether there exist stochastic processes with the properties prescribed by the deﬁnition of Brownian motion.
As sets of sets. ♦ . the decision is a Ft measurable function. We may put forward a sequence of σﬁeld Ft not associated with any stochastic processes. t ≥ 0 is a martingale with respect to Ft if for any t it is integrable.1 Martingale A stochastic process {X(t). and for any s > 0 E{X(t + s)Ft } = X(t). Theorem 7. exp{θB(t) − 1 θ2 t} is a martingale. If X(t) is a stochastic process on [0. 2 Proofs will be provided in class. (b). then {X(u) : 0 ≤ u ≤ t} is a set of random variables and it generates a σﬁeld which is usually denoted as Ft . We would like to mention two other martingales. MARTINGALES OF BROWNIAN MOTION 101 example of any important concepts in continuous time stochastic process. it is called a ﬁltration. ♦ With this deﬁnition. This section introduces its martingale aspects. Ft represents information about X(t) available to an observer up to time t. If any decision is to be made at time t by this observer.2 Let B(t) be a Brownian motion. In general. these σﬁelds satisfy Fs ⊂ Ft for all 0 ≤ s < t ≤ T . For any θ. This extends to a set of random variables. we have implied that X(t) is Ft measurable. EX(t) ≤ ∞. we take Ft = σ{X(s) : s ≤ t} when a stochastic process X(t) and a ﬁltration Ft are subjects of some discussion. B 2 (t) − t is a martingale. t ≥ 0} has the increasing property. Deﬁnition 7.7. T ]. As long as {Ft . Recall that each random variable X generates a σﬁeld: σ(X). One of the deﬁning properties of Brownian motion can be reworded as the Brownian motion is a martingale.2. (a). Unless otherwise speciﬁed.
and Ft = σ{X(s) : s ≤ t}. ♦ . As we mentioned before. this claim turns out to be true. Proof: Working out the conditional moment generating function will be suﬃcient.3 Markov Property of Brownian Motion Let {X(t). Deﬁnition 7. Deﬁnition 7. ♦ In applications.2 Let {X(t). A typical example is the behavior of a simple random walk after its return to 0. BROWNIAN MOTION OR WIENER PROCESS 7. then {X(t). As a preparation. we must be careful when claim that the simple random walk following that moment will have the same property as a simple random walk starting from 0. Brownian motion in many respects generalizes the simple random walk. Regardless. ♦ Compared to the deﬁnition of Markov chain. t > 0. One of them is the strong Markov property. The Markov property is naturally extended as follows. t ≥ 0} if {T ≤ t} ∈ Ft for all t > 0. If for any s. Since the returning time itself is random. t ≥ 0} be a stochastic process and Ft is the corresponding ﬁltration. They share many similar properties.102 CHAPTER 7. P {X(t + s) ≤ yFt } = P {X(t + s) ≤ yXt } almost surely for all y. we must be prepared to prove that this is indeed the case. t ≥ 0} be a stochastic process. Theorem 7. we reintroduce the concept of stopping times. A random time T is called a stopping time for {X(t).1 Markov Process.3 A Brownian motion is a Markov Process. t ≥ 0} is a Markov process. If we recall. we often want to know the property following the occurrence of some event. this deﬁnition removes the requirement that the state space of X(t) is countable.
it can be ˆ easily veriﬁed that B(t) is also a Brownian motion. Theorem 7. Let T be the time when the value of stock price will drop by 50% in the next unit of time. the truthfulness of T ≤ t is a simple matter. If T is a stopping time. In this case. we would make a lot of money. let us return to the issue of strong Markov property. Ft is the σﬁeld generated by random variables X(s) including all s ≤ t. P {B(T + t) ≤ yFT } = P {B(T + t) ≤ yB(T )} for all y. . Let us tentatively accept it as a fact and see what will be the consequences. Then for any t > 0. ♦ We cannot aﬀord to spend more time at proving this result. we hope to be able to use it as if it is a nonrandom time. we can tell whether T ≤ t or not. We do not need strong mathematics to ﬁnd that T is not a stopping time.4 Exit Times and Hitting Times Let T (x) = inf{t > 0 : B(t) = x} be a stopping time for Brownian motion B(t). it remains to be true due to the strong Markov property. ˆ Let us deﬁne B(t) = B(T + t) − B(T ). A ∩ {T ≤ t} ∈ Ft for any t}. Let T to be the time when a Brownian motion ﬁrst exceeds value 1.7. Then T is a stopping time.4 Strong Markov Property. EXIT TIMES AND HITTING TIMES 103 If we stripe the jargon of σﬁeld.4. We deﬁne FT = {A : A ∈ F. Let T be a stopping time associated with Brownian motion B(t). you really have to be able to see into the future. It is seen that T (x) is the ﬁrst moment when the Brownian motion hits target x. 7. Finally. If T were a stopping time. By examining the values of X(s) for all s smaller or equal t. Another preparation is: for each nonrandom time t. a stopping time T is a random quantity such that after we observed the stochastic process up to and include time t. When T is a constant. When T is a stopping time.
We do not have similar results for E{T }. 1]. ♦ Proof: It is seen that the event {T > 1} implies a < B(t) < b for all t ∈ [0. Since the normal distribution is symmetric. x Next. T (b)}. Deﬁne T = min{T (a). under the assumption that a < x < b. the simple random walk result seems to suggest that P (T < ∞) = 1. The real answers are given as follows. α = sup{Φ(b − x) − Φ(a − x)} < 1. b are ﬁnite. BROWNIAN MOTION OR WIENER PROCESS Assume B(0) = x and a < x < b. P (T > 1B(0) = x) ≤ P {a < B(1) < bB(0) = x} = Φ(b − x) − Φ(a − x). One may link this problem with simple random walk and predict the outcome. and the second question is question is answered by computing E{T } under the assumption that the ﬁrst answer is aﬃrmative. B(0) = x)pn−1 . for any positive integer n.5 The hitting time T deﬁned above have the properties: (a) P (T < ∞B(0) = x) = 1. Hence T is the time when the Brownian motion escapes the area formed by two horizontal lines. what is the average time it takes? The ﬁrst question is answered by computing P (T < ∞).104 CHAPTER 7. As long as a. The question is: will it occur in ﬁnite time? If so. but for E{T (a)} or E{T (b)} for simple random walk. The probability for the ﬁrst passage of “1” for a symmetric simple random walk is 1. pn = P {T > nB(0) = x} = P {T > nT > n − 1. (b) E{T B(0) = x} < ∞. and the later are likely ﬁnite from the same consideration. Hence. Theorem 7.
and I believe that a proof using some limiting approach is possible.7. B(0) = x}pn−1 = P {a < B(s) < b.5. E(T B(0) = x) ≤ n=0 ∞ 105 P {T > nB(0) = x} < ∞. Applying this relationship repeatedly. P {T (a) < ∞B(0) = a}. Theorem 7. for n − 1 ≤ s ≤ na < B(n − 1) < b}pn−1 = P {a < B(s) < b. Again.5 Maximum and Minimum of Brownian Motion Deﬁne. The signiﬁcance of these two quantities are obvious. we have P (T < ∞B(0) = x) = 1 − n→∞ pn = 1.6 For any x > 0. M (t) = max{B(s) : 0 ≤ s ≤ t} and m(t) = min{B(s) : 0 ≤ s ≤ t}. for 1 ≤ s ≤ 1a < B(0) < b}pn−1 ≤ αpn−1 . Thus. the following result resembles a result of simple random walk. but the Markov property itself. The following results are obvious: P {T (b) < ∞B(0) = a} = 1. ♦ I worry slightly that I did not really make use of strong Markov property. for n − 1 ≤ s ≤ na < B(s) < b. . lim Further. MAXIMUM AND MINIMUM OF BROWNIAN MOTION = P {a < B(s) < b. for a given Brownian motion. P {M (t) > mB(0) = 0} = 2P {B(t) ≤ mB(0) = 0}. 7. we get pn ≤ αn . for 0 ≤ s ≤ n − 1.
B(t) < m} = P {T (m) < t. B(t) < m}. First. BROWNIAN MOTION OR WIENER PROCESS Proof: We will omit the condition that B(0) = 0. plus {M (t) > m} is equivalent to T (m) ≤ t. B(t) < B(T (m))} = P {B(t) < B(T (m))T (m) < t}P {T (m) < t} = P {B(t) > B(T (m))T (m) < t}P {T (m) < t} Strong Markov Property Reﬂection Principle = P {M (t) ≥ m. 2t 2πt3 .106 CHAPTER 7. Since B(t) ≥ m implies M (t) ≥ m.7 The density function of T (x) is given by x x2 f (t) = √ exp{− } t ≥ 0. B(t) ≥ m}. Also. B(t) ≥ m} +P {M (t) ≥ m. P {M (t) ≥ m. the above result becomes P {M (t) ≥ m} = 2P {B(t) ≥ m}. B(t) ≥ m}. ♦ This result can be used to ﬁnd the distribution function of T (x) under the assumption that B(0) = 0. This completes the proof. Applying this back. we use T (a) as the stopping time when B(t) ﬁrst hits a. Theorem 7. we have P {M (t) ≥ m} = P {M (t) ≥ m. for any m. we get P {M (t) ≥ m} = 2P {M (t) ≥ m. Note that B(T (m)) = m as the sample paths are continuous almost surely.
6. P {E(a. and we should draw a picture to understand its meaning. b). The reﬂection idea is formally stated as the next theorem. ♦ We call T (x) the ﬁrst passage time. P {E(a. b)B(a)}] For any x > 0.7. b)B(a) = −x} = P {T (x) < b − aB(0) = 0} = P {T (x) < b − a} by Markov property and so on. How long does it take for the motion to come back to 0? This time. Deﬁne B(t) = B(t) for t ≤ T and B(t) = ˆ 2B(t) − B(t) for t > T . Then B(t) is also Brownian motion. b) is given by 2 arc cos a/b. the conclusion is very diﬀerent from that of simple random walk. . Theorem 7. Theorem 7. ZEROS OF BROWNIAN MOTION AND ARCSINE LAW 107 Proof: It is done by using the equivalence between T (x) ≤ t and M (t) ≥ x.8 ˆ ˆ Let T be a stopping time.9 The probability that B(t) has a least one zero in the time interval (a. b)} = E[P {E(a. ♦ The proof is not hard. b) be the event that B(t) = 0 for at lease some t ∈ (a. 7. π Proof: Let E(a.6 Zeros of Brownian Motion and Arcsine Law Assume B(0) = 0.
This conclusion can be further strengthened. is nowhere smooth. b)B(a) = −x}fa (x)dx P {T (x) < b − a}fa (x)dx = 2 0 where fa (x) is the density function of B(a) and known to be normal with mean 0 and variance a. Substituting two expressions in and ﬁnishing the job of integration. no matter how small this neighborhood is. though continuous. ♦ When a → 0. so that a/b → 0.7 Diﬀusion Processes To be continued. b)B(a)}] = = 2 0 −∞ ∞ ∞ P {E(a. This result further illustrates that the sample paths of Brownian motion. The famous Arcsine law now follows. this probability approaches 1. b)B(a) = x}fa (x)dx P {E(a.10 The probability that Brownian motion has no zeros in the time interval (a. b) is given by (2/π)arc sin (a/b). There are inﬁnite number of zeros in any neighborhood of t = 0. The density function of T (x) is given earlier. . ♦ There is a similar result for simple random walk. we get the result. Hence. ∞ E[P {E(a. BROWNIAN MOTION OR WIENER PROCESS Hence. 7. Theorem 7. the probability that B(t) = 0 in any small neighborhood of t = 0 (not including 0) is 1 given B(0) = 0.108 CHAPTER 7.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.