# Probability Theory

Richard F. Bass
These notes are c (1998 by Richard F. Bass. They may be used for personal or classroom purposes,
but not for commercial purposes.
Revised 2001.
1. Basic notions.
A probability or probability measure is a measure whose total mass is one. Because the origins of
probability are in statistics rather than analysis, some of the terminology is diﬀerent. For example, instead
of denoting a measure space by (X, /, µ), probabilists use (Ω, T, P). So here Ω is a set, T is called a σ-ﬁeld
(which is the same thing as a σ-algebra), and P is a measure with P(Ω) = 1. Elements of T are called events.
Elements of Ω are denoted ω.
Instead of saying a property occurs almost everywhere, we talk about properties occurring almost
surely, written a.s.. Real-valued measurable functions from Ω to R are called random variables and are
usually denoted by X or Y or other capital letters. We often abbreviate ”random variable” by r.v.
We let A
c
= (ω ∈ Ω : ω / ∈ A) (called the complement of A) and B −A = B ∩ A
c
.
Integration (in the sense of Lebesgue) is called expectation or expected value, and we write EX for

XdP. The notation E[X; A] is often used for

A
XdP.
The random variable 1
A
is the function that is one if ω ∈ A and zero otherwise. It is called the
indicator of A (the name characteristic function in probability refers to the Fourier transform). Events such
as (ω : X(ω) > a) are almost always abbreviated by (X > a).
Given a random variable X, we can deﬁne a probability on R by
P
X
(A) = P(X ∈ A), A ⊂ R. (1.1)
The probability P
X
is called the law of X or the distribution of X. We deﬁne F
X
: R → [0, 1] by
F
X
(x) = P
X
((−∞, x]) = P(X ≤ x). (1.2)
The function F
X
is called the distribution function of X.
As an example, let Ω = ¦H, T¦, T all subsets of Ω (there are 4 of them), P(H) = P(T) =
1
2
. Let
X(H) = 1 and X(T) = 0. Then P
X
=
1
2
δ
0
+
1
2
δ
1
, where δ
x
is point mass at x, that is, δ
x
(A) = 1 if x ∈ A
and 0 otherwise. F
X
(a) = 0 if a < 0,
1
2
if 0 ≤ a < 1, and 1 if a ≥ 1.
Proposition 1.1. The distribution function F
X
of a random variable X satisﬁes:
(a) F
X
is nondecreasing;
(b) F
X
is right continuous with left limits;
(c) lim
x→∞
F
X
(x) = 1 and lim
x→−∞
F
X
(x) = 0.
Proof. We prove the ﬁrst part of (b) and leave the others to the reader. If x
n
↓ x, then (X ≤ x
n
) ↓ (X ≤ x),
and so P(X

x
n
) ↓ P(X ≤ x) since P is a measure.
Note that if x
n
↑ x, then (X ≤ x
n
) ↑ (X < x), and so F
X
(x
n
) ↑ P(X < x).
Any function F : R → [0, 1] satisfying (a)-(c) of Proposition 1.1 is called a distribution function,
whether or not it comes from a random variable.
1
Proposition 1.2. Suppose F is a distribution function. There exists a random variable X such that
F = F
X
.
Proof. Let Ω = [0, 1], T the Borel σ-ﬁeld, and P Lebesgue measure. Deﬁne X(ω) = sup¦x : F(x) < ω¦. It
is routine to check that F
X
= F.
In the above proof, essentially X = F
−1
. However F may have jumps or be constant over some
intervals, so some care is needed in deﬁning X.
Certain distributions or laws are very common. We list some of them.
(a) Bernoulli. A random variable is Bernoulli if P(X = 1) = p, P(X = 0) = 1 −p for some p ∈ [0, 1].
(b) Binomial. This is deﬁned by P(X = k) =

n
k

p
k
(1−p)
n−k
, where n is a positive integer, 0 ≤ k ≤ n,
and p ∈ [0, 1].
(c) Geometric. For p ∈ (0, 1) we set P(X = k) = (1 −p)p
k
. Here k is a nonnegative integer.
(d) Poisson. For λ > 0 we set P(X = k) = e
−λ
λ
k
/k! Again k is a nonnegative integer.
(e) Uniform. For some positive integer n, set P(X = k) = 1/n for 1 ≤ k ≤ n.
If F is absolutely continuous, we call f = F

the density of F. Some examples of distributions
characterized by densities are the following.
(f) Uniform on [a, b]. Deﬁne f(x) = (b −a)
−1
1
[a,b]
(x). This means that if X has a uniform distribution,
then
P(X ∈ A) =

A
1
b −a
1
[a,b]
(x) dx.
(g) Exponential. For x > 0 let f(x) = λe
−λx
.
(h) Standard normal. Deﬁne f(x) =
1

e
−x
2
/2
. So
P(X ∈ A) =
1

A
e
−x
2
/2
dx.
(i) ^(µ, σ
2
). We shall see later that a standard normal has mean zero and variance one. If Z is a
standard normal, then a ^(µ, σ
2
) random variable has the same distribution as µ + σZ. It is an
exercise in calculus to check that such a random variable has density
1

2πσ
e
−(x−µ)
2
/2σ
2
. (1.3)
(j) Cauchy. Here
f(x) =
1
π
1
1 +x
2
.
We can use the law of a random variable to calculate expectations.
Proposition 1.3. If g is bounded or nonnegative, then
Eg(X) =

g(x) P
X
(dx).
Proof. If g is the indicator of an event A, this is just the deﬁnition of P
X
. By linearity, the result holds for
simple functions. By the monotone convergence theorem, the result holds for nonnegative functions, and by
linearity again, it holds for bounded g.
2
If F
X
has a density f, then P
X
(dx) = f(x) dx. So, for example, EX =

xf(x) dx and EX
2
=

x
2
f(x) dx. (We need E[X[ ﬁnite to justify this if X is not necessarily nonnegative.)
We deﬁne the mean of a random variable to be its expectation, and the variance of a random variable
is deﬁned by
Var X = E(X −EX)
2
.
For example, it is routine to see that the mean of a standard normal is zero and its variance is one.
Note
Var X = E(X
2
−2XEX + (EX)
2
) = EX
2
−(EX)
2
.
Another equality that is useful is the following.
Proposition 1.4. If X ≥ 0 a.s. and p > 0, then
EX
p
=

0

p−1
P(X > λ) dλ.
The proof will show that this equality is also valid if we replace P(X > λ) by P(X ≥ λ).
Proof. Use Fubini’s theorem and write

0

p−1
P(X > λ) dλ = E

0

p−1
1
(λ,∞)
(X) dλ = E

X
0

p−1
dλ = EX
p
.
We need two elementary inequalities.
Proposition 1.5. Chebyshev’s inequality If X ≥ 0,
P(X ≥ a) ≤
EX
a
.
Proof. We write
P(X ≥ a) = E

1
[a,∞)
(X)

≤ E

X
a
1
[a,∞)
(X)

≤ EX/a,
since X/a is bigger than 1 when X ∈ [a, ∞).
If we apply this to X = (Y −EY )
2
, we obtain
P([Y −EY [ ≥ a) = P((Y −EY )
2
≥ a
2
) ≤ Var Y/a
2
. (1.4)
This special case of Chebyshev’s inequality is sometimes itself referred to as Chebyshev’s inequality, while
Proposition 1.5 is sometimes called the Markov inequality.
The second inequality we need is Jensen’s inequality, not to be confused with the Jensen’s formula of
complex analysis.
3
Proposition 1.6. Suppose g is convex and and X and g(X) are both integrable. Then
g(EX) ≤ Eg(X).
Proof. One property of convex functions is that they lie above their tangent lines, and more generally their
support lines. So if x
0
∈ R, we have
g(x) ≥ g(x
0
) +c(x −x
0
)
for some constant c. Take x = X(ω) and take expectations to obtain
Eg(X) ≥ g(x
0
) +c(EX −x
0
).
Now set x
0
equal to EX.
If A
n
is a sequence of sets, deﬁne (A
n
n
inﬁnitely often,” by
(A
n
i.o.) = ∩

n=1

i=n
A
i
.
This set consists of those ω that are in inﬁnitely many of the A
n
.
A simple but very important proposition is the Borel-Cantelli lemma. It has two parts, and we prove
the ﬁrst part here, leaving the second part to the next section.
Proposition 1.7. (Borel-Cantelli lemma) If
¸
n
P(A
n
) < ∞, then P(A
n
i.o.) = 0.
Proof. We have
P(A
n
i.o.) = lim
n→∞
P(∪

i=n
A
i
).
However,
P(∪

i=n
A
i
) ≤

¸
i=n
P(A
i
),
which tends to zero as n → ∞.
2. Independence.
Let us say two events A and B are independent if P(A∩ B) = P(A)P(B). The events A
1
, . . . , A
n
are
independent if
P(A
i
1
∩ A
i
2
∩ ∩ A
i
j
) = P(A
i
1
)P(A
i
2
) P(A
i
j
)
for every subset ¦i
1
, . . . , i
j
¦ of ¦1, 2, . . . , n¦.
Proposition 2.1. If A and B are independent, then A
c
and B are independent.
Proof. We write
P(A
c
∩ B) = P(B) −P(A∩ B) = P(B) −P(A)P(B) = P(B)(1 −P(A)) = P(B)P(A).
We say two σ-ﬁelds T and ( are independent if A and B are independent whenever A ∈ T and
B ∈ (. Two random variables X and Y are independent if the σ-ﬁeld generated by X and the σ-ﬁeld
generated by Y are independent. (Recall that the σ-ﬁeld generated by a random variable X is given by
4
¦(X ∈ A) : A a Borel subset of R¦.) We deﬁne the independence of n σ-ﬁelds or n random variables in the
obvious way.
Proposition 2.1 tells us that A and B are independent if the random variables 1
A
and 1
B
are inde-
pendent, so the deﬁnitions above are consistent.
If f and g are Borel functions and X and Y are independent, then f(X) and g(Y ) are independent.
This follows because the σ-ﬁeld generated by f(X) is a sub-σ-ﬁeld of the one generated by X, and similarly
for g(Y ).
Let F
X,Y
(x, y) = P(X ≤ x, Y ≤ y) denote the joint distribution function of X and Y . (The comma
inside the set means ”and.”)
Proposition 2.2. F
X,Y
(x, y) = F
X
(x)F
Y
(y) if and only if X and Y are independent.
Proof. If X and Y are independent, the 1
(−∞,x]
(X) and 1
(−∞,y]
(Y ) are independent by the above comments.
Using the above comments and the deﬁnition of independence, this shows F
X,Y
(x, y) = F
X
(x)F
Y
(y).
Conversely, if the inequality holds, ﬁx y and let ´
y
denote the collection of sets A for which P(X ∈
A, Y ≤ y) = P(X ∈ A)P(Y ≤ y). ´
y
contains all sets of the form (−∞, x]. It follows by linearity that ´
y
contains all sets of the form (x, z], and then by linearity again, by all sets that are the ﬁnite union of such
half-open, half-closed intervals. Note that the collection of ﬁnite unions of such intervals, /, is an algebra
generating the Borel σ-ﬁeld. It is clear that ´
y
is a monotone class, so by the monotone class lemma, ´
y
contains the Borel σ-ﬁeld.
For a ﬁxed set A, let ´
A
denote the collection of sets B for which P(X ∈ A, Y ∈ B) = P(X ∈
A)P(Y ∈ B). Again, ´
A
is a monotone class and by the preceding paragraph contains the σ-ﬁeld generated
by the collection of ﬁnite unions of intervals of the form (x, z], hence contains the Borel sets. Therefore X
and Y are independent.
The following is known as the multiplication theorem.
Proposition 2.3. If X, Y , and XY are integrable and X and Y are independent, then EXY = EXEY .
Proof. Consider the random variables in σ(X) (the σ-ﬁeld generated by X) and σ(Y ) for which the
multiplication theorem is true. It holds for indicators by the deﬁnition of X and Y being independent.
It holds for simple random variables, that is, linear combinations of indicators, by linearity of both sides.
It holds for nonnegative random variables by monotone convergence. And it holds for integrable random
variables by linearity again.
Let us give an example of independent random variables. Let Ω = Ω
1

2
and let P = P
1
P
2
, where
(Ω
i
, T
i
, P
i
) are probability spaces for i = 1, 2. We use the product σ-ﬁeld. Then it is clear that T
1
and T
2
are independent by the deﬁnition of P. If X
1
is a random variable such that X
1

1
, ω
2
) depends only on ω
1
and X
2
depends only on ω
2
, then X
1
and X
2
are independent.
This example can be extended to n independent random variables, and in fact, if one has independent
random variables, one can always view them as coming from a product space. We will not use this fact.
Later on, we will talk about countable sequences of independent r.v.s and the reader may wonder whether
such things can exist. That it can is a consequence of the Kolmogorov extension theorem; see PTA, for
example.
If X
1
, . . . , X
n
are independent, then so are X
1
− EX
1
, . . . , X
n
− EX
n
. Assuming everything is
integrable,
E[(X
1
−EX
1
) + (X
n
−EX
n
)]
2
= E(X
1
−EX
1
)
2
+ +E(X
n
−EX
n
)
2
,
5
using the multiplication theorem to show that the expectations of the cross product terms are zero. We have
thus shown
Var (X
1
+ +X
n
) = Var X
1
+ + Var X
n
. (2.1)
We ﬁnish up this section by proving the second half of the Borel-Cantelli lemma.
Proposition 2.4. Suppose A
n
is a sequence of independent events. If
¸
n
P(A
n
) = ∞, then P(A
n
i.o.) = 1.
Note that here the A
n
are independent, while in the ﬁrst half of the Borel-Cantelli lemma no such
assumption was necessary.
Proof. Note
P(∪
N
i=n
A
i
) = 1 −P(∩
N
i=n
A
c
i
) = 1 −
N
¸
i=n
P(A
c
i
) = 1 −
N
¸
i=n
(1 −P(A
i
)).
By the mean value theorem, 1 − x ≤ e
−x
, so we have that the right hand side is greater than or equal to
1 − exp(−
¸
N
i=n
P(A
i
)). As N → ∞, this tends to 1, so P(∪

i=n
A
i
) = 1. This holds for all n, which proves
the result.
3. Convergence.
In this section we consider three ways a sequence of r.v.s X
n
can converge.
We say X
n
converges to X almost surely if (X
n
→ X) has probability zero. X
n
converges to X in
probability if for each ε, P([X
n
−X[ > ε) → 0 as n → ∞. X
n
converges to X in L
p
if E[X
n
−X[
p
→ 0 as
n → ∞.
The following proposition shows some relationships among the types of convergence.
Proposition 3.1. (a) If X
n
→ X a.s., then X
n
→ X in probability.
(b) If X
n
→ X in L
p
, then X
n
→ X in probability.
(c) If X
n
→ X in probability, there exists a subsequence n
j
such that X
n
j
converges to X almost surely.
Proof. To prove (a), note X
n
−X tends to 0 almost surely, so 1
(−ε,ε)
c (X
n
−X) also converges to 0 almost
surely. Now apply the dominated convergence theorem.
(b) comes from Chebyshev’s inequality:
P([X
n
−X[ > ε) = P([X
n
−X[
p
> ε
p
) ≤ E[X
n
−X[
p

p
→ 0
as n → ∞.
To prove (c), choose n
j
larger than n
j−1
such that P([X
n
− X[ > 2
−j
) < 2
−j
whenever n ≥ n
j
.
So if we let A
i
= ([X
n
j
− X[ > 2
−i
for some j ≥ i), then P(A
i
) ≤ 2
−i+1
. By the Borel-Cantelli lemma
P(A
i
i.o.) = 0. This implies X
n
j
→ X on the complement of (A
i
i.o.).
Let us give some examples to show there need not be any other implications among the three types
of convergence.
Let Ω = [0, 1], T the Borel σ-ﬁeld, and P Lebesgue measure. Let X
n
= e
n
1
(0,1/n)
. Then clearly X
n
converges to 0 almost surely and in probability, but EX
p
n
= e
np
/n → ∞ for any p.
Let Ω be the unit circle, and let P be Lebesgue measure on the circle normalized to have total mass
1. Let t
n
=
¸
n
i=1
i
−1
, and let A
n
= ¦θ : t
n−1
≤ θ < t
n
¦. Let X
n
= 1
A
n
. Any point on the unit circle will
be in inﬁnitely many A
n
, so X
n
does not converge almost surely to 0. But P(A
n
) = 1/2πn → 0, so X
n
→ 0
in probability and in L
p
.
6
4. Weak law of large numbers.
Suppose X
n
is a sequence of independent random variables. Suppose also that they all have the same
distribution, that is, F
X
n
= F
X
1
for all n. This situation comes up so often it has a name, independent,
identically distributed, which is abbreviated i.i.d.
Deﬁne S
n
=
¸
n
i=1
X
i
. S
n
is called a partial sum process. S
n
/n is the average value of the ﬁrst n of
the X
i
’s.
Theorem 4.1. (Weak law of large numbers) Suppose the X
i
are i.i.d. and EX
2
1
< ∞. Then S
n
/n → EX
1
in probability.
Proof. Since the X
i
are i.i.d., they all have the same expectation, and so ES
n
= nEX
1
. Hence E(S
n
/n −
EX
1
)
2
is the variance of S
n
/n. If ε > 0, by Chebyshev’s inequality,
P([S
n
/n −EX
1
[ > ε) ≤
Var (S
n
/n)
ε
2
=
¸
n
i=1
Var X
i
n
2
ε
2
=
nVar X
1
n
2
ε
2
. (4.1)
Since EX
2
1
< ∞, then Var X
1
< ∞, and the result follows by letting n → ∞.
A nice application of the weak law of large numbers is a proof of the Weierstrass approximation
theorem.
Theorem 4.2. Suppose f is a continuous function on [0, 1] and ε > 0. There exists a polynomial P such
that sup
x∈[0,1]
[f(x) −P(x)[ < ε.
Proof. Let
P(x) =
n
¸
k=0
f(k/n)

n
k

x
k
(1 −x)
n−k
.
Clearly P is a polynomial. Since f is continuous, there exists M such that [f(x)[ ≤ M for all x and there
exists δ such that [f(x) −f(y)[ < ε/2 whenever [x −y[ < δ.
Let X
i
be i.i.d. Bernoulli r.v.s with parameter x. Then S
n
, the partial sum, is a binomial, and hence
P(x) = Ef(S
n
/n). The mean of S
n
/n is x. We have
[P(x) −f(x)[ = [Ef(S
n
/n) −f(EX
1
)[ ≤ E[f(S
n
/n) −f(EX
1
)[
≤ MP([S
n
/n −x[ > δ) +ε/2.
By (4.1) the ﬁrst term will be less than
MVar X
1
/nδ
2
≤ Mx(1 −x)/nδ
2
≤ Mnδ
2
,
which will be less than ε/2 if n is large enough, uniformly in x.
5. Techniques related to almost sure convergence.
Our aim is the strong law of large numbers (SLLN), which says that S
n
/n converges to EX
1
almost
surely if E[X
1
[ < ∞.
We ﬁrst prove it under the assumption that EX
4
1
< ∞.
7
Proposition 5.1. Suppose X
i
is an i.i.d. sequence with EX
4
i
< ∞ and let S
n
=
¸
n
i=1
X
i
. Then S
n
/n →
EX
1
a.s.
Proof. By looking at X
i
−EX
i
we may assume that the X
i
have mean 0. By Chebyshev,
P([S
n
/n[ > ε) ≤
E(S
n
/n)
4
ε
4
=
ES
4
n
n
4
ε
4
.
If we expand S
4
n
, we will have terms involving X
4
i
, terms involving X
2
i
X
2
j
, terms involving X
3
i
X
j
, terms
involving X
2
i
X
j
X
k
, and terms involving X
i
X
j
X
k
X

, with i, j, k, all being diﬀerent. By the multiplication
theorem and the fact that the X
i
have mean 0, the expectations of all the terms will be 0 except for those
of the ﬁrst two types. So
ES
4
n
=
n
¸
i=1
EX
4
i
+
¸
i=j
EX
2
i
EX
2
j
.
By the ﬁniteness assumption, the ﬁrst term on the right is bounded by c
1
n. By Cauchy-Schwarz, EX
2
i

(EX
4
i
)
1/2
< ∞, and there are at most n
2
terms in the second term on the right, so this second term is
bounded by c
2
n
2
. Substituting, we have
P([S
n
/n[ > ε) ≤ c
3
/n
2
ε
4
.
Consequently P([S
n
/n[ > ε i.o.) = 0 by Borel-Cantelli. Since ε is arbitrary, this implies S
n
/n → 0 a.s.
Before we can prove the SLLN assuming only the ﬁniteness of ﬁrst moments, we need some prelimi-
naries.
Proposition 5.2. If Y ≥ 0, then EY < ∞ if and only if
¸
n
P(Y > n) < ∞.
Proof. By Proposition 1.4, EY =

0
P(Y > x)dx. P(Y > x) is nonincreasing in x, so the integral is
bounded above by
¸

n=0
P(Y > n) and bounded below by
¸

n=1
P(Y > n).
If X
i
is a sequence of r.v.s, the tail σ-ﬁeld is deﬁned by ∩

n=1
σ(X
n
, X
n+1
, . . .). An example of an
event in the tail σ-ﬁeld is (limsup
n→∞
X
n
> a). Another example is (limsup
n→∞
S
n
/n > a). The reason
for this is that if k < n is ﬁxed,
S
n
n
=
S
k
n
+
¸
n
i=k+1
X
i
n
.
The ﬁrst term on the right tends to 0 as n → ∞. So limsup S
n
/n = limsup(
¸
n
i=k+1
X
i
)/n, which is in
σ(X
k+1
, X
k+2
, . . .). This holds for each k. The set (limsup S
n
> a) is easily seen not to be in the tail σ-ﬁeld.
Theorem 5.3. (Kolmogorov 0-1 law) If the X
i
are independent, then the events in the tail σ-ﬁeld have
probability 0 or 1.
This implies that in the case of i.i.d. random variables, if S
n
/n has a limit with positive probability,
then it has a limit with probability one, and the limit must be a constant.
Proof. Let ´ be the collection of sets in σ(X
n+1
, . . .) that is independent of every set in σ(X
1
, . . . , X
n
).
´ is easily seen to be a monotone class and it contains σ(X
n+1
, . . . , X
N
) for every N > n. Therefore ´
must be equal to σ(X
n+1
, . . .).
If A is in the tail σ-ﬁeld, then for each n, A is independent of σ(X
1
, . . . , X
n
). The class ´
A
of sets
independent of A is a monotone class, hence is a σ-ﬁeld containing σ(X
1
, . . . , X
n
) for each n. Therefore ´
A
contains σ(X
1
, . . .).
8
We thus have that the event A is independent of itself, or
P(A) = P(A∩ A) = P(A)P(A) = P(A)
2
.
This implies P(A) is zero or one.
The next proposition shows that in considering a law of large numbers we can consider truncated
random variables.
Proposition 5.4. Suppose X
i
is an i.i.d. sequence of r.v.s with E[X
1
[ < ∞. Let X

n
= X
n
1
(|X
n
|≤n)
. Then
(a) X
n
converges almost surely if and only if X

n
does;
(b) If S

n
=
¸
n
i=1
X

i
, then S
n
/n converges a.s. if and only if S

n
/n does.
Proof. Let A
n
= (X
n
= X

n
) = ([X
n
[ > n). Then P(A
n
) = P([X
n
[ > n) = P([X
1
[ > n). Since E[X
1
[ < ∞,
then by Proposition 5.2 we have
¸
P(A
n
) < ∞. So by the Borel-Cantelli lemma, P(A
n
i.o.) = 0. Thus for
almost every ω, X
n
= X

n
for n suﬃciently large. This proves (a).
For (b), let k (depending on ω) be the largest integer such that X

k
(ω) = X
k
(ω). Then S
n
/n−S

n
/n =
(X
1
+ +X
k
)/n −(X

1
+ +X

k
)/n → 0 as n → ∞.
Next is Kolmogorov’s inequality, a special case of Doob’s inequality.
Proposition 5.5. Suppose the X
i
are independent and EX
i
= 0 for each i. Then
P( max
1≤i≤n
[S
i
[ ≥ λ) ≤
ES
2
n
λ
2
.
Proof. Let A
k
= ([S
k
[ ≥ λ, [S
1
[ < λ, . . . , [S
k−1
[ < λ). Note the A
k
are disjoint and that A
k
∈ σ(X
1
, . . . , X
k
).
Therefore A
k
is independent of S
n
−S
k
. Then
ES
2
n

n
¸
k=1
E[S
2
n
; A
k
]
=
n
¸
k=1
E[(S
2
k
+ 2S
k
(S
n
−S
k
) + (S
n
−S
k
)
2
); A
k
]

n
¸
k=1
E[S
2
k
; A
k
] + 2
n
¸
k=1
E[S
k
(S
n
−S
k
); A
k
].
Using the independence, E[S
k
(S
n
−S
k
)1
A
k
] = E[S
k
1
A
k
]E[S
n
−S
k
] = 0. Therefore
ES
2
n

n
¸
k=1
E[S
2
k
; A
k
] ≥
n
¸
k=1
λ
2
P(A
k
) = λ
2
P( max
1≤k≤n
[S
k
[ ≥ λ).
Our result is immediate from this.
The last result we need for now is a special case of what is known as Kronecker’s lemma.
Proposition 5.6. Suppose x
i
are real numbers and s
n
=
¸
n
i=1
x
i
. If
¸

j=1
(x
j
/j) converges, then s
n
/n → 0.
Proof. Let b
n
=
¸
n
j=1
(x
j
/j), b
0
= 0, and suppose b
n
→ b. As is well known, this implies (
¸
n
i=1
b
i
)/n → b.
We have n(b
n
−b
n−1
) = x
n
, so
s
n
n
=
¸
n
i=1
(ib
i
−ib
i−1
)
n
=
¸
n
i=1
ib
i

¸
n−1
i=1
(i + 1)b
i
n
= b
n

¸
n
i=1
b
i
n
→ b −b = 0.
9
6. Strong law of large numbers.
This section is devoted to a proof of Kolmogorov’s strong law of large numbers. We showed earlier
that if EX
2
i
< ∞, where the X
i
are i.i.d., then the weak law of large numbers (WLLN) holds: S
n
/n converges
to EX
1
in probability. The WLLN can be improved greatly; it is enough that xP([X
1
[ > x) → 0 as x → ∞.
Here we show the strong law (SLLN): if one has a ﬁnite ﬁrst moment, then there is almost sure convergence.
First we need a lemma.
Lemma 6.1. Suppose V
i
is a sequence of independent r.v.s, each with mean 0. Let W
n
=
¸
n
i=1
V
i
. If
¸

i=1
Var V
i
< ∞, then W
n
converges almost surely.
Proof. Choose n
j
> n
j−1
such that
¸

i=n
j
Var V
i
< 2
−3j
. If n > n
j
, then applying Kolmogorov’s inequality
shows that
P( max
n
j
≤i≤n
[W
i
−W
n
j
[ > 2
−j
) ≤ 2
−3j
/2
−2j
= 2
−j
.
Letting n → ∞, we have P(A
j
) ≤ 2
−j
, where
A
j
= (max
n
j
≤i
[W
i
−W
n
j
[ > 2
−j
).
By the Borel-Cantelli lemma, P(A
j
i.o.) = 0.
Suppose ω / ∈ (A
j
i.o.). Let ε > 0. Choose j large enough so that 2
−j+1
< ε and ω / ∈ A
j
. If n, m > n
j
,
then
[W
n
−W
m
[ ≤ [W
n
−W
n
j
[ +[W
m
−W
n
j
[ ≤ 2
−j+1
< ε.
Since ε is arbitrary, W
n
(ω) is a Cauchy sequence, and hence converges.
Theorem 6.2. (SLLN) Let X
i
be a sequence of i.i.d. random variables. Then S
n
/n converges almost surely
if and only if E[X
1
[ < ∞.
Proof. Let us ﬁrst suppose S
n
/n converges a.s. and show E[X
1
[ < ∞. If S
n
(ω)/n → a, then
S
n−1
n
=
S
n−1
n −1
n −1
n
→ a.
So
X
n
n
=
S
n
n

S
n−1
n
→ a −a = 0.
Hence X
n
/n → 0, a.s. Thus P([X
n
[ > n i.o.) = 0. By the second part of Borel-Cantelli,
¸
P([X
n
[ > n) < ∞.
Since the X
i
are i.i.d., this means
¸

n=1
P([X
1
[ > n) < ∞, and by Proposition 4.1, E[X
1
[ < ∞.
Now suppose E[X
1
[ < ∞. By looking at X
i
− EX
i
, we may suppose without loss of generality that
EX
i
= 0. We truncate, and let Y
i
= X
i
1
(|X
i
|≤i)
. It suﬃces to show
¸
n
i=1
Y
i
/n → 0 a.s., by Proposition 5.4.
Next we estimate. We have
EY
i
= E[X
i
1
(|X
i
|≤i)
] = E[X
1
1
(|X
1
|≤i)
] → EX
1
= 0.
10
The convergence follows by the dominated convergence theorem, since the integrands are bounded by [X
1
[.
To estimate the second moment of the Y
i
, we write
EY
2
i
=

0
2yP([Y
i
[ ≥ y) dy
=

i
0
2yP([Y
i
[ ≥ y) dy

i
0
2yP([X
1
[ ≥ y) dy,
and so

¸
i=1
E(Y
2
i
/i
2
) ≤

¸
i=1
1
i
2

i
0
2yP([X
1
[ ≥ y) dy
= 2

¸
i=1
1
i
2

0
1
(y≤i)
yP([X
1
[ ≥ y) dy
= 2

0

¸
i=1
1
i
2
1
(y≤i)
yP([X
1
[ ≥ y) dy
≤ 4

0
1
y
yP([X
1
[ ≥ y) dy
= 4

0
P([X
1
[ ≥ y) dy = 4E[X
1
[ < ∞.
Let U
i
= Y
i
−EY
i
. Then Var U
i
= Var Y
i
≤ EY
2
i
, and by the above,

¸
i=1
Var (U
i
/i) < ∞.
By Lemma 6.1 (with V
i
= U
i
/i),
¸
n
i=1
(U
i
/i) converges almost surely. By Kronecker’s lemma, (
¸
n
i=1
U
i
)/n
converges almost surely. Finally, since EY
i
→ 0, then
¸
n
i=1
EY
i
/n → 0, hence
¸
n
i=1
Y
i
/n → 0.
7. Uniform integrability. Before proceeding to some extensions of the SLLN, we discuss uniform inte-
grability. A sequence of r.v.s is uniformly integrable if
sup
i

(|X
i
|>M)
[X
i
[ dP → 0
as M → ∞.
Proposition 7.1. Suppose there exists ϕ : [0, ∞) → [0, ∞) such that ϕ is nondecreasing, ϕ(x)/x → ∞ as
x → ∞, and sup
i
Eϕ([X
i
[) < ∞. Then the X
i
are uniformly integrable.
Proof. Let ε > 0 and choose x
0
such that x/ϕ(x) < ε if x ≥ x
0
. If M ≥ x
0
,

(|X
i
|>M)
[X
i
[ =

[X
i
[
ϕ([X
i
[)
ϕ([X
i
[)1
(|X
i
|>M)
≤ ε

ϕ([X
i
[) ≤ ε sup
i
Eϕ([X
i
[).
11
Proposition 7.2. If X
n
and Y
n
are two uniformly integrable sequences, then X
n
+ Y
n
is also a uniformly
integrable sequence.
Proof. Since there exists M
0
such that sup
n
E[[X
n
[; [X
n
[ > M
0
] < 1 and sup
n
E[[Y
n
[; [Y
n
[ > M
0
] < 1,
then sup
n
E[X
n
[ ≤ M
0
+ 1, and similarly for the Y
n
. Let ε > 0 and choose M
1
> 4(M
0
+ 1)/ε such that
sup
n
E[[X
n
[; [X
n
[ > M
1
] < ε/4 and sup
n
E[[Y
n
[; [Y
n
[ > M
1
] < ε/4. Let M
2
= 4M
2
1
.
Note P([X
n
[ +[Y
n
[ > M
2
) ≤ (E[X
n
[ +E[Y
n
[)/M
2
≤ ε/(4M
1
) by Chebyshev’s inequality. Then
E[[X
n
+Y
n
[; [X
n
+Y
n
[ > M
2
] ≤ E[[X
n
[; [X
n
[ > M
1
]
+E[[X
n
[; [X
n
[ ≤ M
1
, [X
n
+Y
n
[ > M
2
]
+E[[Y
n
[; [Y
n
[ > M
1
]
+E[[Y
n
[; [Y
n
[ ≤ M
1
, [X
n
+Y
n
[ > M
2
].
The ﬁrst and third terms on the right are each less than ε/4 by our choice of M
1
. The second and fourth
terms are each less than M
1
P([X
n
+Y
n
[ > M
2
) ≤ ε/2.
The main result we need in this section is Vitali’s convergence theorem.
Theorem 7.3. If X
n
→ X almost surely and the X
n
are uniformly integrable, then E[X
n
−X[ → 0.
Proof. By the above proposition, X
n
− X is uniformly integrable and tends to 0 a.s., so without loss of
generality, we may assume X = 0. Let ε > 0 and choose M such that sup
n
E[[X
n
[; [X
n
[ > M] < ε. Then
E[X
n
[ ≤ E[[X
n
[; [X
n
[ > M] +E[[X
n
[; [X
n
[ ≤ M] ≤ ε +E[[X
n
[1
(|X
n
|≤M)
].
The second term on the right goes to 0 by dominated convergence.
8. Complements to the SLLN.
Proposition 8.1. Suppose X
i
is an i.i.d. sequence and E[X
1
[ < ∞. Then
E

S
n
n
−EX
1

→ 0.
Proof. Without loss of generality we may assume EX
1
= 0. By the SLLN, S
n
/n → 0 a.s. So we need to
show that the sequence S
n
/n is uniformly integrable.
Pick M
1
such that E[[X
1
[; [X
1
[ > M
1
] < ε/2. Pick M
2
= M
1
E[X
1
[/ε. So
P([S
n
/n[ > M
2
) ≤ E[S
n
[/nM
2
≤ E[X
1
[/M
2
= ε/M
1
.
We used here E[S
n
[ ≤
¸
n
i=1
E[X
i
[ = nE[X
1
[.
We then have
E[[X
i
[; [S
n
/n[ > M
2
] ≤ E[[X
i
[ : [X
i
[ > M
1
] +E[[X
i
[; [X
i
[ ≤ M
1
, [S
n
/n[ > M
2
]
≤ ε +M
1
P([S
n
/n[ > M
2
) ≤ 2ε.
Finally,
E[[S
n
/n[; [S
n
/n[ > M
2
] ≤
1
n
n
¸
i=1
E[[X
i
[; [S
n
/n[ > M
2
] ≤ 2ε.
We now consider the “three series criterion.” We prove the “if” portion here and defer the “only if”
to Section 20.
12
Theorem 8.2. Let X
i
be a sequence of independent random variables., A > 0, and Y
i
= X
i
1
(|X
i
|≤A)
. Then
¸
X
i
converges if and only if all of the following three series converge: (a)
¸
P([X
n
[ > A); (b)
¸
EY
i
; (c)
¸
Var Y
i
.
Proof of “if” part. Since (c) holds, then
¸
(Y
i
−EY
i
) converges by Lemma 6.1. Since (b) holds, taking the
diﬀerence shows
¸
Y
i
converges. Since (a) holds,
¸
P(X
i
= Y
i
) =
¸
P([X
i
[ > A) < ∞, so by Borel-Cantelli,
P(X
i
= Y
i
i.o.) = 0. It follows that
¸
X
i
converges.
9. Conditional expectation.
If T ⊆ ( are two σ-ﬁelds and X is an integrable ( measurable random variable, the conditional
expectation of X given T, written E[X [ T] and read as “the expectation (or expected value) of X given
T,” is any T measurable random variable Y such that E[Y ; A] = E[X; A] for every A ∈ T. The conditional
probability of A ∈ ( given T is deﬁned by P(A [ T) = E[1
A
[ T].
If Y
1
, Y
2
are two T measurable random variables with E[Y
1
; A] = E[Y
2
; A] for all A ∈ T, then Y
1
= Y
2
,
a.s., or conditional expectation is unique up to a.s. equivalence.
In the case X is already T measurable, E[X [ T] = X. If X is independent of T, E[X [ T] = EX.
Both of these facts follow immediately from the deﬁnition. For another example, which ties this deﬁnition
with the one used in elementary probability courses, if ¦A
i
¦ is a ﬁnite collection of disjoint sets whose union
is Ω, P(A
i
) > 0 for all i, and T is the σ-ﬁeld generated by the A
i
s, then
P(A [ T) =
¸
i
P(A∩ A
i
)
P(A
i
)
1
A
i
.
This follows since the right-hand side is T measurable and its expectation over any set A
i
is P(A∩ A
i
).
As an example, suppose we toss a fair coin independently 5 times and let X
i
be 1 or 0 depending
whether the ith toss was a heads or tails. Let A be the event that there were 5 heads and let T
i
=
σ(X
1
, . . . , X
i
). Then P(A) = 1/32 while P(A [ T
1
) is equal to 1/16 on the event (X
1
= 1) and 0 on the event
(X
1
= 0). P(A [ T
2
) is equal to 1/8 on the event (X
1
= 1, X
2
= 1) and 0 otherwise.
We have
E[E[X [ T]] = EX (9.1)
because E[E[X [ T]] = E[E[X [ T]; Ω] = E[X; Ω] = EX.
The following is easy to establish.
Proposition 9.1. (a) If X ≥ Y are both integrable, then E[X [ T] ≥ E[Y [ T] a.s.
(b) If X and Y are integrable and a ∈ R, then E[aX +Y [ T] = aE[X [ T] +E[Y [ T].
It is easy to check that limit theorems such as monotone convergence and dominated convergence
have conditional expectation versions, as do inequalities like Jensen’s and Chebyshev’s inequalities. Thus,
for example, we have the following.
Proposition 9.2. (Jensen’s inequality for conditional expectations) If g is convex and X and g(X) are
integrable,
E[g(X) [ T] ≥ g(E[X [ T]), a.s.
A key fact is the following.
13
Proposition 9.3. If X and XY are integrable and Y is measurable with respect to T, then
E[XY [ T] = Y E[X [ T]. (9.2)
Proof. If A ∈ T, then for any B ∈ T,
E

1
A
E[X [ T]; B

= E

E[X [ T]; A∩ B

= E[X; A∩ B] = E[1
A
X; B].
Since 1
A
E[X [ T] is T measurable, this shows that (9.1) holds when Y = 1
A
and A ∈ T. Using linearity
and taking limits shows that (9.1) holds whenever Y is T measurable and Xand XY are integrable.
Two other equalities follow.
Proposition 9.4. If c ⊆ T ⊆ (, then
E

E[X [ T] [ c

= E[X [ c] = E

E[X [ c] [ T

.
Proof. The right equality holds because E[X [ c] is c measurable, hence T measurable. To show the left
equality, let A ∈ c. Then since A is also in T,
E

E

E[X [ T] [ c

; A

= E

E[X [ T]; A

= E[X; A] = E[E

X [ c]; A

.
Since both sides are c measurable, the equality follows.
To show the existence of E[X [ T], we proceed as follows.
Proposition 9.5. If X is integrable, then E[X [ T] exists.
Proof. Using linearity, we need only consider X ≥ 0. Deﬁne a measure Q on T by Q(A) = E[X; A] for
A ∈ T. This is trivially absolutely continuous with respect to P[
F
, the restriction of P to T. Let E[X [ T]
be the Radon-Nikodym derivative of Q with respect to P[
F
. The Radon-Nikodym derivative is T measurable
by construction and so provides the desired random variable.
When T = σ(Y ), one usually writes E[X [ Y ] for E[X [ T]. Notation that is commonly used (however,
we will use it only very occasionally and only for heuristic purposes) is E[X [ Y = y]. The deﬁnition is as
follows. If A ∈ σ(Y ), then A = (Y ∈ B) for some Borel set B by the deﬁnition of σ(Y ), or 1
A
= 1
B
(Y ). By
linearity and taking limits, if Z is σ(Y ) measurable, Z = f(Y ) for some Borel measurable function f. Set
Z = E[X [ Y ] and choose f Borel measurable so that Z = f(Y ). Then E[X [ Y = y] is deﬁned to be f(y).
If X ∈ L
2
and ´ = ¦Y ∈ L
2
: Y is T-measurable¦, one can show that E[X [ T] is equal to the
projection of X onto the subspace ´. We will not use this in these notes.
10. Stopping times.
We next want to talk about stopping times. Suppose we have a sequence of σ-ﬁelds T
i
such that
T
i
⊂ T
i+1
for each i. An example would be if T
i
= σ(X
1
, . . . , X
i
). A random mapping N from Ω to
¦0, 1, 2, . . .¦ is called a stopping time if for each n, (N ≤ n) ∈ T
n
. A stopping time is also called an optional
time in the Markov theory literature.
The intuition is that the sequence knows whether N has happened by time n by looking at T
n
.
Suppose some motorists are told to drive north on Highway 99 in Seattle and stop at the ﬁrst motorcycle
shop past the second realtor after the city limits. So they drive north, pass the city limits, pass two realtors,
and come to the next motorcycle shop, and stop. That is a stopping time. If they are instead told to stop
at the third stop light before the city limits (and they had not been there before), they would need to drive
to the city limits, then turn around and return past three stop lights. That is not a stopping time, because
they have to go ahead of where they wanted to stop to know to stop there.
We use the notation a ∧ b = min(a, b) and a ∨ b = max(a, b). The proof of the following is immediate
from the deﬁnitions.
14
Proposition 10.1.
(a) Fixed times n are stopping times.
(b) If N
1
and N
2
are stopping times, then so are N
1
∧ N
2
and N
1
∨ N
2
.
(c) If N
n
is a nondecreasing sequence of stopping times, then so is N = sup
n
N
n
.
(d) If N
n
is a nonincreasing sequence of stopping times, then so is N = inf
n
N
n
.
(e) If N is a stopping time, then so is N +n.
We deﬁne T
N
= ¦A : A∩ (N ≤ n) ∈ T
n
for all n¦.
11. Martingales.
In this section we consider martingales. Let T
n
be an increasing sequence of σ-ﬁelds. A sequence of
random variables M
n
n
if for each n, M
n
is T
n
measurable.
M
n
is a martingale if M
n
n
, M
n
is integrable for all n, and
E[M
n
[ T
n−1
] = M
n−1
, a.s., n = 2, 3, . . . . (11.1)
If we have E[M
n
[ T
n−1
] ≥ M
n−1
a.s. for every n, then M
n
is a submartingale. If we have E[M
n
[
T
n−1
] ≤ M
n−1
, we have a supermartingale. Submartingales have a tendency to increase.
Let us take a moment to look at some examples. If X
i
is a sequence of mean zero i.i.d. random
variables and S
n
is the partial sum process, then M
n
= S
n
is a martingale, since E[M
n
[ T
n−1
] = M
n−1
+
E[M
n
− M
n−1
[ T
n−1
] = M
n−1
+ E[M
n
− M
n−1
] = M
n−1
, using independence. If the X
i
’s have variance
one and M
n
= S
2
n
−n, then
E[S
2
n
[ T
n−1
] = E[(S
n
−S
n−1
)
2
[ T
n−1
] + 2S
n−1
E[S
n
[ T
n−1
] −S
2
n−1
= 1 +S
2
n−1
,
using independence. It follows that M
n
is a martingale.
Another example is the following: if X ∈ L
1
and M
n
= E[X [ T
n
], then M
n
is a martingale.
If M
n
is a martingale and H
n
∈ T
n−1
for each n, it is easy to check that N
n
=
¸
n
i=1
H
i
(M
i
−M
i−1
)
is also a martingale.
If M
n
is a martingale and g(M
n
) is integrable for each n, then by Jensen’s inequality
E[g(M
n+1
) [ T
n
] ≥ g(E[M
n+1
[ T
n
]) = g(M
n
),
or g(M
n
) is a submartingale. Similarly if g is convex and nondecreasing on [0, ∞) and M
n
is a positive
submartingale, then g(M
n
) is a submartingale because
E[g(M
n+1
) [ T
n
] ≥ g(E[M
n+1
[ F
n
]) ≥ g(M
n
).
12. Optional stopping.
Note that if one takes expectations in (11.1), one has EM
n
= EM
n−1
, and by induction EM
n
= EM
0
.
The theorem about martingales that lies at the basis of all other results is Doob’s optional stopping theorem,
which says that the same is true if we replace n by a stopping time N. There are various versions, depending
on what conditions one puts on the stopping times.
Theorem 12.1. If N is a bounded stopping time with respect to T
n
and M
n
a martingale, then EM
N
=
EM
0
.
Proof. Since N is bounded, let K be the largest value N takes. We write
EM
N
=
K
¸
k=0
E[M
N
; N = k] =
K
¸
k=0
E[M
k
; N = k].
15
Note (N = k) is T
j
measurable if j ≥ k, so
E[M
k
; N = k] = E[M
k+1
; N = k]
= E[M
k+2
; N = k] = . . . = E[M
K
; N = k].
Hence
EM
N
=
K
¸
k=0
E[M
K
; N = k] = EM
K
= EM
0
.
This completes the proof.
The assumption that N be bounded cannot be entirely dispensed with. For example, let M
n
be the
partial sums of a sequence of i.i.d. random variable that take the values ±1, each with probability
1
2
. If
N = min¦i : M
i
= 1¦, we will see later on that N < ∞ a.s., but EM
N
= 1 = 0 = EM
0
.
The same proof as that in Theorem 12.1 gives the following corollary.
Corollary 12.2. If N is bounded by K and M
n
is a submartingale, then EM
N
≤ EM
K
.
Also the same proof gives
Corollary 12.3. If N is bounded by K, A ∈ T
N
, and M
n
is a submartingale, then E[M
N
; A] ≤ E[M
K
; A].
Proposition 12.4. If N
1
≤ N
2
are stopping times bounded by K and M is a martingale, then E[M
N
2
[
T
N
1
] = M
N
1
, a.s.
Proof. Suppose A ∈ T
N
1
. We need to show E[M
N
1
; A] = E[M
N
2
; A]. Deﬁne a new stopping time N
3
by
N
3
(ω) =

N
1
(ω) if ω ∈ A
N
2
(ω) if ω / ∈ A.
It is easy to check that N
3
is a stopping time, so EM
N
3
= EM
K
= EM
N
2
implies
E[M
N
1
; A] +E[M
N
2
; A
c
] = E[M
N
2
].
Subtracting E[M
N
2
; A
c
] from each side completes the proof.
The following is known as the Doob decomposition.
Proposition 12.5. Suppose X
k
is a submartingale with respect to an increasing sequence of σ-ﬁelds T
k
.
Then we can write X
k
= M
k
+A
k
such that M
k
is a martingale adapted to the T
k
and A
k
is a sequence of
random variables with A
k
being T
k−1
-measurable and A
0
≤ A
1
≤ .
Proof. Let a
k
= E[X
k
[ T
k−1
] −X
k−1
for k = 1, 2, . . . Since X
k
is a submartingale, then each a
k
≥ 0. Then
let A
k
=
¸
k
i=1
a
i
. The fact that the A
k
are increasing and measurable with respect to T
k−1
is clear. Set
M
k
= X
k
−A
k
. Then
E[M
k+1
−M
k
[ T
k
] = E[X
k+1
−X
k
[ F
k
] −a
k+1
= 0,
or M
k
is a martingale.
Combining Propositions 12.4 and 12.5 we have
16
Corollary 12.6. Suppose X
k
is a submartingale, and N
1
≤ N
2
are bounded stopping times. Then
E[X
N
2
[ T
N
1
] ≥ X
N
1
.
13. Doob’s inequalities.
The ﬁrst interesting consequences of the optional stopping theorems are Doob’s inequalities. If M
n
is a martingale, denote M

n
= max
i≤n
[M
i
[.
Theorem 13.1. If M
n
is a martingale or a positive submartingale,
P(M

n
≥ a) ≤ E[[M
n
[; M

n
≥ a]/a ≤ E[M
n
[/a.
Proof. Set M
n+1
= M
n
. Let N = min¦j : [M
j
[ ≥ a¦∧(n+1). Since [ [ is convex, [M
n
[ is a submartingale.
If A = (M

n
≥ a), then A ∈ T
N
and by Corollary 12.3
aP(M

n
≥ a) ≤ E[[M
N
[; A] ≤ E[[M
n
[; A] ≤ E[M
n
[.
For p > 1, we have the following inequality.
Theorem 13.2. If p > 1 and E[M
i
[
p
< ∞ for i ≤ n, then
E(M

n
)
p

p
p −1

p
E[M
n
[
p
.
Proof. Note M

n

¸
n
i=1
[M
n
[, hence M

n
∈ L
p
. We write
E(M

n
)
p
=

0
pa
p−1
P(M

n
> a) da ≤

0
pa
p−1
E[[M
n
[1
(M

n
≥a)
/a] da
= E

M

n
0
pa
p−2
[M
n
[ da =
p
p −1
E[(M

n
)
p−1
[M
n
[]

p
p −1
(E(M

n
)
p
)
(p−1)/p
(E[M
n
[
p
)
1/p
.
The last inequality follows by H¨older’s inequality. Now divide both sides by the quantity (E(M

n
)
p
)
(p−1)/p
.
14. Martingale convergence theorems.
The martingale convergence theorems are another set of important consequences of optional stopping.
The main step is the upcrossing lemma. The number of upcrossings of an interval [a, b] is the number of
times a process crosses from below a to above b.
To be more exact, let
S
1
= min¦k : X
k
≤ a¦, T
1
= min¦k > S
1
: X
k
≥ b¦,
and
S
i+1
= min¦k > T
i
: X
k
≤ a¦, T
i+1
= min¦k > S
i+1
: X
k
≥ b¦.
The number of upcrossings U
n
before time n is U
n
= max¦j : T
j
≤ n¦.
17
Theorem 14.1. (Upcrossing lemma) If X
k
is a submartingale,
EU
n
≤ (b −a)
−1
E[(X
n
−a)
+
].
Proof. The number of upcrossings of [a, b] by X
k
is the same as the number of upcrossings of [0, b − a]
by Y
k
= (X
k
−a)
+
. Moreover Y
k
is still a submartingale. If we obtain the inequality for the the number of
upcrossings of the interval [0, b −a] by the process Y
k
, we will have the desired inequality for upcrossings of
X.
So we may assume a = 0. Fix n and deﬁne Y
n+1
= Y
n
. This will still be a submartingale. Deﬁne the
S
i
, T
i
as above, and let S

i
= S
i
∧ (n + 1), T

i
= T
i
∧ (n + 1). Since T
i+1
> S
i+1
> T
i
, then T

n+1
= n + 1.
We write
EY
n+1
= EY
S

1
+
n+1
¸
i=0
E[Y
T

i
−Y
S

i
] +
n+1
¸
i=0
E[Y
S

i+1
−Y
T

i
].
All the summands in the third term on the right are nonnegative since Y
k
is a submartingale. For the jth
upcrossing, Y
T

j
−Y
S

j
≥ b −a, while Y
T

j
−Y
S

j
is always greater than or equal to 0. So

¸
i=0
(Y
T

i
−Y
S

i
) ≥ (b −a)U
n
.
So
(4.8) EU
n
≤ EY
n+1
/(b −a).
This leads to the martingale convergence theorem.
Theorem 14.2. If X
n
is a submartingale such that sup
n
EX
+
n
< ∞, then X
n
converges a.s. as n → ∞.
Proof. Let U(a, b) = lim
n→∞
U
n
. For each a, b rational, by monotone convergence,
EU(a, b) ≤ c(b −a)
−1
E(X
n
−a)
+
< ∞.
So U(a, b) < ∞, a.s. Taking the union over all pairs of rationals a, b, we see that a.s. the sequence X
n
(ω)
cannot have limsup X
n
> liminf X
n
. Therefore X
n
converges a.s., although we still have to rule out the
possibility of the limit being inﬁnite. Since X
n
is a submartingale, EX
n
≥ EX
0
, and thus
E[X
n
[ = EX
+
n
+EX

n
= 2EX
+
n
−EX
n
≤ 2EX
+
n
−EX
0
.
By Fatou’s lemma, E lim
n
[X
n
[ ≤ sup
n
E[X
n
[ < ∞, or X
n
converges a.s. to a ﬁnite limit.
Corollary 14.3. If X
n
is a positive supermartingale or a martingale bounded above or below, X
n
converges
a.s.
Proof. If X
n
is a positive supermartingale, −X
n
is a submartingale bounded above by 0. Now apply
Theorem 4.12.
18
If X
n
is a martingale bounded above, by considering −X
n
, we may assume X
n
is bounded below.
Looking at X
n
+M for ﬁxed M will not aﬀect the convergence, so we may assume X
n
is bounded below by
0. Now apply the ﬁrst assertion of the corollary.
Proposition 14.4. If X
n
is a martingale with sup
n
E[X
n
[
p
< ∞ for some p > 1, then the convergence is in
L
p
as well as a.s. This is also true when X
n
is a submartingale. If X
n
is a uniformly integrable martingale,
then the convergence is in L
1
. If X
n
→ X

in L
1
, then X
n
= E[X

[ T
n
].
X
n
is a uniformly integrable martingale if the collection of random variables X
n
is uniformly inte-
grable.
Proof. The L
p
convergence assertion follows by using Doob’s inequality (Theorem 13.2) and dominated
convergence. The L
1
convergence assertion follows since a.s. convergence together with uniform integrability
implies L
1
convergence. Finally, if j < n, we have X
j
= E[X
n
[ T
j
]. If A ∈ T
j
,
E[X
j
; A] = E[X
n
; A] → E[X

; A]
by the L
1
convergence of X
n
to X

. Since this is true for all A ∈ T
j
, X
j
= E[X

[ T
j
].
15. Applications of martingales.
Let S
n
be your fortune at time n. In a fair casino, E[S
n+1
[ T
n
] = S
n
. If N is a stopping time, the
optional stopping theorem says that ES
N
= ES
0
; in other words, no matter what stopping time you use
and what method of betting, you will do not better on average than ending up with what you started with.
An elegant application of martingales is a proof of the SLLN. Fix N large. Let Y
i
be i.i.d. Let
Z
n
= E[Y
1
[ S
n
, S
n+1
, . . . , S
N
]. We claim Z
n
= S
n
/n. Certainly S
n
/n is σ(S
n
, , S
N
) measurable. If
A ∈ σ(S
n
, . . . , S
N
) for some n, then A = ((S
n
, . . . , S
N
) ∈ B) for some Borel subset B of R
N−n+1
. Since the
Y
i
are i.i.d., for each k ≤ n,
E[Y
1
; (S
n
, . . . , S
N
) ∈ B] = E[Y
k
; (S
n
, . . . , S
N
) ∈ B].
Summing over k and dividing by n,
E[Y
1
; (S
n
, . . . , S
N
) ∈ B] = E[S
n
/n; (S
n
, . . . , S
N
) ∈ B].
Therefore E[Y
1
; A] = E[S
n
/n; A] for every A ∈ σ(S
n
, . . . , S
N
). Thus Z
n
= S
n
/n.
Let X
k
= Z
N−k
, and let T
k
= σ(S
N−k
, S
N−k+1
, . . . , S
N
). Note T
k
gets larger as k gets larger,
and by the above X
k
= E[Y
1
[ T
k
]. This shows that X
k
is a martingale (cf. the next to last example
in Section 11). By Doob’s upcrossing inequality, if U
X
n
is the number of upcrossings of [a, b] by X, then
EU
X
N
≤ EX
+
N
/(b − a) ≤ E[Z
0
[/(b − a) = E[Y
1
[/(b − a). This diﬀers by at most one from the number of
upcrossings of [a, b] by Z
1
, . . . , Z
N
. So the expected number of upcrossings of [a, b] by Z
k
for k ≤ N is
bounded by 1 + E[Y
1
[/(b − a). Now let N → ∞. By Fatou’s lemma, the expected number of upcrossings
of [a, b] by Z
1
, . . . is ﬁnite. Arguing as in the proof of the martingale convergence theorem, this says that
Z
n
= S
n
/n does not oscillate.
It is conceivable that [S
n
/n[ → ∞. But by Fatou’s lemma,
E[lim[S
n
/n[] ≤ liminf E[S
n
/n[ ≤ liminf nE[Y
1
[/n = E[Y
1
[ < ∞.
Another application of martingale techniques is Wald’s identities.
19
Proposition 15.1. Suppose the Y
i
are i.i.d. with E[Y
1
[ < ∞, N is a stopping time with EN < ∞, and N
is independent of the Y
i
. Then ES
N
= (EN)(EY
1
), where the S
n
are the partial sums of the Y
i
.
Proof. S
n
−n(EY
1
) is a martingale, so ES
n∧N
= E(n ∧N)EY
1
by optional stopping. The right hand side
tends to (EN)(EY
1
) by monotone convergence. S
n∧N
converges almost surely to S
N
, and we need to show
the expected values converge.
Note
[S
n∧N
[ =

¸
k=0
[S
n∧k
[1
(N=k)

¸
k=0
n∧k
¸
j=0
[Y
j
[1
(N=k)
=
n
¸
j=0

¸
k>j
[Y
j
[1
(N=k)
=
n
¸
j=0
[Y
j
[1
(N≥j)

¸
j=0
[Y
j
[1
(N≥j)
.
The last expression, using the independence, has expected value

¸
j=0
(E[Y
j
[)P(N ≥ j) ≤ (E[Y
1
[)(1 +EN) < ∞.
So by dominated convergence, we have ES
n∧N
→ ES
N
.
Wald’s second identity is a similar expression for the variance of S
N
.
We can use martingales to ﬁnd certain hitting probabilities.
Proposition 15.2. Suppose the Y
i
are i.i.d. with P(Y
1
= 1) = 1/2, P(Y
1
= −1) = 1/2, and S
n
the partial
sum process. Suppose a and b are positive integers. Then
P(S
n
hits −a before b) =
b
a +b
.
If N = min¦n : S
n
∈ ¦−a, b¦¦, then EN = ab.
Proof. S
2
n
− n is a martingale, so ES
2
n∧N
= En ∧ N. Let n → ∞. The right hand side converges to EN
by monotone convergence. Since S
n∧N
is bounded in absolute value by a + b, the left hand side converges
by dominated convergence to ES
2
N
, which is ﬁnite. So EN is ﬁnite, hence N is ﬁnite almost surely.
S
n
is a martingale, so ES
n∧N
= ES
0
= 0. By dominated convergence, and the fact that N < ∞ a.s.,
hence S
n∧N
→ S
N
, we have ES
N
= 0, or
−aP(S
N
= −a) +bP(S
N
= b) = 0.
We also have
P(S
N
= −a) +P(S
N
= b) = 1.
Solving these two equations for P(S
N
= −a) and P(S
N
= b) yields our ﬁrst result. Since EN = ES
2
N
=
a
2
P(S
N
= −a) +b
2
P(S
N
= b), substituting gives the second result.
Based on this proposition, if we let a → ∞, we see that P(N
b
< ∞) = 1 and EN
b
= ∞, where
N
b
= min¦n : S
n
= b¦.
Next we give a version of the Borel-Cantelli lemma.
20
Proposition 15.3. Suppose A
n
∈ T
n
. Then (A
n
i.o.) and (
¸

n=1
P(A
n
[ T
n−1
) = ∞) diﬀer by a null set.
Proof. Let X
n
=
¸
n
m=1
[1
A
m
− P(A
m
[ T
m−1
)]. Note [X
n
− X
n−1
[ ≤ 1. Also, it is easy to see that
E[X
n
−X
n−1
[ T
n−1
] = 0, so X
n
is a martingale.
We claim that for almost every ω either limX
n
exists and is ﬁnite, or else limsup X
n
= ∞ and
liminf X
n
= −∞. In fact, if N = min¦n : X
n
≤ −k¦, then X
n∧N
≥ −k − 1, so X
n∧N
converges by the
martingale convergence theorem. Therefore limX
n
exists and is ﬁnite on (N = ∞). So if limX
n
does not
exist or is not ﬁnite, then N < ∞. This is true for all k, hence liminf X
n
= −∞. A similar argument shows
limsup X
n
= ∞ in this case.
Now if limX
n
exists and is ﬁnite, then
¸

n=1
1
A
n
= ∞ if and only if
¸
P(A
n
[ T
n−1
) < ∞. On the
other hand, if the limit does not exist or is not ﬁnite, then
¸
1
A
n
= ∞ and
¸
P(A
n
[ T
n−1
) = ∞.
16. Weak convergence.
We will see later that if the X
i
are i.i.d. with mean zero and variance one, then S
n
/

n converges in
the sense
P(S
n
/

n ∈ [a, b]) → P(Z ∈ [a, b]),
where Z is a standard normal. If S
n
/

n converged in probability or almost surely, then by the zero-one
law it would converge to a constant, contradicting the above. We want to generalize the above type of
convergence.
We say F
n
converges weakly to F if F
n
(x) → F(x) for all x at which F is continuous. Here F
n
and F are distribution functions. We say X
n
converges weakly to X if F
X
n
converges weakly to F
X
. We
sometimes say X
n
converges in distribution or converges in law to X. Probabilities µ
n
converge weakly if
their corresponding distribution functions converges, that is, if F
µ
n
(x) = µ
n
(−∞, x] converges weakly.
An example that illustrates why we restrict the convergence to continuity points of F is the following.
Let X
n
= 1/n with probability one, and X = 0 with probability one. F
X
n
(x) is 0 if x < 1/n and 1 otherwise.
F
X
n
(x) converges to F
X
(x) for all x except x = 0.
Proposition 16.1. X
n
converges weakly to X if and only if Eg(X
n
) → Eg(X) for all g bounded and
continuous.
The idea that Eg(X
n
) converges to Eg(X) for all g bounded and continuous makes sense for any
metric space and is used as a deﬁnition of weak convergence for X
n
taking values in general metric spaces.
Proof. First suppose Eg(X
n
) converges to Eg(X). Let x be a continuity point of F, let ε > 0, and choose
δ such that [F(y) −F(x)[ < ε if [y −x[ < δ. Choose g continuous such that g is one on (−∞, x], takes values
between 0 and 1, and is 0 on [x +δ, ∞). Then F
X
n
(x) ≤ Eg(X
n
) → Eg(X) ≤ F
X
(x +δ) ≤ F(x) +ε.
Similarly, if h is a continuous function taking values between 0 and 1 that is 1 on (−∞, x −δ] and 0
on [x, ∞), F
X
n
(x) ≥ Eh(X
n
) → Eh(X) ≥ F
X
(x −δ) ≥ F(x) −ε. Since ε is arbitrary, F
X
n
(x) → F
X
(x).
Now suppose X
n
converges weakly to X. If a and b are continuity points of F and of all the F
X
n
,
then E1
[a,b]
(X
n
) = F
X
n
(b) −F
X
n
(a) → F(b) −F(a) = E1
[a,b]
(X). By taking linear combinations, we have
Eg(X
n
) → Eg(X) for every g which is a step function where the end points of the intervals are continuity
points for all the F
X
n
and for F
X
. Since the set of points that are not a continuity point for some F
X
n
or
for F
X
is countable, and we can approximate any continuous function on an interval by such step functions
uniformly, we have Eg(X
n
) → Eg(X) for all g such that the support of g is a closed interval whose endpoints
are continuity points of F
X
and g is continuous on its support.
Let ε > 0 and choose M such that F
X
(M) > 1 − ε and F
X
(−M) < ε and so that M and −M are
continuity points of F
X
and of the F
X
n
. By the above argument, E(1
[−M,M]
g)(X
n
) → E(1
[−M,M]
g)(X),
21
where g is a bounded continuous function. The diﬀerence between E(1
[−M,M]
g)(X) and Eg(X) is bounded
by |g|

P(X / ∈ [−M, M]) ≤ 2ε|g|

. Similarly, when X is replaced by X
n
, the diﬀerence is bounded by
|g|

P(X
n
/ ∈ [−M, M]) → |g|

P(X / ∈ [−M, M]). So for n large, it is less than 3ε|g|

. Since ε is arbitrary,
Eg(X
n
) → Eg(X) whenever g is bounded and continuous.
Let us examine the relationship between weak convergence and convergence in probability. The
example of S
n
/

n shows that one can have weak convergence without convergence in probability.
Proposition 16.2. (a) If X
n
converges to X in probability, then it converges weakly.
(b) If X
n
converges weakly to a constant, it converges in probability.
(c) (Slutsky’s theorem) If X
n
converges weakly to X and Y
n
converges weakly to a constant c, then
X
n
+Y
n
converges weakly to X +c and X
n
Y
n
converges weakly to cX.
Proof. To prove (a), let g be a bounded and continuous function. If n
j
is any subsequence, then there
exists a further subsequence such that X(n
j
k
) converges almost surely to X. Then by dominated convergence,
Eg(X(n
j
k
)) → Eg(X). That suﬃces to show Eg(X
n
) converges to Eg(X).
For (b), if X
n
converges weakly to c,
P(X
n
−c > ε) = P(X
n
> c +ε) = 1 −P(X
n
≤ c +ε) → 1 −P(c ≤ c +ε) = 0.
We use the fact that if Y ≡ c, then c + ε is a point of continuity for F
Y
. A similar equation shows
P(X
n
−c ≤ −ε) → 0, so P([X
n
−c[ > ε) → 0.
We now prove the ﬁrst part of (c), leaving the second part for the reader. Let x be a point such that
x −c is a continuity point of F
X
. Choose ε so that x −c +ε is again a continuity point. Then
P(X
n
+Y
n
≤ x) ≤ P(X
n
+c ≤ x +ε) +P([Y
n
−c[ > ε) → P(X ≤ x −c +ε).
So limsup P(X
n
+Y
n
≤ x) ≤ P(X +c ≤ x +ε). Since ε can be as small as we like and x −c is a continuity
point of F
X
, then limsup P(X
n
+Y
n
≤ x) ≤ P(X +c ≤ x). The lim inf is done similarly.
We say a sequence of distribution functions ¦F
n
¦ is tight if for each ε > 0 there exists M such that
F
n
(M) ≥ 1 − ε and F
n
(−M) ≤ ε for all n. A sequence of r.v.s is tight if the corresponding distribution
functions are tight; this is equivalent to P([X
n
[ ≥ M) ≤ ε.
Theorem 16.3. (Helly’s theorem) Let F
n
be a sequence of distribution functions that is tight. There exists
a subsequence n
j
and a distribution function F such that F
n
j
converges weakly to F.
What could happen is that X
n
= n, so that F
X
n
→ 0; the tightness precludes this.
Proof. Let q
k
be an enumeration of the rationals. Since F
n
(q
k
) ∈ [0, 1], any subsequence has a further
subsequence that converges. Use diagonalization so that F
n
j
(q
k
) converges for each q
k
and call the limit
F(q
k
). F is nondecreasing, and deﬁne F(x) = inf
q
k
≥x
F(q
k
). So F is right continuous and nondecreasing.
If x is a point of continuity of F and ε > 0, then there exist r and s rational such that r < x < s and
F(s) −ε < F(x) < F(r) +ε. Then
F
n
j
(x) ≥ F
n
j
(r) → F(r) > F(x) −ε
and
F
n
j
(x) ≤ F
n
j
(s) → F(s) < F(x) +ε.
Since ε is arbitrary, F
n
j
(x) → F(x).
Since the F
n
are tight, there exists M such that F
n
(−M) < ε. Then F(−M) ≤ ε, which implies
lim
x→−∞
F(x) = 0. Showing lim
x→∞
F(x) = 1 is similar. Therefore F is in fact a distribution function.
We conclude by giving an easily checked criterion for tightness.
22
Proposition 16.4. Suppose there exists ϕ : [0, ∞) → [0, ∞) that is increasing and ϕ(x) → ∞ as x → ∞.
If c = sup
n
Eϕ([X
n
[) < ∞, then the X
n
are tight.
Proof. Let ε > 0. Choose M such that ϕ(x) ≥ c/ε if x > M. Then
P([X
n
[ > M) ≤

ϕ([X
n
[)
c/ε
1
(|X
n
|>M)
dP ≤
ε
c
Eϕ([X
n
[) ≤ ε.
17. Characteristic functions.
We deﬁne the characteristic function of a random variable X by ϕ
X
(t) = Ee
itx
for t ∈ R.
Note that ϕ
X
(t) =

e
itx
P
X
(dx). So if X and Y have the same law, they have the same characteristic
function. Also, if the law of X has a density, that is, P
X
(dx) = f
X
(x) dx, then ϕ
X
(t) =

e
itx
f
X
(x) dx, so
in this case the characteristic function is the same as (one deﬁnition of) the Fourier transform of f
X
.
Proposition 17.1. ϕ(0) = 1, [ϕ(t)[ ≤ 1, ϕ(−t) = ϕ(t), and ϕ is uniformly continuous.
Proof. Since [e
itx
[ ≤ 1, everything follows immediately from the deﬁnitions except the uniform continuity.
For that we write
[ϕ(t +h) −ϕ(t)[ = [Ee
i(t+h)X
−Ee
itX
[ ≤ E[e
itX
(e
ihX
−1)[ = E[e
ihX
−1[.
[e
ihX
− 1[ tends to 0 almost surely as h → 0, so the right hand side tends to 0 by dominated convergence.
Note that the right hand side is independent of t.
Proposition 17.2. ϕ
aX
(t) = ϕ
X
(at) and ϕ
X+b
(t) = e
itb
ϕ
X
(t),
Proof. The ﬁrst follows from Ee
it(aX)
= Ee
i(at)X
, and the second is similar.
Proposition 17.3. If X and Y are independent, then ϕ
X+Y
(t) = ϕ
X
(t)ϕ
Y
(t).
Proof. From the multiplication theorem,
Ee
it(X+Y )
= Ee
itX
e
itY
= Ee
itX
Ee
itY
.
Note that if X
1
and X
2
are independent and identically distributed, then
ϕ
X
1
−X
2
(t) = ϕ
X
1
(t)ϕ
−X
2
(t) = ϕ
X
1
(t)ϕ
X
2
(−t) = ϕ
X
1
(t)ϕ
X
2
(t) = [ϕ
X
1
(t)[
2
.
Let us look at some examples of characteristic functions.
(a) Bernoulli: By direct computation, this is pe
it
+ (1 −p) = 1 −p(1 −e
it
).
(b) Coin ﬂip: (i.e., P(X = +1) = P(X = −1) = 1/2) We have
1
2
e
it
+
1
2
e
−it
= cos t.
(c) Poisson:
Ee
itX
=

¸
k=0
e
itk
e
−λ
λ
k
k!
= e
−λ
¸
(λe
it
)
k
k!
= e
−λ
e
λe
it
= e
λ(e
it
−1)
.
(d) Point mass at a: Ee
itX
= e
ita
. Note that when a = 0, then ϕ ≡ 1.
23
(e) Binomial: Write X as the sum of n independent Bernoulli r.v.s B
i
. So
ϕ
X
(t) =
n
¸
i=1
ϕ
B
i
(t) = [ϕ
B
i
(t)]
n
= [1 −p(1 −e
it
)]
n
.
(f) Geometric:
ϕ(t) =

¸
k=0
p(1 −p)
k
e
itk
= p
¸
((1 −p)e
it
)
k
=
p
1 −(1 −p)e
it
.
(g) Uniform on [a, b]:
ϕ(t) =
1
b −a

b
a
e
itx
dx =
e
itb
−e
ita
(b −a)it
.
Note that when a = −b this reduces to sin(bt)/bt.
(h) Exponential:

0
λe
itx
e
−λx
dx = λ

0
e
(it−λ)x
dx =
λ
λ −it
.
(i) Standard normal:
ϕ(t) =
1

−∞
e
itx
e
−x
2
/2
dx.
This can be done by completing the square and then doing a contour integration. Alternately, ϕ

(t) =
(1/

2π)

−∞
ixe
itx
e
−x
2
/2
dx. (do the real and imaginary parts separately, and use the dominated
convergence theorem to justify taking the derivative inside.) Integrating by parts (do the real and
imaginary parts separately), this is equal to −tϕ(t). The only solution to ϕ

(t) = −tϕ(t) with ϕ(0) = 1
is ϕ(t) = e
−t
2
/2
.
(j) Normal with mean µ and variance σ
2
: Writing X = σZ + µ, where Z is a standard normal, then
ϕ
X
(t) = e
iµt
ϕ
Z
(σt) = e
iµt−σ
2
t
2
/2
.
(k) Cauchy: We have
ϕ(t) =
1
π

e
itx
1 +x
2
dx.
This is a standard exercise in contour integration in complex analysis. The answer is e
−|t|
.
18. Inversion formula.
We need a preliminary real variable lemma, and then we can proceed to the inversion formula, which
gives a formula for the distribution function in terms of the characteristic function.
Lemma 18.1. (a)

N
0
(sin(Ax)/x) dx → sgn (A)π/2 as N → ∞.
(b) sup
a
[

a
0
(sin(Ax)/x) dx[ < ∞.
Proof. If A = 0, this is clear. The case A < 0 reduces to the case A > 0 by the fact that sin is an odd
function. By a change of variables y = Ax, we reduce to the case A = 1. Part (a) is a standard result
in contour integration, and part (b) comes from the fact that the integral can be written as an alternating
series.
24
An alternate proof of (a) is the following. e
−xy
sin x is integrable on ¦(x, y); 0 < x < a, 0 < y < ∞¦.
So

a
0
sin x
x
dx =

a
0

0
e
−xy
sin xdy dx
=

0

a
0
e
−xy
sin xdxdy
=

0

e
−xy
y
2
+ 1
(−y sin x −cos x)

a
0
dy
=

0

¸
e
−ay
y
2
+ 1
(−y sin a −cos a)
¸

−1
y
2
+ 1

dy
=
π
2
−sin a

0
ye
−ay
y
2
+ 1
dy −cos a

0
e
−ay
y
2
+ 1
dy.
The last two integrals tend to 0 as a → ∞ since the integrand is bounded by (1 +y)e
−y
if a ≥ 1.
Theorem 18.2. (Inversion formula) Let µ be a probability measure and let ϕ(t) =

e
itx
µ(dx). If a < b,
then
lim
T→∞
1

T
−T
e
−ita
−e
−itb
it
ϕ(t) dt = µ(a, b) +
1
2
µ(¦a¦) +
1
2
µ(¦b¦).
The example where µ is point mass at 0, so ϕ(t) = 1, shows that one needs to take a limit, since the
integrand in this case is 2 sin t/t, which is not integrable.
Proof. By Fubini,

T
−T
e
−ita
−e
−itb
it
ϕ(t) dt =

T
−T

e
−ita
−e
−itb
it
e
itx
µ(dx) dt
=

T
−T
e
−ita
−e
−itb
it
e
itx
dt µ(dx).
To justify this, we bound the integrand by the mean value theorem
Expanding e
−itb
and e
−ita
using Euler’s formula, and using the fact that cos is an even function and
sin is odd, we are left with

2

T
0
sin(t(x −a))
t
dt −

T
0
sin(t(x −b))
t
dt

µ(dx).
Using Lemma 18.1 and dominated convergence, this tends to

[πsgn (x −a) −πsgn (x −b)]µ(dx).
Theorem 18.3. If

[ϕ(t)[ dt < ∞, then µ has a bounded density f and
f(y) =
1

e
−ity
ϕ(t) dt.
Proof.
µ(a, b) +
1
2
µ(¦a¦) +
1
2
µ(¦b¦)
= lim
T→∞
1

T
−T
e
−ita
−e
−itb
it
ϕ(t) dt
=
1

−∞
e
−ita
−e
−itb
it
ϕ(t) dt

b −a

[ϕ(t)[ dt.
25
Letting b → a shows that µ has no point masses.
We now write
µ(x, x +h) =
1

e
−itx
−e
−it(x+h)
it
ϕ(t)dt
=
1

x+h
x
e
−ity
dy

ϕ(t) dt
=

x+h
x

1

e
−ity
ϕ(y) dt

dy.
So µ has density (1/2π)

e
−ity
ϕ(t) dt. As in the proof of Proposition 17.1, we see f is continuous.
A corollary to the inversion formula is the uniqueness theorem.
Theorem 18.3. If ϕ
X
= ϕ
Y
, then P
X
= P
Y
.
The following proposition can be proved directly, but the proof using characteristic functions is much
easier.
Proposition 18.4. (a) If X and Y are independent, X is a normal with mean a and variance b
2
, and Y is
a normal with mean c and variance d
2
, then X +Y is normal with mean a +c and variance b
2
+d
2
.
(b) If X and Y are independent, X is Poisson with parameter λ
1
, and Y is Poisson with parameter λ
2
,
then X +Y is Poisson with parameter λ
1

2
.
(c) If X
i
are i.i.d. Cauchy, then S
n
/n is Cauchy.
Proof. For (a),
ϕ
X+Y
(t) = ϕ
X
(t)ϕ
Y
(t) = e
iat−b
2
t
2
/2
e
ict−c
2
t
2
/2
= e
i(a+c)t−(b
2
+d
2
)t
2
/2
.
Now use the uniqueness theorem.
Parts (b) and (c) are proved similarly.
19. Continuity theorem.
Lemma 19.1. Suppose ϕ is the characteristic function of a probability µ. Then
µ([−2A, 2A]) ≥ A

1/A
−1/A
ϕ(t) dt

−1.
Proof. Note
1
2T

T
−T
ϕ(t) dt =
1
2T

T
−T

e
itx
µ(dx) dt
=

1
2T
1
[−T,T]
(t)e
itx
dtµ(dx)
=

sin Tx
Tx
µ(dx).
Since [(sin(Tx))/Tx[ ≤ 1 and [ sin(Tx)[ ≤ 1, then [(sin(Tx))/Tx[ ≤ 1/2TA if [x[ ≥ 2A. So

sin Tx
Tx
µ(dx)

≤ µ([−2A, 2A]) +

[−2A,2A]
c
1
2TA
µ(dx)
= µ([−2A, 2A]) +
1
2TA
(1 −µ([−2A, 2A])
=
1
2TA
+

1 −
1
2TA

µ([−2A, 2A]).
26
Setting T = 1/A,

A
2

1/A
−1/A
ϕ(t) dt

1
2
+
1
2
µ([−2A, 2A]).
Now multiply both sides by 2.
Proposition 19.2. If µ
n
converges weakly to µ, then ϕ
n
converges to ϕ uniformly on every ﬁnite interval.
Proof. Let ε > 0 and choose M large so that µ([−M, M]
c
) < ε. Deﬁne f to be 1 on [−M, M], 0 on
[−M −1, M + 1]
c
, and linear in between. Since

f dµ
n

f dµ, then if n is large enough,

(1 −f) dµ
n
≤ 2ε.
We have

n
(t +h) −ϕ
n
(t)[ ≤

[e
ihx
−1[µ
n
(dx)
≤ 2

(1 −f) dµ
n
+h

[x[f(x) µ
n
(dx)
≤ 2ε +h(M + 1).
So for n large enough and [h[ ≤ ε/(M + 1), we have

n
(t +h) −ϕ
n
(t)[ ≤ 3ε,
which says that the ϕ
n
are equicontinuous. Therefore the convergence is uniform on ﬁnite intervals.
The interesting result of this section is the converse, L´evy’s continuity theorem.
Theorem 19.3. Suppose µ
n
are probabilities, ϕ
n
(t) converges to a function ϕ(t) for each t, and ϕ is
continuous at 0. Then ϕ is the characteristic function of a probability µ and µ
n
converges weakly to µ.
Proof. Let ε > 0. Since ϕ is continuous at 0, choose δ small so that

1

δ
−δ
ϕ(t) dt −1

< ε.
Using the dominated convergence theorem, choose N such that
1

δ
−δ

n
(t) −ϕ(t)[ dt < ε
if n ≥ N. So if n ≥ N,

1

δ
−δ
ϕ
n
(t) dt

1

δ
−δ
ϕ(t) dt

1

δ
−δ

n
(t) −ϕ(t)[ dt
≥ 1 −2ε.
By Lemma 19.1 with A = 1/δ, for such n,
µ
n
[−2/δ, 2/δ] ≥ 2(1 −2ε) −1 = 1 −4ε.
This shows the µ
n
are tight.
Let n
j
be a subsequence such that µ
n
j
converges weakly, say to µ. Then ϕ
n
j
(t) → ϕ
µ
(t), hence
ϕ(t) = ϕ
µ
(t), or ϕ is the characteristic function of a probability µ. If µ

is any subsequential weak limit
point of µ
n
, then ϕ
µ
(t) = ϕ(t) = ϕ
µ
(t); so µ

must equal µ. Hence µ
n
converges weakly to µ.
We need the following estimate on moments.
27
Proposition 19.4. If E[X[
k
< ∞ for an integer k, then ϕ
X
has a continuous derivative of order k and
ϕ
(k)
(t) =

(ix)
k
e
itx
P
X
(dx).
In particular, ϕ
(k)
(0) = i
k
EX
k
.
Proof. Write
ϕ(t +h) −ϕ(t)
h
=

e
i(t+h)x
−e
itx
h
P(dx).
The integrand is bounded by [x[. So if

[x[P
X
(dx) < ∞, we can use dominated convergence to obtain the
desired formula for ϕ

(t). As in the proof of Proposition 17.1, we see ϕ

(t) is continuous. We do the case of
general k by induction. Evaluating ϕ
(k)
at 0 gives the particular case.
Here is a converse.
Proposition 19.5. If ϕ is the characteristic function of a random variable X and ϕ

(0) exists, then E[X[
2
<
∞.
Proof. Note
e
ihx
−2 +e
−ihx
h
2
= −2
1 −cos hx
h
2
≤ 0
and 2(1 −cos hx)/h
2
converges to x
2
as h → 0. So by Fatou’s lemma,

x
2
P
X
(dx) ≤ 2 liminf
h→0

1 −cos hx
h
2
P
X
(dx)
= −limsup
h→0
ϕ(h) −2ϕ(0) +ϕ(−h)
h
2
= ϕ

(0) < ∞.
One nice application of the continuity theorem is a proof of the weak law of large numbers. Its proof
is very similar to the proof of the central limit theorem, which we give in the next section.
Another nice use of characteristic functions and martingales is the following.
Proposition 19.6. Suppose X
i
is a sequence of independent r.v.s and S
n
converges weakly. Then S
n
converges almost surely.
Proof. Suppose S
n
converges weakly to W. Then ϕ
S
n
(t) → ϕ
W
(t) uniformly on compact sets by Proposition
19.2. Since ϕ
W
(0) = 1 and ϕ
W
is continuous, there exists δ such that [ϕ
W
(t) −1[ < 1/2 if [t[ < δ. So for n
large, ϕ
S
n
(t)[ ≥ 1/4 if [t[ < δ.
Note
E

e
itS
n
[ X
1
, . . . X
n−1

= e
itS
n−1
E[e
itX
n
[ X
1
, . . . , X
n−1
] = e
itS
n−1
ϕ
X
n
(t).
Since ϕ
S
n
(t) =
¸
ϕ
X
i
(t), it follows that e
itS
n

S
n
(t) is a martingale.
Therefore for [t[ < δ and n large, e
itS
n

S
n
(t) is a bounded martingale, and hence converges almost
surely. Since ϕ
S
n
(t) → ϕ
W
(t) = 0, then e
itS
n
converges almost surely if [t[ < δ.
Let A = ¦(ω, t) ∈ Ω (−δ, δ) : e
itS
n
(ω)
does not converge¦. For each t, we have almost sure con-
vergence, so

1
A
(ω, t) P(dω) = 0. Therefore

δ
−δ

1
A
dPdt = 0, and by Fubini,

δ
−δ
1
A
dt dP = 0. Hence
almost surely,

1
A
(ω, t) dt = 0. This means, there exists a set N with P(N) = 0, and if ω / ∈ N, then e
itS
n
(ω)
converges for almost every t ∈ (−δ, δ).
28
If ω / ∈ N, by dominated convergence,

a
0
e
itS
n
(ω)
dt converges, provided a < δ. Call the limit A
a
. Also

a
0
e
itS
n
(ω)
dt =
e
iaS
n
(ω)
−1
iS
n
(ω)
if S
n
(ω) = 0 and equals a otherwise.
Since S
n
converges weakly, it is not possible for [S
n
[ → ∞ with positive probability. If we let
N

= ¦ω : [S
n
(ω)[ → ∞¦ and choose ω / ∈ N ∪ N

, there exists a subsequence S
n
j
(ω) which converges
to a ﬁnite limit, say R. We can choose a < δ such that e
iaS
n
(ω)
converges and e
iaR
= 1. Therefore
A
a
= (e
iaR
−1)/R, a nonzero quantity. But then
S
n
(ω) =
e
iaS
n
(ω)
−1

a
0
e
itS
n
(ω)
dt

lim
n→∞
e
iaS
n
(ω)
−1
A
a
.
Therefore, except for ω ∈ N ∪ N

, we have that S
n
(ω) converges.
20. Central limit theorem.
The simplest case of the central limit theorem (CLT) is the case when the X
i
are i.i.d., with mean
zero and variance one, and then the CLT says that S
n
/

n converges weakly to a standard normal. We ﬁrst
prove this case.
We need the fact that if c
n
are complex numbers converging to c, then (1 +(c
n
/n))
n
→ e
c
. We leave
the proof of this to the reader, with the warning that any proof using logarithms needs to be done with some
care, since log z is a multi-valued function when z is complex.
Theorem 20.1. Suppose the X
i
are i.i.d., mean zero, and variance one. Then S
n
/

n converges weakly to
a standard normal.
Proof. Since X
1
has ﬁnite second moment, then ϕ
X
1
has a continuous second derivative. By Taylor’s
theorem,
ϕ
X
1
(t) = ϕ
X
1
(0) +ϕ

X
1
(0)t +ϕ

X
1
(0)t
2
/2 +R(t),
where [R(t)[/t
2
→ 0 as [t[ → 0. So
ϕ
X
1
(t) = 1 −t
2
/2 +R(t).
Then
ϕ
S
n
/

n
(t) = ϕ
S
n
(t/

n) = (ϕ
X
1
(t/

n))
n
=

1 −
t
2
2n
+R(t/

n)

n
.
Since t/

n converges to zero as n → ∞, we have
ϕ
S
n
/

n
(t) → e
−t
2
/2
.
Now apply the continuity theorem.
Let us give another proof of this simple CLT that does not use characteristic functions. For simplicity
let X
i
be i.i.d. mean zero variance one random variables with E[X
i
[
3
< ∞.
29
Proposition 20.2. With X
i
as above, S
n
/

n converges weakly to a standard normal.
Proof. Let Y
1
, . . . , Y
n
be i.i.d. standard normal r.v.s that are independent of the X
i
. Let Z
1
= Y
2
+ +Y
n
,
Z
2
= X
1
+Y
3
+ +Y
n
, Z
3
= X
1
+X
2
+Y
4
+ +Y
n
, etc.
Let us suppose g ∈ C
3
with compact support and let W be a standard normal. Our ﬁrst goal is to
show
[Eg(S
n
/

n) −Eg(W)[ → 0. (20.1)
We have
Eg(S
n
/

n) −Eg(W) = Eg(S
n
/

n) −Eg(
n
¸
i=1
Y
i
/

n)
=
n
¸
i=1

Eg

X
i
+Z
i

n

−Eg

Y
i
+Z
i

n

.
By Taylor’s theorem,
g

X
i
+Z
i

n

= g(Z
i
/

n) +g

(Z
i
/

n)
X
i

n
+
1
2
g

(Z
i
/

n)X
2
i
+R
n
,
where [R
n
[ ≤ |g

|

[X
i
[
3
/n
3/2
. Taking expectations and using the independence,
Eg

X
i
+Z
i

n

= Eg(Z
i
/

n) + 0 +
1
2
Eg

(Z
i
/

n) +ER
n
.
We have a very similar expression for Eg((Y
i
+Z
i
)/

n). Taking the diﬀerence,

Eg

X
i
+Z
i

n

−Eg

Y
i
+Z
i

n

≤ |g

|

E[X
i
[
3
+E[Y
i
[
3
n
3/2
.
Summing over i from 1 to n, we have (20.1).
By approximating continuous functions with compact support by C
3
functions with compact support,
we have (20.1) for such g. Since E(S
n
/

n)
2
= 1, the sequence S
n
/

n is tight. So given ε there exists M
such that P([S
n
/

n[ > M) < ε for all n. By taking M larger if necessary, we also have P([W[ > M) < ε.
Suppose g is bounded and continuous. Let ψ be a continuous function with compact support that is bounded
by one, is nonnegative, and that equals 1 on [−M, M]. By (20.1) applied to gψ,
[E(gψ)(S
n
/

n) −E(gψ)(W)[ → 0.
However,
[Eg(S
n
/

n) −E(gψ)(S
n
/

n)[ ≤ |g|

P([S
n
/

n[ > M) < ε|g|

,
and similarly
[Eg(W) −E(gψ)(W)[ < ε|g|

.
Since ε is arbitrary, this proves (20.1) for bounded continuous g. By Proposition 16.1, this proves our
proposition.
We give another example of the use of characteristic functions.
30
Proposition 20.3. Suppose for each n the r.v.s X
ni
, i = 1, . . . , n are i.i.d. Bernoullis with parameter p
n
.
If np
n
→ λ and S
n
=
¸
n
i=1
X
ni
, then S
n
converges weakly to a Poisson r.v. with parameter λ.
Proof. We write
ϕ
S
n
(t) = [ϕ
X
n1
(t)]
n
=

1 +p
n
(e
it
−1)

n
=

1 +
np
n
n
(e
it
−1)

n
→ e
λ(e
it
−1)
.
Now apply the continuity theorem.
A much more general theorem than Theorem 20.1 is the Lindeberg-Feller theorem.
Theorem 20.4. Suppose for each n, X
ni
, i = 1, . . . , n are mean zero independent random variables. Suppose
(a)
¸
n
i=1
EX
2
ni
→ σ
2
> 0 and
(b) for each ε,
¸
n
i=1
E[[X
2
ni
; [X
ni
[ > ε] → 0.
Let S
n
=
¸
n
i=1
X
ni
. Then S
n
converges weakly to a normal r.v. with mean zero and variance σ
2
.
Note nothing is said about independence of the X
ni
for diﬀerent n.
Let us look at Theorem 20.1 in light of this theorem. Suppose the Y
i
are i.i.d. and let X
ni
= Y
i
/

n.
Then
n
¸
i=1
E(Y
i
/

n)
2
= EY
2
1
and
n
¸
i=1
E[[X
ni
[
2
; [X
ni
[ > ε] = nE[[Y
1
[
2
/n; [Y
1
[ >

nε] = E[[Y
1
[
2
; [Y
1
[ >

nε],
which tends to 0 by the dominated convergence theorem.
If the Y
i
are independent with mean 0, and
¸
n
i=1
E[Y
i
[
3
(Var S
n
)
3/2
→ 0,
then S
n
/(Var S
n
)
1/2
converges weakly to a standard normal. This is known as Lyapounov’s theorem; we
leave the derivation of this from the Lindeberg-Feller theorem as an exercise for the reader.
Proof. Let ϕ
ni
be the characteristic function of X
ni
and let σ
2
ni
be the variance of X
ni
. We need to show
n
¸
i=1
ϕ
ni
(t) → e
−t
2
σ
2
/2
. (20.2)
Using Taylor series, [e
ib
−1 −ib +b
2
/2[ ≤ c[b[
3
for a constant c. Also,
[e
ib
−1 −ib +b
2
/2[ ≤ [e
ib
−1 −ib[ +[b
2
[/2 ≤ c[b[
2
.
If we apply this to a random variable tY and take expectations,

Y
(t) −(1 +itEY −t
2
EY
2
/2)[ ≤ c(t
2
EY
2
∧ t
3
EY
3
).
Applying this to Y = X
ni
,

ni
(t) −(1 −t
2
σ
2
ni
/2)[ ≤ cE[t
3
[X
ni
[
3
∧ t
2
[X
ni
[
2
].
31
The right hand side is less than or equal to
cE[t
3
[X
ni
[
3
;[X
ni
[ ≤ ε] +cE[t
2
[X
ni
[
2
; [X
ni
[ > ε]
≤ cεt
3
E[[X
ni
[
2
] +ct
2
E[[X
ni
[
2
; [X
ni
[ ≥ ε].
Summing over i we obtain
n
¸
i=1

ni
(t) −(1 −t
2
σ
2
ni
/2)[ ≤ cεt
3
¸
E[[X
ni
[
2
] +ct
2
¸
E[[X
ni
[
2
; [X
ni
[ ≥ ε].
We need the following inequality: if [a
i
[, [b
i
[ ≤ 1, then

n
¸
i=1
a
i

n
¸
i=1
b
i

n
¸
i=1
[a
i
−b
i
[.
To prove this, note
¸
a
i

¸
b
i
= (a
n
−b
n
)
¸
i<n
b
i
+a
n

¸
i<n
a
i

¸
i<n
b
i

and use induction.
Note [ϕ
ni
(t)[ ≤ 1 and [1 − t
2
σ
2
ni
/2[ ≤ 1 because σ
2
ni
≤ ε
2
+ E[[X
2
ni
[; [X
ni
[ > ε] < 1/t
2
if we take ε
small enough and n large enough. So

n
¸
i=1
ϕ
ni
(t) −
n
¸
i=1
(1 −t
2
σ
2
ni
/2)

≤ cεt
3
¸
E[[X
ni
[
2
] +ct
2
¸
E[[X
ni
[
2
; [X
ni
[ ≥ ε].
Since sup
i
σ
2
ni
→ 0, then log(1 −t
2
σ
2
ni
/2) is asymptotically equal to −t
2
σ
2
ni
/2, and so
¸
(1 −t
2
σ
2
ni
/2) = exp

¸
log(1 −t
2
σ
2
ni
/2)

is asymptotically equal to
exp

−t
2
¸
σ
2
ni
/2

= e
−t
2
σ
2
/2
.
Since ε is arbitrary, the proof is complete.
We now complete the proof of Theorem 8.2.
Proof of “only if” part of Theorem 8.2. Since
¸
X
n
converges, then X
n
must converge to zero a.s.,
and so P([X
n
[ > A i.o.) = 0. By the Borel-Cantelli lemma, this says
¸
P([X
n
[ > A) < ∞. We also conclude
by Proposition 5.4 that
¸
Y
n
converges.
Let c
n
=
¸
n
i=1
Var Y
i
and suppose c
n
→ ∞. Let Z
nm
= (Y
m
− EY
m
)/

c
n
. Then
¸
n
m=1
Var Z
nm
=
(1/c
n
)
¸
n
m=1
Var Y
m
= 1. If ε > 0, then for n large, we have 2A/

c
n
< ε. Since [Y
m
[ ≤ A and hence
[EY
m
[ ≤ A, then [Z
nm
[ ≤ 2A/

c
n
< ε. It follows that
¸
n
m=1
E([Z
nm
[
2
; [Z
nm
[ > ε) = 0 for large n.
By Theorem 20.4,
¸
n
m=1
(Y
m
− EY
m
)/

c
n
converges weakly to a standard normal. However,
¸
n
m=1
Y
m
converges, and c
n
→ ∞, so
¸
Y
m
/

c
n
must converge to 0. The quantities
¸
EY
m
/

c
n
are nonrandom, so
there is no way the diﬀerence can converge to a standard normal, a contradiction. We conclude c
n
does not
converge to inﬁnity.
Let V
i
= Y
i
− EY
i
. Since [V
i
[ < 2A, EV
i
= 0, and Var V
i
= Var Y
i
, which is summable, by the “if”
part of the three series criterion,
¸
V
i
converges. Since
¸
Y
i
converges, taking the diﬀerence shows
¸
EY
i
converges.
32
21. Framework for Markov chains.
Suppose o is a set with some topological structure that we will use as our state space. Think of o
as being R
d
or the positive integers, for example. A sequence of random variables X
0
, X
1
, . . ., is a Markov
chain if
P(X
n+1
∈ A [ X
0
, . . . , X
n
) = P(X
n+1
∈ A [ X
n
) (21.1)
for all n and all measurable sets A. The deﬁnition of Markov chain has this intuition: to predict the
probability that X
n+1
is in any set, we only need to know where we currently are; how we got there gives
Let’s make some additional comments. First of all, we previously considered random variables as
mappings from Ω to R. Now we want to extend our deﬁnition by allowing a random variable be a map X
from Ω to o, where (X ∈ A) is T measurable for all open sets A. This agrees with the deﬁnition of r.v. in
the case o = R.
Although there is quite a theory developed for Markov chains with arbitrary state spaces, we will con-
ﬁne our attention to the case where either o is ﬁnite, in which case we will usually suppose o = ¦1, 2, . . . , n¦,
or countable and discrete, in which case we will usually suppose o is the set of positive integers.
We are going to further restrict our attention to Markov chains where
P(X
n+1
∈ A [ X
n
= x) = P(X
1
∈ A [ X
0
= x),
that is, where the probabilities do not depend on n. Such Markov chains are said to have stationary transition
probabilities.
Deﬁne the initial distribution of a Markov chain with stationary transition probabilities by µ(i) =
P(X
0
= i). Deﬁne the transition probabilities by p(i, j) = P(X
n+1
= j [ X
n
= i). Since the transition
probabilities are stationary, p(i, j) does not depend on n.
In this case we can use the deﬁnition of conditional probability given in undergraduate classes. If
P(X
n
= i) = 0 for all n, that means we never visit i and we could drop the point i from the state space.
Proposition 21.1. Let X be a Markov chain with initial distribution µ and transition probabilities p(i, j).
then
P(X
n
= i
n
, X
n−1
= i
n−1
, . . . , X
1
= i
1
, X
0
= i
0
) = µ(i
0
)p(i
0
, i
1
) p(i
n−1
, i
n
). (21.2)
Proof. We use induction on n. It is clearly true for n = 0 by the deﬁnition of µ(i). Suppose it holds for n;
we need to show it holds for n + 1. For simplicity, we will do the case n = 2. Then
P(X
3
= i
3
,X
2
= i
2
, X
1
= i
1
, X
0
= i
0
)
= E[P(X
3
= i
3
[ X
0
= i
0
, X
1
= i
1
, X
2
= i
2
); X
2
= i
2
, X
1
= i
i
, X
0
= i
0
]
= E[P(X
3
= i
3
[ X
2
= i
2
); X
2
= i
2
, X
1
= i
i
, X
0
= i
0
]
= p(i
2
, i
3
)P(X
2
= i
2
, X
1
= i
1
, X
0
= i
0
).
Now by the induction hypothesis,
P(X
2
= i
2
, X
1
= i
1
, X
0
= i
0
) = p(i
1
, i
2
)p(i
0
, i
1
)µ(i
0
).
Substituting establishes the claim for n = 3.
The above proposition says that the law of the Markov chain is determined by the µ(i) and p(i, j).
The formula (21.2) also gives a prescription for constructing a Markov chain given the µ(i) and p(i, j).
33
Proposition 21.2. Suppose µ(i) is a sequence of nonnegative numbers with
¸
i
µ(i) = 1 and for each i
the sequence p(i, j) is nonnegative and sums to 1. Then there exists a Markov chain with µ(i) as its initial
distribution and p(i, j) as the transition probabilities.
Proof. Deﬁne Ω = o

. Let T be the σ-ﬁelds generated by the collection of sets ¦(i
0
, i
1
, . . . , i
n
) : n >
0, i
j
∈ o¦. An element ω of Ω is a sequence (i
0
, i
1
, . . .). Deﬁne X
j
(ω) = i
j
if ω = (i
0
, i
1
, . . .). Deﬁne
P(X
0
= i
0
, . . . , X
n
= i
n
) by (21.2). Using the Kolmogorov extension theorem, one can show that P can be
extended to a probability on Ω.
The above framework is rather abstract, but it is clear that under P the sequence X
n
has initial
distribution µ(i); what we need to show is that X
n
is a Markov chain and that
P(X
n+1
= i
n+1
[ X
0
= i
0
, . . . X
n
= i
n
) = P(X
n+1
= i
n+1
[ X
n
= i
n
) = p(i
n
, i
n+1
). (21.3)
By the deﬁnition of conditional probability, the left hand side of (21.3) is
P(X
n+1
= i
n+1
[ X
0
= i
0
, . . . , X
n
= i
n
) =
P(X
n+1
= i
n+1
, X
n
= i
n
, . . . X
0
= i
0
)
P(X
n
= i
n
, . . . , X
0
= i
0
)
(21.4)
=
µ(i
0
) p(i
n−1
, i
n
)p(i
n
, i
n+1
)
µ(i
0
) p(i
n−1
, i
n
)
= p(i
n
, i
n+1
)
as desired.
To complete the proof we need to show
P(X
n+1
= i
n+1
, X
n
= i
n
)
P(X
n
= i
n
)
= p(i
n
, i
n+1
),
or
P(X
n+1
= i
n+1
, X
n
= i
n
) = p(i
n
, i
n+1
)P(X
n
= i
n
). (21.5)
Now
P(X
n
= i
n
) =
¸
i
0
,···,i
n−1
P(X
n
= i
n
, X
n−1
= i
n−1
, . . . , X
0
= i
0
)
=
¸
i
0
,···,i
n−1
µ(i
0
) p(i
n−1
, i
n
)
and similarly
P(X
n+1
= i
n+1
, X
n
= i
n
)
=
¸
i
0
,···,i
n−1
P(X
n+1
= i
n+1
, X
n
= i
n
, X
n−1
= i
n−1
, . . . , X
0
= i
0
)
= p(i
n
, i
n+1
)
¸
i
0
,···,i
n−1
µ(i
0
) p(i
n−1
, i
n
).
Equation (21.5) now follows.
Note in this construction that the X
n
sequence is ﬁxed and does not depend on µ or p. Let p(i, j) be
ﬁxed. The probability we constructed above is often denoted P
µ
. If µ is point mass at a point i or x, it is
denoted P
i
or P
x
. So we have one probability space, one sequence X
n
, but a whole family of probabilities
P
µ
.
34
Later on we will see that this framework allows one to express the Markov property and strong
Markov property in a convenient way. As part of the preparation for doing this, we deﬁne the shift operators
θ
k
: Ω → Ω by
θ
k
(i
0
, i
1
, . . .) = (i
k
, i
k+1
, . . .).
Then X
j
◦ θ
k
= X
j+k
. To see this, if ω = (i
0
, i
1
, . . .), then
X
j
◦ θ
k
(ω) = X
j
(i
k
, i
k+1
, . . .) = i
j+k
= X
j+k
(ω).
22. Examples.
Random walk on the integers
We let Y
i
be an i.i.d. sequence of r.v.’s, with p = P(Y
i
= 1) and 1 − p = P(Y
i
= −1). Let
X
n
= X
0
+
¸
n
i=1
Y
i
. Then the X
n
can be viewed as a Markov chain with p(i, i + 1) = p, p(i, i −1) = 1 −p,
and p(i, j) = 0 if [j − i[ = 1. More general random walks on the integers also ﬁt into this framework. To
check that the random walk is Markov,
P(X
n+1
= i
n+1
[ X
0
= i
0
, . . . , X
n
= i
n
)
= P(X
n+1
−X
n
= i
n+1
−i
n
[ X
0
= i
0
, . . . , X
n
= i
n
)
= P(X
n+1
−X
n
= i
n+1
−i
n
),
using the independence, while
P(X
n+1
= i
n+1
[ X
n
= i
n
) = P(X
n+1
−X
n
= i
n+1
−i
n
[ X
n
= i
n
)
= P(X
n+1
−X
n
= i
n+1
−i
n
).
Random walks on graphs
Suppose we have n points, and from each point there is some probability of going to another point.
For example, suppose there are 5 points and we have p(1, 2) =
1
2
, p(1, 3) =
1
2
, p(2, 1) =
1
4
, p(2, 3) =
1
2
,
p(2, 5) =
1
4
, p(3, 1) =
1
4
, p(3, 2) =
1
4
, p(3, 3) =
1
2
, p(4, 1) = 1, p(5, 1) =
1
2
, p(5, 5) =
1
2
. The p(i, j) are often
arranged into a matrix:
P =

¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
0
1
2
1
2
0 0
1
4
0
1
2
0
1
4
1
4
1
4
1
2
0 0
1 0 0 0 0
1
2
0 0 0
1
2
¸

.
Note the rows must sum to 1 since
5
¸
j=1
p(i, j) =
5
¸
j=1
P(X
1
= j [ X
0
= i) = P(X
1
∈ o [ X
0
= i) = 1.
Renewal processes
Let Y
i
be i.i.d. with P(Y
i
= k) = a
k
and the a
k
are nonnegative and sum to 1. Let T
0
= i
0
and
T
n
= T
0
+
¸
n
i=1
. We think of the Y
i
as the lifetime of the nth light bulb and T
n
the time when the nth light
bulb burns out. (We replace a light bulb as soon as it burns out.) Let
X
n
= min¦m−n : T
i
= m for some i¦.
35
So X
n
is the amount of time after time n until the current light bulb burns out.
If X
n
= j and j > 0, then T
i
= n +j for some i but T
i
does not equal n, n + 1, . . . , n +j −1 for any
i. So T
i
= (n + 1) + (j − 1) for some i and T
i
does not equal (n + 1), (n + 1) + 1, . . . , (n + 1) + (j − 2) for
any i. Therefore X
n+1
= j −1. So p(i, i −1) = 1 if i ≥ 1.
If X
n
= 0, then a light bulb burned out at time n and X
n+1
is 0 if the next light bulb burned out
immediately and j − 1 if the light bulb has lifetime j. The probability of this is a
j
. So p(0, j) = a
j+1
. All
the other p(i, j)’s are 0.
Branching processes
Consider k particles. At the next time interval, some of them die, and some of them split into several
particles. The probability that a given particle will split into j particles is given by a
j
, j = 0, 1, . . ., where
the a
j
are nonnegative and sum to 1. The behavior of each particle is independent of the behavior of all
the other particles. If X
n
is the number of particles at time n, then X
n
is a Markov chain. Let Y
i
be i.i.d.
random variables with P(Y
i
= j) = a
j
. The p(i, j) for X
n
are somewhat complicated, and can be deﬁned by
p(i, j) = P(
¸
i
m=1
Y
m
= j).
Queues
We will discuss brieﬂy the M/G/1 queue. The M refers to the fact that the customers arrive according
to a Poisson process. So the probability that the number of customers arriving in a time interval of length t
is k is given by e
−λt
(λt)
k
/k! The G refers to the fact that the length of time it takes to serve a customer is
given by a distribution that is not necessarily exponential. The 1 refers to the fact that there is 1 server.
Suppose the length of time to serve one customer has distribution function F with density f. The
probability that k customers arrive during the time it takes to serve one customer is
a
k
=

0
e
−λt
(λt)
k
k!
f(t) dt.
Let the Y
i
be i.i.d. with P(Y
i
= k − 1) = a
k
. So Y
i
is the number of customers arriving during the time it
takes to serve one customer. Let X
n+1
= (X
n
+Y
n+1
)
+
be the number of customers waiting. Then X
n
is a
Markov chain with p(0, 0) = a
0
+a
1
and p(i, j −1 +k) = a
k
if j ≥ 1, k > 1.
Ehrenfest urns
Suppose we have two urns with a total of r balls, k in one and r − k in the other. Pick one of the
r balls at random and move it to the other urn. Let X
n
be the number of balls in the ﬁrst urn. X
n
is a
Markov chain with p(k, k + 1) = (r −k)/r, p(k, k −1) = k/r, and p(i, j) = 0 otherwise.
One model for this is to consider two containers of air with a thin tube connecting them. Suppose
a few molecules of a foreign substance are introduced. Then the number of molecules in the ﬁrst container
is like an Ehrenfest urn. We shall see that all states in this model are recurrent, so inﬁnitely often all the
molecules of the foreign substance will be in the ﬁrst urn. Yet there is a tendency towards equilibrium, so
on average there will be about the same number of molecules in each container for all large times.
Birth and death processes
Suppose there are i particles, and the probability of a birth is a
i
, the probability of a death is b
i
,
where a
i
, b
i
≥ 0, a
i
+ b
i
≤ 1. Setting X
n
equal to the number of particles, then X
n
is a Markov chain with
p(i, i + 1) = a
i
, p(i, i −1) = b
i
, and p(i, i) = 1 −a
i
−b
i
.
23. Markov properties.
A special case of the Markov property says that
E
x
[f(X
n+1
) [ T
n
] = E
X
n
f(X
1
). (23.1)
36
The right hand side is to be interpreted as ϕ(X
n
), where ϕ(y) = E
y
f(X
1
). The randomness on the right
hand side all comes from the X
n
. If we write f(X
n+1
) = f(X
1
) ◦ θ
n
and we write Y for f(X
1
), then the
above can be rewritten
E
x
[Y ◦ θ
n
[ T
n
] = E
X
n
Y.
Let T

be the σ-ﬁeld generated by ∪

n=1
T
n
.
Theorem 23.1. (Markov property) If Y is bounded and measurable with respect to T

, then
E
x
[Y ◦ θ
n
[ T
n
] = E
X
n
[Y ], P −a.s.
for each n and x.
Proof. If we can prove this for Y = f
1
(X
1
) f
m
(X
m
), then taking f
j
(x) = 1
i
j
(x), we will have it for Y ’s
of the form 1
(X
1
=i
1
,...,X
m
=i
m
)
. By linearity (and the fact that o is countable), we will then have it for Y ’s
of the form 1
((X
1
,...,X
m
)∈B)
. A monotone class argument shows that such Y ’s generate T

.
We use induction on m, and ﬁrst we prove it for m = 1. We need to show
E[f
1
(X
1
) ◦ θ
n
[ T
n
] = E
X
n
f
1
(X
1
).
Using linearity and the fact that o is countable, it suﬃces to show this for f
1
(y) = 1
{j}
(y). Using the
deﬁnition of θ
n
, we need to show
P(X
n+1
= j [ T
n
) = P
X
n
(X
1
= j),
or equivalently,
P
x
(X
n+1
= j; A) = E
x
[P
X
n
(X
1
= j); A] (23.2)
when A ∈ T
n
. By linearity it suﬃces to consider A of the form A = (X
1
= i
1
, . . . , X
n
= i
n
). The left hand
side of (23.2) is then
P
x
(X
n+1
= j, X
1
= i
1
, . . . , X
n
= i
j
),
and by (21.4) this is equal to
p(i
n
, j)P
x
(X
1
= i
1
, . . . , X
n
−i
n
) = p(i
n
, j)P
x
(A).
Let g(y) = P
y
(X
1
= j). We have
P
x
(X
1
= j, X
0
= k) =

P
k
(X
1
= j) if x = k,
0 if x = k,
while
E
x
[g(X
0
); X
0
= k] = E
x
[g(k); X
0
= k] = P
k
(X
1
= j)P
x
(X
0
= k) =

P
(
X
1
= j) if x = k,
0 if x = k.
It follows that
p(i, j) = P
x
(X
1
= j [ X
0
= i) = P
i
(X
1
= j).
So the right hand side of (23.2) is
E
x
[p(X
n
, j); x
1
= i
1
, . . . , X
n
= i
n
] = p(i
n
, j)P
x
(A)
37
as required.
Suppose the result holds for m and we want to show it holds for m+ 1. We have
E
x
[f
1
(X
n+1
) f
m+1
(X
n+m+1
) [ T
n
]
= E
x
[E
x
[f
m+1
(X
n+m+1
) [ T
n+m
]f
1
(X
n+1
) f
m
(X
n+m
) [ T
n
]
E
x
[E
X
n+m
[f
m+1
(X
1
)]f
1
(X
n+1
) f
m
(X
n+m
) [ T
n
]
E
x
[f
1
(X
n+1
) f
m−1
(X
n+m−1
)h(X
n+m
) [ T
n
].
Here we used the result for m = 1 and we deﬁned h(y) = f
n+m
(y)g(y), where g(y) = E
y
[f
m+1
(X
1
)]. Using
the induction hypothesis, this is equal to
E
X
n
[f
1
(X
1
) f
m−1
(X
m−1
)g(X
m
)] = E
X
n
[f
1
(X
1
) f
m
(X
m
)E
X
m
f
m+1
(X
1
)]
= E
X
n
[f
1
(X
1
) f
m
(X
m
)E[f
m+1
(X
m+1
) [ T
m
]]
= E
X
n
[f
1
(X
1
) f
m+1
(X
m+1
)],
which is what we needed.
Deﬁne θ
N
(ω) = (θ
N(ω)
)(ω). The strong Markov property is the same as the Markov property, but
where the ﬁxed time n is replaced by a stopping time N.
Theorem 23.2. If Y is bounded and measurable and N is a ﬁnite stopping time, then
E
x
[Y ◦ θ
N
[ T
N
] = E
X
N
[Y ].
Proof. We will show
P
x
(X
N+1
= j [ T
N
) = P
X
N
(X
1
= j).
Once we have this, we can proceed as in the proof of the Theorem 23.1 to obtain our result. To show the
above equality, we need to show that if B ∈ T
N
, then
P
x
(X
N+1
= j, B) = E
x
[P
X
N
(X
1
= j); B]. (23.3)
Recall that since B ∈ T
N
, then B ∩ (N = k) ∈ T
k
. We have
P
x
(X
N+1
= j, B, N = k) = P
x
(X
k+1
= j, B, N = k)
= E
x
[P
x
(X
k+1
= j [ T
k
); B, N = k]
= E
x
[P
X
k
(X
1
= j); B, N = k]
= E
x
[P
X
N
(X
1
= j); B, N = k].
Now sum over k; since N is ﬁnite, we obtain our desired result.
Another way of expressing the Markov property is through the Chapman-Kolmogorov equations. Let
p
n
(i, j) = P(X
n
= j [ X
0
= i).
Proposition 23.3. For all i, j, m, n we have
p
n+m
(i, j) =
¸
k∈S
p
n
(i, k)p
m
(k, j).
38
Proof. We write
P(X
n+m
= j, X
0
= i) =
¸
k
P(X
n+m
= j, X
n
= k, X
0
= i)
=
¸
k
P(X
n+m
= j [ X
n
= k, X
0
= i)P(X
n
= k [ X
0
= i)P(X
0
= i)
=
¸
k
P(X
n+m
= j [ X
n
= k)p
n
(i, k)P(X
0
= i)
=
¸
k
p
m
(k, j)p
n
(i, k)P(X
0
= i).
If we divide both sides by P(X
0
= i), we have our result.
Note the resemblance to matrix multiplication. It is clear if P is the matrix made up of the p(i, j),
then P
n
will be the matrix whose (i, j) entry is p
n
(i, j).
24. Recurrence and transience.
Let
T
y
= min¦i > 0 : X
i
= y¦.
This is the ﬁrst time that X
i
hits the point y. Even if X
0
= y we would have T
y
> 0. We let T
k
y
be the k-th
time that the Markov chain hits y and we set
r(x, y) = P
x
(T
y
< ∞),
the probability starting at x that the Markov chain ever hits y.
Proposition 24.1. P
x
(T
k
y
< ∞) = r(x, y)r(y, y)
k−1
.
Proof. The case k = 1 is just the deﬁnition, so suppose k > 1. Using the strong Markov property,
P
x
(T
k
y
< ∞) = P
x
(T
y
◦ θ
T
k−1
y
< ∞, T
k−1
y
< ∞)
= E
x
[P
x
(T
y
◦ θ
T
k−1
y
< ∞ [ T
T
k−1
y
); T
k−1
y
< ∞]
= E
x
[P
X(T
k−1
y
)
(T
y
< ∞); T
k−1
y
]
= E
x
[P
y
(T
y
< ∞); T
k−1
y
< ∞]
= r(y, y)P
x
(T
k−1
y
< ∞).
We used here the fact that at time T
k−1
y
the Markov chain must be at the point y. Repeating this argument
k −2 times yields the result.
We say that y is recurrent if r(y, y) = 1; otherwise we say y is transient. Let
N(y) =

¸
n=1
1
(X
n
=y)
.
Proposition 24.2. y is recurrent if and only if E
y
N(y) = ∞.
Proof. Note
E
y
N(y) =

¸
k=1
P
y
(N(y) ≥ k) =

¸
k=1
P
y
(T
k
y
< ∞)
=

¸
k=1
r(y, y)
k
.
39
We used the fact that N(y) is the number of visits to y and the number of visits being larger than k is the
same as the time of the k-th visit being ﬁnite. Since r(y, y) ≤ 1, the left hand side will be ﬁnite if and only
if r(y, y) < 1.
Observe that
E
y
N(y) =
¸
n
P
y
(X
n
= y) =
¸
n
p
n
(y, y).
If we consider simple symmetric random walk on the integers, then p
n
(0, 0) is 0 if n is odd and equal
to

n
n/2

2
−n
if n is even. This is because in order to be at 0 after n steps, the walk must have had n/2
positive steps and n/2 negative steps; the probability of this is given by the binomial distribution. Using
Stirling’s approximation, we see that p
n
(0, 0) ∼ c/

n for n even, which diverges, and so simple random walk
in one dimension is recurrent.
Similar arguments show that simple symmetric random walk is also recurrent in 2 dimensions but
transient in 3 or more dimensions.
Proposition 24.3. If x is recurrent and r(x, y) > 0, then y is recurrent and r(y, x) = 1.
Proof. First we show r(y, x) = 1. Suppose not. Since r(x, y) > 0, there is a smallest n and y
1
, . . . , y
n−1
such that p(x, y
1
)p(y
1
, y
2
) p(y
n−1
, y) > 0. Since this is the smallest n, none of the y
i
can equal x. Then
P
x
(T
x
= ∞) ≥ p(x, y
1
) p(y
n−1
, y)(1 −r(y, x)) > 0,
a contradiction to x being recurrent.
Next we show that y is recurrent. Since r(y, x) > 0, there exists L such that p
L
(y, x) > 0. Then
p
L+n+K
(y, y) ≥ p
L
(y, x)p
n
(x, x)p
K
(x, y).
Summing over n,
¸
n
p
L+n+K
(y, y) ≥ p
L
(y, x)p
K
(x, y)
¸
n
p
n
(x, x) = ∞.
We say a subset C of o is closed if x ∈ C and r(x, y) > 0 implies y ∈ C. A subset D is irreducible if
x, y ∈ D implies r(x, y) > 0.
Proposition 24.4. Let C be ﬁnite and closed. Then C contains a recurrent state.
From the preceding proposition, if C is irreducible, then all states will be recurrent.
Proof. If not, for all y we have r(y, y) < 1 and
E
x
N(y) =

¸
k=1
r(x, y)r(y, y)
k−1
=
r(x, y)
1 −r(y, y)
< ∞.
Since C is ﬁnite, then
¸
y
E
x
N(y) < ∞. But that is a contradiction since
¸
y
E
x
N(y) =
¸
y
¸
n
p
n
(x, y) =
¸
n
¸
y
p
n
(x, y) =
¸
n
P
x
(X
n
∈ C) =
¸
n
1 = ∞.
40
Theorem 24.5. Let R = ¦x : r(x, x) = 1¦, the set of recurrent states. Then R = ∪

i=1
R
i
, where each R
i
is
closed and irreducible.
Proof. Say x ∼ y if r(x, y) > 0. Since every state is recurrent, x ∼ x and if x ∼ y, then y ∼ x. If x ∼ y and
y ∼ z, then p
n
(x, y) > 0 and p
m
(y, z) > 0 for some n and m. Then p
n+m
(x, z) > 0 or x ∼ z. Therefore we
have an equivalence relation and we let the R
i
be the equivalence classes.
Looking at our examples, it is easy to see that in the Ehrenfest urn model all states are recurrent. For
the branching process model, suppose p(x, 0) > 0 for all x. Then 0 is recurrent and all the other states are
transient. In the renewal chain, there are two cases. If ¦k : a
k
> 0¦ is unbounded, all states are recurrent.
If K = max¦k : a
k
> 0¦, then ¦0, 1, . . . , K −1¦ are recurrent states and the rest are transient.
For the queueing model, let µ =
¸
ka
k
, the expected number of people arriving during one customer’s
service time. We may view this as a branching process by letting all the customers arriving during one
person’s service time be considered the progeny of that customer. It turns out that if µ ≤ 1, 0 is recurrent
and all other states are also. If µ > 1 all states are transient.
25. Stationary measures.
A probability µ is a stationary distribution if
¸
x
µ(x)p(x, y) = µ(y). (25.1)
In matrix notation this is µP = µ, or µ is the left eigenvector corresponding to the eigenvalue 1. In the case of
a stationary distribution, P
µ
(X
1
= y) = µ(y), which implies that X
1
, X
2
, . . . all have the same distribution.
We can use (25.1) when µ is a measure rather than a probability, in which case it is called a stationary
measure.
If we have a random walk on the integers, µ(x) = 1 for all x serves as a stationary measure. In the
case of an asymmetric random walk: p(i, i + 1) = p, p(i, i −1) = q = 1 −p and p = q, setting µ(x) = (p/q)
x
also works.
In the Ehrenfest urn model, µ(x) = 2
−r

r
x

works. One way to see this is that µ is the distribution
one gets if one ﬂips r coins and puts a coin in the ﬁrst urn when the coin is heads. A transition corresponds
to picking a coin at random and turning it over.
Proposition 25.1. Let a be recurrent and let T = T
a
. Set
µ(y) = E
a
T−1
¸
n=0
1
(X
n
=y)
.
Then µ is a stationary measure.
The idea of the proof is that µ(y) is the expected number of visits to y by the sequence X
0
, . . . , X
T−1
while µP is the expected number of visits to y by X
1
, . . . , X
T
. These should be the same because X
T
=
X
0
= a.
Proof. First, let p
n
(a, y) = P
a
(X
n
= y, T > n). So
µ(y) =

¸
n=0
P
a
(X
n
= y, T > n) =

¸
n=0
p
n
(a, y)
41
and
¸
y
µ(y)p(y, z) =
¸
y

¸
n=0
p
n
(a, y)p(y, z).
Second, we consider the case z = a. Then
¸
y
p
n
(a, y)p(y, z)
=
¸
y
P
a
(hit y in n steps without ﬁrst hitting a and then go to z in one step)
= p
n+1
(a, z).
So
¸
n
µ(y)p(y, z) =
¸
n
¸
y
p
n
(a, y)p(y, z)
=

¸
n=0
p
n+1
(a, z) =

¸
n=0
p
n
(a, z)
= µ(z)
since p
0
(a, z) = 0.
Third, we consider the case a = z. Then
¸
y
p
n
(a, y)p(y, z)
=
¸
y
P
a
(hit y in n steps without ﬁrst hitting a and then go to z in one step)
= P
a
(T = n + 1).
Recall P
a
(T = 0) = 0, and since a is recurrent, T < ∞. So
¸
y
µ(y)p(y, z) =
¸
n
¸
y
p
n
(a, y)p(y, z)
=

¸
n=0
P
a
(T = n + 1) =

¸
n=0
P
a
(T = n) = 1.
On the other hand,
T−1
¸
n=0
1
(X
n
=a)
= 1
(X
0
=a)
= 1,
hence µ(a) = 1. Therefore, whether z = a or z = a, we have µP(z) = µ(z).
Finally, we show µ(y) < ∞. If r(a, y) = 0, then µ(y) = 0. If r(a, y) > 0, choose n so that p
n
(a, y) > 0,
and then
1 = µ(a) =
¸
y
µ(y)p
n
(a, y),
which implies µ(y) < ∞.
We next turn to uniqueness of the stationary distribution. We call the stationary measure constructed
in Proposition 25.1 µ
a
.
42
Proposition 25.2. If the Markov chain is irreducible and all states are recurrent, then the stationary
measure is unique up to a constant multiple.
Proof. Fix a ∈ o. Let µ
a
be the stationary measure constructed above and let ν be any other stationary
measure.
Since ν = νP, then
ν(z) = ν(a)p(a, z) +
¸
y=a
ν(y)p(y, z)
= ν(a)p(a, z) +
¸
y=a
ν(a)p(a, y)p(y, z) +
¸
x=a
¸
y=a
ν(x)p(x, y)p(y, z)
= ν(a)P
a
(X
1
= z) +ν(a)P
a
(X
1
= a, X
2
= z) +P
ν
(X
0
= a, X
1
= a, X
2
= z).
Continuing,
ν(z) = ν(a)
n
¸
m=1
P
a
(X
1
= a, X
2
= a, . . . , X
m−1
= a, X
m
= z)
+P
ν
(X
0
= a, X
1
= a, . . . , X
n−1
= a, X
n
= z)
≥ ν(a)
n
¸
m=1
P
a
(X
1
= a, X
2
= a, . . . , X
m−1
= a, X
m
= z).
Letting n → ∞, we obtain
ν(z) ≥ ν(a)µ
a
(z).
We have
ν(a) =
¸
x
ν(x)p
n
(x, a) ≥ ν(a)
¸
x
µ
a
(x)p
n
(x, a)
= ν(a)µ
a
(a) = ν(a),
since µ
a
(a) = 1 (see proof of Proposition 25.1). This means that we have equality and so
ν(x) = ν(a)µ
a
(x)
whenever p
n
(x, a) > 0. Since r(x, a) > 0, this happens for some n. Consequently
ν(x)
ν(a)
= µ
a
(x).
Proposition 25.3. If a stationary distribution exists, then µ(y) > 0 implies y is recurrent.
Proof. If µ(y) > 0, then
∞ =

¸
n=1
µ(y) =

¸
n=1
¸
x
µ(x)p
n
(x, y) =
¸
x
µ(x)

¸
n=1
p
n
(x, y)
=
¸
x
µ(x)

¸
n=1
P
x
(X
n
= y) =
¸
x
µ(x)E
x
N(y)
=
¸
x
µ(x)r(x, y)[1 +r(y, y) +r(y, y)
2
+ ].
Since r(x, y) ≤ 1 and µ is a probability measure, this is less than
¸
x
µ(x)(1 +r(y, y) + ) ≤ 1 +r(y, y) + .
Hence r(y, y) must equal 1.
Recall that T
x
is the ﬁrst time to hit x.
43
Proposition 25.4. If the Markov chain is irreducible and has stationary distribution µ, then
µ(x) =
1
E
x
T
x
.
Proof. µ(x) > 0 for some x. If y ∈ o, then r(x, y) > 0 and so p
n
(x, y) > 0 for some n. Hence
µ(y) =
¸
x
µ(x)p
n
(x, y) > 0.
Hence by Proposition 25.3, all states are recurrent. By the uniqueness of the stationary distribution, µ
x
is
a constant multiple of µ, i.e., µ
x
= cµ. Recall
µ
x
(y) =

¸
n=0
P
x
(X
n
= y, T
x
> n),
and so
¸
y
µ
x
(y) =
¸
y

¸
n=0
P
x
(X
n
= y, T
x
> n) =
¸
n
¸
y
P
x
(X
n
= y, T
x
> n)
=
¸
n
P
x
(T
x
> n) = E
x
T
x
.
Thus c = E
x
T
x
. Recalling that µ
x
(x) = 1,
µ(x) =
µ
x
(x)
c
=
1
E
x
T
x
.
We make the following distinction for recurrent states. If E
x
T
x
< ∞, then x is said to be positive
recurrent. If x is recurrent but E
x
T
x
= ∞, x is null recurrent.
Proposition 25.5. Suppose a chain is irreducible.
(a) If there exists a positive recurrent state, then there is a stationary distribution,.
(b) If there is a stationary distribution, all states are positive recurrent.
(c) If there exists a transient state, all states are transient.
(d) If there exists a null recurrent state, all states are null recurrent.
Proof. To show (a), if x is positive recurrent, then there exists a stationary measure with µ(x) = 1. Then
µ(y) = µ(y)/E
x
T
x
will be a stationary distribution.
For (b), suppose µ(x) > 0 for some x. We showed this implies µ(y) > 0 for all y. Then 0 < µ(y) =
1/E
y
T
y
, which implies E
y
T
y
< ∞.
We showed that if x is recurrent and r(x, y) > 0, then y is recurrent. So (c) follows.
Suppose there exists a null recurrent state. If there exists a positive recurrent or transient state as
well, then by (a) and (b) or by (c) all states are positive recurrent or transient, a contradiction, and (d)
follows.
26. Convergence.
Our goal is to show that under certain conditions p
n
(x, y) → π(y), where π is the stationary distri-
bution. (In the null recurrent case p
n
(x, y) → 0.)
Consider a random walk on the set ¦0, 1¦, where with probability one on each step the chain moves
to the other state. Then p
n
(x, y) = 0 if x = y and n is even. A less trivial case is the simple random walk
on the integers. We need to eliminate this periodicity.
Suppose x is recurrent, let I
x
= ¦n ≥ 1 : p
n
(x, x) > 0¦, and let d
x
be the g.c.d. (greatest common
divisor) of I
x
. d
x
is called the period of x.
44
Proposition 26.1. If r(x, y) > 0, then d
y
= d
x
.
Proof. Since x is recurrent, r(y, x) > 0. Choose K and L such that p
K
(x, y), p
L
(y, x) > 0.
p
K+L+n
(y, y) ≥ p
L
(y, x)p
n
(x, x)p
K
(x, y),
so taking n = 0, we have p
K+L
(y, y) > 0, or d
y
divides K + L. So d
y
divides n if p
n
(x, x) > 0, or d
y
is a
divisor of I
x
. Hence d
y
divides d
x
. By symmetry d
x
divides d
y
.
Proposition 26.2. If d
x
= 1, there exists m
0
such that p
m
(x, x) > 0 whenever m ≥ m
0
.
Proof. First of all, I
x
is closed under addition: if m, n ∈ I
x
,
p
m+n
(x, x) ≥ p
m
(x, x)p
n
(x, x) > 0.
Secondly, if there exists N such that N, N +1 ∈ I
x
, let m
0
= N
2
. If m ≥ m
0
, then m−N
2
= kN +r
for some r < N and
m = r +N
2
+kN = r(N + 1) + (N −r +k)N ∈ I
x
.
Third, pick n
0
∈ I
x
and k > 0 such that n
0
+k ∈ I
x
. If k = 1, we are done. Since d
x
= 1, there exists
n
1
∈ I
x
such that k does not divide n
1
. We have n
1
= mk +r for some 0 < r < k. Note (m+1)(n
0
+k) ∈ I
x
and (m+ 1)n
0
+n
1
∈ I
x
. The diﬀerence between these two numbers is (m+ 1)k −n
1
= k −r < k. So now
we have two numbers in I
k
diﬀering by less than or equal to k −1. Repeating at most k times, we get two
numbers in I
x
diﬀering by at most 1, and we are done.
We write d for d
x
. A chain is aperiodic if d = 1.
If d > 1, we say x ∼ y if p
kd
(x, y) > 0 for some k > 0 We divide o into equivalence classes o
1
, . . . o
d
.
Every d steps the chain started in o
i
is back in o
i
. So we look at p

= p
d
on o
i
.
Theorem 26.3. Suppose the chain is irreducible, aperiodic, and has a stationary distribution π. Then
p
n
(x, y) → π(y) as n → ∞.
Proof. The idea is to take two copies of the chain with diﬀerent starting distributions, let them run
independently until they couple, i.e., hit each other, and then have them move together. So deﬁne
q((x
1
, y
1
), (x
2
, y
2
)) =

p(x
1
, x
2
)p(y
1
, y
2
) if x
1
= y
1
,
p(x
1
, x
2
) if x
1
= y
1
, x
2
= y
2
,
0 otherwise.
Let Z
n
= (X
n
, Y
n
) and T = min¦i : X
i
= Y
i
¦. We have
P(X
n
= y) = P(X
n
= y, T ≤ n) +P(X
n
= y, T > n)
= P(Y
n
= y, T ≤ n) +P(X
n
= y, T > n),
while
P(Y
n
= y) = P(Y
n
= y, T ≤ n) +P(Y
n
= y, T > n).
Subtracting,
P(X
n
= y) −P(Y
n
= y) ≤ P(X
n
= y, T > n) −P(Y
n
= y, T > n)
≤ P(X
n
= y, T > n) ≤ P(T > n).
45
Using symmetry,
[P(X
n
= y) −P(Y
n
= y)[ ≤ P(T > n).
Suppose we let Y
0
have distribution π and X
0
= x. Then
[p
n
(x, y) −π(y)[ ≤ P(T > n).
It remains to show P(T > n) → 0. To do this, consider another chain Z

n
= (X
n
, Y
n
), where now we
take X
n
, Y
n
independent. Deﬁne
r((x
1
, y
1
), (x
2
, y
2
)) = p(x
1
, x
2
)p(y
1
, y
2
).
The chain under the transition probabilities r is irreducible. To see this, there exist K and L such
that p
K
(x
1
, x
2
) > 0 and p
L
(y
1
, y
2
) > 0. If M is large, p
L+M
(x
2
, x
2
) > 0 and p
K+M
(y
2
, y
2
) > 0. So
p
K+L+M
(x
1
, x
2
) > 0 and p
K+L+M
(y
1
, y
2
) > 0, and hence we have r
K+L+M
((x
1
, x
2
), (y
1
, y
2
)) > 0.
It is easy to check that π

(a, b) = π(a)π(b) is a stationary distribution for Z

. Hence Z

n
is recurrent,
and hence it will hit (x, x), hence the time to hit the diagonal ¦(y, y) : y ∈ o¦ is ﬁnite. However the
distribution of the time to hit the diagonal is the same as T.
27. Gaussian sequences.
We ﬁrst prove a converse to Proposition 17.3.
Proposition 27.1. If Ee
i(uX+vY )
= Ee
iuX
Ee
ivY
for all u and v, then X and Y are independent random
variables.
Proof. Let X

be a random variable with the same law as X, Y

one with the same law as Y , and X

, Y

independent. (We let Ω = [0, 1]
2
, P Lebesgue measure, X

a function of the ﬁrst variable, and Y

a function
of the second variable deﬁned as in Proposition 1.2.) Then Ee
i(uX

+vY

)
= Ee
iuX

Ee
ivY

. Since X, X

have
the same law, they have the same characteristic function, and similarly for Y, Y

. Therefore (X

, Y

) has the
same joint characteristic function as (X, Y ). By the uniqueness of the Fourier transform, (X

, Y

) has the
same joint law as (X, Y ), which is easily seen to imply that X and Y are independent.
A sequence of random variables X
1
, . . . , X
n
is said to be jointly normal if there exists a sequence
of independent standard normal random variables Z
1
, . . . , Z
m
and constants b
ij
and a
i
such that X
i
=
¸
m
j=1
b
ij
Z
j
+ a
i
, i = 1, . . . , n. In matrix notation, X = BZ + A. For simplicity, in what follows let us take
A = 0; the modiﬁcations for the general case are easy. The covariance of two random variables X and Y is
deﬁned to be E[(X −EX)(Y −EY )]. Since we are assuming our normal random variables are mean 0, we
can omit the centering at expectations. Given a sequence of mean 0 random variables, we can talk about
the covariance matrix, which is Cov (X) = EXX
t
, where X
t
denotes the transpose of the vector X. In the
above case, we see Cov (X) = E[(BZ)(BZ)
t
] = E[BZZ
t
B
t
] = BB
t
, since EZZ
t
= I, the identity.
Let us compute the joint characteristic function Ee
iu
t
X
of the vector X, where u is an n-dimensional
vector. First, if v is an m-dimensional vector,
Ee
iv
t
Z
= E
m
¸
j=1
e
iv
j
Z
j
=
m
¸
j=1
Ee
iv
j
Z
j
=
m
¸
j=1
e
−v
2
j
/2
= e
−v
t
v/2
using the independence of the Z’s. So
Ee
iu
t
X
= Ee
iu
t
BZ
= e
−u
t
BB
t
u/2
.
By taking u = (0, . . . , 0, a, 0, . . . , 0) to be a constant times the unit vector in the jth coordinate direction,
we deduce that each of the X’s is indeed normal.
46
Proposition 27.2. If the X
i
are jointly normal and Cov (X
i
, X
j
) = 0 for i = j, then the X
i
are independent.
Proof. If Cov (X) = BB
t
is a diagonal matrix, then the joint characteristic function of the X’s factors, and
so by Proposition 27.1, the Xs would in this case be independent.
28. Stationary processes.
In this section we give some preliminaries which will be used in the next on the ergodic theorem. We
say a sequence X
i
is stationary if (X
k
, X
k+1
, . . .) has the same distribution as (X
0
, X
1
, . . .).
One example is if the X
i
are i.i.d. For readers who are familiar with Markov chains, another is if X
i
is a Markov chain, π is the stationary distribution, and X
0
has distribution π.
A third example is rotations of a circle. Let Ω be the unit circle, P normalized Lebesgue measure on
Ω, and θ ∈ [0, 2π). We let X
0
(ω) = ω and set X
n
(ω) = ω +nθ (mod 2π).
A fourth example is the Bernoulli shift: let Ω = [0, 1), P Lebesgue measure, X
0
(ω) = ω, and X
n
(ω)
be binary expansion of ω from the nth place on.
Proposition 28.1. If X
n
is stationary, then Y
k
= g(X
k
, X
k+1
, . . .) is stationary.
Proof. If B ⊂ R

, let
A = ¦x = (x
0
, x
1
, . . .) : (g(x
0
, . . .), g(x
1
, . . . , ), . . .) ∈ B¦.
Then
P((Y
0
, Y
1
, . . .) ∈ B) = P((X
0
, X
1
, . . .) ∈ A)
= P((X
k
, X
k+1
, . . .) ∈ A)
= P((Y
k
, Y
k+1
, . . .) ∈ B).
We say that T : Ω → Ω is measure preserving if P(T
−1
A) = P(A) for all A ∈ T.
There is a one-to-one correspondence between measure preserving transformations and stationary
sequences. Given T, let X
0
= ω and X
n
= T
n
ω. Then
P((X
k
, X
k+1
, . . .) ∈ A) = P(T
k
(X
0
, X
1
, . . .) ∈ A) = P((X
0
, X
1
, . . .) ∈ A).
On the other hand, if X
k
is stationary, deﬁne
´
Ω = R

, and deﬁne
´
X
k
(ω) = ω
k
, where ω = (ω
0
, ω
1
, . . .). Deﬁne
´
P on
´
Ω so that the law of
´
X under
´
P is the same as the law of X under P. Then deﬁne Tω = (ω
1
, ω
2
, . . .).
We see that
P(A) = P((ω
0
, ω
1
, . . .) ∈ A) =
´
P((
´
X
0
,
´
X
1
, . . .) ∈ A)
= P((X
0
, X
1
, . . .) ∈ A) = P((X
1
, X
2
, . . .) ∈ A)
=
´
P((
´
X
1
,
´
X
2
, . . .) ∈ A) = P((ω
1
, ω
2
, . . .) ∈ A)
= P(Tω ∈ A) = P(T
−1
A).
We say a set A is invariant if T
−1
A = A (up to a null set, that is, the symmetric diﬀerence has prob-
ability zero). The invariant σ-ﬁeld 1 is the collection of invariant sets. A measure preserving transformation
is ergodic if the invariant σ-ﬁeld is trivial.
In the case of an i.i.d. sequence, A invariant means A = T
−n
A ∈ σ(X
n
, X
n+1
, . . .) for each n. Hence
each invariant set is in the tail σ-ﬁeld, and by the Kolmogorov 0-1 law, T is ergodic.
In the case of rotations, if θ is a rational multiple of π, T need not be ergodic. For example, let θ = π
and A = (0, π/2) ∪ (π, 3π/2). However, if θ is an irrational multiple of π, the T is ergodic. To see that,
47
recall that if f is measurable and bounded, then f is the L
2
limit of
¸
K
k=−K
c
k
e
ikx
, where c
k
are the Fourier
coeﬃcients. So
f(T
n
x) =
¸
c
k
e
ikx+iknθ
=
¸
d
k
e
ikx
,
where d
k
= c
k
e
iknθ
. If f(T
n
x) = f(x) a.e., then c
k
= d
k
, or c
k
e
iknθ
= c
k
. But θ is not a rational multiple
of π, so e
iknθ
= 1, so c
k
= 0. Therefore f = 0 a.e. If we take f = 1
A
, this says that either A is empty or A
is the whole space, up to sets of measure zero.
Our last example was the Bernoulli shift. Let X
i
be i.i.d. with P(X = 1) = P(X
i
= 0) = 1/2. Let
Y
n
=
¸

m=0
2
−(m+1)
X
n+m
. So there exists g such that Y
n
= g(X
n
, X
n+1
, . . .). If A is invariant for the
Bernoulli shift,
A = ((Y
n
, Y
n+1
, . . .) ∈ B) = ((X
n
, X
n+1
, . . .) ∈ C),
where C = ¦x : (g(x
0
, x
1
, . . .), g(x
1
, x
2
, . . .), . . .) ∈ B¦. this is true for all n, so A is in the invariant σ-ﬁeld
for the X
i
’s, which is trivial. Therefore T is ergodic.
29. The ergodic theorem.
The key to the ergodic theorem is the following maximal lemma.
Lemma 29.1. Let X be integrable. Let T be a measure preserving transformation, let X
j
(ω) = X(T
j
ω),
let S
k
(ω) = X
0
(ω) + +X
k−1
(ω), and M
k
(ω) = max(0, S
1
(ω), . . . , S
k
(ω)). Then E[X; M
k
> 0] ≥ 0.
Proof. If j ≤ k, M
k
(Tω) ≥ S
j
(Tω), so X(ω) +M
k
(Tω) ≥ X(ω) +S
j
(Tω) = S
j+1
(ω), or
X(ω) ≥ S
j+1
(ω) −M
k
(Tω), j = 1, . . . , k.
Since S
1
(ω) = X(ω) and M
k
(Tω) ≥ 0, then
X(ω) ≥ S
1
(ω) −M
k
(Tω).
Therefore
E[X(ω); M
k
> 0] ≥

(M
k
>0)
[max(S
1
, . . . , S
k
)(ω) −M
k
(Tω)]
=

(M
k
>0)
[M
k
(ω) −M
k
(Tω)].
On the set (M
k
= 0) we have M
k
(ω) −M
k
(Tω) = −M
k
(Tω) ≤ 0. Hence
E[X(ω); M
k
> 0] ≥

[M
k
(ω) −M
k
(Tω)].
Since T is measure preserving, EM
k
(ω) −EM
k
(Tω) = 0, which completes the proof.
Recall 1 is the invariant σ-ﬁeld. The ergodic theorem says the following.
Theorem 29.2. Let T be measure preserving and X integrable. Then
1
n
n−1
¸
m=0
X(T
m
ω) → E[X [ 1],
where the convergence takes place almost surely and in L
1
.
Proof. We start with the a.s. result. By looking at X −E[X [ 1], we may suppose E[X [ 1] = 0. Let ε > 0
and D = ¦limsup S
n
/n > ε¦. We will show P(D) = 0.
48
Let δ > 0. Since X is integrable,
¸
P([X
n
(ω)[ > δn) =
¸
P([X[ > δn) < ∞ (cf. proof of Proposition
5.1). By Borel-Cantelli, [X
n
[/n will eventually be less than δ. Since δ is arbitrary, [X
n
[/n → 0 a.s. Since
(S
n
/n)(Tω) −(S
n
/n)(ω) = X
n
(ω)/n −X
0
(ω)/n → 0,
then limsup(S
n
/n)(Tω) = limsup(S
n
/n)(ω), and so D ∈ 1. Let X

(ω) = (X(ω) − ε)1
D
(ω), and deﬁne S

n
and M

n
analogously to the deﬁnitions of S
n
and M
n
. On D, limsup(S
n
/n) > ε, hence limsup(S

n
/n) > 0.
Let F = ∪
n
(M

n
> 0). Note ∪
n
i=0
(M

i
> 0) = (M

n
> 0). Also [X

[ ≤ [X[ + ε is integrable. By
Lemma 29.1, E[X

; M

n
> 0] ≥ 0. By dominated convergence, E[X

; F] ≥ 0.
We claim D = F, up to null sets. To see this, if limsup(S

n
/n) > 0, then ω ∈ ∪
n
(M

n
> 0). Hence
D ⊂ F. On the other hand, if ω ∈ F, then M

n
> 0 for some n, so X

n
= 0 for some n. By the deﬁnition of
X

, for some n, T
n
ω ∈ D, and since D is invariant, ω ∈ D a.s.
Recall D ∈ 1. Then
0 ≤ E[X

; D] = E[X −ε; D]
= E[E[X [ 1]; D] −εP(D) = −εP(D),
using the fact that E[X [ 1] = 0. We conclude P(D) = 0 as desired.
Since we have this for every ε, then limsup S
n
/n ≤ 0. By applying the same argument to −X, we
obtain liminf S
n
/n ≥ 0, and we have proved the almost sure result. Let us now turn to the L
1
convergence.
Let M > 0, X

M
= X1
(|X|≤M)
, and X

M
= X −X

M
. By the almost sure result,
1
n
¸
X

M
(T
m
ω) → E[X

M
[ 1]
almost surely. Both sides are bounded by M, so
E

1
n
¸
X

M
(T
m
ω) −E[X

M
[ 1]

→ 0. (29.1)
Let ε > 0 and choose M large so that E[X

M
[ < ε; this is possible by dominated convergence. We
have
E

1
n
n−1
¸
m=0
X

M
(T
m
ω)

1
n
¸
E[X

M
(T
m
ω)[ = E[X

M
[ ≤ ε
and
E[E[X

M
[ 1][ ≤ E[E[[X

M
[ [ 1]] = E[X

M
[ ≤ ε.
So combining with (29.1)
limsup E

1
n
¸
X(T
m
ω) −E[X [ 1]

≤ 2ε.
This shows the L
1
convergence.
What does the ergodic theorem tell us about our examples? In the case of i.i.d. random variables, we
see S
n
/n → EX almost surely and in L
1
, since E[X [ I] = EX. Thus this gives another proof of the SLLN.
For rotations of the circle with X(ω) = 1
A
(ω) and θ is an irrational multiple of π, E[X [ 1] = EX =
P(A), the normalized Lebesgue measure of A. So the ergodic theorem says that (1/n)
¸
1
A
(ω + nθ), the
average number of times ω + nθ is in A, converges for almost every ω to the normalized Lebesgue measure
of A.
Finally, in the case of the Bernoulli shift, it is easy to see that the ergodic theorem says that the
average number of ones in the binary expansion of almost every point in [0, 1) is 1/2.
49