You are on page 1of 316

v

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Kolmogorov’s consistency theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Conditional expectations and distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31


2.1 Inequalities for sums of independent random variables . . . . . . . . . . . . . . . . . 31
2.2 Laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3 0 – 1 laws, convergence of random series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4 Stopping times, Wald’s identity, and strong Markov property . . . . . . . . . . . . 54
2.5 Azuma inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67


3.1 Convergence of laws. Selection theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2 Characteristic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.3 Central limit theorem for i.i.d. random variables . . . . . . . . . . . . . . . . . . . . . . . 83
3.4 Central limit theorem for triangular arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4 Metrics on Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97


4.1 Bounded Lipschitz functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.2 Convergence of laws on metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3 Strassen’s theorem. Relationships between metrics . . . . . . . . . . . . . . . . . . . . . 112
4.4 Wasserstein distance and Kantorovich-Rubinstein theorem . . . . . . . . . . . . . . 121
4.5 Wasserstein distance and entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

5 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.1 Martingales. Uniform integrability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.2 Stopping times, optional stopping theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.3 Doob’s inequalities and convergence of martingales . . . . . . . . . . . . . . . . . . . . 162
6 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.1 Stochastic processes. Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.2 Donsker’s invariance principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.3 Convergence of empirical process to Brownian bridge . . . . . . . . . . . . . . . . . . 182
6.4 Reflection principles for Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.5 Skorohod’s imbedding and laws of the iterated logarithm . . . . . . . . . . . . . . . 198

7 Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207


7.1 Overview of Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.2 Sums over Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
7.3 Infinitely divisible distributions: basic properties . . . . . . . . . . . . . . . . . . . . . . 217
7.4 Canonical representations of infinitely divisible distributions . . . . . . . . . . . . 223
7.5 a-stable distributions and Lévy processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

8 Exchangeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
8.1 Moment problem and de Finetti’s theorem for coin flips . . . . . . . . . . . . . . . . 235
8.2 General de Finetti and Aldous-Hoover representations . . . . . . . . . . . . . . . . . . 240
8.3 The Dovbysh-Sudakov representation for Gram-de Finetti arrays . . . . . . . . . 251

9 Finite State Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255


9.1 Definitions and basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
9.2 Stationary distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
9.3 Convergence theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
9.4 Reversible Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

10 Gaussian Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281


10.1 Gaussian integration by parts, interpolation, and concentration . . . . . . . . . . . 281
10.2 Examples of concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
10.3 Gaussian comparison inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
10.4 Comparison of bilinear and linear forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
10.5 Johnson–Lindenstrauss lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
10.6 Intersecting random half-spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
1

Chapter 1
Introduction

1.1 Probability spaces

(W, A , P) is a probability space if (W, A ) is a measurable space and P is a measure


on A such that P(W) = 1. Let us recall some definitions from measure theory. A
pair (W, A ) is a measurable space if A is a s -algebra of subsets of W. A collection
A of subsets of W is called an algebra if:
(i) W 2 A,
(ii) C, B 2 A =) C \ B,C [ B 2 A,
(iii) B 2 A =) W \ B 2 A.
A collection A of subsets of W is called a s -algebra if it is an algebra and
(iv) Ci 2 A for all i 1 =) [i 1Ci 2 A .
Elements of the s -algebra A are often called events. P is a probability measure
on the s -algebra A if
(1) P(W) = 1,
(2) P(A) 0 for all events A 2 A ,
(3) P is countably additive: given any disjoint Ai 2 A for i 1, i.e. Ai \ A j = 0/
for all i 6= j,
⇣[• ⌘ •
P Ai = Â P(Ai ).
i=1 i=1

It is sometimes convenient to use an equivalent formulation of property (3):


(30 ) P is finitely additive and continuous, i.e. for any decreasing sequence of
events Bn 2 A , Bn ◆ Bn+1 ,
\
B= Bn =) P(B) = lim P(Bn ).
n!•
n 1

Lemma 1.1. Properties (3) and (30 ) are equivalent.


2 1 Introduction

Proof. First, let us show that (3) implies (30 ). If we denote Cn = Bn \ Bn+1 then Bn
is the disjoint union [k nCk [ B and, by (3),

P(Bn ) = P(B) + Â P(Ck ).


k n

Since the last sum is the tail of convergent series, limn!• P(Bn ) = P(B). Next, let
us show that (30 ) implies (3). If, given disjoint sets (An ), we define Bn = [i n+1 Ai ,
then [ [ [ [ [
Ai = A1 A2 · · · An Bn ,
i 1
and, by finite additivity,
⇣[ ⌘ n
P Ai = Â P(Ai ) + P(Bn ).
i 1 i=1

/ Therefore, by (30 ),
Clearly, Bn ◆ Bn+1 and, since (An ) are disjoint, \n 1 Bn = 0.
limn!• P(Bn ) = 0 and (3) follows.

Let us give several examples of probability spaces. One most basic example of a
probability space is ([0, 1], B([0, 1]), l ), where B([0, 1]) is the Borel s -algebra on
[0, 1] and l is the Lebesgue measure. Let us quickly recall how this measure is con-
structed. More generally, let us consider the construction of the Lebesgue-Stieltjes
measure on R corresponding to a non-decreasing right-continuous function F(x).
One considers an algebra of finite unions of disjoint intervals
n[ o
A= (ai , bi ] : n 1, all (ai , bi ] are disjoint
in

and defines the measure F on the sets in this algebra by (we slightly abuse the
notations here)
⇣[ ⌘ n
F (ai , bi ] = Â F(bi ) F(ai ) .
in i=1

It is not difficult to show that F is countably additive on the algebra A, i.e.,



⇣[ ⌘ •
F Ai = Â F(Ai )
i=1 i=1

whenever all Ai and [i 1 Ai are finite unions of disjoint intervals. The proof is
exactly the same as in the case of F(x) = x corresponding to the Lebesgue measure
case. Once countable additivity is proved on the algebra, it remains to appeal to
the following key result. Recall that, given an algebra A, the s -algebra A = s (A)
generated by A is the smallest s -algebra that contains A.
Theorem 1.1 (Caratheodory’s extension theorem). If A is an algebra of sets and
µ : A ! R is a non-negative countably additive function on A, then µ can be ex-
1.1 Probability spaces 3

tended to a measure on the s -algebra s (A). If µ is s -finite, then this extension is


unique.

Therefore, F above can be uniquely extended to the measure on the s -algebra s (A)
generated by the algebra of finite unions of disjoint intervals. This is the s -algebra
B(R) of Borel sets on R. Clearly, (R, B(R), F) will be a probability space if
(1) F(x) = F(( •, x]) is non-decreasing and right-continuous,
(2) limx! • F(x) = 0 and limx!• F(x) = 1.
The reason we required F to be right-continuous corresponds to our choice that
intervals (a, b] in the algebra are closed on the right, so the two conventions agree
and the measure F is continuous, as it should be, e.g.
⇣\ ⌘
F (a, b] = F (a, b + n 1 ] = lim F (a, b + n 1 ] .
n!•
n 1

In Probability Theory, functions satisfying properties (1) and (2) above are called
cumulative distribution functions, or c.d.f. for short, and we will give an alternative
construction of the probability space (R, B(R), F) in the next section.
Other basic ways to define a probability is through a probability function or density
function. If the measurable space is such that all singletons are measurable, we can
simply assign some weights pi = P(wi ) to a sequence of distinct points wi 2 W,
such that Âi 1 pi = 1, and let

P(A) = P A \ {wi }i 1 .

The function i ! pi = P({wi }) is called a probability function. Now, suppose that


we already have a s -finite measure Q on (W, A ), and consider any measurable
function f : W ! R+ such that
Z
f (w) dQ(w) = 1.
W

Then we can define a probability measure P on (W, A ) by


Z
P(A) = f (w) dQ(w).
A

The function f is called the density function of P with respect to Q and, in a


typical setting when W = Rk and Q is the Lebesgue measure l , f is simply called
the density function of P.

Example 1.1.1. A probability measure on R corresponding to the probability func-


tion
li
pi = P({i}) = e l
i!
4 1 Introduction

for integer i 0 is called the Poisson distribution with the parameter l > 0. (Nota-
tion: given a set A, we will denote by I(x 2 A) or IA (x) the indicator that x belongs
to A.) t
u
Example 1.1.2. A probability measure on R corresponding to the density function
f (x) = l e l x I(x 0) is called the exponential distribution with the parameter
l > 0. t
u
Example 1.1.3. A probability measure on R corresponding to the density function
1 x2 /2
f (x) = p e
2p
is called the standard normal, or standard Gaussian, distribution on R. t
u
Recall that a measure P is called absolutely continuous with respect to another
measure Q, P Q, if for all A 2 A ,

Q(A) = 0 =) P(A) = 0,

in which case the existence of the density is guaranteed by the following classical
result from measure theory.
Theorem 1.2 (Radon-Nikodym). On a measurable space (W, A ) let µ be a s -
finite measure and n be a finite measure absolutely continuous with respect to µ,
n µ. Then there exists the Radon-Nikodym derivative h 2 L 1 (W, A , µ) such
that Z
n(A) = h(w) dµ(w)
A
for all A 2 A . Such h is unique modulo µ-a.e. equivalence.
Of course, the Radon-Nikodym theorem also applies to finite signed measures n,
which can be decomposed into n = n + n for some finite measures n + , n – the
so-called Hahn-Jordan decomposition. Let us recall a proof of the Radon-Nikodym
theorem for convenience.
Proof. Clearly, we can assume that µ is a finite measure. Consider the HilbertR space
H = L 2 (W, A , µ +n) and the linear functional T : H ! R given by T ( f ) = f dn.
Since Z Z
f dn  | f | d(µ + n)  Ck f kH ,
R
T
R is a continuous linear functional and, by the
R Riesz-Fréchet
R theorem, f dn =
f g d(µ + n) for some g 2 H. This implies f dµ = f (1 g) d(µ + n). Now
g(w) 0 for (µ + n)-almost all w, which can be seen by taking f (w) = I(g(w) <
0), and similarly g(w)  1 for (µ + n)-almost all w. Therefore, we can take 0 
g  1. Let E = {w : g(w) = 1}. Then
Z Z
µ(E) = I(w 2 E) dµ(w) = I(w 2 E)(1 g(w)) d(µ + n)(w) = 0,
1.1 Probability spaces 5
R R
and since n µ, n(E) = 0. Since f g dµ = f (1 g) dn, we can restrict both
integral to ER c , and replacing
R f by f /(1 g) and denoting h := g/(1 g) (defined on
E c ) we get E c f dn = E c f h dµ (more carefully, one can truncate f /(1 g)
R first and
c
then use the monotone convergence theorem). Therefore, n(A \ E ) = A\E c h dµ
and this finishes the proof if we set h = 0 on E. To prove uniqueness,
R consider
two such h and h0 and let A = {w : h(w) > h0 (w)}. Then 0 = A (h h0 )dµ and,
therefore, µ(A) = 0. t
u

Let us now write down some important properties of s -algebras and probability
measures.
Lemma 1.2 (Approximation lemma). If A is an algebra of sets then for any B 2
s (A) there exists a sequence Bn 2 A such that limn!• P(B 4 Bn ) = 0.

Proof. Here B 4 Bn denotes the symmetric difference (B [ Bn ) \ (B \ Bn ). Let

D = B 2 s (A) : lim P(B 4 Bn ) = 0 for some Bn 2 A .


n!•

We will prove that D is a s -algebra and, since A ✓ D, this will imply that s (A) ✓
D. One can easily check that
Z
d(B,C) := P(B 4C) = |IB (w) IC (w)| dP(w)
W

is a semi-metric, which satisfies


(a) d(B \C, D \ E)  d(B, D) + d(C, E),
(b) |P(B) P(C)|  d(B,C),
(c) d(Bc ,Cc ) = d(B,C).
Now, consider D1 , . . . , DN 2 D. If a sequence Cin 2 A for n 1 approximates Di ,
i.e.
lim P(Cin 4 Di ) = 0,
n!•
S S
then, by the properties (a) – (c),SCnN = iN Cin approximates DN = iN Di , which
means that DN 2 D. Let D = i 1 Di . Since P(D \ DN ) ! 0 as N ! •, it is clear
that D 2 D, so D is a s -algebra. t
u

Dynkin’s theorem. We will now describe a tool, the so-called Dynkin’s theorem,
or p–l theorem, which is often quite useful in checking various properties of prob-
abilities.
p-systems: A collection of sets P is called a p-system if it is closed under taking
intersections, i.e.
1. if A, B 2 P then A \ B 2 P.
l -systems: A collection of sets L is called a l -system if
6 1 Introduction

1. W 2 L ,
2. if A 2 L then Ac 2 L ,
3. if An 2 L are disjoint for n 1 then [n 1 An 2 L .
Given any collection of sets C , by analogy with the s -algebra s (C ) generated by
C , we will denote by L (C ) the smallest l -system that contains C . If is easy to
see that the intersection of all l -systems that contain C is again a l -system that
contains C , so this intersection is precisely L (C ).
Theorem 1.3 (Dynkin’s theorem). If P is a p-system, L is a l -system and P ✓
L , then s (P) ✓ L .
We will give typical examples of application of this result below.
Proof. First of all, it should be obvious that the collection of sets which is both a
p-system and a l -system is a s -algebra. Therefore, if we can show that L (P) is
a p-system then it is a s -algebra and

P ✓ s (P) ✓ L (P) ✓ L ,

which proves the result. Let us prove that L (P) is a p-system. For a fixed set
A ✓ W, let us define

GA = B ✓ W B \ A 2 L (P) .

Step 1. Let us show that if A 2 L (P) then GA is a l -system. Obviously, W 2 GA .


If B 2 GA then B \ A 2 L (P) and, since Ac 2 L (P) is disjoint from B \ A,
c
Bc \ A = (B \ A) [ Ac 2 L (P).

This means that Bc 2 GA . Finally, if Bn 2 GA are disjoint then Bn \ A 2 L (P) are


disjoint and
([n 1 Bn ) \ A = [n 1 (Bn \ A) 2 L (P),
so [n 1 Bn 2 GA . We showed that GA is a l -system.
Step 2. Next, let us show that if A 2 P then L (P) ✓ GA . Since P ✓ L (P),
by Step 1, GA is a l -system. Also, since P is a p-system, closed under taking
intersections, P ✓ GA . This implies that L (P) ✓ GA . In other words, we showed
that if A 2 P and B 2 L (P) then A \ B 2 L (P).
Step 3. Finally, let us show that if B 2 L (P) then L (P) ✓ GB . By step 2, GB
contains P and, by Step 1, GB is a l -system. Therefore, L (P) ✓ GB . We showed
that if B 2 L (P) and A 2 L (P) then A \ B 2 L (P), so L (P) is a p-system.
t
u
Example 1.1.4. Suppose that W is a topological space with the Borel s -algebra B
generated by open sets. Given two probability measures P1 and P2 on (W, B), the
collection of sets
1.1 Probability spaces 7

L = B 2 B : P1 (B) = P2 (B)
is trivially a l -system, by the properties of probability measures. On the other
hand, the collection P of all open sets is a p-system and, therefore, if we know
that P1 (B) = P2 (B) for all open sets then, by Dynkin’s theorem, this holds for all
Borel sets B 2 B. Similarly, one can see that a probability on the Borel s -algebra
on the real line is determined by probabilities of the sets ( •,t] for all t 2 R. t
u

Regularity of measures. Let us now consider the case of W = S where (S, d) is a


metric space, and let A be the Borel s -algebra generated by open (or closed) sets.
A probability measure P on this space is called closed regular if

P(A) = sup P(F) : F ✓ A, F - closed (1.1)

for all A 2 A . Similarly, probability measure P is called regular if

P(A) = sup P(K) : K ✓ A, K - compact (1.2)

for all A 2 A . It is a standard result in measure theory that every finite measure on
(Rk , B(Rk )) is regular. In the setting of complete separable metric spaces, this is
known as Ulam’s theorem, which we will prove below.
Theorem 1.4. Every probability measure P on a metric space (S, d) is closed reg-
ular.

Proof. Let us consider a collection of sets

L = A 2 A : both A and Ac satisfy (1.1) . (1.3)

First of all, let us show that each closed set F 2 L ; we only need to show that an
open set U = F c satisfies (1.1). Let us consider sets

Fn = s 2 S : d(s, F) 1/n .

It is obvious that all Fn are closed, and Fn ✓ Fn+1 . One can also easily check that,
since F is closed, [n 1 Fn = U and, by the continuity of measure,

P(U) = lim P(Fn ) = sup P(Fn ).


n!• n 1

This proves that U = F c satisfies (1.1) and F 2 L . Next, one can easily check that
L is a l -system, which we will leave as an exercise below. Since the collection
of all closed sets is a p-system, and closed sets generate the Borel s -algebra A ,
by Dynkin’s theorem, all measurable sets are in L . This proves that P is closed
regular. t
u

Theorem 1.5 (Ulam’s theorem). If (S, d) is a complete separable metric space


then every probability measure P is regular.
8 1 Introduction

Proof. First, let us show that there exists a compact set K ✓ S such that P(S \
K)
S•
 e. Consider a sequence {s1 , s2 , . . .} that is dense in S. For any m 1, S =
1 1
i=1 B(si , m ), where B(si , m ) is the closed ball of radius 1/m centered at si . By the
continuity of measure, for large enough n(m),

⇣ n(m)
[ ⇣ 1 ⌘⌘ e
P S\ B si ,  m.
i=1
m 2

If we take
\ n(m)
[ ⇣ 1⌘
K= B si ,
m 1 i=1
m
then
e
P(S \ K)  Â m
= e.
m 12
Obviously, by construction, K is closed and totally bounded. Since S is complete,
K is compact. By the previous theorem, given A 2 A , we can find a closed subset
F ✓ A such that P(A \ F)  e. Therefore, P(A \ (F \ K))  2e, and since F \ K is
compact, this finishes the proof. t
u

Exercise 1.1.1. Let F = {F ✓ N : N \ F is finite} be the collection of all sets in N


with finite complements. F is a filter, which means that (a) 0/ 62 F , (b) if F1 , F2 2 F
then F1 \ F2 2 F , and (c) if F 2 F and F ✓ G then G 2 F . It is well known (by
Zorn’s lemma) that F ✓ U for some ultrafilter U , which in addition to (a) –
(c) also satisfies: (d) for any set A ✓ N, either A or N \ A is in U . If we define
P(A) = 1 for A 2 U and P(A) = 0 for A 62 U , show that P is finitely additive, but
not countably additive, on 2N . If P could be interpreted as a “probability”, suppose
that two people pick a number in N according to P, and whoever picks the bigger
number wins. If one shows his number first, what is the probability that the other
one wins?

Exercise 1.1.2. Suppose that C is a class of subsets of W and B 2 s (C ). Show that


there exists a countable class CB ✓ C such that B 2 s (CB ).

Exercise 1.1.3. Check that L in (1.3) is a l -system. (In fact, one can check that
this is a s -algebra.)
1.2 Random variables 9

1.2 Random variables

Let (W, A , P) be a probability space and (S, B) be a measurable space where B is


a s -algebra of subsets of S. Recall that a function X : W ! S is called measurable
if for all B 2 B,
X 1 (B) = w 2 W : X(w) 2 B 2 A .
In Probability Theory, such functions are called random variables, especially, when
(S, B) = (R, B(R)). Depending on the target space S, X may be called a random
vector, sequence, or, more generally, a random element in S. Recall that measura-
bility can be checked on the sets that generate the s -algebra B and, in particular,
the following holds.
Lemma 1.3. X : W ! R is a random variable if and only if, for all t 2 R,

{X  t} := w 2 W : X(w) 2 ( •,t] 2 A .

Proof. Only “if” direction requires proof. We will prove that


1
D = D✓R : X (D) 2 A

is a s -algebra. Since sets ( •,t] 2 D, this will imply that B(R) ✓ D. The fact that
D is a s -algebra follows simply because taking pre-image preserves set operations.
For example, if we consider a sequence Di 2 D for i 1 then
⇣[ ⌘ [
X 1 Di = X 1 (Di ) 2 A ,
i 1 i 1
S
because X 1 (Di ) 2 A and A is a s -algebra. Therefore, i 1 Di 2 D. Other prop-
erties can be checked similarly, so D is a s -algebra. t
u
Given a random element X on (W, A , P) with values in (S, B), let us denote the
image measure on B by PX = P X 1 , which means that for B 2 B,
1 1
PX (B) = P(X 2 B) = P(X (B)) = P X (B).

(S, B, PX ) is called the sample space of a random element X and PX is called


the law of X, or the distribution of X. Clearly, on this space a random variable
x : S ! S defined by the identity x (s) = s has the same law as X.When S = R, the
function F(t) = P(X  t) is called the cumulative distribution function (c.d.f.) of
X. Clearly, this function satisfies the following properties that already appeared in
the previous section:
(1) F(x) = F(( •, x]) is non-decreasing and right-continuous,
(2) limx! • F(x) = 0 and limx!• F(1) = 1.
On the other hand, any such function is a c.d.f. of some random variable, for exam-
ple, the random variables X(x) = x on the space (R, B(R), F) constructed in the
10 1 Introduction

previous section, since

P x : X(x)  t = F ( •,t] = F(t).

Another construction can be given on the probability space ([0, 1], B([0, 1]), l )
with the Lebesgue measure l , using the quantile transformation. Given a c.d.f. F,
let us define a random variable X : [0, 1] ! R by the quantile transformation

X(x) = inf s 2 R : F(s) x .

What is the c.d.f. of X? Notice that, since F is right-continuous,

X(x)  t () inf{s : F(s) x}  t () lim F(s) x () F(t) x.


s#t

This implies that F is the c.d.f. of X, since

l (x : X(x)  t) = l (x : F(t) x) = F(t).

This means that, to define the probability space (R, B(R), F), we can start with
([0, 1], B([0, 1]), l ) and let F = l X 1 be the image of the Lebesgue measure by
the quantile transformation, or the law of X on R. A related “inverse” property is
left as an exercise below.
Given random element X : (W, A ) ! (S, B), the s -algebra
1
s (X) = X (B) : B 2 B

is called a s -algebra generated by X. It is obvious that this collection of sets is,


indeed, as s -algebra.

Example 1.2.1. Consider a random variable X on ([0, 1], B([0, 1]), l ) defined by

0, 0  x  1/2,
X(x) =
1, 1/2 < x  1.

Then, the s -algebra generated by X consists of the sets


n h 1i ⇣1 i o
s (X) = 0,/ 0, , , 1 , [0, 1] ,
2 2
and P(X = 0) = P(X = 1) = 1/2. t
u

Lemma 1.4. Consider a probability space (W, A , P), a measurable space (S, B)
and random elements X : W ! S and Y : W ! R. Then the following are equivalent:
1. Y = g(X) for some (Borel) measurable function g : S ! R,
2. Y : W ! R is s (X)-measurable.
1.2 Random variables 11

It should be obvious from the proof that R can be replaced by any separable metric
space.

Proof. The fact that 1 implies 2 is obvious, since for any Borel set B ✓ R the set
B0 = g 1 (B) 2 B and, therefore,

Y = g(X) 2 B = X 2 g 1 (B) = B0 = X 1
(B0 ) 2 s (X).

Let us show that 2 implies 1. For all integer n and k, consider sets
n h k k + 1 ⌘o ⇣h k k + 1 ⌘⌘
An,k = w : Y (w) 2 n , n =Y 1 , .
2 2 2n 2n
By 2, An,k 2 s (X) = {X 1 (B) : B 2 B} and, therefore, An,k = X 1 (B
n,k ) for some
Bn,k 2 B. Let us consider a function
k
gn (x) = Â 2n I(x 2 Bn,k ).
k2Z

By construction, |Y gn (X)|  2 n , since


h k k +1⌘ k
Y (w) 2 n
, n () X(w) 2 Bn,k () gn (X(w)) = n .
2 2 2
It is easy to see that gn (x)  gn+1 (x) and, therefore, the limit g(x) = limn!• gn (x)
exists and is measurable as a limit of measurable functions. Clearly, Y = g(X). t u

Independence. Consider a probability space (W, A , P). Then, s -algebras Ai ✓ A ,


i  n, are independent if

P(A1 \ · · · \ An ) = ’ P(Ai )
in

for all Ai 2 Ai . Similarly, s -algebras Ai ✓ A for i  n are pairwise independent if

P(Ai \ A j ) = P(Ai )P(A j )

for all Ai 2 Ai , A j 2 A j , i 6= j. Random variables Xi : (W, A ) ! (S, B) for i  n


are independent if the s -algebras s (Xi ) are independent, which is just another
convenient way to state that

P(X1 2 B1 , . . . , Xn 2 Bn ) = P(X1 2 B1 ) ⇥ . . . ⇥ P(Xn 2 Bn )

for any events B1 , . . . , Bn 2 B. Pairwise independence is defined similarly.

Example 1.2.2. Consider a regular tetrahedron die with red, green and blue sides
and a red-green-blue base. If we roll this die then the colors provide an example of
pairwise independent random variables that are not independent, since
12 1 Introduction

1 1
P(r) = P(b) = P(g) = and P(rb) = P(rg) = P(bg) = ,
2 4
⇣ ⌘3
1
while P(rbg) = 4 6= P(r)P(b)P(g) = 12 . t
u

First of all, independence can be checked on generating algebras.

Lemma 1.5. If algebras Ai , i  n are independent then s -algebras s (Ai ) are inde-
pendent.

Proof. Obvious by the Approximation Lemma 1.2. t


u

A more flexible criterion follows from Dynkin’s theorem.

Lemma 1.6. If collections of sets Ci for i  n are p-systems (closed under finite
intersections) then their independence implies the independence of the s -algebras
s (Ci ) they generate.

Proof. Le us consider the collection C of sets C 2 A such that


n
P(C \C2 \ · · · \Cn ) = P(C) ’ P(Ci )
i=2

for all Ci 2 Ci for 2  i  n. It is obvious that C is a l -system, and it contains C1


by assumption. Since C1 is a p-system, by Dynkin’s theorem, C contains s (C1 ).
This means that we can replace C1 by s (C1 ) in the statement of the theorem and,
similarly, we can continue to replace each Ci by s (Ci ). t
u

Lemma 1.7. Let us consider random variables Xi : W ! R on a probability space


(W, A , P). The following holds.
(a) Random variables (Xi ) are independent if and only if, for all ti 2 R,
n
P(X1  t1 , . . . , Xn  tn ) = ’ P(Xi  ti ). (1.4)
i=1

(b) If the laws of Xi have densities fi on R then these random variables are in-
dependent if and only if a joint density f on Rn of the vector (Xi )1in exists
and n
f (x1 , ..., xn ) = ’ fi (xi ).
i=1

Proof. (a) This is obvious by Lemma 1.6, because the collection of sets ( •,t] for
t 2 R is a p-system that generates the Borel s -algebra on R.
(b) Let us start with the “if” part. If we denote X = (X1 , . . . , Xn ) then, for any
Ai 2 B(R),
1.2 Random variables 13

⇣\
n ⌘
P {Xi 2 Ai } = P(X 2 A1 ⇥ · · · ⇥ An )
i=1
Z n
=
A1 ⇥···⇥An i=1
’ fi (xi ) dx1 · · · dxn
n Z n
=’ fi (xi ) dxi = ’ P(Xi 2 Ai ).
i=1 Ai i=1

Next, we prove the “only if” part. First of all, by independence,


n Z n
P(X 2 A1 ⇥ · · · ⇥ An ) = ’ P(Xi 2 Ai ) = ’ fi (xi ) dx1 · · · dxn .
i=1 A1 ⇥···⇥An i=1

We would like to show that this implies that


Z n
P(X 2 A) = ’ fi (xi ) dx1 · · · dxn .
A i=1

for all A in the Borel s -algebra on Rn , which would means that the joint den-
sity exists and is equal to the product of individual densities. One can prove the
above equality for all A 2 B(Rn ) by appealing to the Monotone Class Theorem
from measure theory, or the Caratheodory Extension Theorem 1.1, since the above
equality, obviously, can be extended from the semi-algebra of measurable rectan-
gles A1 ⇥ · · · ⇥ An to the algebra of disjoint unions of measurable rectangles, which
generates the Borel s -algebra. However, we can also appeal to the Dynkin’s theo-
rem, since the family L of sets A that satisfy the above equality is a l -system by
properties of measures and integrals, and it contains the p-system P of measur-
able rectangles A1 ⇥ · · · ⇥ An that generates the Borel s -algebra, B(Rn ) = s (P).
t
u

More generally, a collection of s -algebras At ✓ A indexed by t 2 T for some


set T is called independent if any finite subset of these s -algebras are independent.
Let T = T1 [ · · · [ Tn be a partition of T into disjoint sets. In this case, the following
holds.

Lemma 1.8 (Grouping lemma). The s -algebras


_ ⇣[ ⌘
Bi = At = s At
t2Ti t2Ti

generated by the subsets of s -algebras (At )t2Ti are independent.

Proof. For each i  n, consider a collection of sets


n\ o
Ci = At : for all finite F ✓ Ti and At 2 At .
t2F
14 1 Introduction

It is obvious that Bi = s (Ci ) since At ✓ Ci for all t 2 Ti , each Ci is a p-system, and


C1 , . . . , Cn are independent by the definition of independence of the s -algebras At
for t 2 T . Using Lemma 1.6 finishes the proof. (Of course, one should recognize
from measure theory that Ci is a semi-algebra that generates Bi .) t
u

If we would like to construct finitely many independent random variables (Xi )in
with arbitrary distributions (Pi )in on R, we can simply consider the space W = Rn
with the product measure
P1 ⇥ . . . ⇥ Pn
and define a random variable Xi by Xi (x1 , . . . , xn ) = xi . The main result in the next
section will imply that one can construct an infinite sequence of independent ran-
dom variables with arbitrary distributions on the same probability space, and here
we will give a sketch of another construction on the space ([0, 1], B([0, 1]), l ).
We will write P = l to emphasize that we think of the Lebesgue measure as a
probability.
Step 1. If we write the dyadic decomposition of x 2 [0, 1],

x= Â2 n
en (x),
n 1

then it is easy to see that (en )n 1 are independent random variables with the distri-
bution P(en = 0) = P(en = 1) = 1/2, since for any n 1 and any ai 2 {0, 1},

P(X : e1 (x) = a1 , . . . , en (x) = an ) = 2 n ,

since fixing the first n coefficients in the dyadic expansion places x into an interval
of length 2 n .
Step 2. Let us consider injections km : N ! N for m 1 such that their ranges
km (N) are all disjoint and let us define

Xm = Xm (x) = Â2 n
ekm (n) (x).
n 1

It is an easy exercise to check that each Xm is well defined and has the uniform
distribution on [0, 1], which can be seen by looking at the dyadic intervals first.
Moreover, by the Grouping Lemma above, the random variables (Xm )m 1 are all
independent since they are defined in terms groups of independent random vari-
ables.
Step 3. Given a sequence of probability distributions (Pm )m 1 on R, let (Fm )m 1
be the sequence of the corresponding c.d.f.s and let (Qm )m 1 be their quantile
transforms. We have seen above that each Ym = Qm (Xm ) has the distribution Pm
on R, and they are obviously independent of each other. Therefore, we constructed
a sequence of independent random variables Ym on the space ([0, 1], B([0, 1]), l )
with arbitrary distributions Pm .
1.2 Random variables 15

Expectation. If X : W ! R is a random variable on (W, A , P) then the expectation


of X is defined as Z
EX = X(w) dP(w).
W
In other words, expectation is just another term for the integral with respect to a
probability measure and, as a result, expectation has all the usual properties of the
integrals in measure theory: convergence theorems, change of variables formula,
Fubini’s theorem, etc. Let us write down some special cases of the change of vari-
ables formula.

Lemma 1.9. (1) If F is the c.d.f. (and the law) of X on R then, for any measurable
function g : R ! R, Z
Eg(X) = g(x) dF(x).
R
(2) If the distribution of X is discrete, i.e. P(X 2 {xi }i 1 ) = 1, then

Eg(X) = Â g(xi )P(X = xi ).


i 1

(3) If the distribution of X : W ! Rn on Rn has the density function f (x) then, for
any measurable function g : Rn ! R,
Z
Eg(X) = g(x) f (x) dx.
Rn

Proof. All these properties follow by making the change of variables x = X(w),
Z Z Z
1
Eg(X) = g(X(w)) dP(w) = g(x) dP X (x) = g(x) dPX (x),
W

where PX = P X 1 is the law of X on R or Rn . t


u

Another simple fact is the following.

Lemma 1.10. If X,Y : W ! R are independent and E|X|, E|Y | < • then EXY =
EXEY.

Proof. Independence implies that the distribution of (X,Y ) on R2 is the product


measure P ⇥ Q, where P and Q are the distributions of X and Y on R, and, there-
fore,
Z Z Z
EXY = xy d(P ⇥ Q)(x, y) = x dP(x) y dQ(y) = EXEY,
R2 R R

by the change of variables and Fubini theorems. t


u

Exercise 1.2.1. If a random variable X has continuous c.d.f. F(t), show that F(X)
is uniform on [0, 1], i.e. the law of F(X) is the Lebesgue measure on [0, 1].
16 1 Introduction
R
Exercise 1.2.2. If F is a continuous distribution function, show that F(x) dF(x) =
1/2.
Exercise 1.2.3. ch(l ) is a moment generating function of a random variable X with
distribution P(X = ±1) = 1/2, since

el + e l
Eel X = = ch(l ).
2
Does there exist a (bounded) random variable X such that Eel X = chm (l ) for
0 < m < 1? (Hint: compute several derivatives at zero.)
Exercise 1.2.4. Consider a measurable function f : X ⇥Y ! R and a product prob-
ability measure P ⇥ Q on X ⇥Y. For 0 < p  q, prove that

kk f kL p (P) kLq (Q)  kk f kLq (Q) kL p (P) ,

i.e.
⇣Z ⇣Z ⌘q/p ⌘1/q ⇣Z ⇣Z ⌘ p/q ⌘1/p
p
| f (x, y)| dP(x) dQ(y)  | f (x, y)|q dQ(y) dP(x) .

Assume that both sides are well-defined, for example, that f is bounded.
Exercise 1.2.5. If the event A 2 s (P) is independent of the p-system P then
P(A) = 0 or 1.
Exercise 1.2.6. Suppose X is a random variable and g : R ! R is measurable. Prove
that if X and g(X) are independent then P(g(X) = c) = 1 for some constant c.
Exercise 1.2.7. Suppose that (em )1mn are i.i.d. exponential random variables
with the parameter l > 0, and let

e(1)  e(2)  . . .  e(n)

be the order statistics (the random variables arranged in the increasing order). Prove
that the spacings
e(1) , e(2) e(1) , . . . , e(n) e(n 1)
are independent exponential random variables, and e(k+1) e(k) has the parameter
(n k)l .
Exercise 1.2.8. Suppose that (en )n 1 are i.i.d. exponential random variables with
the parameter l = 1. Let Sn = e1 + . . . + en and Rn = Sn+1 /Sn for n 1. Prove that
(Rn )n 1 are independent and Rn has density nx (n+1) i(x 1). Hint: Let R0 = e1
and compute the joint density of (R0 , R1 , . . . , Rn ) first.
Exercise 1.2.9. Let N be a Poisson random variable with the mean l , i.e. P(N =
j) = l j e l / j! for integer j 0. Then, consider N i.i.d. random variables, indepen-
dent of N, taking values 1, . . . , k with probabilities p1 , . . . , pk . Let N j be the number
1.2 Random variables 17

of these random variables taking value j, so that N1 + . . . + Nk = N. Prove that


N1 , . . . , Nk are independent Poisson random variables with means l p1 , . . . , l pk .

Exercise 1.2.10. Suppose that a measurable subset P ✓ [0, 1] and the interval I =
[a, b] ✓ [0, 1] are such that l (P) = l (I), where l is the Lebesgue measure on [0, 1].
Show that there exists a measure-preserving transformation T : [0, 1] ! [0, 1], i.e.
l T 1 = l , such that T (I) ✓ P and T is one-to-one (injective) outside a set of
measure zero.
18 1 Introduction

1.3 Kolmogorov’s consistency theorem

In this section we will describe a typical way to construct an infinite family of


random variables (Xt )t2T on the same probability space, namely, when we are given
all their finite dimensional marginals. This means that for any finite subset N ✓ T ,
we are given
PN (B) = P (Xt )t2N 2 B
for all B in the Borel s -algebra BN = B(RN ). Clearly, these laws must satisfy a
natural consistency condition,

PN (B) = PM (B ⇥ RM\N ), (1.5)

for any finite subsets N ✓ M and any Borel set B 2 BN . (Of course, to be careful,
we should define these probabilities for ordered subsets and also make sure they
are consistent under rearrangements, but the notations for unordered sets is clear
and should not cause any confusion.)
Our goal is to construct a probability space (W, A , P) and random variables
Xt : W ! R that have (PN ) as their finite dimensional distributions. We take

W = RT = w : T ! R

to be the space of all real-valued functions on T , and let Xt be the coordinate pro-
jection Xt = Xt (w) = w(t). For the coordinate projections to be measurable, the
following collection of events,

A = B ⇥ RT \N : B 2 BN ,

must be contained in the s -algebra A . It is easy to see that A is, in fact, an algebra
of sets, and it is called the cylindrical algebra on RT . We will then take A = s (A)
to be the smallest s -algebra on which all coordinate projections are measurable.
This is the so-called cylindrical s -algebra on RT . A set B ⇥ RT \N is called a cylin-
der. As we already agreed, the probability P on the sets in algebra A is given by

P B ⇥ RT \N = PN (B).

Given two finite subsets N ✓ M ⇢ T and B 2 BN , the same set can be represented
as two different cylinders,

B ⇥ RT \N = B ⇥ RM\N ⇥ RT \M .

However, by the consistency condition, the definition of P will not depend on the
choice of the representation. To finish the construction, we need to show that P
can be extended from algebra A to a probability measure on the s -algebra A . By
the Caratheodory Extension Theorem 1.1, we only need to show that the following
holds.
1.3 Kolmogorov’s consistency theorem 19

Theorem 1.6. P is countably additive on the cylindrical algebra A.

Proof. Equivalently, it is enough to show that P satisfies continuity of measure


property on A, namely, given a sequence Bn 2 A,
\
Bn ◆ Bn+1 , Bn = 0/ =) lim P(Bn ) = 0.
n!•
n 1
T
We will prove that if there exists e > 0 such that P(Bn ) > e for all n then n 1 Bn 6=
/ We can represent the cylinders Bn as Bn = Cn ⇥ RT \Nn for some finite subset
0.
Nn ✓ T and Cn 2 BNn . Since Bn ◆ Bn+1 , we can assume that Nn ✓ Nn+1 . First of
all, by the regularity of probability on a finite dimensional space, there exists a
compact set Kn ✓ Cn such that
e
PNn Cn \ Kn  .
2n+1
It is easy to see that
\ \ [
Ci ⇥ RT \Ni \ Ki ⇥ RT \Ni ✓ (Ci \ Ki ) ⇥ RT \Ni
in in in

and, therefore,
⇣\ \ ⌘ ⇣[ ⌘
P Ci ⇥ RT \Ni \ Ki ⇥ RT \Ni  P (Ci \ Ki ) ⇥ RT \Ni
in in in
⇣ ⌘ e e
 Â P (Ci \ Ki ) ⇥ RT \Ni  Â  .
in in 2i+1 2

Since we assumed that


⇣\ ⌘
P(Bn ) = P Ci ⇥ RT \Ni > e,
in

this implies that ⇣\ ⌘ e


P Ki ⇥ RT \Ni > 0.
in
2
Let us rewrite this intersection as
\ \
Ki ⇥ RT \Ni = (Ki ⇥ RNn \Ni ) ⇥ RT \Nn = K n ⇥ RT \Nn ,
in in

where \
Kn = (Ki ⇥ RNn \Ni )
in

is a compact in RNn , since Kn is a compact in RNn . We proved that


20 1 Introduction
⇣\ ⌘
PNn (K n ) = P(K n ⇥ RT \Nn ) = P Ki ⇥ RT \Ni > 0
in

and, therefore, there exists a point w n = (w n (t))t2Nn 2 K n . By construction, we


also have the following inclusion property. For m > n,

w m 2 K m ✓ K n ⇥ RNm \Nn

and, therefore, (w m (t))t2Nn 2 K n . Any sequence on a compact has a converging


subsequence. Let (n1k )k 1 be such that
1
(w nk (t))t2N1 ! (w(t))t2N1 2 K 1

as k ! •. Then we can take a subsequence (n2k )k 1 of the sequence (n1k )k 1 such


that
2
(w nk (t))t2N2 ! (w(t))t2N2 2 K 2 .
Notice that the values of w(t) must agree on N1 . Iteratively, we can find a subse-
m 1
quence (nmk )k 1 of (nk )k 1 such that
m
(w nk (t))t2Nm ! (w(t))t2Nm 2 K m .

This proves the existence of a point


\ \
w2 K n ⇥ RT \Nn ✓ Bn ,
n 1 n 1

so this last set is not empty. This finishes the proof. t


u

This results gives us another way to construct a sequence (Xn )n 1 of independent


random variables with given distributions (Pn )n 1 as the coordinate projections on
the infinite product space RN with the cylindrical (product) s -algebra A = B N
and infinite product measure P = ’n 1 Pn .

Exercise 1.3.1. Does the set C([0, 1], R) of continuous functions on [0, 1] belong
to the cylindrical s -algebra A on R[0,1] ? Hint: An exercise in Section 1 might be
helpful.

Exercise 1.3.2. Let B be the cylindrical s -algebra on R[0,1] and let W be the set of
all continuous functions on [0, 1], i.e. W = C[0, 1] ✓ R[0,1] . Consider the s -algebra
A := {B \ W : B 2 B} on W = C[0, 1], which consists of intersections of sets in
B with the set of continuous functions W = C[0, 1]. Show that the set
n Z 1 o
A := w 2 C[0, 1] : w(t) dt < 1
0

belongs to the s -algebra A .


1.4 Conditional expectations and distributions 21

1.4 Conditional expectations and distributions

This section can be studied anytime before Chapter 4. The topic of this section
can be viewed as a vast abstract generalization of the Fubini theorem for product
spaces. For example, given a pair of random variables (X,Y ), how can we average
a function f (X,Y ) by fixing X = x and averaging with respect to Y first? Can we
define some distribution of Y given X = x that can be used to compute this average?
We will begin by defining these conditional averages, or conditional expectations,
and then we will use them to define the corresponding conditional distributions.
Let (W, A , P) be a probability space and X : W ! R be a random variable such
that E|X| < •. Let B be a s -subalgebra of A , B ✓ A . Random variable Y : W !
R is called the conditional expectation of X given B if
1. Y is measurable on B, i.e. for every Borel set B on R, Y 1 (B) 2 B.
2. For any set B 2 B, we have EXIB = EY IB .
This conditional expectation is denoted by Y = E(X|B). If X, Z are random vari-
ables then the conditional expectation of X given Z is defined by

Y = E(X|Z) = E(X|s (Z)).

Since Y is measurable on s (Z), Y = f (Z) for some measurable function f . Let us


go over a number of properties of conditional expectation, most of which follow
easily from the definition.
1. (Existence) Let us define
Z
µ(B) = X dP for B 2 B.
B

Since E|X| < •, µ(B) is a s -additive signed measure on B. Moreover, if P(B) = 0


then µ(B) = 0, which means that µ is absolutely continuous with respect to P. By
the Radon-Nikodym theorem, there exists a function Y = dµ/dP measurable on
B such that, for B 2 B,
Z Z
µ(B) = X dP = Y dP.
B B

By definition, Y = E(X|B). Another way to define at conditional expectations


is as follows. Notice that L 2 (W, B, P) is a linear subspace of L 2 (W, A , P) for
B ✓ A . If f 2 L 2 (W, A , P) then the conditional expectation g = E( f |B) is
simply an orthogonal projection of f onto that subspace, since such projection
is B-measurable and the orthogonal projection is defined by E f h = Egh for
all h 2 L 2 (W, B, P). In particular, one can take h = IB for any B 2 B. When
f 2 L 1 (W, A , P), the conditional expectation can be defined by approximating f
by functions in L 2 (W, A , P).
22 1 Introduction

2. (Uniqueness) Suppose there exists another Y 0 = E(X|B) such that P(Y 6= Y 0 ) >
0, i.e. either P(Y > Y 0 ) > 0 or P(Y < Y 0 ) > 0. Since both Y,Y 0 are measurable on
B, the set B = {Y > Y 0 } 2 B. If P(B) > 0 then E(Y Y 0 )IB > 0. On the other
hand,
E(Y Y 0 )IB = EXIB EXIB = 0,
a contradiction. Similarly, P(Y < Y 0 ) = 0. Of course, the conditional expectation
can be redefined on a set of measure zero, so uniqueness is understood in this sense.
3. (Linearity) Conditional expectation is linear,

E(cX +Y |B) = cE(X|B) + E(Y |B),

just like the usual integral. It is obvious that the right hand side satisfies the def-
inition of the conditional expectation E(cX + Y |B) and the equality follows by
uniqueness. Again, the equality holds almost surely.
4. (Smoothing) If s -algebras C ✓ B ✓ A then

E(X|C ) = E(E(X|B)|C ).

To see this, consider a set C 2 C . Since C is also in B,

EIC (E(E(X|B)|C )) = EIC E(X|B) = EIC X,

so the right hand side satisfies the definition of E(X|C ).


5. (Extreme cases) When B is either the entire s -algebra A or the trivial s -algebra
/ W},
{0,
E(X|A ) = X, E(X|{0, / W}) = EX.
This is obvious by definition.
6. (Independent s -algebra) If X is independent of B then

E(X|B) = EX.

This is also obvious by independence, since for B 2 B,

EXIB = EXP(B) = E(EX)IB

and EX is, of course, B-measurable.


7. (Monotonicity) If X  Z then

E(X|B)  E(Z|B)

almost surely. The proof is by contradiction, similar to the proof of uniqueness


above.
1.4 Conditional expectations and distributions 23

8. (Monotone convergence) If E|Xn | < •, E|X| < • and Xn " X then

E(Xn |B) " E(X|B).

To prove this, let us first note that, since E(Xn |B)  E(Xn+1 |B)  E(X|B), there
exists a limit
g = lim E(Xn |B)  E(X|B).
n!•
Since E(Xn |B) are measurable on B, so is the limit g. It remains to check that for
any set B 2 B, EgIB = EXIB . Since Xn IB " XIB and E(Xn |B)IB " gIB , the usual
monotone convergence theorem implies that

EXn IB " EXIB and EIB E(Xn |B) " EgIB .

Since EIB E(Xn |B) = EXn IB , this implies that EgIB = EXIB and, therefore, g =
E(X|B) almost surely.
9. (Dominated convergence) If |Xn |  Y, EY < •, and Xn ! X then

lim E(Xn |B) = E(X|B).


n!•

The proof of the same as for usual integrals. We write

Y  gn = inf Xm  Xn  hn = sup Xm  Y
m n m n

and observe that gn " X, hn # X, |gn |  Y and |hn |  Y . Therefore, by the monotone
convergence theorem,

E(gn |B) " E(X|B), E(hn |B) # E(X|B)

and the monotonicity of conditional expectation implies the claim.


10. (Measurable factor) If E|X| < •, E|XY | < • and Y is measurable on B then

E(XY |B) = Y E(X|B).

First of all, it is enough to prove this for non-negative X,Y 0 by decomposing


X = X + X and Y = Y + Y . Since Y is B-measurable, we can find a sequence
of simple functions Yn of the form Âk wk ICk , where Ck 2 B, such that 0  Yn " Y.
By the monotone convergence theorem, it is enough to prove that

E(XICk |B) = ICk E(X|B).

Take any B 2 B. Since B \Ck 2 B,

EIB ICk E(X|B) = EIB\Ck E(X|B) = EXIB\Ck = E(XICk )IB


24 1 Introduction

and this finishes the proof.


11. (Jensen’s inequality) If f : R ! R is convex and E| f (X)| < • then

f (E(X|B))  E( f (X)|B).

Let h(x) be some monotone (thus, measurable) function such that h(x) is in the
subgradient ∂ f (x) of the convex function f for all x. Then, by convexity,

f (X) f (E(X|B)) h(E(X|B))(X E(X|B)).

If we ignore any integrability issues then, taking conditional expectations of both


sides,

E( f (X)|B) f (E(X|B)) h(E(X|B))(E(X|B) E(X|B)) = 0,

where we used the previous property, since h(E(X|B)) is B-measurable. Let us


now make this simple idea rigorous. Let Bn = {w : |E(X|B)|  n} and Xn = XIBn .
As above,

f (Xn ) f (E(Xn |B)) h(E(Xn |B))(Xn E(Xn |B)).

Notice that Bn 2 B and, by property 9, E(Xn |B) = IBn E(X|B) 2 [ n, n]. Since
both f and h are bounded on [ n, n], now we don’t have any integrability issues
and, taking conditional expectations of both sides, we get

f (E(Xn |B))  E( f (Xn )|B).

Now, let n ! •. Since E(Xn |B) = IBn E(X|B) ! E(X|B) and f is continuous,
f (E(Xn |B)) converges to f (E(X|B)). On the other hand,

E( f (Xn )|B)7 = E( f (X)IBn + f (0)IBcn |B)


= E( f (X)|B)IBn + f (0)IBcn ! E( f (X)|B),

and this finishes the proof.


Conditional distributions. Again, let (W, A , P) be a probability space and let
B ✓ A be a sub-s -algebra of A . Let (Y, Y ) be a measurable space and f be
a measurable function from (W, A ) into (Y, Y ). A conditional distribution of f
given B is a function P f |B : Y ⇥ W ! [0, 1] such that
(i) for all w, P f |B ( · , w) is a probability measure on Y ;
(ii) for each C 2 Y , P f |B (C, · ) = E(I( f 2 C)|B)( · ) almost surely and P f |B (C, ·)
is B-measurable.
The meaning of this definition will become more transparent in the examples be-
low, but we will study the existence of conditional distribution in this abstract set-
ting first. For any given C, the conditional expectation in (ii) is defined up to a
1.4 Conditional expectations and distributions 25

P|B -null set so there is some freedom in defining P f |B (C, ·) for a given C, but it
is not obvious that we can define P f |B (C, ·) simultaneously for all C in such a way
that (i) is also satisfied. However, when the space (Y, Y ) is regular, i.e. (Y, d) is a
complete separable metric space and Y is its Borel s -algebra, then a conditional
distribution always exists.

Theorem 1.7. If (Y, Y ) is regular then a conditional distribution P f |B exists.


Moreover, if P0f |B is another conditional distribution then

P f |B (·, w) = P0f |B (·, w)

for P|B -almost all w, i.e. the conditional distribution is unique modulo P|B -almost
everywhere equivalence.

Proof. The Borel s -algebra Y is generated by a countable collection Y 0 of sets in


Y , for example, by open balls with rational radii and centers in a countable dense
subset of (Y, d). The algebra C generated by Y 0 is countable since it can be written
as a countable union of increasing finite algebras corresponding to finite subsets of
Y 0 . Since (Y, d) is complete and separable, by Ulam’s theorem (Theorem 1.5), for
any set C 2 Y one can find a sequence of compact subsets K1 ⇢ K2 ⇢ . . . ⇢ C
such that P( f 2 Kn ) " P( f 2 C). If K = [n 1 Kn then, by conditional monotone
convergence theorem,

E(I( f 2 Kn )|B) " E(I( f 2 K)|B) = E(I( f 2 C)|B) (1.6)

almost surely, where the last equality holds because P( f 2 C \ K) = 0. Let D be a


countable algebra generated by the union of C and the set of all Kn for all C 2 C .
For each D 2 D, let P f |B (D, ·) be any fixed version of the conditional expectation
E(I( f 2 D)|B)(·). We have:
(a) for each D 2 D, P f |B (D, ·) 0 a.s.;
(b) P f |B (Y, ·) = 1 and P f |B (0,
/ ·) = 0 a.s.;
(c) for any k 1 and any disjoint D1 , . . . , Dk 2 D,
[
P f |B Di , · = Â P f |B (Di , ·) a.s.;
ik ik

(d) for any C 2 C and the specific sequence Kn as chosen, (1.6) holds a.s..
Since there are countably many conditions in (a) - (d), there exists a set N 2 B
such that P(N ) = 0 and conditions (a) - (d) hold for all w 62 N . We will now
show that for all such w, P f |B (·, w) can be extended to a probability measure on
Y and this extension is such that for any C 2 Y , P f |B (C, ·) is a version of the
conditional expectation, i.e. (ii) holds.
First, we need to show that for all w 62 N , P f |B (·, w) is countably additive on C
or, equivalently, for any sequence Cn 2 C such that Cn # 0/ we have P f |B (Cn , w) # 0.
26 1 Introduction

If not, then there exists a sequence Cn # 0/ such that P f |B (Cn , w) d > 0. Using
(d), take compact Kn 2 D such that P f |B (Cn \ Kn , w)  d /3n . Then

d d
P f |B (K1 \ . . . \ Kn , w) P f |B (Cn , w) Â i
>
1in 3 2

so that each set K n = K1 \ . . . \ Kn is not empty. Since Cn # 0/ and, therefore, K n # 0,


/
the complements of K n form an open cover of the compact set K1 (and the entire
space) and there exists a finite open sub-cover of K1 . Since K n are decreasing, it
is easy to see that this can only happen if K n is empty for some n, which is a
contradiction.
By Caratheodory’s extension theorem, for all w 62 N , P f |B ( · , w) can be
uniquely extended to a probability measure on Y so that (i) holds (for w 2 N ,
we can define P f |B (·, w) to be any fixed probability measure on Y ). To prove (ii),
notice that the collection of sets C 2 Y that satisfy (ii) is, obviously, a l -system
that contains algebra C (which is a p-system) and, by Dynkin’s theorem, Theorem
1.3, it contains the s -algebra Y .
To prove uniqueness, by the property (ii), we have

P f |B (C, w) = P0f |B (C, w)

for all C 2 C on the set of w of measure one, since C is countable, and we know
that the probability measure on the s -algebra Y is uniquely determined by its
values on the generating algebra C . t
u

When conditional distribution exists, it can be used to compute the conditional


expectations with respect to B.

Theorem 1.8. Let g : Y ! R be a measurable functions such that E|g( f )| < •.


Then for P|B -almost all w, g is integrable with respect to P f |B (·, w) and
Z
E(g( f )|B)(w) = g(y) P f |B (dy, w) almost surely. (1.7)

Proof. If C 2 Y and g(y) = I(y 2 C) then, by (ii),


Z
E(I( f 2 C)|B)(w) = P f |B (C, w) = I(y 2 C) P f |B (dy, w)

for P|B -almost all w. Therefore, (1.7) holds for all simple functions. For g 0,
take a sequence of simple functions 0  gn " g. Since Eg( f ) < •, by monotone
convergence theorem for conditional expectations,

E(gn ( f )|B)(w) " E(g( f )|B)(w) < •

for all w 62 N for some N 2 B with P(N ) = 0. Assume also that for w 62 N ,
(1.7) holds for all functions in the sequence (gn ). By the usual monotone conver-
1.4 Conditional expectations and distributions 27

gence theorem,
Z Z
gn (y) P f |B (dy, w) " g(y) P f |B (dy, w)

for all w and, therefore, (1.7) holds for all w 62 N . This prove the claim for g 0
and the general case follows. t
u
Product space case. Consider measurable spaces (X, X ) and (Y, Y ) and let
(W, A ) be the product space, W = X ⇥Y and A = X ⌦ Y , with some probability
measure P on it. Let

h(w) = h(x, y) = x : X ⇥Y ! X and f (w) = f (x, y) = y : X ⇥Y ! Y

be the coordinate projections. Let B be the s -algebra generated by the first coor-
dinate projection h, i.e.
B = h 1 (X ) = X ⇥Y.
Suppose that a conditional distribution P f |B of f given B exists, for example, if
(Y, Y ) is regular. For a fixed C 2 Y , P f |B (C, ·) is B-measurable, by definition,
and since B = h 1 (X ),

P f |B (C, (x, y)) = Px (C) (1.8)

for some X -measurable function Px (C). In the product space setting, Px is called
a conditional distribution of y given x. Notice that for any set D 2 X , {w : h(w) 2
D} 2 B and

I(x 2 D)Px (C) = I(h 2 D)E(I( f 2 C)|B)


= E(I(h 2 D)I( f 2 C)|B) = E(I(w 2 D ⇥C)|B) a.s.

Integrating both sides over W we get


Z Z
P(D ⇥C) = I(x 2 D)Px (C) dP(x, y) = Px (C) dµ(x),
D

where µ = P h 1 is the marginal of P on X. Therefore, conditional distribution


of y given x satisfies:
(1) for all x 2 X, Px (·) is a probability measure on Y ;
(2) for each set C 2 Y , Px (C) is X -measurable;
RR
(3) for any D 2 X and C 2 Y , P(D ⇥C) = dPx (y)dµ(x).
DC

Of course, (1) and (2) imply that Px (C) defines a version of the conditional expec-
tation of I(y 2 C) given the first coordinate x. Moreover, (3) implies the following
more general statement.
Theorem 1.9. If g : X ⇥Y ! R is P-integrable then
28 1 Introduction
Z ZZ
g(x, y) dP(x, y) = g(x, y) dPx (y)dµ(x). (1.9)

Proof. This coincides with (3) for the indicator of the rectangle D ⇥ C and as a
result holds for indicators of disjoint unions of rectangles. Then, using the mono-
tone class theorem (or Dynkin’s theorem) as in the proof of Fubini’s theorem,
(1.9) can be extended to indicators of all measurable sets in the product s -algebra
A = X ⌦ Y and then to simple functions, positive measurable functions and all
integrable functions. t
u

This result is a generalization of Fubini’s theorem in the case when the measure
is not a product measure. One can also view this as a way to construct more general
measures on product spaces than product measures. One can start with any function
Px (C) that satisfies properties (1) and (2) and define a measure P on the product
space as in property (3). Such functions satisfying properties (1) and (2) are called
probability kernels or transition functions. In the case when Px (C) = n(C), we
recover the product measure µ ⇥ n.
A special version of the product space case is the so called disintegration the-
orem. Let (Y, Y ) be regular and let n be a probability measure on Y . Consider a
measurable function p : (Y, Y ) ! (X, X ) and let µ = n p 1 be the push-forward
measure on X . Let P be the push-forward measure on the product s -algebra
X ⌦ Y by the map y ! (p(y), y), so that P has marginals µ and n.

Theorem 1.10 (Disintegration Theorem). For any bounded measurable functions


h : X ! R, f : Y ! R,
Z Z ⇣Z ⌘
f (y)h(p(y)) dn(y) = f (y) dPx (y) h(x) dµ(x). (1.10)

Moreover, for µ-almost all x, h(p(y)) = h(x) for Px -almost all y.

Proof. To prove (1.10) just take g(x, y) = h(x) f (y) in (1.9). If we replace h(x) with
h(x)g(x) and use (1.10) twice we get
Z ⇣Z ⌘ Z
f (y)h(p(y)) dPx (y) g(x) dµ(x) = f (y)h(p(y))g(p(y)) dn(y)
Z ⇣Z ⌘ Z ⇣Z ⌘
= f (y)dPx (y) h(x)g(x) dµ(x) = f (y)h(x) dPx (y) g(x) dµ(x).

Since g is arbitrary, this means that for µ-almost all x,


Z Z
f (y)h(p(y)) dPx (y) = f (y)h(x) dPx (y).

This also holds for µ-almost all x simultaneously for countably many functions f .
We can choose these f to be the indicators of sets in the algebra C in the proof of
Theorem 1.7, and this choice obviously implies that h(p(y)) = h(x) for Px -almost
all y. t
u
1.4 Conditional expectations and distributions 29

Example 1.4.1. (Same space case) Suppose that now X = Y, X ✓ Y and p : Y !


X is the identity map, p(y) = y. Then µ = n ⌫X is just the restriction of n to a
smaller s -algebra X . In this case, (1.10) becomes
Z Z ⇣Z ⌘
f (y)h(y) dn(y) = f (y) dPx (y) h(x) dn⌫X (x),

for any bounded X -measurable function h and Y -measurable function f . t


u

Exercise 1.4.1. Let (X,Y ) be a random variable on R2 with the density f (x, y).
Show that if E|X| < • then
Z .Z
E(X|Y ) = x f (x,Y )dx f (x,Y )dx,
R R

where the right hand side is defined almost surely.

Exercise 1.4.2. Suppose that X,Y and U are random variables on some probability
space such that U is independent of (X,Y ) and has the uniform distribution on
[0, 1]. Prove that there exists a measurable function f : R ⇥ [0, 1] ! R such that
(X, f (X,U)) has the same distribution as (X,Y ) on R2 .

Exercise 1.4.3. Consider a measurable space (W, B) and suppose that random
pairs (X,Y ) and (X 0 ,Y 0 ), defined on different probability spaces and taking values
in R ⇥ W, have the same distribution. If E|X|, E|X 0 | < •, show that the conditional
expectations E(X|Y ) and E(X 0 |Y 0 ) have the same distribution.

Exercise 1.4.4. Consider a triple (X,Y, Z) of random elements taking values on


some regular spaces (i.e. complete separable metric spaces with their Borel s -
algebras). Show that the the following are equivalent:
(a) the conditional distribution of X given (Y, Z) depends only on Y (i.e. s (Y )-
measurable),
(b) X and Z are independent conditionally on Y .

Exercise 1.4.5. Consider two probability measures n and h on R2 .


(a) If possible, construct any probability measure P on R3 such that, for any
Borel set A ✓ R2 ,

P (x, y, z) : (x, y) 2 A = n(A) and P (x, y, z) : (x, z) 2 A = h(A).

In other words, if possible, construct any probability measure P on R3 with


(x, y)-marginal n and (x, z)-marginal h.
(b) For which n and h it is impossible to construct such P?

Exercise 1.4.6. In the setting of the Disintegration Theorem, if X = Y = [0, 1], n is


the Lebesgue measure on Y , p(y) = 2yI(y  1/2) + 2(1 y)I(y > 1/2), what is µ
and Px ?
30 1 Introduction

Exercise 1.4.7. In the setting of the Disintegration Theorem, show that if X is a


separable metric space then, for µ-almost every x, Px (p 1 (x)) = 1.

Exercise 1.4.8. Consider three random variables X,Y, Z : W ! R such that


(a) X and Z are independent;
(b) Y and Z are independent conditionally on X;
(c) Z ⇠ N(0, 1).
Let F = s (X +Y, X Y ) be the s -algebra generated by X +Y and X Y . Compute
the conditional expectation E(sin(Z)|F ).
31

Chapter 2
Laws of Large Numbers

2.1 Inequalities for sums of independent random variables

When we toss an unbiased coin many times, we expect the number of heads or
tails to be close to a half — a phenomenon called the law of large numbers. More
generally, if (Xn )n 1 is a sequence of independent identically distributed (i.i.d.)
random variables, we expect that their average

Sn 1 n
Xn = = Â Xi
n n i=1

is, in some sense, close to the expectation µ = EX1 , assuming that it exists. In the
next section, we will prove a general qualitative result of this type, but we begin
with more quantitative statements for bounded random variables. We will need a
couple of simple lemmas.
Lemma 2.1 (Chebyshev’s inequality). If a random variable X 0 then, for t > 0,
EXI(X t) EX
P(X t)   .
t t
Proof. The first inequality follows from

EXI(X t) tEI(X t) = tP(X t).

In the second inequality we use that X 0. t


u
This inequality is often used in the exponential form,
lt
P(X t)  e Eel X for l > 0, (2.1)

which is obvious if we rewrite {X t} = {el X elt }.


Lemma 2.2. If a random variable X satisfies |X|  1 and EX = 0 then, for any
l 0,
32 2 Laws of Large Numbers
2 /2
Eel X  el .

Proof. Let us write l X as a convex combination


1+X 1 X
lX = l+ ( l ),
2 2
where (1 + X)/2, (1 X)/2 2 [0, 1] and their sum is equal to one. By the convexity
of exponential function,
1+X l 1 X
el X  e + e l
,
2 2
2
and taking expectations we get Eel X  ch(l )  el /2 , where the last inequality is
obvious by comparing the Taylor series on both sides. t
u

The first main result in this section is the following classical inequality.
Theorem 2.1 (Hoeffding’s inequality). If e1 , . . . , en are independent random vari-
ables such that |ei |  1 and Eei = 0 then, for any a1 , . . . , an 2 R and t 0,
⇣ n ⌘ ⇣ t2 ⌘
P Â ei ai t  exp
2 Âni=1 a2i
.
i=1

The most classical case is that of i.i.d. Rademacher random variables (en )n 1 such
that
1
P(en = 1) = P(en = 1) = ,
2
which is equivalent to tossing a coin with heads and tails replaced by ±1.

Proof. We begin by writing, for l > 0,


⇣ n ⌘ ⇣ n ⌘ n
P Â ei ai t e lt
E exp l  ei ai = e lt
’ Eel ei ai ,
i=1 i=1 i=1

2 a2 /2
where in the last step we used Lemma 1.10. By Lemma 2.2, Eel ei ai  el i and,
therefore,
⇣n ⌘ ⇣ l 2 n 2⌘
P Â ei ai t  exp lt + Â ai .
i=1 2 i=1
Optimizing over l 0 finishes the proof. t
u

We can also apply the same inequality to ( ei ) and, combining both cases,
⇣ n ⌘ ⇣ t2 ⌘
P Â ei ai t  2 exp
2 Âni=1 a2i
.
i=1

This implies, for example, that


2.1 Inequalities for sums of independent random variables 33
⇣ 1 n ⌘ ⇣ nt 2 ⌘
P Â ei
n i=1
t  2 exp
2
.

This show that, no matter how small t > 0 is, the probability that the average devi-
ates from the expectation Ee1 = 0 by more than t decreases exponentially fast with
n.
Let us now consider a more general case. For p, q 2 (0, 1), we consider the
function
p 1 p
D(p, q) = p log + (1 p) log ,
q 1 q
which is called the Kullback-Leibler divergence. To see that D(p, q) 0, with
equality only if p = q, just use that log x  x 1 with equality only if x = 1.

Theorem 2.2 (Hoeffding-Chernoff inequality). Suppose that random variables


(Xi )i 1 are independent, 0  Xi  1, and µ = EX. Then
⇣1 n ⌘

nD(µ+t,µ)
P Xi µ +t  e
i=1

for any t 0 such that µ + t  1.

Proof. Notice that the probability is zero when µ +t > 1, since the average can not
exceed 1, so µ + t  1 is not really a constraint here. Using the convexity of the
exponential function, we can write for x 2 [0, 1] that

el x = el (x·1+(1 x)·0)
 xel + (1 x)el ·0 = 1 x + xel ,

which implies that

Eel X  1 EX + EXel = 1 µ + µel .

Using this, we get the following bound for any l 0,


⇣ n ⌘
P Â Xi n(µ + t)  e l n(µ+t)
Eel  Xi = e l n(µ+t)
(Eel X )n
i=1
l n(µ+t) n
e 1 µ + µel .

The derivative of the right hand side in l is equal to zero at:


l n(µ+t)
n(µ + t)e (1 µ + µel )n + n(1 µ + µel )n 1
µel e l n(µ+t)
= 0,

(µ + t)(1 µ + µel ) + µel = 0,


(1 µ)(µ + t)
el = 1,
µ(1 µ t)
so the critical point l 0, as required. Substituting this back into the above bound,
34 2 Laws of Large Numbers
✓ ◆ ✓ ◆!n
µ(1 µ t) µ+t (1 µ)(µ + t)
1 µ+
(1 µ)(µ + t) 1 µ t
✓ ◆µ+t ✓ !
◆1 µ t n
µ 1 µ
=
µ +t 1 µ t
✓ ✓ ◆◆
µ +t 1 µ t
= exp n (µ + t) log + (1 µ t) log ,
µ 1 µ
finishes the proof. t
u

We can also apply this bound to Zi = 1 Xi , with the mean µZ = 1 µ, to get


⇣1 n ⌘ ⇣1 n ⌘
P Â Xi  µ
n i=1
t = P Â Zi
n i=1
µZ + t
nD(µz +t,µZ ) nD(1 µ+t,1 µ)
e =e .

These inequalities show that, no matter how small t > 0 is, the probability that
the average X n deviates from the expectation µ by more than t in either direction
decreases exponentially fast with n. Of course, the same conclusion applies to any
bounded random variables, |Xi |  M, by shifting and rescaling the interval [ M, M]
into the interval [0, 1].
Even though the Hoeffding-Chernoff inequality applies to all bounded random
variables, in real-world applications in engineering, computer science, etc., one
would like to improve the control of the probability by incorporating other mea-
sures of closeness of the random variable X to the mean µ, for example, the vari-
ance
s 2 = Var(X) = E(X µ)2 = EX 2 (EX)2 .
The following inequality is classical. Let us denote

f (x) = (1 + x) log(1 + x) x.

We will center the random variables (Xn ) and instead work with Zn = Xn µ.

Theorem 2.3 (Bennett’s inequality). Consider i.i.d. (Zn )n 1 such that EZ = 0,


EZ 2 = s 2 and |Z| < M. Then,
⇣1 n ⌘ ✓ ✓ ◆◆
ns 2 tM
P Â Zi t  exp f ,
n i=1 M 2 s2

for all t 0.

Proof. As above, using that (Zi ) are i.i.d.,


⇣ n ⌘
n
P Â Zi nt  e l nt
Eel Z .
i=1
2.1 Inequalities for sums of independent random variables 35

Using Taylor series expansion and the fact that EZ = 0, EZ 2 = s 2 and |Z|  M,
we can write
• •
(l Z)k EZ k
Eel Z = E Â = Â lk
k=0 k! k=0 k!
• •
lk lk k 2 2
= 1+ Â EZ 2 Z k 2
 1+ Â
M s
k=2 k! k=2 k!
s 2 • l k Mk s2 ⇣ ⌘
= 1+ 2 Â = 1 + 2 el M 1 l M
M k=2 k! M
⇣ s2 ⇣ ⌘⌘
lM
 exp e 1 l M
M2
where in the last inequality we used 1 + x  ex . Therefore,
⇣ n ⌘ ⇣ s2 ⇣ ⌘⌘
P Â Zi nt  exp n lt + 2 el M
M
1 lM .
i=1

Now, we optimize this bound over l 0. We find the critical point:

s2 ⇣ lM ⌘
t+ Me M = 0,
M2
tM
el M =
+ 1,
s2
✓ ◆
1 tM
l = log 1 + 2 .
M s
Since this l 0, plugging it into the above bound,
✓ ✓ ◆ ✓ ✓ ◆◆◆
t tM s 2 tM tM
exp n log 1 + 2 + 2 + 1 1 log 1 +
M s M s2 s2
✓ 2✓ ✓ ◆ ✓ ◆◆◆
s tM tM tM tM
= exp n log 1 + 2 log 1 + 2
M 2 s 2 s s 2 s
✓ 2✓ ✓ ◆ ✓ ◆◆◆
s tM tM tM
= exp n 1 + 2 log 1 + 2
M2 s 2 s s
✓ 2
✓ ◆◆
ns tM
= exp f ,
M 2 s2
finishes the proof. t
u
To simplify the bound in Bennett’s inequality, one can notice that the function
x2
f (x) := f (x) 2(1+x/3) 0, because

x2 (x + 9)
f (0) = f 0 (0) = 0 and f 00 (x) = 0,
(x + 1)(x + 3)3
36 2 Laws of Large Numbers

which implies that


⇣1 n ⌘ ✓ ◆
nt 2
P Â
n i=1
Zi t  exp
2(s 2 + tM/3)
.

Combining with the same inequality for ( Zi ), and recalling that Zi = Xi µ, we


obtain another classical inequality, Bernstein’s inequality:
⇣ 1 n ⌘ ✓ ◆
nt 2
P Â Xi µ t  2 exp 2(s 2 + tM/3) .
n i=1

For small t, the denominator is of order ⇠ 2s 2 , and we get a better control of the
probability when the variance is small.
Poisson approximation of the Binomial. As a motivation for one of the exer-
cises below, let us prove a classical approximation of the Binomial distribution by
Poisson. Consider independent Bernoulli random variables Xi ⇠ B(pi ) for i  n for
some pi 2 (0, 1). Recall that, for l > 0, Poisson distribution Pl is defined by

lk l
Pl ({k}) = e for integer k 0.
k!
The following holds.

Theorem 2.4. If Sn = X1 + . . . + Xn and l = p1 + . . . + pn then, for any subset of


integers B ✓ Z,
n
|P(Sn 2 B) Pl (B)|  Â p2i .
i=1
2
In particular, if l > 0 is fixed and all pi = ln then the right hand side is ln is
small for large n, which shows that Binomial distribution of the sum Sn is well
approximated by Poisson with the same mean.

Proof. The proof is based on a construction on “the same probability space”. Let
us construct Bernoulli r.v. Xi ⇠ B(pi ) and Poisson r.v. Xi⇤ ⇠ P pi on the same proba-
bility space as follows. Let us consider the standard probability space ([0, 1], B, P)
with the Lebesque measure P. Let us fix i for a moment and define a random vari-
able ⇢
0, 0  x  1 pi ,
Xi = Xi (x) =
1, 1 pi < x  1.
Clearly, Xi ⇠ B(pi ). Let us construct Xi⇤ as follows. For k 0, let us denote
k
(pi )`
ck = Â e pi
`=0 `!

and let
2.1 Inequalities for sums of independent random variables 37
8
>
> 0, 0  x  c0 ,
<
1, c0 < x  c1 ,
Xi⇤ = Xi⇤ (x) =
> 2, c1 < x  c2 ,
>
:
...
Clearly, Xi⇤ has the Poisson distribution P pi . When do we have Xi 6= Xi⇤ ? Since
1 pi  e pi = c0 , this can only happen for 1 pi < x  c0 and c1 < x  1, which
implies that
pi pi pi pi
P(Xi 6= Xi⇤ ) = e (1 pi ) + (1 e pi e ) = pi (1 e )  p2i .

Then we construct pairs (Xi , Xi⇤ ) for i  n on separate coordinates of the product
space [0, 1]n with the product Lebesgue measure, thus, making them independent
for i  n. It is well-known (and easy to check) that a sum of independent Poisson
random variables is again Poisson with the parameter given by the sum of indi-
vidual parameters and, therefore, Âin Xi⇤ has the Poisson distribution Pl , where
l = p1 + . . . + pn . Finally, we use the union bound to conclude that
n n
P(Sn 6= Sn⇤ )  Â P(Xi 6= Xi⇤ )  Â p2i ,
i=1 i=1

which finishes the proof. t


u

Exercise 2.1.1. Prove that, for 0 < µ  1/2 and 0  t < µ,

t2
D(1 µ + t, 1 µ) .
2µ(1 µ)
Hint: compare second derivatives.

Exercise 2.1.2. In the setting of Theorem 2.2, show that, for s > 0,
⇣ 1 n ⌘ ⇣ s2 ⇣ ⌘⌘
P p  (Xi µ) s  exp +O n 1/2
.
n i=1 2µ(1 µ)

Notice that, by the previous exercise, if µ 1/2 then the same inequality holds
without O n 1/2 term. (Once we learn the Central Limit Theorem, this will show
that the Hoeffding-Chernoff inequality is almost sharp for i.i.d. Bernoulli B(µ)
random variables in the regime t = psn for s 1.)

Exercise 2.1.3. Fix l > 0 and s > 0. (a) In the setting of Theorem 2.2, if µ = ln ,
show that
⇣n ⌘ ⇣ l +s ⌘
P Â Xi l + s  exp s (l + s) log +O n 1 .
i=1 l

lk l,
(b) If X has Poisson distribution P(X = k) = k! e k = 0, 1, 2, . . . , show that
38 2 Laws of Large Numbers
⇣ ⌘ ⇣ l +s⌘
P X l + s  exp s (l + s) log
l
and, when l + s is integer,
⇣ ⌘ 1 ⇣ l +s⌘
P X l +s p exp s (l + s) log .
e l +s l

Hint: For the lower bound, use Stirling’s formula in the form k!  ekk+1/2 e k . (To-
gether with Theorem 2.4 above, this exercise shows that the Hoeffding-Chernoff
inequality is almost sharp in the regime µ = ln , t = ns for s 1.)

Exercise 2.1.4. Let X1 , . . . , Xn be independent flips of a fair coin, i.e. P(Xi = 0) =


P(Xi = 1) = 1/2. If X n is their average show that for t 0
2nt 2
P |X n 1/2| > t  2e .

Hint: use the first problem above twice.

Exercise 2.1.5. Suppose that the random variables X1 , . . . , Xn , X10 , . . . , Xn0 are inde-
pendent and, for all i  n, Xi and Xi0 have the same distribution. Prove that
⇣ n n ⌘1/2 ⌘
P  (Xi Xi0 ) > 2t  (Xi Xi0 )2  e t.
i=1 i=1

Hint: think about a way to introduce Rademacher random variables ei into the
problem and then use Hoeffding’s inequality.

Exercise 2.1.6 (Bernstein’s inequality). Let X1 , . . . , Xn be i.i.d. random variables


such that |Xi µ|  M, EX = µ and var(X) = s 2 . If X n is their average, make a
change of variables in the Bernstein inequality to show that for t > 0,
r
⇣ 2s 2t 2Mt ⌘
P Xn µ + +  e t.
n 3n
Exercise 2.1.7. Suppose that the chances of winning the jackpot in a lottery are
1 : 139, 000, 000. Assuming that 100, 000, 000 people played independently of each
other, estimate the probability that 3 of them will have to share the jackpot. Give a
bound on the quality of your estimate.
2.2 Laws of large numbers 39

2.2 Laws of large numbers

In this section, we will study two types of convergence of the average to the mean,
in probability and almost surely. Consider a sequence of random variables (Yn )n 1
on some probability space (W, A , P). We say that Yn converges in probability to a
p
random variable Y (and write Yn ! Y ) if, for all e > 0,

lim P(|Yn Y| e) = 0.
n!•

We say that Yn converges to Y almost surely, or with probability 1, if

P w : lim Yn (w = Y (w)) = 1.
n!•

Let us begin with an easier case.


Theorem 2.5 (Weak law of large numbers). Consider a sequence of random vari-
ables (Xn )n 1 that are centered, EXn = 0, have uniformly bounded second mo-
ments, EXn2  K < •, and are uncorrelated, EXi X j = 0 for i 6= j. Then
1
Xn = Â Xi ! 0
n in

in probability.
Proof. By Chebyshev’s inequality, also using that EXi X j = 0,

EX 2n
P |X n 0| e = P X 2n e2 
e2
1 1 n nK K
=
n e
2 2
E(X1 + · · · + Xn )2
= Â
n e i=1
2 2
EXi2  2 2 = 2 ! 0,
n e ne

as n ! •, which finishes the proof. t


u
Of course, if (Xn )n 1 are independent then they are automatically uncorrelated,
since EXi X j = EXi EX j = 0. Before we move on to the almost sure convergence,
let us give one more application of the above argument to the problem of approx-
imation of continuous functions. Consider an i.i.d. sequence (Xn )n 1 with the dis-
tribution P on R that depends on some parameter q 2 Q ✓ R, and suppose that

EXi = q , Var(Xi ) = s 2 (q ).

Then the following holds.


Theorem 2.6. If u : R ! R is continuous and bounded and s 2 (q ) is bounded on
compacts then Eu(X n ) ! u(q ) uniformly on compacts.
Proof. For any e > 0, we can write
40 2 Laws of Large Numbers

|Eu(X n ) u(q )|  E|u(X n ) u(q )|


= E|u(X n ) u(q )| I(|X n q |  e) + I(|X n q | > e)
 max |u(x) u(q )| + 2kuk• P(|X n q | > e).
|x q |e

The last probability can be bounded as in the previous theorem,

s 2 (q )
P(|X n q | > e)  ,
ne 2
and, therefore,
2kuk• s 2 (q )
|Eu(X n ) u(q )|  max |u(x) u(q )| + .
|x q |e ne 2

The statement of the theorem should now be obvious. t


u
Example 2.2.1. Let (Xi ) be i.i.d. with the Bernoulli distribution B(q ) with proba-
bility of success q 2 [0, 1],

P(Xi = 1) = q , P(Xi = 0) = 1 q,

and let u : [0, 1] ! R be continuous. Then, by the above theorem, the following
linear combination of the Bernstein polynomials,
n ⇣ ⌘ ⇣ n ⌘ n ⇣ ⌘✓ ◆
k k n k
Bn (q ) := Eu(X n ) = Â u P Â Xi = k = Â u q (1 q )n k
k=0 n i=1 k=0 n k

approximate u(q ) uniformly on [0, 1]. This is an explicit example of polynomials


in the Weierstrass theorem that approximate a continuous function on [0, 1]. t
u
Example 2.2.2. Suppose (Xi ) have the Poisson distribution P(q ) with the param-
eter q > 0,
qk
Pq (Xi = k) = e q for integer k 0.
k!
Then, it is well known (and easy to check) that EXi = q , s 2 (q ) = q , and the sum
X1 + . . . + Xn has Poisson distribution P(nq ). Therefore, if u is bounded and con-
tinuous on [0, +•) then
• ⇣k⌘ ⇣ n ⌘ • ⇣ ⌘
k (nq )k
Eu(X n ) = Â n Â
u P Xi = k = Â n k! e
u nq
! u(q )
k=0 i=1 k=0

uniformly on compact sets. t


u
Before we turn to the almost sure convergence results, let us note that con-
vergence in probability is weaker than a.s. convergence. For example, consider a
probability space which is a circle of circumference 1 with the uniform measure
on it. Consider a sequence of r.v. on this probability space defined by
2.2 Laws of large numbers 41
⇣ h 1 1 1 ⌘ ⌘
Xk (x) = I x 2 1 + + · · · + , 1 + · · · + mod 1 .
2 k k+1
Then, Xk ! 0 in probability, since for 0 < e < 1,
1
P(|Xk 0| e) = ! 0.
k+1
However, Xk does not have an almost sure limit, because the series Âk 1 1/k di-
verges and, as a result, each point x on the sphere will fall into the above intervals
infinitely many times, i.e. it will satisfy Xk (x) = 1 for infinitely many k. Let us
begin with the following lemma.

Lemma 2.3. Consider a sequence (pi )i 1 such that pi 2 [0, 1). Then

’(1 pi ) = 0 () Â pi = +•.
i 1 i1

Proof. ”(=”. Using that 1 pe p we get


⇣ ⌘
’(1 pi )  exp  i ! 0 as n ! •.
p
in in

”=)”. We can assume that pi  1/2 for i m for large enough m, because, other-
wise, the series obviously diverges. Since 1 p e 2p for p  1/2 we have
⇣ ⌘
’ (1 p i ) exp 2 Â i
p
min min

and the result follows. t


u

The following result plays a key role in Probability Theory. Consider a sequence
(An )n 1 of events An 2 A on the probability space (W, A , P). We will denote by
\ [
An i.o. := Am
n 1m n

the event that An occur infinitely often, which consists of all w 2 W that belong to
infinitely many events in the sequence (An )n 1 . Then the following holds.

Lemma 2.4 (Borel-Cantelli lemma).


(1) If Ân 1 P(An ) < • then P(An i.o.) = 0.
(2) If An are independent and Ân 1 P(An ) = +• then P(An i.o.) = 1.
S
Proof. (1) If Bn = m n Am then Bn ◆ Bn+1 and, by the continuity of measure,
⇣\ ⌘
P(An i.o.) = P Bn = lim P(Bn ).
n!•
n 1
42 2 Laws of Large Numbers

On the other hand,


⇣[ ⌘
P(Bn ) = P Am  Â P(Am ),
m n m n

which goes to zero as n ! +•, because the series Âm 1 P(Am ) < •.


(2) We can write
⇣ [ ⌘ ⇣\ ⌘
P W \ Bn = P W \ Am = P (W \ Am )
m n m n
{by independence} = ’ P(W \ Am ) = ’ (1 P(Am )) = 0,
m n m n

Lemma 2.3, since Âm


by T n P(Am ) = +•. Therefore, P(Bn ) = 1 and P(An i.o.) =
P( n 1 Bn ) = 1. t
u
Let us show how this implies the strong law of large number for bounded random
variables. Recall that in the case when µ = EX1 , s 2 = Var(X1 ) and |X1 µ|  M,
the average satisfies Bernstein’s inequality proved in the last section,
✓ ◆
nt 2
P |X n µ| t  2 exp .
2(s 2 + tM/3)
If we take ⇣ 8s 2 log n ⌘1/2
t = tn =
n
then, for n large enough such that tn M/3  s 2 , we have
✓ ◆
ntn2 2
P |X n µ| tn  2 exp = 2.
4s 2 n

Since the series Ân 2


1 2n converges, the Borel-Cantelli lemma implies that

P |X n µ| tn i.o. = 0.

This means that for P-almost all w 2 W, the difference |X n (w) µ| will become
smaller than tn for large enough n n0 (w). If we recall the definition of almost
sure convergence, this means that X n converges to µ almost surely — the so-called
strong law of large number. Next, we will show that this holds even for unbounded
random variables under a minimal assumption that the expectation µ = EX1 exists.
Strong law of large numbers. The following simple observation will be useful: if
a random variable X 0 then
Z •
EX = P(X x) dx.
0

Indeed, if F is the law of X on R then


2.2 Laws of large numbers 43
Z • Z •Z x Z •Z • Z •
EX = x dF(x) = 1 dsdF(x) = 1 dF(x)ds = P(X s) ds.
0 0 0 0 s 0

For X 0 such that EX < • this implies


Z •
 P(X i) 
0
P(X s) ds = EX < •. (2.2)
i 1

As before, let (Xn )n 1 be i.i.d. random variables on the same probability space.

Theorem 2.7 (Strong law of large numbers). If E|X1 | < • then

1 n
Xn = Â Xi ! µ = EX1 almost surely.
n i=1

Proof. The proof will proceed in several steps.


Step 1. First, without loss of generality we can assume that Xi 0. Indeed, for
signed random variables we can decompose Xi = Xi+ Xi , where

Xi+ = Xi I(Xi 0) and Xi = |Xi | I(Xi < 0)

and the general result would follow, since

1 n 1 n + 1 n
 i n  Xi
n i=1
X = Â Xi ! EX1+
n i=1
EX1 = EX1 .
i=1

Thus, from now on we assume that Xi 0.


Step 2. (Truncation) Next, we can replace Xi by Yi = Xi I(Xi  i) using the Borel-
Cantelli lemma. Since

 P(Xi 6= Yi ) =  P(Xi > i)  EX1 < •,


i 1 i 1

the Borel-Cantelli lemma implies that P({Xi 6= Yi } i.o.) = 0. This means that for
some (random) i0 and for i i0 we have Xi = Yi and, therefore,
1 n 1 n
lim
n!• n
 i n!• n  Yi a.s.
X = lim
i=1 i=1

It remains to show that if Tn = Âni=1 Yi then Tn /n ! EX almost surely.


Step 3. (Limits over subsequences) We will first prove almost sure convergence
along the subsequences of the type n(k) = ba k c for a > 1. For any e > 0,

ÂP |Tn(k) ETn(k) | en(k)


k 1
44 2 Laws of Large Numbers

1 1
 Â e 2 n(k)2 Var(Tn(k) ) = Â e 2 n(k)2 Â Var(Yi )
k 1 k 1 in(k)
1 1 1
 Â e 2 n(k)2 Â EYi2 =
e 2 iÂ1 i k:n(k)
EY 2 Â
n(k)2
.
k 1 in(k) i

To bound the last sum, let us note that

ak
 n(k) = ba k c  a k
2
and if k0 = min{k : a k i} then
1 4 4 K
 n(k)2
 Â 2k = 2k
a a 0 (1 1
)

i2
,
n(k) i ak i a2

where K = K(a) depends on a only. Therefore, we showed that


Z i
1 1
 P |Tn(k) ETn(k) | en(k)  K  EYi2
i2
=KÂ 2 x2 dF(x),
k 1 i 1 i 1i 0

where F is the law of X. We can bound the last sum as follows


Z i Z m+1 Z m+1
1 1 1
 i2 x2 dF(x) =  2 Â
x2 dF(x) = Â Â i2 x2 dF(x)
i 1 0 i 1 i m<i m m 0 i>m m
Z m+1 Z m+1
2
 Â x2 dF(x)  2 Â x dF(x) = 2EX < •.
m 0 m+1 m m 0 m

We proved that
ÂP |Tn(k) ETn(k) | en(k) < •
k 1
and the Borel-Cantelli lemma implies that

P |Tn(k) ETn(k) | en(k) i.o. = 0.

If we take a sequence em = m 1 , this implies that

P 9m 1, |Tn(k) ETn(k) | m 1 n(k) i.o. = 0,

and this proves that


Tn(k) ETn(k)
!0
n(k) n(k)
with probability one. On the other hand, it is obvious that
ETn(k) 1
n(k)
= Â EX1 I(X1  i) ! EX
n(k) in(k)
2.2 Laws of large numbers 45

as k ! •, so we proved that
Tn(k)
! EX almost surely.
n(k)

Step 4. Finally, for j such that


n(k + 1)
n(k)  j < n(k + 1) = n(k)  n(k)a 2
n(k)
we can write
1 Tn(k) T j Tn(k+1)
  a2
a n(k)
2 j n(k + 1)
and, therefore, with probability one,
1 Tj Tj
EX  lim inf  lim sup  a 2 EX.
a 2 j!• j j!• j

Taking a = 1 + m 1 and letting m ! • proves that lim j!• T j / j = EX almost


surely. t
u

Exercise 2.2.1. Suppose that random variables (Xn )n 1 are i.i.d. such that E|X1 | p <
• for some p > 0. Show that maxin |Xi |/n1/p goes to zero in probability.

Exercise 2.2.2 (Weak LLN for U-statistics). If (Xn )n 1 are i.i.d., EX1 = µ and
s 2 = Var(X1 ) < •, show that
✓ ◆ 1
n
2 Â Xi X j ! µ 2
1i< jn

in probability as n ! •.

Exercise 2.2.3. If u : [0, 1]k ! R is continuous then show that


⇣j ✓ ◆
jk ⌘ n
 u n , . . . , n ’ ji xiji (1 xi )n ji ! u(x1 , . . . , xk )
1

0 j1 ,..., j n
k ik

as n ! •, uniformly on [0, 1]k .

Exercise 2.2.4. If E|X| < • and limn!• P(An ) = 0, show that limn!• EXIAn = 0.
(Hint: use the Borel-Cantelli lemma over some subsequence.)

Exercise 2.2.5. Suppose that An is a sequence of events such that

lim P(An ) = 0 and


n!•
 P(An \ An+1 ) < •.
n 1

Prove that P(An i.o.) = 0.


46 2 Laws of Large Numbers

Exercise 2.2.6. Let {Xn }n 1 be i.i.d. and Sn = X1 + . . . + Xn . If Sn /n ! 0 a.s., show


that E|X1 | < •. (Hint: use the idea in (2.2) and the Borel-Cantelli lemma.)

Exercise 2.2.7. Suppose that (Xn )n 1 are independent random variables. Show that
P(supn 1 Xn < •) = 1 if and only if Ân 1 P(Xn > M) < • for some M > 0.

Exercise 2.2.8. Consider i.i.d. nonnegative random variables Xi for i 1 such that
EX1 = +•, and let Sn = X1 + . . . + Xn . Show that limn!• Snn = +• almost surely.

Exercise 2.2.9. If (Xn )n 1 are i.i.d. and not constant with probability one, then
P(Xn converges) = 0.

Exercise 2.2.10. Let {Xn }n 1 be independent and exponentially distributed, i.e.


with c.d.f. F(x) = 1 e x for x 0. Show that
⇣ Xn ⌘
P lim sup = 1 = 1.
n!• log n

Exercise 2.2.11. Suppose that (Xn )n 1 are i.i.d. with EX1+ < • and EX1 = •.
Show that Sn /n ! • almost surely.

Exercise 2.2.12. Suppose that (Xn )n 1 are i.i.d. such that E|X1 | < • and EX1 = 0.
If (cn ) is a bounded sequence of reals numbers, prove that

1 n
 ci Xi ! 0 almost surely.
n i=1

Hint: either (a) group the close values of c or (b) examine the proof of the strong
law of large numbers.

Exercise 2.2.13. Suppose (Xn )n 1 are i.i.d. standard normal. Prove that
⇣ |Xn | ⌘
P lim sup p = 1 = 1.
n!• 2 log n
2.3 0 – 1 laws, convergence of random series 47

2.3 0 – 1 laws, convergence of random series

Consider a sequence (Xi )i 1 of random variables on the same probability space


(W, F , P) and let s (Xi )i 1 be the s -algebra generated by this sequence. An
event A 2 s (Xi )i 1 is called a tail event if A 2 s (Xi )i n for all n 1. In other
words, \
A2T = s (Xi )i n ,
n 1

where T is the so-called tail s -algebra. For example, if Ai 2 s (Xi ) then


\[
Ai i.o. = Ai
n 1i n

is a tail event. It turns out that when (Xi )i 1 are independent then all tail events
have probability 0 or 1.

Theorem 2.8 (Kolmogorov’s 0–1 law). If (Xi )i 1 are independent then P(A) = 0
or 1 for all A 2 T .

Proof. For a finite subset F = {i1 , . . . , in } ⇢ N, let us denote by XF = (Xi1 , . . . , Xin ).


The s -algebra s (Xi )i 1 is generated by the algebra of sets

A = {XF 2 B} finite F ⇢ N, B 2 B(R|F| ) .

This is an algebra, because any set operations on finite number of such sets can
again be expressed in terms of finitely many random variables Xi . By the Ap-
proximation Lemma, we can approximate any set A 2 s (Xi )i 1 by sets in A .
Therefore, for any e > 0, there exists a set A0 2 A such that P(A 4 A0 )  e and,
therefore,
|P(A) P(A0 )|  e, |P(A) P(A \ A0 )|  e.
By definition, A0 2 s (X1 , ..., Xn ) for large enough n and, since A is a tail event,
A 2 s ((Xi )i n+1 ). The Grouping Lemma implies that A and A0 are independent
and P(A \ A0 ) = P(A)P(A0 ). Finally, we get

P(A) ⇡ P(A \ A0 ) = P(A)P(A0 ) ⇡ P(A)P(A),

and letting e # 0 proves that P(A) = P(A)2 . t


u

Example 2.3.1. The event A = the series Âi 1 Xi converges is a tail event, so it


has probability 0 or 1 when Xi ’s are independent. t
u

Example 2.3.2. Consider the series Âi 1 Xi zi on the complex plane, for z 2 C. Its
radius of convergence is
1
r = lim inf |Xi | i .
i!•
48 2 Laws of Large Numbers

For any x 0, the event {r  x} is, obviously, a tail event. This implies that r =
const with probability 1 when Xi ’s are independent. t
u

Next we will prove a stronger result under a more restrictive assumption that the
random variables (Xi )i 1 are not only independent, but also identically distributed.
A set B 2 RN is called symmetric if, for all n 1,

(x1 , x2 , . . . , xn , xn+1 , . . .) 2 B =) (xn , x2 , ..., xn 1 , x1 , xn+1 , ...) 2 B.

In other words, the set B is symmetric under the permutations of finitely many
coordinates. We will call an event A 2 s ((Xi )i 1 ) symmetric if it is of the form
{(Xi )i 1 2 B} for some symmetric set B 2 RN in the cylindrical s -algebra B • on
RN . For example, any event in the tail s -algebra T is symmetric.

Theorem 2.9 (Savage-Hewitt 0–1 law). If (Xi )i 1 are i.i.d. and A 2 s ((Xi )i 1 ) is
symmetric then P(A) = 0 or 1.

Proof. By the Approximation Lemma, for any e > 0, there exists n 1 and An 2
s (X1 , ..., Xn ) such that P(An 4 A)  e. Of course, we can write this set as

An = (X1 , . . . , Xn ) 2 Bn 2 s (X1 , ..., Xn )

for some Borel set Bn on Rn . Let us denote

A0n = (Xn+1 , . . . , X2n ) 2 Bn 2 s (Xn+1 , ..., X2n ).

This set is independent of An , so P(An \ A0n ) = P(An )P(A0n ). Given a sequence


x = (x1 , x2 , . . .) 2 RN , let us define an operator

Gx = (xn+1 , . . . , x2n , x1 , . . . , xn , x2n+1 , . . .)

that switches the first n coordinates with the second n coordinates. Denote X =
(Xi )i 1 and recall that A = {X 2 B} for some symmetric set B 2 RN that, by defi-
nition, satisfies GB = Gx : x 2 B = B. Now we can write

P(A0n 4 A) = P (Xn+1 , . . . , X2n ) 2 Bn 4 X 2B


= P (Xn+1 , . . . , X2n ) 2 Bn 4 GX 2 GB
= P (Xn+1 , . . . , X2n ) 2 Bn 4 GX 2 B
{using that Xi ’s are i.i.d.} = P (X1 , . . . , Xn ) 2 Bn 4 X 2B
= P(An 4 A)  e.

This implies that P (An \ A0n ) 4 A  2e and we can conclude that

P(A) ⇡ P(An ), P(A) ⇡ P(An \ A0n ) = P(An )P(A0n ) ⇡ P(A)2 .

Letting e # 0 again implies that P(A) = P(A)2 . t


u
2.3 0 – 1 laws, convergence of random series 49

Example 2.3.3. Let Sn = X1 + . . . + Xn and let


Sn an
r = lim sup .
n!• bn
The event {r  x} is symmetric, since changing the order of a finite set of coordi-
nates does not affect Sn for large enough n. As a result, P(r  x) = 0 or 1, which
implies that r = const with probability 1. t
u

Random series. We already saw above that, by Kolmogorov’s 0–1 law, the se-
ries Âi 1 Xi for independent (Xi )i 1 converges with probability 0 or 1. This means
that either Sn = X1 + . . . + Xn converges to its limit S with probability one, or with
probability one it does not converge. We know that almost sure convergence is
stronger that convergence in probability, so in the case when with probability one
Sn does not converge, is it still possible that it converges to some random variable
in probability? The answer is no, because we will now prove that for random se-
ries convergence in probability implies almost sure convergence. We will need the
following.

Theorem 2.10 (Kolmogorov’s inequality). Suppose that (Xi )i 1 are independent


and Sn = X1 + . . . + Xn . If for all j  n,

P(|Sn S j| a)  p < 1, (2.3)

then, for x > a,


⇣ ⌘ 1
P max |S j | x  P(|Sn | > x a).
1 jn 1 p
Proof. First of all, let us notice that this inequality is obvious without the maxi-
mum, because (2.3) is equivalent to 1 p  P(|Sn S j | < a) and we can write

(1 p)P |S j | x  P |Sn S j | < a P |S j | x


= P |Sn S j | < a, |S j | x  P(|Sn | > x a).

The equality in the middle holds because the events {|S j | x} and {|Sn S j | < a}
are independent, since the first depends only on X1 , ..., X j and the second only on
X j+1 , ..., Xn . The last inequality holds, because if |Sn S j | < a and |S j | x then,
by triangle inequality, |Sn | > x a.
To deal with the maximum, instead of looking at an arbitrary partial sum S j , we
will look at the first partial sum that crosses the level x. We define this first time by

t = min j  n : |S j | x

and let t = n + 1 if all |S j | < x. Notice that the event {t = j} also depends only on
X1 , ..., X j , so we can again write
50 2 Laws of Large Numbers

(1 p)P t = j  P |Sn S j| < a P t = j


= P |Sn S j | < a, t = j  P(|Sn | > x a, t = j).

The last inequality is true, because when t = j we have |S j | x and

|Sn S j | < a, t = j ✓ |Sn | > x a, t = j .

It remains to add up over j  n to get

(1 p)P(t  n)  P(|Sn | > x a, t  n)  P(|Sn | > x a)

and notice that {t  n} = {max jn |S j | x}. t


u

We will need one more simple lemma.


Lemma 2.5. A sequence Yn ! Y almost surely if and only if Mn = maxi n |Yi
Y | ! 0 in probability.

Proof. The “only if” direction is obvious, so we only need to prove the “if” part.
Since the sequence Mn is decreasing, it converges to some limit Mn # M 0 every-
where. Since for all e > 0,

P(M e)  P(Mn e) ! 0 as n ! •,

this means that P(M = 0) = 1 and Mn ! 0 almost surely. Of course, this implies
that Yn ! Y almost surely. t
u

We are now ready to prove the result mentioned above.


Theorem 2.11. If the series Âi 1 Xi converges in probability then it converges al-
most surely.

Proof. Suppose that the partial sums Sn converge to some random variable S in
probability, i.e., for any e > 0, for large enough n n0 (e) we have P(|Sn S|
e)  e. If k j n n0 (e) then

P(|Sk S j| 2e)  P(|Sk S| e) + P(|S j S| e)  2e.

Next, we apply Kolmogorov’s inequality with x = 4e, a = 2e and p = 2e to the


partial sums Xn+1 + . . . + X j to get
⇣ ⌘ 1 2e
P max |S j Sn | 4e  P(|Sk Sn | 2e)   3e,
n jk 1 2e 1 2e
for small e. The events {maxn jk |S j Sn | 4e} are increasing as k " • and, by
the continuity of measure,
⇣ ⌘
P max |S j Sn | 4e  3e.
n j
2.3 0 – 1 laws, convergence of random series 51

Finally, since P(|Sn S| e)  e we get


⇣ ⌘
P max |S j S| 5e  4e.
n j

This means that the maximum maxn j |S j S| ! 0 in probability and, by previous


lemma, Sn ! S almost surely. t
u

Let us give one easy-to-check criterion for convergence of random series. Again,
we will need one auxiliary result.

Lemma 2.6. Random sequence (Yn )n 1 converges in probability to some (proper)


limit Y if and only if it is Cauchy in probability, which means that

lim P(|Yn Ym | e) = 0
n,m!•

for all e > 0.

Proof. Again, the “only if” direction is obvious and we only need to prove the “if”
part. Given e = ` 2 , we can find m(`) large enough such that, for n, m m(`),
⇣ 1⌘ 1
P |Yn Ym |  2. (2.4)
`2 `
Without loss of generality, we can assume that m(` + 1) m(`) so that
⇣ 1⌘ 1
P |Ym(`+1) Ym(`) |  2.
`2 `
Then, ⇣ 1⌘ 1
ÂP |Ym(`+1) Ym(`) |
`2
 Â 2 <•
`
` 1 ` 1
and, by the Borel-Cantelli lemma,
⇣ 1 ⌘
P |Ym(`+1) Ym(`) | i.o. = 0.
`2
As a result, for large enough (random) ` and for k > `,
1 1
|Ym(k) Ym(`) |  Â 2
< .
i `i ` 1

This means that, with probability one, (Ym(`) )` 1 is a Cauchy sequence and there
exists an almost sure limit Y = lim`!• Ym(`) . Together with (2.4) this implies that
Yn ! Y in probability. t
u

Theorem 2.12. If (Xi )i 1 is a sequence of independent random variables such that


EXi = 0 and Âi 1 EXi2 < • then the series Âi 1 Xi converges almost surely.
52 2 Laws of Large Numbers

Proof. It is enough to prove convergence in probability. For m < n,


1 1
P(|Sn Sm | e) 
e2
E(Sn Sm )2 = Â EXi2 ! 0
e 2 m<in

as n, m ! •, since the series Âi 1 EXi2 converges. This means that (Sn )n 1 is


Cauchy in probability and, by previous lemma, Sn converges to some limit S in
probability. t
u
Example 2.3.4. Consider the random series Âi 1 ei /ia where P(ei = ±1) = 1/2.
We have ⇣ e ⌘2 1 1
 ia =  i2a < • if a > 2 ,
i
E
i 1 i 1

so the series converges almost surely for a > 1/2. t


u
One famous application of Theorem 2.12 is the following form of the strong law
of large numbers.
Theorem 2.13 (Kolmogorov’s strong law of large numbers). Let (Xi )i 1 be inde-
pendent random variables such that EXi = 0 and EXi2 < •. Suppose that bi  bi+1
and bi " •. If Âi 1 EXi2 /b2i < • then limn!• bn 1 Sn = 0 almost surely.
This follows immediately from Theorem 2.12 and the following well-known cal-
culus lemma.
Lemma 2.7 (Kronecker’s lemma). Suppose that a sequence (bi )i 1 is such that
all bi > 0 and bi " •. Given another sequence (xi )i 1 , if the series Âi 1 xi /bi con-
verges then limn!• bn 1 Âni=1 xi = 0.
Proof. Because the series converges, rn := Âi n+1 xi /bi ! 0 as n ! •. Notice that
we can write xn = bn (rn 1 rn ) and, therefore,
n n n 1
 xi =  bi (ri 1 ri ) =  (bi+1 bi )ri + b1 r0 bn rn .
i=1 i=1 i=1

Since ri ! 0, given e > 0, we can find n0 such that for i n0 we have |ri |  e and
| Âni=n10 +1 (bi+1 bi )ri |  ebn . Therefore,
n n0
bn 1 Â xi  bn 1 Â (bi+1 bi )ri + e + bn 1 b1 |r0 | + |rn |.
i=1 i=1

Letting n ! • and then e # 0 finishes the proof. t


u
A couple of examples of application of Kolmogorov’s strong law of large num-
bers will be given in the exercises below.
Exercise 2.3.1. Let {Sn : n 0} be a simple random walk which starts at zero,
S0 = 0, and at each step moves to the right with probability p and to the left with
2.3 0 – 1 laws, convergence of random series 53

probability 1 p. Show that the event {Sn = 0 i.o.} has probability 0 or 1. (Hint:
Hewitt-Savage 0 – 1 law.)

Exercise 2.3.2. In the setting of the previous problem, show: (a) if p 6= 1/2 then
P(Sn = 0 i.o.) = 0; (b) if p = 1/2 then P(Sn = 0 i.o.) = 1. Hint: use the fact that
the events n on o
lim inf Sn  1/2 , lim sup Sn 1/2
n!• n!•
are symmetric.

Exercise 2.3.3. Given any sequence of random variables (Yi )i 1 , the event A =
limi!• Yi exists and any e > 0, show that there exist integers k, n, m 1 such
that an event \ n 1o
Ak,n,m = Y j Yi 
ni< jm
k
approximates A in a sense that P(A 4 Ak,n,m )  e. Hint: use continuity of measure
and Cauchy’s convergence test.

Exercise 2.3.4. Suppose that (Xn )n 1 are i.i.d. with EX1 = 0 and EX12 = 1. Prove
that for d > 0,
n
1
n (log n)1/2+d
1/2 Â Xi ! 0
i=1
almost surely. Hint: use Kolmogorov’s strong law of large numbers.

Exercise 2.3.5. Let (Xn ) be i.i.d. random variables with continuous distribution F.
We say that Xn is a record value if Xn > Xi for i < n. Let In be the indicator of the
event that Xn is a record value.
(a) Show that the random variables (In )n 1 are independent and P(In = 1) = 1/n.
Hint: if Rn 2 {1, . . . , n} is the rank of Xn among the first n random variables
(Xi )in , prove that (Rn ) are independent.
(b) If Sn = I1 + . . . + In is the number of records up to time n, prove that
Sn / log n ! 1 almost surely. Hint: use Kolmogorov’s strong law of large num-
bers.

Exercise 2.3.6. Suppose that two sequences of random variables Xn : W1 ! R and


Yn : W2 ! R for n 1 on two different probability spaces have the same distri-
butions in the sense that all their finite dimensional distributions are the same,
L ((Xi )in ) = L ((Yi )in ) for all n 1. If Xn converges almost surely to some ran-
dom variable X on W1 as n ! •, prove that Yn also converges almost surely on
W2 .
54 2 Laws of Large Numbers

2.4 Stopping times, Wald’s identity, and strong Markov property

In this section, we will have our first encounter with two concepts, stopping times
and Markov property, in the setting of the sums of independent random variables.
Later, stopping times will play an important role in the study of martingales, and
Markov property will appear again in the setting of the Brownian motion.
Consider a sequence (Xi )i 1 of independent random variables and an integer
valued random variable t 2 {1, 2, . . .}. We say that t is independent of the future
if {t  n} is independent of s ((Xi )i n+1 ). Suppose that t is independent of the
future and E|Xi | < • for all i 1. We can formally write

ESt = Â ESt I(t = k) = Â ESk I(t = k)


k 1 k 1

  EXn I(t = k) =   EXn I(t = k) =  EXn I(t


(⇤)
= n).
k 1 nk n 1k n n 1

In (*) we can interchange the order of summation if, for example, the double se-
quence is absolutely summable, by the Fubini-Tonelli theorem. Since t is indepen-
dent of the future, the event {t n} = {t  n 1}c is independent of s (Xn ) and
we get
ESt = Â EXn P(t n). (2.5)
n 1
This implies the following.

Theorem 2.14 (Wald’s identity). If (Xi )i 1 are i.i.d., E|X1 | < • and Et < •, then
ESt = EX1 Et.

Proof. By (2.5) we have,

ESt = Â EXn P(t n) = EX1 Â P(t n) = EX1 Et.


n 1 n 1

The reason we can interchange the order of summation in (*) is because under our
assumptions the double sequence is absolutely summable since

  E|Xn |I(t = k) =  E|Xn |I(t n) = E|X1 |Et < •,


n 1k n n 1

so we can apply the Fubini-Tonelli theorem. t


u

We say that t is a stopping time if {t  n} 2 s (X1 , . . . , Xn ) for all n. Clearly,


a stopping time is independent of the future. One example of a stopping time is
t = min{k 1 : Sk 1}, since
[
{t  n} = {Sk 1} 2 s (X1 , . . . , Xn ).
kn
2.4 Stopping times, Wald’s identity, and strong Markov property 55

Given a stopping time t, we would like to describe all events that depend on t
and the sequence X1 , . . . , Xt up to this stopping time. A formal definition of the
s -algebra st generated by the sequence up to a stopping time t is the following:

st = A : A \ {t  n} 2 s (X1 , . . . , Xn ) for all n 1 .

When t is a stopping time, one can easily check that this is a s -algebra, and the
meaning of the definition is that, if we know that t  n then the corresponding part
of the event A is expressed only in terms of X1 , . . . , Xn .

Theorem 2.15 (Markov property). Suppose that (Xi )i 1 are i.i.d. and t is a stop-
ping time. Then the sequence Tt = (Xt+1 , Xt+2 , . . .) is independent of the s -algebra
st and
d
Tt = (X1 , X2 , . . .),
d
where = means the equality in distribution.

In words, this means that the sequence Tt = (Xt+1 , Xt+2 , . . .) after the stopping
time is an independent copy of the entire sequence, also independent of everything
that happens before the stopping time.

Proof. Consider an event A 2 st and B 2 B • , the cylindrical s -algebra on RN .


We can write,

P A \ {Tt 2 B} = ÂP A \ {t = n} \ {Tt 2 B}
n 1
= ÂP A \ {t = n} \ {Tn 2 B} ,
n 1

where Tn = (Xn+1 , Xn+2 , . . .). By the definition of the s -algebra st ,

A \ {t = n} = A \ {t  n} \ A \ {t  n 1} 2 s (X1 , . . . , Xn ).

On the other hand, {Tn 2 B} 2 s (Xn+1 , . . .) and, therefore, is independent of A \


{t = n}. Using this and the fact that (Xi )i 1 are i.i.d.,

P A \ {Tt 2 B} = ÂP A \ {t = n} P Tn 2 B
n 1
= ÂP A \ {t = n} P T1 2 B = P A P T1 2 B ,
n 1

and this finishes the proof. t


u

Let us give one interesting application of the Markov property and Wald’s iden-
tity that will yield another proof of the Strong Law of Large Numbers.

Theorem 2.16. Suppose that (Xi )i 1 are i.i.d. such that EX1 > 0. If Z = infn 1 Sn
then P(Z > •) = 1.
56 2 Laws of Large Numbers

This means that partial sums can not drift down to • if the mean EX1 > 0. Of
course, this is obvious by the strong law of large number, but we want to prove this
independently, since this will give another proof of the SLLN.
Proof. Let us define
(2)
t1 = min k 1 : Sk 1 , Z1 = min Sk , Sk = St1 +k St1 ,
kt1

(2) (2) (3) (2) (2)


t2 = min k 1 : Sk 1 , Z2 = min Sk , Sk = St2 +k St2 ,
kt2
and recursively,
(n) (n) (n+1) (n) (n)
tn = min k 1 : Sk 1 , Zn = min Sk , Sk = Stn +k Stn .
ktn

We mentioned above that t1 is a stopping time. It is easy to check that Z1 is st1 -


measurable and, by the Markov property, it is independent of the sequence Tt1 =
(Xt1 +1 , Xt1 +2 , . . .), which has the same distribution as the original sequence. Since
t2 and Z2 are defined exactly the same way as t1 and Z1 , only in terms of this
new sequence Tt1 , Z2 is an independent copy of Z1 . Now, it should be obvious that
(Zn )n 1 are i.i.d. random variables. Clearly,

Z = inf Sk = inf Z1 , St1 + Z2 , St1 +t2 + Z3 , ... ,


k 1

and, since, by construction, St1 +···+tk 1


k 1,
[ [
Z< N ✓ St1 +...+tk 1
+ Zk  N ✓ k 1 + Zk  N .
k 1 k 1

Therefore,

P(Z < N)  Â P(k 1 + Zk  N) = Â P(Zk  N k + 1)


k 1 k 1
= Â P(Z1  N k + 1) = Â P(Z1  j)  Â P(|Z1 | j) ! 0
k 1 j N j N

as N ! •, if we can show that E|Z1 | < •, since

 P(|Z1 | j)  E|Z1 | < •.


j 1

By Wald’s identity,

E|Z1 |  E Â |Xi | = E|X1 |Et1 < •


it1

if we can show that Et1 < •. This is left as an exercise below. We proved that
P(Z < N) ! 0 as N ! •, which, of course, implies that P(Z > •) = 1. t
u
2.4 Stopping times, Wald’s identity, and strong Markov property 57

This result can be used to give another proof of the Strong Law of Large Num-
bers.

Theorem 2.17. If (Xi )i 1 are i.i.d. and EX1 = 0 then Sn /n ! 0 almost surely.

Proof. Given e > 0 we define Xie = Xi + e so that EX1e = e > 0. By the above
result, infn 1 (Sn + ne) > • with probability one. This means that for all n 1,
Sn + ne M > • for some random variable M. Dividing both sides by n and
letting n ! • we get
Sn
lim inf e
n!• n
with probability one. We can then let e # 0 over some sequence. Similarly, we prove
that
Sn
lim sup  0
n!• n
with probability one, which finishes the proof. t
u

Exercise 2.4.1. Let (Xi )i 1 be i.i.d. and EX1 > 0. Given a > 0, show that Et < •
for t = inf{k 1 : Sk > a}. (Hint: truncate Xi ’s and t and use Wald’s identity).

Exercise 2.4.2. Let S0 = 0, Sn = Âni=1 Xi be a random walk with i.i.d. (Xi ) such that
P(Xi = 1) = p, P(Xi = 1) = 1 p for p > 1/2. Consider integer b 1 and let
t = min{n 1 : Sn = b}. Show that for 0 < s  1,
⇣1 (1 4pqs2 )1/2 ⌘b
Est =
2qs
and compute Et.

Exercise 2.4.3. Suppose that we play a game with i.i.d. outcomes (Xn )n 1 such
that E|X1 | < •. If we play n rounds, we gain the largest of the first n outcomes. In
addition, to play each round we have to pay amount c > 0, so after n rounds our
total profit (or loss) is
Yn = max Xm cn.
1mn
In this problem we will find the best strategy to play the game, in some sense.
1. Given a 2 R, let p = P(X1 > a) > 0 and consider the stopping time T =
inf{n : Xn > a}. Compute EYT . (Hint: sum over sets {T = n}.)
2. Consider a such that E(X1 a)+ = c. For a = a show that EYT = a.
3. Show that Yn  a + Ânm=1 ((Xm a)+ c) (for any a, actually).
4. Use Wald’s identity to conclude that for any stopping time t such that Et < •
we have EYt  a.
This means that stopping at time T results in the best expected profit a among all
stopping times t with Et < •.
58 2 Laws of Large Numbers

2.5 Azuma inequality

So far we have considered sums of independent random variables. As a digression,


we will now give one example of a concentration inequality for general functions
f = f (X1 , . . . , Xn ) of n independent random variables. We do not assume that these
random variables are identically distributed, but we will assume the following sta-
bility condition on the function f :

f (x1 , . . . , xi , . . . , xn ) f (x1 , . . . , xi0 , . . . , xn )  ai (2.6)

for all i  n, for some constants a1 , . . . , an . This means that changing the ith coor-
dinate of the function f while keeping all the other coordinates fixed can change
its value by not more than ai . In particular, the function f is bounded.
Theorem 2.18 (Azuma’s inequality). If (2.6) holds then
⇣ t2 ⌘
P f Ef t  exp (2.7)
2 Âni=1 a2i
for any t 0.
Proof. For i = 1, . . . , n, let Ei denote the expectation in Xi+1 , . . . , Xn with the ran-
dom variables X1 , . . . , Xi fixed. One can think of (X1 , . . . , Xn ) as defined on a product
space with the product measure, and Ei denotes the integration over the last n i
coordinates. Let us denote Yi = Ei f Ei 1 f and note that En f = f and E0 f = E f .
Then we can write n
f E f = Â Yi
i=1
(this is called martingale-difference representation) and as before, for l 0,
⇣ ⌘ ⇣n ⌘
P f Ef t = P Â Yi t e lt
EelY1 +...+lYn .
i=1

Notice that Ei 1Yi = Ei 1 f Ei 1 f = 0. Also, the stability condition implies that


|Yi |  ai . Since Y1 , . . . ,Yn 1 do not depend on Xn (only Yn does), if we average in
Xn first, we can write

EelY1 +...+lYn = E elY1 +...+lYn 1 En 1 elYn .

If we apply Lemma 2.2 to Yn /an viewed as a function of Xn , we get


2 a2 /2
En 1 elYn = En 1 el an (Yn /an )  el n

and, therefore,
2 /2
EelY1 +...+lYn  e(l an ) EelY1 +...+lYn 1 .
By induction on n, we get
2.5 Azuma inequality 59

EelY1 +...+lYn  el
2
Âni=1 a2i /2

and ⇣ ⌘ ⇣
n
l 2 n 2⌘
P Â Yi t  exp lt + Â ai .
2 i=1
i=1
Optimizing over l 0 finishes the proof. t
u

Notice that in the above proof, we did not use the fact that Xi ’s are random vari-
ables. They could be random vectors or arbitrary random elements taking values in
some measurable spaces. We only used the assumption that they are independent.
Keeping this in mind, let us give one example of application of Azuma’s inequality.

Example 2.5.1 (Chromatic number of the Erdős–Rényi graph). Consider the


Erdős–Rényi random graph G(n, p) on n vertices, where each edge is present with
probability p independently of other edges. Let c(G(n, p)) be the chromatic num-
ber of this graph, which is the smallest number of colours needed to colour the
vertices so that no two adjacent vertices share the same colour. Let us denote the
set of vertices by {v1 , . . . , vn } and let

ei, j = I the edge between vi and v j is present .

Then, all ei, j are independent Bernoulli B(p) random variables, by the defini-
tion of the Erdős–Rényi random graph. For i = 2, . . . , n, let us denote by Xi =
(e1,i , e2,i . . . , ei 1,i ) the vector of indicators of edges between the vertex vi and ver-
tices v1 , . . . , vi 1 . The vectors X2 , . . . , Xn are independent because they consist of
indicators of disjoint sets of edges, and the chromatic number is a function

c(G(n, p)) = f (X2 , . . . , Xn ),

since these vectors include indicators of all edges in the graph. To apply Azuma’s
inequality, we need to determine the stability constants a2 , . . . , an . Notice that if we
replace Xi with another value Xi0 = (e01,i , e02,i . . . , e0i 1,i ), this means that we modify
some edges between vi and v1 , . . . , vi 1 . The chromatic number can not increase by
more than 1, because we can always assign a new colour to the vertex vi , so

f (X2 , . . . , Xi0 , . . . , Xn ) f (X2 , . . . , Xi , . . . , Xn )  1.

By the same logic,

f (X2 , . . . , Xi , . . . , Xn ) f (X2 , . . . , Xi0 , . . . , Xn )  1,

and this shows that the stability condition (2.6) holds with ai = 1. Therefore,
Azuma’s inequality implies (when applied to both f and f )
⇣ ⌘ t2
P c(G(n, p)) Ec(G(n, p)) t  2e 2(n 1) .
60 2 Laws of Large Numbers
p
For example, if we take t = 2n log n, we get
⇣ p ⌘ 2
P c(G(n, p)) Ec(G(n, p))  2n log n 1 ,
n
p
so, with high probability, the chromatic number will be within 2n log n from its
expected value Ec(G(n, p)). It is known (but non-trivial) that this expected value
n 1
Ec(G(n, p)) ⇠ log ,
2 log n 1 p
p
so the deviation 2n log n is of a smaller order compared to the expectation. This
shows that the chromatic number is typically of the same order as its expectation,
which gives us an example of the law of large numbers for a very non-trivial func-
tional of independent random variables. t
u

Example 2.5.2 (Balls in boxes). Suppose that we throw n balls into m boxes at ran-
dom, so that the probability that a ball lands in any given box is 1/m, and indepen-
dently of each other. Let N be the number of non-empty boxes. If Xi 2 {1, . . . , m} is
the box number in which the ith ball lands then X1 , . . . , Xn are independent random
variables and N = card{X1 , . . . , Xn } is the number of distinct boxes hit.
It is easy to compute, writing N = Âim I(box #i not empty), that
⇣ ⇣ 1 ⌘n ⌘
EN = m 1 1 . (2.8)
m
When n is large and m = an for some fixed a > 0 then
1/a
EN ⇠ na 1 e .

Changing the value of one box Xi changes N by at most 1, so the stability constants
are all equal to ai = 1. As in the previous example,
t2
P N EN t  2e 2n , (2.9)

by Azuma’s inequality, and


p 2
P N EN  2n log n 1 .
n
So the typical number of non-empty boxes in this case is close to its expectation,
relatively speaking. t
u

Example 2.5.3 (Max-Cut of sparse random graph). This example is quite simi-
lar to the previous one, only balls and boxes have a different meaning and, instead
of the number of non-empty boxes, we consider a much more complicated func-
tion.
2.5 Azuma inequality 61

Let V = {v1 , . . . , vn } be the set of vertices of a graph and E be its set of edges.
Max-Cut problem is to divide vertices into two groups in a way that maximizes
the number of edges between the two groups. This is a very important problem in
computer science that has many applications, for example, to layout of electronic
circuitry, and as a reformulations of various combinatorial optimization problems.
For example, imagine that we have a group of n people and edges represent people
that dislike each other. Then our goal is to separate them into two groups in such a
way that as many ‘enemies’ as possible are in the opposite groups.
Here we will consider a specific model of a random graph, and will show that
the maximal number of edges between the two groups concentrates around its ex-
pectation. We will select edges randomly, but using a different procedure than in
the Erdős–Rényi graph. We will take

m = dn

possible edges for some fixed d > 0 and, as usual, we think of n as being large.
Then we place each of these m edges at random among the set of n2 possible
edges, independently of each other. Each pair of vertices (a possible location to
place an edge) represents a box, and edges represent balls. The random variables
X1 , . . . , Xm that describe between which vertices each edge is placed take n2 pos-
sible values and are all independent. This model produces what is called a sparse
graph, because the number of edges m = nd is small relative to n2 and there is a
fixed average number of edges per vertex, d. As in the above example, let

ei, j = I the edge between vi and v j is present .

It is possible that more than one edge is placed between two vertices, in which case
we keep one of them.
The function that we will consider on the above sparse random graph is called
Max-Cut. It is defined as follows. If we want to split all vertices into two groups,
one way to encode this is to assign each vertex one of the two labels { 1, 1}. Then
all vertices with the same label belong to the same group. In other words, each
vector
s = s1 , . . . , sn 2 { 1, 1}n
describes a possible cut of the graph into two groups. Let E(s ) be the number of
present edges connecting the vertices in opposite groups,
n o
E(s ) = card i < j : ei, j = 1, si 6= s j . (2.10)

Another way to represent E(s ) is as follows. Notice that si 6= s j only if si s j = 1,


otherwise, si s j = 1. Therefore, I(si 6= s j ) = (1 si s j )/2 and
1
E(s ) = Â ei, j 1
2 i<
si s j . (2.11)
j
62 2 Laws of Large Numbers

Then the value M of the Max-Cut corresponds to a way to cut the graph so that the
number of edges between the two groups is as large as possible,

M = max E(s ). (2.12)


s

Recall that M is a random variable that depends on the positions X1 , . . . , Xm of our


possible m = dn edges.
If we change the placement Xi of one edge, this can change the maximum M
by at most 1, because, for each possible cut (configuration s ), moving one edge
might increase or decrease the number of edges between the two groups by at most
1. This means that the stability condition in Azuma’s inequality holds with ai = 1
and, therefore,
t2 t2
P M EM t  2e 2m = 2e 2dn . (2.13)
p
If we take t = 2dn log n, we get
⇣ p ⌘ 2
P M EM  2dn log n 1 .
n
p
Is the deviation 2dn log n of smaller order than EM? It turns out that there exists
a cut such that at least half of all edges are between the vertices that belong to op-
posite groups. To see this, choose labels s1 , . . . , sn to be i.i.d. Rademacher, which
means that we assign each vertex to one of the two groups { 1, +1} at random.
Then
1 1
M = max E(s )
s
EE(s ) = Â ei, j 1
2 i<
Esi s j = Â ei, j ,
2 i<
j j

which is exactly one half of all the present edges. Taking the expectations on both
sides, we can write
1
EM Â P there is an edge between vi and v j
2 i< j
✓ ◆
1 n ⇣ ⇣ 1 ⌘dn ⌘
= 1 1 n ,
2 2 2

where on the right hand side we have the analogue of (2.8). Using the inequality
1 x  e x , we can write
⇣ 1 ⌘dn 2d 2d 2 ⇣1⌘
dn/(n2) 2d
1 n e e n =1 + 2 +O 3 ,
2
n n n

where in the second line we used Taylor’s theorem for ex at zero. Plugging this in
the above inequality, we get
2.5 Azuma inequality 63
✓ ◆
1 n ⇣ 2d 2d 2 ⇣ 1 ⌘⌘ dn d + d2 ⇣1⌘
EM + O = +O .
2 2 n n2 n3 2 2 n

This shows that (up


p to the smaller order terms) EM is at least dn/2. This means
that the deviation 2dn log n in Azuma’s inequality above is of a smaller order, so
the typical Max-Cut value M is relatively close to its expectation.
By the way, it is a major open problem to compute the limit limn!• EM/n in
terms of d. t
u
Example 2.5.4 (Hamming cube). For any x, y 2 {0, 1}n , the number of coordi-
nates of x and y that differ,
n
r(x, y) = Â I(xi 6= yi ), (2.14)
i=1

is called the Hamming distance between x and y. The function r is called the
Hamming metric on {0, 1}n . We will now use Azuma’s inequality to show that,
given a subset A ✓ {0, 1}n that contains a positive proportion of all points,

card(A) > e2n (2.15)

for some e > 0, most points in {0, 1}n are


p relatively close to the set A, namely,
within the Hamming p distance of order n. In other words, for most points in
{0, 1}n , only of order n coordinates are different from one of the points in A.
Given t > 0, let us denote by
n o
B(A,t) = x 2 {0, 1}n : r(x, y)  t for some y 2 A (2.16)

the set of all points within Hamming distance t from A. For e, d 2 (0, 1), let us
denote r r
1 1
te,d = 2 log + 2 log .
e d
p
Then the following lemma shows that te,d n-neighbourhood of A contains at least
(1 d ) proportion of all points in {0, 1}n .
Lemma 2.8. For e, d 2 (0, 1), if card(A) > e2n then
p
card B A,te,d n (1 d )2n . (2.17)

Proof. Let X = (X1 , . . . , Xn ) be a vector consisting of i.i.d. Bernoulli B(1/2) random


variables, which means that, for any subset S ✓ {0, 1}n ,

card(S)
P X 2S = .
2n
Let us consider a random variable Z = miny2A r(X, y) equal to the Hamming dis-
tance from X to the set A. Changing one coordinate Xi to Xi0 can change Z by at
64 2 Laws of Large Numbers

most 1, and Azuma’s inequality (applied to Z and Z) implies that


p 2
P Z  EZ t n  e t /2 , (2.18)
p 2
P Z EZ + t n  e t /2 . (2.19)
q
If in the first inequality we take te = 2 log e1 then
p
P Z  EZ te n  e.
p
This direction of Azuma’s inequality allows
p us to conclude that EZ  te n because
of the following observation. If EZ te n > 0 then the above probability would
be bigger than P(Z = 0). However, since Z = 0 if and only if X 2 A,

card(A)
P(Z = 0) = P(X 2 A) = > e,
2n
by our assumption, which is a contradiction. Therefore,
p
EZ  te n.

The second inequality then implies


p p p t 2 /2
P Z te n + t n  P Z EZ + t n  e .
q
In this inequality, we take td = 2 log d1 then
p p
P Z te n + td n  d ,

which implies that p p


P(Z  te n + td n) 1 d .
p
Since this event
p is exactly the set B(A,te,d n) of all points within the Hamming
distance te,d n from A, this finishes the proof. t
u

Exercise 2.5.1 (Empirical process). Given a family F of functions f : R ! [0, 1]


and independent random variables X1 , . . . , Xn , consider
n
1
Z := p sup
n f 2F Â f (Xi ) E f (Xi ) .
i=1

Prove that P(|Z EZ| t)  2e t 2 /2 for all t 0.


Exercise 2.5.2. Suppose we throw n balls into m boxes at random, and let N be the
number of boxes with at least two balls in them. Compute EN. What is the limit
limn!• EN/n when m = an for a fixed a > 0? What does Azuma’s inequality say
about |N EN|?
2.5 Azuma inequality 65

Exercise 2.5.3 (3-SAT). Suppose that a debate class has a large number of stu-
dents, n. The professor creates m = 2n debate teams, each composed of 3 members,
and he does it by assigning each student to 6 different teams.
Each member of each team is assigned to defend a position from the platform of
one of two major political parties, Party A or Party B. The professor does not know
the party affiliations of the students, and we suppose that each student belongs to
either Party A or Party B with probability 1/2 each, independently of each other.
Given their turn, each team will select only one of its members to defend his or
her assigned position. However, if it turns out that all three members have been
assigned positions of the opposing party (not the one to which they belong), the
team will pass.
Let N be the number of speeches delivered in class. What is the expectation
EN? Use Azuma’s inequality to show that, with high probability, the number of
speeches N will be relatively close to EN.

Exercise 2.5.4 (Independent set). Let G(n, p) be the Erdős-Rényi random graph
on the set of n vertices V = {1, . . . , n}. A subset of vertices V 0 ✓ V in this graph is
called an independent set if there are no edges between any two vertices in V 0 . Let
X be the largest cardinality of an independent set in G(n, p). Show that
⇣ t2 ⌘
P |X EX| t  2 exp .
2n
67

Chapter 3
Central Limit Theorem

3.1 Convergence of laws. Selection theorem

In this section we begin the discussion of weak convergence of distributions on


metric spaces. Let (S, d) be a metric space with the metric d. Consider a measurable
space (S, B) with the Borel s -algebra B generated by open sets and let (Pn )n 1
and P be some probability distributions on B. We define

Cb (S) = f : S ! R – continuous and bounded .

We say that Pn ! P weakly if


Z Z
lim f dPn = f dP for all f 2 Cb (S). (3.1)
n!• S S

d
This is often denoted by Pn ! P or Pn =) P. Of course, the idea here is that
the measure P on the Borel s -algebra is determined uniquely by the integrals of
f 2 Cb (S), so we compare closeness R of theR measures by closeness of all these
integrals. To see this, suppose that f dP = f dQ for all f 2 Cb (S). Consider any
open set U in S and let F = U c . Using that d(x, F) = 0 if and only if x 2 F (because
F is closed), it is easy to see that

fm (x) = min 1, m d(x, F) " I(x 2 U) as m " •.

Since fm 2 Cb (S), by monotone convergence theorem we get that P(U) = Q(U)


and, by Dynkin’s theorem, P = Q. We will also say that random variables Xn !
X in distribution if their laws converge weakly. Notice that in this definition the
random variables need not be defined on the same probability space, as long as
they take values in the same metric space S. We will come back to the study of
convergence on general metric spaces later in the course, and in this section we
will prove only one general result, the Selection Theorem. Other results will be
proved only on R or Rn to prepare us for the most famous example of convergence
of laws - the Central Limit Theorem (CLT). First of all, let us notice that on the
68 3 Central Limit Theorem

real line the convergence of probability measures can be expressed in terms of their
c.d.f.s, as follows.

Theorem 3.1. If S = R then Pn ! P weakly if and only if Fn (t) = Pn ( •,t] !


F(t) = P ( •,t] for any point of continuity t of the c.d.f. F.

Proof. “=)” Suppose that (3.1) holds. Let us approximate the indicator I(x  t)
by continuous functions j1 , j2 2 Cb (R) so that

I(x  t e)  j1 (x)  I(x  t)  j2 (x)  I(x  t + e).

Then, using (3.1) for j1 and j2 ,


Z Z
F(t e)  j1 dF = lim j1 dFn
n!•
Z Z
 lim Fn (t)  lim j2 dFn = j2 dF  F(t + e).
n!• n!•

Therefore, for any e > 0,

F(t e)  lim Fn (t)  F(t + e).


n!•

More carefully, we should write lim inf and lim sup but, since t is a point of con-
tinuity of F, letting e # 0 proves that the limit limn!• Fn (t) exists and is equal to
F(t).
“(=” Let PC(F) be the set of points of continuity of F. Since F is monotone,
the set PC(F) is dense in R. Take M large enough such that both M, M 2 PC(F)
and P(( M, M]c )  e. Clearly, for large enough n 1 we have Pn (( M, M]c ) 
2e. For any k > 1, consider a sequence of points M = x1k  x2k  . . .  xkk = M such
k
that all xi 2 PC(F) and maxi |xi+1 xik | ! 0 as k ! •. Given a function f 2 Cb (R),
consider an approximating function

fk (x) = Â f (xik ) I x 2 (xik 1 , xik ] + 0 · I x 2


/ ( M, M] .
1<ik

Since f in continuous,

dk (M) = sup fk (x) f (x) ! 0, k ! •.


|x|M

Since all xik 2 PC(F), by assumption, we can write


Z
fk dFn = Â fk (xik ) Fn (xik ) Fn (xik 1
1<ik
Z
n!•
! Â fk (xik ) F(xik ) F(xik 1 ) = fk dF.
1<ik
3.1 Convergence of laws. Selection theorem 69

On the other hand,


Z Z
f dF fk dF  k f k• P ( M, M]c + dk (M)  k f k• e + dk (M)

and, similarly, for large enough n 1,


Z Z
f dFn fk dFn  k f k• Pn ( M, M]c + dk (M)  k f k• 2e + dk (M).
R
Letting
R n ! •, then k ! • and, finally, e # 0 (or M " •), proves that f dFn !
f dF. t
u

Example 3.1.1. If Pn ({n 1 }) = 1 and P({0}) = 1 then Pn ! P weakly, but their


c.d.f.s, obviously, do not converge at the point of discontinuity x = 0.

Our typical strategy for proving the convergence of probability measures will be
based on the following elementary observation.
Lemma 3.1. If for any sequence (n(k))k 1 there exists a subsequence (n(k(r)))r 1
such that Pn(k(r)) ! P weakly then Pn ! P weakly.

Proof. Suppose not. Then for some f 2 Cb (S) and for some e > 0 there exists a
subsequence (n(k)) such that
Z Z
f dPn(k) f dP > e.

But this contradicts the fact that for some subsequence Pn(k(r)) ! P weakly. t
u

As a result, to prove that Pn ! P weakly, it is enough to demonstrate two things:


1. First, that the sequence (Pn ) is such that one can always find converging sub-
sequences for any sequence (Pn(k) ). This describes a sort of “relative com-
pactness” of the sequence (Pn ) and will be a consequence of uniform tight-
ness and the Selection Theorem proved below.
2. Second, we need to show that any subsequential limit is always the same, P,
and this will be done by some ad hoc methods, for example, using the method
of characteristic functions in the case of the CLT.

We say that a sequence of distributions (Pn )n 1 on a metric space (S, d) is uni-


formly tight if, for any e > 0, there exists a compact K ✓ S such that Pn (K) 1 e
for all n. Checking this property can be difficult and, of course, one needs to under-
stand how compacts look like for a particular metric space. In the case of the CLT
on Rn , this will be a trivial task. The following fact is quite fundamental.
Theorem 3.2 (Selection Theorem). If (Pn )n 1 is a uniformly tight sequence of
laws on the metric space (S, d) then there exists a subsequence (n(k)) such that
Pn(k) converges weakly to some probability law P.
70 3 Central Limit Theorem

Let us recall the following well-known result.

Lemma 3.2 (Cantor’s diagonalization). Let A be a countable set and fn : A ! R


for n 1. Then there exists a subsequence (n(k)) such that fn(k) (a) converges for
all a 2 A, possibly to ±•.

Proof. Let A = {a1 , a2 , . . .}. Take (n1 (k)) such that fn1 (k) (a1 ) converges. Take
(n2 (k)) ✓ (n1 (k)) such that fn2 (k) (a2 ) converges. Recursively, take (n` (k)) ✓
(n` 1 (k)) such that fn` (k) (a` ) converges. Now consider the sequence (nk (k)).
Clearly, fnk (k) (a` ) converges for any ` because for k `, nk (k) 2 {n` (k)} by con-
struction. t
u

Proof (of Theorem 3.2). We will prove the Selection Theorem for arbitrary metric
spaces, since this result will be useful to us later when we study the convergence
of laws on general metric spaces. However, when S = R one can see this in a much
more intuitive way, as follows.
(The case S = R). Let A be a dense set of points in R. Given a sequence of prob-
ability measures Pn on R and their c.d.f.s Fn , by Cantor’s diagonalization, there
exists a subsequence (n(k)) such that Fn(k) (a) ! F(a) for all a 2 A. We then define
F(x) = inf F(a) : x < a, a 2 A . Since, by construction, F is non-decreasing on
A, the function F is also non-decreasing. Since

F(x) = inf F(a) : x < a, a 2 A = inf inf F(a) : y < a, a 2 A = inf F(y),
x<y x<y

it is also right-continuous. The fact that Pn are uniformly tight ensures that F(x) !
0 or 1 if x ! • or +•, so F(x) is a cumulative distribution function. Although
it is possible that F(a) 6= F(a) for some a 2 A, it is clear from the definition of F
that if x is a point of continuity of F then

lim F(a) = F(x), lim F(b) = F(x). (3.2)


A3a"x A3b#x

In order to prove weak convergence of Pn(k) to the measure P with the c.d.f. F, let
x be a point of continuity of F(x) and let a, b 2 A such that a < x < b. We have,

F(a) = lim Fn(k) (a)  lim inf Fn(k) (x)  lim sup Fn(k) (x)  lim Fn(k) (b) = F(b).
k!• k!• k!• k!•

Therefore, (3.2) implies that Fn(k) (x) ! F(x) for all such x and, by Theorem 3.1,
this means that the laws Pn converge to P. Let us now prove the general case on
general metric spaces.
(The general case). If K is a compact then, obviously, Cb (K) = C(K). Later in these
lectures, when we deal in more detail with convergence on general metric spaces,
we will prove the following fact, which is well-known and is a consequence of the
Stone-Weierstrass theorem.
3.1 Convergence of laws. Selection theorem 71

C(K) is separable w.r.t. `• norm k f k• = supx2K | f (x|.


Since Pn are uniformly tight, for any r 1 we can find a compact Kr such that
Pn (Kr ) > 1 1/r for all n 1. Let Cr ⇢ C(Kr ) be a countable dense subset of
C(Kr ). By Cantor’s diagonalization, there exists a subsequence (n(k)) such that
Pn(k) ( f ) converges for all f 2 Cr for all r 1. Since Cr is dense in C(Kr ), this
implies that Pn(k) ( f ) converges for all f 2 C(Kr ) for all r 1. Next, for any f 2
Cb (S),
Z Z Z
k f k•
f dPn(k) f dPn(k)  | f | dPn(k)  k f k• Pn(k) (Krc )  .
Kr Krc r
This implies that the limit
Z
I( f ) := lim f dPn(k) (3.3)
k!•
R
exists. The question is why this limit is an integral I( f ) = f dP for some prob-
ability measure P? Basically, this is a consequence of the Riesz representation
theorem for positive linear functionals on locally compact Hausdorff spaces. One
can define the functional Ir ( f ) on each compact Kr , find a measure on it using the
Riesz representation theorem, check that these measures agree on the intersections,
and extend to a probability measure on their union. However, instead, we will use
a relative of the Riesz representation theorem — the Stone-Daniell theorem from
measure theory – which states the following.
Given a set S, a family of function a = { f : S ! R} is called a vector lattice if

f , g 2 a =) c f + g 2 a, 8c 2 R and f ^ g, f _ g 2 a.

Vector lattice a is called a Stone vector lattice if f ^ 1 2 a for any f 2 a. For


example, any vector lattice that contains constants is automatically a Stone vector
lattice.
A functional I : a ! R is called a pre-integral if
1. I(c f + g) = cI( f ) + I(g),
2. f 0, I( f ) 0,
3. fn # 0, k fn k• < • =) I( fn ) ! 0.
See R.M. Dudley “Reals Analysis and Probability” for a proof of the following:

R theorem) If a is a Stone vector lattice and I is a pre-integral on a


(Stone-Daniell
then I( f ) = f dµ for some unique measure µ on the minimal s -algebra on which
all functions in a are measurable.
We will use this theorem with a = Cb (S) and I defined in (3.3). The first two
properties are obvious. To prove the third one, let us consider a sequence such that

fn # 0, 0  fn (x)  f1 (x)  k f1 k• .
72 3 Central Limit Theorem

On any compact Kr , fn # 0 uniformly, i.e.


n!•
k fn k•,Kr  en,r ! 0.

Since Z Z Z
1
fn dPn(k) = fn dPn(k) + fn dPn(k)  en,r + k f1 k• ,
Kr Krc r
we get Z
1
fn dPn(k)  en,r + k f1 k• .
I( fn ) = lim
k!• r
Letting n ! • and r ! •, we get that I( fn ) ! 0. By the Stone-Daniell theorem,
Z
I( f ) = f dP

for some unique measure P on s (Cb (S)). The choice of f = 1 shows that I( f ) =
1 = P(S), which means that P is a probability measure. Finally, let us show that
s (Cb (S)) is the Borel s -algebra B generated by open sets. Since any f 2 Cb (S) is
measurable on B we get s (Cb (S)) ✓ B. On the other hand, let F ✓ S be any closed
set and take a function f (x) = min(1, d(x, F)). We have, | f (x) f (y)|  d(x, y) so
f 2 Cb (S) and
f 1 ({0}) 2 s (Cb (S)).
However, since F is closed, f 1 ({0}) = {x : d(x, F) = 0} = F and this proves that
B ✓ s (Cb (S)). t
u

Conversely, the following holds (we will prove this result later for any complete
separable metric space).

Theorem 3.3. If Pn converges weakly to P on Rk then (Pn )n 1 is uniformly tight.

Proof. For any e > 0, there exists large enough M > 0, such that P(|x| > M) < e.
Consider a function
8
< 0, s  M,
j(s) = 1, s 2M,
:
(s M)/M, M  s  2M,

and let a(x) := j(|x|) for x 2 Rk . Since Pn ! P weakly,


Z
lim sup Pn |x| > 2M  lim sup a(x)dPn (x)
n!•
Z n!•
= a(x)dP(x)  P |x| > M  e.

For n large enough, n n0 , we get Pn (|x| > 2M)  2e. For n < n0 choose Mn so
that Pn (|x| > Mn )  2e. Take M 0 = max{M1 , . . . , Mn0 1 , 2M}. As a result, Pn (|x| >
M 0 )  2e for all n 1. t
u
3.1 Convergence of laws. Selection theorem 73

Finally, let us relate convergence in distribution to other forms of convergence.


Consider random variables X and Xn on some probability space (W, A , P) with
values in a metric space (S, d). Let P and Pn be their corresponding laws on Borel
sets B in S. Convergence of Xn to X in probability and almost surely is defined
exactly the same way as for S = R by replacing |Xn X| with d(Xn , X).

Lemma 3.3. Xn ! X in probability if and only if for any sequence (n(k)) there
exists a subsequence (n(k(r))) such that Xn(k(r)) ! X almost surely.

Proof. “(=”. Suppose Xn does not converge to X in probability. Then, for small
e > 0, there exists a subsequence (n(k))k 1 such that P(d(X, Xn(k) ) e) e.
This contradicts the existence of a subsequence Xn(k(r)) that converges to X almost
surely.
“=)”. Given a subsequence (n(k)) let us choose (k(r)) so that
⇣ 1⌘ 1
P d(Xn(k(r)) , X)  2.
r r
By the Borel-Cantelli lemma, these events can occur i.o. with probability 0, which
means that with probability one, for large enough r,
1
d(Xn(k(r)) , X)  ,
r
i.e. Xn(k(r)) ! X almost surely. t
u

Lemma 3.4. Xn ! X in probability then Xn ! X in distribution.

Proof. By Lemma 3.3, for any subsequence (n(k)) there exists a subsequence
(n(k(r))) such that Xn(k(r)) ! X almost surely. Given f 2 Cb (R), by the domi-
nated convergence theorem, E f (Xn(k(r)) ) ! E f (X), i.e. Xn(k(r)) ! X weakly. By
Lemma 3.1, Xn ! X in distribution. t
u

Exercise 3.1.1. Let Xn be random variables on the same probability space with
values in a metric space S. If for some point s 2 S, Xn ! s in distribution, show that
Xn ! s in probability.

Exercise 3.1.2. For the following sequences of laws Pn on R having densities fn ,


which are uniformly tight? (a) fn = I(0  x  n)/n, (b) fn = ne nx I(x 0), (c)
fn = e x/n /nI(x 0).

Exercise 3.1.3. Suppose that Xn ! X in distribution on R and Yn ! c 2 R in prob-


ability. Show that XnYn ! cX in distribution, assuming that Xn ,Yn are defined on
the same probability space.

Exercise 3.1.4. Suppose that random variables (Xn ) are independent and Xn ! X
in probability. Show that X is almost surely constant.
74 3 Central Limit Theorem

Exercise 3.1.5. Suppose that random variables X and Y taking values in [0, 1] sat-
isfy EX n EY n for all integer n 1. Does this imply that X is stochastically
greater than Y , i.e. P(X t) P(Y t) for all t?
3.2 Characteristic functions 75

3.2 Characteristic functions

In the next section, we will prove one of the most classical results in Probability
Theory – the central limit theorem – and one of the main tools will be the so-
called characteristic functions. Let X = (X1 , . . . , Xk ) be a random vector on Rk with
the distribution P and let t = (t1 , . . . ,tk ) 2 Rk . The characteristic function of X is
defined by Z
i(t,X)
f (t) = Ee = ei(t,x) dP(x).
In this section, we will collect various important and useful facts about the charac-
teristic functions. The standard normal distribution N(0, 1) on R with the density
1 x2 /2
p(x) = p e
2p
will play a central role, so let us start by computing its characteristic function. First
of all, notice that this is indeed a density, since
⇣ 1 Z ⌘2 1
ZZ
1
Z 2pZ •
x2 /2 (x2 +y2 )/2 r2 /2
p e dx = e dxdy = e r drdq = 1.
2p R 2p R2 2p 0 0

If X has the standard normal distribution N(0, 1) then, obviously, EX = 0 and


Z Z
1 2 2 x2 /2 1 x2 /2
Var(X) = EX = p x e dx = p x d( e ) = 1,
2p R 2p R

by integration by parts. To motivate the computation of the characteristic function,


let us first notice that for l 2 R,
Z Z Z
1 x2 l2 1 (x l )2 l2 1 x2 l2
Eel X = p el x 2 dx = e 2 p e 2 dx = e 2 p e 2 dx = e 2 .
2p 2p 2p
For complex l = it, we begin similarly by completing the square,
Z Z
itX t2 1 (x it)2 t2
Ee =e 2 p e 2 dx = e 2 j(z) dz,
2p it+R

where we denoted
1 z2
j(z) = p e 2 for z 2 C.
2p
Since j is analytic, by Cauchy’s theorem, integral over a closed path is equal to 0.
Let us take a closed path it + x for x from M to +M, M + iy for y from t to 0,
x from M to M and, finally, M + iy for y from 0 to t. For large M, the function
j(z) is very small on the intervals ±M + iy, so letting M ! • we get
Z Z
j(z) dz = j(z) dz = 1,
it+R R
76 3 Central Limit Theorem

and we proved that


t2
f (t) = EeitX = e 2 . (3.4)
It is also easy to check that Y = s (X + µ) has mean µ = EY, variance s 2 = Var(Y )
and the density
1 ⇣ (x µ)2 ⌘
p exp .
2ps 2s 2
This is the so-called normal distribution N(µ, s 2 ) and its characteristic function is
given by
2 2
EeitY = Eeit(µ+s X) = eitµ t s /2 . (3.5)
Our next very important observation is that integrability of X is related to smooth-
ness of its characteristic function.

Lemma 3.5. If X is a real-valued random variable such that E|X|r < • for integer
r 1 then f (t) 2 Cr (R) and f ( j) (t) = E(iX) j eitX for j  r.

For a partial converse, see an exercise below, where the existence of f 00 (0) is shown
to imply that EX 2 < •. One can similarly show that, for even n, the existence of
f (n) (0) implies that EX n < •.

Proof. If r = 0, we use the fact that |eitX |  1 to conclude that

f (t) = EeitX ! EeisX = f (s) if t ! s,

by the dominated convergence theorem. This means that f 2 C(R), i.e. character-
istic functions are always continuous. If r = 1, E|X| < •, we can use

eitX eisX
 |X|
t s
and, therefore, by the dominated convergence theorem,

eitX eisX
f 0 (t) = lim E = EiXeitX .
s!t t s
Also, by dominated convergence theorem, EiXeitX 2 C(R), which means that f 2
C1 (R). We proceed by induction. Suppose that we proved that f ( j) (t) = E(iX) j eitX
and that r = j + 1, E|X| j+1 < •. Then, we can use that

(iX) j eitX (iX) j eisX


 |X| j+1 ,
t s
so that
f ( j+1) (t) = E(iX) j+1 eitX 2 C(R)
by the dominated convergence theorem. t
u
3.2 Characteristic functions 77

Next, we want to show that the characteristic function uniquely determines the
distribution. This is usually proved using convolutions. Let X and Y be two inde-
pendent random vectors on Rk with the distributions P and Q. We denote by P ⇤ Q
the convolution of P and Q, which is the distribution L (X +Y ) of the sum X +Y.
We have,
ZZ
P ⇤ Q(A) = EI(X +Y 2 A) = I(x + y 2 A) dP(x)dQ(y)
ZZ Z
= I(x 2 A y) dP(x)dQ(y) = P(A y) dQ(y).

If P has density p then


ZZ ZZ
P ⇤ Q(A) = I(x + y 2 A)p(x) dxdQ(y) = I(z 2 A)p(z y) dzdQ(y)
Z ⇣Z ⌘ Z ⇣Z ⌘
= p(z y) dz dQ(y) = p(z y) dQ(y) dz,
A A

which means that P ⇤ Q has density


Z
f (x) = p(x y) dQ(y). (3.6)

If, in addition, Q has density q then


Z
f (x) = p(x y)q(y) dy. (3.7)

Let us denote by N(0, s 2 I) the distribution of the random vector X = (X1 , . . . , Xk )


of i.i.d. random variables with the normal distribution N(0, s 2 ). The density of X
is given by
k
1 1 x2 ⇣ 1 ⌘k 1 |x|2
’ 2ps
p e 2s 2 i = p
2ps
e 2s 2 .
i=1

Given a distribution P on Rk , let us denote

Ps = P ⇤ N(0, s 2 I).

It turns out that Ps is always absolutely continuous and its density can be written
in terms of the characteristic function of P.

Lemma 3.6. The convolution Ps = P ⇤ N(0, s 2 I) has density


⇣ 1 ⌘k Z 2
i(t,x) s2 |t|2
ps (x) = f (t)e dt
2p
R
where f (t) = ei(t,x) dP(x).

Proof. By (3.6), P ⇤ N(0, s 2 I) has density


78 3 Central Limit Theorem
⇣ 1 ⌘k Z 1 |x y|2
ps (x) = p e 2s 2 dP(y).
2ps
Using (3.4), we can write
Z
1 (x yi )2 1 i s1 (xi yi )zi 1 2
e 2s 2 i =p e e 2 zi dzi
2p
and taking a product over i  k we get
1 |x y|2 ⇣ 1 ⌘k Z
i s1 (x y,z) 1 |z|2
e 2s 2 = p e e 2 dz.
2p
Then we can continue
⇣ 1 ⌘k ZZ 1 y,z) 12 |z|2
ps (x) = e i s (x dzdP(y)
2ps
⇣ 1 ⌘k ZZ 1 y,z) 12 |z|2
= e i s (x dP(y)dz
2ps
⇣ 1 ⌘k Z ⇣ z ⌘
i s1 (x,z) 12 |z|2
= f e dz.
2ps s
Making the change of variables z = ts finishes the proof. t
u
This immediately implies that the characteristic function uniquely determines
the distribution.
R R
Theorem 3.4 (Uniqueness). If ei(t,x) dP(x) = ei(t,x) dQ(x) then P = Q.
Proof. By the above Lemma, Ps = Qs . If X ⇠ P and g ⇠ N(0, I) then X +s g ! X
almost surely as s # 0 and, therefore, Ps ! P weakly. Similarly, Qs ! Q. t
u
This immediately implies the following stability property of the normal distri-
bution.
Lemma 3.7. If X j for j = 1, 2 are independent and have normal distributions
N(µ j , s 2j ) then X1 + X2 has normal distribution N(µ1 + µ2 , s12 + s22 ).
Proof. By independence and (3.5), the characteristic function of the sum is equal
to
t 2 s12 /2 itµ2 t 2 s22 /2 t 2 (s12 +s22 )/2
Eeit(X1 +X2 ) = EeitX1 EeitX2 = eitµ1 e = eit(µ1 +µ2 ) .

This is the characteristic function of the normal distribution N(µ1 + µ2 , s12 + s22 ),
so the claim follows from the Uniqueness Theorem. One can also prove this by a
straightforward computation using the formula (3.7) for the density of the convo-
lution, which is left as an exercise below. t
u
Lemma 3.6 gives some additional information when the characteristic function
is integrable.
3.2 Characteristic functions 79
R
Lemma 3.8 (Fourier inversion formula). If | f (t)| dt < • then P has density
⇣ 1 ⌘k Z
i(t,x)
p(x) = f (t)e dt.
2p
Proof. Since
i(t,x) 12 s 2 |t|2 i(t,x)
f (t)e ! f (t)e
pointwise as s # 0 and
i(t,x) 12 s 2 |t|2
f (t)e  | f (t)|,

the integrability of f (t) implies, by the dominated convergence theorem, that


ps (x) ! p(x). Since Ps ! P weakly, for any g 2 Cb (Rk ),
Z Z
g(x)ps (x) dx ! g(x) dP(x).

On the other hand, since


⇣ 1 ⌘k Z
s
|p (x)|  | f (t)|dt < •,
2p
by the dominated convergence theorem, for any compactly supported g 2 Cc (Rk ),
Z Z
s
g(x)p (x) dx ! g(x)p(x) dx.

Therefore, for any such g 2 Cc (Rk ),


Z Z
g(x) dP(x) = g(x)p(x) dx.

Of course, this means that p(x) is the density of P. t


u

Because of the uniqueness theorem, the convergence of characteristic functions


implies convergence of distributions under some mild additional assumptions.
Lemma 3.9. If (Pn ) is uniformly tight on Rk and the characteristic functions con-
verge to some function f ,
Z
fn (t) = ei(t,x) dPn (x) ! f (t),
R
then f (t) = ei(t,x) dP(x) is a characteristic function of some distribution P and
Pn ! P weakly.

Proof. For any sequence (n(k)), by the Selection Theorem, there exists a subse-
quence (n(k(r))) such that Pn(k(r)) converges weakly to some distribution P. Since
ei(t,x) is bounded and continuous,
80 3 Central Limit Theorem
Z Z
ei(t,x) dPn(k(r)) ! ei(t,x) dP(x)

as r ! • and, therefore, f is the characteristic function of P. By the uniqueness


theorem, the distribution P does not depend on the sequence (n(k)). By Lemma
3.1, Pn ! P weakly. t
u
Even though in many cases uniform tightness of (Pn ) is not difficult to check di-
rectly, it also follows automatically from the continuity of the limit of characteristic
functions at zero. To show this, we will need the following bound.
Lemma 3.10. If X is a real-valued random variable then
⇣ Z
1⌘ 7 u
P |X| >  (1 Re f (t)) dt.
u u 0
R
Proof. Since Re f (t) = cos(tx) dP(x), we can write
Z u ZuZ
1 1
(1 Re f (t)) dt = (1 costx) dP(x)dt
u 0 u
0 R
Z Zu
1
= (1 costx) dtdP(x)
u
R 0
Z⇣
sin xu ⌘
= 1 dP(x)
xu
R
Z ⇣
sin xu ⌘
1 dP(x)
xu
|xu| 1
Z
1 ⇣ 1⌘
(1 sin 1) 1dP(x) P |X| ,
7 u
|xu| 1

sin y sin 1
where in the last inequality we used that y < 1 if y > 1. t
u
Theorem 3.5 (Levy’s continuity theorem). Let (Xn ) be a sequence of random
variables on Rk . Suppose that

fn (t) = Eei(t,Xn ) ! f (t)

and f (t) is continuous at 0 along each axis. Then there exists a probability distri-
bution P such that Z
f (t) = ei(t,x) dP(x)

and Pn = L (Xn ) ! P weakly.


Proof. By Lemma 3.9, we only need to show that {L (Xn )} is uniformly tight. If
we denote Xn = (Xn,1 , . . . , Xn,k ) then the characteristic functions of the ith coordi-
3.2 Characteristic functions 81

nate:

fni (ti ) := fn (0, . . . ,ti , 0, . . . 0) = Eeiti Xn,i ! f (0, . . . ,ti , . . . 0) =: f i (ti ).

Since fn (0) = 1, we have f (0) = 1. By the continuity of f i at 0, for any e > 0 we


can find d > 0 such that for all i  k, | f i (ti ) 1|  e if |ti |  d . By the dominated
convergence theorem,
Z d Z
1 1 d
lim 1 Re fni (ti ) dti = 1 Re f i (ti ) dti
n!• d 0 d 0
Z
1 d
 1 f i (ti ) dti  e.
d 0
Together with the previous lemma this implies that
⇣ Z
1⌘ 7 d
P |Xn,i | >  1 Re fni (ti ) dti  7 · 2e,
d d 0

for large enough n. Since |Xn |2 = Âki=1 |Xn,i |2 , the union bound implies that
⇣ p ⌘
k
P |Xn | >  14ke,
d
which means that the sequence Pn = L (Xn ) is uniformly tight. t
u

Using characteristic functions, we can prove a couple more basic properties of


convergence in distributions.
Lemma 3.11 (Continuous Mapping). Suppose that Pn ! P weakly on X and
G : X ! Y is a continuous map. Then Pn G 1 ! P G 1 on Y. In other words, if
Zn ! Z in distribution then G(Zn ) ! G(Z) in distribution.

Proof. This is obvious, because for any function f 2 Cb (Y ) we have f G 2 Cb (X)


and, therefore, E f (G(Zn )) ! E f (G(Z)), which is the definition of convergence
G(Zn ) ! G(Z) in distribution. t
u

Lemma 3.12. If Pn ! P on Rk and Qn ! Q on Rm then Pn ⇥ Qn ! P ⇥ Q on


Rk+m .

Proof. By Fubini’s theorem, the characteristic function


Z Z Z
ei(t,x) dPn ⇥ Qn (x) = ei(t1 ,x1 ) dPn ei(t2 ,x2 ) dQn
Z Z Z
! ei(t1 ,x1 ) dP ei(t2 ,x2 ) dQ = ei(t,x) dP ⇥ Q.

Using Theorem 3.5 finishes the proof. t


u

Corollary 3.1. If both Pn ! P and Qn ! Q on Rk then Pn ⇤ Qn ! P ⇤ Q.


82 3 Central Limit Theorem

Proof. Since the function G : Rk+k ! Rk given by G(x, y) = x + y is continuous,


by the continuous mapping lemma,
1 1
Pn ⇤ Qn = (Pn ⇥ Qn ) G ! (P ⇥ Q) G = P ⇤ Q,

which finishes the proof. t


u
Exercise 3.2.1. Prove Lemma 3.7 using the formula (3.7) for the density of the
convolution.
Exercise 3.2.2. If X is a random variable such that EeitX = e t4 for all t 2 R, show
that EX 2 < •. Hint: use Lemma 3.10.
Exercise 3.2.3. Does there exist a random variable X such that EeitX = e t4 ?

Exercise 3.2.4. Suppose that f (t) = EeitX is differentiable near 0 and f 0 (t) is dif-
ferentiable at zero, i.e. f 00 (0) is well defined. Prove that EX 2 < •. Hint: consider
j(t) = | f (t)|2 = f (t) f ( t), which is the characteristic function of X1 X2 for two
independent copies X1 , X2 of X, apply l’Hospital’s rule to (1 j(t))/t 2 and then
use Fatou’s lemma to show that E(X1 X2 )2 < •; conclude using Fubini’s theo-
rem.
Exercise 3.2.5. Prove that a probability distribution on Rk is uniquely determined
by the probabilities of half-spaces.
Exercise 3.2.6. Prove that if P ⇤ Q = P on Rk then Q({0}) = 1. (Hint: use that a
characteristic function is continuous, in particular, in the neighborhood of zero.)
Exercise 3.2.7. Find the characteristic function of the law on R with density
max(1 |x|, 0). Then use the Fourier inversion formula to conclude that max(1
|t|, 0) is a characteristic function of some distribution.
Exercise 3.2.8. Let f be a continuous function on R with f (0) = 1, f (t) = f ( t)
for all t, and such that f restricted to [0, •) is convex, with f (t) ! 0 as t ! •.
Show that f is a characteristic function. (Hint: Approximate the general case by
piecewise linear. Then use that max(1 |t|, 0) is a characteristic function and that
characteristic functions are closed under convex combinations and scaling f (t) !
f (at) for a 2 R.)
Exercise 3.2.9. Does there exist a random variable X such that EeitX = e |t| ?

Exercise 3.2.10. Show that there are distributions P1 6= P2 whose characteristic


functions coincide on some interval (a, b). One the other hand, show that if P is
compactly supported then its characteristic function on anyR open interval (a, b)
uniquely determines P. (Hint: Show that in this case f (z) = ezx dP(x) is analytic
on C.)
Exercise 3.2.11. If f and fn are characteristic functions of probability measures on
Rk and fn ! f pointwise, show that fn ! f uniformly on compacts.
3.3 Central limit theorem for i.i.d. random variables 83

3.3 Central limit theorem for i.i.d. random variables

We have seen in the Hoeffding-Chernoff inequality in Section 2.1 that if X1 , . . . , Xn


are independent flips of a fair coin, P(Xi = 0) = P(Xi = 1) = 1/2, and X n is their
average then
2
P |X n 1/2| > t  2e 2nt .
If we denote
p Sn nµ
Zn = 2 n(X n
1/2) = p ,
ns 2
where µ = EX1 = 1/2 and s 2 = Var(X1 ) = 1/4, then the Hoeffding-Chernoff in-
equality can be rewritten as
t 2 /2
P(|Zn | t)  2e .

We will see that this inequality is rather accurate for large sample size n, since we
will show that
Z •
2 x2 /2 t!• 2 t 2 /2
lim P(|Zn | t) = p e dx ⇠ p e .
n!• 2p t t 2p
The asymptotic equivalence as t ! • is a simple exercise, and the large n limit is
a consequence of a more general result,
Z z
1 x2 /2
lim P(Zn  z) = F(z) = p e dx, (3.8)
n!• 2p •

for all z 2 R. In other words, Zn converges in distribution to a standard normal


x 2 /2 p
distribution with the density e / 2p and the c.d.f. F(z). Moreover, this holds
for any i.i.d. sequence (Xn ) with µ = EX1 and s 2 = Var(X1 ) < •, and

Sn nµ 1 n Xi µ
Zn = p =p  . (3.9)
ns 2 n i=1 s

Before we use characteristic functions to prove this result as well as its multi-
variate version, let us describe a more intuitive approach via the so-called Linde-
berg’s method, which emphasizes a little bit more explicitly the main reason behind
the central limit theorem, namely, the stability property of the normal distribution
proved in Lemma 3.7.

Theorem 3.6. Consider an i.i.d. sequence (Xi )i 1 such that EX1 = µ, Var(X) = s 2
and E|X1 |3 < •. Then the distribution of Zn defined in (3.9) converges weakly to
standard normal distribution N(0, 1).

Remark 3.1. One can easily modify the proof we give below to get rid of the unnec-
essary assumption E|X1 |3 < •. This will appear as an exercise in the next section,
84 3 Central Limit Theorem

where we will state and prove a more general version of the Lindeberg’s CLT for
non i.i.d. random variables.

Proof. First of all, notice that the random variables (Xi µ)/s have mean 0 and
variance 1. Therefore, it is enough to prove the result for

1 n
Zn = p  Xi
n i=1

under the assumption that EX1 = 0, EX12 = 1. Let (gi )i 1 be independent standard
normal random variables. Then, by the stability property,

1 n
Z = p  gi
n i=1

also has standard normal distribution N(0, 1). If, for 1  m  n + 1, we define
1
Tm = p g1 + . . . + gm 1 + Xm + . . . + Xn
n
then, for any bounded function f : R ! R, we can write
n n
E f (Zn ) E f (Z) = Â E f (Tm ) E f (Tm+1 )  Â E f (Tm ) E f (Tm+1 ) .
m=1 m=1

If we denote
1
Sm = p g1 + . . . + gm 1 + Xm+1 + . . . + Xn
n
p p
then Tm = Sm + Xm / n and Tm+1 = Sm + gm / n. Suppose now that f has uni-
formly bounded third derivative. Then, by Taylor’s formula,

f 0 (Sm )Xm f 00 (Sm )Xm2 k f 000 k• |Xm |3


f (Tm ) f (Sm ) p 
n 2n 6n3/2
and
f 0 (Sm )gm f 00 (Sm )g2m k f 000 k• |gm |3
f (Tm+1 ) f (Sm )p  .
n 2n 6n3/2
Notice that Sm is independent of Xm and gm and, therefore,

E f 0 (Sm )Xm = E f 0 (Sm )EXm = 0 = E f 0 (Sm )Egm = E f 0 (Sm )gm

and

E f 00 (Sm )Xm2 = E f 00 (Sm )EXm2 = E f 00 (Sm ) = E f 00 (Sm )Eg2m = E f 00 (Sm )g2m .

As a result, taking expectations and subtracting the above inequalities, we get


3.3 Central limit theorem for i.i.d. random variables 85

k f 000 k• (E|X1 |3 + E|g1 |3 )


E f (Tm ) E f (Tm+1 )  .
6n3/2
Adding up over 1  m  n, we proved that

k f 000 k• (E|X1 |3 + E|g1 |3 )


E f (Zn ) E f (Z)  p
6 n
and limn!• E f (Zn ) = E f (Z). We proved in Section 3.1 that this convergence for
f 2 Cb (R) implies convergence of the c.d.f.s in (3.8), but the same proof also works
using functions with bounded third derivative. We will leave it as an exercise at the
end of this section. t
u

Now, let us prove the CLT using the method of characteristic functions.
Theorem 3.7 (Central Limit Theorem). Consider an i.i.d. p sequence (Xi )i 1 such
that EX1 = 0, EX12 = 1. Then the distribution of Zn = Sn / n converges weakly to
standard normal distribution N(0, 1).
p
Proof. We begin by showing that the characteristic function of Sn / n converges
to the characteristic function of the standard normal distribution N(0, 1),
Sn
it p 1 t2
lim Ee n =e 2 ,
n!•

for all t 2 R. By independence,


Sn n itXi ⇣ itX ⌘
p1 n
it p
= ’ Ee
p
Ee n n = Ee n .
i=1

Since EX12 < •, Lemma 3.5 implies that the characteristic function f (t) = EeitX1 2
C2 (R) and, therefore,
1 00
f (t) = f (0) + f 0 (0)t + f (0)t 2 + o(t 2 ) as t ! 0.
2
Since

f (0) = 1, f 0 (0) = EiXei·0·X = iEX = 0, f 00 (0) = E(iX)2 = EX 2 = 1,

we get
t2
f (t) = 1 + o(t 2 ).
2
Finally,
⇣ ⇣ t ⌘⌘n ⇣
itS
pn t2 ⇣ t 2 ⌘⌘n 1 2
Ee = f p
n = 1 +o ! e 2t
n 2n n
as n ! •. The result then follows from Levy’s continuity theorem in the previous
section. Alternatively, we could also use Lemma 3.9 since it is easy to check that
86 3 Central Limit Theorem
p
the sequence of laws (L (Sn / n))n 1 is uniformly tight. Indeed, by Chebyshev’s
inequality,
⇣ S ⌘ 1 ⇣S ⌘ 1
n n
P p > M  2 Var p = 2 < e
n M n M
for large enough M. t
u
Next, we will prove a multivariate version of the central limit theorem. For a ran-
dom vector X = (X1 , . . . , Xk ) 2 Rk , let EX = (EX1 , . . . , EXk ) denote its expectation
and
Cov(X) = EXi X j 1i, jk
its covariance matrix. Again, the main step is to prove convergence of characteris-
tic functions.
Theorem 3.8. Let (Xi )i 1 be a sequence of i.i.d. random k
2 p vectors on R such that
EX1 = 0 and E|X1 | < •. Then the distribution of Sn / n converges weakly to the
distribution P with the characteristic function
1 (Ct,t)
f p (t) = e 2 , (3.10)

where C = Cov(X1 ).
Proof. Consider any t 2 Rk . Then Zi = (t, Xi ) are i.i.d. real-valued random variables
and, by the central limit theorem on the real line,
⇣ S ⌘ 1 n ⇣ Var(Z ) ⌘ ⇣ 1 ⌘
E exp i t, p = E exp i p  Zi ! exp
n 1
= exp (Ct,t)
n n i=1 2 2

as n ! •, since

Var t1 X1,1 + · · · + tk X1,k = Â tit j EX1,i X1, j = (Ct,t).


i, j
n ⇣ ⌘o
Sn
The sequence L p
n
is uniformly tight on Rk since
⇣ S ⌘ 1 Sn 2 1 2
P pn M  2E p = 2
E (Sn,1 , . . . , Sn,k )
n M n nM
1 1 M!•
= Â ESn,i
nM 2 ik
2
= 2 E|X1 |2 ! 0,
M

so we can use Lemma 3.9 to finish the proof. Alternatively, we can just apply
Levy’s continuity theorem. t
u

Multivariate Gaussian distributions. The covariance matrix C = Cov(X) is sym-


metric and non-negative definite, since

(Ct,t) = E(t, X)2 0.


3.3 Central limit theorem for i.i.d. random variables 87

The unique distribution with covariance function in (3.10) is called a multivariate


normal distribution with the covariance C and is denoted by N(0,C). It can also
be defined more constructively as follows. Consider an i.i.d. sequence g1 , . . . , gn of
standard normal N(0, 1) random variables and let g = (g1 , . . . , gn )T . Given a k ⇥ n
matrix A, the covariance matrix of Ag 2 Rk is

C = Cov(Ag) = EAg(Ag)T = AEggT AT = AAT .

Since we know the characteristic function of the standard normal random variables
g` , the characteristic function of Ag is

E exp i(t, Ag) = E exp i(AT t, g) = E ’ exp i(AT t)` g` = ’ exp ((AT t)` )2 /2
`n `n
⇣ 1 ⌘ ⇣ 1 ⌘ ⇣ 1 ⌘
= exp |AT t|2 = exp t T AAT t = exp (Ct,t) ,
2 2 2
which means that Ag has the distribution N(0,C). On the other hand, given a sym-
metric non-negative definite matrix C one can always find A such that C = AAT .
For example, let C = QDQT be its eigenvalue decomposition, for orthogonal ma-
trix Q and diagonal matrix D. Since C is nonnegative definite, the elements of D are
nonnegative. Then, one can take n = k and A = C1/2 := QD1/2 QT or A = QD1/2 .
This means that any normal distribution can be generated as a linear transformation
of the vector of i.i.d. standard normal random variables.
Density in the invertible case. Suppose det(C) 6= 0. Take any invertible A such
that C = AAT , so that Ag ⇠ N(0,C). Since the density of g is
1 ⇣ 1 ⌘ ⇣ 1 ⌘k ⇣ 1 ⌘
’ p2p exp 2
x`2 = p
2p
exp
2
|x|2 ,
`k

for any Borel set W ✓ Rk we can write


Z ⇣ 1 ⌘k ⇣ 1 ⌘
P(Ag 2 W) = P(g 2 A 1 W) = p exp |x|2 dx.
A 1W 2p 2

Let us now make the change of variables y = Ax or x = A 1 y. Then


Z ⇣
1 ⌘k ⇣ 1 ⌘ 1
P(Ag 2 W) = p exp |A 1 y|2 dy.
W 2p 2 | det(A)|
But since
det(C) = det(AAT ) = det(A) det(AT ) = det(A)2
p
we have | det(A)| = det(C). Also

|A 1 y|2 = (A 1 y)T (A 1 y) = yT (AT ) 1 A 1 y = yT (AAT ) 1 y = yT C 1


y.

Therefore, we get
88 3 Central Limit Theorem
Z ⇣
1 ⌘k 1 ⇣ 1 ⌘
P(Ag 2 W) = p p exp yT C 1
y dy.
W 2p det(C) 2

This means that ⇣ 1 ⌘k ⇣ 1 ⌘


1
p p exp yT C 1
y
2p det(C) 2
is the density of the distribution N(0,C) when C is invertible. t
u
General case. If C = QDQT is the eigenvalue decomposition of C, let us take, for
example, X = QD1/2 g for i.i.d. standard normal vector g, so that X ⇠ N(0,C). If
q1 , . . . , qk are the column vectors of Q then
1/2 1/2
X = QD1/2 g = (l1 g1 )q1 + . . . + (lk gk )qk .

Therefore, in the orthonormal coordinate basis q1 , . . . , qk , the random vector X has


1/2 1/2
coordinates l1 g1 , . . . , lk gk . These coordinates are independent and have nor-
mal distributions with variances l1 , . . . , lk correspondingly. When det(C) = 0, i.e.
C is not invertible, some of its eigenvalues will be zero, say, ln+1 = . . . = lk = 0.
Then the distribution will be concentrated on the subspace spanned by vectors
q1 , . . . , qn , and it will not have density on the entire space Rk . It will have the den-
sity
1 ⇣ x2 ⌘
f (x1 , . . . , xn ) = ’ p exp `
`n 2pl` 2l `

on the subspace spanned by vectors q1 , . . . , qn . t


u
Let us look at some properties of normal distributions.
Lemma 3.13. If X ⇠ N(0,C) on Rk and A is a m ⇥ k matrix then AX ⇠ N(0, ACAT )
on Rm .

Proof. The characteristic function of AX is


⇣ 1 ⌘ ⇣ 1 ⌘
E exp i(t, AX) = E exp i(AT t, X) = exp (CAT t, AT t) = exp (ACAT t,t) ,
2 2
and the statement follows by definition. t
u

Lemma 3.14. X is normal on Rk if and only if (t, X) is normal on R for all t 2 Rk .

Proof. ”=)”. The c.f. of real-valued random variable (t, X) is

f (l ) = E exp il (t, X) = E exp i(lt, X)


⇣ 1 ⌘ ⇣ 1 ⌘
= exp (Clt, lt) = exp l 2 (Ct,t) ,
2 2
which implies that (t, X) has normal distribution N(0, (Ct,t)) with variance (Ct,t).
”(=”. If (t, X) is normal then
3.3 Central limit theorem for i.i.d. random variables 89
⇣ 1 ⌘
E exp i(t, X) = exp (Ct,t) ,
2
because the variance of (t, X) is (Ct,t). This means that X is normal N(0,C). t
u
Lemma 3.15. Let Z = (X,Y ), where X = (X1 , . . . , Xi ) and Y = (Y1 , . . . ,Y j ), and
suppose that Z is normal on Ri+ j . Then X and Y are independent if and only
Cov(Xm ,Yn ) = 0 for all m, n.
Proof. One way is obvious. The other way around, suppose that

D0
C = Cov(Z) = .
0F

If t = (t1 ,t2 ) then the characteristic function of Z is


⇣ 1 ⌘
E exp i(t, Z) = exp (Ct,t)
2
⇣ 1 1 ⌘
= exp (Dt1 ,t1 ) (Ft2 ,t2 ) = E exp i(t1 , X) E exp i(t2 ,Y ),
2 2
which is precisely the characteristic function of independent X and Y . By the
uniqueness theorem, X and Y are independent. t
u
Exercise 3.3.1. Finish the proof of Theorem 3.6. Namely, show that (3.8) holds if
limn!• E f (Zn ) = E f (Z) for functions f with bounded third derivative.
Exercise 3.3.2. 24% of the residents in a community are members of a minority
group but among the 96 people called for jury duty only 13 are. Does this data
indicate that minorities are less likely to be called for jury duty?
Exercise 3.3.3. Given 0 < t1 < . . . < tk < •, show that the Gaussian distribution
N(0, (ti ^ t j )1i, jk ) has density on Rk . (Hint: either check that the covariance is
non-degenerate, or look up the definition of the Brownian motion.)
Exercise 3.3.4. Given a vector of i.i.d. standard normal random variables g =
(g1 , . . . , gk )T and orthogonal k ⇥ k matrix V , show that Y = V g also has i.i.d. stan-
dard normal coordinates.
Exercise 3.3.5. If g is standard normal, prove that EgF(g) = EF 0 (g) for nice
enough functions F : R ! R.
Exercise 3.3.6. If g = (g1 , . . . , gk ) has arbitrary normal distribution N(0,C) on Rk ,
prove that
n
∂F
Eg1 F(g) = Â (Eg1 gi )E (g),
i=1 ∂ xi
where C1i = Eg1 gi , for nice enough functions F : Rk ! R. Hint: what can you say
about g1 and gi (C1i /C11 )g1 for 1  i  k?
90 3 Central Limit Theorem

3.4 Central limit theorem for triangular arrays

Instead of considering i.i.d. sequences, for each n 1 we will now consider a


vector (Xn1 , . . . , Xnn ) of independent random variables, not necessarily identically
distributed. This setting is called the triangular arrays. Let us denote
n
µni = EXni , sni2 = Var(Xni ) < •, D2n = Â sni2 . (3.11)
i=1

Consider
Sn ESn 1 n
Zn = p = Â Xni
Var(Sn ) Dn i=1
µni . (3.12)

We have EZn = 0 and Var(Zn ) = 1, but the variance of each summand sni2 /D2n is
not necessarily 1/n as in the case of i.i.d. random variables. One condition that
ensures that Zn satisfies the CLT is
1 n 3
lim
n!• D3n
 E Xni µni = 0, (3.13)
i=1

which is called the Lyapunov condition.


Theorem 3.9. Under the Lyapunov condition (3.13), the distribution of Zn in (3.12)
converges weakly to standard Gaussian N(0, 1).
The proof is identical to Theorem 3.6 in the previous section. One small change
is to consider ani = sni /Dn and represent g = Âni=1 ani gi . At some point, when
controlling one of the error terms in the Taylor expansion, one also has to use that
⇣ ⌘
2 3/2 3
sni3 = E Xni µni  E Xni µni .

Theorem 3.9 also follows directly from the next more general result.
Theorem 3.10 (Lindeberg’s CLT). For each n 1, consider a vector (Xni )1in
of independent random variables such that
n
EXni = 0, Var(Sn ) = Â E(Xni )2 = 1.
i=1

Suppose that the following Lindeberg’s condition is satisfied:


n
 E(Xni )2 I(|Xni | > e) ! 0 as n ! • for all e > 0. (3.14)
i=1

Then the distribution L Âin Xni ! N(0, 1) weakly.


Remark 3.2. We will prove this result again using characteristic functions, but one
can also use the same argument as in Theorem 3.6 in the previous section. We will
3.4 Central limit theorem for triangular arrays 91

leave it as an exercise below. Also, notice that for i.i.d. random variables this p
im-
plies the CLT without the assumption E|X1 |3 < •, since in that case Xni = Xi / n.

Proof. First of all, the sequence L Âin Xni is uniformly tight, because by
Chebyshev’s inequality
⇣ ⌘ 1
P Â Xni > M  2  e
in M

for large enough M. It remains to show that the characteristic function of Sn cov-
l2
erges to e 2 . For simplicity of notation, let us omit the upper index n and write Xi
instead of Xni . Since Eeil Sn = ’ni=1 Eeil Xi it is enough to show that
n
l2
log Eeil Sn = Â log 1 + Eeil Xi 1 ! . (3.15)
i=1 2

It is not difficult to check, by induction on m, that for any a 2 R,


m
(ia)k |a|m+1
eia   . (3.16)
k=0 k! (m + 1)!

We will leave it as an exercise at the end of the section. Using this for m = 1,

l2
Eeil Xi 1 = Eeil Xi EXi21 il EXi 
2
l2 2 l2 1
 e + EXi2 I(|Xi | > e)  l 2 e 2  (3.17)
2 2 2
for large n by (3.14) and for small enough e. Using Taylor’s expansion of log(1+z)
it is easy to check that
1
log(1 + z) z  |z|2 for |z| 
2
and, therefore, using this and (3.17),
n n
2
 log 1 + Eeil Xi 1 Eeil Xi 1   Eeil Xi 1
i=1 i=1
n
l4 l4 ⌘ n l4
2
Â EXi2  max EXi2 Â EXi2 = max EXi2 .
i=1 4 4 1in i=1 4 1in

This goes to zero, because

max EXi2  e 2 + max EXi2 I(|Xi | > e) ! e 2


1in 1in
92 3 Central Limit Theorem

as n ! •, by (3.14), and then we let e # 0. As a result, to prove (3.15) it remains


to show that n
l2
 Eeil Xi 1 ! 2 .
i=1

Using (3.16) for m = 1, on the event |Xi | > e ,

l2 2
eil Xi 1 il Xi I |Xi | > e  X I |Xi | > e
2 i
and, therefore,
l2 2
eil Xi 1 il Xi + X I |Xi | > e  l 2 Xi2 I |Xi | > e . (3.18)
2 i
Using (3.16) for m = 2, on the event |Xi |  e ,

l2 2 l3 l 3e 2
eil Xi 1 il Xi + Xi I |Xi |  e  |Xi |3 I |Xi |  e  X . (3.19)
2 6 6 i
Combining the last two equations and using that EXi = 0,

l2 l 3e
Eeil Xi 1+ EXi2  l 2 EXi2 I |Xi | > e + EXi2 .
2 6
Finally, this implies
n n
l2 l2 n
 Eeil Xi 1 +
2
= Â Eeil Xi 1 + Â EXi2
2 i=1
i=1 i=1
n
l 3e l 3e
 + l 2 Â EXi2 I(|Xi | > e) !
6 i=1 6

as n ! •, using Lindeberg’s condition (3.14). It remains to let e # 0. t


u

Theorem 3.11 (Multivariate Lindeberg’s CLT). For each n 1, let (Xni )1in
be independent random vectors in Rk such that
n
EXni = 0, Cov(Sn ) = Â Cov(Xni ) ! C as n ! •.
i=1

Suppose that the following Lindeberg’s condition is satisfied:


n
 E|Xni |2 I(|Xni | > e) ! 0 as n ! • for all e > 0. (3.20)
i=1

Then the distribution L Âin Xni ! N(0,C) weakly.


3.4 Central limit theorem for triangular arrays 93

Just like in the i.i.d. case, using characteristic functions, the proof is reduced to
the one-dimensional case. One only needs to observe that, for any t 2 Rk , random
variables (t, Xni ) satisfy the condition (3.14), because |(t, Xni )|  |t||Xni |.
Three series theorem. We will now use Lindeberg’s CLT above to prove the “three
series theorem” for random series. Let us begin with the following observation.

Lemma 3.16. If P, Q are distributions on R such that P ⇤ Q = P then Q({0}) = 1.

Proof. Let us consider the characteristic functions


Z Z
fP (t) = eitx dP(x), fQ (t) = eitx dQ(x).

The condition P ⇤ Q = P implies that fP (t) fQ (t) = fP (t). Since fP (0) = 1 and
fP (t) is continuous, for small enough |t|  e we have | fP (t)| > 0 and, as a result,
fQ (t) = 1. Since
Z Z
fQ (t) = cos(tx) dQ(x) + i sin(tx) dQ(x),
R
for |t|  e this implies that cos(tx) dQ(x) = 1 and, since cos(s)  1, this can
happen only if

Q x : xt = 0 mod 2p = 1 for all |t|  e.

Take s,t such that |s|, |t|  e and s/t is irrational. For x to be in the support of Q
we must have xs = 2pk and xt = 2pm for some integer k, m. This can happen only
if x = 0. t
u

We proved in Section 2.3 that convergence of random series in probability im-


plies almost sure convergence. It turns out that this result can be strengthened, and
convergence in distribution also implies convergence in probability and, therefore,
almost sure convergence.

Theorem 3.12 (Levy’s equivalence theorem). If (Xi ) is a sequence of independent


random variables then the series Âi 1 Xi converges almost surely, in probability, or
in distribution, at the same time.

Proof. Of course, we only need to show that convergence in distribution implies


convergence in probability. Suppose that L (Sn ) ! P. Convergence in law im-
plies that {L (Sn )} is uniformly tight, which implies that {L (Sn Sk )}n,k 1 is
uniformly tight. We will now show that this implies that, for any e > 0,

P(|Sn Sk | > e) < e (3.21)

for n k N for large enough N. Suppose not. Then there exists e > 0 and se-
quences (n(`)) and (n0 (`)) such that n(`)  n0 (`) and
94 3 Central Limit Theorem

P(|Sn0 (`) Sn(`) | > e) e.

Let us denote Y` = Sn0 (`) Sn(`) . Since {L (Y` )} is uniformly tight, by the selection
theorem, there exists a subsequence (`(r)) such that L (Y`(r) ) ! Q. Since

Sn0 (`(r)) = Sn(`(r)) +Y`(r) and L (Sn0 (`(r)) ) = L (Sn(`(r)) ) ⇤ L (Y`(r) ),

letting r ! • we get that P = P ⇤ Q. By the above Lemma, Q({0}) = 1, which


implies that P(|Y`(r) | > e)  e for large r – a contradiction. By Lemma 2.6, (3.21)
implies that Sn converges in probability. t
u

Theorem 3.13 (Three series theorem). Let (Xi )i 1 be a sequence of independent


random variables and let Zi = Xi I(|Xi |  1). Then the series Âi 1 Xi converges
almost surely if and only if the following three conditions hold:
(1) Âi 1 P(|Xi |> 1) < •,
(2) Âi 1 EZi converges,
(3) Âi 1 Var(Zi ) < •.

Proof. ”(=”. Suppose (1) – (3) hold. Since

 P(|Xi | > 1) =  P(Xi 6= Zi ) < •,


i 1 i 1

by the Borel-Cantelli lemma, P(Xi 6= Zi i.o.) = 0, which means that Âi 1 Xi con-


verges if and only if Âi 1 Zi converges. By (2), it is enough to show that Âi 1 (Zi
EZi ) converges, but this follows from Theorem 2.12 by (3).
”=)”. If Âi 1 Xi converges almost surely, P(|Xi | > 1 i.o) = 0, and, since (Xi ) are
independent, by the Borel-Cantelli lemma, Âi 1 P(|Xi | > 1) < •. This proves (1).
We will prove (3) by contradiction. If Âi 1 Xi converges then, obviously, Âi 1 Zi
converges. Let us denote Smn = Âmkn Zk . Since Smn ! 0 as m, n ! •, P(|Smn | >
d )  d for any d > 0 for m, n large enough. Suppose that Âi 1 Var(Zi ) = •. Then
2
smn = Var(Smn ) = Â Var(Zk ) ! •
mkn

as n ! • for any fixed m. Intuitively, this should not happen, since Smn ! 0 in
probability but their variance goes to infinity. In principle, one can construct such
sequence of random variables, but in our case it will be ruled out by Lindeberg’s
CLT, as follows. Let us show that, by Lindeberg’s theorem,
Smn ESmn Zk EZk
Tmn =
smn
= Â smn
mkn

2 ! •. We
converges in distribution to N(0, 1) if m, n ! • in such a way that smn
only need to check that
3.4 Central limit theorem for triangular arrays 95
⇣Z EZk ⌘2 ⇣ Zk EZk ⌘
Â
k
E I >e !0
mkn smn smn

if m, n ! • in such a way that smn 2 ! •. This is obvious, because |Z EZk | < 2


k
and smn ! •, so the event in the indicator does not occur and the sum is, in fact,
equal to 0 for large m, n. Next, since P(|Smn | > d )  d , we get

ESmn d ⌘
P |Smn |  d = P Tmn +  1 d.
smn smn
But this is impossible, since Tmn is approximately standard normal, which can not
concentrate near any constant. Therefore, we proved that Âi 1 Var(Zi ) < •. By
Kolmogorov’s SLLN this implies that Âi 1 (Zi EZi ) converges almost surely and,
since Âi 1 Zi converges almost surely, Âi 1 EZi also converges. t
u

Exercise 3.4.1. Let (Xn ) be i.i.d. random variables with continuous distribution F.
We say that Xn is a record value if Xn > Xi for i < n. If Sn is the number of records
up to time n, show that Zn in (3.12) satisfies the CLT.

Exercise 3.4.2. Consider a triangular array of i.i.d. Bernoulli Xn1 , . . . , Xnn ⇠ B(pn ),
where pn varies with n. If pn ! 0 but npn ! +•, show that Zn in (3.12) satisfies
the CLT.

Exercise 3.4.3. Let X1 , X2 , . . . be independent random variables such that Xi ⇠


B(pi ) for some pi 2 [0, 1]. If the series • i=1 pi (1 pi ) diverges, show that Zn in
(3.12) satisfies the CLT.

Exercise 3.4.4. Adapt the argument of Theorem 3.6 in the previous section to prove
Theorem 3.10. Hint: when using the Taylor expansion in that proof, mimic equa-
tions (3.18) and (3.19).

Exercise 3.4.5. Prove (3.16). (Hint: integrate the inequality to make the induction
step.)

Exercise 3.4.6. Consider the random series Ân 1 en n a where (en ) are i.i.d. ran-
dom signs P(en = ±1) = 1/2. Give the necessary and sufficient condition on a for
this series to converge.

Exercise 3.4.7. Let (Xn )n 1 be i.i.d. random variables such that EX1 = 0 and
0 < EX12 < •. Show that Ân 1 Xn /an converges almost surely if and only if
Ân 1 1/a2n < •.

Exercise 3.4.8. Gamma distribution with parameters k, q > 0 has density

xk 1 e x/q
I(x > 0).
q k G(k)
96 3 Central Limit Theorem

Suppose Xn has Gamma(kn , qn ) distribution with kn = 1 + sin2 n and qn = 1/kn ,


and Xn are independent for n 1. Either show that
X1 + . . . + Xn n
p
n
converges in distribution to a Gaussian, or reduce this to a question about the exis-
tence of a certain limit. Hint: to compute the limit, look up strong ergodic theorem.

Exercise 3.4.9. Let (Xn )n 1 be i.i.d. random variables with the Poisson distribution
with mean l = 1. Can you find an and bn such that an Ân`=1 X` log ` bn converges
in distribution to standard Gaussian?

Exercise 3.4.10. Suppose that Xk for k 1 are independent random variables and
log k log k
P(Xk = ± k) = , P(Xk = 0) = 1 .
2k k
If Sn = Ânk=1 Xk and D2n = Ânk=1 Var(Xk ), does the distribution of Sn /Dn converge to
N(0, 1)?
97

Chapter 4
Metrics on Probability Measures

4.1 Bounded Lipschitz functions

We will now study the class of the so called bounded Lipschitz functions on a metric
space (S, d), which will play an important role in the next section, when we deal
with convergence of laws on metric spaces. For a function f : S ! R, let us define
a Lipschitz semi-norm by
| f (x) f (y)|
k f kL = sup .
x6=y d(x, y)

(When the cardinality of S is 1, we set k f kL = 0.) Clearly, k f kL = 0 if and only if


f is constant so k f kL is not a norm, even though it satisfies the triangle inequality.
Let us define a bounded Lipschitz norm by

k f kBL = k f kL + k f k• ,

where k f k• = sups2S | f (s)|. Let

BL(S, d) = f : S ! R : k f kBL < •

be the set of all bounded Lipschitz functions on (S, d). We will now prove several
fact about these functions.
Lemma 4.1. If f , g 2 BL(S, d) then f g 2 BL(S, d) and k f gkBL  k f kBL kgkBL .
Proof. First of all, k f gk•  k f k• kgk• . We can write,

| f (x)g(x) f (y)g(y)|  | f (x)(g(x) g(y))| + |g(y)( f (x) f (y))|


 k f k• kgkL d(x, y) + kgk• k f kL d(x, y)

and, therefore,

k f gkBL  k f k• kgk• + k f k• kgkL + kgk• k f kL  k f kBL kgkBL .


98 4 Metrics on Probability Measures

This finishes the proof. t


u

Let us recall the notations a _ b = max(a, b) and a ^ b = min(a, b) and let ⇤ = ^


or _.

Lemma 4.2. The following inequalities hold:


(1) k f1 ⇤ · · · ⇤ fk kL  max1ik k fi kL ,
(2) k f1 ⇤ · · · ⇤ fk kBL  2 max1ik k fi kBL .

Proof. (1) It is enough to consider k = 2. For specificity, take ⇤ = _. Given x, y 2 S,


suppose that
f1 _ f2 (x) f1 _ f2 (y) = f1 (y).
Then

| f1 _ f2 (y) f1 _ f2 (x)| = f1 _ f2 (x) f1 _ f2 (y)



f1 (x) f1 (y), if f1 (x) f2 (x)

f2 (x) f2 (y), otherwise
 k f1 kL _ k f2 kL d(x, y).

The case of ⇤ = ^ can be considered similarly.


(2) First of all, obviously, k f1 ⇤ · · · ⇤ fk k•  max1ik k fi k• . Therefore, using (1),

k f1 ⇤ · · · ⇤ fk kBL  max k fi k• + max k fi kL  2 max k fi kBL


i i i

and this finishes the proof. t


u

Another important fact is the following.

Theorem 4.1 (Extension theorem). Given a set A ✓ S and a bounded Lipschitz


function f 2 BL(A, d) on A, there exists an extension h 2 BL(S, d) such that f = h
on A and khkBL = k f kBL .

Proof. Let us first find an extension such that khkL = k f kL . We will check that the
following choice works,

h(x) := sup( f (s) k f kL d(x, s)).


s2A

First of all, if x 2 A then, by the Lipschitz property on A,

f (s) k f kL d(x, s)  f (x) for all s 2 A,

which means that h(x) = f (x) for x 2 A. Next, by the triangle inequality,

f (s) k f kL d(x, s)  f (s) k f kL d(y, s) + k f kL d(x, y)


4.1 Bounded Lipschitz functions 99

for any x, y 2 S, and taking the supremum over s 2 A shows that

h(x)  h(y) + k f kL d(x, y),

which means that khkL = k f kL . To extend preserving the bounded Lipschitz norm,
take
h0 = (h ^ k f k• ) _ ( k f k• ).
By part (1) of previous lemma, we see that kh0 kBL = k f kBL and, clearly, h0 coin-
cides with f on A. t
u

Let (C(S), d• ) denote the space of continuous real-valued functions on S with

d• ( f , g) = sup | f (x) g(x)|.


x2S

To prove the next property of bounded Lipschitz functions, let us first recall the
following famous generalization of the Weierstrass theorem. We will give the proof
for convenience.

Theorem 4.2 (Stone-Weierstrass theorem). Let (S, d) be a compact metric space


and F ✓ C(S) is such that
1. F is an algebra, i.e. for all f , g 2 F , c 2 R, we have c f + g 2 F , f g 2 F .
2. F separates points, i.e. if x 6= y 2 S then there exists f 2 F such that f (x) 6=
f (y).
3. F contains constants.
Then F is dense in (C(S), d• ).

Proof. Consider bounded f 2 F , i.e. | f (x)|  M. A function x ! |x| defined on


the interval [ M, M] can be uniformly approximated by polynomials of x by the
Weierstrass theorem on the real line or, for example, using Bernstein’s polynomi-
als. Therefore, | f (x)| can be uniformly approximated by polynomials of f (x), and
by properties 1 and 3, by functions in F . Therefore, if F is the closure of F in d•
norm then for any f 2 F its absolute value | f | 2 F . Therefore, for any f , g 2 F
we have
1 1
min( f , g) = ( f + g) |f g| 2 F ,
2 2 (4.1)
1 1
max( f , g) = ( f + g) + | f g| 2 F .
2 2
Given any points x 6= y and c, d 2 R one can always find f 2 F such that f (x) = c
and f (y) = d. Indeed, by property 2 we can find g 2 F such that g(x) 6= g(y) and,
as a result, a system of equations

ag(x) + b = c, ag(y) + b = d
100 4 Metrics on Probability Measures

has a solution a, b. Then the function f = ag + b satisfies the above and it is in F


by 1. Take h 2 C(S) and fix x. For any y let fy 2 F be such that

fy (x) = h(x), fy (y) = h(y).

By continuity of fy , for any y 2 S there exists an open neighborhood Uy of y such


that
fy (s) h(s) e for s 2 Uy .
Since (Uy ) is an open cover of the compact S, there exists a finite subcover
Uy1 , . . . ,UyN . Let us define a function

f x (s) = max( fy1 (s), . . . , fyN (s)) 2 F by (4.1).

By construction, it has the following properties:

f x (x) = h(x), f x (s) h(s) e for all s 2 S.

Again, by continuity of f x (s) there exists an open neighborhood Ux of x such that

f x (s)  h(s) + e for s 2 Ux .

Take a finite subcover Ux1 , . . . ,UxM and define

h0 (s) = min f x1 (s), . . . , f xM (s) 2 F by (4.1).

By construction, h0 (s)  h(s) + e and h0 (s) h(s) e for all s 2 S which means
that d• (h0 , h)  e. Since h0 2 F , this proves that F is dense in (C(S), d• ). t
u

The above results can be now combined to prove the following property of
bounded Lipschitz functions.
Corollary 4.1. If (S, d) is a compact space then the set of bounded Lipschitz func-
tions BL(S, d) is dense in (C(S), d• ).

Proof. We apply the Stone-Weierstrass theorem with F = BL(S, d). Property 3


is obvious, property 1 follows from Lemma 4.1 and property 2 follows from the
extension Theorem 4.1, since a function defined on two points x 6= y such that
f (x) 6= f (y) can be extended to a bounded Lipschitz function on the entire S. t
u

We will also need another well-known result from analysis. A set A ✓ S is totally
bounded if for any e > 0 there exists a finite e-cover of A, i.e. a set of points
a1 , . . . , aN such that A ✓ [iN B(ai , e), where B(a, e) = {y 2 S : d(a, y)  e} is a
ball of radius e centered at a.
Theorem 4.3 (Arzela-Ascoli). If (S, d) is a compact metric space then a subset
F ✓ C(S) is totally bounded in d• metric if and only if F is equicontinuous and
uniformly bounded.
4.1 Bounded Lipschitz functions 101

Equicontinuous means that for any e > 0 there exists d > 0 such that if d(x, y)  d
then, for all f 2 F , | f (x) f (y)|  e. The following fact was used in the proof of
the Selection Theorem, which was proved for general metric spaces.
Corollary 4.2. If (S, d) is a compact space then C(S) is separable in d• .

Proof. By the above theorem, BL(S, d) is dense in C(S). For any integer n 1, the
set { f : k f kBL  n} is, obviously, uniformly bounded and equicontinuous. By the
Arzela-Ascoli theorem, it is totally bounded and, therefore, separable, which can
be seen by taking the union of finite 1/m-covers for all m 1. The union
[
f : k f kBL  n = BL(S, d)
n 1

is also separable and, since it is dense in C(S), C(S) is separable. t


u

Exercise 4.1.1. If F is a finite set {x1 , . R. . , xn } and P is a law with P(F) = 1, show
that for any law Q the supremum sup f d(P Q) : k f kL  1 can be restricted
to functions of the form f (x) = min1in (ci + d(x, xi )).
102 4 Metrics on Probability Measures

4.2 Convergence of laws on metric spaces

Let (S, d) be a metric space and B - a Borel s -algebra generated by open sets. Let
us recall that Pn ! P weakly if
Z Z
f dPn ! f dP

for all f 2 Cb (S) - real-valued bounded continuous functions on S. For a set A ✓ S,


we denote by A the closure of A, int A - the interior of A and ∂ A = A \ int A - the
boundary of A. A measurable set A is called a continuity set of P if P(∂ A) = 0.

Theorem 4.4 (Portmanteau theorem). The following are equivalent.


(1) Pn ! P weakly.
R R
(2) f dPn ! f dP for all bounded Lipschitz functions f .
(3) For any open set U ✓ S, lim infn!• Pn (U) P(U).
(4) For any closed set F ✓ S, lim supn!• Pn (F)  P(F).
(5) For any continuity set A of P, limn!• Pn (A) = P(A).

Proof. (1) =) (2). Obvious.


(2) =) (3). Let U be an open set and F = U c . Consider a sequence of bounded
Lipschitz functions
fm (s) = min(1, md(s, F)).
Since F is closed, d(s, F) = 0 only if s 2 F and, therefore, fm (s) " I(s 2 U) as
m ! •. Since, for each m 1,
Z
Pn (U) fm dPn

taking the limit n ! • and using (2), we get that


Z
lim inf Pn (U) fm dP.
n!•

Now, letting m ! • and using the monotone convergence theorem implies that
Z
lim inf Pn (U) I(s 2 U) dP(s) = P(U),
n!•

which proves (2).


(3) () (4). Obvious by taking complements.
(3)&(4) =) (5). Since int A is open and A is closed, and int A ✓ A, by (2) and (3),

P(int A)  lim inf Pn (int A)  lim sup Pn (A)  P(A).


n!• n!•
4.2 Convergence of laws on metric spaces 103

If P(∂ A) = 0 then P(A) = P(int A) = P(A) and, therefore, limn!• Pn (A) = P(A).
(5) =) (1). Consider f 2 Cb (S) and let Fy = {s 2 S : f (s) = y} be a level set of
f . There exist at most countably many y such that P(Fy ) > 0. Therefore, for any
e > 0 we can find a sequence a1  . . .  aN such that

max(ak+1 ak )  e, P(Fak ) = 0 for all k


k

and the range of f is inside the interval (a1 , aN ). Let

Bk = s 2 S : ak  f (s) < ak+1 and fe (s) = Â ak I(s 2 Bk ).


k

Since f is continuous, ∂ Bk ✓ Fak [ Fak+1 and P(∂ Bk ) = 0. By (4),


Z Z
n!•
fe dPn = Â ak Pn (Bk ) ! Â ak P(Bk ) = fe dP.
k k
R
Since,
R by construction, | fe (s) f (s)|  e, letting e ! 0 proves that f dPn !
f dP, so Pn ! P weakly. t
u
Notice that the weak convergence depends on the metric only through the topol-
ogy, which means that if metrics d and e are equivalent then Cb (S, d) = Cb (S, e)
and, thus, a specific choice of metric does not affect weak convergence. On the
other hand, spaces of bounded Lipschitz functions BL(S, d) and BL(S, e) could be
different and we can sometimes use it to our advantage by choosing the metric with
better properties, as we will see in the next result.
Convergence of empirical measures. Let (W, A , P) be a probability space and
X1 , X2 , . . . : W ! S be an i.i.d. sequence of random variables with values in a metric
space (S, d). Let µ be the law of Xi on S. Let us define the random empirical
measures µn on the Borel s -algebra B on S by

1 n
µn (A)(w) = Â I(Xi (w) 2 A), A 2 B.
n i=1

By the strong law of large numbers, for any f 2 Cb (S),


Z Z
1 n
f dµn = Â f (Xi ) ! E f (X1 ) = f dµ a.s.
n i=1

However, the set of measure zero where this convergence is violated depends on f
and it is not right away obvious that the convergence holds for all f 2 Cb (S) with
probability one. We will need the following lemma.
Lemma 4.3. If (S, d) is separable then there exists a metric e on S such that (S, e)
is totally bounded, and e and d define the same topology, i.e. e(sn , s) ! 0 if and
only if d(sn , s) ! 0.
104 4 Metrics on Probability Measures

Proof. First of all, it is easy to check that c = d/(1 + d) is a metric (that defines
the same topology) taking values in [0, 1]. If {sn } is a dense subset of S, consider
the map
T (s) = c(s, sn ) n 1 : S ! [0, 1]N .
The key point here is that the sequence tm ! t in S, or limm!• c(tm ,t) = 0, if and
only if limm!• c(tm , sn ) = c(t, sn ) for each n 1. In other words, tm ! t if and
only if T (tm ) ! T (t) in the Cartesian product [0, 1]N equipped with the product
topology. This topology is compact and metrizable by the metric

r (un )n 1 , (vn )n 1 = Â2 n
|un vn |,
n 1

so we can now define the metric e on S by e(s, s0 ) = r(T (s), T (s0 )). Because of what
we said above, this metric defines the same topology as d. Moreover, it is totally
bounded, because the image T (S) of our metric space S by the map T is totally
bounded in ([0, 1]N , r), since this space is compact. t
u

Theorem 4.5 (Varadarajan). Let (S, d) be a separable metric space. Then µn con-
verges to µ weakly almost surely,

P w : µn (·)(w) ! µ weakly = 1.

Proof. Let e be the metric in the preceding lemma. Clearly, Cb (S, d) = Cb (S, e)
and, as we mentioned above, weak convergence of measures is not affected by this
change of metric. If (T, e) is the completion of (S, e) then (T, e) is compact. By
the Arzela-Ascoli theorem, BL(T, e) is separable with respect to the d• norm and,
therefore, BL(S, e) is also separable. Let ( fm ) be a dense subset of BL(S, e). Then,
by the strong law of large number,
Z Z
1 n
fm dµn = Â fm (Xi ) ! E fm (X1 ) =
n i=1
fm dµ a.s.

R R
Therefore, on the set of probability one, fm dµn ! fm dµ for all m 1 and,
since ( fm ) is dense in BL(S, e), on the same set of probability one, this convergence
holds for all f 2 BL(S, e). By the portmanteau theorem, µn ! µ weakly on this set
of probability one. t
u

Next, we will introduce two metrics on the set of all probability measures on
(S, d) with the Borel s -algebra B and, under some mild conditions, prove that
they metrize the weak convergence. For a set A ✓ S, let us denote by

Ae = y 2 S : d(x, y) < e for some x 2 A

its open e-neighborhood. If P and Q are probability distributions on S then

r(P, Q) = inf e > 0 : P(A)  Q(Ae ) + e for all A 2 B


4.2 Convergence of laws on metric spaces 105

is called the Levy-Prohorov distance between P and Q.


Lemma 4.4. r is a metric on the set of probability laws on B.
Proof. (1) First, let us show that r(Q, P) = r(P, Q). Suppose that r(P, Q) > e.
Then there exists a set A such that P(A) > Q(Ae ) + e. Taking complements gives

Q(Aec ) > P(Ac ) + e P(Aece ) + e,

where the last inequality follows from the fact that Ac ◆ Aece :

a 2 Aece =) d(a, Aec ) < e =) d(a, b) < e for some b 2 Aec


/ Ae , d(b, A) e
since b 2
=) d(a, A) > 0 =) a 2 / A =) a 2 Ac .

Therefore, for the set B = Aec , Q(B) > P(Be ) + e. This means that r(Q, P) > e
and, therefore, r(Q, P) r(P, Q). By symmetry, r(Q, P)  r(P, Q) and, there-
fore, r(Q, P) = r(P, Q).
(2) Next, let us show that if r(P, Q) = 0 then P = Q. For any set F and any n 1,
1 1
P(F)  Q(F n ) + .
n
1
If F is closed then F n # F as n ! • and, by the continuity of measure,
1
P(F)  lim Q F n = Q(F).
n!•

Similarly, P(F) Q(F) and, therefore, P(F) = Q(F) for all closed sets and, there-
fore, for all Borel sets.
(3) Finally, let us prove the triangle inequality

r(P, T)  r(P, Q) + r(Q, T).

If r(P, Q) < x and r(Q, T) < y then for any set A,

P(A)  Q(Ax ) + x  T (Ax )y + y + x  T Ax+y + x + y,

which means that r(P, T)  x + y, and minimizing over x, y proves the triangle
inequality. t
u
Given probability distributions P, Q on the metric space (S, d), we define the
bounded Lipschitz distance between them by
nZ Z o
b (P, Q) = sup f dP f dQ : k f kBL  1 .

In this case, it is much easier to check that this is a metric.


106 4 Metrics on Probability Measures

Lemma 4.5. b is a metric on the set of probability laws on B.

Proof. It is obvious that b (P, Q) = b (Q, P) and the triangle inequality holds. It
remains to prove that if b (P, Q) = 0 then P = Q. Given a closed set F and U = F c ,
it is easy to see that
fm (x) = md(x, F) ^ 1 " IU

R m ! •.R Obviously, k fm kBL  m + 1 and, therefore, b (P, Q) = 0 implies that


as
fm dP = fm dQ. By the monotone converngence theorem, letting m ! • proves
that P(U) = Q(U), so P = Q. t
u

Let us now show that on separable metric space, the metrics r and b metrize
weak convergence. Before we prove this, let us recall the statement of Ulam’s
theorem proved in Theorem 1.5 in Section 1.1. Namely, every probability law P
on a complete separable metric space (S, d) is tight, which means that for any e > 0
there exists a compact K ✓ S such that P(S \ K)  e.

Theorem 4.6. If (S, d) is separable or P is tight then the following are equivalent:
(1) Pn ! P.
R R
(2) For all f 2 BL(S, d), f dPn ! f d P.
(3) b (Pn , P) ! 0.
(4) r(Pn , P) ! 0.

Remark 4.1. We will prove the implications (1) =) (2) =) (3) =) (4) =) (1),
and the assumption that (S, d) is separable or P is tight will be used only in one
step, to prove (2) =) (3).

Proof. (1) =) (2). This follow by definition, since BL(S, d) ✓ Cb (S).


(3) =) (4). In fact, we will prove that
p
r(Pn , P)  2 b (Pn , P). (4.2)

Given a Borel set A ✓ S, consider the function


⇣ 1 ⌘
f (x) = 0 _ 1 d(x, A) , so that IA  f  IAe .
e
Obviously, k f kBL  1 + e 1
and we can write
Z Z ⇣Z Z ⌘
Pn (A)  f dPn = f dP + f dPn f dP
nZ Z o
 P(Ae ) + (1 + e 1 ) sup f dPn f dP : k f kBL  1

= P(Ae ) + (1 + e 1
)b (Pn , P)  P(Ad ) + d ,
4.2 Convergence of laws on metric spaces 107

where d = max(e, (1 + e 1 )b (Pn , P)). This implies that r(Pn , P) p


d . Since
e is arbitrary
p we can
p minimize
p d = d (e) over e. If we take e = b then
d = max( b , b + b ) = b + b and
p p
b  1 =) r  2 b ; b 1 =) r  1  2 b .

(4) =) (1). Suppose that r(Pn , P) ! 0, which means that there exists a sequence
en # 0 such that

Pn (A)  P(Aen ) + en for all measurable A ✓ S.


T en
If A is closed, then n 1A = A and, by the continuity of measure,
⇣ ⌘
lim sup Pn (A)  lim sup P(Aen ) + en = P(A).
n!• n!•

By the Portmanteau Theorem, Pn ! P.


(2) =) (3). First, let us consider the case when (S, d) is complete. Then, by Ulam’s
theorem, again, we can find a compact K such that P(S \ K)  e. If (S, d) is not
separable then we assume that P is tight and, by definition, we can again find a
compact K such that P(S \ K)  e. If we consider the function
⇣ 1 ⌘ 1
f (x) = 0 _ 1 d(x, K) , so that k f kBL  1 + ,
e e
then Z Z
n!•
Pn (K e ) f dPn ! f dP P(K) 1 e,

which implies that for n large enough, Pn (K e ) 1 2e. Let

B = f : k f kBL(S,d)  1 , BK = f K
: f 2 B ✓ C(K),

where f K denotes the restriction of f to K. Since K is compact, by the Arzela-


Ascoli theorem, BK is totally bounded with respect to d• and, given e > 0, we can
find k 1and f1 , . . . , fk 2 B such that, for all f 2 B,

sup | f (x) f j (x)|  e for some j  k.


x2K

This uniform approximation can also be extended to K e . Namely, for any x 2 K e


take y 2 K such that d(x, y)  e. Then

| f (x) f j (x)|  | f (x) f (y)| + | f (y) f j (y)| + | f j (y) f j (x)|


 k f kL d(x, y) + e + k f j kL d(x, y)  3e.

Therefore, for large enough n, for any f 2 B,


108 4 Metrics on Probability Measures
Z Z Z Z
f dPn f dP  f dPn f dP + k f k• Pn (K ec ) + P(K ec )
e e
ZK ZK
 f dPn f dP + 2e + e
e e
ZK ZK
 f j dPn f j dP + 3e + 3e + 2e + e
e e
ZK Z K
 f j dPn f j dP + 3e + 3e + 2(2e + e)
Z Z
 max f j dPn f j dP + 12e.
1 jk

Finally,
Z Z Z Z
b (Pn , P) = sup f dPn f dP  max f j dPn f j dP + 12e
f 2B 1 jk

and, using the assumption (2), lim supn!• b (Pn , P)  12e. Letting e ! 0 finishes
the proof.
Now, suppose that (S, d) is separable but not complete. Let (T, d) be the comple-
tion of (S, d). We showed in the previous section that every function f 2 BL(S, d)
can be extended to fˆ on (T, d) preserving the norm k f kBL . This extension is obvi-
ously unique, since S is dense in T , so there is one-to-one correspondence f $ fˆ
between the unit balls of BL(S, d) and BL(T, d). Next, we can also view the mea-
sure P as a probability measure on (T, d) as follows. If for any Borel set A in (T, d)
we let P̂(A) = P(A \ S) (we define P̂n similarly) then P̂ is a probability measure
on the Borel s -algebra on (T, d). Note that S might not be a measurable set in its
completion (because of this, there is no natural correspondence between measures
on S and measures on its completion, T ) but one can easily check that the sets A \ S
are Borel in (S, d) (this is because open sets in S are intersections of open sets in T
with S). One can also easily see that, with this definition,
Z Z
f dP = fˆ d P̂,
S T

which means that both statement in (2) and (3) hold simultaneously on (S, d) and
(T, d). This proves that (2) ) (3) on all separable metric spaces. t
u

Convergence and uniform tightness. Next, we will make a connection between


the above metrics and uniform tightness. Recall that a sequence of laws (Pn ) is
uniformly tight if, for each e > 0, there exists a compact set K ✓ S such that Pn (S \
K)  e for all n 1. First, we will show that, in some sense, uniform tightness is
necessary for convergence of laws.

Theorem 4.7. If Pn ! P0 weakly and each Pn is tight for n 0, then (Pn )n 0 is


uniformly tight.
4.2 Convergence of laws on metric spaces 109

In particular, by Ulam’s theorem, any convergent sequence of laws on a complete


separable metric space is uniformly tight.

Proof. Since Pn ! P0 and P0 is tight, by Theorem 4.6, the Levy-Prohorov metric


r(Pn , P0 ) ! 0. Given e > 0, let us take a compact K such that P0 (K0 ) > 1 e/2.
By the definition of r,
e r(P ,P )+ 1 1
1 < P0 (K0 )  Pn K0 n 0 n + r(Pn , P0 ) +
2 n
a(n)
and, therefore, if we denote a(n) := r(Pn , P0 ) + n 1 then Pn (K0 ) > 1 e for
large enough n N.
A probability measure is always closed regular, so any measurable set A can be
approximated by a closed subset F. Since Pn is tight, we can choose a compact of
measure close to one, and intersecting it with the closed subset F, we can approx-
imate any set A by a compact subset. Therefore, for n N, there exists a compact
a(n)
Kn ✓ K0 such that Pn (Kn ) > 1 e. For 1  n < N, since Pn is tight, we can also
find a compact Kn such that Pn (Kn ) > 1 e. If we now take L = [n 0 Kn then

Pn (L) Pn (Kn ) > 1 e for all n 0.

It remains to show that L is compact. Consider a sequence (xn ) in L. There are


two possibilities. First, if there exists an infinite subsequence (xn(k) ) that belongs
to one of the compacts K j then it has a converging subsubsequence in K j and as
a result in L. If not, then there exists a subsequence (xn(k) ) such that xn(k) 2 Km(k)
a(m(k))
and m(k) ! • as k ! •. Since Km(k) ✓ K0 there exists yk 2 K0 such that
d(xn(k) , yk )  a(m(k)). Since K0 is compact, the sequence yk 2 K0 has a converging
subsequence yk(r) ! y 2 K0 which implies that d(xn(k(r)) , y) ! 0, i.e. xn(k(r)) ! y 2
L. Therefore, L is compact. t
u

Next, on complete separable metric spaces, we will complement the Selection


Theorem by showing how uniform tightness can be expressed in the above metrics.
110 4 Metrics on Probability Measures

Theorem 4.8. Let (S, d) be a complete separable metric space and P be a subset
of probability laws on S. Then the following are equivalent.
(1) P is uniformly tight.
(2) For any sequence Pn 2 P there exists a converging subsequence Pn(k) ! P
where P is a law on S.
(3) P has the compact closure on the space of probability laws equipped with
the Levy-Prohorov or bounded Lipschitz metrics r or b .
(4) P is totally bounded with respect to r or b .
Remark 4.2. In other words, we are showing that, on complete separable metric
spaces, total boundedness on the space of probability measures is equivalent to
uniform tightness. The rest is just basic properties of metric space. Also, implica-
tions (1) =) (2) =) (3) =) (4) hold without the completeness assumption and
the only implication where completeness will be used is (4) =) (1).
Proof. (1) =) (2). Any sequence Pn 2 P is uniformly tight and, by the Selection
Theorem, there exists a converging subsequence.
(2) =) (3). Since (S, d) is separable, by Theorem 4.6, Pn ! P if and only if
r(Pn , P) or b (Pn , P) ! 0. Every sequence in the closure P can be approximated
by a sequence in P. That sequence has a converging subsequence that, obviously,
converges to an element in P, which means that the closure of P is compact.
(3) =) (4). Compact sets are totally bounded and, therefore, if the closure P is
compact, the set P is totally bounded.
p
(4) =) (1). Since r  2 b , we will only deal with r. For any e > 0, there exists
a finite subset P0 ✓ P such that P ✓ P0e . Since (S, d) is complete and separable,
by Ulam’s theorem, for each Q 2 P0 there exists a compact KQ such that Q(KQ ) >
1 e. Therefore,
[
K= KQ is a compact and Q(K) > 1 e for all Q 2 P0 .
Q2P0

Let F be a finite set such that K ✓ F e (here we will denote by F e the closed e-
neighborhood of F). Since P ✓ P0e , for any P 2 P there exists Q 2 P0 such
that r(P, Q) < e and, therefore,

1 e  Q(K)  Q(F e )  P(F 2e ) + e.

Thus, 1 2e  P(F 2e ) for all P 2 P. Given d > 0, take em = d /2m+1 and find
Fm as above, i.e.
d d /2m
1 m
 P Fm .
2
Then ⇣\ ⌘
d /2m d
P Fm 1 Â m = 1 d.
m 1 m 12
4.2 Convergence of laws on metric spaces 111
T d /2m
Finally, L = m 1 Fm is compact because it is closed and totally bounded by
construction, and S is complete. t
u

Theorem 4.9 (Prokhorov’s theorem). The set of probability laws on a complete


separable metric space is complete with respect to the metrics r and b .

Proof. If a sequence of laws is Cauchy with respect to r or b then it is totally


bounded and, by the previous theorem, it has a converging subsequence. Obviously,
a Cauchy sequence will converge to the same limit. t
u

Exercise 4.2.1. Prove that the space of probability laws Pr(S) on a compact metric
space (S, d) is compact with respect to the metrics r and b .

Exercise 4.2.2. If (S, d) is separable, prove that Pr(S) is also separable with respect
to the metrics r and b . Hint: think about probability measures concentrated on
finite subsets of a dense countable set in S, and use metric b .

Exercise 4.2.3. Let (S, d) be a metric space and B its Borel s -algebra. Let Pr(S)
be the set of all probability measures on (S, B) equipped with the topology of
weak convergence. Let F be the Borel s -algebra on Pr(S) generated by open sets.
Consider the map T : Pr(S) ⇥ B ! [0, 1] defined by T (P, A) = P(A). Prove that
for any fixed A 2 B, the function T ( · , A) : Pr(S) ! [0, 1] is F -measurable. Hint:
First, prove this for open sets A ✓ S.

Exercise 4.2.4. Let P, Q be two laws on R with distribution functions F, G respec-


tively. Let

l (P, Q) := inf{e > 0 : F(x e) e  G(x)  F(x + e) + e for all x}

– Levy’s metric. (a) Show that l is a metric metrizing convergence of laws on R.


(b) Show that l  r, but that there exist laws Pn , Qn with l (Pn , Qn ) ! 0 while
r(Pn , Qn ) 6! 0.

Exercise 4.2.5. Define a specific finite set of laws F on [0, 1] such that for every
law P on [0, 1] there exists Q 2 F with r(P, Q) < 0.1, where r is Prokhorov’s
metric. Hint: reduce the problem to metric b .

Exercise 4.2.6. Let X j be i.i.d. N(0, 1) random variables. Let H be a Hilbert space
with orthonormal basis {e j } j 1 . Let X = Â j 1 X j e j / j. For any e > 0, find a com-
pact K such that P(X 2 K) > 1 e.
112 4 Metrics on Probability Measures

4.3 Strassen’s theorem. Relationships between metrics

Metric for convergence in probability. Let (W, B, P) be a probability space,


(S, d) - a separable metric space and X,Y : W ! S - random variables with val-
ues in S. The quantity

a(X,Y ) = inf e 0 : P(d(X,Y ) > e)  e

is called the Ky Fan metric on the set L 0 (W, S) of classes of equivalences of such
random variables, where two random variables are equivalent if they are equal
almost surely. If we take a sequence

ek # a = a(X,Y )

then P(d(X,Y ) > ek )  ek and, since I(d(X,Y ) > ek ) " I(d(X,Y ) > a), by the
monotone convergence theorem, P(d(X,Y ) > a)  a. Thus, the infimum in the
definition of a(X,Y ) is attained.

Lemma 4.6. The Ky Fan metric a on L 0 (W, S) metrizes convergence in probabil-


ity.

Proof. First of all, clearly, a(X,Y ) = 0 if and only X = Y almost surely. To prove
the triangle inequality,

P d(X, Z) > a(X,Y ) + a(Y, Z)


 P d(X,Y ) > a(X,Y ) + P d(Y, Z) > a(Y, Z)
 a(X,Y ) + a(Y, Z),

so that a(X, Z)  a(X,Y ) + a(Y, Z). This proves that a is a metric. Next, if an =
a(Xn , X) ! 0 then, for any e > 0 and large enough n such that an < e,

P(d(Xn , X) > e)  P(d(Xn , X) > an )  an ! 0.

Conversely, if Xn ! X in probability then, for any m 1 and large enough n


n(m), ⇣ 1⌘ 1
P d(Xn , X) >  ,
m m
which means that an  1/m and an ! 0. t
u

Lemma 4.7. For X,Y 2 L 0 (W, S), the Levy-Prohorov metric r satisfies

r(L (X), L (Y ))  a(X,Y ).

Proof. Take e > a(X,Y ) so that P(d(X,Y ) e)  e. For any measurable A ✓ S,

P(X 2 A) = P(X 2 A, d(X,Y ) < e) + P(X 2 A, d(X,Y ) e)  P(Y 2 Ae ) + e


4.3 Strassen’s theorem. Relationships between metrics 113

which means that r(L (X), L (Y ))  e. Letting e # a(X,Y ) proves the result. t
u

We will now prove that, in some sense, the opposite is also true. Let (S, d) be a
metric space and P, Q be probability laws on S. Suppose that these laws are close
in the Levy-Prohorov metric r. Can we construct random variables s1 and s2 , with
laws P and Q, that are defined on the same probability space and are close to each
other in the Ky Fan metric a? We will construct a distribution on the product space
S ⇥ S such that the coordinates s1 and s2 have marginal distributions P and Q and
the distribution is concentrated on a neighbourhood of the diagonal s1 = s2 , and
the size of the neighbourhood is controlled by r(P, Q).
The following result will be a key tool in the proof of the main result of this
section. Consider two sets X and Y. Given a subset K ✓ X ⇥Y and A ✓ X we define
a K-image of A by
AK = y 2 Y : 9x 2 A, (x, y) 2 K .
A K-matching f of X into Y is a one-to-one function (injection) f : X ! Y such
that (x, f (x)) 2 K. We will need the following well-known matching theorem.

Theorem 4.10 (Hall’s marriage theorem). If X,Y are finite and, for all A ✓ X,

card(AK ) card(A) (4.3)

then there exists a K-matching f of X into Y .

Proof. We will prove the result by induction on m = card(X). The case of m = 1


is obvious. For each x 2 X there exists y 2 Y such that (x, y) 2 K. If there is a
matching f of X \ {x} into Y \ {y} then defining f (x) = y extends f to X. If not,
then since card(X \ {x}) < m, by induction assumption, condition (4.3) is violated,
i.e. there exists a set A ✓ X \ {x} such that card(AK \ {y}) < card(A). But be-
cause we also know that card(AK ) card(A) this implies that card(AK ) = card(A).
Since card(A) < m, by induction there exists a matching of A onto AK . If there is
a matching of X \ A into Y \ AK we can combine it with a matching of A and AK .
If not then again, by the induction assumption, there exists D ✓ X \ A such that
card(DK \ AK ) < card(D). But then

card (A [ D)K = card(DK \ AK ) + card(AK )


< card(D) + card(A) = card(D [ A),

which contradicts the assumption (4.3). t


u

Theorem 4.11 (Strassen). Suppose that (S, d) is a separable metric space and
a, b > 0. Suppose the laws P and Q are such that, for all measurable sets F ✓ S,

P(F)  Q(F a ) + b (4.4)

Then for any e > 0 there exist two non-negative measures h, g on S ⇥ S such that
114 4 Metrics on Probability Measures

1. µ = h + g is a law on S ⇥ S with marginals P and Q.


2. h(d(x, y) > a + e) = 0.
3. g(S ⇥ S)  b + e.
4. µ is a finite sum of product measures.

Remark 4.3. In the above statement, it is enough to assume that (4.4) holds only
for closed sets or only for open sets F; moreover, one can replace an open a-
neighbourhood F a = {s 2 S : d(s, F) < a} by a closed a-neighbourhood F a] =
{s 2 S : d(s, F)  a}. This is because the set F e is open and F e] is closed, and, for
example, the condition (4.4) for closed sets implies P(F)  P(F e] )  Q((F e] )a ) +
b  Q(F a+2e ) + b for all measurable sets F, which simply replaces a by a + 2e
in (4.4).

Remark 4.4. Condition (4.4) is a relaxation of the definition of the Levy-Prohorov


metric, since one can take different a, b > r(P, Q). Conditions 1 – 3 mean that
we can construct a measure µ on S ⇥ S such that coordinates x, y have marginal
distributions P, Q, concentrated within distance a + e of each other (condition 2)
except for the set of measure at most b + e (condition 3).

Proof. Case A. The proof will proceed in several steps. We will start with the sim-
plest case which, however, contains the main idea. Given small e > 0, take n 1
such that ne > 1. Suppose that the laws P, Q are uniform on finite subsets M, N ✓ S
of equal cardinality,
1
card(M) = card(N) = n, P(x) = Q(y) = < e, x 2 M, y 2 N.
n
Using the condition (4.4), we would like to match as many points from M and N
as possible, but only points that are within distance a from each other. To use the
matching theorem, we will introduce some auxiliary sets U and V that are not too
big, with size controlled by the parameter b , and the union of these sets with M
and N satisfies a certain matching condition.
Take integer k such that b n  k < (b + e)n. Let us take sets U and V such that
k = card(U) = card(V ) and U,V are disjoint from M, N. Define

X = M [U, Y = N [V.

Let us define a subset K ✓ X ⇥ Y such that (x, y) 2 K if and only if one of the
following holds:
(i) x 2 U,
(ii) y 2 V,
(iii) d(x, y)  a + e if x 2 M, y 2 N.
This means that small auxiliary sets can be matched with any points, but only close
points, d(x, y)  a + e, can be matched in the main sets M and N. Consider a set
4.3 Strassen’s theorem. Relationships between metrics 115

A ✓ X with cardinality card(A) = r. If A 6✓ M then by (i), AK = Y and card(AK ) r.


Suppose now that A ✓ M and, again, we would like to show that card(AK ) r.
Using (4.4) and the fact that, by (iii), Aa \ N ✓ AK \ N, we can write
r 1 1
= P(A)  Q(Aa ) + b = card(Aa \ N) + b  card(AK \ N) + b .
n n n
Therefore,

r = card(A)  nb + card(AK \ N)  k + card(AK \ N) = card(AK ),

since k = card(V ) and AK = V [ (AK \ N). By the matching theorem, there exists
a K-matching f of X into Y . Let

T = {x 2 M : f (x) 2 N},

i.e. points in M that are matched with points in N at distance d(x, y)  a + e.


Clearly, card(T ) n k, since at most k points can be matched with a point in V ,
and for x 2 T, by (iii), d(x, f (x))  a + e. For x 2 M \ T, redefine f (x) to match
each x with a different points in N that are not matched with points in T. This
defines a matching of M onto N. We define measures h and g by
1 1
h= Â
n x2T
d(x, f (x)) , g = Â d(x, f (x)) ,
n x2M\T

and let µ = h + g. First of all, obviously, µ has marginals P and Q because each
point in M or N appears in the sum h + g only once with the weight 1/n. Also,

card(M \ T ) k
h d(x, f (x)) > a + e = 0, g(S ⇥ S)   < b + e. (4.5)
n n
Finally, both h and g are finite sums of point masses, which are product measures
of point masses.
Case B. Suppose now that P and Q are concentrated on finitely many points with
rational probabilities. Then we can artificially split all points into “smaller” points
of equal probabilities as follows. Let n be such that ne > 1 and

nP(x), nQ(x) 2 J = {1, 2, . . . , n}.

Define a discrete metric on J by f (i, j) = e2 I(i 6= j) and define a metric on S ⇥ J by

e (x, i), (y, j) = d(x, y) + f (i, j).


j
Define a measure P0 on S ⇥ J as follows. If P(x) = n then

1
P0 (x, i) = for i = 1, . . . , j.
n
116 4 Metrics on Probability Measures

Define Q0 similarly. Let us check that the laws P0 , Q0 satisfy the assumptions of
Case A with a + e instead of a. Given a set F ✓ S ⇥ J, define

F1 = x 2 S (x, j) 2 F for some j .

Using (4.4),
P0 (F)  P(F1 )  Q(F1a ) + b  Q0 (F a+e ) + b ,
because f (i, j) < e. By Case A in (4.5), we can construct µ 0 = h 0 + g 0 with
marginals P0 and Q0 such that

h 0 e((x, i), (y, j)) > a + 2e = 0, g 0 (S ⇥ J) ⇥ (S ⇥ J) < b + e.

Let µ, h, and g be the projections of µ 0 , h 0 , and g 0 back onto S ⇥ S by the map


((x, i), (y, j)) ! (x, y). Then, clearly, µ = h + g, µ has marginals P and Q and
g(S ⇥ S) < b + e. Since

e (x, i), (y, j) = d(x, y) + f (i, j) d(x, y),

we get
h d(x, y) > a + 2e  h 0 e((x, i), (y, j)) > a + 2e = 0.
Finally, µ is obviously a finite sum of product measures. Of course, we can replace
e by e/2.
Case C. (General case) Let P, Q be laws on a separable metric space (S, d). Let
A be a maximal set such that for all x, y 2 A the distance d(x, y) e. Such set A
is called an e-packing and it is countable, because S is separable, so A = {xi }i 1 .
Since A is maximal, for each x 2 S there exists y 2 A such that d(x, y) < e, other-
wise, we could add x to the set A. Let us create a partition of S using e-balls around
the points xi :

B1 = {x 2 S : d(x, x1 ) < e}, B2 = {d(x, x2 ) < e} \ B1

and, iteratively for k 2,

Bk = {d(x, xk ) < e} \ (B1 [ · · · [ Bk 1 ).

{Bk }k 1 is a partition of S. Let us discretize measures P and Q by projecting them


onto {xi }i 1 :
P0 (xk ) = P(Bk ), Q0 (xk ) = Q(Bk ).
Consider any set F ✓ S. For any point x 2 F, if x 2 Bk then d(x, xk ) < e, i.e. xk 2 F e
and, therefore,
P(F)  P0 (F e ).
Also, if xk 2 F then Bk ✓ F e and, therefore,

P0 (F)  P(F e ).
4.3 Strassen’s theorem. Relationships between metrics 117

To apply Case B, we need to approximate P0 by a measure on a finite number of


points with rationals probabilities. For large enough n 1, let
bnP0 (xk )c
P00 (xk ) = .
n
Clearly, as n ! •, P00 (xk ) " P0 (xk ). Notice that only a finite number of points
carry non-zero weights P00 (xk ) > 0. Let x0 be some auxiliary point outside of the
sequence {xk } and let us assign to it the remaining probability

P00 (x0 ) = 1 Â P00 (xk ).


k 1

If we take n large enough so that P00 (x0 ) < e/2 then

 |P00 (xk ) P0 (xk )|  e.


k 0

All the relations above also hold true for Q, Q0 and Q00 that are defined similarly,
and we can assume that the same point x0 plays a role of the auxiliary point for
both P00 and Q00 . Given F ✓ S, we can write

P00 (F)  P0 (F) + e  P(F e ) + e  Q(F e+a ) + b + e


 Q0 (F a+2e ) + b + e  Q00 (F a+2e ) + b + 2e.

By Case B, there exists a decomposition µ 00 = h 00 + g 00 on S ⇥ S with marginals P00


and Q00 such that

h 00 d(x, y) > a + 3e = 0, g 00 (S ⇥ S)  b + 3e.

Let us also “move” the points (x0 , xi ) and (xi , x0 ) for i 0 into the support of g 00 :
since the total weight of these points is at most e, the total weight of g 00 does not
increase much:
g 00 (S ⇥ S)  b + 5e.
It remains to “redistribute” these measures from the sequence {xi }i 0 to the entire
space S in a way that recovers marginal distributions P and Q and so that not much
accuracy is lost. Define a sequence of measures on S by
P(C \ Bi )
Pi (C) = if P(Bi ) > 0, and Pi (C) = 0 otherwise,
P(Bi )
and define Qi similarly. The measures Pi and Qi are concentrated on Bi . Define

h= Â h 00 (xi , x j )(Pi ⇥ Q j ).
i, j 1
118 4 Metrics on Probability Measures

Since h 00 (xi , x j ) = 0 unless d(xi , x j )  a + 3e, the measure h is concentrated on


the set {d(x, y)  a + 5e} because for x 2 Bi , y 2 B j ,

d(x, y)  d(x, xi ) + d(xi , x j ) + d(x j , y)  e + (a + 3e) + e = a + 5e.

The marginals u and v of h satisfy

u(C) := h(C ⇥ S)  Â h 00 (xi , x j )Pi (C) = Â h 00 ({xi } ⇥ S)Pi (C)


i, j 1 i 1
 Â P (xi )Pi (C)  Â P (xi )Pi (C) = Â P(Bi )Pi (C) = P(C)
00 0
i 1 i 1 i 1

and, similarly, v(C) := h(S ⇥ C)  Q(C). If u(S) = v(S) = 1 then h(S ⇥ S) = 1


and h is a probability measure with marginals P and Q, so we can take g = 0.
Otherwise, take t = 1 u(S) = 1 v(S) and define
1
g = (P u) ⇥ (Q v).
t
It is easy to check that µ = h + g has marginals P and Q. For example,
1
µ(C ⇥ S) = h(C ⇥ S) + g(C ⇥ S) = u(C) + P(C) u(C) ⇥ Q(S) v(S)
t
1
= u(C) + P(C) u(C) ⇥ 1 v(S) = u(C) + P(C) u(C) = P(C).
t
Also,

g(S ⇥ S) = t = 1 h(S ⇥ S) = 1 h 00 (S ⇥ S) = g 00 (S ⇥ S)  b + 5e.

Finally, by construction, µ is a finite sum of product measures. t


u
The following relationship between Ky Fan and Levy-Prohorov metrics is an
immediate consequence of Strassen’s theorem. We already saw the inequality
r(L (X), L (Y ))  a(X,Y ).
Theorem 4.12. If (S, d) is a separable metric space and P, Q are laws on S then,
for any e > 0, there exist random variables X and Y on the same probability space
with the distributions L (X) = P and L (Y ) = Q such that a(X,Y )  r(P, Q) + e.
If P and Q are tight, one can take e = 0.
Proof. Let us take a = b = r(P, Q). Then, by the definition of the Levy-Prohorov
metric, for any e > 0 and any set A,

P(A)  Q(Ar+e ) + r + e.

By Strassen’s theorem, there exists a measure µ on S ⇥ S with the marginals P, Q


such that
µ d(x, y) > r + 2e  r + 2e. (4.6)
4.3 Strassen’s theorem. Relationships between metrics 119

Therefore, if X and Y are the coordinate projections on S ⇥ S, i.e.

X,Y : S ⇥ S ! S, X(x, y) = x, Y (x, y) = y,

then, by definition of the Ky Fan metric, a(X,Y )  r + 2e. If P and Q are tight
then there exists compact K such that P(K), Q(K) 1 d . For e = 1/n find µn as
in (4.6). Since µn has marginals P and Q, µn (K ⇥ K) 1 2d , which means that
the sequence (µn )n 1 is uniformly tight. By the Selection Theorem, there exists a
converging subsequence µn(k) ! µ. Obviously, µ again has marginals P and Q.
Since, by construction,
⇣ 2⌘ 2
µn d(x, y) > r + r+
n n
and {d(x, y) > r + e} is an open set in S ⇥ S, by the Portmanteau Theorem,
⇣ ⌘ ⇣ ⌘
µ d(x, y) > r + e  lim inf µn(k) d(x, y) > r + e  r.
k!•

Letting e # 0 we get µ(d(x, y) > r)  r. As above, a(X,Y )  r for this measure


µ. t
u

This also implies the relationship between the bounded Lipschitz metric b and
Levy-Prohorov metric r.

Lemma 4.8. If (S, d) is a separable metric space then


p
b (P, Q)  2r(P, Q)  4 b (P, Q).

Proof. We already proved the second inequality. To prove the first one, by the
previous theorem, given e > 0, we can find random variables X and Y with the
distributions P and Q, such that a(X,Y )  r + e. Consider a bounded Lipschitz
function f , k f kBL < •. Then
Z Z
f dP f dQ = |E f (X) E f (Y )|  E| f (X) f (Y )|

 k f kL (r + e) + 2k f k• P d(X,Y ) > r + e

 k f kL (r + e) + 2k f k• (r + e)  2k f kBL (r + e).

Thus, b (P, Q)  2(r(P, Q) + e) and letting e # 0 finishes the proof. t


u

Exercise 4.3.1. Show that the inequality b (P, Q)  2r(P, Q) is sharp in the fol-
lowing sense: for any t 2 (0, 1) and e > 0, one can find laws P and Q on R such
that r(P, Q) = t and b (P, Q) 2t e. Hint: let P, Q have atoms of size t at points
far apart, with P, Q otherwise the same.
120 4 Metrics on Probability Measures

Exercise 4.3.2. Suppose that X,Y and U are random variables on some probability
space such that P(X 2 A)  P(Y 2 Ad ) + e for all measurable sets A ✓ R, and U
is independent of (X,Y ) and has the uniform distribution on [0, 1]. Prove that there
exists a measurable function f : R⇥[0, 1] ! R such that P X f (X,U) > d  e
and f (X,U) has the same distribution as Y .

Exercise 4.3.3. Let (S, d) be a separable metric space, and let (Pr(S, d), r) be the
space of probability measures on S equipped with the Levy-Prohorov metric r. Let
Pr(2) (S, d) := Pr(Pr(S, d)) be the space of probability measures on (Pr(S, d), r) and
denote by r (2) the Levy-Prohorov metric on Pr(2) (S, d). (Elements of Pr(2) (S, d)
can be viewed as laws of random measures on S.) Given P(2) 2 Pr(2) (S, d), we
can define a two-step process to generate a random variable on S: first, generate a
probability measure P 2 Pr(S, d) from this measure P(2) and, second, generate a
random variable X from P, so that the distribution L (X) of X is given by
Z
Pr(X 2 A) = P(A) dP(2) (P).

Let Q(2) 2 Pr(2) (S, d) be another measure, and generate a random variable Y in the
same way. Prove that r(L (X), L (Y ))  2r (2) (P(2) , Q(2) ).

Exercise 4.3.4. Suppose that probability measures P, Q on a separable metric


space (S, d) are tight. Let M(P, Q) be the set of measures on the product space
S ⇥ S with marginals P and Q. Given a > 0, show that
⇣ ⌘ ⇣ ⌘
inf µ d(x, y) > a = sup P(A) Q(Aa] ) ,
µ2M(P,Q)

where sup is over all measurable (or closed, or open) sets A. Hint: Let b be the
supremum on the right hand side; see remark below Theorem 4.11; and notice that
I(x 2 A) I(y 2 Aa] )  I(d(x, y) > a).
4.4 Wasserstein distance and Kantorovich-Rubinstein theorem 121

4.4 Wasserstein distance and Kantorovich-Rubinstein theorem

Let (S, d) be a separable metric space. Denote by P1 (S) the set of all laws on S
such that for some z 2 S (equivalently, for all z 2 S),
Z
d(x, z) dP(x) < •.
S

Let us consider a set

M(P, Q) = µ : µ is a law on S ⇥ S with marginals P and Q .

For P, Q 2 P1 (S), the quantity


nZ o
W (P, Q) = inf d(x, y) dµ(x, y) : µ 2 M(P, Q)

is called the Wasserstein distance between P and Q. If P and Q are tight, this
infimum is attained (see exercise below).
A measure µ 2 M(P, Q) represents a transportation between the measures P
and Q. We can think of the conditional distribution µ(y|x) as a way to redistribute
the mass in the neighborhood of a point x so that the distribution P will be redis-
tributed to the distribution Q. If the distance d(x, y) represents the cost of moving
x to y then the Wasserstein distance gives the optimal total cost of transporting P
to Q. Given any two laws P and Q on S, let us define
nZ Z o
g(P, Q) = sup f dP f dQ : k f kL  1

(notice that b (P, Q)  g(P, Q)) and


nZ Z o
md (P, Q) = sup f dP + g dQ : f , g 2 C(S), f (x) + g(y) < d(x, y) .

Notice that, obviously, for P, Q 2 P1 (S), both g(P, Q), md (P, Q) < •. Let us
show that these two quantities are equal.

Lemma 4.9. We have g(P, Q) = md (P, Q).

Proof. Given a function f such that k f kL  1, let us take a small e > 0 and set
g(y) = f (y) e. Then

f (x) + g(y) = f (x) f (y) e  d(x, y) e < d(x, y)

and Z Z Z Z
f dP + g dQ = f dP f dQ e.

Combining with the choice of f (x) and g(y) = f (y) e we get


122 4 Metrics on Probability Measures
Z Z
f dP f dQ
nZ Z o
 sup f dP + g dQ : f , g 2 C(S), f (x) + g(y) < d(x, y) + e

which, of course, proves that g(P, Q)  md (P, Q).


Let us now consider functions f , g such that f (x) + g(y) < d(x, y). Define

e(x) = inf(d(x, y) g(y)) = sup(g(y) d(x, y))


y y

Clearly, f (x)  e(x)  d(x, x) g(x) = g(x) and, therefore,


Z Z Z Z
f dP + g dQ  e dP e dQ.

The function e satisfies

e(x) e(x0 ) = sup g(y) d(x0 , y) sup g(y) d(x, y)


y y
 sup d(x, y) d(x0 , y)  d(x, x0 )
y

which means that kekL  1 and, therefore, md (P, Q)  g(P, Q). This finishes the
proof. t
u

Below, we will need the following version of the Hahn-Banach theorem.

Theorem 4.13 (Hahn-Banach). Let V be a normed vector space, E - a linear sub-


space of V and U - an open convex set in V such that U \ E 6= 0. / If r : E ! R is
a linear non-zero functional on E then there exists a linear functional r : V ! R
such that r|E = r and supU r(x) = supU\E r(x).

Proof. Let t = sup{r(x) : x 2 U \ E} and let B = {x 2 E : r(x) > t}. Since B is


convex and U \ B = 0, / the Hahn-Banach separation theorem implies that there
exists a linear functional q : V ! R that separates U and B. For any x0 2 U \ E let
F = {x 2 E : q(x) = q(x0 )}. Since q(x0 ) < infB q(x), F \ B = 0./ This means that
the hyperplanes {x 2 E : q(x) = q(x0 )} and {x 2 E : r(x) = t} in the subspace E
are parallel and this implies that q(x) = ar(x) on E for some a 6= 0. Let r = q/a.
Then r = r|E and
1 1
sup r(x) = sup q(x)  inf q(x) = inf r(x) = t = sup r(x) = sup r(x).
U a U a B B U\E U\E

Since U ◆ U \ E, the inequality is actually equality, which finishes the proof. t


u

Using this, we will prove the following Kantorovich-Rubinstein theorem for


compact metric spaces.
4.4 Wasserstein distance and Kantorovich-Rubinstein theorem 123

Theorem 4.14. If S is a compact metric space then W (P, Q) = md (P, Q) = g(P, Q)


for P, Q 2 P1 (S).

Proof. We only need to show the first equality. Consider a vector space V = C(S ⇥
S) equipped with k · k• norm and let

U = h 2 V : h(x, y) < d(x, y) .

Obviously, U is convex and open, because S ⇥ S is compact and any continuous


function on a compact achieves its maximum. Consider a linear subspace E of V
defined by
E = f 2 V : f (x, y) = f (x) + g(y), f , g 2 C(S)
so that

U \ E = f 2 V : f (x, y) = f (x) + g(y) < d(x, y), f , g 2 C(S) .

Define a linear functional r on E by


Z Z
r(f ) = f dP + g dQ if f = f (x) + g(y).

By the above Hahn-Banach theorem, r can be extended to r : V ! R such that


r|E = r and
sup r(f ) = sup r(f ) = md (P, Q) < •.
U U\E
Let us look at the properties of this functional. First of all, if a(x, y) 0 then r(a)
0. Indeed, for any c 0,

U 3 d(x, y) c · a(x, y) e < d(x, y)

and, therefore, for all c 0,

r(d ca e) = r(d) cr(a) r(e)  sup r < •.


U

This can hold only if r(a) 0. This implies that if f1  f2 then r(f1 )  r(f2 ).
For any function f , both f , f  kf k• · 1 and, by monotonicity of r,

|r(f )|  kf k• r(1) = kf k• .

Since S ⇥ S is compact and r is a continuous functional on (C(S ⇥ S), k · k• ), by


the Riesz representation theorem, there exists a unique measure µ on the Borel
s -algebra on S ⇥ S such that
Z
r(h) = h(x, y) dµ(x, y).

Since r|E = r,
124 4 Metrics on Probability Measures
Z Z Z
( f (x) + g(y)) dµ(x, y) = f dP + g dQ,

which implies that µ has marginal P and Q, i.e. µ 2 M(P, Q). This proves that
nZ o
md (P, Q) = sup r(f ) = sup h(x, y) dµ(x, y) : h(x, y) < d(x, y)
U
Z
= d(x, y) dµ(x, y) W (P, Q).

The opposite inequality is easy, because for any f , g such that f (x) + g(y) < d(x, y)
and any n 2 M(P, Q),
Z Z Z Z
f dP + g dQ = ( f (x) + g(y)) dn(x, y)  d(x, y)dn(x, y). (4.7)

This finishes the proof and, moreover, it shows that the infimum in the definition
of W is attained. t
u
Remark 4.5. Notice that in the proof of this theorem we never used the fact that d is
a metric. Theorem holds for any d 2 C(S ⇥ S) under the corresponding integrability
assumptions. For example, one can consider loss functions of the type d(x, y) p for
p > 1, which are not necessarily metrics. However, in Lemma 4.9, the fact that d
is a metric was essential. t
u
Our next goal will be to show that W = g on separable and not necessarily
compact metric spaces. We start with the following.
Lemma 4.10. If (S, d) is a separable metric space then W and g are metrics on
P1 (S).
Proof. Since for a bounded Lipschitz metric b we have b (P, Q)  g(P, Q), g is
also a metric, because if g(P, Q) = 0 then b (P, Q) = 0 and, therefore, P = Q. As in
(4.7), it should be obvious that g(P, Q) = md (P, Q)  W (P, Q) and if W (P, Q) = 0
then g(P, Q) = 0 and P = Q. The symmetry of W is obvious. It remains to show
that W (P, Q) satisfies the triangle inequality. The idea here is very simple, and let
us first explain it in the case when (S, d) is complete. Consider three laws P, Q, T
on S and let µ 2 M(P, Q) and n 2 M(Q, T) be such that
Z Z
d(x, y) dµ(x, y)  W (P, Q) + e and d(y, z) dn(y, z)  W (Q, T) + e.

Let us generate a distribution g on S ⇥ S ⇥ S with marginals P, Q and T and


marginals on pairs of coordinates (x, y) and (y, z) given by µ and n by “gluing”
µ and n in the following way. We know that when (S, d) is complete and sepa-
rable, there exist regular conditional distributions µ(dx | y) and n(dz | y) of the
coordinates x and z given y. Then, we define a distribution g on S ⇥ S ⇥ S by first
generating y from the distribution Q and, given y, generating the pair x and z ac-
cording to the conditional distributions µ(dx | y) and n(dz | y) independently of
4.4 Wasserstein distance and Kantorovich-Rubinstein theorem 125

each other, i.e. according to the product measure on S ⇥ S :

g(dx dz | y) = µ(dx | y) ⇥ n(dz | y).

This is called “conditionally independent coupling of x and z given y”. Obviously,


by construction, (x, y) has the distribution µ and (y, z) has the distribution n. There-
fore, the marginals of x and z are P and T, which means that the pair (x, z) has the
distribution h 2 M(P, T). Finally,
Z Z Z Z
W (P, T)  d(x, z) dh(x, z) = d(x, z) dg(x, y, z)  d(x, y) dg + d(y, z) dg
Z Z
= d(x, y) dµ + d(y, z) dn  W (P, Q) +W (Q, T) + 2e.

Letting e ! 0 proves the triangle inequality for W . In the case when (S, d) is not
complete, we will apply the same basic idea after we discretize the space without
losing much in the transportation cost integral. This can be done as in the proof
of Strassen’s theorem, Case C. Given e > 0, consider a partition (Sn )n 1 of S such
that diameter(Sn ) < e for all n. On each box Sn ⇥ Sm let

1 µ((C \ Sn ) ⇥ Sm ) 2 µ(Sn ⇥ (C \ Sm ))
µnm (C) = , µnm (C) =
µ(Sn ⇥ Sm ) µ(Sn ⇥ Sm )
be the marginal distributions of the conditional distribution of µ on Sn ⇥ Sm . Define

µ 0 = Â µ(Sn ⇥ Sm ) µnm
1 2
⇥ µnm .
n,m

In this construction, locally on each small box Sn ⇥ Sm , measure µ is replaced by


the product measure with the same marginals. Let us compute the marginals of µ 0 .
Given a set C ✓ S,

µ 0 (C ⇥ S) = Â µ(Sn ⇥ Sm ) µnm
1 2
(C) ⇥ µnm (S)
n,m
= Â µ((C \ Sn ) ⇥ Sm ) = Â µ((C \ Sn ) ⇥ S) = Â P(C \ Sn ) = P(C).
n,m n n

Similarly, µ 0 (S ⇥ C) = Q(C), so µ 0 has the same marginals as µ, µ 0 2 M(P, Q).


It should be obvious that transportation cost integral does not change much by
replacing µ with µ 0 . One can visualize this by looking at what happens locally on
each small box Sn ⇥ Sm . Let (Xn ,Ym ) be a random pair with distribution µ restricted
to Sn ⇥ Sm so that
Z
1
Ed(Xn ,Ym ) = d(x, y)dµ(x, y).
µ(Sn ⇥ Sm ) Sn ⇥Sm

Let Ym0 be an independent copy of Ym , also independent of Xn . Then the joint distri-
1 ⇥ µ 2 and
bution of (Xn ,Ym0 ) is µnm nm
126 4 Metrics on Probability Measures
Z
1 2
Ed(Xn ,Ym0 ) = d(x, y) d(µnm ⇥ µnm )(x, y).
Sn ⇥Sm

Then Z
d(x, y) dµ(x, y) = Â µ(Sn ⇥ Sm )Ed(Xn ,Ym ),
n,m
Z
d(x, y) dµ 0 (x, y) = Â µ(Sn ⇥ Sm )Ed(Xn ,Ym0 ).
n,m

Finally, d(Ym ,Ym0 )  diam(Sm )  e and these two integrals differ by at most e.
Therefore, Z
d(x, y) dµ 0 (x, y)  W (P, Q) + 2e.
Similarly, we can define

n 0 = Â n(Sn ⇥ Sm ) nnm
1 2
⇥ nnm
n,m

such that Z
d(x, y) dn 0 (x, y)  W (Q, T) + 2e.

We will now show that this special simple form of the distributions µ 0 (x, y), n 0 (y, z)
ensures that the conditional distributions of x and z given y are well defined. Let
Qm be the restriction of Q to Sm ,

Qm (C) = Q(C \ Sm ) = Â µ(Sn ⇥ Sm ) µnm


2
(C).
n

Obviously, if Qm (C) = 0 then µnm2 (C) = 0 for all n, which means that µ 2 are
nm
absolutely continuous with respect to Qm and the Radon-Nikodym derivatives
2
dµnm
fnm (y) =
dQm
(y) exist and  µ(Sn ⇥ Sm ) fnm (y) = 1 a.s. for y 2 Sm .
n

Of course, we can set fnm (y) = 0 for y outside of Sm . Let us define a conditional
distribution of x given y by

µ 0 (A | y) = Â µ(Sn ⇥ Sm ) fnm (y)µnm


1
(A).
n,m

Notice that for any A 2 B, µ 0 (A | y) is measurable in y and µ 0 (A|y) is a probability


distribution on B, for Q-almost all y, because

µ 0 (S | y) = Â µ(Sn ⇥ Sm ) fnm (y) = 1 a.s.


n,m

Let us check that for Borel sets A, B 2 B,


4.4 Wasserstein distance and Kantorovich-Rubinstein theorem 127
Z
µ 0 (A ⇥ B) = µ 0 (A | y) dQ(y).
B

Indeed, since fnm (y) = 0 for y 62 Sm ,


Z Z
µ (A | y) dQ(y) = Â
0 1
µ(Sn ⇥ Sm )µnm (A) fnm (y) dQ(y)
B n,m B
Z
= Â µ(Sn ⇥ Sm )µnm
1
(A) fnm (y) dQm (y)
n,m B

= Â µ(Sn ⇥ Sm )µnm
1 2
(A)µnm (B) = µ 0 (A ⇥ B).
n,m

Conditional distribution n 0 (· | y) can be defined similarly, and we finish the proof


as in the case of the complete space above. t
u

Next lemma shows that on a separable metric space any law with the “first mo-
ment”, i.e. P 2 P1 (S), can be approximated in metrics W and g by laws concen-
trated on finite sets.

Lemma 4.11. If (S, d) is separable and P 2 P1 (S) then there exists a sequence of
laws Pn such that Pn (Fn ) = 1 for some finite sets Fn and W (Pn , P), g(Pn , P) ! 0.

Proof. For each n 1, let (Sn j ) j 1 be a partition of S such that diam(Sn j )  1/n.
Take a point xn j 2 Sn j in each set Sn j and for k 1 define a function

xn j , if x 2 Sn j for j  k,
fnk (x) =
xn1 , if x 2 Sn j for j > k.

We have,
Z Z
d(x, fnk (x)) dP(x) = Â d(x, fnk (x)) dP(x)
j 1 Sn j
Z
1 2

n  P(Sn j ) + S\(Sn1 [···[Snk )
d(x, xn1 ) dP(x) 
n
jk
R
for k large enough, because P 2 P1 (S), i.e. d(x, xn1 ) dP(x) < •, and the set
S \ (Sn1 [ · · · [ Snk ) # 0.
/
Let µn be the image on S ⇥ S of the measure P under the map x ! ( fnk (x), x) so
that µn 2 M(Pn , P) for some Pn concentrated on the set of points {xn1 , . . . , xnk }.
Finally,
Z Z
2
W (Pn , P)  d(x, y) dµn (x, y) = d( fnk (x), x) dP(x)  .
n
Since g(Pn , P)  W (Pn , P), this finishes the proof. t
u

We are finally ready to extend Theorem 4.14 to separable metric spaces.


128 4 Metrics on Probability Measures

Theorem 4.15 (Kantorovich-Rubinstein theorem). If (S, d) is a separable metric


space then W (P, Q) = g(P, Q) for any distributions P, Q 2 P1 (S).

Proof. By the previous lemma, we can approximate P and Q by Pn and Qn concen-


trated on finite (hence, compact) sets. By Theorem 4.14, W (Pn , Qn ) = g(Pn , Qn ).
Finally, since both W, g are metrics,

W (P, Q)  W (P, Pn ) +W (Pn , Qn ) +W (Qn , Q)


 W (P, Pn ) + g(Pn , Qn ) +W (Qn , Q)
 W (P, Pn ) +W (Qn , Q) + g(Pn , P) + g(Qn , Q) + g(P, Q).

Letting n ! • proves that W (P, Q)  g(P, Q). We saw above that the opposite
inequality always holds. t
u

Wasserstein’s distance Wp (P, Q) on Rn . We will now prove a version of the


Kantorovich-Rubinstein theorem on Rn in some cases when d(x, y) is not a metric.
Given p 1, let us define the Wasserstein distance Wp (P, Q) on
n Z o
P p (Rn ) = P – law on Rn : |x| p dP(x) < •

corresponding to the cost function d(x, y) = |x y| p by


nZ o
p
Wp (P, Q) = inf |x y| p dµ(x, y) : µ 2 M(P, Q) . (4.8)

Even though d(x, y) is not a metric for p > 1, Wp is still a metric on P p (Rn ),
which can be shown the same way as in Lemma 4.10. Namely, given nearly optimal
µ 2 M(P, Q) and n 2 M(Q, T) we can construct (X,Y, Z) ⇠ M(P, Q, T) such that
(X,Y ) ⇠ µ and (Y, Z) ⇠ n and, therefore,
1 1 1
Wp (P, T)  (E|X Z| p ) p  (E|X Y | p ) p + (E|Y Z| p ) p
1 1
 (Wpp (P, Q) + e) p + (Wpp (Q, T) + e) p .

Then, we let e # 0. The following dual representation holds.


Theorem 4.16. For any P, Q 2 P p (Rn ),
nZ Z o
Wp (P, Q) p = sup f dP + g dQ : f , g 2 C(Rn ), f (x) + g(y)  |x y| p .
(4.9)

Proof. We will show below that for any continuous uniformly bounded function
d(x, y) on Rn ⇥ Rn ,
nZ o
inf d(x, y) dµ(x, y) : µ 2 M(P, Q) (4.10)
4.4 Wasserstein distance and Kantorovich-Rubinstein theorem 129
nZ Z o
= sup f dP + g dQ : f , g 2 C(Rn ), f (x) + g(y)  d(x, y) .

This will imply (4.9) as follows. Let us take R 1 large enough so that for K =
{|x|  R}, Z Z
|x| p dP  2 p 1 e, |x| p dQ  2 p 1 e.
Kc Kc
We can find such R, because P, Q 2 P p (Rn ).
Let d(x, y) = |x y| p ^ (2R) p . Then
p
for any x, y 2 K, we have d(x, y) = |x y| and for any µ 2 M(P, Q),
Z Z Z
p
|x y| dµ(x, y)  d(x, y) dµ(x, y) + |x y| p dµ(x, y).
(K⇥K)c

Let us break the second integral into two integrals over disjoint sets {|x| |y|} and
{|x| < |y|}. On the first set, |x y| p  2 p |x| p and, moreover, x can not belong to K,
since in that case y must be in K c . This means that the part of the second integral
over this set in bounded by
Z Z
p p
2 |x| dµ(x, y) = 2 p |x| p dP(x)  e/2.
K c ⇥Rn Kc

The second integral over the set {|x| < |y|} can be similarly bounded by e/2 and,
therefore, Z Z
p
|x y| dµ(x, y)  d(x, y) dµ(x, y) + e.
Together with (4.10) this implies that
nZ o
Wp (P, Q) p = inf |x y| p dµ(x, y) : µ 2 M(P, Q)
nZ o
 inf d(x, y) dµ(x, y) : µ 2 M(P, Q) + e
nZ Z o
= sup f dP + g dQ : f , g 2 C(Rn ), f (x) + g(y)  d(x, y) + e
nZ Z o
 sup f dP + g dQ : f , g 2 C(Rn ), f (x) + g(y)  |x y| p + e.

Letting e # 0 proves inequality in one direction, and the opposite inequality


is always true, as we have seen above, since for any f , g 2 C(Rn ) such that
f (x) + g(y)  |x y| p and any µ 2 M(P, Q),
Z Z Z Z
f dP + g dQ = ( f (x) + g(y)) dµ(x, y)  |x y| p dµ(x, y). (4.11)

Therefore, it remains to prove (4.10). Notice that, by adding a constant, we can


assume that d 0.
Again, the inequality is obvious, so we only need to prove . We will reduce
this to the compact case proved in Theorem 4.14. Let us take R 1 large enough
and let K = {|x|  R}. Let us consider measures
130 4 Metrics on Probability Measures

P(C \ K) Q(C \ K)
PK (C) = , QK (C) =
P(K) Q(K)

and let µK 2 M(PK , QK ) be the measure on K ⇥ K that achieves the infimum


Z nZ o
d(x, y) dµK (x, y) = inf d(x, y) dµ(x, y) : µ 2 M(PK , QK ) .

If define the measure µ⇤ on Rn ⇥ Rn by

µ⇤ = P(K)Q(K) µK + P ⇥ Q (K⇥K)c
,

it is easy to check that µ⇤ 2 M(P, Q). Therefore,


nZ o Z
inf d(x, y) dµ(x, y) : µ 2 M(P, Q)  d(x, y) dµ⇤ (x, y)
Z
 d(x, y) dµK (x, y) + kdk• P ⇥ Q((K ⇥ K)c ).

If we take R large enough so that the last term is smaller then e then, using the
Kantorovich-Rubinstein theorem on compacts, we get
nZ o Z
inf d(x, y) dµ(x, y) : µ 2 M(P, Q)  d(x, y) dµK (x, y) + e
nZ o
= inf d(x, y) dµ(x, y) : µ 2 M(PK , QK ) + e
nZ Z o
= sup f dPK + g dQK : f , g 2 C(K), f (x) + g(y)  d(x, y) + e.
Z Z
 f dPK + g dQK + 2e, (4.12)

for some f , g 2 C(K) such that f (x) + g(y)  d(x, y). To finish the proof, we would
like to estimate the above sum of integrals by
Z Z
f dP + y dQ

for some functions f , y 2 C(Rn ) such that f (x) + y(y)  d(x, y). This can be
achieved by “improving” our choice of functions f and g using infimum convolu-
tions, as we will now explain.
However, before we do this, let us make the following useful observation. Since
d 0, we can assume that the supremum in (4.12) is strictly positive (otherwise,
there is nothing to prove) and
Z Z
f dPK + g dQK 0.

Since we can replace f and g by f c and g + c without changing the sum of


integrals, we can assume that each integral is nonnegative. This implies that there
4.4 Wasserstein distance and Kantorovich-Rubinstein theorem 131

exist points x0 , y0 2 K such that f (x0 ), g(y0 ) 0. Since f (x) + g(y)  d(x, y),

f (x)  d(x, y0 ), g(y)  d(x0 , y) for x, y 2 K. (4.13)

Let us now define the function f (x) for x 2 Rn by

f (x) = inf d(x, y) g(y) .


y2K

This is called the infimum-convolution. Clearly, since f (x)  d(x, y) g(y) for all
y 2 K, we have f (x)  f (x) for x 2 K, which implies that
Z Z Z Z
f dPK + g dQK  f dPK + g dQK .

Notice that, by definition, f (x) + g(y)  d(x, y) for all x 2 Rn , y 2 K. Also, using
(4.13) and the fact that g(y0 ) 0,

inf d(x, y) d(x0 , y)  f (x)  d(x, y0 ) g(y0 )  d(x, y0 ). (4.14)


y2K

Next, we define the function y(y) for y 2 Rn by

y(y) = infn d(x, y) f (x) .


x2R

Since g(y)  d(x, y) f (x) for all x 2 Rn , we have g(y)  y(y) for y 2 K, which
implies that
Z Z Z Z Z Z
f dPK + g dQK  f dPK + g dQK  f dPK + y dQK .

By definition, f (x) + y(y)  d(x, y) for all x, y 2 Rn . Also, using (4.14) and the
fact that f (x0 ) f (x0 ) 0,

inf d(x, y) d(x, y0 )  y(y)  d(x0 , y) f (x0 )  d(x0 , y). (4.15)


x2Rn

The estimates in (4.13) and (4.14) imply that kf k• , kyk•  kdk• . Therefore, if
we write
Z Z Z Z Z
f dP = f dP + f dP = P(K) f dPK + f dP
ZK Kc Z KZ Kc

= f dPK P(K c ) f dPK + f dP,


K K Kc

each of the last two terms can be bounded in absolute value by kdk• P(K c ), which
shows that, for large enough R 1,
Z Z
f dPK  f dP + e.
K
132 4 Metrics on Probability Measures

The same estimate holds for y and we showed that


Z Z Z Z
f dPK + g dQK  f dP + y dQ + 2e.

The equation (4.12) implies that


nZ o Z Z
inf d(x, y) dµ(x, y) : µ 2 M(P, Q)  f dP + y dQ + 4e

and, since f (x) + y(y)  d(x, y), this finishes the proof. t
u

Exercise 4.4.1. If P and Q are tight, show that the infimum in the definition of
W (P, Q) is attained.

Exercise 4.4.2. Suppose (S, d) is separable. Show that W (P, Q) r(P, Q)2 and
W (P, Q) b (P, Q). If the diameter D = supx,y2S d(x, y) of (S, d) is finite, show
that W (P, Q)  r(P, Q)(1 + D).

Exercise 4.4.3. Let (S, d) be a separable metric space and, given a bounded mea-
surable function f : S2 ! R, let
nZ o
W f (P, Q) := inf f (x, y) dµ(x, y) : µ 2 M(P, Q)

be a Wasserstein-like ‘transportation distance’ between probability measures P and


Q on (S, d). Prove that W f (P, Q) is convex in the pair (P, Q).

Exercise 4.4.4. (a) Consider a countable set A and two probability measures P and
Q on A. Prove that
nZ o 1
inf I(x 6= y) dµ(x, y) : µ 2 M(P, Q) = Â P({a}) Q({a}) .
2 a2A

Hint: use the Kantorovich-Rubinstein theorem with metric d(x, y) = I(x 6= y).
(b) Let (S, d) be a separable metric space and B be the Borel s -algebra. Given
two probability measures P and Q on B, construct a measure n 2 M(P, Q) that
witnesses equality
nZ o
inf I(x 6= y) dµ(x, y) : µ 2 M(P, Q) = sup |P(A) Q(A)| =: TV(P, Q),
A2B

where TV(P, Q) is called the total variation distance between P and Q. Hint: use
the Hahn-Jordan decomposition.
R
Exercise 4.4.5. Let P1 (R) be the set of laws on R such that |x| dP(x) < •. Let
us define a map F : P1 (R) ! P1 (R) as follows. Consider a random variable t
with values in N such that Et < • and an independent random variable x such that
4.4 Wasserstein distance and Kantorovich-Rubinstein theorem 133

E|x | < •. Given a sequence (Xi ) of i.i.d. random variables with the distribution P 2
P1 (R), independent of t and x , let F(P) be the distribution of the sum x + Âti=1 Xi ,
where the sum Âti=1 Xi is zero if t = 0. If Et 2 [0, 1), prove that F has a unique fixed
point, F(P) = P. Hint: use the Banach Fixed Point Theorem for the Wasserstein
metric W1 . (You need to prove that the metric space (P1 (R),W1 ) is complete.)

Exercise 4.4.6. Let P be the set of probability laws on [ 1, 1] and define a map
F : P ! P as follows. Consider a random variable t with values in N and an
independent random variable x . Given a sequence (Xi ) of i.i.d. random variables
on [ 1, 1] with the distribution P 2 P, independent of t and x , let F(P) be the
distribution of cos(x + Âti=1 Xi ), where the sum Âti=1 Xi is zero if t = 0. Prove that
F has a fixed point, F(P) = P. Hint: use the Schauder Fixed Point Theorem for P
equipped with the topology of weak convergence. (Notice that now a fixed point is
possibly not unique.)

Exercise 4.4.7. Consider a finite set A and nonnegative real-valued functions

f1 , . . . , fn : A ! R

such that for any convex combination g = l1 f1 + . . . + ln fn there is a point a 2 A


such
R that g(a) < 1. Prove that there is a probability measure P on A such that
fi dP < 1 for all i  n. Hint: use the Hahn-Banach separation theorem.
134 4 Metrics on Probability Measures

4.5 Wasserstein distance and entropy

In this section we will make several connections between the Wasserstein distance
and other classical objects, with application to Gaussian concentration. Let us start
with the following classical inequality.

Theorem 4.17 (Brunn-Minkowski inequality on R). If g is the Lebesgue measure


and A, B are two non-empty Borel sets on R then g(A + B) g(A) + g(B), where
A + B = {a + b : a 2 A, b 2 B}.

Proof. First, suppose that A and B are open and bounded. Since the Lebesgue mea-
sure is invariant under translations, let us translate A and B so that sup A = inf B = 0.
Let us check that, in this case, A [ B ✓ A + B. Since A is open, for each a 2 A there
exists e > 0 such that (a e, a+e) ✓ A. Since inf B = 0, there exists b 2 B such that
0  b < e/2. Then a 2 (a + b e, a + b + e) ✓ A + B, which proves that A ✓ A + B.
One can prove similarly that B is also a subset of A + B. Since A and B are disjoint,
we proved that g(A) + g(B) = g(A [ B)  g(A + B).
Now, suppose that A and B are compact. Then, obviously, A + B is also compact.
For e > 0, let us denote by Ce an open e-neighborhood of the set C. Since Ae +
Be ✓ (A + B)2e , using the previous case of the open bounded sets, we get g(Ae ) +
g(Be )  g(Ae + Be )  g((A + B)2e ). Since A is closed, Ae # A as e # 0 and, by
the continuity of measure, g(Ae ) # g(A). The same holds for B and A + B, and we
proved the inequality for two compact sets.
Finally, consider arbitrary measurable sets A and B. We can assume that g(A) <
• and g(B) < •, otherwise, there is nothing to prove. By the regularity of the
Lebesgue measure, we can find compacts C ✓ A and D ✓ B such that g(A \ C) 
e and g(B \ D)  e. Using the previous case of the compact sets, we can write
g(A) + g(B) 2e  g(C) + g(D)  g(C + D)  g(A + B), and letting e # 0 finishes
the proof. t
u

Using this, we will prove another classical inequality.

Theorem 4.18 (Prekopa-Leindler inequality). Consider nonnegative integrable


functions w, u, v : Rn ! [0, •) such that for some l 2 [0, 1],

w(l x + (1 l )y) u(x)l v(y)1 l


for all x, y 2 Rn .

Then, Z ⇣Z ⌘l ⇣Z ⌘1 l
w dx u dx v dx .

Using the Prekopa-Leindler inequality one can prove that the Lebesgue measure g
on Rn satisfies the Brunn-Minkowski inequality

g(A)1/n + g(B)1/n  g(A + B)1/n . (4.16)


4.5 Wasserstein distance and entropy 135

We will leave this as an exercise. From this one can easily deduce the famous
isoperimetric property of Euclidean balls with respect to the Lebesgue measure,
namely, that among all sets with the same measure, a ball has the smallest surface
area. If B is a unit open ball in Rn and g(A) = g(B) then, by (4.16),

g(Ae ) = g(A + eB) (g(A)1/n + g(eB)1/n )n


= (g(B)1/n + g(eB)1/n )n = (1 + e)n g(B) = g(Be ).

In other words, volume of A grows faster than B as we expand the sets, which
means that the surface area of B is smaller.

Proof (of Theorem 4.18). The proof will proceed by induction on n. Let us first
show the induction step. Suppose the statement holds for n and we would like to
show it for n + 1. By the assumption, for any x, y 2 Rn and a, b 2 R,

w(l x + (1 l )y, l a + (1 l )b) u(x, a)l v(y, b)1 l


.

Let us fix a and b and consider functions w1 (x) = w(x, l a + (1 l )b), u1 (x) =
u(x, a) and v1 (x) = v(x, b) on Rn that satisfy w1 (l x + (1 l )y) u1 (x)l v1 (y)1 l .
By the induction assumption,
Z ⇣Z ⌘l ⇣Z ⌘1 l
w1 dx u1 dx v1 dx .
Rn Rn Rn

These integrals still depend on a and b and we can define


Z Z
w2 (l a + (1 l )b) = w1 dx = w(x, l a + (1 l )b) dx
Rn Rn

and, similarly,
Z Z
u2 (a) = u1 (x, a) dx and v2 (b) = v1 (x, b) dx.
Rn Rn

Then the above inequality can be rewritten as

w2 (l a + (1 l )b) u2 (a)l v2 (b)1 l


.

These functions are defined on R and, by the induction assumption,


Z ⇣Z ⌘l ⇣Z ⌘1 l
w2 ds u2 ds v2 ds
R R R
Z ⇣Z ⌘l ⇣Z ⌘1 l
=) w dz u dz v dz ,
Rn+1 Rn+1 Rn+1

which finishes the proof of the induction step.


It remains to prove the case n = 1. We can assume that u, v, w : R ! [0, 1], be-
cause both inequalities in the statement of the theorem are homogeneous to trun-
136 4 Metrics on Probability Measures

cation and scaling. Also, we can assume that u and v are not identically zero, since
there is nothing to prove in that case, and we can scale them by their k · k• norm
and assume that kuk• = kvk• = 1. We have the following set inclusion,

{w a} ◆ l {u a} + (1 l ){v a},

because if u(x) a and v(y) a then, by assumption,

w(l x + (1 l )y) u(x)l v(y)1 l


al a1 l
= a.

When a 2 [0, 1], the sets {u a} and {v a} are not empty and the Brunn-
Minkowski inequality implies that

g(w a) l g(u a) + (1 l )g(v a).

Finally,
Z Z Z 1 Z 1
w(z) dz = I(x  w(z)) dadz = g(w a) da
R R 0 0
Z 1 Z 1
l g(u a) da + (1 l) g(v a) da
Z0 Z 0
⇣Z ⌘l ⇣Z ⌘1 l
=l u(z) dz + (1 l) v(z) dz u(z) dz v(z) dz ,
R R R R

where in the last step we used the arithmetic-geometric mean inequality. This fin-
ishes the proof. t
u

Remark 4.6. Another common proof of theR lastR step n = 1 uses transportation of
measure, as follows. We can assume that u = v = 1 by rescaling
u v w
u! R , v! R , w! R l R 1 l
.
u v ( u) ( v)
R
Then we need to show that w 1. Consider cumulative distribution functions
Z x Z x
F(x) = u(y) dy and G(x) = v(y) dy,
• •

and let x(t) and y(t) be their quantile transforms. Then, F(x(t)) = t and G(y(t)) = t
for 0  t  1. By the Lebesgue differentiation theorem, F 0 (x) = u(x) for all x 62 N
for some set N of Lebesgue measure zero. Since x(t) is (strictly) monotone, the
derivative x0 (t) exists for all t 62 N 0 for some set N 0 of Lebesgue measure zero.
Also, notice that
x(t) 2 N =) t 2 F(N ),
and, since F is absolutely continuous, F(N ) also has Lebesgue measure zero
(Lusin N property). Therefore, for t 62 F(N ) [ N 0 , the derivative x0 (t) exists and
F 0 (x(t)) = u(x(t)). Using the chain rule in the equation F(x(t)) = t, for all such t,
4.5 Wasserstein distance and entropy 137

u(x(t))x0 (t) = 1. Similarly, outside of some set of measure zero, v(y(t))y0 (t) = 1.
Now, consider the function z(t) = l x(t) + (1 l )y(t). This function is strictly
increasing and differentiable almost everywhere. Therefore,
Z +• Z 1 Z 1
w(z) dz 0
w(z(t))z (t) dt = w l x(t) + (1 l )y(t) z0 (t) dt.
• 0 0

By the arithmetic-geometric mean inequality

z0 (t) = l x0 (t) + (1 l )y0 (t) x0 (t)l y0 (t)1 l

and, by assumption,

w(l x(t) + (1 l )y(t)) u(x(t))l v(y(t))1 l


.

Therefore,
Z • Z 1⇣ ⌘l ⇣ ⌘1 l
Z 1
0
w(z)dz u(x(t))x (t) v(y(t))y0 (t) dt = 1 dt = 1.
• 0 0

This finishes the proof. t


u

Entropy and the Kullback-Leibler divergence. Consider a probability measure


P on some measurable space and a nonnegative measurable function u : W ! R+ .
We define the entropy of u with respect to P by
Z Z Z
EntP (u) = u log u dP u dP · log u dP,

where 0 · log 0 = 0. Notice that EntP (u) 0, by Jensen’s inequality, since u log u is
a convex function. Entropy has the following variational representation, known in
physics as the Gibbs variational principle (see exercise below).

Lemma 4.12. The entropy can be written as


nZ Z o
EntP (u) = sup uv dP : ev dP  1 . (4.17)
R
Proof. Take any measurable function v such that ev dP  1. Then, for any l 0,
Z Z ⇣ Z ⌘ Z
uv dP  uv dP + l 1 e dP = l + (uv l ev ) dP.
v

For l > 0, the integrand (uv l ev ) is strictly concave in


R vv and can be maximized
v
pointwise by taking v such that u = l e . Therefore, for e dP  1 and l > 0,
Z Z Z
u
uv dP  l + u log dP u dP.
l
138 4 Metrics on Probability Measures

The right hand side Ris convex in l for l 0 (since u is nonnegative), the minimum
is achieved on l = u dP, and with this choice of l the right hand side is equal to
EntP (u). On the other hand,
R it is easy to see thatR this upper bound is achieved on
the function v = log u u dP , which satisfies ev dP = 1. t
u

Suppose now that a law Q is absolutely continuous with respect to P and denote
its Radon-Nikodym derivative by
dQ
u= .
dP
Then the Kullback-Leibler divergence of P from Q is defined by
Z Z
dQ
D(Q k P) := log u dQ = log dQ. (4.18)
dP
Notice that this quantity is not symmetric in P and Q, so the order is important.
Clearly, D(Q k P) = EntP (u), since
Z Z Z Z
dQ dQ dQ dQ dQ
EntP (u) = log · dP dP · log dP = log dQ.
dP dP dP dP dP
The variational characterization (4.17) implies that
Z Z Z
if ev dP  1 then v dQ = uv dP  D(Q k P). (4.19)

Talagrand’s quadratic transportation cost inequality for Gaussian measures.


In this subsection we consider a non-degenerate normal distribution N(0,C) with
the covariance matrix C such that det(C) 6= 0. We know that this distribution has
density e V (x) , where
1
V (x) = (C 1 x, x) + const.
2
If we denote A = C 1 /2 then, for any t 2 [0, 1],

tV (x) + (1 t)V (y) V (tx + (1 t)y)


⇣ ⌘
= t(Ax, x) + (1 t)(Ay, y) A(tx + (1 t)y), (tx + (1 t)y)
(4.20)
= t(1 t) A(x y), (x y)
1
t(1 t)|x y|2 = Kt(1 t)|x y|2 ,
2lmax (C)

where lmax (C) is the largest eigenvalue of C and K = 1/(2lmax (C)). We will use
this to prove the following useful inequality for the Wasserstein distance W2 defined
at the end of the previous section.
4.5 Wasserstein distance and entropy 139

Theorem
R 4.19. If P = N(0,C) and Q is absolutely continuous with respect to P
with |x|2 dQ(x) < • then

W2 (Q, P)2  2lmax (C) D(Q k P). (4.21)

Proof. Using the Kantorovich-Rubinstein theorem (see (4.9)), write


Z
1 n o
W2 (Q, P)2 = inf K|x y|2 dµ(x, y) : µ 2 M(P, Q)
K
1 nZ Z o
= sup f dP + g dQ : f (x) + g(y)  K|x y|2 .
K
Consider functions f , g 2 C(Rn ) such that f (x) + g(y)  K|x y|2 . Then, by (4.20),
for any t 2 (0, 1),
1 ⇣ ⌘
f (x) + g(y)  tV (x) + (1 t)V (y) V (tx + (1 t)y)
t(1 t)
and

t(1 t) f (x) tV (x) + t(1 t)g(y) (1 t)V (y)  V (tx + (1 t)y).

If we introduce the functions

u(x) = e(1 t) f (x) V (x) , v(y) = etg(y) V (y) and w(z) = e V (z)

then w(tx + (1 t)y) u(x)t v(y)1 t and, by the Prekopa-Leindler inequality,


⇣Z ⌘t ⇣Z ⌘1 t Z
e(1 t) f (x) V (x) dx etg(x) V (x) dx  e V (x) dx.

Since e V is the density of P, we showed that


⇣Z ⌘t ⇣Z ⌘1 t ⇣Z ⌘ 1 ⇣Z ⌘1
t) f 1 t t
e(1 dP etg dP  1, so e(1 t) f
dP etg dP  1.

It is a simple calculus exercise to show that


⇣Z ⌘1 R
s
lim es f dP = e f dP ,
s#0

and, therefore, letting t " 1 proves that


Z R
eg dP · e f dP
 1.
R R
This inequality can be written as ev dP  1 for v = g + f dP and (4.19) implies
that
140 4 Metrics on Probability Measures
Z Z Z
v dQ = f dP + g dQ  D(Q k P).
Together with the Kantorovich-Rubinstein representation above this implies that
1
W2 (Q, P)2  D(Q k P) = 2lmax (C) D(Q k P),
K
which finishes the proof. t
u

Concentration of Gaussian measure. Given a measurable set A ✓ Rn with


P(A) > 0, define the distribution PA by

P(C \ A)
PA (C) = .
P(A)
Then, obviously, the Radon-Nikodym derivative
dPA 1
= IA
dP P(A)
and the Kullback-Leibler divergence
Z
1 1
D(PA k P) = log dPA = log .
A P(A) P(A)
Since W2 is a metric, for any two Borel sets A and B,

W2 (PA , PB )  W2 (PA , P) +W2 (PB , P)


p ⇣ 1 1 ⌘
 2lmax (C) log1/2 + log1/2 ,
P(A) P(B)
using (4.21). Suppose that the sets A and B are apart from each other by a distance
t, i.e. d(A, B) t > 0. Then any two points in the support of measures PA and PB
are at a distance at least t from each other, which implies that the transportation
distance W2 (PA , PB ) t. Therefore,
p ⇣ 1 1 ⌘
t  W2 (PA , PB )  2lmax (C) log1/2 + log1/2
P(A) P(B)
p 1
 4lmax (C) log1/2 .
P(A)P(B)
Therefore,
1 ⇣ t2 ⌘
P(B)  exp .
P(A) 4lmax (C)
In particular, if B = {x : d(x, A) t} then

1 ⇣ t2 ⌘
P d(x, A) t  exp .
P(A) 4lmax (C)
4.5 Wasserstein distance and entropy 141

If the set A is not too small, e.g. P(A) 1/2, this implies that
⇣ t2 ⌘
P d(x, A) t  2 exp .
4lmax (C)
This shows that the Gaussian measure is exponentially concentrated near any “large
enough” set. The constant 1/4 in the exponent is not optimal and can be replaced
by 1/2; this is just an example of application of the above ideas. The optimal result
is the famous Gaussian isoperimetry,

if P(A) = P(B) for some half-space B then P(At ) P(Bt ).

Gaussian concentration via infimum-convolution. If we denote c := 1/lmax (C)


then setting t = 1/2 in (4.20),
⇣x +y⌘ c
V (x) +V (y) 2V |x y|2 .
2 4
Given a function f on Rn , let us define its infimum-convolution by
⇣ c ⌘
g(y) = inf f (x) + |x y|2 .
x 4
Then, for all x and y, we have the inequality
c ⇣x +y⌘
g(y) f (x)  |x y|2  V (x) +V (y) 2V . (4.22)
4 2
If we consider the functions u(x) = e f (x) V (x) , v(y) = eg(y) V (y) and w(z) = e V (z) ,

then (4.22) implies that


⇣x +y⌘
w u(x)1/2 v(y)1/2
2
and the Prekopa-Leindler inequality with l = 1/2 implies that
Z Z
eg dP e f
dP  1. (4.23)

Given a measurable set A, let f be equal to 0 on A and +• on the complement of


A. Then g(y) = 4c d(x, A)2 and (4.23) implies
Z
c 1
exp d(x, A)2 dP(x)  .
4 P(A)
By Chebyshev’s inequality,

1 ⇣ ct 2 ⌘ 1 ⇣ t2 ⌘
P d(x, A) t  exp = exp ,
P(A) 4 P(A) 4lmax (C)
142 4 Metrics on Probability Measures

which is the same Gaussian concentration inequality we proved above. t


u

Discrete metric and total variation. The total variation distance between two
probability measures P and Q on a measurable space (S, B) is defined by

TV(P, Q) = sup |P(A) Q(A)|.


A2B

Using the Hahn-Jordan decomposition, we can represent a signed measure µ =


P Q as µ = µ + µ such that, for some set D 2 B and for any set E 2 B,

µ + (E) = µ(ED) 0 and µ (E) = µ(EDc ) 0.

Therefore, for any A 2 B,

P(A) Q(A) = µ + (A) µ (A) = µ + (AD) µ (ADc ),

which makes it obvious that

sup |P(A) Q(A)| = µ + (D).


A2B

Let us describe some connections of the total variation distance to the Kullback-
Leibler divergence and the Kantorovich-Rubinstein theorem. Let us start with the
following simple observation.
R
Lemma 4.13. If f is a measurable function on S such that | f |  1 and f dP = 0
then for any l 2 R, Z
2
el f dP  el /2 .

Proof. Since (1 + f )/2, (1 f )/2 2 [0, 1] and


1+ f 1 f
lf = l+ ( l ),
2 2
by the convexity of ex , we get
1+ f l 1 f
el f  e + e l
= ch(l ) + f sh(l ).
2 2
Therefore, Z
2 /2
el f dP  ch(l )  el ,
where the last inequality is easy to see by Taylor’s expansion. t
u

Let us now consider a discrete metric on S given by

d(x, y) = I(x 6= y). (4.24)


4.5 Wasserstein distance and entropy 143

Then a 1-Lipschitz function f with respect to the metric d, k f kL  1, is defined by


the condition that for all x, y 2 S,

| f (x) f (y)|  1. (4.25)

Formally, the Kantorovich-Rubinstein theorem in this case would state that


nZ o
W (P, Q) = inf I(x 6= y) dµ(x, y) : µ 2 M(P, Q)
nZ Z o
= sup f dQ f dP : k f kL  1 =: g(P, Q).

However, since any uncountable set S is not separable w.r.t. the discrete metric d,
we can not apply the Kantorovich-Rubinstein theorem directly. In this case, one
can use the Hahn-Jordan decomposition to show that W coincides with the total
variation distance, W (P, Q) = TV(P, Q) and it is easy to construct a measure µ 2
M(P, Q) explicitly that witnesses the above equality. This was one of the exercises
at the end of the previous section. One can also easily check directly that g(P, Q) =
TV(P, Q). Thus, for the discrete metric d,

W (P, Q) = TV(P, Q) = g(P, Q).

We have the following analogue of the Kullback-Leibler divergence bound in The-


orem 4.19, which now holds for any measure P, not only Gaussian.

Theorem 4.20 (Pinsker’s


p inequality). If Q is absolutely continuous with respect
to P then TV(P, Q)  2D(Q k P).
R
Proof. Take f suchR that (4.25) holds. If we define g(x) = f (x) f dP then,
clearly, |g|  1 and g dP = 0. The above lemma implies that for any l 2 R,
Z R
l f dP l 2 /2
el f dP  1.

The variational characterization of entropy (4.19) implies that


Z Z
l f dQ l f dP l 2 /2  D(Q k P)

and for l > 0 we get


Z Z
l 1
f dQ f dP  + D(Q k P).
2 l
Minimizing the right hand side over l > 0, we get
Z Z p
f dQ f dP  2D(Q k P).

Applying this to f and f yields the result. t


u
144 4 Metrics on Probability Measures

Hamming metric on a product space. Let us consider a finite set A and, given
integer n 1, consider the following Hamming metric on An ,

1 n
dH (x, y) = Â I(xi 6= yi ).
n i=1

For two measures µ and n on An , consider the corresponding Wasserstein distance


nZ o
WH (µ, n) = inf dH (x, y) dl (x, y) : l 2 M(µ, n)
n1 n o
= inf Â
n i=1
l (x i 6
= yi ) : l 2 M(µ, n) .

Since dH (x, y)  I(x 6= y), by Pinsker’s inequality,


p
WH (µ, n)  TV(µ, n)  2D(µ k n). (4.26)

On the other hand, on the product space An , one is often interested to understand
how different a measure µ is from some (or any) product measure n = n1 ⇥ . . . ⇥ nn
with respect to the above Wasserstein metric. Consider the function

fA (x) = x log x (1 x) log(1 x) + x log card(A), x 2 [0, 1].

This function is concave, fA (x) 0 and it is equal to zero only at x = 0. The


following reverse analogue of the above Pinsker’s inequality holds.
Lemma 4.14. If µ1 , . . . , µn are the marginals of µ then, for any product measure
n = n1 ⇥ . . . ⇥ nn on An ,
1
fA (WH (µ, n)) D(µ k µ1 ⇥ . . . ⇥ µn ). (4.27)
2n
In particular,
1
fA (WH (µ, µ1 ⇥ . . . ⇥ µn )) D(µ k µ1 ⇥ . . . ⇥ µn ).
2n
For the proof, we refer to Lemma 4.4 in arXiv:1705.00302. One can also rewrite
the KL divergence on the right hand side as
n
D(µ k µ1 ⇥ . . . ⇥ µn ) = Â H(µi ) H(µ),
i=1

where H(P) = Âs P(s) log P(s) is called the entropy of the discrete measure P.
The inequality (4.27) then shows that the KL divergence between µ and the product
measure with the same marginals, µ1 ⇥ . . . ⇥ µn (which is an explicit quantity that
in principle should be easier to compute), can be used to control from below the
Wasserstein distance WH (µ, n) between µ and an arbitrary product measure n with
4.5 Wasserstein distance and entropy 145

respect to the Hamming distance dH . Let us also note that one can strengthen the
inequality (4.27) by replacing fA (WH (µ, n)) with
n1 n o
inf Â
n i=1
fA (l (xi 6= yi )) : l 2 M(µ, n) .

Exercise 4.5.1. Prove the Brunn-Minkowski inequality (4.16) on Rn using the


Prekopa-Leindler inequality. Hint: Apply the Prekopa-Leindler inequality to sets
A/l and B/(1 l ) and optimize over l 2 (0, 1).

Exercise 4.5.2. Using the Prekopa-Leindler inequality prove that if gN is the stan-
dard Gaussian measure on RN , A and B are Borel sets and l 2 [0, 1] then

log gN (l A + (1 l )B) l log gN (A) + (1 l ) log gN (B).

Exercise 4.5.3 (Anderson’s inequality). If C is convex and symmetric around 0,


i.e. C = C, then for any z 2 RN , gN (C) gN (C + z). More generally, for any
centred Gaussian distribution P = N(0, S) on RN , P(C) P(C + z). Hint: use the
previous problem.

Exercise 4.5.4. If C is convex and symmetric around 0, X and Y are (centred) Gaus-
sian on RN and Cov(X) Cov(Y ) (i.e. their difference is nonnegative definite) then
P(X 2 C)  P(Y 2 C). Hint: use the previous problem.

Exercise 4.5.5. Suppose that g is standard Gaussian on Rn and f : Rn ! R is a


Lipschitz function with k f kL = a. If M is a median of f (g), prove that
t 2 /4
P f (g) > M + at  2e .

Hint: Use Gaussian concentration inequality. Median means that P( f (g) M)


1/2 and P( f (g)  M) 1/2.

Exercise 4.5.6 (Gibbs’ variational


R principle). Given a probability measure P on
Rn and a function v such that ev dP < •, prove that
Z nZ Z o
v dQ
log e dP = max v dQ log dQ ,
Q dP
where the maximum is taken over all probability measures Q absolutely continuous
with respect to P and is achieved on dQ ⇠ ev dP.
147

Chapter 5
Martingales

5.1 Martingales. Uniform integrability

Let (W, B, P) be a probability space and let (T, ) be a linearly ordered set. We
will mostly work with countable sets T , such as Z or subsets of Z. Consider a
family of random variables Xt : W ! R and a family of s -algebras (Bt )t2T such
that Bt ✓ Bu ✓ B for t  u. A family (Xt , Bt )t2T is called a martingale if the
following hold:
1. Xt is Bt -measurable for all t 2 T (Xt is adapted to Bt ),
2. E|Xt | < • for all t 2 T ,
3. E(Xu |Bt ) = Xt for all t  u.
If the last equality is replaced by E(Xu |Bt )  Xt then the process is called a super-
martingale and if E(Xu |Bt ) Xt then it is called a submartingale. If (Xt , Bt ) is a
martingale and Xt = E(X|Bt ) for some random variables X then the martingale is
called right-closable. If some t0 2 T is an upper bound of T (i.e. t  t0 for all t 2 T )
then the martingale is called right-closed. Of course, in this case Xt = E(Xt0 |Bt ).
Clearly, a right-closable martingale can be made into a right-closed martingale by
adding an additional point, say t0 , to the set T so that t  t0 for all t 2 T , and
defining Xt0 := X. Let us give several examples of martingales.

Example 5.1.1. Consider a sequence (Xn )n 1 of independent random variables


such that EXi = 0 and let Sn = Âin Xi . If Bn = s (X1 , . . . , Xn ) then (Sn , Bn )n 1
is a martingale since, for m n,

E(Sm |Bn ) = E(Sn + Xn+1 + . . . + Xm |Bn ) = Sn + EXn+1 + . . . + EXm = Sn ,

since Xn+1 , . . . , Xm are independent of Bn and Sn is Bn -measurable.

Example 5.1.2. Consider a sequence of s -algebras . . . ✓ Bn ✓ Bn+1 ✓ . . . ✓ B


and a random variable X such that E|X| < •. If we let Xn = E(X|Bn ) then (Xn , Bn )
is a right-closable martingale, since for n < m,
148 5 Martingales

E(Xm |Bn ) = E(E(X|Bm )|Bn ) = E(X|Bn ) = Xn ,

by the smoothing property of conditional expectation.

Example 5.1.3. As a typical example, one can consider an integrable function X =


f (x1 , . . . , xn ) of some random variables x1 , . . . , xn , let B0 be the trivial s -algebra
and Bm = s (x1 , . . . , xm ) for m  n. If we let Xm = E(X|Bm ) then X0 = EX, Xn = X
and the representation
n
X = EX + Â (Xm Xm 1 )
m=1
is called the martingale-difference representation of X. This representation is par-
ticularly useful when x1 , . . . , xn are independent, because in this case Xm 1 is just
the expectation of Xm with respect to xm .

Example 5.1.4. Let (Xi )i 1 be i.i.d. and let Sn = Âin Xi . Let T = {. . . , 2, 1} be


the set of negative integers and, for n 1, let us define

B n = s (Sn , Sn+1 , . . .) = s (Sn , Xn+1 , Xn+2 , . . .).

Clearly, B (n+1) ✓ B n . For 1  k  n, by symmetry (we leave the details for the
exercise at the end of the section),

E(X1 |B n ) = E(Xk |B n ).

Therefore,

Sn = E(Sn |B n ) = Â E(Xk |B n ) = nE(X1 |B n )


1kn

Sn
and Z n := n = E(X1 |B n ). Thus, (Z n , B n ) n 1 is a right-closed martingale.

Theorem 5.1 (Doob’s decomposition). If (Xn , Bn )n 0 is a submartingale then it


can be uniquely decomposed Xn = Zn + Yn , where (Yn , Bn ) is a martingale, Z0 =
0, Zn  Zn+1 almost surely and Zn is Bn 1 -measurable.

The sequence (Zn ) is called predictable, since it is a function of X1 , . . . , Xn 1 , so we


know it at time n 1. Since an increasing sequence is always convergent, the ques-
tion of convergence for submartingales is reduced to the question of convergence
for martingales.

Proof. Let Dn = Xn Xn 1 and

Gn = E(Dn |Bn 1 ) = E(Xn |Bn 1 ) Xn 1 0

by the definition of submartingale. Let,

Hn = Dn Gn , Yn = H1 + . . . + Hn , Zn = G1 + · · · + Gn .
5.1 Martingales. Uniform integrability 149

Since Gn 0 a.s., Zn  Zn+1 and, by construction, Zn is Bn 1 -measurable. We


have,
E(Hn |Bn 1 ) = E(Dn |Bn 1 ) Gn = 0
and, therefore, E(Yn |Bn 1 ) = Yn 1 . Uniqueness follows by construction. Suppose
that Xn = Zn +Yn with all stated properties. First, since Z0 = 0, Y0 = X0 . By induc-
tion, given a unique decomposition up to n 1, we can write

Zn = E(Zn |Bn 1 ) = E(Xn Yn |Bn 1 ) = E(Xn |Bn 1 ) Yn 1

and Yn = Xn Zn . This finishes the proof. t


u

When we study convergence properties of martingales, an important role will be


played by their uniform integrability properties. We say that a collection of random
variables (Xt )t2T is uniformly integrable if

sup E|Xt |I(|Xt | > M) ! 0 as M ! •. (5.1)


t

For example, when |Xt |  Y for all t 2 T and EY < • then, clearly, (Xt )t2T is
uniformly integrable. Other basic criteria of uniform integrability will be given in
the exercises below. The following criterion of the L1 -convergence, which can be
viewed as a strengthening of the dominated convergence theorem, will be useful.

Lemma 5.1. Consider random variables (Xn ), X, such that E|Xn | < •, E|X| < •.
The following are equivalent:
1. E|Xn X| ! 0 as n ! •.
2. (Xn )n 1 is uniformly integrable and Xn ! X in probability.

Proof. 2=)1. For any e > 0 and K > 0, we can write,

E|Xn X|  e + E|Xn X|I(|Xn X| > e)


 e + 2KP(|Xn X| > e) + 2E|Xn |I(|Xn | > K) + 2E|X|I(|X| > K)
 e + 2KP(|Xn X| > e) + 2 sup E|Xn |I(|Xn | > K) + 2E|X|I(|X| > K).
n

Since Xn ! X in probability, letting n ! •,

lim sup E|Xn X|  e + 2 sup E|Xn |I(|Xn | > K) + 2E|X|I(|X| > K).
n!• n

Now, first letting e ! 0, and then letting K ! • and using that (Xn ) is uniformly
integrable proves the result.
1=)2. By Chebyshev’s inequality,
1
P(|Xn X| > e)  E|Xn X| ! 0
e
150 5 Martingales

as n ! •, so Xn ! X in probability. To prove uniform integrability, let us recall


that for any e > 0 there exists d > 0 such that P(A)  d implies that E|X|IA  e.
This is because
E|X|IA  KP(A) + E|X|I(|X| K),
and if we take K such that E|X|I(|X| K)  e/2 then it is enough to take d =
e/(2K). Now, given e > 0, take d as above and take M > 0 large enough, so that
for all n 1,
E|Xn |
P(|Xn | > M)   d.
M
We showed that E|X|I(|Xn | > M)  e for such d and, therefore,

E|Xn |I(|Xn | > M)  E|Xn X| + E|X|I(|Xn | > M)  E|Xn X| + e.

For large enough n n0 , E|Xn X|  e and, therefore,

E|Xn |I(|Xn | > M)  2e.

We can also choose M large enough so that E|Xn |I(|Xn | > M)  2e for n  n0 and
this finishes the proof. t
u
Next, we will prove some uniform integrability properties for martingales and
submartingales, but first let us make the following simple observation.
Lemma 5.2. Let f : R ! R be a convex function such that E| f (Xt )| < • for all t.
Suppose that either of the following two conditions holds:
1. (Xt , Bt ) is a martingale;
2. (Xt , Bt ) is a submartingale and f is increasing.
Then ( f (Xt ), Bt ) is a submartingale.
Proof. Under the first condition, for t  u, by conditional Jensen’s inequality,

f (Xt ) = f (E(Xu |Bt ))  E( f (Xu )|Bt ).

Under the second condition, since Xt  E(Xu |Bt ) for t  u and f is increasing,

f (Xt )  f (E(Xu |Bt ))  E( f (Xu )|Bt ),

where we again used conditional Jensen’s inequality. t


u
Lemma 5.3. The following holds.
1. If (Xt , Bt )t2T is a right-closable martingale then (Xt ) is uniformly integrable.
2. If (Xt , Bt )t2T is a right-closable submartingale then (max(Xt , a)) is uniformly
integrable for any a 2 R.
The fact that submartingale (Xt , Bt )t2T is right-closable means that Xt  E(X|Bt )
for some X such that E|X| < •.
5.1 Martingales. Uniform integrability 151

Proof. 1. Since there exists an integrable random variable X such that Xt =


E(X|Bt ), we have

|Xt | = |E(X|Bt )|  E |X| | Bt and E|Xt |  E|X| < •.

Since {|Xt | > M} 2 Bt ,

Xt I(|Xt | > M) = I(|Xt | > M)E(X|Bt ) = E(XI(|Xt | > M)|Bt )

and, therefore,

E|Xt |I(|Xt | > M)  E|X|I(|Xt | > M)  KP(|Xt | > M) + E|X|I(|X| > K)


E|Xt | E|X|
K + E|X|I(|X| > K)  K + E|X|I(|X| > K).
M M
Letting M ! •, K ! • proves that supt E|Xt |I(|Xt | > M) ! 0 as M ! •.
2. Since Xt  E(X|Bt ) and {Xt > M} 2 Bt , as above this implies that

EXt I(Xt > M)  EXI(Xt > M).

Also, since the function max(a, x) is convex and increasing in x, by the part (b) of
the previous lemma,
E max(a, Xt )  E max(a, X). (5.2)
Finally, if we take M > |a| then the inequality | max(Xt , a)| > M can hold only if
max(Xt , a) = Xt > M. Combining all these observations,

E| max(Xt , a)|I(| max(Xt , a)| > M) = EXt I(Xt > M)  EXI(Xt > M)

 KP(Xt > M) + E|X|I(|X| > K)


E max(Xt , 0)
K + E|X|I(|X| > K)
M
E max(X, 0)
{by (5.2)}  K + E|X|I(|X| > K).
M
Letting M ! • and then K ! • proves that (max(Xt , a))t2T is uniformly inte-
grable. t
u

Exercise 5.1.1. In the setting of the Example 3 above, prove carefully that, for
1  k  n, E(X1 |B n ) = E(Xk |B n ).

Exercise 5.1.2. Suppose that (Sn , Bn ) is a martingale, where Sn = X1 + . . . + Xn


and Bn = s (X1 , . . . , Xn ). For i < j, if we assume that E|Xi X j | < • then show that
EXi X j = 0.

Exercise 5.1.3. Let (Xn )n 1 be i.i.d. random variables and, given a bounded mea-
surable function f : R2 ! R such that f (x, y) = f (y, x), consider
152 5 Martingales

2
S n =
n(n Â0
1) 1`<`
f (X` , X`0 )
n

(called the U-statistics) and the s -algebras

F n = s X(1) , . . . , X(n) , (X` )`>n ,

where X(1) , . . . , X(n) are the order statistics of X1 , . . . , Xn . Prove that F (n+1) ✓ F n
and S n = E(S 2 |F n ), i.e. (S n , F n ) n 1 is a right-closed martingale.

Exercise 5.1.4. Suppose that (Xn )n 1 are i.i.d. and E|X1 | < •. If Sn = X1 +. . .+Xn ,
show that the sequence (Sn /n)n 1 is uniformly integrable.

Exercise 5.1.5. Show that if supt2T E|Xn |1+d < • for some d > 0 then (Xt )t2T is
uniformly integrable.

Exercise 5.1.6. Show that (Xt )t2T is uniformly integrable if and only if
(a) supt E|Xt | < • and
(b) for any e > 0, there exists d > 0 such that supt2T E|Xt | IA  e if P(A)  d .

Exercise 5.1.7. Suppose that random variables Xn 0 for n 0, Xn ! X0 in proba-


bility and EXn ! EX0 . Prove that the limit limn!• E|Xn X0 | = 0. (Hint: consider
(X0 Xn )+ .)
5.2 Stopping times, optional stopping theorem 153

5.2 Stopping times, optional stopping theorem

Consider a probability space (W, B, P) and a sequence of s -algebras Bn ✓ B for


n 0 such that Bn ✓ Bn+1 . An integer valued random variable t 2 {0, 1, 2, . . .}
is called a stopping time if {t  n} 2 Bn for all n. Of course, in this case we also
have
{t = n} = {t  n} \ {t  n 1} 2 Bn .
Given a stopping time t, let us consider the s -algebra

Bt = B 2 B : {t  n} \ B 2 Bn for all n

consisting of all events that “depend on the data up to a stopping time t”. If a
sequence of random variable (Xn ) is adapted to (Bn ), i.e. Xn is Bn -measurable,
then random variables such as Xt or Âtk=1 Xk are Bt -measurable. For example,
[
{t  n} \ {Xt 2 A} = {t = k} \ {Xk 2 A}
kn
[⇣ \ ⌘
= {t  k} \ {t  k 1} {Xk 2 A} 2 Bn .
kn

Let us mention several basic properties of stopping times.


1. Constant t = n is a stopping time and Bt = Bn .
2. Stopping time t is Bt -measurable. To see this, we need to show that {t  k} 2
Bt for all integer k 0. This is true, because

{t  k} \ {t  n} = {t  k ^ n} 2 Bk^n ✓ Bn ,

by the definition of stopping time.


3. t is a stopping time if and only if {t > n} 2 Bn for all n. This is obvious, because
{t > n} = {t  n}c and s -algebras as closed under taking complements.
4. If (tk )k 1 are all stopping times then their minimum ^k 1 tk and maximum
_k 1 tk are also stopping times. This is true, because
\ \
^k 1 tk >n = tk > n 2 Bn , _k 1 tk n = tk  n 2 Bn .
k 1 k 1

5. If t1 and t2 are stopping times then the sum t1 + t2 is a stopping time. This is
true, because
n
[
{t1 + t2  n} = {t1 = k} \ {t2  n k} 2 Bn .
k=0
154 5 Martingales

We will leave other properties, concerning two stopping times t1 and t2 , as an


exercise.
6. The events {t1 < t2 }, {t1 = t2 } and {t1 > t2 } belong to Bt1 and Bt2 .
7. If B 2 Bt1 then B \ {t1  t2 } and B \ {t1 < t2 } are in Bt2 .
8. If t1  t2 then Bt1 ✓ Bt2 .
A martingale (Xn , Bn ) often represents a fair game, so that the average value
E(Xm |Bn ) at some future time m > n given the data Bn at time n is equal to the
value Xn at time n. A stopping time t represents a strategy to stop the game based
only on the information at any given time, so that the event {t  n} that we stop
the game before time n depends only on the data Bn up to that time. A classical
result that we will now prove states that, under some mild integrability conditions,
the average value under any strategy is the same, so their is no “winning strategy”
on average.

Theorem 5.2 (Optional Stopping). Let (Xn , Bn ) be a martingale and t1 , t2 < •


be stopping times such that

E|Xt2 | < •, lim E|Xn |I(n  t2 ) = 0. (5.3)


n!•

Then, for any set A 2 Bt1 ,

EXt2 IA I(t1  t2 ) = EXt1 IA I(t1  t2 ).

For example, if t1 = 0 and A = W then EXt2 = EX0 if the condition (5.3) is satisfied.
For example, if the stopping time t2 is bounded then (5.3) is obviously satisfied.
Let us note that without some kind of integrability condition, Theorem 5.2 can not
hold, as the following example shows.

Example 5.2.1. Consider independent (Xn )n 1 such that P(Xn = ±2n ) = 1/2. If
Bn = s (X1 , . . . , Xn ) then (Sn , Bn ) is a martingale. Let t1 = 1 and t2 = min{k 1 :
Sk > 0}. Clearly, St2 = 2, because if t2 = k then

St2 = Sk = 2 22 ... 2k 1
+ 2k = 2.

However, 2 = ESt2 6= ES1 = 0. Notice that the second condition in (5.3) is violated,
since
P(t2 = n) = 2 n , P(t2 n + 1) = Â 2 k = 2 n
k n+1
and, therefore,

E|Sn |I(n  t2 ) = 2P(t2 = n) + (2n+1 2)P(n + 1  t2 ) = 2,

which does not go to zero. t


u
5.2 Stopping times, optional stopping theorem 155

Proof (of Theorem 5.2). Consider a set A 2 Bt1 . The result is based on the follow-
ing formal computation with the middle step (*) proved below,

EXt2 IA I(t1  t2 ) = Â EXt2 I A \ {t1 = n} I(n  t2 )


n 1

 EXn I
(⇤)
= A \ {t1 = n} I(n  t2 ) = EXt1 IA I(t1  t2 ).
n 1

To prove (*), it is enough to show that for An = A \ {t1 = n} 2 Bn ,

EXt2 IAn I(n  t2 ) = EXn IAn I(n  t2 ). (5.4)

We begin by writing

EXn IAn I(n  t2 ) = EXn IAn I(t2 = n) + EXn IAn I(n < t2 )
= EXt2 IAn I(t2 = n) + EXn IAn I(n < t2 ).

Since {n < t2 } = {t2  n}c 2 Bn and (Xn , Bn ) is a martingale, the last term

EXn IAn I(n < t2 ) = EXn+1 IAn I(n < t2 ) = EXn+1 IAn I(n + 1  t2 )

and, therefore,

EXn IAn I(n  t2 ) = EXt2 IAn I(t2 = n) + EXn+1 IAn I(n + 1  t2 ).

We can continue the same computation and, by induction on m, we get

EXn IAn I(n  t2 ) = EXt2 IAn I(n  t2 < m) + EXm IAn I(m  t2 ). (5.5)

It remains to let m ! • and use the assumptions in (5.3). By the second assumption,
the last term
EXm IAn I(m  t2 )  E|Xm |I(m  t2 ) ! 0
as m ! •. Since

Xt2 IAn I(n  t2  m) ! Xt2 IAn I(n  t2 )

almost surely as m ! • and E|Xt2 | < •, by the dominated convergence theorem,

EXt2 IAn I(n  t2 < m) ! EXt2 IAn I(n  t2 ).

This proves (5.4) and finishes the proof of the theorem. t


u

Example 5.2.2. (Computing EXt ) If we take t1 = 0, n = 0 and A = W then An = W


and the equation (5.5) becomes

EX0 = EXt2 I(t2 < m) + EXm I(m  t2 ).


156 5 Martingales

This implies that, for any stopping time t,

EX0 = lim EXt I(t  n) () lim EXn I(t n) = 0, (5.6)


n!• n!•

which gives a clean condition to check if we would like to show that EXt = EX0 .

Example 5.2.3 (Hitting times of simple random walk). Given p 2 (0, 1), con-
sider i.i.d. random variables (Xi )i 1 such that

P(Xi = 1) = p, P(Xi = 1) = 1 p,

and consider a random walk Sn+1 = Sn + Xn+1 starting with S0 = 0. Consider two
integers a  1 and b 1 and the stopping time

t = min k 1 : Sn = a or b .

If we denote q = 1 p then Yn = (q/p)Sn is a martingale, since


⇣ q ⌘Sn +1 ⇣ q ⌘Sn 1 ⇣ q ⌘Sn
E(Yn+1 |Bn ) = p +q = .
p p p
It is easy to show that P(t n) ! 0, which we leave as an exercise below. Since
Sn 2 [a, b] for n  t and St = a or b, the equation (5.6) implies that
⇣ q ⌘b ⇣ q ⌘a
1 = EY0 = EYt = P(St = b) + (1 P(St = b)).
p p
If p 6= 1/2, we can solve this:

(q/p)a 1
P(St = b) = .
(q/p)a (q/p)b
If q = p = 1/2 then Sn itself is a martingale and the equation (5.6) implies that
a
0 = bP(St = b) + a(1 P(St = b)) and P(St = b) = .
a b
One can also compute Et, which we will also leave as an exercise.

Example 5.2.4 (Fundamental Wald’s identity). Let (Xn )n 1 be a sequence of


i.i.d. random variables, S0 = 0 and Sn = X1 + . . . + Xn . Suppose that the Laplace
transform j(l ) = Eel X1 is defined on some nontrivial interval (l , l+ ) containing
0. It is obvious that
el Sn
Yn =
j(l )n
is a martingale for l 2 (l , l+ ). Let t be a stopping time. Since Yt 0,
5.2 Stopping times, optional stopping theorem 157

el St el St
lim E I(t  n) = E I(t < •)
n!• j(l )t j(l )t
by the monotone convergence theorem. Since Y0 = 1, (5.6) implies in this case that

el St el Sn
1=E I(t < •) () lim E I(t n) = 0. (5.7)
j(l )t n!• j(l )n

The left hand side is called the fundamental Wald’s identity. In some cases one
can use this to compute the Laplace transform of a stopping time and, thus, its
distribution, as we will see below.
Notice that in the case when t < • almost surely and (either side of) the equation
(5.7) holds, we can write
el St
1=E .
j(l )t
Computing formally the derivative in l at l = 0, we get ESt = EX1 Et, which is
the Wald’s identity we proved in (2.4), so (5.7) is a generalization of that identity.
Of course, this is only a formal generalization because here we make a stronger
assumption on the existence of the Laplace transform in a neighbourhood of zero,
while the identity in ESt = EX1 Et was proved only under the assumption that the
expectations EX1 and Et are well defined.

Example 5.2.5 (Symmetric random walk). Let X0 = 0, P(Xi = ±1) = 1/2, Sn =


Âkn Xk . Given integer z 1, let

t = min{k : Sk = z or z}.

Since j(l ) = ch(l ) 1,

el Sn e|l |z
E I(t n)  P(t n)
j(l )n ch(l )n
and the right hand side of (5.7) holds. Therefore,

el St
1=E = el z E ch(l ) t I(St = z) + e lz
E ch(l ) t
I(St = z)
j(l )t
el z + e l z
= E ch(l ) t
2
by symmetry. Therefore,

t 1 1
E ch(l ) = and Eegt = 1
ch(l z) ch(ch (e g )z)
by the change of variables eg = 1/ ch l . t
u
158 5 Martingales

For more general stopping times the condition on the right hand side of (5.7)
might not be easy to check. We will now consider another point of view that may
sometimes be helpful in verifying the fundamental Wald’s identity. If P is the dis-
tribution of X1 , let Pl be the distribution with the Radon-Nikodym derivative with
respect to P given by
dPl el x
= .
dP j(l )
This is, indeed, a density with respect to P, since
Z
el x j(l )
dP = = 1.
R j(x) j(l )

For convenience of notation, we will think of (Xn ) as the coordinates on the prod-
uct space (R• , B • , P⌦• ) with the cylindrical s -algebra. Therefore, {t = n} 2
s (X1 , . . . , Xn ) ✓ B n is a Borel set on Rn . We can write,

el St el Sn
E
j(l )t
I(t < •) = Â E I(t = n)
n=1 j(l )
n

• Z
el (x1 +···+xn )
= Â j(l )n
dP(x1 ) . . . dP(xn )
n=1
{t=n}
• Z
= Â l (t < •).
dPl (x1 ) . . . dPl (xn ) = P⌦•
n=1
{t=n}

This means that we can think of the random variables Xn as having the distribution
Pl and, to prove Wald’s identity, we need to show that t < • with probability one.
Example 5.2.6 (Crossing a growing boundary). Suppose that we have a bound-
ary given by some sequence f : N ! R and a stopping time (crossing time):

t = min k : Sk f (k) .

To verify Wald’s identity in (5.7), we need to show that t < • with probability
one, assuming now that random variables Xn have the distribution Pl . Under this
distribution, the random variables Xn have the expectation
Z
el x j 0 (l )
El Xi = x dP(x) = .
j(l ) j(l )
By the strong law of large numbers
Sn j 0 (l )
lim =
n!• n j(l )
and, therefore, if the growth of the crossing boundary satisfies
5.2 Stopping times, optional stopping theorem 159

f (n) j 0 (l )
lim sup < (5.8)
n!• n j(l )
then, obviously, the random walk Sn will cross it with probability one, in which
case P⌦• l (t < •) = 1. By Hölder’s inequality, j(l ) is log-convex and, therefore,
j 0 (l )/j(l ) is increasing, which means that if this condition holds for l 0 then it
holds for l > l 0 . Also, note that, for l 2 (l , l+ ), the derivative of j 0 (l )/j(l )
is the variance of X ⇠ Pl , so this function is actually strictly increasing unless the
random variable X ⇠ P is constant. t
u

In the next two examples, we will consider a constant boundary f (k) = z for
some integer z > 0. In this case, the condition (5.8) becomes j 0 (l ) > 0. We will
consider the case when j 0 (0) = EX1 > 0, which means that the random walk has
a positive drift and, in particular, t < • with probability one.

Example 5.2.7. Let us consider i.i.d. Xi taking integer values {. . . , 2, 1, 0, 1}


with EX1 > 0, which of course implies that P(X1 = 1) > 0. These assumptions
yield that t = min{k : Sk z} < • a.s. and St = z (we can not jump over the
boundary). By continuity, j 0 (l ) > 0 is a small neighbourhood of zero and, thus,
by the previous example,

el St 1
1=E t
= el z E .
j(l ) j(l )t
As in the example of the symmetric walk, this determines the Laplace transform
Eegt on some interval of g and determines the distribution of t. t
u

Example 5.2.8 (Geometric case). Let us now consider the case when Xi are i.i.d.
geometric random variables and, for some p 2 (0, 1),

P(Xi k) = (1 p)k 1
for k = 1, 2, 3, . . . .
d
In this case, one can show that t and St are independent and St = z 1 + X1 .
Indeed, for any y z and n 1, we can write

P(St y, t = n) = Â P(Sn y, Sn 1 = x, t = n)
x<z
= Â P(Xn y x, Sn 1 = x, t n 1).
x<z

Since the event {Sn 1 = x, t n 1} 2 Bn 1 and, thus, is independent of the


geometric random variable Xn , we can continue

P(St y, t = n) = Â (1 p)y x 1
P(Sn 1 = x, t n 1)
x<z
= (1 p)y z
 (1 p)z x 1
P(Sn 1 = x, t n 1)
x<z
160 5 Martingales

= (1 p)y z
 P(Xn z x, Sn 1 = x, t n 1)
x<z
= (1 p)y z P(St z, t = n) = (1 p)y z P(t = n).

This proves the above claim. Together with the fundamental Wald’s identity this
shows that
el St 1 1
1=E = Eel St E = el (z 1)
j(l )E .
j(l )t j(l )t j(l )t
The Laplace transform of the geometric distribution is

pel
j(l ) = for l < log(1 p),
1 (1 p)el
and we again determined the Laplace transform of t on some interval. t
u

Exercise 5.2.1. Prove the properties 6 – 8 of the stopping times above.

Exercise 5.2.2. Let (Xn , Bn ) be a martingale and t1  t2  . . . be a non-decreasing


sequence of bounded stopping times. Show that (Xtn , Btn ) is a matringale.

Exercise 5.2.3. Let (Xn , Bn )n 0 be a martingale and t a stopping time such that
P(t < •) = 1. Suppose that E|Xt^n |1+d  c for some c, d > 0 and for all n. Prove
that EXt = EX0 .

Exercise 5.2.4. Given 0 < p < 1, consider i.i.d. random variables (Xi )i 1 such that
P(Xi = 1) = p, P(Xi = 1) = 1 p and consider a random walk S0 = 0, Sn+1 =
Sn + Xn+1 . Consider two integers a  1 and b 1 and define a stopping time

t = min{k 1 : Sk = a or b}.

(a) Show that P(t n) ! 0.


(b) Compute Et. Hint: for p 6= 1/2, use that Sn nEX1 is a martingale; for p = 1/2,
use that Sn2 n is a martingale.

Exercise 5.2.5. Let Sn = X1 + . . . + Xn be a simple symmetric random walk with


S0 = 0, i.e. steps (Xi )i 1 are i.i.d. and P(Xi = ±1) = 1/2. Let t = min{n 5 :
Sn = Sn 5 + 5}.
(a) Is t a stopping time?
(b) Compute Et. (Hint: If t = n for n 6, what must happen in the last six steps?
How can one rewrite P(t = n)?)

Exercise 5.2.6. Let X0 = 0, P(Xi = ±1) = 1/2 and Sn = Âkn Xk . Let t = min k :
Sk 0.95k + 100 . Prove that
5.2 Stopping times, optional stopping theorem 161

el St
E I(t < •) = 1
j(l )t
for l large enough.
162 5 Martingales

5.3 Doob’s inequalities and convergence of martingales

In this section, we will study convergence of martingales, and we begin by proving


two classical inequalities.

Theorem 5.3 (Doob’s inequality). If (Xn , Bn )n 1 is a submartingale and Yn =


max1kn Xk then, for any M > 0,
1 1
P Yn M  EXn I(Yn M)  EX + . (5.9)
M M n
Proof. Define a stopping time

min{k : Xk M, k  n}, if such k exists,
t1 =
n, otherwise.

Let t2 = n so that t1  t2 . By the optional stopping theorem, Theorem 5.2,

E(Xn |Bt1 ) = E(Xt2 |Bt1 ) Xt1 .

Let us average this inequality over the set A = {Yn = max1kn Xn M}, which
belongs to Bt1 , because
n o
A \ {t1  k} = max Xi M 2 Bk^n ✓ Bk .
1ik^n

On the event A, Xt1 M and, therefore,

EXn IA EXt1 IA MEIA = MP(A).

This is precisely the first inequality in (5.9). The second inequality is obvious. t
u

Example 5.3.1 (Second Kolmogorov’s inequality). If (Xi ) are independent and


EXi = 0 then Sn = Â1in Xi is a martingale and Sn2 is a submartingale. Therefore,
by Doob’s inequality,
⇣ ⌘ ⇣ ⌘ 1 1
P max |Sk | M = P max Sk2 M 2  2 ESn2 = 2 Â Var(Xk ).
1kn 1kn M M 1kn

This is a big improvement on Chebyshev’s inequality, since we control the maxi-


mum max1kn |Sk | instead of one sum |Sn |. t
u

Doob’s upcrossing inequality. Let (Xn , Bn )n 1 be a submartingale. Given two real


numbers a < b, we will define an increasing sequence of stopping times (tn )n 1
when Xn is crossing a downward and b upward. More specifically, we define these
stopping times by

t1 = min n 1 : Xn  a , t2 = min n > t2 : Xn b


5.3 Doob’s inequalities and convergence of martingales 163

and, by induction, for k 2,

t2k 1 = min n > t2k 2 Xn  a , t2k = min n > t2k 1 : Xn b .

Then, we consider the random variable

n(a, b, n) = max k : t2k  n ,

which is the number of upward crossings of the interval [a, b] before time n.

Theorem 5.4 (Doob’s upcrossing inequality). If (Xn , Bn ) is a submartingale then

E(Xn a)+
En(a, b, n)  . (5.10)
b a
Proof. Since x ! (x a)+ is increasing convex function, Zn = (Xn a)+ is also a
submartingale. Clearly, the number of upcrossings of the interval [a, b] by (Xn ) can
be expressed in terms of the number of upcrossings of [0, b a] by (Zn ),

nX (a, b, n) = nZ (0, b a, n),

which means that it is enough to prove (5.10) for nonnegative submartingales. From
now on we can assume that 0  Xn and we would like to show that
EXn
En(0, b, n)  .
b
Let us define a sequence of random variables h j for j 1 by

1, t2k 1 < j  t2k for some k
hj =
0, otherwise,

i.e. h j is the indicator of the event that at time j the process is crossing [0, b]
upward. Define X0 = 0. Then, clearly,
n n
bn(0, b, n)  Â h j (X j Xj 1) = Â I(h j = 1)(X j Xj 1 ).
j=1 j=1

Since (tk ) are stopping times, the event


[
{h j = 1} = t2k 1 < j  t2k
k 1
[ c
= t2k 1 j 1 \ t2k  j 1 2 Bj 1,
k 1

i.e. the fact that at time j we are crossing upward is determined completely by the
sequence up to time j 1. Then
164 5 Martingales
n
bEn(0, b, n)  Â EI(h j = 1)(X j Xj 1)
j=1
n
= Â EI(h j = 1)E Xj Xj 1 Bj 1
j=1
n
= Â EI(h j = 1) E(X j |B j 1) Xj 1 .
j=1

Since (X j , B j ) is a submartingale, E(X j |B j 1) Xj 1 and

I(h j = 1)(E(X j |B j 1) Xj 1)  E(X j |B j 1) Xj 1.

Therefore,
n n
bEn(0, b, n)  ÂE E(X j |B j 1) Xj 1 = Â E(X j Xj 1) = EXn .
j=1 j=1

This finishes the proof for nonnegative submartingales, and the general case. t
u

Finally, we will now use Doob’s upcrossing inequality to prove our main result
about the convergence of submartingales, from which the convergence of martin-
gales will follow.

Theorem 5.5. Suppose that (Xn , Bn ) •<n<• is a submartingale.


1. A.s. limit X • = limn! • Xn exists and EX +• < +•. If B • = \n Bn then
(Xn , Bn ) •n<• is a submartingale (i.e. we can add n = • to our index
set).
2. If supn EXn+ < •, then a.s. limit X+• = limn!+• Xn exists and EX+• +
< •.
3. Moreover, if (Xn+ ) •<n<• is uniformly integrable then (Xn , Bn ) •<n+• is a
right-closed submartingale, where B+• = _n Bn = s [n Bn .

Proof. Let us note that Xn converges as n ! • (here, • = +• or •) if and only


if lim sup Xn = lim inf Xn . Therefore, Xn diverges if and only if the following event
occurs
[
lim sup Xn > lim inf Xn = lim sup Xn b>a lim inf Xn ,
a<b

where the union is taken over all rational numbers a < b. This means that the
probability P(Xn diverges) > 0 if and only if there exist rational a < b such that

P lim sup Xn b>a lim inf Xn > 0.

If we recall that n(a, b, n) denotes the number of upcrossings of [a, b], this is equiv-
alent to P limn!• n(a, b, n) = • > 0.
5.3 Doob’s inequalities and convergence of martingales 165

Proof of 1. For each n 1, let us define a sequence

Y1 = X n ,Y2 = X n+1 , . . . ,Yn = X 1.

By Doob’s inequality,

E(Yn a)+ E(X 1 a)+


EnY (a, b, n)  = < •.
b a b a
Since 0  nY (a, b, n) " n(a, b) almost surely as n ! +•, by the monotone conver-
gence theorem,
En(a, b) = lim EnY (a, b, n) < •.
n!+•
Therefore, P(n(a, b) = •) = 0, which means that the sequence (Xn )n0 can not
have infinitely many upcrossings of [a, b] and, therefore, the limit

X • = lim Xn
n! •

exists with probability one. By Fatou’s lemma and the submartingale property,

EX +•  lim inf EX +n  EX +1 < •.


n! •

It remains to show that (Xn , Bn ) •n<• is a submartingale. First of all, the limit
X • = limn! • Xn is measurable on Bn for all n and, therefore, measurable on
B • = \Bn . Let us take a set A 2 \Bn . We would like to show that, for any m,
EX • IA  EXm IA . Since (Xn , Bn )n0 is a right-closed submartingale, by part 2 of
Lemma 5.3 in Section 5.1, Xn _ a •<n0 is uniformly integrable for any a 2 R.
Since X • _ a = limn! • Xn _ a almost surely, by Section 5.1,

E(X • _ a)IA = lim E(Xn _ a)IA .


n! •

Finally, since the function x ! x _ a is convex and non-decreasing, (Xn _ a, Bn )n<•


is a submartingale by Lemma 5.2 in Section 5.1 and, therefore, for any n  m, we
have E(Xn _ a)IA  E(Xm _ a)IA since A 2 Bn . This proves that E(X • _ a)IA 
E(Xm _ a)IA . Letting a # •, by the monotone convergence theorem, EX • IA 
EXm IA .
Proof of 2. If n(a, b, n) is the number of upcrossings of [a, b] by the sequence
X1 , . . . , Xn then, by Doob’s inequality,

(b a)En(a, b, n)  E(Xn a)+  |a| + EXn+  K < •.

Therefore, En(a, b) < • for n(a, b) = limn!• n(a, b, n) and, as above, the limit
X+• = limn!+• Xn exists. Since all Xn are measurable on B+• , so is X+• . Finally,
by Fatou’s lemma,
+
EX+•  lim inf EXn+ < •.
n!•
166 5 Martingales

Notice that a different assumption, supn E|Xn | < •, would similarly imply that
E|X+• |  lim infn!• E|Xn | < •.
Proof of 3. By 2, the limit X+• exists. We want to show that for any m and for any
set A 2 Bm ,
EXm IA  EX+• IA .
We already mentioned above that (Xn _ a) is a submartingale and, therefore,
E(Xm _ a)IA  E(Xn _ a)IA for m  n. Clearly, if (Xn+ ) •<n<• is uniformly in-
tegrable then (Xn _ a) •<n<• is uniformly integrable for all a 2 R. Since Xn _ a !
X+• _ a almost surely, by Lemma 5.1,

lim E(Xn _ a)IA = E(X+• _ a)IA


n!•

and this shows that E(Xm _ a)IA  E(X+• _ a)IA . Letting a ! • and using the
monotone convergence theorem we get that EXm IA  EX+• IA . t
u
Convergence of martingales. If (Xn , Bn ) is a martingale then both (Xn , Bn ) and
( Xn , Bn ) are submartingales and one can apply the above theorem to both of
them. For example, this implies the following.
1. If supn E|Xn | < • then a.s. limit X+• = limn!+• Xn exists and E|X+• | < •.
2. If (Xn ) •<n<• is uniformly integrable then (Xn , Bn ) •<n+• is a right-
closed martingale.
In other words, a martingale is right-closable if and only if it is uniformly inte-
grable. Of course, in this case we also conclude that limn!+• E|Xn X+• | = 0.
For a martingale of the type E(X|Bn ) we can identify the limit as follows.
Theorem 5.6 (Levy’s convergence theorem). Let (W, B, P) be a probability space
and X : W ! R be a random variable such that E|X| < •. Given a sequence of
sigma algebras
B1 ✓ . . . ✓ Bn ✓ . . . ✓ B+• ✓ B
S
where B+• = s 1n<• Bn , we have Xn = E(X|Bn ) ! E(X|B+• ) a.s. .
Proof. (Xn , Bn )1n<• is a right-closable martingale, because Xn = E(X|Bn ).
Therefore, it is uniformly integrable and a.s. limit X+• := lim
S n!+•
E(X|Bn ) ex-
ists. It remains to show that X+• = E(X|B+• ). Consider A 2 n Bn . Since A 2 Bm
for some m and (Xn ) is a martingale, EXn IA = EXm IA for n m. Therefore,

EX+• IA = lim EXn IA = EXm IA = E(E(X|Bm )IA ) = EXIA .


n!•
S
Since n Bn is an algebra, we get EX+• IA = EXIA for all A 2 B+• = s [n Bn
(by Dynkin’s theorem or monotone class theorem), which proves that X+• =
E(X|B+• ). t
u
Example 5.3.2 (Strong law of large numbers). Let (Xn )n 1 be i.i.d. such that
E|X1 | < •, and let
5.3 Doob’s inequalities and convergence of martingales 167

B n = s (Sn , Xn+1 , . . .), Sn = X1 + · · · + Xn .

We showed before that


Sn
Z n= E(X1 |B n ),
=
n
so (Z n , B n ) is a reverse martingale and, by the above theorem, the almost
sure
T
limit Z • = limn!+• Z n exists and is measurable on the s -algebra B • =
n 1 B n . Since each event in B • is obviously symmetric, i.e. invariant under
the permutations of finitely many (Xn ), by the Savage-Hewitt 0 – 1 law, the prob-
ability of each event is 0 or 1, i.e. B • consists of 0/ and W up to sets of measure
zero. Therefore, Z • is a constant a.s. and, since (Zn ) •n 1 is also a martingale,
EZ • = EX1 . Therefore, we proved that Sn /n ! EX1 almost surely.

Example 5.3.3 (Improved Kolmogorov’s law of large numbers). Consider a se-


quence (Xn )n 1 such that

E(Xn+1 |X1 , . . . , Xn ) = 0.

We do not assume independence here. This assumption, obviously, implies that


EXk X` = 0 for k 6= `. Let us show that if a sequence (bn ) is such that

EXn2
bn  bn+1 , lim bn = • and  2
<•
n=1 bn
n!•

then Sn /bn ! 0 almost surely. Indeed, Yn = Âkn (Xk /bk ) is a martingale and (Yn )
is uniformly integrable, since

1 1 n EXk2
E|Yn |I(Yn | > M) 
M
E|Yn |2 = Â b2
M k=1 k

(we used here that EXk X` = 0 for k 6= `) and, therefore,

1 • EXk2
sup E|Yn |I(Yn | > M) 
n
 b2 ! 0
M k=1 k

as M ! •. By the martingale convergence theorem, the limit Y = limn!• Yn exists


almost surely, and Kronecker’s lemma implies that Sn /bn ! 0 almost surely. t
u

Exercise 5.3.1. Let (Xn )n 1 be a martingale with EXn = 0 and EXn2 < • for all n.
Show that ⇣ ⌘ EXn2
P max Xk M  .
1kn EXn2 + M 2
Hint: (Xn + c)2 is a submartingale for any c > 0.

Exercise 5.3.2. Let (gi ) be i.i.d. standard normal random variables and let Xn =
Âni=1 si2 (g2i 1) for some numbers (si ). Prove that for t > 0,
168 5 Martingales
n
P max |Xi |
1in
t  2t 2
 si4 .
i=1

Exercise 5.3.3. Show that for any random variable Y,


Z •
E|Y | p = pt p 1 P(|Y | t)dt.
0
R |Y |
Hint: represent |Y | p = 0 pt p 1 dt and switch the order of integration.

Exercise 5.3.4. Let X,Y be two non-negative random variables such that for every
t > 0, Z
P(Y t)  t 1 XI(Y t)dP.
R
For any p > 1, k f k p = ( | f | p dP)1/p and 1/p+1/q = 1, show that kY k p  qkXk p .
Hint: Use previous exercise, switch the order of integration to integrate in t first,
then use Holder’s inequality and solve for kY k p .

Exercise 5.3.5. Given a non-negative submartingale (Xn , Bn ), let Xn⇤ := max jn X j .
Prove that for any p > 1 and 1/p + 1/q = 1, kXn⇤ k p  qkXn k p . Hint: use previous
exercise and Doob’s maximal inequality.

Exercise 5.3.6 (Polya’s urn model). Suppose we have b blue and r red balls in the
urn. We pick a ball randomly and return it with c balls of the same color. Let us
consider the sequence
#(blue balls after n iterations)
Yn = .
#(total after n iterations)
Prove that the limit Y = limn!• Yn exists almost surely.

Exercise 5.3.7 (Improved Kolmogorov’s 0 – 1 law). Consider arbitrary random


variables (Xi )i 1 and consider sigma algebras Bn = s (X1 , . . . , Xn ) and B+• =
s ((Xn )n 1 ). Consider any set A 2 B+• that is independent of Bn for all n (also
called a tail event). By considering conditional expectations E(IA |Bn ) prove that
P(A) = 0 or 1.

Exercise 5.3.8. Suppose that a random variable x takes values 0, 1, 2, 3, . . .. Sup-


pose that Ex = µ > 1 and Var(x ) = s 2 < •. Let xnk be independent copies of x .
Let X1 = 1 and define recursively Xn+1 = ÂXk=1
n
xnk . Prove that Sn = Xn /µ n con-
verges almost surely.

Exercise 5.3.9. Let (Xn )n 1 be i.i.d. random variables and, given a bounded mea-
surable function f : R2 ! R such that f (x, y) = f (y, x), consider the U-statistics
2
S n =
n(n Â0
1) 1`<`
f (X` , X`0 )
n
5.3 Doob’s inequalities and convergence of martingales 169

and the s -algebras


F n = s X(1) , . . . , X(n) , (X` )`>n ,
where X(1) , . . . , X(n) are the order statistics of X1 , . . . , Xn . Prove that limn!• S n =
E f (X1 , X2 ) almost surely. Hint: notice that \n 1 F n consists of symmetric events.

Exercise 5.3.10. Consider two bounded measurable functions R


f (x, y), g(x, y) on
[0, 1]2 . Suppose that for almost all (x, y) on [0, 1], we have 01 f (x, u)g(y, u) du  1.
R
Prove that 01 f (x, u)g(x, u) du  1 for almost all x on [0, 1]. (For almost all in both
cases means with respect to the Lebesgue measure.)
171

Chapter 6
Brownian Motion

6.1 Stochastic processes. Brownian motion

We have developed a general theory of convergence of laws on (separable) metric


spaces and in the following two sections we will look at some specific examples of
convergence on the spaces of continuous functions (C[0, 1], k · k• ) and (C(R+ ), d),
where d is a metric metrizing uniform convergence on compacts. These examples
will describe a certain central limit theorem type results on these spaces and in
this section we will define the corresponding limiting Gaussian laws, namely, the
Brownian motion and Brownian bridge. We will start with basic definitions and
basic regularity results in the presence of continuity. Given a set T and a probability
space (W, F , P), a stochastic process is a function

Xt (w) = X(t, w) : T ⇥ W ! R

such that for each t 2 T , Xt : W ! R is a random variable, i.e. a measurable function.


In other words, a stochastic process is a collection of random variables Xt indexed
by a set T. A stochastic process is often defined by specifying finite dimensional
(f.d.) distributions PF = L ({Xt }t2F ) for all finite subsets F ✓ T. Kolmogorov’s
theorem then guarantees the existence of a probability space on which the process
is defined, under the natural consistency condition

F1 ✓ F2 =) PF1 = PF2 .
RF1

One can also think of a process as a function on W with values in RT = { f : T !


R}, because for a fixed w 2 W, Xt (w) 2 RT is a (random) function of t. In Kol-
mogorov’s theorem, given a family of consistent f.d. distributions, a process was
defined on the probability space (RT , BT ), where BT is the cylindrical s -algebra
generated by the algebra of cylinders B ⇥ RT \F for Borel sets B in RF and all finite
F. When T is uncountable, some very natural sets such as
172 6 Brownian Motion
n o [
w : sup Xt > 1 = {Xt > 1}
t2T t2T

might not be measurable on BT . However, in our examples we will deal with


continuous processes that possess additional regularity properties. If (T, d) is a
metric space then a process Xt is called sample continuous if for all w 2 W,
Xt (w) 2 C(T, d) – the space of continuous function on (T, d). The process Xt is
called continuous in probability if Xt ! Xt0 in probability whenever t ! t0 . Let us
note that sample continuity is not a property that follows automatically if we know
all finite dimensional distributions of the process.

Example 6.1.1. Let T = [0, 1], (W, P) = ([0, 1], l ) where l is the Lebesgue mea-
sure. Let Xt (w) = I(t = w) and Xt0 (w) = 0. Finite dimensional distributions of these
processes are the same, because for any fixed t 2 [0, 1],

P(Xt = 0) = P(Xt0 = 0) = 1.

However, P(Xt is continuous) = 0, but for Xt0 this probability is 1. t


u

Let (T, d) be a metric space. The process Xt is measurable if

Xt (w) : T ⇥ W ! R

is jointly measurable on the product space (T, B) ⇥ (W, F ), where B is the Borel
s -algebra on T.

Lemma 6.1. If (T, d) is a separable metric space and Xt is sample continuous then
Xt is measurable.

Proof. Let (S j ) j 1 be a measurable partition of T such that diam(S j )  1n . For each


non-empty S j , let us take a point t j 2 S j and define

Xtn (w) = Xt j (w) for t 2 S j .

Xtn (w) is, obviously, measurable on T ⇥ W, because for any Borel set A on R,
[
(w,t) : Xtn (w) 2 A = S j ⇥ w : Xt j (w) 2 A .
j 1

Since Xt (w) is sample continuous, Xtn (w) ! Xt (w) as n ! • for all (w,t). Hence,
Xt is also measurable. t
u

If Xt is a sample continuous process indexed by T then we can think of Xt as an


element of the metric space of continuous functions (C(T, d), k · k• ), rather then
simply an element of RT . We can define measurable events on this space in two
different ways. On the one hand, we have the natural Borel s -algebra B on C(T, d)
generated by the open (or closed) balls
6.1 Stochastic processes. Brownian motion 173

Bg (e) = f 2 C(T, d) : k f gk• < e .

On the other hand, if we think of C(T ) as a subspace of RT , we can consider a


s -algebra
ST = B \C(T, d) : B 2 BT ,
which is the intersection of the cylindrical s -algebra BT with C(T, d). (Recall
that, by an exercise in Section 1.3, C(T, d) 62 BT when (T, d) is uncountable.) It
turns out that these two definitions coincide if (T, d) is separable. An important
implication of this is that the distribution of any random element with values in
(C(T, d), k · k• ) is completely determined by its finite dimensional distributions.

Lemma 6.2. If (T, d) is a separable metric space then B = ST .

Proof. Let us first show that ST ✓ B. Any element of the cylindrical algebra that
generates the cylindrical s -algebra BT is given by

B ⇥ RT \F for a finite F ✓ T and for some Borel set B ✓ RF .

Then
\
B ⇥ RT \F C(T, d) = x 2 C(T, d) : (xt )t2F 2 B = pF (x) 2 B

where pF : C(T, d) ! RF is the finite dimensional projection such that pF (x) =


(xt )t2F . Projection pF is, obviously, continuous in the k · k• norm and, therefore,
measurable on the Borel s -algebra B generated by the open sets in the k · k•
norm. This implies that pF (x) 2 B 2 B and, thus, ST ✓ B. Let us now show
that B ✓ ST . Let T 0 be a countable dense subset of T . Then, by continuity, any
closed e-ball in C(T, d) can be written as
\
f 2 C(T, d) : k f gk•  e = f 2 C(T, d) : | f (t) g(t)|  e 2 ST .
t2T 0

This finishes the proof. t


u

In the remainder of the section we will define two specific sample continuous
stochastic processes.
Brownian motion. Brownian motion is defined as a sample continuous process Xt
on T = R+ such that
(a) the distribution of Xt is a centered Gaussian for each t 0;
(b) X0 = 0 and EX12 = 1;
(c) if t < s then L (Xs Xt ) = L (Xs t );
(d) for any t1 < . . . < tn , the increments Xt1 X0 , . . . , Xtn Xtn 1 are independent.
If we denote s 2 (t) = Var(Xt ) then these properties imply
174 6 Brownian Motion
⇣t ⌘ 1 2
s 2 (nt) = ns 2 (t), s 2 = s (t) and s 2 (qt) = qs 2 (t)
m m
for all rational q. Since s 2 (1) = 1, s 2 (q) = q for all rational q and, by the sample
continuity, s 2 (t) = t for all t 0. For any t1 < . . . < tn , the increments Xt j+1 Xt j are
independent Gaussian N(0,t j+1 t j ) and, therefore, jointly Gaussian. This implies
that the vector (Xt1 , . . . , Xtn ) also has a jointly Gaussian distribution, since it can
be written as a linear map of the increments. The covariance matrix can be easily
computed, since for s < t,

EXs Xt = EXs (Xs + (Xt Xs )) = s = min(t, s).

As a result, we can give an equivalent definition: Brownian motion is a sam-


ple continuous centered Gaussian process Xt for t 2 [0, •) with the covariance
Cov(Xt , Xs ) = min(t, s). A process is Gaussian if all its finite dimensional distribu-
tions are Gaussian.
Without the requirement of sample continuity, the existence of such process
follows from Kolmogorov’s theorem, since all finite dimensional distributions are
consistent by construction. However, we still need to prove that there exists a sam-
ple continuous version with these prescribed finite dimensional distributions. We
start with a simple estimate.

Lemma 6.3. The tail probability of the standard normal distribution satisfies
Z •
1 x2 /2 c2
F(c) = p e dx  e 2
2p c

for all c 0.

Proof. We can write


Z
• •x Z
1 2 1 2 1 1 c2 /2
F(c) = p e x /2 dx  p e x /2 dx = p e .
2p c 2p c c 2p c
p p
If c > 1/ 2p then F(c)  exp( c2 /2). If c  1/ 2p then a simpler estimate
gives the result
1 ⇣ 1 ⇣ 1 ⌘2 ⌘ 2
F(c)  F(0) =  exp p  e c /2 ,
2 2 2p
which finishes the proof. t
u

Theorem 6.1 (Existence of Brownian motion). There exists a sample continuous


Gaussian process with the covariance Cov(Xt , Xs ) = min(t, s).

Proof. It is enough to construct Xt on the interval [0, 1]. Given a process Xt that has
the finite dimensional distributions of the Brownian process, but is not necessarily
continuous, let us define for n 1,
6.1 Stochastic processes. Brownian motion 175

Vk = X k+1
n
X kn for k = 0, . . . , 2n 1.
2 2

The variance Var(Vk ) = 1/2n and, by the above lemma,


⇣ 1⌘ ⇣ 1⌘ ⇣ 2n 1 ⌘
n n+1
P max |Vk |  2 P |V1 |  2 exp .
k n2 n2 n4
The right hand side is summable over n 1 and, by the Borel-Cantelli lemma,
⇣n 1o ⌘
P max |Vk | i.o. = 0. (6.1)
k n2
t
Given t 2 [0, 1] and its dyadic expansion t = •j=1 2jj for t j 2 {0, 1}, let us define
t
t(n) = Ânj=1 2jj so that

Xt(n) Xt(n 1) 2 {0} [ {Vk : k = 0, . . . , 2n 1}.

Then, the sequence


Xt(n) = 0 + Â (Xt( j) Xt( j 1) )
1 jn

converges almost surely to some limit Zt , because, by (6.1), with probability one,
2
|Xt(n) Xt(n 1) |  n

for large enough (random) n n0 (w). By construction, Zt = Xt on the dense subset


of all dyadic t 2 [0, 1]. If we can prove that Zt is sample continuous then all f.d.
distributions of Zt and Xt will coincide, which means that Zt is a continuous version
of the Brownian motion. Take any t, s 2 [0, 1] such that |t s|  2 n . If t(n) = 2kn
and s(n) = 2mn , then |k m| 2 {0, 1}. As a result, |Xt(n) Xs(n) | is either equal to 0
or one of the increments |Vk | and, by (6.1), |Xt(n) Xs(n) |  n 2 for large enough n.
Finally,

|Zt Zs |  |Zt Xt(n) | + |Xt(n) Xs(n) | + |Xs(n) Zs |


1 1 1 c
 Â 2+ 2+Â 2  ,
` n ` n ` n ` n

which proves the continuity of Zt . On the event in (6.1) of probability zero we set
Zt = 0. t
u

Definition 6.1. A sample continuous centered Gaussian process Bt for t 2 [0, 1] is


called a Brownian bridge if

EBt Bs = s(1 t) for s < t.

Such process exists because if Xt is a Brownian motion then Bt = Xt tX1 is a


Brownian bridge, since for s < t,
176 6 Brownian Motion

EBs Bt = E(Xt tX1 )(Xs sX1 ) = s st ts + st = s(1 t).

Notice that B0 = B1 = 0. t
u

Exercise 6.1.1 (Ciesielski’s construction of Brownian motion). Let ( fk2 n ) be


the Haar basis on (L2 [0, 1], dx), i.e. f0 = 1 and for n 1 and k 2 In = {k - odd, 1 
k  2n 1}, ⇢ (n 1)/2
2 , (k 1)2 n  x < k2 n
fk2 n (x) = 1)/2
2 (n , k2 n  x < (k + 1)2 n
and 0 otherwise. If (gk2 n ) are i.i.d. standard Gaussian random variables, prove that
Z t
Wt = g0t + Â Â gk2 n fk2 n (x) dx
n 1 k2In 0

is a Brownian motion on [0, 1]. Hint: To show that EWt Ws = min(t, s), use Parse-
val’s identity. To prove continuity, show that
Z t
en = Â gk2 n
0
fk2 n (x) dx

=2 (n+1)/2
max gk2
k2In
n
k2In
p
satisfies P en 2 2 n ln 2n 8 n and use the Borel-Cantelli lemma.

Exercise 6.1.2. Let Wt be a Brownian motion. Prove that


1. (symmetry) { Wt }t2[0,•) is a Brownian motion;
2. (invariance under scaling) For every l > 0, { p1 Wlt }t2[0,•) is a Brownian
l
motion;
3. (simple Markov property) For every s 0, {Ws+t Ws }t2[0,•) is a Brownian
motion independent of s ({Wr }r2[0,s] );
4. (time reversal) If {Wt }t2[0,T ] is a Brownian motion, then {WT WT t }t2[0,T ]
is a Brownian motion.

Exercise 6.1.3. If Wt is a Brownian motion, prove that P(max0tT Wt x) 


2
e x /(2T ) for x 0. Hint: Apply Doob’s inequality to exp(lWt l 2t/2).
6.2 Donsker’s invariance principle 177

6.2 Donsker’s invariance principle

In this section we will show how Brownian motion Wt arises in a classical central
limit theorem on the space of continuous functions on R+ . When working with
continuous processes defined on R+ , such as the Brownian motion, the metric
k · k• on C(R+ ) is too strong. A more appropriate metric d can be defined by

1 dn ( f , g)
d( f , g) = Â 2n 1 + dn ( f , g) , where dn ( f , g) = sup | f (t)
0tn
g(t)|.
n 1

It is obvious that d( f j , f ) ! 0 if and only if dn ( f j , f ) ! 0 for all n 1, i.e. d


metrizes uniform convergence on compacts. (C(R+ ), d) is also a complete separa-
ble space, since any sequence is Cauchy in d if and only if it is Cauchy for each
dn . When proving uniform tightness of laws on (C(R+ ), d), we will need a char-
acterization of compacts via the Arzela-Ascoli theorem, which in this case can be
formulated as follows. For a subset K ✓ C(R+ ), let us define by

Kn = f [0,n]
: f 2K

the restriction of K to the interval [0, n]. Then, it is clear that K is compact with
respect to d if and only if each Kn is compact with respect to dn . Therefore, using
the Arzela-Ascoli theorem to characterize compacts in C[0, n] we get the following.
For a function x 2 C[0, T ], let us denote its modulus of continuity by
n o
mT (x, d ) = sup |xa xb | : |a b|  d , a, b 2 [0, T ] .

Theorem 6.2 (Arzela-Ascoli). A set K is compact in (C(R+ ), d) if and only if K


is closed, uniformly bounded and equicontinuous on each interval [0, n]. In other
words,
sup |x0 | < • and lim sup mT (x, d ) = 0 for all T > 0.
x2K d #0 x2K

This implies the following criterion of the uniform tightness of laws on the metric
space (C(R+ ), d), which is simply a translation of the Arzela-Ascoli theorem into
probabilistic language.

Theorem 6.3. A sequence of laws (Pn )n 1 on (C(R+ ), d) is uniformly tight if and


only if
lim sup Pn |x0 | > l = 0 (6.2)
l !+• n 1

and
lim sup Pn mT (x, d ) > e = 0 (6.3)
d #0 n 1

for any T > 0 and any e > 0.


178 6 Brownian Motion

Proof. (=)) Suppose that for any g > 0, there exists a compact K such that
Pn (K) > 1 g for all n 1. By the Arzela-Ascoli theorem, |x0 |  l for some
l > 0 and for all x 2 K and, therefore,

sup Pn (|x0 | > l )  sup Pn (K c )  g.


n n

Also, by equicontinuity, for any e > 0 there exists d0 > 0 such that for d < d0 and
for all x 2 K we have mT (x, d ) < e. Therefore,

sup Pn mT (x, d ) > e  sup Pn (K c )  g.


n n

((=) Fix any g > 0. For each integer T 1, find lT > 0 such that
g
sup Pn |x0 | > lT  .
n 2T +1
For each integer T 1 and k 1, find dT,k > 0 such that
⇣ 1⌘ g
sup Pn mT (x, dT,k ) >  T +k+1 .
n k 2
Consider the set
n 1 o
AT = x 2 C(R+ ) : |x0 |  lT , mT (x, dT,k )  for all k 1 .
k
Then,
g g g
sup Pn (AT )c  +Â =
n 2T +1 k 1 2T +k+1 2T
T
and their intersection A = T 1 AT
satisfies
⇣[ ⌘ g
sup Pn (Ac ) = sup Pn
n
(AT )c 
n
 2T = g.
T 1 T 1

By construction, each set AT is closed, uniformly bounded and equicontinuous on


[0, T ]. Therefore, by the Arzela-Ascoli theorem, their intersection A is compact
on (C(R+ ), d). Since g > 0 was arbitrary, this proves that the sequence (Pn ) is
uniformly tight. t
u

Remark 6.1. Of course, for the uniform tightness on (C[0, 1], k · k• ) we only need
the second condition (6.3) for T = 1. Also, it will be convenient to slightly relax
(6.3) and replace it with asymptotic equicontinuity condition

lim lim sup Pn mT (x, d ) > e = 0. (6.4)


d #0 n!•

If this holds then, given g > 0, we can find d0 > 0 and n0 1 such that for all
n > n0 ,
6.2 Donsker’s invariance principle 179

Pn mT (x, d0 ) > e < g.


Since mT (x, d ) # 0 as d # 0 for all x 2 C(R+ ), for each n  n0 , we can find dn > 0
such that
Pn mT (x, dn ) > e < g,
Since mT (x, d ) is decreasing in d , this implies that

if d < min(d0 , d1 , . . . , dn0 ) then Pn mT (x, d ) > e < g

for all n 1. We showed that (6.4) implies (6.3). t


u

Using this characterization of uniform tightness, we will now give our first ex-
ample of convergence on (C(R+ ), d) to the Brownian motion Wt . Consider a se-
quence (Xi )i 1 of i.i.d. random variables such that EXi = 0 and s 2 = EXi2 < •.
Let us consider a continuous partial sum process on [0, •) defined by

1 Xbntc+1
Wtn = p
ns  Xi + (nt bntc) p
ns
, (6.5)
ibntc

where bntc is the integer part of nt, bntc  nt < bntc + 1.


Theorem 6.4 (Donsker Invariance Principle). The processes Wtn converge in dis-
tribution to the Brownian motion Wt on the space (C(R+ ), d).
This implies for example that any continuous functionals of these processes on
(C(R+ ), d) converges in distribution, for example,

sup Wtn ! sup Wt


t1 t1

in distribution.

Proof. Since the last term in Wtn is of order n 1/2 , for simplicity of notations, we
will simply write
1
Wtn = p  Xi
ns int
and treat nt as an integer. By the central limit theorem,
1 p 1
p
ns  Xi = tp
nts
 Xi
int int

converges in distribution to normal distribution N(0,t). Given t < s, we can repre-


sent
1
Wsn = Wtn + p  Xi
ns nt<ins
and since Wtn and Wsn Wtn are independent, it should be obvious that finite di-
mensional distributions of Wtn converge to finite dimensional distributions of the
180 6 Brownian Motion

Brownian motion Wt . By Lemma 6.2 in the previous section, this identifies Wt as


the unique possible limit of Wtn and, if we can show that the sequence of laws
(L (Wtn ))n 1 is uniformly tight on C[0, •), d , the Selection Theorem will im-
ply that Wtn ! Wt weakly. Since W0n = 0, we only need to prove the asymptotic
equicontinuity (6.4). First,
1 1
mT (W n , d ) = sup p
ns  Xi  max
0knT,0< jnd
p
ns  Xi .
t,s2[0,T ],|t s|d ns<int k<ik+ j

If instead of maximizing over all 0  k  nT, we maximize in increments of nd ,


i.e. over indices k of the type k = `nd for 0  `  m 1, where m := T /d , then it
is easy to check that the maximum will decrease by at most a factor of 3,
1 1
max
0knT,0< jnd
p
ns  Xi  3 max
0`m 1,0< jnd
p
ns  Xi ,
k<ik+ j `nd <i`nd + j

because the second maximum over 0 < j  nd is taken over intervals of the same
size nd . As a consequence, if mT (W n , d ) > e then one of the events
n 1 eo
max p
0< jnd
Â
ns `nd <i`nd + j
Xi >
3

must occur for some 0  `  m 1. Since the number of these events is m = T /d ,


⇣ ⌘ ⇣ 1 e⌘
P mT (W n , d ) > e  m P max p
0< jnd
 i 3 .
ns 0<i
X > (6.6)
j

Kolmogorov’s inequality, Theorem 2.10, implies that if Sn = X1 + · · · + Xn and


max0< jn P(|Sn S j | > a)  p < 1 then
⇣ ⌘ 1
P max |S j | > 2a  P(|Sn | > a).
0< jn 1 p
p
If we take a = e ns /6 then, by Chebyshev’s inequality,
⇣ e p ⌘ 62 d ns 2
P Â Xi >
6
ns  2 2 = 36d e
e ns
2

j<ind

and, therefore, if 36d e 2 < 1,


⇣ ep ⌘ ⇣ ep ⌘
1
P max  Xi > ns  1 36d e 2
P Â Xi > ns .
0< jnd 0<i j 3 0<ind
6

Finally, using (6.6) and the central limit theorem,


6.2 Donsker’s invariance principle 181
⇣ ⌘ ⇣ ep ⌘
1
lim sup P mT (Wtn , d ) > e  m 1
n!•
36d e 2
lim sup P
n!•
 Xi >
6
ns
0<ind
⇣ e ⌘
2 1
= m 1 36d e 2 N(0, 1) p , •
6 d
⇣ 1 e2 ⌘
1
 2T d 1 1 36d e 2 exp .
2 62 d
This, obviously, goes to zero as d # 0, which proves that

lim lim sup P mT (W n , d ) > e = 0,


d #0 n!•

for all T > 0 and e > 0. This proves that Wtn ! Wt weakly in (C[0, •), d). t
u

Exercise 6.2.1. For n 1, consider independent random variables Xn,i for 1  i  n


such that n
2
EXn,i = 0, sn,i 2
= EXn,i , and  sn,i2 = 1.
i=1
Suppose that Lindeberg’s condition holds: for all e > 0,
n
lim
n!•
 EXn,i2 I(|Xn,i | > e) = 0.
i=1

2 for 1  j  n, and define a random process X n (t) on the inter-


Define tn, j = Âi j sn,i
val [0, 1] by setting X n (0) = 0, X n (tn, j ) = Âi j Xn,i for 1  j  n and interpolating
linearly in between. Prove that X n (t) converges in distribution to the Brownian
motion Wt on (C[0, 1], k · k• ).

Exercise 6.2.2. Let (ei )1in be i.i.d. Rademacher random variables, i.e. P(ei =
±1) = 1/2. For d 2 (0, 1), let Fd be the subset of vectors in {0, 1}n with at most
d n coordinates equal to 1, and such that these coordinates are consecutive. Show
that, for large enough absolute constant c > 0,
⇣ 1 n p ⌘ c t2
P max p  ei fi c( d + t)  e 2d
f 2Fd n i=1 d

for all t > 0 and, as a consequence,


⇣ r
1 n 1 p ⌘
E max p  ei fi  c d log + d
f 2Fd n i=1 d

for some large enough c > 0. Hint: Use discretization and Kolmogorov’s inequality
as in the proof of Donsker’s theorem, and then apply Hoeffding’s inequality instead
of the CLT.
182 6 Brownian Motion

6.3 Convergence of empirical process to Brownian bridge

Empirical process and the Kolmogorov-Smirnov test. In this sections we show


how the Brownian bridge Bt arises in another central limit theorem on the space of
continuous functions on [0, 1]. Let us start with a motivating example from statis-
tics. Suppose that x1 , . . . , xn are i.i.d. uniform random variables on [0, 1]. By the law
of large numbers, for any t 2 [0, 1], the empirical c.d.f. n 1 Âni=1 I(xi  t) converges
to the true c.d.f. P(x1  t) = t almost surely and, moreover, by the CLT,
p ⇣1 n ⌘
d
Xtn := n  I(xi  t) t ! N 0,t(1 t) .
n i=1

The stochastic process Xtn is called the empirical process. The covariance of this
process, for s  t,

EXtn Xsn = E(I(x1  t) t)(I(x1  s) s) = s ts ts + ts = s(1 t),

is the same as the covariance of the Brownian bridge and, by the multivariate CLT,
all finite dimensional distributions of the empirical process converge to the f.d.
distributions of the Brownian bridge,

L (Xtn )t2F ! L (Bt )t2F . (6.7)

However, we would like to show the convergence of Xtn to Bt in some stronger


sense that would imply weak convergence of continuous functions of the process
on the space (C[0, 1], k · k• ). The Kolmogorov-Smirnov test in statistics provides
some motivation. Suppose that i.i.d. (Xi )i 1 have continuous distribution with c.d.f.
F(t) = P(X1  t). Let Fn (t) = n 1 Âni=1 I(Xi  t) be the empirical c.d.f.. Since F is
continuous and F(R) = [0, 1],
p p 1 n
sup n|Fn (t) F(t)| = sup n  I(Xi  t) F(t)
t2R t2R n i=1
p 1 n
= sup n  I F(Xi )  F(t) F(t)
t2R n i=1
p 1 n d
= sup n  I(F(Xi )  t) t = sup |Xtn |,
t2[0,1] n i=1 t2[0,1]

because (F(Xi )) are i.i.d. and have the uniform distribution on [0, 1]. This means
that the distribution of the left hand side does not depend on F and, in order to
infer whether the sample (Xi )1in comes from the distribution with the c.d.f. F,
statisticians need to know only the distribution of the supremum of the empirical
process or, as an approximation, the distribution of its limit. Equation (6.7) suggests
that
L sup |Xtn | ! L sup |Bt | , (6.8)
t t
6.3 Convergence of empirical process to Brownian bridge 183

and the right hand side is called the Kolmogorov-Smirnov distribution that will be
computed in the next section. Since Bt is sample continuous, its distribution is the
law on the metric space (C[0, 1], k·k• ). Even though Xtn is not continuous, its jumps
are equal to n 1/2 , so it can be approximated by a continuous process Ytn uniformly
within n 1/2 . Since k · k• is a continuous functional on (C[0, 1], k · k• ), (6.8) would
hold if we can prove weak convergence L (Ytn )!L (Bt ). We only need to prove
uniform tightness of L (Ytn ) because, by Lemma 6.2, (6.7) already identifies the
law of the Brownian motion as the unique possible limit. Thus, we need to address
the question of uniform tightness of (L (Ytn )) on the complete separable space
(C[0, 1], k · k• ) or, equivalently, by the result of the previous section, the asymptotic
equicontinuity of Ytn ,

lim lim sup P m(Y n , d ) > e = 0.


d #0 n!•

Obviously, this is again equivalent to

lim lim sup P m(X n , d ) > e = 0,


d #0 n!•

so we can deal with the process Xtn , even though it is not continuous. By Cheby-
shev’s inequality,
1
P m(X n , d ) > e  Em(X n , d )
e
and we need to learn how to control Em(X , d ). The modulus of continuity of X n
n

can be written as
p 1 n
m(X n , d ) = sup |Xtn Xsn | = n sup  I(s < xi  t)
n i=1
(t s)
|t s|d |t s|d
p 1 n
= n sup
f 2F
 f (xi )
n i=1
Ef , (6.9)

where we introduced the class of functions


n o
F = f (x) = I(s < x  t) : t, s 2 [0, 1], |t s| < d . (6.10)

We will now develop an approach to control the expectation of (6.9) for general
classes of functions F and we will only use the specific definition (6.10) at the
very end. This will be done in several steps.
Symmetrization. As the first step, we will replace the empirical process (6.9) by
a symmetrized version, called the Rademacher process, that will be easier to con-
trol. Let x10 , . . . , xn0 be independent copies of x1 , . . . , xn and let e1 , . . . , en be i.i.d.
Rademacher random variables, such that P(ei = 1) = P(ei = 1) = 1/2. Let us
define
1 n 1 n
Pn f = Â f (xi ) and P0n f = Â f (xi0 ).
n i=1 n i=1
184 6 Brownian Motion

Notice that EP0n f = E f . Consider the random variables

1 n
Z = sup Pn f
f 2F
Ef and R = sup
f 2F
 ei f (xi ) .
n i=1

Then, using Jensen’s inequality, the symmetry, and then triangle inequality, we can
write

EZ = E sup Pn f E f = E sup Pn f EP0n f


f 2F f 2F
n
1 1 n
 E sup
f 2F
 ( f (xi )
n i=1
f (xi0 )) = E sup
f 2F
 ei ( f (xi )
n i=1
f (xi0 ))
n n
1 1
 E sup
f 2F
 ei f (xi ) + E fsup
n i=1 2F n  ei f (xi0 ) = 2ER.
i=1

Equality in the second line holds because switching xi $ xi0 arbitrarily does not
change the expectation, so the equality holds for any fixed (ei ) and, therefore, for
any random (ei ). t
u
Given this symmetrization bound, for the specific class of functions in (6.10),
one can finish the proof of asymptotic equicontinuity using the last exercise in
the previous section together with a simple observation (6.13) below. However,
instead, we will describe a much more general approach to this problem.
Covering numbers, Kolmogorov’s chaining and Dudley’s entropy integral. To
control ER for general classes of functions F , we will need to use some measures
of complexity of F . First, we will show how to control the Rademacher process R
conditionally on x1 , . . . , xn . Suppose that (F, d) is a totally bounded metric space.
For any u > 0, a u-packing number of F with respect to d is defined by

D(F, u, d) = max card Fu ✓ F : d( f , g) > u for all f , g 2 Fu

and a u-covering number is defined by

N(D, u, d) = min card Fu ✓ F : 8 f 2 F 9 g 2 Fu such that d( f , g)  u .

Both packing and covering numbers measure how many points are needed to ap-
proximate any element in the set F within distance u. It is a simple exercise (at the
end of the section) to show that

N(F, u, d)  D(F, u, d)  N(F, u/2, d)

and, in this sense, packing and covering numbers are closely related. Let F be a
subset of the cube [ 1, 1]n equipped with a rescaled Euclidean metric
6.3 Convergence of empirical process to Brownian bridge 185
⇣1 n ⌘1/2

d( f , g) = ( fi gi )2 .
i=1

Consider the following Rademacher process on F,

1 n
R( f ) = p  ei fi .
n i=1

Then we have the following version of the classical Kolmogorov chaining lemma.
We will assume that 0 2 F by increasing D(F, e, d) by 1 if necessary.

Theorem 6.5 (Kolmogorov’s chaining). For any u > 0,


⇣ Z
9/2
2d(0, f ) p ⌘
P 8 f 2 F, R( f )  2 log1/2 D(F, e, d) de + 29/2 d(0, f ) u 1 e u.
0

Proof. Define a sequence of subsets

{0} = F0 ✓ F1 ✓ . . . ✓ Fj ✓ . . . ✓ F

such that Fj satisfies


1. 8 f , g 2 Fj , d( f , g) > 2 j ,
2. 8 f 2 F we can find g 2 Fj such that d( f , g)  2 j .
F0 obviously satisfies these properties for j = 0. To construct Fj+1 given Fj :
• Start with Fj+1 := Fj .
• If possible, find f 2 F such that d( f , g) > 2 ( j+1) for all g 2 Fj+1 .
• Let Fj+1 := Fj+1 [ { f } and repeat until you cannot find such f .
Define projection p j : F ! Fj as follows:

for f 2 F find closest g 2 Fj with d( f , g)  2 j and set p j ( f ) = g,

where we break the tie arbitrarily. Then any f 2 F can be decomposed into the
telescopic series

f= Â (p j ( f ) pj 1 ( f )).
j=1

Note that, by the above definition, if Fj 1 = Fj then p j 1 ( f ) = p j ( f ), because in


this case there is a unique g 2 Fj with d( f , g)  2 j . In this case, for any f , the
term p j ( f ) p j 1 ( f ) belongs to the set L j 1, j := {0}. Otherwise, if Fj 1 6= Fj ,

d(p j 1 ( f ), p j ( f ))  d(p j 1 ( f ), f ) + d( f , p j ( f ))
( j 1) j j j+2
2 +2 = 3·2 2 .
186 6 Brownian Motion

In this case, for any f , the term p j ( f ) pj 1( f ) belongs to the set


j+2
Lj 1, j = f g : f 2 Fj , g 2 Fj 1 , d( f , g) 2 .

We will call elements of these sets links (as in a chain). Since R( f ) is linear,

R( f ) = ÂR p j( f ) pj 1( f ) .
j=1

We first show how to control R on the set of all non-trivial links. Assume that
0 6= ` 2 L j 1, j . By Hoeffding’s inequality,
⇣ 1 n ⌘ ⇣ t2 ⌘ ⇣ t2 ⌘
P R(`) = p  ei `i t  exp  exp .
n i=1 2n 1 Âni=1 `2i 2 · 2 2 j+4

If |F| denotes the cardinality of the set F then |L j 1, j |  |Fj 1 | · |Fj |  |Fj |2 and,
therefore, by the union bound,
⇣ t2 ⌘ 1
P 8` 2 L j 1, j , R(`)  t 1 |Fj |2 exp 2 j+5
=1 e u,
2 |Fj |2
after making the change of variables
⇣ ⌘1/2
2 j+5 jp
t= 2 (4 log |Fj | + u)  27/2 2 j
log1/2 |Fj | + 25/2 2 u.

Hence,
⇣ ⌘ 1
7/2 j jp
P 8` 2 L j 1, j , R(`)  2 2 log1/2 |Fj | + 25/2 2 u 1 e u.
|Fj |2

If Fj 1 = Fj then L j 1, j = {0}, so the probability on the left hand side is 1 and, by


the union bound,
⇣ p ⌘
P 8 j 1 8` 2 L j 1, j , R(`)  27/2 2 j log1/2 |Fj | + 25/2 2 j u
• •
1 1
1 Â I(|Fj 1| < |Fj |)
|Fj |2
e u
1 Â ( j + 1)2 e u
j=1 j=1
2 u u
=1 (p /6 1)e 1 e .

Given f 2 F, let integer k be such that 2 (k+1) < d(0, f )  2 k . Then in the above
construction p0 ( f ) = . . . = pk 1 ( f ) = 0 and, with probability at least 1 e u ,
• • ⇣ ⌘
jp
ÂR p j( f ) pj 1( f )  Â 27/2 2 j
log1/2 |Fj | + 25/2 2 u
j=k j=k
6.3 Convergence of empirical process to Brownian bridge 187

kp
 27/2 Â 2 j
log1/2 D(F, 2 j , d) + 27/2 2 u.
j=k

Note that 2 k < 2d( f , 0) and 27/2 2 k < 29/2 d( f , 0). Finally, since the packing
numbers D(F, e, d) are decreasing in e, we can write

29/2 Â 2 ( j+1)
log1/2 D(F, 2 j , d) (6.11)
j=k
Z 2 k Z 2d(0, f )
9/2 1/2 9/2
2 log D(F, e, d) de  2 log1/2 D(F, e, d) de.
0 0

This finishes the proof. t


u

The integral in (6.11) is called Dudley’s entropy integral. We would like to apply
the bound of the above theorem to
p 1 n
nR = sup p  ei f (xi )
f 2F n i=1

for a class of functions F in (6.10). Suppose that x1 , . . . , xn 2 [0, 1] are fixed and
let
n ⇣ ⌘ o
F = fi 1in = I s < xi  t : t, s 2 [0, 1], |t s|  d ✓ {0, 1}n .
1in

Then we can control the covering numbers of this set as follows.

Lemma 6.4. N(F, u, d)  Ku 4 for some absolute K > 0 independent of the points
x1 , . . . , xn .

This kind of estimate is called uniform covering numbers, because the bound does
not depend on the points x1 , . . . , xn that generate the class F from F .

Proof. We can assume that x1  . . .  xn . Then the class F consists of all vectors
of the type
(0 . . . 1 . . . 1 . . . 0),
i.e. the coordinates equal to 1 come in blocks. Given u, let Fu be a subset of such
vectors with blocks of 1’s starting and ending at the coordinates kbnuc. Given any
vector f 2 F, let us approximate it by a vector in f 0 2 Fu by choosing the closest
starting and ending coordinates for the blocks of 1’s. The number of different co-
ordinates will be bounded by 2bnuc and, therefore, the distance between f and f 0
will be bounded by q
0
p
d( f , f )  2n 1 bnuc  2u.
p
The cardinality of Fu is, obviously, of order
p u 2 and this proves that N(F, 2u, d) 
Ku 2 . Making the change of variables 2u ! u proves the result. t
u
188 6 Brownian Motion

To apply Kolmogorov’s chaining bound to this class F, let us make a simple


2
observation that if a random variable X 0 satisfies P(X a + bt)  Ke t for all
t 0 then
Z • Z •
EX = P(X t) dt  a + P(X a + t) dt
0 0
Z • t2
 a+K e b2 dt  a + Kb  K(a + b).
0

Theorem 6.5 then implies that


⇣ Z Dn r ⌘
1 n K
Ee sup p  ei fi  K log du + Dn (6.12)
F n i=1 0 u

where Ee is the expectation with respect to (ei ) only and

1 n 1 n
D2n = sup d(0, f )2 = sup  i
n i=1
f (x )2
= sup  I(s < xi  t)
F F |t s|d n i=1

1 n 1 n
= sup  I(xi  t)
n i=1 Â I(xi  s) .
n i=1
|t s|d

Since the integral on the right hand side of (6.12) is concave in Dn , by Jensen’s
inequality,

1 n ⇣Z EDn r K ⌘
E sup p  ei f (xi )  K log du + EDn .
F n i=1 0 u

By the symmetrization inequality, this finally proves that


⇣Z EDn r K ⌘
Em(X n , d )  K log du + EDn .
0 u
The strong law of large numbers easily implies (exercise) that

1 n
sup  I(xi  t)
n i=1
t !0
t2[0,1]

almost surely and, therefore,


p
D2n ! d a.s. and EDn ! d. (6.13)

This implies that


⇣Z p r
d K p ⌘
lim sup Em(Xtn , d )  K log du + d .
n!• 0 u
6.3 Convergence of empirical process to Brownian bridge 189

The right-hand side goes to zero as d ! 0 and this finishes the proof of asymptotic
equicontinuity of X n . As a result, for any continuous function F on (C[0, 1], k · k• )
the distibution of F(Xtn ) converges to the distribution of F(Bt ). For example,
p 1

n sup I(xi  t) t ! sup |Bt |
0t1 0t1

in distribution. We will find the distribution of the right hand side in the next sec-
tion. Notice that the methods we used to prove equicontinuity were quite general
and the main step where we used the specific class of function F was to control
the covering numbers.

Exercise 6.3.1. Show that N(F, u, d)  D(F, u, d)  N(F, u/2, d).

Exercise 6.3.2. If (xi )i 1 are i.i.d. uniform on [0, 1], prove that

1 n
sup  I(xi  t)
n i=1
t ! 0 a.s. .
t2[0,1]

Exercise 6.3.3. If ei are i.i.d. Rademacher, gi are i.i.d. standard Gaussian, and F is
a bounded subset of Rn , show that
1 n 1 1 n
E sup p  ei fi  E sup p  gi fi .
f 2F n i=1 E|g1 | f 2F n i=1

Exercise 6.3.4. Consider a set L = {l = (l1 , . . . , ld ) : Âdi=1 |li |  1} ⇢ Rd and the


distance d(l 1 , l 2 ) = Âni=1 |li1 li2 |. Prove that D(L, e, d)  (4/e)d .

Exercise 6.3.5. Suppose that a family F of measurable functions f : W ! [0, 1] is


such that for some V 1 and for any w1 , . . . , wn 2 W,

2 ⇣1 n ⌘
2 1/2
log D(F , e, d)  V log , where d( f , g) =
e  f (wi )
n i=1
g(wi ) .

If (Xn )n 1 are i.i.d. random elements with values in W, prove that

1 n ⇣ V ⌘1/2
E sup
f 2F
 f (Xi )
n i=1
E f (X1 )  K
n

for some absolute constant K. (Assume, for example, that F is countable to avoid
any measureability issues.)
190 6 Brownian Motion

6.4 Reflection principles for Brownian motion

We showed that the empirical process converges to the Brownian bridge on


(C([0, 1]), k · k• ). As a result, the distribution of a continuous function of the pro-
cess will also converge, for example,

sup |Xtn | ! sup |Bt |


0t1 0t1

in distribution. We will compute the distribution of this supremum in Theorem 6.9


below, but first we will prove the so called strong Markov property of the Brownian
motion.
Given a Brownian motion Wt on some probability space (W, F , P) let Bt =
s (Ws )st ✓ F be the s -algebra generated by the process up to time t. Let us
consider a family of s -algebras Ft for t 0 such that
(i) Ft ✓ Fs for all t  s;
(ii) Bt ✓ Ft for all t 0;
(iii) the future increments (Wt+s Wt )s 0 are independent of Ft for all t 0.
For example, if we consider some s -algebra A independent of the process Wt then
Ft = s (Bt [ A ) satisfy these properties. A random variable t 0 is called a
stopping time if
{t  t} 2 Ft for all t 0.
For example, a hitting time tc = inf{t > 0,WtT= c}Sfor c > 0 is a stopping time
because, by the sample continuity, {tc  t} = q<c r<t {Wr > q}, where both the
intersection and union are over rational numbers q, r.
Consider the s -algebra
n o
Ft = B 2 F : B \ {t  t} 2 Ft for all t 0 (6.14)

generated by the data up to the stopping time t. Let us denote by B(C(R+ ), d) the
Borel s -algebra on (C(R+ ), d), where a metric d metrizes uniform convergence
on compacts. The following is very similar to the property of stopping times for
sums of i.i.d. random variables in Section 2.4.
Theorem 6.6 (Strong Markov Property). On the event {t < •}, the increments
Wt0 := Wt+t Wt of the process after the stopping time are independent of Ft and,
moreover, the distribution of Wt0 is that of a Brownian motion. More precisely, for
any B 2 Ft and A 2 B(C(R+ ), d),

E I(W 0 2 A)IB I(t < •) = E I(W 2 A)EIB I(t < •). (6.15)

Proof. Since B(C(R+ ), d) = A \C(R+ ) where A is the cylindrical s -algebra on


R[0,•) , it is enough to prove (6.15) for A in the cylindrical algebra, i.e. cylinders of
6.4 Reflection principles for Brownian motion 191

the form C ⇥ R[0,•)\F for C 2 B(RF ) for a finite F = {t1 , . . . ,td } for some integer
d 1 and 0  t1 < . . . < td . By Dynkin’s theorem, it is enough to consider only
open sets C and, finally, approximating open sets by bounded Lipshitz functions
from below, it is enough to prove that

Ee(t)IB I(t < •) = Ee(0)EIB I(t < •),

where e(t) := f (Wt+t1 Wt , . . . ,Wt+td Wt ) for some f 2 Cb (Rd ).


The main tool in the proof is the approximation of an arbitrary stopping time t
by the dyadic stopping time
b2n tc + 1
tn = t.
2n
First of all, let us check that tn is a stopping time. Indeed, if
k k+1 k+1
 t < n then tn = n
2n 2 2
and, therefore, for any t 0, if `/2n  t < (` + 1)/2n then
n ⇢
`o ` [
{tn  t} = tn  n = t < n = {t  q} 2 Ft .
2 2 n q<`/2

By construction, tn # t and, by the continuity of the process, Wtn ! Wt almost


surely. Take a set B 2 Ft . For any integer k 1, the event

B \ {tn = k2 n } = B \ (k 1)2 n
 t < k2 n
2 Fk2 n

is independent of e(k2 n ) and, therefore,

Ee(tn )IB I(t < •) = Â Ee(k2 n


)I B \ {tn = k2 n }
k 0
= Â Ee(k2 n
)EI B \ {tn = k2 n }
k 0
= Ee(0) Â EI B \ {tn = k2 n } = Ee(0)EIB I(t < •).
k 0

In the second line we used the homogeneity of the Brownian motion to write
Ee(k2 n ) = Ee(0). Since e(tn ) ! e(t) almost surely, this finishes the proof. t
u

Remark 6.2. A random variable t 0 is called a Markov time if

{t < t} 2 Ft for all t 0.

The s -algebra generated by the process up to a Markov time t is defined by

Ft + = B 2 F : B \ {t < t} 2 Ft for all t 0 . (6.16)


192 6 Brownian Motion

One can show that a stopping time is always Markov time and Ft ✓ Ft + . One can
check that the above Strong Markov property also holds for Markov times t and
s -algebras Ft + (exercise below). t
u
We will use the strong Markov property in the above form in the next section,
but in the examples in this section the events A will be allowed to depend on the
stopping time t. For this purpose, it will be convenient to have the following ver-
sion of the strong Markov property. Given time s 0, consider the following two
modifications on the Brownian motion:
W i (t; s) := W (t) I(t  s) + (Ws + W̃t s ) I(t > s),
(6.17)
W r (t; s) := W (t) I(t  s) + (Ws (Wt Ws )) I(t > s),

where the increment Wt Ws of the original process Wt after time s was replaced in
the first case by an independent Brownian motion (W̃h )h 0 at time t s and in the
second case by (Wt Ws ), i.e. the original increment reflected around zero. (To
accommodate the Brownian motion (W̃h )h 0 independent of all s -algebras Ft we
can extend the probability space W for example, to the product space W ⇥ C(R+ )
if necessary.) Of course, both modifications satisfy of the properties of a Brownian
motion, so
d d
(W i (t; s))t 0 = (W r (t; s))t 0 = (W (t))t 0 .
Our next result shows that this still holds if we replace a fixed time s by a stopping
time t.
Theorem 6.7 (Reflexion principle). If t is a stopping time and and a 2 {i, r} then
d
(W a (t; t))t 0 = (Wt )t 0

on the event t < •. More precisely, for any A 2 B(C(R+ ), d),

P(W a 2 A, t < •) = P(W 2 A, t < •). (6.18)

Proof. By the same reduction as in the previous theorem, it is enough to show that,
for any d 1, any 0 < t1 < . . . < td , and any f 2 Cb (Rd ),

E f (W a (t1 ; t), . . . ,W a (td ; t)) I(t < •) = E f (Wt1 , . . . ,Wtd ) I(t < •).

Let us denote e(s) := f (W a (t1 ; s), . . . ,W a (td ; s)) and note that it is continuous is s
because the processes W a (t; s) for a 2 {i, r} are continuous in (t, s) and f 2 Cb (Rd ).
Therefore, for the stopping time tn defined as in the proof of the strong Markov
property above, e(tn ) ! e(t) on the event {t < •}. Finally, write

Ee(tn )I(t < •) = Â Ee(k2 n


)I tn = k2 n
k 0

and note that {tn = k2 n } 2 Fk2 n , while for s = k2 n the only difference between
Wt and W a (t; s) is how the increments after time k2 n were defined (all independent
6.4 Reflection principles for Brownian motion 193

of Fk2 n ). As a result the sum on the right hand side above is equal to

 E f (Wt1 , . . . ,Wtd )I tn = k2 n
= E f (Wt1 , . . . ,Wtd )I t < • ,
k 0

which finishes the proof. t


u

Example 6.4.1. As a first example of application of the reflexion principle, let us


compute the following probability,
⇣ ⌘
P sup Wt c = P(tc  b),
tb

for any c > 0 and the hitting time tc . Let us write

P(tc  b) = P(tc  b,Wb Wtc > 0) + P(tc  b,Wb Wtc < 0),

where we can write strict inequalities because

P(tc  b,Wb Wtc = 0) = P(tc  b,Wb = c)  P(Wb = c) = 0.

By the reflexion principle above applied to W r (t, tc ), the two probabilities above
are equal, so

P(tc  b) = 2P(tc  b,Wb Wtc > 0) = P(tc  b,Wb > c) = P(Wb > c),

where we could omit tc  b because, by continuity of the process, the event Wb > c
automatically implies that tc  b. Hence,
⇣ ⌘ Z •
1 x2
P sup Wt c = P(tc  b) = 2P(Wb > c) = 2 p p e 2 dx. (6.19)
tb c/ b 2p

This implies that the density of of tc is given by


1 c2 c
ftc (b) = p e 2b · ,
2p b3/2

and since it is of order O(b 3/2 ) as b ! +•, Etc = •. t


u

Next, we compute various probabilities for the suprema of the Brownian bridge.
In addition to the reflexion principle, we will need the following observation that
the Brownian bridge can be viewed as a Brownian motion conditioned to be equal
to zero at time t = 1 (a.k.a. pinned down Brownian motion).

Lemma 6.5 (BB as pinned down BM)). Conditional distribution of (Wt )t2[0,1]
given |W1 | < e converges to the law of (Bt )t2[0,1] ,

L W |W1 | < e ! L (B) as e # 0.


194 6 Brownian Motion

Proof. If Wt is a Brownian motion then Bt = Wt tW1 is the Brownian bridge for


t 2 [0, 1]. Notice that Bt is independent of W1 , because their covariance

EBt W1 = EWt W1 tEW12 = t t = 0.

Therefore, the Brownian motion can be written as a sum Wt = Bt + tW1 of the


Brownian bridge and independent process tW1 and, if we define a random variable
he with distribution L (he ) = L W1 |W1 | < e independent of Bt then

L W |W1 | < e = L (B + the ) ! L (B).

as e # 0. t
u
Using this lemma we will first prove the analogue of the above Example 6.4.1 for
the Brownian bridge.
Theorem 6.8. If Bt is a Brownian bridge then, for all b > 0,
⇣ ⌘ 2
P sup Bt b = e 2b .
t2[0,1]

Proof. Since Bt = Wt tW1 and W1 are independent, we can write


P(9t : Wt tW1 = b, |W1 | < e)
P(9t : Bt = b) =
P(|W1 | < e)
P(9t : Wt = b + tW1 , |W1 | < e)
= .
P(|W1 | < e)
We can estimate the numerator from below and above by

P 9t : Wt > b + e, |W1 | < e  P 9t : Wt = b + tW1 , |W1 | < e


 P 9t : Wt b e, |W1 | < e .

Let us first analyze the upper bound. If we define a hitting time t = inf{t : Wt =
b e} then Wt = b e and

P 9t : Wt b e, |W1 | < e = P t  1, |W1 | < e


= P t  1,W1 Wt 2 ( b, b + 2e) .

By the reflexion principle,

P t  1,W1 Wt 2 ( b, b + 2e) = P t  1, (W1 Wt ) 2 ( b, b + 2e)


= P t  1,W1 Wt 2 (b 2e, b) = P W1 2 (2b 3e, 2b e) ,

because the fact that W1 2 (2b 3e, 2b e) automatically implies that t  1 for
b > 0 and e small enough. Therefore, we proved that
6.4 Reflection principles for Brownian motion 195

P(W1 2 (2b 3e, 2b e)) 2b2


P(9t : Bt = b)  !e
P(W1 2 ( e, e))
as e # 0. The lower bound can be analyzed similarly. t
u
Theorem 6.9 (Kolmogorov-Smirnov). If B is a Brownian bridge then, for b > 0,
⇣ ⌘
P sup |Bt | b = 2 Â ( 1)n 1 e 2n b .
2 2

0t1 n 1

Proof. For n 1, consider an event

An = 9t1 < · · · < tn  1 : Bt j = ( 1) j 1


b

and let tb and t b be the hitting times of b and b. By symmetry of the distribution
of the process Bt ,
⇣ ⌘
P sup |Bt | b = P tb or t b  1 = 2P(A1 , tb < t b ).
0t1

Again, by symmetry,

P(An , tb < t b ) = P(An ) P(An , t b < tb ) = P(An ) P(An+1 , tb < t b )

and, by induction,

P(A1 , tb < t b ) = P(A1 ) P(A2 ) + . . . + ( 1)n 1 P(An , tb < t b ).

As in Theorem 6.8, reflecting the Brownian motion each time we hit b or b, one
can show that
P W1 2 (2nb e, 2nb + e) 1 (2nb)2 2n2 b2
P(An ) = lim =e 2 =e
e#0 P W1 2 ( e, e)

and this finishes the proof. t


u
Given a, b > 0, let us compute the probability that a Brownian bridge crosses
one of the levels a or b.
Theorem 6.10 (Two-sided boundary). If a, b > 0 then

P 9t : Bt = a or b = Â e 2(na+(n+1)b)2
+e 2((n+1)a+nb)2
 2e 2n2 (a+b)2
.
n 0 n 1
(6.20)
Proof. We have

P(9t : Bt = a or b) = P(9t : Bt = a, t a < tb ) + P(9t : Bt = b, tb < t a ).

If we introduce the events


196 6 Brownian Motion

Cn = 9t1 < . . . < tn : Bt1 = b, Bt2 = a, . . .

and
An = 9t1 < . . . < tn : Bt1 = a, Bt2 = b, . . .
then, as in the previous theorem,

P(Cn , tb < t a ) = P(Cn ) P(Cn , t a < tb ) = P(Cn ) P(An+1 , t a < tb )

and, similarly,

P(An , t a < tb ) = P(An ) P(Cn+1 , tb < t a ).

By induction,

P(9t : Bt = a or b) = Â( 1)n 1 (P(An ) + P(Cn )).
n=1

Probabilities of the events An and Cn can be computed using the reflection prin-
2 2 2
ciple as above, P(A2n ) = P(C2n ) = e 2n (a+b) , P(C2n+1 ) = e 2(na+(n+1)b) , and
2
P(A2n+1 ) = e 2((n+1)a+nb) , which finishes the proof. t
u

If X = inf Bt and Y = sup Bt then the spread of the process Bt is x = X +Y.

Theorem 6.11 (Distribution of the spread). For any t > 0,

P x t =1 Â (8n2t 2 2)e 2n2 t 2


.
n 1

Proof. First of all, (6.20) gives the joint c.d.f. of (X,Y ) because

F(a, b) = P(X < a,Y < b) = P( a < inf Bt , sup Bt < b)


=1 P(9t : Bt = a or b).

∂ 2F
If f (a, b) = ∂ a∂ b is the joint p.d.f. of (X,Y ) then the c.d.f of the spread X +Y is
Z tZ t a
P Y +X  t = f (a, b) db da.
0 0

The inner integral is


Z t a
∂F ∂F
f (a, b) db = (a,t a) (a, 0).
0 ∂a ∂a
Since
∂F
(a, b) = Â 4n na + (n + 1)b e 2(na+(n+1)b)2
∂a n 0
6.4 Reflection principles for Brownian motion 197

+ Â 4(n + 1) (n + 1)a + nb e 2((n+1)a+nb)2


n 0

 8n2 (a + b) e 2n2 (a+b)2


,
n 1

plugging in the values b = t a and b = 0 gives


Z t a
f (a, b) db = Â 4n((n + 1)t a)e 2((n+1)t a)2
0 n 0

+ Â 4(n + 1)(nt + a)e 2(nt+a)2


 8n2te 2n2 t 2
.
n 0 n 1

Integrating over a 2 [0,t],

P Y +X  t = Â (2n + 1) e 2n2 t 2
e 2(n+1)2 t 2
 8n2t 2 e 2n2 t 2
n 0 n 1

= 1+2 Â e 2n2 t 2
 8n t 2 2
e 2n2 t 2
,
n 1 n 1

and this finishes the proof. t


u

Exercise 6.4.1. Prove the Strong Markov Property for a Markov time t and s -
algebra Ft + in (6.16).

In the next four exercises, (Ft ) is a nested family of s -algebras, Ft ✓ Fs for


t  s.

Exercise 6.4.2. Show that for t = t, Ft + = Ft + = \s>t Fs .

Exercise 6.4.3. A family (Ft ) is called right-continuous if \s>t Fs = Ft for all


t 0. If (Ft ) is right-continuous, show that a Markov time t is a stopping time
and Ft + = Ft .

Exercise 6.4.4. If we define F t := \s>t Fs , show that (F t ) is right-continuous.

Exercise 6.4.5. If (Ft ) satisfies conditions (i) – (iii) at the beginning of this section,
prove that (F t ) satisfies them too.
198 6 Brownian Motion

6.5 Skorohod’s imbedding and laws of the iterated logarithm

In this section we will prove another classical limit theorem in Probability, called
the laws of the iterated logarithm, using the method of Skorohod’s imbedding,
which we will describe first. Throughout this section, (Wt )t 0 will be a Brownian
motion and Ft = s ((Ws )st ). We will need the following preliminary technical
result.
Theorem 6.12. If t < • is a stopping time such that Et < • then EWt = 0 and
EWt2 = Et.
Proof. Let us start with the case when a stopping time t takes a finite number of
values, t 2 {t1 , . . . ,tn }. Note that (Wt j , Ft j ) is a martingale, since

E(Wt j |Ft j 1 ) = E(Wt j Wt j 1 +Wt j 1 |Ft j 1 ) = Wt j 1 .

By the optional stopping theorem for martingales, EWt = EWt1 = 0. Next, let us
prove that EWt2 = Et by induction on n. If n = 1 then t = t1 and

EWt2 = EWt21 = t1 = Et.

To make an induction step from n 1 to n, define a stopping time a = t ^tn 1 and


write

EWt2 = E(Wa +Wt Wa )2 = EWa2 + E(Wt Wa )2 + 2EWa (Wt Wa ).

First of all, by the induction assumption, EWa2 = Ea. Moreover, t 6= a only if


t = tn , in which case a = tn 1 . The event {t = tn } = {t  tn 1 }c 2 Ftn 1 , so

EWa (Wt Wa ) = EWtn 1 (Wtn Wtn 1 )I(t = tn ) = 0.

Similarly,

E(Wt Wa )2 = EE(I(t = tn )(Wtn Wtn 1 )2 |Ftn 1 ) = (tn tn 1 )P(t = tn ).

Therefore,
EWt2 = Ea + (tn tn 1 )P(t = tn ) = Et
and this finishes the proof of the induction step.
Next, let us consider the case of a uniformly bounded stopping time t  M < •.
In the previous lecture we defined a dyadic approximation
b2n tc + 1
tn = ,
2n
which is also a stopping time, tn # t, and by the sample continuity Wtn ! Wt
almost surely. Since (tn ) are uniformly bounded, Etn ! Et. To prove that EWt2n !
EWt2 , we need to show that the sequence (Wt2n ) is uniformly integrable. Notice that
6.5 Skorohod’s imbedding and laws of the iterated logarithm 199

tn < 2M and, therefore, tn takes possible values of the type k/2n for k  k0 =
b2n (2M)c. Since the sequence (W1/2n , . . . ,Wk0 /2n ,W2M ) is a martingale, adapted to
a corresponding sequence of Ft , and tn and 2M are two stopping times such that
tn < 2M, by the optional stopping theorem, Theorem 5.2, Wtn = E(W2M |Ftn ). By
Jensen’s inequality,

Wt4n  E(W2M
4
|Ftn ), EWt4n  EW2M
4
= 6M,

and the uniform integrability follows,

EWt4n 6M
EWt2n I(|Wtn | > N)  2
 2.
N N
This proves that EWtn ! EWt and EWt2n ! EWt2 . Since tn takes a finite number
of values, EWtn = 0 and EWt2n = Etn , and letting n ! • proves

EWt = 0, EWt2 = Et. (6.21)

Before we consider the general case, let us notice that for two bounded stopping
times t  r  M one can similarly show that

E(Wr Wt )Wt = 0. (6.22)

because, by the optional stopping, (Wtn , Ftn ), (Wrn , Frn ) is a martingale and

E(Wrn Wtn )Wtn = EWtn E(Wrn |Ftn ) Wtn ) = 0.

Finally, we consider the general case. Let us define t(n) = min(t, n). Since
2 = Et(n)  Et < •, the sequence W
EWt(n) t(n) is uniformly integrable and, thus,
0 = EWt(n) ! EWt , which proves that EWt = 0. Next, for m  n,

E(Wt(n) Wt(m) )2 = EWt(n)


2 2
EWt(m) 2EWt(m) (Wt(n) Wt(m) )
= Et(n) Et(m),

using (6.21), (6.22) and the fact that t(n), t(m) are bounded stopping times. Since
t(n) " t, Fatou’s lemma and the monotone convergence theorem imply

E(Wt Wt(m) )2  lim inf Et(n) Et(m) = Et Et(m).


n!•

Letting m ! • shows that

lim E(Wt Wt(m) )2 = 0


m!•

2
which means that EWt(m) ! EWt2 . Since EWt(m)
2 = Et(m) and Et(m) ! Et by
the monotone convergence theorem, this implies that EWt2 = Et. t
u
200 6 Brownian Motion

We will now prove the Skorohod’s imbedding theorem.


Theorem 6.13 (Skorohod’s imbedding). Let Y be a random variable such that
EY = 0, EY 2 < •. There exists a stopping time t < • such that Et < • and
L (Wt ) = L (Y ).

Proof. Let us start with the simplest case when Y takes only two values, Y 2
{ a, b} for a, b > 0. The condition EY = 0 determines the distribution of Y,
a
pb + (1 p)( a) = 0 and p = . (6.23)
a+b
Let t = inf{t > 0 | Wt = a or b} be a hitting time of the two-sided boundary a,
b. The tail probability of t can be bounded by

P(t > n)  P |W j+1 W j | < a + b, 0  j  n 1


(6.24)
= P(|W1 | < a + b)n = g n .

Therefore, Et < • and, by the previous theorem, EWt = 0. Since Wt 2 { a, b}


we must have L (Wt ) = L (Y ). Let us now consider the general case. If µ is the
law of Y, let us define Y by the identity Y = Y (x) = x on its sample probability
space (R, B, µ). Let us construct a sequence of s -algebras

B1 ✓ B2 ✓ . . . ✓ B

as follows. Let B1 be generated by the set ( •, 0), i.e.

/ R, ( •, 0), [0, +•) .


B1 = 0,

Given B j , let us define B j+1 by splitting each finite interval [c, d) 2 B j into two
intervals [c, (c + d)/2) and [(c + d)/2, d), and splitting infinite interval ( •, j)
into ( •, ( j + 1)) and [ ( j + 1), j) and, similarly, splitting [ j, +•) into [ j, j +
1) and [ j + 1, •). Consider a right-closed martingale

Y j = E(Y |B j ). (6.25)

It is almost obvious that the Borel s -algebra B on R is generated by all these


s -algebras, B = _ j 1 B j . Then, by Levy’s martingale convergence, Lemma 5.6,
Y j ! E(Y |B) = Y almost surely. Since Y j is measurable on B j , it must be constant
on each simple set [c, d) 2 B j . If Y j (x) = y for x 2 [c, d) then, since Y j = E(Y |B j ),
Z
yµ([c, d)) = EY j I[c,d) = EY I[c,d) = x dµ(x)
[c,d)

and Z
1
y= x dµ(x). (6.26)
µ([c, d)) [c,d)
6.5 Skorohod’s imbedding and laws of the iterated logarithm 201

If µ([c, d)) = 0 we pick any y 2 (c, d) as a value of Y j on [c, d). Since in the s -
algebra B j+1 the interval [c, d) is split into two intervals, the random variable Y j+1
can take only two values on the interval [c, d), say c  y1 < y2 < d,

P(Y j+1 = y2 ) = P(Y j+1 Y j = y2 y,Y j = y)


y y1
= P(Y j+1 Y j = y2 y | Y j = y)P(Y j = y) = P(Y j = y), (6.27)
y2 y1
where in the last equation we used (6.23) together with E(Y j+1 Y j |B j ) = 0.
We will define stopping times tn such that L (Wtn ) = L (Yn ) as follows. Since
Y1 takes only two values a and b, if t1 = inf{t > 0 | Wt = a or b} then we proved
above that L (Wt1 ) = L (Y1 ). Given t j , define t j+1 by:

if Wt j = y for y in (6.26), let t j+1 = inf{t > t j | Wt = y1 or y2 }.

Since, by definition, the events {Wt j+1 = y j } for j = 1, 2 assume that Wt j = y, we


can rewrite, for example,
\
{Wt j+1 = y2 } = {(Wt j +h Wt j )h 0 hits y2 y before y1 y} {Wt j = y}.

Note that {Wt j = y} 2 Ft j and, by the strong Markov property in Theorem 6.6,

P(Wt j+1 = y2 ) = P (Wh )h 0 hits y2 y before y1 y P(Wt j = y)


y y1
= P(Wt j = y),
y2 y1
where in the last equation we again used (6.23). Since Y j ’s above satisfy the same
recursive equations (6.27), this proves that L (Wtn ) = L (Yn ).
Also, tn  t = inf{t > 0 | Wt = a or b} if we take a and b to be the smallest
and largest among all values y that Yn can take so, by (6.24), all Etn < •. Moreover,
by the previous theorem and (6.25),

Etn = EWt2n = EYn2  EY 2 < •.

The sequence tn is monotone, so it converges to some stopping time tn " t. and,


therefore, Et = lim Etn  EY 2 < •. By sample continuity, Wtn ! Wt a.s. and,
hence, L (Wtn ) = L (Yn ) ! L (Wt ) = L (Y ). t
u

The Skorohod’s imbedding will be used below to reduce the case of sums of i.i.d.
random variables to the following law of the iterated logarithm for the Brownian
motion, which we prove first.
p
Theorem 6.14. If u(t) = 2t`(t) where `(t) = log log(t) then
Wt
lim sup = 1 a.s.
t!+• u(t)
202 6 Brownian Motion

Let us start with the following result which shows that the Brownian motion does
not fluctuate much between times s and (1 + e)s for large s.
Lemma 6.6. For any e 2 (0, 1),
n |W
Ws | o p
t
lim sup sup : s  t  (1 + e)s  8 e a.s.
s!• u(s)
p
Proof. Let e > 0, tk = (1 + e)k and Mk = 2eu(tk ). By symmetry, the equation
(6.19) and the Gaussian tail estimate in Lemma 6.3,
⇣ ⌘ ⇣ ⌘
P sup |Wt Wtk | Mk  2P sup Wt Mk
tk ttk+1 0ttk+1 tk
⇣ 1 Mk2 ⌘
= 4N(0,tk+1 tk )(Mk , •)  4 exp
2 (tk+1 tk )
⇣ 4et `(t ) ⌘ ⇣ 1 ⌘2
k k
= 4 exp =4 .
2etk k log(1 + e)
The sum of these probabilities converges and, by the Borel-Cantelli lemma, for
large enough k, p
sup |Wt Wtk |  2eu(tk ).
tk ttk+1

If k is such that tk  s  tk+1 and s  t  (1 + e)s then, clearly, tk  s  t  tk+2


and, therefore, for large enough k,
p p p
|Wt Ws |  2 2eu(tk ) + 2eu(tk+1 )  8 eu(s).

This finishes the proof. t


u
p
Proof (of Theorem 6.14). For e > 0, a > 1 and tk = ak (recall u(t) = 2t`(t)),
⇣W p ⌘
t
P Wtk (1 + e)u(tk ) = P p k (1 + e) 2`(tk ) (6.28)
tk
1 1 2
⇠p p e (1+e) `(tk )
2p (1 + e) 2`(tk )
1 1 ⇣ 1 ⌘(1+e)2
⇠p p .
2p (1 + e) 2`(ak ) log ak

This series converges so, by the Borel-Cantelli lemma, Wtk  (1 + e)u(ak ) for large
enough k. If we take a = 1 + e, together with Lemma 6.6 this gives that, for large
enough k, if tk  t < tk+1 then
Wt Wtk u(tk ) Wt Wtk u(tk ) p
= +  (1 + e) + 8 e.
u(t) u(tk ) u(t) u(tk ) u(t)
Letting e # 0 over some sequence proves that, with probability one,
6.5 Skorohod’s imbedding and laws of the iterated logarithm 203

Wt
lim sup  1.
t!• u(t)

To prove that the upper limit is equal to 1, we will use the Borel-Cantelli lemma
for the independent increments Wak Wak 1 for large values of the parameter a > 1.
If 0 < e < 1 then, similarly to (6.28),

P Wak Wak 1 (1 e)u(ak ak 1 )


1 1 ⇣ 1 ⌘(1 e)2
⇠p p .
2p (1 e) 2`(ak ak 1) log(ak ak 1)

The series diverges and, since these events are independent, by Borel-Cantelli
lemma they occur infinitely often with probability one. We already proved in
(6.28) that, for e > 0, Wak  (1 + e)u(ak ) for large enough k 1 or, by sym-
metry, Wak (1 + e)u(ak ) for large enough k 1. Together with Wak Wak 1
(1 e)u(a a 1 ) this gives
k k

Wak u(ak ak 1 ) Wak 1


(1 e) +
u(ak ) u(ak ) u(ak )
k
u(a a ) k 1 u(ak 1 )
(1 e) (1 + e)
u(ak ) u(ak )
s s
(ak ak 1 )`(ak ak 1 ) ak 1 `(ak 1 )
= (1 e) (1 + e)
ak `(ak ) ak `(ak )

and r⇣ r
Wt Wk 1⌘ 1
lim sup lim sup ak (1 e) 1 (1 + e) .
t!• u(t) k!• u(a ) a a
Letting e # 0 and a ! • over some sequences proves that the upper limit is equal
to one. t
u

The LIL for Brownian motion implies the LIL for sums of independent random
variables via Skorohod’s imbedding.
Theorem 6.15. Suppose that Y1 , . . . ,Yn are i.i.d. and EYi = 0, EYi2 = 1. If Sn =
Y1 + . . . +Yn then
Sn
lim sup p = 1 a.s.
n!• 2n log log n
d
Proof. Let us define a stopping time t(1) such that Wt(1) = Y1 . By the strong
Markov property, the increment of the process after stopping time is indepen-
dent of the process before stopping time and has the law of the Brownian motion.
d
Therefore, we can define t(2) such that Wt(1)+t(2) Wt(1) = Y2 and, by indepen-
d
dence, Wt(1)+t(2) = Y1 + Y2 and t(1), t(2) are i.i.d. By induction, we can define
i.i.d. t(1), . . . , t(n) such that, for T (n) = t(1) + . . . + t(n),
204 6 Brownian Motion
⇣ S ⌘ ⇣ ⌘
n d WT (n)
= .
u(n) n 1 u(n) n 1
By the Exercise 2.3.6, it is enough to prove the almost sure convergence of the right
hand side. Let us write
WT (n) Wn WT (n) Wn
= + .
u(n) u(n) u(n)
By the LIL for Brownian motion,
Wn
lim sup = 1.
n!• u(n)

By the strong law of large numbers, 2


p T (n)/n ! Et(1)p= EY1 = 1 almost surely.
This means that, for any e > 0,
p n/ 1 + e  T (n)  n 1 + e for large enough n.
Using Lemma 6.6 with s = n/ 1 + e implies that
|WT (n) Wn | p
lim sup  16 e,
n!• u(n)
and letting e # 0 finishes the proof. t
u

The LIL for Brownian motion also implies the local LIL:
Wt
lim sup p = 1.
t!0 2t`(1/t)
It is easy to check that if Wt is a Brownian motion then tW1/t is also the Brownian
motion and the result follows by a change of variable t ! 1/t. To check that tW1/t
is a Brownian motion notice that for t < s,
1 1
EtW1/t sW1/s tW1/t = st t2 = t t =0
s t
2
and E tW1/t sW1/s = t +s 2t = s t.

Exercise 6.5.1. Show that the condition Et < • in Theorem 6.12 above can not be
replaced with the condition EWt2 < •. Hint: see previous section.

Exercise 6.5.2. Consider the sequence T (n) = t(1) + . . . + t(n) as in the proof of
Theorem 6.15. Prove that, for any e > 0,
⇣ WT ( j) W j ⌘
lim P max p e = 0.
n!• jn n
Hint: First prove that, by the SLLN, for any d > 0,
6.5 Skorohod’s imbedding and laws of the iterated logarithm 205
⇣ T ( j) j ⌘
lim P max d = 0,
n!• jn n
p d
and then use the fact that ( cWt )t 0 = (Wct )t 0.
p
Exercise 6.5.3. For t  1, define the process Xtn := WT (nt) / n when nt is an inte-
ger, and by linear interpolation in between, as in Donsker’s theorem. Let K be a
closed set in C([0, 1], k · k• ). Show that, for any e 0,

lim sup P (Xtn )t1 2 K  P (Wt )t1 2 K e .


n!•

p d
Hint: Use previous exercise for e > 0 and the fact that ( cWt )t 0 = (Wct )t 0.

Exercise 6.5.4 (Donsker’s theorem). In the setting of previous exercise, prove that
(Xtn )t1 converges in distribution to (Wt )t1 in C([0, 1], k · k• ). Hint: Use the Port-
manteau theorem.
207

Chapter 7
Poisson Processes

7.1 Overview of Poisson processes

In this section we will introduce and review several important properties of Poisson
processes on a measurable space (S, S ). In applications, (S, S ) is usually some
nice space, such as a Euclidean space Rn with the Borel s -algebra. However, in
general, it is enough to require that the diagonal {(s, s0 ) : s = s0 } is measurable on
the product space S ⇥ S which, in particular, implies that every singleton set {s} in
S is measurable as a section of the diagonal. This condition is needed to be able to
write P(X = Y ) for a pair (X,Y ) of random variables defined on the product space.
From now on, every time we consider a measurable space we will assume that it
satisfies this condition. Let us also notice that a product S ⇥ S0 of two such spaces
will also satisfy this condition since {(s1 , s01 ) = (s2 , s02 )} = {s1 = s2 } \ {s01 = s02 }.
Let µ and µn for n 1 be some non-atomic (not having any atoms, i.e. points of
positive measure) measures on S such that

µ= Â µn , µn (S) < •. (7.1)


n 1

For each n 1, let Nn be a random variable with the Poisson distribution P(µn (S))
with the mean µn (S) and let (Xn` )` 1 be i.i.d. random variables, also independent
of Nn , with the distribution
µn (B)
pn (B) = . (7.2)
µn (S)
We assume that all these random variables are independent for different n 1. The
condition that µ is non-atomic implies that P(Xn` = Xm j ) = 0 if n 6= m or ` 6= j.
Let us consider random sets
[
Pn = Xn1 , . . . , XnNn and P = Pn . (7.3)
n 1

The set P will be called a Poisson process on S with the mean measure µ. Let us
point out a simple observation that will be used many times below that if, first, we
208 7 Poisson Processes

are given the means µn (S) and distributions (7.2) that were used to generate the set
(7.3) then the measure µ in (7.1) can be written as

µ= Â µn (S)pn . (7.4)
n 1

We will show in Theorem 7.4 below that when the measure µ is s -finite then, in
some sense, this definition of a Poisson process P is independent of the particular
representation (7.1). However, for several reasons it is convenient to think of the
above construction as the definition of a Poisson process. First of all, it allows
us to avoid any discussion about what “a random set” means and, moreover, many
important properties of Poisson processes follow from it rather directly. In any case,
we will show in Theorem 7.5 below that all such processes satisfy the traditional
definition of a Poisson process.
One important immediate consequence of the definition in (7.3) is the Mapping
Theorem. Given a Poisson process P on S with the mean measure µ and a mea-
surable map f : S ! S0 into another measurable space (S0 , S 0 ), let us consider the
image set [
f (P) = f (Xn1 ), . . . , f (XnNn ) .
n 1

Since random variables f (Xn` ) have the distribution pn f 1 on S 0 , the set f (P)
resembles the definition (7.3) corresponding to the measure

 µn (S) pn f 1
= Â µn f 1
=µ f 1
.
n 1 n 1

Therefore, to conclude that f (P) is a Poisson process on S0 we only need to require


that this image measure is non-atomic.

Theorem 7.1 (Mapping Theorem). If P is a Poisson process on S with the mean


measure µ and f : S ! S0 is such that µ f 1 is non-atomic then f (P) is a Poisson
process on S0 with the mean measure µ f 1 .

Next, let us consider a sequence (lm ) of measures that satisfy (7.1), lm = Ân 1 lmn
and lmn (S) < •. Let Pm = [n 1 Pmn be a Poisson process with the mean measure
lm defined as in (7.3) and suppose that all these Poisson processes are generated
independently over m 1. Since

l := Â lm = Â Â lmn = Â lm(`)n(`)
m 1 m 1n 1 ` 1

and P := [m 1 Pm = [` 1 Pm(`)n(`) for any enumeration of the pairs (m, n) by the


indices ` 2 N, we immediately get the following.

Theorem 7.2 (Superposition Theorem). If Pm for m 1 are independent Poisson


processes with the mean measures lm then their superposition P = [m 1 Pm is a
Poisson process with the mean measure l = Âm 1 lm .
7.1 Overview of Poisson processes 209

Another important property of Poisson processes is the Marking Theorem. Con-


sider another measurable space (S0 , S 0 ) and let K : S ⇥ S 0 ! [0, 1] be a transition
function (a probability kernel), which means that for each s 2 S, K(s, ·) is a prob-
ability measure on S 0 and for each A 2 S 0 , K(·, A) is a measurable function on
(S, S ). For each point s 2 P, let us generate a point m(s) 2 S0 from the distribution
K(s, ·), independently for different points s. The point m(s) is called a marking of
s and a random subset of S ⇥ S0 ,

P⇤ = (s, m(s)) : s 2 P , (7.5)

is called a marked Poisson process. In other words, P⇤ = [n 1 P⇤n , where

P⇤n = (Xn1 , m(Xn1 )), . . . , (XnNn , m(XnNn )) ,

and each point (Xn` , m(Xn` )) 2 S ⇥ S0 is generated according to the distribution


ZZ
p⇤n (C) = K(s, ds0 ) pn (ds).
C

Since p⇤n is obviously non-atomic, by definition, this means that P⇤ is a Poisson


process on S ⇥ S0 with the mean measure
ZZ ZZ
 µn (S) p⇤n (C) =  C
K(s, ds0 )µn (ds) =
C
K(s, ds0 )µ(ds).
n 1 n 1

Therefore, the following holds.


Theorem 7.3 (Marking Theorem). The random subset P⇤ in (7.5) is a Poisson
process on S ⇥ S0 with the mean measure
ZZ
µ ⇤ (C) = K(s, ds0 )µ(ds). (7.6)
C

The above results give us several ways to generate a new Poisson process from
the old one. However, in each case, the new process is generated in a way that de-
pends on a particular representation of its mean measure. We will now show that,
when the measure µ is s -finite, the random set P can be generated in a way that,
in some sense, does not depend on the particular representation (7.1). This point
is very important, because, often, given a Poisson process with one representation
of its mean measure, we study its properties using another, more convenient, rep-
resentation. Suppose that S is equal to a disjoint union [m 1 Sm of sets such that
0 < µ(Sm ) < •, in which case

µ= Â µ|Sm (7.7)
m 1

is another representation of the type (7.1), where µ|Sm is the restriction of µ to the
set Sm . The following holds.
210 7 Poisson Processes

Theorem 7.4 (Equivalence Theorem). The process P in (7.3) generated accord-


ing to the representation (7.1) is statistically indistinguishable from a process gen-
erated using the representation (7.7).
More precisely, we will show the following. Let us denote
µ|Sm (B) µ(B \ Sm )
pSm (B) = = . (7.8)
µ|Sm (S) µ(Sm )
We will show that, given the Poisson process P in (7.3) generated according to
the representation (7.1), the cardinalities N(Sm ) = |P \ Sm | are independent ran-
dom variables with the distributions P(µ(Sm )) and, conditionally on (N(Sm ))m 1 ,
each set P \ Sm “looks like” an i.i.d. sample of size N(Sm ) from the distribution
pSm . This means that, if we arrange the points in P \ Sm in a random order, the
resulting vector has the same distribution as an i.i.d. sample from the measure pSm .
Moreover, conditionally on (N(Sm ))m 1 , these vectors are independent over m 1.
This is exactly how one would generate a Poisson process using the representation
(7.7). The proof of Theorem 7.4 will be based on the following two properties that
usually appear as the definition of a Poisson process on S with the mean measure
µ:
(i) for any A 2 S , the cardinality N(A) = |P \ A| has the Poisson distribution
P(µ(A)) with the mean µ(A);
(ii) the cardinalities N(A1 ), . . . , N(Ak ) are independent, for any k 1 and any dis-
joint sets A1 , . . . , Ak 2 S .
When µ(A) = •, it is understood that P\A is countably infinite. Usually, one starts
with (i) and (ii) as the definition of a Poisson process and the set P constructed in
(7.3) is used to demonstrate the existence of such processes, which explains the
name of the following theorem.
Theorem 7.5 (Existence Theorem). The process P in (7.3) satisfies the properties
(i) and (ii).

Proof. Given x 0, let us denote the weights of the Poisson distribution P(x) with
the mean x by
xj x
p j (x) = P(x) { j} = e .
j!
Consider disjoint sets A1 , . . . , Ak and let A0 = ([ik Ai )c be the complement of their
union. Fix m1 , . . . , mk 0 and let m = m1 + · · · + mk . Given any set A, denote
Nn (A) = |Pn \ A|. With this notation, let us compute the probability of the event

W = Nn (A1 ) = m1 , . . . , Nn (Ak ) = mk .

Recall that the random variables (Xn` )` 1 are i.i.d. with the distribution pn defined
in (7.2) and, therefore, conditionally on Nn in (7.3), the cardinalities (Nn (A` ))0`k
have multinomial distribution and we can write
7.1 Overview of Poisson processes 211

P(W) = ÂP W Nn = m + j P Nn = m + j
j 0
= ÂP Nn (A0 ) = j \ W Nn = m + j pm+ j µn (S)
j 0
(m + j)! µn (S)m+ j
= Â j!m1 ! · · · mk !
pn (A0 ) j pn (A1 )m1 · · · pn (Ak )mk
(m + j)!
e µn (S)
j 0
1
= Â j!m1 ! · · · mk !
µn (A0 ) j µn (A1 )m1 · · · µn (Ak )mk e µn (S)
j 0
µn (A1 )m1 µn (A1 ) µn (Ak )mk µn (Ak )
= e ··· e .
m1 ! mk !
This means that the cardinalities Nn (A` ) for 1  `  k are independent random
variables with the distributions P(µn (A` )). Since all measures pn are non-atomic,
P(Xn j = Xmk ) = 0 for any (n, j) 6= (m, k). Therefore, the sets Pn are all disjoint
with probability one and

N(A) = |P \ A| = Â |Pn \ A| = Â Nn (A).


n 1 n 1

First of all, the cardinalities N(A1 ), . . . , N(Ak ) are independent, since we showed
that Nn (A1 ), . . . , Nn (Ak ) are independent for each n 1, and it remains to show that
N(A) has the Poisson distribution with the mean µ(A) = Ân 1 µn (A). The partial
sum Sm = Ânm Nn (A) has the Poisson distribution with the mean Ânm µn (A) and,
since Sm " N(A), for any integer r 0, the probability P(N(A)  r) equals
⇣ ⌘
lim P(Sm  r) = lim  p j  µn (A) =  p j µ(A) .
m!• m!•
jr nm jr

If µ(A) < •, this shows that N(A) has the distribution P(µ(A)). If µ(A) = • then
P(N(A)  r) = 0 for all r 0, which implies that N(A) is countably infinite. t u

Proof (of Theorem 7.4). First, suppose that µ(S) < • and suppose that the Poisson
process P was generated as in (7.3) using some sequence of measures (µn ). By
Theorem 7.5, the cardinality N = N(S) has the distribution P(µ(S)). Let us show
that, conditionally on the event {N = n}, the set P “looks like” an i.i.d. sample of
size n from the distribution p(·) = µ(·)/µ(S) or, more precisely, if we randomly
assign labels {1, . . . , n} to the points in P, the resulting vector (X1 , . . . , Xn ) has the
same distribution as an i.i.d. sample from p. Let us consider some measurable sets
B1 , . . . , Bn 2 S and compute

Pn X1 2 B1 , . . . , Xn 2 Bn = P X1 2 B1 , . . . , Xn 2 Bn N = n , (7.9)

where we denoted by Pn the conditional probability given the event {N = n}. To


use Theorem 7.5, we will reduce this case to the case of disjoint sets as follows. Let
A1 , . . . , Ak be the partition of the space S generated by the sets B1 , . . . , Bn , obtained
212 7 Poisson Processes

by taking intersections of the sets B` or their complements over `  n. Then, for


each `  n, we can write

I(x 2 B` ) = Â d` j I(x 2 A j ),
jk

where d` j = 1 if B` ◆ A j and d` j = 0 otherwise. In terms of these sets, (7.9) can be


rewritten as

Pn X1 2 B1 , . . . , Xn 2 Bn = En ’ Â d` j I(X` 2 A j ) (7.10)
`n jk
= Â d1 j1 · · · dn jn Pn X1 2 A j1 , . . . , Xn 2 A jn .
j1 ,..., jn k

If we fix j1 , . . . , jn and for j  k let I j = {`  n : j` = j} then

Pn X1 2 A j1 , . . . , Xn 2 A jn = Pn X` 2 A j for all j  k, ` 2 I j .

If n j = |I j | then the last event can be expressed in words by saying that, for each
j  k, we observe n j points of the random set P in the set A j and then assign labels
in I j to the points P \ A j . By Theorem 7.5, the probability to observe n j points in
each set A j , given that N = n, equals

P(N(A j ) = n j , j  k)
Pn (N(A j ) = n j , j  k) =
P(N = n)
µ(A j )n j µ(A j ) . µ(S)n
=’ e e µ(S)
,
jk n j! n!

while the probability to randomly assign the labels in I j to the points in A j for all
j  k is equal to ’ jk n j !/n!. Therefore,
⇣ µ(A ) ⌘n j
Pn X1 2 A j1 , . . . , Xn 2 A jn = ’ = ’ p(A j` ).
j
(7.11)
jk µ(S) `n

If we plug this into (7.10), we get

Pn X1 2 B1 , . . . , Xn 2 Bn = Â d1 j1 p(A j1 ) · · · Â dn jn p(A jn )
j1 k jn k
= p(B1 ) · · · p(Bn ).

Therefore, conditionally on {N = n}, X1 , . . . , Xn are i.i.d. with the distribution p.


Now, suppose that µ is s -finite and (7.7) holds. Consider the random sets
in (7.3) and define N(Sm ) = |P \ Sm |. By Theorem 7.5, the random variables
(N(Sm ))m 1 are independent and have the Poisson distributions P(µ(Sm )), and
we would like to show that, conditionally on (N(Sm ))m 1 , each set P \ Sm can be
generated as a sample of size N(Sm ) from the distribution pSm = µ|Sm (·)/µ(Sm ),
7.1 Overview of Poisson processes 213

independently over m. This can be done by considering finitely many m at a time


and using exactly the same computation leading to (7.11) based on the properties
proved in Theorem 7.5, with each subset A ✓ Sm producing a factor pSm (A). t
u
Let us now consider one example of a Poisson process that will be useful below
when talk about a-stable subordinators. Let us consider a parameter z 2 (0, 1) and
let P be a Poisson process on (0, •) with the mean measure
1 z
µ(dx) = z x dx. (7.12)

Clearly, the measure µ is s -finite, which can be seen, for example, by considering
the partition (0, •) = [m 1 Sm with S1 = [1, •) and Sm = [1/m, 1/(m 1)) for m
2, since Z •
z x 1 z dx = 1 < •.
1
By Theorem 7.4, each cardinality N(Sm ) = |P \ Sm | has the Poisson distribution
with the mean µ(Sm ) and, conditionally on N(Sm ), the points in P \ Sm are i.i.d.
with the distribution µ|Sm /µ(Sm ). Therefore, by Wald’s identity,
Z Z
xµ(dx)
E Â x = EN(Sm )
Sm µ(Sm )
=
Sm
xµ(dx)
x2P\Sm

and, therefore,
Z 1 Z 1
z
E Â xI(x < 1) = xµ(dx) = zx z
dx = < •. (7.13)
x2P 0 0 1 z

This means that the sum Âx2P xI(x < 1) is finite with probability one and, since
there are finitely many points in the set P \ [1, •), the sum Âx2P x is also finite
with probability one.
Exercise 7.1.1. Let P be a Poisson process with the mean measure (7.12). Show
a
that E Âx2P x < • for all 0 < a < z .
Exercise 7.1.2. Let P be a Poisson process on (0, •) with the mean measure l dx.
If we enumerate the points in P in the increasing order, X1 < X2 < . . . , show that
the increments X1 , X2 X1 , X3 X2 , . . . are i.i.d. exponential random variables with
the density l e l x on (0, •). Hint: compute the probability for X1 , . . . , Xn to be in
the intervals [x1 , x1 + Dx1 ], [x2 , x2 + Dx2 ], . . . , [xn , xn + Dxn ] for x1 < x2 < . . . < xn .
Exercise 7.1.3. If P be a Poisson process on (0, •) with the mean measure (7.12)
and (Ux )x2P are i.i.d. markings uniform on [0, 1], compute the mean measure of
the process {xUx : x 2 P}. Is it a Poisson process?
Exercise 7.1.4. If P be a Poisson process on (0, •) with the Lebesgue mean mea-
sure dx and z > 0, show that {x 1/z : x 2 P} is the Poisson process on (0, •) with
the mean measure (7.12).
214 7 Poisson Processes

7.2 Sums over Poisson processes

We will now consider random variables of the following type,

S= Â f (x), (7.14)
x2P

where P is a Poisson process with the mean measure µ and f is a real-valued mea-
surable function. Since the points in P are not necessarily ordered, we understand
that the series converges if it converges absolutely. If we consider disjoint subsets
S+ = { f > 0} and S+ = { f < 0} then P+ = P \ S+ and P = P \ S are inde-
pendent Poisson processes and we can decompose S as the difference S+ S of
two independent random variables

S+ = Â f (x) = Â f + (x), S = Â f (x) = Â f (x), (7.15)


x2P+ x2P x2P x2P

where f + = max( f , 0) and f = max( f , 0) are positive and negative parts of f .


The distribution of a positive random variable X 0 is determined by its Laplace
transform
f (t) = Ee tX for t 0, (7.16)
which can be seen as follows. Because X 0, the function z ! EezX on C is
well defined for Re z  0 and one can easily check that it is analytic on Re z <
0. Therefore, it is uniquely determined by its values on z 2 R (i.e. the Laplace
transform f (t)) and, in particular, the characteristic function EeitX is also uniquely
determined by f (t). This means that the distribution of X 0 is determined by its
Laplace transform.

Theorem 7.6 (Campbell’s Theorem I). If P is a Poisson process with the mean
measure µ and f 0 then the Laplace transform of S = Âx2P f (x) equals
h Z i
Ee tS = exp (1 e t f (x) ) dµ(x) . (7.17)

In particular, S < • almost surely if and only if


Z
min( f (x), 1) dµ(x) < • (7.18)

and, otherwise, S = +• almost surely. Moreover,


Z Z
ES = f (x) dµ(x), Var(S) = f (x)2 dµ(x), (7.19)

finite or infinite. (Of course, the variance is defined only if ES < •.)

Proof. First, suppose that f is a simple (step) function taking non-zero values
( f j ) jk (and possibly zero) and each level set A j = {x : f (x) = f j } has finite mea-
7.2 Sums over Poisson processes 215

sure, m j = µ(A j ) < •. Then the random variables N j = card(P \ A j ) are indepen-
dent Poiss(m j ) for j  k, the sum S = Âx2P f (x) = Â jk f j N j and, by indepen-
dence,
Ee tS = Ee t  jk f j N j = ’ Ee t f j N j .
jk

Each factor equals


mnj
Ee t f jNj
= Âe t f jn

n!
e mj
n 0
(m j e t f j )n h i
=e mj
 n! = exp m j (e t fj
1) ,
n 0

so
h i
Ee tS
= exp  m j (e t fj
1)
jk
hZ i h Z i
t f (x) t f (x)
= exp (e 1) dµ(x) = exp (1 e ) dµ(x) .

This proves (7.17) for simple f 0. General f 0 can be written as a limit of an


increasing sequence of simple functions and, using monotone convergence theorem
on both sides, we obtain (7.17) for general f 0. Similarly, one can check (7.19)
for simple functions directly and obtain the general R case by monotone convergence
theorem. Finally, it is obvious that, for t > 0, (1 e t f (x) ) dµ(x) < • if and only
if (7.18) holds. In particular, if this integral is infinite then Ee tS = 0, which means
that S = +• with probability R one. If the integral is finite then, by the dominated
convergence theorem, limt#0 (1 e t f (x) ) dµ(x) = 0 and, therefore, P(S < +•) =
limt#0 Ee tS = 1. t
u

The general case follow from the decomposition (7.15), although we have to
replace the Laplace transform by characteristic function to make sure it is defined
for both positive and negative parts.

Theorem 7.7 (Campbell’s Theorem II). If P is a Poisson process with the mean
measure µ then the series S = Âx2P f (x) is absolutely convergent almost surely if
and only if Z
min(| f (x)|, 1) dµ(x) < •, (7.20)
and in this case the characteristic function of S equals
hZ i
Ee = exp (eit f (x) 1) dµ(x) .
itS
(7.21)

Moreover, Z
ES = f (x) dµ(x), (7.22)
216 7 Poisson Processes

where the expectation exists if and only if the integral converges. If ES converges
then Z
Var(S) = f (x)2 dµ(x), (7.23)
finite or infinite.

Proof. All the statements follow immediately from the decomposition S = S+


S in (7.15) and the independence of S+ and S . For (7.21), we use that the for-
mula (7.17) for the Laplace transform, i.e.
hZ ±
i
EezS± = exp (ez f (x) 1) dµ(x)

for z = t, implies the same formula for imaginary z = it, since both sides are
analytic on Re z < 0. t
u

Exercise 7.2.1. If f1 , f2 2 L1 (S, µ) \ L2 (S, µ), show that


Z
Cov(S1 , S2 ) = f1 (x) f2 (x) dµ(x),

where S j = Âx2P f j (x). Hint: Consider the function f = t1 f1 + t2 f2 .

Exercise 7.2.2. Let P be a Poisson process on (0, •) with the mean measure
z x 1 z dx for z 2 (0, 1). Show that the Laplace transform of Âx2P x is of the form
exp( ct z ) and find the constant c.

Exercise 7.2.3. If z 2 (0, 1), P is a Poisson process on (0, 1) with the mean mea-
sure µ(dx) = x 2 z dx, and Sm = [1/m, 1/(m 1)) for m 2, show that the series
⇣ Z ⌘
  x x dµ(x)
Sm
m 2 x2P\Sm

converges almost surely.


7.3 Infinitely divisible distributions: basic properties 217

7.3 Infinitely divisible distributions: basic properties

Definition 7.1. A distribution P on R is called infinitely divisible if, for any n 1,


there exists a distribution Pn such that P is the n-fold convolution of Pn , P = P⇤n
n .

In other words, if a random variable X has distribution P and X1 , . . . , Xn are


independent random variables with the distribution Pn then
d
X = X1 + . . . + Xn .

The motivation for this definition is to understand the so called Lévy processes,
which are stochastic processes X(c) indexed by c 0 with X(0) = 0, with inde-
pendent and stationary increments, and continuous in probability. This means:
(a) (Independence of increments) for any 0  c1 < c2 < · · · < cn < •, the incre-
ments X(c1 ), X(c2 ) X(c1 ), X(c3 ) X(c2 ), . . . , X(cn ) X(cn 1 ) are inde-
pendent;
(b) (Stationary increments) for any c0 < c, the increment X(c) X(c0 ) is equal in
distribution to X(c c0 );
(c) (Continuity in probability) for any e > 0 and c 0, limh!0 P(|X(c + h)
X(c)| > e) = 0.
The first two properties imply that the distribution of X(c) for any given c must
be infinitely divisible because X(c) is equal to the sum of i.i.d. increments X((k +
1)c/n) X(kc/n) for k = 0, . . . , n 1.

Example 7.3.1. Most basic examples of infinitely divisible distributions which we


already know include Gaussian and Poisson distributions. Another example is the
Cauchy(a) distribution for a > 0 with the density
a
p(x) = for x 2 R.
p(x2 + a2 )
We will leave it as an exercise below to show that the characteristic function of this
distribution is f (t) = e a|t| which implies that the sum of independent Cauchy(ai )
random variables is Cauchy(Â ai ). Another example is the Gamma(a, b ) distribu-
tion with parameters a > 0, b > 0, with the density
ba a 1 bx
p(x) = x e for x 0.
G(a)
Again, we will leave it as an exercise below to show that the characteristic function
of this distribution is f (t) = (1 it/b ) a which implies that the sum of indepen-
dent Gamma(ai , b ) random variables is Gamma(Â ai , b ). t
u

Below we will give a general characterization of infinitely divisible distributions,


but we begin with a few basic properties. First, we will show that characteristic
218 7 Poisson Processes

functions of infinitely divisible distributions do not vanish and for this we will
need the following lemma.
Lemma 7.1. If f (t) = EeitX then 1 Re f (2t)  4(1 Re f (t)). Moreover, | f (t)|2
is also a characteristic function and, therefore, 1 | f (2t)|2  4(1 | f (t)|2 ).
Proof. First statement follows from:
Z Z
1 Re f (2t) = (1 cos 2tx) dP(x) = 2 sin2 tx dP(x)
Z
=2 (1 + costx)(1 costx) dP(x)
Z
4 (1 costx) dP(x) = 4(1 Re f (t)).

The second statement follows because f (t) = f ( t) is a characteristic function


and, therefore, | f (t)|2 = f (t) f (t) is also a characteristic function. t
u
R
Theorem 7.8. A characteristic function f (t) = eitx dP(x) of an infinitely divisible
distribution P does not vanish, i.e. f (t) 6= 0 for any t 2 R.
Proof. Fix any t 2 R. Given n 1, if P = P⇤n n and f n is the characteristic function
of Pn then f (t) = fn (t)n . Applying the above lemma k times starting with t/2k
instead of t, we can write

1 | fn (t)|2  4k (1 | fn (t/2k )|2 ) = 4k (1 | f (t/2k )|2/n ).

Since f (0) = 1 and f (t) is continuous, we can choose k large enough such that
x = | f (t/2k )| > 0. Since x2/n ! 1 as n ! •, choosing n large enough we can make
the right hand side above smaller than 1/2, which implies that | fn (t)|2 > 1/2 and,
therefore, | f (t)| = | fn (t)|n 6= 0. t
u
If r(t) = | f (t)| 6= 0, since f (0) = 1, we can represent f (t) = r(t)eiq (t) for some
unique continuous q (t) such that q (0) = 0. If f (t) = fn (t)n for some continuous
fn (t) with fn (0) = 1, because rn (t) := | fn (t)| = r(t)1/n 6= 0, we can also represent
fn (t) = rn (t)eiqn (t) for some unique continuous qn (t) such that qn (0) = 0. Clearly
this means that qn (t) = q (t)/n. In other words, if f (t) is a characteristic function
of an infinitely divisible distribution, we can take fn (t) = r(t)1/n eiq (t)/n .
Next, let us state without a proof the obvious result that follows immediately
from the definition.
Theorem 7.9. Convolution of infinitely divisible distributions is infinitely divisible.
In other words, the sum of independent infinitely divisible random variables is
infinitely divisible. In particular, if f (t) is a characteristic function of an infinitely
divisible distribution then so is | f (t)|2 = f ( t) f (t).
Theorem 7.10. If a distribution P is a limit of infinitely divisible distributions Pk
then it is also infinitely divisible.
7.3 Infinitely divisible distributions: basic properties 219

Proof. Let f be the c.f. of P, let fk be the c.f. of Pk and, by infinite divisibility of
Pk , let fk,n be the c.f. such that fk (t) = fk,n (t)n . The convergence Pk ! P implies
that fk (t) ! f (t) and, therefore,

lim | fk,n (t)|2 = lim | fk (t)|2/n = | f (t)|2/n .


k!• k!•

Since | f (t)|2/n is continuous and is a limit of characteristic functions | fk,n (t)|2 , it


is itself a characteristic function. This means that | f (t)|2 is infinitely divisible and,
therefore, it does not vanish and f (t) 6= 0. As we mentioned above, this means that
we can represent f (t) = r(t)eiq (t) for some unique continuous q (t) with q (0) = 0
and fk (t) = rk (t)eiqk (t) for some unique continuous qk (t) with qk (0) = 0. The fact
that r(t) 6= 0 and fk (t) ! f (t) implies that

rk (t) ! r(t) and eiqk (t) ! eiq (t) .

Because the pointwise convergence of characteristic functions implies uniform


convergence on compacts (see Exercise 3.2.11), we must have that qk (t) ! q (t)
for all t (exercise below). This implies that
p p
lim n fk (t) = lim rk (t)1/n eiqk (t)/n = lim r(t)1/n eiq (t)/n = n f (t).
k!• k!• k!•
p
Therefore, as a continuous limit of characteristic functions, n f (t) is itself a char-
acteristic function of some distribution Qn , which proves that P = Q⇤n n and P is
infinitely divisible. t
u

Theorem 7.11. If f (t) is a characteristic function of an infinitely divisible distri-


bution then so is f (t)c for any c > 0.

Proof.
p If mc = m/n 2 Q then, for any k 1, fnk (t)m/n = (jk (t))k , where jk (t) =
p
nk
( f (t) ) . Since f (t) is infinitely divisible, f (t) is a characteristic function
and its mth power, jk (t), is also a characteristic function. This proves that f (t)m/n
is infinitely divisible. Taking a limit m/n ! c and using previous theorem implies
this for all c > 0. t
u

Theorem 7.12. (i) If (X(c))c 0 is a Lévy process and f (t) is the characteristic
function of X(1) then the characteristic function of X(c) is f (t)c .
(ii) If f (t) is a characteristic function of an infinitely divisible distribution then
there exists a Lévy process (X(c))c 0 such that f (t) is the characteristic function
of X(1).

Proof. (i) By the properties (a) and (b) of the increments of a Lévy process,

EeitX(m/n) = (EeitX(1/n) )m = f (t)m/n .


220 7 Poisson Processes

If m/n ! c then f (t)m/n ! f (t)c and, by continuity in probability (c) of the Lévy
d
process, X(m/n) ! X(c). This implies that the characteristic function of X(c) is
f (t)c .
(ii) For any
0  c1 < c2 < · · · < cn < •,
let us define the distribution of (X(c1 ), . . . , X(cn )) by the conditions that X(0) = 0
and the increments X(c` ) X(c` 1 ) for ` = 1, . . . , n are independent and have char-
acteristic functions f (t)c` c` 1 . These distributions are consistent because, if we
remove a point c` , the increment X(c`+1 ) X(c` 1 ) must have characteristic func-
tion f (t)c`+1 c` 1 which equals f (t)c`+1 c` f (t)c` c` 1 – the characteristic function
of the sum of X(c`+1 ) X(c` ) and X(c` ) X(c` 1 ). By Kolmogorov’s consistency
theorem, we can define a family of random variables (X(c))c 0 with these finite
dimensional distributions. The properties (a) and (b) of the Lévy process hold by
construction. Since f (t)h ! 1 as h # 0, the increment X(c + h) X(c) converges in
distribution to 0, which implies the continuity in probability property (c). t
u

We proved that Lévy processes correspond to infinitely divisible distributions


by specifying the distribution of X(1), so we can now focus on studying infinitely
divisible distributions.

Example 7.3.2. Let us give another example which is not obviously infinitely di-
visible and which follows from that above properties. Given p 2 (0, 1), let us con-
sider a geometric distribution P({n}) = (1 p)pn for integer n = 0, 1, 2, . . . . If p
is the probability of failure, this is the distribution of the number of failures before
the first success occurring with probability 1 p. To see that this distribution is
infinitely divisible, let us compute the characteristic function
1 p
f (t) = Â eitn (1 p)pn =
1 peit
.
n 0

Using the Taylor series for log(1 z) for |z| < 1,

pk ikt
log f (t) = log(1 p) log(1 peit ) = Â (e 1).
k 1 k

Now, let us recall that if X has Poisson distribution with the mean l > 0 then its
characteristic function is
lk (l eit )k
EeitX = Â eitk k!
e l
=e l
 k! = exp[l (eit 1)].
k 0 k 0

This means that exp[l (eict 1)] for c 2 R is the characteristic function of cX. If we
denote by P(pk /k) Poisson random variables with the mean pk /k, independent for
k 1, then it is easy to check that the series Âk 1 kP(pk /k) converges and its char-
acteristic function is f (t) defined above. In other words, this random series gives a
7.3 Infinitely divisible distributions: basic properties 221

different representation of a geometric random variable. With this representation it


is clear that the distribution is infinitely divisible because the function
h pk i
fn (t) = exp  (eikt 1)
k 1 nk

is the characteristic function of the series Âk k /(nk)).


1 kP(p t
u
Definition 7.2 (Compound Poisson distributions). If N has Poiss(l ) distribution
and (Xi )i 1 are i.i.d. real-valued random variables with the distribution P then the
distribution of ÂNi=1 Xi is called a compound Poisson distribution.
Let us compute the characteristic function of a compound Poisson. If f (t) =
R itx
e dP(x) is the characteristic function of Xi ’s then

(l f (t))k
 k!
N N l
Eeit Âi=1 Xi = E[Eeit Âi=1 Xi | N] = E f (t)N = e
k 0
h Z i
= exp[l ( f (t) 1)] = exp l (eitx 1) dP(x) . (7.24)

We will use this to show the following fundamental fact that will be used in the
next section to derive a representation for general infinitely divisible distributions.
Theorem 7.13. Infinitely divisible distributions are limits of compound Poisson.
Proof. If f (t)
pis a characteristic function of some infinitely divisible distribution
and fn (t) = n f (t) then

n( fn (t) 1) = n elog f (t)/n 1


⇣ log f (t) ⇣ log f (t) ⌘ ⌘
= n 1+ +o 1 ! log f (t)
n n
as n ! •.R In other words, if Pn is the distribution with the characteristic function
fn (t) = eitx dPn (x) then
h Z i
f (t) = lim exp[n( fn (t) 1)] = lim exp n (eitx 1) dPn (x) (7.25)
n!• n!•

is the limit of characteristic functions of compound Poisson distributions corre-


sponding to l = n and the distribution Pn . This finishes the proof. t
u
In the above proof, if f (t) is the characteristic function of the Lévy process X(1)
at time c = 1 then
h Z i
f (t)c = lim exp cn (eitx 1) dPn (x) (7.26)
n!•

is the characteristic function of X(c) for c 0. For a fixed n, the right hand side
corresponds to the following Lévy process Xn (c). If P is a Poisson process on
222 7 Poisson Processes

(0, •) with the mean measure n dx and (Yx )x2P are i.i.d. markings of this process
from the distribution Pn then

Xn (c) = Â Yx I(x  c).


x2P

It is obvious that its increments are stationary, independent, and X(c) X(c0 ) has
a compound Poisson distribution with l = n(c c0 ) and P = Pn . In other words,
in the sense of finite dimensional distributions, all Lévy processes are limits of
such jump processes, with i.i.d. jumps Yx occurring at the locations x of a Poisson
process P.

Exercise 7.3.1. Prove that the characteristic function of Cauchy(a) distribution is


f (t) = e a|t| . Hint: compute the Fourier transform of e a|t| and use the inverse
Fourier transform.

Exercise 7.3.2. Prove that the characteristic function of Gamma(a, b ) distribution


is f (t) = (1 it/b ) a . Hint: make the change of variables and use a contour inte-
gral to reduce back to the integral of the density.

Exercise 7.3.3. In the proof of Theorem 7.10, show that qk (t) ! q (t) for all t.

Exercise 7.3.4. If P is a Poisson process on R such that Âx2P |x| < • a.s., is the
distribution of Âx2P x infinitely divisible?
7.4 Canonical representations of infinitely divisible distributions 223

7.4 Canonical representations of infinitely divisible distributions

In this section we will derive a general representation for characteristic functions


of infinitely divisible distributions. We will begin with the case of nonnegative
infinitely divisible random variables X 0, which is slightly simpler but which
illustrates most of the ideas of the general case. In this case, we will consider the
Laplace transform
f (t) = Ee tX for t 0, (7.27)
which determines the distribution of X 0.
Theorem 7.14. If X 0 is infinitely divisible then
h Z i
tx
f (t) = exp bt + (1 e ) dµ(x) , (7.28)
(0,•)
R
where b 0 and µ is a measure on (0, •) such that (0,•)(x ^ 1) dµ(x) < •. Such
representation is unique, and all such f (t) correspond to some infinitely divisible
distributions on R+ .
Proof. First of all, given such f (t), the existence of an infinitely divisible distribu-
tion follows from Campbell’s theorem because, if P is a Poisson process on (0, •)
with the mean measure µ, then

X =b+ Âx
x2P

has
R the distribution P with Laplace transform f (t). Notice that the condition
(0,•)(x ^ 1) dµ(x) < • coincides with the one in Campbell’s theorem and ensures
that X < • almost surely. For any n 1, if Pn is a Poisson process on (0, •) with
the mean measure µ/n then the distribution Pn of b /n + Âx2Pn x satisfies P = P⇤nn ,
by the superposition property of Poisson processes, or by looking at the Laplace
transforms. In other words, all f (t) as in (7.28) are characteristic functions of a
sum of a Poisson process on (0, •) plus a constant.
Now, suppose that X 0 is infinitely divisible, which means that, for all n 0,
there exists a distribution Pn on R+ = [0, •) such that P = P⇤n
n , which implies that
Z •
tX
f (t) = Ee = fn (t)n , where fn (t) = e tx
dPn (x).
0

As in (7.25) above, this implies that


h Z• i
tx
f (t) = lim exp n (1 e ) dPn (x) .
n!• 0

First of all, if we take t = 1 then this implies that, for large enough n,
Z •
n (1 e x ) dPn (x)  1 log f (1).
0
224 7 Poisson Processes

One can check that 1 e x (1 e 1 )(x ^ 1) and, therefore, for large enough n,
Z •
1 log f (1)
n (x ^ 1) dPn (x)  < •.
0 1 e 1
If we consider the measures Gn on [0, •) defined by

dGn (x) := n(x ^ 1) dPn (x) (7.29)

then Gn (R+ )  K < • for all n, i.e. these measures are uniformly bounded.
Next, we show that these measures are also uniformly tight. For t  1, we can
write
Z •
n tx
Gn ([1/t, •)) = nPn ([1/t, •))  1
(1 e ) dPn (x)
1 e 1/t
Z •
n tx log f (t)
 1
(1 e ) dPn (x) ! .
1 e 0 1 e 1
Since limt#0 f (t) = limt#0 Ee tX = 1, by choosing t small enough we can make
the right hand side above smaller than any e > 0 so, for large enough n, we have
Gn ([1/t, •))  e. This implies that (Gn ) is uniformly tight.
By the Selection Theorem, we can choose a subsequence (nk )k 1 such that
Gnk ! G weakly for some finite measure G on R+ . On the one hand,
Z • Z •
1 e tx tx
dGnk (x) = nk (1 e ) dPnk (x) ! log f (t)
0 x^1 0

as k ! •. On the other hand, if we consider the function h(t, x) = (1 e tx )/x ^ 1


for x > 0 and extend it by continuity to zero, h(t, 0) = t, then the weak convergence
Gnk ! G implies that
Z • Z •
1 e tx
dGnk (x) ! h(t, x) dG(x)
0 x^1 0

as k ! •, which shows that


h Z • i
f (t) = exp h(t, x) dG(x) .
0

Notice that, although Gn ({0}) = 0, in the limit we might have G({0}) = b 0, so


h Z i
1 e tx
f (t) = exp bt + dG(x) .
(0,•) x^1
R
Finally, if we consider the measure dµ(x) := dG(x)/(x^1) on (0, •) then (0,•)(x^
1) dµ(x) = G((0, •)) < • and the representation (7.28) holds.
To prove the uniqueness of such representation, let us write
7.4 Canonical representations of infinitely divisible distributions 225
Z
tx
log f (t) = bt + (1 e ) dµ(x),
(0,•)

from which one can easily check that, for t 1,


Z t+1 Z ⇣ sinh(x) ⌘
1 tx
log f (t) + log f (s) ds = e 1 dµ(x)
2 t 1 (0,•) x
Z
(t 1)x
= e p(x) dµ(x)
(0,•)

where p(x) = e x (sinh(x)/x 1).R Since 0  p(x)  1 for x > 0 and p(x) = O x
2

near the origin, the assumption (0,•)(x ^ 1) dµ(x) < • implies that
Z
p(x) dµ(x) < •.
(0,•)

Therefore, the measure n defined by dn(x) := p(x) dµ(x) is a positive bounded


measure on (0, •) and the above representation for f (t) uniquely determines
Z Z
(t 1)x (t 1)x
e dn(x) = e p(x) dµ(x)
(0,•) (0,•)

for t 1. In other words, we know the Laplace transform of a finite measure n on


(0, •), which uniquely determines n and, therefore, µ. Once we know µ, we know
b , which proves that the above representation is unique. t
u

Next, we will consider the case of general infinitely divisible random variables
and derive the canonical Lévy-Khintchine representation of their characteristic
functions.

Theorem 7.15 (Lévy-Khintchine representation). If X is infinitely divisible then


h Z i
t 2s 2
f (t) = EeitX = exp itg + eitx 1 itx I(|x|  1) dµ(x) , (7.30)
2 R\{0}
R 2 ^ 1) dµ(x) <
where g 2 R, s 0 and µ is a measure on R \ {0} such that R\{0}(x
•. The representation is unique.

The measure µ is called the Lévy measure of X. We could choose it to be a measure


on R and set µ({0}) = 0 in order to still guarantee uniqueness. Notice also that in
I(|x|  1) the constant 1 can be replaced by any other positive constant, which will
simply result in transferring part of the integral into the constant g. We will prove
that all such f (t) as in (7.30) correspond to some infinitely divisible distributions
in the next theorem, in order to emphasize the role played by various terms in the
representation. For symmetric infinitely divisible random variables, we will give
another concrete representation for the random variables in the exercises below.
226 7 Poisson Processes

Proof. We sawp in (7.25) that, if Pn is the distribution with the characteristic func-
tion fn (t) = n f (t) then
h Z i
f (t) = lim exp n (eitx 1) dPn (x)
n!•

and, comparing the absolute values,


Z
lim n (1 costx) dPn (x) = log | f (t)|.
n!•

Using this with t = 1 we get that


Z Z
n (1 cos x) dPn (x)  n (1 cos x) dPn (x)  1 log | f (1)|
|x|1

for large enough n and, since 1 cos x x2 /3 for |x|  1, we get that
Z
n x2 dPn (x)  K < •. (7.31)
|x|1

Next, let us consider the average of the above limit over t 2 [0, d ]. Since
Z dh Z i Z⇣
1 sin d x ⌘
n (1 costx) dPn (x) dt = n 1 dPn (x),
d 0 dx
we have Z⇣ Z
sin d x ⌘ 1 d
lim n 1 dPn (x) = log | f (t)| dt,
n!• dx d 0
which is bounded because | f (t)| =
6 0 for any infinitely divisible distribution. Since
sin x/x  1 and 1 sin x/x 1/2 for |x| 1,
Z ⇣ Z d
sin d x ⌘ 2
nPn (|x| 1/d )  2n 1 dPn (x)  log | f (t)| dt + e
|x| 1/d dx d 0

for n large enough. First of all, for d = 1 this implies that nPn (|x| 1) is bounded
by a constant for large enough n and, combining this with (7.31) proves that
Z
n (x2 ^ 1) dPn (x)  K 0 < •. (7.32)

If we consider the measure Gn on R defined by

dGn (x) := n(x2 ^ 1) dPn (x) (7.33)

then Gn (R)  K 00 < • for all n, i.e. total masses of these measures are uniformly
bounded. Next, since | f (t)| ⇡ 1 for small t, we can choose d above small enough
so that
Gn (|x| 1/d ) = nPn (|x| 1/d )  2e
7.4 Canonical representations of infinitely divisible distributions 227

for large enough n. This means that the sequence of measures (Gn ) is uniformly
tight.
By the Selection Theorem, we can choose a subsequence (nk )k 1 such that
Gnk ! G weakly for some finite measure G on R. On the one hand,
Z Z
1
(eitx 1) dGnk (x) = nk (eitx 1) dPnk (x) ! log f (t)
x2 ^ 1
as k ! •. On the other hand, in contrast with the case of positive infinitely divisible
random variables above, the function (eitx 1)/(x2 ^ 1) blows up at x = 0 and can
not be extended by continuity to zero. To fix this, let us take any c > 0 and rewrite
Z Z
itx 1 1
(e 1) 2 dGnk (x) = eitx 1 itx I(|x|  c) dGnk (x)
x ^1 Z x2 ^ 1
1
+ it x I(|x|  c) dGnk (x).
x2 ^ 1
The function h(x) = (eitx 1 itx I(|x|  c)/(x2 ^ 1) can be extended by continuity
to zero, h(0) = t 2 /2, in which case its only points of discontinuity are x = c, c.
However, if we choose c > 0 such that c and c are points of continuity of the
limiting measure G then the weak convergence Gnk ! G implies that
Z Z
h(x) dGnk (x) ! h(x) dG(x)

as k ! •. Since the sum of the two integrals above also converges (to log f (t)),
this forces the second integral to converge,
Z
1
it x I(|x|  c) dGnk (x) ! itgc ,
x2 ^ 1
for some gc 2 R. We proved that
h Z i
1
f (t) = exp itgc + eitx 1 itx I(|x|  c) dG(x) .
x2 ^ 1
Changing the indicator I(|x|  c) to I(|x|  1) will only result in changing the term
itgc to itg for some g 2 R, so
h Z i
1
f (t) = exp itg + eitx 1 itx I(|x|  1) 2 dG(x) .
x ^1
As before, although Gn ({0}) = 0, in the limit we might have G({0}) = s 2 0, so
h Z i
t 2s 2 1
f (t) = exp itg + eitx 1 itx I(|x|  1) dG(x) .
2 R\{0} x2 ^ 1
Finally, if we consider the measure
228 7 Poisson Processes

dG(x)
dµ(x) := (7.34)
x2 ^ 1
R
on R \ {0} then R\{0}(x2 ^ 1) dµ(x) = G(R \ {0}) < • and the representation
(7.30) holds.
To prove the uniqueness of such representation, one can easily check that
Z t+1 Z ⇣ Z
1 itx sin x ⌘ 1
log f (t) log f (s) ds = e 1 dG(x) = eitx dF(x),
2 t 1 x x2 ^ 1
where we defined the measure F on R by
⇣ sin x ⌘ 1
dF(x) = 1 dG(x).
x x2 ^ 1
One can see that the prefactor in front of dG(x) is strictly positive (when extended
by continuity to 1/6 at x = 0) and R bounded, so F is a positive finite measure.
Since its characteristic function eitx dF(x) determines F, we showed that f (t)
determines F, and therefore G, uniquely. Once we know G, we know s 2 = G({0})
and µ, which proves that the representation is unique. t
u

Theorem 7.16. All functions f (t) of the form (7.30) correspond to some infinitely
divisible distributions.

Proof. We will represent the random variable X as a sum of independent compo-


nents, whose characteristics functions correspond to various terms in (7.30). The
term itg, of course, corresponds to a constant g, and the term t 2 s 2 /2 corresponds
to a Gaussian N(0, s 2 ) random variable. Next, if we break the integral in (7.30) into
two integrals,
Z Z
itx
(e 1) dµ(x) and (eitx 1 itx) dµ(x), (7.35)
R\[ 1,1] I

where I = [ 1, 1] \ {0}. By (7.24), the first integral corresponds to the compound


Poisson distribution with l = µ(R \ [ 1, 1]) and
dµ(x)
dP(x) = I(x 2 R \ [ 1, 1]) .
µ(R \ [ 1, 1])
All of these terms so far correspond to infinitely divisible distributions, and it re-
mains to understand why the second integral in (7.35) is a characteristic function
of some infinitely divisible distribution.
Let us decompose the measure µ on I into the sum µ1 + µ2 of a purely atomic
component µ1 and continuous component µ2 . Let xk for k 1 be the atoms of µ
with µ({xk }) = Dk > 0, enumerated in an arbitrary order. Then
Z h i
(eitx 1 itx) dµ1 (x) = exp  (eitxk 1)Dk itxk Dk (7.36)
I k 1
7.4 Canonical representations of infinitely divisible distributions 229

is the characteristic function of the random series

S1 = Â xk Nk x k Dk ,
k 1

where Nk are independent Poiss(Dk ). The series converges almost surely because

E(xk Nk xk Dk ) = 0, Var xk Nk = xk2 Dk ,


R
and Âk 1 xk2 Dk  I x2 dµ < •. It is clear that the series is infinitely divisible.
Next, let I = [m 1 Sm be a partition of I = [ 1, 1] \ {0} into disjoint intervals
with endpoints not equal to 0. Then we can write
Z Z
exp (eitx
I
1 itx) dµ2 (x) = exp  (eitx 1 itx) dµ2 (x)
m 1 Sm
⇣Z Z ⌘
= exp  Sm
(eitx 1) dµ2 (x) it
Sm
x dµ2 (x) , (7.37)
m 1

because on each interval Sm (separated away from zero) the measure µ2 is finite
and, thus, both integrals are well defined. Let P be a Poisson process with the
mean measure µ2 on I, and consider the random series
⇣ Z ⌘
S2 = Â Â x x dµ2 (x) .
m 1 x2P\Sm Sm

By Campbell’s theorem,
Z ⇣ ⌘ Z
E Â x=
Sm
x dµ2 (x), Var  x = x2 dµ2 (x),
Sm
x2P\Sm x2P\Sm
R
so the sum of variances is again bounded by I x2 dµ < • and, therefore, the above
series S2 converges almost surely. Again, using Campbell’s theorem, one can see
that its characteristic function equals (7.37). Since the series is obviously infinitely
divisible, this finishes the proof. t
u

Next, we will show a simple corollary of the uniqueness of the Lévy-Khintchine


respresentation that will be useful in one of the exercises below.

Lemma 7.2. If a characteristic function f (t) can be represented as


h Z i
t 2s 2
f (t) = EeitX = exp itg + eitx 1 itx I(|x|  1) dµ(x) , (7.38)
2 R\{0}

where g 2 R, s 0 and µ is a signed measure R on R \2 {0} such that µ = µ1 µ2


and µ1 , µ2 are measures on R \ {0} such that R\{0}(x ^ 1) dµ j (x) < •, then such
representation is unique.
230 7 Poisson Processes

This means that if f (t) can be represented in such a way and µ is not a positive
measure (i.e. the component µ2 is non-trivial) then f (t) is not infinitely divisible.
Proof. Suppose we have two such representations with parameters (g1 , s12 , µ11 , µ21 )
and (g2 , s22 , µ12 , µ22 ). Then, rearranging the terms,
Z
t 2 s12
itg1 + eitx 1 itx I(|x|  1) d(µ11 + µ22 )(x)
2 R\{0}
Z
t 2 s22
= itg2 + eitx 1 itx I(|x|  1) d(µ12 + µ21 )(x).
2 R\{0}

Both sides are of the form corresponding to a characteristic function of an infinitely


divisible distribution and we proved above that we must have g1 = g2 , s12 = s22 and
µ11 + µ22 = µ12 + µ21 , or µ1 = µ11 µ21 = µ12 µ22 = µ2 . t
u
Exercise 7.4.1. Consider a, b 2 (0, 1) and the function

1 b 1 + ae it
f (t) = · .
1 + a 1 beit
Find the distribution for which f (t) is a characteristic function, and show that it is
not infinitely divisible. Hint: compute log f (t) and use Lemma 7.2.
Exercise 7.4.2. In the setting of the previous exercise, suppose that a  b and show
that | f (t)|2 is an infinitely divisible characteristic function. (Together with the pre-
vious exercise this implies that the sum or difference of two non-infinitely divisible
random variables can be infinitely divisible.)
R
Exercise 7.4.3. If in Theorem 7.16 the measure µ satisfies R (|x| ^ 1) dµ(x) < •,
what is the random variable corresponding to such c.f. f (t)?
Exercise 7.4.4. Show that if X is an infinitely divisible
R random variable with c.f.
(7.30) then, for p > 0, E|X| p < • if and only if {|x|>1} |x| p dµ(x) < •. Hint: Con-
sider the decomposition in the proof of Theorem 7.16 and prove that the existence
of pth moment is determined by the compound Poisson component corresponding
to the first integral in (7.35), because the last component (call it Y ) corresponding
to the second integral in (7.35) has the moment generating function
Z
EezY = exp (ezx 1 zx) dµ(x)
[ 1,1]

for all z 2 C.
Exercise 7.4.5. Suppose that X is an infinitely divisible random variable with the
c.f. (7.30) with s 2 = 0 (i.e. without the Gaussian component) and suppose that the
distribution of X is symmetric. Show that, in this case,
Z
itX
f (t) = Ee = exp cos(tx) 1 dµ(x). (7.39)
R
7.4 Canonical representations of infinitely divisible distributions 231

Exercise 7.4.6 (Rosinski’s representation). Let l be the Lebesgue measure on


(0, •) and P be some probability measure on R. Consider a measurable function
G : (0,R•) ⇥ R ! R and suppose that the image measure µ = (l ⇥ P) G 1 sat-
isfies R(x2 ^ 1) dµ(x) < •. Suppose that (Xn )n 1 is an increasing enumeration
of the Poisson process with the mean measure l on (0, •), (Yn )n 1 are i.i.d. ran-
dom variables with the distribution P, and (en )n 1 are i.i.d. symmetric ±1 valued
random variables, and suppose all these sequences are independent of each other.
Show that the series
S = Â en G(Xn ,Yn )
n 1
converges almost surely, and S is a symmetric infinitely divisible random variable
with the c.f. (7.39).
Hint: Consider a Poisson process (Xn ,Yn , en ) and use Campbell’s theorem to
compute the c.f. of S. Be careful that we do not assume absolute convergence of
the
R above series while in Campbell’s theorem we do. In other words, the condition
(x 2 ^ 1) dµ(x) < • is too weak to apply Campbell’s theorem directly, so restrict
R
the Poisson process Xn to a large interval [0, T ] first.
232 7 Poisson Processes

7.5 a-stable distributions and Lévy processes

Definition 7.3. An a-stable distribution, for a 2 (0, 2), is an infinitely divisible


distribution that has a characteristic function of the form
h Z i
f (t) = exp itg + eitx 1 itx I(|x|  1) p(x) dx , (7.40)
R

where g 2 R and where


1
p(x) = m1 I(x < 0) + m2 I(x > 0)
|x|1+a
for arbitrary m1 , m2 0. t
u
2
Comparing with the generalR 2 representation (7.30), s = 0 and dµ(x) = p(x)dx.
Observe that the condition (x ^ 1) dµ(x) < • is equivalent to 0 < a < 2 if at least
one of m1 , m2 is strictly positive. If g = 0 and m1 = m2 = m > 0, the distribution is
called symmetric a-stable, in which case
Z Z
m
eitx 1 itx I(|x|  1) p(x) dx = eitx 1 itx I(|x|  1) dx
|x|1+a
Z Z
m m
= (costx 1) dx = |t|a (cos y 1) dy = c|t|a ,
|x|1+a |y|1+a
where we made the change of variables y = tx and where c > 0. Thus, the character-
istic function of a symmetric a-stable distributions is of the form exp( c|t|a ). One
can similarly explicitly express the characteristic function in the non-symmetric
case as well.
The following properly expresses the stability property of a-stable distributions.

Lemma 7.3. If X, X1 , . . . , Xn are i.i.d. a-stable random variables then


d
X1 + . . . + Xn = n1/a X + Kn (7.41)

for some constant Kn that depends on n and the parameters of the distribution. If
g = 0 and m1 = m2 then Kn = 0.

Proof. If f (t) in (7.40) is the c.f. of X then f (tc) is the c.f. of cX, which is equal
to the exponential of
Z
itcg + eitcx 1 itcx I(|x|  1) p(x) dx
Z
{y=cx}
= itcg + |c|a eity 1 ity I(|y|  c) p(y) dy
Z
= itL + |c|a eity 1 ity I(|y|  1) p(y) dy,
7.5 a-stable distributions and Lévy processes 233

where the constant


Z
L = cg + |c|a y I(|y|  1) I(|y|  c) p(y) dy.

If g = 0 and m1 = m2 then L = 0, since we integrate an odd function. On the other


hand, the c.f. of X1 + . . . + Xn is f (t)n , which is the exponential of
Z
itng + n eitx 1 itx I(|x|  1) p(x) dx.

Therefore, if we take c = n1/a and Kn = ng L, we see that the characteristic


functions coincide, which finishes the proof. t
u

Next, we will gives a representation of symmetric a-stable random variables


and corresponding Lévy processes via the so called a-stable subordinators.

Definition 7.4. If P is a Poisson 2


R •process on (0, •) with the mean measure dµ(x)dy
for some measure µ such that 0 min(x, 1) dµ(x) < •, then the process

t(t) = Â x I(y  t) for t 0 (7.42)


(x,y)2P

is called a subordinator. t
u

By Campbell’s theorem, this process is well-defined because


ZZ Z •
min(x I(y  t), 1) dµ(x) dy = t min(x, 1) dµ(x) < •.
0

The process t(t) is non-decreasing and right-continuous by our choice of a non-


strict inequality y  t in the definition.
Also, notice that the increments of this process are independent, because they
are defined in terms of the points in our Poisson process on disjoint subsets of
(0, •)2 , and also stationary, since the Lebesgue measure dy is translation invariant,
so t(t) is a non-negative Lévy process. In fact, any non-negative Lévy process is
of this form because any non-negative infinitely divisible random variable is a sum
over a Poisson process on (0, •).
To visualize how t(t) looks like, let us restrict ourselves to t 2 [0, 1]. Since dy is
a probability measure on [0, 1], we can first consider a Poisson process p on (0, •)
with the mean measure µ and then for each x 2 p take a marking yx uniformly on
[0, 1] independently for all x. By Marking theorem, the resulting collection of points
(x, yx ) is equivalent to the Poisson process P restricted to (0, •) ⇥ [0, 1]. The set of
markings yx is either finite, if µ(0, •) < •, or countably infinite, if µ(0, •) = •, in
which case it is a countable dense subset of [0, 1] with probability one. The process
t(t) = Âx2p x I(yx  t) has jumps of size x at times t = yx in this finite or countable
dense collection.
234 7 Poisson Processes

Definition 7.5. If z 2 (0, 1) then a z -stable subordinator is a subordinator corre-


sponding to the measure dµ(x) = cx 1 z dx for a constant c > 0. t
u
In this case, by Campbell’s theorem, the Laplace transform of t(t) is
h Z •Z • i
Ee l t(t) = exp (1 e l xI(yt) ) cx 1 z dy dx
h 0Z •0 i
= exp ct (1 e l x )x 1 z dx
0 Z
h • i
= exp ctl z (1 e x )x 1 z dx = exp( ca(z )tl z ),
0
R
where a(z ) = 0•(1 e x )x 1 z dx. The fact that this Laplace transform resembles
the characteristic function of a symmetric a-stable distribution allows us to repre-
sent such random variables via a time change of the Brownian motion by this sub-
ordinator. If (Wt )t 0 is a Brownian motion independent of t(t) then (Wt(t) )t 0 is a
stationary process with independent symmetric a-stable increments with a = 2z .
Indeed,
2
⇣ ca(z )t ⌘
EeixWt(t) = EE(eixWt(t) | t(t)) = Ee x t(t)/2 = exp |x|2z
,
2z
so for any fixed t, Wt(t) has an a-stable distribution with a = 2z . We leave the rest
to the following exercise.
Exercise 7.5.1. Show that Wt(t) has independent stationary a-stable increments
with a = 2z .
Exercise 7.5.2. If P is a Poisson process on (0, •)2 with the mean measure
cx 1 z dxdy and, for (x, y) 2 P, g(x,y) ⇠ N(0, x) are independent Gaussian markings
with variance x, show that the process

X(t) = Â g(x,y) I(y  t)


(x,y)2P

is the same Levy process as (Wt(t) )t 0 above.


Exercise 7.5.3. Let t(t) be a z -stable subordinator. Given a nonnegative Borel
function f 0 on (0, •), let
Z t
Z(t) =
0
f (s) dt(s) = Â x f (y)I(y  t).
(x,y)2P
Rt
Prove that Z(t) < • a.s. for all t 0 if and only if 0 f (y)z dy < • for all t 0.
R
Exercise 7.5.4. In the setting of the previous exercise, assuming that 0t f (y)z dy <
• for all t 0, compute the Laplace transform of the increments of the process
Z(t) and describe the relationship between Z(t) and t(t).
235

Chapter 8
Exchangeability

8.1 Moment problem and de Finetti’s theorem for coin flips

Let us start by recalling an example in Section 5. Let (Xi )i 1 be i.i.d. with the
Bernoulli distribution and the probability of success q 2 [0, 1], i.e. Pq (Xi = 1) = q
and Pq (Xi = 0) = 1 q , and let u : [0, 1] ! R be some continuous function on
[0, 1]. Then, by Theorem 2.6 in Section 5, the Bernstein polynomials
n ⇣ ⌘ ⇣n ⌘ n ⇣ ⌘✓ ◆
k k n k
Bn (q ) :=  u Pq  Xi = k =  u q (1 q )n k
k=0 n i=1 k=0 n k

approximate u(q ) uniformly on [0, 1].


Moment problem. Consider a random variable X taking values in [0, 1] and let
µk = EX k be its moments. Given a sequence (c0 , c1 , c2 , . . .), let us define the se-
quence of increments by Dck = ck+1 ck . Then

Dµk = µk µk+1 = E(X k X k+1 ) = EX k (1 X),

( D)( Dµk ) = ( D)2 µk = EX k (1 X) EX k+1 (1 X) = EX k (1 X)2


and, by induction,
( D)r µk = EX k (1 X)r .
Clearly, ( D)r µk 0 since X 2 [0, 1]. If u is a continuous function on [0, 1] and Bn
is the corresponding Bernstein polynomial defined above then
n ⇣ ⌘✓ ◆ n ⇣ ⌘✓ ◆
k n k n
EBn (X) = Â u k
EX (1 X) n k
= Âu ( D)n k µk .
k=0 n k k=0 n k

Since Bn converges uniformly to u, EBn (X) converges to Eu(X). Let us define


✓ ◆
n
( D)n k µk 0.
(n)
pk = (8.1)
k
236 8 Exchangeability

(n) (n)
By taking u ⌘ 1, it is easy to see that Ânk=0 pk = 1, so we can think of pk as the
distribution ⇣ k⌘ (n)
P X (n) = = pk (8.2)
n
of some random variable X (n) . We showed that EBn (X) = Eu(X (n) ) ! Eu(X) for
any continuous function u, which means that X (n) converges to X in distribution. In
other words, given finitely many moments of X, this construction gives an explicit
approximation of the distribution of X.
If we want to express the moment µk for a fixed k in terms of the distribution
(8.2), we can write for, n k,
⇣ ⌘n k
µk = EX k = EX k X + 1 X
n k✓ ◆
n k
=Â EX k+ j (1 X)n (k+ j)
j=0 j
n ✓ ◆ (8.3)
n k
= Â m
EX (1 X) n m
m=k m k
n ✓ ◆ n ✓ ◆✓ ◆ 1
n k n k n
= Â n m
µm = Â
(n)
( D) pm .
m=k m k m=k m k m

Let us note that the part of this formula which is expressed only in terms of the
sequence (µk )k 0 ,
n ✓ ◆
n k
µk = Â ( D)n m µm , (8.4)
m=k m k
holds for any sequence (µk )k 0 and not necessarily a sequence of moments of
some random variables X 2 [0, 1]. For example, for n = k + 1 this formula (called
inversion formula) says that µk = µk+1 Dµk , and it can be proved by induction
for all n k (exercise).
Next, given a sequence (µk ), we consider the following question: When is (µk )
the sequence of moments of some [0, 1] valued random variable X? By the above,
it is necessary that

µk 0, µ0 = 1 and ( 1)r Dr µk 0 for all k, r. (8.5)

It turns out that this is also sufficient.


Theorem 8.1 (Hausdorff). There exists a r.v. X 2 [0, 1] such that µk = EX k if and
only if (8.5) holds.
(n)
Proof. Given the sequence (µk ), let us define pk by (8.1). Using the inversion
formula (8.4) for k = 0, we get
n
 pm
(n)
= µ0 = 1
m=0
8.1 Moment problem and de Finetti’s theorem for coin flips 237

(n)
and, by the assumption (8.5), pm 0. Notice that, by (8.3), for any fixed k,
n ✓ ◆✓ ◆ 1
n k n
µk = Â
(n)
pm
m=k m k m
n n ⇣ ⌘k
m(m 1) · · · (m k + 1) (n) m
= Â pm ⇡ Â
(n)
pm
m=k n(n 1) · · · (n k + 1) m=0 n

as n ! •. If we think of p(n) as the probability distribution on [0, 1] that assigns


(n)
weight pm to m/n then the right hand side is the kth moment of p(n) . By the
Selection Theorem, p(n) converges to some distribution on [0, 1] along some sub-
sequence, and µk will obviously be the kth moment of this limiting distribution,
which finishes the proof. Notice also that, because the moments determine the dis-
tribution on [0, 1] uniquely, the entire sequence p(n) converges. t
u

de Finetti’s theorem. As a consequence of the Hausdorff theorem above, we will


now prove the classical de Finetti representation for coin flips. The general case
will be considered in the next section. A sequence (Xn )n 1 of {0, 1}-valued random
variables is called exchangeable if, for any n 1 and any x1 , . . . , xn 2 {0, 1}, the
probability
P(X1 = x1 , . . . , Xn = xn )
depends only on the ‘number of successes’ x1 + . . . + xn and does not depend on
the order of 1’s or 0’s. Another way to say this is that, for any n 1 and any
permutation p of {1, . . . , n}, the distribution of (Xp(1) , . . . , Xp(n) ) does not depend
on p. Then the following holds.
Theorem 8.2 (de Finetti). There exists a distribution F on [0, 1] such that, for any
n 1 and any x1 , . . . , xn 2 {0, 1},
Z 1
P(X1 = x1 , . . . , Xn = xn ) = pk (1 p)n k
dF(p),
0

where k = x1 + . . . + xn .
In other words, in order to generate such an exchangeable sequence of 0’s and 1’s,
we first pick p 2 [0, 1] from some distribution F and then generate a sequence of
i.i.d Bernoulli random variables with probability of success x.

Proof. Let µ0 = 1 and, for k 1, define

µk = P(X1 = 1, . . . , Xk = 1). (8.6)

We have

P(X1 = 1, . . . , Xk = 1, Xk+1 = 0) = P(X1 = 1, . . . , Xk = 1)


P(X1 = 1, . . . , Xk = 1, Xk+1 = 1) = µk µk+1 = Dµk .
238 8 Exchangeability

Next, using exchangeability,

P(X1 = 1, . . . , Xk = 1, Xk+1 = 0, Xk+2 = 0) = P(X1 = 1, . . . , Xk = 1, Xk+1 = 0)


P(X1 = 1, . . . , Xk = 1, Xk+1 = 0, Xk+2 = 1) = Dµk ( Dµk+1 ) = ( D)2 µk .

Similarly, by induction,

P(X1 = 1, . . . , Xk = 1, Xk+1 = 0, . . . , Xn = 0) = ( D)n k µk 0.

By the Hausdorff theorem, µk = EX k for some r.v. X 2 [0, 1] and, therefore,

P(X1 = 1, . . . , Xk = 1, Xk+1 = 0, . . . , Xn = 0) = ( D)n k µk


Z 1
= EX k (1 X)n k
= pk (1 p)n k
dF(p).
0

Since, by exchangeability, changing the order of 1’s and 0’s does not affect the
probability, this finishes the proof. t
u

Example 8.1.1 (Polya’s urn model). Suppose we have b blue and r red balls in the
urn. We pick a ball randomly and return it with c balls of the same color. Consider
the random variable

1, if the ith ball picked is blue
Xi =
0, otherwise.

Xi ’s are not independent but it is easy to check that they are exchangeable. For
example,
b b+c r
P(bbr) = ⇥ ⇥
b + r b + r + c b + r + 2c
b r b+c
= ⇥ ⇥ = P(brb).
b + r b + r + c b + r + 2c
To identify the distribution F in de Finetti’s theorem, let us look at its moments µk
in (8.6),
b b+c b + (k 1)c
µk = P b| .{z
. . b} = ⇥ ⇥···⇥ .
b+r b+r+c b + r + (k 1)c
k times

One can recognize or check that (µk ) are the moments of Beta(a, b ) distribution
with the density
G(a + b ) a 1
x (1 x)b 1 I(0  x  1)
G(a)G(b )
with the parameters a = b/c, b = r/c. By de Finetti’s theorem, we can generate
Xi ’s by first picking p from the distribution Beta b/c, r/c and then generating
i.i.d. Bernoulli (Xi )’s with the probability of success p. By the strong law of large
8.1 Moment problem and de Finetti’s theorem for coin flips 239

numbers, applied conditionally on p, the proportion of blue balls picked in the first
n trials
1
pn = (X1 + . . . + Xn )
n
will converge to this probability of success p, i.e. in the limit it will be random with
the Beta distribution. Recall that this example came up in the exercises in Section
15 on the convergence of martingales, where one showed that
b + (X1 + . . . + Xn )c b + pn nc
Yn = =
b + r + nc b + r + nc
converges almost surely because it is a bounded martingale. Since this limits coin-
cides with the limit of pn , this means that we identified its distribution. t
u

Exercise 8.1.1. Prove the formula (8.4).

Exercise 8.1.2. Compute the moments of the Beta(a, b ) distribution.

Exercise 8.1.3. In the setting of de Finetti’s theorem above, prove that the limit
limn!• n 1 Âni=1 Xi exists almost surely and, conditionally on this limit, the se-
quence (Xi )i 1 is i.i.d..
240 8 Exchangeability

8.2 General de Finetti and Aldous-Hoover representations

We will begin by proving the classical de Finetti representation for exchangeable


sequences. A sequence (s` )` 1 is called exchangeable if for any permutation p of
finitely many indices we have equality in distribution
d
sp(`) ` 1
= s` ` 1
. (8.7)

The following holds.


Theorem 8.3 (de Finetti). If the sequence (s` )` 1 is exchangeable then there ex-
ists a measurable function g : [0, 1]2 ! R such that
d
s` ` 1
= g(w, u` ) ` 1
, (8.8)

where w and (u` ) are i.i.d. random variables with the uniform distribution on [0, 1].
Before we proceed to prove the above results, let us recall the following definition.
Definition 8.1. A measurable space (W, B) is called a Borel space if there exists a
one-to-one function j from W onto a Borel subset A ✓ [0, 1] such that both j and
j 1 are measurable. t
u
Perhaps, the most important example of Borel spaces are complete separable
metric spaces and their Borel subsets (see e.g. Section 13.1 in R.M. Dudley “Real
Analysis and Probability”). The existence of the isomorphism j automatically im-
plies that if we can prove the above results in the case when the elements of the
sequence (s` ) or array (s`,`0 ) take values in [0, 1] then the same representation re-
sults hold when the elements take values in a Borel space. Similarly, other standard
results for real-valued or [0, 1]-valued random variables are often automatically ex-
tended to Borel spaces. For example, one can generate any real-valued random
variable as a measurable function of a uniform random variable on [0, 1] using the
quantile transform and, therefore, any random element on a Borel space can also
be generated by a function of a uniform random variable on [0, 1].
Let us describe another typical measure theoretic argument that will be used
many times below.
Lemma 8.1 (Coding Lemma). Suppose that a random pair (X,Y ) takes values in
the product of a measurable space (W1 , B1 ) and a Borel space (W2 , B2 ). Then,
there exists a measurable function f : W1 ⇥ [0, 1] ! W2 such that
d
(X,Y ) = X, f (X, u) , (8.9)

where u is a uniform random variable on [0, 1] independent of X.


Another way to state this is to say that, conditionally on X, Y can be generated
as a function Y = f (X, u) of X and an independent uniform random variable u on
8.2 General de Finetti and Aldous-Hoover representations 241

[0, 1]. Rather than using the Coding Lemma itself we will often use the general
ideas of conditioning and generating random variables as functions of uniform
random variables on [0, 1] that are used in its proof, without repeating the same
argument.

Proof. By the definition of a Borel space, one can easily reduce the general case to
the case when W2 = [0, 1] equipped with the Borel s -algebra. Since in this case the
regular conditional distribution Pr(x, B) of Y given X exists, for a fixed x 2 W1 , we
can define by F(x, y) = Pr(x, [0, y]) the conditional distribution function of Y given
x and by
f (x, u) = F 1 (x, u) = inf s 2 [0, 1] : u  F(x, s)
its quantile transformation. It is easy to see that f is measurable on the product
space W1 ⇥ [0, 1] because, for any t 2 R,

(x, u) : f (x, u)  t = (x, u) : u  F(x,t)

and, by the definition of the regular conditional probability, F(x,t) = Pr(x, [0,t])
is measurable in x for a fixed t. If u is uniform on [0, 1] then, for a fixed x 2 W1 ,
f (x, u) has the distribution Pr(x, · ) and this finishes the proof. t
u

The most important way in which the exchangeability condition (8.7) will be
used is to say that for any infinite subset I ✓ N,
d
s` `2I
= s` ` 1
. (8.10)

Let us describe one immediate consequence of this simple observation. If

FI = s s` : ` 2 I (8.11)

is the s -algebra generated by the random variables s` for ` 2 I then the following
holds.
Lemma 8.2. For any infinite subset I ✓ N and j 62 I, the conditional expectations

E f (s j ) FI = E f (s j ) FN\{ j}

almost surely, for any bounded measurable function f : R ! R.

Proof. First, using the property (8.10) for I [ { j} instead of I implies the equality
in distribution (see Exercise 1.4.3 in Section 1.4),
d
E f (s j ) FI = E f (s j ) FN\{ j} ,

and, therefore, the equality of the L2 -norms,

E f (s j ) FI 2
= E f (s j ) FN\{ j} 2
. (8.12)
242 8 Exchangeability

On the other hand, I ✓ N \ { j} and FI ✓ FN\{ j} , which implies that


2 2
E E f (s j ) FI E f (s j ) FN\{ j} = E E f (s j ) FI = E f (s j ) FI 2
.

Combining this with (8.12) yields that


2
E f (s j ) FI E f (s j ) FN\{ j} 2
= 0, (8.13)

which finishes the proof. t


u

Proof (of Theorem 8.3). Let us take any infinite subset I ✓ N such that its com-
plement N \ I is also infinite. By (8.10), we only need to prove the representa-
tion (8.8) for (s` )`2I . First, we will show that, conditionally on (s` )`2N\I , the ran-
dom variables (s` )`2I are independent. This means that, given n 1, any distinct
`1 , . . . , `n 2 I, and any bounded measurable functions f1 , . . . , fn : R ! R,
⇣ ⌘
E ’ f j (s` j ) FN\I = ’ E f j (s` j ) FN\I , (8.14)
jn jn

or, in other words, for any A 2 FN\I ,

EIA ’ f j (s` j ) = EIA ’ E f j (s` j ) FN\I .


jn jn

Since IA and f1 (s`1 ), . . . , fn 1 (s`n 1 ) are FN\{`n } -measurable, we can write

EIA ’ f j (s` j ) = EIA ’ f j (s` j )E fn (s`n ) FN\{`n }


jn jn 1
= EIA ’ f j (s` j )E fn (s`n ) FN\I ,
jn 1

where the second equality follows from Lemma 8.2. This implies that
⇣ ⌘ ⇣ ⌘
E ’ f j (s` j ) FN\I = E ’ f j (s` j ) FN\I E fn (s`n ) FN\I
jn jn 1

and (8.14) follows by induction on n. Let us also observe that, because of (8.10),
the distribution of the array (s` , (s j ) j2N\I ) does not depend on ` 2 I. This implies
that, conditionally on (s j ) j2N\I , the random variables (s` )`2I are identically dis-
tributed in addition to being independent. The product space [0, 1]• with the Borel
s -algebra generated by the product topology is a Borel space (recall that, equipped
with the usual metric, it becomes a complete separable metric space). Therefore,
in order to conclude the proof, it remains to generate X = (s` )`2N\I as a function
of a uniform random variable w on [0, 1], X = X(w), and then use the argument in
the Coding Lemma 8.1 to generate the sequence (s` )`2I as ( f (X(w), u` ))`2I , where
(u` )`2I are i.i.d. random variables uniform on [0, 1]. This finishes the proof. t
u
8.2 General de Finetti and Aldous-Hoover representations 243

Comments about de Finetti’s theorem. Consider an exchangeable sequence


(s` )` 1 taking values in some complete separable space W. It is equal in distribution
to (g(w, u` ))` 1 . For a fixed w, this is an i.i.d. sequence from the distribution

µw = l g(w, · ) 1 ,

where l is the Lebesgue measure on [0, 1]. By the strong law of large numbers for
empirical measures (Varadarajan’s Theorem 4.5 in Section 4.2),

1 n
µw = lim
n!• n
 dg(w,u` ) .
`=1

The limit on the right hand side is taken on the complete separable metric space
P(W, B) of probability measures on W equipped, for example, with the bounded
Lipschitz metric b or Levy-Prokhorov metric r, and this limit exists almost surely
over (u` )` 1 . This implies that
(i) the limit µ = limn!• 1n Ân`=1 ds` exists almost surely;
(ii) µ is a (random) probability measure on W, i.e. a random element in P(W, B);
(iii) given µ, the sequence (s` )` 1 is i.i.d. from with the distribution µ.
The measure µ is called the empirical measure of the sequence (s` )` 1 . One can
now interpret de Finetti’s representation as follows. First, we generate µ as a func-
tion of a uniform random variable w on [0,1] and then generate s` as a function
of µ and i.i.d. random variables u` uniform on [0, 1], using the Coding Lemma.
Combining two steps, we generate s` as a function of w and u` .

Remark 8.1. There are two equivalent definitions of a random probability measure
on a complete separable metric space W with the Borel s -algebra B. On the one
hand, as above, these are just random elements taking values in the space P(W, B)
of probability measures on (W, B) equipped with the topology of weak conver-
gence or a metric that metrizes weak convergence. On the other hand, we can think
of a random measure h as a probability kernel, i.e. as a function h = h(x, A) of
a generic point x 2 X for some probability space (X, F , Pr) and a measurable set
A 2 B such that for a fixed x, h(x, · ) is a probability measure and for a fixed A,
h(·, A) is a measurable function on (X, F ). It is well known that these two defini-
tions coincide (see, e.g. Lemma 1.37 and Theorem A2.3 in Kallenberg’s “Founda-
tions of Modern Probability”).

Example 8.2.1 (de Finetti’s representation for {0, 1}-valued sequences). Com-
ing back to the example considered in the previous section, consider an exchange-
able sequence (s` )` 1 of random variables that take values {0, 1}. Then the em-
pirical measure is a random probability measure on {0, 1}. It is encoded by one
(random) parameter p = µ({1}) 2 [0, 1]. To fix µ means to fix p and, given p, the
sequence (s` )` 1 is i.i.d. Bernoulli with the probability of success equal to p. If h
is the distribution of p then, for any n 1 and any e1 , . . . , en 2 {0, 1},
244 8 Exchangeability
Z
P(s1 = e1 , . . . , sn = en ) = pk (1 p)n k
dh(p),
[0,1]

where k = e1 + . . . + en . t
u

Another piece of information that can be deduced from de Finetti’s theorem and
is often useful is the following. Suppose that in addition to the sequence (s` )` 1 we
are given another random element Z taking values in a complete separable metric
space, and their joint distribution is not affected by permutations of the sequence,
d
Z, (s` )` 1 = Z, (sp(`) ` 1
). (8.15)

Suppose that µ is the empirical measure of (s` )` 1 .

Theorem 8.4. Conditionally on µ, the sequence (s` )` 1 is i.i.d. with the distribu-
tion µ and independent of Z.

For example, one can deduce Theorem 8.7 in the exercise below from this state-
ment rather directly.

Proof (of Theorem 8.4). If we define t` = (Z, s` ) then (8.15) implies that (t` )` 1 is
exchangeable. Let
1 n 1 n
n = lim  dt` = lim  d(Z,s` )
n!• n n!• n
`=1 `=1
be the empirical measure of this sequence. Obviously, n = dZ ⇥ µ. Since, given n,
the sequence (Z, s` ) is i.i.d. from n, this means that, given n, the sequence (s` ) is
i.i.d. from µ. In other words, given Z and µ, the sequence (s` ) is i.i.d. from µ. The
statement then follows from the following simple lemma, which appears as one of
the exercises in Section 1.4. t
u

Lemma 8.3. Suppose that three random elements X,Y and Z take values on some
complete separable metric spaces. The conditional distribution of X given the pair
(Y, Z) depends only on Y (s (Y )-measurable) if and only if X and Z are independent
conditionally on Y .

The Aldous-Hoover representation. We will now prove two analogues of de


Finetti’s theorem for two-dimensional arrays. Let us consider an infinite random
array s = (s`,`0 )`,`0 1 . The array s is called an exchangeable array if for any permu-
tations p and r of finitely many indices we have equality in distribution,
d
sp(`),r(`0 ) `,`0 1
= s`,`0 `,`0 1
, (8.16)

in the sense that their finite dimensional distributions are equal. Here is one natural
example of an exchangeable array. Given a measurable function s : [0, 1]4 ! R
and sequences of i.i.d. random variables w, (u` ), (v`0 ), (x`,`0 ) that have the uniform
distribution on [0, 1], the array
8.2 General de Finetti and Aldous-Hoover representations 245

s (w, u` , v`0 , x`,`0 ) `,`0 1


(8.17)

is, obviously, exchangeable. It turns out that all exchangeable arrays are of this
form.
Theorem 8.5 (Aldous-Hoover). Any infinite exchangeable array (s`,`0 )`,`0 1 is
equal in distribution to (8.17) for some function s .
Another version of the Aldous-Hoover representation holds in the symmetric case.
A symmetric array s is called weakly exchangeable if for any permutation p of
finitely many indices we have equality in distribution
d
sp(`),p(`0 ) `,`0 1
= s`,`0 `,`0 1
. (8.18)

Because of the symmetry, we can consider only half of the array indexed by (`, `0 )
such that `  `0 or, equivalently, consider the array indexed by sets {`, `0 } and
rewrite (8.18) as
d
s{p(`),p(`0 )} `,`0 1 = s{`,`0 } `,`0 1 . (8.19)
Notice that, compared to (8.16), the diagonal elements now play somewhat differ-
ent roles from the rest of the array. One natural example of a weakly exchangeable
array is given by

s{`,`} = g(w, u` ) and s{`,`0 } = f w, u` , u`0 , x{`,`0 } for ` 6= `0 , (8.20)

for any measurable functions g : [0, 1]2 ! R and f : [0, 1]4 ! R, which is symmetric
in its middle two coordinates u` , u`0 , and i.i.d. random variables w, (u` ), (x{`,`0 } )
with the uniform distribution on [0, 1]. Again, it turns out that such examples cover
all possible weakly exchangeable arrays.
Theorem 8.6 (Aldous-Hoover). Any infinite weakly exchangeable array is equal
in distribution to the array (8.20) for some functions g and f , which is symmetric
in its two middle coordinates.
After we give a proof of Theorem 8.6, we will leave a similar proof of Theorem 8.5
as an exercise (with some hints). We will then give another, quite different, proof of
Theorem 8.5. The proof of the Dovbysh-Sudakov representation in the next section
will be based on Theorem 8.5.
The most important way in which the exchangeability condition (8.18) will be
used is to say that, for any infinite subset I ✓ N,
d
s{`,`0 } `,`0 2I
= s{`,`0 } `,`0 1
. (8.21)

Again, one important consequence of this observation will be the following. Given
j, j0 2 I such that j 6= j0 , let us now define the s -algebra

FI ( j, j0 ) = s s{`,`0 } : `, `0 2 I, {`, `0 } 6= { j, j0 } . (8.22)


246 8 Exchangeability

In other words, this s -algebra is generated by all elements s{`,`0 } with both indices
` and `0 in I, excluding s{ j, j0 } . The following analogue of Lemma 8.2 holds.
Lemma 8.4. For any infinite subset I ✓ N and any j, j0 2 I such that j 6= j0 , the
conditional expectations

E f (s{ j, j0 } ) FI ( j, j0 ) = E f (s{ j, j0 } ) FN ( j, j0 )

almost surely, for any bounded measurable function f : R ! R.

Proof. The proof is almost identical to the proof of Lemma 8.2. The property (8.21)
implies the equality in distribution,
d
E f (s{ j, j0 } ) FI ( j, j0 ) = E f (s{ j, j0 } ) FN ( j, j0 ) ,

and, therefore, the equality of the L2 -norms,

E f (s{ j, j0 } ) FI ( j, j0 ) 2
= E f (s{ j, j0 } ) FN ( j, j0 ) 2
.

On the other hand, since FI ( j, j0 ) ✓ FN ( j, j0 ), as in (8.13), this implies that


2
E f (s{ j, j0 } ) FI ( j, j0 ) E f (s{ j, j0 } ) FN ( j, j0 ) 2
= 0,

which finishes the proof. t


u

Proof (of Theorem 8.6). Let us take an infinite subset I ✓ N such that its com-
plement N \ I is also infinite. By (8.21), we only need to prove the representation
(8.20) for (s{`,`0 } )`,`0 2I . For each j 2 I, let us consider the array
⇣ ⌘
S j = s{`,`0 } `,`0 2(N\I)[{ j} = s{ j, j} , s{ j,`} `2N\I , s{`,`0 } `,`0 2N\I . (8.23)

It is obvious that the weak exchangeability (8.19) implies that the sequence (S j ) j2I
is exchangeable, since any permutation p of finitely many indices from I in the
array (8.19) results in the corresponding permutation of the sequence (S j ) j2I . We
can view each array S j as an element of the Borel space [0, 1]• and de Finetti’s
Theorem 8.3 implies that
d
Sj j2I
= X(w, u j ) j2I
(8.24)

for some measurable function X on [0, 1]2 taking values in the space of such arrays.
Next, we will show that, conditionally on the sequence (S j ) j2I , the off-diagonal
elements s{ j, j0 } for j, j0 2 I, j 6= j0 , are independent and, moreover, the conditional
distribution of s{ j, j0 } depends only on S j and S j0 . This means that if we consider
the s -algebras

F = s (S j ) j2I , F j, j0 = s (S j , S j0 ) for j, j0 2 I, j 6= j0 , (8.25)


8.2 General de Finetti and Aldous-Hoover representations 247

then we would like to show that, for any finite set C of indices {`, `0 } for `, `0 2 I
such that ` 6= `0 and any bounded measurable functions f`,`0 corresponding to the
indices {`, `0 } 2 C , we have
⇣ ⌘
E ’ f`,`0 (s{`,`0 } ) F = ’ E f`,`0 (s{`,`0 } ) F`,`0 . (8.26)
{`,`0 }2C {`,`0 }2C

Notice that the definitions (8.22) and (8.25) imply that F j, j0 = F(N\I)[{ j, j0 } ( j, j0 ),
since all the elements s{`,`0 } with both indices in (N \ I) [ { j, j0 }, except for s{ j, j0 } ,
appear as one of the coordinates in the arrays S j or S j0 . Therefore, by Lemma 8.4,

E f (s{ j, j0 } ) F j, j0 = E f (s{ j, j0 } ) FN ( j, j0 ) . (8.27)

Let us fix any { j, j0 } 2 C , let C 0 = C \ {{ j, j0 }} and consider an arbitrary set


A 2 F . Since IA and f`,`0 (s{`,`0 } ) for {`, `0 } 2 C 0 are FN ( j, j0 )-measurable,

EIA ’ f`,`0 (s{`,`0 } ) f j, j0 (s{ j, j0 } )


{`,`0 }2C 0
⇣ ⌘
= E IA ’ f`,`0 (s{`,`0 } )E f j, j0 (s{ j, j0 } ) FN ( j, j0 )
{`,`0 }2C 0
⇣ ⌘
= E IA ’ f`,`0 (s{`,`0 } )E f j, j0 (s{ j, j0 } ) F j, j0 ,
{`,` }2C 0
0

where the second equality follows from (8.27). Since F j, j0 ✓ F , this implies that
⇣ ⌘ ⇣ ⌘
E ’ f`,`0 (s{`,`0 } ) F = E ’ `,` {`,` }
f 0 (s 0 ) F E f j, j0 (s{ j, j0 } ) F j, j0
{`,`0 }2C {`,`0 }2C 0

and (8.26) follows by induction on the cardinality of C . By the argument in the


Coding Lemma 8.1, (8.26) implies that, conditionally on F , we can generate
d
s{ j, j0 } j6= j0 2I
= h(S j , S j0 , x{ j, j0 } ) j6= j0 2I
, (8.28)

for some measurable function h and i.i.d. uniform random variables x{ j, j0 } on [0, 1].
The reason why the function h can be chosen to be the same for all { j, j0 } is be-
cause, by symmetry, the distribution of (S j , S j0 , s{ j, j0 } ) does not depend on { j, j0 }
and the conditional distribution of s{ j, j0 } given S j and S j0 does not depend on
{ j, j0 }. Also, the arrays (S j , S j0 , s{ j, j0 } ) and (S j0 , S j , s{ j, j0 } ) are equal in distribution
and, therefore, the function h is symmetric in the coordinates S j and S j0 . Finally,
let us recall (8.24) and define the function

f w, u j , u j0 , x{ j, j0 } = h X(w, u j ), X(w, u j0 ), x{ j, j0 } ,

which is, obviously, symmetric in u j and u j0 . Then, the equations (8.24) and (8.28)
imply that
248 8 Exchangeability
⇣ ⌘ ⇣ ⌘
d
Sj j2I
, s{ j, j0 } j6= j0 2I
= X(w, u j ) j2I
, f w, u j , u j0 , x{ j, j0 } j6= j0 2I
.

In particular, if we denote by g the first coordinate of the map X corresponding to


the element s{ j, j} in the array S j in (8.23), this proves that
⇣ ⌘ ⇣ ⌘
d
s{ j, j} j2I , s{ j, j0 } j6= j0 2I = g(w, u j ) j2I , f w, u j , u j0 , x{ j, j0 } j6= j0 2I .

Recalling (8.21) finishes the proof of the representation (8.20). t


u

Proof of Theorem 8.5. One proof of Theorem 8.5 similar to the proof of Theorem
8.6 is sketched in an exercise below. We will now give a different proof, and the
main part of the proof will be based on the following observation. Suppose that
we have an exchangeable sequence of pairs (t` , s` )` 1 with coordinates in com-
plete separable metric spaces. Given the sequence (t` )` 1 , how can we generate
the sequence of second coordinates (s` )` 1 ? Consider the empirical measures

1 n 1 n
µ = lim
n!• n
 (t` ,s` )
d , µ 1 = lim
n!• n
 dt` .
`=1 `=1

Obviously, µ1 is the marginal of µ on the first coordinate and, moreover, µ1 is a


measurable function of the sequence (t` )` 1 , so if we are given this sequence we
automatically know µ1 . The following holds.

Lemma 8.5. Given (t` )` 1 , we can generate the sequence (s` )` 1 in distribution as
d
s` ` 1
= f (µ1 ,t` , v, x` ) ` 1

for some measurable function f and i.i.d. uniform random variables v and (x` )` 1
on [0, 1].

Proof. First, let us note how to generate the empirical measure µ given the se-
quence t = (t` )` 1 . Given µ, the sequence (t` , s` )` 1 is i.i.d. from µ and, since µ1
is the first marginal of µ, (t` )` 1 are i.i.d. from µ1 . This means that, if we consider
the triple (µ, µ1 ,t) then the conditional distribution of t given (µ, µ1 ) depends only
on µ1 ,
P t 2 · | µ, µ1 ) = P t 2 · | µ1 ).
By Lemma 8.3, t and µ are independent given µ1 . Therefore, again by Lemma 8.3,

P µ 2 · | t, µ1 ) = P µ 2 · | µ1 ).

On the other hand, µ1 is a function of t, so P µ 2 · | t, µ1 ) = P µ 2 · | t) and, thus,

P µ 2 · | t) = P µ 2 · | µ1 ).
8.2 General de Finetti and Aldous-Hoover representations 249

In other words, to generate µ given t, we can simply compute µ1 and generate µ


given µ1 . By the Coding Lemma, we can generate µ = g(µ1 , v) as a function of µ1
and independent uniform random variable v on [0, 1].
Now, recall that, given µ, the sequence (t` , s` )` 1 is i.i.d. with the distribution
µ so, given t` and µ, we can simply generate s` from the conditional distribution
µ(s` 2 · | t` ) independently from each other. Again, using the Coding Lemma, we
can generate s` = h(µ,t` , x` ) as a function of i.i.d. x` uniform random variables on
[0, 1]. Finally, recalling that µ = g(µ1 , v), we can write

s` = h(g(µ1 , v),t` , x` ) = f (µ1 ,t` , v, x` ).

This finishes the proof. t


u

Proof (of Theorem 8.5). Let us for convenience index the array s`,`0 by ` 1 and
`0 2 Z instead of `0 1. Let us denote by

X`0 = s`,`0 ` 1
and X = X`0 `0 0

the `0 -th column and ‘left half’ of this array. Since the sequence of columns
(X`0 )`0 2Z is exchangeable, we showed in the proof of de Finetti’s theorem that,
conditionally on X, the columns (X`0 )`0 1 in the ‘right half’ of the array are i.i.d.
If we describe the distribution of one column X1 given X then we can generate all
columns (X`0 )`0 1 independently from this distribution. Therefore, our strategy will
be to describe the distribution of X1 given X, and then combine it with the structure
of the distribution of X. Both steps will use exchangeability with respect to permu-
tations of rows, because so far we have only used exchangeability with respect to
permutations of columns. Let

Y` = s`,`0 `0 0

be the elements in the `-th row of the ‘left half’ X of the array. We want to describe
the distribution of X1 = (s`,1 )` 1 given X = (Y` )` 1 and we will use the fact that
the sequence (Y` , s`,1 )` 1 is exchangeable. By Lemma 8.5, conditionally on X =
(Y` )` 1 , (s`,1 )` 1 can be generated as

s`,1 = f (µ1 ,Y` , v1 , x`,1 ),

where µ1 is the empirical measure of (Y` ) and where instead of v and (x` ) we wrote
v1 and (x`,1 ) to emphasize the first column index 1. Since, conditionally on X, the
columns (X`0 )`0 1 in the ‘right half’ of the array are i.i.d., we can generate

s`,`0 = f (µ1 ,Y` , v`0 , x`,`0 ),

where v`0 and x`,`0 are i.i.d. uniform random variables on [0, 1]. Finally, since
(Y` )` 1 are i.i.d. given the empirical distribution µ1 , we can generate µ1 = h(w)
as a function of a uniform random variable w on [0, 1] and then, using the Cod-
250 8 Exchangeability

ing Lemma, generate Y` = Y (µ1 , u` ) = Y (h(w), u` ) as a function of µ1 and i.i.d.


uniform random variables u` on [0, 1]. Plugging these into f above gives s`,`0 =
s (w, u` , v`0 , x`,`0 ) for some function s , which finishes the proof. t
u

Exercise 8.2.1. Prove Lemma 8.3.

Exercise 8.2.2. Give a similar proof of Theorem 8.5, using only global symmetry
considerations. Hints: Since the row and column indices play a different role in
this case, the first step will be slightly different (the second step will be essentially
the same). One has to consider two sequences indexed by j 2 I,
⇣ ⌘ ⇣ ⌘
S1j = s`, j `2N\I , s`,`0 `,`0 2N\I and S2j = s j,` `2N\I , s`,`0 `,`0 2N\I .

These two sequences are separately exchangeable, which means that


⇣ ⌘ ⇣ ⌘
1 2 d
Sp( j) j2I , Sr( j) j2I = S1j j2I , S2j j2I

for any permutations p and r of finitely many indices. Notice that these sequences
are not independent. In this case one needs to prove (as a part of the exercise) the
following modification of de Finetti’s representation.

Theorem 8.7. If the sequences (s1` )` 1 and (s2` )` 1 are separately exchangeable
then there exist measurable functions g1 , g2 : [0, 1]2 ! R such that
⇣ ⌘ ⇣ ⌘
d
s1` ` 1 , s2` ` 1 = g1 (w, u` ) ` 1 , g2 (w, v` ) ` 1 (8.29)

where w, (u` ) and (v` ) are i.i.d. random variables uniform on [0, 1].

Exercise 8.2.3. Consider a pair of random arrays

s1`,`0 `,`0 1
and s2`,`0 `,`0 1

that are separately exchangeable in the first coordinate and jointly exchangeable in
the second coordinate, that is,
⇣ ⌘ ⇣ ⌘
d
s1p1 (`),r(`0 ) `,`0 1 , s2p2 (`),r(`0 ) `,`0 1 = s1`,`0 `,`0 1 , s2`,`0 `,`0 1

for any permutations p1 , p2 , r of finitely many coordinates. Show that there exist
two functions s1 , s2 such that these arrays can be generated in distribution by

s1`,`0 = s1 (w, u1` , v`0 , x`,`


1 2 2 2
0 ), s`,`0 = s2 (w, u` , v`0 , x`,`0 ),

where all the arguments are i.i.d. uniform random variables on [0, 1].
8.3 The Dovbysh-Sudakov representation for Gram-de Finetti arrays 251

8.3 The Dovbysh-Sudakov representation for Gram-de Finetti arrays

Let us consider an infinite symmetric random array R = (R`,`0 )`,`0 1 , which is


weakly exchangeable in the sense defined in (8.18), i.e. for any permutation p
of finitely many indices we have equality in distribution
d
Rp(`),p(`0 ) `,`0 1
= R`,`0 `,`0 1
. (8.30)

In addition, suppose that R is positive definite with probability one, where by posi-
tive definite we will always mean non-negative definite. Such weakly exchangeable
positive definite arrays are called Gram-de Finetti arrays. It turns out that all such
arrays are generated essentially as the covariance matrix of an i.i.d. sample from
a random measure on a Hilbert space. Let H be the Hilbert space L2 ([0, 1], dv),
where dv denotes the Lebesgue measure on [0, 1].
Theorem 8.8 (Dovbysh-Sudakov). There exists a random probability measure h
on H ⇥ R+ such that the array R = (R`,`0 )`,`0 1 is equal in distribution to

h` · h`0 + a` d`,`0 `,`0 1


, (8.31)

where, conditionally on h, (h` , a` )` 1 is a sequence of i.i.d. random variables with


the distribution h and h · h0 denotes the scalar product on H.
In particular, the marginal G of h on H can be used to generate the sequence (h` )
and the off-diagonal elements of the array R.
Proof (of Theorem 8.8.). Since the array R is positive definite, conditionally on
R, we can generate a Gaussian vector g in RN with the covariance equal to R.
Now, also conditionally on R, let (gi )i 1 be independent copies of g. If, for each
i 1, we denote the coordinates of gi by g`,i for ` 1 then, since the array R =
(R`,`0 )`,`0 1 is weakly exchangeable, it should be obvious that the array (g`,i )`,i 1
is exchangeable in the sense of (8.16). By Theorem 8.5, there exists a measurable
function s : [0, 1]4 ! R such that
d
g`,i `,i 1
= s (w, u` , vi , x`,i ) `,i 1
, (8.32)

where w, (u` ), (vi ), (x`,i ) are i.i.d. random variables with the uniform distribution
on [0, 1]. By the strong law of large numbers (applied conditionally on R), for any
`, `0 1,
1 n
 g`,i g`0 ,i ! R`,`0
n i=1
almost surely as n ! •. Similarly, by the strong law of large numbers (now applied
conditionally on w and (u` )` 1 ),

1 n
Â
n i=1
s (w, u` , vi , x`,i )s (w, u`0 , vi , x`0 ,i ) ! E0 s (w, u` , v1 , x`,1 )s (w, u`0 , v1 , x`0 ,1 )
252 8 Exchangeability

almost surely, where E0 denotes the expectation with respect to the random vari-
ables v1 and (x`,1 )` 1 . Therefore, (8.32) implies that
d
R`,`0 `,`0 1
= E0 s (w, u` , v1 , x`,1 )s (w, u`0 , v1 , x`0 ,1 ) `,`0 1
. (8.33)

If we denote
Z Z
s (1) (w, u, v) = s (w, u, v, x) dx, s (2) (w, u, v) = s (w, u, v, x)2 dx,

then the off-diagonal and diagonal elements on the right hand side of (8.33) are
given by
Z Z
s (1) (w, u` , v)s (1) (w, u`0 , v) dv and s (2) (w, u` , v) dv

correspondingly. Notice that, for almost all w and u, the function v ! s (1) (w, u, v)
is in H = L2 ([0, 1], dv), since
Z Z
s (1) (w, u, v)2 dv  s (2) (w, u, v) dv

and, by (8.33), the right hand side is equal in distribution to R1,1 . Therefore, if we
denote Z
h` = s (w, u` , ·), a` = s (2) (w, u` , v) dv h` · h` ,
(1)

then (8.33) becomes


d
R`,`0 `,`0 1
= h` · h`0 + a` d`,`0 `,`0 1
. (8.34)

It remains to observe that (h` , a` )` 1 is an i.i.d. sequence from the random measure
h on H ⇥ R+ given by the image of the Lebesgue measure du on [0, 1] by the map
⇣ Z Z ⌘
u ! s (w, u, · ), s (w, u, v) dv
(1) (2)
s (1) (w, u, v)s (1) (w, u, v) dv .

This finishes the proof. t


u

Let us now suppose that there is equality instead of equality in distribution in


(8.31), i.e. the array R is generated by an i.i.d. sample (h` , a` ) from the random
measure h. To complete the picture, let us show that one can reconstruct h, up to
an orthogonal transformation of its marginal on H, as a measurable function of the
array R itself, with values in the set P(H ⇥ R+ ) of all probability measures on
H ⇥ R+ equipped with the topology of weak convergence.

Lemma 8.6. There exists a measurable function h 0 = h 0 (R) of the array (R`,`0 )`,`0 1
with values in P(H ⇥ R+ ) such that h 0 = h (U, id) 1 almost surely for some or-
thogonal operator U on H that depends on the sequence (h` )` 1 .
8.3 The Dovbysh-Sudakov representation for Gram-de Finetti arrays 253

Proof. Let us begin by showing that the norms kh` k can be reconstructed almost
surely from the array R. Consider a sequence (g` ) on H such that g` · g`0 = R`,`0 for
all `, `0 1. In other words, kg` k2 = kh` k2 + a` and g` · g`0 = h` · h`0 for all ` 6= `0 .
p
Without loss of generality, let us assume that g` = h` + a` e` , where (e` )` 1 is
an orthonormal sequence orthogonal to the closed span of (h` ). If necessary, we
identify H with H H to choose such a sequence (e` ). Since (h` ) is an i.i.d. se-
quence from the marginal G of the measure h on H, with probability one, there are
elements in the sequence (h` )` 2 arbitrarily close to h1 and, therefore, the length
of the orthogonal projection of h1 onto the closed span of (h` )` 2 is equal to kh1 k.
As a result, the length of the orthogonal projection of g1 onto the closed span of
(g` )` 2 is also equal to kh1 k, and it is obvious that this length is a measurable func-
tion of the array g` · g`0 = R`,`0 . Similarly, we can reconstruct all the norms kh` k as
measurable functions of the array R and, thus, all a` = R`,` kh` k2 . Therefore,

h` · h`0 `,`0 1
and (a` )` 1 (8.35)

are both measurable functions of the array R. Given the matrix (h` · h`0 ), we can
find a sequence (x` ) in H isometric to (h` ), for example, by choosing x` to be in
the span of the first ` elements of some fixed orthonormal basis. This means that
all x` are measurable functions of R and that there exists an orthogonal operator
U = U((h` )` 1 ) on H such that

x` = Uh` for ` 1. (8.36)

Since (h` , a` )` 1 is an i.i.d. sequence from distribution h, by the strong law of large
numbers for empirical measures (Varadarajan’s theorem in Section 17),
1
 d(h` ,a` ) ! h weakly
n 1`n

almost surely and, therefore, (8.36) implies that


1
 d(x` ,a` ) ! h (U, id)
n 1`n
1
weakly (8.37)

almost surely. The left hand side is, obviously, a measurable function of the array R
in the space of all probability measures on H ⇥ R+ equipped with the topology of
weak convergence and, therefore, as a limit, h 0 = h (U, id) 1 is also a measurable
function of R. This finishes the proof. t
u

Exercise 8.3.1. Consider a sequence of pairs (G1N , G2N )N 1 of random probability


measures on { 1, +1}N that are not necessarily independent. Let (s ` )`0 be an
i.i.d. sample of replicas from G1N and (s ` )` 1 be an i.i.d. sample of replicas from
G2N (for convenience of notation, we denote both by s ` but use difference sets of
indices for `). Let RN = (RN`,`0 )`,`0 2Z be the array of the so-called overlaps
254 8 Exchangeability

1 N ` `0
RN`,`0 = Â si si .
N i=1

Suppose that RN converges in distributions to an array R = (R`,`0 )`,`0 2Z . Notice that


this array is weakly exchangeable,
d
Rp(`),p(`0 ) `,`0 2Z
= R`,`0 `,`0 2Z
,

but not under all permutations of integers Z, but only those permutations that map
positive integers into positive and non-positive into non-positive. Prove that there
exists a pair of random measures G1 and G2 on a separable Hilbert space H (not
necessarily independent) such that
d
R`,`0 `6=`0 2Z
= h` · h`0 `6=`0 2Z
,

where (h` )`0 is an i.i.d. sample from G1 and (h` )` 1 is an i.i.d. sample from G2 .
255

Chapter 9
Finite State Markov Chains

9.1 Definitions and basic properties

In this chapter, we will study finite state Markov chains, namely, sequences (Xi )i 1
of random variables taking values in a finite set

S = s1 , . . . , sm (9.1)

(elements of this set will be called states), satisfying

P Xk+1 = xk+1 | X1 = x1 , . . . , Xk = xk = P Xk+1 = xk+1 | Xk = xk . (9.2)

This means that the conditional distribution of the next outcome Xk+1 given the
preceding outcomes X1 = x1 , . . . , Xk = xk depends only on the most recent outcome
Xk = xk . This property is called a Markov property of the sequence, also called a
memoryless property, in the sense that we do not need to remember the entire past
and only need to know the most recent outcome to know the chances of the next
outcome. The conditional distribution in (9.2) is also called transition probability.
Markov chain is called homogeneous if transition probabilities P(Xk+1 = a | Xk =
b) do not depend on the index k. For any n 1, the joint distribution of X1 , . . . , Xn
can be constructed sequentially
n 1
P X1 = x1 , . . . , Xn = xn = ’ P Xk+1 = xk+1 | Xk = xk , (9.3)
k=0

where the first factor for k = 0 is just P(X1 = x1 ).


The transition matrix P is the matrix
h i h i
P := pi j = p(si , s j ) , (9.4)
1i, jn 1i, jn

whose entries are the transition probabilities


256 9 Finite State Markov Chains

pi j = p(si , s j ) = P Xk+1 = s j | Xk = si . (9.5)

It is customary (and convenient in terms of notation) to start the Markov chain at


‘time zero’, i.e. k = 0. The distribution of X0 , which we will denote by µ,

µi = µ(si ) := P(X0 = si ) for si 2 S, (9.6)

is called the initial distribution of the Markov chain. With this notation, the defini-
tion of the (homogeneous finite state) Markov chain can be rewritten as

P X0 = x0 , X1 = x1 , . . . , Xn = xn = µ(x0 )p(x0 , x1 )p(x1 , x2 ) · · · p(xn 1 , xn ). (9.7)

The initial distribution µ will vary depending on how we want to start the chain,
so we can also express this definition entirely in terms of the transition matrix,

P X1 = x1 , . . . , Xn = xn | X0 = x0 = p(x0 , x1 )p(x1 , x2 ) · · · p(xn 1 , xn ). (9.8)

If we start the chain in the state x0 , then the probability that the chain will visit
states x1 , . . . , xn in the next n steps is the product of transition probabilities along
the path. If we multiply by µ(x0 ), we recover the previous equation.
It is also convenient to visualize transition probabilities by a transition graph.
For example, the transition matrix
2 3
0.25 0.25 0 0 0.5 0 0
6 0.4 0 0 0 0.6 0 0 7
6 7
6 0 0.3 0 0.7 0 0 0 7
6 7
P=6 6 0 0 0.25 0 0.75 0 0 7 7 (9.9)
6 0 0.9 0 0 0.1 0 0 7
6 7
4 0 0 0 0 0 0 0.55
0 0 0 0 0 0.5 0

can be depicted as in Figure 9.1, with dots representing states, and arrows repre-
senting non-zero transition probabilities. Notice that, for example, p12 6= p21 .
This graphical representation of transition probabilities suggests that we can
think of Markov chains as random walks on the set of states, where at each step we
pick one of the neighbours of the current state si with probabilities pi j and move
there. If pii 6= 0, we can end up staying at si .
A state si is called inessential if we can reach from it another state s j with pos-
itive probability (not necessarily in one step), from which we cannot come back
to si . For example, in Figure 9.1, the states s3 and s4 are inessential. Such states
are also called transient, although this terminology is better suited for infinite state
Markov chains. For finite state chains, once we reach s j , we can never come back
to si , so for the long-term behaviour of the Markov chain these states do not matter.
Two states si and s j are called communicating, denoted si $ s j , if we can go
from each of them to the other with positive probability. If
9.1 Definitions and basic properties 257

0.3
s2
0.25 | 0.4
s7 s3
0.5 0.5
0.6 | 0.9
0.25 s6
0.7 | 0.25
s1
0.1

s4

0.5 s5
0.75

Fig. 9.1 Transition graph of a Markov chain with the transition matrix (9.9). States s3 , s4 are inessential,
and essential states are divided into two clusters, {s1 , s2 , s5 } and {s6 , s7 }.

Si = s 2 S : si $ s

is the set of states communicating with si , and if si communicates with s j then,


obviously, Si = S j , because any state that communicates with si communicates
with s j and vice versa. This means that communicating states can be divided into
disjoint clusters. For example, in Figure 9.1, there are two clusters of essential
states, {s1 , s2 , s5 } and {s6 , s7 }, and one cluster of inessential states, {s3 , s4 }. A state
si is called absorbing state if pii = 1, because, once we reach this state, we can
never leave. Transition probabilities on a given cluster of essential states define a
Markov chain on that cluster.
Markov chain is called irreducible if all its states communicate with each other,
which means that there are no inessential states and there is only one cluster of
essential states. Below we will mostly study irreducible Markov chains.
The transition matrix P is much more useful than just as a representation of the
transition probabilities. If we denote

pi j (n) = Pn ij
(9.10)

the (i, j)th element of Pn then the following holds.


Lemma 9.1 (n-step transition probabilities). For all n 1,

P(Xn = s j | X0 = si ) = pi j (n) (9.11)

and, in particular,
n
P(Xn = s j ) = Â µi pi j (n), (9.12)
i=1
258 9 Finite State Markov Chains

where µ is the initial distribution defined in (9.6).


If we write the last equation in the vector form,

P(Xn = s1 ), . . . , P(Xn = sm ) = µPn , (9.13)

where µ = (µ1 , . . . , µm ), we can compute the distribution of the chain at time n


by multiplying the initial distribution µ by Pn . It is important that we multiply the
row vector µ by Pn on the right and, for example, Pn µ T does not have the same
meaning. The equation (9.11) also shows that
m
 pi j (n) = 1,
j=1

so Pn can be viewed as an n-step transition matrix.


Proof. We start with n = 2. In order to compute

P(X2 = s j | X0 = si ),

we can sum over possible outcomes of X1 = sk ,


m
P(X2 = s j | X0 = si ) = Â P(X1 = sk , X2 = s j | X0 = si )
k=1
m m
= Â p(si , sk )p(sk , s j ) = Â pik pk j ,
k=1 k=1

which is the (i, j)th entry of P2 . The general case follows by induction, using a
similar calculation. Assuming that (9.11) holds,
m
P(Xn+1 = s j | X0 = si ) = Â P(Xn = sk , Xn+1 = s j | X0 = si )
k=1
m
= Â P(Xn+1 = s j | Xn = sk )P(Xn = sk | X0 = si )
k=1
m
= Â pik (n)pk j = pi j (n + 1).
k=1

If we multiply this by µi = P(X0 = si ) and sum over i  m, we get (9.12) for Xn+1 ,
so the proof of the induction step is complete. t
u
The period di of a state si is defined by

di = gcd n 1 : pii (n) > 0 , (9.14)

the greatest common divisor all the times n when a walk starting at si can return
to si . If di = 1 then the state is called aperiodic. We call a Markov chain aperiodic
9.1 Definitions and basic properties 259

if all periods di = 1. It turns out that for irreducible chains this is the same as
requiring only one period di to be equal to 1.
Lemma 9.2. If a Markov chain is irreducible then all periods di are equal.
Proof. Let us suppose that s j can be reached from si in N steps and si can be
reached from s j in M steps with positive probabilities, pi j (N) > 0 and p ji (M) > 0.
If p j j (n) > 0 then

pii (N + M + n) pi j (N)p j j (n)p ji (M) > 0,

because we can reach s j in N steps, then come back to s j in n steps, and then
reach si in M steps, so the probability to return to si in N + M + n steps is positive.
This is also true for n = 0, because we can go to s j and come back to si in N + M
steps. Since di is the period of si , by definition, di divides all numbers N + M + n
as above. In particular, it divides N + M, which implies that it divides all n such
that p j j (n) > 0. This means that di divides d j , because d j is the greatest common
divisor of all such n. Similarly, d j divides di , so they must be equal. t
u
The following property of irreducible aperiodic Markov chains will be very im-
portant in Section 9.3.
Lemma 9.3. If a Markov chain is irreducible and aperiodic then there exists N 1
such that, for all n N,
pi j (n) > 0 for all i, j. (9.15)
In other words, for large n, all the entries of Pn are strictly positive.
Proof. Let T (s1 ) = {n 1 : p11 (n) > 0} be the set of all times the chain starting
at s1 can come back to s1 . Let d 1 be the smallest positive integer that can be
written as
d = a1 n1 + . . . + ak nk ,
for k 1, ni 2 T (s1 ) and ai 2 Z. Then d must divide all n 2 T (s1 ), because if it
does not divide some n 2 T (s1 ) then

n = ad + r,

where the remainder r 1 is strictly less than d. However,

r=n ad = n a1 n1 ... ak nk

is also a linear combination with integer coefficients of the times in T (s1 ), which
contradicts that d was the smallest such number. This proves that d divides the
period of s1 and, since the chain is aperiodic, d = 1. We showed that

1 = a1 n1 + . . . + ak nk , (9.16)

for some k 1, ni 2 T (s1 ) and ai 2 Z. Some ai of course can be negative.


260 9 Finite State Markov Chains

Let us take the largest |ai | and, for certainty, suppose that |a1 | is the largest. If

N1 = |a1 |n1 (n1 + . . . + nk ),

we will now show that all n N1 can be written as a linear combinations

n = c1 n1 + . . . + ck nk ,

with non-negative integer coefficients ci . This will prove that p11 (n) > 0 for such
n, because the chain starting from the state s1 can come back to s1 in ni number
of steps, so we just need to repeat these loops ci times for all i  k to see that the
chain can come back in n steps with positive probability.
Let us first consider any n between N1 and N1 + n1 , which means that n = N1 + `
for some `  n1 . By (9.16), we can write

n = N1 + ` = |a1 |n1 (n1 + . . . + nk ) + `(a1 n1 + . . . + ak nk ) = c1 n1 + . . . + ck nk ,

where
ci = |a1 |n1 + `ai |a1 |` + `ai = (|a1 | + ai )` 0
as we wished, because we assumed that |a1 | is greater or equal to |ai |. Since

N1 + n1 = (|a1 |n1 + 1)n1 + |a1 |n1 (n2 + . . . + nk ),

we can repeat the same argument for n in between N1 + n1 and N1 + 2n1 , and so
on.
Similarly, for each state si , we can find Ni such that pii (n) > 0 for n Ni . If we
take N 0 = max(N1 , . . . , Nm ) then pii (n) > 0 for all i and all n N 0 . Since the chain
is irreducible, we can always go from one state si to another state s j in under m
steps with positive probability (see exercise below), which implies that pi j (n) > 0
for all n N 0 + m. t
u

Exercise 9.1.1. Let U1 ,U2 , . . . be i.i.d. random variables with the uniform distribu-
tion on {1, . . . , m} and let
Xn = max Ui .
in
Show that X1 , X2 , . . . is a Markov chain and find its transition probabilities. What
are the essential states of this chain?

Exercise 9.1.2. Suppose that m while balls and m black balls are mixed together
and divided evenly between two baskets. At each step two balls are chosen at ran-
dom, one from each basket, and switched. We say that the system is in the state si
if there are i white balls in the first basket. Find the transition probabilities of this
chain.

Exercise 9.1.3. If a Markov chain on m states is irreducible, show that, for any
i, j  m, pi j (k) > 0 for some k  m.
9.1 Definitions and basic properties 261

Exercise 9.1.4. If a Markov chain is irreducible and p11 > 0, what is the period d2
of the state s2 ?

Exercise 9.1.5. If a Markov chain (Xn )n 0 is irreducible and has period d, show
that Yn = Xdn for n 0 is a Markov chain and that it is aperiodic. Does it have to
be irreducible?

Exercise 9.1.6. Let N ⇠ Poiss(l ) be a Poisson random variable. Let us run N in-
dependent Markov chains starting from the state s1 , and let Nn (i) be the number of
these chains in the state si at time n. Show that Nn (i) ⇠ Poiss(l p1i (n)).
262 9 Finite State Markov Chains

9.2 Stationary distributions

In Lemma 9.1 we showed that if the vector µ = (µ1 , . . . , µm ) describes the distri-
bution of the chain X0 at time zero, then µPn is the distribution of Xn at time n. The
distribution µ on the set of states is called stationary if

µ = µP, (9.17)

i.e. the distribution of X1 is the same as that of X0 . Of course, this implies that
µ = µPn , so the distribution of all Xn is the same. We will show that a stationary
distribution always exists, and for an irreducible Markov chain it is unique. For
irreducible chains, we will also derive a representation of the stationary distribution
in terms of expected return times. To find a stationary distribution in practice, one
can just solve a system of linear equations µ = µP.

Lemma 9.4 (Existence). For any finite state Markov chain, there exists at least
one stationary distribution.

Proof. Let us consider a sequence of matrices


1
An = (1 + P + . . . + Pn ).
n
Since each matrix Pk is a transition matrix and its rows sum up to 1, the rows of
An also sum up to one, so its entries are all between 0 and 1. By the Bolzano–
Weierstrass theorem, there exists a convergent subsequence Ank ! A. Notice that
the limit A is also a transition matrix because the entries must be nonnegative and
rows add up to 1. If we consider
1
An P = (P + . . . + Pn + Pn+1 ),
n
we see that the difference goes to a zero matrix,
1
An P An = ( 1 + Pn+1 ) ! 0.
n
This implies that
A = lim Ank = lim Ank P = AP,
k!• k!•
which shows that A = AP. This means that each row a of this matrix satisfies a = aP.
Since each row is a probability distribution on states (entries are nonnegative and
add up to one), each row is a stationary distribution. t
u

Notice that if the stationary distribution is unique, all the rows of A in the above
proof must be equal.
Next, we will show the following.
9.2 Stationary distributions 263

Lemma 9.5 (Uniqueness). If a Markov chain is irreducible then the stationary


distribution is unique.

Proof. Previous lemma shows that P I is not invertible, since there exists a non-
zero solution of µ(P I) = 0. Let us first show that, when P is the transition matrix
of an irreducible Markov chain, the rank of P I is m 1. Consider the linear
system (P I)vT = 0 for v = (v1 , . . . , vn ). Let us show that irreducibility implies
that
v1 = v2 = . . . = vn .
Consider the largest coordinate of v, let us say, v1 . The first equation in the system
reads m
 p1 j v j = v1 .
j=1

On the other hand, since all v j  v1 and p1 j v j  p1 j v1 ,


m m
v1 = Â p1 j v j  Â p 1 j v1 = v1 .
j=1 j=1

So the equality is actually an equality, and all p1 j v j = p1 j v1 . This means that


v j = v1 for all j such that p1 j 6= 0. In other words, all the states s j that can be
reached from the state s1 in one step must have v j = v1 . Then, repeating the same
calculation for these immediate neighbours we can see that all their neighbours
also much have v j = v1 . Since the chain is irreducible, we can reach all states in
finitely many steps, which proves that all v j ’s are equal.
This proves that the kernel of P I is one dimensional,

ker(P I) = c(1, . . . , 1) : c 2 R , (9.18)

spanned by the vector (1, . . . , 1), which means that the rank of P I equals to m 1.
On the other hand, if we had more than one stationary distribution, the system µ =
µP would have at least two linearly independent solutions, which would mean that
the rank of P I would be not greater than m 2. This proves that the stationary
distribution is unique. t
u

Next, we will derive a more descriptive formula for the stationary distribution of
irreducible Markov chains. First, let us give a heuristic non-rigorous explanation
and then check carefully that our intuitive guess is correct. In the proof of Lemma
9.4 we considered the sequence of matrices
1
An = (1 + P + . . . + Pn )
n
and showed that any limit A over some subsequence consists of rows which are
stationary distributions. For irreducible chains, we just showed that the stationary
distribution µ is unique, so we must have that
264 9 Finite State Markov Chains
2 3 2 3
µ µ1 · · · µm
6 .. 7 6 .. . . .. 7 .
A=4.5=4 . . . 5
µ µ1 · · · µm

In particular, this means that the limits over all convergent subsequences are equal
to A and, therefore, the limit exists over the entire sequence,
2 3
µ1 · · · µm
1
lim (1 + P + . . . + Pn ) = 4 ... . . . ... 5 .
6 7
n!• n
µ1 · · · µm

Since the (i, j)th entry of Pk is

(Pk )i j = pi j (k) = P(Xk = s j | X0 = si ),

this can be written as


1 n
lim
n!• n
 P(Xk = s j | X0 = si ) = µ j . (9.19)
k=1

If we multiply both sides by P(X0 = si ), sum over i  m and use that


m
 P(Xk = s j | X0 = si )P(X0 = si ) = P(Xk = s j ),
i=1

we get
1 n
lim
n!• n
 P(Xk = s j ) = µ j . (9.20)
k=1
In other words, this equation holds no matter how we start the chain, i.e. no matter
what the distribution of X0 is. Since

P(Xk = s j ) = E I(Xk = s j ),

by the linearity of expectation,

1 n
lim E
n!•
 I(Xk = s j ) = µ j .
n k=1
(9.21)

Clearly, this equation can be interpreted by saying that µ j is the expected propor-
tion of time the chain visits the state s j , if we looks at these proportions over long
periods of time.
There is another interpretation of the last equation. Since it does not matter how
we start the chain, let us start it in the state s1 . Then, let us divide the time between
1 and n into intervals between consecutive visits of the state s1 . Let us say that M
visits of s1 occurred between 1 and n, with intervals T1 , T2 , . . . , TM , so that
9.2 Stationary distributions 265

T1 + . . . + TM ⇡ n

in the sense that their ratio is close to 1. Let N1 ( j), . . . , NM ( j) be the number of
times we visited s j in between these consecutive returns to s1 , so that
n
N1 ( j) + . . . + NM ( j) ⇡ Â I(Xk = s j ).
k=1

Then
1 n N1 ( j) + . . . + NM ( j)
Â
n k=1
I(Xk = s j ) ⇡
T1 + . . . + TM
.

If we know that the Markov chain is visiting s1 , the future does not depend on
the past because of the memoryless property. This suggests that what happens in
between consecutive visits is independent of each other and the pairs

(N1 ( j), T1 ), . . . , (NM ( j), TM )

are actually independent and identically distributed. To make this statement rigor-
ous, once needs to prove the so called strong Markov property, which we are not
going to go into here. However, if we use it as a guiding intuition and divide both
numerator and denominator above by M, the law of large numbers tells us that
N1 ( j) + . . . + NM ( j) EN1 ( j)
⇡ .
T1 + . . . + TM ET1
This suggests another representation,
EN1 ( j)
µj = , (9.22)
ET1
where we assume that the chain starts in the state s1 ,

T1 = min n 1 : Xn = s1 (9.23)

is the first return time to s1 , and


T1 1
N1 ( j) = Â I(Xk = s j ) (9.24)
k=0

is the number of times the chain visits s j before returning to s1 .


The first return time T1 is an example of a stopping time. In general, an integer
valued random variable T 0 is called a stopping time if the event {T = n} de-
pends only on X0 , . . . , Xn . In other words, we decide whether to stop at time n based
on what was observed up to time n. The return time T1 to s1 is a stopping time in
this sense, because
266 9 Finite State Markov Chains

{T1 = n} = {X1 6= s1 , . . . , Xn 1 6= s1 , Xn = s1 }.

Notice also that N1 (1) = 1, because the chain is in the state s1 only at time zero
before the return, so (9.22) implies that
1
µ1 = . (9.25)
ET1
Since it did not matter how we start the chain, if we start it at s j and define

T j = min n 1 : Xn = s j (9.26)

then the same logic suggests that


1
µj = . (9.27)
ET j

In order to prove the representation (9.22), we need to check two things. First,
we will need to check that ET1 < •. Because the number of visits to different states
before time T1 adds up to T1 ,
m
 N1 ( j) = T1 ,
j=1

and therefore N1 ( j)  T1 , proving ET1 < • would also imply that EN1 ( j) < •.
This would show that the numbers in (9.22) are well defined, and they define a
distribution because
m m
EN1 ( j) ET1
 µj =  = = 1.
j=1 j=1 ET1 ET1

After that we will need to check that µP = µ, so it is the stationary distribution.

Lemma 9.6. If a Markov chain is irreducible then ET1 < •, where T1 defined in
(9.23) is the first return time to the state s1 for the chain that starts at s1 .

In the proof we will use the Markov property (9.2) in the following way. Let us
consider two vectors

X 1 = (X1 , . . . , Xt1 ), X 2 = (Xt1 +1 , . . . , Xt2 )

consisting of the Markov chain up to time t1 and in between t1 and t2 . Let us


consider some subsets

A ✓ {s1 , . . . , sm }t1 , B ✓ {s1 , . . . , sm }t2 t1

of possible outcomes for these two blocks of the Markov chain, and consider the
conditional probability
9.2 Stationary distributions 267

P(X 2 2 B | X 1 2 A).
We can not use the Markov property directly, because the condition is a set, but we
can rewrite this as
P(X 2 2 B, X 1 2 A)
P(X 2 2 B | X 1 2 A) = (9.28)
P(X 1 2 A)
P(X 2 2 B, X 1 = a)

a2A P(X 1 2 A)
P(X 2 2 B | X 1 = a)P(X 1 = a)
= Â P(X 1 2 A)
a2A
P(X 2 2 B | Xt1 = at1 )P(X 1 = a)
= Â P(X 1 2 A)
,
a2A

where in the last step we used the Markov property to write the condition in terms
of the last outcome Xt1 = at1 . This is very useful because, for example, if we can
control the probability
P(X 2 2 B | Xt1 = at1 )  p
for all possible outcomes Xt1 = at1 then the above equation implies that

P(X 1 = a)
P(X 2 2 B | X 1 2 A)  p  1
= p. (9.29)
a2A P(X 2 A)

In other words, Markov property can be used in this way even if the condition is a
set.
Proof (of Lemma 9.6). Since the chain is irreducible, if the chain is in the state si
at time k, Xk = si , we can reach the state s1 in under m steps, which means that

pi1 (`) = P(Xk+` = s1 | Xk = si ) > 0

for some `  m. This implies that the probability that we do not reach the state s1
in the next m steps is strictly smaller than 1,

P(Xk+1 6= s1 , . . . , Xk+m 6= s1 | Xk = si )
 P(Xk+` 6= s1 | Xk = si ) = 1 P(Xk+` = s1 | Xk = si ) < 1.

This is true for any si , so

P(Xk+1 6= s1 , . . . , Xk+m 6= s1 | Xk = si )  1 e, (9.30)

for some small enough e > 0, for all i  m.


Using Markov property, this will imply that the probability that the chain does
not come back to s1 in Nm steps (i.e. N cycles of m steps) will be smaller than
(1 e)N ,
268 9 Finite State Markov Chains

P(T1 > Nm)  (1 e)N . (9.31)


To see this, let us write

P(T1 > Nm) = P(X1 6= s1 , . . . , X(N 1)m 6= s1 , . . . , XNm 6= s1 )

and let us decompose this sequence into two blocks

X 1 = (X1 , . . . , X(N 1)m ), X 2 = (X(N 1)m+1 , . . . , XNm ).

If we take
1)m
A = {s2 , . . . , sm }(N , B = {s2 , . . . , sm }m
to be the sets consisting of all the paths that avoid the state s1 and have lengths
(N 1)m and m then

P(T1 > Nm) = P(X 2 2 B, X 1 2 A) = P(X 2 2 B | X 1 2 A)P(X 1 2 A).

The equation (9.30) implies that, no matter what the value of X(N 1)m is, the condi-
tional probability that all coordinates of X 2 avoid the state s1 is smaller than 1 e.
By the discussion before the proof and (9.29), the Markov property implies

P(X 2 2 B | X 1 2 A)  1 e,

and, therefore,

P(T1 > Nm)  (1 e)P(X 1 2 A) = (1 e)P(T1 > (N 1)m).

By induction on N, the equation (9.31) follows.


Next, let us denote d = (1 e)1/m < 1 and consider any k 1. Take N 0 such
that Nm < k  (N + 1)m. Then

P(T1 k)  P(T1 > Nm)  (1 e)N = d Nm  d k m


.

This implies that


• •
d1 m
ET1 = Â P(T1 k)  Â dk m
=
1 d
< •,
k=1 k=1

which finishes the proof. t


u

Recall the equations 9.22), (9.23), (9.24),


EN1 ( j)
µj = , (9.32)
ET
n1 o
T1 = min n 1 : Xn = s1 , (9.33)
9.2 Stationary distributions 269

T1 1
N1 ( j) = Â I(Xk = s j ), (9.34)
k=0

where we assumed that the chain starts in the state s1 . To check that our heuristic
discussion above gives the correct answer, it remains to prove the following.

Theorem 9.1 (Representation of stationary distribution). If a Markov chain is


irreducible then µ defined in (9.32) is the stationary distribution.

Proof. Showing that µ = µP is equivalent to showing


m
EN1 ( j) = Â pi j EN1 (i), (9.35)
i=1

because the two equations only differ by a factor of 1/ET1 . To prove (9.35), first
let us consider j 6= 1. Because the chain starts at s1 , we have X0 = s1 6= s j and, in
the definition of N1 ( j), we can start the summation from k = 1,
T1 1 T1 1
N1 ( j) = Â I(Xk = s j ) = Â I(Xk = s j ).
k=0 k=1

We can also write this as


• •
N1 ( j) = Â I(Xk = s j , k < T1 ) = Â I(Xk = s j , k 1 < T1 ), (9.36)
k=1 k=1

where we used that

{Xk = s j , k < T1 } = {Xk = s j , k 1 < T1 },

because Xk = s j 6= s1 , so it is impossible that T1 = k. When j = 1, the equation


(9.36) also holds, for a different reason. In this case, N1 (1) = 1 by definition, while
the right hand side of (9.36) equals to 1 because k 1 < T1 is the same as k  T1
and, between times 1 and T1 , the chain visits s1 once at k = T1 .
If we take expectations of both sides of (9.36),

EN1 ( j) = Â P(Xk = s j , k 1 < T1 ) (9.37)
k=1
• m
= Â Â P(Xk 1 = si , Xk = s j , k 1 < T1 ),
k=1 i=1

where we also partitioned the event into different outcomes Xk 1 = si . The key
observation we need to make is that the event

{k 1 < T1 } = X1 6= s1 , . . . , Xk 1 6= s1
270 9 Finite State Markov Chains

depends only on the random variables X1 , . . . , Xk 1 up to time k 1. Let us write


one term in the above sum by conditioning on the first k 1 random variables,

P(Xk 1 = si , Xk = s j , k 1 < T1 )
= P(Xk = s j | Xk 1 = si , k 1 < T1 )P(Xk 1 = si , k 1 < T1 ).

If we use the Markov property in the form (9.28), we can drop the condition k 1 <
T1 and rewrite this as

P(Xk 1 = si , Xk = s j , k 1 < T1 )
= P(Xk = s j | Xk 1 = si )P(Xk 1 = si , k 1 < T1 )
= pi j P(Xk 1 = si , k 1 < T1 ).

Plugging this back into (9.37) and rearranging terms,


• m
EN1 ( j) = Â Â pi j P(Xk 1 = si , k 1 < T1 )
k=1 i=1
m •
=  pi j  P(Xk 1 = si , k 1 < T1 )
i=1 k=1
m •
{` = k 1} =  pi j  P(X` = si , ` < T1 )
i=1 `=0
m T1 1 m
= Â pi j E Â I(X` = si ) = Â pi j EN1 (i).
i=1 `=0 i=1

This proves (9.35) and finishes the proof. t


u

Exercise 9.2.1. Show that if a Markov chain has two different stationary distribu-
tions then there exist infinitely many stationary distributions.
Exercise 9.2.2. Find the stationary distribution of Markov chain with the transition
matrix 2 3
0 0.5 0.5
P = 4 0 0 1 5.
0.5 0.5 0
If the chain starts at s2 , what is the expectation of the first return time to s2 ?
Exercise 9.2.3. If µ = (µ1 , . . . , µm ) is the stationary distribution of an irreducible
Markov chain, show that µi 6= 0 for all 1  i  m.
Exercise 9.2.4 (Another proof of uniqueness). Consider an irreducible Markov
chain and suppose that there exist two stationary distributions µ 1 and µ 2 . Let j be
any minimizer
µ 1j µi1
= min .
µ 2j im µ 2
i
9.2 Stationary distributions 271

Show that
µ 1j µi1
=
µ 2j µi2
for any i such that pi j > 0. From this, derive that µ 1 = µ 2 .
Exercise 9.2.5. If p1 j = p2 j for all j and
(
s j , if Xn = s j for j 3,
Yn =
s⇤ , if Xn = s j for j  2,

show that Y1 ,Y2 , . . . is a Markov chain on the new state space S = {s⇤ , s3 , . . . , sm }.
Exercise 9.2.6. If a Markov chain is irreducible, prove that all moments ET1p < •
for p 1, where T1 is defined in (9.23).
Exercise 9.2.7. In Toronto, it rains (or snows) on about 38% of days per year. Let
us consider a Markov chain model of weather with two states s1 = ‘rain’, s2 = ‘no
rain’, and the transition matrix

p 1 p
P= ,
1 q q

for some p, q 2 (0, 1), so that if it rains today then it will rain tomorrow with
probability p, and if it does not rain today then it will not rain tomorrow with
probability q. Find all p and q for which the expected proportion of rainy days over
long periods of time is 0.38.
Exercise 9.2.8. Suppose that the state s1 is inessential and let S1 be the cluster of
states that communicate with s1 . Suppose that the chain starts at s1 , and let T be
the first time we exit the cluster S1 ,

T = min n 1 : Xn 62 S .

Prove that ET < •.


Exercise 9.2.9. Let us consider an irreducible Markov chain X0 , X1 , . . ., and let
A ✓ S be a non-empty subset of its states. If the chain starts at si then let

TA (si ) = min n 0 : Xn 2 A

be the first time the chain visits one of the states in A and let ti = ETA (si ) be its
expectation. Prove that
(a) If si 2 A then ti = 0.
(b) If si 62 A then ti = 1 + Âmj=1 pi j t j .
(c) The equations in parts (a) and (b) determine t = (t1 , . . . ,tm ) uniquely. Hint:
consider equation (9.18).
272 9 Finite State Markov Chains

9.3 Convergence theorem

In Lemma 9.4 in the last section we considered the sequence of matrices


1
An = (1 + P + . . . + Pn )
n
and showed that any subsequential limit A consists of rows which are stationary
distributions. For irreducible chains, we also showed that the stationary distribution
µ is unique and, therefore,
2 3
µ1 · · · µm
1
lim (1 + P + . . . + Pn ) = 4 ... . . . ... 5 .
6 7
n!• n
µ1 · · · µm

Can we expect a stronger statement that


2 3
µ1 · · · µm
n 6 .. . . .. 7?
lim P = 4 . . . 5
n!•
µ1 · · · µm

First, we need to check that this is, indeed, a stronger statement.

Exercise 9.3.1. If the limit limn!• Pn exists then the limit limn!• 1n (1 + P + . . . +
Pn ) exists and is the same.

It is easy to see that this can not be true in general. For example, if a Markov
chain has two states s1 , s2 and it has the transition probabilities p12 = p21 = 1 then,
 
01 10
P2n = , P2n+1 = .
10 01

Another way to put it is that, if we start the chain in the state si , it can be back in
si only at even times. The problem with this example is that this chain has period
2, which results in this periodic behaviour. However, if the chain is aperiodic then
the convergence holds.

Theorem 9.2 (Convergence Theorem). If a Markov chain is irreducible and ape-


riodic then 2 3
µ1 · · · µm
lim Pn = 4 ... . . . ... 5 ,
6 7
(9.38)
n!•
µ1 · · · µm
where µ = (µ1 , . . . , µm ) is its stationary distribution.

Notice that this is equivalent to the following statement: if n = (n1 , . . . , nm ) is any


distribution on the set of states then
9.3 Convergence theorem 273

lim nPn = µ. (9.39)


n!•

In one direction, multiplying the right hand side of (9.38) by n on the left, obvi-
ously, gives µ. In the other direction, if we take n = (0, . . . , 1, . . . , 0) where the ith
coordinate is 1, then
lim nPn = µ
n!•

ith
is the row of the limit limn!• Pn ,
so all the rows of this limit are equal to µ.
Recall that if n is the initial distribution of the chain, i.e. the distribution of X0 ,
then nPn is the distribution of Xn . Hence, the convergence theorem in the form
(9.39) states that the distribution of Xn converges to the stationary distribution µ
no matter what the initial distribution is.

Proof (Theorem 9.2). We will be proving the formula (9.39). Let us consider the
L1 -distance between vectors on Rm ,
m
kx yk1 = Â |xi yi |.
i=1

Let us notice that multiplying by any transition matrix P on the right does not
increase this distance, because
m m m m
kxP yPk1 = Â Â pi j (xi yi )  Â Â pi j |xi yi |
j=1 i=1 j=1 i=1
m m
= Â Â pi j |xi yi |
i=1 j=1
m
= Â |x j y j | = kx yk1 .
i=1

If we apply this to distributions n and µ and use that µ = µP, we get that

knP µk1  kn µk1 .

This suggests that the distributions get closer to each other after one step but, in
order to prove convergence, we would like to have a strict inequality and then
iterate it.
This is where the assumption that our chain is aperiodic comes into play. By
Lemma 9.3 in Section 9.1, for large enough N, all entries pi j (N) of PN are strictly
positive. Let e > 0 be the smallest among its entries. If we apply the above calcula-
tion to the transition matrix PN and distributions n and µ, the improvement comes
from the fact that m
 e(ni µi ) = e(1 1) = 0
i=1
and, therefore,
274 9 Finite State Markov Chains
m m
 pi j (N)(ni µi ) =  (pi j (N) e)(ni µi ).
i=1 i=1

Notice that all pi j (N) e 0, because e was the smallest among all entries, so
m m
knPN µPN k1 = Â Â pi j (N)(ni µi )
j=1 i=1
m m
= Â Â (pi j (N) e)(ni µi )
j=1 i=1
m m
 Â Â (pi j (N) e)|ni µi |
j=1 i=1
m m
= Â Â (pi j (N) e)|ni µi |
i=1 j=1
m
= (1 me) Â |n j µ j |.
i=1

Thus, knPN µPN k1  (1 me)kn µk1 and, by induction,

knPNk µPNk k1  (1 me)k kn µk1 .

For powers in between the multiples of N, of the form Nk + ` for `  N, we use

knPNk+` µPNk+` k1  (1 me)k knP` µP` k1  (1 me)k kn µk1 .

This proves that knPn µPn k1 ! 0 for any two distributions n and µ. When µ is
the stationary distribution, µPn = µ, so knPn µk1 ! 0. This finishes the proof.u
t

Exercise 9.3.2. Given p 2 (0, 1), consider a Markov chain with the transition prob-
abilities
pi,i 1 = 1 p, pi,i+1 = p for i = 2, . . . , m 1
and p1,1 = 1 p, p1,2 = p, pm,m 1 = 1 p, pm,m = p. What is the limit limn!• Pn .
Hint: solve the system µ = µP explicitly.

Exercise 9.3.3. Show that if a Markov chain is irreducible and aperiodic then, for
any function f : S ! R, the expected value E f (Xn ) converges.

Exercise 9.3.4. Suppose that m people sit at the round table and all have plates
with rice in front of them. At the same time, each person divides his of her rice in
half and evens out one half with the half of the person on the right (i.e. they split
the total of their halves evenly), and another half with the person on the left. If they
keep repeating this, what will happen in the long run?

Exercise 9.3.5. Let us consider the L• norm on Rm ,


9.3 Convergence theorem 275

kx yk• = max |xi yi |.


im

If P is a Markov transition matrix P, show that

kPx Pyk•  kx yk• .


276 9 Finite State Markov Chains

9.4 Reversible Markov chains

Let us consider a Markov chain with the transition matrix P. A probability distri-
bution µ = (µ1 , . . . , µm ) on the set of states is called reversible for the chain if

µi pi j = µ j p ji (9.40)

for all i and j. A Markov chain is called reversible if it has a reversible distribution.
The equations (9.40) are called the detailed balance equations. If we start the chain
with the distribution µ then (9.40) states that the first two outcomes (si , s j ) and
(s j , si ) are equally likely. Moreover,

P(X0 = si , X1 = s j , X2 = sk ) = µi pi j p jk
= p ji µ j p jk = p ji pk j µk = P(X0 = sk , X1 = s j , X2 = si ),

which means that the first three outcomes (si , s j , sk ) and (sk , s j , si ) are equally
likely. The same calculation show that any sequence of outcomes and the same
sequence in reverse order are equally likely, hence the name reversible. It is easy
to check that a reversible distribution is stationary.
Lemma 9.7. A reversible distribution µ is stationary.
Proof. Summing (9.40) over i, we get
m m
 µi pi j = µ j  p ji = µ j ,
i=1 i=1

which finishes the proof. t


u
There are a number of examples, where the equations (9.40) are easy to guess
or easy to check, so this is one good way to find a stationary distribution. Let us
consider some classical examples of reversible Markov chains.
Example 9.4.1 (Birth-and-death processes). Let us consider a Markov chain that
in one step moves from a state si only to si+1 or si 1 . Of course, since our state
space is finite, this means that from s1 it only moves up to s2 and from sm it only
moves down to sm 1 . The chain can also stay in each state si . Another way to say
this is that
pi j > 0 if |i j| = 1, and pi j = 0 if |i j| > 1.
The probabilities pii may be positive or zero. Such Markov chains are called birth-
and-death chains, and they are all reversible. To find a reversible distribution, let
us set µ1 = x as an unknown variable. If (9.40) holds then
µ1 p12 p12
µ2 = = x.
p21 p21
Given µ2 , we can find µ3 ,
9.4 Reversible Markov chains 277

µ2 p23 p12 p23


µ3 = = x,
p32 p21 p32
and we can continue, by induction,
p12 pi 1,i
µi = ··· x.
p21 pi,i 1
Since all µi are of the form ci x for some positive ci > 0, to find x we can use that
the probabilities add up to 1, so
1
x= .
c1 + . . . + cm
Of course, the same calculation works for any chain with the transition graph that
looks like a single path between two ‘endpoint’ vertices, and we can relabel the
states of such chain to turn it into a formal birth-and-death chain. t
u

Example 9.4.2 (The Ehrenfest model of diffusion). One example of a birth-and-


death Markov chain is Ehrenfest’s model of diffusion. Suppose there are m par-
ticles inside two connected containers. The particles can move between the two
containers, and suppose that the next particle that moves is chosen uniformly at
random. This process can be described by a Markov chain with m + 1 states si for
i = 0, . . . , m, where the system is in the state i if there are i particles in the first
container and m i in the second. The transition probabilities are given by
i m i
pi,i 1 = , pi,i+1 = . (9.41)
m m
Since this is a birth-and-death chain, it is reversible and it turns out that its re-
versible distribution is the Binomial B(m, 12 ) distribution,
✓ ◆
m 1
µi =
i 2m .
We will leave it as an exercise to check that the detailed balance equations (9.40)
are satisfied. t
u

Example 9.4.3 (Random walks on graphs). Let us consider a connected graph G


on the set of vertices {s1 , . . . , sm }. This means that all vertices are connected by
a path on edges of the graph. Let Ni be the number of neighbours of si . Let us
consider a Markov chain with the transition probabilities
⇢ 1
, if s j is a neighbour of si ,
pi j = Ni (9.42)
0, otherwise.

In other words, in each state si the chain picks a neighbour uniformly at random
and moves there. A distribution µ is reversible for this chain if
278 9 Finite State Markov Chains

1 1
µi = µj .
Ni Nj

Therefore, all the ratios Nµii must be equal to the same constant c, so µi = cNi . Since
the probabilities must add up to 1, the constant c = N1 , where N = N1 + . . . + Nm ,
and
Ni
µi = . (9.43)
N
Since the graph is connected, the Markov chain is irreducible and µ is the unique
stationary distribution. t
u

The most important examples of reversible Markov chains appear in the con-
text of Markov Chain Monte Carlo (MCMC) algorithms. Let us describe one such
algorithm.

Example 9.4.4 (Metropolis algorithm). As in the previous example, let us con-


sider a connected graph G on the set of vertices S = {s1 , . . . , sm }. Suppose that we
are interested in some probability distribution µ on S, but µ is defined by a compli-
cated model that makes the computation of its coordinates µi impractical. On the
other hand, for any neighbours si and s j on the graph, µi and µ j are related in some
simple way, so that we know the ratio
µi
ri j =
µj
without knowing µi and µ j . Of course, given ri j , the distribution µ can in principle
be reconstructed by setting µ1 = x as unknown parameter, then finding µi = ri1 x in
terms of x for all its neighbours, and propagating this through the graph to express
all µi = xci in terms of x. Then we find
1
x= ,
c1 + . . . + cm
because the probabilities must add up to one. However, in many applications the
graph is so large (say m = 220 ) that it is not even feasible to add up m numbers.
If we only know the above ratios ri j , how can we compute, for example, the
expectation of some function f : S ! R on S with respect to the distribution µ,
m
Eµ f := Â f (si )µi ,
i=1

without computing µ? The idea of the Markov Chain Monte Carlo method is
to combine the law of large numbers with the convergence theorem for Markov
chains. Suppose that we can construct a Markov chain with the stationary distri-
bution µ, whose transition probabilities depends only on the ratios ri j . Then we
can pick the initial state at random and run the chain for a large number of steps,
n, to obtain a random variable Xn with the distribution close to µ. If we repeat
9.4 Reversible Markov chains 279

this procedure N times, we will get N i.i.d. random variables Xn1 , . . . , XnN with the
distribution close to µ. By the law of large numbers,

1 N m
 n
N k=1
f (X k
) ⇡ E f (Xn
1
) ⇡ Â f (si )µi .
i=1

In other words, this procedure allows us to approximate the expectation of f with


respect to the distribution µ, when it is not feasible to compute µ directly.
Here is a classical construction of a Markov chain, called the Metropolis chain,
with the stationary distribution µ and the transition probabilities that depend only
on the ratios ri j . Let us write i ⇠ j to denote that si and s j are neighbours. For each
vertex si , we define ⇣µ N ⌘
1 j i
pi j = min ,1 (9.44)
Ni µi N j
if j ⇠ i and define
pii = 1 Â pi j . (9.45)
j⇠i

All other pi j = 0. Notice that pii 0, because


1
 pi j   Ni = 1,
j⇠i j⇠i

so this definition makes sense. Notice also that the transition probabilities depend
only on the ratios ri j .
To show that µ is the stationary distribution of this Markov chain, we will check
that µ satisfies the detailed balance equations (9.40). For i = j there is nothing to
check, and if i 6⇠ j then both sides are zero, so we only need to check this for i ⇠ j.
If µ j Ni  µi N j then, by (9.44),
1 1 µi N j µi
pi j = and p ji = = ,
Ni N j µ j Ni µ j Ni
which immediately implies (9.40). The case µ j Ni µi N j is exactly the same, so
the Metropolis chain has the stationary distribution µ.
If at least one pii > 0 then the chain is aperiodic. If the chain is periodic, a slight
modification will make it periodic without affecting the stationary distribution µ
(see exercise below). Then the convergence theorem applies and we can use this
chain to produce an i.i.d. sample with the distribution close to µ. t
u

Exercise 9.4.1. Show that if µ is a reversible distribution for P, it is also a re-


versible distribution for l I + (1 l )P for any l 2 [0, 1].

Exercise 9.4.2. Show that if µ is a reversible distribution for P, it is also a re-


versible distribution for Pn .
280 9 Finite State Markov Chains

Exercise 9.4.3. Show that a birth-and-death Markov chain is aperiodic if and only
if at least one pii > 0.

Exercise 9.4.4. Show that a Markov chain with the following transition matrix is
not reversible: 2 3
0 0.75 0.25
P = 40.25 0 0.755 .
0.75 0.25 0

Exercise 9.4.5. Check that the Binomial distribution B(m, 12 ) satisfies the detailed
balance equations (9.40) for the Ehrenfest chain (9.41).

Exercise 9.4.6. Suppose that two containers contain a total of m while balls and m
black balls. At each step a ball is chosen at random from all 2m balls and put into
the other container. We say that a system is in the state si if there are i white balls
in the first container. Find the transition probabilities of this chain and compute
limn!• Pn .

Exercise 9.4.7. Consider the Ehrenfest chain (9.41) and let Dn be the difference of
particles in the two containers at time n. This means that if the chain is in the state
si at time n then Dn = 2i m. Prove that
m 2 ⇣ m 2 ⌘n
EDn+1 = EDn = ED0 .
m m
Exercise 9.4.8. Consider a random walk on the following graph:
s3

s2 s4

s1 s6 s5

If we start the chain at s1 , what is the expected time of the first return to s1 ? What
is the expected number of visits to s4 before the first return to s1 ? Hint: recall the
results in Section 9.2.
281

Chapter 10
Gaussian Distributions

10.1 Gaussian integration by parts, interpolation, and concentration

In this section, we will describe several Gaussian techniques, such as the Gaus-
sian integration by parts, interpolation and concentration. We will begin with the
Gaussian integration by parts. Let g be a centered Gaussian random variable with
variance v2 and let us denote the density function of its distribution by

1 ⇣ x2 ⌘
jv (x) = p exp . (10.1)
2pv 2v2

Since xjv (x) = v2 jv0 (x), given a differentiable function F : R ! R, we can for-
mally integrate by parts,
Z +•
Z
2 2
EgF(g) = xF(x)jv (x)dx = v F(x)jv (x) +v F 0 (x)jv (x)dx
Z •

= v2 F 0 (x)jv (x)dx = v2 EF 0 (g),

if the limits limx!±• F(x)jv (x) = 0 and the expectations on both sides are finite.
In fact, this formula holds under the assumption that E|F 0 (g)| < •, i.e. the right
hand side is well defined.

Lemma 10.1 (Gaussian integration by parts). If g ⇠ N(0, v2 ) then

EgF(g) = Eg2 EF 0 (g) = v2 EF 0 (g), (10.2)

whenever E|F 0 (g)| < •.

Proof. By making the change of variables Z = g/v and considering the function
f (x) = F(vx), it is enough to prove that Z ⇠ N(0, 1) satisfies

EZ f (Z) = E f 0 (Z), (10.3)


282 10 Gaussian Distributions

whenever E| f 0 (Z)| < •. Let us start by writing


Z +•
1 z2
E f 0 (Z) = p f 0 (z)e 2 dz
2p •
Z 0 Z +•
1 0 z2 1 z2
= p f (z)e 2 dz + p f 0 (z)e 2 dz.
2p • 2p 0

z2 Rz x2
If we represent e 2 = • xe
2 dx for z  0 then
Z 0 0 z Z Z
1 z2 1 x2
p f 0 (z)e 2 dz = p f 0 (z)xe 2 dxdz
2p • 2p • •
Z 0 Z 0
1 x2
= p f 0 (z)xe 2 dzdx
2p • x
Z 0
1 x2
= p f (x) f (0) xe 2 dx,
2p •
where we could switch the order of integration, by Fubini’s theorem, because the
integrand is absolutely integrable over the region x  z  0 by the assumption that
E| f 0 (Z)| < •. Similarly, representing
Z •
z2 x2
e 2 = xe 2 dx for z 0,
z

we can write the second integral as


Z +• Z •
1 z2 1 x2
p f 0 (z)e 2 dz = p f (x) f (0) xe 2 dx.
2p 0 2p 0

Adding up the two integrals, we get E f 0 (Z) = EZ f (Z) f (0) = EZ f (Z), which
finishes the proof. t
u

This computation can be generalized to Gaussian vectors. Let g = (g` )1`n be


a Gaussian vector with the covariance C, g ⇠ N(0,C), and consider a differentiable
function F = F((x` )1`n ) : Rn ! R. Let us denote

∂F
F` = (10.4)
∂ x` .
Then the following holds.

Lemma 10.2 (Gaussian integration by parts). If g ⇠ N(0,C) is a centered Gaus-


sian vector on Rn then

Eg1 F(g) = Â E(g1 g` )EF` (g), (10.5)


`n

if, for all `  n, either E|F` (g)| < • or Eg1 g` = 0.


10.1 Gaussian integration by parts, interpolation, and concentration 283

Proof. If v2 = Eg21 then the Gaussian vector g0 = (g0` )1`n defined by

g0` = g` l` g1 , where l` = v 2 Eg1 g` , (10.6)

is independent of g1 , since Eg1 g0` = Eg1 g` l` v2 = 0. If we denote l = (l` )1`n


then we can write g = g0 + g1 l . If E1 denotes the expectation in g1 only then using
(10.2) conditionally on g0 implies that
∂F 0
E1 g1 F(g) = E1 g1 F(g0 + g1 l ) = v2 E1 (g + xl ) (10.7)
∂x x=g1

if the right hand side is well defined. Since


∂F 0 ∂F
∂x
(g + xl )
x=g1
= Â l` ∂ x` (g0 + xl ) x=g1 = Â l` F` (g), (10.8)
`n `n

our assumption that, for all `  n, either E|F` (g)| < • or l` = 0 implies (10.7).
Moreover, by this assumption, (10.7) is absolutely integrable and integrating in g0
finishes the proof. t
u
We will encounter two types of examples, when the function F = F((x` )1`n )
and all its partial derivatives are bounded, or, when they grow at most exponentially
fast, that is, for some constants c1 , c2 > 0, for `  n,
∂F
(x)  c1 ec2 |x| . (10.9)
∂ x`

Most commonly, Gaussian integration by parts will be used in various calcula-


tions involving Gaussian interpolation. Consider two independent Gaussian ran-
dom vectors X = (Xi )in and Y = (Yi )in with the covariances

ai, j = EXi X j and bi, j = EYiY j . (10.10)

Since X and Y are independent, for 0  t  1,


p p
Z(t) = tX + 1 tY (10.11)

is a Gaussian random vector Z(t) = (Zi (t))in with the covariance

EZi (t)Z j (t) = tai, j + (1 t)bi, j ,

which corresponds to a linear interpolation between the covariances of X and Y .


Let us consider a function
p p
f (t) = EF(Z(t)) = EF( tX + 1 tY ). (10.12)

The end points of this interpolation are f (0) = EF(Y ) and f (1) = EF(X), and the
derivative along the interpolation can be computed as follows.
284 10 Gaussian Distributions

Lemma 10.3 (Gaussian interpolation). If for some c1 , c2 > 0 and all 1  i, j  n,

∂F ∂ 2F
(x)  c1 ec2 |x| and, either (x)  c1 ec2 |x| or ai, j = bi, j , (10.13)
∂ xi ∂ xi ∂ x j
then
1 ∂ 2F
2 i,Â
f 0 (t) = (ai, j bi, j )E Z(t) . (10.14)
jn ∂ xi ∂ x j

Proof. When the partial derivatives have at most exponential growth, it is easy to
check that one can interchange the derivative and integral to write
d ∂F ∂F
f 0 (t) = E F(Z(t)) = E Â (Z(t))Zi0 (t) = Â E (Z(t))Zi0 (t).
dt in ∂ xi in ∂ xi

Let us apply Gaussian integration by parts to each of the terms on the right hand
side. Since the covariance
⇣ 1 1 ⌘p p 1
EZi0 (t)Z j (t) = E p Xi p Yi ( tX j + 1 tY j ) = (ai, j bi, j ),
2 t 2 1 t 2
the Gaussian integration by parts formula (10.5) implies that

∂F 1 ∂ 2F
E
∂ xi
(Z(t))Zi0 (t) =
2 Â (ai, j bi, j )E
∂ xi ∂ x j
(Z(t)),
jn

because of the assumption (10.13). Adding up over i  n finishes the proof. t


u

Using the above interpolation, one can prove the following classical Gaussian
concentration inequality for Lipschitz functions. Let us consider a function F =
F (x` )1`n : Rn ! R such that, for some L > 0,

|F(x) F(y)|  L|x y| for all x, y 2 Rn . (10.15)

The smallest such L is called the Lipschitz seminorm kFkLip of F.

Theorem 10.1 (Gaussian concentration). If g = (gi )in is standard Gaussian on


Rn then, for any t 0,
⇣ ⌘ ⇣ t2 ⌘
P F(g) EF(g) t  2 exp . (10.16)
4kFk2Lip

Proof. First, let us suppose that F is differentiable and its gradient is bounded by
L, |—F|  L. Take any l 0. The Gaussian interpolation we would like to consider
is of the form
⇣ p p p p ⌘
f (t) = E exp l F( tg1 + 1 tg) F( tg2 + 1 tg) , (10.17)
10.1 Gaussian integration by parts, interpolation, and concentration 285

where g = (gi )in , g1 = (g1i )in and g2 = (g2i )in are three independent standard
Gaussian vectors on Rn . If we consider a function G : R2n ! R given by the for-
mula
⇣ ⌘
G(x1 , . . . , xn , xn+1 , . . . , x2n ) = exp l F(x1 , . . . , xn ) F(xn+1 , . . . , x2n ) ,

and define Gaussian random vectors X and Y on R2n by X = (g1 , g2 ) and Y = (g, g),
then the above interpolation can be rewritten in the form (10.12) as
p p
f (t) = EG( tX + 1 tY ).

Notice that the covariance matrices of X and Y are equal to


 
In 0 I I
A = Cov(X) = , B = Cov(Y ) = n n ,
0 In In In

where In is n ⇥ n identity matrix. Therefore,



0 In
A B= ,
In 0

and the difference ai, j bi, j is non-zero only when 1  i  n and j = i + n, or


1  j  n and i = j + n. In both cases, ai, j bi, j = 1. If we introduce the notation
Fi = ∂∂ xFi then it is easy to see that

∂ 2G
= l 2 GFi (x1 , . . . , xn )Fi (xn+1 , . . . , x2n ).
∂ xi ∂ xi+n
By our assumption, |—F|  L, so the partial derivatives Fi are all bounded, and F
grows at most linearly, |F(x)|  c1 + c2 |x|. Hence, the growth conditions in (10.13)
(for G) are satisfied, and the formula (10.14) gives
p p n p p p p
f 0 (t) = l 2 EG( tX + 1 tY ) Â Fi ( tg1 + 1 tg)Fi ( tg2 + 1 tg).
i=1

By the Cauchy-Schwarz inequality,


n p 1 p p p
 Fi ( tg + 1 tg)Fi ( tg2 + 1 tg)
i=1
p p p p
 |—F( tg1 + 1 tg)||—F( tg2 + 1 tg)|  L2 ,

which implies that (since G is positive)


p p
f 0 (t)  l 2 L2 EG( tX + 1 tY ) = l 2 L2 f (t).

Therefore, the derivative


286 10 Gaussian Distributions

l 2 L2 t 0 l 2 L2 t
f (t)e =e f 0 (t) l 2 L2 f (t)  0,
2 2 2 2
so f (t)e l L t is decreasing and f (1)  el L f (0). Recalling the definition (10.17),
we see that f (0) = 1 and, therefore, we proved that
2 L2
f (1) = E exp l F(g1 ) F(g2 )  el .

The Gaussian interpolation and the assumption on the gradient, |—F|  L, have
played their roles.
Since g1 and g2 are independent, integrating in g2 first (let us denote this integral
E2 ) and using Jensen’s inequality,

exp l F(g1 ) E2 F(g2 )  E2 exp l F(g1 ) F(g2 ) .

Integrating this in g1 , and using that g1 , g2 have the same distribution as g, we get
2 L2
E exp l F(g) EF(g)  E exp l F(g1 ) F(g2 )  el .

Using Markov’s inequality,


lt lt+l 2 L2
P F(g) EF(g) t  Ee exp l F(g) EF(g)  e .

This inequality holds for any l 0, and minimizing the right hand side over l (in
other words, setting l = t/2L2 ) proves that
⇣ ⌘ ⇣ t2 ⌘
P F(g) EF(g) t  exp . (10.18)
4kFk2Lip

Applying this to F and using the union bound proves (10.16).


Finally, in the general case when we do not assume differentiability and only
assume (10.15), one can use the standard smoothing technique. Namely, for e > 0,
we define
Fe (x) = EF(x + eg), (10.19)
where g is a standard Gaussian vector on Rn . This function is also Lipschitz,

|Fe (x) Fe (y)|  E|F(x + eg) F(y + eg)|  L|x y|,

but it is also differentiable (even smooth), because


Z Z
1 |y|2 /2 1 |y x|2 /2e 2
Fe (x) = p F(x + ey)e dy = p F(y)e dy.
( 2p)n Rn (e 2p)n Rn

The case proved above shows that


10.1 Gaussian integration by parts, interpolation, and concentration 287
⇣ ⌘ ⇣ t2 ⌘
P Fe (g) EFe (g) t  2 exp . (10.20)
4kFk2Lip

Since Fe approximates F uniformly,

|F(x) Fe (x)|  E|F(x) F(x + eg)|  LeEkgk,

letting e # 0 finishes the proof of the general case. t


u
288 10 Gaussian Distributions

10.2 Examples of concentration inequalities

Example 10.2.1. Our first example will be a supremum of linear functionals in Rn .


Let us consider a bounded set A in Rn , and consider the function

F(x) = sup(a, x) = sup a1 x1 + . . . + an xn .


a2A a2A

Since

|F(x) F(y)| = sup a1 x1 + . . . + an xn sup a1 y1 + . . . + an yn


a2A a2A

 sup a1 (x1 y1 ) + . . . + an (xn yn )  sup kakkx yk,


a2A a2A

the Lipschitz constant of F is bounded by kFkLip  supa2A kak. Therefore, if g is


a standard Gaussian vector in Rn then
⇣ ⌘ ⇣ t2 ⌘
P sup(a, g) E sup(a, g) t  2 exp . (10.21)
a2A a2A 4 supa2A kak2

Let us give a couple of special cases of this inequality.

Example 10.2.2. Let X by a Gaussian random vector in Rn with arbitrary covari-


ance matrix. We know that X is equal in distribution to a linear transformation Cg
of the standard Gaussian random vector g on Rn . If Ci = (ci, j ) jn is the ith row of
C then Xi = Ci g and
n
EXi2 = E(Ci g)2 = Â c2i, j = kCi k2 .
j=1

If we apply (10.21) with A given by the collection of rows C1 , . . . ,Cn , we get


⇣ ⌘ ⇣ t2 ⌘
P max Xi E max Xi t  2 exp . (10.22)
in in 4 maxin EXi2
Example 10.2.3 (Concentration of finite dimensional norms of Gaussian vec-
tors). Let us consider n-dimensional normed vector space E with the norm k · kE
and let b1 , . . . , bn be any basis of E. If g is a standard Gaussian vector in Rn , then

X = g1 b1 + . . . + gn bn (10.23)

is one way to produce a random vector in E. The norm of a vector x 2 E can be


written as the supremum over the unit ball of the dual space E ⇤ of linear functionals
n o
kxkE = sup z (x) : z 2 E ⇤ , kz kE ⇤  1 , (10.24)
10.2 Examples of concentration inequalities 289

and, in particular, the norm of the random vector X in (10.23) can be written as
n o
kXkE = sup g1 z (b1 ) + . . . + gn z (bn ) : z 2 E ⇤ , kz kE ⇤  1 . (10.25)

This functional is of the same form as in (10.21), with the set A in Rn given by
n o
A = (z (b1 ), . . . , z (bn ) : z 2 E ⇤ , kz kE ⇤  1 .

The Lipschitz norm of this function is bounded by supa2A kak, denoted by


n o
1/2
s (X) := sup z (b1 )2 + . . . + z (bn )2 : z 2 E ⇤ , kz kE ⇤  1 . (10.26)

This shows that the norm of a random vector X with Gaussian coordinates satisfies
the following concentration inequality,
⇣ ⌘ ⇣ t2 ⌘
P kXkE EkXkE t  2 exp . (10.27)
4s (X)2
Another common way to write this, replacing t by tEkXkE ,
⇣ ⌘ ⇣ t 2 ⇣ EkXk ⌘2 ⌘
E
P kXkE EkXkE tEkXkE  2 exp . (10.28)
4 s (X)
EkXkE 2
The quantity d(X) := s (X) is called the concentration dimension of X.

Example 10.2.4 (Moment comparison for norms of Gaussian vectors). One


useful consequence of the concentration inequality in previous example is that the
L p -norms for p 1 of the norm kXkE of the Gaussian vector X in (10.23) are
comparable to each other,
p p
EkXkE  EkXkEp  c p EkXkE , (10.29)

where c p is some constant that depends only on p. The first inequality is just
Jensen’s inequality, so we only need to show the second one.
If we denote x+ = max(0, x) the positive part of x, then we can bound

kXkE  EkXkE + (kXkE EkXkE )+ .

Let us denote Z = (kXkE EkXkE )+ . Using the inequality (a + b) p  2 p 1 (a p +


b p ), we get
EkXkEp  2 p 1 (EkXkE ) p + 2 p 1 EZ p .
For simplicity of notation, let us write s instead of s (X). Recall that, if p > 0 and
a random variable Z 0 is nonnegative then we can write the pth moment of Z as
Z •
p
EZ = p x p 1 P(Z x) dx. (10.30)
0
290 10 Gaussian Distributions

By (10.27), we know that, P(Z x)  e x2 /4s 2 for x 0 and, using (10.30),


Z • Z •
x2 /4s 2 t 2 /4
EZ p  p xp 1e dx = s p p t p 1e dt = a p s p ,
0 0

where we made the change of variables t = s x and then denoted by a p the constant
R
p 0•t p 1 e t /4 dt. Using the concentration inequality in the previous example, we
2

proved that
EkXkEp  2 p 1 (EkXkE ) p + 2 p 1 a p s (X) p . (10.31)
It remains to understand how to bound s (X) in (10.26).
For z 2 E ⇤ , the random variable

z (X) = g1 z (b1 ) + . . . + gn z (bn )

is Gaussian with the variance Ez (X)2 = z (b1 )2 + . . . + z (bn )2 . If Y is an arbitrary


(centred) Gaussian random variable ⇠ N(0, v2 ) then
Z • Z • p
1 x2 /2v2 v y2 /2
E|Y | = p |x|e dx = p |y|e dy = v 2/p,
2pv • 2p •

which implies that v2 = EY 2 = p2 (E|Y |)2 . Using this for Y = z (X), we get
p
z (b1 )2 + . . . + z (bn )2 = Ez (X)2 = (E|z (X)|)2 .
2
Taking supremum over z 2 E ⇤ such that kz kE ⇤  1, we get that
p 2 p 2 p 2
s (X)2 = sup E|z (X)|  E sup |z (X)| = EkXkE .
2 z 2 z 2
p
This means that s (X)  p/2EkXkE , and plugging this into (10.31), finally
proves (10.29) with the constant c p = 2 p 1 (1 + a p (p/2) p/2 ).

Exercise 10.2.1. Let X ⇠ N(0,C) is a Gaussian random vector in Rn , show that


⇣ n n ⌘ ⇣ t2 ⌘
P log  eXi E log  eXi t  2 exp . (10.32)
i=1 i=1 4 maxin EXi2

Exercise 10.2.2. If g ⇠ N(0, I) is standard Gaussian on Rn and kgk is its Euclidean


norm, show that
⇣ ⌘ ⇣ t2 ⌘
P kgk Ekgk t  2 exp .
4
10.3 Gaussian comparison inequalities 291

10.3 Gaussian comparison inequalities

First, we will prove the so-called Slepian-Fernique inequalities.

Theorem 10.2 (Slepian-Fernique inequality I). Let X = (Xi )in and Y = (Yi )in
be two Gaussian vectors on Rn such that
1. EXi2 = EYi2 for all i  n,
2. EXi X j  EYiY j for all i, j  n.
Then, for any choice of parameters (li )in 2 Rn ,
⇣\
n ⌘ ⇣\
n ⌘
P Xi  li P Yi  li (10.33)
i=1 i=1

and
E max Xi E max Yi . (10.34)
i i

What this means is that, if the coordinates of X have the same variance, but are less
correlated, then they are less likely to stay below given thresholds (li ) and, there-
fore, their maximum is bigger on average. After we prove this result, we will give
another proof of the second statement (10.34) under less restrictive assumptions.

Proof. Let us rewrite the indicator


⇣\
n ⌘ n
I xi  li = ’ I xi  li .
i=1 i=1

Let us approximate each indicator I(xi  li ) by a smooth nonnegative decreasing


function ji (xi ). Define
n
j(x) = ’ ji (xi ).
i=1
p p
If we consider the interpolation f (t) = Ej( tX + 1 tY ) and use the Gaus-
sian interpolation formula (10.14), we will now check that the assumptions of the
theorem about the covariances imply that f 0 (t)  0. Indeed, for j 6= i,
n
∂2

∂ xi ∂ x j `=1
j` (x` ) 0,

because the derivatives applied to the factors i and j in the product will both be
negative, since all functions j` are decreasing. On the other hand, by assumption,
the difference of the covariances EXi X j EYiY j  0 is negative in this case, so
the corresponding term in (10.14) will be negative. The derivatives ∂ 2 /∂ xi2 are not
important because the variances are equal and EXi2 EYi2 = 0. This proves that
f 0 (t)  0 and, therefore,
292 10 Gaussian Distributions

f (1) = Ej(X)  f (0) = Ej(Y ).

Now, letting ji, j ’s converge to the corresponding indicators, proves that


⇣\
n ⌘ ⇣\
n ⌘
EI Xi  li  EI Yi  li ,
i=1 i=1

which is the same as (10.33). Let us show how this implies (10.34).
Notice that, if we take all li = l then (10.33) can be rewritten as
⇣ ⌘ ⇣ ⌘
P max Xi  l  P max Yi  l . (10.35)
i=1 i=1

Then the inequality (10.34)


R for the averages is an immediate consequence of the
representation EZ = 0• P(Z x) dx for Z 0. Let us define X = maxi=1 Xi and
Y = maxi=1 Yi and let us decompose X = X + X and Y = Y + Y into positive
and negative parts. If we take l 0, and apply the inequality (10.35) with l and
l , we get

P X+ l P Y+ l ,P X l P Y l .

This implies that EX + EY + , EX  EY , and subtracting two inequalities we


get (10.34). t
u

Next, we will prove (10.34) under less restrictive assumptions. If we write

E(Xi X j )2 = EXi2 + EX j2 2EXi X j , E(Yi Y j )2 = EYi2 + EY j2 2EYiY j

then, under the first assumption that all variances are equal, EXi2 = EYi2 , the second
assumption that EXi X j  EYiY j for all i, j  n is equivalent to

E(Xi X j )2 E(Yi Y j )2 for all i, j  n.

We will now show that with this reformulation of the second assumption, we can
drop the first assumption, which is often convenient in applications.

Theorem 10.3 (Slepian-Fernique inequality II). Let X = (Xi )in and Y = (Yi )in
be two Gaussian vectors on Rn such that

E(Xi X j )2 E(Yi Y j )2 for all i, j  n.

Then, the inequality (10.34) holds, i.e.

E max Xi E max Yi . (10.36)


i i

Proof. We will apply the same interpolation as in the above proof, but to a special
smooth approximation of the maximum function maxin xi given by
10.3 Gaussian comparison inequalities 293

1
F(x) = log  eb xi ,
b in

for positive parameter b > 0. The reason we can view this as a smooth approxima-
tion of the maximum is because
1 log n
max xi  F(x) = log  eb xi  max xi + .
in b in in b

Indeed, if we keep only the largest term in the sum Âin eb xi we get the lower
bound, and if we replace each term in this sum by the largest one, we get the
upper bound. Now, we can make the second term log n/b becomes as small as we
like by taking b large. If we prove that EF(X) EF(Y ), we recover (10.36) by
letting b go to infinity. To use the Gaussian interpolation (10.14), let us compute
the derivatives of F. First of all,

∂F eb xi
pi (x) := = .
∂ xi  jn eb x j

Differentiating this, we see that

∂ 2F ∂ 2F
= b (pi (x) pi (x)2 ), = b pi (x)p j (x) if i 6= j.
∂ xi2 ∂ xi ∂ x j

Therefore, the Gaussian interpolation formula (10.14) gives (to simplify notation,
we write pi instead of pi (Z(t)))

d 1 ∂ 2F
f 0 (t) = EF(Z(t)) = Â (EXi X j EYiY j )E (Z(t))
dt 2 i, jn ∂ xi ∂ x j
b b
=
2 Â (EXi2 EYi2 )E(pi p2i )
2 Â (EXi X j EYiY j )Epi p j .
in i6= j

However, we chose F in such a way that,

eb xi
 pi (x) =  bxj
= 1.
in in  jn e

If we multiply both sides by pi (x) and subtract p2i (x), we get

pi (x) p2i (x) = Â pi (x)p j (x).


j6=i

This means that

 (EXi2 EYi2 )E(pi p2i )


in
294 10 Gaussian Distributions

= Â (EXi2 EYi2 )E Â pi p j = Â (EXi2 EYi2 )Epi p j .


in j6=i i6= j

Switching indices i and j, this can also be written as Âi6= j (EX j2 EY j2 )Epi p j , and
taking average of the two, we get
1
 (EXi2 EYi2 )E(pi p2i ) =
2 i6Â
(EXi2 + EX j2 EYi2 EY j2 )Epi p j .
in =j

Plugging it into the above expression for the derivative f 0 (t) and collecting the
terms, we get
b
f 0 (t) =
4 Â (E(Xi X j )2 E(Yi Y j )2 )Epi p j 0,
i6= j

since, by the assumption of the theorem, each term is positive. This implies that
f (1) f (0) or, in other words, EF(X) EF(Y ). This finishes the proof. t
u

Example 10.3.1 (Gaussian contraction inequality). Given a standard Gaussian


vector g = (g1 , . . . , gn ) on Rn , let us consider a Gaussian process

X(t) = (t, g) = t1 g1 + . . . + tn gn , (10.37)

where t = (t1 , . . . ,tn ) 2 T for some bounded subset T ✓ Rn . Let us consider a 1-


Lipschitz function f : Rn ! Rn , i.e.

k f (t) f (t 0 )k  kt t 0 k,

and, if we write f = ( f1 , . . . , fn ), consider a process

Y (t) = ( f (t), g) = f1 (t)g1 + . . . + fn (t)gn . (10.38)

Obviously,

E(Y (t) Y (t 0 ))2 = k f (t) f (t 0 )k2  kt t 0 k2 = E(X(t) X(t 0 ))2 ,

so Theorem 10.3 implies that

E sup Y (t) = E sup( f (t), g)  E sup X(t) = E sup(t, g), (10.39)


t2T t2T t2T t2T

which is known as the Gaussian contraction inequality. t


u

Next, we will prove a more general minimax analogue of the first theorem above
known as Gordon’s comparison inequality.

Theorem 10.4 (Gordon’s inequality). Let (Xi, j ), (Yi, j ) be two Gaussian vectors
indexed by i  n, j  m such that
10.3 Gaussian comparison inequalities 295

1. EXi,2 j = EYi,2j for all i, j,


2. EXi, j Xi,`  EYi, jYi,` for all i, j, `,
3. EXi, j Xk,` EYi, jYk,` for all i, j, k, ` such that i 6= k.
Then, for any choice of parameters (li, j ),
⇣[
n \
m ⌘ ⇣[
n \
m ⌘
P Xi, j  li, j P Yi, j  li, j (10.40)
i=1 j=1 i=1 j=1

and
E min max Xi, j E min max Yi, j . (10.41)
i j i j
This reduces to Slepian’s inequality when the index i takes only one value, so the
condition 3 is empty.
Proof. Let us rewrite the indicator
⇣[
n \
m ⌘ ⇣
n m ⌘
I xi, j  li, j = 1 ’ 1 ’ I xi, j  li, j .
i=1 j=1 i=1 j=1

Let us approximate each indicator I(xi, j  li, j ) by a smooth nonnegative non-


increasing function ji, j (xi, j ). Define
n ⇣ m ⌘
j(x) = 1 ’ 1 ’ ji, j (xi, j ) .
i=1 j=1
p p
If we consider the interpolation f (t) = Ej( tX + 1 tY ) and use the Gaussian
interpolation formula (10.14), it is easy to check that the conditions of the theorem
on the covariance imply that f 0 (t)  0. Indeed, for j 6= `,
n ⇣ m ⌘ m
∂ 2j ∂2
=’ 1 ’ k,p k,p
j (x ) ’ ji,p (xi,p ) 0,
∂ xi, j ∂ xi,` k6=i p=1 ∂ xi, j ∂ xi,` p=1

because the derivatives applied to two factors in the last product will both be non-
positive (since all functions ji, j are non-increasing). On the other hand, the dif-
ference of the covariances EXi, j Xi,` EYi, jYi,`  0 is negative in this case, so the
corresponding term in (10.14) will be negative. Similarly, for i 6= k,

∂ 2j n ⇣ m ⌘ ∂ m
∂ m

∂ xi, j ∂ xk,`
= ’ 1 ’ jk0 ,p (xk0 ,p ) ∂ xi, j ’ ji,p (xi,p ) ∂ xk,` ’ jk,p (xk,p )
k 6=i,k
0 p=1 p=1 p=1

is  0. By assumption, the difference of the covariances EXi, j Xk,` EYi, jYk,` 0


is positive in this case, so the corresponding term in (10.14) will again be negative.
This proves that f 0 (t)  0 and, therefore,

f (1) = Ej(X)  f (0) = Ej(Y ).


296 10 Gaussian Distributions

Now, letting ji, j ’s converge to the corresponding indicators, proves that


⇣[
n \
m ⌘ ⇣[
n \
m ⌘
EI Xi, j  li, j  EI Yi, j  li, j ,
i=1 j=1 i=1 j=1

which is the same as (10.40). If we take all li, j = l , this can be rewritten as
⇣ ⌘ ⇣ ⌘
P min max Xi, j  l  P min max Yi, j  l .
i j i j

and (10.41) follows by integration by parts as before. t


u

For the last statement of the previous theorem, one can also remove the assump-
tion about the equalities of variances.

Theorem 10.5 (Gordon’s inequality II). Let (Xi, j ) and (Yi, j ) be two Gaussian
vectors indexed by i  n, j  m such that
1. E(Xi, j Xi,` )2 E(Yi, j Yi,` )2 for all i, j, `,
2. E(Xi, j Xk,` )2  E(Yi, j Yk,` )3 for all i, j, k, ` such that i 6= k.
Then,
E min max Xi, j E min max Yi, j . (10.42)
i j i j

To prove this inequality, one can use that


1 1
F(x) = log  ! min max Xi, j as b ! •,
b i Âj e
b Xi, j i j

and, as in the proof of Theorem 10.3, show that EF(X)  EF(Y ). The calculation
is a bit more involved but straightforward. Letting b ! • proves (10.42).

Exercise 10.3.1. If g ⇠ N(0, I) is standard Gaussian on Rn , a function s : R ! R


is 1-Lipschitz and a set T ✓ Rn is bounded, show that
⇣ ⌘
E sup s (t1 )g1 + . . . + s (tn )gn  E sup t1 g1 + . . . + tn gn .
t2T t2T

Exercise 10.3.2. Prove Theorem 10.5.


10.4 Comparison of bilinear and linear forms 297

10.4 Comparison of bilinear and linear forms

Let us consider the following random bilinear form


n m
X(t, u) = Â Â gi, j ti u j , (10.43)
i=1 j=1

where (gi, j )in, jm are i.i.d. standard Gaussian random variables and parameters

t = (t1 , . . . ,tn ) 2 Rn and u = (u1 , . . . , um ) 2 Rm .

In various applications, one is interested in quantities of the form

max max X(t, u), min max X(t, u) or min max X(t, u)
t2T u2U t2T u2U u2U t2T

for some sets of parameters T ✓ Rn and U ✓ Rm , which can be difficult to calculate


or estimate directly. It turns out that one can use comparison inequalities from the
previous section to relate these quantities to similar quantities for another, linear or
‘almost’ linear form. We will give several examples below.

Example 10.4.1. Let us first consider the following random process


n m
Y (t, u) = kuk  hiti + ktk  g j u j , (10.44)
i=1 j=1

where h1 , . . . , hn , g1 , . . . , gm are i.i.d. standard Gaussian random variables. This is


not quite a linear form because of the factors ktk and kuk, but the dependence on t
and u is simpler and one can often analyze more easily the quantities

max max Y (t, u), min max Y (t, u) or min max Y (t, u).
t2T u2U t2T u2U u2U t2T

Let us take any two pairs of parameters (t, u) and (t 0 , u0 ) and compute
⇣ n m ⌘2
2
E X(t, u) X(t 0 , u0 ) =E Â Â gi, j (ti u j ti0 u0j ) (10.45)
i=1 j=1
n m
= Â Â (ti u j ti0 u0j )2 = ktk2 kuk2 + kt 0 k2 ku0 k2 2(t,t 0 )(u, u0 ).
i=1 j=1

Similarly, one can compute


2
E Y (t, u) Y (t 0 , u0 ) (10.46)
⇣n ⌘2 ⇣m ⌘2
= E Â hi (kukti ku0 kti0 ) + E Â g j (ktku j kt 0 ku0j )
i=1 j=1
2 2 0 2 0 2
= 2ktk kuk + 2kt k ku k 2kukku k(t,t 0 )
0
2ktkkt 0 k(u, u0 ).
298 10 Gaussian Distributions

Subtracting and rearranging the terms, it is easy to see that


2 2
E Y (t, u) Y (t 0 , u0 ) E X(t, u) X(t 0 , u0 ) = (10.47)
⇣ ⌘2 ⇣ ⌘⇣ ⌘
= ktkkuk kt 0 kku0 k + 2 ktkkt 0 k (t,t 0 ) kukku0 k (u, u0 ) .

By the Cauchy-Schwarz inequality, (t,t 0 )  ktkkt 0 k and (u, u0 )  kukku0 k, so


2 2
E Y (t, u) Y (t 0 , u0 ) E X(t, u) X(t 0 , u0 ) . (10.48)

By Theorem 10.3 in the previous section,

E max max X(t, u)  E max max Y (t, u). (10.49)


t2T u2U t2T u2U

t
u

In order to apply the results for the minimax mint2T maxu2U , we need the reverse
inequality for t = t 0 . Because of (10.48), we can only hope to get the equality
2 2
E Y (t, u) Y (t, u0 ) = E X(t, u) X(t, u0 ) . (10.50)

There are a couple of ways this can be achieved, as we will see in the following
examples.

Example 10.4.2. One way to do this is to modify the definition of the bilinear form
slightly and consider
n m
X + (t, u) = Â Â gi, j ti u j + zktkkuk, (10.51)
i=1 j=1

where z is a standard Gaussian random variable independent of all gi, j . Includ-


ing this additional term will add (ktkkuk kt 0 kku0 k)2 to (10.45) and, as a result,
(10.47) will become
2 2
E Y (t, u) Y (t 0 , u0 ) E X + (t, u) X + (t 0 , u0 ) =
⇣ ⌘⇣ ⌘
= 2 ktkkt 0 k (t,t 0 ) kukku0 k (u, u0 ) . (10.52)

This is equal to zero when t = t 0 so, by Theorem 10.5,

E min max X + (t, u) E min max Y (t, u). (10.53)


t2T u2U t2T u2U

Notice that the inequality is reversed in this case and together with (10.49), we can
write

E min max Y (t, u)  E min max X + (t, u)


t2T u2U t2T u2U
10.4 Comparison of bilinear and linear forms 299

 E max max X + (t, u)  E max max Y (t, u). (10.54)


t2T u2U t2T u2U

Moreover, if we take t 0 = t in (10.52), we get that

EX + (t, u)2 = EY (t, u)2 for all (t, u),

so we are in a position to apply Theorem 10.4 to compare the probabilities as in


(10.40),
⇣[ \ ⌘ ⇣[ \ ⌘
P X + (t, u)  l (t, u)  P Y (t, u)  l (t, u) , (10.55)
t2T u2U t2T u2U

for arbitrary function l (t, u). Notice that (10.52) is also equal to zero if u = u0 so,
by the same logic, we can switch the role of the parameters t and u,

E min max Y (t, u)  E min max X + (t, u)


u2U t2T u2U t2T
 E max max X + (t, u)  E max max Y (t, u) (10.56)
u2U t2T u2U t2T

and
⇣[ \ ⌘ ⇣[ \ ⌘
P X + (t, u)  l (t, u) P Y (t, u)  l (t, u) . (10.57)
u2U t2T u2U t2T

This example will be useful to us in Section 10.6. t


u

Example 10.4.3. There is another special case that is very useful, when

ktk = a for all t 2 T, kuk  b for all u 2 U, (10.58)

for some constants a, b > 0. In other words, T is a subset of the sphere or radius a
in Rn and U is a subset of the ball of radius b in Rm . In this case, we will modify
the definition (10.44) slightly and consider a proper random linear form
n m
Y + (t, u) = b  hiti + a  g j u j . (10.59)
i=1 j=1

As before, by a direct calculation, one can check that


2 2
E Y + (t, u) Y + (t 0 , u0 ) E X(t, u) X(t 0 , u0 ) (10.60)
2 0 2 0
=2 a (t,t ) b (u, u ) . (10.61)

By the Cauchy-Schwarz inequality, this difference is nonnegative. Moreover, it is


equal to zero when t = t 0 by the assumption that ktk = a. This implies, by Theorem
10.3 and Theorem 10.5,

E min max Y + (t, u)  E min max X(t, u)


t2T u2U t2T u2U
300 10 Gaussian Distributions

 E max max X(t, u)  E max max Y + (t, u). (10.62)


t2T u2U t2T u2U

We will use this example in Section 10.5 below. t


u

Example 10.4.4. In the setting of the Example 10.4.1, let us suppose that

ktk = a for all t 2 T, (10.63)

i.e. T is a subset of the sphere or radius a in Rn . If we take u = u0 in (10.47), we


get the equality
2 2
E Y (t, u) Y (t 0 , u) = E X(t, u) X(t 0 , u) . (10.64)

As in (10.56), this implies

E min max Y (t, u)  E min max X(t, u)


u2U t2T u2U t2T
 E max max X(t, u)  E max max Y (t, u). (10.65)
u2U t2T u2U t2T

t
u

Example 10.4.5. Let us take T = Sn 1 and U = Sm 1 to be unit spheres in Rn and


Rm . We will use Example 10.4.3 with a = b = 1 and the role of t and u reversed in
(10.62). First of all,
⇣ n m ⌘
min max  i i  j j = max (h,t)
kuk=1 ktk=1 i=1
h t + g u
ktk=1
max (g, u) = khk
kuk=1
kgk,
j=1

so
E min max Y + (t, u) = Ekhk Ekgk.
u2U t2T
This can be computed explicitly, and it is well-known that
p
n 2G( n+1
2 )
p
p  Ekhk = n  n (10.66)
n+1 G( 2 )

(we leave it as an exercise below). The same holds for Ekgk with n replaced by m.
On the other hand,
⇣ n ⇣ m ⌘2 ⌘1/2
min max X(t, u) = min
u2U t2T kuk=1 i=1 j=1
  gi, j u j ,

which we can rewrite as follows. Let us introduce the notation


1 n
gi = (gi,1 , . . . , gi,m )T 2 Rm and Z = Â gi gTi .
n i=1
(10.67)
10.4 Comparison of bilinear and linear forms 301

Vectors gi are i.i.d. standard Gaussian in Rm and m ⇥ m matrix Z is their sample


covariance matrix. We can then rewrite
n ⇣ m ⌘2 n n n
  gi, j u j =  (gi , u)2
= Â uT
g i g T
i u = Â (gi gTi u, u) = n(Zu, u).
j=1 j=1 j=1 j=1 j=1

Since Z is symmetric and positive semi-definite,

min (Zu, u) = lmin (Z) 0,


kuk=1

where lmin (Z) is the smallest eigenvalue of Z. The first inequality in (10.62) gives
p
Ekhk Ekgk  E nlmin (Z), (10.68)

and, using (10.66), we can write


r r
n m p
 E lmin (Z). (10.69)
n+1 n
When n = (1 + e)m, i.e. the sample size n is relatively bigger than the dimension
m of our space, the left hand side is separated away from zero, so the smallest
eigenvalue lmin (Z) is separated away from zero, at least on average. To obtain
similar statement in probability, we can use Example 10.4.5. The only difference
will be that p
min max X + (t, u) = nlmin (Z) + z,
u2U t2T
and using (10.57) with l (t, u) ⌘ l , we get
⇣p ⌘ ⇣ ⌘
P nlmin (Z) + z  l  P khk kgk  l . (10.70)
p
If we take l = t n and flip the inequality, we get
⇣p z ⌘
⇣ khk kgk ⌘
P lmin (Z) + p t p P t . (10.71)
n n
p p
By the law of large number, (khk kgk)/ n ⇡ 1p m/n, so the second prob-
ability will
p be close to one if n is such that 1 m/n > t, or n > m/(1 t)2 .
Since z/ n is negligible for large n, this shows that with probability close to one,
lmin (Z) t 2 . This shows that when n = (1 + e)m, lmin (Z) is separated away from
zero with high probability, and not only on average.
We can reformulate these results for general Gaussian vectors on Rm with the
covariance C. If Xi = Agi for some matrix A such that C = AAT , then (Xi )in are
i.i.d. N(0,C) and
1 n
ZC = Â Xi XiT = AZAT (10.72)
n i=1
302 10 Gaussian Distributions

is their sample covariance matrix, where Z was defined in (10.67). Since

lmin (ZC ) = inf (ZC u, u) = inf (AZAT u, u) = inf (ZAT u, AT u)


kuk=1 kuk=1 kuk=1
T T
lmin (Z) inf (A u, A u) = lmin (Z) inf (Cu, u) = lmin (Z)lmin (C),
kuk=1 kuk=1

the statements we obtained above for the sample covariance matrix Z can be trans-
ferred to similar statements for ZC , only now scaled by lmin (C). t
u

Exercise 10.4.1. If g ⇠ N(0, I) is standard Gaussian on Rn and kgk is its Euclidean


norm, show that p
n 2G( n+1
2 )
p
p  Ekgk = n  n.
n+1 G( 2 )
10.5 Johnson–Lindenstrauss lemma 303

10.5 Johnson–Lindenstrauss lemma

In this section, we will give one classical application of the results in the previous
section. Let us consider N 1 and a set

V = v1 , . . . , vm ✓ R N (10.73)

of m points in RN . The dimension N here can be arbitrarily large, and we should


think of the number of points m as also being large. The Johnson–Lindenstrauss
lemma states that there exists a linear map from RN into Euclidean space Rn of pos-
sibly much lower dimension n that preserves the distances between all the points
in the set V up to a small relative error. This is called a low-distortion embedding.
It was discovered in a work on functional analysis, but it found many applications
to computational algorithms in various fields as a preprocessing step to reduce the
dimensionality of high-dimensional data. Here is the precise statement.

Theorem 10.6 (Johnson–Lindenstrauss lemma). Given m points v1 , . . . , vm in


RN , any e 2 (0, 1), and
4
n > 2 log m, (10.74)
e
there exists a linear map f : RN ! Rn such that
p
n k f (vk ) f (v` )k
p e  1+e (10.75)
n+1 kvk v` k

for all 1  k < `  m, where k · k denotes the Euclidean norm (in RN and Rn ).

The dimension n in (10.74) that we are allowed to choose depends on the distortion
parameter e and the number of points m, but it does not depend on N. The depen-
dence on m is logarithmic, so the dimension can be relatively small even when
the number of points is very large. We will give two proofs of this result, using
Gaussian concentration and Gaussian comparison.

Proof (using Gaussian concentration). In our first proof we will assume (10.74)
with constant 8 instead of 4 (however, using Gaussian concentration inequality
with optimal constant would resolve this issue).
Let us consider a random matrix
2 3
g11 g12 g13 . . . g1N
6g21 g22 g23 . . . g2N 7
6 7
G = 6 .. .. .. . . .. 7 ,
4 . . . . . 5
gn1 gn2 gn3 . . . gnN

where the entries gi j are i.i.d. standard Gaussian random variables, and show that
the linear map
304 10 Gaussian Distributions

1
f (x) = p G x (10.76)
n
satisfies the low-distortion property (10.75) with positive probability. Let us de-
note, for 1  k < `  m,
vk v`
ak` = .
kvk v` k
Then the equation (10.75) can be rewritten, for f in (10.76), as
p
n 1
p e  p kG ak` k  1 + e
n+1 n

for all 1  k < `  m. For a fixed a = (a1 , . . . , aN ) 2 SN 1, the coordinates of G a,

(G a)i = gi1 a1 + . . . + giN aN ,

are independent standard Gaussian random variables, so G a is a standard Gaussian


vector on Rn . Combing Exercise 10.2.2 and Exercise 10.4.1, we get that
⇣ pn 1 ⌘ 2
P p e  p kG ak  1 + e 1 2e ne /4 .
n+1 n

Since ak` 2 SN 1
for all for 1  k < `  m,
⇣ pn 1 ⌘
ne 2 /4
P p e  p kG ak` k  1 + e 1 2e ,
n+1 n
and, by the union bound,
⇣ p ⌘
n 1 ne 2 /4
P 8 1  k < `  m, p e  p kG ak` k  1 + e 1 m2 e .
n+1 n

The right hand side is strictly positive if n > e82 log m. The fact that the probability
is positive means that there exists G such that the low-distortion condition (10.75)
is satisfied. t
u

Proof (using Gaussian comparison). We will now show a proof using Gordon’s
inequality in the form of the Example 10.4.3. Let us again consider the set
n vk v` o
A = ak` = 2 SN 1 : 1  k < `  n ,
kvk v` k
and let B = {x 2 Rn : kxk  1} be the unit ball in Rn . Given the Gaussian n ⇥ N
matrix G in the previous proof, consider a bilinear form for a 2 A and b 2 B,
n N
X(a, b) := Â Â gi j bi a j .
i=1 j=1
10.5 Johnson–Lindenstrauss lemma 305

Given standard Gaussian vectors h = (hi )in on Rn and g = (g j ) jN on RN , we


consider a linear form
n N
Y (a, b) := Â hi bi + Â g j a j .
i=1 j=1

Then, the comparison inequality (10.62) in the Example 10.4.3 gives

E min max Y (a, b)  E min max X(a, b)


a2A b2B a2A b2B
 E max max X(a, b)  E max max Y (a, b). (10.77)
a2A b2B a2A b2B

If we denote
N
D := max(g, a) = max  g j a j (10.78)
a2A a2A j=1

d
then (10.77) can be rewritten as (also using g = g)

Ekhk ED  E min kG ak  E max kG ak  Ekhk + ED.


a2A a2A

From Exercise 10.4.1 we know that


p
n 2G( n+12 )
p
p  Ekhk =  n.
n+1 G( n2 )

To estimate ED, take any b > 0 and write


1 1
E max(g, a)  E log  eb (g,a)  log E  eb (g,a)
a2A b a2A b a2A
1 b 1
= log  eb /2 = + log card(A).
2

b a2A 2 b

Optimizing over b gives


p p p
ED  2 log card A  2 log m < e n,

since card(A)  m2 and we assumed that n > e42 log m. Together, the inequalities
above imply
p
n 1 1
p e < p E min kG ak  p E max kG ak < 1 + e.
n+1 n a2A n a2A
This implies that
⇣ pn ⌘ 1
p e E min kG ak > 1 > (1 + e) 1 E max kG ak
n+1 a2A a2A
306 10 Gaussian Distributions

and, therefore,
h ⇣ pn ⌘ i
1
E min kG ak (1 + e) p e max kG ak > 0.
a2A n+1 a2A

Therefore, there exists G such that


⇣ pn ⌘
1
min kG ak (1 + e) p e max kG ak > 0.
a2A n+1 a2A

If we rescale G ! G so that maxa2A kG ak = 1 + e, we get that


p
n
min kG ak > p e.
a2A n+1

This shows the existence of a linear map G with the distortion property (10.75). t
u

Suppose that we also want to control the distortion of the areas of triangles
formed by any three points vi , v j and vk . This can be done as follows. First, the
areas of all triangles in the same (two-dimensional) plane are distorted by the same
factor under a linear transformation. For each three points, we can add another
point vi jk in the same plane, which forms an isosceles right triangle with vi , v j .
This means that, instead of m points, we now have m + m3  m3 points on the list.
We can now find f as in Lemma 10.6 for this new enlarged list of points. Since
the dependence on m is logarithmic, we lose a factor of 3 in the bound (10.74) for
n. It remains to solve the following exercise to see that this procedure allows us to
control the distortion of areas.

Exercise 10.5.1. Suppose that three points vi , v j and vk in the setting of Lemma
10.6 form an isosceles right triangle D. Show that
p Area( f (D))
1 5e   1 + e.
Area(D)

Exercise 10.5.2. Show that, for d 2 (0, 1), if the dimension


8 m
n> log p
e2 d
then the map f (x) = p1 G x in the above proof satisfies (10.75) with probability at
n
least 1 d .
10.6 Intersecting random half-spaces 307

10.6 Intersecting random half-spaces

In this section, we will show another application of Gaussian comparison. Consider


the following problem. Let g1 , . . . , gm be i.i.d. standard Gaussian random vectors
on Rn , and k 2 R is a fixed number. Consider a set
m
\
A= x 2 Sn 1
: g` · x k , (10.79)
`=1

which is an intersection of a unit sphere with m random half-spaces defined by


random directions g` and some threshold k. We will think of the dimension n as
being large, and we would like to know how many random half-spaces we need
to intersect to make this set empty with high probability. In Physics this is known
as a “perceptron capacity” problem. For negative k this is an open problem, but
for k 0 the answer is known. It turns out that m should be proportional to n, i.e.
m = an, and the following holds. We write x+ = max(x, 0) and let z ⇠ N(0, 1).

Theorem 10.7 (Gardner-Stojnic theorem). For all k 2 R,

aE(z + k)2+ > 1 =) P(A = 0)


/ 1 e cn
(10.80)

for some constant c = c(a, k) > 0 and, for all k 0,

aE(z + k)2+ < 1 =) P(A 6= 0)


/ 1 e cn
(10.81)

for some constant c = c(a, k) > 0.

The first statement holds for all k, but for negative k this threshold is not the
correct one. The second statement shows that the threshold is sharp for nonnegative
k, in the sense that for a below the threshold the set A is not empty with high
probability and for a above the threshold the set A is empty with high probability.
For k = 0, which corresponds to half-spaces through the origin, the threshold is
2, which means that one must intersect more than 2n half-spaces to get an empty
intersection with the unit sphere with high probability.

Proof (of (10.80)). Assume aE(z+k)2+ > 1. Let G be a m⇥n matrix whose entries
are i.i.d. standard Gaussian random variables, and 1 = (1, . . . , 1)T 2 Rm . Then A
is non-empty if and only if there exists x 2 Sn 1 such that Gx k1 where the
inequality of two vectors should be interpreted coordinate-wise. We notice that
this is equivalent to
min max l T (k1 Gx)  0, (10.82)
x l

where minimax is taken over (x, l ) 2 Sn 1 ⇥ Bm


+ and where

Bm m
+ = l 2 R : kl k  1; li 0, 1  i  m .
308 10 Gaussian Distributions

From now on we consider only such pairs (x, l ) 2 Sn 1 ⇥ Bm + . Therefore, to show


(10.80) it suffices to show that there exists d > 0 such that
⇣ ⌘
P min max l T (k1 Gx) d 1 e cn .
x l

We start with the following observation. Let z ⇠ N(0, 1) be a standard Gaussian


random variable independent of the entries in G. Then, for any d , e > 0,
⇣ p ⌘
P min max l T (k1 Gx) + z e n
x
⇣ l ⌘ p
 P min max l T (k1 Gx) d + P(z e n d ).
x l

2
The second term on the right-hand side can be bounded by e e n/8 , so we will try
to bound the probability on the left-hand side from below. We will use Gordon’s
inequality in Theorem 10.4 in Section 10.3. Let

X(x, l ) := l T Gx + z, Y (x, l ) := l T g + xT h

where g 2 Rm , h 2 Rn are independent standard Gaussian random vectors. It is easy


to see that

EX(x1 , l1 )X(x2 , l2 ) = E( l1T Gx1 + z)( l2T Gx2 + z) = (l1 · l2 )(x1 · x2 ) + 1,

and

EY (x1 , l1 )Y (x2 , l2 ) = E(l1T g + x1T h)(l2T g + x2T h) = l1 · l2 + x1 · x2 .

If (x1 , l1 ) = (x2 , l2 ) then both variances are equal to 2. Also,

EX(x1 , l1 )X(x2 , l2 ) EY (x1 , l1 )Y (x2 , l2 ) = (1 l1 · l2 )(1 x1 · x2 ) 0,

by the Cauchy-Schwarz inequality. However, if x1 = x2 ,

EX(x1 , l1 )X(x1 , l2 ) EY (x1 , l1 )Y (x1 , l2 ) = (1 l1 · l2 )(1 kx1 k2 ) = 0  0.

Therefore, the assumptions in Theorem 10.4 are satisfied and


⇣ p ⌘
P min max X(x, l ) + kl T 1 e n
x
⇣ l p ⌘
P min max Y (x, l ) + kl T 1 e n .
x l

The minimax on the right hand side can be computed explicitly,

min max Y (x, l ) + kl T 1 = max l T (g + k1) + min xT h


x l m 1
l 2S+ x2Sn 1
10.6 Intersecting random half-spaces 309

= k(g + k1)+ k khk,

where (g + k1)+ = ((gi + k)+ )im . By the assumption aE(z + k)2+ > 1, we can
choose e > 0 small enough such that aE(z + k)2+ (1 + 3e)2 , or
p p p
(mE(z + k)2+ )1/2 n 2e n e n.

Therefore,
⇣ p ⌘
P min max X(x, l ) + kl T 1 e n
x
⇣ l p ⌘
P k(g + k1)+ k khk e n
⇣ p p ⌘
P k(g + k1)+ k khk (mE(z + k)2+ )1/2 n 2e n
⇣ p p ⌘
1 P khk n+e n
⇣ p ⌘
P k(g + k1)+ k  (mE(z + k)2+ )1/2 e n .

The first probability is exponentially small by Gaussian concentration inequality


for the norm khk. In the last term, we can rewrite the event as
m
 (gi + k)2+  mE(z + k)2+ Ce n,
i=1

where Ce = 2e(aE(z + k)2+ )1/2 e2 2e e2 e for e 2 (0, 1). One can then
rewrite this as m h i
 E(gi + k)2+ (gi + k)2+ Ce n,
i=1
and use Markov’s inequality as in Section 2.1 to show that this probability is expo-
nentially small. We will leave this as an exercise. t
u

Proof (of (10.81)). Now we assume that aE(z + k)2+ < 1 and k 0. Because
k 0, the existence of a point x 2 Sn 1 such that Gx 1k is equivalent to the
existence of a point x 2 Bn in the unit ball such that Gx 1k. Analogously to
(10.82), this is equivalent to

min max l T (k1 Gx)  0, (10.83)


x l

where the minimax is now taken over (x, l ) 2 Bn ⇥ Bm n m


+ . Because both B and B+
are convex and compact, by Sion’s minimax theorem,

min max l T (k1 Gx) = max min l T (k1 Gx) = min max l T (Gx k1).
x l l x l x

Therefore, our goal is to show that


310 10 Gaussian Distributions
⇣ ⌘
P min max l T (Gx k1) 0 1 e cn
l x

for some c > 0. In this direction, the calculation will be based on a comparison
inequality in the Example 10.4.2 in Section 10.4.
As above, let z ⇠ N(0, 1) be a standard Gaussian random variable independent
of the entries in G. Then, for any e > 0,
⇣ h p i ⌘
P min max l T Gx kl T 1 + zkl kkxk e nkl kkxk 0
l x
⇣ h i ⌘ p
 P min max l T Gx kl T 1 0 + P(z e n).
l x

2
The second term on the right-hand side can be bounded by e e n/2 , so we will
try to bound the probability on the left-hand side from below. We will again use
Gordon’s inequality in Theorem 10.4, only now

X(x, l ) := l T Gx + zkl kkxk, Y (x, l ) := kxkl T g + kl kxT h

where g 2 Rm , h 2 Rn are independent standard Gaussian random vectors. One can


check that that

EX(x1 , l1 )X(x2 , l2 ) = (l1 · l2 )(x1 · x2 ) + kx1 kkx2 kkl1 kkl2 k,

and
EY (x1 , l1 )Y (x2 , l2 ) = kx1 kkx2 k(l1 · l2 ) + kl1 kkl2 k(x1 · x2 ).
If (x1 , l1 ) = (x2 , l2 ) then the variances are again equal. Also,

EX(x1 , l1 )X(x2 , l2 ) EY (x1 , l1 )Y (x2 , l2 )


= (kl1 kkl2 k l1 · l2 )(kx1 kkx2 k x1 · x2 ) 0,

by the Cauchy-Schwarz inequality. However, if l1 = l2 ,

EX(x1 , l1 )X(x2 , l1 ) EY (x1 , l1 )Y (x2 , l1 ) = 0  0.

Therefore, the assumptions in Theorem 10.4 are satisfied and


⇣ h p i ⌘
P min max X(x, l ) kl T 1 e nkl kkxk 0
l x
⇣ h p i ⌘
P min max Y (x, l ) kl T 1 e nkl kkxk 0 .
l x

The minimax on the right hand side can be computed explicitly. First, we maximize
over x 2 Bn ,
⇣ p ⌘
max kxkl T g + kl kxT h e nkxkkl k
x
10.6 Intersecting random half-spaces 311
⇣ p ⌘
= max r l T g + kl kvT h
e nkl k
r2[0,1],v2Sn 1
⇣ p ⌘
T
= max r l g + kl kkhk e nkl k
r2[0,1]
⇣ p ⌘
= max 0, l T g + kl kkhk e nkl k
p
l T g e nkl k + kl kkhk.

Therefore, the above probability is bounded from below by


⇣ h p i ⌘
P min l T g e nkl k + kl kkhk kl T 1 0 .
l

If we represent l 2 Bm
+ as l = rv for r 2 [0, 1] and
n o
m 1
v 2 S+ = l 2 Sm 1 : li 0, 1  i  m ,

the minimum over l 2 Bm


+ above can be computed as follows,
h p i
min r vT (g k1) e n + khk
m 1
r2[0,1],v2S+
⇣ p ⌘
= min r k(k1 g)+ k e n + khk
r2[0,1]
⇣ p ⌘
= min 0, k(k1 g)+ k e n + khk .

Therefore, the above probability equals


⇣ p ⌘
P k(k1 g)+ k e n + khk 0
⇣ p ⌘
= P k(k1 + g)+ k + khk e n ,

d
using symmetry g = g. Since aE(z + k)2+ < 1, we can find e > 0 such that
p p p
(mE(z + k)2+ )1/2 + n 2e n > e n.

As a result, the probability above can be bounded from below by


⇣ p p ⌘
1 P khk  n e n
⇣ p ⌘
P k(g + k1)+ k (mE(z + k)2+ )1/2 + e n .

The rest of the argument is the same as in the first part above, based on Gaussian
concentration and the exercise below. t
u

Exercise 10.6.1. Show that


312 10 Gaussian Distributions
⇣ m h i ⌘
P Â E(gi + k)2+ (gi + k)2+ Ce n  e cn
i=1

for some constant that depends on k, a and Ce .

You might also like