24, 2011
Amir Dembo
Email address: amir@math.stanford.edu
Department of Mathematics, Stanford University, Stanford, CA 94305.
Contents
Preface 5
Chapter 1. Probability, measure and integration 7
1.1. Probability spaces, measures and algebras 7
1.2. Random variables and their distribution 18
1.3. Integration and the (mathematical) expectation 30
1.4. Independence and product measures 54
Chapter 2. Asymptotics: the law of large numbers 71
2.1. Weak laws of large numbers 71
2.2. The BorelCantelli lemmas 77
2.3. Strong law of large numbers 85
Chapter 3. Weak convergence, clt and Poisson approximation 95
3.1. The Central Limit Theorem 95
3.2. Weak convergence 103
3.3. Characteristic functions 117
3.4. Poisson approximation and the Poisson process 133
3.5. Random vectors and the multivariate clt 140
Chapter 4. Conditional expectations and probabilities 151
4.1. Conditional expectation: existence and uniqueness 151
4.2. Properties of the conditional expectation 156
4.3. The conditional expectation as an orthogonal projection 164
4.4. Regular conditional probability distributions 169
Chapter 5. Discrete time martingales and stopping times 175
5.1. Denitions and closure properties 175
5.2. Martingale representations and inequalities 184
5.3. The convergence of Martingales 191
5.4. The optional stopping theorem 204
5.5. Reversed MGs, likelihood ratios and branching processes 210
Chapter 6. Markov chains 225
6.1. Canonical construction and the strong Markov property 225
6.2. Markov chains with countable state space 233
6.3. General state space: Doeblin and Harris chains 255
Chapter 7. Continuous, Gaussian and stationary processes 269
7.1. Denition, canonical construction and law 269
7.2. Continuous and separable modications 274
3
4 CONTENTS
7.3. Gaussian and stationary processes 284
Chapter 8. Continuous time martingales and Markov processes 289
8.1. Continuous time ltrations and stopping times 289
8.2. Continuous time martingales 294
8.3. Markov and Strong Markov processes 317
Chapter 9. The Brownian motion 341
9.1. Brownian transformations, hitting times and maxima 341
9.2. Weak convergence and invariance principles 348
9.3. Brownian path: regularity, local maxima and level sets 367
Bibliography 375
Index 377
Preface
These are the lecture notes for a year long, PhD level course in Probability Theory
that I taught at Stanford University in 2004, 2006 and 2009. The goal of this
course is to prepare incoming PhD students in Stanfords mathematics and statistics
departments to do research in probability theory. More broadly, the goal of the text
is to help the reader master the mathematical foundations of probability theory
and the techniques most commonly used in proving theorems in this area. This is
then applied to the rigorous study of the most fundamental classes of stochastic
processes.
Towards this goal, we introduce in Chapter 1 the relevant elements from measure
and integration theory, namely, the probability space and the algebras of events
in it, random variables viewed as measurable functions, their expectation as the
corresponding Lebesgue integral, and the important concept of independence.
Utilizing these elements, we study in Chapter 2 the various notions of convergence
of random variables and derive the weak and strong laws of large numbers.
Chapter 3 is devoted to the theory of weak convergence, the related concepts
of distribution and characteristic functions and two important special cases: the
Central Limit Theorem (in short clt) and the Poisson approximation.
Drawing upon the framework of Chapter 1, we devote Chapter 4 to the denition,
existence and properties of the conditional expectation and the associated regular
conditional probability distribution.
Chapter 5 deals with ltrations, the mathematical notion of information progres
sion in time, and with the corresponding stopping times. Results about the latter
are obtained as a by product of the study of a collection of stochastic processes
called martingales. Martingale representations are explored, as well as maximal
inequalities, convergence theorems and various applications thereof. Aiming for a
clearer and easier presentation, we focus here on the discrete time settings deferring
the continuous time counterpart to Chapter 8.
Chapter 6 provides a brief introduction to the theory of Markov chains, a vast
subject at the core of probability theory, to which many text books are devoted.
We illustrate some of the interesting mathematical properties of such processes by
examining few special cases of interest.
Chapter 7 sets the framework for studying rightcontinuous stochastic processes
indexed by a continuous time parameter, introduces the family of Gaussian pro
cesses and rigorously constructs the Brownian motion as a Gaussian process of
continuous sample path and zeromean, stationary independent increments.
5
6 PREFACE
Chapter 8 expands our earlier treatment of martingales and strong Markov pro
cesses to the continuous time setting, emphasizing the role of rightcontinuous l
tration. The mathematical structure of such processes is then illustrated both in
the context of Brownian motion and that of Markov jump processes.
Building on this, in Chapter 9 we reconstruct the Brownian motion via the in
variance principle as the limit of certain rescaled random walks. We further delve
into the rich properties of its sample path and the many applications of Brownian
motion to the clt and the Law of the Iterated Logarithm (in short, lil).
The intended audience for this course should have prior exposure to stochastic
processes, at an informal level. While students are assumed to have taken a real
analysis class dealing with Riemann integration, and mastered well this material,
prior knowledge of measure theory is not assumed.
It is quite clear that these notes are much inuenced by the text books [Bil95,
Dur10, Wil91, KaS97] I have been using.
I thank my students out of whose work this text materialized and my teaching as
sistants Su Chen, Kshitij Khare, Guoqiang Hu, Julia Salzman, Kevin Sun and Hua
Zhou for their help in the assembly of the notes of more than eighty students into
a coherent document. I am also much indebted to Kevin Ross, Andrea Montanari
and Oana Mocioalca for their feedback on earlier drafts of these notes, to Kevin
Ross for providing all the gures in this text, and to Andrea Montanari, David
Siegmund and Tze Lai for contributing some of the exercises in these notes.
Amir Dembo
Stanford, California
April 2010
CHAPTER 1
Probability, measure and integration
This chapter is devoted to the mathematical foundations of probability theory.
Section 1.1 introduces the basic measure theory framework, namely, the probability
space and the algebras of events in it. The next building blocks are random
variables, introduced in Section 1.2 as measurable functions X() and their
distribution.
This allows us to dene in Section 1.3 the important concept of expectation as the
corresponding Lebesgue integral, extending the horizon of our discussion beyond
the special functions and variables with density to which elementary probability
theory is limited. Section 1.4 concludes the chapter by considering independence,
the most fundamental aspect that dierentiates probability from (general) measure
theory, and the associated product measures.
1.1. Probability spaces, measures and algebras
We shall dene here the probability space (, T, P) using the terminology of mea
sure theory.
The sample space is a set of all possible outcomes of some random exper
iment. Probabilities are assigned by A P(A) to A in a subset T of all possible
sets of outcomes. The event space T represents both the amount of information
available as a result of the experiment conducted and the collection of all subsets
of possible interest to us, where we denote elements of T as events. A pleasant
mathematical framework results by imposing on T the structural conditions of a
algebra, as done in Subsection 1.1.1. The most common and useful choices for
this algebra are then explored in Subsection 1.1.2. Subsection 1.1.3 provides fun
damental supplements from measure theory, namely Dynkins and Caratheodorys
theorems and their application to the construction of Lebesgue measure.
1.1.1. The probability space (, T, P). We use 2
, consisting of all
allowed events, that is, those subsets of to which we shall assign probabilities.
We next dene the structural conditions imposed on T.
Denition 1.1.1. We say that T 2
i
A
i
T.
Remark. Using DeMorgans law, we know that (
i
A
c
i
)
c
=
i
A
i
. Thus the fol
lowing is equivalent to property (c) of Denition 1.1.1:
(c) If A
i
T for i = 1, 2, 3, . . . then also
i
A
i
T.
7
8 1. PROBABILITY, MEASURE AND INTEGRATION
Denition 1.1.2. A pair (, T) with T a algebra of subsets of is called a
measurable space. Given a measurable space (, T), a measure is any countably
additive nonnegative set function on this space. That is, : T [0, ], having
the properties:
(a) (A) () = 0 for all A T.
(b) (
n
A
n
) =
n
(A
n
) for any countable collection of disjoint sets A
n
T.
When in addition () = 1, we call the measure a probability measure, and
often label it by P (it is also easy to see that then P(A) 1 for all A T).
Remark. When (b) of Denition 1.1.2 is relaxed to involve only nite collections
of disjoint sets A
n
, we say that is a nitely additive nonnegative setfunction.
In measure theory we sometimes consider signed measures, whereby is no longer
nonnegative, hence its range is [, ], and say that such measure is nite when
its range is R (i.e. no set in T is assigned an innite measure).
Denition 1.1.3. A measure space is a triplet (, T, ), with a measure on the
measurable space (, T). A measure space (, T, P) with P a probability measure
is called a probability space.
The next exercise collects some of the fundamental properties shared by all prob
ability measures.
Exercise 1.1.4. Let (, T, P) be a probability space and A, B, A
i
events in T.
Prove the following properties of every probability measure.
(a) Monotonicity. If A B then P(A) P(B).
(b) Subadditivity. If A
i
A
i
then P(A)
i
P(A
i
).
(c) Continuity from below: If A
i
A, that is, A
1
A
2
. . . and
i
A
i
= A,
then P(A
i
) P(A).
(d) Continuity from above: If A
i
A, that is, A
1
A
2
. . . and
i
A
i
= A,
then P(A
i
) P(A).
Remark. In the more general context of measure theory, note that properties (a)
(c) of Exercise 1.1.4 hold for any measure , whereas the continuity from above
holds whenever (A
i
) < for all i suciently large. Here is more on this:
Exercise 1.1.5. Prove that a nitely additive nonnegative set function on a
measurable space (, T) with the continuity property
B
n
T, B
n
, (B
n
) < = (B
n
) 0
must be countably additive if () < . Give an example that it is not necessarily
so when () = .
The algebra T always contains at least the set and its complement, the empty
set . Necessarily, P() = 1 and P() = 0. So, if we take T
0
= , as our 
algebra, then we are left with no degrees of freedom in choice of P. For this reason
we call T
0
the trivial algebra. Fixing , we may expect that the larger the 
algebra we consider, the more freedom we have in choosing the probability measure.
This indeed holds to some extent, that is, as long as we have no problem satisfying
the requirements in the denition of a probability measure. A natural question is
when should we expect the maximal possible algebra T = 2
to be useful?
Example 1.1.6. When the sample space is countable we can and typically shall
take T = 2
> 0 to each
1.1. PROBABILITY SPACES, MEASURES AND ALGEBRAS 9
making sure that
A
p
=
1

, the uniform measure on , whereby computing
probabilities is the same as counting. Concrete examples are a single coin toss, for
which we have
1
= H, T ( = H if the coin lands on its head and = T if it
lands on its tail), and T
1
= , , H, T, or when we consider a nite number of
coin tosses, say n, in which case
n
= (
1
, . . . ,
n
) :
i
H, T, i = 1, . . . , n
is the set of all possible ntuples of coin tosses, while T
n
= 2
n
is the collection
of all possible sets of ntuples of coin tosses. Another example pertains to the
set of all nonnegative integers = 0, 1, 2, . . . and T = 2
for
k = 0, 1, 2, . . ..
When is uncountable such a strategy as in Example 1.1.6 will no longer work.
The problem is that if we take p
.
Denition 1.1.7. We say that a probability space (, T, P) is nonatomic, or
alternatively call P nonatomic if P(A) > 0 implies the existence of B T, B A
with 0 < P(B) < P(A).
Indeed, in contrast to the case of countable , the generic uncountable sample
space results with a nonatomic probability space (c.f. Exercise 1.1.27). Here is an
interesting property of such spaces (see also [Bil95, Problem 2.19]).
Exercise 1.1.8. Suppose P is nonatomic and A T with P(A) > 0.
(a) Show that for every > 0, we have B A such that 0 < P(B) < .
(b) Prove that if 0 < a < P(A) then there exists B A with P(B) = a.
Hint: Fix
n
0 and dene inductively numbers x
n
and sets G
n
T with H
0
= ,
H
n
=
k<n
G
k
, x
n
= supP(G) : G AH
n
, P(H
n
G) a and G
n
AH
n
such that P(H
n
G
n
) a and P(G
n
) (1
n
)x
n
. Consider B =
k
G
k
.
As you show next, the collection of all measures on a given space is a convex cone.
Exercise 1.1.9. Given any measures
n
, n 1 on (, T), verify that =
n=1
c
n
n
is also a measure on this space, for any nite constants c
n
0.
Here are few properties of probability measures for which the conclusions of Ex
ercise 1.1.4 are useful.
Exercise 1.1.10. A function d : A A [0, ) is called a semimetric on
the set A if d(x, x) = 0, d(x, y) = d(y, x) and the triangle inequality d(x, z)
d(x, y) +d(y, z) holds. With AB = (A B
c
) (A
c
B) denoting the symmetric
dierence of subsets A and B of , show that for any probability space (, T, P),
the function d(A, B) = P(AB) is a semimetric on T.
Exercise 1.1.11. Consider events A
n
in a probability space (, T, P) that are
almost disjoint in the sense that P(A
n
A
m
) = 0 for all n ,= m. Show that then
P(
n=1
A
n
) =
n=1
P(A
n
).
10 1. PROBABILITY, MEASURE AND INTEGRATION
Exercise 1.1.12. Suppose a random outcome N follows the Poisson probability
measure of parameter > 0. Find a simple expression for the probability that N is
an even integer.
1.1.2. Generated and Borel algebras. Enumerating the sets in the 
algebra T is not a realistic option for uncountable . Instead, as we see next, the
most common construction of algebras is then by implicit means. That is, we
demand that certain sets (called the generators) be in our algebra, and take the
smallest possible collection for which this holds.
Exercise 1.1.13.
(a) Check that the intersection of (possibly uncountably many) algebras is
also a algebra.
(b) Verify that for any algebras 1 ( and any H 1, the collection
1
H
= A ( : A H 1 is a algebra.
(c) Show that H 1
H
is nonincreasing with respect to set inclusions, with
1
= 1 and 1
= (. Deduce that 1
HH
= 1
H
1
H
1.
In view of part (a) of this exercise we have the following denition.
Denition 1.1.14. Given a collection of subsets A
) or by (A
, ), and call (A
. That is,
(A
) =
( : ( 2
is a algebra, A
( .
Example 1.1.15. Suppose = S is a topological space (that is, S is equipped with
a notion of open subsets, or topology). An example of a generated algebra is the
Borel algebra on S dened as (O S open) and denoted by B
S
. Of special
importance is B
R
which we also denote by B.
Dierent sets of generators may result with the same algebra. For example, tak
ing = 1, 2, 3 it is easy to see that (1) = (2, 3) = , 1, 2, 3, 1, 2, 3.
A algebra T is countably generated if there exists a countable collection of sets
that generates it. Exercise 1.1.17 shows that B
R
is countably generated, but as you
show next, there exist non countably generated algebras even on = R.
Exercise 1.1.16. Let T consist of all A such that either A is a countable set
or A
c
is a countable set.
(a) Verify that T is a algebra.
(b) Show that T is countably generated if and only if is a countable set.
Recall that if a collection of sets / is a subset of a algebra (, then also (/) (.
Consequently, to show that (A
) = (B
and B
(B
(A
=
N
1
for
1
= H, T (indeed, setting H = 1 and T = 0, each innite sequence
is in correspondence with a unique real number x [0, 1] with being the binary
expansion of x, see Exercise 1.2.13). The algebra should at least allow us to
consider any possible outcome of a nite number of coin tosses. The natural 
algebra in this case is the minimal algebra having this property, or put more
formally T
c
= (A
,k
,
k
1
, k = 1, 2, . . .), where A
,k
=
:
i
=
i
, i =
1 . . . , k for = (
1
, . . . ,
k
).
The preceding example is a special case of the construction of a product of mea
surable spaces, which we detail now.
Example 1.1.20. The product of the measurable spaces (
i
, T
i
), i = 1, . . . , n is
the set =
1
n
with the algebra generated by A
1
A
n
: A
i
T
i
,
denoted by T
1
T
n
.
You are now to check that the Borel algebra of R
d
is the product of dcopies of
that of R. As we see later, this helps simplifying the study of random vectors.
Exercise 1.1.21. Show that for any d < ,
B
R
d = B B = ((a
1
, b
1
) (a
d
, b
d
) : a
i
< b
i
R, i = 1, . . . , d)
(you need to prove both identities, with the middle term dened as in Example
1.1.20).
Exercise 1.1.22. Let T = (A
, is
uncountable (i.e., is uncountable). Prove that for each B T there exists a count
able subcollection A
j
, j = 1, 2, . . . A
, , such that B (A
j
, j =
1, 2, . . .).
Often there is no explicit enumerative description of the algebra generated by
an innite collection of subsets, but a notable exception is
Exercise 1.1.23. Show that the sets in ( = ([a, b] : a, b Z) are all possible
unions of elements from the countable collection b, (b, b +1), b Z, and deduce
that B , = (.
Probability measures on the Borel algebra of R are examples of regular measures,
namely:
12 1. PROBABILITY, MEASURE AND INTEGRATION
Exercise 1.1.24. Show that if P is a probability measure on (R, B) then for any
A B and > 0, there exists an open set G containing A such that P(A) + >
P(G).
Here is more information about B
R
d.
Exercise 1.1.25. Show that if is a nitely additive nonnegative set function on
(R
d
, B
R
d) such that (R
d
) = 1 and for any Borel set A,
(A) = sup(K) : K A, K compact ,
then must be a probability measure.
Hint: Argue by contradiction using the conclusion of Exercise 1.1.5. To this end,
recall the nite intersection property (if compact K
i
R
d
are such that
n
i=1
K
i
are
nonempty for nite n, then the countable intersection
i=1
K
i
is also nonempty).
1.1.3. Lebesgue measure and Caratheodorys theorem. Perhaps the
most important measure on (R, B) is the Lebesgue measure, . It is the unique
measure that satises (F) =
r
k=1
(b
k
a
k
) whenever F =
r
k=1
(a
k
, b
k
] for some
r < and a
1
< b
1
< a
2
< b
2
< b
r
. Since (R) = , this is not a probability
measure. However, when we restrict to be the interval (0, 1] we get
Example 1.1.26. The uniform probability measure on (0, 1], is denoted U and
dened as above, now with added restrictions that 0 a
1
and b
r
1. Alternatively,
U is the restriction of the measure to the subalgebra B
(0,1]
of B.
Exercise 1.1.27. Show that ((0, 1], B
(0,1]
, U) is a nonatomic probability space and
deduce that (R, B, ) is a nonatomic measure space.
Note that any countable union of sets of probability zero has probability zero, but
this is not the case for an uncountable union. For example, U(x) = 0 for every
x R, but U(R) = 1.
As we have seen in Example 1.1.26 it is often impossible to explicitly specify the
value of a measure on all sets of the algebra T. Instead, we wish to specify its
values on a much smaller and better behaved collection of generators / of T and
use Caratheodorys theorem to guarantee the existence of a unique measure on T
that coincides with our specied values. To this end, we require that / be an
algebra, that is,
Denition 1.1.28. A collection / of subsets of is an algebra (or a eld) if
(a) /,
(b) If A / then A
c
/ as well,
(c) If A, B / then also A B /.
Remark. In view of the closure of algebra with respect to complements, we could
have replaced the requirement that / with the (more standard) requirement
that /. As part (c) of Denition 1.1.28 amounts to closure of an algebra
under nite unions (and by DeMorgans law also nite intersections), the dierence
between an algebra and a algebra is that a algebra is also closed under countable
unions.
We sometimes make use of the fact that unlike generated algebras, the algebra
generated by a collection of sets / can be explicitly presented.
Exercise 1.1.29. The algebra generated by a given collection of subsets /, denoted
f(/), is the intersection of all algebras of subsets of containing /.
1.1. PROBABILITY SPACES, MEASURES AND ALGEBRAS 13
(a) Verify that f(/) is indeed an algebra and that f(/) is minimal in the
sense that if ( is an algebra and / (, then f(/) (.
(b) Show that f(/) is the collection of all nite disjoint unions of sets of the
form
ni
j=1
A
ij
, where for each i and j either A
ij
or A
c
ij
are in /.
We next state Caratheodorys extension theorem, a key result from measure the
ory, and demonstrate how it applies in the context of Example 1.1.26.
Theorem 1.1.30 (Caratheodorys extension theorem). If
0
: / [0, ]
is a countably additive set function on an algebra / then there exists a measure
on (, (/)) such that =
0
on /. Furthermore, if
0
() < then such a
measure is unique.
To construct the measure U on B
(0,1]
let = (0, 1] and
/ = (a
1
, b
1
] (a
r
, b
r
] : 0 a
1
< b
1
< < a
r
< b
r
1 , r <
be a collection of subsets of (0, 1]. It is not hard to verify that / is an algebra, and
further that (/) = B
(0,1]
(c.f. Exercise 1.1.17, for a similar issue, just with (0, 1]
replaced by R). With U
0
denoting the nonnegative set function on / such that
(1.1.1) U
0
_
r
_
k=1
(a
k
, b
k
]
_
=
r
k=1
(b
k
a
k
) ,
note that U
0
((0, 1]) = 1, hence the existence of a unique probability measure U on
((0, 1], B
(0,1]
) such that U(A) = U
0
(A) for sets A / follows by Caratheodorys
extension theorem, as soon as we verify that
Lemma 1.1.31. The set function U
0
is countably additive on /. That is, if A
k
is a
sequence of disjoint sets in / such that
k
A
k
= A /, then U
0
(A) =
k
U
0
(A
k
).
The proof of Lemma 1.1.31 is based on
Exercise 1.1.32. Show that U
0
is nitely additive on /. That is, U
0
(
n
k=1
A
k
) =
n
k=1
U
0
(A
k
) for any nite collection of disjoint sets A
1
, . . . , A
n
/.
Proof. Let G
n
=
n
k=1
A
k
and H
n
= A G
n
. Then, H
n
and since
A
k
, A / which is an algebra it follows that G
n
and hence H
n
are also in /. By
denition, U
0
is nitely additive on /, so
U
0
(A) = U
0
(H
n
) +U
0
(G
n
) = U
0
(H
n
) +
n
k=1
U
0
(A
k
) .
To prove that U
0
is countably additive, it suces to show that U
0
(H
n
) 0, for then
U
0
(A) = lim
n
U
0
(G
n
) = lim
n
n
k=1
U
0
(A
k
) =
k=1
U
0
(A
k
) .
To complete the proof, we argue by contradiction, assuming that U
0
(H
n
) 2 for
some > 0 and all n, where H
n
are elements of /. By the denition of /
and U
0
, we can nd for each a set J
/ whose closure J
is a subset of H
and
U
0
(H
) 2
the
minimum of 2
/r and (b
k
a
k
)/2). With U
0
nitely additive on the algebra /
this implies that for each n,
U
0
_
n
_
=1
(H
)
_
=1
U
0
(H
) .
14 1. PROBABILITY, MEASURE AND INTEGRATION
As H
n
H
n
J
=
_
n
(H
n
J
)
_
n
(H
) .
Hence, by nite additivity of U
0
and our assumption that U
0
(H
n
) 2, also
U
0
(
n
J
) = U
0
(H
n
) U
0
(H
n
n
J
) U
0
(H
n
) U
0
(
_
n
(H
)) .
In particular, for every n, the set
n
J
n
J
. Since K
n
are compact sets (by HeineBorel theo
rem), the set
is a subset of H
for all
we arrive at
.
It is often useful to assume that the probability space we have is complete, in the
sense we make precise now.
Denition 1.1.34. We say that a measure space (, T, ) is complete if any subset
N of any B T with (B) = 0 is also in T. If further = P is a probability
measure, we say that the probability space (, T, P) is a complete probability space.
Our next theorem states that any measure space can be completed by adding to
its algebra all subsets of sets of zero measure (a procedure that depends on the
measure in use).
Theorem 1.1.35. Given a measure space (, T, ), let A = N : N A for
some A T with (A) = 0 denote the collection of null sets. Then, there
exists a complete measure space (, T, ), called the completion of the measure
space (, T, ), such that T = F N : F T, N A and = on T.
Proof. This is beyond our scope, but see detailed proof in [Dur10, Theorem
A.2.3]. In particular, T = (T, A) and (A N) = (A) for any N A and
A T (c.f. [Bil95, Problems 3.10 and 10.5]).
The following collections of sets play an important role in proving the easy part
of Caratheodorys theorem, the uniqueness of the extension .
Denition 1.1.36. A system is a collection T of sets closed under nite inter
sections (i.e. if I T and J T then I J T).
A system is a collection / of sets containing and BA for any A B A, B /,
1.1. PROBABILITY SPACES, MEASURES AND ALGEBRAS 15
which is also closed under monotone increasing limits (i.e. if A
i
/ and A
i
A,
then A / as well).
Obviously, an algebra is a system. Though an algebra may not be a system,
Proposition 1.1.37. A collection T of sets is a algebra if and only if it is both
a system and a system.
Proof. The fact that a algebra is a system is a trivial consequence of
Denition 1.1.1. To prove the converse direction, suppose that T is both a 
system and a system. Then is in the system T and so is A
c
= A for any
A T. Further, with T also a system we have that
A B = (A
c
B
c
) T ,
for any A, B T. Consequently, if A
i
T then so are also G
n
= A
1
A
n
T.
Since T is a system and G
n
i
A
i
, it follows that
i
A
i
T as well, completing
the verication that T is a algebra.
The main tool in proving the uniqueness of the extension is Dynkins theorem,
stated next.
Theorem 1.1.38 (Dynkins theorem). If T / with T a system and
/ a system then (T) /.
Proof. A short though dense exercise in set manipulations shows that the
smallest system containing T is a system (for details see [Wil91, Section A.1.3]
or the proof of [Bil95, Theorem 3.2]). By Proposition 1.1.37 it is a algebra, hence
contains (T). Further, it is contained in the system /, as / also contains T.
Remark. Proposition 1.1.37 remains valid even if in the denition of system we
relax the condition that B A / for any A B A, B /, to the condition
A
c
/ whenever A /. However, Dynkins theorem does not hold under the
latter denition.
As we show next, the uniqueness part of Caratheodorys theorem, is an immediate
consequence of the theorem.
Proposition 1.1.39. If two measures
1
and
2
on (, (T)) agree on the 
system T and are such that
1
() =
2
() < , then
1
=
2
.
Proof. Let / = A (T) :
1
(A) =
2
(A). Our assumptions imply that
T / and that /. Further, (T) is a system (by Proposition 1.1.37), and
if A B, A, B /, then by additivity of the nite measures
1
and
2
,
1
(B A) =
1
(B)
1
(A) =
2
(B)
2
(A) =
2
(B A),
that is, B A /. Similarly, if A
i
A and A
i
/, then by the continuity from
below of
1
and
2
(see remark following Exercise 1.1.4),
1
(A) = lim
n
1
(A
n
) = lim
n
2
(A
n
) =
2
(A) ,
so that A /. We conclude that / is a system, hence by Dynkins theorem,
(T) /, that is,
1
=
2
.
16 1. PROBABILITY, MEASURE AND INTEGRATION
Remark. With a somewhat more involved proof one can relax the condition
1
() =
2
() < to the existence of A
n
T such that A
n
and
1
(A
n
) <
(c.f. [Bil95, Theorem 10.3] for details). Accordingly, in Caratheodorys extension
theorem we can relax
0
() < to the assumption that
0
is a nite mea
sure, that is
0
(A
n
) < for some A
n
/ such that A
n
, as is the case with
Lebesgues measure on R.
We conclude this subsection with an outline the proof of Caratheodorys extension
theorem, noting that since an algebra / is a system and /, the uniqueness of
the extension to (/) follows from Proposition 1.1.39. Our outline of the existence
of an extension follows [Wil91, Section A.1.8] (or see [Bil95, Theorem 11.3] for
the proof of a somewhat stronger result). This outline centers on the construction
of the appropriate outer measure, a relaxation of the concept of measure, which we
now dene.
Denition 1.1.40. An increasing, countably subadditive, nonnegative set func
tion
: T
[0, ], having the properties:
(a)
() = 0 and
(A
1
)
(A
2
) for any A
1
, A
2
T with A
1
A
2
.
(b)
n
A
n
)
(A
n
) for any countable collection of sets A
n
T.
In the rst step of the proof we dene the increasing, nonnegative set function
(E) = inf
n=1
0
(A
n
) : E
_
n
A
n
, A
n
/,
for E T = 2
(A)
0
(A) for any A /. In the second step we prove that
if in addition A
n
A
n
for A
n
/, then the countable additivity of
0
on /
results with
0
(A)
0
(A
n
). Consequently,
=
0
on the algebra /.
The third step uses the countable additivity of
0
on / to show that for any A /
the outer measure
be an outer measure on a
measurable space (, T). Then the
) is a measure space.
In the current setting, with / contained in the algebra (, it follows that (/)
( on which
n=1
/
n
.
(b) Provide an example of algebras /
n
for which
n=1
/
n
is not a 
algebra.
18 1. PROBABILITY, MEASURE AND INTEGRATION
1.2. Random variables and their distribution
Random variables are numerical functions X() of the outcome of our ran
dom experiment. However, in order to have a successful mathematical theory, we
limit our interest to the subset of measurable functions (or more generally, measur
able mappings), as dened in Subsection 1.2.1 and study the closure properties of
this collection in Subsection 1.2.2. Subsection 1.2.3 is devoted to the characteriza
tion of the collection of distribution functions induced by random variables.
1.2.1. Indicators, simple functions and random variables. We start
with the denition of random variables, rst in the general case, and then restricted
to Rvalued variables.
Denition 1.2.1. A mapping X : S between two measurable spaces (, T)
and (S, o) is called an (S, o)valued Random Variable (R.V.) if
X
1
(B) := : X() B T B o.
Such a mapping is also called a measurable mapping.
Denition 1.2.2. When we say that X is a random variable, or a measurable
function, we mean an (R, B)valued random variable which is the most common
type of R.V. we shall encounter. We let mT denote the collection of all (R, B)
valued measurable mappings, so X is a R.V. if and only if X mT. If in addition
is a topological space and T = (O open ) is the corresponding Borel 
algebra, we say that X : R is a Borel (measurable) function. More generally,
a random vector is an (R
d
, B
R
d)valued R.V. for some d < .
The next exercise shows that a random vector is merely a nite collection of R.V.
on the same probability space.
Exercise 1.2.3. Relying on Exercise 1.1.21 and Theorem 1.2.9, show that X :
R
d
is a random vector if and only if X() = (X
1
(), . . . , X
d
()) with each
X
i
: R a R.V.
Hint: Note that X
1
(B
1
. . . B
d
) =
d
i=1
X
1
i
(B
i
).
We now provide two important generic examples of random variables.
Example 1.2.4. For any A T the function I
A
() =
_
1, A
0, / A
is a R.V.
Indeed, : I
A
() B is for any B R one of the four sets , A, A
c
or
(depending on whether 0 B or not and whether 1 B or not), all of whom are
in T. We call such R.V. also an indicator function.
Exercise 1.2.5. By the same reasoning check that X() =
N
n=1
c
n
I
An
() is a
R.V. for any nite N, nonrandom c
n
R and sets A
n
T. We call any such X
a simple function, denoted by X SF.
Our next proposition explains why simple functions are quite useful in probability
theory.
Proposition 1.2.6. For every R.V. X() there exists a sequence of simple func
tions X
n
() such that X
n
() X() as n , for each xed .
1.2. RANDOM VARIABLES AND THEIR DISTRIBUTION 19
Proof. Let
f
n
(x) = n1
x>n
+
n2
n
1
k=0
k2
n
1
(k2
n
,(k+1)2
n
]
(x) ,
noting that for R.V. X 0, we have that X
n
= f
n
(X) are simple functions. Since
X X
n+1
X
n
and X() X
n
() 2
n
whenever X() n, it follows that
X
n
() X() as n , for each .
We write a general R.V. as X() = X
+
()X
() where X
+
() = max(X(), 0)
and X
o.
c). If A
n
o for all n then X
1
(A
n
) T for all n. With T a algebra, then also
X
1
(
n
A
n
) =
n
X
1
(A
n
) T. Consequently,
n
A
n
o.
Our assumption that /
o, then translates to o = (/)
o, as claimed.
The most important algebras are those generated by ((S, o)valued) random
variables, as dened next.
Exercise 1.2.10. Adapting the proof of Theorem 1.2.9, show that for any mapping
X : S and any algebra o of subsets of S, the collection X
1
(B) : B o is
a algebra. Verify that X is an (S, o)valued R.V. if and only if X
1
(B) : B
o T, in which case we denote X
1
(B) : B o either by (X) or by T
X
and
call it the algebra generated by X.
To practice your understanding of generated algebras, solve the next exercise,
providing a convenient collection of generators for (X).
Exercise 1.2.11. If X is an (S, o)valued R.V. and o = (/) then (X) is gen
erated by the collection of sets X
1
(/) := X
1
(A) : A /.
An important example of use of Exercise 1.2.11 corresponds to (R, B)valued ran
dom variables and / = (, x] : x R (or even / = (, x] : x Q) which
generates B (see Exercise 1.1.17), leading to the following alternative denition of
the algebra generated by such R.V. X.
1.2. RANDOM VARIABLES AND THEIR DISTRIBUTION 21
Denition 1.2.12. Given a function X : R we denote by (X) or by T
X
the smallest algebra T such that X() is a measurable mapping from (, T) to
(R, B). Alternatively,
(X) = ( : X() , R) = ( : X() q, q Q) .
More generally, given a random vector X = (X
1
, . . . , X
n
), that is, random variables
X
1
, . . . , X
n
on the same probability space, let (X
k
, k n) (or T
X
n
), denote the
smallest algebra T such that X
k
(), k = 1, . . . , n are measurable on (, T).
Alternatively,
(X
k
, k n) = ( : X
k
() , R, k n) .
Finally, given a possibly uncountable collection of functions X
: R, indexed
by , we denote by (X
, ) (or simply by T
X
), the smallest algebra T
such that X
= 0, 1
N
of innite binary sequences, as in Example 1.1.19.
(a) Show that X() =
n=1
n
2
n
is a measurable map from (
, T
c
) to
([0, 1], B
[0,1]
).
(b) Conversely, let Y (x) = (
1
, . . . ,
n
, . . .) where for each n 1,
n
(1) = 1
while
n
(x) = I(2
n
x is an odd number) when x [0, 1). Show that
Y = X
1
is a measurable map from ([0, 1], B
[0,1]
) to (
, T
c
).
Here are some alternatives for Denition 1.2.12.
Exercise 1.2.14. Verify the following relations and show that each generating col
lection of sets on the right hand side is a system.
(a) (X) = ( : X() , R)
(b) (X
k
, k n) = ( : X
k
()
k
, 1 k n,
1
, . . . ,
n
R)
(c) (X
1
, X
2
, . . .) = ( : X
k
()
k
, 1 k m,
1
, . . . ,
m
R, m
N)
(d) (X
1
, X
2
, . . .) = (
n
(X
k
, k n))
As you next show, when approximating a random variable by a simple function,
one may also specify the latter to be based on sets in any generating algebra.
Exercise 1.2.15. Suppose (, T, P) is a probability space, with T = (/) for an
algebra /.
(a) Show that infP(AB) : A / = 0 for any B T (recall that AB =
(A B
c
) (A
c
B)).
22 1. PROBABILITY, MEASURE AND INTEGRATION
(b) Show that for any bounded random variable X and > 0 there exists a
simple function Y =
N
n=1
c
n
I
An
with A
n
/ such that P([X Y [ >
) < .
Exercise 1.2.16. Let T = (A
or
1
,
2
A
c
.
(a) Show that if mapping X is measurable on (, T) then X(
1
) = X(
2
).
(b) Provide an explicit algebra T of subsets of = 1, 2, 3 and a mapping
X : R which is not a random variable on (, T).
We conclude with a glimpse of the canonical measurable space associated with a
stochastic process (X
t
, t T) (for more on this, see Lemma 7.1.7).
Exercise 1.2.17. Fixing a possibly uncountable collection of random variables X
t
,
indexed by t T, let T
X
C
= (X
t
, t C) for each C T. Show that
T
X
T
=
_
C countable
T
X
C
and that any R.V. Z on (, T
X
T
) is measurable on T
X
C
for some countable C T.
1.2.2. Closure properties of random variables. For the typical measur
able space with uncountable it is impractical to list all possible R.V. Instead,
we state a few useful closure properties that often help us in showing that a given
mapping X() is indeed a R.V.
We start with closure with respect to the composition of a R.V. and a measurable
mapping.
Proposition 1.2.18. If X : S is an (S, o)valued R.V. and f is a measurable
mapping from (S, o) to (T, T ), then the composition f(X) : T is a (T, T )
valued R.V.
Proof. Considering an arbitrary B T , we know that f
1
(B) o since f is
a measurable mapping. Thus, as X is an (S, o)valued R.V. it follows that
[f(X)]
1
(B) = X
1
(f
1
(B)) T .
This holds for any B T , thus concluding the proof.
In view of Exercise 1.2.3 we have the following special case of Proposition 1.2.18,
corresponding to S = R
n
and T = R equipped with the respective Borel algebras.
Corollary 1.2.19. Let X
i
, i = 1, . . . , n be R.V. on the same measurable space
(, T) and f : R
n
R a Borel function. Then, f(X
1
, . . . , X
n
) is also a R.V. on
the same space.
To appreciate the power of Corollary 1.2.19, consider the following exercise, in
which you show that every continuous function is also a Borel function.
Exercise 1.2.20. Suppose (S, ) is a metric space (for example, S = R
n
). A func
tion g : S [, ] is called lower semicontinuous (l.s.c.) if liminf
(y,x)0
g(y)
g(x), for all x S. A function g is said to be upper semicontinuous(u.s.c.) if g
is l.s.c.
(a) Show that if g is l.s.c. then x : g(x) b is closed for each b R.
(b) Conclude that semicontinuous functions are Borel measurable.
(c) Conclude that continuous functions are Borel measurable.
1.2. RANDOM VARIABLES AND THEIR DISTRIBUTION 23
A concrete application of Corollary 1.2.19 shows that any linear combination of
nitely many R.V.s is a R.V.
Example 1.2.21. Suppose X
i
are R.V.s on the same measurable space and c
i
R.
Then, W
n
() =
n
i=1
c
i
X
i
() are also R.V.s. To see this, apply Corollary 1.2.19
for f(x
1
, . . . , x
n
) =
n
i=1
c
i
x
i
a continuous, hence Borel (measurable) function (by
Exercise 1.2.20).
We turn to explore the closure properties of mT with respect to operations of a
limiting nature, starting with the following key theorem.
Theorem 1.2.22. Let R = [, ] equipped with its Borel algebra
B
R
= ([, b) : b R) .
If X
i
are Rvalued R.V.s on the same measurable space, then
inf
n
X
n
, sup
n
X
n
, liminf
n
X
n
, limsup
n
X
n
,
are also Rvalued random variables.
Proof. Pick an arbitrary b R. Then,
: inf
n
X
n
() < b =
_
n=1
: X
n
() < b =
_
n=1
X
1
n
([, b)) T.
Since B
R
is generated by [, b) : b R, it follows by Theorem 1.2.9 that inf
n
X
n
is an Rvalued R.V.
Observing that sup
n
X
n
= inf
n
(X
n
), we deduce from the above and Corollary
1.2.19 (for f(x) = x), that sup
n
X
n
is also an Rvalued R.V.
Next, recall that
W = liminf
n
X
n
= sup
n
_
inf
ln
X
l
_
.
By the preceding proof we have that Y
n
= inf
ln
X
l
are Rvalued R.V.s and hence
so is W = sup
n
Y
n
.
Similarly to the arguments already used, we conclude the proof either by observing
that
Z = limsup
n
X
n
= inf
n
_
sup
ln
X
l
_
,
or by observing that limsup
n
X
n
= liminf
n
(X
n
).
Remark. Since inf
n
X
n
, sup
n
X
n
, limsup
n
X
n
and liminf
n
X
n
may result in values
even when every X
n
is Rvalued, hereafter we let mT also denote the collection
of Rvalued R.V.
An important corollary of this theorem deals with the existence of limits of se
quences of R.V.
Corollary 1.2.23. For any sequence X
n
mT, both
0
= : liminf
n
X
n
() = limsup
n
X
n
()
and
1
= : liminf
n
X
n
() = limsup
n
X
n
() R
are measurable sets, that is,
0
T and
1
T.
24 1. PROBABILITY, MEASURE AND INTEGRATION
Proof. By Theorem1.2.22 we have that Z = limsup
n
X
n
and W = liminf
n
X
n
are two Rvalued variables on the same space, with Z() W() for all . Hence,
1
= : Z() W() = 0, Z() R, W() R is measurable (apply Corollary
1.2.19 for f(z, w) = z w), as is
0
= W
1
() Z
1
()
1
.
The following structural result is yet another consequence of Theorem 1.2.22.
Corollary 1.2.24. For any d < and R.V.s Y
1
, . . . , Y
d
on the same measurable
space (, T) the collection 1 = h(Y
1
, . . . , Y
d
); h : R
d
R Borel function is a
vector space over R containing the constant functions, such that if X
n
1 are
nonnegative and X
n
X, an Rvalued function on , then X 1.
Proof. By Example 1.2.21 the collection of all Borel functions is a vector
space over R which evidently contains the constant functions. Consequently, the
same applies for 1. Next, suppose X
n
= h
n
(Y
1
, . . . , Y
d
) for Borel functions h
n
such
that 0 X
n
() X() for all . Then, h(y) = sup
n
h
n
(y) is by Theorem
1.2.22 an Rvalued Borel function on R
d
, such that X = h(Y
1
, . . . , Y
d
). Setting
h(y) = h(y) when h(y) R and h(y) = 0 otherwise, it is easy to check that h is a
realvalued Borel function. Moreover, with X : R (nite valued), necessarily
X = h(Y
1
, . . . , Y
d
) as well, so X 1.
The pointwise convergence of R.V., that is X
n
() X(), for every is
often too strong of a requirement, as it may fail to hold as a result of the R.V. being
illdened for a negligible set of values of (that is, a set of zero measure). We
thus dene the more useful, weaker notion of almost sure convergence of random
variables.
Denition 1.2.25. We say that a sequence of random variables X
n
on the same
probability space (, T, P) converges almost surely if P(
0
) = 1. We then set
X
= limsup
n
X
n
, and say that X
n
converges almost surely to X
, or use the
notation X
n
a.s.
X
.
Remark. Note that in Denition 1.2.25 we allow the limit X
() to take the
values with positive probability. So, we say that X
n
converges almost surely
to a nite limit if P(
1
) = 1, or alternatively, if X
R is a Borel function.
Proof. From Corollary 1.2.19 we know that Z = g(Y
1
, . . . , Y
n
) is in m( for
each Borel function g : R
n
R. Turning to prove the converse result, recall
part (b) of Exercise 1.2.14 that the algebra ( is generated by the system T =
A
: = (
1
, . . . ,
n
) R
n
where I
A
= h
(Y
1
, . . . , Y
n
) for the Borel function
h
(y
1
, . . . , y
n
) =
n
k=1
1
y
k
k
. Thus, in view of Corollary 1.2.24, we have by the
monotone class theorem that 1 = g(Y
1
, . . . , Y
n
) : g : R
n
R is a Borel function
contains all elements of m(.
We conclude this subsection with a few exercises, starting with Borel measura
bility of monotone functions (regardless of their continuity properties).
1.2. RANDOM VARIABLES AND THEIR DISTRIBUTION 25
Exercise 1.2.27. Show that any monotone function g : R R is Borel measurable.
Next, Exercise 1.2.20 implies that the set of points at which a given function g is
discontinuous, is a Borel set.
Exercise 1.2.28. Fix an arbitrary function g : S R.
(a) Show that for any > 0 the function g
(x, k
1
) < inf
k
g
(x, k
1
) is exactly the set
of points at which g is discontinuous.
(c) Deduce that the set D
g
of points of discontinuity of g is a Borel set.
Here is an alternative characterization of B that complements Exercise 1.2.20.
Exercise 1.2.29. Show that if T is a algebra of subsets of R then B T if
and only if every continuous function f : R R is in mT (i.e. B is the smallest
algebra on R with respect to which all continuous functions are measurable).
Exercise 1.2.30. Suppose X
n
and X
()) = 1 .
Show that for any > 0, there exists an event A with P(A) < and a nonrandom
N = N(), suciently large such that X
n
() < X
i
P(X
1
(B
i
)) =
i
T
X
(B
i
) .
This shows that T
X
is also countably additive, hence a probability measure, as
claimed in Denition 1.2.33.
Note that the law T
X
of a R.V. X : R, determines the values of the
probability measure P on (X).
Denition 1.2.34. We write X
D
= Y and say that X equals Y in law (or in
distribution), if and only if T
X
= T
Y
.
A good way to practice your understanding of the Denitions 1.2.33 and 1.2.34 is
by verifying that if X
a.s.
= Y , then also X
D
= Y (that is, any two random variables
we consider to be the same would indeed have the same law).
The next concept we dene, the distribution function, is closely associated with
the law T
X
of the R.V.
Denition 1.2.35. The distribution function F
X
of a realvalued R.V. X is
F
X
() = P( : X() ) = T
X
((, ]) R
Our next result characterizes the set of all functions F : R [0, 1] that are
distribution functions of some R.V.
Theorem 1.2.36. A function F : R [0, 1] is a distribution function of some
R.V. if and only if
(a) F is nondecreasing
(b) lim
x
F(x) = 1 and lim
x
F(x) = 0
(c) F is rightcontinuous, i.e. lim
yx
F(y) = F(x)
Proof. First, assuming that F = F
X
is a distribution function, we show that
it must have the stated properties (a)(c). Indeed, if x y then (, x] (, y],
and by the monotonicity of the probability measure T
X
(see part (a) of Exercise
1.1.4), we have that F
X
(x) F
X
(y), proving that F
X
is nondecreasing. Further,
(, x] R as x , while (, x] as x , resulting with property (b)
of the theorem by the continuity from below and the continuity from above of the
probability measure T
X
on R. Similarly, since (, y] (, x] as y x we get
the right continuity of F
X
by yet another application of continuity from above of
T
X
.
We proceed to prove the converse result, that is, assuming F has the stated prop
erties (a)(c), we consider the random variable X
() < , so X
() x = : F(x) ,
implies that F
X
(x) = U((0, F(x)]) = F(x) for all x R, and further, the sets
(0, F(x)] are all in B
(0,1]
, implying that X
also X
() in Theorem 1.2.36 is
called Skorokhods representation. You can, and should, verify that the random
variable X
+
() = supy : F(y) would have worked equally well for that
purpose, since X
+
() ,= X
() only if X
+
() > q X
r
k=1
(F(b
k
) F(a
k
)), yielding a probability measure T
on (R, B) such that T((, ]) = F() for all R (c.f. [Bil95, Theorem 12.4]).
Our next example highlights the possible shape of the distribution function.
Example 1.2.37. Consider Example 1.1.6 of n coin tosses, with algebra T
n
=
2
n
, sample space
n
= H, T
n
, and the probability measure P
n
(A) =
A
p
,
where p
= 2
n
for each
n
(that is, =
1
,
2
, ,
n
for
i
H, T),
corresponding to independent, fair, coin tosses. Let Y () = I
{1=H}
measure the
outcome of the rst toss. The law of this random variable is,
T
Y
(B) =
1
2
1
{0B}
+
1
2
1
{1B}
and its distribution function is
F
Y
() = T
Y
((, ]) = P
n
(Y () ) =
_
_
1, 1
1
2
, 0 < 1
0, < 0
. (1.2.2)
Note that in general (X) is a strict subset of the algebra T (in Example 1.2.37
we have that (Y ) determines the probability measure for the rst coin toss, but
tells us nothing about the probability measure assigned to the remaining n 1
tosses). Consequently, though the law T
X
determines the probability measure P
on (X) it usually does not completely determine P.
Example 1.2.37 is somewhat generic. That is, if the R.V. X is a simple function (or
more generally, when the set X() : is countable and has no accumulation
points), then its distribution function F
X
is piecewise constant with jumps at the
possible values that X takes and jump sizes that are the corresponding probabilities.
Indeed, note that (, y] (, x) as y x, so by the continuity from below of
T
X
it follows that
F
X
(x
) := lim
yx
F
X
(y) = P( : X() < x) = F
X
(x) P( : X() = x) ,
for any R.V. X.
A direct corollary of Theorem 1.2.36 shows that any distribution function has a
collection of continuity points that is dense in R.
28 1. PROBABILITY, MEASURE AND INTEGRATION
Exercise 1.2.38. Show that a distribution function F has at most countably many
points of discontinuity and consequently, that for any x R there exist y
k
and z
k
at which F is continuous such that z
k
x and y
k
x.
In contrast with Example 1.2.37 the distribution function of a R.V. with a density
is continuous and almost everywhere dierentiable, that is,
Denition 1.2.39. We say that a R.V. X() has a probability density function
f
X
if and only if its distribution function F
X
can be expressed as
(1.2.3) F
X
() =
_
f
X
(x)dx, R.
By Theorem 1.2.36 a probability density function f
X
must be an integrable, Lebesgue
almost everywhere nonnegative function, with
_
R
f
X
(x)dx = 1. Such F
X
is contin
uous with
dFX
dx
(x) = f
X
(x) except possibly on a set of values of x of zero Lebesgue
measure.
Remark. To make Denition 1.2.39 precise we temporarily assume that probability
density functions f
X
are Riemann integrable and interpret the integral in (1.2.3)
in this sense. In Section 1.3 we construct Lebesgues integral and extend the scope
of Denition 1.2.39 to Lebesgue integrable density functions f
X
0 (in particular,
accommodating Borel functions f
X
). This is the setting we assume thereafter, with
the righthandside of (1.2.3) interpreted as the integral (f
X
; (, ]) of f
X
with
respect to the restriction on (, ] of the completion of the Lebesgue measure on
R (c.f. Denition 1.3.59 and Example 1.3.60). Further, the function f
X
is uniquely
dened only as a representative of an equivalence class. That is, in this context we
consider f and g to be the same function when (x : f(x) ,= g(x)) = 0.
Building on Example 1.1.26 we next detail a few classical examples of R.V. that
have densities.
Example 1.2.40. The distribution function F
U
of the R.V. of Example 1.1.26 is
F
U
() = P(U ) = P(U [0, ]) =
_
_
1, > 1
, 0 1
0, < 0
(1.2.4)
and its density is f
U
(u) =
_
1, 0 u 1
0, otherwise
.
The exponential distribution function is
F(x) =
_
0, x 0
1 e
x
, x 0
,
corresponding to the density f(x) =
_
0, x 0
e
x
, x > 0
, whereas the standard normal
distribution has the density
(x) = (2)
1/2
e
x
2
2
,
with no closed form expression for the corresponding distribution function (x) =
_
x
(u)du in terms of elementary functions.
1.2. RANDOM VARIABLES AND THEIR DISTRIBUTION 29
Every realvalued R.V. X has a distribution function but not necessarily a density.
For example X = 0 w.p.1 has distribution function F
X
() = 1
0
. Since F
X
is
discontinuous at 0, the R.V. X does not have a density.
Denition 1.2.41. We say that a function F is a Lebesgue singular function if it
has a zero derivative except on a set of zero Lebesgue measure.
Since the distribution function of any R.V. is nondecreasing, from real analysis
we know that it is almost everywhere dierentiable. However, perhaps somewhat
surprisingly, there are continuous distribution functions that are Lebesgue singular
functions. Consequently, there are nondiscrete random variables that do not have
a density. We next provide one such example.
Example 1.2.42. The Cantor set C is dened by removing (1/3, 2/3) from [0, 1]
and then iteratively removing the middle third of each interval that remains. The
uniform distribution on the (closed) set C corresponds to the distribution function
obtained by setting F(x) = 0 for x 0, F(x) = 1 for x 1, F(x) = 1/2 for
x [1/3, 2/3], then F(x) = 1/4 for x [1/9, 2/9], F(x) = 3/4 for x [7/9, 8/9],
and so on (which as you should check, satises the properties (a)(c) of Theorem
1.2.36). From the denition, we see that dF/dx = 0 for almost every x / C and
that the corresponding probability measure has P(C
c
) = 0. As the Lebesgue measure
of C is zero, we see that the derivative of F is zero except on a set of zero Lebesgue
measure, and consequently, there is no function f for which F(x) =
_
x
f(y)dy
holds. Though it is somewhat more involved, you may want to check that F is
everywhere continuous (c.f. [Bil95, Problem 31.2]).
Even discrete distribution functions can be quite complex. As the next example
shows, the points of discontinuity of such a function might form a (countable) dense
subset of R (which in a sense is extreme, per Exercise 1.2.38).
Example 1.2.43. Let q
1
, q
2
, . . . be an enumeration of the rational numbers and set
F(x) =
i=1
2
i
1
[qi,)
(x)
(where 1
[qi,)
(x) = 1 if x q
i
and zero otherwise). Clearly, such F is non
decreasing, with limits 0 and 1 as x and x , respectively. It is not hard
to check that F is also right continuous, hence a distribution function, whereas by
construction F is discontinuous at each rational number.
As we have that P( : X() ) = F
X
() for the generators : X()
of (X), we are not at all surprised by the following proposition.
Proposition 1.2.44. The distribution function F
X
uniquely determines the law
T
X
of X.
Proof. Consider the collection (R) = (, b] : b R of subsets of R. It
is easy to see that (R) is a system, which generates B (see Exercise 1.1.17).
Hence, by Proposition 1.1.39, any two probability measures on (R, B) that coincide
on (R) are the same. Since the distribution function F
X
species the restriction
of such a probability measure T
X
on (R) it thus uniquely determines the values
of T
X
(B) for all B B.
Dierent probability measures P on the measurable space (, T) may trivialize
dierent algebras. That is,
30 1. PROBABILITY, MEASURE AND INTEGRATION
Denition 1.2.45. If a algebra 1 T and a probability measure P on (, T)
are such that P(H) 0, 1 for all H 1, we call 1 a Ptrivial algebra.
Similarly, a random variable X is called Ptrivial or Pdegenerate, if there exists
a nonrandom constant c such that P(X ,= c) = 0.
Using distribution functions we show next that all random variables on a Ptrivial
algebra are Ptrivial.
Proposition 1.2.46. If a random variable X m1 for a Ptrivial algebra 1,
then X is Ptrivial.
Proof. By denition, the sets : X() are in 1 for all R. Since 1
is Ptrivial this implies that F
X
() 0, 1 for all R. In view of Theorem 1.2.36
this is possible only if F
X
() = 1
c
for some nonrandom c R (for example, set
c = inf : F
X
() = 1). That is, P(X ,= c) = 0, as claimed.
We conclude with few exercises about the support of measures on (R, B).
Exercise 1.2.47. Let be a measure on (R, B). A point x is said to be in the
support of if (O) > 0 for every open neighborhood O of x. Prove that the support
is a closed set whose complement is the maximal open set on which vanishes.
Exercise 1.2.48. Given an arbitrary closed set C R, construct a probability
measure on (R, B) whose support is C.
Hint: Try a measure consisting of a countable collection of atoms (i.e. points of
positive probability).
As you are to check next, the discontinuity points of a distribution function are
closely related to the support of the corresponding law.
Exercise 1.2.49. The support of a distribution function F is the set S
F
= x R
such that F(x +) F(x ) > 0 for all > 0.
(a) Show that all points of discontinuity of F() belong to S
F
, and that any
isolated point of S
F
(that is, x S
F
such that (x , x +) S
F
= x
for some > 0) must be a point of discontinuity of F().
(b) Show that the support of the law T
X
of a random variable X, as dened
in Exercise 1.2.47, is the same as the support of its distribution function
F
X
.
1.3. Integration and the (mathematical) expectation
A key concept in probability theory is the mathematical expectation of ran
dom variables. In Subsection 1.3.1 we provide its denition via the framework
of Lebesgue integration with respect to a measure and study properties such as
monotonicity and linearity. In Subsection 1.3.2 we consider fundamental inequal
ities associated with the expectation. Subsection 1.3.3 is about the exchange of
integration and limit operations, complemented by uniform integrability and its
consequences in Subsection 1.3.4. Subsection 1.3.5 considers densities relative to
arbitrary measures and relates our treatment of integration and expectation to
Riemanns integral and the classical denition of the expectation for a R.V. with
probability density. We conclude with Subsection 1.3.6 about moments of random
variables, including their values for a few well known distributions.
1.3. INTEGRATION AND THE (MATHEMATICAL) EXPECTATION 31
1.3.1. Lebesgue integral, linearity and monotonicity. Let SF
+
denote
the collection of nonnegative simple functions with respect to the given measurable
space (S, T) and mT
+
denote the collection of [0, ]valued measurable functions
on this space. We next dene Lebesgues integral with respect to any measure
on (S, T), rst for SF
+
, then extending it to all f mT
+
. With the notation
(f) :=
_
S
f(s)d(s) for this integral, we also denote by
0
() the more restrictive
integral, dened only on SF
+
, so as to clarify the role each of these plays in some of
our proofs. We call an Rvalued measurable function f mT for which ([f[) < ,
a integrable function, and denote the collection of all integrable functions by
L
1
(S, T, ), extending the denition of the integral (f) to all f L
1
(S, T, ).
Denition 1.3.1. Fix a measure space (S, T, ) and dene (f) by the following
four step procedure:
Step 1. Dene
0
(I
A
) := (A) for each A T.
Step 2. Any SF
+
has a representation =
n
l=1
c
l
I
A
l
for some nite n < ,
nonrandom c
l
[0, ] and sets A
l
T, yielding the denition of the integral via
0
() :=
n
l=1
c
l
(A
l
) ,
where we adopt hereafter the convention that 0 = 0 = 0.
Step 3. For f mT
+
we dene
(f) := sup
0
() : SF
+
, f.
Step 4. For f mT let f
+
= max(f, 0) mT
+
and f
= min(f, 0) mT
+
.
We then set (f) = (f
+
) (f
) provided either (f
+
) < or (f
) < . In
particular, this applies whenever f L
1
(S, T, ), for then (f
+
) + (f
) = ([f[)
is nite, hence (f) is well dened and nite valued.
We use the notation
_
S
f(s)d(s) for (f) which we call Lebesgue integral of f
with respect to the measure .
The expectation E[X] of a random variable X on a probability space (, T, P) is
merely Lebesgues integral
_
X()dP() of X with respect to P. That is,
Step 1. E[I
A
] = P(A) for any A T.
Step 2. Any SF
+
has a representation =
n
l=1
c
l
I
A
l
for some nonrandom
n < , c
l
0 and sets A
l
T, to which corresponds
E[] =
n
l=1
c
l
E[I
A
l
] =
n
l=1
c
l
P(A
l
) .
Step 3. For X mT
+
dene
EX = supEY : Y SF
+
, Y X.
Step 4. Represent X mT as X = X
+
X
, where X
+
= max(X, 0) mT
+
and
X
= min(X, 0) mT
+
, with the corresponding denition
EX = EX
+
EX
,
provided either EX
+
< or EX
< .
32 1. PROBABILITY, MEASURE AND INTEGRATION
Remark. Note that we may have EX = while X() < for all . For
instance, take the random variable X() = for = 1, 2, . . . and T = 2
. If
P( = k) = ck
2
with c = [
k=1
k
2
]
1
a positive, nite normalization constant,
then EX = c
k=1
k
1
= .
Similar to the notation of integrable functions introduced in the last step of
the denition of Lebesgues integral, we have the following denition for random
variables.
Denition 1.3.2. We say that a random variable X is (absolutely) integrable, or
X has nite expectation, if E[X[ < , that is, both EX
+
< and EX
< .
Fixing 1 q < we denote by L
q
(, T, P) the collection of random variables X
on (, T) for which [[X[[
q
= [E[X[
q
]
1/q
< . For example, L
1
(, T, P) denotes
the space of all (absolutely) integrable randomvariables. We use the short notation
L
q
when the probability space (, T, P) is clear from the context.
We next verify that Lebesgues integral of each function f is assigned a unique
value in Denition 1.3.1. To this end, we focus on
0
: SF
+
[0, ] of Step 2 of
our denition and derive its structural properties, such as monotonicity, linearity
and invariance to a change of argument on a negligible set.
Lemma 1.3.3.
0
() assigns a unique value to each SF
+
. Further,
a).
0
() =
0
() if , SF
+
are such that (s : (s) ,= (s)) = 0.
b).
0
is linear, that is
0
( +) =
0
() +
0
() ,
0
(c) = c
0
() ,
for any , SF
+
and c 0.
c).
0
is monotone, that is
0
()
0
() if (s) (s) for all s S.
Proof. Note that a nonnegative simple function SF
+
has many dierent
representations as weighted sums of indicator functions. Suppose for example that
(1.3.1)
n
l=1
c
l
I
A
l
(s) =
m
k=1
d
k
I
B
k
(s) ,
for some c
l
0, d
k
0, A
l
T, B
k
T and all s S. There exists a nite
partition of S to at most 2
n+m
disjoint sets C
i
such that each of the sets A
l
and
B
k
is a union of some C
i
, i = 1, . . . , 2
n+m
. Expressing both sides of (1.3.1) as nite
weighted sums of I
Ci
, we necessarily have for each i the same weight on both sides.
Due to the (nite) additivity of over unions of disjoint sets C
i
, we thus get after
some algebra that
(1.3.2)
n
l=1
c
l
(A
l
) =
m
k=1
d
k
(B
k
) .
Consequently,
0
() is welldened and independent of the chosen representation
for . Further, the conclusion (1.3.2) applies also when the two sides of (1.3.1)
dier for s C as long as (C) = 0, hence proving the rst stated property of the
lemma.
Choosing the representation of + based on the representations of and
immediately results with the stated linearity of
0
. Given this, if (s) (s) for all
s, then = + for some SF
+
, implying that
0
() =
0
() +
0
()
0
(),
as claimed.
1.3. INTEGRATION AND THE (MATHEMATICAL) EXPECTATION 33
Remark. The stated monotonicity of
0
implies that () coincides with
0
() on
SF
+
. As
0
is uniquely dened for each f SF
+
and f = f
+
when f mT
+
, it
follows that (f) is uniquely dened for each f mT
+
L
1
(S, T, ).
All three properties of
0
(hence ) stated in Lemma 1.3.3 for functions in SF
+
extend to all of mT
+
L
1
. Indeed, the facts that (cf) = c(f), that (f) (g)
whenever 0 f g, and that (f) = (g) whenever (s : f(s) ,= g(s)) = 0 are
immediate consequences of our denition (once we have these for f, g SF
+
). Since
f g implies f
+
g
+
and f
+g
= f +(h+g)
f for
some f mT
+
such that h
+
+g
+
= f +(h+g)
+
. Since (h
) < and (g
) < ,
by linearity and monotonicity of () on mT
+
necessarily also (f) < and the
linearity of (h+g) on mT
+
L
1
follows by elementary algebra. In conclusion, we
have just proved that
Proposition 1.3.5. The integral (f) assigns a unique value to each f mT
+
L
1
(S, T, ). Further,
a). (f) = (g) whenever (s : f(s) ,= g(s)) = 0.
b). is linear, that is for any f, h, g mT
+
L
1
and c 0,
(h +g) = (h) +(g) , (cf) = c(f) .
c). is monotone, that is (f) (g) if f(s) g(s) for all s S.
Our proof of the identity (h + g) = (h) + (g) is an example of the following
general approach to proving that certain properties hold for all h L
1
.
Denition 1.3.6 (Standard Machine). To prove the validity of a certain property
for all h L
1
(S, T, ), break your proof to four easier steps, following those of
Denition 1.3.1.
Step 1. Prove the property for h which is an indicator function.
Step 2. Using linearity, extend the property to all SF
+
.
Step 3. Using MON extend the property to all h mT
+
.
Step 4. Extend the property in question to h L
1
by writing h = h
+
h
and
using linearity.
Here is another application of the standard machine.
34 1. PROBABILITY, MEASURE AND INTEGRATION
Exercise 1.3.7. Suppose that a probability measure T on (R, B) is such that
T(B) = (fI
B
) for the Lebesgue measure on R, some nonnegative Borel function
f() and all B B. Using the standard machine, prove that then T(h) = (fh) for
any Borel function h such that either h 0 or (f[h[) < .
Hint: See the proof of Proposition 1.3.56.
We shall see more applications of the standard machine later (for example, when
proving Proposition 1.3.56 and Theorem 1.3.61).
We next strengthen the nonnegativity and monotonicity properties of Lebesgues
integral () by showing that
Lemma 1.3.8. If (h) = 0 for h mT
+
, then (s : h(s) > 0) = 0. Conse
quently, if for f, g L
1
(S, T, ) both (f) = (g) and (s : f(s) > g(s)) = 0,
then (s : f(s) ,= g(s)) = 0.
Proof. By continuity below of the measure we have that
(s : h(s) > 0) = lim
n
(s : h(s) > n
1
)
(see Exercise 1.1.4). Hence, if (s : h(s) > 0) > 0, then for some n < ,
0 < n
1
(s : h(s) > n
1
) =
0
(n
1
I
h>n
1 ) (h) ,
where the right most inequality is a consequence of the denition of (h) and the
fact that h n
1
I
h>n
1 SF
+
. Thus, our assumption that (h) = 0 must imply
that (s : h(s) > 0) = 0.
To prove the second part of the lemma, consider
h = g f which is nonnegative
outside a set N T such that (N) = 0. Hence, h = (g f)I
N
c mT
+
and
0 = (g) (f) = (
n
i=1
Z
i
count the number of xed points of the random permutation (i.e. the
number of selfmatchings). Show that E[X
n
(X
n
1) (X
n
k + 1)] = 1 for
k = 1, 2, . . . , n.
Similarly, here is an elementary application of the monotonicity of the expectation
(i.e. part (c) of the preceding theorem).
Exercise 1.3.12. Suppose an integrable random variable X is such that E(XI
A
) =
0 for each A (X). Show that necessarily X = 0 almost surely.
1.3.2. Inequalities. The linearity of the expectation often allows us to com
pute EX even when we cannot compute the distribution function F
X
. In such cases
the expectation can be used to bound tail probabilities, based on the following clas
sical inequality.
Theorem 1.3.13 (Markovs inequality). Suppose : R [0, ] is a Borel
function and let
(A)P(X A) E((X)I
XA
) E(X).
Proof. By the denition of
(A)I
xA
(x)I
xA
(x) ,
for all x R. Therefore,
(A)I
XA
(X)I
XA
(X) for every .
We deduce the stated inequality by the monotonicity of the expectation and the
identity E(
(A)I
XA
) =
(A) = a
q
. Markovs inequality is then a
q
P([X[ a) E[X[
q
. Considering q = 2
and X = Y EY for Y L
2
, this amounts to
P([Y EY [ a)
Var(Y )
a
2
,
which we call Chebyshevs inequality (c.f. Denition 1.3.67 for the variance and
moments of random variable Y ).
(c). Taking (x) = e
x
for some > 0 and A = [a, ) for some a R we have
that
(A) = e
a
. Markovs inequality is then
P(X a) e
a
Ee
X
.
36 1. PROBABILITY, MEASURE AND INTEGRATION
This bound provides an exponential decay in a, at the cost of requiring X to have
nite exponential moments.
In general, we cannot compute EX explicitly from the Denition 1.3.1 except
for discrete R.V.s and for R.V.s having a probability density function. We thus
appeal to the properties of the expectation listed in Theorem 1.3.9, or use various
inequalities to bound one expectation by another. To this end, we start with
Jensens inequality, dealing with the eect that a convex function makes on the
expectation.
Proposition 1.3.15 (Jensens inequality). Suppose g() is a convex function
on an open interval G of R, that is,
g(x) + (1 )g(y) g(x + (1 )y) x, y G, 0 1.
If X is an integrable R.V. with P(X G) = 1 and g(X) is also integrable, then
E(g(X)) g(EX).
Proof. The convexity of g() on G implies that g() is continuous on G (hence
g(X) is a random variable) and the existence for each c G of b = b(c) R such
that
(1.3.3) g(x) g(c) +b(x c), x G.
Since G is an open interval of R with P(X G) = 1 and X is integrable, it follows
that EX G. Assuming (1.3.3) holds for c = EX, that X G a.s., and that both
X and g(X) are integrable, we have by Theorem 1.3.9 that
E(g(X)) = E(g(X)I
XG
) E[(g(c)+b(Xc))I
XG
] = g(c)+b(EXc) = g(EX) ,
as stated. To derive (1.3.3) note that if (c h
2
, c +h
1
) G for positive h
1
and h
2
,
then by convexity of g(),
h
2
h
1
+h
2
g(c +h
1
) +
h
1
h
1
+h
2
g(c h
2
) g(c) ,
which amounts to [g(c + h
1
) g(c)]/h
1
[g(c) g(c h
2
)]/h
2
. Considering the
inmum over h
1
> 0 and the supremum over h
2
> 0 we deduce that
inf
h>0,c+hG
g(c +h) g(c)
h
:= (D
+
g)(c) (D
g)(c) := sup
h>0,chG
g(c) g(c h)
h
.
With G an open set, obviously (D
g)(c), (D
+
g)(c)] R
we get (1.3.3) out of the denition of D
+
g and D
g.
Remark. Since g() is convex if and only if g() is concave, we may as well state
Jensens inequality for concave functions, just reversing the sign of the inequality in
this case. A trivial instance of Jensens inequality happens when X() = xI
A
() +
yI
A
c() for some x, y R and A T such that P(A) = . Then,
EX = xP(A) +yP(A
c
) = x +y(1 ) ,
whereas g(X()) = g(x)I
A
() +g(y)I
A
c (). So,
Eg(X) = g(x) +g(y)(1 ) g(x +y(1 )) = g(EX) ,
as g is convex.
1.3. INTEGRATION AND THE (MATHEMATICAL) EXPECTATION 37
Applying Jensens inequality, we show that the spaces L
q
(, T, P) of Denition
1.3.2 are nested in terms of the parameter q 1.
Lemma 1.3.16. Fixing Y mT, the mapping q [[Y [[
q
= [E[Y [
q
]
1/q
is non
decreasing for q > 0. Hence, the space L
q
(, T, P) is contained in L
r
(, T, P) for
any r q.
Proof. Fix q > r > 0 and consider the sequence of bounded R.V. X
n
() =
min([Y ()[, n)
r
. Obviously, X
n
and X
q/r
n
are both in L
1
. Apply Jensens In
equality for the convex function g(x) = [x[
q/r
and the nonnegative R.V. X
n
, to
get that
(EX
n
)
q
r
E(X
q
r
n
) = E[min([Y [, n)
q
] E([Y [
q
) .
For n we have that X
n
[Y [
r
, so by monotone convergence E([Y [
r
)
q
r
(E[Y [
q
). Taking the 1/qth power yields the stated result [[Y [[
r
[[Y [[
q
.
We next bound the expectation of the product of two R.V. while assuming nothing
about the relation between them.
Proposition 1.3.17 (H olders inequality). Let X, Y be two random variables
on the same probability space. If p, q > 1 with
1
p
+
1
q
= 1, then
(1.3.4) E[XY [ [[X[[
p
[[Y [[
q
.
Remark. Recall that if XY is integrable then E[XY [ is by itself an upper bound
on [[EXY ][. The special case of p = q = 2 in H olders inequality
E[XY [
EX
2
EY
2
,
is called the CauchySchwarz inequality.
Proof. Fixing p > 1 and q = p/(p 1) let = [[X[[
p
and = [[Y [[
q
. If = 0
then [X[
p
a.s.
= 0 (see Theorem 1.3.9). Likewise, if = 0 then [Y [
q
a.s.
= 0. In either
case, the inequality (1.3.4) trivially holds. As this inequality also trivially holds
when either = or = , we may and shall assume hereafter that both and
are nite and strictly positive. Recall that
x
p
p
+
y
q
q
xy 0, x, y 0
(c.f. [Dur10, Page 21] where it is proved by considering the rst two derivatives
in x). Taking x = [X[/ and y = [Y [/, we have by linearity and monotonicity of
the expectation that
1 =
1
p
+
1
q
=
E[X[
p
p
p
+
E[Y [
q
q
q
E[XY [
,
yielding the stated inequality (1.3.4).
A direct consequence of H olders inequality is the triangle inequality for the norm
[[X[[
p
in L
p
(, T, P), that is,
Proposition 1.3.18 (Minkowskis inequality). If X, Y L
p
(, T, P), p 1,
then [[X +Y [[
p
[[X[[
p
+[[Y [[
p
.
38 1. PROBABILITY, MEASURE AND INTEGRATION
Proof. With [X+Y [ [X[ +[Y [, by monotonicity of the expectation we have
the stated inequality in case p = 1. Considering hereafter p > 1, it follows from
H olders inequality (Proposition 1.3.17) that
E[X +Y [
p
= E([X +Y [[X +Y [
p1
)
E([X[[X +Y [
p1
) +E([Y [[X +Y [
p1
)
(E[X[
p
)
1
p
(E[X +Y [
(p1)q
)
1
q
+ (E[Y [
p
)
1
p
(E[X +Y [
(p1)q
)
1
q
= ([[X[[
p
+[[Y [[
p
) (E[X +Y [
p
)
1
q
(recall that (p 1)q = p). Since X, Y L
p
and
[x +y[
p
([x[ +[y[)
p
2
p1
([x[
p
+[y[
p
), x, y R, p > 1,
if follows that a
p
= E[X + Y [
p
< . There is nothing to prove unless a
p
> 0, in
which case dividing by (a
p
)
1/q
we get that
(E[X +Y [
p
)
1
1
q
[[X[[
p
+ [[Y [[
p
,
giving the stated inequality (since 1
1
q
=
1
p
).
Remark. Jensens inequality applies only for probability measures, while both
H olders inequality ([fg[) ([f[
p
)
1/p
([g[
q
)
1/q
and Minkowskis inequality ap
ply for any measure , with exactly the same proof we provided for probability
measures.
To practice your understanding of Markovs inequality, solve the following exercise.
Exercise 1.3.19. Let X be a nonnegative random variable with Var(X) 1/2.
Show that then P(1 +EX X 2EX) 1/2.
To practice your understanding of the proof of Jensens inequality, try to prove
its extension to convex functions on R
n
.
Exercise 1.3.20. Suppose g : R
n
R is a convex function and X
1
, X
2
, . . . , X
n
are integrable random variables, dened on the same probability space and such that
g(X
1
, . . . , X
n
) is integrable. Show that then Eg(X
1
, . . . , X
n
) g(EX
1
, . . . , EX
n
).
Hint: Use convex analysis to show that g() is continuous and further that for any
c R
n
there exists b R
n
such that g(x) g(c) + b, x c for all x R
n
(with
, denoting the inner product of two vectors in R
n
).
Exercise 1.3.21. Let Y 0 with v = E(Y
2
) < .
(a) Show that for any 0 a < EY ,
P(Y > a)
(EY a)
2
E(Y
2
)
Hint: Apply the CauchySchwarz inequality to Y I
Y >a
.
(b) Show that (E[Y
2
v[)
2
4v(v (EY )
2
).
(c) Derive the second Bonferroni inequality,
P(
n
_
i=1
A
i
)
n
i=1
P(A
i
)
1j<in
P(A
i
A
j
) .
How does it compare with the bound of part (a) for Y =
n
i=1
I
Ai
?
1.3. INTEGRATION AND THE (MATHEMATICAL) EXPECTATION 39
1.3.3. Convergence, limits and expectation. Asymptotic behavior is a
key issue in probability theory. We thus explore here various notions of convergence
of random variables and the relations among them, focusing on the integrability
conditions needed for exchanging the order of limit and expectation operations.
Unless explicitly stated otherwise, throughout this section we assume that all R.V.
are dened on the same probability space (, T, P).
In Denition 1.2.25 we have encountered the convergence almost surely of R.V. A
weaker notion of convergence is convergence in probability as dened next.
Denition 1.3.22. We say that R.V. X
n
converge to a given R.V. X
in proba
bility, denoted X
n
p
X
, if P( : [X
n
() X
[
p
0, and is a special case of the
convergence in measure of f
n
mT to f
mT, that is (s : [f
n
(s)f
(s)[ >
) 0 as n , for any xed > 0.
Our next exercise and example clarify the relationship between convergence almost
surely and convergence in probability.
Exercise 1.3.23. Verify that convergence almost surely to a nite limit implies
convergence in probability, that is if X
n
a.s.
X
R then X
n
p
X
.
Remark 1.3.24. Generalizing Denition 1.3.22, for a separable metric space (S, )
we say that (S, B
S
)valued random variables X
n
converge to X
in probability if and
only if for every > 0, P((X
n
, X
.
Example 1.3.25. Consider the probability space ((0, 1], B
(0,1]
, U) and X
n
() =
1
[tn,tn+sn]
() with s
n
0 as n slowly enough and t
n
[0, 1 s
n
] are such
that any (0, 1] is in innitely many intervals [t
n
, t
n
+ s
n
]. The latter property
applies if t
n
= (i 1)/k and s
n
= 1/k when n = k(k 1)/2 +i, i = 1, 2, . . . , k and
k = 1, 2, . . . (plot the intervals [t
n
, t
n
+ s
n
] to convince yourself ). Then, X
n
p
0
(since s
n
= U(X
n
,= 0) 0), whereas xing each (0, 1], we have that X
n
() =
1 for innitely many values of n, hence X
n
does not converge a.s. to zero.
Associated with each space L
q
(, T, P) is the notion of L
q
convergence which we
now dene.
Denition 1.3.26. We say that X
n
converges in L
q
to X
, denoted X
n
L
q
X
,
if X
n
, X
L
q
and [[X
n
X
[[
q
0 as n (i.e., E([X
n
X
[
q
) 0 as
n .
Remark. For q = 2 we have the explicit formula
[[X
n
X[[
2
2
= E(X
2
n
) 2E(X
n
X) +E(X
2
).
Thus, it is often easiest to check convergence in L
2
.
The following immediate corollary of Lemma 1.3.16 provides an ordering of L
q
convergence in terms of the parameter q.
Corollary 1.3.27. If X
n
L
q
X
and q r, then X
n
L
r
X
.
40 1. PROBABILITY, MEASURE AND INTEGRATION
Next note that the L
q
convergence implies the convergence of the expectation of
[X
n
[
q
.
Exercise 1.3.28. Fixing q 1, use Minkowskis inequality (Proposition 1.3.18),
to show that if X
n
L
q
X
, then E[X
n
[
q
E[X
[
q
and for q = 1, 2, 3, . . . also
EX
q
n
EX
q
.
Further, it follows from Markovs inequality that the convergence in L
q
implies
convergence in probability (for any value of q).
Proposition 1.3.29. If X
n
L
q
X
, then X
n
p
X
.
Proof. Fixing q > 0 recall that Markovs inequality results with
P([Y [ > )
q
E[[Y [
q
] ,
for any R.V. Y and any > 0 (c.f part (b) of Example 1.3.14). The assumed
convergence in L
q
means that E[[X
n
X
[
q
] 0 as n , so taking Y = Y
n
=
X
n
X
as claimed.
The converse of Proposition 1.3.29 does not hold in general. As we next demon
strate, even the stronger almost surely convergence (see Exercise 1.3.23), and having
a nonrandom constant limit are not enough to guarantee the L
q
convergence, for
any q > 0.
Example 1.3.30. Fixing q > 0, consider the probability space ((0, 1], B
(0,1]
, U)
and the R.V. Y
n
() = n
1/q
I
[0,n
1
]
(). Since Y
n
() = 0 for all n n
0
and some
nite n
0
= n
0
(), it follows that Y
n
()
a.s.
0 as n . However, E[[Y
n
[
q
] =
nU([0, n
1
]) = 1 for all n, so Y
n
does not converge to zero in L
q
(see Exercise
1.3.28).
Considering Example 1.3.25, where X
n
L
q
0 while X
n
does not converge a.s. to
zero, and Example 1.3.30 which exhibits the converse phenomenon, we conclude
that the convergence in L
q
and the a.s. convergence are in general non comparable,
and neither one is a consequence of convergence in probability.
Nevertheless, a sequence X
n
can have at most one limit, regardless of which con
vergence mode is considered.
Exercise 1.3.31. Check that if X
n
L
q
X and X
n
a.s.
Y then X
a.s.
= Y .
Though we have just seen that in general the order of the limit and expectation
operations is noninterchangeable, we examine for the remainder of this subsection
various conditions which do allow for such an interchange. Note in passing that
upon proving any such result under certain pointwise convergence conditions, we
may with no extra eort relax these to the corresponding almost sure convergence
(and the same applies for integrals with respect to measures, see part (a) of Theorem
1.3.9, or that of Proposition 1.3.5).
Turning to do just that, we rst outline the results which apply in the more
general measure theory setting, starting with the proof of the monotone convergence
theorem.
1.3. INTEGRATION AND THE (MATHEMATICAL) EXPECTATION 41
Proof of Theorem 1.3.4. By part (c) of Proposition 1.3.5, the proof of
which did not use Theorem 1.3.4, we know that (h
n
) is a nondecreasing sequence
that is bounded above by (h). It therefore suces to show that
lim
n
(h
n
) = sup
n
0
() : SF
+
, h
n
sup
0
() : SF
+
, h = (h) (1.3.5)
(see Step 3 of Denition 1.3.1). That is, it suces to nd for each nonnegative
simple function h a sequence of nonnegative simple functions
n
h
n
such
that
0
(
n
)
0
() as n . To this end, xing , we may and shall choose
without loss of generality a representation =
m
l=1
c
l
I
A
l
such that A
l
T are
disjoint and further c
l
(A
l
) > 0 for l = 1, . . . , m (see proof of Lemma 1.3.3). Using
hereafter the notation f
(A
l
) for all l, so
0
()
m
l=1
h
(A
l
)(A
l
) = V .
Suppose rst that V < , that is 0 < h
(A
l
)(A
l
) < for all l. In this case, xing
< 1, consider for each n the disjoint sets A
l,,n
= s A
l
: h
n
(s) h
(A
l
) T
and the corresponding
,n
(s) =
m
l=1
h
(A
l
)I
A
l,,n
(s) SF
+
,
where
,n
(s) h
n
(s) for all s S. If s A
l
then h(s) > h
(A
l
). Thus, h
n
h
implies that A
l,,n
A
l
as n , for each l. Consequently, by denition of (h
n
)
and the continuity from below of ,
lim
n
(h
n
) lim
n
0
(
,n
) = V .
Taking 1 we deduce that lim
n
(h
n
) V
0
(). Next suppose that V = ,
so without loss of generality we may and shall assume that h
(A
1
)(A
1
) = .
Fixing x (0, h
(A
1
)) let A
1,x,n
= s A
1
: h
n
(s) x T noting that A
1,x,n
A
1
as n and
x,n
(s) = xI
A1,x,n
(s) h
n
(s) for all n and s S, is a non
negative simple function. Thus, again by continuity from below of we have that
lim
n
(h
n
) lim
n
0
(
x,n
) = x(A
1
) .
Taking x h
(A
1
) we deduce that lim
n
(h
n
) h
(A
1
)(A
1
) = , completing the
proof of (1.3.5) and that of the theorem.
Considering probability spaces, Theorem 1.3.4 tells us that we can exchange the
order of the limit and the expectation in case of monotone upward a.s. convergence
of nonnegative R.V. (with the limit possibly innite). That is,
Theorem 1.3.32 (Monotone convergence theorem). If X
n
0 and X
n
()
X
.
In Example 1.3.30 we have a pointwise convergent sequence of R.V. whose ex
pectations exceed that of their limit. In a sense this is always the case, as stated
next in Fatous lemma (which is a direct consequence of the monotone convergence
theorem).
42 1. PROBABILITY, MEASURE AND INTEGRATION
Lemma 1.3.33 (Fatous lemma). For any measure space (S, T, ) and any f
n
mT, if f
n
(s) g(s) for some integrable function g, all n and almostevery
s S, then
(1.3.6) liminf
n
(f
n
) (liminf
n
f
n
) .
Alternatively, if f
n
(s) g(s) for all n and a.e. s, then
(1.3.7) limsup
n
(f
n
) (limsup
n
f
n
) .
Proof. Assume rst that f
n
mT
+
and let h
n
(s) = inf
kn
f
k
(s), noting
that h
n
mT
+
is a nondecreasing sequence, whose pointwise limit is h(s) :=
liminf
n
f
n
(s). By the monotone convergence theorem, (h
n
) (h). Since
f
n
(s) h
n
(s) for all s S, the monotonicity of the integral (see Proposition 1.3.5)
implies that (f
n
) (h
n
) for all n. Considering the liminf as n we arrive
at (1.3.6).
Turning to extend this inequality to the more general setting of the lemma, note
that our conditions imply that f
n
a.e.
= g + (f
n
g)
+
for each n. Considering the
countable union of the negligible sets in which one of these identities is violated,
we thus have that
h := liminf
n
f
n
a.e.
= g + liminf
n
(f
n
g)
+
.
Further, (f
n
) = (g) +((f
n
g)
+
) by the linearity of the integral in mT
+
L
1
.
Taking n and applying (1.3.6) for (f
n
g)
+
mT
+
we deduce that
liminf
n
(f
n
) (g) +(liminf
n
(f
n
g)
+
) = (g) +(h g) = (h)
(where for the right most identity we used the linearity of the integral, as well as
the fact that g is integrable).
Finally, we get (1.3.7) for f
n
by considering (1.3.6) for f
n
.
Remark. In terms of the expectation, Fatous lemma is the statement that if R.V.
X
n
X, almost surely, for some X L
1
and all n, then
(1.3.8) liminf
n
E(X
n
) E(liminf
n
X
n
) ,
whereas if X
n
X, almost surely, for some X L
1
and all n, then
(1.3.9) limsup
n
E(X
n
) E(limsup
n
X
n
) .
Some text books call (1.3.9) and (1.3.7) the Reverse Fatou Lemma (e.g. [Wil91,
Section 5.4]).
Using Fatous lemma, we can easily prove Lebesgues dominated convergence the
orem (in short DOM).
Theorem 1.3.34 (Dominated convergence theorem). For any measure space
(S, T, ) and any f
n
mT, if for some integrable function g and almostevery
s S both f
n
(s) f
(s) as n , and [f
n
(s)[ g(s) for all n, then f
is
integrable and further ([f
n
f
[) 0 as n .
Proof. Up to a negligible subset of S, our assumption that [f
n
[ g and
f
n
f
, implies that [f
[ g, hence f
[ = 0, we conclude that
0 limsup
n
([f
n
f
[) (limsup
n
[f
n
f
[) = (0) = 0 ,
1.3. INTEGRATION AND THE (MATHEMATICAL) EXPECTATION 43
as claimed.
By Minkowskis inequality, ([f
n
f
[). The
dominated convergence theorem provides us with a simple sucient condition for
the converse implication in case also f
n
f
a.e.
Lemma 1.3.35 (Scheffes lemma). If f
n
mT converges a.e. to f
mT
and ([f
n
[) ([f
[) 0 as n .
Remark. In terms of expectation, Schees lemma states that if X
n
a.s.
X
and
E[X
n
[ E[X
[ < , then X
n
L
1
X
as well.
Proof. As already noted, we may assume without loss of generality that
f
n
(s) f
(s) 0 as n ,
for all s S. Further, since ([f
n
[) ([f
for all
n and s. As (g
n
)
) 0 as
n . Since [x[ = x + 2x
) 0 for n , as claimed.
In the general case of f
n
mT, we know that both 0 (f
n
)
+
(s) (f
)
+
(s) and
0 (f
n
)
(s) (f
[) = ((f
)
+
) +((f
) liminf
n
((f
n
)
) + liminf
n
((f
n
)
+
)
liminf
n
[((f
n
)
) +((f
n
)
+
)] = lim
n
([f
n
[) = ([f
[) .
Hence, necessarily both ((f
n
)
+
) ((f
)
+
) and ((f
n
)
) ((f
). Since
[x y[ [x
+
y
+
[ + [x
and (f
n
)
+
, we see that
lim
n
([f
n
f
[) lim
n
([(f
n
)
+
(f
)
+
[) + lim
n
([(f
n
)
(f
[) = 0 ,
concluding the proof of the lemma.
We conclude this subsection with quite a few exercises, starting with an alterna
tive characterization of convergence almost surely.
Exercise 1.3.36. Show that X
n
a.s.
0 if and only if for each > 0 there is n
so that for each random integer M with M() n for all we have that
P( : [X
M()
()[ > ) < .
Exercise 1.3.37. Let Y
n
be (realvalued) random variables on (, T, P), and N
k
positive integer valued random variables on the same probability space.
(a) Show that Y
N
k
() = Y
N
k
()
() are random variables on (, T).
(b) Show that if Y
n
a.s
Y
and N
k
a.s.
then Y
N
k
a.s.
Y
.
(c) Provide an example of Y
n
p
0 and N
k
a.s.
such that almost surely
Y
N
k
= 1 for all k.
(d) Show that if Y
n
a.s.
Y
and P(N
k
> r) 1 as k , for every xed
r < , then Y
N
k
p
Y
.
44 1. PROBABILITY, MEASURE AND INTEGRATION
In the following four exercises you nd some of the many applications of the
monotone convergence theorem.
Exercise 1.3.38. You are now to relax the nonnegativity assumption in the mono
tone convergence theorem.
(a) Show that if E[(X
1
)
] < and X
n
() X() for almost every , then
EX
n
EX.
(b) Show that if in addition sup
n
E[(X
n
)
+
] < , then X L
1
(, T, P).
Exercise 1.3.39. In this exercise you are to show that for any R.V. X 0,
(1.3.10) EX = lim
0
E
X for E
X =
j=0
jP( : j < X() (j + 1)) .
First use monotone convergence to show that E
k
X converges to EX along the
sequence
k
= 2
k
. Then, check that E
X E
i
x
i
P( : X() = x
i
) (this applies to every R.V. X 0 on a
countable ). More generally, verify that such formula applies whenever the series
is absolutely convergent (which amounts to X L
1
).
Exercise 1.3.40. Use monotone convergence to show that for any sequence of
nonnegative R.V. Y
n
,
E(
n=1
Y
n
) =
n=1
EY
n
.
Exercise 1.3.41. Suppose X
n
, X L
1
(, T, P) are such that
(a) X
n
0 almost surely, E[X
n
] = 1, E[X
n
log X
n
] 1, and
(b) E[X
n
Y ] E[XY ] as n , for each bounded random variable Y on
(, T).
Show that then X 0 almost surely, E[X] = 1 and E[X log X] 1.
Hint: Jensens inequality is handy for showing that E[X log X] 1.
Next come few direct applications of the dominated convergence theorem.
Exercise 1.3.42.
(a) Show that for any random variable X, the function t E[e
tX
] is con
tinuous on R (this function is sometimes called the bilateral exponential
transform).
(b) Suppose X 0 is such that EX
q
< for some q > 0. Show that then
q
1
(EX
q
1) Elog X as q 0 and deduce that also q
1
log EX
q
Elog X as q 0.
Hint: Fixing x 0 deduce from convexity of q x
q
that q
1
(x
q
1) log x as
q 0.
Exercise 1.3.43. Suppose X is an integrable random variable.
(a) Show that E([X[I
{X>n}
) 0 as n .
(b) Deduce that for any > 0 there exists > 0 such that
supE[[X[I
A
] : P(A) .
1.3. INTEGRATION AND THE (MATHEMATICAL) EXPECTATION 45
(c) Provide an example of X 0 with EX = for which the preceding fails,
that is, P(A
k
) 0 as k while E[XI
A
k
] is bounded away from zero.
The following generalization of the dominated convergence theorem is also left as
an exercise.
Exercise 1.3.44. Suppose g
n
() f
n
() h
n
() are integrable functions in the
same measure space (S, T, ) such that for almostevery s S both g
n
(s)
g
(s), f
n
(s) f
(s) and h
n
(s) h
and
h
) and (h
n
) (h
), then
f
is integrable and (f
n
) (f
).
Finally, here is a demonstration of one of the many issues that are particularly
easy to resolve with respect to the L
2
(, T, P) norm.
Exercise 1.3.45. Let X = (X(t))
tR
be a mapping from R into L
2
(, T, P).
Show that t X(t) is a continuous mapping (with respect to the norm  
2
on
L
2
(, T, P)), if and only if both
(t) = E[X(t)] and r(s, t) = E[X(s)X(t)] (s)(t)
are continuous realvalued functions (r(s, t) is continuous as a map from R
2
to R).
1.3.4. L
1
convergence and uniform integrability. For probability theory,
the dominated convergence theorem states that if random variables X
n
a.s.
X
are
such that [X
n
[ Y for all n and some random variable Y such that EY < , then
X
L
1
and X
n
L
1
X
, then X
L
1
and
X
n
L
1
X
.
We next state a uniform integrability condition that together with convergence in
probability implies the convergence in L
1
.
Denition 1.3.47. A possibly uncountable collection of R.V.s X
, J is
called uniformly integrable (U.I.) if
(1.3.11) lim
M
sup
E[[X
[I
X>M
] = 0 .
Our next lemma shows that U.I. is a relaxation of the condition of dominated
convergence, and that U.I. still implies the boundedness in L
1
of X
, J.
Lemma 1.3.48. If [X
E[X
[ < .
Proof. By monotone convergence, E[Y I
Y M
] EY as M , for any R.V.
Y 0. Hence, if in addition EY < , then by linearity of the expectation,
E[Y I
Y >M
] 0 as M . Now, if [X
[ Y then [X
[I
X>M
Y I
Y >M
,
hence E[[X
[I
X>M
] E[Y I
Y >M
], which does not depend on , and for Y L
1
converges to zero when M . We thus proved that if [X
[ = E[[X
[I
XM
] +E[[X
[I
X>M
] M + sup
E[[X
[I
X>M
] ,
we see that if X
E[X
[ < .
We next state and prove Vitalis convergence theorem for probability measures,
deferring the general case to Exercise 1.3.53.
Theorem 1.3.49 (Vitalis convergence theorem). Suppose X
n
p
X
. Then,
the collection X
n
is U.I. if and only if X
n
L
1
X
[.
Remark. In view of Lemma 1.3.48, Vitalis theorem relaxes the assumed a.s. con
vergence X
n
X
[ M + ) P(B
n,
), taking n and considering
=
k
0, we get by continuity from below of P that almost surely [X
[ M.
So, [X
n
X
[ = E[[X
n
X
[I
B
c
n,
] +E[[X
n
X
[I
Bn,
]
E[I
B
c
n,
] +E[2MI
Bn,
] + 2MP(B
n,
) .
Since P(B
n,
) 0 as n , it follows that limsup
n
E[X
n
X
[ . Taking
0 we deduce that E[X
n
X
[ 0 in this case.
Moving to deal now with the general case of a collection X
n
that is U.I., let
M
(x) = max(min(x, M), M). As [
M
(x)
M
(y)[ [xy[ for any x, y R, our
assumption X
n
p
X
implies that
M
(X
n
)
p
M
(X
)[ .
With [
M
(X
)[ [X
)[ E[X
[, hence E[X
[I
X>M
] < , and further
increasing M if needed, by the U.I. condition also E[[X
n
[I
Xn>M
] < for all n.
Considering the expectation of the inequality [x
M
(x)[ [x[I
x>M
(which holds
for all x R), with x = X
n
and x = X
, we obtain that
E[X
n
X
[ E[X
n
M
(X
n
)[ +E[
M
(X
n
)
M
(X
)[ +E[X
M
(X
)[
2 +E[
M
(X
n
)
M
(X
)[ .
1.3. INTEGRATION AND THE (MATHEMATICAL) EXPECTATION 47
Recall that
M
(X
n
)
L
1
M
(X
), hence limsup
n
E[X
n
X
[ 2. Taking 0
completes the proof of L
1
convergence of X
n
to X
.
Suppose now that X
n
L
1
X
[[ E[[ [X
n
[ [X
[ [] E[X
n
X
[ 0.
That is, E[X
n
[ E[X
[ and X
n
, n are integrable.
It thus remains only to show that if X
n
p
X
M
(x) = [x[I
xM1
+ (M 1)(M [x[)I
(M1,M]
([x[) ,
a piecewiselinear, continuous, bounded function, such that
M
(x) = [x[ for [x[
M1 and
M
(x) = 0 for [x[ M. Fixing > 0, with X
integrable, by dominated
convergence E[X
[E
m
(X
m
(y)[ (m 1)[x y[ for any x, y R, our assumption X
n
p
X
implies that
m
(X
n
)
p
m
(X
) as
n . Since [x[I
x>m
[x[
m
(x) for all x R, our assumption E[X
n
[
E[X
[ E
m
(X
) + 2 .
As each X
n
is integrable, E[[X
n
[I
Xn>M
] 2 for some M m nite and all n
(including also n < n
0
()). The fact that such nite M = M() exists for any > 0
amounts to the collection X
n
being U.I.
The following exercise builds upon the bounded convergence theorem.
Exercise 1.3.50. Show that for any X 0 (do not assume E(1/X) < ), both
(a) lim
y
yE[X
1
I
X>y
] = 0 and
(b) lim
y0
yE[X
1
I
X>y
] = 0.
Next is an example of the advantage of Vitalis convergence theorem over the
dominated convergence theorem.
Exercise 1.3.51. On ((0, 1], B
(0,1]
, U), let X
n
() = (n/ log n)I
(0,n
1
)
() for n 2.
Show that the collection X
n
is U.I. such that X
n
a.s.
0 and EX
n
0, but there
is no random variable Y with nite expectation such that Y X
n
for all n 2 and
almost all (0, 1].
By a simple application of Vitalis convergence theorem you can derive a classical
result of analysis, dealing with the convergence of Cesaro averages.
Exercise 1.3.52. Let U
n
denote a random variable whose law is the uniform prob
ability measure on (0, n], namely, Lebesgue measure restricted to the interval (0, n]
and normalized by n
1
to a probability measure. Show that g(U
n
)
p
0 as n ,
for any Borel function g() such that [g(y)[ 0 as y . Further, assuming that
also sup
y
[g(y)[ < , deduce that E[g(U
n
)[ = n
1
_
n
0
[g(y)[dy 0 as n .
Here is Vitalis convergence theorem for a general measure space.
48 1. PROBABILITY, MEASURE AND INTEGRATION
Exercise 1.3.53. Given a measure space (S, T, ), suppose f
n
, f
mT with
([f
n
[) nite and ([f
n
f
I
A
) as n , for any measurable set A.
1.3.5. Expectation, density and Riemann integral. Applying the stan
dard machine we now show that xing a measure space (S, T, ), each nonnegative
measurable function f induces a measure f on (S, T), with f being the natural
generalization of the concept of probability density function.
Proposition 1.3.56. Fix a measure space (S, T, ). Every f mT
+
induces
a measure f on (S, T) via (f)(A) = (fI
A
) for all A T. These measures
satisfy the composition relation h(f) = (hf) for all f, h mT
+
. Further, h
L
1
(S, T, f) if and only if fh L
1
(S, T, ) and then (f)(h) = (fh).
Proof. Fixing f mT
+
, obviously f is a nonnegative set function on (S, T)
with (f)() = (fI
n
k=1
fI
A
k
fI
A
, it follows by monotone convergence and linearity of the integral
that,
(fI
A
) = lim
n
(
n
k=1
fI
A
k
) = lim
n
n
k=1
(fI
A
k
) =
k
(fI
A
k
)
Thus, (f)(A) =
k
(f)(A
k
) verifying that f is a measure.
Fixing f mT
+
, we turn to prove that the identity
(1.3.13) (f)(hI
A
) = (fhI
A
) A T ,
holds for any h mT
+
. Since the left side of (1.3.13) is the value assigned to A
by the measure h(f) and the right side of this identity is the value assigned to
the same set by the measure (hf), this would verify the stated composition rule
h(f) = (hf). The proof of (1.3.13) proceeds by applying the standard machine:
Step 1. If h = I
B
for B T we have by the denition of the integral of an indicator
function that
(f)(I
B
I
A
) = (f)(I
AB
) = (f)(A B) = (fI
AB
) = (fI
B
I
A
) ,
which is (1.3.13).
1.3. INTEGRATION AND THE (MATHEMATICAL) EXPECTATION 49
Step 2. Take h SF
+
represented as h =
n
l=1
c
l
I
B
l
with c
l
0 and B
l
T.
Then, by Step 1 and the linearity of the integrals with respect to f and with
respect to , we see that
(f)(hI
A
) =
n
l=1
c
l
(f)(I
B
l
I
A
) =
n
l=1
c
l
(fI
B
l
I
A
) = (f
n
l=1
c
l
I
B
l
I
A
) = (fhI
A
) ,
again yielding (1.3.13).
Step 3. For any h mT
+
there exist h
n
SF
+
such that h
n
h. By Step 2 we
know that (f)(h
n
I
A
) = (fh
n
I
A
) for any A T and all n. Further, h
n
I
A
hI
A
and fh
n
I
A
fhI
A
, so by monotone convergence (for both integrals with respect to
f and ),
(f)(hI
A
) = lim
n
(f)(h
n
I
A
) = lim
n
(fh
n
I
A
) = (fhI
A
) ,
completing the proof of (1.3.13) for all h mT
+
.
Writing h mT as h = h
+
h
with h
+
= max(h, 0) mT
+
and h
=
min(h, 0) mT
+
, it follows from the composition rule that
_
h
d(f) = (f)(h
I
S
) = h
(f)(S) = ((h
f))(S) = (fh
I
S
) =
_
fh
d.
Observing that fh
= (fh)
when f mT
+
, we thus deduce that h is f
integrable if and only if fh is integrable in which case
_
hd(f) =
_
fhd, as
stated.
Fixing a measure space (S, T, ), every set D T induces a algebra T
D
=
A T : A D. Let
D
denote the restriction of to (D, T
D
). As a corollary
of Proposition 1.3.56 we express the integral with respect to
D
in terms of the
original measure .
Corollary 1.3.57. Fixing D T let h
D
denote the restriction of h mT to
(D, T
D
). Then,
D
(h
D
) = (hI
D
) for any h mT
+
. Further, h
D
L
1
(D, T
D
,
D
)
if and only if hI
D
L
1
(S, T, ), in which case also
D
(h
D
) = (hI
D
).
Proof. Note that the measure I
D
of Proposition 1.3.56 coincides with
D
on the algebra T
D
and assigns to any set A T the same value it assigns to
A D T
D
. By Denition 1.3.1 this implies that (I
D
)(h) =
D
(h
D
) for any
h mT
+
. The corollary is thus a restatement of the composition and integrability
relations of Proposition 1.3.56 for f = I
D
.
Remark 1.3.58. Corollary 1.3.57 justies using hereafter the notation
_
A
fd or
(f; A) for (fI
A
), or writing E(X; A) =
_
A
X()dP() for E(XI
A
). With this
notation in place, Proposition 1.3.56 states that each Z 0 such that EZ = 1
induces a probability measure Q = ZP such that Q(A) =
_
A
ZdP for all A T,
and then E
Q
(W) :=
_
WdQ = E(ZW) whenever W 0 or ZW L
1
(, T, P)
(the assumption EZ = 1 translates to Q() = 1).
Proposition 1.3.56 is closely related to the probability density function of Denition
1.2.39. Enroute to showing this, we rst dene the collection of Lebesgue integrable
functions.
50 1. PROBABILITY, MEASURE AND INTEGRATION
Denition 1.3.59. Consider Lebesgues measure on (R, B) as in Section 1.1.3,
and its completion on (R, B) (see Theorem 1.1.35). A set B B is called
Lebesgue measurable and f : R R is called Lebesgue integrable function if
f mB, and ([f[) < . As we show in Proposition 1.3.64, any nonnegative
Riemann integrable function is also Lebesgue integrable, and the integral values
coincide, justifying the notation
_
B
f(x)dx for (f; B), where the function f and
the set B are both Lebesgue measurable.
Example 1.3.60. Suppose f is a nonnegative Lebesgue integrable function such
that
_
R
f(x)dx = 1. Then, T = f of Proposition 1.3.56 is a probability measure
on (R, B) such that T(B) = (f; B) =
_
B
f(x)dx for any Lebesgue measurable set
B. By Theorem 1.2.36 it is easy to verify that F() = T((, ]) is a distribution
function, such that F() =
_
h(X())dP() =
_
R
h(x)dT
X
(x).
Proof. Apply the standard machine with respect to h mB:
Step 1. Taking h = I
B
for B B, note that by the denition of expectation of
indicators
Eh(X) = E[I
B
(X())] = P( : X() B) = T
X
(B) =
_
h(x)dT
X
(x).
Step 2. Representing h SF
+
as h =
m
l=1
c
l
I
B
l
for c
l
0, the identity (1.3.14)
follows from Step 1 by the linearity of the expectation in both spaces.
Step 3. For h mB
+
, consider h
n
SF
+
such that h
n
h. Since h
n
(X())
h(X()) for all , we get by monotone convergence on (, T, P), followed by ap
plying Step 2 for h
n
, and nally monotone convergence on (R, B, T
X
), that
_
h(X())dP() = lim
n
_
h
n
(X())dP()
= lim
n
_
R
h
n
(x)dT
X
(x) =
_
R
h(x)dT
X
(x),
as claimed.
Step 4. Write a Borel function h(x) as h
+
(x) h
l
f(x
l
)(J
l
)
R(f)[ < , for any x
l
J
l
and J
l
a nite partition of (a, b] into disjoint subin
tervals whose length (J
l
) < .
Lebesgues integral of a function f is based on splitting its range to small intervals
and approximating f(s) by a constant on the subset of S for which f() falls into
each such interval. As such, it accommodates an arbitrary domain S of the function,
in contrast to Riemanns integral where the domain of integration is split into small
rectangles hence limited to R
d
. As we next show, even for S = (a, b], if f 0
(or more generally, f bounded), is Riemann integrable, then it is also Lebesgue
integrable, with the integrals coinciding in value.
Proposition 1.3.64. If f(x) is a nonnegative Riemann integrable function on an
interval (a, b], then it is also Lebesgue integrable on (a, b] and (f) = R(f).
Proof. Let f
(J) = supf(x) : x J.
Varying x
l
over J
l
we see that
(1.3.15) R(f)
l
f
(J
l
)(J
l
)
l
f
(J
l
)(J
l
) R(f) + ,
for any nite partition of (a, b] into disjoint subintervals J
l
such that sup
l
(J
l
)
. For any such partition, the nonnegative simple functions () =
l
f
(J
l
)I
J
l
and u() =
l
f
(J
l
)I
J
l
are such that () f u(), whereas R(f)
(()) (u()) R(f) + , by (1.3.15). Consider the dyadic partitions
n
of (a, b] to 2
n
intervals of length (b a)2
n
each, such that
n+1
is a renement
of
n
for each n = 1, 2, . . .. Note that u(
n
)(x) u(
n+1
)(x) for all x (a, b]
and any n, hence u(
n
))(x) u
also Borel
measurable, and by the monotonicity of Lebesgues integral,
R(f) lim
n
((
n
)) (
) (u
) lim
n
(u(
n
)) R(f) .
We deduce that (u
) = (
) = R(f) for u
(x) >
(x) whose
52 1. PROBABILITY, MEASURE AND INTEGRATION
Lebesgue measure is zero (see Lemma 1.3.8). Consequently, f is Lebesgue measur
able on (a, b] with (f) = (
) = R(f) as stated.
Here is an alternative, direct proof of the fact that Q in Remark 1.3.58 is a
probability measure.
Exercise 1.3.65. Suppose E[X[ < and A =
n
A
n
for some disjoint sets
A
n
T.
(a) Show that then
n=0
E(X; A
n
) = E(X; A) ,
that is, the sum converges absolutely and has the value on the right.
(b) Deduce from this that for Z 0 with EZ positive and nite, Q(A) :=
EZI
A
/EZ is a probability measure.
(c) Suppose that X and Y are nonnegative random variables on the same
probability space (, T, P) such that EX = EY < . Deduce from the
preceding that if EXI
A
= EY I
A
for any A in a system / such that
T = (/), then X
a.s.
= Y .
Exercise 1.3.66. Suppose T is a probability measure on (R, B) and f 0 is a
Borel function such that T(B) =
_
B
f(x)dx for B = (, b], b R. Using the
theorem show that this identity applies for all B B. Building on this result,
use the standard machine to directly prove Corollary 1.3.62 (without Proposition
1.3.56).
1.3.6. Mean, variance and moments. We start with the denition of mo
ments of a random variable.
Denition 1.3.67. If k is a positive integer then EX
k
is called the kth moment
of X. When it is well dened, the rst moment m
X
= EX is called the mean. If
EX
2
< , then the variance of X is dened to be
(1.3.16) Var(X) = E(X m
X
)
2
= EX
2
m
2
X
EX
2
.
Since E(aX + b) = aEX + b (linearity of the expectation), it follows from the
denition that
(1.3.17) Var(aX +b) = E(aX +b E(aX +b))
2
= a
2
E(X m
X
)
2
= a
2
Var(X)
We turn to some examples, starting with R.V. having a density.
Example 1.3.68. If X has the exponential distribution then
EX
k
=
_
0
x
k
e
x
dx = k!
for any k (see Example 1.2.40 for its density). The mean of X is m
X
= 1 and
its variance is EX
2
(EX)
2
= 1. For any > 0, it is easy to see that T = X/
has density f
T
(t) = e
t
1
t>0
, called the exponential density of parameter . By
(1.3.17) it follows that m
T
= 1/ and Var(T) = 1/
2
.
Similarly, if X has a standard normal distribution, then by symmetry, for k odd,
EX
k
=
1
2
_
x
k
e
x
2
/2
dx = 0 ,
1.3. INTEGRATION AND THE (MATHEMATICAL) EXPECTATION 53
whereas by integration by parts, the even moments satisfy the relation
(1.3.18) EX
2
=
1
2
_
x
21
xe
x
2
/2
dx = (2 1)EX
22
,
for = 1, 2, . . .. In particular,
Var(X) = EX
2
= 1 .
Consider G = X +, where > 0 and R, whose density is
f
G
(y) =
1
2
2
e
(y)
2
2
2
.
We call the law of G the normal distribution of mean and variance
2
(as EG =
and Var(G) =
2
).
Next are some examples of R.V. with nite or countable set of possible values.
Example 1.3.69. We say that B has a Bernoulli distribution of parameter p
[0, 1] if P(B = 1) = 1 P(B = 0) = p. Clearly,
EB = p 1 + (1 p) 0 = p .
Further, B
2
= B so EB
2
= EB = p and
Var(B) = EB
2
(EB)
2
= p p
2
= p(1 p) .
Recall that N has a Poisson distribution with parameter 0 if
P(N = k) =
k
k!
e
for k = 0, 1, 2, . . .
(where in case = 0, P(N = 0) = 1). Observe that for k = 1, 2, . . .,
E(N(N 1) (N k + 1)) =
n=k
n(n 1) (n k + 1)
n
n!
e
=
k
n=k
nk
(n k)!
e
=
k
.
Using this formula, it follows that EN = while
Var(N) = EN
2
(EN)
2
= .
The random variable Z is said to have a Geometric distribution of success proba
bility p (0, 1) if
P(Z = k) = p(1 p)
k1
for k = 1, 2, . . .
This is the distribution of the number of independent coin tosses needed till the rst
appearance of a Head, or more generally, the number of independent trials till the
rst occurrence in this sequence of a specic event whose probability is p. Then,
EZ =
k=1
kp(1 p)
k1
=
1
p
EZ(Z 1) =
k=2
k(k 1)p(1 p)
k1
=
2(1 p)
p
2
Var(Z) = EZ(Z 1) +EZ (EZ)
2
=
1 p
p
2
.
54 1. PROBABILITY, MEASURE AND INTEGRATION
Exercise 1.3.70. Consider a counting random variable N
n
=
n
i=1
I
Ai
.
(a) Provide a formula for Var(N
n
) in terms of P(A
i
) and P(A
i
A
j
) for
i ,= j.
(b) Using your formula, nd the variance of the number N
n
of empty boxes
when distributing at random r distinct balls among n distinct boxes, where
each of the possible n
r
assignments of balls to boxes is equally likely.
Exercise 1.3.71. Show that if P(X [a, b]) = 1, then Var(X) (b a)
2
/4.
1.4. Independence and product measures
In Subsection 1.4.1 we buildup the notion of independence, from events to random
variables via algebras, relating it to the structure of the joint distribution func
tion. Subsection 1.4.2 considers nite product measures associated with the joint
law of independent R.V.s. This is followed by Kolmogorovs extension theorem
which we use in order to construct innitely many independent R.V.s. Subsection
1.4.3 is about Fubinis theorem and its applications for computing the expectation
of functions of independent R.V.
1.4.1. Denition and conditions for independence. Recall the classical
denition that two events A, B T are independent if P(A B) = P(A)P(B).
For example, suppose two fair dice are thrown (i.e. = 1, 2, 3, 4, 5, 6
2
with
T = 2
and (I
B
) = , , B, B
c
are independent. Since P() = 0 and is invariant under
intersections, whereas P() = 1 and all events are invariant under intersection with
, it suces to consider G A, A
c
and H B, B
c
. We check independence
1.4. INDEPENDENCE AND PRODUCT MEASURES 55
rst for G = A and H = B
c
. Noting that A is the union of the disjoint events
A B and A B
c
we have that
P(A B
c
) = P(A) P(A B) = P(A)[1 P(B)] = P(A)P(B
c
) ,
where the middle equality is due to the assumed independence of A and B. The
proof for all other choices of G and H is very similar.
More generally we dene the mutual independence of events as follows.
Denition 1.4.2. Events A
i
T are Pmutually independent if for any L <
and distinct indices i
1
, i
2
, . . . , i
L
,
P(A
i1
A
i2
A
iL
) =
L
k=1
P(A
i
k
).
We next generalize the denition of mutual independence to algebras, random
variables and beyond. This denition applies to the mutual independence of both
nite and innite number of such objects.
Denition 1.4.3. We say that the collections of events /
T with J
(possibly an innite index set) are Pmutually independent if for any L < and
distinct
1
,
2
, . . . ,
L
J,
P(A
1
A
2
A
L
) =
L
k=1
P(A
k
), A
k
/
k
, k = 1, . . . , L.
We say that random variables X
1
() = P( H) = P(H) = P(H)P() =
2
() .
If A /
n
, then by the mutual independence of /
i
, i = 1, . . . , n, it follows that
1
(A) = P(A
i1
A
i2
A
i3
A
iL
A) = (
L
k=1
P(A
i
k
))P(A)
= P(A
i1
A
i2
A
iL
)P(A) =
2
(A) .
Since the nite measures
1
() and
2
() agree on the system /
n
and on , it
follows that
1
=
2
on (
n
= (/
n
) (see Proposition 1.1.39). That is, P(GH) =
P(G)P(H) for any G (
n
.
56 1. PROBABILITY, MEASURE AND INTEGRATION
Since this applies for arbitrary A
i
/
i
, i = 1, . . . , n 1, in view of Denition
1.4.3 we have just proved that if /
1
, /
2
, . . . , /
n
are mutually independent, then
/
1
, /
2
, . . . , (
n
are mutually independent.
Applying the latter relation for (
n
, /
1
, . . . , /
n1
(which are mutually independent
since Denition 1.4.3 is invariant to a permutation of the order of the collections) we
get that (
n
, /
1
, . . . , /
n2
, (
n1
are mutually independent. After n such iterations
we have the stated result.
Because the mutual independence of the collections of events /
, J amounts
to the mutual independence of any nite number of these collections, we have the
immediate consequence:
Corollary 1.4.5. If systems of events /
= (
1
,
), are also mutually independent.
Proof. Let /
k
, k = 1, . . . , L is
merely a nite intersection of sets from distinct collections 1
k
,j(k)
, the assumed
mutual independence of 1
,
implies the mutual independence of /
. By Corollary
1.4.5, this in turn implies the mutual independence of (/
, implying that
(
(/
).
Relying on the preceding corollary you can now establish the following character
ization of independence (which is key to proving Kolmogorovs 01 law).
Exercise 1.4.7. Show that if for each n 1 the algebras T
X
n
= (X
1
, . . . , X
n
)
and (X
n+1
) are Pmutually independent then the random variables X
1
, X
2
, X
3
, . . .
are Pmutually independent. Conversely, show that if X
1
, X
2
, X
3
, . . . are indepen
dent, then for each n 1 the algebras T
X
n
and T
X
n
= (X
r
, r > n) are indepen
dent.
It is easy to check that a Ptrivial algebra 1 is Pindependent of any other
algebra ( T. Conversely, as we show next, independence is a great tool for
proving that a algebra is Ptrivial.
Lemma 1.4.8. If each of the algebras (
k
(
k+1
is Pindependent of a algebra
1 (
k1
(
k
) then 1 is Ptrivial.
Remark. In particular, if 1 is Pindependent of itself, then 1 is Ptrivial.
Proof. Since (
k
(
k+1
for all k and (
k
are algebras, it follows that / =
k1
(
k
is a system. The assumed Pindependence of 1 and (
k
for each k
yields the Pindependence of 1 and /. Thus, by Theorem 1.4.4 we have that
1 and (/) are Pindependent. Since 1 (/) it follows that in particular
P(H) = P(H H) = P(H) P(H) for each H 1. So, necessarily P(H) 0, 1
for all H 1. That is, 1 is Ptrivial.
1.4. INDEPENDENCE AND PRODUCT MEASURES 57
We next dene the tail algebra of a stochastic process.
Denition 1.4.9. For a stochastic process X
k
we set T
X
n
= (X
r
, r > n) and
call T
X
=
n
T
X
n
the tail algebra of the process X
k
.
As we next see, the Ptriviality of the tail algebra of independent random vari
ables is an immediate consequence of Lemma 1.4.8. This result, due to Kolmogorov,
is just one of the many 01 laws that exist in probability theory.
Corollary 1.4.10 (Kolmogorovs 01 law). If X
k
are Pmutually indepen
dent then the corresponding tail algebra T
X
is Ptrivial.
Proof. Note that T
X
k
T
X
k+1
and T
X
T
X
= (X
k
, k 1) = (
k1
T
X
k
)
(see Exercise 1.2.14 for the latter identity). Further, recall Exercise 1.4.7 that for
any n 1, the algebras T
X
n
and T
X
n
are Pmutually independent. Hence, each of
the algebras T
X
k
is also Pmutually independent of the tail algebra T
X
, which
by Lemma 1.4.8 is thus Ptrivial.
Out of Corollary 1.4.6 we deduce that functions of disjoint collections of mutually
independent random variables are mutually independent.
Corollary 1.4.11. If R.V. X
k,j
, 1 k m, 1 j l(k) are mutually indepen
dent and f
k
: R
l(k)
R are Borel functions, then Y
k
= f
k
(X
k,1
, . . . , X
k,l(k)
) are
mutually independent random variables for k = 1, . . . , m.
Proof. We apply Corollary 1.4.6 for the index set = (k, j) : 1 k
m, 1 j l(k), and mutually independent systems 1
k,j
= (X
k,j
), to deduce
the mutual independence of (
k
= (
j
1
k,j
). Recall that (
k
= (X
k,j
, 1 j l(k))
and (Y
k
) (
k
(see Denition 1.2.12 and Exercise 1.2.32). We complete the proof
by noting that Y
k
are mutually independent if and only if (Y
k
) are mutually
independent.
Our next result is an application of Theorem 1.4.4 to the independence of random
variables.
Corollary 1.4.12. Realvalued random variables X
1
, X
2
, . . . , X
m
on the same prob
ability space (, T, P) are mutually independent if and only if
(1.4.1) P(X
1
x
1
, . . . , X
m
x
m
) =
m
i=1
P(X
i
x
i
) , x
1
, . . . , x
m
R.
Proof. Let /
i
denote the collection of subsets of of the form X
1
i
((, b])
for b R. Recall that /
i
generates (X
i
) (see Exercise 1.2.11), whereas (1.4.1)
states that the systems /
i
are mutually independent (by continuity from below
of P, taking x
i
for i ,= i
1
, i ,= i
2
, . . . , i ,= i
L
, has the same eect as taking a
subset of distinct indices i
1
, . . . , i
L
from 1, . . . , m). So, just apply Theorem 1.4.4
to conclude the proof.
The condition (1.4.1) for mutual independence of R.V.s is further simplied when
these variables are either discrete valued, or having a density.
Exercise 1.4.13. Suppose (X
1
, . . . , X
m
) are random variables and (S
1
, . . . , S
m
)
are countable sets such that P(X
i
S
i
) = 1 for i = 1, . . . , m. Show that if
P(X
1
= x
1
, . . . , X
m
= x
m
) =
m
i=1
P(X
i
= x
i
)
58 1. PROBABILITY, MEASURE AND INTEGRATION
whenever x
i
S
i
, i = 1, . . . , m, then X
1
, . . . , X
m
are mutually independent.
Exercise 1.4.14. Suppose the random vector X = (X
1
, . . . , X
m
) has a joint prob
ability density function f
X
(x) = g
1
(x
1
) g
m
(x
m
). That is,
P((X
1
, . . . , X
m
) A) =
_
A
g
1
(x
1
) g
m
(x
m
)dx
1
. . . dx
m
, A B
R
m ,
where g
i
are nonnegative, Lebesgue integrable functions. Show that then X
1
, . . . , X
m
are mutually independent.
Beware that pairwise independence (of each pair A
k
, A
j
for k ,= j), does not imply
mutual independence of all the events in question and the same applies to three or
more random variables. Here is an illustrating example.
Exercise 1.4.15. Consider the sample space = 0, 1, 2
2
with probability measure
on (, 2
k=1
k
s
. Fixing such s, let X and Y be independent random variables
with P(X = k) = P(Y = k) = k
s
/(s) for k = 1, 2, . . ..
(a) Show that the events D
p
= X is divisible by p, with p a prime number,
are Pmutually independent.
(b) By considering the event X = 1, provide a probabilistic explanation of
Eulers formula 1/(s) =
p
(1 1/p
s
).
1.4. INDEPENDENCE AND PRODUCT MEASURES 59
(c) Show that the probability that no perfect square other than 1 divides X is
precisely 1/(2s).
(d) Show that P(G = k) = k
2s
/(2s), where G is the greatest common
divisor of X and Y .
1.4.2. Product measures and Kolmogorovs theorem. Recall Example
1.1.20 that given two measurable spaces (
1
, T
1
) and (
2
, T
2
) the product (mea
surable) space (, T) consists of =
1
2
and T = T
1
T
2
, which is the same
as T = (/) for
/ =
_
m
j=1
A
j
B
j
: A
j
T
1
, B
j
T
2
, m <
_
,
where throughout,
2
(
m
j=1
A
j
B
j
) =
m
j=1
1
(A
j
)
2
(B
j
), A
j
T
1
, B
j
T
2
, m < .
We denote
2
=
1
2
and call it the product of the measures
1
and
2
.
Proof. By Caratheodorys extension theorem, it suces to show that / is an
algebra on which
2
is countably additive (see Theorem 1.1.30 for the case of nite
measures). To this end, note that =
1
2
/. Further, / is closed under
intersections, since
(
m
j=1
A
j
B
j
)
(
n
i=1
C
i
D
i
) =
i,j
[(A
j
B
j
) (C
i
D
i
)]
=
i,j
(A
j
C
i
) (B
j
D
i
) .
It is also closed under complementation, for
(
m
j=1
A
j
B
j
)
c
=
m
j=1
[(A
c
j
B
j
) (A
j
B
c
j
) (A
c
j
B
c
j
)] .
By DeMorgans law, / is an algebra.
Note that countable unions of disjoint elements of / are also countable unions of
disjoint elements of the collection ! = A B : A T
1
, B T
2
of measurable
rectangles. Hence, if we show that
(1.4.2)
m
j=1
1
(A
j
)
2
(B
j
) =
1
(C
i
)
2
(D
i
) ,
whenever
m
j=1
A
j
B
j
=
i
(C
i
D
i
) for some m < , A
j
, C
i
T
1
and B
j
, D
i
T
2
, then we deduce that the value of
2
(E) is independent of the representation
we choose for E / in terms of measurable rectangles, and further that
2
is
60 1. PROBABILITY, MEASURE AND INTEGRATION
countably additive on /. To this end, note that the preceding set identity amounts
to
m
j=1
I
Aj
(x)I
Bj
(y) =
i
I
Ci
(x)I
Di
(y) x
1
, y
2
.
Hence, xing x
1
, we have that (y) =
m
j=1
I
Aj
(x)I
Bj
(y) SF
+
is the mono
tone increasing limit of
n
(y) =
n
i=1
I
Ci
(x)I
Di
(y) SF
+
as n . Thus, by
linearity of the integral with respect to
2
and monotone convergence,
g(x) :=
m
j=1
2
(B
j
)I
Aj
(x) =
2
() = lim
n
2
(
n
) = lim
n
n
i=1
I
Ci
(x)
2
(D
i
) .
We deduce that the nonnegative g(x) mT
1
is the monotone increasing limit of
the nonnegative measurable functions h
n
(x) =
n
i=1
2
(D
i
)I
Ci
(x). Hence, by the
same reasoning,
m
j=1
2
(B
j
)
1
(A
j
) =
1
(g) = lim
n
1
(h
n
) =
2
(D
i
)
1
(C
i
) ,
proving (1.4.2) and the theorem.
It follows from Theorem 1.4.19 by induction on n that given any nite collection
of nite measure spaces (
i
, T
i
,
i
), i = 1, . . . , n, there exists a unique product
measure
n
=
1
n
on the product space (, T) (i.e., =
1
n
and T = (A
1
A
n
; A
i
T
i
, i = 1, . . . , n)), such that
(1.4.3)
n
(A
1
A
n
) =
n
i=1
i
(A
i
) A
i
T
i
, i = 1, . . . , n.
Remark 1.4.20. A notable special case of this construction is when
i
= R with
the Borel algebra and Lebesgue measure of Section 1.1.3. The product space
is then R
n
with its Borel algebra and the product measure is
n
, the Lebesgue
measure on R
n
.
The notion of the law T
X
of a realvalued random variable X as in Denition
1.2.33, naturally extends to the joint law T
X
of a random vector X = (X
1
, . . . , X
n
)
which is the probability measure T
X
= P X
1
on (R
n
, B
R
n).
We next characterize the joint law of independent random variables X
1
, . . . , X
n
as the product of the laws of X
i
for i = 1, . . . , n.
Proposition 1.4.21. Random variables X
1
, . . . , X
n
on the same probability space,
having laws
i
= T
Xi
, are mutually independent if and only if their joint law is
n
=
1
n
.
Proof. By Denition 1.4.3 and the identity (1.4.3), if X
1
, . . . , X
n
are mutually
independent then for B
i
B,
T
X
(B
1
B
n
) = P(X
1
B
1
, . . . , X
n
B
n
)
=
n
i=1
P(X
i
B
i
) =
n
i=1
i
(B
i
) =
1
n
(B
1
B
n
) .
This shows that the law of (X
1
, . . . , X
n
) and the product measure
n
agree on the
collection of all measurable rectangles B
1
B
n
, a system that generates B
R
n
1.4. INDEPENDENCE AND PRODUCT MEASURES 61
(see Exercise 1.1.21). Consequently, these two probability measures agree on B
R
n
(c.f. Proposition 1.1.39).
Conversely, if T
X
=
1
n
, then by same reasoning, for Borel sets B
i
,
P(
n
i=1
: X
i
() B
i
) = T
X
(B
1
B
n
) =
1
n
(B
1
B
n
)
=
n
i=1
i
(B
i
) =
n
i=1
P( : X
i
() B
i
) ,
which amounts to the mutual independence of X
1
, . . . , X
n
.
We wish to extend the construction of product measures to that of an innite col
lection of independent random variables. To this end, let N = 1, 2, . . . denote the
set of natural numbers and R
N
= x = (x
1
, x
2
, . . .) : x
i
R denote the collection
of all innite sequences of real numbers. We equip R
N
with the product algebra
B
c
= (!) generated by the collection !of all nite dimensional measurable rectan
gles (also called cylinder sets), that is sets of the form x : x
1
B
1
, . . . , x
n
B
n
,
where B
i
B, i = 1, . . . , n N (e.g. see Example 1.1.19).
Kolmogorovs extension theorem provides the existence of a unique probability
measure P on (R
N
, B
c
) whose projections coincide with a given consistent sequence
of probability measures
n
on (R
n
, B
R
n).
Theorem 1.4.22 (Kolmogorovs extension theorem). Suppose we are given
probability measures
n
on (R
n
, B
R
n) that are consistent, that is,
n+1
(B
1
B
n
R) =
n
(B
1
B
n
) B
i
B, i = 1, . . . , n <
Then, there is a unique probability measure P on (R
N
, B
c
) such that
(1.4.4) P( :
i
B
i
, i = 1, . . . , n) =
n
(B
1
B
n
) B
i
B, i n <
Proof. (sketch only) We take a similar approach as in the proof of Theorem
1.4.19. That is, we use (1.4.4) to dene the nonnegative set function P
0
on the
collection ! of all nite dimensional measurable rectangles, where by the consis
tency of
n
the value of P
0
is independent of the specic representation chosen
for a set in !. Then, we extend P
0
to a nitely additive set function on the algebra
/ =
_
m
j=1
E
j
: E
j
!, m <
_
,
in the same linear manner we used when proving Theorem 1.4.19. Since / generates
B
c
and P
0
(R
N
) =
n
(R
n
) = 1, by Caratheodorys extension theorem it suces to
check that P
0
is countably additive on /. The countable additivity of P
0
is veried
by the method we already employed when dealing with Lebesgues measure. That
is, by the remark after Lemma 1.1.31, it suces to prove that P
0
(H
n
) 0 whenever
H
n
/ and H
n
. The proof by contradiction of the latter, adapting the
argument of Lemma 1.1.31, is based on approximating each H / by a nite
union J
k
H of compact rectangles, such that P
0
(H J
k
) 0 as k . This is
done for example in [Bil95, Page 490].
Example 1.4.23. To systematically construct an innite sequence of independent
random variables X
i
of prescribed laws T
Xi
=
i
, we apply Kolmogorovs exten
sion theorem for the product measures
n
=
1
n
constructed following
62 1. PROBABILITY, MEASURE AND INTEGRATION
Theorem 1.4.19 (where it is by denition that the sequence
n
is consistent). Al
ternatively, for innite product measures one can take arbitrary probability spaces
(
i
, T
i
,
i
) and directly show by contradiction that P
0
(H
n
) 0 whenever H
n
/
and H
n
(for more details, see [Str93, Exercise 1.1.14]).
Remark. As we shall nd in Sections 6.1 and 7.1, Kolmogorovs extension theorem
is the key to the study of stochastic processes, where it relates the law of the process
to its nite dimensional distributions. Certain properties of R are key to the proof
of Kolmogorovs extension theorem which indeed is false if (R, B) is replaced with
an arbitrary measurable space (S, o) (see the discussions in [Dur10, Subsection
2.1.4] and [Dud89, notes for Section 12.1]). Nevertheless, as you show next, the
conclusion of this theorem applies for any Bisomorphic measurable space (S, o).
Denition 1.4.24. Two measurable spaces (S, o) and (T, T ) are isomorphic if
there exists a one to one and onto measurable mapping between them whose inverse
is also a measurable mapping. A measurable space (S, o) is Bisomorphic if it is
isomorphic to a Borel subset T of R equipped with the induced Borel algebra
T = B T : B B.
Here is our generalized version of Kolmogorovs extension theorem.
Corollary 1.4.25. Given a measurable space (S, o) let S
N
denote the collection of
all innite sequences of elements in S equipped the product algebra o
c
generated
by the collection of all cylinder sets of the form s : s
1
A
1
, . . . , s
n
A
n
, where
A
i
o for i = 1, . . . , n. If (S, o) is Bisomorphic then for any consistent sequence
of probability measures
n
on (S
n
, o
n
) (that is,
n+1
(A
1
A
n
S) =
n
(A
1
A
n
) for all n and A
i
o), there exists a unique probability measure Q on
(S
N
, o
c
) such that for all n and A
i
o,
(1.4.5) Q(s : s
i
A
i
, i = 1, . . . , n) =
n
(A
1
A
n
) .
Next comes a guided proof of Corollary 1.4.25 out of Theorem 1.4.22.
Exercise 1.4.26.
(a) Verify that our proof of Theorem 1.4.22 applies in case (R, B) is replaced
by T B equipped with the induced Borel algebra T (with R
N
and B
c
replaced by T
N
and T
c
, respectively).
(b) Fixing such (T, T ) and (S, o) isomorphic to it, let g : S T be one to
one and onto such that both g and g
1
are measurable. Check that the
one to one and onto mappings g
n
(s) = (g(s
1
), . . . , g(s
n
)) are measurable
and deduce that
n
(B) =
n
(g
1
n
(B)) are consistent probability measures
on (T
n
, T
n
).
(c) Consider the one to one and onto mapping g
(s) = (g(s
1
), . . . , g(s
n
), . . .)
from S
N
to T
N
and the unique probability measure P on (T
N
, T
c
) for
which (1.4.4) holds. Verify that o
c
is contained in the algebra of subsets
A of S
N
for which g
(A) is in T
c
and deduce that Q(A) = P(g
(A)) is
a probability measure on (S
N
, o
c
).
(d) Conclude your proof of Corollary 1.4.25 by showing that this Q is the
unique probability measure for which (1.4.5) holds.
Remark. Recall that Caratheodorys extension theorem applies for any nite
measure. It follows that, by the same proof as in the preceding exercise, any
consistent sequence of nite measures
n
uniquely determines a nite measure
1.4. INDEPENDENCE AND PRODUCT MEASURES 63
Q on (S
N
, o
c
) for which (1.4.5) holds, a fact which we use in later parts of this text
(for example, in the study of Markov chains in Section 6.1).
Our next proposition shows that in most applications one encounters Bisomorphic
measurable spaces (for which Kolmogorovs theorem applies).
Proposition 1.4.27. If S B
M
for a complete separable metric space M and o
is the restriction of B
M
to S then (S, o) is Bisomorphic.
Remark. While we do not provide the proof of this proposition, we note in passing
that it is an immediate consequence of [Dud89, Theorem 13.1.1].
1.4.3. Fubinis theorem and its application. Returning to (, T, ) which
is the product of two nite measure spaces, as in Theorem 1.4.19, we now prove
that:
Theorem 1.4.28 (Fubinis theorem). Suppose =
1
2
is the product of the
nite measures
1
on (X, X) and
2
on (Y, }). If h mT for T = X} is such
that h 0 or
_
[h[ d < , then,
_
XY
hd =
_
X
__
Y
h(x, y) d
2
(y)
_
d
1
(x) (1.4.6)
=
_
Y
__
X
h(x, y) d
1
(x)
_
d
2
(y)
Remark. The iterated integrals on the right side of (1.4.6) are nite and well
dened whenever
_
[h[d < . However, for h / mT
+
the inner integrals might
be well dened only in the almost everywhere sense.
Proof of Fubinis theorem. Clearly, it suces to prove the rst identity
of (1.4.6), as the second immediately follows by exchanging the roles of the two
measure spaces. We thus prove Fubinis theorem by showing that
(1.4.7) y h(x, y) m}, x X,
(1.4.8) x f
h
(x) :=
_
Y
h(x, y) d
2
(y) mX,
so the double integral on the right side of (1.4.6) is well dened and
(1.4.9)
_
XY
hd =
_
X
f
h
(x)d
1
(x) .
We do so in three steps, rst proving (1.4.7)(1.4.9) for nite measures and bounded
h, proceeding to extend these results to nonnegative h and nite measures, and
then showing that (1.4.6) holds whenever h mT and
_
[h[d is nite.
Step 1. Let 1 denote the collection of bounded functions on XY for which (1.4.7)
(1.4.9) hold. Assuming that both
1
(X) and
2
(Y) are nite, we deduce that 1
contains all bounded h mT by verifying the assumptions of the monotone class
theorem (i.e. Theorem 1.2.7) for 1 and the system ! = AB : A X, B }
of measurable rectangles (which generates T).
To this end, note that if h = I
E
and E = AB !, then either h(x, ) = I
B
() (in
case x A), or h(x, ) is identically zero (when x , A). With I
B
m} we thus have
64 1. PROBABILITY, MEASURE AND INTEGRATION
(1.4.7) for any such h. Further, in this case the simple function f
h
(x) =
2
(B)I
A
(x)
on (X, X) is in mX and
_
XY
I
E
d =
1
2
(E) =
2
(B)
1
(A) =
_
X
f
h
(x)d
1
(x) .
Consequently, I
E
1 for all E !; in particular, the constant functions are in 1.
Next, with both m} and mX vector spaces over R, by the linearity of h f
h
over the vector space of bounded functions satisfying (1.4.7) and the linearity of
f
h
1
(f
h
) and h (h) over the vector spaces of bounded measurable f
h
and
h, respectively, we deduce that 1 is also a vector space over R.
Finally, if nonnegative h
n
1 are such that h
n
h, then for each x X the
mapping y h(x, y) = sup
n
h
n
(x, y) is in m}
+
(by Theorem 1.2.22). Further,
f
hn
mX
+
and by monotone convergence f
hn
f
h
(for all x X), so by the same
reasoning f
h
mX
+
. Applying monotone convergence twice more, it thus follows
that
(h) = sup
n
(h
n
) = sup
n
1
(f
hn
) =
1
(f
h
) ,
so h satises (1.4.7)(1.4.9). In particular, if h is bounded then also h 1 .
Step 2. Suppose now that h mT
+
. If
1
and
2
are nite measures, then
we have shown in Step 1 that (1.4.7)(1.4.9) hold for the bounded nonnegative
functions h
n
= h n. With h
n
h we have further seen that (1.4.7)(1.4.9) hold
also for the possibly unbounded h. Further, the closure of (1.4.8) and (1.4.9) with
respect to monotone increasing limits of nonnegative functions has been shown
by monotone convergence, and as such it extends to nite measures
1
and
2
.
Turning now to nite
1
and
2
, recall that there exist E
n
= A
n
B
n
!
such that A
n
X, B
n
Y,
1
(A
n
) < and
2
(B
n
) < . As h is the monotone
increasing limit of h
n
= hI
En
mT
+
it thus suces to verify that for each n
the nonnegative f
n
(x) =
_
Y
h
n
(x, y)d
2
(y) is measurable with (h
n
) =
1
(f
n
).
Fixing n and simplifying our notations to E = E
n
, A = A
n
and B = B
n
, recall
Corollary 1.3.57 that (h
n
) =
E
(h
E
) for the restrictions h
E
and
E
of h and to
the measurable space (E, T
E
). Also, as E = A B we have that T
E
= X
A
}
B
and
E
= (
1
)
A
(
2
)
B
for the nite measures (
1
)
A
and (
2
)
B
. Finally, as
f
n
(x) = f
hE
(x) :=
_
B
h
E
(x, y)d(
2
)
B
(y) when x A and zero otherwise, it follows
that
1
(f
n
) = (
1
)
A
(f
hE
). We have thus reduced our problem (for h
n
), to the case
of nite measures
E
= (
1
)
A
(
2
)
B
which we have already successfully resolved.
Step 3. Write h mT as h = h
+
h
, with h
mT
+
. By Step 2 we know that
y h
(x, y) m} for each x X, hence the same applies for y h(x, y). Let
X
0
denote the subset of X for which
_
Y
[h(x, y)[d
2
(y) < . By linearity of the
integral with respect to
2
we have that for all x X
0
(1.4.10) f
h
(x) = f
h+
(x) f
h
(x)
is nite. By Step 2 we know that f
h
mX, hence X
0
= x : f
h+
(x) + f
h
(x) <
is in X. From Step 2 we further have that
1
(f
h
) = (h
) whereby our
assumption that
_
[h[ d =
1
(f
h+
+ f
h
) < implies that
1
(X
c
0
) = 0. Let
f
h
(x) = f
h+
(x) f
h
(x) on X
0
and
f
h
(x) = 0 for all x / X
0
. Clearly,
f
h
mX
is
1
almosteverywhere the same as the inner integral on the right side of (1.4.6).
Moreover, in view of (1.4.10) and linearity of the integrals with respect to
1
and
we deduce that
(h) = (h
+
) (h
) =
1
(f
h+
)
1
(f
h
) =
1
(
f
h
) ,
1.4. INDEPENDENCE AND PRODUCT MEASURES 65
which is exactly the identity (1.4.6).
Equipped with Fubinis theorem, we have the following simpler formula for the
expectation of a Borel function h of two independent R.V.
Theorem 1.4.29. Suppose that X and Y are independent random variables of laws
1
= T
X
and
2
= T
Y
. If h : R
2
R is a Borel measurable function such that
h 0 or E[h(X, Y )[ < , then,
(1.4.11) Eh(X, Y ) =
_
_
_
h(x, y) d
1
(x)
_
d
2
(y)
In particular, for Borel functions f, g : R R such that f, g 0 or E[f(X)[ <
and E[g(Y )[ < ,
(1.4.12) E(f(X)g(Y )) = Ef(X) Eg(Y )
Proof. Subject to minor changes of notations, the proof of Theorem 1.3.61
applies to any (S, o)valued R.V. Considering this theorem for the random vector
(X, Y ) whose joint law is
1
2
(c.f. Proposition 1.4.21), together with Fubinis
theorem, we see that
Eh(X, Y ) =
_
R
2
h(x, y) d(
1
2
)(x, y) =
_
_
_
h(x, y) d
1
(x)
_
d
2
(y) ,
which is (1.4.11). Take now h(x, y) = f(x)g(y) for nonnegative Borel functions
f(x) and g(y). In this case, the iterated integral on the right side of (1.4.11) can
be further simplied to,
E(f(X)g(Y )) =
_
_
_
f(x)g(y) d
1
(x)
_
d
2
(y) =
_
g(y)[
_
f(x) d
1
(x)] d
2
(y)
=
_
[Ef(X)]g(y) d
2
(y) = Ef(X) Eg(Y )
(with Theorem 1.3.61 applied twice here), which is the stated identity (1.4.12).
To deal with Borel functions f and g that are not necessarily nonnegative, rst
apply (1.4.12) for the nonnegative functions [f[ and [g[ to get that E([f(X)g(Y )[) =
E[f(X)[E[g(Y )[ < . Thus, the assumed integrability of f(X) and of g(Y ) allows
us to apply again (1.4.11) for h(x, y) = f(x)g(y). Now repeat the argument we
used for deriving (1.4.12) in case of nonnegative Borel functions.
Another consequence of Fubinis theorem is the following integration by parts for
mula.
Lemma 1.4.30 (integration by parts). Suppose H(x) =
_
x
h(y)dy for a
nonnegative Borel function h and all x R. Then, for any random variable X,
(1.4.13) EH(X) =
_
R
h(y)P(X > y)dy .
Proof. Combining the change of variables formula (Theorem 1.3.61), with our
assumption about H(), we have that
EH(X) =
_
R
H(x)dT
X
(x) =
_
R
_
_
R
h(y)I
x>y
d(y)
_
dT
X
(x) ,
where denotes Lebesgues measure on (R, B). For each y R, the expectation of
the simple function x h(x, y) = h(y)I
x>y
with respect to (R, B, T
X
) is merely
h(y)P(X > y). Thus, applying Fubinis theorem for the nonnegative measurable
66 1. PROBABILITY, MEASURE AND INTEGRATION
function h(x, y) on the product space RR equipped with its Borel algebra B
R
2,
and the nite measures
1
= T
X
and
2
= , we have that
EH(X) =
_
R
_
_
R
h(y)I
x>y
dT
X
(x)
_
d(y) =
_
R
h(y)P(X > y)dy ,
as claimed.
Indeed, as we see next, by combining the integration by parts formula with H olders
inequality we can convert bounds on tail probabilities to bounds on the moments
of the corresponding random variables.
Lemma 1.4.31.
(a) For any r > p > 0 and any random variable Y 0,
EY
p
=
_
0
py
p1
P(Y > y)dy =
_
0
py
p1
P(Y y)dy
= (1
p
r
)
_
0
py
p1
E[min(Y/y, 1)
r
]dy .
(b) If X, Y 0 are such that P(Y y) y
1
E[XI
Y y
] for all y > 0, then
Y 
p
qX
p
for any p > 1 and q = p/(p 1).
(c) Under the same hypothesis also EY 1 +E[X(log Y )
+
].
Proof. (a) The rst identity is merely the integration by parts formula for
h
p
(y) = py
p1
1
y>0
and H
p
(x) = x
p
1
x0
and the second identity follows by the
fact that P(Y = y) = 0 up to a (countable) set of zero Lebesgue measure. Finally,
it is easy to check that H
p
(x) =
_
R
h
p,r
(x, y)dy for the nonnegative Borel function
h
p,r
(x, y) = (1 p/r)py
p1
min(x/y, 1)
r
1
x0
1
y>0
and any r > p > 0. Hence,
replacing h(y)I
x>y
throughout the proof of Lemma 1.4.30 by h
p,r
(x, y) we nd that
E[H
p
(X)] =
_
0
E[h
p,r
(X, y)]dy, which is exactly our third identity.
(b) In a similar manner it follows from Fubinis theorem that for p > 1 and any
nonnegative random variables X and Y
E[XY
p1
] = E[XH
p1
(Y )] = E[
_
R
h
p1
(y)XI
Y y
dy] =
_
R
h
p1
(y)E[XI
Y y
]dy .
Thus, with y
1
h
p
(y) = qh
p1
(y) our hypothesis implies that
EY
p
=
_
R
h
p
(y)P(Y y)dy
_
R
qh
p1
(y)E[XI
Y y
]dy = qE[XY
p1
] .
Applying H olders inequality we deduce that
EY
p
qE[XY
p1
] qX
p
Y
p1

q
= qX
p
[EY
p
]
1/q
where the rightmost equality is due to the fact that (p 1)q = p. In case Y
is bounded, dividing both sides of the preceding bound by [EY
p
]
1/q
implies that
Y 
p
qX
p
. To deal with the general case, let Y
n
= Y n, n = 1, 2, . . . and
note that either Y
n
y is empty (for n < y) or Y
n
y = Y y. Thus, our
assumption implies that P(Y
n
y) y
1
E[XI
Yny
] for all y > 0 and n 1. By
the preceding argument Y
n

p
qX
p
for any n. Taking n it follows by
monotone convergence that Y 
p
qX
p
.
1.4. INDEPENDENCE AND PRODUCT MEASURES 67
(c) Considering part (a) with p = 1, we bound P(Y y) by one for y [0, 1] and
by y
1
E[XI
Y y
] for y > 1, to get by Fubinis theorem that
EY =
_
0
P(Y y)dy 1 +
_
1
y
1
E[XI
Y y
]dy
= 1 +E[X
_
1
y
1
I
Y y
dy] = 1 +E[X(log Y )
+
] .
We further have the following corollary of (1.4.12), dealing with the expectation
of a product of mutually independent R.V.
Corollary 1.4.32. Suppose that X
1
, . . . , X
n
are Pmutually independent random
variables such that either X
i
0 for all i, or E[X
i
[ < for all i. Then,
(1.4.14) E
_
n
i=1
X
i
_
=
n
i=1
EX
i
,
that is, the expectation on the left exists and has the value given on the right.
Proof. By Corollary 1.4.11 we know that X = X
1
and Y = X
2
X
n
are
independent. Taking f(x) = [x[ and g(y) = [y[ in Theorem 1.4.29, we thus have
that E[X
1
X
n
[ = E[X
1
[E[X
2
X
n
[ for any n 2. Applying this identity
iteratively for X
l
, . . . , X
n
, starting with l = m, then l = m + 1, m + 2, . . . , n 1
leads to
(1.4.15) E[X
m
X
n
[ =
n
k=m
E[X
k
[ ,
holding for any 1 m n. If X
i
0 for all i, then [X
i
[ = X
i
and we have (1.4.14)
as the special case m = 1.
To deal with the proof in case X
i
L
1
for all i, note that for m = 2 the identity
(1.4.15) tells us that E[Y [ = E[X
2
X
n
[ < , so using Theorem 1.4.29 with
f(x) = x and g(y) = y we have that E(X
1
X
n
) = (EX
1
)E(X
2
X
n
). Iterating
this identity for X
l
, . . . , X
n
, starting with l = 1, then l = 2, 3, . . . , n 1 leads to
the desired result (1.4.14).
Another application of Theorem 1.4.29 provides us with the familiar formula for
the probability density function of the sum X+Y of independent random variables
X and Y , having densities f
X
and f
Y
respectively.
Corollary 1.4.33. Suppose that R.V. X with a Borel measurable probability density
function f
X
and R.V. Y with a Borel measurable probability density function f
Y
are independent. Then, the random variable Z = X +Y has the probability density
function
f
Z
(z) =
_
R
f
X
(z y)f
Y
(y)dy .
Proof. Fixing z R, apply Theorem 1.4.29 for h(x, y) = 1
(x+yz)
, to get
that
F
Z
(z) = P(X +Y z) = Eh(X, Y ) =
_
R
_
_
R
h(x, y)dT
X
(x)
_
dT
Y
(y) .
68 1. PROBABILITY, MEASURE AND INTEGRATION
Considering the inner integral for a xed value of y, we have that
_
R
h(x, y)dT
X
(x) =
_
R
I
(,zy]
(x)dT
X
(x) = T
X
((, z y]) =
_
zy
f
X
(x)dx,
where the right most equality is by the existence of a density f
X
(x) for X (c.f.
Denition 1.2.39). Clearly,
_
zy
f
X
(x)dx =
_
z
f
X
(x y)dx. Thus, applying
Fubinis theorem for the Borel measurable function g(x, y) = f
X
(x y) 0 and
the product of the nite Lebesgues measure on (, z] and the probability
measure T
Y
, we see that
F
Z
(z) =
_
R
_
_
z
f
X
(x y)dx
_
dT
Y
(y) =
_
z
_
_
R
f
X
(x y)dT
Y
(y)
_
dx
(in this application of Fubinis theorem we replace one iterated integral by another,
exchanging the order of integrations). Since this applies for any z R, it follows
by denition that Z has the probability density
f
Z
(z) =
_
R
f
X
(z y)dT
Y
(y) = Ef
X
(z Y ) .
With Y having density f
Y
, the stated formula for f
Z
is a consequence of Corollary
1.3.62.
Denition 1.4.34. The expression
_
f(z y)g(y)dy is called the convolution of
the nonnegative Borel functions f and g, denoted by f g(z). The convolution of
measures and on (R, B) is the measure on (R, B) such that (B) =
_
(B x)d(x) for any B B (where B x = y : x +y B).
Corollary 1.4.33 states that if two independent random variables X and Y have
densities, then so does Z = X+Y , whose density is the convolution of the densities
of X and Y . Without assuming the existence of densities, one can show by a similar
argument that the law of X +Y is the convolution of the law of X and the law of
Y (c.f. [Dur10, Theorem 2.1.10] or [Bil95, Page 266]).
Convolution is often used in analysis to provide a more regular approximation to
a given function. Here are few of the reasons for doing so.
Exercise 1.4.35. Suppose Borel functions f, g are such that g is a probability
density and
_
[f(x)[dx is nite. Consider the scaled densities g
n
() = ng(n), n 1.
(a) Show that f g(y) is a Borel function with
_
[f g(y)[dy
_
[f(x)[dx
and if g is uniformly continuous, then so is f g.
(b) Show that if g(x) = 0 whenever [x[ 1, then f g
n
(y) f(y) as n ,
for any continuous f and each y R.
Next you nd two of the many applications of Fubinis theorem in real analysis.
Exercise 1.4.36. Show that the set G
f
= (x, y) R
2
: 0 y f(x) of points
under the graph of a nonnegative Borel function f : R [0, ) is in B
R
2 and
deduce the wellknown formula (G
f
) =
_
f(x)d(x), for its area.
Exercise 1.4.37. For n 2, consider the unit sphere S
n1
= x R
n
: x = 1
equipped with the topology induced by R
n
. Let the surface measure of A B
S
n1 be
(A) = n
n
(C
0,1
(A)), for C
a,b
(A) = rx : r (a, b], x A and the nfold product
Lebesgue measure
n
(as in Remark 1.4.20).
1.4. INDEPENDENCE AND PRODUCT MEASURES 69
(a) Check that C
a,b
(A) B
R
n and deduce that () is a nite measure on
S
n1
(which is further invariant under orthogonal transformations).
(b) Verify that
n
(C
a,b
(A)) =
b
n
a
n
n
(A) and deduce that for any B B
R
n
n
(B) =
_
0
_
_
S
n1
I
rxB
d(x)
_
r
n1
d(r) .
Hint: Recall that
n
(B) =
n
n
(B) for any 0 and B B
R
n.
Combining (1.4.12) with Theorem 1.2.26 leads to the following characterization of
the independence between two random vectors (compare with Denition 1.4.1).
Exercise 1.4.38. Show that the R
n
valued random variable (X
1
, . . . , X
n
) and the
R
m
valued random variable (Y
1
, . . . , Y
m
) are independent if and only if
E(h(X
1
, . . . , X
n
)g(Y
1
, . . . , Y
m
)) = E(h(X
1
, . . . , X
n
))E(g(Y
1
, . . . , Y
m
)),
for all bounded, Borel measurable functions g : R
m
R and h : R
n
R.
Then show that the assumption of h() and g() bounded can be relaxed to both
h(X
1
, . . . , X
n
) and g(Y
1
, . . . , Y
m
) being in L
1
(, T, P).
Here is another application of (1.4.12):
Exercise 1.4.39. Show that E(f(X)g(X)) (Ef(X))(Eg(X)) for every random
variable X and any bounded nondecreasing functions f, g : R R.
In the following exercise you bound the exponential moments of certain random
variables.
Exercise 1.4.40. Suppose Y is an integrable random variable such that E[e
Y
] is
nite and E[Y ] = 0.
(a) Show that if [Y [ then
log E[e
Y
]
2
(e
1)E[Y
2
] .
Hint: Use the Taylor expansion of e
Y
Y 1.
(b) Show that if E[Y
2
e
Y
]
2
E[e
Y
] then
log E[e
Y
] log cosh() .
Hint: Note that (u) = log E[e
uY
] is convex, nonnegative and nite
on [0, 1] with (0) = 0 and
(u) +
(u)
2
=
E[Y
2
e
uY
]/E[e
uY
] is nondecreasing on [0, 1] and (u) = log cosh(u)
satises the dierential equation
(u) +
(u)
2
=
2
.
As demonstrated next, Fubinis theorem is also handy in proving the impossibility
of certain constructions.
Exercise 1.4.41. Explain why it is impossible to have Pmutually independent
random variables U
t
(), t [0, 1], on the same probability space (, T, P), having
each the uniform probability measure on [1/2, 1/2], such that t U
t
() is a Borel
function for almost every .
Hint: Show that E[(
_
r
0
U
t
()dt)
2
] = 0 for all r [0, 1].
Random variables X and Y such that E(X
2
) < and E(Y
2
) < are called
uncorrelated if E(XY ) = E(X)E(Y ). It follows from (1.4.12) that independent
random variables X, Y with nite second moment are uncorrelated. While the
converse is not necessarily true, it does apply for pairs of random variables that
take only two dierent values each.
70 1. PROBABILITY, MEASURE AND INTEGRATION
Exercise 1.4.42. Suppose X and Y are uncorrelated random variables.
(a) Show that if X = I
A
and Y = I
B
for some A, B T then X and Y are
also independent.
(b) Using this, show that if a, bvalued R.V. X and c, dvalued R.V. Y
are uncorrelated, then they are also independent.
(c) Give an example of a pair of R.V. X and Y that are uncorrelated but not
independent.
Next come a pair of exercises utilizing Corollary 1.4.32.
Exercise 1.4.43. Suppose X and Y are random variables on the same probability
space, X has a Poisson distribution with parameter > 0, and Y has a Poisson
distribution with parameter > (see Example 1.3.69).
(a) Show that if X and Y are independent then P(X Y ) exp((
)
2
).
(b) Taking = for > 1, nd I() > 0 such that P(X Y )
2 exp(I()) even when X and Y are not independent.
Exercise 1.4.44. Suppose X and Y are independent random variables of identical
distribution such that X > 0 and E[X] < .
(a) Show that E[X
1
Y ] > 1 unless X() = c for some nonrandom c and
almost every .
(b) Provide an example in which E[X
1
Y ] = .
We conclude this section with a concrete application of Corollary 1.4.33, comput
ing the density of the sum of mutually independent R.V., each having the same
exponential density. To this end, recall
Denition 1.4.45. The gamma density with parameters > 0 and > 0 is given
by
f
(s) = ()
1
s
1
e
s
1
s>0
,
where () =
_
0
s
1
e
s
ds is nite and positive. In particular, = 1 corresponds
to the exponential density f
T
of Example 1.3.68.
Exercise 1.4.46. Suppose X has a gamma density of parameters
1
and and Y
has a gamma density of parameters
2
and . Show that if X and Y are indepen
dent then X +Y has a gamma density of parameters
1
+
2
and . Deduce that
if T
1
, . . . , T
n
are mutually independent R.V. each having the exponential density of
parameter , then W
n
=
n
i=1
T
i
has the gamma density of parameters = n and
.
CHAPTER 2
Asymptotics: the law of large numbers
Building upon the foundations of Chapter 1 we turn to deal with asymptotic
theory. To this end, this chapter is devoted to degenerate limit laws, that is,
situations in which a sequence of random variables converges to a nonrandom
(constant) limit. Though not exclusively dealing with it, our focus here is on the
sequence of empirical averages n
1
n
i=1
X
i
as n .
Section 2.1 deals with the weak law of large numbers, where convergence in prob
ability (or in L
q
for some q > 1) is considered. This is strengthened in Section 2.3
to a strong law of large numbers, namely, to convergence almost surely. The key
tools for this improvement are the BorelCantelli lemmas, to which Section 2.2 is
devoted.
2.1. Weak laws of large numbers
A weak law of large numbers corresponds to the situation where the normalized
sums of large number of random variables converge in probability to a nonrandom
constant. Usually, the derivation of a weak low involves the computation of vari
ances, on which we focus in Subsection 2.1.1. However, the L
2
convergence we
obtain there is of a somewhat limited scope of applicability. To remedy this, we
introduce the method of truncation in Subsection 2.1.2 and illustrate its power in
a few representative examples.
2.1.1. L
2
limits for sums of uncorrelated variables. The key to our
derivation of weak laws of large numbers is the computation of variances. As a
preliminary step we dene the covariance of two R.V. and extend the notion of a
pair of uncorrelated random variables, to a (possibly innite) family of R.V.
Denition 2.1.1. The covariance of two random variables X, Y L
2
(, T, P) is
Cov(X, Y ) = E[(X EX)(Y EY )] = EXY EXEY ,
so in particular, Cov(X, X) = Var(X).
We say that random variables X
L
2
(, T, P) are uncorrelated if
E(X
) = E(X
)E(X
) ,= ,
or equivalently, if
Cov(X
, X
) = 0 ,= .
As we next show, the variance of the sum of nitely many uncorrelated random
variables is the sum of the variances of the variables.
Lemma 2.1.2. Suppose X
1
, . . . , X
n
are uncorrelated random variables (which nec
essarily are dened on the same probability space). Then,
(2.1.1) Var(X
1
+ +X
n
) = Var(X
1
) + + Var(X
n
) .
71
72 2. ASYMPTOTICS: THE LAW OF LARGE NUMBERS
Proof. Let S
n
=
n
i=1
X
i
. By Denition 1.3.67 of the variance and linearity of
the expectation we have that
Var(S
n
) = E([S
n
ES
n
]
2
) = E
_
[
n
i=1
X
i
n
i=1
EX
i
]
2
_
= E
_
[
n
i=1
(X
i
EX
i
)]
2
_
.
Writing the square of the sum as the sum of all possible crossproducts, we get that
Var(S
n
) =
n
i,j=1
E[(X
i
EX
i
)(X
j
EX
j
)]
=
n
i,j=1
Cov(X
i
, X
j
) =
n
i=1
Cov(X
i
, X
i
) =
n
i=1
Var(X
i
) ,
where we use the fact that Cov(X
i
, X
j
) = 0 for each i ,= j since X
i
and X
j
are
uncorrelated.
Equipped with this lemma we have our
Theorem 2.1.3 (L
2
weak law of large numbers). Consider S
n
=
n
i=1
X
i
for uncorrelated random variables X
1
, . . . , X
n
, . . .. Suppose that Var(X
i
) C and
EX
i
= x for some nite constants C, x, and all i = 1, 2, . . .. Then, n
1
S
n
L
2
x as
n , and hence also n
1
S
n
p
x.
Proof. Our assumptions imply that E(n
1
S
n
) = x, and further by Lemma
2.1.2 we have the bound Var(S
n
) nC. Recall the scaling property (1.3.17) of the
variance, implying that
E
_
(n
1
S
n
x)
2
_
= Var
_
n
1
S
n
_
=
1
n
2
Var(S
n
)
C
n
0
as n . Thus, n
1
S
n
L
2
x (recall Denition 1.3.26). By Proposition 1.3.29 this
implies that also n
1
S
n
p
x.
The most important special case of Theorem 2.1.3 is,
Example 2.1.4. Suppose that X
1
, . . . , X
n
are independent and identically dis
tributed (or in short, i.i.d.), with EX
2
1
< . Then, EX
2
i
= C and EX
i
= m
X
are both nite and independent of i. So, the L
2
weak law of large numbers tells us
that n
1
S
n
L
2
m
X
, and hence also n
1
S
n
p
m
X
.
Remark. As we shall see, the weaker condition E[X
i
[ < suces for the conver
gence in probability of n
1
S
n
to m
X
. In Section 2.3 we show that it even suces
for the convergence almost surely of n
1
S
n
to m
X
, a statement called the strong
law of large numbers.
Exercise 2.1.5. Show that the conclusion of the L
2
weak law of large numbers
holds even for correlated X
i
, provided EX
i
= x and Cov(X
i
, X
j
) r([i j[) for all
i, j, and some bounded sequence r(k) 0 as k .
With an eye on generalizing the L
2
weak law of large numbers we observe that
Lemma 2.1.6. If the random variables Z
n
L
2
(, T, P) and the nonrandom b
n
are such that b
2
n
Var(Z
n
) 0 as n , then b
1
n
(Z
n
EZ
n
)
L
2
0.
2.1. WEAK LAWS OF LARGE NUMBERS 73
Proof. We have E[(b
1
n
(Z
n
EZ
n
))
2
] = b
2
n
Var(Z
n
) 0.
Example 2.1.7. Let Z
n
=
n
k=1
X
k
for uncorrelated random variables X
k
. If
Var(X
k
)/k 0 as k , then Lemma 2.1.6 applies for Z
n
and b
n
= n, hence
n
1
(Z
n
EZ
n
) 0 in L
2
(and in probability). Alternatively, if also Var(X
k
) 0,
then Lemma 2.1.6 applies even for Z
n
and b
n
= n
1/2
.
Many limit theorems involve random variables of the form S
n
=
n
k=1
X
n,k
, that is,
the row sums of triangular arrays of random variables X
n,k
: k = 1, . . . , n. Here
are two such examples, both relying on Lemma 2.1.6.
Example 2.1.8 (Coupon collectors problem). Consider i.i.d. random vari
ables U
1
, U
2
, . . ., each distributed uniformly on 1, 2, . . . , n. Let [U
1
, . . . , U
l
[ de
note the number of distinct elements among the rst l variables, and
n
k
= infl :
[U
1
, . . . , U
l
[ = k be the rst time one has k distinct values. We are interested
in the asymptotic behavior as n of T
n
=
n
n
, the time it takes to have at least
one representative of each of the n possible values.
To motivate the name assigned to this example, think of collecting a set of n
dierent coupons, where independently of all previous choices, each item is chosen
at random in such a way that each of the possible n outcomes is equally likely.
Then, T
n
is the number of items one has to collect till having the complete set.
Setting
n
0
= 0, let X
n,k
=
n
k
n
k1
denote the additional time it takes to get
an item dierent from the rst k 1 distinct items collected. Note that X
n,k
has
a geometric distribution of success probability q
n,k
= 1
k1
n
, hence EX
n,k
= q
1
n,k
and Var(X
n,k
) q
2
n,k
(see Example 1.3.69). Since
T
n
=
n
n
n
0
=
n
k=1
(
n
k
n
k1
) =
n
k=1
X
n,k
,
we have by linearity of the expectation that
ET
n
=
n
k=1
_
1
k 1
n
_
1
= n
n
=1
1
= n(log n +
n
) ,
where
n
=
n
=1
_
n
1
x
1
dx is between zero and one (by monotonicity of x
x
1
). Further, X
n,k
is independent of each earlier waiting time X
n,j
, j = 1, . . . , k
1, hence we have by Lemma 2.1.2 that
Var(T
n
) =
n
k=1
Var(X
n,k
)
n
k=1
_
1
k 1
n
_
2
n
2
=1
2
= Cn
2
,
for some C < . Applying Lemma 2.1.6 with b
n
= nlog n, we deduce that
T
n
n(log n +
n
)
nlog n
L
2
0 .
Since
n
/ log n 0, it follows that
T
n
nlog n
L
2
1 ,
and T
n
/(nlog n) 1 in probability as well.
74 2. ASYMPTOTICS: THE LAW OF LARGE NUMBERS
One possible extension of Example 2.1.8 concerns innitely many possible coupons.
That is,
Exercise 2.1.9. Suppose
k
are i.i.d. positive integer valued random variables,
with P(
1
= i) = p
i
> 0 for i = 1, 2, . . .. Let D
l
= [
1
, . . . ,
l
[ denote the number
of distinct elements among the rst l variables.
(a) Show that D
n
a.s.
as n .
(b) Show that n
1
ED
n
0 as n and deduce that n
1
D
n
p
0.
Hint: Recall that (1 p)
n
1 np for any p [0, 1] and n 0.
Example 2.1.10 (An occupancy problem). Suppose we distribute at random
r distinct balls among n distinct boxes, where each of the possible n
r
assignments
of balls to boxes is equally likely. We are interested in the asymptotic behavior of
the number N
n
of empty boxes when r/n [0, ], while n . To this
end, let A
i
denote the event that the ith box is empty, so N
n
=
n
i=1
I
Ai
. Since
P(A
i
) = (1 1/n)
r
for each i, it follows that E(n
1
N
n
) = (1 1/n)
r
e
.
Further, EN
2
n
=
n
i,j=1
P(A
i
A
j
) and P(A
i
A
j
) = (1 2/n)
r
for each i ,= j.
Hence, splitting the sum according to i = j or i ,= j, we see that
Var(n
1
N
n
) =
1
n
2
EN
2
n
(1
1
n
)
2r
=
1
n
(1
1
n
)
r
+ (1
1
n
)(1
2
n
)
r
(1
1
n
)
2r
.
As n , the rst term on the right side goes to zero, and with r/n , each of
the other two terms converges to e
2
. Consequently, Var(n
1
N
n
) 0, so applying
Lemma 2.1.6 for b
n
= n we deduce that
N
n
n
e
in L
2
and in probability.
2.1.2. Weak laws and truncation. Our next order of business is to extend
the weak law of large numbers for row sums S
n
in triangular arrays of independent
X
n,k
which lack a nite second moment. Of course, with S
n
no longer in L
2
, there
is no way to establish convergence in L
2
. So, we aim to retain only the convergence
in probability, using truncation. That is, we consider the row sums S
n
for the
truncated array X
n,k
= X
n,k
I
X
n,k
bn
, with b
n
slowly enough to control the
variance of S
n
and fast enough for P(S
n
,= S
n
) 0. As we next show, this gives
the convergence in probability for S
n
which translates to same convergence result
for S
n
.
Theorem 2.1.11 (Weak law for triangular arrays). Suppose that for each
n, the random variables X
n,k
, k = 1, . . . , n are pairwise independent. Let X
n,k
=
X
n,k
I
X
n,k
bn
for nonrandom b
n
> 0 such that as n both
(a)
n
k=1
P([X
n,k
[ > b
n
) 0,
and
(b) b
2
n
n
k=1
Var(X
n,k
) 0.
Then, b
1
n
(S
n
a
n
)
p
0 as n , where S
n
=
n
k=1
X
n,k
and a
n
=
n
k=1
EX
n,k
.
2.1. WEAK LAWS OF LARGE NUMBERS 75
Proof. Let S
n
=
n
k=1
X
n,k
. Clearly, for any > 0,
_
S
n
a
n
b
n
>
_
_
S
n
,= S
n
_
_
_
S
n
a
n
b
n
>
_
.
Consequently,
(2.1.2) P
_
S
n
a
n
b
n
>
_
P(S
n
,= S
n
) +P
_
S
n
a
n
b
n
>
_
.
To bound the rst term, note that our condition (a) implies that as n ,
P(S
n
,= S
n
) P
_
n
_
k=1
X
n,k
,= X
n,k
k=1
P(X
n,k
,= X
n,k
) =
n
k=1
P([X
n,k
[ > b
n
) 0 .
Turning to bound the second term in (2.1.2), recall that pairwise independence
is preserved under truncation, hence X
n,k
, k = 1, . . . , n are uncorrelated random
variables (to convince yourself, apply (1.4.12) for the appropriate functions). Thus,
an application of Lemma 2.1.2 yields that as n ,
Var(b
1
n
S
n
) = b
2
n
n
k=1
Var(X
n,k
) 0 ,
by our condition (b). Since a
n
= ES
n
, from Chebyshevs inequality we deduce that
for any xed > 0,
P
_
S
n
a
n
b
n
>
_
2
Var(b
1
n
S
n
) 0 ,
as n . In view of (2.1.2), this completes the proof of the theorem.
Specializing the weak law of Theorem 2.1.11 to a single sequence yields the fol
lowing.
Proposition 2.1.12 (Weak law of large numbers). Consider i.i.d. random
variables X
i
, such that xP([X
1
[ > x) 0 as x . Then, n
1
S
n
n
p
0,
where S
n
=
n
i=1
X
i
and
n
= E[X
1
I
{X1n}
].
Proof. We get the result as an application of Theorem 2.1.11 for X
n,k
= X
k
and b
n
= n, in which case a
n
= n
n
. Turning to verify condition (a) of this
theorem, note that
n
k=1
P([X
n,k
[ > n) = nP([X
1
[ > n) 0
as n , by our assumption. Thus, all that remains to do is to verify that
condition (b) of Theorem 2.1.11 holds here. This amounts to showing that as
n ,
n
= n
2
n
k=1
Var(X
n,k
) = n
1
Var(X
n,1
) 0 .
76 2. ASYMPTOTICS: THE LAW OF LARGE NUMBERS
Recall that for any R.V. Z,
Var(Z) = EZ
2
(EZ)
2
E[Z[
2
=
_
0
2yP([Z[ > y)dy
(see part (a) of Lemma 1.4.31 for the right identity). Considering Z = X
n,1
=
X
1
I
{X1n}
for which P([Z[ > y) = P([X
1
[ > y) P([X
1
[ > n) P([X
1
[ > y)
when 0 < y < n and P([Z[ > y) = 0 when y n, we deduce that
n
= n
1
Var(Z) n
1
_
n
0
g(y)dy ,
where by our assumption, g(y) = 2yP([X
1
[ > y) 0 for y . Further, the
nonnegative Borel function g(y) 2y is then uniformly bounded on [0, ), hence
n
1
_
n
0
g(y)dy 0 as n (c.f. Exercise 1.3.52). Verifying that
n
0, we
established condition (b) of Theorem 2.1.11 and thus completed the proof of the
proposition.
Remark. The condition xP([X
1
[ > x) 0 for x is indeed necessary for the
existence of nonrandom
n
such that n
1
S
n
n
p
0 (c.f. [Fel71, Page 234236]
for a proof).
Exercise 2.1.13. Let X
i
be i.i.d. with P(X
1
= (1)
k
k) = 1/(ck
2
log k) for
integers k 2 and a normalization constant c =
k
1/(k
2
log k). Show that
E[X
1
[ = , but there is a nonrandom < such that n
1
S
n
p
.
As a corollary to Proposition 2.1.12 we next show that n
1
S
n
p
m
X
as soon as
the i.i.d. random variables X
i
are in L
1
.
Corollary 2.1.14. Consider S
n
=
n
k=1
X
k
for i.i.d. random variables X
i
such
that E[X
1
[ < . Then, n
1
S
n
p
EX
1
as n .
Proof. In view of Proposition 2.1.12, it suces to show that if E[X
1
[ < ,
then both nP([X
1
[ > n) 0 and EX
1
n
= E[X
1
I
{X1>n}
] 0 as n .
To this end, recall that E[X
1
[ < implies that P([X
1
[ < ) = 1 and hence
the sequence X
1
I
{X1>n}
converges to zero a.s. and is bounded by the integrable
[X
1
[. Thus, by dominated convergence E[X
1
I
{X1>n}
] 0 as n . Applying
dominated convergence for the sequence nI
{X1>n}
(which also converges a.s. to
zero and is bounded by the integrable [X
1
[), we deduce that nP([X
1
[ > n) =
E[nI
{X1>n}
] 0 when n , thus completing the proof of the corollary.
We conclude this section by considering an example for which E[X
1
[ = and
Proposition 2.1.12 does not apply, but nevertheless, Theorem 2.1.11 allows us to
deduce that c
1
n
S
n
p
1 for some c
n
such that c
n
/n .
Example 2.1.15. Let X
i
be i.i.d. random variables such that P(X
1
= 2
j
) =
2
j
for j = 1, 2, . . .. This has the interpretation of a game, where in each of its
independent rounds you win 2
j
dollars if it takes exactly j tosses of a fair coin
to get the rst Head. This example is called the St. Petersburg paradox, since
though EX
1
= , you clearly would not pay an innite amount just in order to
play this game. Applying Theorem 2.1.11 we nd that one should be willing to pay
roughly nlog
2
n dollars for playing n rounds of this game, since S
n
/(nlog
2
n)
p
1
as n . Indeed, the conditions of Theorem 2.1.11 apply for b
n
= 2
mn
provided
2.2. THE BORELCANTELLI LEMMAS 77
the integers m
n
are such that m
n
log
2
n . Taking m
n
log
2
n+log
2
(log
2
n)
implies that b
n
nlog
2
n and a
n
/(nlog
2
n) = m
n
/ log
2
n 1 as n , with the
consequence of S
n
/(nlog
2
n)
p
1 (for details see [Dur10, Example 2.2.7]).
2.2. The BorelCantelli lemmas
When dealing with asymptotic theory, we often wish to understand the relation
between countably many events A
n
in the same probability space. The two Borel
Cantelli lemmas of Subsection 2.2.1 provide information on the probability of the set
of outcomes that are in innitely many of these events, based only on P(A
n
). There
are numerous applications to these lemmas, few of which are given in Subsection
2.2.2 while many more appear in later sections of these notes.
2.2.1. Limit superior and the BorelCantelli lemmas. We are often in
terested in the limits superior and limits inferior of a sequence of events A
n
on
the same measurable space (, T).
Denition 2.2.1. For a sequence of subsets A
n
, dene
A
:= limsup A
n
=
m=1
_
=m
A
= : A
n
for innitely many ns
= : A
n
innitely often = A
n
i.o.
Similarly,
liminf A
n
=
_
m=1
=m
A
= : A
n
for all but nitely many ns
= : A
n
eventually = A
n
ev.
Remark. Note that if A
n
T are measurable, then so are limsup A
n
and liminf A
n
.
By DeMorgans law, we have that A
n
ev. = A
c
n
i.o.
c
, that is, A
n
for all
n large enough if and only if A
c
n
for nitely many ns.
Also, if A
n
eventually, then certainly A
n
innitely often, that is
liminf A
n
limsup A
n
.
The notations limsup A
n
and liminf A
n
are due to the intimate connection of
these sets to the limsup and liminf of the indicator functions on the sets A
n
. For
example,
limsup
n
I
An
() = I
limsup An
(),
since for a given , the limsup on the left side equals 1 if and only if the
sequence n I
An
() contains an innite subsequence of ones. In other words, if
and only if the given is in innitely many of the sets A
n
. Similarly,
liminf
n
I
An
() = I
liminf An
(),
since for a given , the liminf on the left side equals 1 if and only if there
are only nitely many zeros in the sequence n I
An
() (for otherwise, their limit
inferior is zero). In other words, if and only if the given is in A
n
for all n large
enough.
78 2. ASYMPTOTICS: THE LAW OF LARGE NUMBERS
In view of the preceding remark, Fatous lemma yields the following relations.
Exercise 2.2.2. Prove that for any sequence A
n
T,
P(limsup A
n
) limsup
n
P(A
n
) liminf
n
P(A
n
) P(liminf A
n
) .
Show that the right most inequality holds even when the probability measure is re
placed by an arbitrary measure (), but the left most inequality may then fail unless
(
kn
A
k
) < for some n.
Practice your understanding of the concepts of limsup and liminf of sets by solving
the following exercise.
Exercise 2.2.3. Assume that P(limsup A
n
) = 1 and P(liminf B
n
) = 1. Prove
that P(limsup(A
n
B
n
)) = 1. What happens if the condition on B
n
is weakened
to P(limsup B
n
) = 1?
Our next result, called the rst BorelCantelli lemma, states that if the probabil
ities P(A
n
) of the individual events A
n
converge to zero fast enough, then almost
surely, A
n
occurs for only nitely many values of n, that is, P(A
n
i.o.) = 0. This
lemma is extremely useful, as the possibly complex relation between the dierent
events A
n
is irrelevant for its conclusion.
Lemma 2.2.4 (BorelCantelli I). Suppose A
n
T and
n=1
P(A
n
) < . Then,
P(A
n
i.o.) = 0.
Proof. Dene N() =
k=1
I
A
k
(). By the monotone convergence theorem
and our assumption,
E[N()] = E
_
k=1
I
A
k
()
_
=
k=1
P(A
k
) < .
Since the expectation of N is nite, certainly P( : N() = ) = 0. Noting
that the set : N() = is merely : A
n
i.o., the conclusion P(A
n
i.o.) = 0
of the lemma follows.
Our next result, left for the reader to prove, relaxes somewhat the conditions of
Lemma 2.2.4.
Exercise 2.2.5. Suppose A
n
T are such that
n=1
P(A
n
A
c
n+1
) < and
P(A
n
) 0. Show that then P(A
n
i.o.) = 0.
The rst BorelCantelli lemma states that if the series
n
P(A
n
) converges then
almost every is in nitely many sets A
n
. If P(A
n
) 0, but the series
n
P(A
n
)
diverges, then the event A
n
i.o. might or might not have positive probability. In
this sense, the BorelCantelli I is not tight, as the following example demonstrates.
Example 2.2.6. Consider the uniform probability measure U on ((0, 1], B
(0,1]
), and
the events A
n
= (0, 1/n]. Then A
n
, so A
n
i.o. = , but U(A
n
) = 1/n, so
n
U(A
n
) = and the BorelCantelli I does not apply.
Recall also Example 1.3.25 showing the existence of A
n
= (t
n
, t
n
+ 1/n] such that
U(A
n
) = 1/n while A
n
i.o. = (0, 1]. Thus, in general the probability of A
n
i.o.
depends on the relation between the dierent events A
n
.
2.2. THE BORELCANTELLI LEMMAS 79
As seen in the preceding example, the divergence of the series
n
P(A
n
) is not
sucient for the occurrence of a set of positive probability of values, each of
which is in innitely many events A
n
. However, upon adding the assumption that
the events A
n
are mutually independent (agrantly not the case in Example 2.2.6),
we conclude that almost all must be in innitely many of the events A
n
:
Lemma 2.2.7 (BorelCantelli II). Suppose A
n
T are mutually independent
and
n=1
P(A
n
) = . Then, necessarily P(A
n
i.o.) = 1.
Proof. Fix 0 < m < n < . Use the mutual independence of the events A
=m
A
c
_
=
n
=m
P(A
c
) =
n
=m
(1 P(A
))
=m
e
P(A
)
= exp(
n
=m
P(A
)) .
As n , the set
n
=m
A
c
=m
A
c
_
exp(
=m
P(A
)) = 0 .
Take the complement to see that P(B
m
) = 1 for B
m
=
=m
A
k
P(A
k
) = and
limsup
n
_
n
k=1
P(A
k
)
_
2
/
_
1j,kn
P(A
j
A
k
)
_
= > 0 .
Prove that then P(A
n
i.o. ) .
Hint: Consider part (a) of Exercise 1.3.21 for Y
n
=
kn
I
A
k
and a
n
= EY
n
.
80 2. ASYMPTOTICS: THE LAW OF LARGE NUMBERS
2.2.2. Applications. In the sequel we explore various applications of the two
BorelCantelli lemmas. In doing so, unless explicitly stated otherwise, all events
and random variables are dened on the same probability space.
We know that the convergence a.s. of X
n
to X
, but not vice verse (see Exercise 1.3.23 and Example 1.3.25).
As our rst application of BorelCantelli I, we rene the relation between these
two modes of convergence, showing that convergence in probability is equivalent to
convergence almost surely along subsequences.
Theorem 2.2.10. X
n
p
X
as k .
We start the proof of this theorem with a simple analysis lemma.
Lemma 2.2.11. Let y
n
be a sequence in a topological space. If every subsequence
y
n(m)
has a further subsubsequence y
n(m
k
)
that converges to y, then y
n
y.
Proof. If y
n
does not converge to y, then there exists an open set G containing
y and a subsequence y
n(m)
such that y
n(m)
/ G for all m. But clearly, then we
cannot nd a further subsequence of y
n(m)
that converges to y.
Remark. Applying Lemma 2.2.11 to y
n
= E[X
n
X
[ we deduce that X
n
L
1
X
as k .
Proof of Theorem 2.2.10. First, we show suciency, assuming X
n
p
X
.
Fix a subsequence n(m) and
k
0. By the denition of convergence in probability,
there exists a subsubsequence n(m
k
) such that P
_
[X
n(m
k
)
X
[ >
k
_
2
k
. Call this sequence of events A
k
=
_
: [X
n(m
k
)
() X
()[ >
k
_
. Then
the series
k
P(A
k
) converges. Therefore, by BorelCantelli I, P(limsup A
k
) =
0. For any / limsup A
k
there are only nitely many values of k such that
[X
n(m
k
)
X
[ >
k
, or alternatively, [X
n(m
k
)
X
[
k
for all k large enough.
Since
k
0, it follows that X
n(m
k
)
() X
() when / limsup A
k
, that is,
with probability one.
Conversely, x > 0. Let y
n
= P([X
n
X
.
It is not hard to check that convergence almost surely is invariant under application
of an a.s. continuous mapping.
Exercise 2.2.12. Let g : R R be a Borel function and denote by D
g
its set of
discontinuities. Show that if X
n
a.s.
X
D
g
) = 0, then
g(X
n
)
a.s.
g(X
D
g
) =
0. Then, g(X
n
)
p
g(X
) (and
Eg(X
n
) Eg(X
)).
Proof. Fix a subsequence X
n(m)
. By Theorem 2.2.10 there exists a subse
quence X
n(m
k
)
such that P(A) = 1 for A = : X
n(m
k
)
() X
() as k .
Let B = : X
() / D
g
, noting that by assumption P(B) = 1. For any
A B we have g(X
n(m
k
)
()) g(X
).
Finally, if g is bounded, then the collection g(X
n
) is U.I. yielding, by Vitalis
convergence theorem, its convergence in L
1
(and hence that Eg(X
n
) Eg(X
)).
You are next to extend the scope of Theorem 2.2.10 and the continuous mapping
of Corollary 2.2.13 to random variables taking values in a separable metric space.
Exercise 2.2.14. Recall the denition of convergence in probability in a separable
metric space (S, ) as in Remark 1.3.24.
(a) Extend the proof of Theorem 2.2.10 to apply for any (S, B
S
)valued ran
dom variables X
n
, n (and in particular for Rvalued variables).
(b) Denote by D
g
the set of discontinuities of a Borel measurable g : S
R (dened similarly to Exercise 1.2.28, where realvalued functions are
considered). Suppose X
n
p
X
and P(X
D
g
) = 0. Show that then
g(X
n
)
p
g(X
g(X
).
The following result in analysis is obtained by combining the continuous mapping
of Corollary 2.2.13 with the weak law of large numbers.
Exercise 2.2.15 (Inverting Laplace transforms). The Laplace transform of
a bounded, continuous function h(x) on [0, ) is the function L
h
(s) =
_
0
e
sx
h(x)dx
on (0, ).
(a) Show that for any s > 0 and positive integer k,
(1)
k1
s
k
L
(k1)
h
(s)
(k 1)!
=
_
0
e
sx
s
k
x
k1
(k 1)!
h(x)dx = E[h(W
k
)] ,
where L
(k1)
h
() denotes the (k1)th derivative of the function L
h
() and
W
k
has the gamma density with parameters k and s.
(b) Recall Exercise 1.4.46 that for s = n/y the law of W
n
coincides with the
law of n
1
n
i=1
T
i
where T
i
0 are i.i.d. random variables, each having
the exponential distribution of parameter 1/y (with ET
1
= y and nite
moments of all order, c.f. Example 1.3.68). Deduce that the inversion
formula
h(y) = lim
n
(1)
n1
(n/y)
n
(n 1)!
L
(n1)
h
(n/y) ,
holds for any y > 0.
82 2. ASYMPTOTICS: THE LAW OF LARGE NUMBERS
The next application of BorelCantelli I provides our rst strong law of large
numbers.
Proposition 2.2.16. Suppose E[Z
2
n
] C for some C < and all n. Then,
n
1
Z
n
a.s.
0 as n .
Proof. Fixing > 0 let A
k
= : [k
1
Z
k
()[ > for k = 1, 2, . . .. Then, by
Chebyshevs inequality and our assumption,
P(A
k
) = P( : [Z
k
()[ k)
E(Z
2
k
)
(k)
2
C
2
k
2
.
Since
k
k
2
< , it follows by Borel Cantelli I that P(A
) = 0, where A
=
: [k
1
Z
k
()[ > for innitely many values of k. Hence, for any xed
> 0, with probability one k
1
[Z
k
()[ for all large enough k, that is,
limsup
n
n
1
[Z
n
()[ a.s. Considering a sequence
m
0 we conclude that
n
1
Z
n
0 for n and a.e. .
Exercise 2.2.17. Let S
n
=
n
l=1
X
l
, where X
i
are i.i.d. random variables with
EX
1
= 0 and EX
4
1
< .
(a) Show that n
1
S
n
a.s.
0.
Hint: Verify that Proposition 2.2.16 applies for Z
n
= n
1
S
2
n
.
(b) Show that n
1
D
n
a.s.
0 where D
n
denotes the number of distinct integers
among
k
, k n and
k
are i.i.d. integer valued random variables.
Hint: D
n
2M +
n
k=1
I

k
M
.
In contrast, here is an example where the empirical averages of integrable, zero
mean independent variables do not converge to zero. Of course, the trick is to have
nonidentical distributions, with the bulk of the probability drifting to negative one.
Exercise 2.2.18. Suppose X
i
are mutually independent random variables such that
P(X
n
= n
2
1) = 1 P(X
n
= 1) = n
2
for n = 1, 2, . . .. Show that EX
n
= 0,
for all n, while n
1
n
i=1
X
i
a.s.
1 for n .
Next we have few other applications of BorelCantelli I, starting with some addi
tional properties of convergence a.s.
Exercise 2.2.19. Show that for any R.V. X
n
(a) X
n
a.s.
0 if and only if P([X
n
[ > i.o. ) = 0 for each > 0.
(b) There exist nonrandom constants b
n
such that X
n
/b
n
a.s.
0.
Exercise 2.2.20. Show that if W
n
> 0 and EW
n
1 for every n, then almost
surely,
limsup
n
n
1
log W
n
0 .
Our next example demonstrates how BorelCantelli I is typically applied in the
study of the asymptotic growth of running maxima of random variables.
Example 2.2.21 (Head runs). Let X
k
, k Z be a twosided sequence of i.i.d.
0, 1valued random variables, with P(X
1
= 1) = P(X
1
= 0) = 1/2. With
m
=
maxi : X
mi+1
= = X
m
= 1 denoting the length of the run of 1s going
2.2. THE BORELCANTELLI LEMMAS 83
backwards from time m, we are interested in the asymptotics of the longest such
run during 1, 2, . . . , n, that is,
L
n
= max
m
: m = 1, . . . , n
= maxmk : X
k+1
= = X
m
= 1 for some m = 1, . . . , n .
Noting that
m
+ 1 has a geometric distribution of success probability p = 1/2, we
deduce by an application of BorelCantelli I that for each > 0, with probability
one,
n
(1+) log
2
n for all n large enough. Hence, on the same set of probability
one, we have N = N() nite such that L
n
max(L
N
, (1+) log
2
n) for all n N.
Dividing by log
2
n and considering n followed by
k
0, this implies that
limsup
n
L
n
log
2
n
a.s.
1 .
For each xed > 0 let A
n
= L
n
< k
n
for k
n
= [(1 ) log
2
n]. Noting that
A
n
mn
i=1
B
c
i
,
for m
n
= [n/k
n
] and the independent events B
i
= X
(i1)kn+1
= = X
ikn
= 1,
yields a bound of the form P(A
n
) exp(n
/(2 log
2
n)) for all n large enough (c.f.
[Dur10, Example 2.3.3] for details). Since
n
P(A
n
) < , we have that
liminf
n
L
n
log
2
n
a.s.
1
by yet another application of BorelCantelli I, followed by
k
0. We thus conclude
that
L
n
log
2
n
a.s.
1 .
The next exercise combines both BorelCantelli lemmas to provide the 01 law for
another problem about head runs.
Exercise 2.2.22. Let X
k
be a sequence of i.i.d. 0, 1valued random variables,
with P(X
1
= 1) = p and P(X
1
= 0) = 1 p. Let A
k
be the event that X
m
= =
X
m+k1
= 1 for some 2
k
m 2
k+1
k. Show that P(A
k
i.o. ) = 1 if p 1/2
and P(A
k
i.o. ) = 0 if p < 1/2.
Hint: When p 1/2 consider only m = 2
k
+ (i 1)k for i = 0, . . . , [2
k
/k].
Here are a few direct applications of the second BorelCantelli lemma.
Exercise 2.2.23. Suppose that Z
k
are i.i.d. random variables such that P(Z
1
=
z) < 1 for any z R.
(a) Show that P(Z
k
converges for k ) = 0.
(b) Determine the values of limsup
n
(Z
n
/ log n) and liminf
n
(Z
n
/ log n)
in case Z
k
has the exponential distribution (of parameter = 1).
After deriving the classical bounds on the tail of the normal distribution, you
use both BorelCantelli lemmas in bounding the uctuations of the sums of i.i.d.
standard normal variables.
Exercise 2.2.24. Let G
i
be i.i.d. standard normal random variables.
84 2. ASYMPTOTICS: THE LAW OF LARGE NUMBERS
(a) Show that for any x > 0,
(x
1
x
3
)e
x
2
/2
_
x
e
y
2
/2
dy x
1
e
x
2
/2
.
Many texts prove these estimates, for example see [Dur10, Theorem
1.2.3].
(b) Show that, with probability one,
limsup
n
G
n
2 log n
= 1 .
(c) Let S
n
= G
1
+ + G
n
. Recall that n
1/2
S
n
has the standard normal
distribution. Show that
P([S
n
[ < 2
_
nlog n, ev. ) = 1 .
Remark. Ignoring the dependence between the elements of the sequence S
k
, the
bound in part (c) of the preceding exercise is not tight. The denite result here is
the law of the iterated logarithm (in short lil), which states that when the i.i.d.
summands are of zero mean and variance one,
(2.2.1) P(limsup
n
S
n
2nlog log n
= 1) = 1 .
We defer the derivation of (2.2.1) to Theorem 9.2.29, building on a similar lil for
the Brownian motion (but, see [Bil95, Theorem 9.5] for a direct proof of (2.2.1),
using both BorelCantelli lemmas).
The next exercise relates explicit integrability conditions for i.i.d. random vari
ables to the asymptotics of their running maxima.
Exercise 2.2.25. Consider possibly Rvalued, i.i.d. random variables Y
i
and
their running maxima M
n
= max
kn
Y
k
.
(a) Using (2.3.4) if needed, show that P([Y
n
[ > n i.o. ) = 0 if and only if
E[[Y
1
[] < .
(b) Show that n
1
Y
n
a.s.
0 if and only if E[[Y
1
[] < .
(c) Show that n
1
M
n
a.s.
0 if and only if E[(Y
1
)
+
] < and P(Y
1
> ) >
0.
(d) Show that n
1
M
n
p
0 if and only if nP(Y
1
> n) 0 and P(Y
1
>
) > 0.
(e) Show that n
1
Y
n
p
0 if and only if P([Y
1
[ < ) = 1.
In the following exercise, you combine Borel Cantelli I and the variance computa
tion of Lemma 2.1.2 to improve upon Borel Cantelli II.
Exercise 2.2.26. Suppose
n=1
P(A
n
) = for pairwise independent events
A
i
. Let S
n
=
n
i=1
I
Ai
be the number of events occurring among the rst n.
(a) Prove that Var(S
n
) E(S
n
) and deduce from it that S
n
/E(S
n
)
p
1.
(b) Applying BorelCantelli I show that S
n
k
/E(S
n
k
)
a.s.
1 as k , where
n
k
= infn : E(S
n
) k
2
.
(c) Show that E(S
n
k+1
)/E(S
n
k
) 1 and since n S
n
is nondecreasing,
deduce that S
n
/E(S
n
)
a.s.
1.
2.3. STRONG LAW OF LARGE NUMBERS 85
Remark. BorelCantelli II is the a.s. convergence S
n
for n , which is a
consequence of part (c) of the preceding exercise (since ES
n
).
We conclude this section with an example in which the asymptotic rate of growth
of random variables of interest is obtained by an application of Exercise 2.2.26.
Example 2.2.27 (Record values). Let X
i
be a sequence of i.i.d. random
variables with a continuous distribution function F
X
(x). The event A
k
= X
k
>
X
j
, j = 1, . . . , k 1 represents the occurrence of a record at the k instance (for
example, think of X
k
as an athletes kth distance jump). We are interested in the
asymptotics of the count R
n
=
n
i=1
I
Ai
of record events during the rst n instances.
Because of the continuity of F
X
we know that a.s. the values of X
i
, i = 1, 2, . . . are
distinct. Further, rearranging the random variables X
1
, X
2
, . . . , X
n
in a decreasing
order induces a random permutation
n
on 1, 2, . . . , n, where all n! possible per
mutations are equally likely. From this it follows that P(A
k
) = P(
k
(k) = 1) = 1/k,
and though denitely not obvious at rst sight, the events A
k
are mutually indepen
dent (see [Dur10, Example 2.3.2] for details). So, ER
n
= log n +
n
where
n
is
between zero and one, and from Exercise 2.2.26 we deduce that (log n)
1
R
n
a.s.
1
as n . Note that this result is independent of the law of X, as long as the
distribution function F
X
is continuous.
2.3. Strong law of large numbers
In Corollary 2.1.14 we got the classical weak law of large numbers, namely, the
convergence in probability of the empirical averages n
1
n
i=1
X
i
of i.i.d. integrable
random variables X
i
to the mean EX
1
. Assuming in addition that EX
4
1
< , you
used BorelCantelli I in Exercise 2.2.17 enroute to the corresponding strong law of
large numbers, that is, replacing the convergence in probability with the stronger
notion of convergence almost surely.
We provide here two approaches to the strong law of large numbers, both of which
get rid of the unnecessary nite moment assumptions. Subsection 2.3.1 follows
Etemadis (1981) direct proof of this result via the subsequence method. Subsection
2.3.2 deals in a more systematic way with the convergence of random series, yielding
the strong law of large numbers as one of its consequences.
2.3.1. The subsequence method. Etemadis key observation is that it es
sentially suces to consider nonnegative X
i
, for which upon proving the a.s. con
vergence along a not too sparse subsequence n
l
, the interpolation to the whole
sequence can be done by the monotonicity of n
n
X
i
. This is an example of
a general approach to a.s. convergence, called the subsequence method, which you
have already encountered in Exercise 2.2.26.
We thus start with the strong law for integrable, nonnegative variables.
Proposition 2.3.1. Let S
n
=
n
i=1
X
i
for nonnegative, pairwise independent and
identically distributed, integrable random variables X
i
. Then, n
1
S
n
a.s.
EX
1
as
n .
Proof. The proof progresses along the themes of Section 2.1, starting with
the truncation X
k
= X
k
I
X
k
k
and its corresponding sums S
n
=
n
i=1
X
i
.
86 2. ASYMPTOTICS: THE LAW OF LARGE NUMBERS
Since X
i
are identically distributed and x P([X
1
[ > x) is nonincreasing, we
have that
k=1
P(X
k
,= X
k
) =
k=1
P([X
1
[ > k)
_
0
P([X
1
[ > x)dx = E[X
1
[ <
(see part (a) of Lemma 1.4.31 for the rightmost identity and recall our assumption
that X
1
is integrable). Thus, by BorelCantelli I, with probability one, X
k
() =
X
k
() for all but nitely many ks, in which case necessarily sup
n
[S
n
() S
n
()[
is nite. This shows that n
1
(S
n
S
n
)
a.s.
0, whereby it suces to prove that
n
1
S
n
a.s.
EX
1
.
To this end, we next show that it suces to prove the following lemma about
almost sure convergence of S
n
along suitably chosen subsequences.
Lemma 2.3.2. Fixing > 1 let n
l
= [
l
]. Under the conditions of the proposition,
n
1
l
(S
n
l
ES
n
l
)
a.s.
0 as l .
By dominated convergence, E[X
1
I
X1k
] EX
1
as k , and consequently, as
n ,
1
n
ES
n
=
1
n
n
k=1
EX
k
=
1
n
n
k=1
E[X
1
I
X1k
] EX
1
(we have used here the consistency of Cesaro averages, c.f. Exercise 1.3.52 for an
integral version). Thus, assuming that Lemma 2.3.2 holds, we have that n
1
l
S
n
l
a.s.
EX
1
when l , for each > 1.
We complete the proof of the proposition by interpolating from the subsequences
n
l
= [
l
] to the whole sequence. To this end, x > 1. Since n S
n
is non
decreasing, we have for all and any n [n
l
, n
l+1
],
n
l
n
l+1
S
n
l
()
n
l
S
n
()
n
n
l+1
n
l
S
n
l+1
()
n
l+1
With n
l
/n
l+1
1/ for l , the a.s. convergence of m
1
S
m
along the subse
quence m = n
l
implies that the event
A
:= :
1
EX
1
liminf
n
S
n
()
n
limsup
n
S
n
()
n
EX
1
,
has probability one. Consequently, taking
m
1, we deduce that the event B :=
m
A
m
also has probability one, and further, n
1
S
n
() EX
1
for each B.
We thus deduce that n
1
S
n
a.s.
EX
1
, as needed to complete the proof of the
proposition.
Remark. The monotonicity of certain random variables (here n S
n
), is crucial
to the successful application of the subsequence method. The subsequence n
l
for
which we need a direct proof of convergence is completely determined by the scaling
function b
1
n
applied to this monotone sequence (here b
n
= n); we need b
n
l+1
/b
n
l
, which should be arbitrarily close to 1. For example, same subsequences n
l
= [
l
]
are to be used whenever b
n
is roughly of a polynomial growth in n, while even
n
l
= (l!)
c
would work in case b
n
= log n.
Likewise, the truncation level is determined by the highest moment of the basic
variables which is assumed to be nite. For example, we can take X
k
= X
k
I
X
k
k
p
for any p > 0 such that E[X
1
[
1/p
< .
2.3. STRONG LAW OF LARGE NUMBERS 87
Proof of Lemma 2.3.2. Note that E[X
2
k
] is nondecreasing in k. Further,
X
k
are pairwise independent, hence uncorrelated, so by Lemma 2.1.2,
Var(S
n
) =
n
k=1
Var(X
k
)
n
k=1
E[X
2
k
] nE[X
2
n
] = nE[X
2
1
I
X1n
] .
Combining this with Chebychevs inequality yield the bound
P([S
n
ES
n
[ n) (n)
2
Var(S
n
)
2
n
1
E[X
2
1
I
X1n
] ,
for any > 0. Applying BorelCantelli I for the events A
l
= [S
n
l
ES
n
l
[ n
l
,
followed by
m
0, we get the a.s. convergence to zero of n
1
[S
n
ES
n
[ along any
subsequence n
l
for which
l=1
n
1
l
E[X
2
1
I
X1n
l
] = E[X
2
1
l=1
n
1
l
I
X1n
l
] <
(the latter identity is a special case of Exercise 1.3.40). Since E[X
1
[ < , it thus
suces to show that for n
l
= [
l
] and any x > 0,
(2.3.1) u(x) :=
l=1
n
1
l
I
xn
l
cx
1
,
where c = 2/( 1) < . To establish (2.3.1) x > 1 and x > 0, setting
L = minl 1 : n
l
x. Then,
L
x, and since [y] y/2 for all y 1,
u(x) =
l=L
n
1
l
2
l=L
l
= c
L
cx
1
.
So, we have established (2.3.1) and hence completed the proof of the lemma.
As already promised, it is not hard to extend the scope of the strong law of large
numbers beyond integrable and nonnegative random variables.
Theorem 2.3.3 (Strong law of large numbers). Let S
n
=
n
i=1
X
i
for
pairwise independent and identically distributed random variables X
i
, such that
either E[(X
1
)
+
] is nite or E[(X
1
)
] is nite. Then, n
1
S
n
a.s.
EX
1
as n .
Proof. First consider nonnegative X
i
. The case of EX
1
< has already
been dealt with in Proposition 2.3.1. In case EX
1
= , consider S
(m)
n
=
n
i=1
X
(m)
i
for the bounded, nonnegative, pairwise independent and identically distributed
random variables X
(m)
i
= min(X
i
, m) X
i
. Since Proposition 2.3.1 applies for
X
(m)
i
, it follows that a.s. for any xed m < ,
(2.3.2) liminf
n
n
1
S
n
liminf
n
n
1
S
(m)
n
= EX
(m)
1
= Emin(X
1
, m) .
Taking m , by monotone convergence Emin(X
1
, m) EX
1
= , so (2.3.2)
results with n
1
S
n
a.s.
Turning to the general case, we have the decomposition X
i
= (X
i
)
+
(X
i
)
of
each random variable to its positive and negative parts, with
(2.3.3) n
1
S
n
= n
1
n
i=1
(X
i
)
+
n
1
n
i=1
(X
i
)
Since (X
i
)
+
are nonnegative, pairwise independent and identically distributed,
it follows that n
1
n
i=1
(X
i
)
+
a.s.
E[(X
1
)
+
] as n . For the same reason,
88 2. ASYMPTOTICS: THE LAW OF LARGE NUMBERS
also n
1
n
i=1
(X
i
)
a.s.
E[(X
1
)
n=1
I
{Zn}
, and deduce that
(2.3.4)
n=1
P(Z n) EZ 1 +
n=1
P(Z n) .
(b) Suppose X
i
are i.i.d R.V.s with E[[X
1
[
n=1
P([X
n
[ > kn
1/
) = ,
and deduce that a.s. limsup
n
n
1/
[X
n
[ = .
(c) Conclude that if S
n
= X
1
+X
2
+ +X
n
, then
limsup
n
n
1/
[S
n
[ = , a.s.
We provide next two classical applications of the strong law of large numbers, the
rst of which deals with the large sample asymptotics of the empirical distribution
function.
Example 2.3.5 (Empirical distribution function). Let
F
n
(x) = n
1
n
i=1
I
(,x]
(X
i
) ,
denote the observed fraction of values among the rst n variables of the sequence
X
i
which do not exceed x. The functions F
n
() are thus called the empirical
distribution functions of this sequence.
For i.i.d. X
i
with distribution function F
X
our next result improves the strong
law of large numbers by showing that F
n
converges uniformly to F
X
as n .
Theorem 2.3.6 (GlivenkoCantelli). For i.i.d. X
i
with arbitrary distribu
tion function F
X
, as n ,
D
n
= sup
xR
[F
n
(x) F
X
(x)[
a.s.
0 .
Remark. While outside our scope, we note in passing the DvoretzkyKieferWolfowitz
inequality that P(D
n
> ) 2 exp(2n
2
) for any n and all > 0, quantifying
the rate of convergence of D
n
to zero (see [DKW56], or [Mas90] for the optimal
preexponential constant).
Proof. By the right continuity of both x F
n
(x) and x F
X
(x) (c.f.
Theorem 1.2.36), the value of D
n
is unchanged when the supremum over x R is
replaced by the one over x Q (the rational numbers). In particular, this shows
that each D
n
is a random variable (c.f. Theorem 1.2.22).
2.3. STRONG LAW OF LARGE NUMBERS 89
Applying the strong law of large numbers for the i.i.d. nonnegative I
(,x]
(X
i
)
whose expectation is F
X
(x), we deduce that F
n
(x)
a.s.
F
X
(x) for each xed non
random x R. Similarly, considering the strong law of large numbers for the i.i.d.
nonnegative I
(,x)
(X
i
) whose expectation is F
X
(x
), we have that F
n
(x
)
a.s.
F
X
(x
) for each xed nonrandom x R. Consequently, for any xed l < and
x
1,l
, . . . , x
l,l
we have that
D
n,l
= max(
l
max
k=1
[F
n
(x
k,l
) F
X
(x
k,l
)[,
l
max
k=1
[F
n
(x
k,l
) F
X
(x
k,l
)[)
a.s.
0 ,
as n . Choosing x
k,l
= infx : F
X
(x) k/(l + 1), we get out of the
monotonicity of x F
n
(x) and x F
X
(x) that D
n
D
n,l
+ l
1
(c.f. [Bil95,
Proof of Theorem 20.6]). Therefore, taking n followed by l completes
the proof of the theorem.
We turn to our second example, which is about counting processes.
Example 2.3.7 (Renewal theory). Let
i
be i.i.d. positive, nite random
variables and T
n
=
n
k=1
k
. Here T
n
is interpreted as the time of the nth occur
rence of a given event, with
k
representing the length of the time interval between
the (k 1) occurrence and that of the kth such occurrence. Associated with T
n
is
the dual process N
t
= supn : T
n
t counting the number of occurrences during
the time interval [0, t]. In the next exercise you are to derive the strong law for the
large t asymptotics of t
1
N
t
.
Exercise 2.3.8. Consider the setting of Example 2.3.7.
(a) By the strong law of large numbers argue that n
1
T
n
a.s.
E
1
. Then,
adopting the convention
1
= 0, deduce that t
1
N
t
a.s.
1/E
1
for t .
Hint: From the denition of N
t
it follows that T
Nt
t < T
Nt+1
for all
t 0.
(b) Show that t
1
N
t
a.s.
1/E
2
as t , even if the law of
1
is dierent
from that of the i.i.d.
i
, i 2.
Here is a strengthening of the preceding result to convergence in L
1
.
Exercise 2.3.9. In the context of Example 2.3.7 x > 0 such that P(
1
> ) >
and let
T
n
=
n
k=1
k
for the i.i.d. random variables
i
= I
{i>}
. Note that
T
n
T
n
and consequently N
t
N
t
= supn :
T
n
t.
(a) Show that limsup
t
t
2
E
N
2
t
< .
(b) Deduce that t
1
N
t
: t 1 is uniformly integrable (see Exercise 1.3.54),
and conclude that t
1
EN
t
1/E
1
when t .
The next exercise deals with an elaboration over Example 2.3.7.
Exercise 2.3.10. For i = 1, 2, . . . the ith light bulb burns for an amount of time
i
and then remains burned out for time s
i
before being replaced by the (i +1)th bulb.
Let R
t
denote the fraction of time during [0, t] in which we have a working light.
Assuming that the two sequences
i
and s
i
are independent, each consisting of
i.i.d. positive and integrable random variables, show that R
t
a.s.
E
1
/(E
1
+Es
1
).
Here is another exercise, dealing with sampling at times of heads in independent
fair coin tosses, from a nonrandom bounded sequence of weights v(l), the averages
of which converge.
90 2. ASYMPTOTICS: THE LAW OF LARGE NUMBERS
Exercise 2.3.11. For a sequence B
i
of i.i.d. Bernoulli random variables of
parameter p = 1/2, let T
n
be the time that the corresponding partial sums reach
level n. That is, T
n
= infk :
k
i=1
B
i
n, for n = 1, 2, . . ..
(a) Show that n
1
T
n
a.s.
2 as n .
(b) Given nonnegative, nonrandom v(k) show that k
1
k
i=1
v(T
i
)
a.s.
s
as k , for some nonrandom s, if and only if n
1
n
l=1
v(l)B
l
a.s.
s/2
as n .
(c) Deduce that if n
1
n
l=1
v(l)
2
is bounded and n
1
n
l=1
v(l) s as n
, then k
1
k
i=1
v(T
i
)
a.s.
s as k .
Hint: For part (c) consider rst the limit of n
1
n
l=1
v(l)(B
l
0.5) as n .
We conclude this subsection with few additional applications of the strong law of
large numbers, rst to a problem of universal hypothesis testing, then an application
involving stochastic geometry, and nally one motivated by investment science.
Exercise 2.3.12. Consider i.i.d. [0, 1]valued random variables X
k
.
(a) Find Borel measurable functions f
n
: [0, 1]
n
0, 1, which are inde
pendent of the law of X
k
, such that f
n
(X
1
, X
2
, . . . , X
n
)
a.s.
0 whenever
EX
1
< 1/2 and f
n
(X
1
, X
2
, . . . , X
n
)
a.s.
1 whenever EX
1
> 1/2.
(b) Modify your answer to assure that f
n
(X
1
, X
2
, . . . , X
n
)
a.s.
1 also in case
EX
1
= 1/2.
Exercise 2.3.13. Let U
n
be i.i.d. random vectors, each uniformly distributed
on the unit ball u R
2
: [u[ 1. Consider the R
2
valued random vectors
X
n
= [X
n1
[U
n
, n = 1, 2, . . . starting at a nonrandom, nonzero vector X
0
(that is,
each point is uniformly chosen in a ball centered at the origin and whose radius is the
distance from the origin to the previously chosen point). Show that n
1
log [X
n
[
a.s.
1/2 as n .
Exercise 2.3.14. Let V
n
be i.i.d. nonnegative random variables. Fixing r > 0
and q (0, 1], consider the sequence W
0
= 1 and W
n
= (qr + (1 q)V
n
)W
n1
,
n = 1, 2, . . .. A motivating example is of W
n
recording the relative growth of a
portfolio where a constant fraction q of ones wealth is reinvested each year in a
riskless asset that grows by r per year, with the remainder reinvested in a risky
asset whose annual growth factors are the random V
n
.
(a) Show that n
1
log W
n
a.s.
w(q), for w(q) = Elog(qr + (1 q)V
1
).
(b) Show that q w(q) is concave on (0, 1].
(c) Using Jensens inequality show that w(q) w(1) in case EV
1
r. Fur
ther, show that if EV
1
1
r
1
, then the almost sure convergence applies
also for q = 0 and that w(q) w(0).
(d) Assuming that EV
2
1
< and EV
2
1
< show that supw(q) : q [0, 1]
is nite, and further that the maximum of w(q) is obtained at some q
(0, 1) when EV
1
> r > 1/EV
1
1
. Interpret your results in terms of the
preceding investment example.
Hint: Consider small q > 0 and small 1q > 0 and recall that log(1+x) xx
2
/2
for any x 0.
2.3.2. Convergence of random series. A second approach to the strong
law of large numbers is based on studying the convergence of random series. The
key tool in this approach is Kolmogorovs maximal inequality, which we prove next.
2.3. STRONG LAW OF LARGE NUMBERS 91
Proposition 2.3.15 (Kolmogorovs maximal inequality). The random vari
ables Y
1
, . . . , Y
n
are mutually independent, with EY
2
l
< and EY
l
= 0 for l =
1, . . . , n. Then, for Z
k
= Y
1
+ +Y
k
and any z > 0,
(2.3.5) z
2
P( max
1kn
[Z
k
[ z) Var(Z
n
) .
Remark. Chebyshevs inequality gives only z
2
P([Z
n
[ z) Var(Z
n
) which is
signicantly weaker and insucient for our current goals.
Proof. Fixing z > 0 we decompose the event A = max
1kn
[Z
k
[ z
according to the minimal index k for which [Z
k
[ z. That is, A is the union of the
disjoint events A
k
= [Z
k
[ z > [Z
j
[, j = 1, . . . , k 1 over 1 k n. Obviously,
(2.3.6) z
2
P(A) =
n
k=1
z
2
P(A
k
)
n
k=1
E[Z
2
k
; A
k
] ,
since Z
2
k
z
2
on A
k
. Further, EZ
n
= 0 and A
k
are disjoint, so
(2.3.7) Var(Z
n
) = EZ
2
n
n
k=1
E[Z
2
n
; A
k
] .
It suces to show that E[(Z
n
Z
k
)Z
k
; A
k
] = 0 for any 1 k n, since then
E[Z
2
n
; A
k
] E[Z
2
k
; A
k
] = E[(Z
n
Z
k
)
2
; A
k
] + 2E[(Z
n
Z
k
)Z
k
; A
k
]
= E[(Z
n
Z
k
)
2
; A
k
] 0 ,
and (2.3.5) follows by comparing (2.3.6) and (2.3.7). Since Z
k
I
A
k
can be represented
as a nonrandom Borel function of (Y
1
, . . . , Y
k
), it follows that Z
k
I
A
k
is measurable
on (Y
1
, . . . , Y
k
). Consequently, for xed k and l > k the variables Y
l
and Z
k
I
A
k
are independent, hence uncorrelated. Further EY
l
= 0, so
E[(Z
n
Z
k
)Z
k
; A
k
] =
n
l=k+1
E[Y
l
Z
k
I
A
k
] =
n
l=k+1
E(Y
l
)E(Z
k
I
A
k
) = 0 ,
completing the proof of Kolmogorovs inequality.
Equipped with Kolmogorovs inequality, we provide an easy to check sucient
condition for the convergence of random series of independent R.V.
Theorem 2.3.16. Suppose X
i
are independent random variables with Var(X
i
) <
and EX
i
= 0. If
n
Var(X
n
) < then w.p.1. the random series
n
X
n
()
converges (that is, the sequence S
n
() =
n
k=1
X
k
() has a nite limit S
()).
Proof. Applying Kolmogorovs maximal inequality for the independent vari
ables Y
l
= X
l+r
, we have that for any > 0 and positive integers r and n,
P( max
rkr+n
[S
k
S
r
[ )
2
Var(S
r+n
S
r
) =
2
r+n
l=r+1
Var(X
l
) .
Taking n , we get by continuity from below of P that
P(sup
kr
[S
k
S
r
[ )
2
l=r+1
Var(X
l
)
92 2. ASYMPTOTICS: THE LAW OF LARGE NUMBERS
By our assumption that
n
Var(X
n
) is nite, it follows that
l>r
Var(X
l
) 0 as
r . Hence, if we let T
r
= sup
n,mr
[S
n
S
m
[, then for any > 0,
P(T
r
2) P(sup
kr
[S
k
S
r
[ ) 0
as r . Further, r T
r
() is nonincreasing, hence,
P(limsup
M
T
M
2) = P(inf
M
T
M
2) P(T
r
2) 0 .
That is, T
M
()
a.s.
0 for M . By denition, the convergence to zero of T
M
()
is the statement that S
n
() is a Cauchy sequence. Since every Cauchy sequence in
R converges to a nite limit, we have the stated a.s. convergence of S
n
().
We next provide some applications of Theorem 2.3.16.
Example 2.3.17. Considering nonrandom a
n
such that
n
a
2
n
< and inde
pendent Bernoulli variables B
n
of parameter p = 1/2, Theorem 2.3.16 tells us that
n
(1)
Bn
a
n
converges with probability one. That is, when the signs in
n
a
n
are chosen on the toss of a fair coin, the series almost always converges (though
quite possibly
n
[a
n
[ = ).
Exercise 2.3.18. Consider the record events A
k
of Example 2.2.27.
(a) Verify that the events A
k
are mutually independent with P(A
k
) = 1/k.
(b) Show that the random series
n2
(I
An
1/n)/ log n converges almost
surely and deduce that (log n)
1
R
n
a.s.
1 as n .
(c) Provide a counterexample to the preceding in case the distribution func
tion F
X
(x) is not continuous.
The link between convergence of random series and the strong law of large numbers
is the following classical analysis lemma.
Lemma 2.3.19 (Kroneckers lemma). Consider two sequences of real numbers
x
n
and b
n
where b
n
> 0 and b
n
. If
n
x
n
/b
n
converges, then s
n
/b
n
0
for s
n
= x
1
+ + x
n
.
Proof. Let u
n
=
n
k=1
(x
k
/b
k
) which by assumption converges to a nite
limit denoted u
. Setting u
0
= b
0
= 0, summation by parts yields the identity,
s
n
=
n
k=1
b
k
(u
k
u
k1
) = b
n
u
n
k=1
(b
k
b
k1
)u
k1
.
Since u
n
u
and b
n
, the Cesaro averages b
1
n
n
k=1
(b
k
b
k1
)u
k1
also
converge to u
. Consequently, s
n
/b
n
u
= 0.
Theorem 2.3.16 provides an alternative proof for the strong law of large numbers
of Theorem 2.3.3 in case X
i
are i.i.d. (that is, replacing pairwise independence
by mutual independence). Indeed, applying the same truncation scheme as in the
proof of Proposition 2.3.1, it suces to prove the following alternative to Lemma
2.3.2.
Lemma 2.3.20. For integrable i.i.d. random variables X
k
, let S
m
=
m
k=1
X
k
and X
k
= X
k
I
X
k
k
. Then, n
1
(S
n
ES
n
)
a.s.
0 as n .
2.3. STRONG LAW OF LARGE NUMBERS 93
Lemma 2.3.20, in contrast to Lemma 2.3.2, does not require the restriction to a
subsequence n
l
. Consequently, in this proof of the strong law there is no need for
an interpolation argument so it is carried directly for X
k
, with no need to split each
variable to its positive and negative parts.
Proof of Lemma 2.3.20. We will shortly show that
(2.3.8)
k=1
k
2
Var(X
k
) 2E[X
1
[ .
With X
1
integrable, applying Theorem 2.3.16 for the independent variables Y
k
=
k
1
(X
k
EX
k
) this implies that for some A with P(A) = 1, the random series
n
Y
n
() converges for all A. Using Kroneckers lemma for b
n
= n and
x
n
= X
n
() EX
n
we get that n
1
n
k=1
(X
k
EX
k
) 0 as n , for every
A, as stated.
The proof of (2.3.8) is similar to the computation employed in the proof of Lemma
2.3.2. That is, Var(X
k
) EX
2
k
= EX
2
1
I
X1k
and k
2
2/(k(k + 1)), yielding
that
k=1
k
2
Var(X
k
)
k=1
2
k(k + 1)
EX
2
1
I
X1k
= EX
2
1
v([X
1
[) ,
where for any x > 0,
v(x) = 2
k=x
1
k(k + 1)
= 2
k=x
_
1
k
1
k + 1
=
2
x
2x
1
.
Consequently, EX
2
1
v([X
1
[) 2E[X
1
[, and (2.3.8) follows.
Many of the ingredients of this proof of the strong law of large numbers are also
relevant for solving the following exercise.
Exercise 2.3.21. Let c
n
be a bounded sequence of nonrandom constants, and X
i
n
k=1
c
k
X
k
a.s.
0 for n .
Next you nd few exercises that illustrate how useful Kroneckers lemma is when
proving the strong law of large numbers in case of independent but not identically
distributed summands.
Exercise 2.3.22. Let S
n
=
n
k=1
Y
k
for independent random variables Y
i
such
that Var(Y
k
) < B < and EY
k
= 0 for all k. Show that [n(log n)
1+
]
1/2
S
n
a.s.
0
as n and > 0 is xed (this falls short of the law of the iterated logarithm of
(2.2.1), but each Y
k
is allowed here to have a dierent distribution).
Exercise 2.3.23. Suppose the independent random variables X
i
are such that
Var(X
k
) p
k
< and EX
k
= 0 for k = 1, 2, . . ..
(a) Show that if
k
p
k
< then n
1
n
k=1
kX
k
a.s.
0.
(b) Conversely, assuming
k
p
k
= , give an example of independent ran
dom variables X
i
, such that Var(X
k
) p
k
, EX
k
= 0, for which almost
surely limsup
n
X
n
() = 1.
(c) Show that the example you just gave is such that with probability one, the
sequence n
1
n
k=1
kX
k
() does not converge to a nite limit.
Exercise 2.3.24. Consider independent, nonnegative random variables X
n
.
94 2. ASYMPTOTICS: THE LAW OF LARGE NUMBERS
(a) Show that if
(2.3.9)
n=1
[P(X
n
1) +E(X
n
I
Xn<1
)] <
then the random series
n
X
n
() converges w.p.1.
(b) Prove the converse, namely, that if
n
X
n
() converges w.p.1. then
(2.3.9) holds.
(c) Suppose G
n
are mutually independent random variables, with G
n
having
the normal distribution A(
n
, v
n
). Show that w.p.1. the random series
n
G
2
n
() converges if and only if e =
n
(
2
n
+v
n
) is nite.
(d) Suppose
n
are mutually independent random variables, with
n
having
the exponential distribution of parameter
n
> 0. Show that w.p.1. the
random series
n
() converges if and only if
n
1/
n
is nite.
Hint: For part (b) recall that for any a
n
[0, 1), the series
n
a
n
is nite if and
only if
n
(1 a
n
) > 0. For part (c) let f(y) =
n
min((
n
+
v
n
y)
2
, 1) and
observe that if e = then f(y) +f(y) = for all y ,= 0.
You can now also show that for such strong law of large numbers (that is, with
independent but not identically distributed summands), it suces to strengthen
the corresponding weak law (only) along the subsequence n
r
= 2
r
.
Exercise 2.3.25. Let Z
k
=
k
j=1
Y
j
where Y
j
are mutually independent R.V.s.
(a) Fixing > 0 show that if 2
r
Z
2
r
a.s.
0 then
r
P([Z
2
r+1 Z
2
r [ > 2
r
)
is nite and if m
1
Z
m
p
0 then max
m<k2m
P([Z
2m
Z
k
[ m) 0.
(b) Adapting the proof of Kolmogorovs maximal inequality show that for any
n and z > 0,
P( max
1kn
[Z
k
[ 2z) min
1kn
P([Z
n
Z
k
[ < z) P([Z
n
[ > z) .
(c) Deduce that if both m
1
Z
m
p
0 and 2
r
Z
2
r
a.s.
0 then also n
1
Z
n
a.s.
0.
Hint: For part (c) combine parts (a) and (b) with z = n, n = 2
r
and the mutually
independent Y
j+n
, 1 j n, to show that
r
P(2
r
D
r
2) is nite for D
r
=
max
2
r
<k2
r+1 [Z
k
Z
2
r [ and any xed > 0.
CHAPTER 3
Weak convergence, clt and Poisson approximation
After dealing in Chapter 2 with examples in which random variables converge
to nonrandom constants, we focus here on the more general theory of weak con
vergence, that is situations in which the laws of random variables converge to a
limiting law, typically of a nonconstant random variable. To motivate this theory,
we start with Section 3.1 where we derive the celebrated Central Limit Theorem (in
short clt), the most widely used example of weak convergence. This is followed by
the exposition of the theory, to which Section 3.2 is devoted. Section 3.3 is about
the key tool of characteristic functions and their role in establishing convergence
results such as the clt. This tool is used in Section 3.4 to derive the Poisson ap
proximation and provide an introduction to the Poisson process. In Section 3.5 we
generalize the characteristic function to the setting of random vectors and study
their properties while deriving the multivariate clt.
3.1. The Central Limit Theorem
We start this section with the property of the normal distribution that makes it
the likely limit for properly scaled sums of independent random variables. This is
followed by a barehands proof of the clt for triangular arrays in Subsection 3.1.1.
We then present in Subsection 3.1.2 some of the many examples and applications
of the clt.
Recall the normal distribution of mean R and variance v > 0, denoted here
after A(, v), the density of which is
(3.1.1) f(y) =
1
2v
exp(
(y )
2
2v
) .
As we show next, the normal distribution is preserved when the sum of independent
variables is considered (which is the main reason for its role as the limiting law for
the clt).
Lemma 3.1.1. Let Y
n,k
be mutually independent random variables, each having
the normal distribution A(
n,k
, v
n,k
). Then, G
n
=
n
k=1
Y
n,k
has the normal
distribution A(
n
, v
n
), with
n
=
n
k=1
n,k
and v
n
=
n
k=1
v
n,k
.
Proof. Recall that Y has a A(, v) distribution if and only if Y has the
A(0, v) distribution. Therefore, we may and shall assume without loss of generality
that
n,k
= 0 for all k and n. Further, it suces to prove the lemma for n = 2, as
the general case immediately follows by an induction argument. With n = 2 xed,
we simplify our notations by omitting it everywhere. Next recall the formula of
Corollary 1.4.33 for the probability density function of G = Y
1
+ Y
2
, which for Y
i
95
96 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
of A(0, v
i
) distribution, i = 1, 2, is
f
G
(z) =
_
2v
1
exp(
(z y)
2
2v
1
)
1
2v
2
exp(
y
2
2v
2
)dy .
Comparing this with the formula of (3.1.1) for v = v
1
+v
2
, it just remains to show
that for any z R,
(3.1.2) 1 =
_
2u
exp(
z
2
2v
(z y)
2
2v
1
y
2
2v
2
)dy ,
where u = v
1
v
2
/(v
1
+ v
2
). It is not hard to check that the argument of the expo
nential function in (3.1.2) is (y cz)
2
/(2u) for c = v
2
/(v
1
+ v
2
). Consequently,
(3.1.2) is merely the obvious fact that the A(cz, u) density function integrates to
one (as any density function should), no matter what the value of z is.
Considering Lemma 3.1.1 for Y
n,k
= (nv)
1/2
(Y
k
) and i.i.d. random variables
Y
k
, each having a normal distribution of mean and variance v, we see that
n,k
=
0 and v
n,k
= 1/n, so G
n
= (nv)
1/2
(
n
k=1
Y
k
n) has the standard A(0, 1)
distribution, regardless of n.
3.1.1. Lindebergs clt for triangular arrays. Our next proposition, the
celebrated clt, states that the distribution of
S
n
= (nv)
1/2
(
n
k=1
X
k
n) ap
proaches the standard normal distribution in the limit n , even though X
k
may well be nonnormal random variables.
Proposition 3.1.2 (Central Limit Theorem). Let
S
n
=
1
nv
(
n
k=1
X
k
n) ,
where X
k
are i.i.d with v = Var(X
1
) (0, ) and = E(X
1
). Then,
(3.1.3) lim
n
P(
S
n
b) =
1
2
_
b
exp(
y
2
2
)dy for every b R.
As we have seen in the context of the weak law of large numbers, it pays to extend
the scope of consideration to triangular arrays in which the random variables X
n,k
are independent within each row, but not necessarily of identical distribution. This
is the context of Lindebergs clt, which we state next.
Theorem 3.1.3 (Lindebergs clt). Let
S
n
=
n
k=1
X
n,k
for Pmutually inde
pendent random variables X
n,k
, k = 1, . . . , n, such that EX
n,k
= 0 for all k and
v
n
=
n
k=1
EX
2
n,k
1 as n .
Then, the conclusion (3.1.3) applies if for each > 0,
(3.1.4) g
n
() =
n
k=1
E[X
2
n,k
; [X
n,k
[ ] 0 as n .
Note that the variables in dierent rows need not be independent of each other
and could even be dened on dierent probability spaces.
3.1. THE CENTRAL LIMIT THEOREM 97
Remark 3.1.4. Under the assumptions of Proposition 3.1.2 the variables X
n,k
=
(nv)
1/2
(X
k
) are mutually independent and such that
EX
n,k
= (nv)
1/2
(EX
k
) = 0, v
n
=
n
k=1
EX
2
n,k
=
1
nv
n
k=1
Var(X
k
) = 1 .
Further, per xed n these X
n,k
are identically distributed, so
g
n
() = nE[X
2
n,1
; [X
n,1
[ ] = v
1
E[(X
1
)
2
I
X1
nv
] .
For each > 0 the sequence (X
1
)
2
I
X1
nv
converges a.s. to zero for
n and is dominated by the integrable random variable (X
1
)
2
. Thus, by
dominated convergence, g
n
() 0 as n . We conclude that all assumptions
of Theorem 3.1.3 are satised for this choice of X
n,k
, hence Proposition 3.1.2 is a
special instance of Lindebergs clt, to which we turn our attention next.
Let r
n
= max
v
n,k
: k = 1, . . . , n for v
n,k
= EX
2
n,k
. Since for every n, k and
> 0,
v
n,k
= EX
2
n,k
= E[X
2
n,k
; [X
n,k
[ < ] +E[X
2
n,k
; [X
n,k
[ ]
2
+g
n
() ,
it follows that
r
2
n
2
+g
n
() n, > 0 ,
hence Lindebergs condition (3.1.4) implies that r
n
0 as n .
Remark. Lindeberg proved Theorem 3.1.3, introducing the condition (3.1.4). Later,
Feller proved that (3.1.3) plus r
n
0 implies that Lindebergs condition holds. To
gether, these two results are known as the FellerLindeberg Theorem.
We see that the variables X
n,k
are of uniformly small variance for large n. So,
considering independent random variables Y
n,k
that are also independent of the
X
n,k
and such that each Y
n,k
has a A(0, v
n,k
) distribution, for a smooth function
h() one may control [Eh(
S
n
) Eh(G
n
)[ by a Taylor expansion upon successively
replacing the X
n,k
by Y
n,k
. This indeed is the outline of Lindebergs proof, whose
core is the following lemma.
Lemma 3.1.5. For h : R R of continuous and uniformly bounded second and
third derivatives, G
n
having the A(0, v
n
) law, every n and > 0, we have that
[Eh(
S
n
) Eh(G
n
)[
_
6
+
r
n
2
_
v
n
h
+g
n
()h
,
with f
= sup
xR
[f(x)[ denoting the supremum norm.
Remark. Recall that G
n
D
=
n
G for
n
=
v
n
. So, assuming v
n
1 and Lin
debergs condition which implies that r
n
0 for n , it follows from the
lemma that [Eh(
S
n
) Eh(
n
G)[ 0 as n . Further, [h(
n
x) h(x)[
[
n
1[[x[h
S
n
) = Eh(G) ,
98 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
for any continuous function h() of continuous and uniformly bounded rst three
derivatives. This is actually all we need from Lemma 3.1.5 in order to prove Lin
debergs clt. Further, as we show in Section 3.2, convergence in distribution as in
(3.1.3) is equivalent to (3.1.5) holding for all continuous, bounded functions h().
Proof of Lemma 3.1.5. Let G
n
=
n
k=1
Y
n,k
for mutually independent Y
n,k
,
distributed according to A(0, v
n,k
), that are independent of X
n,k
. Fixing n and
h, we simplify the notations by eliminating n, that is, we write Y
k
for Y
n,k
, and X
k
for X
n,k
. To facilitate the proof dene the mixed sums
U
l
=
l1
k=1
X
k
+
n
k=l+1
Y
k
, l = 1, . . . , n
Note the following identities
G
n
= U
1
+Y
1
, U
l
+X
l
= U
l+1
+Y
l+1
, l = 1, . . . , n 1, U
n
+X
n
=
S
n
,
which imply that,
(3.1.6) [Eh(G
n
) Eh(
S
n
)[ = [Eh(U
1
+Y
1
) Eh(U
n
+X
n
)[
n
l=1
l
,
where
l
= [E[h(U
l
+ Y
l
) h(U
l
+ X
l
)][, for l = 1, . . . , n. For any l and R,
consider the remainder term
R
l
() = h(U
l
+) h(U
l
) h
(U
l
)
2
2
h
(U
l
)
in second order Taylors expansion of h() at U
l
. By Taylors theorem, we have that
[R
l
()[ h
[[
3
6
, (from third order expansion)
[R
l
()[ h
[[
2
, (from second order expansion)
whence,
(3.1.7) [R
l
()[ min
_
h
[[
3
6
, h
[[
2
_
.
Considering the expectation of the dierence between the two identities,
h(U
l
+X
l
) = h(U
l
) +X
l
h
(U
l
) +
X
2
l
2
h
(U
l
) +R
l
(X
l
) ,
h(U
l
+Y
l
) = h(U
l
) +Y
l
h
(U
l
) +
Y
2
l
2
h
(U
l
) +R
l
(Y
l
) ,
we get that
E[(X
l
Y
l
)h
(U
l
)]
E
_
(
X
2
l
2
Y
2
l
2
)h
(U
l
)
+[E[R
l
(X
l
) R
l
(Y
l
)][ .
Recall that X
l
and Y
l
are independent of U
l
and chosen such that EX
l
= EY
l
and
EX
2
l
= EY
2
l
. As the rst two terms in the bound on
l
vanish we have that
(3.1.8)
l
E[R
l
(X
l
)[ +E[R
l
(Y
l
)[ .
3.1. THE CENTRAL LIMIT THEOREM 99
Further, utilizing (3.1.7),
E[R
l
(X
l
)[ h
E
_
[X
l
[
3
6
; [X
l
[
+h
E[[X
l
[
2
; [X
l
[ ]
6
h
E[[X
l
[
2
] +h
E[X
2
l
; [X
l
[ ] .
Summing these bounds over l = 1, . . . , n, by our assumption that
n
l=1
EX
2
l
= v
n
and the denition of g
n
(), we get that
(3.1.9)
n
l=1
E[R
l
(X
l
)[
6
v
n
h
+ g
n
()h
.
Recall that Y
l
/
v
n,l
is a standard normal random variable, whose fourth moment
is 3 (see (1.3.18)). By monotonicity in q of the L
q
norms (c.f. Lemma 1.3.16), it
follows that E[[Y
l
/
v
n,l
[
3
] 3, hence E[Y
l
[
3
3v
3/2
n,l
3r
n
v
n,l
. Utilizing once
more (3.1.7) and the fact that v
n
=
n
l=1
v
n,l
, we arrive at
(3.1.10)
n
l=1
E[R
l
(Y
l
)[
h
6
n
l=1
E[Y
l
[
3
r
n
2
v
n
h
.
Plugging (3.1.8)(3.1.10) into (3.1.6) completes the proof of the lemma.
In view of (3.1.5), Lindebergs clt builds on the following elementary lemma,
whereby we approximate the indicator function on (, b] by continuous, bounded
functions h
k
: R R for each of which Lemma 3.1.5 applies.
Lemma 3.1.6. There exist h
k
(x) of continuous and uniformly bounded rst three
derivatives, such that 0 h
k
(x) I
(,b)
(x) and 1 h
+
k
(x) I
(,b]
(x) as
k .
Proof. There are many ways to prove this. Here is one which is from rst prin
ciples, hence requires no analysis knowledge. The function : [0, 1] [0, 1] given
by (x) = 140
_
1
x
u
3
(1 u)
3
du is monotone decreasing, with continuous derivatives
of all order, such that (0) = 1, (1) = 0 and whose rst three derivatives at 0 and
at 1 are all zero. Its extension (x) = (min(x, 1)
+
) to a function on R that is one
for x 0 and zero for x 1 is thus nonincreasing, with continuous and uniformly
bounded rst three derivatives. It is easy to check that the translated and scaled
functions h
+
k
(x) = (k(x b)) and h
k
(x) = (k(x b) + 1) have all the claimed
properties.
Proof of Theorem 3.1.3. Applying (3.1.5) for h
k
(), then taking k
we have by monotone convergence that
liminf
n
P(
S
n
< b) lim
n
E[h
k
(
S
n
)] = E[h
k
(G)] F
G
(b
) .
Similarly, considering h
+
k
(), then taking k we have by bounded convergence
that
limsup
n
P(
S
n
b) lim
n
E[h
+
k
(
S
n
)] = E[h
+
k
(G)] F
G
(b) .
Since F
G
() is a continuous function we conclude that P(
S
n
b) converges to
F
G
(b) = F
G
(b
and N
1
. Since N
D
= N
n
+N
and N
n+1
D
= N
n
+N
+N
1
, this
provides a monotone coupling , that is, a construction of the random variables N
n
,
N
and N
n+1
on the same probability space, such that N
n
N
N
n+1
. Because
of this monotonicity, for any > 0 and all n n
0
(b, ) the event (N
)/
b
is between (N
n+1
(n + 1))/
n + 1 b and (N
n
n)/
n b +. Con
sidering the limit as n followed by 0, it thus follows that the convergence
(N
n
n)/n
1/2
D
G implies also that (N
)/
1/2
D
G as . In words,
the normal distribution is a good approximation of a Poisson with large parameter.
In Theorem 2.3.3 we established the strong law of large numbers when the sum
mands X
i
are only pairwise independent. Unfortunately, as the next example shows,
pairwise independence is not good enough for the clt.
Example 3.1.9. Consider i.i.d.
i
such that P(
i
= 1) = P(
i
= 1) = 1/2
for all i. Set X
1
=
1
and successively let X
2
k
+j
= X
j
k+2
for j = 1, . . . , 2
k
and
k = 0, 1, . . .. Note that each X
l
is a 1, 1valued variable, specically, a product
of a dierent nite subset of
i
s that corresponds to the positions of ones in the
binary representation of 2l 1 (with
1
for its least signicant digit,
2
for the next
digit, etc.). Consequently, each X
l
is of zero mean and if l ,= r then in EX
l
X
r
at least one of the
i
s will appear exactly once, resulting with EX
l
X
r
= 0, hence
with X
l
being uncorrelated variables. Recall part (b) of Exercise 1.4.42, that such
3.1. THE CENTRAL LIMIT THEOREM 101
variables are pairwise independent. Further, EX
l
= 0 and X
l
1, 1 mean that
P(X
l
= 1) = P(X
l
= 1) = 1/2 are identically distributed. As for the zero mean
variables S
n
=
n
j=1
X
j
, we have arranged things such that S
1
=
1
and for any
k 0
S
2
k+1 =
2
k
j=1
(X
j
+X
2
k
+j
) =
2
k
j=1
X
j
(1 +
k+2
) = S
2
k(1 +
k+2
) ,
hence S
2
k =
1
k+1
i=2
(1 +
i
) for all k 1. In particular, S
2
k = 0 unless
2
=
3
=
. . . =
k+1
= 1, an event of probability 2
k
. Thus, P(S
2
k ,= 0) = 2
k
and certainly
the clt result (3.1.3) does not hold along the subsequence n = 2
k
.
We turn next to applications of Lindebergs triangular array clt, starting with
the asymptotic of the count of record events till time n 1.
Exercise 3.1.10. Consider the count R
n
of record events during the rst n in
stances of i.i.d. R.V. with a continuous distribution function, as in Example 2.2.27.
Recall that R
n
= B
1
+ +B
n
for mutually independent Bernoulli random variables
B
k
such that P(B
k
= 1) = 1 P(B
k
= 0) = k
1
.
(a) Check that b
n
/ log n 1 where b
n
= Var(R
n
).
(b) Show that Lindebergs clt applies for X
n,k
= (log n)
1/2
(B
k
k
1
).
(c) Recall that [ER
n
log n[ 1, and conclude that (R
n
log n)/
log n
D
G.
Remark. Let o
n
denote the symmetric group of permutations on 1, . . . , n. For
s o
n
and i 1, . . . , n, denoting by L
i
(s) the smallest j n such that s
j
(i) = i,
we call s
j
(i) : 1 j L
i
(s) the cycle of s containing i. If each s o
n
is equally
likely, then the law of the number T
n
(s) of dierent cycles in s is the same as that
of R
n
of Example 2.2.27 (for a proof see [Dur10, Example 2.2.4]). Consequently,
Exercise 3.1.10 also shows that in this setting (T
n
log n)/
log n
D
G.
Part (a) of the following exercise is a special case of Lindebergs clt, known also
as Lyapunovs theorem.
Exercise 3.1.11 (Lyapunovs theorem). Let S
n
=
n
k=1
X
k
for X
k
mutually
independent such that v
n
= Var(S
n
) < .
(a) Show that if there exists q > 2 such that
lim
n
v
q/2
n
n
k=1
E([X
k
EX
k
[
q
) = 0 ,
then v
1/2
n
(S
n
ES
n
)
D
G.
(b) Use the preceding result to show that n
1/2
S
n
D
G when also EX
k
= 0,
EX
2
k
= 1 and E[X
k
[
q
C for some q > 2, C < and k = 1, 2, . . ..
(c) Show that (log n)
1/2
S
n
D
G when the mutually independent X
k
are
such that P(X
k
= 0) = 1k
1
and P(X
k
= 1) = P(X
k
= 1) = 1/(2k).
The next application of Lindebergs clt involves the use of truncation (which we
have already introduced in the context of the weak law of large numbers), to derive
the clt for normalized sums of certain i.i.d. random variables of innite variance.
102 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
Proposition 3.1.12. Suppose X
k
are i.i.d of symmetric distribution, that is
X
1
D
= X
1
(or P(X
1
> x) = P(X
1
< x) for all x) such that P([X
1
[ > x) = x
2
for x 1. Then,
1
nlog n
n
k=1
X
k
D
G as n .
Remark 3.1.13. Note that Var(X
1
) = EX
2
1
=
_
0
2xP([X
1
[ > x)dx = (c.f.
part (a) of Lemma 1.4.31), so the usual clt of Proposition 3.1.2 does not apply here.
Indeed, the innite variance of the summands results in a dierent normalization
of the sums S
n
=
n
k=1
X
k
that is tailored to the specic tail behavior of x
P([X
1
[ > x).
Caution should be exercised here, since when P([X
1
[ > x) = x
nlog n and c
n
1 are such that both c
n
/b
n
0 and c
n
/
nlog n
S
n
D
G as
n , where S
n
=
n
k=1
X
k
I
X
k
cn
is the sum of the truncated variables. We
have chosen the truncation level c
n
large enough to assure that
P(S
n
,= S
n
)
n
k=1
P([X
k
[ > c
n
) = nP([X
1
[ > c
n
) = nc
2
n
0
for n , hence we may now conclude that
1
nlog n
S
n
D
G as claimed.
We conclude this section with Kolmogorovs three series theorem, the most den
itive result on the convergence of random series.
Theorem 3.1.14 (Kolmogorovs three series theorem). Suppose X
k
are
independent random variables. For nonrandom c > 0 let X
(c)
n
= X
n
I
Xnc
be the
corresponding truncated variables and consider the three series
(3.1.11)
n
P([X
n
[ > c),
n
EX
(c)
n
,
n
Var(X
(c)
n
).
Then, the random series
n
X
n
converges a.s. if and only if for some c > 0 all
three series of (3.1.11) converge.
Remark. By convergence of a series we mean the existence of a nite limit to the
sum of its rst m entries when m . Note that the theorem implies that if all
3.2. WEAK CONVERGENCE 103
three series of (3.1.11) converge for some c > 0, then they necessarily converge for
every c > 0.
Proof. We prove the suciency rst, that is, assume that for some c > 0
all three series of (3.1.11) converge. By Theorem 2.3.16 and the niteness of
n
Var(X
(c)
n
) it follows that the random series
n
(X
(c)
n
EX
(c)
n
) converges a.s.
Then, by our assumption that
n
EX
(c)
n
converges, also
n
X
(c)
n
converges a.s.
Further, by assumption the sequence of probabilities P(X
n
,= X
(c)
n
) = P([X
n
[ > c)
is summable, hence by BorelCantelli I, we have that a.s. X
n
,= X
(c)
n
for at most
nitely many ns. The convergence a.s. of
n
X
(c)
n
thus results with the conver
gence a.s. of
n
X
n
, as claimed.
We turn to prove the necessity of convergence of the three series in (3.1.11) to the
convergence of
n
X
n
, which is where we use the clt. To this end, assume the
random series
n
X
n
converges a.s. (to a nite limit) and x an arbitrary constant
c > 0. The convergence of
n
X
n
implies that [X
n
[ 0, hence a.s. [X
n
[ > c
for only nitely many ns. In view of the independence of these events and Borel
Cantelli II, necessarily the sequence P([X
n
[ > c) is summable, that is, the series
n
P([X
n
[ > c) converges. Further, the convergence a.s. of
n
X
n
then results
with the a.s. convergence of
n
X
(c)
n
.
Suppose now that the nondecreasing sequence v
n
=
n
k=1
Var(X
(c)
k
) is unbounded,
in which case the latter convergence implies that a.s. T
n
= v
1/2
n
n
k=1
X
(c)
k
0
when n . We further claim that in this case Lindebergs clt applies for
S
n
=
n
k=1
X
n,k
, where
X
n,k
= v
1/2
n
(X
(c)
k
m
(c)
k
), and m
(c)
k
= EX
(c)
k
.
Indeed, per xed n the variables X
n,k
are mutually independent of zero mean
and such that
n
k=1
EX
2
n,k
= 1. Further, since [X
(c)
k
[ c and we assumed that
v
n
it follows that [X
n,k
[ 2c/
v
n
0 as n , resulting with Lindebergs
condition holding (as g
n
() = 0 when > 2c/
v
n
, i.e. for all n large enough).
Combining Lindebergs clt conclusion that
S
n
D
G and T
n
a.s.
0, we deduce that
(
S
n
T
n
)
D
G (c.f. Exercise 3.2.8). However, since
S
n
T
n
= v
1/2
n
n
k=1
m
(c)
k
are nonrandom, the sequence P(
S
n
T
n
0) is composed of zeros and ones,
hence cannot converge to P(G 0) = 1/2. We arrive at a contradiction to our
assumption that v
n
, and so conclude that the sequence Var(X
(c)
n
) is summable,
that is, the series
n
Var(X
(c)
n
) converges.
By Theorem 2.3.16, the summability of Var(X
(c)
n
) implies that the series
n
(X
(c)
n
m
(c)
n
) converges a.s. We have already seen that
n
X
(c)
n
converges a.s. so it follows
that their dierence
n
m
(c)
n
, which is the middle term of (3.1.11), converges as
well.
3.2. Weak convergence
Focusing here on the theory of weak convergence, we rst consider in Subsection
3.2.1 the convergence in distribution in a more general setting than that of the clt.
This is followed by the study in Subsection 3.2.2 of weak convergence of probability
measures and the theory associated with it. Most notably its relation to other modes
104 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
of convergence, such as convergence in total variation or pointwise convergence of
probability density functions. We conclude by introducing in Subsection 3.2.3 the
key concept of uniform tightness which is instrumental to the derivation of weak
convergence statements, as demonstrated in later sections of this chapter.
3.2.1. Convergence in distribution. Motivated by the clt, we explore here
the convergence in distribution, its relation to convergence in probability, some
additional properties and examples in which the limiting law is not the normal law.
To start o, here is the denition of convergence in distribution.
Denition 3.2.1. We say that R.V.s X
n
converge in distribution to a R.V. X
,
denoted by X
n
D
X
, if F
Xn
() F
X
() as n for each xed which is
a continuity point of F
X
.
Similarly, we say that distribution functions F
n
converge weakly to F
, denoted
by F
n
w
F
, if F
n
() F
.
Remark. If the limit R.V. X
and F
(x)[ 0.
The clt is not the only example of convergence in distribution we have already
met. Recall the GlivenkoCantelli theorem (see Theorem 2.3.6), whereby a.s. the
empirical distribution functions F
n
of an i.i.d. sequence of variables X
i
converge
uniformly, hence pointwise to the true distribution function F
X
.
Here is an explicit necessary and sucient condition for the convergence in distri
bution of integer valued random variables
Exercise 3.2.3. Let X
n
, 1 n be integer valued R.V.s. Show that X
n
D
= k) for each k Z.
In contrast with all of the preceding examples, we demonstrate next why the
convergence X
n
D
X
) or not, depending upon the choice of h(), and even within the
collection of continuous functions with image in [1, 1], the rate of this convergence
is not uniform in h.
Example 3.2.4. The random variables X
n
= 1/n converge in distribution to X
=
0. Indeed, it is easy to check that F
Xn
() = I
[1/n,)
() converge to F
X
() =
I
[0,)
() at each ,= 0. However, there is no convergence at the discontinuity
point = 0 of F
X
as F
X
(0) = 1 while F
Xn
(0) = 0 for all n.
Further, Eh(X
n
) = h(
1
n
) h(0) = Eh(X
) as
n . So, in order for X + 1/n to converge in distribution to X as n , we
3.2. WEAK CONVERGENCE 105
have to restrict such convergence to the continuity points of the limiting distribution
function F
X
, as done in Denition 3.2.1.
We have seen in Examples 3.1.7 and 3.1.8 that the normal distribution is a good
approximation for the Binomial and the Poisson distributions (when the corre
sponding parameter is large). Our next example is of the same type, now with the
approximation of the Geometric distribution by the Exponential one.
Example 3.2.5 (Exponential approximation of the Geometric). Let Z
p
be a random variable with a Geometric distribution of parameter p (0, 1), that is,
P(Z
p
k) = (1 p)
k1
for any positive integer k. As p 0, we see that
P(pZ
p
> t) = (1 p)
t/p
e
t
for all t 0
That is, pZ
p
D
T, with T having a standard exponential distribution. As Z
p
corresponds to the number of independent trials till the rst occurrence of a spe
cic event whose probability is p, this approximation corresponds to waiting for the
occurrence of rare events.
At this point, you are to check that convergence in probability implies the con
vergence in distribution, which is hence weaker than all notions of convergence
explored in Section 1.3.3 (and is perhaps a reason for naming it weak convergence).
The converse cannot hold, for example because convergence in distribution does not
require X
n
and X
, then X
n
D
X
. Conversely, if X
n
D
and X
.
Further, as the next theorem shows, given F
n
w
F
, it is possible to construct
random variables Y
n
, n such that F
Yn
= F
n
and Y
n
a.s.
Y
. The catch
of course is to construct the appropriate coupling, that is, to specify the relation
between the dierent Y
n
s.
Theorem 3.2.7. Let F
n
be a sequence of distribution functions that converges
weakly to F
.
Proof. We use Skorokhods representation as in the proof of Theorem 1.2.36.
That is, for (0, 1] and 1 n let Y
+
n
() Y
n
() be
Y
+
n
() = supy : F
n
(y) , Y
n
() = supy : F
n
(y) < .
While proving Theorem 1.2.36 we saw that F
Y
n
= F
n
for any n , and as
remarked there Y
n
() = Y
+
n
() for all but at most countably many values of ,
hence P(Y
n
= Y
+
n
) = 1. It thus suces to show that for all (0, 1),
Y
+
() limsup
n
Y
+
n
() limsup
n
Y
n
()
liminf
n
Y
n
() Y
() . (3.2.1)
Indeed, then Y
n
() Y
() for any A = : Y
+
() = Y
() where
P(A) = 1. Hence, setting Y
n
= Y
+
n
for 1 n would complete the proof of the
theorem.
106 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
Turning to prove (3.2.1) note that the two middle inequalities are trivial. Fixing
(0, 1) we proceed to show that
(3.2.2) Y
+
() limsup
n
Y
+
n
() .
Since the continuity points of F
() is a continuity point of F
, then
necessarily z Y
+
n
() for all n large enough. To this end, note that z > Y
+
()
implies by denition that F
and
F
n
w
F
we know that F
n
(z) F
(z). Hence, F
n
(z) > for all suciently large
n. By denition of Y
+
n
and monotonicity of F
n
, this implies that z Y
+
n
(), as
needed. The proof of
(3.2.3) liminf
n
Y
n
() Y
() ,
is analogous. For y < Y
() we know by monotonicity of F
that F
(y) < .
Assuming further that y is a continuity point of F
n
(). Taking continuity
points y
k
of F
such that y
k
Y
and Y
n
D
Y
, where Y
is non
random and for each n the variables X
n
and Y
n
are dened on the same probability
space.
(a) Show that then X
n
+Y
n
D
X
+Y
.
Hint: Recall that the collection of continuity points of F
X
is dense.
(b) Deduce that if Z
n
X
n
D
0 then X
n
D
X if and only if Z
n
D
X.
(c) Show that Y
n
X
n
D
Y
.
For example, here is an application of Exercise 3.2.8 enroute to a clt connected
to renewal theory.
Exercise 3.2.9.
(a) Suppose N
m
are nonnegative integervalued random variables and b
m
are nonrandom integers such that N
m
/b
m
p
1. Show that if S
n
=
n
k=1
X
k
for i.i.d. random variables X
k
with v = Var(X
1
) (0, )
and E(X
1
) = 0, then S
Nm
/
vb
m
D
G as m .
Hint: Use Kolmogorovs inequality to show that S
Nm
/
vb
m
S
bm
/
vb
m
p
0.
(b) Let N
t
= supn : S
n
t for S
n
=
n
k=1
Y
k
and i.i.d. random variables
Y
k
> 0 such that v = Var(Y
1
) (0, ) and E(Y
1
) = 1. Show that
(N
t
t)/
vt
D
G as t .
Theorem 3.2.7 is key to solving the following:
Exercise 3.2.10. Suppose that Z
n
D
Z
(c)
D
Z
(c) ,= 0.
Consider the following exercise as a cautionary note about your interpretation of
Theorem 3.2.7.
Exercise 3.2.11. Let M
n
=
n
k=1
k
i=1
U
i
and W
n
=
n
k=1
n
i=k
U
i
, where U
i
as n , with M
is nite only
for c < 2).
(c) In case c < e prove that W
n
D
M
as n while W
n
can not have
an almost sure limit. Explain why this does not contradict Theorem 3.2.7.
The next exercise relates the decay (in n) of sup
s
[F
X
(s) F
Xn
(s)[ to that of
sup[Eh(X
n
) Eh(X
(x)[ L.
Exercise 3.2.12. Let
n
= sup
s
[F
X
(s) F
Xn
(s)[.
(a) Show that if sup
x
[h(x)[ M and sup
x
[h
)[ C
n
+ 4MP(X
/ [a, b]) .
(b) Show that if X
[a, b] and f
X
(x) > 0 for all x [a, b], then
[Q
n
() Q
()[
1
n
for any (
n
, 1
n
), where Q
n
() =
supx : F
Xn
(x) < denotes quantile for the law of X
n
. Using this,
construct Y
n
D
= X
n
such that P([Y
n
Y
[ >
1
n
) 2
n
and deduce
the bound of part (a), albeit the larger value 4M +L/ of C.
Here is another example of convergence in distribution, this time in the context
of extreme value theory.
Exercise 3.2.13. Let M
n
= max
1in
T
i
, where T
i
, i = 1, 2, . . . are i.i.d. ran
dom variables of distribution function F
T
(t). Noting that F
Mn
(x) = F
T
(x)
n
, show
that b
1
n
(M
n
a
n
)
D
M
when:
(a) F
T
(t) = 1 e
t
for t 0 (i.e. T
i
are Exponential of parameter one).
Here, a
n
= log n, b
n
= 1 and F
M
(y) = exp(e
y
) for y R.
(b) F
T
(t) = 1 t
) for y > 0.
(c) F
T
(t) = 1 [t[
) for y 0.
Remark. Up to the linear transformation y (y )/, the three distributions
of M
provided in Exercise 3.2.13 are the only possible limits of maxima of i.i.d.
random variables. They are thus called the extreme value distributions of Type 1
(or Gumbeltype), in case (a), Type 2 (or Frechettype), in case (b), and Type 3
(or Weibulltype), in case (c). Indeed,
Exercise 3.2.14.
(a) Building upon part (a) of Exercise 2.2.24, show that if G has the standard
normal distribution, then for any y R
lim
t
1 F
G
(t +y/t)
1 F
G
(t)
= e
y
.
108 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
(b) Let M
n
= max
1in
G
i
for i.i.d. standard normal random variables
G
i
. Show that b
n
(M
n
b
n
)
D
M
where F
M
(y) = exp(e
y
) and b
n
is such that 1 F
G
(b
n
) = n
1
.
(c) Show that b
n
/
2 log n
p
1.
(d) More generally, suppose T
t
= infx 0 : M
x
t, where x M
x
is some monotone nondecreasing family of random variables such that
M
0
= 0. Show that if e
t
T
t
D
T
as t with T
having the
standard exponential distribution then (M
x
log x)
D
M
as x ,
where F
M
(y) = exp(e
y
).
Our next example is of a more combinatorial avor.
Exercise 3.2.15 (The birthday problem). Suppose X
i
are i.i.d. with each
X
i
uniformly distributed on 1, . . . , n. Let T
n
= mink : X
k
= X
l
, for some l < k
mark the rst coincidence among the entries of the sequence X
1
, X
2
, . . ., so
P(T
n
> r) =
r
k=2
(1
k 1
n
) ,
is the probability that among r items chosen uniformly and independently from
a set of n dierent objects, no two are the same (the name birthday problem
corresponds to n = 365 with the items interpreted as the birthdays for a group of
size r). Show that P(n
1/2
T
n
> s) exp(s
2
/2) as n , for any xed s 0.
Hint: Recall that x x
2
log(1 x) x for x [0, 1/2].
The symmetric, simple random walk on the integers is the sequence of random
variables S
n
=
n
k=1
k
where
k
are i.i.d. such that P(
k
= 1) = P(
k
= 1) =
1
2
.
From the clt we already know that n
1/2
S
n
D
G. The next exercise provides
the asymptotics of the rst and last visits to zero by this random sequence, namely
R = inf 1 : S
= 0 and L
n
= sup n : S
2n(n/e)
n
/n! 1 as n ), show
that
nP(R > 2n) 1 and that (2n)
1
L
2n
D
X, where X has the
arcsine probability density function f
X
(x) =
1
x(1x)
on [0, 1].
(e) Let H
2n
count the number of 1 k 2n such that S
k
0 and S
k1
0.
Show that H
2n
D
= L
2n
, hence (2n)
1
H
2n
D
X.
3.2. WEAK CONVERGENCE 109
3.2.2. Weak convergence of probability measures. We rst extend the
denition of weak convergence from distribution functions to measures on Borel
algebras.
Denition 3.2.17. For a topological space S, let C
b
(S) denote the collection of all
continuous bounded functions on S. We say that a sequence of probability measures
n
on a topological space S equipped with its Borel algebra (see Example 1.1.15),
converges weakly to a probability measure
, denoted
n
w
, if
n
(h)
(h)
for each h C
b
(S).
As we show next, Denition 3.2.17 is an alternative denition of convergence in
distribution, which, in contrast to Denition 3.2.1, applies to more general R.V.
(for example to the R
d
valued random variables we consider in Section 3.5).
Proposition 3.2.18. The weak convergence of distribution functions is equivalent
to the weak convergence of the corresponding laws as probability measures on (R, B).
Consequently, X
n
D
X
) as n .
Proof. Suppose rst that F
n
w
F
and let Y
n
, 1 n be the random
variables given by Theorem 3.2.7 such that Y
n
a.s.
Y
. For h C
b
(R) we have by
continuity of h that h(Y
n
)
a.s.
h(Y
)) = T
(h) .
Conversely, suppose that T
n
w
T
k
C
b
(R) be such that h
k
(x) I
(,)
(x) and h
+
k
(x) I
(,]
(x)
as k (c.f. Lemma 3.1.6 for a construction of such functions). We have by the
weak convergence of the laws when n , followed by monotone convergence as
k , that
liminf
n
T
n
((, )) lim
n
T
n
(h
k
) = T
(h
k
) T
((, )) = F
) .
Similarly, considering h
+
k
() and then k , we have by bounded convergence
that
limsup
n
T
n
((, ]) lim
n
T
n
(h
+
k
) = T
(h
+
k
) T
((, ]) = F
() .
For any continuity point of F
we conclude that F
n
() = T
n
((, ]) converges
as n to F
() = F
and P(X
D
g
) = 0, then
g(X
n
)
D
g(X
).
Proof. Given X
n
D
X
. Fixing h C
b
(R), clearly D
hg
D
g
, so
P(Y
D
hg
) P(Y
D
g
) = 0.
110 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
Therefore, by Exercise 2.2.12, it follows that h(g(Y
n
))
a.s.
h(g(Y
)). Since h g is
bounded and Y
n
D
= X
n
for all n, it follows by bounded convergence that
Eh(g(X
n
)) = Eh(g(Y
n
)) E(h(g(Y
)) = Eh(g(X
)) .
This holds for any h C
b
(R), so by Proposition 3.2.18, we conclude that g(X
n
)
D
g(X
).
Our next theorem collects several equivalent characterizations of weak convergence
of probability measures on (R, B). To this end we need the following denition.
Denition 3.2.20. For a subset A of a topological space S, we denote by A the
boundary of A, that is A = A A
o
is the closed set of points in the closure of A
but not in the interior of A. For a measure on (S, B
S
) we say that A B
S
is a
continuity set if (A) = 0.
Theorem 3.2.21 (portmanteau theorem). The following four statements are
equivalent for any probability measures
n
, 1 n on (R, B).
(a)
n
w
n
(F)
(F)
(c) For every open set G, one has liminf
n
n
(G)
(G)
(d) For every
n
(A) =
(A)
Remark. As shown in Subsection 3.5.1, this theorem holds with (R, B) replaced
by any metric space S and its Borel algebra B
S
.
For
n
= T
Xn
we get the formulation of the Portmanteau theorem for random
variables X
n
, 1 n , where the following four statements are then equivalent
to X
n
D
X
:
(a) Eh(X
n
) Eh(X
F)
(c) For every open set G one has liminf
n
P(X
n
G) P(X
G)
(d) For every Borel set A such that P(X
A) = 0, one has
lim
n
P(X
n
A) = P(X
A)
Proof. It suces to show that (a) (b) (c) (d) (a), which we
shall establish in that order. To this end, with F
n
(x) =
n
((, x]) denoting the
corresponding distribution functions, we replace
n
w
) ,
and by Fatous lemma,
limsup
n
n
(F) = limsup
n
EI
F
(Y
n
) Elimsup
n
I
F
(Y
n
) EI
F
(Y
) =
(F) ,
as stated in (b).
3.2. WEAK CONVERGENCE 111
(b) (c). The complement F = G
c
of an open set G is a closed set, so by (b) we
have that
1 liminf
n
n
(G) = limsup
n
n
(G
c
)
(G
c
) = 1
(G) ,
implying that (c) holds. In an analogous manner we can show that (c) (b), so
(b) and (c) are equivalent.
(c) (d). Since (b) and (c) are equivalent, we assume now that both (b) and (c)
hold. Then, applying (c) for the open set G = A
o
and (b) for the closed set F = A
we have that
(A) limsup
n
n
(A) limsup
n
n
(A)
liminf
n
n
(A) liminf
n
n
(A
o
)
(A
o
) . (3.2.4)
Further, A = A
o
A so
(A) =
(A
o
) =
(A)
(with the last equality due to the fact that A
o
A A). Consequently, for such a
set A all the inequalities in (3.2.4) are equalities, yielding (d).
(d) (a). Consider the set A = (, ] where is a continuity point of F
.
Then, A = and
() = F
() F
n
((, ]) =
((, ]) = F
() ,
which is our version of (a).
We turn to relate the weak convergence to the convergence pointwise of proba
bility density functions. To this end, we rst dene a new concept of convergence
of measures, the convergence in totalvariation.
Denition 3.2.22. The total variation norm of a nite signed measure on the
measurable space (S, T) is

tv
= sup(h) : h mT, sup
sS
[h(s)[ 1.
We say that a sequence of probability measures
n
converges in total variation to a
probability measure
, denoted
n
t.v.
, if 
n

tv
0.
Remark. Note that 
tv
= 1 for any probability measure (since (h) ([h[)
h
(1) 1 for the functions h considered, with equality for h = 1). By a similar
reasoning, 

tv
2 for any two probability measures ,
on (S, T).
Convergence in totalvariation obviously implies weak convergence of the same
probability measures, but the converse fails, as demonstrated for example by
n
=
1/n
, the probability measure on (R, B) assigning probability one to the point 1/n,
which converge weakly to
=
0
(see Example 3.2.4), whereas 
n
 = 2
for all n. The dierence of course has to do with the nonuniformity of the weak
convergence with respect to the continuous function h.
To gain a better understanding of the convergence in totalvariation, we consider
an important special case.
Proposition 3.2.23. Suppose P = f and Q = g for some measure on (S, T)
and f, g mT
+
such that (f) = (g) = 1. Then,
(3.2.5) P Q
tv
=
_
S
[f(s) g(s)[d(s) .
112 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
Further, suppose
n
= f
n
with f
n
mT
+
such that (f
n
) = 1 for all n .
Then,
n
t.v.
if f
n
(s) f

tv
= ([f
n
f
in L
1
(S, T, ). Since f
n
0 and (f
n
) = 1
for any n , it follows from Schees lemma (see Lemma 1.3.35) that the latter
convergence is a consequence of f
n
(s) f
.
Example 3.2.25. Similarly, if X
n
are integer valued for n = 1, 2 . . ., then
n
= f
n
for f
n
(k) = P(X
n
= k) and the counting measure
on (Z, 2
Z
) such that
(k) = 1
for each k Z. So, by the preceding proposition, the pointwise convergence of
Exercise 3.2.3 is not only necessary and sucient for weak convergence but also for
convergence in totalvariation of the laws of X
n
to that of X
.
In the next exercise, you are to rephrase Example 3.2.25 in terms of the topological
space of all probability measures on Z.
Exercise 3.2.26. Show that d(, ) = 
tv
is a metric on the collection of all
probability measures on Z, and that in this space the convergence in total variation
is equivalent to the weak convergence which in turn is equivalent to the pointwise
convergence at each x Z.
Hence, under the framework of Example 3.2.25, the GlivenkoCantelli theorem
tells us that the empirical measures of integer valued i.i.d. R.V.s X
i
converge in
totalvariation to the true law of X
1
.
Here is an example from statistics that corresponds to the framework of Example
3.2.24.
Exercise 3.2.27. Let V
n+1
denote the central value on a list of 2n+1 values (that
is, the (n + 1)th largest value on the list). Suppose the list consists of mutually
independent R.V., each chosen uniformly in [0, 1).
(a) Show that V
n+1
has probability density function (2n + 1)
_
2n
n
_
v
n
(1 v)
n
at each v [0, 1).
(b) Verify that the density f
n
(v) of
V
n
=
2n(2V
n+1
1) is of the form
f
n
(v) = c
n
(1 v
2
/(2n))
n
for some normalization constant c
n
that is
independent of [v[
2n.
(c) Deduce that for n the densities f
n
(v) converge pointwise to the
standard normal density, and conclude that
V
n
D
G.
3.2. WEAK CONVERGENCE 113
Here is an interesting interpretation of the clt in terms of weak convergence of
probability measures.
Exercise 3.2.28. Let / denote the set of probability measures on (R, B) for
which
_
x
2
d(x) = 1 and
_
xd(x) = 0, and / denote the standard normal
distribution. Consider the mapping T : / / where T is the law of (X
1
+
X
2
)/
2 for X
1
and X
2
i.i.d. of law each. Explain why the clt implies that
T
m
w
as m , for any /. Show that T = (see Lemma 3.1.1), and
explain why is the unique, globally attracting xed point of T in /.
Your next exercise is the basis behind the celebrated method of moments for weak
convergence.
Exercise 3.2.29. Suppose that X and Y are [0, 1]valued random variables such
that E(X
n
) = E(Y
n
) for n = 0, 1, 2, . . ..
(a) Show that Ep(X) = Ep(Y ) for any polynomial p().
(b) Show that Eh(X) = Eh(Y ) for any continuous function h : [0, 1] R
and deduce that X
D
= Y .
Hint: Recall Weierstrass approximation theorem, that if h is continuous on [0, 1]
then there exist polynomials p
n
such that sup
x[0,1]
[h(x) p
n
(x)[ 0 as n .
We conclude with the following example about weak convergence of measures in
the space of innite binary sequences.
Exercise 3.2.30. Consider the topology of coordinate wise convergence on S =
0, 1
N
and the Borel probability measures
n
on S, where
n
is the uniform
measure over the
_
2n
n
_
binary sequences of precisely n ones among the rst 2n
coordinates, followed by zeros from position 2n + 1 onwards. Show that
n
w
where
S such that (K
c
) < . A collection
such that
(K
c
that are needed for the dierent measures). Further, in the case of S = R which we
study here one can take without loss of generality the compact K
as a symmetric
bounded interval [M
, M
, M
] (whose closure
is compact) in order to simplify notations. Thus, expressing uniform tightness
in terms of the corresponding distribution functions leads in this setting to the
following alternative denition.
114 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
Denition 3.2.32. A sequence of distribution functions F
n
is called uniformly
tight, if for every > 0 there exists M = M
such that
limsup
n
[1 F
n
(M) +F
n
(M)] < .
Remark. As most texts use in the context of Denition 3.2.32 tight (or tight
sequence) instead of uniformly tight, we shall adopt the same convention here.
Uniform tightness of distribution functions has some structural resemblance to the
U.I. condition (1.3.11). As such we have the following simple sucient condition
for uniform tightness (which is the analog of Exercise 1.3.54).
Exercise 3.2.33. A sequence of probability measures
n
on (R, B) is uniformly
tight if sup
n
n
(f([x[)) is nite for some nonnegative Borel function such that
f(r) as r . Alternatively, if sup
n
Ef([X
n
[) < then the distribution
functions F
Xn
form a tight sequence.
The importance of uniform tightness is that it guarantees the existence of limit
points for weak convergence.
Theorem 3.2.34 (Prohorov theorem). A collection of probability measures
on a complete, separable metric space S equipped with its Borel algebra B
S
, is
uniformly tight if and only if for any sequence
m
there exists a subsequence
m
k
that converges weakly to some probability measure
on (S, B
S
) (where
is
not necessarily in and may depend on the subsequence m
k
).
Remark. For a proof of Prohorovs theorem, which is beyond the scope of these
notes, see [Dud89, Theorem 11.5.4].
Instead of Prohorovs theorem, we prove here a barehands substitute for the
special case S = R. When doing so, it is convenient to have the following notion of
convergence of distribution functions.
Denition 3.2.35. When a sequence F
n
of distribution functions converges to a
right continuous, nondecreasing function F
, we
say that F
n
converges vaguely to F
, denoted F
n
v
F
.
In contrast with weak convergence, the vague convergence allows for the limit
F
(x) =
such that
(R) < 1.
Example 3.2.36. Suppose F
n
= pI
[n,)
+qI
[n,)
+(1pq)F for some p, q 0
such that p+q 1 and a distribution function F that is independent of n. It is easy
to check that F
n
v
F
as n , where F
such that F
n
k
(y) F
, that is F
n
k
v
F
.
3.2. WEAK CONVERGENCE 115
Deferring the proof of Hellys theorem to the end of this section, uniform tightness
is exactly what prevents probability mass from escaping to , thus assuring the
existence of limit points for weak convergence.
Lemma 3.2.38. The sequence of distribution functions F
n
is uniformly tight if
and only if each vague limit point of this sequence is a distribution function. That
is, if and only if when F
n
k
v
F, necessarily 1 F(x) +F(x) 0 as x .
Proof. Suppose rst that F
n
is uniformly tight and F
n
k
v
F. Fixing > 0,
there exist r
1
< M
and r
2
> M
) +F
n
(M
)) < .
It follows that limsup
x
(1 F(x) + F(x)) and since > 0 is arbitrarily
small, F must be a distribution function of some probability measure on (R, B).
Conversely, suppose F
n
is not uniformly tight, in which case by Denition 3.2.32,
for some > 0 and n
k
(3.2.6) 1 F
n
k
(k) +F
n
k
(k) for all k.
By Hellys theorem, there exists a vague limit point F to F
n
k
as k . That
is, for some k
l
as l we have that F
n
k
l
v
F. For any two continuity
points r
1
< 0 < r
2
of F, we thus have by the denition of vague convergence, the
monotonicity of F
n
k
l
, and (3.2.6), that
1 F(r
2
) +F(r
1
) = lim
l
(1 F
n
k
l
(r
2
) +F
n
k
l
(r
1
))
liminf
l
(1 F
n
k
l
(k
l
) +F
n
k
l
(k
l
)) .
Considering now r = min(r
1
, r
2
) , this shows that inf
r
(1F(r)+F(r)) ,
hence the vague limit point F cannot be a distribution function of a probability
measure on (R, B).
Remark. Comparing Denitions 3.2.31 and 3.2.32 we see that if a collection of
probability measures on (R, B) is uniformly tight, then for any sequence
m
the corresponding sequence F
m
of distribution functions is uniformly tight. In view
of Lemma 3.2.38 and Hellys theorem, this implies the existence of a subsequence
m
k
and a distribution function F
such that F
m
k
w
F
. By Proposition 3.2.18
we deduce that
m
k
w
[0, 1] is nondecreasing.
Further, F
(x
n
) = infH(q) : q Q, q > x
n
for some n
= infH(q) : q Q, q > x = F
(x).
Suppose that x is a continuity point of the nondecreasing function F
. Then, for
any > 0 there exists y < x such that F
(x) < F
(x) < F
(y) H(r
1
) H(r
2
) < F
(x) + .
Recall that F
n
k
(x) [F
n
k
(r
1
), F
n
k
(r
2
)] and F
n
k
(r
i
) H(r
i
) as k , for i = 1, 2.
Thus, by (3.2.7) for all k large enough
F
(x) < F
n
k
(r
1
) F
n
k
(x) F
n
k
(r
2
) < F
(x) +,
which since > 0 is arbitrary implies F
n
k
(x) F
(x) as k .
Exercise 3.2.39. Suppose that the sequence of distribution functions F
X
k
is
uniformly tight and EX
2
k
< are such that EX
2
n
as n . Show that then
also Var(X
n
) as n .
Hint: If [EX
n
l
[
2
then sup
l
Var(X
n
l
) < yields X
n
l
/EX
n
l
L
2
1, whereas the
uniform tightness of F
Xn
l
implies that X
n
l
/EX
n
l
p
0.
Using Lemma 3.2.38 and Hellys theorem, you next explore the possibility of estab
lishing weak convergence for nonnegative random variables out of the convergence
of the corresponding Laplace transforms.
Exercise 3.2.40.
(a) Based on Exercise 3.2.29 show that if Z 0 and W 0 are such that
E(e
sZ
) = E(e
sW
) for each s > 0, then Z
D
= W.
(b) Further, show that for any Z 0, the function L
Z
(s) = E(e
sZ
) is
innitely dierentiable at all s > 0 and for any positive integer k,
E[Z
k
] = (1)
k
lim
s0
d
k
ds
k
L
Z
(s) ,
even when (both sides are) innite.
(c) Suppose that X
n
0 are such that L(s) = lim
n
E(e
sXn
) exists for all
s > 0 and L(s) 1 for s 0. Show that then the sequence of distribution
functions F
Xn
is uniformly tight and that there exists a random variable
X
0 such that X
n
D
X
n
k=1
kI
k
for I
k
0, 1 independent random variables,
with P(I
k
= 1) = k
1
. Show that there exists X
0 such that X
n
D
and E(e
sX
) = exp(
_
1
0
t
1
(e
st
1)dt) for all s > 0.
Remark. The idea of using transforms to establish weak convergence shall be
further developed in Section 3.3, with the Fourier transform instead of the Laplace
transform.
3.3. Characteristic functions
This section is about the fundamental concept of characteristic function, its rele
vance for the theory of weak convergence, and in particular for the clt.
In Subsection 3.3.1 we dene the characteristic function, providing illustrating ex
amples and certain general properties such as the relation between nite moments
of a random variable and the degree of smoothness of its characteristic function. In
Subsection 3.3.2 we recover the distribution of a random variable from its charac
teristic function, and building upon it, relate tightness and weak convergence with
the pointwise convergence of the associated characteristic functions. We conclude
with Subsection 3.3.3 in which we reprove the clt of Section 3.1 as an applica
tion of the theory of characteristic functions we have thus developed. The same
approach will serve us well in other settings which we consider in the sequel (c.f.
Sections 3.4 and 3.5).
3.3.1. Denition, examples, moments and derivatives. We start o
with the denition of the characteristic function of a random variable. To this
end, recall that a Cvalued random variable is a function Z : C such that the
real and imaginary parts of Z are measurable, and for Z = X +iY with X, Y R
integrable random variables (and i =
X
() = E[e
iX
] = E[cos(X)] +iE[sin(X)]
where R and obviously both cos(X) and sin(X) are integrable R.V.s.
We also denote by
() = (e
ix
) is the characteristic function of a
R.V. X whose law T
X
is .
Here are some of the properties of characteristic functions, where the complex
conjugate x iy of z = x +iy C is denoted throughout by z and the modulus of
z = x +iy is [z[ =
_
x
2
+y
2
.
Proposition 3.3.2. Let X be a R.V. and
X
its characteristic function, then
(a)
X
(0) = 1
(b)
X
() =
X
()
(c) [
X
()[ 1
(d)
X
() is a uniformly continuous function on R
(e)
aX+b
() = e
ib
X
(a)
118 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
Proof. For (a),
X
(0) = E[e
i0X
] = E[1] = 1. For (b), note that
X
() = Ecos(X) +iEsin(X)
= Ecos(X) iEsin(X) =
X
() .
For (c), note that the function [z[ =
_
x
2
+y
2
: R
2
R is convex, hence by
Jensens inequality (c.f. Exercise 1.3.20),
[
X
()[ = [Ee
iX
[ E[e
iX
[ = 1
(since the modulus [e
ix
[ = 1 for any real x and ).
For (d), since
X
(+h)
X
() = Ee
iX
(e
ihX
1), it follows by Jensens inequality
for the modulus function that
[
X
( +h)
X
()[ E[[e
iX
[[e
ihX
1[] = E[e
ihX
1[ = (h)
(using the fact that [zv[ = [z[[v[). Since 2 [e
ihX
1[ 0 as h 0, by bounded
convergence (h) 0. As the bound (h) on the modulus of continuity of
X
()
is independent of , we have uniform continuity of
X
() on R.
For (e) simply note that
aX+b
() = Ee
i(aX+b)
= e
ib
Ee
i(a)X
= e
ib
X
(a).
We also have the following relation between nite moments of the random variable
and the derivatives of its characteristic function.
Lemma 3.3.3. If E[X[
n
< , then the characteristic function
X
() of X has
continuous derivatives up to the nth order, given by
(3.3.1)
d
k
d
k
X
() = E[(iX)
k
e
iX
] , for k = 1, . . . , n
Proof. Note that for any x, h R
e
ihx
1 = ix
_
h
0
e
iux
du .
Consequently, for any h ,= 0, R and positive integer k we have the identity
k,h
(x) = h
1
_
(ix)
k1
e
i(+h)x
(ix)
k1
e
ix
_
(ix)
k
e
ix
(3.3.2)
= (ix)
k
e
ix
h
1
_
h
0
(e
iux
1)du ,
from which we deduce that [
k,h
(x)[ 2[x[
k
for all and h ,= 0, and further that
[
k,h
(x)[ 0 as h 0. Thus, for k = 1, . . . , n we have by dominated convergence
(and Jensens inequality for the modulus function) that
[E
k,h
(X)[ E[
k,h
(X)[ 0 for h 0.
Taking k = 1, we have from (3.3.2) that
E
1,h
(X) = h
1
(
X
( +h)
X
()) E[iXe
iX
] ,
so its convergence to zero as h 0 amounts to the identity (3.3.1) holding for
k = 1. In view of this, considering now (3.3.2) for k = 2, we have that
E
2,h
(X) = h
1
(
X
( +h)
X
()) E[(iX)
2
e
iX
] ,
and its convergence to zero as h 0 amounts to (3.3.1) holding for k = 2. We
continue in this manner for k = 3, . . . , n to complete the proof of (3.3.1). The
continuity of the derivatives follows by dominated convergence from the convergence
to zero of [(ix)
k
e
i(+h)x
(ix)
k
e
ix
[ 2[x[
k
as h 0 (with k = 1, . . . , n).
3.3. CHARACTERISTIC FUNCTIONS 119
The converse of Lemma 3.3.3 does not hold. That is, there exist random variables
with E[X[ = for which
X
() is dierentiable at = 0 (c.f. Exercise 3.3.23).
However, as we see next, the existence of a nite second derivative of
X
() at
= 0 implies that EX
2
< .
Lemma 3.3.4. If liminf
0
2
(2
X
(0)
X
()
X
()) < , then EX
2
< .
Proof. Note that
2
(2
X
(0)
X
()
X
()) = Eg
(X), where
g
(x) =
2
(2 e
ix
e
ix
) = 2
2
[1 cos(x)] x
2
for 0 .
Since g
(X) E[liminf
0
g
(X)] = EX
2
,
thus completing the proof of the lemma.
We continue with a few explicit computations of the characteristic function.
Example 3.3.5. Consider a Bernoulli random variable B of parameter p, that is,
P(B = 1) = p and P(B = 0) = 1 p. Its characteristic function is by denition
B
() = E[e
iB
] = pe
i
+ (1 p)e
i0
= pe
i
+ 1 p .
The same type of explicit formula applies to any discrete valued R.V. For example,
if N has the Poisson distribution of parameter then
(3.3.3)
N
() = E[e
iN
] =
k=0
(e
i
)
k
k!
e
= exp((e
i
1)) .
The characteristic function has an explicit form also when the R.V. X has a
probability density function f
X
as in Denition 1.2.39. Indeed, then by Corollary
1.3.62 we have that
(3.3.4)
X
() =
_
R
e
ix
f
X
(x)dx,
which is merely the Fourier transform of the density f
X
(and is well dened since
cos(x)f
X
(x) and sin(x)f
X
(x) are both integrable with respect to Lebesgues mea
sure).
Example 3.3.6. If G has the A(, v) distribution, namely, the probability density
function f
G
(y) is given by (3.1.1), then its characteristic function is
G
() = e
iv
2
/2
.
Indeed, recall Example 1.3.68 that G = X + for =
v and X of a standard
normal distribution A(0, 1). Hence, considering part (e) of Proposition 3.3.2 for
a =
v and b = , it suces to show that
X
() = e
2
/2
. To this end, as X is
integrable, we have from Lemma 3.3.3 that
X
() = E(iXe
iX
) =
_
R
xsin(x)f
X
(x)dx
(since xcos(x)f
X
(x) is an integrable odd function, whose integral is thus zero).
The standard normal density is such that f
X
(x) = xf
X
(x), hence integrating by
parts we nd that
X
() =
_
R
sin(x)f
X
(x)dx =
_
R
cos(x)f
X
(x)dx =
X
()
120 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
(since sin(x)f
X
(x) is an integrable odd function). We know that
X
(0) = 1 and
since () = e
2
/2
is the unique solution of the ordinary dierential equation
U
() =
e
ib
e
ia
i(b a)
(recall that
_
b
a
e
zx
dx = (e
zb
e
za
)/z for any z C). For a = b the characteristic
function simplies to sin(b)/(b). Or, in case b = 1 and a = 0 we have
U
() =
(e
i
1)/(i) for the random variable U of Example 1.1.26.
For a = 0 and z = + i, > 0, the same integration identity applies also
when b (since the real part of z is negative). Consequently, by (3.3.4), the
exponential distribution of parameter > 0 whose density is f
T
(t) = e
t
1
t>0
(see Example 1.3.68), has the characteristic function
T
() = /( i).
Finally, for the density f
S
(s) = 0.5e
s
it is not hard to check that
S
() =
0.5/(1 i) + 0.5/(1 + i) = 1/(1 +
2
) (just break the integration over s R in
(3.3.4) according to the sign of s).
We next express the characteristic function of the sum of independent random
variables in terms of the characteristic functions of the summands. This relation
makes the characteristic function a useful tool for proving weak convergence state
ments involving sums of independent variables.
Lemma 3.3.8. If X and Y are two independent random variables, then
X+Y
() =
X
()
Y
()
Proof. By the denition of the characteristic function
X+Y
() = Ee
i(X+Y )
= E[e
iX
e
iY
] = E[e
iX
]E[e
iY
] ,
where the rightmost equality is obtained by the independence of X and Y (i.e.
applying (1.4.12) for the integrable f(x) = g(x) = e
ix
). Observing that the right
most expression is
X
()
Y
() completes the proof.
Here are three simple applications of this lemma.
Example 3.3.9. If X and Y are independent and uniform on (1/2, 1/2) then
by Corollary 1.4.33 the random variable = X + Y has the triangular density,
f
(x) = (1 [x[)1
x1
. Thus, by Example 3.3.7, Lemma 3.3.8, and the trigono
metric identity cos = 1 2 sin
2
(/2) we have that its characteristic function is
() = [
X
()]
2
=
_
2 sin(/2)
_
2
=
2(1 cos )
2
.
Exercise 3.3.10. Let X,
X be i.i.d. random variables.
(a) Show that the characteristic function of Z = X
X is a nonnegative,
realvalued function.
(b) Prove that there do not exist a < b and i.i.d. random variables X,
X
such that X
X is the uniform random variable on (a, b).
3.3. CHARACTERISTIC FUNCTIONS 121
In the next exercise you construct a random variable X whose law has no atoms
while its characteristic function does not converge to zero for .
Exercise 3.3.11. Let X = 2
k=1
3
k
B
k
for B
k
i.i.d. Bernoulli random vari
ables such that P(B
k
= 1) = P(B
k
= 0) = 1/2.
(a) Show that
X
(3
k
) =
X
() ,= 0 for k = 1, 2, . . ..
(b) Recall that X has the uniform distribution on the Cantor set C, as speci
ed in Example 1.2.42. Verify that x F
X
(x) is everywhere continuous,
hence the law T
X
has no atoms (i.e. points of positive probability).
3.3.2. Inversion, continuity and convergence. Is it possible to recover the
distribution function from the characteristic function? Then answer is essentially
yes.
Theorem 3.3.12 (Levys inversion theorem). Suppose
X
is the characteris
tic function of random variable X whose distribution function is F
X
. For any real
numbers a < b and , let
(3.3.5)
a,b
() =
1
2
_
b
a
e
iu
du =
e
ia
e
ib
i2
.
Then,
(3.3.6) lim
T
_
T
T
a,b
()
X
()d =
1
2
[F
X
(b) +F
X
(b
)]
1
2
[F
X
(a) +F
X
(a
)] .
Furthermore, if
_
R
[
X
()[d < , then X has the bounded continuous probability
density function
(3.3.7) f
X
(x) =
1
2
_
R
e
ix
X
()d .
Remark. The identity (3.3.7) is a special case of the Fourier transform inversion
formula, and as such is in duality with
X
() =
_
R
e
ix
f
X
(x)dx of (3.3.4). The
formula (3.3.6) should be considered its integrated version, which thereby holds
even in the absence of a density for X.
Here is a simple application of the duality between (3.3.7) and (3.3.4).
Example 3.3.13. The Cauchy density is f
X
(x) = 1/[(1 + x
2
)]. Recall Example
3.3.7 that the density f
S
(s) = 0.5e
s
has the positive, integrable characteristic
function 1/(1 +
2
). Thus, by (3.3.7),
0.5e
s
=
1
2
_
R
1
1 +t
2
e
its
dt .
Multiplying both sides by two, then changing t to x and s to , we get (3.3.4) for
the Cauchy density, resulting with its characteristic function
X
() = e

.
When using characteristic functions for proving limit theorems we do not need
the explicit formulas of Levys inversion theorem, but rather only the fact that the
characteristic function determines the law, that is:
Corollary 3.3.14. If the characteristic functions of two random variables X and
Y are the same, that is
X
() =
Y
() for all , then X
D
= Y .
122 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
Remark. While the realvalued moment generating function M
X
(s) = E[e
sX
] is
perhaps a simpler object than the characteristic function, it has a somewhat limited
scope of applicability. For example, the law of a random variable X is uniquely
determined by M
X
() provided M
X
(s) is nite for all s [, ], some > 0 (c.f.
[Bil95, Theorem 30.1]). More generally, assuming all moments of X are nite, the
Hamburger moment problem is about uniquely determining the law of X from a
given sequence of moments EX
k
. You saw in Exercise 3.2.29 that this is always
possible when X has bounded support, but unfortunately, this is not always the case
when X has unbounded support. For more on this issue, see [Dur10, Subsection
3.3.5].
Proof of Corollary 3.3.14. Since
X
=
Y
, comparing the right side of
(3.3.6) for X and Y shows that
[F
X
(b) +F
X
(b
)] [F
X
(a) +F
X
(a
)] = [F
Y
(b) +F
Y
(b
)] [F
Y
(a) +F
Y
(a
)] .
As F
X
is a distribution function, both F
X
(a) 0 and F
X
(a
) 0 when a .
For this reason also F
Y
(a) 0 and F
Y
(a
) 0. Consequently,
F
X
(b) +F
X
(b
) = F
Y
(b) +F
Y
(b
) for all b R .
In particular, this implies that F
X
= F
Y
on the collection ( of continuity points
of both F
X
and F
Y
. Recall that F
X
and F
Y
have each at most a countable set of
points of discontinuity (see Exercise 1.2.38), so the complement of ( is countable,
and consequently ( is a dense subset of R. Thus, as distribution functions are non
decreasing and rightcontinuous we know that F
X
(b) = infF
X
(x) : x > b, x (
and F
Y
(b) = infF
Y
(x) : x > b, x (. Since F
X
(x) = F
Y
(x) for all x (, this
identity extends to all b R, resulting with X
D
= Y .
Remark. In Lemma 3.1.1, it was shown directly that the sum of independent
random variables of normal distributions A(
k
, v
k
) has the normal distribution
A(, v) where =
k
and v =
k
v
k
. The proof easily reduces to dealing
with two independent random variables, X of distribution A(
1
, v
1
) and Y of
distribution A(
2
, v
2
) and showing that X+Y has the normal distribution A(
1
+
2
, v
1
+v
2
). Here is an easy proof of this result via characteristic functions. First by
the independence of X and Y (see Lemma 3.3.8), and their normality (see Example
3.3.6),
X+Y
() =
X
()
Y
() = exp(i
1
v
1
2
/2) exp(i
2
v
2
2
/2)
= exp(i(
1
+
2
)
1
2
(v
1
+v
2
)
2
)
We recognize this expression as the characteristic function corresponding to the
A(
1
+
2
, v
1
+ v
2
) distribution, which by Corollary 3.3.14 must indeed be the
distribution of X +Y .
Proof of Levys inversion theorem. Consider the product of the law
T
X
of X which is a probability measure on R and Lebesgues measure of [T, T],
noting that is a nite measure on R [T, T] of total mass 2T.
Fixing a < b R let h
a,b
(x, ) =
a,b
()e
ix
, where by (3.3.5) and Jensens
inequality for the modulus function (and the uniform measure on [a, b]),
[h
a,b
(x, )[ = [
a,b
()[
1
2
_
b
a
[e
iu
[du =
b a
2
.
3.3. CHARACTERISTIC FUNCTIONS 123
Consequently,
_
[h
a,b
[d < , and applying Fubinis theorem, we conclude that
J
T
(a, b) :=
_
T
T
a,b
()
X
()d =
_
T
T
a,b
()
_
_
R
e
ix
dT
X
(x)
_
d
=
_
T
T
_
_
R
h
a,b
(x, )dT
X
(x)
_
d =
_
R
_
_
T
T
h
a,b
(x, )d
_
dT
X
(x) .
Since h
a,b
(x, ) is the dierence between the function e
iu
/(i2) at u = x a and
the same function at u = x b, it follows that
_
T
T
h
a,b
(x, )d = R(x a, T) R(x b, T) .
Further, as the cosine function is even and the sine function is odd,
R(u, T) =
_
T
T
e
iu
i2
d =
_
T
0
sin(u)
d =
sgn(u)
S([u[T) ,
with S(r) =
_
r
0
x
1
sin xdx for r > 0.
Even though the Lebesgue integral
_
0
x
1
sin xdx does not exist, because both
the integral of the positive part and the integral of the negative part are innite,
we still have that S(r) is uniformly bounded on (0, ) and
lim
r
S(r) =
2
(c.f. Exercise 3.3.15). Consequently,
lim
T
[R(x a, T) R(x b, T)] = g
a,b
(x) =
_
_
0 if x < a or x > b
1
2
if x = a or x = b
1 if a < x < b
.
Since S() is uniformly bounded, so is [R(x a, T) R(x b, T)[ and by bounded
convergence,
lim
T
J
T
(a, b) = lim
T
_
R
[R(x a, T) R(x b, T)]dT
X
(x) =
_
R
g
a,b
(x)dT
X
(x)
=
1
2
T
X
(a) +T
X
((a, b)) +
1
2
T
X
(b) .
With T
X
(a) = F
X
(a) F
X
(a
), T
X
((a, b)) = F
X
(b
) F
X
(a) and T
X
(b) =
F
X
(b) F
X
(b
X
() are integrable with respect to Lebesgues measure on
R, hence f
X
(x) of (3.3.7) is well dened. Further, [f
X
(x)[ C is uniformly bounded
and by dominated convergence with respect to Lebesgues measure on R,
lim
h0
[f
X
(x +h) f
X
(x)[ lim
h0
1
2
_
R
[e
ix
[[
X
()[[e
ih
1[d = 0,
implying that f
X
() is also continuous. Turning to prove that f
X
() is the density
of X, note that
[
a,b
()
X
()[
b a
2
[
X
()[ ,
124 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
so by dominated convergence we have that
(3.3.8) lim
T
J
T
(a, b) = J
(a, b) =
_
R
a,b
()
X
()d .
Further, in view of (3.3.5), upon applying Fubinis theorem for the integrable func
tion e
iu
I
[a,b]
(u)
X
() with respect to Lebesgues measure on R
2
, we see that
J
(a, b) =
1
2
_
R
_
_
b
a
e
iu
du
_
X
()d =
_
b
a
f
X
(u)du ,
for the bounded continuous function f
X
() of (3.3.7). In particular, J
(a, b) must
be continuous in both a and b. Comparing (3.3.8) with (3.3.6) we see that
J
(a, b) =
1
2
[F
X
(b) +F
X
(b
)]
1
2
[F
X
(a) +F
X
(a
)] ,
so the continuity of J
(, ) implies that F
X
() must also be continuous everywhere,
with
F
X
(b) F
X
(a) = J
(a, b) =
_
b
a
f
X
(u)du ,
for all a < b. This shows that necessarily f
X
(x) is a nonnegative realvalued
function, which is the density of X.
Exercise 3.3.15. Integrating
_
z
1
e
iz
dz around the contour formed by the upper
semicircles of radii and r and the intervals [r, ] and [r, ], deduce that S(r) =
_
r
0
x
1
sin xdx is uniformly bounded on (0, ) with S(r) /2 as r .
Our strategy for handling the clt and similar limit results is to establish the
convergence of characteristic functions and deduce from it the corresponding con
vergence in distribution. One ingredient for this is of course the fact that the
characteristic function uniquely determines the corresponding law. Our next result
provides an important second ingredient, that is, an explicit sucient condition for
uniform tightness in terms of the limit of the characteristic functions.
Lemma 3.3.16. Suppose
n
are probability measures on (R, B) and
n
() =
n
(e
ix
) the corresponding characteristic functions. If
n
() () as n ,
for each R and further () is continuous at = 0, then the sequence
n
is
uniformly tight.
Remark. To see why continuity of the limit () at 0 is required, consider the
sequence
n
of normal distributions A(0, n
2
). From Example 3.3.6 we see that
the pointwise limit () = I
=0
of
n
() = exp(n
2
2
/2) exists but is dis
continuous at = 0. However, for any M < we know that
n
([M, M]) =
1
([M/n, M/n]) 0 as n , so clearly the sequence
n
is not uniformly
tight. Indeed, the corresponding distribution functions F
n
(x) = F
1
(x/n) converge
vaguely to F
(x) = F
1
(0) = 1/2 which is not a distribution function (reecting
escape of all the probability mass to ).
Proof. We start the proof by deriving the key inequality
(3.3.9)
1
r
_
r
r
(1
())d =
_
r
r
_
_
R
(1 e
ix
)d(x)
_
d =
_
R
J(x)d(x) .
Thus, the lower bound (3.3.10) and monotonicity of the integral imply that
1
r
_
r
r
(1
())d =
1
r
_
R
J(x)d(x)
_
R
I
{x>2/r}
d(x) = ([2/r, 2/r]
c
) ,
hence establishing (3.3.9).
We turn to the application of this inequality for proving the uniform tightness.
Since
n
(0) = 1 for all n and
n
(0) (0), it follows that (0) = 1. Further,
() is continuous at = 0, so for any > 0, there exists r = r() > 0 such that
4
[1 ()[ for all [r, r],
and hence also
2
1
r
_
r
r
[1 ()[d .
The pointwise convergence of
n
to implies that [1
n
()[ [1 ()[. By
bounded convergence with respect to Uniform measure of on [r, r], it follows
that for some nite n
0
= n
0
() and all n n
0
,
1
r
_
r
r
[1
n
()[d ,
which in view of (3.3.9) results with
1
r
_
r
r
[1
n
()]d
n
([2/r, 2/r]
c
) .
Since > 0 is arbitrary and M = 2/r is independent of n, by Denition 3.2.32 this
amounts to the uniform tightness of the sequence
n
.
Building upon Corollary 3.3.14 and Lemma 3.3.16 we can nally relate the point
wise convergence of characteristic functions to the weak convergence of the corre
sponding measures.
Theorem 3.3.17 (Levys continuity theorem). Let
n
, 1 n be proba
bility measures on (R, B).
(a) If
n
w
, then
n
()
() for each R.
126 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
(b) Conversely, if
n
() converges pointwise to a limit () that is contin
uous at = 0, then
n
is a uniformly tight sequence and
n
w
such
that
= .
Proof. For part (a), since both x cos(x) and x sin(x) are bounded
continuous functions, the assumed weak convergence of
n
to
implies that
n
() =
n
(e
ix
)
(e
ix
) =
. Though in general
= .
Here is a direct consequence of Levys continuity theorem.
Exercise 3.3.18. Show that if X
n
D
X
, Y
n
D
Y
and Y
n
is independent of
X
n
for 1 n , then X
n
+Y
n
D
X
+Y
.
Combining Exercise 3.3.18 with the Portmanteau theorem and the clt, you can
now show that a nite second moment is necessary for the convergence in distribu
tion of n
1/2
n
k=1
X
k
for i.i.d. X
k
.
Exercise 3.3.19. Suppose X
k
,
X
k
are i.i.d. and n
1/2
n
k=1
X
k
D
Z (with
the limit Z R).
(a) Set Y
k
= X
k
X
k
and show that n
1/2
n
k=1
Y
k
D
Z
Z, with Z and
Z i.i.d.
(b) Let U
k
= Y
k
I
Y
k
b
and V
k
= Y
k
I
Y
k
>b
. Show that for any u < and
all n,
P(
n
k=1
Y
k
u
n) P(
n
k=1
U
k
u
n,
n
k=1
V
k
0)
1
2
P(
n
k=1
U
k
u
n) .
(c) Apply the Portmanteau theorem and the clt for the bounded i.i.d. U
k
n
k=1
X
k
D
Z, then necessarily EX
2
1
< .
Remark. The trick of replacing X
k
by the variables Y
k
= X
k
X
k
whose law is
symmetric (i.e. Y
k
D
= Y
k
), is very useful in many problems. It is often called the
symmetrization trick.
3.3. CHARACTERISTIC FUNCTIONS 127
Exercise 3.3.20. Provide an example of a random variable X with a bounded
probability density function but for which
_
R
[
X
()[d = , and another example
of a random variable X whose characteristic function
X
() is not dierentiable at
= 0.
As you nd out next, Levys inversion theorem can help when computing densities.
Exercise 3.3.21. Suppose the random variables U
k
are i.i.d. where the law of each
U
k
is the uniform probability measure on (1, 1). Considering Example 3.3.7, show
that for each n 2, the probability density function of S
n
=
n
k=1
U
k
is
f
Sn
(s) =
1
_
0
cos(s)(sin /)
n
d ,
and deduce that
_
0
cos(s)(sin /)
n
d = 0 for all s > n 2.
Exercise 3.3.22. Deduce from Example 3.3.13 that if X
k
are i.i.d. each having
the Cauchy density, then n
1
n
k=1
X
k
has the same distribution as X
1
, for any
value of n.
We next relate dierentiability of
X
() with the weak law of large numbers and
show that it does not imply that E[X[ is nite.
Exercise 3.3.23. Let S
n
=
n
k=1
X
k
where the i.i.d. random variables X
k
have
each the characteristic function
X
().
(a) Show that if
dX
d
(0) = z C, then z = ia for some a R and n
1
S
n
p
a
as n .
(b) Show that if n
1
S
n
p
a, then
X
(h
k
)
n
k
e
ia
for any h
k
0 and
n
k
= [1/h
k
], and deduce that
dX
d
(0) = ia.
(c) Conclude that the weak law of large numbers holds (i.e. n
1
S
n
p
a for
some nonrandom a), if and only if
X
() is dierentiable at = 0 (this
result is due to E.J.G. Pitman, see [Pit56]).
(d) Use Exercise 2.1.13 to provide a random variable X for which
X
() is
dierentiable at = 0 but E[X[ = .
As you show next, X
n
D
X
yields convergence of
Xn
() to
X
(), uniformly
over compact subsets of R.
Exercise 3.3.24. Show that if X
n
D
X
.
Characteristic functions of modulus one correspond to lattice or degenerate laws,
as you show in the following renement of part (c) of Proposition 3.3.2.
Exercise 3.3.25. Suppose [
Y
()[ = 1 for some ,= 0.
(a) Show that Y is a (2/)lattice randomvariable, namely, that Y mod (2/)
is Pdegenerate.
Hint: Check conditions for equality when applying Jensens inequality for
(cos Y, sin Y ) and the convex function g(x, y) =
_
x
2
+y
2
.
(b) Deduce that if in addition [
Y
()[ = 1 for some / Q then Y must be
Pdegenerate, in which case
Y
() = exp(ic) for some c R.
128 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
Building on the preceding two exercises, you are to prove next the following con
vergence of types result.
Exercise 3.3.26. Suppose Z
n
D
Y and
n
Z
n
+
n
D
Y for some
Y , nonP
degenerate Y , and nonrandom
n
0,
n
.
(a) Show that
n
0 nite.
Hint: Start with the niteness of limit points of
n
.
(b) Deduce that
n
nite.
(c) Conclude that
Y
D
= Y +.
Hint: Recall Slutskys lemma.
Remark. This convergence of types fails for Pdegenerate Y . For example, if
Z
n
D
= A(0, n
3
), then both Z
n
D
0 and nZ
n
D
0. Similarly, if Z
n
D
= A(0, 1)
then
n
Z
n
D
= A(0, 1) for the nonconverging sequence
n
= (1)
n
(of alternating
signs).
Mimicking the proof of Levys inversion theorem, for random variables of bounded
support you get the following alternative inversion formula, based on the theory of
Fourier series.
Exercise 3.3.27. Suppose R.V. X supported on (0, t) has the characteristic func
tion
X
and the distribution function F
X
. Let
0
= 2/t and
a,b
() be as in
(3.3.5), with
a,b
(0) =
ba
2
.
(a) Show that for any 0 < a < b < t
lim
T
T
k=T
0
(1
[k[
T
)
a,b
(k
0
)
X
(k
0
) =
1
2
[F
X
(b)+F
X
(b
)]
1
2
[F
X
(a)+F
X
(a
)] .
Hint: Recall that S
T
(r) =
T
k=1
(1 k/T)
sinkr
k
is uniformly bounded for
r (0, 2) and integer T 1, and S
T
(r)
r
2
as T .
(b) Show that if
k
[
X
(k
0
)[ < then X has the bounded continuous
probability density function, given for x (0, t) by
f
X
(x) =
0
2
kZ
e
ik0x
X
(k
0
) .
(c) Deduce that if R.V.s X and Y supported on (0, t) are such that
X
(k
0
) =
Y
(k
0
) for all k Z, then X
D
= Y .
Here is an application of the preceding exercise for the random walk on the circle
S
1
of radius one (c.f. Denition 5.1.6 for the random walk on R).
Exercise 3.3.28. Let t = 2 and denote the unit circle S
1
parametrized by
the angular coordinate to yield the identication = [0, t] where both endpoints
are considered the same point. We equip with the topology induced by [0, t] and
the surface measure
n
k=1
k
)mod t, with i.i.d ,
k
, a random walk.
(a) Verify that Exercise 3.3.27 applies for
0
= 1 and R.V.s on .
3.3. CHARACTERISTIC FUNCTIONS 129
(b) Show that if probability measures
n
on (, B
and (k) =
(k) for
all k Z.
Hint: Since is compact the sequence
n
is uniformly tight.
(c) Show that
U
(k) = 1
k=0
and
Sn
(k) =
(k)
n
. Deduce from these facts
that if has a density with respect to
then S
n
D
U as n .
Hint: Recall part (a) of Exercise 3.3.25.
(d) Check that if = is nonrandom for some /t / Q, then S
n
does
not converge in distribution, but S
Kn
D
U for K
n
which are uniformly
chosen in 1, 2, . . . , n, independently of the sequence
k
.
3.3.3. Revisiting the clt. Applying the theory of Subsection 3.3.2 we pro
vide an alternative proof of the clt, based on characteristic functions. One can
prove many other weak convergence results for sums of random variables by prop
erly adapting this approach, which is exactly what we will do when demonstrating
the convergence to stable laws (see Exercise 3.3.33), and in proving the Poisson
approximation theorem (in Subsection 3.4.1), and the multivariate clt (in Section
3.5).
To this end, we start by deriving the analog of the bound (3.1.7) for the charac
teristic function.
Lemma 3.3.29. If a random variable X has E(X) = 0 and E(X
2
) = v < , then
for all R,
X
()
_
1
1
2
v
2
_
2
Emin([X[
2
, [[[X[
3
/6).
Proof. Let R
2
(x) = e
ix
1 ix(ix)
2
/2. Then, rearranging terms, recalling
E(X) = 0 and using Jensens inequality for the modulus function, we see that
X
()
_
1
1
2
v
2
_
E
_
e
iX
1iX
i
2
2
2
X
2
ER
2
(X)
E[R
2
(X)[.
Since [R
2
(x)[ min([x[
2
, [x[
3
/6) for any x R (see also Exercise 3.3.34), by mono
tonicity of the expectation we get that E[R
2
(X)[ Emin([X[
2
, [X[
3
/6), com
pleting the proof of the lemma.
The following simple complex analysis estimate is needed for relating the approx
imation of the characteristic function of summands to that of their sum.
Lemma 3.3.30. Suppose z
n,k
C are such that z
n
=
n
k=1
z
n,k
z
and
n
=
n
k=1
[z
n,k
[
2
0 when n . Then,
n
:=
n
k=1
(1 +z
n,k
) exp(z
) for n .
Proof. Recall that the power series expansion
log(1 +z) =
k=1
(1)
k1
z
k
k
130 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
converges for [z[ < 1. In particular, for [z[ 1/2 it follows that
[ log(1 +z) z[
k=2
[z[
k
k
[z[
2
k=2
2
(k2)
k
[z[
2
k=2
2
(k1)
= [z[
2
.
Let
n
= max[z
n,k
[ : k = 1, . . . , n. Note that
2
n
n
, so our assumption that
n
0 implies that
n
1/2 for all n suciently large, in which case
[ log
n
z
n
[ = [ log
n
k=1
(1 +z
n,k
)
n
k=1
z
n,k
[
n
k=1
[ log(1 +z
n,k
) z
n,k
[
n
.
With z
n
z
and
n
0, it follows that log
n
z
. Consequently,
n
exp(z
) as claimed.
We will give now an alternative proof of the clt of Theorem 3.1.2.
Proof of Theorem 3.1.2. From Example 3.3.6 we know that
G
() = e
2
2
is the characteristic function of the standard normal distribution. So, by Levys
continuity theorem it suces to show that
Sn
() exp(
2
/2) as n , for
each R. Recall that
S
n
=
n
k=1
X
n,k
, with X
n,k
= (X
k
)/
vn i.i.d.
random variables, so by independence (see Lemma 3.3.8) and scaling (see part (e)
of Proposition 3.3.2), we have that
n
:=
Sn
() =
n
k=1
X
n,k
() =
Y
(n
1/2
)
n
= (1 +z
n
/n)
n
,
where Y = (X
1
)/
v and z
n
= z
n
() := n[
Y
(n
1/2
) 1]. Applying Lemma
3.3.30 for z
n,k
= z
n
/n it remains only to show that z
n
2
/2 (for then
n
=
[z
n
[
2
/n 0). Indeed, since E(Y ) = 0 and E(Y
2
) = 1, we have from Lemma 3.3.29
that
[z
n
+
2
/2[ =
n[
Y
(n
1/2
) 1] +
2
/2
EV
n
,
for V
n
= min([Y [
2
, n
1/2
[Y [
3
/6). With V
n
a.s.
0 as n and V
n
[[
2
[Y [
2
which is integrable, it follows by dominated convergence that EV
n
0 as n ,
hence z
n
2
/2 completing the proof of Theorem 3.1.2.
We proceed with a brief introduction of stable laws, their domain of attraction
and the corresponding limit theorems (which are a natural generalization of the
clt).
Denition 3.3.31. Random variable Y has a stable law if it is nondegenerate
and for any m 1 there exist constants d
m
> 0 and c
m
, such that Y
1
+. . . +Y
m
D
=
d
m
Y +c
m
, where Y
i
are i.i.d. copies of Y . Such variable has a symmetric stable
law if in addition Y
D
= Y . We further say that random variable X is in the
domain of attraction of nondegenerate Y if there exist constants b
n
> 0 and a
n
such that Z
n
(X) = (S
n
a
n
)/b
n
D
Y for S
n
=
n
k=1
X
k
and i.i.d. copies X
k
of
X.
By denition, the collection of stable laws is closed under the ane map Y
vY + for R and v > 0 (which correspond to the centering and scale of the
law, but not necessarily its mean and variance). Clearly, each stable law is in its
own domain of attraction and as we see next, only stable laws have a nonempty
domain of attraction.
3.3. CHARACTERISTIC FUNCTIONS 131
Proposition 3.3.32. If X is in the domain of attraction of some nondegenerate
variable Y , then Y must have a stable law.
Proof. Fix m 1, and setting n = km let
n
= b
n
/b
k
> 0 and
n
=
(a
n
ma
k
)/b
k
. We then have the representation
n
Z
n
(X) +
n
=
m
i=1
Z
(i)
k
,
where Z
(i)
k
= (X
(i1)k+1
+. . . +X
ik
a
k
)/b
k
are i.i.d. copies of Z
k
(X). From our
assumption that Z
k
(X)
D
Y we thus deduce (by at most m 1 applications of
Exercise 3.3.18), that
n
Z
n
(X)+
n
D
Y , where
Y = Y
1
+. . .+Y
m
for i.i.d. copies
Y
i
of Y . Moreover, by assumption Z
n
(X)
D
Y , hence by the convergence of
types
Y
D
= d
m
Y + c
m
for some nite nonrandom d
m
0 and c
m
(c.f. Exercise
3.3.26). Recall Lemma 3.3.8 that
Y
() = [
Y
()]
m
. So, with Y assumed non
degenerate the same applies to
Y (see Exercise 3.3.25), and in particular d
m
> 0.
Since this holds for any m 1, by denition Y has a stable law.
We have already seen two examples of symmetric stable laws, namely those asso
ciated with the zeromean normal density and with the Cauchy density of Example
3.3.13. Indeed, as you show next, for each (0, 2) there corresponds the sym
metric stable variable Y
)
(so the Cauchy distribution corresponds to the symmetric stable of index = 1
and the normal distribution corresponds to index = 2).
Exercise 3.3.33. Fixing (0, 2), suppose X
D
= X and P([X[ > x) = x
for
all x 1.
(a) Check that
X
() = 1([[)[[
where (r) =
_
r
(1cos u)u
(+1)
du
converges as r 0 to (0) nite and positive.
(b) Setting
,0
() = exp([[
), b
n
= ((0)n)
1/
and
S
n
= b
1
n
n
k=1
X
k
for i.i.d. copies X
k
of X, deduce that
Sn
()
,0
() as n , for
any xed R.
(c) Conclude that X is in the domain of attraction of a symmetric stable
variable Y
n
k=1
[X
k
[ 1 in probability but not
almost surely (in contrast, X is integrable when > 1, in which case the
strong law of large numbers applies).
Remark. While outside the scope of these notes, one can show that (up to scaling)
any symmetric stable variable must be of the form Y
(1 +isgn()g
())) ,
where g
1
(r) = (2/) log [r[ and g
X
() beyond the case n = 2 of Lemma 3.3.29.
Exercise 3.3.34. For any x R and nonnegative integer n, let
R
n
(x) = e
ix
k=0
(ix)
k
k!
.
(a) Show that R
n
(x) =
_
x
0
iR
n1
(y)dy for all n 1 and deduce by induction
on n that
[R
n
(x)[ min
_
2[x[
n
n!
,
[x[
n+1
(n + 1)!
_
for all x R, n = 0, 1, 2, . . . .
(b) Conclude that if E[X[
n
< then
X
()
n
k=0
(i)
k
EX
k
k!
[[
n
E
_
min
_
2[X[
n
n!
,
[[[X[
n+1
(n + 1)!
_
_
.
By solving the next exercise you generalize the proof of Theorem 3.1.2 via char
acteristic functions to the setting of Lindebergs clt.
Exercise 3.3.35. Consider
S
n
=
n
k=1
X
n,k
for mutually independent random
variables X
n,k
, k = 1, . . . , n, of zero mean and variance v
n,k
, such that v
n
=
n
k=1
v
n,k
1 as n .
(a) Fixing R show that
n
=
Sn
() =
n
k=1
(1 +z
n,k
) ,
where z
n,k
=
X
n,k
() 1.
(b) With z
=
2
/2, use Lemma 3.3.29 to verify that [z
n,k
[ 2
2
v
n,k
and
further, for any > 0,
[z
n
v
n
z
[
n
k=1
[z
n,k
v
n,k
z
[
2
g
n
() +
[[
3
6
v
n
,
where z
n
=
n
k=1
z
n,k
and g
n
() is given by (3.1.4).
(c) Recall that Lindebergs condition g
n
() 0 implies that r
2
n
= max
k
v
n,k
0 as n . Deduce that then z
n
z
and
n
=
n
k=1
[z
n,k
[
2
0
when n .
3.4. POISSON APPROXIMATION AND THE POISSON PROCESS 133
(d) Applying Lemma 3.3.30, conclude that
S
n
D
G.
We conclude this section with an exercise that reviews various techniques one may
use for establishing convergence in distribution for sums of independent random
variables.
Exercise 3.3.36. Throughout this problem S
n
=
n
k=1
X
k
for mutually indepen
dent random variables X
k
.
(a) Suppose that P(X
k
= k
) = P(X
k
= k
) = 1/(2k
) and P(X
k
= 0) =
1 k
n
D
G.
3.4. Poisson approximation and the Poisson process
Subsection 3.4.1 deals with the Poisson approximation theorem and few of its ap
plications. It leads naturally to the introduction of the Poisson process in Subsection
3.4.2, where we also explore its relation to sums of i.i.d. Exponential variables and
to order statistics of i.i.d. uniform random variables.
3.4.1. Poisson approximation. The Poisson approximation theorem is about
the law of the sum S
n
of a large number (= n) of independent random variables.
In contrast to the clt that also deals with such objects, here all variables are non
negative integer valued and the variance of S
n
remains bounded, allowing for the
approximation in law of S
n
by an integer valued variable. The Poisson distribution
results when the number of terms in the sum grows while the probability that each
of them is nonzero decays. As such, the Poisson approximation is about counting
the number of occurrences among many independent rare events.
Theorem 3.4.1 (Poisson approximation). Let S
n
=
n
k=1
Z
n,k
, where for each
n the random variables Z
n,k
for 1 k n, are mutually independent, each taking
value in the set of nonnegative integers. Suppose that p
n,k
= P(Z
n,k
= 1) and
n,k
= P(Z
n,k
2) are such that as n ,
(a)
n
k=1
p
n,k
< ,
(b) max
k=1, ,n
p
n,k
0,
(c)
n
k=1
n,k
0.
Then, S
n
D
N
k=1
Z
n,k
,
where Z
n,k
= Z
n,k
I
Z
n,k
1
for k = 1, . . . , n. Indeed, observe that,
P(S
n
,= S
n
)
n
k=1
P(Z
n,k
,= Z
n,k
) =
n
k=1
P(Z
n,k
2)
=
n
k=1
n,k
0 for n , by assumption (c) .
Hence, (S
n
S
n
)
p
0. Consequently, the convergence S
n
D
N
of the sums of
truncated variables imply that also S
n
D
N
() for each
R.
To this end, recall that Z
n,k
are independent Bernoulli variables of parameters
p
n,k
, k = 1, . . . , n. Hence, by Lemma 3.3.8 and Example 3.3.5 we have that for
z
n,k
= p
n,k
(e
i
1),
Sn
() =
n
k=1
Z
n,k
() =
n
k=1
(1 p
n,k
+p
n,k
e
i
) =
n
k=1
(1 +z
n,k
) .
Our assumption (a) implies that for n
z
n
:=
n
k=1
z
n,k
= (
n
k=1
p
n,k
)(e
i
1) (e
i
1) := z
.
Further, since [z
n,k
[ 2p
n,k
, our assumptions (a) and (b) imply that for n ,
n
=
n
k=1
[z
n,k
[
2
4
n
k=1
p
2
n,k
4( max
k=1,...,n
p
n,k
)(
n
k=1
p
n,k
) 0 .
Applying Lemma 3.3.30 we conclude that when n ,
S
n
() exp(z
) = exp((e
i
1)) =
N
()
(see (3.3.3) for the last identity), thus completing the proof.
Remark. Recall Example 3.2.25 that the weak convergence of the laws of the
integer valued S
n
to that of N
n
k=1
p
n,k
, the
more quantitative result
[[T
Sn
T
N
n
[[
tv
=
k=0
[P(S
n
= k) P(N
n
= k)[ 2 min(
1
n
, 1)
n
k=1
p
2
n,k
due to Stein (1987) also holds (see also [Dur10, (3.6.1)] for a simpler argument,
due to Hodges and Le Cam (1960), which is just missing the factor min(
1
n
, 1)).
3.4. POISSON APPROXIMATION AND THE POISSON PROCESS 135
For the remainder of this subsection we list applications of the Poisson approxi
mation theorem, starting with
Example 3.4.2 (Poisson approximation for the Binomial). Take indepen
dent variables Z
n,k
0, 1, so
n,k
= 0, with p
n,k
= p
n
that does not depend on
k. Then, the variable S
n
= S
n
has the Binomial distribution of parameters (n, p
n
).
By Steins result, the Binomial distribution of parameters (n, p
n
) is approximated
well by the Poisson distribution of parameter
n
= np
n
, provided p
n
0. In case
n
= np
n
< , Theorem 3.4.1 yields that the Binomial (n, p
n
) laws converge
weakly as n to the Poisson distribution of parameter . This is in agreement
with Example 3.1.7 where we approximate the Binomial distribution of parameters
(n, p) by the normal distribution, for in Example 3.1.8 we saw that, upon the same
scaling, N
n
is also approximated well by the normal distribution when
n
.
Recall the occupancy problem where we distribute at random r distinct balls
among n distinct boxes and each of the possible n
r
assignments of balls to boxes is
equally likely. In Example 2.1.10 we considered the asymptotic fraction of empty
boxes when r/n and n . Noting that the number of balls M
n,k
in the
kth box follows the Binomial distribution of parameters (r, n
1
), we deduce from
Example 3.4.2 that M
n,k
D
N
. Thus, P(M
n,k
= 0) P(N
= 0) = e
.
That is, for large n each box is empty with probability about e
, which may
explain (though not prove) the result of Example 2.1.10. Here we use the Poisson
approximation theorem to tackle a dierent regime, in which r = r
n
is of order
nlog n, and consequently, there are fewer empty boxes.
Proposition 3.4.3. Let S
n
denote the number of empty boxes. Assuming r = r
n
is such that ne
r/n
[0, ), we have that S
n
D
N
as n .
Proof. Let Z
n,k
= I
M
n,k
=0
for k = 1, . . . , n, that is Z
n,k
= 1 if the kth box
is empty and Z
n,k
= 0 otherwise. Note that S
n
=
n
k=1
Z
n,k
, with each Z
n,k
having the Bernoulli distribution of parameter p
n
= (1 n
1
)
r
. Our assumption
about r
n
guarantees that np
n
. If the occupancy Z
n,k
of the various boxes were
mutually independent, then the stated convergence of S
n
to N
l=0
(1)
l
b
l
(r, n) .
136 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
According to part (b) of Exercise 3.4.4, p
0
(r, n) e
.
By part (a) of Exercise 3.4.4 we know that b
l
(r, n)
l
/l! for xed l, hence
p
l
(r, n) e
l
/l!. As p
l
= P(S
n
= l), the proof of the proposition is thus
complete.
The following exercise provides the estimates one needs during the proof of Propo
sition 3.4.3 (for more details, see [Dur10, Theorem 3.6.5]).
Exercise 3.4.4. Assuming ne
r/n
, show that
(a) b
l
(r, n) of (3.4.1) converges to
l
/l! for each xed l.
(b) p
0
(r, n) of (3.4.2) converges to e
.
Finally, here is an application of Proposition 3.4.3 to the coupon collectors prob
lem of Example 2.1.8, where T
n
denotes the number of independent trials, it takes
to have at least one representative of each of the n possible values (and each trial
produces a value U
i
that is distributed uniformly on the set of n possible values).
Example 3.4.5 (Revisiting the coupon collectors problem). For any
x R, we have that
(3.4.3) lim
n
P(T
n
nlog n nx) = exp(e
x
),
which is an improvement over our weak law result that T
n
/nlog n 1. Indeed, to
derive (3.4.3) view the rst r trials of the coupon collector as the random placement
of r balls into n distinct boxes that correspond to the n possible values. From this
point of view, the event T
n
r corresponds to lling all n boxes with the r
balls, that is, having none empty. Taking r = r
n
= [nlog n + nx] we have that
ne
r/n
= e
x
, and so it follows from Proposition 3.4.3 that P(T
n
r
n
)
P(N
= 0) = e
, as stated in (3.4.3).
Note that though T
n
=
n
k=1
X
n,k
with X
n,k
independent, the convergence in dis
tribution of T
n
, given by (3.4.3), is to a nonnormal limit. This should not surprise
you, for the terms X
n,k
with k near n are large and do not satisfy Lindebergs
condition.
Exercise 3.4.6. Recall that
n
D
N with N a Poisson random variable of
parameter
2
/2.
3.4.2. Poisson Process. The Poisson process is a continuous time stochastic
process N
t
(), t 0 which belongs to the following class of counting processes.
Denition 3.4.7. A counting process is a mapping N
t
(), where N
t
()
is a piecewise constant, nondecreasing, right continuous function of t 0, with
N
0
() = 0 and (countably) innitely many jump discontinuities, each of whom is
of size one.
Associated with each sample path N
t
() of such a process are the jump times
0 = T
0
< T
1
< < T
n
< such that T
k
= inft 0 : N
t
k for each k, or
equivalently
N
t
= supk 0 : T
k
t.
3.4. POISSON APPROXIMATION AND THE POISSON PROCESS 137
In applications we nd such N
t
as counting the number of discrete events occurring
in the interval [0, t] for each t 0, with T
k
denoting the arrival or occurrence time
of the kth such event.
Remark. It is possible to extend the notion of counting processes to discrete events
indexed on R
d
, d 2. This is done by assigning random integer counts N
A
to Borel
subsets A of R
d
in an additive manner, that is, N
AB
= N
A
+N
B
whenever A and
B are disjoint. Such processes are called point processes. See also Exercise 7.1.13
for more about Poisson point process and inhomogeneous Poisson processes of non
constant rate.
Among all counting processes we characterize the Poisson process by the joint
distribution of its jump (arrival) times T
k
.
Denition 3.4.8. The Poisson process of rate > 0 is the unique counting process
with the gaps between jump times
k
= T
k
T
k1
, k = 1, 2, . . . being i.i.d. random
variables, each having the exponential distribution of parameter .
Thus, from Exercise 1.4.46 we deduce that the kth arrival time T
k
of the Poisson
process of rate has the gamma density of parameters = k and ,
f
T
k
(u) =
k
u
k1
(k 1)!
e
u
1
u>0
.
As we have seen in Example 2.3.7, counting processes appear in the context of
renewal theory. In particular, as shown in Exercise 2.3.8, the Poisson process of
rate satises the strong law of large numbers t
1
N
t
a.s.
.
Recall that a random variable N has the Poisson() law if
P(N = n) =
n
n!
e
, n = 0, 1, 2, . . . .
Our next proposition, which is often used as an alternative denition of the Poisson
process, also explains its name.
Proposition 3.4.9. For any and any 0 = t
0
< t
1
< < t
, the increments N
t1
,
N
t2
N
t1
, . . . , N
t
N
t
1
, are independent random variables and for some > 0
and all t > s 0, the increment N
t
N
s
has the Poisson((t s)) law.
Thus, the Poisson process has independent increments, each having a Poisson law,
where the parameter of the count N
t
N
s
is proportional to the length of the
corresponding interval [s, t].
The proof of Proposition 3.4.9 relies on the lack of memory of the exponential
distribution. That is, if the law of a random variable T is exponential (of some
parameter > 0), then for all t, s 0,
(3.4.4) P(T > t +s[T > t) =
P(T > t +s)
P(T > t)
=
e
(t+s)
e
t
= e
s
= P(T > s) .
Indeed, the key to the proof of Proposition 3.4.9 is the following lemma.
Lemma 3.4.10. Fixing t > 0, the variables
j
with
1
= T
Nt+1
t, and
j
= T
Nt+j
T
Nt+j1
, j 2 are i.i.d. each having the exponential distribution
of parameter . Further, the collection
j
is independent of N
t
which has the
Poisson distribution of parameter t.
138 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
Remark. Note that in particular, E
t
= T
Nt+1
t which counts the time till
next arrival occurs, hence called the excess life time at t, follows the exponential
distribution of parameter .
Proof. Fixing t > 0 and n 1 let H
n
(x) = P(t T
n
> t x). With
H
n
(x) =
_
x
0
f
Tn
(t y)dy and T
n
independent of
n+1
, we get by Fubinis theorem
(for I
tTn>tn+1
), and the integration by parts of Lemma 1.4.30 that
P(N
t
= n) = P(t T
n
> t
n+1
) = E[H
n
(
n+1
)]
=
_
t
0
f
Tn
(t y)P(
n+1
> y)dy
=
_
t
0
n
(t y)
n1
(n 1)!
e
(ty)
e
y
dy = e
t
(t)
n
n!
. (3.4.5)
As this applies for any n 1, it follows that N
t
has the Poisson distribution of
parameter t. Similarly, observe that for any s
1
0 and n 1,
P(N
t
= n,
1
> s
1
) = P(t T
n
> t
n+1
+s
1
)
=
_
t
0
f
Tn
(t y)P(
n+1
> s
1
+y)dy
= e
s1
P(N
t
= n) = P(
1
> s
1
)P(N
t
= n) .
Since T
0
= 0, P(N
t
= 0) = e
t
and T
1
=
1
, in view of (3.4.4) this conclusion
extends to n = 0, proving that
1
is independent of N
t
and has the same exponential
law as
1
.
Next, x arbitrary integer k 2 and nonnegative s
j
0 for j = 1, . . . , k. Then,
for any n 0, since
n+j
, j 2 are i.i.d. and independent of (T
n
,
n+1
),
P(N
t
= n,
j
> s
j
, j = 1, . . . , k)
= P(t T
n
> t
n+1
+s
1
, T
n+j
T
n+j1
> s
j
, j = 2, . . . , k)
= P(t T
n
> t
n+1
+s
1
)
k
j=2
P(
n+j
> s
j
) = P(N
t
= n)
k
j=1
P(
j
> s
j
).
Since s
j
0 and n 0 are arbitrary, this shows that the random variables N
t
and
j
, j = 1, . . . , k are mutually independent (c.f. Corollary 1.4.12), with each
j
hav
ing an exponential distribution of parameter . As k is arbitrary, the independence
of N
t
and the countable collection
j
follows by Denition 1.4.3.
Proof of Proposition 3.4.9. Fix t, s
j
0, j = 1, . . . , k, and nonnegative
integers n and m
j
, 1 j k. The event N
sj
= m
j
, 1 j k is of the form
(
1
, . . . ,
r
) H for r = m
k
+ 1 and
H =
k
j=1
x [0, )
r
: x
1
+ +x
mj
s
j
< x
1
+ +x
mj +1
.
Since the event (
1
, . . . ,
r
) H is merely N
t+sj
N
t
= m
j
, 1 j k, it
follows form Lemma 3.4.10 that
P(N
t
= n, N
t+sj
N
t
= m
j
, 1 j k) = P(N
t
= n, (
1
, . . . ,
r
) H)
= P(N
t
= n)P((
1
, . . . ,
r
) H) = P(N
t
= n)P(N
sj
= m
j
, 1 j k) .
3.4. POISSON APPROXIMATION AND THE POISSON PROCESS 139
By induction on this identity implies that if 0 = t
0
< t
1
< t
2
< < t
, then
(3.4.6) P(N
ti
N
ti1
= n
i
, 1 i ) =
i=1
P(N
titi1
= n
i
)
(the case = 1 is trivial, and to advance the induction to + 1 set k = , t = t
1
,
n = n
1
and s
j
= t
j+1
t
1
, m
j
=
j+1
i=2
n
i
).
Considering (3.4.6) for = 2, t
2
= t > s = t
1
, and summing over the values of n
1
we see that P(N
t
N
s
= n
2
) = P(N
ts
= n
2
), hence by (3.4.5) we conclude that
N
t
N
s
has the Poisson distribution of parameter (t s), as claimed.
The Poisson process is also related to the order statistics V
n,k
for the uniform
measure, as stated in the next two exercises.
Exercise 3.4.11. Let U
1
, U
2
, . . . , U
n
be i.i.d. with each U
i
having the uniform
measure on (0, 1]. Denote by V
n,k
the kth smallest number in U
1
, . . . , U
n
.
(a) Show that (V
n,1
, . . . , V
n,n
) has the same law as (T
1
/T
n+1
, . . . , T
n
/T
n+1
),
where T
k
are the jump (arrival) times for a Poisson process of rate
(see Subsection 1.4.2 for the denition of the law T
X
of a random vector
X).
(b) Taking = 1, deduce that nV
n,k
D
T
k
as n while k is xed, where
T
k
has the gamma density of parameters = k and s = 1.
Exercise 3.4.12. Fixing any positive integer n and 0 t
1
t
2
t
n
t,
show that
P(T
k
t
k
, k = 1, . . . , n[N
t
= n) =
n!
t
n
_
t1
0
_
t2
x1
(
_
tn
xn1
dx
n
)dx
n1
dx
1
.
That is, conditional on the event N
t
= n, the rst n jump times T
k
: k = 1, . . . , n
have the same law as the order statistics V
n,k
: k = 1, . . . , n of a sample of n i.i.d
random variables U
1
, . . . , U
n
, each of which is uniformly distributed in [0, t].
Here is an application of Exercise 3.4.12.
Exercise 3.4.13. Consider a Poisson process N
t
of rate and jump times T
k
.
(a) Compute the values of g(n) = E(I
Nt=n
n
k=1
T
k
).
(b) Compute the value of v = E(
Nt
k=1
(t T
k
)).
(c) Suppose that T
k
is the arrival time to the train station of the kth pas
senger on a train that departs the station at time t. What is the meaning
of N
t
and of v in this case?
The representation of the order statistics V
n,k
in terms of the jump times of
a Poisson process is very useful when studying the large n asymptotics of their
spacings R
n,k
. For example,
Exercise 3.4.14. Let R
n,k
= V
n,k
V
n,k1
, k = 1, . . . , n, denote the spacings
between V
n,k
of Exercise 3.4.11 (with V
n,0
= 0). Show that as n ,
(3.4.7)
n
log n
max
k=1,...,n
R
n,k
p
1 ,
140 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
and further for each xed x 0,
(3.4.8) G
n
(x) := n
1
n
k=1
I
{R
n,k
>x/n}
p
e
x
,
(3.4.9) B
n
(x) := P( min
k=1,...,n
R
n,k
> x/n
2
) e
x
.
As we show next, the Poisson approximation theorem provides a characterization
of the Poisson process that is very attractive for modeling realworld phenomena.
Corollary 3.4.15. If N
t
is a Poisson process of rate > 0, then for any xed k,
0 < t
1
< t
2
< < t
k
and nonnegative integers n
1
, n
2
, , n
k
,
P(N
t
k
+h
N
t
k
= 1[N
tj
= n
j
, j k) = h +o(h),
P(N
t
k
+h
N
t
k
2[N
tj
= n
j
, j k) = o(h),
where o(h) denotes a function f(h) such that h
1
f(h) 0 as h 0.
Proof. Fixing k, the t
j
and the n
j
, denote by A the event N
tj
= n
j
, j k.
For a Poisson process of rate the random variable N
t
k
+h
N
t
k
is independent of
A with P(N
t
k
+h
N
t
k
= 1) = e
h
h and P(N
t
k
+h
N
t
k
2) = 1e
h
(1+h).
Since e
h
= 1h+o(h) we see that the Poisson process satises this corollary.
Our next exercise explores the phenomenon of thinning, that is, the partitioning
of Poisson variables as sums of mutually independent Poisson variables of smaller
parameter.
Exercise 3.4.16. Suppose X
i
are i.i.d. with P(X
i
= j) = p
j
for j = 0, 1, . . . , k
and N a Poisson random variable of parameter that is independent of X
k
. Let
N
j
=
N
i=1
I
Xi=j
j = 0, . . . , k .
(a) Show that the variables N
j
, j = 0, 1, . . . , k are mutually independent with
N
j
having a Poisson distribution of parameter p
j
.
(b) Show that the subsequence of jump times
T
k
obtained by independently
keeping with probability p each of the jump times T
k
of a Poisson pro
cess N
t
of rate , yields in turn a Poisson process
N
t
of rate p.
We conclude this section noting the superposition property, namely that the sum
of two independent Poisson processes is yet another Poisson process.
Exercise 3.4.17. Suppose N
t
= N
(1)
t
+ N
(2)
t
where N
(1)
t
and N
(2)
t
are two inde
pendent Poisson processes of rates
1
> 0 and
2
> 0, respectively. Show that N
t
is a Poisson process of rate
1
+
2
.
3.5. Random vectors and the multivariate clt
The goal of this section is to extend the clt to random vectors, that is, R
d
valued
random variables. Towards this end, we revisit in Subsection 3.5.1 the theory
of weak convergence, this time in the more general setting of R
d
valued random
variables. Subsection 3.5.2 is devoted to the extension of characteristic functions
and Levys theorems to the multivariate setting, culminating with the Cramer
wold reduction of convergence in distribution of random vectors to that of their
3.5. RANDOM VECTORS AND THE MULTIVARIATE clt 141
one dimensional linear projections. Finally, in Subsection 3.5.3 we introduce the
important concept of Gaussian random vectors and prove the multivariate clt.
3.5.1. Weak convergence revisited. Recall Denition 3.2.17 of weak con
vergence for a sequence of probability measures on a topological space S, which
suggests the following denition for convergence in distribution of Svalued random
variables.
Denition 3.5.1. We say that (S, B
S
)valued random variables X
n
converge in
distribution to a (S, B
S
)valued random variable X
, denoted by X
n
D
X
, if
T
Xn
w
T
X
.
As already remarked, the Portmanteau theorem about equivalent characterizations
of the weak convergence holds also when the probability measures
n
are on a Borel
measurable space (S, B
S
) with (S, ) any metric space (and in particular for S = R
d
).
Theorem 3.5.2 (portmanteau theorem). The following ve statements are
equivalent for any probability measures
n
, 1 n on (S, B
S
), with (S, ) any
metric space.
(a)
n
w
n
(F)
(F)
(c) For every open set G, one has liminf
n
n
(G)
(G)
(d) For every
n
(A) =
(A)
(e) If the Borel function g : S R is such that
(D
g
) = 0, then
n
g
1
w
g
1
and if in addition g is bounded then
n
(g)
(g).
Remark. For S = R, the equivalence of (a)(d) is the content of Theorem 3.2.21
while Proposition 3.2.19 derives (e) out of (a) (in the context of convergence in
distribution, that is, X
n
D
X
and P(X
D
g
) = 0 implying that g(X
n
)
D
g(X
A
(x
)[ (x, x
) for any x, x
, it follows that x
A
(x) is a continuous function
on (S, ). Consequently, h
r
(x) = (1 r
A
(x))
+
C
b
(S) for all r 0. Further,
A
(x) = 0 for all x A, implying that h
r
I
A
for all r. Thus, applying part (a)
of the Portmanteau theorem for h
r
we have that
limsup
n
n
(A) lim
n
n
(h
r
) =
(h
r
) .
As
A
(x) = 0 if and only if x A it follows that h
r
I
A
as r , resulting with
limsup
n
n
(A)
(A) .
142 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
Taking A = A = F a closed set, we arrive at part (b) of the theorem.
(d) (e). Fix a Borel function g : S R with K = sup
x
[g(x)[ < such that
(D
g
) = 0. Clearly, R :
g
1
() > 0 is a countable set. Thus, xing
> 0 we can pick < and
0
<
1
< <
such that
g
1
(
i
) = 0
for 0 i ,
0
< K < K <
and
i
i1
< for 1 i . Let
A
i
= x :
i1
< g(x)
i
for i = 1, . . . , , noting that A
i
x : g(x) =
i1
,
or g(x) =
i
D
g
. Consequently, by our assumptions about g() and
i
we
have that
(A
i
) = 0 for each i = 1, . . . , . It thus follows from part (d) of the
Portmanteau theorem that
i=1
n
(A
i
)
i=1
(A
i
)
as n . Our choice of
i
and A
i
is such that g
i=1
i
I
Ai
g +, resulting
with
n
(g)
i=1
n
(A
i
)
n
(g) +
for n = 1, 2, . . . , . Considering rst n followed by 0, we establish that
n
(g)
(h g) as soon as
(D
g
) = 0.
This applies for every h C
b
(R), so in this case
n
g
1
w
g
1
.
We next show that the relation of Exercise 3.2.6 between convergences in proba
bility and in distribution also extends to any metric space (S, ), a fact we will later
use in Subsection 9.2, when considering the metric space of all continuous functions
on [0, ).
Corollary 3.5.3. If random variables X
n
, 1 n on the same probability
space and taking value in a metric space (S, ) are such that (X
n
, X
)
p
0, then
X
n
D
X
.
Proof. Fixing h C
b
(S) and > 0, we have by continuity of h() that G
r
S,
where
G
r
= y S : [h(x) h(y)[ whenever (x, y) r
1
.
By denition, if X
G
r
and (X
n
, X
) r
1
then [h(X
n
)h(X
)[ . Hence,
for any n, r 1,
E[[h(X
n
) h(X
)[] + 2h
(P(X
/ G
r
) +P((X
n
, X
) > r
1
)) ,
where h
= sup
xS
[h(x)[ is nite (by the boundedness of h). Considering n
followed by r we deduce from the convergence in probability of (X
n
, X
)
to zero, that
limsup
n
E[[h(X
n
) h(X
)[] + 2h
lim
r
P(X
/ G
r
) = .
Since this applies for any > 0, it follows by the triangle inequality that Eh(X
n
)
Eh(X
) for all h C
b
(S), i.e. X
n
D
X
.
Remark. The notion of distribution function for an R
d
valued random vector X =
(X
1
, . . . , X
d
) is
F
X
(x) = P(X
1
x
1
, . . . , X
d
x
d
) .
3.5. RANDOM VECTORS AND THE MULTIVARIATE clt 143
Inducing a partial order on R
d
by x y if and only if x y has only nonnegative
coordinates, each distribution function F
X
(x) has the three properties listed in
Theorem 1.2.36. Unfortunately, these three properties are not sucient for a given
function F : R
d
[0, 1] to be a distribution function. For example, since the mea
sure of each rectangle A =
d
i=1
(a
i
, b
i
] should be positive, the additional constraint
of the form
A
F =
2
d
j=1
F(x
j
) 0 should hold if F() is to be a distribution
function. Here x
j
enumerates the 2
d
corners of the rectangle A and each corner
is taken with a positive sign if and only if it has an even number of coordinates
from the collection a
1
, . . . , a
d
. Adding the fourth property that
A
F 0 for
each rectangle A R
d
, we get the necessary and sucient conditions for F() to be
a distribution function of some R
d
valued random variable (c.f. [Bil95, Theorem
12.5] for a detailed proof).
Recall Denition 3.2.31 of uniform tightness, where for S = R
d
we can take K
=
[M
, M
]
d
with no loss of generality. Though Prohorovs theorem about uniform
tightness (i.e. Theorem 3.2.34) is beyond the scope of these notes, we shall only
need in the sequel the fact that a uniformly tight sequence of probability measures
has at least one limit point. This can be proved for S = R
d
in a manner similar
to what we have done in Theorem 3.2.37 and Lemma 3.2.38 for S = R
1
, using the
corresponding concept of distribution function F
X
() (see [Dur10, Theorem 3.9.2]
for more details).
3.5.2. Characteristic function. We start by extending the useful notion of
characteristic function to the context of R
d
valued random variables (which we also
call hereafter random vectors).
Denition 3.5.4. Adopting the notation (x, y) =
d
i=1
x
i
y
i
for x, y R
d
, a ran
dom vector X = (X
1
, X
2
, , X
d
) with values in R
d
has the characteristic function
X
() = E[e
i(,X)
] ,
where = (
1
,
2
, ,
d
) R
d
and i =
1.
Remark. The characteristic function
X
: R
d
C exists for any X since
(3.5.1) e
i(,X)
= cos(, X) +i sin(, X) ,
with both real and imaginary parts being bounded (hence integrable) random vari
ables. Actually, it is easy to check that all ve properties of Proposition 3.3.2 hold,
where part (e) is modied to
A
t
X+b
() = exp(i(b, ))
X
(A), for any nonrandom
d ddimensional matrix A and b R
d
(with A
t
denoting the transpose of the
matrix A).
Here is the extension of the notion of probability density function (as in Denition
1.2.39) to a random vector.
Denition 3.5.5. Suppose f
X
is a nonnegative Borel measurable function with
_
R
d
f
X
(x)dx = 1. We say that a random vector X = (X
1
, . . . , X
d
) has a probability
density function f
X
() if for every b = (b
1
, . . . , b
d
),
F
X
(b) =
_
b1
_
b
d
f
X
(x
1
, . . . , x
d
)dx
d
dx
1
144 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
(such f
X
is sometimes called the joint density of X
1
, . . . , X
d
). This is the same as
saying that the law of X is of the form f
X
d
with
d
the dfold product Lebesgue
measure on R
d
(i.e. the d > 1 extension of Example 1.3.60).
Example 3.5.6. We have the following extension of the Fourier transform formula
(3.3.4) to random vectors X with density,
X
() =
_
R
d
e
i(,x)
f
X
(x)dx
(this is merely a special case of the extension of Corollary 1.3.62 to h : R
d
R).
We next state and prove the corresponding extension of Levys inversion theorem.
Theorem 3.5.7 (Levys inversion theorem). Suppose
X
() is the character
istic function of random vector X = (X
1
, . . . , X
d
) whose law is T
X
, a probability
measure on (R
d
, B
R
d). If A = [a
1
, b
1
] [a
d
, b
d
] with T
X
(A) = 0, then
(3.5.2) T
X
(A) = lim
T
_
[T,T]
d
d
j=1
aj,bj
(
j
)
X
()d
for
a,b
() of (3.3.5). Further, the characteristic function determines the law of a
random vector. That is, if
X
() =
Y
() for all then X has the same law as Y .
Proof. We derive (3.5.2) by adapting the proof of Theorem 3.3.12. First apply
Fubinis theorem with respect to the product of Lebesgues measure on [T, T]
d
and the law of X (both of which are nite measures on R
d
) to get the identity
J
T
(a, b) :=
_
[T,T]
d
d
j=1
aj,bj
(
j
)
X
()d =
_
R
d
_
d
j=1
_
T
T
h
aj,bj
(x
j
,
j
)d
j
_
dT
X
(x)
(where h
a,b
(x, ) =
a,b
()e
ix
). In the course of proving Theorem 3.3.12 we have
seen that for j = 1, . . . , d the integral over
j
is uniformly bounded in T and that
it converges to g
aj,bj
(x
j
) as T . Thus, by bounded convergence it follows that
lim
T
J
T
(a, b) =
_
R
d
g
a,b
(x)dT
X
(x) ,
where
g
a,b
(x) =
d
j=1
g
aj,bj
(x
j
) ,
is zero on A
c
and one on A
o
(see the explicit formula for g
a,b
(x) provided there).
So, our assumption that T
X
(A) = 0 implies that the limit of J
T
(a, b) as T is
merely T
X
(A), thus establishing (3.5.2).
Suppose now that
X
() =
Y
() for all . Adapting the proof of Corollary 3.3.14
to the current setting, let = R : P(X
j
= ) > 0 or P(Y
j
= ) > 0 for some
j = 1, . . . , d noting that if all the coordinates a
j
, b
j
, j = 1, . . . , d of a rectangle
A are from the complement of then both T
X
(A) = 0 and T
Y
(A) = 0. Thus,
by (3.5.2) we have that T
X
(A) = T
Y
(A) for any A in the collection ( of rectangles
with coordinates in the complement of . Recall that is countable, so for any
rectangle A there exists A
n
( such that A
n
A, and by continuity from above of
both T
X
and T
Y
it follows that T
X
(A) = T
Y
(A) for every rectangle A. In view of
Proposition 1.1.39 and Exercise 1.1.21 this implies that the probability measures
T
X
and T
Y
agree on all Borel subsets of R
d
.
3.5. RANDOM VECTORS AND THE MULTIVARIATE clt 145
We next provide the ingredients needed when using characteristic functions en
route to the derivation of a convergence in distribution result for random vectors.
To this end, we start with the following analog of Lemma 3.3.16.
Lemma 3.5.8. Suppose the random vectors X
n
, 1 n on R
d
are such that
X
n
()
X
() as n for each R
d
. Then, the corresponding sequence
of laws T
X
n
is uniformly tight.
Proof. Fixing R
d
consider the sequence of random variables Y
n
= (, X
n
).
Since
Yn
() =
X
n
() for 1 n , we have that
Yn
()
Y
() for all
R. The uniform tightness of the laws of Y
n
then follows by Lemma 3.3.16.
Considering
1
, . . . ,
d
which are the unit vectors in the d dierent coordinates,
we have the uniform tightness of the laws of X
n,j
for the sequence of random
vectors X
n
= (X
n,1
, X
n,2
, . . . , X
n,d
) and each xed coordinate j = 1, . . . , d. For
the compact sets K
= [M
, M
]
d
and all n,
P(X
n
/ K
)
d
j=1
P([X
n,j
[ > M
) .
As d is nite, this leads from the uniform tightness of the laws of X
n,j
for each
j = 1, . . . , d to the uniform tightness of the laws of X
n
.
Equipped with Lemma 3.5.8 we are ready to state and prove Levys continuity
theorem.
Theorem 3.5.9 (Levys continuity theorem). Let X
n
, 1 n be random
vectors with characteristic functions
X
n
(). Then, X
n
D
X
if and only if
X
n
()
X
, then clearly as n ,
X
n
() = E
_
e
i(,X
n
)
_
E
_
e
i(,X
)
_
=
X
().
For the converse direction, assuming that
X
n
X
.
Consequently, y
n
y
, as stated.
Remark. As in the case of Theorem 3.3.17, it is not hard to show that if
X
n
()
() as n and () is continuous at = 0 then is necessarily the charac
teristic function of some random vector X
and consequently X
n
D
X
.
146 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
The proof of the multivariate clt is just one of the results that rely on the following
immediate corollary of Levys continuity theorem.
Corollary 3.5.10 (CramerWold device). A sucient condition for X
n
D
is that (, X
n
)
D
(, X
) for each R
d
.
Proof. Since (, X
n
)
D
(, X
)
_
.
As this applies for any R
d
, we get that X
n
D
X
by applying Levys
continuity theorem in R
d
(i.e., Theorem 3.5.9), now in the converse direction.
Remark. Beware that it is not enough to consider only nitely many values of
in the CramerWold device. For example, consider the random vectors X
n
=
(X
n
, Y
n
) with X
n
, Y
2n
i.i.d. and Y
2n+1
= X
2n+1
. Convince yourself that in this
case X
n
D
X
1
and Y
n
D
Y
1
but the random vectors X
n
do not converge in
distribution (to any limit).
The computation of the characteristic function is much simplied in the presence
of independence.
Exercise 3.5.11. Show that if Y = (Y
1
, . . . , Y
d
) with Y
k
mutually independent
R.V., then for all = (
1
, . . . ,
d
) R
d
,
(3.5.3)
Y
() =
d
k=1
Y
k
(
k
)
Conversely, show that if (3.5.3) holds for all R
d
, the random variables Y
k
,
k = 1, . . . , d are mutually independent of each other.
3.5.3. Gaussian random vectors and the multivariate clt. Recall the
following linear algebra concept.
Denition 3.5.12. An d d matrix A with entries A
jk
is called nonnegative
denite (or positive semidenite) if A
jk
= A
kj
for all j, k, and for any R
d
(, A) =
d
j=1
d
k=1
j
A
jk
k
0.
We are ready to dene the class of multivariate normal distributions via the cor
responding characteristic functions.
Denition 3.5.13. We say that a random vector X = (X
1
, X
2
, , X
d
) is Gauss
ian, or alternatively that it has a multivariate normal distribution if
(3.5.4)
X
() = e
1
2
(,V)
e
i(,)
,
for some nonnegative denite d d matrix V, some = (
1
, . . . ,
d
) R
d
and all
= (
1
, . . . ,
d
) R
d
. We denote such a law by A(, V).
Remark. For d = 1 this denition coincides with Example 3.3.6.
3.5. RANDOM VECTORS AND THE MULTIVARIATE clt 147
Our next proposition proves that the multivariate A(, V) distribution is well
dened and further links the vector and the matrix V to the rst two moments
of this distribution.
Proposition 3.5.14. The formula (3.5.4) corresponds to the characteristic func
tion of a probability measure on R
d
. Further, the parameters and V of the Gauss
ian random vector X are merely
j
= EX
j
and V
jk
= Cov(X
j
, X
k
), j, k = 1, . . . , d.
Proof. Any nonnegative denite matrix V can be written as V = U
t
D
2
U
for some orthogonal matrix U (i.e., such that U
t
U = I, the d ddimensional
identity matrix), and some diagonal matrix D. Consequently,
(, V) = (A, A)
for A = DU and all R
d
. We claim that (3.5.4) is the characteristic function
of the random vector X = A
t
Y +, where Y = (Y
1
, . . . , Y
d
) has i.i.d. coordinates
Y
k
, each of which has the standard normal distribution. Indeed, by Exercise 3.5.11
Y
() = exp(
1
2
(, )) is the product of the characteristic functions exp(
2
k
/2) of
the standard normal distribution (see Example 3.3.6), and by part (e) of Proposition
3.3.2,
X
() = exp(i(, ))
Y
(A), yielding the formula (3.5.4).
We have just shown that X has the A(, V) distribution if X = A
t
Y + for a
Gaussian random vector Y (whose distribution is A(0, I)), such that EY
j
= 0 and
Cov(Y
j
, Y
k
) = 1
j=k
for j, k = 1, . . . , d. It thus follows by linearity of the expec
tation and the bilinearity of the covariance that EX
j
=
j
and Cov(X
j
, X
k
) =
[EA
t
Y (A
t
Y )
t
]
jk
= (A
t
IA)
jk
= V
jk
, as claimed.
Denition 3.5.13 allows for V that is noninvertible, so for example the constant
random vector X = is considered a Gaussian random vector though it obviously
does not have a density. The reason we make this choice is to have the collection
of multivariate normal distributions closed with respect to L
2
convergence, as we
prove below to be the case.
Proposition 3.5.15. Suppose Gaussian random vectors X
n
converge in L
2
to a
random vector X

2
] 0 as n . Then, X
is
a Gaussian random vector, whose parameters are the limits of the corresponding
parameters of X
n
.
Proof. Recall that the convergence in L
2
of X
n
to X
implies that
n
= EX
n
converge to
= EX
. Further, the L
2
convergence
implies the corresponding convergence in probability and hence, by bounded con
vergence
X
n
()
X
() for each R
d
. Since
X
n
() = e
1
2
(,Vn)
e
i(,
n
)
,
for any n < , it follows that the same applies for n = . It is a well known fact
of linear algebra that the elementwise limit V
1
2
n
k=1
(X
k
), where X
k
are i.i.d. random vectors with nite second moments and such that = EX
1
.
Then,
S
n
D
G, with G having the A(0, V) distribution and where V is the dd
dimensional covariance matrix of X
1
.
Proof. Consider the i.i.d. random vectors Y
k
= X
k
each having also the
covariance matrix V. Fixing an arbitrary vector R
d
we proceed to show that
(,
S
n
)
D
(, G), which in view of the CramerWold device completes the proof
of the theorem. Indeed, note that (,
S
n
) = n
1
2
n
k=1
Z
k
, where Z
k
= (, Y
k
) are
i.i.d. Rvalued random variables, having zero mean and variance
v
= Var(Z
1
) = E[(, Y
1
)
2
] = (, E[Y
1
Y
t
1
] ) = (, V) .
Observing that the clt of Proposition 3.1.2 thus applies to (,
S
n
), it remains only
to verify that the resulting limit distribution A(0, v
(,G)
(s) =
G
(s) = e
1
2
s
2
(,V)
= e
v
s
2
/2
,
which is the characteristic function of the A(0, v
n
k=1
X
k
where X,
X
k
are i.i.d. random vectors such that
P(X = +e
i
) = P(X = e
i
) =
1
2d
i = 1, . . . , d,
and e
i
is the unit vector in the ith direction, i = 1, . . . , d. In this case EX = 0
and if i ,= j then EX
i
X
j
= 0, resulting with the covariance matrix V = (1/d)I for
the multivariate normal limit in distribution of n
1/2
S
n
.
Building on Lindebergs clt for weighted sums of i.i.d. random variables, the
following multivariate normal limit is the basis for the convergence of random walks
to Brownian motion, to which Section 9.2 is devoted.
Exercise 3.5.18. Suppose
k
are i.i.d. with E
1
= 0 and E
2
1
= 1. Consider the
random functions
S
n
(t) = n
1/2
S(nt) where S(t) =
[t]
k=1
k
+(t [t])
[t]+1
and [t]
denotes the integer part of t.
(a) Verify that Lindebergs clt applies for
S
n
=
n
k=1
a
n,k
k
whenever the
nonrandom a
n,k
are such that r
n
= max[a
n,k
[ : k = 1, . . . , n 0
and v
n
=
n
k=1
a
2
n,k
1.
(b) Let c(s, t) = min(s, t) and xing 0 = t
0
t
1
< < t
d
, denote by C the
d d matrix of entries C
jk
= c(t
j
, t
k
). Show that for any R
d
,
d
r=1
(t
r
t
r1
)(
r
j=1
j
)
2
= (, C) ,
(c) Using the CramerWold device deduce that (
S
n
(t
1
), . . . ,
S
n
(t
d
))
D
G
with G having the A(0, C) distribution.
3.5. RANDOM VECTORS AND THE MULTIVARIATE clt 149
As we see in the next exercise, there is more to a Gaussian random vector than
each coordinate having a normal distribution.
Exercise 3.5.19. Suppose X
1
has a standard normal distribution and S is inde
pendent of X
1
and such that P(S = 1) = P(S = 1) = 1/2.
(a) Check that X
2
= SX
1
also has a standard normal distribution.
(b) Check that X
1
and X
2
are uncorrelated random variables, each having
the standard normal distribution, while X = (X
1
, X
2
) is not a Gaussian
random vector and where X
1
and X
2
are not independent variables.
Motivated by the proof of Proposition 3.5.14 here is an important property of
Gaussian random vectors which may also be considered to be an alternative to
Denition 3.5.13.
Exercise 3.5.20. A random vector X has the multivariate normal distribution if
and only if (
d
i=1
a
ji
X
i
, j = 1, . . . , m) is a Gaussian random vector for any non
random coecients a
11
, a
12
, . . . , a
md
R.
The classical denition of the multivariate normal density applies for a strict subset
of the distributions we consider in Denition 3.5.13.
Denition 3.5.21. We say that X has a nondegenerate multivariate normal dis
tribution if the matrix V is invertible, or alternatively, when V is (strictly) positive
denite matrix, that is (, V) > 0 whenever ,= 0.
We next relate the density of a random vector with its characteristic function, and
provide the density for the nondegenerate multivariate normal distribution.
Exercise 3.5.22.
(a) Show that if
_
R
d
[
X
()[d < , then X has the bounded continuous
probability density function
(3.5.5) f
X
(x) =
1
(2)
d
_
R
d
e
i(,x)
X
()d .
(b) Show that a random vector X with a nondegenerate multivariate normal
distribution A(, V) has the probability density function
f
X
(x) = (2)
d/2
(detV)
1/2
exp
_
1
2
(x , V
1
(x ))
_
.
Here is an application to the uniform distribution over the sphere in R
n
, as n .
Exercise 3.5.23. Suppose Y
k
are i.i.d. random variables with EY
2
1
= 1 and
EY
1
= 0. Let W
n
= n
1
n
k=1
Y
2
k
and X
n,k
= Y
k
/
W
n
for k = 1, . . . , n.
(a) Noting that W
n
a.s.
1 deduce that X
n,1
D
Y
1
.
(b) Show that n
1/2
n
k=1
X
n,k
D
G whose distribution is A(0, 1).
(c) Show that if Y
k
are standard normal random variables, then the ran
dom vector X
n
= (X
n,1
, . . . , X
n,n
) has the uniform distribution over the
surface of the sphere of radius
n in R
n
(i.e., the unique measure sup
ported on this sphere and invariant under orthogonal transformations),
and interpret the preceding results for this special case.
150 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
We conclude the section with the following exercise, which is a multivariate, Lin
debergs type clt.
Exercise 3.5.24. Let y
t
denotes the transpose of the vector y R
d
and y its
Euclidean norm. The independent random vectors Y
k
on R
d
are such that Y
k
D
=
Y
k
,
lim
n
n
k=1
P(Y
k
 >
n) = 0,
and for some symmetric, (strictly) positive denite matrix V and any xed
(0, 1],
lim
n
n
1
n
k=1
E(Y
k
Y
t
k
I
Y
k
n
) = V.
(a) Let T
n
=
n
k=1
X
n,k
for X
n,k
= n
1/2
Y
k
I
Y
k
n
. Show that T
n
D
n
k=1
Y
k
and show that
S
n
D
G.
(c) Show that (
S
n
)
t
V
1
S
n
D
Z and identify the law of Z.
CHAPTER 4
Conditional expectations and probabilities
The most important concept in probability theory is the conditional expectation
to which this chapter is devoted. In contrast with the elementary denition often
used for a nite or countable sample space, the conditional expectation, as dened
in Section 4.1, is itself a random variable. Section 4.2 details the important prop
erties of the conditional expectation. Section 4.3 provides a representation of the
conditional expectation as an orthogonal projection in Hilbert space. Finally, in
Section 4.4 we represent the conditional expectation also as the expectation with
respect to the random regular conditional probability distribution.
4.1. Conditional expectation: existence and uniqueness
In Subsection 4.1.1 we review the elementary denition of the conditional expec
tation E(X[Y ) in case of discrete valued R.V.s X and Y . This motivates our
formal denition of the conditional expectation for any pair of R.V.s. such that
X is integrable. The existence and uniqueness of the conditional expectation is
shown there based on the RadonNikodym theorem, the proof of which we provide
in Subsection 4.1.2.
4.1.1. Conditional expectation: motivation and denition. Suppose
the R.V.s X and Z on a probability space (, T, P) are both simple functions.
More precisely, let X take the distinct values x
1
, . . . , x
m
R and Z take the
distinct values z
1
, . . . , z
n
R, where without loss of generality we assume that
P(Z = z
i
) > 0 for i = 1, . . . , n. Then, from elementary probability theory, we know
that for any i = 1, . . . , n, j = 1, . . . , m,
P(X = x
j
[Z = z
i
) =
P(X = x
j
, Z = z
i
)
P(Z = z
i
)
,
and we can compute the corresponding conditional expectation
E[X[Z = z
i
] =
m
j=1
x
j
P(X = x
j
[Z = z
i
) .
Noting that this conditional expectation is a function of (via the value of
Z()), we dene the R.V. Y = E[X[Z] on the same probability space such that
Y () = E[X[Z = z
i
] whenever is such that Z() = z
i
.
Example 4.1.1. Suppose that X =
1
and Z =
2
on the probability space T = 2
,
= 1, 2
2
with
P(1, 1) = .5, P(1, 2) = .1, P(2, 1) = .1, P(2, 2) = .3.
151
152 4. CONDITIONAL EXPECTATIONS AND PROBABILITIES
Then,
P(X = 1[Z = 1) =
P(X = 1, Z = 1)
P(Z = 1)
=
5
6
,
implying that P(X = 2[Z = 1) =
1
6
and
E[X[Z = 1] = 1
5
6
+ 2
1
6
=
7
6
.
Likewise, check that E[X[Z = 2] =
7
4
, hence E[X[Z] =
7
6
I
Z=1
+
7
4
I
Z=2
.
Partitioning into the discrete collection of Zatoms, namely the sets G
i
= :
Z() = z
i
for i = 1, . . . , n, observe that Y () is constant on each of these sets.
The algebra ( = T
Z
= (Z) = Z
1
(B), B B is in this setting merely
the collection of all 2
n
possible unions of various Zatoms. Hence, ( is nitely
generated and since Y () is constant on each generator G
i
of (, we see that Y ()
is measurable on (, (). Further, since any G ( is of the form G =
iI
G
i
for
the disjoint sets G
i
and some J 1, . . . , n, we nd that
E[Y I
G
] =
iI
E[Y I
Gi
] =
iI
E[X[Z = z
i
]P(Z = z
i
)
=
iI
m
j=1
x
j
P(X = x
j
, Z = z
i
) = E[XI
G
] .
To summarize, in case X and Z are simple functions and ( = (Z), we have
Y = E[X[Z] as a R.V. on (, () such that E[Y I
G
] = E[XI
G
] for all G (.
Since both properties make sense for any algebra ( and any integrable R.V. X
this suggests the denition of the conditional expectation as given by the following
theorem.
Theorem 4.1.2. Given X L
1
(, T, P) and ( T a algebra there exists a
R.V. Y called the conditional expectation (C.E.) of X given (, denoted by E[X[(],
such that Y L
1
(, (, P) and for any G (,
(4.1.1) E [(X Y ) I
G
] = 0 .
Moreover, if (4.1.1) holds for any G ( and R.V.s Y and
Y , both of which are in
L
1
(, (, P), then P(
= : Y ()
Y () > to see that (by linearity of the
expectation),
0 = E [XI
G
] E [XI
G
] = E [(Y
Y ) I
G
] P(G
) .
Hence, P(G
G
0
as 0, we deduce
that P(Y
Y > 0) = 0. The same argument applies with the roles of Y and
Y
reversed, so P(Y
Y = 0) = 1 as claimed.
We turn to the existence of the C.E. assuming rst that X L
1
(, T, P) is also
nonnegative. Let denote the probability measure obtained by restricting P to
the measurable space (, () and denote the measure obtained by restricting XP
of Proposition 1.3.56 to this measurable space, noting that is a nite measure
(since () = (XP)() = E[X] < ). If G ( is such that (G) = P(G) = 0,
then by denition also (G) = (XP)(G) = 0. Therefore, is absolutely continuous
with respect to , and by the RadonNikodym theorem there exists Y m(
+
such
that = Y . This implies that for any G (,
E[XI
G
] = P(XI
G
) = (G) = (Y )(G) = (Y I
G
) = E[Y I
G
]
(and in particular, that E[Y ] = () < ), proving the existence of the C.E. for
nonnegative R.V.s.
Turning to deal with the case of a general integrable R.V. X we use the representa
tion X = X
+
X
with X
+
0 and X
] are
nite. Set Y = Y
+
Y
= E[X
[(]
154 4. CONDITIONAL EXPECTATIONS AND PROBABILITIES
exist by the preceding argument. Then, Y m( is integrable, and by denition of
Y
I
G
] = E[X
+
I
G
] E[X
I
G
] = E[XI
G
] .
This establishes (4.1.1) and completes the proof of the theorem.
Remark. Beware that for Y = E[X[(] often Y
,= E[X
[(] = 1).
Exercise 4.1.7. Suppose either E(Y
k
)
+
is nite or E(Y
k
)
n
k=1
f
k
.
(b) Suppose and are probability measures on S = (s
1
, . . . , s
n
) : s
k
S
k
, k = 1, . . . , n. Show that f
k
(s
k
), k = 1, . . . , n, are both mutually
independent and mutually independent.
4.1.2. Proof of the RadonNikodym theorem. This section is devoted
to proving the RadonNikodym theorem, which we have already used for estab
lishing the existence of C.E. This is done by proving the more general Lebesgue
decomposition, based on the following denition.
Denition 4.1.9. Two measures
1
and
2
on the same measurable space (S, T)
are mutually singular if there is a set A T such that
1
(A) = 0 and
2
(A
c
) = 0.
This is denoted by
1
2
, and we sometimes state that
1
is singular with respect
to
2
, instead of
1
and
2
mutually singular.
Equipped with the concept of mutually singular measures, we next state the
Lebesgue decomposition and show that the RadonNikodym theorem is a direct
consequence of this decomposition.
Theorem 4.1.10 (Lebesgue decomposition). Suppose and are measures
on the same measurable space (S, T) such that (S) and (S) are nite. Then,
=
ac
+
s
where the measure
s
is singular with respect to and
ac
= f for
some f mT
+
. Further, such a decomposition of is unique (per given ).
Remark. To build your intuition, note that Lebesgue decomposition is quite ex
plicit for nite measures on a countable space S (with T = 2
S
). Indeed, then
ac
and
s
are the restrictions of to the support S
= s S : (s) > 0
of and its complement, respectively, with f(s) = (s)/(s) for s S
the
RadonNikodym derivative of
ac
with respect to (see Exercise 1.2.47 for more
on the support of a measure).
4.1. CONDITIONAL EXPECTATION: EXISTENCE AND UNIQUENESS 155
Proof of the RadonNikodym theorem. Assume rst that (S) and (S)
are nite. Let =
ac
+
s
be the unique Lebesgue decomposition induced by .
Then, by denition there exists a set A T such that
s
(A
c
) = (A) = 0. Further,
our assumption that implies that
s
(A) (A) = 0 as well, hence
s
(S) = 0,
i.e. =
ac
= f for some f mT
+
.
Next, in case and are nite measures the sample space S is a countable union
of disjoint sets A
n
T such that both (A
n
) and (A
n
) are nite. Considering the
measures
n
= I
An
and
n
= I
An
such that
n
(S) = (A
n
) and
n
(S) = (A
n
)
are nite, our assumption that implies that
n
n
. Hence, by the
preceding argument for each n there exists f
n
mT
+
such that
n
= f
n
n
. With
=
n
and
n
= (f
n
I
An
) (by the composition relation of Proposition 1.3.56),
it follows that = f for f =
n
f
n
I
An
mT
+
nite valued.
As for the uniqueness of the RadonNikodym derivative f, suppose that f = g
for some g mT
+
and a nite measure . Consider E
n
= D
n
s : g(s)f(s)
1/n, g(s) n and measurable D
n
S such that (D
n
) < . Then, necessarily
both (fI
En
) and (gI
En
) are nite with
n
1
(E
n
) ((g f)I
En
) = (g)(E
n
) (f)(E
n
) = 0 ,
implying that (E
n
) = 0. Considering the union over n = 1, 2, . . . we deduce
that (s : g(s) > f(s)) = 0, and upon reversing the roles of f and g, also
(s : g(s) < f(s)) = 0.
Remark. Following the same argument as in the preceding proof of the Radon
Nikodym theorem, one easily concludes that Lebesgue decomposition applies also
for any two nite measures and .
Our next lemma is the key to the proof of Lebesgue decomposition.
Lemma 4.1.11. If the nite measures and on (S, T) are not mutually singular,
then there exists B T and > 0 such that (B) > 0 and (A) (A) for all
A T
B
.
The proof of this lemma is based on the HahnJordan decomposition of a nite
signed measure to its positive and negative parts (for a denition of a nite signed
measure see the remark after Denition 1.1.2).
Theorem 4.1.12 (Hahn decomposition). For any nite signed measure : T
R there exists D T such that
+
= I
D
and
= I
D
c are measures on (S, T).
See [Bil95, Theorem 32.1] for a proof of the Hahn decomposition as stated here,
or [Dud89, Theorem 5.6.1] for the same conclusion in case of a general, that is
[, ]valued signed measure, where uniqueness of the HahnJordan decomposi
tion of a signed measure as the dierence between the mutually singular measures
n
D
n
where D
n
, n = 1, 2 . . ., is a posi
tive set for the Hahn decomposition of the nite signed measure n
1
. Since A
c
is contained in the negative set D
c
n
for n
1
, it follows that (A
c
) n
1
(A
c
).
156 4. CONDITIONAL EXPECTATIONS AND PROBABILITIES
Taking n we deduce that (A
c
) = 0. If (D
n
) = 0 for all n then (A) = 0
and necessarily is singular with respect to , contradicting the assumptions of
the lemma. Therefore, (D
n
) > 0 for some nite n. Taking = n
1
and B = D
n
results with the thesis of the lemma.
Proof of Lebesgue decomposition. Our goal is to construct f mT
+
such that the measure
s
= f is singular with respect to . Since necessarily
s
(A) 0 for any A T, such a function f must belong to
1 = h mT
+
: (A) (h)(A), for all A T.
Indeed, we take f to be an element of 1 for which (f)(S) is maximal. To show
that such f exists note rst that 1 is closed under nondecreasing passages to the
limit (by monotone convergence). Further, if h and
h are both in 1 then also
maxh,
hI
A
c ) = (maxh,
hI
A
) .
That is, 1 is also closed under the formation of nite maxima and in particu
lar, the function lim
n
max(h
1
, . . . , h
n
) is in 1 for any h
n
1. Now let =
sup(h)(S) : h 1 noting that (S) is nite. Choosing h
n
1 such
that (h
n
)(S) n
1
results with f = lim
n
max(h
1
, . . . , h
n
) in 1 such that
(f)(S) lim
n
(h
n
)(S) = . Since f is an element of 1 both
ac
= f and
s
= f are nite measures.
If
s
fails to be singular with respect to then by Lemma 4.1.11 there exists
B T and > 0 such that (B) > 0 and
s
(A) (I
B
)(A) for all A T. Since
=
s
+f, this implies that f+I
B
1. However, ((f+I
B
))(S) +(B) >
contradicting the fact that is the nite maximal value of (h)(S) over h 1.
Consequently, this construction of f has = f +
s
with a nite measure
s
that
is singular with respect to .
Finally, to prove the uniqueness of the Lebesgue decomposition, suppose there
exist f
1
, f
2
mT
+
, such that both f
1
and f
2
are singular with respect to
. That is, there exist A
1
, A
2
T such that (A
i
) = 0 and ( f
i
)(A
c
i
) = 0 for
i = 1, 2. Considering A = A
1
A
2
it follows that (A) = 0 and ( f
i
)(A
c
) = 0
for i = 1, 2. Consequently, for any E T we have that ( f
1
)(E) = (E A) =
( f
2
)(E), proving the uniqueness of
s
, and hence of the decomposition of as
ac
+
s
.
We conclude with a simple application of RadonNikodym theorem in conjunction
with Lemma 1.3.8.
Exercise 4.1.13. Suppose and are two nite measures on the same measur
able space (S, T) such that (A) (A) for all A T. Show that if (g) = (g) is
nite for some g mT such that (s : g(s) 0) = 0 then () = ().
4.2. Properties of the conditional expectation
In some generic settings the C.E. is rather explicit. One such example is when X
is measurable on the conditioning algebra (.
Example 4.2.1. If X L
1
(, (, P) then Y = X m( satises (4.1.1) so
E[X[(] = X. In particular, if X = c is a constant R.V. then E[X[(] = c for
any algebra (.
4.2. PROPERTIES OF THE CONDITIONAL EXPECTATION 157
Here is an extension of this example.
Exercise 4.2.2. Suppose that (Y, })valued random variable Y is measurable on
( and (X, X)valued random variable Z is Pindependent of (. Show that if is
measurable on the product space (X Y, X }) and (Z, Y ) is integrable, then
E[(Z, Y )[(] = g(Y ) where g(y) = E[(Z, y)].
Since only constant random variables are measurable on T
0
= , , by denition
of the C.E. clearly E[X[T
0
] = EX. We show next that E[X[1] = EX also whenever
the conditioning algebra 1 is independent of (X) (and in particular, when 1 is
Ptrivial).
Proposition 4.2.3. If X L
1
(, T, P) and the algebra 1 is independent of
((X), (), then
E[X[(1, ()] = E[X[ (] .
For ( = , this implies that
1 independent of (X) = E[X[1] = EX .
Remark. Recall that a Ptrivial algebra 1 T is independent of (X) for
any X mT. Hence, by Proposition 4.2.3 in this case E[X[1] = EX for all
X L
1
(, T, P).
Proof. Let Y = E[X[(] m(. Because 1 is independent of ((, (X)), it
follows that for any G ( and H 1 the random variable I
H
is independent of
both XI
G
and Y I
G
. Consequently,
E[XI
GH
] = E[XI
G
I
H
] = E[XI
G
]EI
H
E[Y I
GH
] = E[Y I
G
I
H
] = E[Y I
G
]EI
H
Further, E[XI
G
] = E[Y I
G
] by the denition of Y , hence E[XI
A
] = E[Y I
A
] for
any A / = G H : G (, H 1. Applying Exercise 4.1.3 with Y
L
1
(, (, P) L
1
(, (1, (), P) and / a system of subsets containing and
generating ((, 1), we thus conclude that
E[X[((, 1)] = Y = E[X[(]
as claimed.
We turn to derive various properties of the C.E. operation, starting with its posi
tivity and linearity (per xed conditioning algebra).
Proposition 4.2.4. Let X L
1
(, T,P) and set Y = E[X[(] for some algebra
( T. Then,
(a) EX = EY
(b) ( Positivity) X 0 = Y 0 a.s. and X > 0 = Y > 0
a.s.
Proof. Considering G = ( in the denition of the C.E. we nd that
EX = E[XI
G
] = E[Y I
G
] = EY .
Turning to the positivity of the C.E. note that if X 0 a.s. then 0 E[XI
G
] =
E[Y I
G
] 0 for G = : Y () 0 (. Hence, in this case E[Y I
Y 0
] = 0.
That is, almost surely Y 0. Further, P(X > , Y 0) E[XI
X>
I
Y 0
]
E[XI
Y 0
] = 0 for any > 0, so P(X > 0, Y = 0) = 0 as well.
We next show that the C.E. operator is linear.
158 4. CONDITIONAL EXPECTATIONS AND PROBABILITIES
Proposition 4.2.5. ( Linearity) Let X, Y L
1
(, T,P) and ( T a algebra.
Then, for any , R,
E[X +Y [(] = E[X[(] +E[Y [(] .
Proof. Let Z = E[X[(] and V = E[Y [(]. Since Z, V L
1
(, T, P) the
same applies for Z +V . Further, for any G (, by linearity of the expectation
operator and the denition of the C.E.
E[(Z+V )I
G
] = E[ZI
G
] +E[V I
G
] = E[XI
G
] +E[Y I
G
] = E[(X+Y )I
G
],
as claimed.
From its positivity and linearity we immediately get the monotonicity of the C.E.
Corollary 4.2.6 (Monotonicity). If X, Y L
1
(, T, P) are such that X Y ,
then E[X[(] E[Y [(] for any algebra ( T.
In the following exercise you are to combine the linearity and positivity of the
C.E. with Fubinis theorem.
Exercise 4.2.7. Show that if X, Y L
1
(, T, P) are such that E[X[Y ] = Y and
E[Y [X] = X then almost surely X = Y .
Hint: First show that E[(X Y )I
{X>cY }
] = 0 for any nonrandom c.
We next deal with the relationship between the C.E.s of the same R.V. for nested
conditioning algebras.
Proposition 4.2.8 (Tower property). Suppose X L
1
(, T,P) and the 
algebras 1 and ( are such that 1 ( T. Then, E[X[1] = E[E(X[()[1].
Proof. Recall that Y = E[X[(] is integrable, hence Z = E[Y [1] is integrable.
Fixing A 1 we have that E[Y I
A
] = E[ZI
A
] by the denition of the C.E. Z. Since
1 (, also A ( hence E[XI
A
] = E[Y I
A
] by the denition of the C.E. Y . We
deduce that E[XI
A
] = E[ZI
A
] for all A 1. It then follows from the denition of
the C.E. that Z is a version of E[X[1].
Remark. The tower property is also called the law of iterated expectations.
Any algebra ( contains the trivial algebra T
0
= , . Applying the tower
property with 1 = T
0
and using the fact that E[Y [T
0
] = EY for any integrable
random variable Y , it follows that for any algebra (
(4.2.1) EX = E[X[T
0
] = E[E(X[()[T
0
] = E[E(X[()] .
Here is an application of the tower property, leading to stronger conclusion than
what one has from Proposition 4.2.3.
Lemma 4.2.9. If integrable R.V. X and algebra ( are such that E[X[(] is
independent of X, then E[X[(] = E[X].
Proof. Let Z = E[X[(]. Applying the tower property for 1 = (Z) ( we
have that E[X[1] = E[Z[1]. Clearly, E[Z[1] = Z (see Example 4.2.1), whereas
our assumption that X is independent of Z implies that E[X[1] = E[X] (see
Proposition 4.2.3). Consequently, Z = E[X], as claimed.
As shown next, we can take out what is known when computing the C.E.
Proposition 4.2.10. Suppose Y m( and X L
1
(, T,P) are such that XY
L
1
(, T,P). Then, E[XY [(] = Y E[X[(].
4.2. PROPERTIES OF THE CONDITIONAL EXPECTATION 159
Proof. Let Z = E[X[(] which is well dened due to our assumption that
E[X[ < . With Y Z m( and E[XY [ < , it suces to check that
(4.2.2) E[XY I
A
] = E[ZY I
A
]
for all A (. Indeed, if Y = I
B
for B ( then Y I
A
= I
G
for G = B A ( so
(4.2.2) follows from the denition of the C.E. Z. By linearity of the expectation,
this extends to Y which is a simple function on (, (). Recall that for X 0 by
positivity of the C.E. also Z 0, so by monotone convergence (4.2.2) then applies
for all Y m(
+
. In general, let X = X
+
X
and Y = Y
+
Y
for Y
m(
+
and the integrable X
0. Since [XY [ = (X
+
+ X
)(Y
+
+ Y
) is integrable, so
are the products X
and (4.2.2) holds for each of the four possible choices of the
pair (X
, Y
), with Z
= E[X
Y
+
+ X
, it readily
follows that (4.2.2) applies also for X and Y .
Adopting hereafter the notation P(A[() for E[I
A
[(], the following exercises illus
trate some of the many applications of Propositions 4.2.8 and 4.2.10.
Exercise 4.2.11. For any algebras (
i
T, i = 1, 2, 3, let (
ij
= ((
i
, (
j
) and
prove that the following conditions are equivalent:
(a) P[A
3
[(
12
] = P[A
3
[(
2
] for all A
3
(
3
.
(b) P[A
1
A
3
[(
2
] = P[A
1
[(
2
]P[A
3
[(
2
] for all A
1
(
1
and A
3
(
3
.
(c) P[A
1
[(
23
] = P[A
1
[(
2
] for all A
1
(
1
.
Remark. Taking (
1
= (X
k
, k < n), (
2
= (X
n
) and (
3
= (X
k
, k > n),
condition (a) of the preceding exercise states that the sequence of random variables
X
k
has the Markov property. That is, the conditional probability of a future
event A
3
given the past and present information (
12
is the same as its conditional
probability given the present (
2
alone. Condition (c) makes the same statement,
but with time reversed, while condition (b) says that past and future events A
1
and
A
3
are conditionally independent given the present information, that is, (
2
.
Exercise 4.2.12. Let Z = (X, Y ) be a uniformly chosen point in (0, 1)
2
. That
is, X and Y are independent random variables, each having the U(0, 1) measure
of Example 1.1.26. Set T = 2I
A
(Z) + 10I
B
(Z) + 4I
C
(Z) where A = (x, y) :
0 < x < 1/4, 3/4 < y < 1, B = (x, y) : 1/4 < x < 3/4, 0 < y < 1/2 and
C = (x, y) : 3/4 < x < 1, 1/4 < y < 1.
(a) Find an explicit formula for the conditional expectation W = E(T[X)
and use it to determine the conditional expectation U = E(TX[X).
(b) Find the value of E[(T W) sin(e
X
)].
Exercise 4.2.13. Fixing a positive integer k, compute E(X[Y ) in case Y = kX
[kX] for X having the U(0, 1) measure of Example 1.1.26 (and where [x] denotes
the integer part of x).
Exercise 4.2.14. Fixing t R and X integrable random variable, let Y = max(X, t)
and Z = min(X, t). Setting a
t
= E[X[X t] and b
t
= E[X[X t], show that
E[X[Y ] = Y I
Y >t
+a
t
I
Y =t
and E[X[Z] = ZI
Z<t
+b
t
I
Z=t
.
Exercise 4.2.15. Let X, Y be i.i.d. random variables. Suppose is independent
of (X, Y ), with P( = 1) = p, P( = 0) = 1 p. Let Z = (Z
1
, Z
2
) where
Z
1
= X + (1 )Y and Z
2
= Y + (1 )X.
160 4. CONDITIONAL EXPECTATIONS AND PROBABILITIES
(a) Prove that Z and are independent.
(b) Obtain an explicit expression for E[g(X, Y )[Z], in terms of Z
1
and Z
2
,
where g : R
2
R is a bounded Borel function.
Exercise 4.2.16. Suppose EX
2
< and dene Var(X[() = E[(XE(X[())
2
[(].
(a) Show that, E[Var(X[(
2
)] E[Var(X[(
1
)] for any two algebras (
1
(
2
(that is, the dispersion of X about its conditional mean decreases as the
algebra grows).
(b) Show that for any algebra (,
Var[X] = E[Var(X[()] + Var[E(X[()] .
Exercise 4.2.17. Suppose N is a nonnegative, integer valued R.V. which is inde
pendent of the independent, integrable R.V.s
i
on the same probability space, and
that
i
P(N i)E[
i
[ is nite.
(a) Check that
X() =
N()
i=1
i
() ,
is integrable and deduce that EX =
i
P(N i)E
i
.
(b) Suppose in addition that
i
are identically distributed, in which case this
is merely Walds identity EX = ENE
1
. Show that if both
1
and N are
squareintegrable, then so is X and
Var(X) = Var(
1
)EN + Var(N)(E
1
)
2
.
Suppose XY , X and Y are integrable. Combining Proposition 4.2.10 and (4.2.1)
convince yourself that if E[X[Y ] = EX then E[XY ] = EXEY . Recall that if X and
Y are independent and integrable then E[X[Y ] = EX (c.f. Proposition 4.2.3). As
you show next, the converse implications are false and further, one cannot dispense
of the nesting relationship between the two algebras in the tower property.
Exercise 4.2.18. Provide examples of X, Y 1, 0, 1 such that
(a) E[XY ] = EXEY but E[X[Y ] ,= EX.
(b) E[X[Y ] = EX but X is not independent of Y .
(c) For = 1, 2, 3 and T
i
= (i), i = 1, 2, 3,
E[E(X[T
1
)[T
2
] ,= E[E(X[T
2
)[T
1
].
As shown in the sequel, per xed conditioning algebra we can interpret the
C.E. as an expectation in a dierent (conditional) probability space. Indeed, every
property of the expectation has a corresponding extension to the C.E. For example,
the extension of Jensens inequality is
Proposition 4.2.19 (Jensens inequality). Suppose g() is a convex function
on an open interval G of R, that is,
g(x) + (1 )g(y) g(x + (1 )y) x, y G, 0 1.
If X is an integrable R.V. with P(X G) = 1 and g(X) is also integrable, then
almost surely E[g(X)[1] g(E[X[1]) for any algebra 1.
Proof. Recall our derivation of (1.3.3) showing that
g(x) g(c) + (D
g)(c)(x c) c, x G
4.2. PROPERTIES OF THE CONDITIONAL EXPECTATION 161
Further, with (D
g)(c)(x c) = sup
n
a
n
x +b
n
a.s.
and EX
[(].
162 4. CONDITIONAL EXPECTATIONS AND PROBABILITIES
Proof. Let Y
m
= E[X
m
[(] m(
+
. By monotonicity of the C.E. we have that
the sequence Y
m
is a.s. nondecreasing, hence it has a limit Y
m(
+
(possibly
innite). We complete the proof by showing that Y
= E[X
I
G
] = lim
m
E[Y
m
I
G
] = lim
m
E[X
m
I
G
] = E[X
I
G
],
where since Y
m
Y
and X
m
X
is integrable.
In conclusion, E[X
m
[(] = Y
m
Y
= E[X
[(], as claimed.
Lemma 4.2.25 (Fatous lemma for C.E.). If the nonnegative, integrable X
n
on same measurable space (, T) are such that liminf
n
X
n
is integrable, then
for any algebra ( T,
E
_
liminf
n
X
n
(
_
liminf
n
E[X
n
[(] a.s.
Proof. Applying the monotone convergence theorem for the C.E. of the non
decreasing sequence of nonnegative R.V.s Z
n
= infX
k
: k n (whose limit is
the integrable liminf
n
X
n
), results with
(4.2.3) E
_
liminf
n
X
n
[(
_
= E( lim
n
Z
n
[() = lim
n
E[Z
n
[(] a.s.
Since Z
n
X
n
it follows that E[Z
n
[(] E[X
n
[(] for all n and
(4.2.4) lim
n
E[Z
n
[(] = liminf
n
E[Z
n
[(] liminf
n
E[X
n
[(] a.s.
Upon combining (4.2.3) and (4.2.4) we obtain the thesis of the lemma.
Fatous lemma leads to the C.E. version of the dominated convergence theorem.
Theorem 4.2.26 (Dominated convergence for C.E.). If sup
m
[X
m
[ is inte
grable and X
m
a.s
X
, then E[X
m
[(]
a.s.
E[X
[(].
Proof. Let Y = sup
m
[X
m
[ and Z
m
= Y X
m
0. Applying Fatous lemma
for the C.E. of the nonnegative, integrable R.V.s Z
m
2Y , we see that
E
_
liminf
m
Z
m
(
_
liminf
m
E[Z
m
[(] a.s.
Since X
m
converges, by the linearity of the C.E. and integrability of Y this is
equivalent to
E
_
lim
m
X
m
(
_
limsup
m
E[X
m
[(] a.s.
Applying the same argument for the nonnegative, integrable R.V.s W
m
= Y +X
m
results with
E
_
lim
m
X
m
(
_
liminf
m
E[X
m
[(] a.s. .
We thus conclude that a.s. the liminf and limsup of the sequence E[X
m
[(] coincide
and are equal to E[X
[(], as stated.
Exercise 4.2.27. Let X
1
, X
2
be random variables dened on same probability space
(, T, P) and ( T a algebra. Prove that (a), (b) and (c) below are equivalent.
(a) For any Borel sets B
1
and B
2
,
P(X
1
B
1
, X
2
B
2
[() = P(X
1
B
1
[()P(X
2
B
2
[() .
4.2. PROPERTIES OF THE CONDITIONAL EXPECTATION 163
(b) For any bounded Borel functions h
1
and h
2
,
E[h
1
(X
1
)h
2
(X
2
)[(] = E[h
1
(X
1
)[(]E[h
2
(X
2
)[(] .
(c) For any bounded Borel function h,
E[h(X
1
)[((, (X
2
))] = E[h(X
1
)[(] .
Denition 4.2.28. If one of the equivalent conditions of Exercise 4.2.27 holds we
say that X
1
and X
2
are conditionally independent given (.
Exercise 4.2.29. Suppose that X and Y are conditionally independent given (Z)
and that X and Z are conditionally independent given T, where T (Z). Prove
that then X and Y are conditionally independent given T.
Our next result shows that the C.E. operation is continuous with respect to L
q
convergence.
Theorem 4.2.30. Suppose X
n
L
q
X
. That is, X
n
, X
L
q
(, T, P) are such
that E([X
n
X
[
q
) 0. Then, E[X
n
[(]
L
q
E[X
[()[
q
] = E[[E(X
n
X
[()[
q
]
E[E([X
n
X
[
q
[()] = E[[X
n
X
[
q
] 0,
by our hypothesis, yielding the thesis of the theorem.
As you will show, the C.E. operation is also continuous with respect to the follow
ing topology of weak L
q
convergence.
Denition 4.2.31. Let L
, denoted X
n
wL
q
X
, if X
n
, X
L
q
and E[(X
n
X
then E[X
n
[(]
wL
q
E[X
in (, T, P) when n and Y
n
are
uniformly integrable.
(a) Show that E[Y
n
[(]
L
1
E[Y
[(].
4.3. The conditional expectation as an orthogonal projection
It readily follows from our next proposition that for X L
2
(, T, P) and 
algebras ( T the C.E. Y = E[X[(] is the unique Y L
2
(, (, P) such that
(4.3.1) X Y 
2
= infX W
2
: W L
2
(, (, P).
Proposition 4.3.1. For any X L
2
(, T, P) and algebras ( T, a R.V.
Y L
2
(, (, P) is optimal in the sense of (4.3.1) if and only if it satises the
orthogonality relations
(4.3.2) E[(X Y )Z] = 0 for all Z L
2
(, (, P) .
Further, any such R.V. Y is a version of E[X[(].
Proof. If Y L
2
(, (, P) satises (4.3.1) then considering W = Y + Z it
follows that for any Z L
2
(, (, P) and R,
0 X Y Z
2
2
X Y 
2
2
=
2
EZ
2
2E[(X Y )Z] .
By elementary calculus, this inequality holds for all R if and only if E[(X
Y )Z] = 0. Conversely, suppose Y L
2
(, (, P) satises (4.3.2) and x W
L
2
(, (, P). Then, considering (4.3.2) for Z = W Y we see that
X W
2
2
= X Y 
2
2
2E[(X Y )(W Y )] +W Y 
2
2
X Y 
2
2
,
so necessarily Y satises (4.3.1). Finally, since I
G
L
2
(, (, P) for any G (, if
Y satises (4.3.2) then it also satises the identity (4.1.1) which characterizes the
C.E. E[X[(].
4.3. THE CONDITIONAL EXPECTATION AS AN ORTHOGONAL PROJECTION 165
Example 4.3.2. If ( = (A
1
, . . . , A
n
) for nite n and disjoint sets A
i
such that
P(A
i
) > 0 for i = 1, . . . , n, then L
2
(, (, P) consists of all variables of the form
W =
n
i=1
v
i
I
Ai
, v
i
R. A R.V. Y of this form satises (4.3.1) if and only if the
corresponding v
i
minimizes
E[(X
n
i=1
v
i
I
Ai
)
2
] EX
2
=
_
n
i=1
P(A
i
)v
2
i
2
n
i=1
v
i
E[XI
Ai
]
_
,
which amounts to v
i
= E[XI
Ai
]/P(A
i
). In particular, if Z =
n
i=1
z
i
I
Ai
for
distinct z
i
s, then (Z) = ( and we thus recover our rst denition of the C.E.
E[X[Z] =
n
i=1
E[XI
Z=zi
]
P(Z = z
i
)
I
Z=zi
.
As shown in the sequel, using (4.3.1) as an alternative characterization of the
C.E. of X L
2
(, T, P) we can prove the existence of the C.E. without invoking
the RadonNikodym theorem. We start by dening the relevant concepts from the
theory of Hilbert spaces on which this approach is based.
Denition 4.3.3. A linear vector space is a set H that is closed under operations
of addition and multiplication by (realvalued) scalars. That is, if h
1
, h
2
H then
h
1
+h
2
H and h H for all R, where (h
1
+h
2
) = h
1
+h
2
, ( +)h =
h + h, (h) = ()h and 1h = h. A normed vector space is a linear vector
space H equipped with a norm  . That is, a nonnegative function on H such
that h = [[h for all R and d(h
1
, h
2
) = h
1
h
2
 is a metric on H.
Denition 4.3.4. A sequence h
n
in a normed vector space is called a Cauchy
sequence if sup
k,mn
h
k
h
m
 0 as n and we say that h
n
converges to
h H if h
n
h 0 as n . A Banach space is a normed vector space in
which every Cauchy sequence converges.
Building on the preceding, we dene the concept of inner product and the corre
sponding Hilbert spaces and subspaces.
Denition 4.3.5. A Hilbert space is a Banach space H whose norm is of the
form (h, h)
1/2
for a bilinear, symmetric function (h
1
, h
2
) : H H R such that
(h, h) 0 and we call such (h
1
, h
2
) an inner product for H. A subset K of a Hilbert
space which is closed under addition and under multiplication by a scalar is called
a Hilbert subspace if every Cauchy sequence h
n
K has a limit in K.
Here are two elementary properties of inner products we use in the sequel.
Exercise 4.3.6. Let h = (h, h)
1
2
with (h
1
, h
2
) an inner product for a linear
vector space H. Show that Schwarz inequality
(u, v)
2
u
2
v
2
,
and the parallelogram law u+v
2
+uv
2
= 2u
2
+2v
2
hold for any u, v H.
Our next proposition shows that for each nite q 1 the space L
q
(, T, P) is a
Banach space for the norm  
q
, the usual addition of R.V.s and the multiplication
of a R.V. X() by a nonrandom (scalar) constant. Further, L
2
(, (, P) is a Hilbert
subspace of L
2
(, T, P) for any algebras ( T.
166 4. CONDITIONAL EXPECTATIONS AND PROBABILITIES
Proposition 4.3.7. Upon identifying Rvalued R.V. which are equal with proba
bility one as being in the same equivalence class, for each q 1 and a algebra T,
the space L
q
(, T, P) is a Banach space for the norm  
q
. Further, L
2
(, (, P) is
then a Hilbert subspace of L
2
(, T, P) for the inner product (X, Y ) = EXY and
any algebras ( T.
Proof. Fixing q 1, we identify X and Y such that P(X ,= Y ) = 0 as
being the same element of L
q
(, T, P). The resulting set of equivalence classes is a
normed vector space. Indeed, both 
q
, the addition of R.V. and the multiplication
by a nonrandom scalar are compatible with this equivalence relation. Further, if
X, Y L
q
(, T, P) then X
q
= [[X
q
< for all R and by Minkowskis
inequality X + Y 
q
X
q
+ Y 
q
< . Consequently, L
q
(, T, P) is closed
under the operations of addition and multiplication by a nonrandom scalar, with
 
q
a norm on this collection of equivalence classes.
Suppose next that X
n
L
q
is a Cauchy sequence for  
q
. Then, by denition,
there exist k
n
such that X
r
X
s

q
q
< 2
n(q+1)
for all r, s k
n
. Observe that
by Markovs inequality
P([X
kn+1
X
kn
[ 2
n
) 2
nq
X
kn+1
X
kn

q
q
< 2
n
,
and consequently the sequence P([X
kn+1
X
kn
[ 2
n
) is summable. By Borel
Cantelli I it follows that
n
[X
kn+1
() X
kn
()[ is nite with probability one, in
which case clearly
X
kn
= X
k1
+
n1
i=1
(X
ki+1
X
ki
)
converges to a nite limit X(). Next let, X = limsup
n
X
kn
(which per Theo
rem 1.2.22 is an Rvalued R.V.). Then, xing n and r k
n
, for any t n,
E
_
[X
r
X
kt
[
q
_
= X
r
X
kt

q
q
2
nq
,
so that by the a.s. convergence of X
kt
to X and Fatous lemma
E[X
r
X[
q
= E
_
lim
t
[X
r
X
kt
[
q
_
liminf
t
E[X
r
X
kt
[
q
2
nq
.
This inequality implies that X
r
X L
q
and hence also X L
q
. As r so
does n and we can further deduce from the preceding inequality that X
r
L
q
X.
Recall that [EXY [
EX
2
EY
2
by the CauchySchwarz inequality. Thus, the bi
linear, symmetric function (X, Y ) = EXY on L
2
L
2
is realvalued and compatible
with our equivalence relation. As X
2
2
= (X, X), the Banach space L
2
(, T, P) is
a Hilbert space with respect to this inner product.
Finally, observe that for any algebra ( T the subset L
2
(, (, P) of the Hilbert
space L
2
(, T, P) is closed under addition of R.V.s and multiplication by a non
random constant. Further, as shown before, the L
2
limit of a Cauchy sequence
X
n
L
2
(, (, P) is limsup
n
X
kn
which also belongs to L
2
(, (, P). Hence, the
latter is a Hilbert subspace of L
2
(, T, P).
Remark. With minor notational modications, this proof shows that for any mea
sure on (S, T) and q 1 nite, the set L
q
(S, T, ) of a.e. equivalence classes of
Rvalued, measurable functions f such that ([f[
q
) < , is a Banach space. This
is merely a special case of a general extension of this property, corresponding to
Y = R in your next exercise.
4.3. THE CONDITIONAL EXPECTATION AS AN ORTHOGONAL PROJECTION 167
Exercise 4.3.8. For q 1 nite and a given Banach space (Y,  ), consider
the space L
q
(S, T, ; Y) of all a.e. equivalence classes of functions f : S Y,
measurable with respect to the Borel algebra induced on Y by   and such that
(f()
q
) < .
(a) Show that f
q
= (f()
q
)
1/q
makes L
q
(S, T, ; Y) into a Banach
space.
(b) For future applications of the preceding, verify that the space Y = C
b
(T)
of bounded, continuous realvalued functions on a topological space T is
a Banach space for the supremum norm f = sup[f(t)[ : t T.
Your next exercise extends Proposition 4.3.7 to the collection L
(, T, P) of all
Rvalued R.V. which are in equivalence classes of bounded random variables.
Exercise 4.3.9. Fixing a probability space (, T, P) prove the following facts:
(a) L
= infM : P([X[ M) = 1.
(b) X
q
X
as q , for any X L
(, T, P).
(c) If E[[X[
q
] < for some q > 0 then E[X[
q
P([X[ > 0) as q 0.
(d) The collection SF of simple functions is dense in L
q
(, T, P) for any
1 q .
(e) The collection C
b
(R) of bounded, continuous realvalued functions, is
dense in L
q
(R, B, ) for any q 1 nite.
Hint: The (bounded) monotone class theorem might be handy.
In view of Proposition 4.3.7, the existence of the C.E. of X L
2
which satises
(4.3.1), or the equivalent condition (4.3.2), is a special instance of the following
fundamental geometric property of Hilbert spaces.
Theorem 4.3.10 (Orthogonal projection). Given h H and a Hilbert sub
space G of H, let d = infh g : g G. Then, there exists a unique
h G,
called the orthogonal projection of h on G, such that d = h
h, f) = 0 for all f G.
Proof. We start with the existence of
h G such that d = h
h. To this
end, let g
n
G be such that h g
n
 d. Applying the parallelogram law for
u = h
1
2
(g
m
+g
k
) and v =
1
2
(g
m
g
k
) we nd that
hg
k

2
+hg
m

2
= 2h
1
2
(g
m
+g
k
)
2
+2
1
2
(g
m
g
k
)
2
2d
2
+
1
2
g
m
g
k

2
since
1
2
(g
m
+g
k
) G. Taking k, m , both h g
k

2
and h g
m

2
approach
d
2
and hence by the preceding inequality g
m
g
k
 0. In conclusion, g
n
is a
Cauchy sequence in the Hilbert subspace G, which thus converges to some
h G.
Recall that h
h h g
n
 +g
n
h.
Next, suppose there exist g
1
, g
2
G such that (h g
i
, f) = 0 for i = 1, 2 and
all f G. Then, by linearity of the inner product (g
1
g
2
, f) = 0 for all f G.
Considering f = g
1
g
2
G we see that (g
1
g
2
, g
1
g
2
) = g
1
g
2

2
= 0 so
necessarily g
1
= g
2
.
We complete the proof by showing that
h G is such that h
h
2
h g
2
for all g G if and only if (h
h f
2
h
h
2
=
2
f
2
2(h
h, f)
We arrive at the stated conclusion upon noting that xing f, this function is non
negative for all if and only if (h
h, f) = 0.
Applying Theorem 4.3.10 for the Hilbert subspace G = L
2
(, (, P) of L
2
(, T, P)
(see Proposition 4.3.7), you have the existence of a unique Y G satisfying (4.3.2)
for each nonnegative X L
2
.
Exercise 4.3.11. Show that for any nonnegative integrable X, not necessarily in
L
2
, the sequence Y
n
G corresponding to X
n
= min(X, n) is nondecreasing and
that its limit Y satises (4.1.1). Verify that this allows you to prove Theorem 4.1.2
without ever invoking the RadonNikodym theorem.
Exercise 4.3.12. Suppose ( T is a algebra.
(a) Show that for any X L
1
(, T, P) there exists some G ( such that
E[XI
G
] = sup
AG
E[XI
A
] .
Any G with this property is called (optimal for X.
(b) Show that Y = E[X[(] almost surely, if and only if for any r R, the
event : Y () > r is (optimal for the random variable (X r).
Here is an alternative proof of the existence of E[X[(] for nonnegative X L
2
which avoids the orthogonal projection, as well as the RadonNikodym theorem
(the general case then follows as in Exercise 4.3.11).
Exercise 4.3.13. Suppose X L
2
(, T, P) is nonnegative. Assume rst that
the algebra ( T is countably generated. That is ( = (B
1
, B
2
, . . .) for some
B
k
T.
(a) Let Y
n
= E[X[(
n
] for the nitely generated (
n
= (B
k
, k n) (for its
existence, see Example 4.3.2). Show that Y
n
is a Cauchy sequence in
L
2
(, (, P), hence it has a limit Y L
2
(, (, P).
(b) Show that Y = E[X[(].
Hint: Theorem 2.2.10 might be of some help.
Assume now that ( is not countably generated.
(c) Let 1
1
1
2
be nite algebras. Show that
E[E(X[1
1
)
2
] E[E(X[1
2
)
2
] .
(d) Let = sup E[E(X[1)
2
], where the supremum is over all nite algebras
1 (. Show that is nite, and that there exists an increasing sequence
of nite algebras 1
n
such that E[E(X[1
n
)
2
] as n .
(e) Let 1
= (
n
1
n
) and Y
n
= E[X[1
n
] for the 1
n
in part (d). Explain
why your proof of part (b) implies the L
2
convergence of Y
n
to a R.V. Y
such that E[Y I
A
] = E[XI
A
] for any A 1
.
( f ) Fixing A ( such that A / 1
, let 1
n,A
= (A, 1
n
) and Z
n
=
E[X[1
n,A
]. Explain why some subsequence of Z
n
has an a.s. and
L
2
limit, denoted Z. Show that EZ
2
= EY
2
= and deduce that
E[(Y Z)
2
] = 0, hence Z = Y a.s.
(g) Show that Y is a version of the C.E. E[X[(].
4.4. REGULAR CONDITIONAL PROBABILITY DISTRIBUTIONS 169
4.4. Regular conditional probability distributions
We rst show that if the random vector (X, Z) R
2
has a probability den
sity function f
X,Z
(x, z) (per Denition 3.5.5), then the C.E. E[X[Z] can be com
puted out of the corresponding conditional probability density (as done in a typ
ical elementary probability course). To this end, let f
Z
(z) =
_
R
f
X,Z
(x, z)dx and
f
X
(x) =
_
R
f
X,Z
(x, z)dz denote the probability density functions of Z and X. That
is, f
Z
(z) = (f
X,Z
(, z)) and f
X
(x) = (f
X,Z
(x, )) for Lebesgue measure and the
Borel function f
X,Z
on R
2
. Recall that f
Z
() and f
X
() are nonnegative Borel
functions (for example, consider our proof of Fubinis theorem in case of Lebesgue
measure on (R
2
, B
R
2) and the nonnegative integrable Borel function h = f
X,Z
).
So, dening the conditional probability density function of X given Z as
f
XZ
(x[z) =
_
fX,Z(x,z)
fZ(z)
if f
Z
(z) > 0 ,
f
X
(x) otherwise ,
guarantees that f
XZ
: R
2
R
+
is Borel measurable and
_
R
f
XZ
(x[z)dx = 1 for
all z R.
Proposition 4.4.1. Suppose the random vector (X, Z) has a probability density
function f
X,Z
(x, z) and g() is a Borel function on R such that E[g(X)[ < .
Then, g(Z) is a version of E[g(X)[Z] for the Borel function
(4.4.1) g(z) =
_
R
g(x)f
XZ
(x[z)dx,
in case
_
R
[g(x)[f
X,Z
(x, z)dx is nite (taking otherwise g(z) = 0).
Proof. Since the Borel function h(x, z) = g(x)f
X,Z
(x, z) is integrable with
respect to Lebesgue measure on (R
2
, B
R
2), it follows that g() is also a Borel function
(c.f. our proof of Fubinis theorem). Further, by Fubinis theorem the integrability
of g(X) implies that (R A) = 0 for A = z :
_
[g(x)[f
X,Z
(x, z)dx < , and
with T
Z
= f
Z
this implies that P(Z A) = 1. By Jensens inequality,
[ g(z)[
_
[g(x)[f
XZ
(x[z)dx, z A.
Thus, by Fubinis theorem and the denition of f
XZ
we have that
> E[g(X)[ =
_
[g(x)[f
X
(x)dx
_
[g(x)[
_
_
A
f
XZ
(x[z)f
Z
(z)dz
_
dx
=
_
A
_
_
[g(x)[f
XZ
(x[z)dx
_
f
Z
(z)dz
_
A
[ g(z)[f
Z
(z)dz = E[ g(Z)[ .
So, g(Z) is integrable. With (4.4.1) holding for all z A and P(Z A) = 1, by
Fubinis theorem and the denition of f
XZ
we have that for any Borel set B,
E[ g(Z)I
B
(Z)] =
_
BA
g(z)f
Z
(z)dz =
_
_
_
g(x)f
XZ
(x[z)dx
_
I
BA
(z)f
Z
(z)dz
=
_
R
2
g(x)I
BA
(z)f
X,Z
(x, z)dxdz = E[g(X)I
B
(Z)] .
This amounts to E[ g(Z)I
G
] = E[g(X)I
G
] for any G (Z) = Z
1
(B) : B B
so indeed g(Z) is a version of E[g(X)[Z].
170 4. CONDITIONAL EXPECTATIONS AND PROBABILITIES
To each conditional probability density f
XZ
([) corresponds the collection of con
ditional probability measures
P
XZ
(B, ) =
_
B
f
XZ
(x[Z())dx. The remainder of
this section deals with the following generalization of the latter object.
Denition 4.4.2. Let Y : S be an (S, o)valued R.V. in the probability space
(, T, P), per Denition 1.2.1, and ( T a algebra. The collection
P
Y G
(, ) :
o [0, 1] is called the regular conditional probability distribution (R.C.P.D.)
of Y given ( if:
(a)
P
Y G
(A, ) is a version of the C.E. E[I
Y A
[(] for each xed A o.
(b) For any xed , the set function
P
Y G
(, ) is a probability measure
on (S, o).
In case S = , o = T and Y () = , we call this collection the regular conditional
probability (R.C.P.) on T given (, denoted also by
P(A[()().
If the R.C.P. exists, then we can dene all conditional expectations through the
R.C.P. Unfortunately, the R.C.P. might not exist (see [Bil95, Exercise 33.11] for
an example in which there exists no R.C.P. on T given ().
Recall that each C.E. is uniquely determined only a.e. Hence, for any countable
collection of disjoint sets A
n
T there is possibly a set of of probability zero
for which a given collection of C.E. is such that
P(
_
n
A
n
[()() ,=
n
P(A
n
[()() .
In case we need to examine an uncountable number of such collections in order to
see whether P([() is a measure on (, T), the corresponding exceptional sets of
can pile up to a nonnegligible set, hence the reason why a R.C.P. might not exist.
Nevertheless, as our next proposition shows, the R.C.P.D. exists for any condi
tioning algebra ( and any realvalued random variable X. In this setting, the
R.C.P.D. is the analog of the law of X as in Denition 1.2.33, but now given the
information contained in (.
Proposition 4.4.3. For any realvalued random variable X and any algebra
( T, there exists a R.C.P.D.
P
XG
(, ).
Proof. Consider the random variables H(q, ) = E[I
{Xq}
[(](), indexed by
q Q. By monotonicity of the C.E. we know that if q r then H(q, ) H(r, )
for all / A
qr
where A
qr
( is such that P(A
qr
) = 0. Further, by linearity
and dominated convergence of C.E.s H(q + n
1
, ) H(q, ) as n for all
/ B
q
, where B
q
( is such that P(B
q
) = 0. For the same reason, H(q, ) 0
as q and H(q, ) 1 as q for all / C, where C ( is such
that P(C) = 0. Since Q is countable, the set D = C
r,q
A
rq
q
B
q
is also in
( with P(D) = 0. Next, for a xed nonrandom distribution function G(), let
F(x, ) = infG(r, ) : r Q, r > x, where G(r, ) = H(r, ) if / D and
G(r, ) = G(r) otherwise. Clearly, for all the nondecreasing function
x F(x, ) converges to zero when x and to one when x , as C D.
Furthermore, x F(x, ) is right continuous, hence a distribution function, since
lim
xnx
F(x
n
, ) = infG(r, ) : r Q, r > x
n
for some n
= infG(r, ) : r Q, r > x = F(x, ) .
4.4. REGULAR CONDITIONAL PROBABILITY DISTRIBUTIONS 171
Thus, to each corresponds a unique probability measure
P(, ) on (R, B)
such that
P((, x], ) = F(x, ) for all x R (recall Theorem 1.2.36 for its
existence and Proposition 1.2.44 for its uniqueness).
Note that G(q, ) m( for all q Q, hence so is F(x, ) for each x R (see
Theorem 1.2.22). It follows that B B :
P(B, ) m( is a system (see
Corollary 1.2.19 and Theorem 1.2.22), containing the system T = R, (, q] :
q Q, hence by Dynkins theorem
P(B, ) m( for all B B. Further, for
/ D and q Q,
H(q, ) = G(q, ) F(q, ) G(q +n
1
, ) = H(q +n
1
, ) H(q, )
as n (specically, the leftmost inequality holds for /
r
A
r,q
and the right
most limit holds for / B
q
). Hence,
P(B, ) = E[I
{XB}
[(]() for any B T
and / D. Since P(D) = 0 it follows from the denition of the C.E. that for any
G ( and B T,
_
G
P
XY
(Y (), A) is a version of the C.E. E[I
XA
[(Y )]().
172 4. CONDITIONAL EXPECTATIONS AND PROBABILITIES
(b) For any xed , the set function
P
XY
(Y (), ) is a probability
measure on (S, o).
Hint: With g : S T as before, show that (Y ) = (g(Y )) and deduce from
Theorem 1.2.26 that
P
X(g(Y ))
(A, ) = f(A, g(Y ()) for each A o, where z
f(A, z) is a Borel function.
Here is the extension of the change of variables formula (1.3.14) to the setting of
conditional distributions.
Exercise 4.4.6. Suppose X mT and Y m( for some algebras ( T are
realvalued. Prove that, for any Borel function h : R
2
R such that E[h(X, Y )[ <
, almost surely,
E[h(X, Y )[(] =
_
R
h(x, Y ())d
P
XG
(x, ) .
For an integrable R.V. X (and a nonrandom constant Y = c), this exercise
provides the representation
E[X[(] =
_
R
xd
P
XG
(x, ) ,
of the C.E. in terms of the corresponding R.C.P.D. (with the right side denoting the
Lebesgues integral of Denition 1.3.1 for the probability space (R, B,
P
XG
(, )).
Solving the next exercise should improve your understanding of the relation be
tween the R.C.P.D. and the conditional probability density function.
Exercise 4.4.7. Suppose that the random vector (X, Y, Z) has a probability density
function f
X,Y,Z
per Denition 3.5.5.
(a) Express the R.C.P.D.
P
Y (X,Z)
in terms of f
X,Y,Z
.
(b) Using this expression show that if X is independent of (Y, Z), then
E[Y [X, Z] = E[Y [Z] .
(c) Provide an example of random variables X, Y, Z, such that X is indepen
dent of Y and
E[Y [X, Z] ,= E[Y [Z] .
Exercise 4.4.8. Let S
n
=
n
k=1
k
for i.i.d. integrable random variables
k
.
(a) Show that E[
1
[S
n
] = n
1
S
n
.
Hint: Consider E[
(1)
I
SnB
] for B B and a uniformly chosen ran
dom permutation of 1, . . . , n which is independent of
k
.
(b) Find P(
1
b[S
2
) in case the i.i.d.
k
are Exponential of parameter .
Hint: See the representation of Exercise 3.4.11.
Exercise 4.4.9. Let E[X[X < Y ] = E[XI
X<Y
]/P(X < Y ) for integrable X and
Y such that P(X < Y ) > 0. For each of the following statements, either show that
it implies E[X[X < Y ] EX or provide a counter example.
(a) X and Y are independent.
(b) The random vector (X, Y ) has the same joint law as the random vector
(Y, X) and P(X = Y ) = 0.
(c) EX
2
< , EY
2
< and E[XY ] EXEY .
4.4. REGULAR CONDITIONAL PROBABILITY DISTRIBUTIONS 173
Exercise 4.4.10. Suppose (X, Y ) are distributed according to a multivariate nor
mal distribution, with EX = EY = 0 and EY
2
> 0. Show that E[X[Y ] = Y with
= E[XY ]/EY
2
.
CHAPTER 5
Discrete time martingales and stopping times
In this chapter we study a collection of stochastic processes called martingales.
To simplify our presentation we focus on discrete time martingales and ltrations,
also called discrete parameter martingales and ltrations, with denitions and ex
amples provided in Section 5.1 (indeed, a discrete time stochastic process is merely
a sequence of random variables dened on the same probability space). As we shall
see in Section 5.4, martingales play a key role in computations involving stopping
times. Martingales share many other useful properties, chiey among which are
tail bounds and convergence theorems. Section 5.2 deals with martingale repre
sentations and tail inequalities, some of which are applied in Section 5.3 to prove
various convergence theorems. Section 5.5 further demonstrates the usefulness of
martingales in the study of branching processes, likelihood ratios, and exchangeable
processes.
5.1. Denitions and closure properties
Subsection 5.1.1 introduces the concepts of ltration, martingale and stopping
time and provides a few illustrating examples and interpretations. Subsection 5.1.2
introduces the related supermartingales and submartingales, as well as the power
ful martingale transform and other closure properties of this collection of stochastic
processes.
5.1.1. Martingales, ltrations and stopping times: denitions and
examples. Intuitively, a ltration represents any procedure of collecting more and
more information as times goes on. Our starting point is the following rigorous
mathematical denition of a (discrete time) ltration.
Denition 5.1.1. A ltration is a nondecreasing family of subalgebras T
n
a ltration T
n
and the
associated algebra T
= (
k
T
k
) such that the relation T
k
T
is also a martingale with respect to its canonical ltration. For this reason, hereafter
the statement X
n
is a MG (without explicitly specifying the ltration), means
that X
n
is a MG with respect to its canonical ltration T
X
n
= (X
k
, k n).
We next provide an alternative characterization of the martingale property.
Proposition 5.1.5. If X
n
=
n
k=0
D
k
then the canonical ltration for X
n
is
the same as the canonical ltration for D
n
. Further, (X
n
, T
n
) is a martingale if
and only if D
n
is an integrable S.P., adapted to T
n
, such that E[D
n+1
[T
n
] = 0
a.s. for all n.
Remark. The martingale dierences associated with X
n
are D
n
= X
n
X
n1
,
n 1 and D
0
= X
0
.
Proof. With both the transformation from (X
0
, . . . , X
n
) to (D
0
, . . . , D
n
) and
its inverse being continuous (hence Borel), it follows that T
X
n
= T
D
n
for each n
(c.f. Exercise 1.2.32). Therefore, X
n
is adapted to a given ltration T
n
if and
only if D
n
is adapted to this ltration (see Denition 5.1.3). It is easy to show
by induction on n that E[X
k
[ < for k = 0, . . . , n if and only if E[D
k
[ < for
k = 0, . . . , n. Hence, X
n
is an integrable S.P. if and only if D
n
is. Finally, with
X
n
mT
n
it follows from the linearity of the C.E. that
E[X
n+1
[T
n
] X
n
= E[X
n+1
X
n
[T
n
] = E[D
n+1
[T
n
] ,
and the alternative expression for the martingale property follows from (5.1.1).
Our rst example of a martingale, is the random walk, perhaps the most funda
mental stochastic process.
Denition 5.1.6. The random walk is the stochastic process S
n
= S
0
+
n
k=1
k
with realvalued, independent, identically distributed
k
which are also indepen
dent of S
0
. Unless explicitly stated otherwise, we always set S
0
to be zero. We say
that the random walk is symmetric if the law of
k
is the same as that of
k
. We
5.1. DEFINITIONS AND CLOSURE PROPERTIES 177
call it a simple random walk (on Z), in short srw, if
k
1, 1. The srw is
completely characterized by the parameter p = P(
k
= 1) which is always assumed
to be in (0, 1) (or alternatively, by q = 1 p = P(
k
= 1)). Thus, the symmetric
srw corresponds to p = 1/2 = q (and the asymmetric srw corresponds to p ,= 1/2).
The random walk is a MG (with respect to its canonical ltration), whenever
E[
1
[ < and E
1
= 0.
Remark. More generally, such partial sums S
n
form a MG even when the in
dependent and integrable R.V.
k
of zero mean have nonidentical distributions,
and the canonical ltration of S
n
is merely T
n
, where T
n
= (
1
, . . . ,
n
). In
deed, this is an application of Proposition 5.1.5 for independent, integrable D
k
=
S
k
S
k1
=
k
, k 1 (with D
0
= 0), where E[D
n+1
[D
0
, D
1
, . . . , D
n
] = ED
n+1
= 0
for all n 0 by our assumption that E
k
= 0 for all k.
Denition 5.1.7. We say that a stochastic process X
n
is squareintegrable if
EX
2
n
< for all n. Similarly, we call a martingale (X
n
, T
n
) such that EX
2
n
<
for all n, an L
2
MG (or a squareintegrable MG).
Squareintegrable martingales have zeromean, uncorrelated dierences and admit
an elegant decomposition of conditional second moments.
Exercise 5.1.8. Suppose (X
n
, T
n
) and (Y
n
, T
n
) are squareintegrable martingales.
(a) Show that the corresponding martingale dierences D
n
are uncorrelated
and that each D
n
, n 1, has zero mean.
(b) Show that for any n 0,
E[X
[T
n
] X
n
Y
n
= E[(X
X
n
)(Y
Y
n
)[T
n
]
=
k=n+1
E[(X
k
X
k1
)(Y
k
Y
k1
)[T
n
] .
(c) Deduce that if sup
k
[X
k
[ C nonrandom then for any 1,
E
_
_
k=1
D
2
k
_
2
_
6C
4
.
Remark. A squareintegrable stochastic process with zeromean mutually inde
pendent dierences is necessarily a martingale (consider Proposition 5.1.5). So, in
view of part (a) of Exercise 5.1.8, the MG property is between the more restrictive
requirement of having zeromean, independent dierences, and the not as useful
property of just having zeromean, uncorrelated dierences. While in general these
three conditions are not the same, as you show next they do coincide in case of
Gaussian stochastic processes.
Exercise 5.1.9. A stochastic process X
n
is Gaussian if for each n the ran
dom vector (X
1
, . . . , X
n
) has the multivariate normal distribution (c.f. Denition
3.5.13). Show that having independent or uncorrelated dierences are equivalent
properties for such processes, which together with each of these dierences having
a zero mean is then also equivalent to the MG property.
Products of R.V. is another classical source for martingales.
178 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
Example 5.1.10. Consider the stochastic process M
n
=
n
k=1
Y
k
for independent,
integrable random variables Y
k
0. Its canonical ltration coincides with T
Y
n
(see
Exercise 1.2.32), and taking out what is known we get by independence that
E[M
n+1
[T
Y
n
] = E[Y
n+1
M
n
[T
Y
n
] = M
n
E[Y
n+1
[T
Y
n
] = M
n
E[Y
n+1
] ,
so M
n
is a MG, which we then call the product martingale, if and only if
EY
k
= 1 for all k 1 (for general sequence Y
n
we need instead that a.s.
E[Y
n+1
[Y
1
, . . . , Y
n
] = 1 for all n).
Remark. In investment applications, the MG condition EY
k
= 1 corresponds to a
neutral return rate, and is not the same as the condition E[log Y
k
] = 0 under which
the associated partial sums S
n
= log M
n
form a MG.
We proceed to dene the important concept of stopping time (in the simpler
context of a discrete parameter ltration).
Denition 5.1.11. A random variable taking values in 0, 1, . . . , n, . . . , is a
stopping time for the ltration T
n
(also denoted T
n
stopping time), if the event
: () n is in T
n
for each nite n 0.
Remark. Intuitively, a stopping time corresponds to a situation where the deci
sion whether to stop or not at any given (nonrandom) time step is based on the
information available by that time step. As we shall amply see in the sequel, one of
the advantages of MGs is in providing a handle on explicit computations associated
with various stopping times.
The next two exercises provide examples of stopping times. Practice your under
standing of this concept by solving them.
Exercise 5.1.12. Suppose that and are stopping times for the same ltration
T
n
. Show that then , and + are also stopping times for this ltration.
Exercise 5.1.13. Show that the rst hitting time () = mink 0 : X
k
() B
of a Borel set B R by a sequence X
k
, is a stopping time for the canonical
ltration T
X
n
. Provide an example where the last hitting time = supk 0 :
X
k
B of a set B by the sequence, is not a stopping time (not surprising, since
we need to know the whole sequence X
k
in order to verify that there are no visits
to B after a given time n).
Here is an elementary application of rst hitting times.
Exercise 5.1.14 (Reflection principle). Suppose S
n
is a symmetric random
walk starting at S
0
= 0 (see Denition 5.1.6).
(a) Show that P(S
n
S
k
0) 1/2 for k = 1, 2, . . . , n.
(b) Fixing x > 0, let = infk 0 : S
k
> x and show that
P(S
n
> x)
n
k=1
P( = k, S
n
S
k
0)
1
2
n
k=1
P( = k) .
(c) Deduce that for any n and x > 0,
P(
n
max
k=1
S
k
> x) 2P(S
n
> x) .
5.1. DEFINITIONS AND CLOSURE PROPERTIES 179
(d) Considering now the symmetric srw, show that for any positive integers
n, x,
P(
n
max
k=1
S
k
x) = 2P(S
n
x) P(S
n
= x)
and that Z
2n+1
D
= ([S
2n+1
[1)/2, where Z
n
denotes the number of (strict)
sign changes within S
0
= 0, S
1
, . . . , S
n
.
We conclude this subsection with a useful sucient condition for the integrability
of a stopping time.
Exercise 5.1.15. Suppose the T
n
stopping time is such that a.s.
P[ n +r[T
n
]
for some positive integer r, some > 0 and all n.
(a) Show that P( > kr) (1 )
k
for any positive integer k.
Hint: Use induction on k.
(b) Deduce that in this case E < .
5.1.2. Submartingales, supermartingales and stopped martingales.
Often when operating on a MG, we naturally end up with a submartingale or a
supermartingale, as dened next. Moreover, these processes share many of the
properties of martingales, so it is useful to develop a unied theory for them.
Denition 5.1.16. A submartingale (denoted subMG) is an integrable S.P. X
n
,
adapted to the ltration T
n
, such that
E[X
n+1
[T
n
] X
n
n, a.s.
A supermartingale (denoted supMG) is an integrable S.P. X
n
, adapted to the
ltration T
n
such that
E[X
n+1
[T
n
] X
n
n, a.s.
(A typical S.P. X
n
is neither a subMG nor a supMG, as the sign of the R.V.
E[X
n+1
[T
n
] X
n
may well be random, or possibly dependent upon n).
Remark 5.1.17. Note that X
n
is a subMG if and only if X
n
is a supMG.
By this identity, all results about subMGs have dual statements for supMGs and
vice verse. We often state only one out of each such pair of statements. Further,
X
n
is a MG if and only if X
n
is both a subMG and a supMG. As a result,
every statement holding for either subMGs or supMGs, also hold for MGs.
Example 5.1.18. Expanding on Example 5.1.10, if the nonnegative, integrable
random variables Y
k
are such that E[Y
n
[Y
1
, . . . , Y
n1
] 1 a.s. for all n then M
n
=
n
k=1
Y
k
is a subMG, and if E[Y
n
[Y
1
, . . . , Y
n1
] 1 a.s. for all n then M
n
is a
supMG. Such martingales appear for example in mathematical nance, where Y
k
denotes the random proportional change in the value of a risky asset at the kth
trading round. So, positive conditional mean return rate yields a subMG while
negative conditional mean return rate gives a supMG.
The submartingale (and supermartingale) property is closed with respect to the
addition of S.P.
180 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
Exercise 5.1.19. Show that if X
n
and Y
n
are subMGs with respect to a
ltration T
n
, then so is X
n
+Y
n
. In contrast, show that for any subMG Y
n
[T
m
] X
m
for
any > m. Consequently, for s subMG necessarily n EX
n
is nondecreasing.
Similarly, for a supMG a.s. E[X
[T
m
] X
m
(with n EX
n
nonincreasing),
and for a martingale a.s. E[X
[T
m
] = X
m
for all > m (with E[X
n
] independent
of n).
Proof. Suppose X
n
is a subMG and = m+k for k 1. Then,
E[X
m+k
[T
m
] = E[E(X
m+k
[T
m+k1
)[T
m
] E[X
m+k1
[T
m
]
with the equality due to the tower property and the inequality by the denition
of a subMG and monotonicity of the C.E. Iterating this inequality for decreasing
values of k we deduce that E[X
m+k
[T
m
] E[X
m
[T
m
] = X
m
for all nonnegative
integers k, m, as claimed. Next taking the expectation of this inequality, we have
by monotonicity of the expectation and (4.2.1) that E[X
m+k
] E[X
m
] for all
k, m 0, or equivalently, that n EX
n
is nondecreasing.
To get the corresponding results for a supermartingale X
n
note that then
X
n
is a submartingale, see Remark 5.1.17. As already mentioned there, if
X
n
is a MG then it is both a supermartingale and a submartingale, hence
both E[X
[T
m
] X
m
and E[X
[T
m
] X
m
, resulting with E[X
[T
m
] = X
m
, as
stated.
Exercise 5.1.21. Show that a submartingale (X
n
, T
n
) is a martingale if and only
if EX
n
= EX
0
for all n.
We next detail a few examples in which subMGs or supMGs naturally appear,
starting with an immediate consequence of Jensens inequality
Proposition 5.1.22. Suppose : R R is convex and E[[(X
n
)[] < .
(a) If (X
n
, T
n
) is a martingale then ((X
n
), T
n
) is a submartingale.
(b) If x (x) is also nondecreasing, ((X
n
), T
n
) is a submartingale even
when (X
n
, T
n
) is only a submartingale.
Proof. With (X
n
) integrable and adapted, it suces to check that a.s.
E[(X
n+1
)[T
n
] (X
n
) for all n. To this end, since () is convex and X
n
is
integrable, by the conditional Jensens inequality,
E[(X
n+1
)[T
n
] (E[X
n+1
[T
n
]) ,
so it remains only to verify that (E[X
n+1
[T
n
]) (X
n
). This clearly applies
when (X
n
, T
n
) is a MG, and even for a subMG (X
n
, T
n
), provided that () is
monotone nondecreasing.
Example 5.1.23. Typical convex functions for which the preceding proposition is
often applied are (x) = [x[
p
, p 1, (x) = (xc)
+
, (x) = max(x, c) (for c R),
(x) = e
x
and (x) = xlog x (the latter only for nonnegative S.P.). Considering
instead () concave leads to a supMG, as for example when (x) = min(x, c) or
5.1. DEFINITIONS AND CLOSURE PROPERTIES 181
(x) = x
p
for some p (0, 1) or (x) = log x (latter two cases restricted to non
negative S.P.). For example, if X
n
is a submartingale then (X
n
c)
+
is also
a submartingale (since (x c)
+
is a convex, nondecreasing function). Similarly,
if X
n
is a supermartingale, then min(X
n
, c) is also a supermartingale (since
X
n
is a submartingale and the function min(x, c) = max(x, c) is convex
and nondecreasing).
Here is a concrete application of Proposition 5.1.22.
Exercise 5.1.24. Suppose
i
are mutually independent, E
i
= 0 and E
2
i
=
2
i
.
(a) Let S
n
=
n
i=1
i
and s
2
n
=
n
i=1
2
i
. Show that S
2
n
is a submartingale
and S
2
n
s
2
n
is a martingale.
(b) Show that if in addition m
n
=
n
i=1
Ee
i
are nite, then e
Sn
is a
submartingale and M
n
= e
Sn
/m
n
is a martingale.
Remark. A special case of Exercise 5.1.24 is the random walk S
n
of Denition
5.1.6, with S
2
n
nE
2
1
being a MG when
1
is squareintegrable and of zero mean.
Likewise, e
Sn
is a subMG whenever E
1
= 0 and Ee
1
is nite. Though e
Sn
is in
general not a MG, the normalized M
n
= e
Sn
/[Ee
1
]
n
is merely the product MG of
Example 5.1.10 for the i.i.d. variables Y
i
= e
i
/E(e
1
).
Here is another family of supermartingales, this time related to superharmonic
functions.
Denition 5.1.25. A lower semicontinuous function f : R
d
R is super
harmonic if for any x and r > 0,
f(x)
1
[B(0, r)[
_
B(x,r)
f(y)dy
where B(x, r) = y : [x y[ r is the ball of radius r centered at x and [B(x, r)[
denotes its volume.
Exercise 5.1.26. Suppose S
n
= x+
n
k=1
k
for i.i.d.
k
that are chosen uniformly
on the ball B(0, 1) in R
d
(i.e. using Lebesgues measure on this ball, scaled by
its volume). Show that if f() is superharmonic on R
d
then f(S
n
) is a super
martingale.
Hint: When checking the integrability of f(S
n
) recall that a lower semicontinuous
function is bounded below on any compact set.
We next dene the important concept of a martingale transform, and show that
it is a powerful and exible method for generating martingales.
Denition 5.1.27. We call a sequence V
n
predictable (or previsible) for the
ltration T
n
, also denoted T
n
predictable, if V
n
is measurable on T
n1
for all
n 1. The sequence of random variables
Y
n
=
n
k=1
V
k
(X
k
X
k1
) , n 1, Y
0
= 0
is called the martingale transform of the T
n
predictable V
n
with respect to a sub
or super martingale (X
n
, T
n
).
Theorem 5.1.28. Suppose Y
n
is the martingale transform of T
n
predictable
V
n
with respect to a sub or super martingale (X
n
, T
n
).
182 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
(a) If Y
n
is integrable and (X
n
, T
n
) is a martingale, then (Y
n
, T
n
) is also a
martingale.
(b) If Y
n
is integrable, V
n
0 and (X
n
, T
n
) is a submartingale (or super
martingale) then (Y
n
, T
n
) is also a submartingale (supermartingale, re
spectively).
(c) For the integrability of Y
n
it suces in both cases to have [V
n
[ c
n
for
some nonrandom nite constants c
n
, or alternatively to have V
n
L
q
and X
n
L
p
for all n and some p, q > 1 such that
1
q
+
1
p
= 1.
Proof. With V
n
and X
n
adapted to the ltration T
n
, it follows that
V
k
X
l
mT
k
mT
n
for all l k n. By inspection Y
n
mT
n
as well (see
Corollary 1.2.19), i.e. Y
n
is adapted to T
n
.
Turning to prove part (c) of the theorem, note that for each n the variable Y
n
is
a nite sum of terms of the form V
k
X
l
. If V
k
L
q
and X
l
L
p
for some p, q > 1
such that
1
q
+
1
p
= 1, then by H olders inequality V
k
X
l
is integrable. Alternatively,
since a supermartingale X
l
is in particular integrable, V
k
X
l
is integrable as soon
as [V
k
[ is bounded by a nonrandom nite constant. In conclusion, if either of these
conditions applies for all k, l then obviously Y
n
is an integrable S.P.
Recall that Y
n+1
Y
n
= V
n+1
(X
n+1
X
n
) and V
n+1
mT
n
(since V
n
is T
n

predictable). Therefore, taking out V
n+1
which is measurable on T
n
we nd that
E[Y
n+1
Y
n
[T
n
] = E[V
n+1
(X
n+1
X
n
)[T
n
] = V
n+1
E[X
n+1
X
n
[T
n
] .
This expression is zero when (X
n
, T
n
) is a MG and nonnegative when V
n+1
0
and (X
n
, T
n
) is a subMG. Since the preceding applies for all n, we consequently
have that (Y
n
, T
n
) is a MG in the former case and a subMG in the latter. Finally,
to complete the proof also in case of a supMG (X
n
, T
n
), note that then Y
n
is the
MG transform of V
n
with respect to the subMG (X
n
, T
n
).
Here are two concrete examples of a martingale transform.
Example 5.1.29. The S.P. Y
n
=
n
k=1
X
k1
(X
k
X
k1
) is a MG whenever
X
n
L
2
(, T, P) is a MG (indeed, V
n
= X
n1
is predictable for the canonical
ltration of X
n
and consider p = q = 2 in part (c) of Theorem 5.1.28).
Example 5.1.30. Given an integrable process V
n
suppose that for each k 1 the
bounded
k
has zero mean and is independent of T
k1
= (
1
, . . . ,
k1
, V
1
, . . . , V
k
).
Then, Y
n
=
n
k=1
V
k
k
is a martingale for the ltration T
n
. Indeed, by assump
tion, the dierences
n
of X
n
=
n
k=1
k
are such that E[
k
[T
k1
] = 0 for all
k 1. Hence, (X
n
, T
n
) is a martingale (c.f. Proposition 5.1.5), and Y
n
is
the martingale transform of the T
n
predictable V
n
with respect to the martingale
(X
n
, T
n
) (where the integrability of Y
n
is a consequence of the boundedness of each
k
and integrability of each V
k
). In discrete mathematics applications one often
uses a special case of this construction, with an auxiliary sequence of random i.i.d.
signs
k
1, 1 such that P(
1
= 1) =
1
2
and
n
is independent of the given
integrable S.P. V
n
.
We next dene the important concept of a stopped stochastic process and then
use the martingale transform to show that stopped sub and super martingales are
also subMGs (supMGs, respectively).
5.1. DEFINITIONS AND CLOSURE PROPERTIES 183
Denition 5.1.31. Given a stochastic process X
n
and a a random variable
taking values in 0, 1, . . . , n, . . . , , the stopped at stochastic process, denoted
X
n
, is given by
X
n
() =
_
X
n
(), n ()
X
()
(), n > ()
Theorem 5.1.32. If (X
n
, T
n
) is a subMG (or a supMG or a MG) and are
stopping times for T
n
, then (X
n
X
n
, T
n
) is also a subMG (or supMG or
MG, respectively). In particular, taking = 0 we have that (X
n
, T
n
) is then a
subMG (or supMG or MG, respectively).
Proof. We may and shall assume that (X
n
, T
n
) is a subMG (just consider
X
n
in case X
n
is a supMG and both when X
n
is a MG). Let V
k
() = I
{()<k()}
.
Since are two T
n
stopping times, it follows that V
k
() = I
{()(k1)}
I
{()(k1)}
is measurable on T
k1
for all k 1. Thus, V
n
is a bounded,
nonnegative T
n
predictable sequence. Further, since
X
n
() X
n
() =
n
k=1
I
{()<k()}
(X
k
() X
k1
())
is the martingale transform of V
n
with respect to subMG (X
n
, T
n
), we know from
Theorem 5.1.28 that (X
n
X
n
, T
n
) is also a subMG. Finally, considering the
latter subMG for = 0 and adding to it the subMG (X
0
, T
n
), we conclude that
(X
n
, T
n
) is a subMG (c.f. Exercise 5.1.19 and note that X
n0
= X
0
).
Theorem 5.1.32 thus implies the following key ingredient in the proof of Doobs
optional stopping theorem (to which we return in Section 5.4).
Corollary 5.1.33. If (X
n
, T
n
) is a subMG and are T
n
stopping times,
then EX
n
EX
n
for all n. The reverse inequality holds in case (X
n
, T
n
) is a
supMG, with EX
n
= EX
n
for all n in case (X
n
, T
n
) is a MG.
Proof. Suces to consider X
n
which is a subMG for the ltration T
n
. In
this case we have from Theorem 5.1.32 that Y
n
= X
n
X
n
is also a subMG
for this ltration. Noting that Y
0
= 0 we thus get from Proposition 5.1.20 that
EY
n
0. Theorem 5.1.32 also implies the integrability of X
n
so by linearity of
the expectation we conclude that EX
n
EX
n
.
An important concept associated with each stopping time is the stopped algebra
dened next.
Denition 5.1.34. The stopped algebra T
such that A : ()
n T
n
for all n.
With T
n
representing the information known at time n, think of T
as quantifying
the information known upon stopping at . Some of the properties of these stopped
algebras are detailed in the next exercise.
Exercise 5.1.35. Let and be T
n
stopping times.
(a) Verify that T
=
T
n
.
184 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
(b) Suppose X
n
mT
n
for all n (including n = unless is nite for all ).
Show that then X
mT
. Deduce that () T
and X
k
I
{=k}
mT
I
{=k}
[T
] = E[Y
k
[T
k
]I
{=k}
.
(d) Show that if then T
.
Our next exercise shows that the martingale property is equivalent to the strong
martingale property whereby conditioning at stopped algebras T
replaces the
one at T
n
for nonrandom n.
Exercise 5.1.36. Given an integrable stochastic process X
n
adapted to a ltra
tion T
n
, show that (X
n
, T
n
) is a martingale if and only if E[X
n
[T
] = X
for
any nonrandom, nite n and all T
n
stopping times n.
For nonintegrable stochastic processes we generalize the concept of a martingale
into that of a local martingale.
Exercise 5.1.37. The pair (X
n
, T
n
) is called a local martingale if X
n
is adapted
to the ltration T
n
and there exist T
n
stopping times
k
such that
k
with
probability one and (X
n
k
, T
n
) is a martingale for each k. Show that any martin
gale is a local martingale and any integrable, local martingale is a martingale.
We conclude with the renewal property of stopping times with respect to the
canonical ltration of an i.i.d. sequence.
Exercise 5.1.38. Suppose is an a.s. nite stopping time with respect to the
canonical ltration T
Z
n
of a sequence Z
k
of i.i.d. R.Vs.
(a) Show that T
Z
= (Z
+k
, k 1) is independent of the stopped algebra
T
Z
.
(b) Provide an example of a nite T
Z
n
stopping time and independent Z
k
for which T
Z
is not independent of T
Z
.
5.2. Martingale representations and inequalities
In Subsection 5.2.1 we show that martingales are at the core of all adapted pro
cesses. We further explore there the structure of certain submartingales, intro
ducing the increasing process associated with squareintegrable martingales. This
is augmented in Subsection 5.2.2 by the study of maximal inequalities for sub
martingales (and martingales). Such inequalities are an important technical tool
in many applications of probability theory. In particular, they are the key to the
convergence results of Section 5.3.
5.2.1. Martingale decompositions. To demonstrate the relevance of mar
tingales to the study of general stochastic processes, we start with a representation
of any adapted, integrable, discretetime S.P. as the sum of a martingale and a
predictable process.
Theorem 5.2.1 (Doobs decomposition). Given an integrable stochastic process
X
n
, adapted to a discrete parameter ltration T
n
, n 0, there exists a decom
position X
n
= Y
n
+A
n
such that (Y
n
, T
n
) is a MG and A
n
is an T
n
predictable
sequence. This decomposition is unique up to the value of Y
0
mT
0
.
5.2. MARTINGALE REPRESENTATIONS AND INEQUALITIES 185
Proof. Let A
0
= 0 and for n 1 set
A
n
= A
n1
+E[X
n
X
n1
[T
n1
].
By denition of the conditional expectation we see that A
k
A
k1
is measurable
on T
k1
for any k 1. Since T
k1
T
n1
for all k n and A
n
= A
0
+
n
k=1
(A
k
A
k1
), it follows that A
n
is T
n
predictable. We next check that
Y
n
= X
n
A
n
is a MG. To this end, recall that since X
n
is integrable so is
X
n
X
n1
, whereas the C.E. only reduces the L
1
norm (see Example 4.2.20).
Therefore, E[A
n
A
n1
[ E[X
n
X
n1
[ < . Hence, A
n
is integrable, as
is X
n
, implying by Minkowskis inequality that Y
n
is integrable as well. With
X
n
adapted and A
n
predictable, hence adapted, to T
n
, we see that Y
n
is
also T
n
adapted. It remains to check the martingale condition, that almost surely
E[Y
n
Y
n1
[T
n1
] = 0 for all n 1. Indeed, by linearity of the C.E. and the
construction of the T
n
predictable sequence A
n
, for any n 1,
E[Y
n
Y
n1
[T
n1
] = E[X
n
X
n1
(A
n
A
n1
)[T
n1
]
= E[X
n
X
n1
[T
n1
] (A
n
A
n1
) = 0 .
We nish the proof by checking that such a decomposition is unique up to the
choice of Y
0
. To this end, suppose that X
n
= Y
n
+ A
n
=
Y
n
+
A
n
are two such
decompositions of a given stochastic process X
n
. Then,
Y
n
Y
n
= A
n
A
n
. Since
A
n
and
A
n
are both T
n
predictable sequences while (Y
n
, T
n
) and (
Y
n
, T
n
) are
martingales, we nd that
A
n
A
n
= E[A
n
A
n
[T
n1
] = E[
Y
n
Y
n
[T
n1
]
=
Y
n1
Y
n1
= A
n1
A
n1
.
Thus, A
n
A
n
is independent of n and if in addition Y
0
=
Y
0
then A
n
A
n
=
A
0
A
0
=
Y
0
Y
0
= 0 for all n. In conclusion, both sequences A
n
and Y
n
are
uniquely determined as soon as we determine Y
0
, a R.V. measurable on T
0
.
Doobs decomposition has more structure when (X
n
, T
n
) is a subMG.
Exercise 5.2.2. Check that the predictable part of Doobs decomposition of a sub
martingale (X
n
, T
n
) is a nondecreasing sequence, that is, A
n
A
n+1
for all n.
Remark. As shown in Subsection 5.3.2, Doobs decomposition is particularly useful
in connection with squareintegrable martingales X
n
, where one can relate the
limit of X
n
as n with that of the nondecreasing sequence A
n
in the
decomposition of X
2
n
.
We next evaluate Doobs decomposition for two classical subMGs.
Example 5.2.3. Consider the subMG S
2
n
for the random walk S
n
=
n
k=1
k
,
where
k
are i.i.d. random variables with E
1
= 0 and E
2
1
= 1. Since Y
n
= S
2
n
n
is a martingale (see Exercise 5.1.24), and Doobs decomposition S
2
n
= Y
n
+ A
n
is
unique, it follows that the nondecreasing predictable part in the decomposition of
S
2
n
is A
n
= n.
In contrast with the preceding example, the nondecreasing predictable part in
Doobs decomposition is for most subMGs a nondegenerate random sequence, as
is the case in our next example.
186 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
Example 5.2.4. Consider the subMG (M
n
, T
Z
n
) where M
n
=
n
i=1
Z
i
for i.i.d.
integrable Z
i
0 such that EZ
1
> 1 (see Example 5.1.10). The nondecreasing
predictable part of its Doobs decomposition is such that for n 1
A
n+1
A
n
= E[M
n+1
M
n
[T
Z
n
] = E[Z
n+1
M
n
M
n
[T
Z
n
]
= M
n
E[Z
n+1
1[T
Z
n
] = M
n
(EZ
1
1)
(since Z
n+1
is independent of T
Z
n
). In this case A
n
= (EZ
1
1)
n1
k=1
M
k
+ A
1
,
where we are free to choose for A
1
any nonrandom constant. We see that A
n
is
a nondegenerate random sequence (assuming the R.V. Z
i
are not a.s. constant).
We conclude with the representation of any L
1
bounded martingale as the dier
ence of two nonnegative martingales (resembling the representation X = X
+
X
).
Exercise 5.2.5. Let (X
n
, T
n
) be a martingale with sup
n
E[X
n
[ < . Show that
there is a representation X
n
= Y
n
Z
n
with (Y
n
, T
n
) and (Z
n
, T
n
) nonnegative
martingales such that sup
n
E[Y
n
[ < and sup
n
E[Z
n
[ < .
5.2.2. Maximal and upcrossing inequalities. Martingales are rather tame
stochastic processes. In particular, as we see next, the tail of max
kn
X
k
is bounded
by moments of X
n
. This is a major improvement over Markovs inequality, relat
ing the typically much smaller tail of the R.V. X
n
to its moments (see part (b) of
Example 1.3.14).
Theorem 5.2.6 (Doobs inequality). For any submartingale X
n
and x > 0
let
x
= mink 0 : X
k
x. Then, for any nite n 0,
(5.2.1) P(
n
max
k=0
X
k
x) x
1
E[X
n
I
{xn}
] x
1
E[(X
n
)
+
] .
Proof. Since X
x
x whenever
x
is nite, setting
A
n
= :
x
() n = :
n
max
k=0
X
k
() x ,
it follows that
E[X
nx
] = E[X
x
I
xn
] +E[X
n
I
x>n
] xP(A
n
) +E[X
n
I
A
c
n
].
With X
n
a subMG and
x
a pair of T
X
n
stopping times, it follows from
Corollary 5.1.33 that E[X
nx
] E[X
n
]. Therefore, E[X
n
] E[X
n
I
A
c
n
] xP(A
n
)
which is exactly the left inequality in (5.2.1). The right inequality there holds by
monotonicity of the expectation and the trivial fact XI
A
(X)
+
for any R.V. X
and any measurable set A.
Remark. Doobs inequality generalizes Kolmogorovs maximal inequality. Indeed,
consider X
k
= Z
2
k
for the L
2
martingale Z
k
= Y
1
+ +Y
k
, where Y
l
are mutually
independent with EY
l
= 0 and EY
2
l
< . By Proposition 5.1.22 X
k
is a subMG,
so by Doobs inequality we obtain that for any z > 0,
P( max
1kn
[Z
k
[ z) = P( max
1kn
X
k
z
2
) z
2
E[(X
n
)
+
] = z
2
Var(Z
n
)
which is exactly Kolmogorovs maximal inequality of Proposition 2.3.15.
Combining Doobs inequality with Doobs decomposition of nonnegative sub
martingales, we arrive at the following bounds, due to Lenglart.
5.2. MARTINGALE REPRESENTATIONS AND INEQUALITIES 187
Lemma 5.2.7. Let V
n
= max
n
k=0
Z
k
and A
n
denote the T
n
predictable sequence in
Doobs decomposition of a nonnegative submartingale (Z
n
, T
n
) with Z
0
= 0. Then,
for any T
n
stopping time and all x, y > 0,
(5.2.2) P(V
x, A
y) x
1
E(A
y) .
Further, in this case E[V
p
] c
p
E[A
p
] for c
p
= 1 + 1/(1 p) and any p (0, 1).
Proof. Since M
n
= Z
n
A
n
is a MG with respect to the ltration T
n
(starting at M
0
= 0), by Theorem 5.1.32 the same applies for the stopped stochastic
process M
n
, with any T
n
stopping time. By the same reasoning Z
n
= M
n
+
A
n
is a subMG with respect to T
n
. Applying Doobs inequality (5.2.1) for
this nonnegative subMG we deduce that for any n and x > 0,
P(V
n
x) = P(
n
max
k=0
Z
k
x) x
1
E[ Z
n
] = x
1
E[ A
n
] .
Both V
n
and A
n
are nonnegative and nondecreasing in n (see Exercise 5.2.2),
so by monotone convergence we have that P(V
x) x
1
EA
. In particular,
xing y > 0, since A
n
is T
n
predictable, = minn 0 : A
n+1
> y is an
T
n
stopping time. Further, with A
n
nondecreasing, < if and only if A
> y
in which case A
y and as
V
x, A
y V
EY
p
+ (1 p)
1
EY
p
,
as claimed.
To practice your understanding, adapt the proof of Doobs inequality enroute to
the following dual inequality (which is often called Doobs second subMG inequal
ity).
Exercise 5.2.8. Show that for any subMG X
n
, nite n 0 and x > 0,
(5.2.3) P(
n
min
k=0
X
k
x) x
1
(E[(X
n
)
+
] E[X
0
]) .
Here is a typical example of an application of Doobs inequality.
Exercise 5.2.9. Fixing s > 0, the independent variables Z
n
are such that P(Z
n
=
1) = P(Z
n
= 1) = n
s
/2 and P(Z
n
= 0) = 1 n
s
. Starting at Y
0
= 0, for
n 1 let
Y
n
= n
s
Y
n1
[Z
n
[ +Z
n
I
{Yn1=0}
.
(a) Show that Y
n
is a martingale and that for any x > 0 and n 1,
P(
n
max
k=1
Y
k
x)
1
2x
[1 +
n1
k=1
(k + 1)
s
(1 k
s
)] .
(b) Show that Y
n
p
0 as n and further Y
n
a.s.
0 if and only if s > 1,
but there is no value of s for which Y
n
L
1
0.
188 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
Martingales also provide bounds on the probability that the sum of bounded in
dependent variables is too close to its mean (in lieu of the clt).
Exercise 5.2.10. Let S
n
=
n
k=1
k
where
k
are independent and E
k
= 0,
[
k
[ K for all k. Let s
2
n
=
n
k=1
E
2
k
. Using Corollary 5.1.33 for the martingale
S
2
n
s
2
n
and a suitable stopping time show that
P(
n
max
k=1
[S
k
[ x) (x +K)
2
/s
2
n
.
If the positive part of the subMG has nite pth moment you can improve the
rate of decay in x in Doobs inequality by an application of Proposition 5.1.22 for
the convex nondecreasing (y) = max(y, 0)
p
, denoted hereafter by (y)
p
+
. Further,
in case of a MG the same argument yields comparable bounds on tail probabilities
for the maximum of [Y
k
[.
Exercise 5.2.11.
(a) Show that for any subMG Y
n
, p 1, nite n 0 and y > 0,
P(
n
max
k=0
Y
k
y) y
p
E
_
max(Y
n
, 0)
p
_
.
(b) Show that in case Y
n
is a martingale, also
P(
n
max
k=1
[Y
k
[ y) y
p
E
_
[Y
n
[
p
_
.
(c) Suppose the martingale Y
n
is such that Y
0
= 0. Using the fact that
(Y
n
+c)
2
is a submartingale and optimizing over c, show that for y > 0,
P(
n
max
k=0
Y
k
y)
EY
2
n
EY
2
n
+y
2
Here is the version of Doobs inequality for nonnegative supMGs and its appli
cation for the random walk.
Exercise 5.2.12.
(a) Show that if is a stopping time for the canonical ltration of a non
negative supermartingale X
n
then EX
0
EX
n
E[X
I
n
] for
any nite n.
(b) Deduce that if X
n
is a nonnegative supermartingale then for any x >
0
P(sup
k
X
k
x) x
1
EX
0
.
(c) Suppose S
n
is a random walk with E
1
= < 0 and Var(
1
) =
2
> 0.
Let = /(
2
+
2
) and f(x) = 1/(1 + (z x)
+
). Show that f(S
n
) is
a supermartingale and use this to conclude that for any z > 0,
P(sup
k
S
k
z)
1
1 +z
.
Hint: Taking v(x) = f(x)
2
1
x<z
show that g
x
(y) = f(x) + v(x)[(y
x) + (y x)
2
] f(y) for all x and y. Then show that f(S
n
) =
E[g
Sn
(S
n+1
)[S
k
, k n].
Integrating Doobs inequality we next get bounds on the moments of the maximum
of a subMG.
5.2. MARTINGALE REPRESENTATIONS AND INEQUALITIES 189
Corollary 5.2.13 (L
p
maximal inequalities). If X
n
is a subMG then for
any n and p > 1,
(5.2.4) E
_
(max
kn
X
k
)
p
+
q
p
E
_
(X
n
)
p
+
,
where q = p/(p 1) is a nite universal constant. Consequently, if Y
n
is a MG
then for any n and p > 1,
(5.2.5) E
__
max
kn
[Y
k
[
_
p
q
p
E
_
[Y
n
[
p
.
Proof. The bound (5.2.4) is obtained by applying part (b) of Lemma 1.4.31
for the nonnegative variables X = (X
n
)
+
and Y = (max
kn
X
k
)
+
. Indeed, the
hypothesis P(Y y) y
1
E[XI
Y y
] of this lemma is provided by the left in
equality in (5.2.1) and its conclusion that EY
p
q
p
EX
p
is precisely (5.2.4). In
case Y
n
is a martingale, we get (5.2.5) by applying (5.2.4) for the nonnegative
subMG X
n
= [Y
n
[.
Remark. A bound such as (5.2.5) can not hold for all subMGs. For example, the
nonrandom sequence Y
k
= (k n) 0 is a subMG with [Y
0
[ = n but Y
n
= 0.
The following two exercises show that while L
p
maximal inequalities as in Corollary
5.2.13 can not hold for p = 1, such an inequality does hold provided we replace
E(X
n
)
+
in the bound by E[(X
n
)
+
log min(X
n
, 1)].
Exercise 5.2.14. Consider the martingale M
n
=
n
k=1
Y
k
for i.i.d. nonnegative
random variables Y
k
with EY
1
= 1 and P(Y
1
= 1) < 1.
(a) Explain why E(log Y
1
)
+
is nite and why the strong law of large numbers
implies that n
1
log M
n
a.s.
< 0 when n .
(b) Deduce that M
n
a.s.
0 as n and that consequently M
n
is not
uniformly integrable.
(c) Show that if (5.2.4) applies for p = 1 and some q < , then any non
negative martingale would have been uniformly integrable.
Exercise 5.2.15. Show that if X
n
is a nonnegative subMG then
E
_
max
kn
X
k
(1 e
1
)
1
1 +E[X
n
(log X
n
)
+
] .
Hint: Apply part (c) of Lemma 1.4.31 and recall that x(log y)
+
e
1
y +x(log x)
+
for any x, y 0.
We just saw that in general L
1
bounded martingales might not be U.I. Neverthe
less, as you show next, for sums of independent zeromean random variables these
two properties are equivalent.
Exercise 5.2.16. Suppose S
n
=
n
k=1
k
with
k
independent.
(a) Prove Ottavianis inequality. Namely, show that for any n and t, s 0,
P(
n
max
k=1
[S
k
[ t +s) P([S
n
[ t) +P(
n
max
k=1
[S
k
[ t +s)
n
max
k=1
P([S
n
S
k
[ > s) .
(b) Suppose further that
k
is integrable, E
k
= 0 and sup
n
E[S
n
[ < .
Show that then E[sup
k
[S
k
[] is nite.
190 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
In the spirit of Doobs inequality bounding the tail probability of the maximum
of a subMG X
k
, k = 0, 1, . . . , n in terms of the value of X
n
, we will bound the
oscillations of X
k
, k = 0, 1, . . . , n over an interval [a, b] in terms of X
0
and X
n
.
To this end, we require the following denition of upcrossings.
Denition 5.2.17. The number of upcrossings of the interval [a, b] by X
k
(), k =
0, 1, . . . , n, denoted U
n
[a, b](), is the largest Z
+
such that X
si
() < a and
X
ti
() > b for 1 i and some 0 s
1
< t
1
< < s
< t
n.
For example, Fig. 1 depicts two upcrossings of [a, b].
Figure 1. Illustration of upcrossings of [a, b] by X
k
()
Our next result, Doobs upcrossing inequality, is the key to the a.s. convergence
of supMGs (and subMGs) on which Section 5.3 is based.
Lemma 5.2.18 (Doobs upcrossing inequality). If X
n
is a supMG then
(5.2.6) (b a)E(U
n
[a, b]) E[(X
n
a)
] E[(X
0
a)
] a < b .
Proof. Fixing a < b, let V
1
= I
{X0<a}
and for n = 2, 3, . . ., dene recursively
V
n
= I
{Vn1=1,Xn1b}
+ I
{Vn1=0,Xn1<a}
. Informally, the sequence V
k
is zero
while waiting for the process X
n
to enter (, a) after which time it reverts
to one and stays so while waiting for this process to enter (b, ). See Figure 1
for an illustration in which black circles depict indices k such that V
k
= 1 and
open circles indicate those values of k with V
k
= 0. Clearly, the sequence V
n
is
predictable for the canonical ltration of X
n
. Let Y
n
denote the martingale
5.3. THE CONVERGENCE OF MARTINGALES 191
transform of V
n
with respect to X
n
(per Denition 5.1.27). By the choice of
V
to Y
n
. The only other contribution to Y
n
is by the upcrossing of the interval [a, b]
that is in progress at time n (if there is such), and since it started at value at most
a, its contribution to Y
n
is at least (X
n
a)
(X
n
a)
is a supermartingale (see parts (b) and (c) of Theorem 5.1.28). Thus, considering
the expectation of the preceding inequality yields the upcrossing inequality (5.2.6)
since 0 = EY
0
EY
n
for the supMG Y
n
.
Doobs upcrossing inequality implies that the total number of upcrossings of [a, b]
by a nonnegative supMG has a nite expectation. In this context, Dubins up
crossing inequality, which you are to derive next, provides universal (i.e. depending
only on a/b), exponential bounds on tail probabilities of this random variable.
Exercise 5.2.19. Suppose (X
1
n
, T
n
) and (X
2
n
, T
n
) are both supMGs and is an
T
n
stopping time such that X
1
X
2
.
(a) Show that W
n
= X
1
n
I
>n
+X
2
n
I
n
is a supMG with respect to T
n
and
deduce that so is Y
n
= X
1
n
I
n
+ X
2
n
I
<n
(this is sometimes called the
switching principle).
(b) For a supMG X
n
0 and constants b > a > 0 dene the T
X
n
stopping
times
0
= 1,
= infk >
: X
k
a and
+1
= infk >
: X
k
b, = 0, 1, . . .. That is, the th upcrossing of (a, b) by X
n
starts at
1
and ends at
. For = 0, 1, . . . let Z
n
= a
when n [
) and
Z
n
= a
1
b
X
n
for n [
,
+1
). Show that (Z
n
, T
X
n
) is a supMG.
(c) For b > a > 0 let U
[a, b] )
_
a
b
_
E[min(X
0
/a, 1)]
(this is Dubins upcrossing inequality).
5.3. The convergence of Martingales
As we shall see in this section, a subMG (or a supMG), has an integrable
limit under relatively mild integrability assumptions. For example, in this con
text L
1
boundedness (i.e. the niteness of sup
n
E[X
n
[), yields a.s. convergence
(see Doobs convergence theorem), while the L
1
convergence of X
n
is equivalent
to the stronger hypothesis of uniform integrability of this process (see Theorem
5.3.12). Finally, the even stronger L
p
convergence applies for the smaller subclass
of L
p
bounded martingales (see Doobs L
p
martingale convergence).
192 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
Indeed, these convergence results are closely related to the fact that the maximum
and upcrossings counts of a subMG do not grow too rapidly (and same applies
for supMGs and martingales). To further explore this direction, we next link the
niteness of the total number of upcrossings U
where X
a,b
,
where for each b > a,
a,b
= : liminf
n
X
n
() < a < b < limsup
n
X
n
() .
Since is a countable union of these events, it thus suces to show that P(
a,b
) = 0
for any a, b Q, a < b. To this end note that if
a,b
then limsup
n
X
n
() > b
and liminf
n
X
n
() < a are both limit points of the sequence X
n
(), hence the
total number of upcrossings of the interval [a, b] by this sequence is innite. That
is,
a,b
: U
[a, b] is nite
almost surely it follows that P(
a,b
) = 0 for each a < b, resulting with the stated
conclusion.
Combining Doobs upcrossing inequality of Lemma 5.2.18 with Lemma 5.3.1 we
now prove Doobs a.s. convergence theorem for supMGs (and subMGs).
Theorem 5.3.2 (Doobs convergence theorem). Suppose supMG (X
n
, T
n
)
is such that sup
n
E[(X
n
)
] < . Then, X
n
a.s.
X
and E[X
[ liminf
n
E[X
n
[
is nite.
Proof. Fixing b > a, recall that 0 U
n
[a, b] U
[a, b] as n , where
U
[a[+x
1
(b a)
_
[a[ + sup
n
E[(X
n
)
]
_
.
Thus, our hypothesis that sup
n
E[(X
n
)
.
Further, with X
n
a supMG, we have that E[X
n
[ = EX
n
+ 2E(X
n
)
EX
0
+
2E(X
n
)
for all n. Using this observation in conjunction with Fatous lemma for
0 [X
n
[
a.s.
[X
[ liminf
n
E[X
n
[ EX
0
+ 2 sup
n
E[(X
n
)
] < ,
as stated.
5.3. THE CONVERGENCE OF MARTINGALES 193
Remark. In particular, Doobs convergence theorem implies that if (X
n
, T
n
) is
a nonnegative supMG then X
n
a.s.
X
(and in this
case EX
EX
0
). The same convergence applies for a nonpositive subMG and
more generally, for any subMG with sup
n
E(X
n
)
+
< . Further, the following
exercise provides alternative equivalent conditions for the applicability of Doobs
convergence theorem.
Exercise 5.3.3. Show that the following ve conditions are equivalent for any sub
MG X
n
(and if X
n
is a supMG, just replace (X
n
)
+
by (X
n
)
).
(a) lim
n
E[X
n
[ exists and is nite.
(b) sup
n
E[X
n
[ < .
(c) liminf
n
E[X
n
[ < .
(d) lim
n
E(X
n
)
+
exists and is nite.
(e) sup
n
E(X
n
)
+
< .
Our rst application of Doobs convergence theorem extends Doobs inequality
(5.2.1) to the following bound on the maximal value of a U.I. subMG.
Corollary 5.3.4. For any U.I. subMG X
n
and x > 0,
(5.3.1) P(X
k
x for some k < ) x
1
E[X
I
x<
] x
1
E[(X
)
+
] ,
where
x
= mink 0 : X
k
x.
Proof. Let A
n
=
x
n = max
kn
X
k
x and A
=
x
< =
X
k
x for some k < . Then, A
n
A
. Con
sequently, X
n
I
An
and (X
n
)
+
converge almost surely to X
I
A
and (X
)
+
, re
spectively. Since these two sequences are U.I. we further have that E[X
n
I
An
]
E[X
I
A
] and E[(X
n
)
+
] E[(X
)
+
]. Recall Doobs inequality (5.2.1) that
(5.3.2) P(A
n
) x
1
E[X
n
I
An
] x
1
E[(X
n
)
+
]
for any n nite. Taking n we conclude that
P(A
) x
1
E[X
I
A
] x
1
E[(X
)
+
]
which is precisely our stated inequality (5.3.1).
Applying Doobs convergence theorem we also nd that martingales of bounded
dierences either converge to a nite limit or oscillate between and +.
Proposition 5.3.5. Suppose X
n
is a martingale of uniformly bounded dier
ences. That is, almost surely sup
n
[X
n
X
n1
[ c for some nite nonrandom
constant c. Then, P(A B) = 1 for the events
A = : lim
n
X
n
() exists and is nite,
B = : limsup
n
X
n
() = and liminf
n
X
n
() = .
Proof. We may and shall assume without loss of generality that X
0
= 0
(otherwise, apply the proposition for the MG Y
n
= X
n
X
0
). Fixing a posi
tive integer k, consider the stopping time
k
() = infn 0 : X
n
() k for
the canonical ltration of X
n
and the associated stopped supMG Y
n
= X
n
k
(per Theorem 5.1.32). By denition of
k
and our hypothesis of X
n
having uni
formly bounded dierences, it follows that Y
n
() k c for all n. Consequently,
194 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
sup
n
E(Y
n
)
() R
for all /
k
and some measurable
k
such that P(
k
) = 0. In particular, if
k
() = and /
k
then X
n
() = Y
n
() has a nite limit, so A. This
shows that A
c
k
<
k
for all k, and hence A
c
B
k
k
where
B
=
k
k
< = : liminf
n
X
n
() = . With P(
k
) = 0 for all k,
we thus deduce that P(A B
B
+
)) = 1 as stated.
Remark. Consider a random walk S
n
=
n
k=1
k
with zeromean, bounded incre
ments
k
(i.e. [
k
[ c with c a nite nonrandom constant). Then, v = E
2
k
is
nite and the event A where S
n
() S
() as n for some S
() nite,
implies that
S
n
= (nv)
1/2
S
n
0. Combining the clt
S
n
D
G with Fatous
lemma and part (d) of the Portmanteau theorem we nd that for any > 0,
P(A) E[liminf
n
I

Sn
] liminf
n
P([
S
n
[ ) = P([G[ ) .
Taking 0 we deduce that P(A) = 0. Hence, by Proposition 5.3.5 such random
walk is an example of a nonconverging MG for which a.s.
limsup
n
S
n
= = liminf
n
S
n
.
Here is another application of the preceding proposition.
Exercise 5.3.6. Consider the T
n
adapted W
n
0, such that sup
n
[W
n+1
W
n
[
K for some nite nonrandom constant K and W
0
= 0. Suppose there exist non
random, positive constants a and b such that for all n 0,
E[W
n+1
W
n
+a[T
n
]I
{Wnb}
0 .
With N
n
=
n
k=1
I
{W
k
<b}
, show that P(N
is nite) = 0.
Hint: Check that X
n
= W
n
+an(K+a)N
n1
is a supMG of uniformly bounded
dierences.
As we show next, Doobs convergence theorem leads to the integrability of X
for
any L
1
bounded subMG X
n
and any stopping time .
Lemma 5.3.7. If (X
n
, T
n
) is a subMG and sup
n
E[(X
n
)
+
] < then E[X
[ <
for any T
n
stopping time .
Proof. Since ((X
n
)
+
, T
n
) is a subMG (see Proposition 5.1.22), it follows
that E[(X
n
)
+
] E[(X
n
)
+
] for all n (consider Theorem 5.1.32 for the subMG
(X
n
)
+
and = ). Thus, our hypothesis that sup
n
E[(X
n
)
+
] is nite results with
sup
n
E[(Y
n
)
+
] nite, where Y
n
= X
n
. Applying Doobs convergence theorem for
the subMG (Y
n
, T
n
) we have that Y
n
a.s
Y
with Y
= X
integrable.
We further get the following relation, which is key to establishing Doobs optional
stopping for certain supMGs (and subMGs).
Proposition 5.3.8. Suppose (X
n
, T
n
) is a nonnegative supMG and are
stopping times for the ltration T
n
. Then, EX
EX
I
<n
] E[X
I
<n
] +E[X
n
I
n
I
<n
] .
The supMG X
n
is nonnegative, so by Doobs convergence theorem X
n
a.s.
X
I
=
I
<
] .
Further, by monotone convergence E[X
I
<n
] E[X
I
<
] and E[X
I
<n
]
E[X
I
<
]. Hence, taking n results with
E[X
I
<
] E[X
I
<
] +E[X
I
=
I
<
] .
Adding the identity E[X
I
=
] = E[X
I
=
], which holds for , yields the
stated inequality E[X
] E[X
] E[X
n
i=1
i
, with zeromean, independent but not identically
distributed
i
.
We conclude this subsection with few additional applications of Doobs conver
gence theorem.
Exercise 5.3.10. Suppose X
n
and Y
n
are nonnegative, integrable processes
adapted to the ltration T
n
such that
n1
Y
n
< a.s. and E[X
n+1
[T
n
]
(1 +Y
n
)X
n
+Y
n
for all n. Show that X
n
converges a.s. to a nite limit as n .
Hint: Find a nonnegative supermartingale (W
n
, T
n
) whose convergence implies
that of X
n
.
Exercise 5.3.11. Let X
k
be mutually independent but not necessarily integrable
random variables, such that
(a) Fixing c < nonrandom, let Y
(c)
n
=
n
k=1
[S
k1
[I
S
k1
c
X
k
I
X
k
c
.
Show that Y
(c)
n
is a martingale with respect to the ltration T
X
n
and
that sup
n
Y
(c)
n

2
< .
Hint: Kolmogorovs three series theorem may help in proving that Y
(c)
n
is L
2
bounded.
(b) Show that Y
n
=
n
k=1
[S
k1
[X
k
converges a.s.
5.3.1. Uniformly integrable martingales. The main result of this subsec
tion is the following L
1
convergence theorem for uniformly integrable (U.I.) sub
MGs (and supMGs).
196 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
Theorem 5.3.12. If (X
n
, T
n
) is a subMG, then X
n
is U.I. (c.f. Denition
1.3.47), if and only if X
n
L
1
X
and X
n
E[X
[T
n
] for all n.
Remark. If X
n
is uniformly integrable then sup
n
E[X
n
[ is nite (see Lemma
1.3.48). Thus, the assumption of Theorem 5.3.12 is stronger than that of Theorem
5.3.2, as is its conclusion.
Proof. If X
n
is U.I. then sup
n
E[X
n
[ < . For X
n
subMG it thus
follows by Doobs convergence theorem that X
n
a.s.
X
with X
integrable. Ob
viously, this implies that X
n
p
X
then also X
n
p
X
[T
n
] for all n, recall that X
m
E[X
[T
m
] for
all > m and any subMG (see Proposition 5.1.20). Further, since X
L
1
X
it
follows that E[X
[T
m
]
L
1
E[X
[T
m
] as , per xed m (see Theorem 4.2.30).
The latter implies the convergence a.s. of these conditional expectations along some
subsequence
k
(c.f. Theorem 2.2.10). Hence, we conclude that for any m, a.s.
X
m
liminf
E[X
[T
m
] E[X
[T
m
] ,
i.e., X
n
E[X
[T
n
] for all n.
The preceding theorem identies the collection of U.I. martingales as merely the
set of all Doobs martingales, a concept we now dene.
Denition 5.3.13. The sequence X
n
= E[X[T
n
] with X an integrable R.V. and
T
n
a ltration, is called Doobs martingale of X with respect to T
n
.
Corollary 5.3.14. A martingale (X
n
, T
n
) is U.I. if and only if X
n
= E[X
[T
n
] is
a Doobs martingale with respect to T
n
, or equivalently if and only if X
n
L
1
X
.
Proof. Theorem 5.3.12 states that a subMG (hence also a MG) is U.I. if and
only if it converges in L
1
and in this case X
n
E[X
[T
n
]. Applying this theorem
also for X
n
we deduce that a U.I. martingale is necessarily a Doobs martingale
of the form X
n
= E[X
[T
n
]. Conversely, the sequence X
n
= E[X[T
n
] for some
integrable X and a ltration T
n
is U.I. (see Proposition 4.2.33).
We next generalize Theorem 4.2.26 about dominated convergence of C.E.
Theorem 5.3.15 (Levys upward theorem). Suppose sup
m
[X
m
[ is integrable,
X
n
a.s.
X
and T
n
T
. Then E[X
n
[T
n
] E[X
[T
is in general not
enough even for the a.s. convergence of E[X
n
[(] to E[X
[(].
5.3. THE CONVERGENCE OF MARTINGALES 197
Proof. Consider rst the special case where X
n
= X does not depend on
n. Then, Y
n
= E[X[T
n
] is a U.I. martingale. Therefore, E[Y
[T
n
] = E[X[T
n
]
for all n, where Y
clearly Y
= lim
n
Y
n
mT
. Further, by denition of
the C.E. E[XI
A
] = E[Y
I
A
] for all A in the system T =
n
T
n
hence with
T
= E[X[T
, we
deduce that X
and W
k
= sup[X
n
X
[T
n
][ E[[X
n
X
[ [T
n
] E[W
k
[T
n
] .
Consequently, considering n we nd by the special case of the theorem where
X
n
is replaced by W
k
independent of n (which we already proved), that
limsup
n
[E[X
n
[T
n
] E[X
[T
n
][ lim
n
E[W
k
[T
n
] = E[W
k
[T
] .
Similarly, we know that E[X
[T
n
]
a.s
E[X
[T
]. Further, by denition W
k
0
a.s. when k , so also E[W
k
[T
[T
] as stated. Finally,
since [E[X
n
[T
n
][ E[Z[T
n
] for all n, it follows that E[X
n
[T
n
] is U.I. and hence
the a.s. convergence of this sequence to E[X
[T
= I
A
and A T
yields the
following corollary.
Corollary 5.3.16 (Levys 01 law). If T
n
T
, A T
, then E[I
A
[T
n
]
a.s.
I
A
.
As shown in the sequel, Kolmogorovs 01 law about Ptriviality of the tail 
algebra T
X
=
n
T
X
n
of independent random variables is a special case of Levys
01 law.
Proof of Corollary 1.4.10. Let T
X
= (
n
T
X
n
). Recall Denition 1.4.9
that T
X
T
X
n
T
X
for all n. Thus, by Levys 01 law E[I
A
[T
X
n
]
a.s.
I
A
for
any A T
X
. By assumption X
k
are Pmutually independent, hence for any
A T
X
the R.V. I
A
mT
X
n
is independent of the algebra T
X
n
. Consequently,
E[I
A
[T
X
n
]
a.s.
= P(A) for all n. We deduce that P(A)
a.s.
= I
A
, implying that P(A)
0, 1 for all A T
X
, as stated.
The generalization of Theorem 4.2.30 which you derive next also relaxes the as
sumptions of Levys upward theorem in case only L
1
convergence is of interest.
Exercise 5.3.17. Show that if X
n
L
1
X
and T
n
T
then E[X
n
[T
n
]
L
1
E[X
[T
].
Here is an example of the importance of uniform integrability when dealing with
convergence.
Exercise 5.3.18. Suppose X
n
a.s.
0 are [0, 1]valued random variables and M
n
is a nonnegative MG.
(a) Provide an example where E[X
n
M
n
] = 1 for all n nite.
198 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
(b) Show that if M
n
is U.I. then E[X
n
M
n
] 0.
Denition 5.3.19. A continuous function x : [0, 1) R is absolutely continuous
if for every > 0 there exists > 0 such that for all k < , s
1
< t
1
s
2
< t
2
s
k
< t
k
[0, 1)
k
=1
[t
[ =
k
=1
[x(t
) x(s
)[ .
The next exercise uses convergence properties of MGs to prove a classical result
in real analysis, namely, that an absolutely continuous function is dierentiable for
Lebesgue a.e. t [0, 1).
Exercise 5.3.20. On the probability space ([0, 1), B, U) consider the events
A
i,n
= [(i 1)2
n
, i2
n
) for i = 1, . . . , 2
n
, n = 0, 1, . . . ,
and the associated algebras T
n
= (A
i,n
, i = 1, . . . , 2
n
).
(a) Write an explicit formula for E[h[T
n
] and h L
1
([0, 1), B, U).
(b) For h
i,n
= 2
n
(x(i2
n
)x((i1)2
n
)), show that X
n
(t) =
2
n
i=1
h
i,n
I
Ai,n
(t)
is a martingale with respect to T
n
.
(c) Show that for absolutely continuous x() the martingale X
n
is U.I.
Hint: Show that P([X
n
[ > ) c/ for some constant c < and all
n, > 0.
(d) Show that then there exists h L
1
([0, 1), B, U) such that
x(t) x(s) =
_
t
s
h(u)du for all 1 > t s 0.
(e) Recall Lebesgues theorem, that
1
_
s+
s
[h(s) h(u)[du
a.s
0 as 0,
for a.e. s [0, 1). Using it, conclude that
dx
dt
= h for almost every
t [0, 1).
Recall that uniformly bounded pth moment for some p > 1 implies U.I. (see
Exercise 1.3.54). Strengthening the L
1
convergence of Theorem 5.3.12, the next
proposition shows that an L
p
bounded martingale converges to its a.s. limit also
in L
p
(provided p > 1). In contrast to the preceding convergence results, this
one does not hold for subMGs (or supMGs) which are not MGs (for example, let
= infk 1 :
k
= 0 for independent
k
such that P(
k
,= 0) = k
2
/(k + 1)
2
,
so P( n) = n
2
and verify that X
n
= nI
{n}
a.s.
0 but EX
2
n
= 1, so this
L
2
bounded supMG does not converge to zero in L
2
).
Proposition 5.3.21 (Doobs L
p
martingale convergence). If the MG X
n
such
that X
n
X

p
liminf
n
X
n

p
.
Proof. Being L
p
bounded, the MG X
n
is L
1
bounded and Doobs martin
gale convergence theorem applies here, so X
n
a.s.
X
.
Further, applying Fatous lemma for [X
n
[
p
0 we have that,
liminf
n
E([X
n
[
p
) E(liminf
n
[X
n
[
p
) = E[X
[
p
as claimed, with X
L
p
. It thus suces to verify that E[X
n
X
[
p
0 as
n . To this end, with c = sup
n
E[X
n
[
p
nite we have by the L
p
maximal
5.3. THE CONVERGENCE OF MARTINGALES 199
inequality of (5.2.5) that EZ
n
q
p
c for Z
n
= max
kn
[X
k
[
p
and any nite n. Since
0 Z
n
Z = sup
k<
[X
k
[
p
, we have by monotone convergence that EZ q
p
c is
nite. As X
[
p
Z as well. Hence,
Y
n
= [X
n
X
[
p
([X
n
[ + [X
[)
p
2
p
Z. With Y
n
a.s.
0 and Y
n
2
p
Z for
integrable Z, we deduce by dominated convergence that EY
n
0 as n , thus
completing the proof of the proposition.
Remark. Proposition 5.3.21 does not have an L
1
analog. Indeed, as we have
seen already in Exercise 5.2.14, there exists a nonnegative MG M
n
such that
EM
n
= 1 for all n and M
n
M
in L
1
.
Example 5.3.22. Consider the martingale S
n
=
n
k=1
k
for independent, square
integrable, zeromean random variables
k
such that
k
E
2
k
< . Since ES
2
n
=
n
k=1
E
2
k
, it follows from Proposition 5.3.21 that the random series S
n
()
S
n
k=1
k
for i.i.d.
k
L
2
(, T, P) of zero
mean and unit variance. Let T
n
= (
k
, k n) and T
= (
k
, k < ).
(a) Prove that EWZ
n
0 for any xed W L
2
(, T
, P).
(b) Deduce that the same applies for any W L
2
(, T, P) and conclude that
Z
n
does not converge in L
2
.
(c) Show that though Z
n
D
G, a standard normal variable, there exists no
Z
mT such that Z
n
p
Z
.
We conclude this subsection with the application of martingales to the study of
P olyas urn scheme.
Example 5.3.24 (P olyas urn). Consider an urn that initially contains r red and
b blue marbles. At the kth step a marble is drawn at random from the urn, with all
possible choices being equally likely, and it and c
k
more marbles of the same color are
then returned to the urn. With N
n
= r+b+
n
k=1
c
k
counting the number of marbles
in the urn after n iterations of this procedure, let R
n
denote the number of red
marbles at that time and M
n
= R
n
/N
n
the corresponding fraction of red marbles.
Since R
n+1
R
n
, R
n
+c
n
with P(R
n+1
= R
n
+c
n
[T
M
n
) = R
n
/N
n
= M
n
it follows
that E[R
n+1
[T
M
n
] = R
n
+ c
n
M
n
= N
n+1
M
n
. Consequently, E[M
n+1
[T
M
n
] = M
n
for all n with M
n
a uniformly bounded martingale.
For the study of P olyas urn scheme we need the following denition.
Denition 5.3.25. The beta density with parameters > 0 and > 0 is
f
(u) =
( +)
()()
u
1
(1 u)
1
1
u[0,1]
,
where () =
_
0
s
1
e
s
ds is nite and positive (compare with Denition 1.4.45).
In particular, = = 1 corresponds to the density f
U
(u) of the uniform measure
on (0, 1], as in Example 1.2.40.
Exercise 5.3.26. Let M
n
be the martingale of Example 5.3.24.
(a) Show that M
n
M
a.s. and in L
p
for any p > 1.
200 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
(b) Assuming further that c
k
= c for all k 1, show that for = 0, . . . , n,
P(R
n
= r + c) =
_
n
1
i=0
(r +ic)
n1
j=0
(b +jc)
n1
k=0
(r +b +kc)
,
and deduce that M
k
u
k
< and
k
a
k
= .
Exercise 5.3.28. Fixing b
n
[, 1] for some > 0, suppose X
n
are [0, 1]valued,
T
n
adapted such that X
n+1
= (1 b
n
)X
n
+ b
n
B
n
, n 0, and P(B
n
= 1[T
n
) =
1P(B
n
= 0[T
n
) = X
n
. Show that X
n
a.s.
X
0, 1 and P(X
= 1[T
0
) = X
0
.
5.3.2. Squareintegrable martingales. If (X
n
, T
n
) is a squareintegrable
martingale then (X
2
n
, T
n
) is a subMG, so by Doobs decomposition X
2
n
= M
n
+A
n
for a nondecreasing T
n
predictable sequence A
n
and a MG (M
n
, T
n
) with M
0
=
0. In the course of proving Doobs decomposition we saw that A
n
A
n1
=
E[X
2
n
X
2
n1
[T
n1
] and part (b) of Exercise 5.1.8 provides an alternative expression
A
n
A
n1
= E[(X
n
X
n1
)
2
[T
n1
], motivating the following denition.
Denition 5.3.29. The sequence A
n
= X
2
0
+
n
k=1
E[(X
k
X
k1
)
2
[T
k1
] is called
the predictable compensator of an L
2
martingale (X
n
, T
n
) and denoted by angle
brackets, that is, A
n
= X
n
.
With EM
n
= EM
0
= 0 it follows that EX
2
n
= EX
n
, so X
n
is L
2
bounded if
and only if sup
n
EX
n
is nite. Further, X
n
is nondecreasing, so it converges
to a limit, denoted hereafter X
. With X
n
X
0
= X
2
0
integrable it further
follows by monotone convergence that EX
n
EX
, so X
n
is L
2
bounded
if and only if X
.
To simplify the notations assume hereafter that X
0
= 0 = X
0
so X
n
0 (the
transformation of our results to the general case is trivial).
We start with the following explicit bounds on E[sup
n
[X
n
[
p
] for p 2, from which
we deduce that X
n
is U.I. (hence converges a.s. and in L
1
), whenever X
1/2
is
integrable.
Proposition 5.3.30. There exist nite constants c
q
, q (0, 1], such that if (X
n
, T
n
)
is an L
2
MG with X
0
= 0, then
E[sup
k
[X
k
[
2q
] c
q
E[ X
q
] .
Remark. Our proof gives c
q
= (2 q)/(1 q) for q < 1 and c
1
= 4.
Proof. Let V
n
= max
n
k=0
[X
k
[
2
, noting that V
n
V
= sup
k
[X
k
[
2
when
n . As already explained EX
2
n
EX
,
and considering n we get our thesis for q = 1 (by monotone convergence).
Turning to the case of 0 < q < 1, note that (V
)
q
= sup
k
[X
k
[
2q
. Further, the
T
n
predictable part in Doobs decomposition of the nonnegative submartingale
Z
n
= X
2
n
is A
n
= X
n
. Hence, applying Lemma 5.2.7 with p = q and =
yields the stated bound.
Here is an application of Proposition 5.3.30 to the study of a certain class of
random walks.
Exercise 5.3.31. Let S
n
=
n
k=1
k
for i.i.d.
k
of zero mean and nite second
moment. Suppose is an T
n
stopping time such that E[
] is nite.
(a) Compute the predictable compensator of the L
2
martingale (S
n
, T
n
).
(b) Deduce that S
n
is U.I. and that ES
= 0.
We deduce from Proposition 5.3.30 that X
n
() converges a.s. to a nite limit
when X
1/2
being nite at !
Theorem 5.3.32. Suppose (X
n
, T
n
) is an L
2
martingale with X
0
= 0.
(a) X
n
() converges to a nite limit for a.e. for which X
() is nite.
(b) X
n
()/X
n
() 0 for a.e. for which X
() is innite.
(c) If the martingale dierences X
n
X
n1
are uniformly bounded then the
converse to part (a) holds. That is, X
.
For example, see the proof of Proposition 5.3.30, where we also noted that
v
=
minn 0 : X
n+1
> v are T
n
stopping times such that X
v
v. Thus, setting
Y
n
= X
n
k
for a positive integer k, the martingale (Y
n
, T
n
) is L
2
bounded and as
such it almost surely has a nite limit. Further, we saw there that if X
() is
nite then
k
() = for some random positive integer k = k(), in which case
X
n
k
= X
n
for all n. Since we consider only countably many values of k, this
yields the thesis of part (a) of the theorem.
(b). Since V
n
= (1 +X
n
)
1
is an T
n
predictable sequence of bounded variables,
its martingale transform Y
n
=
n
k=1
V
k
(X
k
X
k1
) with respect to the square
integrable martingale X
n
is also a squareintegrable martingale for the ltration
T
n
(c.f. Theorem 5.1.28). Further, since V
k
mT
k1
it follows that for all k 1,
Y
k
Y
k1
= E[(Y
k
Y
k1
)
2
[T
k1
] = V
2
k
E[(X
k
X
k1
)
2
[T
k1
]
=
X
k
X
k1
(1 +X
k
)
2
1
1 +X
k1
1
1 +X
k
(as (x y)/(1 + x)
2
(1 + y)
1
(1 + x)
1
for all x y 0 and X
k
0 is
nondecreasing in k). With Y
0
= X
0
= 0, adding the preceding inequalities
over k = 1, . . . , n, we deduce that Y
n
1 1/(1 + X
n
) 1 for all n. Thus,
by part (a) of the theorem, for almost every , Y
n
() has a nite limit. That is,
for a.e. the series
n
x
n
/b
n
converges, where b
n
= 1 + X
n
() is a positive,
202 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
nondecreasing sequence and X
n
() =
n
k=1
x
k
for all n. If in addition to the
convergence of this series also X
() = then b
n
and by Kroneckers
lemma X
n
()/b
n
0. In this case b
n
/(b
n
1) 1 so we conclude that then also
X
n
/(b
n
1) 0, which is exactly the thesis of part (b) of the theorem.
(c). Suppose that P
_
X
= , sup
n
[X
n
[ <
_
> 0. Then, there exists some r
such that P
_
X
= ,
r
=
_
> 0 for the T
n
stopping time
r
= infm 0 :
[X
m
[ > r. Since sup
m
[X
m
X
m1
[ c for some nonrandom nite constant c, we
have that [X
nr
[ r+c, from which we deduce that EX
nr
= EX
2
nr
(r+c)
2
for all n. With 0 X
nr
X
r
, by monotone convergence also
E
_
X
I
r=
E
_
X
r
(r +c)
2
.
This contradicts our assumption that P
_
X
= ,
r
=
_
> 0. In conclusion,
necessarily, P
_
X
= , sup
n
[X
n
[ <
_
= 0. Consequently, with sup
n
[X
n
[
nite on the set of values for which X
n
() converges to a nite limit, it follows
that X
n
k=1
I
A
k
count the number of events occurring among
the rst n, with S
k
I
A
k
the corresponding total number of occurrences. Sim
ilarly, let Z
n
=
n
k=1
k
denote the sum of the rst n conditional probabilities
k
= P(A
k
[T
k1
) and Z
k
k
. Then, for almost every ,
(a) If Z
() is nite, then so is S
().
(b) If Z
() is innite, then S
n
()/Z
n
() 1.
Remark. Given any sequence of events, by the tower property E
k
= P(A
k
) for
all k and setting T
n
= (A
k
, k n) guarantees that A
k
T
k
for all k. Hence,
(a) If EZ
k
P(A
k
) is nite, then from part (a) of Proposition 5.3.33 we
deduce that
k
I
A
k
is nite a.s., thus recovering the rst BorelCantelli lemma.
(b) For T
n
= (A
k
, k n) and mutually independent events A
k
we have that
k
= P(A
k
) and Z
n
= ES
n
for all n. Thus, in this case, part (b) of Proposition
5.3.33 is merely the statement that S
n
/ES
n
a.s.
1 when
k
P(A
k
) = , which is
your extension of the second BorelCantelli via Exercise 2.2.26.
Proof. Clearly, M
n
= S
n
Z
n
is squareintegrable and T
n
adapted. Further,
as M
n
M
n1
= I
An
E[I
An
[T
n1
] and Var(I
An
[T
n1
) =
n
(1
n
), it follows that
the predictable compensator of the L
2
martingale (M
n
, T
n
) is M
n
=
n
k=1
k
(1
k
). Hence, M
n
Z
n
for all n, and if Z
() is nite, then so is M
(). By
part (a) of Theorem 5.3.32, for a.e. such the nite limit M
() of M
n
() exists,
implying that S
= M
+Z
is nite as well.
With S
n
= M
n
+ Z
n
, it suces for part (b) of the proposition to show that
M
n
/Z
n
0 for a.e. for which Z
() = while M
() is innite.
5.3. THE CONVERGENCE OF MARTINGALES 203
The following extension of Kolmogorovs three series theorem uses both Theorem
5.3.32 and Levys extension of the BorelCantelli lemmas.
Exercise 5.3.34. Suppose X
n
is adapted to ltration T
n
and for any n, the
R.C.P.D. of X
n
given T
n1
equals the R.C.P.D. of X
n
given T
n1
. For non
random c > 0 let X
(c)
n
= X
n
I
Xnc
be the corresponding truncated variables.
(a) Verify that (Z
n
, T
n
) is a MG, where Z
n
=
n
k=1
X
(c)
k
.
(b) Considering the series
(5.3.3)
n
P([X
n
[ > c [T
n1
), and
n
Var(X
(c)
n
[T
n1
),
show that for a.e. the series
n
X
n
() has a nite limit if and only if
both series in (5.3.3) converge.
(c) Provide an example where the convergence in part (b) occurs with proba
bility 0 < p < 1.
We now consider sucient conditions for convergence almost surely of the mar
tingale transform.
Exercise 5.3.35. Suppose Y
n
=
n
k=1
V
k
(Z
k
Z
k1
) is the martingale transform
of the T
n
predictable V
n
with respect to the martingale (Z
n
, T
n
), per Denition
5.1.27.
(a) Show that if Z
n
is L
2
bounded and V
n
is uniformly bounded then
Y
n
a.s.
Y
nite.
(b) Deduce that for L
2
bounded MG Z
n
the sequence Y
n
() converges to a
nite limit for a.e. for which sup
k1
[V
k
()[ is nite.
(c) Suppose now that V
k
is predictable for the canonical ltration T
n
of
the i.i.d.
k
. Show that if
k
D
=
k
and u uP([
1
[ u) is bounded
above, then the series
n
V
n
n
has a nite limit for a.e. for which
k1
[V
k
()[ is nite.
Hint: Consider Exercise 5.3.34 for the adapted sequence X
k
= V
k
k
.
Here is another application of Levys extension of the BorelCantelli lemmas.
Exercise 5.3.36. Suppose X
n
= 1 +
n
k=1
D
k
, n 0, where the 1, 1valued
D
k
is T
k
adapted and such that E[D
k
[T
k1
] for some nonrandom 1 > > 0
and all k 1.
(a) Show that (X
n
, T
n
) is a submartingale and provide its Doob decomposi
tion.
(b) Using this decomposition and Levys extension of the BorelCantelli lem
mas, show that X
n
almost surely.
(c) Let Z
n
=
Xn
for = (1 )/(1 + ). Show that (Z
n
, T
n
) is a super
martingale and deduce that P(inf
n
X
n
0) .
As we show next, the predictable compensator controls the exponential tails for
martingales of bounded dierences.
Exercise 5.3.37. Fix > 0 nonrandom and an L
2
martingale (M
n
, T
n
) with
M
0
= 0 and bounded dierences sup
k
[M
k
M
k1
[ 1.
(a) Show that N
n
= exp(M
n
(e
1)M
n
) is a supMG for T
n
.
Hint: Recall part (a) of Exercise 1.4.40.
204 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
(b) Show that for any a.s. nite T
n
stopping time and constants u, r > 0,
P(M
u, M
r) exp(u +r(e
1)) .
(c) Show that if the martingale S
n
of Example 5.3.22 has uniformly bounded
dierences [
k
[ 1, then Eexp(S
) is nite for S
k
and any
R.
Applying part (c) of the preceding exercise, you are next to derive the following tail
estimate, due to Dvoretsky, in the context of Levys extension of the BorelCantelli
lemmas.
Exercise 5.3.38. Suppose A
k
T
k
for some ltration T
k
. Let S
n
=
n
k=1
I
A
k
and Z
n
=
n
k=1
P(A
k
[T
k1
). Show that P(S
n
r +u, Z
n
r) e
u
(r/(r +u))
r+u
for all n and u, r > 0, then deduce that for any 0 < r < 1,
P
_
n
_
k=1
A
k
_
er +P
_
n
k=1
P(A
k
[T
k1
) > r
_
.
Hint: Recall the proof of BorelCantelli III that the L
2
martingale M
n
= S
n
Z
n
has dierences bounded by one and M
n
Z
n
.
We conclude this section with a renement of the well known AzumaHoeding
concentration inequality for martingales of bounded dierences, from which we
deduce the strong law of large numbers for martingales of bounded dierences.
Exercise 5.3.39. Suppose (M
n
, T
n
) is a martingale with M
0
= 0 and dierences
D
k
= M
k
M
k1
, k 1 such that for some nite
k
,
E[D
2
k
e
D
k
[T
k1
]
2
k
E[e
D
k
[T
k1
] < .
(a) Show that N
n
= exp(M
n
2
r
n
/2) is a supMG for T
n
provided
[0, 1] and r
n
=
n
k=1
2
k
.
Hint: Recall part (b) of Exercise 1.4.40.
(b) Deduce that for I(x) = (x 1)(2x x 1) and any u 0,
P(M
n
u) exp(r
n
I(u/r
n
)/2) .
(c) Conclude that b
1
n
M
n
a.s.
0 for any martingale M
n
of uniformly bounded
dierences and nonrandom b
n
such that b
n
/
nlog n .
5.4. The optional stopping theorem
This section is about the use of martingales in computations involving stopping
times. The key tool for doing so is the following theorem.
Theorem 5.4.1 (Doobs optional stopping). Suppose are T
n
stopping
times and X
n
= Y
n
+V
n
for subMGs (V
n
, T
n
), (Y
n
, T
n
) such that V
n
is nonpositive
and Y
n
is uniformly integrable. Then, the R.V. X
and X
EX
EX
0
(where X
() and X
[T
n
] X
n
for some integrable X
[T
n
] is U.I. by Corollary 5.3.14, hence Y
n
also U.I. by
Proposition 5.4.4, and the subMG V
n
= X
n
Y
n
is by assumption nonpositive).
5.4. THE OPTIONAL STOPPING THEOREM 205
By far the most common application has (X
n
, T
n
) a martingale, in which case
it yields that EX
0
= EX
for any T
n
stopping time such that X
n
is U.I.
(for example, whenever is bounded, or under the more general conditions of
Proposition 5.4.4).
Proof. By linearity of the expectation, it suces to prove the claim sepa
rately for Y
n
= 0 and for V
n
. Dealing rst with Y
n
= 0, i.e. with a nonpositive
subMG (V
n
, T
n
), note that (V
n
, T
n
) is then a nonnegative supMG. Thus, the
inequality E[V
] E[V
] E[V
0
] and the integrability of V
and V
are immediate
consequences of Proposition 5.3.8.
Considering hereafter the subMG (Y
n
, T
n
) such that Y
n
is U.I., since
are T
n
stopping times it follows by Theorem 5.1.32 that U
n
= Y
n
, Z
n
= Y
n
and
U
n
Z
n
are all subMGs with respect to T
n
. In particular, EU
n
EZ
n
EZ
0
for
all n. Our assumption that the subMG (U
n
, T
n
) is U.I. results with U
n
U
a.s.
and in L
1
(see Theorem 5.3.12). Further, as we show in part (c) of Proposition 5.4.4,
in this case U
n
= Z
n
is U.I. so by the same reasoning, Z
n
Z
a.s. and in L
1
.
We thus deduce that EU
EZ
EZ
0
. By denition, U
= lim
n
Y
n
= Y
and Z
= lim
n
Y
n
= Y
. Consequently, EY
EY
EY
0
, as claimed.
We complement Theorem 5.4.1 by rst strengthening its conclusion and then pro
viding explicit sucient conditions for the uniform integrability of Y
n
.
Lemma 5.4.3. Suppose X
n
is adapted to ltration T
n
and the T
n
stopping
time is such that for any T
n
stopping time the R.V. X
is integrable and
E[X
] E[X
[T
] X
a.s.
Proof. Fixing A T
set = I
A
+ I
A
c . Note that is also an
T
n
stopping time since for any n,
n = (A n)
_
(A
c
n)
= (A n)
_
((A
c
n) n) T
n
because both A and A
c
are in T
and n T
n
(c.f. Denition 5.1.34 of
the algebra T
). By assumption, X
, X
, X
EX
.
Since X
= X
I
A
+X
I
A
c subtracting the nite E[X
I
A
c ] from both sides of this
inequality results with E[X
I
A
] E[X
I
A
]. This holds for all A T
and with
E[X
I
A
] = E[ZI
A
] for Z = E[X
[T
)I
A
] 0 for all A T
are measurable
on T
, as
claimed.
Proposition 5.4.4. Suppose Y
n
is integrable and is a stopping time for a
ltration T
n
. Then, Y
n
is uniformly integrable if any one of the following
conditions hold.
(a) E < and a.s. E[[Y
n
Y
n1
[[T
n1
] c for some nite, nonrandom
c.
(b) Y
n
I
>n
is uniformly integrable and Y
I
<
is integrable.
(c) (Y
n
, T
n
) is a uniformly integrable subMG (or supMG).
206 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
Proof. (a) Clearly, [Y
n
[ Z
n
, where
Z
n
= [Y
0
[ +
n
k=1
[Y
k
Y
k1
[ = [Y
0
[ +
n
k=1
[Y
k
Y
k1
[I
k
,
is nondecreasing in n. Hence, sup
n
[Y
n
[ Z
, implying that Y
n
is U.I.
whenever EZ
E[Y
0
[ +c
k=1
P( k) = E[Y
0
[ +c E ,
with the integrability of Z
[I
<
+ [X
n
[I
>n
for every n, any sequence of
random variables X
n
and any 0, 1, 2, . . . , . Condition (b) states that the
sequence [Y
n
[I
>n
is U.I. and that the variable [Y
[I
<
is integrable. Thus,
taking the expectation of the preceding inequality in case X
n
= Y
n
I
Yn>M
, we nd
that when condition (b) holds,
sup
n
E[[Y
n
[I
Yn >M
] E[[Y
[I
Y >M
I
<
] + sup
n
E[[Y
n
[I
Yn>M
I
>n
] ,
converges to zero as M . That is, [Y
n
[ is then a U.I. sequence.
(c) The hypothesis of (c) that Y
n
is U.I. implies that Y
n
I
>n
is also U.I. and
that sup
n
E[(Y
n
)
+
] is nite. With an T
n
stopping time and (Y
n
, T
n
) a subMG,
it further follows by Lemma 5.3.7 that Y
I
<
is integrable. Having arrived at the
hypothesis of part (b), we are done.
Since Y
n
is U.I. whenever is bounded, we have the following immediate
consequences of Doobs optional stopping theorem, Remark 5.4.2 and Lemma 5.4.3.
Corollary 5.4.5. For any subMG (X
n
, T
n
) and any nondecreasing sequence
k
of T
n
stopping times, (X
k
, T
k
, k 0) is a subMG when either sup
k
k
a non
random nite integer, or a.s. X
n
E[X
[T
n
] for an integrable X
and all n 0.
Check that by part (b) of Exercise 5.2.16 and part (c) of Proposition 5.4.4 it follows
from Doobs optional stopping theorem that ES
n
k=1
k
provided the independent
k
are integrable with E
k
= 0 and sup
n
E[S
n
[ < .
Sometimes Doobs optional stopping theorem is applied enroute to a useful con
tradiction. For example,
Exercise 5.4.6. Show that if X
n
is a submartingale such that EX
0
0 and
inf
n
X
n
< 0 a.s. then necessarily E[sup
n
X
n
] = .
Hint: Assuming rst that sup
n
[X
n
[ is integrable, apply Doobs optional stopping
theorem to arrive at a contradiction. Then consider the same argument for the
subMG Z
n
= maxX
n
, 1.
5.4. THE OPTIONAL STOPPING THEOREM 207
Exercise 5.4.7. Fixing b > 0, let
b
= minn 0 : S
n
b for the random walk
S
n
of Denition 5.1.6 and suppose
n
= S
n
S
n1
are uniformly bounded, of
zero mean and positive variance.
(a) Show that
b
is almost surely nite.
Hint: See Proposition 5.3.5.
(b) Show that E[minS
n
: n
b
] = .
Martingales often provide much information about specic stopping times. We
detail below one such example, pertaining to the srw of Denition 5.1.6.
Corollary 5.4.8 (Gamblers Ruin). Fixing positive integers a and b the prob
ability that a srw S
n
, starting at S
0
= 0, hits a before rst hitting +b is
r = (e
b
1)/(e
b
e
a
) for = log[(1 p)/p] ,= 0. For the symmetric srw, i.e.
when p = 1/2, this probability is r = b/(a +b).
Remark. The probability r is often called the gamblers ruin, or ruin probability
for a gambler with initial capital of +a, betting on the outcome of independent
rounds of the same game, a unit amount per round, gaining or losing an amount
equal to his bet in each round and stopping when either all his capital is lost (the
ruin event), or his accumulated gains reach the amount +b.
Proof. Consider the stopping time
a,b
= infn 0 : S
n
b, or S
n
a for
the canonical ltration of the srw. That is,
a,b
is the rst time that the srw exits
the interval (a, b). Since (S
k
+ k)/2 has the Binomial(k, p) distribution it is not
hard to check that sup
P(S
k
= ) 0 hence P(
a,b
> k) P(a < S
k
< b) 0
as k . Consequently,
a,b
is nite a.s. Further, starting at S
0
(a, b) and
using only increments
k
1, 1, necessarily S
a,b
a, b with probability
one. Our goal is thus to compute the ruin probability r = P(S
a,b
= a). To
this end, note that Ee
k
= pe
+ (1 p)e
n
k=1
e
k
is, for such , a nonnegative MG with M
0
= 1
(c.f. Example 5.1.10). Clearly, M
n
a,b
= exp(S
n
a,b
) exp([[ max(a, b)) is
uniformly bounded (in n), hence uniformly integrable. So, applying Doobs optional
stopping theorem for this MG and stopping time, we have that
1 = EM
0
= E[M
a,b
] = E[e
S
a,b
] = re
a
+ (1 r)e
b
,
which easily yields the stated explicit formula for r in case ,= 0 (i.e. p ,= 1/2).
Finally, recall that S
n
is a martingale for the symmetric srw, with S
n
a,b
uni
formly bounded, hence uniformly integrable. So, applying Doobs optional stopping
theorem for this MG, we nd that in the symmetric case
0 = ES
0
= E[S
a,b
] = ar +b(1 r) ,
that is, r = b/(a +b) when p = 1/2.
Here is an interesting consequence of the Gamblers ruin formula.
Example 5.4.9. Initially, at step k = 0 zero is the only occupied site in Z. Then, at
each step a new particle starts at zero and follows a symmetric srw, independently
of the previous particles, till it lands on an unoccupied site, whereby it stops and
thereafter occupies this site. The set of occupied sites after k steps is thus an interval
of length k + 1 and we let R
k
1, . . . , k + 1 count the number of nonnegative
integers occupied after k steps (starting at R
0
= 1).
208 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
Clearly, R
k+1
R
k
, R
k
+ 1 and P(R
k+1
= R
k
[T
M
k
) = R
k
/(k + 2) by the
preceding Gamblers ruin formula. Thus, R
k
follows the evolution of Bernard
Friedmans urn with parameters d
k
= r = b = 1 and c
k
= 0. Consequently, by
Exercise 5.3.27 we have that (n + 1)
1
R
n
a.s.
1/2.
You are now to derive Walds identities about stopping times for the random
walk, and use them to gain further information about the stopping times
a,b
of the
preceding corollary.
Exercise 5.4.10. Let be an integrable stopping time for the canonical ltration
of the random walk S
n
.
(a) Show that if
1
is integrable, then Walds identity ES
= E
1
E holds.
Hint: Use the representation S
k=1
k
I
k
and independence.
(b) Show that if in addition
1
is squareintegrable, then Walds second iden
tity E[(S
E
1
)
2
] = Var(
1
)E holds as well.
Hint: Explain why you may assume that E
1
= 0, prove the identity with
n instead of and use Doobs L
2
convergence theorem.
(c) Show that if
1
0 then Walds identity applies also when E =
(under the convention that 0 = 0).
Exercise 5.4.11. For the srw S
n
and positive integers a, b consider the stopping
time
a,b
= minn 0 : S
n
/ (a, b) as in proof of Corollary 5.4.8.
(a) Check that E[
a,b
] < .
Hint: See Exercise 5.1.15.
(b) Combining Corollary 5.4.8 with Walds identities, compute the value of
E[
a,b
].
(c) Show that
a,b
b
= minn 0 : S
n
= b for a (where the
minimum over the empty set is ), and deduce that E
b
= b/(2p 1)
when p 1/2.
(d) Show that
b
is almost surely nite when p 1/2.
(e) Find constants c
1
and c
2
such that Y
n
= S
4
n
6nS
2
n
+ c
1
n
2
+ c
2
n is a
martingale for the symmetric srw, and use it to evaluate E[(
b,b
)
2
] in
this case.
We next provide a few applications of Doobs optional stopping theorem, starting
with information on the law of
b
for srw (and certain other random walks).
Exercise 5.4.12. Consider the stopping time
b
= infn 0 : S
n
= b and the
martingale M
n
= exp(S
n
)M()
n
for a srw S
n
, with b a positive integer and
M() = E[e
1
].
(a) Show that if p = 1q [1/2, 1) then e
b
E[M()
b
] = 1 for every > 0.
(b) Deduce that for p [1/2, 1) and every 0 < s < 1,
E[s
1
] =
1
2qs
_
1
_
1 4pqs
2
_
,
and E[s
b
] = (E[s
1
])
b
.
(c) Show that if 0 < p < 1/2 then P(
b
< ) = exp(
b) for
=
log[(1 p)/p] > 0.
(d) Deduce that for p (0, 1/2) the variable Z = 1 + max
k0
S
k
has a Geo
metric distribution of success probability 1 e
.
5.4. THE OPTIONAL STOPPING THEOREM 209
Exercise 5.4.13. Consider
b
= minn 0 : S
n
b for b > 0, in case the i.i.d.
increments
n
= S
n
S
n1
of the random walk S
n
are such that P(
1
> 0) > 0
and
1
[
1
> 0 has the Exponential law of parameter .
(a) Show that for any n nite, conditional on
b
= n the law of S
b
b is
also Exponential of parameter .
Hint: Recall the memoryless property of the exponential distribution.
(b) With M() = E[e
1
] and
b
I
b
<
], 0 < s < 1, and P(
b
< ) in terms of u(s) and
.
Exercise 5.4.14. A monkey types a random sequence of capital letters
k
that
are chosen independently of each other, with each
k
chosen uniformly from amongst
the 26 possible values A, B, . . . , Z.
(a) Suppose that just before each time step n = 1, 2, , a new gambler
arrives on the scene and bets $1 that
n
= P. If he loses, he leaves,
whereas if he wins, he receives $26, all of which he bets on the event
n+1
= R. If he now loses, he leaves, whereas if he wins, he bets his
current fortune of $26
2
on the event that
n+2
= O, and so on, through
the word PROBABILITY . Show that the amount of money M
n
that the
gamblers have collectively earned by time n is a martingale with respect
to T
n
.
(b) Let L
n
denote the number of occurrences of the word PROBABILITY in
the rst n letters typed by the monkey and = infn 11 : L
n
= 1 the
rst time by which it produced this word. Using Doobs optional stopping
theorem show that E = a for a = 26
11
. Does the same apply for the
rst time by which the monkey produces the word ABRACADABRA
and if not, what is E?
(c) Show that n
1
L
n
a.s.
1
a
and further that (L
n
n/a)/
vn
D
G for some
nite, positive constant v.
Hint: Fixing < 1/2, partition 11, . . . , n into m = n
consecutive
blocks K
i
, each of length either = n/m 10 or + 1, with gaps of
length 10 between them. With W
i
denoting the number of occurrences of
PROBABILITY within K
i
, apply Lindebergs clt to (L
n
n
/a)/
v
n
,
where L
n
=
m
i=1
W
i
, n
= n 10m and v
n
=
m
i=1
Var(W
i
), then show
that n
1
v
n
converges.
Exercise 5.4.15. Consider a fair game consisting of successive turns whose out
come are the i.i.d. signs
k
1, 1 such that P(
1
= 1) =
1
2
, and where
upon betting the wagers V
k
in each turn, your gain (or loss) after n turns is
Y
n
=
n
k=1
k
V
k
. Here is a betting system V
k
, predictable with respect to the
canonical ltration T
n
, as in Example 5.1.30, that surely makes a prot in this
fair game!
Choose a nite sequence x
1
, x
2
, . . . , x
i=1
x
i
. Show that the sum of terms in your sequence after n
turns is a martingale S
n
= v + Y
n
with respect to T
n
. Deduce that
with probability one you terminate playing with a prot v at the nite
T
n
stopping time = infn 0 : S
n
= 0.
(b) Show that E is nite.
Hint: Consider the number of terms N
n
in your sequence after n turns.
(c) Show that the expected value of your aggregate maximal loss till termina
tion, namely EL for L = min
k
Y
k
, is innite (which is why you are
not to attempt this gambling scheme).
In the next exercise you derive a timereversed version of the L
2
maximal inequality
(5.2.4) by an application of Corollary 5.4.5.
Exercise 5.4.16. Associate to any given martingale (Y
n
, 1
n
) the record times
k+1
= minj 0 : Y
j
> Y
k
, k = 0, 1, . . . starting at
0
= 0.
(a) Fixing m nite, set
k
=
k
m and explain why (Y
k
, 1
k
) is a MG.
(b) Deduce that if EY
2
m
is nite then
m
k=1
E[(Y
k
Y
k1
)
2
] = EY
2
m
EY
2
0
.
Hint: Apply Exercise 5.1.8.
(c) Conclude that for any martingale Y
n
and all m
E[(max
m
Y
Y
m
)
2
] EY
2
m
.
5.5. Reversed MGs, likelihood ratios and branching processes
With martingales applied throughout probability theory, we present here just a
few selected applications. Our rst example, Subsection 5.5.1, deals with the
analysis of extinction probabilities for branching processes. We then study in Sub
section 5.5.2 the likelihood ratios for independent experiments with the help of
Kakutanis theorem about product martingales. Finally, in Subsection 5.5.3 we
develop the theory of reversed martingales and applying it, provide zeroone law
and representation results for exchangeable processes.
5.5.1. Branching processes: extinction probabilities. We use martin
gales to study the extinction probabilities of branching processes, the object we
dene next.
Denition 5.5.1 (Branching process). The branching process is a discrete
time stochastic process Z
n
taking nonnegative integer values, such that Z
0
= 1
and for any n 1,
Z
n
=
Zn1
j=1
N
(n)
j
,
where N and N
(n)
j
for j = 1, 2, . . . are i.i.d. nonnegative integer valued R.V.s with
nite mean m
N
= EN < , and where we use the convention that if Z
n1
= 0
5.5. REVERSED MGS, LIKELIHOOD RATIOS AND BRANCHING PROCESSES 211
then also Z
n
= 0. We call a branching process subcritical when m
N
< 1, critical
when m
N
= 1 and supercritical when m
N
> 1.
Remark. The S.P. Z
n
is interpreted as counting the size of an evolving popula
tion, with N
(n)
j
being the number of ospring of j
th
individual of generation (n1)
and Z
n
being the size of the nth generation. Associated with the branching process
is the family tree with the root denoting the 0th generation and having N
(n)
j
edges
from vertex j at distance n from the root to vertices of distance (n+1) from the root.
Random trees generated in such a fashion are called GaltonWatson trees and are
the subject of much research. We focus here on the simpler S.P. Z
n
and shall use
throughout the ltration T
n
= (N
(k)
j
, k n, j = 1, 2, . . .). We note in passing
that in general T
Z
n
is a strict subset of T
n
(since in general one can not recover the
number of ospring of each individual knowing only the total population sizes at
the dierent generations). Though not dealt with here, more sophisticated related
models have also been successfully studied by probabilists. For example, branching
process with immigration, where one adds to Z
n
an external random variable I
n
that count the number of individuals immigrating into the population at the n
th
generation; Agedependent branching process where individuals have random life
times during which they produce ospring according to agedependent probability
generating function; Multitype branching process where each individual is assigned
a label (type), possibly depending on the type of its parent and with a dierent law
for the number of ospring in each type, and branching process in random envi
ronment where the law of the number of ospring per individual is itself a random
variable (part of the aapriori given random environment).
Our goal here is to nd the probability p
ex
of population extinction, formally
dened as follows.
Denition 5.5.2. The extinction probability of a branching process is
p
ex
:= P( : Z
n
() = 0 for all n large enough) .
Obviously, p
ex
= 0 whenever P(N = 0) = 0 and p
ex
= 1 whenever P(N = 0) = 1.
Hereafter we exclude these degenerate cases by assuming that 1 > P(N = 0) > 0.
To this end, we rst deduce that with probability one, conditional upon non
extinction the branching process grows unboundedly.
Lemma 5.5.3. If P(N = 0) > 0 then with probability one either Z
n
or
Z
n
= 0 for all n large enough.
Proof. We start by proving that for any ltration T
n
T
, some nonrandom
k
> 0 and all large positive integers k, n
(5.5.1) P(A[T
n
)I
[0,k]
(Z
n
)
k
I
[0,k]
(Z
n
) ,
then P(AB) = 1 for B = lim
n
Z
n
= . Indeed, C
k
= Z
n
k, i.o. in n are by
(5.5.1) such that C
k
P(A[T
n
)
k
, i.o. in n. By Levys 01 law P(A[T
n
) I
A
except on a set D such that P(D) = 0, hence also C
k
DI
A
k
= DA for
all k. With C
k
B
c
it follows that B
c
DA yielding our claim that P(AB) = 1.
Turning now to the branching process Z
n
, let A = : Z
n
() = 0 for all n large
enough which is in T
, noting that if Z
n
k and N
(n+1)
j
= 0, j = 1, . . . , k, then
212 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
Z
n+1
= 0 hence A. Consequently, by the independence of N
(n+1)
j
, j = 1, . . .
and T
n
it follows that
E[I
A
[T
n
]I
{Znk}
E[I
{Zn+1=0}
[T
n
]I
{Znk}
P(N = 0)
k
I
{Znk}
for all n and k. That is, (5.5.1) holds in this case for
k
= P(N = 0)
k
> 0. As
shown already, this implies that with probability one either Z
n
or Z
n
= 0 for
all n large enough.
The generating function
(5.5.2) L(s) = E[s
N
] = P(N = 0) +
k=1
P(N = k)s
k
plays a key role in analyzing the branching process. In this task, we employ the
following martingales associated with branching process.
Lemma 5.5.4. Suppose 1 > P(N = 0) > 0. Then, (X
n
, T
n
) is a martingale where
X
n
= m
n
N
Z
n
. In the supercritical case we also have the martingale (M
n
, T
n
) for
M
n
=
Zn
and (0, 1) the unique solution of s = L(s). The same applies in the
subcritical case if there exists a solution (1, ) of s = L(s).
Proof. Since the value of Z
n
is a nonrandom function of N
(k)
j
, k n, j =
1, 2, . . ., it follows that both X
n
and M
n
are T
n
adapted. We proceed to show by
induction on n that the nonnegative processes Z
n
and s
Zn
for each s > 0 such that
L(s) max(s, 1) are integrable with
(5.5.3) E[Z
n+1
[T
n
] = m
N
Z
n
, E[s
Zn+1
[T
n
] = L(s)
Zn
.
Indeed, recall that the i.i.d. random variables N
(n+1)
j
of nite mean m
N
are inde
pendent of T
n
on which Z
n
is measurable. Hence, by linearity of the expectation
it follows that for any A T
n
,
E[Z
n+1
I
A
] =
j=1
E[N
(n+1)
j
I
{Znj}
I
A
] =
j=1
E[N
(n+1)
j
]E[I
{Znj}
I
A
] = m
N
E[Z
n
I
A
] .
This veries the integrability of Z
n
0 as well as the identity E[Z
n+1
[T
n
] = m
N
Z
n
of (5.5.3), which amounts to the martingale condition E[X
n+1
[T
n
] = X
n
for X
n
=
m
n
N
Z
n
. Similarly, xing s > 0,
s
Zn+1
=
=0
I
{Zn=}
j=1
s
N
(n+1)
j
.
Hence, by linearity of the expectation and independence of s
N
(n+1)
j
and T
n
,
E[s
Zn+1
I
A
] =
=0
E[I
{Zn=}
I
A
j=1
s
N
(n+1)
j
]
=
=0
E[I
{Zn=}
I
A
]
j=1
E[s
N
(n+1)
j
] =
=0
E[I
{Zn=}
I
A
]L(s)
= E[L(s)
Zn
I
A
] .
Since Z
n
0 and L(s) max(s, 1) this implies that Es
Zn+1
1 + Es
Zn
and the
integrability of s
Zn
follows by induction on n. Given that s
Zn
is integrable and the
5.5. REVERSED MGS, LIKELIHOOD RATIOS AND BRANCHING PROCESSES 213
preceding identity holds for all A T
n
, we have thus veried the right identity in
(5.5.3), which in case s = L(s) is precisely the martingale condition for M
n
= s
Zn
.
Finally, to prove that s = L(s) has a unique solution in (0, 1) when m
N
= EN > 1,
note that the function s L(s) of (5.5.2) is continuous and bounded on [0, 1].
Further, since L(1) = 1 and L
(s) =
k=2
k(k 1)P(N = k)s
k2
is positive and nite on (0, 1). Consequently, L() is strictly convex there. Hence,
if (0, 1) is such that = L(), then L(s) < s for s (, 1), so such a solution
(0, 1) is unique.
Remark. Since X
n
= m
n
N
Z
n
is a martingale with X
0
= 1, it follows that EZ
n
=
m
n
N
for all n 0. Thus, a subcritical branching process, i.e. when m
N
< 1, has
mean total population size
E[
n=0
Z
n
] =
n=0
m
n
N
=
1
1 m
N
< ,
which is nite.
We now determine the extinction probabilities for branching processes.
Proposition 5.5.5. Suppose 1 > P(N = 0) > 0. If m
N
1 then p
ex
= 1. In
contrast, if m
N
> 1 then p
ex
= , with m
n
N
Z
n
a.s.
X
and Z
n
a.s.
Z
0, .
Remark. In words, we nd that for subcritical and nondegenerate critical branch
ing processes the population eventually dies o, whereas nondegenerate super
critical branching processes survive forever with positive probability and conditional
upon such survival their population size grows unboundedly in time.
Proof. Applying Doobs martingale convergence theorem to the nonnegative
MG X
n
of Lemma 5.5.4 we have that X
n
a.s.
X
with X
for
the nonnegative MG M
n
=
Zn
of Lemma 5.5.4. Since (0, 1) and Z
n
0, it
follows that this MG is bounded by one, hence U.I. and with Z
0
= 1 it follows that
EM
= EM
0
= (see Theorem 5.3.12). Recall Lemma 5.5.3 that Z
n
a.s.
Z
0, , so M
=
Z
0, 1 with
p
ex
= P(Z
= 0) = P(M
= 1) = EM
=
as stated.
Remark. For a nondegenerate critical branching process (i.e. when m
N
= 1
and P(N = 0) > 0), we have seen that the martingale Z
n
converges to 0 with
probability one, while EZ
n
= EZ
0
= 1. Consequently, this MG is L
1
bounded but
not U.I. (for another example, see Exercise 5.2.14). Further, as either Z
n
= 0 or
Z
n
1, it follows that in this case 1 = E(Z
n
[Z
n
1)(1 q
n
) for q
n
= P(Z
n
= 0).
Further, here q
n
p
ex
= 1 so we deduce that conditional upon nonextinction, the
mean population size E(Z
n
[Z
n
1) = 1/(1 q
n
) grows to innity as n .
214 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
As you show next, if supercritical branching process has a squareintegrable o
spring distribution then m
n
N
Z
n
converges in law to a nondegenerate random vari
able. The KestenStigum Llog Ltheorem, (which we do not prove here), states
that the latter property holds if and only if E[N log N] is nite.
Exercise 5.5.6. Consider a supercritical branching process Z
n
where the num
ber of ospring is of mean m
N
= E[N] > 1 and variance v
N
= Var(N) < .
(a) Compute E[X
2
n
] for X
n
= m
n
N
Z
n
.
(b) Show that P(X
of the martingale X
n
.
(c) Show that P(X
() > 0.
The generating function L(s) = E[s
N
] yields information about the laws of Z
n
and that of X
of Proposition 5.5.5.
Proposition 5.5.7. Consider the generating functions L
n
(s) = E[s
Zn
] for s
[0, 1] and a branching process Z
n
starting with Z
0
= 1. Then, L
0
(s) = s and
L
n
(s) = L[L
n1
(s)] for n 1 and L() of (5.5.2). Consequently, the generating
function
L
(s) = E[s
X
] of X
is a solution of
L
(s) = L[
(s
1/mN
)] which
converges to one as s 1.
Remark. In particular, the probability q
n
= P(Z
n
= 0) = L
n
(0) that the branch
ing process is extinct after n generations is given by the recursion q
n
= L(q
n1
)
for n 1, starting at q
0
= 0. Since the continuous function L(s) is above s on
the interval from zero to the smallest positive solution of s = L(s) it follows that
q
n
is a monotone nondecreasing sequence that converges to this solution, which is
thus the value of p
ex
. This alternative evaluation of p
ex
does not use martingales.
Though implicit here, it instead relies on the Markov property of the branching
process (c.f. Example 6.1.10).
Proof. Recall that Z
1
= N
(1)
1
and if Z
1
= k then the branching process Z
n
for n 2 has the same law as the sum of k i.i.d. variables, each having the same
law as Z
n1
(with the j
th
such variable counting the number of individuals in the
n
th
generation who are descendants of the j
th
individual of the rst generation).
Consequently, E[s
Zn
[Z
1
= k] = E[s
Zn1
]
k
for all n 2 and k 0. Summing over
the disjoint events Z
1
= k we have by the tower property that for n 2,
L
n
(s) = E[E(s
Zn
[Z
1
)] =
k=0
P(N = k)L
n1
(s)
k
= L[L
n1
(s)]
for L() of (5.5.2), as claimed. Obviously, L
0
(s) = s and L
1
(s) = E[s
N
] = L(s).
From this identity we conclude that
L
n
(s) = L[
L
n1
(s
1/mN
)] for
L
n
(s) = E[s
Xn
]
and X
n
= m
n
N
Z
n
. With X
n
a.s.
X
L
n
(s)
L
(s) = E[s
X
], which by the continuity of r L(r) on [0, 1] is thus a
solution of the identity
L
(s) = L[
(s
1/mN
)]. Further, by monotone convergence
(s)
L
(1) = 1 as s 1.
Remark. Of course, q
n
= P(T n) provides the distribution function of the time
of extinction T = mink 0 : Z
k
= 0. For example, if N has the Bernoulli(p)
distribution for some 0 < p < 1 then T is merely a Geometric(1 p) random
variable, but in general the law of T is more involved.
5.5. REVERSED MGS, LIKELIHOOD RATIOS AND BRANCHING PROCESSES 215
The generating function
L
(e
) = +(1 )
2
/(+(1 ))
for all 0 and deduce that conditioned on nonextinction X
has the
exponential distribution of parameter (1 ).
(c) Show that in the subcritical case E[s
Zn
[Z
n
,= 0] (1m)s/[1ms] and
deduce that then the law of Z
n
conditioned upon nonextinction converges
weakly to a Geometric(1 m) distribution.
(d) Show that in the critical case E[e
Zn/n
[Z
n
,= 0] 1/(1+) for all 0
and deduce that then the law of n
1
Z
n
conditioned upon nonextinction
converges weakly to an exponential distribution (of parameter one).
The following exercise demonstrates that martingales are also useful in the study
of GaltonWatson trees.
Exercise 5.5.9. Consider a supercritical branching process Z
n
such that 1 N
for some nonrandom nite . A vertex of the corresponding GaltonWatson tree
T
is called a branch point if it has more than one ospring. For each vertex
v T
let C(v) count the number of branch points one encounters when traversing
along a path from the root of the tree to v (possibly counting the root, but not
counting v among these branch points).
(a) Let T
n
denote the set of vertices in T
vTn
e
C(v)
is a martingale when M() = m
N
e
+P(N = 1)(1 e
).
(b) Let B
n
= minC(v) : v T
n
. Show that a.s. liminf
n
n
1
B
n
where > 0 is nonrandom (and possibly depends on the ospring
distribution).
5.5.2. Product martingales and RadonNikodym derivatives. We start
with an explicit characterization of uniform integrability for the product martingale
of Example 5.1.10.
Theorem 5.5.10 (Kakutanis Theorem). Let M
n
k=1
Y
k
, with M
0
= 1 and independent, integrable Y
k
0
such that EY
k
= 1 for all k 1. By Jensens inequality, a
k
= E[
Y
k
] is in (0, 1]
216 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
for all k 1. The following ve statements are then equivalent:
(a) M
n
is U.I. , (b) M
n
L
1
M
; (c) EM
= 1;
(d)
k
a
k
> 0 ; (e)
k
(1 a
k
) < ,
and if any (every) one of them fails, then M
= 0 a.s.
Proof. Statement (a) implies statement (b) because any U.I. martingale con
verges in L
1
(see Theorem 5.3.12). Further, the L
1
convergence per statement (b)
implies that EM
n
EM
and since EM
n
= EM
0
= 1 for all n, this results with
EM
n
k=1
(
Y
k
/a
k
) we next show that
(c) implies (d) by proving the contrapositive. Indeed, by Doobs convergence
theorem N
n
a.s.
N
with N
n
k=1
a
k
0), then M
n
= N
2
n
(
n
k=1
a
k
)
2
a.s.
0. So in this case M
= 0 a.s.
and statement (c) also fails to hold.
In contrast, if statement (d) holds then N
n
is L
2
bounded since for all n,
EN
2
n
= (
n
k=1
a
k
)
2
EM
n
_
k
a
k
_
2
= c < .
Thus, with M
k
N
2
k
it follows by the L
2
maximal inequality that for all n,
E
_
n
max
k=0
M
k
E
_
n
max
k=0
N
2
k
4E[N
2
n
] 4c .
Hence, M
k
0 are such that sup
k
M
k
is integrable and in particular, M
n
is U.I.
(that is, (a) holds).
Finally, to see why the statements (d) and (e) are equivalent note that upon
applying the Borel Cantelli lemmas for independent events A
n
with P(A
n
) = 1a
n
the divergence of the series
k
(1a
k
) is equivalent to P(A
c
n
eventually) = 0, which
for strictly positive a
k
is equivalent to
k
a
k
= 0.
We next consider another martingale that is key to the study of likelihood ratios
in sequential statistics. To this end, let P and Q be two probability measures on
the same measurable space (, T
) with P
n
= P
Fn
and Q
n
= Q
Fn
denoting the
restrictions of P and Q to a ltration T
n
T
.
Theorem 5.5.11. Suppose Q
n
P
n
for all n, with M
n
= dQ
n
/dP
n
denoting the
corresponding RadonNikodym derivatives on (, T
n
). Then,
(a) (M
n
, T
n
) is a martingale on the probability space (, T
, P) where M
n
a.s.
as n and M
is Pa.s. nite.
(b) If M
n
is uniformly Pintegrable then Q P and dQ/dP = M
.
(c) More generally, the Lebesgue decomposition of Q to its absolutely contin
uous and singular parts with respect to P is
(5.5.4) Q = Q
ac
+Q
s
= M
P+I
{M=}
Q.
Remark. From the decomposition of (5.5.4) it follows that if Q P then both
Q(M
= ) = 1 and
P(M
= 0) = 1.
5.5. REVERSED MGS, LIKELIHOOD RATIOS AND BRANCHING PROCESSES 217
Example 5.5.12. Suppose T
n
= (
n
) and the countable partitions
n
= A
i,n
T of are nested (that is, for each n the partition
n+1
is a renement of
n
). It
is not hard to check directly that
M
n
=
{i:P(Ai,n)>0}
Q(A
i,n
)
P(A
i,n
)
I
Ai,n
,
is an T
n
supMG for (, T, P) and is further an T
n
martingale if Q(A
i,n
) = 0
whenever P(A
i,n
) = 0 (which is precisely the assumption made in Theorem 5.5.11).
We have seen this construction in Exercise 5.3.20, where
n
are the dyadic parti
tions of = [0, 1), P is taken to be Lebesgues measure on [0, 1) and Q([s, t)) =
x(t) x(s) is the signed measure associated with the function x().
Proof. (a). By the RadonNikodym theorem, M
n
mT
n
is nonnegative
and Pintegrable (since P
n
(M
n
) = Q
n
() = 1). Further, Q(A) = Q
n
(A) =
M
n
P
n
(A) = M
n
P(A) for all A T
n
. In particular, if k n and A T
k
then
(since T
k
T
n
),
P(M
n
I
A
) = Q(A) = P(M
k
I
A
) ,
so in (, T
, P) we have M
k
= E[M
n
[T
k
] by denition of the conditional expecta
tion. Finally, by Doobs convergence theorem the nonnegative MG M
n
converges
Pa.s. to M
I
A
),
so taking n we deduce that in this case Q(A) = P(M
I
A
) for any A
k
T
k
(and in particular for A = ). Since the probability measures Q and M
P then
coincide on the system
k
T
k
they agree also on the algebra T
generated by
this system (recall Proposition 1.1.39).
(c). To deal with the general case, where M
n
is not necessarily uniformly P
integrable, consider the probability measure S = (P+Q)/2 and its restrictions S
n
=
(P
n
+Q
n
)/2 to T
n
. As P
n
S
n
and Q
n
S
n
there exist V
n
= dP
n
/dS
n
0 and
W
n
= dQ
n
/dS
n
0 such that V
n
+ W
n
= 2. Per part (a) the bounded (V
n
, T
n
)
and (W
n
, T
n
) are martingales on (, T
and
W
= dP/dS and W
= dQ/dS.
Recall that W
n
S
n
= Q
n
= M
n
P
n
= M
n
V
n
S
n
, so Sa.s. M
n
V
n
= W
n
= 2 V
n
for
all n. Consequently, Sa.s. V
n
> 0 and M
n
= (2 V
n
)/V
n
. Considering n we
deduce that Sa.s. M
n
M
= (2 V
)/V
= W
/V
S = I
{V>0}
M
S +I
{V=0}
W
S
= I
{M<}
M
P+I
{M=}
Q,
and since M
k
Y
k
exists a.s. under
both P and Q. If =
k
P(
Y
k
) is positive then Q is absolutely continuous with
respect to P with dQ/dP = M
= while Pa.s. M
= 0.
Remark 5.5.14. Note that the preceding Y
k
are identically distributed when both
P and Q are products of i.i.d. random variables. Hence in this case > 0 if and
only if P(
Y
1
) = 1, which with P(Y
1
) = 1 is equivalent to P[(
Y
1
1)
2
] = 0, i.e. to
having Pa.s. Y
1
= 1. The latter condition implies that Pa.s. M
= 1, so Q = P.
We thus deduce that any Q ,= P that are both products of i.i.d. random variables,
are mutually singular, and for n large enough the likelihood test of comparing M
n
to a xed threshold decides correctly between the two hypothesis regarding the law
of X
k
, since Pa.s. M
n
0 while Qa.s. M
n
.
Proof. We are in the setting of Theorem 5.5.11 for = R
N
and the ltration
T
X
n
= (X
k
: 1 k n) T
X
= (X
k
, k < ) = B
c
(c.f. Exercise 1.2.14 and the denition of B
c
preceding Kolmogorovs extension
theorem). Here M
n
= dQ
n
/dP
n
=
n
k=1
Y
k
and the mutual independence of
X
k
imply that Y
k
are both Pindependent and Qindependent (c.f. part (b)
of Exercise 4.1.8). In the course of proving part (c) of Theorem 5.5.11 we have
shown that M
n
M
P P. In contrast,
when Pa.s. M
= and QP.
Here is a concrete application of the preceding proposition.
Exercise 5.5.15. Suppose P and Q are two product probability measures on the
set
= 0, 1
N
of innite binary sequences equipped with the product algebra
generated by its cylinder sets, with p
k
= P( :
k
= 1) strictly between zero and
one and q
k
= Q( :
k
= 1) [0, 1].
(a) Deduce from Proposition 5.5.13 that Q is absolutely continuous with re
spect to P if and only if
k
(1
p
k
q
k
_
(1 p
k
)(1 q
k
)) is nite.
(b) Show that if
k
[p
k
q
k
[ is nite then Q is absolutely continuous with
respect to P.
(c) Show that if p
k
, q
k
[, 1 ] for some > 0 and all k, then Q P if
and only if
k
(p
k
q
k
)
2
< .
(d) Show that if
k
q
k
< and
k
p
k
= then QP so in general the
condition
k
(p
k
q
k
)
2
< is not sucient for absolute continuity of
Q with respect to P.
In the spirit of Theorem 5.5.11, as you show next, a positive martingale (Z
n
, T
n
)
induces a collection of probability measures Q
n
that are equivalent to P
n
= P
Fn
(i.e. both Q
n
P
n
and P
n
Q
n
), and satisfy a certain martingale Bayes rule.
5.5. REVERSED MGS, LIKELIHOOD RATIOS AND BRANCHING PROCESSES 219
In particular, the following discrete time analog of Girsanovs theorem, shows that
such construction can signicantly simplify certain computations upon moving from
P
n
to Q
n
.
Exercise 5.5.16. Suppose (Z
n
, T
n
) is a (strictly) positive MG on (, T, P), nor
malized so that EZ
0
= 1. Let P
n
= P
Fn
and consider the equivalent probability
measure Q
n
on (, T
n
) of RadonNikodym derivative dQ
n
/dP
n
= Z
n
.
(a) Show that Q
k
= Q
n
F
k
for any 0 k n.
(b) Fixing 0 k m n and Y L
1
(, T
m
, P) show that Q
n
a.s. (hence
also Pa.s.), E
Qn
[Y [T
k
] = E[Y Z
m
[T
k
]/Z
k
.
(c) For T
n
= T
n
, the canonical ltration of i.i.d. standard normal variables
k
and any bounded, T
n
predictable V
n
, consider the measures Q
n
in
duced by the exponential martingale Z
n
= exp(Y
n
1
2
n
k=1
V
2
k
), where
Y
n
=
n
k=1
k
V
k
. Show that X of coordinates X
m
=
m
k=1
(
k
V
k
),
1 m n, is under Q
n
a Gaussian random vector whose law is the
same as that of
m
k=1
k
: 1 m n under P.
Hint: Use characteristic functions.
5.5.3. Reversed martingales and 01 laws. Reversed martingales which
we next dene, though less common than martingales, are key tools in the proof of
many asymptotics (e.g. 01 laws).
Denition 5.5.17. A reversed martingale (in short RMG), is a martingale indexed
by nonnegative integers. That is, integrable X
n
, n 0, adapted to a ltration T
n
,
n 0, such that E[X
n+1
[T
n
] = X
n
for all n 1. We denote by T
n
T
a ltration T
n
n0
and the associated algebra T
n0
T
n
such that the
relation T
k
T
] almost surely
and in L
1
. This is merely a restatement of Levys downward theorem, since for
X
0
= E[Y [T
0
] we have by the tower property that E[Y [T
n
] = E[X
0
[T
n
] for any
n 0.
Proof. Suppose (X
n
, T
n
) is a RMG. Then, xing n < 0 and applying Propo
sition 5.1.20 for the MG (X
n+k
, T
n+k
), k = 0, . . . , n (taking there = n > m =
220 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
0), we deduce that E[X
0
[T
n
] = X
n
. Conversely, suppose X
n
= E[X
0
[T
n
] for X
0
integrable and all n 0. Then, X
n
L
1
(, T
n
, P) by the denition of C.E. and
further, with T
n
T
n+1
, we have by the tower property that
X
n
= E[X
0
[T
n
] = E[E(X
0
[T
n+1
)[T
n
] = E[X
n+1
[T
n
] ,
so any such (X
n
, T
n
) is a RMG.
Setting hereafter X
n
= E[X
0
[T
n
], note that for each n 0 and a < b , by
Doobs upcrossing inequality for the MG (X
n+k
, T
n+k
), k = 0, . . . , n, we have
that E(U
n
[a, b]) (b a)
1
E[(X
0
a)
] (where U
n
[a, b] denotes the number of up
crossings of the interval [a, b] by X
k
(), k = n, . . . , 0). By monotone convergence
this implies that E(U
[a, b]) (b a)
1
E[(X
0
a)
is integrable).
We now complete the proof by showing that X
= E[X
0
[T
]. Indeed, xing
k 0, since X
n
mT
k
for all n k it follows that X
= limsup
n
X
n
is also in mT
k
. This applies for all k 0, hence X
m[
k0
T
k
] = mT
.
Further, E[X
n
I
A
] E[X
I
A
] for any A T
(by the L
1
convergence of
X
n
to X
), and as A T
T
n
also E[X
0
I
A
] = E[X
n
I
A
] for all n 0.
Thus, E[X
I
A
] = E[X
0
I
A
] for all A T
= E[X
0
[T
].
Similarly to Levys upward theorem, as you show next, Levys downward theorem
can be extended to accommodate a dominated sequences of random variables and
if X
0
L
p
for some p > 1, then X
n
L
p
X
and Y
n
a.s.
Y
as n . Show that if
sup
n
[Y
n
[ is integrable, then E[Y
n
[T
n
]
a.s.
E[Y
[T
] when n .
Exercise 5.5.20. Suppose (X
n
, T
n
) is a RMG. Show that if E[X
0
[
p
is nite and
p > 1, then X
n
L
p
X
when n
Not all reversed subMGs are U.I. but here is an explicit characterization of those
that are.
Exercise 5.5.21. Show that a reversed subMG X
n
is U.I. if and only if inf
n
EX
n
is nite.
Our rst application of RMGs is to provide an alternative proof of the strong law
of large numbers of Theorem 2.3.3, with the added bonus of L
1
convergence.
Theorem 5.5.22 (Strong law of large numbers). Suppose S
n
=
n
k=1
k
for i.i.d. integrable
k
. Then, n
1
S
n
E
1
a.s. and in L
1
when n .
Proof. Let X
m
= (m + 1)
1
S
m+1
for m 0, and dene the corresponding
ltration T
m
= (X
k
, k m). Recall part (a) of Exercise 4.4.8, that X
n
=
E[
1
[X
n
] for each n 0. Further, clearly T
n
= ((
n
, T
n
) for (
n
= (X
n
) and
T
= (
r
, r > ). With T
X
n
independent of ((
1
), (
n
), we thus have that
X
n
= E[
1
[T
n
] for each n 0 (see Proposition 4.2.3). Consequently, (X
n
, T
n
) is
a RMG which by Levys downward theorem converges for n both a.s. and
5.5. REVERSED MGS, LIKELIHOOD RATIOS AND BRANCHING PROCESSES 221
in L
1
to the nite valued random variable X
= E[
1
[T
]. Combining this
and the tower property leads to EX
= E
1
so it only remains to show that
P(X
,= c) = 0 for some nonrandom constant c. To this end, note that for any
nite,
X
= limsup
m
1
m
m
k=1
k
= limsup
m
1
m
m
k=+1
k
.
Clearly, X
mT
X
for any so X
of the sequence
k
. We complete the proof upon noting that the
algebra T is Ptrivial (by Kolmogorovs 01 law and the independence of
k
), so
in particular, a.s. X
m
c
m
consists of all events that are invariant under all nite permutations of
coordinates. Similarly, we call an innite sequence of R.V.s
k
k1
on the same
measurable space exchangeable if (
1
, . . . ,
m
)
D
= (
(1)
, . . . ,
(m)
) for any m and
any permutation of 1, . . . , m; that is, the joint law of an exchangeable sequence
is invariant under nite permutations of coordinates.
Our next lemma summarizes the use of RMGs in this context.
Lemma 5.5.24. Suppose
k
() =
k
is an exchangeable sequence of random vari
ables on (R
N
, B
c
). For any bounded Borel function : R
R and m let
S
m
() =
1
(m)
i
(
i1
, . . . ,
i
), where i = (i
1
, . . . , i
=
m!
(m)!
is the number of such tuples. Then,
(5.5.5)
S
m
() E[(
1
, . . . ,
)[c]
a.s. and in L
1
when m .
Proof. Fixing m since the value of
S
m
() is invariant under any permu
tation of the rst m coordinates of we have that
S
m
() is measurable on c
m
.
Further, this bounded R.V. is obviously integrable, so
(5.5.6)
S
m
() = E[
S
m
()[c
m
] =
1
(m)
i
E[(
i1
, . . . ,
i
)[c
m
] .
Fixing any tuple of distinct integers i
1
, . . . , i
)I
A
] = E[(
1
, . . . ,
)I
A
] for any A c
m
, implying that
E[(
i1
, . . . ,
i
)[c
m
] = E[(
1
, . . . ,
)[c
m
]. Since this applies for any tuple of dis
tinct integers from1, . . . , m it follows by (5.5.6) that
S
m
() = E[(
1
, . . . ,
)[c
m
]
for all m . In conclusion, considering the ltration T
n
= c
n
, n 0 for which
T
S
n
(), c
n
), n 0 is a RMG and the convergence in (5.5.5) holds a.s. and in
L
1
.
Remark. Noting that any sequence of i.i.d. random variables is also exchangeable,
our rst application of Lemma 5.5.24 is the following zeroone law.
Theorem 5.5.25 (HewittSavage 01 law). The exchangeable algebra c of a
sequence of i.i.d. random variables
k
() =
k
is Ptrivial (that is, P(A) 0, 1
for any A c).
Remark. Given the HewittSavage 01 law, we can simplify the proof of Theorem
5.5.22 upon noting that for each m the algebra T
m
is contained in c
m+1
, hence
T
R, almost surely
S
m
()
S
() =
E[(
1
, . . . ,
)[c].
We proceed to show that
S
() = E[(
1
, . . . ,
S
m,r
() =
1
(m)
{i: i1>r,...,i
>r}
(
i1
, . . . ,
i
)
denote the contribution of the tuples i that do not intersect 1, . . . , r. Since
there are exactly (mr)
S
m
()
S
m,r
()[ [1
(mr)
(m)
]()
c
m
for some c = c(r, , ) nite and all m. Consequently, for any r,
() = lim
m
S
m
() = lim
m
S
m,r
() .
Further, by the mutual independence of
k
we have that
S
m,r
() are independent
of T
r
, hence the same applies for their limit
S
)[c] = E[(
1
, . . . ,
)]. Recall
that I
G
is, for each G T
). Hence,
E[I
G
[c] = E[I
G
]. Thus, by the tower property and taking out what is known,
P(A G) = E[I
A
E[I
G
[c]] = E[I
A
]E[I
G
] for any A c. That is, c and T
are
independent for each nite , so by Lemma 1.4.8 we conclude that c is a Ptrivial
algebra, as claimed.
The proof of de Finettis theorem requires the following algebraic identity which
we leave as an exercise for the reader.
Exercise 5.5.26. Fixing bounded Borel functions f : R
1
R and g : R R, let
h
j
(x
1
, . . . , x
) = f(x
1
, . . . , x
1
)g(x
j
) for j = 1, . . . , . Show that for any sequence
k
and any m ,
S
m
(h
) =
m
m + 1
S
m
(f)
S
m
(g)
1
m + 1
1
j=1
S
m
(h
j
) .
Theorem 5.5.27 (de Finettis theorem). If
k
() =
k
is an exchangeable
sequence then conditional on c the random variables
k
, k 1 are mutually inde
pendent and identically distributed.
5.5. REVERSED MGS, LIKELIHOOD RATIOS AND BRANCHING PROCESSES 223
Remark. For example, if exchangeable
k
are 0, 1valued, then by de Finettis
theorem these are i.i.d. Bernoulli variables of parameter p, conditional on c. The
joint (unconditional) law of
k
is thus that of a mixture of i.i.d. Bernoulli(p)
sequences with p a [0, 1]valued random variable, measurable on c.
Proof. In view of Exercise 5.5.26, upon applying (5.5.5) of Lemma 5.5.24
for the exchangeable sequence
k
and bounded Borel functions f, g and h
, we
deduce that
E[f(
1
, . . . ,
1
)g(
)[c] = E[f(
1
, . . . ,
1
)[c]E[g(
)[c] .
By induction on this leads to the identity
E
_
k=1
g
k
(
k
)[c
k=1
E[g
k
(
k
)[c]
for all and bounded Borel g
k
: R R. Taking g
k
= I
B
k
for B
k
B we have
P[(
1
, . . . ,
) B
1
B
[c] =
k=1
P(
k
B
k
[c)
which implies that conditional on c the R.V.s
k
are mutually independent (see
Proposition 1.4.21). Further, E[g(
1
)I
A
] = E[g(
r
)I
A
] for any A c, bounded
Borel g(), positive integer r and exchangeable variables
k
() =
k
, from which it
follows that conditional on c these R.V.s are also identically distributed.
We conclude this section with exercises detailing further applications of RMGs
for the study of a certain Ustatistics, for solving the ballots problem and in the
context of mixing conditions.
Exercise 5.5.28. Suppose
k
are i.i.d. random variables and h : R
2
R a
Borel function such that E[[h(
1
,
2
)[] < . For each m 2 let
W
2m
=
1
m(m1)
1i=jm
h(
i
,
j
) .
For example, note that W
2m
=
1
m1
m
k=1
(
k
m
1
m
i=1
i
)
2
is of this form,
corresponding to h(x, y) = (x y)
2
/2.
(a) Show that W
n
= E[h(
1
,
2
)[T
W
n
] for n 0 hence (W
n
, T
W
n
) is a RMG
and determine its almost sure limit as n .
(b) Assuming in addition that v = E[h(
1
,
2
)
2
] is nite, nd the limit of
E[W
2
n
] as n .
Exercise 5.5.29 (The ballot problem). Let S
k
=
k
i=1
i
for i.i.d. integer
valued
j
0 and for n 2 consider the event
n
= S
j
< j for 1 j n.
(a) Show that X
k
= k
1
S
k
is a RMG for the ltration T
k
= (S
j
, j k)
and that = inf n : X
whenever S
n
n, hence P(
n
[S
n
) = (1S
n
/n)
+
.
The name ballot problem is attached to Exercise 5.5.29 since for
j
0, 2 we
interpret 0s and 2s as n votes for two candidates A and B in a ballot, with
n
= A
leads B throughout the counting and P(
n
[B gets r votes) = (1 2r/n)
+
.
As you nd next, the ballot problem yields explicit formulas for the probability
distributions of the stopping times
b
= infn 0 : S
n
= b associated with the
srw S
n
.
224 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
Exercise 5.5.30. Let R = inf 1 : S
1
(dx)
2
(x, B), A X, B o .
We turn to show how relevant the preceding proposition is for Markov chains.
Proposition 6.1.5. To any nite measure on (S, o) and any sequence of
transition probabilities p
n
(, ) there correspond unique nite measures
k
=
p
0
p
k1
on (S
k+1
, o
k+1
), k = 1, 2, . . . such that
k
(A
0
A
k
) =
_
A0
(dx
0
)
_
A1
p
0
(x
0
, dx
1
)
_
A
k
p
k1
(x
k1
, dx
k
)
for any A
i
o, i = 0, . . . , k. If is a probability measure, then
k
is a consistent
sequence of probability measures (that is,
k+1
(AS) =
k
(A) for any k nite and
A S
k+1
).
Further, if X
n
is a Markov chain with state space (S, o), transition probabilities
p
n
( , ) and initial distribution (A) = P(X
0
A) on (S, o), then for any h
bo
6.1. CANONICAL CONSTRUCTION AND THE STRONG MARKOV PROPERTY 227
and all k 0,
(6.1.3) E[
k
=0
h
(X
)] =
_
(dx
0
)h
0
(x
0
)
_
p
k1
(x
k1
, dx
k
)h
k
(x
k
) ,
so in particular, X
n
has the nite dimensional distributions (f.d.d.)
(6.1.4) P(X
0
A
0
, . . . , X
n
A
n
) = p
0
p
n1
(A
0
. . . A
n
) .
Proof. Starting at a nite measure
1
= on (S, o) and applying Propo
sition 6.1.4 for
2
(x, B) = p
0
(x, B) on S o yields the nite measure
1
=
p
0
on (S
2
, o
2
). Applying this proposition once more, now with
1
=
1
and
2
((x
0
, x
1
), B) = p
1
(x
1
, B) for x = (x
0
, x
1
) S S yields the nite measure
2
= p
0
p
1
on (S
3
, o
3
) and upon repeating this procedure k times we arrive
at the nite measure
k
= p
0
p
k1
on (S
k+1
, o
k+1
). Since p
n
(x, S) = 1
for all n and x S, it follows that if is a probability measure, so are
k
which by
construction are also consistent.
Suppose next that the Markov chain X
n
has transition probabilities p
n
(, ) and
initial distribution . Fixing k and h
=0
h
(X
)] = E[
k1
=0
h
(X
)E(h
k
(X
k
)[T
X
k1
)] = E[
k1
=0
h
(X
)(p
k1
h
k
)(X
k1
)] .
Further, with p
k1
h
k
bo (see Lemma 6.1.3), also h
k1
(p
k1
h
k
) bo and we
get (6.1.3) by induction on k starting at Eh
0
(X
0
) =
_
(dx
0
)h
0
(x
0
). The formula
(6.1.4) for the f.d.d. is merely the special case of (6.1.3) corresponding to indicator
functions h
= I
A
.
Remark 6.1.6. Using (6.1.1) we deduce from Exercise 4.4.5 that any T
n
Markov
chain with a Bisomorphic state space has transition probabilities. We proceed to
dene the law of such a Markov chain and building on Proposition 6.1.5 show that
it is uniquely determined by the initial distribution and transition probabilities of
the chain.
Denition 6.1.7. The law of a Markov chain X
n
with a Bisomorphic state space
(S, o) and initial distribution is the unique probability measure P
on (S
, o
c
)
with S
= S
Z+
, per Corollary 1.4.25, with the specied f.d.d.
P
(s : s
i
A
i
, i = 0, . . . , n) = P(X
0
A
0
, . . . , X
n
A
n
) ,
for A
i
o. We denote by P
x
the law P
in case (A) = I
xA
(i.e. when X
0
= x
is nonrandom).
Remark. Denition 6.1.7 provides the (joint) law for any stochastic process X
n
with a Bisomorphic state space (that is, it applies for any sequence of (S, o)valued
R.V. on the same probability space).
Here is our canonical construction of Markov chains out of their transition prob
abilities and initial distributions.
Theorem 6.1.8. If (S, o) is Bisomorphic, then to any collection of transition
probabilities p
n
: S o [0, 1] and any probability measure on (S, o) there
228 6. MARKOV CHAINS
corresponds a Markov chain Y
n
(s) = s
n
on the measurable space (S
, o
c
) with
state space (S, o), transition probabilities p
n
(, ), initial distribution and f.d.d.
(6.1.5) P
(s : (s
0
, . . . , s
k
) A) = p
0
p
k1
(A) A o
k+1
, k < .
Remark. In particular, this construction implies that for any probability measure
on (S, o) and all A o
c
(6.1.6) P
(A) =
_
(dx)P
x
(A) .
We shall use the latter identity as an alternative denition for P
, that is applicable
even for a nonnite initial measure (namely, when (S) = ), noting that if is
nite then P
, o
c
) for which (6.1.5)
holds (see the remark following Corollary 1.4.25).
Proof. The given transition probabilities p
n
(, ) and probability measure
on (S, o) determine the consistent probability measures
k
= p
0
p
k1
per Proposition 6.1.5 and thereby via Corollary 1.4.25 yield the stochastic process
Y
n
(s) = s
n
on (S
, o
c
), of law P
k
(dy)p
k
(y
k
, B) = E[I
{YA}
p
k
(Y
k
, B)]
(where the rst and last equalities are due to (6.1.5)). Consequently, for any B o
and k 0 nite, p
k
(Y
k
, B) is a version of the C.E. E[I
{Y
k+1
B}
[T
Y
k
] for T
Y
k
=
(Y
0
, . . . , Y
k
), thus showing that Y
n
is a Markov chain of transition probabilities
p
n
(, ).
Remark. Conversely, given a Markov chain X
n
of state space (S, o), apply
ing this construction for its transition probabilities and initial distribution yields a
Markov chain Y
n
that has the same law as X
n
. To see this, recall (6.1.4) that
the f.d.d. of a Markov chain are uniquely determined by its transition probabili
ties and initial distribution, and further for a Bisomorphic state space, the f.d.d.
uniquely determine the law P
, o
c
, P
yA
p
n
(x, y) ,
for any A S, so the transition probabilities are determined by p
n
(x, y) 0 such
that
yS
p
n
(x, y) = 1 for all n and x S (and all Lebesgue integrals are in this case
merely sums). In particular, if S is a nite set and the chain is homogeneous, then
identifying S with 1, . . . , m for some m < , we view p(x, y) as the (x, y)th entry
of an mm dimensional transition probability matrix, and express probabilities of
interest in terms of powers of the latter matrix.
For homogeneous Markov chains whose state space is S = R
d
(or a product of
closed intervals thereof), equipped with the corresponding Borel algebra, compu
tations are relatively explicit when for each x S the transition probability p(x, )
6.1. CANONICAL CONSTRUCTION AND THE STRONG MARKOV PROPERTY 229
is absolutely continuous with respect to (the completion of) Lebesgue measure on
S. Its nonnegative RadonNikodym derivative p(x, y) is then called the transition
probability kernel of the chain. In this case (ph)(x) =
_
h(y)p(x, y)dy and the right
side of (6.1.4) amounts to iterated integrations of the kernel p(x, y) with respect to
Lebesgue measure on S.
Here are few homogeneous Markov chains of considerable interest in probability
theory and its applications.
Example 6.1.9 (Random walk). The random walk S
n
= S
0
+
n
k=1
k
, where
k
are i.i.d. R
d
valued random variables that are also independent of S
0
is an example
of a homogeneous Markov chain. Indeed, S
n+1
= S
n
+
n+1
with
n+1
independent
of T
S
n
= (S
0
, . . . , S
n
). Hence, P[S
n+1
A[T
S
n
] = P[S
n
+
n+1
A[S
n
]. With
n+1
having the same law as
1
, we thus get that P[S
n
+
n+1
A[S
n
] = p(S
n
, A)
for the transition probabilities p(x, A) = P(
1
y x : y A) (c.f. Exercise
4.2.2) and the state space S = R
d
(with its Borel algebra).
Example 6.1.10 (Branching process). Another homogeneous Markov chain
is the branching process Z
n
of Denition 5.5.1 having the countable state space
S = 0, 1, 2, . . . (and the algebra o = 2
S
). The transition probabilities are in this
case p(x, A) = P(
x
j=1
N
j
A), for integer x 1 and p(0, A) = 1
0A
.
Example 6.1.11 (Renewal Markov chain). Suppose q
k
0 and
k=1
q
k
= 1.
Taking S = 0, 1, 2, . . . (and o = 2
S
), a homogeneous Markov chain with transition
probabilities p(0, j) = q
j+1
for j 0 and p(i, i 1) = 1 for i 1 is called a renewal
chain.
As you are now to show, in a renewal (Markov) chain X
n
the value of X
n
is
the amount of time from n to the rst of the (integer valued) renewal times T
k
in [n, ), where
m
= T
m
T
m1
are i.i.d. and P(
1
= j) = q
j
(compare with
Example 2.3.7).
Exercise 6.1.12. Suppose
k
are i.i.d. positive integer valued random variables
with P(
1
= j) = q
j
. Let T
m
= T
0
+
m
k=1
k
for nonnegative integer random
variable T
0
which is independent of
k
.
(a) Show that N
= infk 0 : T
k
, = 0, 1, . . ., are nite stopping
times for the ltration (
n
= (T
0
,
k
, k n).
(b) Show that for each xed nonrandom , the random variable
N
+1
is
independent of the stopped algebra (
N
n
k=1
X
k
for
X
k
= sgn( U
k
). That is, rst pick according to the uniform distribution and
then generate a srw S
n
with each of its increments being +1 with probability and
1 otherwise.
230 6. MARKOV CHAINS
(a) Compute P(X
n+1
= 1[X
1
, . . . , X
n
).
(b) Show that S
n
is a Markov chain. Is it a homogeneous chain?
Exercise 6.1.15 (First order autoregressive process). The rst order
autoregressive process X
k
is dened via X
n
= X
n1
+
n
for n 1, where is
a nonrandom scalar constant and
k
are i.i.d. R
d
valued random variables that
are independent of X
0
.
(a) With T
n
= (X
0
,
k
, k n) verify that X
n
is a homogeneous T
n

Markov chain of state space S = R
d
(equipped with its Borel algebra),
and provide its transition probabilities.
(b) Suppose [[ < 1 and X
0
=
0
for nonrandom scalar , with each
k
having the multivariate normal distribution A(0, V) of zero mean and
covariance matrix V. Find the values of for which the law of X
n
is
independent of n.
As we see in the sequel, our next result, the strong Markov property, is extremely
useful. It applies to any homogeneous Markov chain with a Bisomorphic state
space and allows us to handle expectations of random variables shifted by any
stopping time with respect to the canonical ltration of the chain.
Proposition 6.1.16 (Strong Markov property). Consider a canonical prob
ability space (S
, o
c
, P
such that ()
k
=
k+1
for all k 0 (with the corresponding
iterates (
n
)
k
=
k+n
for k, n 0). Then, for any h
n
bo
c
with sup
n,
[h
n
()[
nite, and any T
X
n
stopping time
(6.1.7) E
[h
) [ T
X
]I
{<}
= E
X
[h
] I
{<}
.
Remark. Here T
X
(or E
x
) indicates expectation taken with respect
to P
(P
x
, respectively). Both sides of (6.1.7) are set to zero when () =
and otherwise its right hand side is g(n, x) = E
x
[h
n
] evaluated at n = () and
x = X
()
().
The strong Markov property is a signicant extension of the Markov property:
(6.1.8) E
[ h(
n
) [ T
X
n
] = E
Xn
[h] ,
holding almost surely for any nonnegative integer n and xed h bo
c
(that is,
the identity (6.1.7) with = n nonrandom). This in turn generalizes Lemma 6.1.3
where (6.1.8) is proved in the special case of h(
1
) and h bo.
Proof. We rst prove (6.1.8) for h() =
k
=0
g
) with g
bo, = 0, . . . , k.
To this end, x B o
n+1
and recall that
m
= p
0
p
m1
are the f.d.d. for
6.1. CANONICAL CONSTRUCTION AND THE STRONG MARKOV PROPERTY 231
P
[h(
n
)I
B
(
0
, . . . ,
n
)] =
n+k
[I
B
(x
0
, . . . , x
n
)
k
=0
g
(x
+n
)]
=
n
_
I
B
(x
0
, . . . , x
n
)g
0
(x
n
)
_
p(x
n
, dy
1
)g
1
(y
1
)
_
p(y
k1
, dy
k
)g
k
(y
k
)
_
= E
[I
B
(X
0
, . . . , X
n
)E
Xn
(h) ] .
This holds for all B o
n+1
, which by denition of the conditional expectation
amounts to (6.1.8).
The collection 1 bo
c
of bounded, measurable h : S
= I
B
we see that I
A
1 for any A in the system T of cylinder sets (i.e.
whenever A = :
0
B
0
, . . . ,
k
B
k
for some k nite and B
o). We thus
deduce by the (bounded version of the) monotone class theorem that 1 = bo
c
, the
collection of all bounded functions on S
[Y
n
[T
X
n
] = g(n, X
n
). Hence, by part (c) of Exercise 5.1.35, for any nite integer
k 0,
E
[h
)I
{=k}
[T
X
] = g(k, X
k
)I
{=k}
= g(, X
)I
{=k}
The identity (6.1.7) is then established by taking out the T
X
measurable indicator
on = k and summing over k = 0, 1, . . . (where the niteness of sup
n,
[h
n
()[
provides the required integrability).
Exercise 6.1.17. Modify the last step of the proof of Proposition 6.1.16 to show
that (6.1.7) holds as soon as
k
E
X
k
[ [h
k
[ ]I
{=k}
is P
integrable.
Here are few applications of the Markov and strong Markov properties.
Exercise 6.1.18. Consider a homogeneous Markov chain X
n
with Bisomorphic
state space (S, o). Fixing B
l
o, let
n
=
l>n
X
l
B
l
and = X
l
B
l
i.o..
(a) Using the Markov property and Levys upward theorem (Theorem 5.3.15),
show that P(
n
[X
n
)
a.s.
I
.
(b) Show that P(X
n
A
n
i.o. ) = 0 for any A
n
o such that for
some > 0 and all n, with probability one,
P(
n
[X
n
) I
{XnAn}
.
(c) Suppose A, B o are such that P
x
(X
l
B for some l 1) for some
> 0 and all x A. Deduce that
P(X
n
A nitely often X
n
B i.o.) = 1 .
232 6. MARKOV CHAINS
Exercise 6.1.19 (Reflection principle). Consider a symmetric random walk
S
n
=
n
k=1
k
, that is,
k
are i.i.d. realvalued and such that
1
D
=
1
. With
n
= S
n
, use the strong Markov property for the stopping time = infk n :
k
> b and h
k
() = I
{
nk
>b}
to show that for any b > 0,
P(max
kn
S
k
> b) 2P(S
n
> b) .
Derive also the following, more precise result for the symmetric srw, where for any
integer b > 0,
P(max
kn
S
k
b) = 2P(S
n
> b) +P(S
n
= b) .
The concept of invariant measure for a homogeneous Markov chain, which we now
introduce, plays an important role in our study of such chains throughout Sections
6.2 and 6.3.
Denition 6.1.20. A measure on (S, o) such that (S) > 0 is called a positive
or nonzero measure. An event A o
c
is called shift invariant if A =
1
A (i.e.
A = : () A), and a positive measure on (S
, o
c
) is called shift invariant
if
1
() = () (i.e. (A) = ( : () A) for all A o
c
). We say
that a stochastic process X
n
with a Bisomorphic state space (S, o) is (strictly)
stationary if its joint law is shift invariant. A positive nite measure on
a Bisomorphic space (S, o) is called an invariant measure for a transition proba
bility p(, ) if it denes via (6.1.6) a shift invariant measure P
(). In particular,
starting at X
0
chosen according to an invariant probability measure results with
a stationary Markov chain X
n
.
Lemma 6.1.21. Suppose a nite measure and transition probability p
0
(, ) on
(S, o) are such that p
0
(S A) = (A) for any A o. Then, for all k 1 and
A o
k+1
,
p
0
p
k
(S A) = p
1
p
k
(A) .
Proof. Our assumption that ((p
0
f)) = (f) for f = I
A
and any A o
extends by the monotone class theorem to all f bo. Fixing A
i
o and k 1
let f
k
(x) = I
A0
(x)p
1
p
k
(x, A
1
A
k
) (where p
1
p
k
(x, ) are the
probability measures of Proposition 6.1.5 in case =
x
is the probability measure
supported on the singleton x and p
0
(y, y) = 1 for all y S). Since (p
j
h) bo
for any h bo and j 1 (see Lemma 6.1.3), it follows that f
k
bo as well.
Further, (f
k
) = p
1
p
k
(A) for A = A
0
A
1
A
k
. By the same
reasoning also
((p
0
f
k
)) =
_
S
(dy)
_
A0
p
0
(y, dx)p
1
p
k
(x, A
1
A
k
) = p
0
p
k
(SA) .
Thus, the stated identity holds for the system of product sets A = A
0
A
k
which generates o
k+1
and since p
1
p
k
(B
n
S
k
) = (B
n
) < for some
B
n
S, this identity extends to all of o
k+1
(see the remark following Proposition
1.1.39).
Remark 6.1.22. Let
k
=
k
p denote the nite measures of Proposition 6.1.5
in case p
n
(, ) = p(, ) for all n (with
0
=
0
p = ). Specializing Lemma 6.1.21 to
this setting we see that if
1
(SA) =
0
(A) for any A o then
k+1
(SA) =
k
(A)
for all k 0 and A o
k+1
.
6.2. MARKOV CHAINS WITH COUNTABLE STATE SPACE 233
Building on the preceding remark we next characterize the invariant measures for
a given transition probability.
Proposition 6.1.23. A positive nite measure () on Bisomorphic (S, o) is an
invariant measure for transition probability p(, ) if and only if p(SA) = (A)
for all A o.
Proof. With a positive nite measure, so are the measures P
and P
1
on (S
, o
c
) which for a Bisomorphic space (S, o) are uniquely determined by
their nite dimensional distributions (see the remark following Corollary 1.4.25).
By (6.1.5) the f.d.d. of P
1
are
k+1
(S A). Therefore, a positive nite measure is an invariant
measure for p(, ) if and only if
k+1
(S A) =
k
(A) for any nonnegative integer
k and A o
k+1
, which by Remark 6.1.22 is equivalent to p(S A) = (A) for
all A o.
6.2. Markov chains with countable state space
Throughout this section we restrict our attention to homogeneous Markov chains
X
n
on a countable (nite or innite), state space S, setting as usual o = 2
S
and
p(x, y) = P
x
(X
1
= y) for the corresponding transition probabilities. Noting that
such chains admit the canonical construction of Theorem 6.1.8 since their state
space is Bisomorphic (c.f. Proposition 1.4.27 for M = S equipped with the metric
d(x, y) = 1
x=y
), we start with a few useful consequences of the Markov and strong
Markov properties that apply for any homogeneous Markov chain on a countable
state space.
Proposition 6.2.1 (ChapmanKolmogorov). For any x, y S and nonnegative
integers k n,
(6.2.1) P
x
(X
n
= y) =
zS
P
x
(X
k
= z)P
z
(X
nk
= y)
Proof. Using the canonical construction of the chain whereby X
n
() =
n
,
we combine the tower property with the Markov property for h() = I
{
nk
=y}
followed by a decomposition according to the value z of X
k
to get that
P
x
(X
n
= y) = E
x
[h(
k
)] = E
x
_
E
x
_
h(
k
) [ T
X
k
_
= E
x
[E
X
k
(h)] =
zS
P
x
(X
k
= z)E
z
(h) .
This concludes the proof as E
z
(h) = P
z
(X
nk
= y).
Remark. The ChapmanKolmogorov equations of (6.2.1) are a concrete special
case of the more general ChapmanKolmogorov semigroup representation p
n
=
p
k
p
nk
of the nstep transition probabilities p
n
(x, y) = P
x
(X
n
= y). See [Dyn65]
for more on this representation, which is at the core of the analytic treatment of
general Markov chains and processes (and beyond our scope).
We proceed to derive some results about rst hitting times of subsets of the state
space by the Markov chain, where by convention we use
A
= infn 0 : X
n
A
in case the initial state matters and the strictly positive T
A
= infn 1 : X
n
A
when it does not, with
y
=
{y}
and T
y
= T
{y}
. To this end, we start with the rst
234 6. MARKOV CHAINS
entrance decomposition of X
n
= y according to the value of T
y
(which serves as
an alternative to the ChapmanKolmogorov decomposition of the same event via
the value in S of X
k
).
Exercise 6.2.2 (First entrance decomposition).
For a homogeneous Markov chain X
n
on (S, o), let T
y,r
= infn r : X
n
= y
(so T
y
= T
y,1
and
y
= T
y,0
).
(a) Show that for any x, y S, B o and positive integers r n,
P
x
(X
n
B, T
y,r
n) =
nr
k=0
P
x
(T
y,r
= n k)P
y
(X
k
B) .
(b) Deduce that in particular,
P
x
(X
n
= y) =
n
k=r
P
x
(T
y,r
= k)P
y
(X
nk
= y).
(c) Conclude that for any y S and nonnegative integers r, ,
j=0
P
y
(X
j
= y)
+r
n=r
P
y
(X
n
= y).
In contrast, here is an application of the last entrance decomposition.
Exercise 6.2.3 (Last entrance decomposition). Show that for a homogeneous
Markov chain X
n
on state space (S, o), all x, y S, B o and n 1,
P
x
(X
n
B, T
y
n) =
n1
k=0
P
x
(X
nk
= y)P
y
(X
k
B, T
y
> k) .
Hint: With L
n
= max1 n : X
yS
p(x, y)f(y) at every
x / C then f(X
nC
) +n
C
is a martingale. Deduce that if in addition
f(x) = 0 for x C then f(x) = E
x
C
.
The next exercise demonstrates few of the many interesting explicit formulas one
may nd for nite state Markov chains.
Exercise 6.2.6. Throughout, X
n
is a Markov chain on S = 0, 1, . . . , N of
transition probability p(x, y).
(a) Use induction to show that in case N = 1, p(0, 1) = and p(1, 0) =
such that + > 0,
P
(X
n
= 0) =
+
+ (1 )
n
_
(0)
+
_
.
(b) Fixing (0) and
1
,=
0
nonrandom, suppose = and conditional
on X
n
the variables B
k
are independent Bernoulli(
X
k
). Evaluate the
mean and variance of the additive functional S
n
=
n
k=1
B
k
.
(c) Verify that E
x
[(X
n
N/2)] = (12/N)
n
(xN/2) for the Ehrenfest chain
whose transition probabilities are p(x, x 1) = x/N = 1 p(x, x + 1).
6.2.1. Classication of states, recurrence and transience. We start
with the partition of a countable state space of a homogeneous Markov chains
to its intercommunicating (equivalence) classes, as d