This book grew from a need to teach a rigorous probability course to a mixed
audience  statisticians, mathematically inclined biostatisticians, mathematicians,
economists, and students of finance  at the advanced undergraduate/introductory
graduate level, without the luxury of a course in measure theory as a prerequisite.
The core of the book covers the basic topics of independence, conditioning, martingales, convergence in distribution, and Fourier transforms. In addition, there are
numerous sections treating topics traditionally thought of as more advanced, such as
coupling and the KMT strong approximation, option pricing via the equivalent martingale measure, and Fernique's inequality for Gaussian processes.
In a further break with tradition, the necessary measure theory is developed via
the identification of integrals with linear functional on spaces of measurable functions, allowing quicker access to the full power of the measure theoretic methods.
The book is not just a presentation of mathematically rigorous theory; it is also
a discussion of why some of that theory takes its current form and how anyone could
have thought of those clever ideas in the first place. It is intended as a secure starting point for anyone who needs to invoke rigorous probabilistic arguments and to
understand what they mean.
David Pollard is Professor of Statistics and Mathematics at Yale University in New
Haven, Connecticut. His interests center on probability, measure theory, theoretical
and applied statistics, and econometrics. He believes strongly that research and
teaching (at all levels) should be intertwined. His book, Convergence of Stochastic
Processes (SpringerVerlag, 1984), successfully introduced many researchers and
graduate students to empirical process theory.
Editorial Board
R. Gill (Department of Mathematics, Utrecht University)
B. D. Ripley (Department of Statistics, University of Oxford)
S. Ross (Department of Industrial Engineering, University of California, Berkeley)
M. Stein (Department of Statistics, University of Chicago)
D. Williams (School of Mathematical Sciences, University of Bath)
This series of high quality upperdivision textbooks and expository monographs covers
all aspects of stochastic applicable mathematics. The topics range from pure and applied
statistics to probability theory, operations research, optimization and mathematical programming. The books contain clear presentations of new developments in the field and
also of the state of the art in classical methods. While emphasizing rigorous treatment of
theoretical methods, the books also contain applications and discussions of new techniques made possible by advances in computational practice.
Already Published
1. Bootstrap Methods and Their Application, by A. C. Davison and D. V. Hinkley
2. Markov Chains, by J. Norris
3. Asymptotic Statistics, by A. W. van der Vaart
4. Wavelet Methods for Time Series Analysis, by Donald B. Percival and
Andrew T. Walden
5. Bayesian Methods: An Analysis for Statisticians and Interdisciplinary Researchers,
by Thomas Leonard and John S. J. Hsu
6. Empirical Processes in MEstimation, by Sara van de Geer
7. Numerical Methods of Statistics, by John F. Monahan
A User's Guide to
Measure Theoretic
Probability
DAVID POLLARD
Yale University
CAMBRIDGE
UNIVERSITY PRESS
A catalog record for this publication is available from the British Library.
Library of Congress Cataloging in Publication Data
Cambridge University Press has no responsibility for the persistence or accuracy of URLs for
external or thirdparty Internet Web sites referred to in this publication and does not guarantee that
any content on such Web sites is, or will remain, accurate or appropriate.
Contents
PREFACE
XI
CHAPTER 1: MOTIVATION
1
2
3
4
*5
6
7
1
2
3
*4
5
6
*7
*8
9
10
*11
12
13
29
1
*2
3
4
*5
*6
7
8
59
65
1
2
3
4
*5
6
*7
*8
Independence
77
Independence of sigmafields
80
Construction of measures on a product space
Product measures
88
Beyond sigmafiniteness
93
SLLN via blocking
95
SLLN for identically distributed summands
Infinite product spaces
99
83
97
Contents
9
10
Problems
102
Notes
108
CHAPTER 5: CONDITIONING
1
2
3
4
*5
6
*7
8
9
1
2
3
4
*5
*6
*7
*8
9
10
147
152
1
2
3
4
*5
6
7
6
*7
*8
9
10
193
200
Multivariate Fourier transforms
202
CramerWold without Fourier transforms
203
The LevyCramer theorem
205
Problems
206
Notes
208
176
ix
Contents
CHAPTER 9: BROWNIAN MOTION
1
2
3
*4
5
*6
*7
*8
9
10
Prerequisites
211
Brownian motion and Wiener measure
213
Existence of Brownian motion
215
Finer properties of sample paths
217
Strong Markov property
219
Martingale characterizations of Brownian motion
Functionals of Brownian motion
226
Option pricing
228
Problems
230
Notes
234
222
1
2
*3
*4
5
6
7
8
9
What is coupling?
237
Almost sure representations
239
Strassen's Theorem
242
The Yurinskii coupling
244
Quantile coupling of Binomial with normal
248
Haar couplingthe Hungarian construction
249
The Koml6sMajor~Tusndy coupling
252
Problems
256
Notes
258
CHAPTER 11: EXPONENTIAL TAILS AND THE LAW OF THE ITERATED LOGARITHM
1
2
*3
*4
5
6
266
1
*2
*3
4
*5
6
7
Introduction
274
Fernique's inequality
275
Proof of Fernique's inequality
276
Gaussian isoperimetric inequality
278
Proof of the isoperimetric inequality
280
Problems
285
Notes
287
1
2
3
4
5
6
7
8
Contents
APPENDIX B: HILBERT SPACES
1
2
3
4
5
6
Definitions
301
Orthogonal projections
302
Orthonormal bases
303
Series expansions of random processes
Problems
306
Notes
306
305
APPENDIX C: CONVEXITY
1
2
3
4
5
6
7
313
1
2
3
4
320
1
2
3
4
5
6
1
2
3
4
INDEX
332
Preface
This book began life as a set of handwritten notes, distributed to students in my
onesemester graduate course on probability theory, a course that had humble aims:
to help the students understand results such as the strong law of large numbers, the
central limit theorem, conditioning, and some martingale theory. Along the way
they could expect to learn a little measure theory and maybe even a smattering of
functional analysis, but not as much as they would learn from a course on Measure
Theory or Functional Analysis.
In recent years the audience has consisted mainly of graduate students in
statistics and economics, most of whom have not studied measure theory. Most of
them have no intention of studying measure theory systematically, or of becoming
professional probabilists, but they do want to learn some rigorous probability
theoryin one semester.
Faced with the reality of an audience that might have neither the time nor
the inclination to devote itself completely to my favorite subject, I sought to
compress the essentials into a course as selfcontained as I could make it. I tried
to pack into the first few weeks of the semester a crash course in measure theory,
with supplementary exercises and a whirlwind exposition (Appendix A) for the
enthusiasts. I tried to eliminate duplication of mathematical effort if it served no
useful role. After many years of chopping and compressing, the material that I most
wanted to cover all fit into a onesemester course, divided into 25 lectures, each
lasting from 60 to 75 minutes. My handwritten notes filled fewer than a hundred
pages.
I had every intention of making my little stack of notes into a little book. But I
couldn't resist expanding a bit here and a bit there, adding useful reference material,
spelling out ideas that I had struggled with on first acquaintance, slipping in extra
topics that my students have seemed to need when writing dissertations, and pulling
in material from other courses I have taught and neat tricks I have learned from my
friends. And soon it wasn't so little any more.
Many of the additions ended up in starred Sections, which contain harder
material or topics that can be skipped over without loss of continuity.
My treatment includes a few eccentricities that might upset some of my
professional colleagues. My most obvious departure from tradition is in the use
of linear functional notation for expectations, an approach I first encountered in
books by de Finetti. I attempt to explain the virtues of this notation in the first
two Chapters. Another slight noveltyat least for anyone already exposed to the
Kolmogorov interpretation of conditional expectationsappears in my treatment
of conditioning, in Chapter 5. For many years I have worried about the wide gap
between the freewheeling conditioning calculations of an elementary probability
course and the formal manipulations demanded by rigor. I claim that a treatment
starting from the idea of conditional distributions offers one way of bridging the gap,
xii
Preface
at least for many of the statistical applications of conditioning that have troubled me
the most.
The twelve Chapters and six Appendixes contain general explanations, remarks,
opinions, and blocks of more formal material. Theorems and Lemmas contain the
most important mathematical details. Examples contain gentler, or less formal,
explanations and illustrations. Supporting theoretical material is presented either
in the form of Exercises, with terse solutions, or as Problems (at the ends of the
Chapters) that work stepbystep through material that missed the cutoff as Exercises,
Lemmas, or Theorems. Some Problems are routine, to give students an opportunity
to digest the ideas in the text without great mental effort; some Problems are hard.
A possible onesemester course
Here is a list of the material that I usually try to cover in the onesemester
graduate course.
Chapter 1: Spend one lecture on why measure theory is worth the effort, using
a few of the Examples as illustrations. Introduce de Finetti notation, identifying
sets with their indicator functions and writing P for both probabilities of sets and
expectations of random variables. Mention, very briefly, the fair price Section as an
alternative to the frequency interpretation.
Chapter 2: Cover the unstarred Sections carefully, but omitting many details
from the Examples. Postpone Section 7 until Chapter 3. Postpone Section 8
until Chapter 6. Describe briefly the generating class theorem for functions, from
Section 11, without proofs.
Chapter 3: Cover Section 1, explaining the connection with the elementary notion
of a density. Take a short excursion into Hilbert space (explaining the projection
theorem as an extension of the result for Euclidean spaces) before presenting the
simple version of RadonNikodym. Mention briefly the classical concept of absolute
continuity, but give no details. Maybe say something about total variation.
Chapter 4: Cover Sections 1 and 2, leaving details of some arguments to the
students. Give a reminder about generating classes of functions. Describe the
construction of /x 0 A, only for a finite kernel A, via the iterated integral. Cover
product measures, using some of the Examples from Section 4. Explain the need for
the blocking idea from Section 6, using the Maximal Inequality to preview the idea
of a stopping time. Mention the truncation idea behind the version of the SLLN for
independent, identically distributed random variables with finite first moments, but
skip most of the proof.
Chapter 5: Discuss Section 1 carefully. Cover the high points of Sections 2
through 4. (They could be skipped without too much loss of continuity, but I prefer
not to move straight into Kolmogorov conditioning.) Cover Section 6.
Preface
xiii
Chapter 1
Motivation
SECTION 1 offers some reasons for why anyone who uses probability should know about
the measure theoretic approach.
SECTION 2 describes some of the added complications, and some of the compensating
benefits that come with the rigorous treatment of probabilities as measures.
SECTION 3 argues that there are advantages in approaching the study of probability theory
via expectations, interpreted as linear functionate, as the basic concept.
SECTION 4 describes the de Finetti convention of identifying a set with its indicator
function, and of using the same symbol for a probability measure and its corresponding
expectation.
SECTION *5 presents a fairprice interpretation of probability, which emphasizes the
linearity properties of expectations. The interpretation is sometimes a useful guide to
intuition.
1.
The problem with this definition is that one needs to be able to calculate distribution
functions, which can make it impossible to establish rigorously some desirable
Chapter 1:
Motivation
The unnecessary repetition does not stop with the discrete/continuous dichotomy.
After one masters formulae for functions of a single random variable, the whole
process starts over for several random variables. The univariate definitions acquire a
prefix joint, leading to a whole host of new exercises in multivariate calculus: joint
densities, Jacobians, multiple integrals, joint moment generating functions, and so
on.
Again the distinctions largely disappear in a measure theoretic treatment.
Distributions are just image measures; joint distributions are just image measures for
maps into product spaces; the same definitions and theorems apply in both cases.
One saves a huge amount of unnecessary repetition by recognizing the role of image
What exactly does that mean? How can something with a discrete distribution,
such as a standardized Binomial, be approximated by a smooth normal distribution?
The traditional answer (which is sometimes presented explicitly in introductory
texts) involves pointwise convergence of distribution functions of random variables;
but the central limit theorem is seldom established (even in introductory texts) by
checking convergence of distribution functions. Instead, when proofs are given, they
typically involve checking of pointwise convergence for some sort of generating
function. The proof of the equivalence between convergence in distribution and
pointwise convergence of generating functions is usually omitted. The treatment of
convergence in distribution for random vectors is even murkier.
As you will see in Chapter 7, it is far cleaner to start from a definition involving
convergence of expectations of "smooth functions" of the random variables, an
approach that covers convergence in distribution for random variables, random
vectors, and even random elements of metric spaces, all within a single framework.
***
In the long run the measure theoretic approach will save you much work and
help you avoid wasted effort with unnecessary distinctions.
2.
Chapter 1:
Motivation
The indirect argument might seem complicated, but, with the help of a few key
theorems, it actually becomes routine. In the literature, it is not unusual to see
applications abbreviated to a remark like "a simple generating class argument shows
. . . , " with the reader left to fill in the routine details.
Lebesgue applied his definition of length (now known as Lebesgue measure)
to the construction of an integral, extending and improving on the Riemann
integral. Subsequent generalizations of Lebesgue's concept of measure (as in
the 1913 paper of Radon and other developments described in the Epilogue to
Hawkins 1979) eventually opened the way for Kolmogorov to identify probabilities
with measures on sigmafields of events on general sample spaces. From the Preface
to Kolmogorov (1933), in the 1950 translation by Morrison:
The purpose of this monograph is to give an axiomatic foundation for the
theory of probability. The author set himself the task of putting in their natural
place, among the general notions of modern mathematics, the basic concepts of
probability theoryconcepts which until recently were considered to be quite
peculiar.
This task would have been a rather hopeless one before the introduction
of Lebesgue's theories of measure and integration. However, after Lebesgue's
publication of his investigations, the analogies between measure of a set and
probability of an event, and between integral of a function and mathematical
expectation of a random variable, became apparent. These analogies allowed of
further extensions; thus, for example, various properties of independent random
variables were seen to be in complete analogy with the corresponding properties
of orthogonal functions. But if probability theory was to be based on the above
analogies, it still was necessary to make the theories of measure and integration
independent of the geometric elements which were in the foreground with
Lebesgue. This has been done by Frechet.
While a conception of probability theory based on the above general
viewpoints has been current for some time among certain mathematicians, there
was lacking a complete exposition of the whole system, free of extraneous
complications. (Cf., however, the book by Fre*chet . . . )
Chapter 1:
Motivation
I will explain the essentials of measure theory in Chapter 2, starting from the
traditional setfunction approach but working as quickly as I can towards systematic
use of expectations.
4.
if x A,
transforms Boolean algebra into ordinary pointwise algebra with functions. I claim
that probability theory becomes easier if one works systematically with expectations
of indicator functions, E1U, rather than with the corresponding probabilities of
events.
Let me start with the assertions about algebra and Boolean algebra. The
operations of union and intersection correspond to pointwise maxima (denoted by
max or the symbol v ) and pointwise minima (denoted by min or the symbol A), or
pointwise products:
and
^(jc)=A u
(jc
lB(x)),
To check these identities, just note that the functions take only the values 0 and 1,
then determine which combinations of indicator values give a 1. For example,
\lA(x)  IB(X)\ takes the value 1 when exactly one of IA(x) and lB(x) equals 1.
Chapter 1:
Motivation
The algebra looks a little cleaner if we omit the argument x. For example, the
horrendous set theoretic relationship
(n*=1A,) A (njL,ft) c u*=1 (A,Aft)
corresponds to the pointwise inequality
whose verification is easy: when the righthand side takes the value 1 the inequality
is trivial, because the lefthand side can take only the values 0 or 1; and when
righthand side takes the value 0, we have 1^. = lBi for all i, which makes the
lefthand side zero.
<i>
Example.
(AAB)A(CAD) = AA (BA(CAD))
<2>
by expanding both sides into a union of many terms. It is easier to note the pattern
for indicator functions. The set AAB is the region where IA + I B takes an odd value
(that is, the value 1); and (AAB)AC is the region where (lA + IB) + IC takes an odd
value. And so on. In fact both sides of the set theoretic identity equal the region
where 1A + 1B + lc + ID t a k e s a n dd value. Associativity of set theoretic differences
is a consequence of associativity of pointwise addition.
Example. The lim sup of a sequence of sets {An : n N} is defined as
n=l i>n
That is, the lim sup consists of those x for which, to each n there exists an i > ri
such that x A(. Equivalently, it consists of those x for which x Ai for infinitely
many i. In other words,
An =limsupI An .
Do you really need to learn the new concept of the lim sup of a sequence
of sets? Theorems that work for lim sups of sequences of functions automatically
carry over to theorems about sets. There is no need to prove everything twice. The
correspondence between sets and their indicators saves us from unnecessary work.
After some repetition, it becomes tiresome to have to keep writing the I for
the indicator function. It would be much easier to write something like A in place
of IA. The indicator of the lim sup of a sequence of sets would then be written
limsupn An, with only the tilde to remind us that we are referring to functions. But
why do we need reminding? As the example showed, the concept for the lim sup
of sets is really just a special case of the concept for sequences of functions. Why
preserve a distinction that hardly matters?
There is a well established tradition in Mathematics for choosing notation that
eliminates inessential distinctions. For example, we use the same symbol 3 for the
natural number and the real number, writing 3 + 6 = 9 as an assertion both about
addition of natural numbers and about addition of real numbers.
^L
for
all X, y in N,
10
Chapter 1:
Motivation
random variables on Q. The expectation maps random variables into real numbers,
in such a way that E(A) = P(A). This line of thinking leads us to de Finetti's
second suggestion: use the same symbol for expectation and probability measure,
writing FX instead of EX, and so on.
The de Finetti notation has an immediate advantage when we deal with several
probability measures, P, Q, ... simultaneously. Instead of having to invent new
symbols Ep, EQ, ..., we reuse P for the expectation corresponding to P, and so on.
REMARK.
YOU might have the concern that you will not be able to tell whether
FA refers to the probability of an event or the expected value of the corresponding
indicator function. The ambiguity should not matter. Both interpretations give the
same number; you will never be faced with a choice between two different values
when choosing an interpretation. If this ambivalence worries you, I would suggest
going systematically with the expectation/indicator function interpretation. It will
never lead you astray.
<3>
Example. For a finite collection of events A\,..., An, the socalled method of
inclusion and exclusion asserts that the probability of the union Uf<nAi equals
i, j , k distinctionA y HA*).. .P(AiflA 2 n.. .HAn).
The equality comes by taking expectations on both sides of an identity for (indicator)
functions,
t;  ^2 AiAJ
+]C^' ^*
distinct
M A J A *  . . . A I A 2 ... A n .
The righthand side of this identity is just the expanded version of 1 Y\i<n (1 ~ Ai).
The identity is equivalent to
<4>
which presents two ways of expressing the indicator function of n,<nAf. See
Problem [1] for a generalization.
Example. Consider Tchebychev's inequality, F{\X ix\ > e] < var(X)/e2, for
each > 0, and each random variable X with expected value /x := FX and finite
variance, var(X) := P(X ~ /x)2. On the lefthand side of the inequality we have
the probability of an event. Or is it the expectation of an indicator function?
Either interpretation is correct, but the second is more helpful. The inequality is
a consequence of the increasing property for expectations invoked for a pair of
functions, {X JJL\ > e] < (X ^t)2/2. The indicator function on the lefthand
side takes only the values 0 and 1. The quadratic function on the righthand side is
nonnegative, and is > 1 whenever the lefthand side equals 1.
***
For the remainder of the book, I will be using the same symbol for a set and
its indicator function, and writing P instead of E for expectation.
REMARK.
For me, the most compelling reason to adopt the de Finetti notation,
and work with P as a linear functional defined for random variables, was not that
I would save on symbols, nor any of the other good reasons listed at the end of
11
Section 3. Instead, I favor the notation because, once the initial shock of seeing old
symbols used in new ways wore off, it made probability theory easier. I can truly
claim to have gained better insight into classical techniques through the mere fact of
translating them into the new notation. I even find it easier to invent new arguments
when working with a notation that encourages thinking in terms of linearity, and
which does not overemphasize the special role for expectations of functions that take
only the values 0 and 1 by according them a different symbol.
The hope that I might convince probability users of some of the advantages
of de Finetti notation was, in fact, one of my motivations for originally deciding to
write yet another book about an old subject.
Your net return will be the random quantity X'(o>) := X(a>)  p(X). Call
the random variable X' a fair return, the net return from a fair trade. Unless you
start worrying about utilitiesin which case you might consult Savage (1954) or
Ferguson (1967, Section 1.4)you should find the following properties reasonable.
(i) fair + fair = fair. That is, if you consider p(X) fair for X and p(Y) fair
for Y then you should be prepared to make both bets, paying p(X) 4 p(Y) to
receive X + F.
(ii) constant x fair = fair. That is, you shouldn't object if I suggest you pay
2p(X) to receive 2X (actually, that particular example is a special case of (i))
12
Chapter 1:
Motivation
Lemma. Properties (I), (ii), and (iii) imply that p() is an increasing linear
functional on random variables. The fair returns are those random variables for
which p(X) = 0.
Proof. For constants a and ft, and random variables X and Y with fair prices p(X)
and p(Y), consider the combined effect of the following fair bets:
you pay me ap(X) to receive aX
you pay me f$p(Y) to receive f$Y
I pay you p(aX + @Y) to receive (aX 4Your net return is a constant,
c = p(aX + PY)  ap(X) If c > 0 you violate (iii); if c < 0 take the other side of the bet to violate (iii). That
proves linearity.
To prove that p() is increasing, suppose X(co) > Y(co) for all co. If you claim
that p(X) < p(Y) then I would be happy for you to accept the bet that delivers
(Y  p(Y))  (X  p(X)) = (X Y)
(p(Y)  p(X)),
13
Sum over i, using linearity again, together with the fact that X = f . XF,, to deduce
that p(X) = . p(XF f ) = ,. p(Fi)p(X \ F,), as asserted.
Why should we restrict the Lemma to finite partitions? If we allowed countable
partitions we would get the countable additivity propertythe key requirement in
the theory of measures. I would be suspicious of such an extension of the simple
argument for finite partitions. It makes a tacit assumption that a combination of
countably many fair bets is again fair. If we accept that assumption, then why not
accept that arbitrary combinations of fair events are fair? For uncountably infinite
collections we would run into awkward contradictions. For example, suppose co is
generated from a uniform distribution on [0,1]. Let X, be the random variable that
returns 1 if co = t and 0 otherwise. By symmetry one might expect p(Xt) = c for
some constant c that doesn't depend on /. But there can be no c for which
/v\
(0
1
/i\
/v^
v \ ? v^
x
l = p{\) = p I2^o<r<i n = <Lo<r<i P\xo = I oo
ifc = 0
_ 0
if c
Perhaps our intuition about the infinite rests on shaky analogies with the finite.
REMARK.
I do not insist that probabilities must be interpreted as fair prices, just
as I do not accept that all probabilities must be interpreted as assertions about long
run frequencies. It is convenient that both interpretations lead to almost the same
mathematical formalism. You are free to join either camp, or both, and still play by
the same probability rules.
6.
Problems
[1]
[2]
Rederive the assertion of Lemma <6> by consideration of the net return from the
following system of bets:
(i) for each i, pay cip(Fi) in order to receive c, if f}
occurs, where c, := p(X  F,); (ii) pay p(X) in order to receive  X ;
(iii) for
each i, make a bet contingent on F,, paying c, (if F( occurs) to receive X.
14
Chapter 1:
Motivation
[3]
For an increasing sequence of events {An : n e N} with union A, show FAn t FA.
7.
Notes
See Dubins & Savage (1964) for an illustration of what is possible in a theory of
probability without countable additivity.
The ideas leading up to Lebesgue's creation of his integral are described in
fascinating detail in the excellent book of Hawkins (1979), which has been the
starting point for most of my forays into the history of measure theory. Lebesgue
first developed his new definition of the integral for his doctoral dissertation
(Lebesgue 1902), then presented parts of his theory in the 19021903 Peccot course
of lectures (Lebesgue 1904). The 1928 revision of the 1904 volume greatly expanded
the coverage, including a treatment of the more general (Lebesgue)Stieltjes integral.
See also Lebesgue (1926), for a clear description of some of the ideas involved in
the development of measure theory, and the Note Historique of Bourbaki (1969), for
a discussion of later developments.
Of course it is a vast oversimplification to imagine that probability theory
abruptly became a specialized branch of measure theory in 1933. As Kolmogorov
himself made clear, the crucial idea was the measure theory of Lebesgue. Kolmogorov's little book was significant not just for "putting in their natural place,
among the general notions of modern mathematics, the basic concepts of probability
theory", but also for adding new ideas, such as probability distributions in infinite
dimensional spaces (reinventing results of Daniell 1919) and a general theory of
conditional probabilities and conditional expectations.
Measure theoretic ideas were used in probability theory well before 1933.
For example, in the Note at the end of L6vy (1925) there was a clear statement
of the countable additivity requirement for probabilities, but Levy did not adopt
the complete measure theoretic formalism; and Khinchin & Kolmogorov (1925)
explicitly constructed their random variables as functions on [0,1], in order to avail
themselves of the properties of Lebesgue measure.
It is also not true that acceptance of the measure theoretic foundation was total
and immediate. For example, eight years after Kolmogorov's book appeared, von
Mises (1941, page 198) asserted (emphasis in the original):
In recapitulating this paragraph I may say: First, the axioms of Kolmogorov
are concerned with the distribution function within one kollektiv and are
supplementary to my theory, not a substitute for it. Second, using the notion of
measure zero in an absolute way without reference to the arbitrarily assumed
measure system, leads to essential inconsistencies.
See also the argument for the measure theoretic framework in the accompanying
paper by Doob (1941), and the comments by both authors that follow (von Mises &
Doob 1941).
For more about Kolmogorov's pivotal role in the history of modern probability,
see: Shiryaev (2000), and the other articles in the same collection; the memorial
15
1.7 Notes
articles in the Annals of Probability, volume 17 (1989); and von Plato (1994), which
also contains discussions of the work of von Mises and de Finetti.
REFERENCES
16
Chapter 1:
Motivation
Chapter 2
1.
18
<i>
Chapter 2:
Definition.
(i) the empty set 0 and the whole space X both belong to A;
(ii) if A belongs to A then so does its complement Ac;
(Hi) if A\, Ai,... is a countable collection of sets in A then both the union UjA,and the intersection n( A, are also in A.
Some of the requirements are redundant as stated. For example, once we
have 0 G A then (ii) implies X e A. When we come to establish properties about
sigmafields it will be convenient to have the list of defining properties pared down
to a minimum, to reduce the amount of mechanical checking. The theorems will
be as sparing as possible in the amount the work they require for establishing the
sigmafield properties, but for now redundancy does not hurt.
The collection A need not contain every subset of X, a fact forced upon us in
general if we want fju to have the properties of a countably additive measure.
<2>
Definition.
19
<3>
and
and
F2 := {c} = , n 2 ,
F4 := [a, b, d, e] = F\ U F3.
20
Chapter 2:
Fi,
F2,
F3,
F!UF 3 ,
Fil)F2 = Eu
F 2 U F3 = F 2 ,
X.
The sets F\9 F2, F3 are the atoms of the sigmafield; every member of <r() is a
union of some collection (possibly empty) of Ft. The only measurable subsets of f}
are the empty set and F, itself. There are no measurable protons or neutrons hiding
inside these atoms.
An unsystematic construction might work for finite sets, but it cannot generate
all members of a sigmafield in general. Indeed, we cannot even hope to list all
the members of an infinite sigmafield. Instead we must find a less explicit way to
characterize its sets.
<4>
Example. By definition, the Borel sigmafield on the real line, denoted by 2(R),
is the sigmafield generated by the open subsets. We could also denote it by a (9)
where 9 stands for the class of all open subsets of R. There are several other
generating classes for 2(R). For example, as you will soon see, the class of all
intervals (oo, t], with t e R, is a generating class.
It might appear a hopeless task to prove that cr() = !B(R) if we cannot
explicitly list the members of both sigmafields, but actually the proof is quite
routine. You should try to understand the style of argument because it is often used
in probability theory.
The equality of sigmafields is established by two inclusions, a() c a (9) and
tf(9) <T(), both of which follow from more easily established results. First we
must prove that c a(9), showing that a (9) is one of the sigmafields 7 that enter
into the intersection defining a(), and hence cr(E) c a(9). The other inclusion
follows similarly if we show that 9 c cr().
Each interval (oo, t] in has a representation njjli(~~ t+n~l), a countable
intersection of open sets. The sigmafield a(9) contains all open sets, and it is
stable under countable intersections. It therefore contains each (oo, t]. That is,
Ca(9).
The argument for 9 c a() is only slightly harder. It depends on the fact
that an open subset of the real line can be written as a countable union of open
intervals. Such an interval has a representation (a, b) = (oo, b) n (oo, a]c, and
(oo, b) = U^=i(~' *  n~1] That is, every open set can be built up from sets
in using operations that are guaranteed not to take us outside the sigmafield a().
My explanation has been moderately detailed. In a published paper the
reasoning would probably be abbreviated to something like "a generating class
argument shows that ...," with the routine details left to the reader.
REMARK.
The generating class argument often reduces to an assertion like: A
is a sigmafield and A 2 , therefore A = a{A) 2 <T().
<5>
Example. A class of subsets of a set X is called & field if it contains the empty
set and is stable under complements, finite unions, and finite intersections. For a
field , write 5 for the class of all possible intersections of countable subclasses
of , and Ea for the class of all possible unions of countable subclasses of .
21
REMARK.
Incidentally, I chose the letters G and F to remind myself of open
and closed sets, which have similar approximation properties for Borel measures on
metric spacessee Problem [12].
We have an sandwich, but the bread might not be of the right type. It is
certainly true that G Ea (a countable union of countable unions is a countable
union), but the set H need not belong to $. However, the sets HN := KJn<NFn do
belong to 5, and countable additivity implies that fiH^ f lt<H.
REMARK.
22
Chapter 2:
2.
Measurable functions
Let X be a set equipped with a sigmafield A, and y be a set equipped with a
sigmafield 23, and T be a function (also called a map) from X to y. We say that T
is AyBmeasurable if the inverse image {x X : Tx e B] belongs to A for each
B in 23. Sometimes the inverse image is denoted by {T e B] or T~lB. Don't be
fooled by the T~l notation into treating T~l as a function from y into X: it's not,
unless T is onetoone (and onto, if you want to have domain y). Sometimes an
.A\23measurable map is referred to in abbreviated form as just .Ameasurable, or
just 23measurable, or just measurable, if there is no ambiguity about the unspecified
sigmafields.
T
01B)
(X,A)
For example, if y = R and 23 equals the Borel sigmafield 23 (R), it is common
to drop the 23 (R) specification and refer to the map as being ^4measurable, or as
being Borel measurable if A is understood and there is any doubt about which
sigmafield to use for the real line. In this book, you may assume that any sigmafield
on R is its Borel sigmafield, unless explicitly specified otherwise. It can get confusing
if you misinterpret where the unspecified sigmafields live. My advice would be
that you imagine a picture showing the two spaces involved, with any missing
sigmafield labels filled in.
23
Sometimes the functions come first, and the sigmafields are chosen specifically
to make those functions measurable.
<6>
<7>
U{JC
: fn(x) > t} A
for each fixed JC, the supremum of the real numbers fn(x) is strictly greater than t
if and only if fn(x) > t for at least one n. Example <7> shows why we have only
to check inverse images for such intervals.
The same generating class is not as convenient for proving measurability of g.
It is not true that an infimum of a sequence of real numbers is strictly greater than t
if and only if all of the numbers are strictly greater than t: think of the sequence
[n~l : n = 1,2, 3 , . . . } , whose infimum is zero. Instead you should argue via the
identity {JC : g(x) < t] = Un[x : fn(x) < t} A for each real r.
24
Chapter 2:
From Example <8> and the representations limsup fn(x) = infnG^ supm>n fm(x)
and liminf fn(x) = supnNinfm>n fm(x), it follows that the lim sup or liminf of a
sequence of measurable (real or extended realvalued) functions is also measurable.
In particular, if the limit exists it is measurable.
Measurability is also preserved by the usual algebraic operationssums,
differences, products, and so onprovided we take care to avoid illegal pointwise
calculations such as oo  oo or 0/0. There are several ways to establish these
stability properties. One of the more direct methods depends on the fact that R has
a countable dense subset, as illustrated by the following argument for sums.
<9>
<io>
Definition. The class M(X,.A), or M(X) or just M for short, consists of all
A\"B(R)measurable functions from X into R. The class M+(X,.A), or M + (X) or
just M + for short, consists of the nonnegative functions in M(X,A).
If you desired exquisite precision you could write M(X,A, R, !B(R)), to
eliminate all ambiguity about domain, range, and sigmafields.
The collection M+ is a cone (stable under sums and multiplication of functions
by positive constants). It is also stable under products, pointwise limits of sequences,
25
<n>
Lemma.
le,
defined by
has the property 0 < f\ (x) < fi(x) < ... < fn(x) f f(x) at every x.
REMARK.
The definition of / involves algebra, so you must interpret {/ > i/2n]
as the indicator function of the set of all points x for which f(x)>
i/2n.
Proof. At each x, count the number of nonzero indicator values. If f(x) > 2", all
4n summands contribute a 1, giving fn(x) = 2 n . If k2~n < f{x) < (k + 1)2"", for
some integer k from { 0 , 1 , 2 , . . . , 4 n 1}, then exactly k of the summands contribute
a 1, giving fn(x) = kl~n. (Check that the last assertion makes sense when k
equals 0.) That is, for 0 < f(x) < 2n, the function / rounds down to an integer
multiple of 2~", from which the convergence and monotone increasing properties
follow.
If you do not find the monotonicity assertion convincing, you could argue,
more formally, that
which reflects the effect of doubling the maximum value and halving the step size
when going from the nth to the (n+l)st approximation.
26
Chapter 2:
3.
Integrals
Just as f* f(x) dx represents a sort of limiting sum of f(x) values weighted by
small lengths of intervalsthe / sign is a long "S", for sum, and the dx is a sort
of limiting incrementso can the general integral / f(x) ii(dx) be defined as a
limit of weighted sums but with weights provided by the measure /x. The formal
definition involves limiting operations that depend on the assumed measurability of
the function / . You can skip the details of the construction (Section 4) by taking
the following result as an axiomatic property of the integral.
< 12>
M4",
2.3 Integrals
27
careful with any calculations involving infinity. Subtle errors are easy to miss when
concealed within a convention.
Theorem. Let /x be a map from M + to [0, oo] that satisfies properties (ii)
through (v) of Theorem <12>. Then the set function defined on the sigmafield A
by (i) is a (countably additive, nonnegative) measure, with ji the functional that it
generates.
Lemma < n > provides the link between the measure /x and the functional /x.
For a given / in M+, let {/} be the sequence defined by the Lemma. Then
noo
f*
> i/2n],
the first equality by Monotone Convergence, the second by linearity. The value
of jlf is uniquely determined by /x, as a set function on A. It is even possible to
use the equality, or something very similar, as the basis for a direct construction of
the integral, from which properties (i) through (v) are then derived, as you will see
from Section 4.
In summary: There is a onetoone correspondence
\i defined
between
measures on the sigmafield A and increasing linear
here
functional on M+(A) with the Monotone Convergence
property. To each measure /tx there is a uniquely determined
functional /x for which /x(IU) == ti(A) for every A in A.
The functional ft is usually called an integral with respect
to /x, and is variously denoted by f fd/x or / f(x) /x(rfjt)
or / f(x)dfi(x). With the de Finetti notation,
or fxfdfi
where we identify a set A with its indicator function, the
functional fi, is just an extension of /x from a smaller domain
(indicators of sets in A) to a larger domain (all of 3VC+).
Accordingly, we should have no qualms about denoting it by the same symbol. I
will write /x/ for the integral. With this notation, assertion (i) of Theorem <12>
becomes: fiA = ixA for all A in A. You probably can't tell that the A on the
lefthand side is an indicator function and the /x is an integral, but you don't need
to be able to tellthat is precisely what (i) asserts.
REMARK.
In elementary algebra we rely on parentheses, or precedence, to make
our meaning clear. For example, both (ax) + b and ax 4 b have the same meaning,
because multiplication has higher precedence than addition. With traditional notation,
the f and the dfi act like parentheses, enclosing the integrand and separating it
from following terms. With linear functional notation, we sometimes need explicit
parentheses to make the meaning unambiguous. As a way of eliminating some
parentheses, I often work with the convention that integration has lower precedence
than exponentiation, multiplication, and division, but higher precedence than addition
or subtraction. Thus I intend you to read ixfg I 6 as (fi(fg)) + 6. I would write
li(fg + 6) if the 6 were part of the integrand.
28
Chapter 2:
<14>
Example. Suppose /JL is a finite measure (that is, fiX < oo) and / is a function
in M + . Then /x/ < oo if and only if J2T=\ Pif n) < The assertion is just a pointwise inequality in disguise. By considering
separately values for which k < f(x) < k + 1, for k = 0,1, 2,..., you can verify the
pointwise inequality between functions,
In fact, the sum on the lefthand side defines I_/C*)J> the largest integer <
and the righthand side denotes the smallest integer > f(x). From the leftmost
inequality,
V>f ^ ^(Li U >n))
increasing
= lim A*(]Li{/>})
Monotone Convergence
N+OQ
linearity
<15>
Definition.
The set ^ ( M ) is a vector space (stable under pointwise addition and multiplication by real numbers). The integral fi defines an increasing linear functional on
&l(li), in the sense that fif > fig if / > g pointwise. The Monotone Convergence
property implies other powerful limit results for functions in &(&), as described in
Section 5. By restricting /JL to ^(fi), we eliminate problems with oo oo.
2.3
29
Integrals
For each / in Cl(ji), its Cl norm is defined as /i := MI/I Strictly speaking,
 111 is only a seminorm, because /i = 0 need not imply that / is the zero
functionas you will see in Section 6, it implies only that /Lt{/ ^ 0} = 0. It
is common practice to ignore the small distinction and refer to  i as a norm
<16>
f(x)dx = /
[b
f(x)dx
and
f(x)dx.
Don't worry about confusing the Lebesgue integral with the Riemann integral over
finite intervals. Whenever the Riemann is well defined, so is the Lebesgue, and the
two sorts of integral have the same value. The Lebesgue is a more general concept.
Indeed, facts about the Riemann are often established by an appeal to theorems
about the Lebesgue. You do not have to abandon what you already know about
integration over finite intervals.
The improper Riemann integral, / ^ f{x) dx = limn_^oo /" n /(JC) dx9 also agrees
with the Lebesgue integral provided tn/ < oo. If m  /  = oo, as in the case of
the function /(JC) := YlT=\{n  x < n + l)(l) rt /w, the improper Riemann integral
might exist as a finite limit, while the Lebesgue integral m / does not exist.
*4.
Chapter 2:
30
M+mple}.
+ v=
A{/>
as desired.
Proof of the Monotone Convergence property. Suppose fn e
and / t / Suppose / > s := X]a,A,, with the A, disjoint
sets in *A and at > 0. Define approximating simple functions
s
n := E i O  ^)/Al{/n > (1  ^ K } . Clearly j n < /. The
2.4
31
simple function sn is one of those that enters into the supremum defining jlfn. It
follows that
A/n > Mfo.) = (1  ) Ei OLill {Ai{fn > (1 On the set A, the functions / increase monotonely to / , which is > a,. The sets
Ai{fn > (1 )ctj} expand up to the whole of A,. Countable additivity implies that
the [i measures of those sets increase to /zA,. It follows that
lim/Z/n > lim sup/Is,, > (1 )jls.
D
Take a supremum over simple s < f then let e tend to zero to complete the proof.
5. Limit theorems
Theorem <13> identified an integral on M+ as an increasing linear functional with
the Monotone Convergence property :
<19>
<20>
1 3n
<2i>
n^oo
Two direct consequences of this limit property have important applications throughout probability theory. The first, Fatou's Lemma, asserts a weaker limit property
for nonnegative functions when the convergence and monotonicity assumptions are
dropped. The second, Dominated Convergence, drops the monotonicity and nonnegativity but imposes an extra domination condition on the convergent sequence {/}.
I have slowly realized over the years that many simple probabilistic results can be
established by Dominated Convergence arguments. The Dominated Convergence
Theorem is the Swiss Army Knife of probability theory.
It is important that you understand why some conditions are needed before
we can interchange integration (which is a limiting operation) with an explicit
limit, as in <19>. Variations on the following example form the basis for many
counterexamples.
E x a m p l e , Let /JL be Lebesgue measure on 3 [ 0 , 1 ] and let {an} be a sequence
of positive numbers. The function fn(x) : = an{0 < x < l/n) converges to zero,
pointwise, but its integral /x(/ n ) = otn/n need not converge to zero. For example,
a n = n2 gives fifn  oo; the integrals diverge. And
6n for n even
AO
f 6 for n even
f
for n odd
glVCS
M/n =
I 3 for n odd.
32
Chapter 2:
Dominated Convergence turns up in many situations that you might not at first
recognize as examples of an interchange in the order of two limit procedures.
<23>
Example.
dt Jo
Jo
Of course I just differentiated under the integral sign, but why is that allowed? The
neatest justification uses a Dominated Convergence argument.
More generally, for each t in an interval (8,8) about the origin let /(, t) be a
jiintegrable function on X, such that the function / ( * , ) is differentiate in (<$, 8)
33
<24>
for each JC. We need to justify taking the derivative at t = 0 inside the integral, to
conclude that
d X
x
(lL /(JC,
</r v
t))\
7
lf=o
= \l
I
\ dt
6.
for every sequence [hn] of nonzero real numbers tending to zero. (Please make sure
that you understand why continuous limits can be replaced by sequential limits in this
way. It is a common simplification.) With no loss of generality, suppose 8 > hn > 0
for all n. The ratio on the lefthand side of <25> equals the \x integral of the
By assumption, fn(x)  A(JC,O) for
function fn(x) := (f(x,hn)f(x,0))/hn.
every x. The sequence {/} is dominated by M: by the meanvalue theorem, for
each x there exists a tx in (hn, hn) c (<$, 8) for which / n (^) = A(JC, tx)\ < M(x).
An appeal to Dominated Convergence completes the argument.
Negligible sets
A set N in A for which /xN = 0 is said to be ^negligible. (Some authors use
the term /xnull, but I find it causes confusion with null as a name for the empty
set.) As the name suggests, we can usually ignore bad things that happen only on
a negligible set. A property that holds everywhere except possibly for those JC in a
/^negligible set of points is said to hold fialmost everywhere or ^almost surely,
abbreviated to a.e. [ft] or a.s. [/LA], with the [JJ,] omitted when understood.
There are several useful facts about negligible sets that are easy to prove and
exceedingly useful to have formally stated. They depend on countable additivity,
via its Monotone Convergence generalization. I state them only for nonnegative
functions, leaving the obvious extensions for *(//) to you.
<26>
34
Chapter 2:
to deduce that /xh < limn (fig + nfi{h ^ g}) = /xg. Reverse the roles of g and h to
get the reverse inequality.
For (iii): Invoke Monotone Convergence for the righthand side of the pointwise
inequality U,iV< < , Nt to get /x (U,//,) < /x QT. N() = . /xty = 0.
For (iv): Put Nn := {g > \/n] for n = 1,2,.... Then //,# < n/xg = 0, from
which it follows that {g > 0} = (J Mi is negligible.
REMARK.
Notice the appeals to countable additivity, via the Monotone
Convergence property, in the proofs. Results such as (iv) fail without countable
additivity, which might trouble those brave souls who would want to develop a
probability theory using only finite additivity.
<28>
Example. Here is one of the standard methods for proving that some measurable
set A has zero fi measure. Find a measurable function / for which f(x) > 0, for
all x in A, and fi(fA) = 0. From part (iv) of Lemma <26> deduce that fA = 0
a.e. \ji\. That is, f(x) = 0 for almost all x in A. The set A = {x A : f(x) > 0}
must be negligible.
Many limit theorems in probability theory assert facts about sequences that
hold only almost everywhere.
<29>
35
The BorelCantelli argument often takes the following form when invoked
to establish almost sure convergence. You should make sure you understand the
method, because the details are usually omitted in the literature.
Suppose {Xn} is a sequence of random variables (all defined on the same Q)
for which 2Zn F{X n  > } < oo for each > 0. By BorelCantelli, to each > 0
there is a Pnegligible set N() for which rt{X,,(<y)i > *} < oo if co N()c. A
sum of integers converges if and only if the summands are eventually zero. Thus to
each co in N() there exists a finite n(c, co) such that \Xn(co)\ < when n > n(, co).
We have an uncountable family of negligible sets {N(e): > 0}. We are allowed
to neglect only countable unions of negligible sets. Replace 6 by a sequence of
values such as 1, 1/2, 1/3, 1/4, . . . , tending to zero. Define N := ( J ^ N(l/k),
which, by part (iii) of Lemma <26>, is negligible. For each co in Nc we have
\Xn(<*>)\ < 1/& when n > n{\/k, co). Consequently, Xn(co) > 0 as n oo for each
coin Nc; the sequence {Xn} converges to zero almost surely.
For measure theoretic arguments with a fixed /Lt, it is natural to treat as identical
those functions that are equal almost everywhere. Many theorems have trivial modifications with equalities replaced by almost sure equalities, and convergence replaced
by almost sure convergence, and so on. For example, Dominated Convergence holds
in a slightly strengthened form:
Let {/} be a sequence of measurable functions for which fn(x)  f(x) at fi
almost all x. Suppose there exists a /iintegrable function F, which does not
depend on n9 such that /(*)) < F(x) for fi almost all x and all n. Then
Most practitioners of probability learn to ignore negligible sets (and then suffer
slightly when they come to some stochastic process arguments where the handling
of uncountable families of negligible sets requires more delicacy). For example,
if I could show that a sequence [fn] converges almost everywhere I would hardly
hesitate to write: Define / := limn /. What happens at those x where fn(x) does
not converge? If hard pressed I might write:
n r. /f,\x)v .._
uenne
if limn fn(x) on the set where the limit exists,
10
otherwise.
You might then wonder if the function sodefined were measurable (it is), or if the
set where the limit exists is measurable (it is). A sneakier solution would be to
write: Define f(x) := limsupn /(*). It doesn't much matter what happens on the
negligible set where the limsup is not equal to the liminf, which happens only when
the limit does not exist.
A more formal way to equate functions equal almost everywhere is to work with
equivalence classes, [/] := {g e M : / = g a.e. [/it]}. The almost sure equivalence
also partitions &1(IA) into equivalence classes, for which we can define /x[/] := fig
for an arbitrary choice of g from [ / ] . The collection of all these equivalence classes
is denoted by L 1 ^ ) . The L1 norm, [/]i := /i, is a true norm on L 1 , because
[/] equals the equivalence class of the zero function when [/]h = 0. Few authors
are careful about maintaining the distinction between / and [ / ] , or between L1(JJL)
and
36
*7.
Chapter 2:
Lp s p a c e s
For each real number p with p > 1 the Cpnorm is defined on M(X, A, ix) by
p 1/P
II/UP *= (Ml/I ) . Problem [17] shows that the map / H>   /   P satisfies the
triangle inequality, / + g\\p < \\f\\p + gp, at least when restricted to realvalued
functions in M.
As with the i^norm, it is not quite correct to call  p a norm, for two
reasons: there are measurable functions for which \\f\\p = oo, and there are nonzero
measurable functions for which / p = 0. We avoid the first complication by
restricting attention to the vector space Cp := LP(X,A, JJL) of all realvalued, Ameasurable functions for which / p < oo. We could avoid the second complication
by working with the vector space Lp := LP(X,A, fi) of /^equivalence classes of
functions in P (X, A, /x). That is, the members of Lp are the /^equivalence classes,
[/] := {g e &p : g = / a.e. [/i]}, with / in Cp. (See Problem [20] for the limiting
case, p = oo.)
REMARK.
The correct term for  p on JP is pseudonorm, meaning that it
has all the properties of a norm (triangle inequality, and c/ = c / for real
constants c) except that it might be zero for nonzero functions. Again, few authors
are careful about maintaining the distinction between Lp and Lp.
2.7 Lp spaces
37
Some of the basic theory for Hilbert space is established in Appendix B. For
the next several Chapters, the following two Hilbert space results, specialized to L2
spaces, will suffice.
(1) CauchySchwarz inequality: Mfg)\ < WfhWgh for all / , g in L 2 O0,
which follows from the Holder inequality (Problem [15]).
(2) Orthogonal projections: Let !Ko be a closed subspace of C2(/JL), For each /
in 2 there is a /o in !Ko, the (orthogonal) projection of / onto IKo, for which
/ /o is orthogonal to !Ko, that is, (/ /o, g) = 0 for all g in MQ. The
point /o minimizes \\f h\\ over all h in 9<o. The projection /o is unique up
to a jialmost sure equivalence.
REMARK.
A closed subspace IKo of 2 must contain all / in 2 for which there
exist / 3<o with \\fn  f\\2  0. In particular, if / belongs to !K0 and g = /
a.e. [/i] Jien g must also belong to !K0. If % is closed, the set of equivalence
classes 3o = f [/] / % } must be a closed subspace of L2(fi), and !Ko must
equal the union of all equivalence classes in 3foFor us the most important subspaces of 2 (X, A, fz) will be defined by the subsigmafields AQ of A. Let*Ho= C2(X, AQ, /JL). The corresponding L 2 (X, ^4o> 0 is a
Hilbert space in its own right, and therefore it is a closed subspace of L2(X, A, n).
Consequently 3<o is a complete subspace of 2 : if {/} is a Cauchy sequence in 5Co
then there exists an /o Oio such that / n  /o 2 > 0. However, {/} also converges
to every other .Ameasurable / for which / = /o a.e. [p]. Unless Ao contains all
/^negligible sets from A, the limit / need not be .Aomeasurable; the subspace Jio
need not be closed. If we work instead with the corresponding L2(X,^4, fi) and
L2(X, *Ao 11) we do get a closed subspace, because the equivalence class of the limit
function is uniquely determined.
Chapter 2:
38
<3i>
Proof. Given uniform integrability, (i) follows from PZ, < M + PZ,{ \Zt\ > M},
and (ii) follows from PZ,F < MFF + PZ,{ Z, > M}.
Conversely, if (i) and (ii) hold then the event {Z, > M) is a candidate for
the F in (ii) when M is so large that P{ Z, > M} < sup r e r F\Zt\/M < 8. It follows
that sup, r PZ,{ \Zt\ > M) < if M is large enough.
almost surely
Xn > X
(a)
,..,.,
domination
subsequence, (b)
. . . . . ^
xn *
Xn  X
in probability
in^CP),
uniform integrability
2.8
39
Uniform integrability
is also uniformly integrable, because Y[Y > M} > Z,{Z,  > M] for every t. Only
the implications leading to and from the box for the * convergence remain to be
proved.
<32>
9.
is easy to check; and when Xool < M/2 and Xrt > M, it follows from the inequality
XM  Xool > X n   Xoo > X n /2. Take expectations, choose M large enough
to make the contribution from Xoo small, then let n tend to infinity to find an no
such that PXn{ X n  > M} < for n > no Increase M if necessary to handle the
corresponding tail contributions for n < no.
<33>
)
(X.A.\L)
40
Chapter 2:
vg = ii(g o T)
The small circle symbol o denotes the composition of functions: (goT)(x) := g(Tx).
The proof of <34> could follow the traditional path: first argue by linearity
from <33> to establish the result for simple functions; then take monotone limits of
simple functions to extend to M+(^i !B).
There is another method for constructing image measures that gets <34> all in
one step. Define an increasing linear functional v on M + (y, !B) by vg := /x(g o F). It
inherits the Monotone Convergence property directly from /x, because, if 0 < gn f g
then 0 < gn o T f g o T. By Theorem <13> it corresponds to a uniquely determined
measure on 2$. When restricted to indicator functions of measurable sets the new
measure coincides with the measure defined by <33>, because if g is the indicator
function of B then g o T is the indicator function of T~l B, (Why?) We have gained
a theorem with almost no extra work, by starting with the linear functional as the
definition of the image measure.
Using the notation T^JL for image measure, we could rewrite the defining
equality as (7(g) := /x(g o T) at least for all g e M + (y, 3), a relationship that I
find easier to remember.
REMARK.
the equality could easily be extended to other cases. For example, by splitting into
positive and negative parts then subtracting, we could extend the equality to functions
in JC'ftl.B, V). And soon.
41
(b) It is continuous from the right: to each > 0 and x G R, there exists a 8 > 0
such that FP(x) < FP(y) < FP(x) + e for x < y < x + 8.
Property (a) follows from that fact that the integral is an increasing functional,
and from Dominated Convergence applied to the sequences (oo, n] I 0 and
(oo, n] f R as n > oo. Property (b) also follows from Dominated Convergence,
applied to the sequence (oo, x + l/n]  (oo, x].
Except in introductory textbooks, and in works dealing with the order properties
of the real line (such as the study of ranks and order statistics), distribution functions
have a reduced role to play in modern probability theory, mostly in connection
with the following method for building measures on (R) as images of Lebesgue
measure. In probability theory the construction often goes by the name of quantile
transformation.
<35>
<36>
Example. There is a converse to the assertions (a) and (b) about distribution
functions. Suppose F is a rightcontinuous, increasing function on R for which
lim^.oo F(x) = 0 and lim*^ F(x) = 1. Then there exists a probability measure P
such that P(oo, x] = F(x) for all real x. To construct such a P , consider the
quantile function q, defined by q{t) := inf{jt : F(x) > t) for 0 < t < 1.
By right continuity of the increasing function F, the set {JC e R : F(x) > t] is
a closed interval of the form [a, oo), with a = q(t). That is, for all x e R and all
t (0, 1),
F(x) >t
if and only if
x > q(t).
In general there are many plausible, but false, equalities related to <36>. For
example, it is not true in general that F(q(t)) = f. However, if F is continuous
and strictly increasing, then q is just the inverse function of F, and the plausible
equalities hold.
Let m denote Lebesgue measure restricted to the Borel sigmafield on (0,1).
The image measure P := #(m) has the desired property,
P (  o o , JC] = m{t : q(t) <x} = m{t:t<
D
10.
F(x)} =
the first equality by definition of the image measure, and the second by equality <36>.
The result is often restated as: if i has a Uniform(0, 1) distribution then q(t) has
distribution function F.
Generating classes of sets
To prove that all sets in a sigmafield A have some property one often resorts to a
generatingclass argument. The simplest form of such an argument has three steps:
(i) Show that all members of a subclass have the property,
(ii) Show that A c <r().
(iii) Show that A) := {A A: A has the property } is a sigmafield.
Then one deduces that .Ao = <r(Ao) 5 <r() 2 A, whence AQ = A. That is, the
property holds for all sets in A.
42
Chapter 2:
Definition.
(i) X D,
(ii) if D\, D2 e D and D\ 2 D2 then D\\D2 e D,
(Hi) if [Dn] is an increasing sequence of sets in D then UDn e D.
REMARK.
Some authors start from a slightly different definition, replacing
requirement (iii) by
(iii)' if {/>} is a sequence of disjoint sets in D then UDn D.
The change in definition would have little effect on the role played by Asystems.
Many authors (including me, until recently) use the name Dynkin class instead
of Xsystem, but the name Sierpinski class would be more appropriate. See the Notes
at the end of this Chapter.
<39>
E\E2e
to an assertion that
<40>
DiD2T>0
It will then follow that Do is a sigmafield, which contains , and hence Do 2 ^(fi).
The choice of Do is easy. Let {D a : a A] be the collection of all Asystems
with D a 2 , one of them being the D we started with. Let Do equal the intersection
of all these D a . That is, let Do consist of all sets D for which D e T)a for each a.
I leave it to you to check the easy details that prove Do to be a A.system. In other
words, Do is the smallest A.system containg ; it is the Asystem generated by .
To upgrade <39> to <40> we have to replace each Ei on the lefthand side by
a Di in Do, without going outside the class Do The trick is to work one component at
a time. Start with the E\. Define Di := {A : AE D o for each E e }. From <39>,
we have Di 2 . If we show that T>\ is a A.system then it will follow that Di 2 Do,
because Do is the smallest Asystem containing . Actually, the assertion that Dj is
43
Asystem is trivial; it follows immediately from the ^.system properties for Do and
identities like (A\\A2)E = (AXE)\(A2E) and (U.A,) E = U(A,).
The inclusion Di D D O implies that D\E2 e D o for all D\ D o and all E2 e .
Put another waythis step is the only subtlety in the proofwe can assert that the
class D 2 := {B : BD Do for each D D o } contains . Just write Di instead of D,
and Ei instead of B, in the definition to see that it is only a matter of switching the
order of the sets.
Argue in the same way as for Di to show that T>2 is also a A.system. It then
follows that D2 2 Do> which is another way of expressing assertion <40>.
The proof of the last Theorem is typical of many generating class arguments,
in that it is trivial once one knows what one has to check. The Theorem, or its
analog for classes of functions (see the next Section), will be my main method for
establishing sigmafield properties. You will be getting plenty of practice at filling
in the details behind frequent assertions of "a generating class argument shows that
...." Here is a typical example to get you started.
<4i>
Exercise. Let \x and v be finite measures on !B(R) with the same distribution
function. That is, ^ (  0 0 , t] = v(oo, t] for all real t. Show that ixB = vB for all
B S(M), that is, /x = v as Borel measures.
SOLUTION: Write for the class of all intervals (00, f], with t e R. Clearly is
stable under finite intersections. From Example <4>, we know that cr() = B(M).
It is easy to check that the class D := [B !B(R) : /JLB = vB] is a Xsystem.
For example, if Bn e D and Bn t B then ixB = limn \iBn = limn vBn = v#, by
Monotone Convergence. It follows from Theorem <38> that D D <r() = !B(R),
and the equality of the two Borel measures is established.
When you employ a A.system argument be sure to verify the properties required
of . The next Example shows what can happen if you forget about the stability
under finite intersections.
<42>
Example. Consider a set X consisting of four points, labelled nw, ne, sw, and se.
Let consist of X and the subsets N {nw, ne}, 5 = {sw, se}, E = {ne, se}, and
W = {nw, sw}. Notice that generates the sigmafield of all subsets of X, but it is
not stable under finite intersections. Let /x and v be probability measures for which
/i(nw) = 1/2
/Lt(sw) = 0
* 11.
/x(ne) = 0
/it(se) = 1/2
v(nw) = 0
v(sw) = 1/2
v(ne) = 1/2
v(se) = 0
Both measures give the the value 1/2 to each of N, 5, E, and W, but they differ in
the values they give to the four singletons.
44
<43>
Chapter 2:
Definition.
Xcone if:
<
KAr;
The sigmafield properties of Acones are slightly harder to establish than their
Asystem analogs, but the reward of more streamlined proofs will make the extra,
onetime effort worthwhile. First we need an analog of the fact that a Asystem that
is stable under finite intersections is also a sigmafield.
<44>
^*
*) 2: 0.
By virtue of properties (i) and (ii) of Xcones, and the assumed stability under
products, both terms on the righthand side belong to 0<+. The proper differencing
property then gives fn(h) e 3f+. Pass uniformly to the limit to get f(h) Ji+.
Write E for the class of all sets of the form {h < C}, with h e IK+ and C
a positive constant. From Example <7>, every h in 3+ is a(E)measurable, and
2.11
<45>
<46>
45
hence <x() = a(M+). For a fixed h and C, the continuous function (1  (h/C)n)+
of h belongs to 9C1", and it increases monotonely to the indicator of {h < C}.
Thus the indicators of all sets in belong to 3 + . The assumptions about *H+
ensure that the class $ of all sets whose indicator functions belong to W+ is stable
under finite intersections (products), complements (subtract from 1), and increasing
countable unions (montone increasing limits). That is, is a Asystem, stable
under finite intersections, and containing . It is a sigmafield containing . Thus
2 <T() = cr(M+). That is, !K+ contains all indicators of sets in cr(W+).
Finally, let it be a bounded, nonnegative, a(9<+)measurable function. From
the fact that each of the sets {k > //2 n }, for i = 1 , . . . , 4 \ belongs to the cone 5+,
we have kn := 2~n ^^L\{k > i/2n] e <K. The functions kn increase monotonely
to k9 which consequently also belongs to !H+.
Theorem. Let !K+ be a Xcone of bounded, nonnegative functions, and 5 be a
subclass ofH+ that is stable under the formation ofpointwise products of pairs of
functions. Then !K+ contains all bounded, nonnegative, o{^measurable functions.
Proof Let JCj be the smallest A.cone containing 9 From the previous Lemma, it
is enough to show that Wj is stable under pairwise products.
Argue as in Theorem <38> for Asystems of sets. A routine calculation shows
that 'K^ := {h e JCj : hg e J{J for all g in 9 } is a Xcone containing 9, and
hence Oif = ^Cj ^ ^ i s ' ho8 MQ f o r a11 hoe^
andg e 9. Similarly, the class
:Kj := {h Wj : hoh J{J for all h0 in IKj } is a Xcone. By the result for J{f we
have JCj 2 9, and hence JCj = ^ o ^ a t *s' ^ o ^s s t a ^l e under products.
Exercise. Let \i be a finite measure on 23(K*). Write Co for the vector space
of all continuous real functions on R* with compact support. Suppose / belongs
to Cl(ix). Show that for each > 0 there exists a g in Co such that / x  / g\ < e.
That is, show that Co is dense in C!(/x) under its .C1 norm.
12.
[1]
Problems
Suppose events A\,A2,...,
in a probability space (2,3\P), are independent:
meaning that F(AhAi2... Aik) = PA f l PA l 2 .. .PAIJt for all choices of distinct
subscripts I'I, 12,..., 1'*, all k. Suppose Y11L\ p^ = 
46
Chapter 2:
P max At = ln<i<m
<i<
*. *
n<i<m
n<i<m
(ii) Let m then n tend to infinity, to deduce (via Dominated Convergence) that
Plimsup; At[ = 1. That is, P{A, i. o.} = 1.
REMARK.
The result gives a converse for the BorelCantelli lemma from
Example <29>. The next Problem establishes a similar result under weaker
assumptions.
[2]
Let A\, A2,... be events in a probability space (Q, 7, P). Define Xn = Ai 4... + An
and an = PXn. Suppose an  00 and iXn/crn2  1. (Compare with the inequality
\\Xn/on2 > 1 which follows from Jensen's inequality.)
(i) Show that
Suppose T is a function from a set X into a set y, and suppose that y is equipped
with a crfield B. Define A as the sigmafield of sets of the form T~lB, with B in B.
Suppose / M + (X,^l). Show that there exists a S\S[0, oo]measurable function
g from y into [0, 00] such that f(x) = g(T(x)), for all x in X, by following these
steps.
(i) Show that A is a afield on X. (It is called the afield generated by the map T.
It is often denoted by cr(F).)
(ii) Show that {/ > i/2n] = T~lBUn for some Bin in B. Define
and
* = 2"" J ] B,tn.
1=1
Let g\,g2,...
be yi\B(E)measurable functions from X into R. Show that
{limsup n g n > t] = (JrQ fl^Li U>mte > rJ Deduce, without any appeal to
r>t
2.12 Problems
47
strict inequalities that turn into nonstrict inequalities in the limitit is possible to
have xn > x for all n and still have limsupw xn = x.
[5]
Suppose a class of sets cannot separate a particular pair of points x, y: for every E
in , either {JC, y] c E or {JC, y] c Ec. Show that a() also cannot separate the pair.
[6]
A collection of sets Jo that is stable under finite unions, finite intersections, and
complements is called a field. A nonnegative set function fi defined on 7 is called
fr e v e r y finite collection
a finitely additive measure if /i(u,</*}) = Yli<n^i
of disjoint sets in 3$. The set function is said to be countably additive on 3$ if
fi(Ui(:f$Fi) = l N M ^ f r e v e r Y countable collection of disjoint sets in 3o whose
union belongs to 7. Suppose fiX < oo. Show that fi is countably additive on 3$ if
and only if fiAn \ 0 for every decreasing sequence in 3b with empty intersection.
Hint: For the argument in one direction, consider the union of differences A,\A,+i.
[7]
[8]
Let IJL be a finite measure and / be a measurable function. For each positive
>n}<oo.
integer k, show that fi\f\k < oo if and only if J2=i W*~!M{I/I
[9]
Suppose v := T/x, the image of the measure fi under the measurable map T. Show
that f eCl(v) if and only if / o T e Clbi), in which case vf = /x ( / o T).
[10]
Let {hn}, {/}, and {gn} be sequences of /xintegrable functions that converge ix
almost everywhere to limits h, f and g. Suppose hn(x) < fn(x) < gn(x) for all JC.
Suppose also that fihn fjuh and ixgn > fig. Adapt the proof of Dominated
Convergence to prove that fifn > fif.
[11]
[12]
Let / i b e a finite measure on the Borel sigmafield 23 (X) of a metric space X. Call
a set B inner regular if fiB = sup{/xF : B D F closed } and outer regular if
fiB = inf{/xF : 5 C G open }
(i) Prove that the class So of all Borel sets that are both inner and outer regular is
a sigmafield. Deduce that every Borel set is inner regular.
(ii) Suppose fi is tight: for each > 0 there exists a compact K such that
fiKc < . Show that the F in the definition of inner regularity can then be
assumed compact.
(iii) When fi is tight, show that there exists a sequence of disjoint compacts subsets
{Kt : i N} of X such that fi (UtKif = 0.
[13]
48
Chapter 2:
union of closed balls with radius l/n. Find a finite family of such balls whose
union Bn has 11 measure greater than fiX e/2n. Show that C\nBn is compact, using
the totalboundedness characterization of compact subsets of complete metric spaces.
[14]
[15]
Let / and g be measurable functions on (X, A, fi), and r and s be positive real
numbers for which r~l Hs"1 = 1. Show that fi\fg\ < (ji\f\r)l/r (jJL\gls)l/s by
arguing as follows. First dispose of the trivial case where one of the factors on
the righthand side is 0 or oo. Then, without loss of generality (why?), assume
that Ml/r = 1 = Mlgl5 Use concavity of the logarithm function to show that
\fg\ 5 \f\r/r + \g\*/s9 and then integrate with respect to JJL. This result is called the
Holder inequality.
[16]
Generalize the Holder inequality (Problem [15]) to more than two measurable
functions / i , ...,/*, and positive real numbers n , . . . , r* for which ],. rfl = 1.
[17]
[18]
For / in ! ( M ) define /i = fi\f\. Let {fn} be a Cauchy sequence in C!(M) that
is, /n /mill > 0 as min(m,n) > oo. Show that there exists an / in C^/x) for
which /n  /Hi > 0, by following these steps.
(i) Find an increasing sequence {n{k)} such that j i H/(*) ~ /(*+i)lli < Deduce that the function H := J2T=\ !/(*) " /(*+i)l i s integrable.
(ii) Show that there exists a realvalued, measurable function / for which
H > \fn(k)(x)  /C*) > 0
49
2.12 Problems
[20]
[21]
[22]
Let ^ be a convex, increasing function for which ^(0) = 0 and ^(x) > oo as
x > oo. (For example, V(x) could equal xp for some fixed p > 1, or exp(jc) 1
or exp(jc2) 1.) Define ^^(X, A, /x) to be the set of all realvalued measurable
functions on X for which iity{\f\/c$) < oo for some positive real CQ. Define
50
Chapter 2:
:= inf{c > 0 : /x^(//c) < 1}, with the convention that the infimum of an
empty set equals +oo. For each / , g in *(X,.A, fi) and each real t prove the
following assertions.
(i) ll/ll* < oo. Hint: Apply Dominated Convergence to
/x^(\f\/c).
(ii) f+g *(X, yi, M) and the triangle inequality holds: \\f+gU
Hint: If c >   /   * and d > g, deduce that
\
c+d )  c+d
\ c )
c+d
by convexity of * .
(iii) tf e L*{X,A,n)
REMARK.
II Ik is called an Orlicz "norm"to make it a true norm one should
work with equivalence classes of functions equal {i almost everywhere. The Lp
norms correspond to the special case 4*(JC) = xp, for some p > 1.
[23]
(iv) Given > 0, choose L so that H/JiH* < 6. For i > L, show that
* (L/) > * (/n(L) ~ /n( 0 IA) "> * (\fn(L) ~ / l A ) .
Deduce that / n ( L )  f\\* < e.
(v) Show that / belongs to *(/x) and / rt  /   ^ > 0 as n > oo.
[24]
51
2.12 Problems
[25]
For each 0 in [0,1] let Xnt$ be a random variable with a Binomial (w, 0) distribution.
That is, F{Xnj = k] = (2)0*(1  0)"* for * = 0 , 1 , . . . . n. You may assume these
elementary facts: VXnte = n0 and F(Xn# nO)2 = n0(l  0 ) . Let / be a continuous
function defined on [0,1].
(i) Show that pn(0) = Ff(Xnj/n)
is a polynomial in 0.
(ii) Suppose  /  < M, for a constant M. For a fixed 6, invoke (uniform) continuity
to find a 8 > 0 such that  / ( s )  f(t)\ < whenever \s  t\ < 8, for all s, t
in [0,1]. Show that
\f(x/n)  / ( 0 )  < + 2M{ (jc/n)  0
2Af
l(*/*O01
(iii) Deduce that supo<^<! \pn(9)  / ( 0 )  < 2e for n large enough. That is, deduce
that /() can be uniformly approximated by polynomials over the range [0,1],
a result known as the Weierstrass approximation theorem.
[26]
13.
Extend the approximation result from Example <46> to the case of an infinite
measure /z on !B(]R*) that gives finite measure to each compact set. Hint: Let B
be a closed ball of radius large enough to ensure fi\f\B < e. Write ixB for the
restriction of it to B. Invoke the result from the Example to find a g in Co such
that ixB\f  g\ < . Find Co functions 1 > hx \, B. Consider approximations ght
for i large enough.
Notes
I recommend Royden (1968) as a good source for measure theory. The books of
Ash (1972) and Dudley (1989) are also excellent references, for both measure theory
and probability. Dudley's book contains particularly interesting historical notes.
See Hawkins (1979, Chapter 4) to appreciate the subtlety of the idea of a
negligible set.
The result from Problem [10] is often attributed to (Pratt 1960), but, as he noted
(in his 1966 Acknowledgment of Priority), it is actually much older.
Theorem <38> (the nk theorem for generating classes of sets) is often
attributed to Dynkin (1960, Section 1.1), although Sierpiiiski (1928) had earlier
proved a slightly stronger result (covering generation of sigmarings, not just sigmafields). I adapted the analogous result for classes of functions, Theorem <45>, from
Protter (1990, page 7) and Dellacherie & Meyer (1978, page 14). Compare with
the "Sierpiiiski Stability Lemma" for sets, and the "Functional Sierpinski Lemma"
presented by HoffmannJ0rgensen (1994, pages 8, 54, 60).
REFERENCES
Ash, R. B. (1972), Real Analysis and Probability, Academic Press, New York.
Dellacherie, C. & Meyer, P. A. (1978), Probabilities and Potential, NorthHolland,
Amsterdam.
Dudley, R. M. (1989), Real Analysis and Probability, Wadsworth, Belmont, Calif.
52
Chapter 2:
Chapter 3
1.
f(x)dv(x) = j
f(x)dn(x).
The d/ji symbols "cancel out," as in the change of variable formula for Lebesgue
integrals.
54
Chapter 3:
REMARK.
The density dv/dju, is often called the RadonNikodym derivative of v
with respect to /x, a reference to the result described in Theorem <4> below. The
word derivative suggests a limit of a ratio of v and fi measures of "small" sets. For
fj, equal to Lebesgue measure on a Euclidean space, dv/d/x can indeed be recovered
as such a limit. Section 4 explains the onedimensional case. Chapter 6 will give
another interpretation via martingales.
Example. Let m denote Lebesgue measure on [0,1). Each point x in [0,1) has a
binary expansion x = Y^Li xn^~n with each JC, either 0 or 1. To ensure uniqueness
of the {*}, choose the expansion that ends in a string of zeros when x is a dyadic
rational. The set {xn = 1} is then a finite union of intervals of the form [a9b),
with both endpoints dyadic rationals. The map T(x) := J^L\ 2**3~" from [0,1)
back into itself is measurable. The image measure v := T(m) concentrates on the
55
and so on. The set C, which is called the Cantor set, has Lebesgue measure less
than mCn == (2/3)" for every n. That is, mC = 0.
The distribution function F(x) := v[0, JC], for 0 < x < 1 has the strange
property that it is constant on each of the open intervals that make up each C,
because v puts zero mass in those intervals. Thus F has a zero derivative at each
point of Cc = UnC, a set with Lebesgue measure one.
The distinction between a continuous function and a function expressible
as an integral was recognized early in the history of measure theory, with the
name absolute continuity being used to denote the stronger property. The original
definition of absolute continuity (Section 4) is now a special case of a streamlined
characterization that applies not just to measures on the real line.
REMARK.
Probability measures dominated by Lebesgue measure correspond to
the continuous distributions of introductory courses, although the correct term is
distributions absolutely continuous with respect to Lebesgue measure. By extension,
random variables whose distributions are dominated by Lebesgue measure are
sometimes called "continuous random variables," which I regard as a harmful abuse
of terminology. There need be nothing continuous about a "continuous random
variable" as a function from a set 2 into R. Indeed, there need be no topology
on Q; the very concept of continuity for the function might be void. Many a student
of probability has been misled into assuming topological properties for "continuous
random variables." I try to avoid using the term.
<2>
<3>
56
Chapter 3:
For example, if /x denotes Lebesgue measure on 3(E) and v is the measure defined
by the density JC, then the interval (n, n + n~l) has /JL measure n~l but v measure
greater than 1.
REMARK.
For finite measures it might seem that absolute continuity should also
have an equivalent formulation in terms of functionals on M + . However, even if
v : /x, it need not be true that to each > 0 there exists a 8 > 0 such that: / G M+
and /If < 8 imply vf < . For example, let //, be Lebesgue measure on 3(0, 1) and
v be the finite measure with density A(JC) := JC~1/2 with respect to /x. The functions
fn(x) := A(x){n~2 < x < n~1} have the property that jxfn < f*/nx~l/2dx
> 0 as
n  oo, but vfn = J "2 x~l dx = logw > oo, even though v <& [i.
<5>
Lemma. Suppose v and JJL are finite measures on (X, A), with v < /x, that is,
vf < fif for all f in M+(X, A). Then v has a density A with respect to /x for which
0 < A < 1 everywhere.
Proof. The linear subspace JCo := {/ e 2 (M) : vf = 0} of 2(/x) is closed for
convergence in the L2{ii) norm: if /x/n  f\2 > 0 then, by the CauchySchwarz
inequality,
/ 2 > 0.
Except when v is the zero measure (in which case the result is trivial), the constant
function 1 does not belong to*KQ.From Section 2.7, there exist functions go ^o
and g\ orthogonal to Jio for which 1 = go + g\. Notice that vg\ = vl ^ 0. The
desired density will be a constant multiple of g\.
Consider any / in 2(/x). With C := vf/vg\ the function /  Cg\ belongs
to Wo, because v(/  Cg\) = vf Cvg\ = 0. Orthogonality of /  Cg\ and g\ gives
0 = (/  Cgugi) = /x(/gi) 
~ix{g\),
vg\
which rearranges to vf = /x(/A) where A := (vg\/iig\)g\.
The inequality 0 < v{A < 0} = AIA{A < 0} ensures that A > 0 a.e. [/x];
and the inequalities v{A > 1} = /xA{A > 1} > /x{A > 1} > v{A > 1} force
A < 1 a.e. [/x], for otherwise the middle inequality would be strict. Replacement
of A by A{0 < A < 1} therefore has no effect on the representation vf = /x(/A)
for / e 2(/x). TO extend the equality to / in M+, first invoke it for the 2(/x)
function n A / , then invoke Monotone Convergence twice as n tends to infinity.
57
Not all measures on (X, A) need be dominated by a given /x. The extreme
example is a measure v that is singular with respect to /z, meaning that there
exists a measurable subset S for which fiSc = 0 = vS. That is, the two measures
concentrate on disjoint parts of X, a situation denoted by writing v JJL. Perhaps
it would be better to say that the two measures are mutually singular, to emphasize
the symmetry of the relationship. For example, discrete measuresthose that
concentrate on countable subsetsare singular with respect to Lebesgue measure on
the real line. There also exist singular measures (with respect to Lebesgue measure)
that give zero mass to each countable set, such as the probability measure v from
Example <i>.
Avoidance of all probability measures except those dominated by a counting
measure or a Lebesgue measureas in introductory probability coursesimposes
awkward constraints on what one can achieve with probability theory. The restriction
becomes particularly tedious for functions of more than a single random variable, for
then one is limited to smooth transformations for which image measures and densities
(with respect to Lebesgue measure) can be calculated by means of Jacobians. The
unfortunate effects of an artificially restricted theory permeate much of the statistical
literature, where sometimes inappropriate and unnecessary requirements are imposed
merely to accommmodate a lack of an appropriate measure theoretic foundation.
REMARK.
Why should absolute continuity with respect to Lebesgue measure
or counting measure play such a central role in introductory probability theory? I
believe the answer is just a matter of definition, or rather, a lack of definition. For a
probability measure P concentrated on a countable set of points, expectations Pg(X)
become countable sums, which can be handled by elementary methods. For general
probability measures the definition of Wg(X) is typically not a matter of elementary
calculation. However, if X has a distribution P with density A with respect to
Lebesgue measure, then Pg(X) = Pg = f A(x)g(x)dx. The last integral has the
familiar look of a Riemann integral, which is the subject of elementary Calculus
courses. Seldom would A or g be complicated enough to require the interpretation
as a Lebesgue integralone stays away from such functions when teaching an
introductory course.
From the measure theoretic viewpoint, densities are not just a crutch for support
of an inadequate integration theory; they become a useful tool for exploiting absolute
continuity for pairs of measures.
In much statistical theory, the actual choice of dominating measure matters
little. The following result, which is often called SchefiK's lemma, is typical.
<6>
On the righthand side, the second term equals zero, because both densities
integrate to 1, and the first term tends to zero by Dominated Convergence, because
A > (A  An)+ * 0 a.e. [/A].
Chapter 3:
58
*2,
The set DSf and the function AKC are unique up to a v + ii almost sure equivalence.
REMARK.
Of course the value of A on N has no effect on M ( / A ) or v(fJ4).
Some authors adopt the convention that A = oo on the set N . The convention has
no effect on the equality <8>, but it does make A unique v f fi almost surely.
The restriction Vj_ of v to 3sf is called the part of v that is singular with respect
to ix. The restriction vabs of v to Jic is called the part of v that is absolutely continuous
with respect to fi. Problem [11] shows that the decomposition v = v + vabs, into
singular and dominated components, is unique.
There is a corresponding decomposition JJL = fjb_ 4 /*abS into parts singular and
absolutely continuous with respect to v. Together the two decompositions partition
the underlying space into four measurable sets: a set that is negligible for both
measures; a set where they are mutually absolutely continuous; and two sets where
the singular components /JL and v concentrate.
A = o?
i = v = 0
A=0
v =0
\L concentrates here
0<A<o
dvabs^Ad^hs
^abs
J / A
dvzbs
neither measure
puts mass here
1/A=O?
^1 = 0
vj_ concentrates here
Proof. Consider first the question of existence. With no loss of generality we may
assume that both v and JJL are finite measures. The general decomposition would
follow by piecing together the results for countably many disjoint subsets of X.
Define A. = v + /z. Note that v < k. From Lemma <5>, there is an Ameasurable function Ao, taking values in [0,1], for which vf = X(fAo) for all /
in M + . Define K := {Ao = 1}, a set that has zero \i measure because
v{A 0 = 1} = vAo{Ao =
1} + M A O { A O =
3.2
59
Define
A := T  ^ r  f A o < 1}.
1  Ao
We have to show that v (/N c ) = /x(/A) for all / in M + . For such an / , and each
positive integer n, the defining property of Ao gives
v(f A n){Ao < 1} = v(f A n)Ao{Ao < 1} + /x(/ A n)A 0 {A 0 < 1},
which rearranges (no problems with oo oo) to
v(f A n)(l  A 0 ){A 0 < 1} = M(/ A n)A 0 {A 0 < 1).
Appeal twice to Monotone Convergence as n oo to deduce
v / ( l  A 0 ){A 0 < 1} = At/A0{A0 < 1}
3.
Replace / by /{Ao < 1}/(1  Ao), which also belongs to M+, to complete the proof
of existence.
The proof of uniqueness of the representation, up to various almost sure
equivalences, follows a similar style of argument. Problem [9] will step you through
the details.
Aen
60
Chapter 3:
Equality is achieved for a partition with two sets: A\ := {m > 0} and A 2 := {m < 0}.
In particular, the total variation distance between two measures equals the JC1
distance between their densities; in fact, total variation distance is often referred
to as i;1 distance. The initial definition of \\ii\\u in which the dominating measure
does not appear, shows that the .C1 norm does not depend on the particular choice
of dominating measure. The * norm also equals sup^^j \fif\, because \/xf\ =
\H*nf)\ < * ( M I/I) < k\m\ for  /  < 1, with equality for"/ := {m > 0}  {m < 0}.
REMARK.
Some authors also refer to v(/^i>M2) = SUPAZA IMI^ /*2^l as
the total variation distance, which can be confusing. In the special case when
we have ii{m\ > m 2 } = fi{m\ < W2}, whence /x = 2v(fjL\, /x 2 ).
the infimum being achieved by f\ = {mi < 7722) and / 2 = {mi > m2}. That is, the
affinity equals the minimum of ii\ A +112A c over all A in A.
REMARK.
For probability measures, the minimizing set has the statistical
interpretation of the (nonrandomized) test between the two hypotheses \i\ and /i2
that minimizes the sum of the type one and type two errors.
The pointwise equality 2 (mi A m2> = mi f m2 mi m2 gives the connection
between affinity and ! distance,
+ m 2   m i  m 2 ) =
Both the affinity and the * distance are related to a natural ordering on the
space of all finite signed measures on A, defined by:
v < 1/ means vf < vr f for each / in M^j d .
<9>
E x a m p l e . To each pair of finite, signed measures ii\ and /X2 there exists a largest
measure v for which v < \i\ and v < M2 It is determined by its density m\ /\mi
with respect to the dominating A. It is easy to see that the v so defined is smaller
than both JX\ and [12, To prove that it is the largest such measure, let /x be any other
signed measure with the same property. Without loss of generality, we may assume
ix has a density m with respect to X. For each bounded, nonnegative / , we then
have
k (rnf) = fif = /x/{mi > m2} 4 fif{m\ < m2}
< ^if{m\
= X(mi A m2)f.
The particular choice f = [m > m\ /\ mi) then leads to the conclusion that
m < m\ Ani2 a.e. [A.], and hence /x < v as measures.
61
The measure v is also denoted by /JL\ A /X2 and is called the measure theoretic
minimum of /xi and jx2 For nonnegative measures, the affinity <*i(/xi, /i2) equals
the t 1 norm of fix A M.
By a similar argument, it is easy to show that the density \m\ rri2\ defines the
smallest measure v (necessarily nonnegative) for which v > \i\ ^ and v > ^2 (i\.
Hellinger distance between probability measures
Let P and Q be probability measures with densities p and q with respect to a
dominating measure k. The square roots of the densities, <fp and ^fq are both
square integrable; they both belong to 2(A.). The Hellinger distance between
the two measures is defined as the 2 distance between the square roots of their
densities,
H(P, Q)2 := k (Jp  Jqf
Again the distance does not depend on the choice of dominating measure (Problem [13]). The quantity k^fpq is called the Hellinger affinity between the two
probabilities, and is denoted by ai(P, Q). Clearly ^fpq > p A #, from which it
follows that a2(P, Q) > i ( P , Q) and H(P, Q)2 < \\P  Q\\\. The CauchySchwarz
inequality gives a useful lower bound:
\\P  Gil? =
Substituting for the Hellinger affinity, we get   P  G   i < H(P, Q) (4  H(P, Q)2)l/2,
which is smaller than 2 / / ( P , Q). The Hellinger distance defines a bounded metric on
the space of all probability measures on A. Convergence in that metric is equivalent
to convergence in C1 norm, because
H(Py Q) 2 < P  Gill < ( P , G).
<H
The Hellinger distance satisfies the inequality 0 < H(P, Q) < >/2. The equality
at 0 occurs when ^fp = y/q almost surely [A.], that is, when P = Q as measures
on A. Equality at V2 occurs when the Hellinger affinity is zero, that is, when
pq = 0 almost surely [A.], which is the condition that P and Q be mutually singular.
For example, discrete distributions (concentrated on a countable set) are always at
the maximum Hellinger distance from nonatomic distributions (zero mass at each
point).
REMARK.
Some authors prefer to have an upper bound of 1 for the Hellinger
distance; they include an extra factor of a half in the definition of H(Py Q)2.
Relative entropy
Let P and Q be two probability measures with densities p and q with respect to
some dominating measure k. The relative entropy (also known as the KullbackLeibler "distance," even though it is not a metric) between P and Q is defined as
= k(plog(p/q)).
62
Chapter 3:
At first sight, it is not obvious that the definition cannot suffer from the oo oo
problem. A Taylor expansion comes to the rescue: for x > 1,
<n>
(1 +Jc)log(l + *) = x + R(x),
2
with R(x) = \x /{\ + x*) for some x* between 0 and x. When p > 0 and q > 0, put
x = (p q)/q, discard the nonnegative remainder term, then multiply through by q
to get plog(p/q) > p q. The same inequality also holds at points where p > 0
and q = 0, with the lefthand side interpreted as oo; and at points where p = 0 we get
no contribution to the defining integral. It follows not only that kp (log(p/q))~ < oo
but also that D(P\\Q) > X(p  q) = 0. The relative entropy is well defined and
nonnegative. (Nonnegativity would also follow via Jensen's inequality.) Moreover,
if k{p > 0 = q] > 0 then plog(p/q) is infinite on a set of positive measure, which
forces D(P\\Q) = oo. That is, the relative entropy is infinite unless P is absolutely
continuous with respect to Q. It can also be infinite even if P and Q are mutually
absolutely continuous (Problem [15]).
As with the Cl and Hellinger distances, the relative entropy does not depend
on the choice of the dominating measure X (Problem [14]).
It is easy to deduce from the conditions for equality in Jensen's inequality that
D(P\\ Q) = 0 if and only if P = Q. An even stronger assertion follows from the
inequality
D{P\\Q)>H2(P,Q).
This inequality is trivially true unless P is absolutely continuous with respect to Q,
in which case we can take X equal to Q. For that case, define rj = ^/p 1. Note
that Qrj2 = H2(P, Q) and
1 = Qp = 2(1 + nf = 1 + IQr) + Qt)2,
which implies that 2Qrj = H2(P, Q). Hence
= 2Q ((1
as asserted.
In a similar vein, there is a lower bound for the relative entropy involving the
*distance. Inequalities <io> and <12> together imply D(P\\Q) >    P  Q\\2V A
more direct argument will give a slightly better bound,
The improvement comes from a refinement (Problem [19]) of the error term in
equality < n > , namely, R(x) > \x2/(l + x/3) for x >  1 .
63
To establish <13> we may once more assume, with no loss of generality, that P
is absolutely continuous with respect to Q. Write 1 + 8 for the density. Notice that
Q8 = 0. Then deduce that
D(P\\Q) = fi((l + )log(l + )  ) > \Q
Multiply the righthand side by 1 = 0(1 + 5/3), then invoke the CauchySchwarz
inequality to bound the product from below by half the square of
+ 673
The asserted inequality <13> follows.
<14>
Example. Let PQ denote the N(0,1) distribution on the real line with density
<f>(x  6) with respect to Lebesgue measure, where 4>{x) = exp(jc 2 /2)/\/27r. Each
of the three distances between PQ and P9 can be calculated in closed form:
D(P0\\P$) = Po* i\x2
+ Ux  $)2) = \02y
and
= i
= 2 /'
(<Hx)<t>(x0))+dx
(*(*)  <f>(x  0)) dx = 2(<t> {\0)  <D (  
and
P 0  Poh = / H ^ I
Inequalities <io>, <12>, and <13> then become, for 0 near zero,
\e2
\e2 > e2 + o($4)
<15>
The comparisons show that there is little room for improvement, except possibly for
the lower bound in <io>.
Example. The method from the previous Example can be extended to other
families of densities, providing a better indication of how much improvement might
be possible in the three inequalities. I will proceed heuristically, ignoring questions
64
Chapter 3:
Joo
exp(g(x9))
= P$cxp(g(x9)g(x))
= 1  0Pog(x) + \02P0 (g(x) + g(jc)2) + terms of order 0 3
we can conclude first that Pog = 0, and then PQ (g(x) f g(x)2) = 0.
REMARK.
The last two assertions can be made rigorous by domination
assumptions, allowing derivatives to be taken inside integral signs. Readers familiar
with the information inequality from theoretical statistics will recognize the two ways
of representing the information function for the family of densities, together with the
zeromean property of the score function.
Continuing the heuristic, we will get approximations for the three distances for
values of 0 near zero.
D(P0\\Pe) = PQ (g(x)  g(x  9)) = \92Pg2 + ...
From the Taylor approximation,
exp (ig(x  9)  {g{x)) = \\9g + \92 {l'g + g2) + ...
we get
H2(P0, Pe) = 2  2P exp (\g{x  9)  \g{x)) = \92Pg2 + ...
And finally,
If we could find a g for which (Po\g\)2 = Pog2, the two inequalities
2H(P0, Pe) > lift 
fti
and
2
would become sharp up to terms of order 9 . The desired equality for g would force
\g to be a constant, which would lead us to the density f(x) = \e~^xK Of course
log / is not differentiate at the origin, an oversight we could remedy by a slight
smoothing of JC near the origin. In that way we could construct arbitrarily smooth
densities with Pog2 as close as we please to (Po\g\)2. There can be no improvement
in the constants in the last two displayed inequalities.
3.4
4.
65
<16>
Note the strong similarity between Definition <16> and the reformulation
of the measure theoretic definition of absolute continuity as an eS property, in
Example <3>.
The connection between absolute continuity of functions and integration of
derivatives was established by Lebesgue (1904). It is one of the most celebrated
results of classical analysis.
<17>
(FT2)
66
Chapter 3:
As shown at the end of this Section, Assertion FTi can be recovered from the
RadonNikodym theorem for measures. The proof of Assertion FT2 (Section 6),
which identifies the density h with a derivative defined almost everywhere, requires
an auxiliary result known as the Vitali Covering Lemma (Section 5).
Notice what the Fundamental Theorem does not assert: that differentiability
almost everywhere should allow the function to be recovered as an integral of that
derivative. As a pointwise limit of continuous functions, (H(x + 8)  H(x))/8, the
derivative H'(x) is measurable, when it exists. However, it need not be Lebesgue
integrable, as noted by Lebesgue in his doctoral thesis (Lebesgue 1902): the function
F(x) := jc 2 sin(l/jc 2 ), with F(0) = 0, has a derivative at all points of the real line,
but F'(JC) behaves like 2x~l COS(1/JC2) for x near 0, which prevents integrability
on any interval that contains 0 (Problem [20]). The function F is not absolutely
continuous.
Neither does the Fundamental Theorem assert that absolute continuity follows
from almost sure existence of H' with f* \H'(x)\dx finite. For example, the
function F constructed in Example < i > , whose derivative exists and is equal to zero
outside a set of zero Lebesgue measure, cannot be recovered by integration of the
derivative, even though that derivative is integrable. It is therefore quite surprising
that existence of even a one sided, integrable derivative everywhere is enough to
ensure absolute continuity.
<19>
<20>
HixtOt
where the suprema run over the collection 7T(JC) of all finite partitions a *o < x\ <
... < xk = x of [a, JC], for each x in [a, b]. At x = a all partitions are degenerate and
F(a) = 0. By splitting [a, x] into a finite union of intervals of length less than 8,
you will see that both functions are real valued. More precisely, F(x) < (ba)/8,
where e and 8 come from Definition <16>.
3.4
67
Both F are increasing functions, for the following reason. First note that
insertion of an extra point into a partition {*,} increases the defining sums. In
particular, if a < y < x then we may arrange that y is a point in the n(x) partition,
so that the sums on the righthand side of <20> are larger than the corresponding
sums for n{y). More precisely, F(x) = F^OO + s u p ^ . (H(#)  #(#_!))*, where
the supremum runs over all finite partitions y = yo < y\ < < yk = x of [y, JC].
When JC y\ < 6, with 8 > 0 as in Definition <16>, each sum for the supremum
is less than , and hence \F(x)  F(y)\ < 6. That is, both F are continuous
functions on [a, b]. A similar argument applied to finite nonoverlapping collections
of intervals {[j,, JC,]} leads to the stronger conclusion that both F are absolutely
continuous functions.
REMARK.
We didn't really need absolute continuity to break H into a difference
of two increasing functions. It would suffice to have sup]T\ \H(yi) H(y,i) < oo,
where the supremum runs over all finite partitions y y0 < y\ < . . . < yk = x
of [y,x]. Such a function is said to have bounded variation on the interval [a,b].
Close inspection of the arguments in Section 6 would reveal that functions of bounded
variation have a derivative almost everywhere. Without absolute continuity, we cannot
recover the function by integrating that derivative, as shown by Example <1>.
which reduces to H(x) H(a) after cancellations. We can choose the partition to
give simultaneously values as close as we please to both suprema in <20>. In the
limit we get
H(x)  H(a) = F + (JC)  F~(x) = f (h+(t)  h'(t)) dt.
Ja
That is, representation <18> holds with h = h+  h~. The uniqueness of the
representing h can be established by another Aclass generating argument.
68
*5.
Chapter 3:
<2i>
<22>
Vitali Covering Lemma. For some fixed y > 0, let V be a Vitali covering for a
set D in $(Rrf) with finite Lebesgue measure. Then there exists a countable family
{Fi} of disjoint sets from V for which the set D\ (U,F,) has zero Lebesgue measure.
REMARK.
More refined versions of the Theorem (such as the results presented
by Saks 1937, Section IV.3, or Wheeden & Zygmund 1977, Section 7.3) allow y
to vary from point to point within D, and relax assumptions of measurablity or
finiteness of mD. We will have no need for the more general versions.
Proof. The result is trivial if mD is zero, so we may assume that mD > 0. The
proof works via repeated reduction of mD by removal of unions of finite families
of disjoint sets from V. The method for the first step sets the pattern.
Fix an e > 0. As you will see later, we need e small enough that the constant
p := 3"rf(l  )y  e is strictly positive.
3.5
69
The family of open balls {Bp : F V] covers the compact set K. It has a finite
subcover, corresponding to sets F\,..., Fm. The ball BFi is of the form B(x{, r,).
We may assume that the sets are numbered so that r\ > r2 > . . . > rm.
Define a subset J of { 1 , 2 , . . . , m} by successively
considering each y, in the order 1, 2, . . . , m, for inclusion
in / , rejecting j if Fj intersects an F, for an i already
accepted into J. For example, with the five sets shown in the
picture, we would have / = {1,2,5}: we include 1, then 2
(because F 2 nFi = 0), reject 3 (because F 3 flFi ^ 0), reject 4
(because F4 n F2 # 0), then accept 5 (because F5 n F\ = 0
and F5nF2 = 0).
For each excluded j there is an j in J for which
i < j and Ft n Fj # 0. The ordering of the radii
ensures that B(xiy 3r,) 3 B(xh r,): if z Ft n F;
and y (*,, r7), we have
to  y\ < \xi z\
+ \zy\<
3n) >
, 3rf)) >
> (1  e)mD.
It follows that
70
*6.
Chapter 3:
fraction p of the Lebesgue measure of D\E\. Then m (D\(E\ U E2)) < (1  p)2tnD,
and the sets that make up E\ U Ej are all disjoint. And so on. In the limit we have
the countable disjoint family whose existence is asserted by the Lemma.
Ja
for a Lebesgue integrable function h then H has a derivative at almost all points
of [a, b] and that derivative coincides with h almost everywhere.
It suffices if we consider the case where h is nonnegative, from which the
general case would follow by breaking h into its positive and negative parts. For
nonnegative h9 write v for the measure on !B[a, b] with density h with respect to
Lebesgue measure m. Define W(JC) as the set of all nondegenerate (that is, nonzero
length) closed intervals E for which x e E c (x n~l,x + n~l). Notice that if the
derivative Hf(x) exists then it can be recovered as a limit of ratios vEn/mEn, with
En (*).
Define functions
~:Ee,n(x)\
and
gn(x) = inf ^
: E (*)
Both the sets {/ > r] and {gn < r] are open, and hence both fn and gn are
Borel measurable. For example, if fn(x) > r then there exists an interval E in
(*) with vE > rmE. Continuity of H at the endpoints of E lets us expand E
slightly, ensuring that x is an interior point of E while keeping the ratio above r
and keeping E within (JC n~l,x + n~l). I f j i s close enough to *, the interval E
also belongs to (, and hence fn(y) > r. The monotone limits /(JC) := infn fn(x)
and g(x) := supn gn(x) are also Borel measurable.
Clearly /(JC) > g(jc) everywhere. For 0 < 8 < 1/n, both [JC, JC + 5 ] and [JC  5 , JC]
belong to n (x), and hence both
+ 8)H(x)
H(x)H(x8)
A
8
lie between gn(x) and fn(x). In the limit as first 8 tends to zero then n tends to
infinity we have
, ,
. H(x + 8)H(x)
< hmsup
H(x+8)H(x)
/ x
< f(x),
with an analogous pair of inequalities for [JC  8, JC]. At points where g(jc) = /I(JC) =
/(JC) the derivative Hf(x) must exist and be equal to h(x). The next Lemma will
help us prove the almost sure equality of / , g9 and h.
<23>
Lemma.
constant
71
= ]n, vEj
> J2t
rmE
disjoint intervals
definition of V
= rmL
disjoint intervals
> rmA
covering property.
7.
Problems
[1]
Let IX be the measure defined for all subsets of X by /JLA = +00 if A ^ 0, and
= 0. Show that / / / = /x(2/) for all nonnegative functions / .
[2]
Let A, denote the measure on R with kA equal to the number of points in A (possibly
infinite), and let [iA 00 for all nonempty A. Show that k has no density with
respect to JJL, even though both measures have the same negligible sets.
[3]
Suppose Ai and A2 are functions in M+(X, A) for which /x(/Ai) = /x(/A2) for all
/ in M+(X, A), for some measure \i.
(i) Show that Ai = A2 a.e. [/x] if fiA\ < 00. Hint: Consider the equality
/xAi{Ai > A 2 } = MA2{AI > A 2 }.
(ii) Show that the assumption of finiteness of fiA\ is not required for the conclusion
Ai = A2 a.e. [//,] if /x is a sigmafinite measure. Hint: For each positive rational
number r show that /JL(A\  A2MAi > r > A2M = 0 for each set with /JLA < 00.
[4]
72
Chapter 3:
p in Cl(Q) such that PE = Q(pE) for every E in . Show that P has density /?
with respect to Q.
[5]
Does Example <3> have an analog for Hp convergence, for some p > 1: if v is a
finite measure dominated by a measure A, and if {/} is a sequence of measurable
functions for which X\fn\p > 0, does it follow that v\fn\p * 0?
[6]
[7]
[8]
Show that a measure /JL is sigmafinite if and only if there exists a strictly positive,
measurable function *1> for which ^
< oo. Hint: If /JL is sigmafinite, consider
functions of the form <f>(jt) = ]T\ a,{* e A,}, for appropriate partitions.
[9]
Suppose (3\fi, AO and (N2, A2) are both Lebesgue decompositions for a finite
measure v with respect to a finite measure 11. Prove Ji\ = K2 a.e. [v 4 /JL] and
AiKj = A2N2 a e  t v + Ml by the following steps.
(i) Show that Hi = K 2 a.e. [v] by means of the equality v(N2:Np = v ( K 2 ^ K i ) +
^(H2Kj Ai) = 0, and a companion equality obtained by interchanging the roles
of the two decompositions. Using the trivial fact that Hi = H2 a.e. [JJL] (because
both sets are ^negligible), deduce that Hi = H2 a.e. [v 4 fi],
(ii) Use the result from part (i) to show that
M(/AiHJ)
= v ( / H p = v(fjq)
= fi(fA2jq)
for all / M + .
73
3.7 Problems
M(/^>O)
(iii) From the Lebesgue decomposition vof = vo(/N) + /*o(/A) for all / e M + ,
where /xoN = 0 and A JVC1", derive the corresponding Lebesgue decomposition
for A. with respect to /x.
(iv) From the uniqueness result of Problem [9], applied to Xo and /xo, deduce that
the Lebesgue decomposition for A. is unique up to the appropriate almost sure
equivalences.
[11]
[12]
Let fji\ and \i2 be finite measures with densities m\ and m2 with respect to a
sigmafinite measure X. Define (/xi /JL2)* as the measure with density (m\ m
with respect to A..
(i) Show that (m\ mi)+B = sup^ YIA^171^
over all finite partitions of X.
(ii) Show that /*i  /JL2\ = (Mi  M2>+ + (M2  /^i) + [13]
[14]
Adapt the argument from the previous Problem to show that the relative entropy
D(P\\Q) does not depend on the choice of dominating measure.
[15]
Let P be the standard Cauchy distribution on the real line, and let Q be the standard
normal distribution. Show that D(P\\Q) = 00, even though P and Q are mutually
absolutely continuous.
[16]
[17]
Let P and Q be finite measures defined on the same sigmafield y, with densities p
and q with respect to a measure /x. Suppose Xo is a measurable set with the property
that there exists a nonnegative constant K such that q > Kp on Xo and q < Kp
on XQ. For each jFmeasurable function with 0 < / < 1 and Pf < PXo, prove that
Qf < QXo. Hint: Prove that (q  Kp)(2X 0  1 ) > (q  Kp)(2f  1 ) , then integrate.
To statisticians this result is known as the NeymanPearson Lemma.
74
Chapter 3:
[18]
Let JLC = ix\  fa ^nd fi' = fi\  fi2 be signed measures, in the sense of Section 3.
Show that v := (/xi + /x^) A (M'I + Pi) 0^2 + A^) *s the largest signed measure for
which v < ^ and v < fif.
[19]
Let /(JC) = (1 + JC) log(l + JC)  JC, for JC >  1 . Use the representations
to show that /(JC) > 72JC2(1 + JC/3)"1.
[20]
Show that the function JC~1COS(JC~2) is not integrable on [0,1]. Hint: Consider
contributions from intervals where JC~2 lies in a range nn T T / 4 . The intervals have
lengths of order n~ 3/2 , but the JC"1 contributes a factor of order n 1 / 2 .
[21]
and
(i) There exist open sets Gn and compact sets Kn with Kn c An c Gn and
m(Gn\ATn) < 6M, with en so small that the functions
fix) := E>o( + Dl* GM}
and
(1  a)H(xo)+aH(xx)  H(xa)
= /
Jo
> 0.
75
3.7 Problems
[23] Let h be integrable with respect to Lebesgue measure m on Rrf. Show that
lim J
kx ({\x z\<
r*o mB(z, r) v
r}h(x)) = h(z)
a.e. [m].
8.
Notes
The Fundamental Theorem < n > is due to Lebesgue, proved in part in his doctoral
dissertation (Lebesgue 1902), and completed in his Peccot lectures (Lebesgue 1904).
Read Hawkins (1979) if you want to appreciate the subtlety and great significance
of Lebesgue's contributions. Note, in particular, Hawkins's comments on page 145
regarding introduction of the term absolute continuity. Compare the footnotes on
page 129 (as reprinted in his collected works) of the 1904 edition of the Peccot
lectures,
Pour qu'une fonction soit integrate indefinie, il faut de plus que sa variation
totale dans une infinite de"nombrable d'intervalles de longeur totale , tende vers
zero avec I.
Si, dans l'enonce* de la page 94, on n'assujettit pas f(x) a etre borne*e,
ni F(x) a etre a nombres de*rive*s bornes, mais seulement a la condition
pr6cedente, on a une definition de 1'integrate 6quivalente a celle developpe*e dans
ce Chapitre et applicable a toutes les fonctions sommables, bornes ou non.
76
Chapter 3:
Inequality <13> is due to Kemperman (1969), Csiszar (1967), and Kullback (1967). (In fact, Kullback established a slightly better inequality.) My proof
is based on Kemperman's argument. See Devroye (1987, Chapter 1) for other
inequalities involving distances between probability densities.
REFERENCES
Chapter 4
1. Independence
Much classical probability theory, such as the laws of large numbers and central
limit theorems, rests on assumptions of independence, which justify factorizations
for probabilities of intersections of events or expectations for products of random
variables.
An elementary treatment usually starts from the definition of independence for
events. Two events A and B are said to be independent if (AB) = (PA)(P/?); three
events A, , and C, are said to be independent if not only P (ABC) = (PA)(Pfl)(PC)
but also F(AB) = (PA)(PB) and P(AC) = (PA)(PC) and P(BC) = (PB)(C).
And so on. There are similar definitions for independence of random variables,
in terms of joint distribution functions or joint densities. The definitions have two
78
Chapter 4:
things in common: they all assert some type of factorization; and they do not
lend themselves to elementary derivation of desirable facts about independence.
The measure theoretic approach, by contrast, simplifies the study of independence
by eliminating unnecessary duplications of definitions, replacing them by a single
concept of independence for sigmafields, from which useful consequences are
easily deduced. For example, the following key assertion is impossible to derive by
elementary means, but requires only routine effort (see Section 2) to establish by
measure theoretic arguments.
<i>
Theorem. Let Z\,..., Zn be independent random variables on a probability space (Qf y,F). If f M + (R*,S(R*)) and g M+(MW*,:B(R'1*)) then
f(Z\,...,
Zk) and g(Z^ + i,..., Zn) are independent random variables, and
P / ( Z i , . . . , Zk)g(ZM,
<2>
<3>
A complete proof of this form of the SLLN is quite a challenge. The classical
proof (a modified version of which appears in Sections 6 and 7) combines a number
of tricks that are more easily understood if introduced as separate ideas and not
just rolled into one monolithic argument. The basic idea is not too hard to grasp
when we have bounded fourth moments; it involves little more than an application
of Corollary <2> and an appeal to the BorelCantelli lemma from Section 2.6.
79
4.1 Independence
For theoretical purposes, for summands that need not all have the same
distribution, it is cleaner to work with the centered variables Xf PX,, which is
equivalent to an assumption that all variables have zero expected values.
<4>
Theorem. Let Xi, X 2 ,... be independent random variables with PX, = 0 for
every i and sup, PXf < 00. Then (Xi 4... + Xn)/n 0 almost surely.
Proof. Define Sn = Xi + ... 4 Xn. It is good enough to show, for each e > 0, that
Do you remember why? If not, you should refer to Section 2.6 for a detailed
explanation of the BorelCantelli argument: the series 53 n {5 n /n > e] must
converge almost surely, which implies that limsup5 n /n < almost surely, from
which the conclusion Mm sup \Sn/n\ 0 follows after a casting out of a sequence of
negligible sets.
Bound the nth term of the sum in <5> by (ne)~4(X\ + ... + Xn)4. Expand
the fourth power.
+ (lots of terms like X\X2)
(A)
[U
The contributions to P(Xi I... f Xn) from the five groups of terms are:
<6>
\3\ less than 12$M, because P(X;fx) < PX? + PX^ < 2M;
g] zero, because P(Xf X2X3) = (PXf X2) (PX3) = 0;
g zero, because P(Xi X2X3X4) = (PXiX2X3) (PX4) = 0.
Notice all the factorizations due to independence. Combining these bounds and
equalities we get P{Sn/n > e] = O (~2), from which <5> follows.
If you feel that Theorem <4> is good enough for 'practical purposes,' and that
all the extra work to whittle a fourth moment assumption down to a first moment
assumption is hardly worth the gain in generality, you might like to contemplate the
following example. How natural, or restrictive, would it be if we were to assume
finite fourth moments?
Example. Let {Pe : 9 = 0 , 1 , . . . , N] be a finite family of distinct probability measures, defined by densities {/?<?} with respect to a measure /x. Suppose observations
Xi, X 2 ,... are generated independently from Po The maximum likelihood estimator
@n(o)) is defined as the value that maximizes Ln(9,(o) := Yli<n PeiXii*0)) Th e SLLN
will show that P $ , = 0 eventually} = 1. That is, the maximum likelihood estimator
eventually picks the true value of 9.
80
Chapter 4:
= logfjLxpe(x){po(x)^0}<0.
By the SLLN (or its extension from Problem [20] if P, = oo), for almost all co
there exists a finite no(co,0) for which 0 > n~l ],<, : = n~X log(Ln(0)/Ln(O))
when n > no(coy 0). When n > max^, wo(<*>, 0), we have m a x ^ Ln(0) < Ln(0), in
which case the maximizing % prefers 0 to each 0 > 1.
REMARK.
Notice that the argument would not work if the index set were infinite.
To handle such sets, one typically imposes compactness assumptions to reduce to the
finite case, by means of a muchimitated method originally due to Wald (1949).
2.
Independence of sigmafields
Technically speaking, the best treatment of independence starts with the concept
of independent subsigmafields of 3", for a fixed probability space (2,7, P). This
Section will develop the appropriate definitions and techniques for dealing with
independence of sigmafields, using the ideas needed for the proof of Theorem <2>
as motivation.
<7>
81
<9>
= (FE\)(FE2)...(FEn)
<n>
<12>
<13>
Definition. Measurable maps Xif for i e I, from & into measurable spaces
(Xi,Ai) are said to be independent if the sigmafields that they generate are
independent, that is, if
\iS
for all finite subsets S of the index set I, and all choices of At e At for i S.
82
Chapter 4:
Results about independent random variables are usually easy to deduce from
the corresponding results about independent sigmafields.
<15>
Example.
fo
all xux2 in R
are independent, because the collections of sets , = {{X, < *} : JC R} are both
stable under finite intersections, and <T(X,) = a(,).
We now have the tools needed to establish Theorem <2>. Write Xi
for / ( Z i , . . . , Zk) and X2 for g(Z*+i,..., Zn). Write Sf for a(Zi), the sigmafield generated by the random variable Z,. From Corollary < n > , the sigmafields
$\ := <r (Si U ... U 3*) and J 2 := 0 (9M U ... U gn) are independent. If we can
show that Xi is 3j\33(R)measurable and X2 is ^XB^measurable, then their
independence will follow: we will have the desired factorization for all sets of the
form {Xi A\) and {X2 A2}, for Borel sets A\ and A2.
Consider first the measurability property for Xi. Temporarily write Z for
(Zi,..., Zu), a map from Q into R*. We need to show that the set
{XleA} = {Zef~l(A)}
belongs to 5\ for every A in S(R). The !B(R*)\S(R)measurability of / ensures
that f~l{A) B(R*). We therefore need only show that {Z B] e 7\ for every B
in B(R*), that is, that the map Z is ?i\$(R*)measurable.
As with many measure theoretic problems, it is better to turn the question
around and ask: For how extensive a class of sets B does {Z e B] belong to 7\1
It is very easy to show that the class 25o of all such B is a sigmafield; so Z is an
^iX^onieasurable function. Moreover, for all choices of A S(R), the set
D := {(zi,..., z*) R* : z, A for i = 1,..., k]
belongs to *BQ because {Z D] = n/{Zi e Dj] e $\. As shown in Problem [6],
the collection of all such D sets generates the Borel sigmafield (R*). Thus
B(R*) c Bo, and {Z B] e 7X for all B e (R*). It follows that X\ is 5Ti\B(R)measurable. Similarly, X2 is 3r2\(R)measurable. The random variables Xi and
X2 are independent, as asserted by Theorem <2>.
The whole argument can be carried over to random elements of more general
spaces if we work with the right sigmafields.
< 16>
Definition. Let X\,..., Xn be sets equipped with sigmafields A\,..., An. The
set of all ordered ntuples (x\9..., xn)9 with xt X, for each i is denoted by
X\ x ... x Xn or X(<n Xi. It is called the product of the {Xi}. A set of the form
A\ x . . . x An = { ( J C I , . . . , xn) Xi x . . . x X n : jcf i4/ for each 1},
4.2
83
Independence of sigmafields
many disjoint pieces. The symbol <g> in place of x is intended as a reminder that
A\ <8>*42 consists of more than the set of all measurable rectangles A\ x A2.
by independence
n
3.
84
Chapter 4:
and
<18>
In general, each marginal has the same total mass as T, which can lead to
bizarre behavior if F (X x y) = oo. For example, if X = y = R and r is Lebesgue
measure on B(R2) then each marginal assigns infinite mass to all except the Lebesgue
negligible subsets of R.
As you will see in Section 4, if F is a probability measure under which
the coordinate projections define independent random variables then r can be
reconstructed as a product of its marginal distributions. Without independence, T
is not completely determined by its marginals. Instead we need a whole family
A = {kx : x X} of measures on $, together with the marginal /x. The construction
will make senseand be usefulfor more than just probability measures, provided
the members of A are tied together by a measurability assumption.
Definition. Call a family of measures A = {kx : x X} on 3 a kernel from (X, A)
to (y, !B) if the map x H XXB is Ameasurable for each B in S. In addition, call A
a probability kernel ifkx(^) = 1 for each x.
When there is no ambiguity regarding the sigmafields, I will also speak of
kernels from X to y.
REMARK.
Probability kernels are also known by several other names: Markov
kernels, randomizations, conditional distributions, and (particularly when X is
interpreted as a parameter space) statistical models.
4.3
85
For the iterated integral to make sense, we need to establish two key measurability properties: for each product measurable function / and each fixed x, the
map y H> f(x, y) should be ^measurable; and the map x H> kyxf{x, y) should be
^.measurable. In order to establish conditions under which these two measurability
properties hold, I will make use of the generating class method for Acones, as
developed in Section 2.11.
REMARK.
It is unfortunate that the letter X should have two distinct meanings
in this Chapter: as a measure or member of a family of measures on V, and as a
prefix suggesting the idea of stability under bounded limits. Sometimes there are not
enough good symbols to go around.
Recall that a Acone on a set Q is a family 3i + of bounded, nonnegative
functions on X with the properties:
(i) 2{ + is a cone, that is, if h\, h2 e IK+ and <x\ and a2 are nonnegative constants
then a\h\ 4 ot2h2 JC+;
(ii) each nonnegative constant function belongs to !K+;
(iii) if huh2
Lemma. For each f in M(X x y, A <g> !B), and each fixed x in X, the function
J H  f(x,y) is *Bmeasurable.
Proof. It is enough to consider the case of bounded nonnegative / . The general case
would then follow by splitting / into positive and negative parts, and representing
each part as a pointwise limit of bounded functions, / * = limn ( / A n).
Write !K+ for the collection of all bounded, nonnegative, A Smeasurable
functions on X x ^ for which the stated measurablity property holds. It is routine
to check the four properties identifying 0i+ as a A.cone. It contains the class S of
all indicator functions g(jc, y) := {JC A, y B] of measurable rectangles, because
g(x, ) is either the zero function or the indicator of the set B. The class S is stable
under pointwise products, and it generates A !B.
The application of the generating argument to establish the second desired
measurablity property can be surprisingly delicate for kernels assigning infinite
measure to y. A finiteness assumption eliminates all difficulties.
86
<20>
Chapter 4:
Theorem. Let A = {Xx : x X} be a kernel from (X, A) to (y, 2 ) with Xxy < oo
for all x, and let /x be a measure on A. Then, for each function f in M + (Xx y, .A<g>2),
(i) y H> /(JC, y) is 'Bmeasurable for each fixed x;
(ii) x H> Xlf(x, y) is Ameasurable;
(Hi) the iterated integral (/i A ) ( / ) := nx (Xyxf(x, y)), for f in M+(X x y, yi<8>2),
defines a measure on *A <g> 2 .
Proof. Property (i) merely restates the assertion of Lemma <19>. For (ii), consider
the class IH+ := {/ e M + (X x y) : / bounded and satisfying (ii)}. Note that
kyxf{x,y) < oo for each JC if / JC4*, so there is no problem with infinite values
when subtracting to show that kyf\(x, y) klf2(x, y) is ^measurable if f\ > fo and
both functions belong to IK+. It is just as easy to check the other three properties
needed to show that IK+ is a A.cone. All indicator functions of measurable rectangles
belong to JC+, because Xx{x A, y B] = {JC A}(kxB). Property (ii) therefore
holds for all bounded functions in M + (X x y). A Monotone Convergence argument,
Xyxf(x,y)=
lim^(WA/(jc,y)),
noo
= \ix uimX^/n(jc, y) j
4.3
87
Infinite kernels
D
<23>
<24>
= m nr I
 p2)
88
Chapter 4:
2nJl^l
4.
\
)
The probability measure //, 0 A is called the standard bivariate normal distribution
with correlation p .
Product measures
A particularly important special case of the constructions from the previous Section
arises when A = [k] for a fixed measure k on , that is, when kx = k for all x.
Sigmafiniteness of the kernel A is then equivalent to sigmafiniteness of A. in the
usual sense: the space y should be a countable union of sets each with finite k
measure. I will abbreviate / A 0 {k} to /z0A., the conventional notation for the product
of the measures \x and A. That is, \x 0 k is the measure on A 0 defined by the
linear functional
(Ji A.) ( / ) := \ix {kyf(xy y))
<25>
for / M + (X x y, A 0 B).
<26>
: f(x) >y>
0})) ,
4.4
89
Product measures
order of integration. Abbreviating fix[y : f(x) > y > 0} to / z { / > y] and writing
the Lebesgue integral in traditional notational, we then conclude that
In particular, if /z (fp)
y > 00.
= pf
< oo then /z{/Jo
> 3;} must decrease to zero faster than y~p
as
The definition of product measures, and the Tonelli Theorem, can be extended
to collections / z i , . . . , \in of more than two sigmafinite measures, as for kernels.
<27>
< z}e~z} .
The ni2 measure of the ball [x2 + y2 < z}, for fixed positive z, equals nz, A change
in the order of integration leaves mz (7r{0 < z}ze~z) = n as the value for C 2 .
The Tonelli Theorem is often invoked to establish integrability of a product
measurable (extended) realvalued function / , by showing that at least one of
the iterated integrals /z* (A^l/Ot, y)\) or Xy (/Z X /(JC, ;y)) is finite. In that case, the
Theorem also asserts equality for pairs of iterated integrals for the positive and
negative parts of the function:
Corollary (Pubini Theorem). For sigmafinite measures /z and A., and a product
measurable function f with (/JL 0 A.)  /  < 00,
(i) y H> f(xyy) is *B measurable for each fixed x; and x \+ f(x,y)
Ameasurable for each fixed y;
is
(ii) the integral kyf(x, y) is well defined and finite fi almost everywhere, and
x H> A//(JC, >>) is /xintegrable; the integral /z*/(JC, y) is well defined and
finite k almost everywhere, and y H* \XX f{x,y) is Xintegrable;
(Hi) (/z (g) k) f = /z* ( * v u .
90
Chapter 4:
REMARKS.
If we add similar almost sure qualifiers to assertion (i), then the
Fubini Theorem also works for functions that are measurable with respect to ? , the
/it 0 X completion of the product sigmafield. The result is easy to deduce from
the Theorem as stated, because each ^measurable function / can be sandwiched
between two product measurable functions, fo < f < fu with /o = / i , a.e. [fx <g> A.].
Many authors work with the slightly more general version, stated for the completion,
but then the Tonelli Theorem also needs almost sure qualifiers.
Without integrability of the function / , the Fubini Theorem can fail, as shown
by Problem [13]. Strictly speaking, the sigmafiniteness of the measures is not
essential, but little is gained by eliminating it from the assumptions of the Theorem.
As explained in the next Section, under the traditional definition of products for
general measures, integrable functions must almost concentrate on a countable union
of measurable rectangles each with finite product measure.
(fiA)(XB).
That is, QD = (/x X) D for each D in the generating class of measurable rectangles.
It follows that Q = }JL <g> X, as measures on the product sigmafield.
Conversely, if Q is a product measure then F{X e A, Y e B] factorizes,
implying that X and Y are independent.
In short: random variables (or random elements of general spaces) are independent if and only if their joint distribution is the product of their marginal
distributions.
Facts about independence can often be deduced from analogous facts about
product measures. In effect, the proofs of the Tonelli/Fubini Theorems are equivalent
to several of the standard generating class and Monotone Convergence tricks used
for independence arguments. Moreover, the results for product measures are stated
for functions, which eliminates a layer of argument needed to extend independence
factorizations from sets to functions.
<29>
P(f(X)g(Y))
= (P/(X)) (Fg(Y))
= Q x y (f(x)g(y))
image measure
x y
= (^ 0 v) * (f(x)g(y))
independence
y
x
= v (fx f(x)g(y))
Tonelli
images measures.
4.4
<30>
91
Product measures
In the third line on the righthand side the factor g(y) behaves like a constant for
the [i integral.
Example. The image of fx v, a product of finite measures on R* x R*, under
the map T(x, y) = x + y is called the convolution of the two measures, and is
denoted by ft v (or v /x, the order of the factors being irrelevant). If fi and
v are probability measures, and X and Y are independent random vectors with
distributions /JL and v, then the product measure /x 0 v gives their joint distribution,
and the convolution $i v gives the distribution of the sum X 4 F.
By the Tonelli Theorem and the definition of image measure,
QL * v)(/) = iixvyf(x
+ y)
for / M+(R*).
When v has a density &() with respect to itdimensional Lebesgue measure m, the
innermost integral on the righthand side can be written, in traditional notation, as
/ Hy)f(* + y)dy, which invites us to make a change of variable and rewrite it
as / 8(y x)f(y)dy. More formally, we could invoke the invariance of Lebesgue
measure under translations (Problem [15]) to justify the reexpression. Whichever
way we justify the change, the convolution becomes
= myiix8(y  x)f(y)
by Tonelli.
REMARK.
The statistical techniques known as density estimation and nonparametric smoothing rely heavily on convolutions.
<3i>
Exercise.
The N(0, a2) distribution on the real line is defined by the density
with respect to Lebesgue measure. Show that the convolution of two normal
distributions is normal: N(0\,af) N($2, o%) = N{0\ + 02, a\ + a%).
SOLUTION: The convolution formula from Example o o > (with \x as the N(0\Jarf)
distribution and 8 as the N(02, a2) density) becomes
Ky
2)
dx.
Make the change of variable z = x 0\, and replace y by 0\ h ^2 + y (to get a neater
expression).
r2
rvT^ 2 \
1 dz.
92
Chapter 4:
<32>
Example. Recall from Section 2.2 the definition of the distribution function Fx
and its corresponding quantile function for a random variable X:
Fx(x) = F{X < x]
for x e R,
The quantile function is almost an inverse to the distribution function, in the sense
that Fx(x) > u if and only if qx(u) < x. As a random variable on (0,1) equipped
with its Borel sigmafield and Lebesgue measure P, the function X := qx(u) has
the same distribution as X. Similarly, if Y has distribution function FY and quantile
function qy, the random variable Y := qy(u) has the same distribution as Y.
Notice that X and Y are both defined on the same (0,1), even though the
original variables need not be defined on the same space. If X and Y do happen
to be defined on the same 2 their joint distribution need not be the same as the
joint distribution for X and Y. In fact, the new variables are closer to each other,
in various senses. For example, several applications of Tonelli will show that
PX  Y\P > F\X  Y\P for each p > 1.
As a first step, calculate an inequality for tail probabilities.
{X > JC, Y > y] < min (P{X > JC}, F{Y > y})
= 1  Fx(x) v FY(y)
u>
0
<33>
Fx(x)vFY(y)}du
), y <qy(u)}du
<34>
> P({X > JC} + {Y > y]  2{X > JC, Y > y})
93
I P ({X(a))
a nonnegative function just begging for an application of Tonelli. For each real
constant s, put y = x + s then integrate over x with respect to Lebesgue measure m
on !B(E). Tonelli lets us interchange the order of integration, leaving
IP (mx{X(co) >x> Y(co)
= IP ((X(co)  Y(co)
 s})
+ s ) + (K(o>)  s  X(co))+)
For each nonnegative t, invoke the inequality for s = r then 5 =  r , then add.
P(X  F + r + X  Y  t\) > P(X  Y H f + X  Y  t\)
<35>
f o r a11 f
^ 
D? = p(pl)
for D > 0,
*5.
See Problem [17] for the analogous inequality, P^(X  F) > P^(X  F), for
every convex, increasing function f on 1 + .
Beyond sigmafiniteness
Even when the kernel A is not sigmafinite, it is possible to define fi A as a
measure on the whole of A S . Unfortunately, the obvious methodby means
of an iterated integralcan fail for measurability reasons. Even when kx does
not depend on x, the map x h* A*/(JC, y) need not be measurable for all / in
M+i% x y, A (8) B), as shown by the following counterexample.
94
<36>
Chapter 4:
Example. Let X equal [0,1], equipped with Borel sigmafield A, and y also
equal [0,1], but equipped with the sigmafield of all subsets of [0,1]. Let be a
subset of [0,1], not .Ameasurable. Let A. be the counting measure on , interpreted
as a measure on . That is, A. B equals the cardinality of E n B, for every subset B
of [0,1]. The Borel measurable subset D = {x = y] of [0,1] 2 also belongs to the
larger sigmafield A <g> 23. The function x \+ ky{(x, y) e D) equals 1 when x ,
and is 0 otherwise; it is not yimeasurable.
In general, the difficulty lies in the possible nonmeasurability of the function
x  A.J/(JC, JO. When / equals the indicator function of a measurable rectangle
the difficulty disappears, because x H> XXB is assumed to be .Ameasurable for
every B in !B. If [ix (A.{JC A,y B}) < OO then Ao := {x e A : XXB = oo} has
zero fi measure. The method of Theorem <20> defines a measure on the product
sigmafield of (A\Ao) x B, by means of an iterated integral,
For the remainder of the Section, I will use the letter F, instead of the symbol
M <g> A, to avoid any inadvertent suggestion of an iterated integral.
The same definition also works if / is required to be zero only outside A x /?,
provided we ignore possible nonmeasurability of Xyxf(x,y) on a /xnegligible set.
More formally, we could assume [i to be complete, so that the behavior for x in AQ
has no effect on the measurability. The method of Corollary <22> can then be
applied to extend F to a measure on a larger collection of subsets.
<37>
Definition. Write % for the collection of all A *Bmeasurable sets D for which
there exist measurable rectangles with D c U/eNAj x Bt and ixx ({JC e Ai)kxBi) < oo
for each i. Denote by M+(X x y, # ) , or just M + (#), the collection of functions f
in M+(X x y,.A B) for which {/ / 0} e X.
The collection Jl is stable under countable unions and countable intersections,
and differences (that is, if /?, % then R\\R2 #)that is, Jl is a sigmaring.
It need not be stable under complements, because the whole product space X x y
might not have a covering by a sequence of measurable rectangles with the desired
propertythat is, Jl need not be a sigmafield.
The definition of a countably additive measure on a sigmaring is essentially
the same as the definition for a sigmafield. There is a onetoone correspondence
between measures defined on*Rand increasing linear functionals on M + (#) with the
Monotone Convergence property. In particular, the iterated integral /x* (A.*/(JC, y))
defines an increasing linear functional on / e M + (#), corresponding to a measure
on ft.
For product measurable sets not in % there is no unique way to proceed.
A minor extension of the classical approach for product measures (using
approximation from above, as in Section 12.4 of Royden 1968) suggests we should
define Y(D) as the infimum of i N M* ({x Ai}A.xfi,), taken over all countable
coverings of D by measurable rectangles, U/ NA; X B,. This definition is equivalent
to putting T(D) = oo when D (./i:B)\3. Consequently, we would have
F ( / ) = oo for each nonnegative, product measurable / not in M + (ft). For the
95
particular case where k == A., if / is product measurable and r (  /  ) < oo, then
{/ ^ 0} c U I NA x Bt, for some countable collection of measurable rectangles with
I N (jiAi) (kBi) < oo. For each i in the set J^ := {i : /xA, = oo} we must have
XBj = 0, which implies that the set Nk := UijfiBi is ^.negligible. Similarly, the
set Nn := UlyxA, where Jx := {i : kBt = oo}, is /^negligible. When restricted to
Xo := Ul^yMi4l, the measure fi is sigmafinite; and when restricted to % := U^Z?,,
the measure X is sigmafinite. Corollary <28> gives the assertions of the Fubini
Theorem for the restriction of / to Xo x %. For trivial reasons, the Fubini Theorem
also holds for the restriction of / to N^ x y or X x Nx.
REMARK.
In effect, with the classical approach, the general Fubini Theorem is
made to hold by forcing integrable functions to concentrate on regions where both
measures are sigmafinite. The general form of the Fubini Theorem is really just a
minor extension of the theorem for sigmafinite measures. I see some virtue in being
content with the definition of the general /i <8> A as a measure on the sigmaring 31.
6.
<38>
96
Chapter 4:
means that SN ST need not behave like a typical increment SN 5,. We avoid that
complication by arguing separately for each event {r = i).
There are also conditional probability versions of the inequality, relaxing the
independence assumption.
Proof. Define
5, := Z, + ..
+ z,,
st
Define a random
A
y \ i
r\
te2
Id
*
X
of the asserted inequality equals
>
i + ^}
disjoint events
definition of .
The event {T = 1, 5, > 1 4 62} is a function of Z i , . . . , Z,; the event {7} < 62}
is a function of Z,+i,..., Z#. By independence, the product of the probabilities in
the last displayed line is equal to the probability of the intersection. The sum is less
than
If I Si1 > 61 4 62 and 7} < 62 then certainly \SN\ > \. The sum is less than
Q
P E I nr
REMARK.
Notice how the disjointness of the events [r i} for i = 1 , . . . , N
was used twice: first to break the maximal event into pieces and then to reassemble
the bounds on the pieces. Also, you should ponder the choice of r value for the
case where \St\ < \ +2 for all i. Where would the proof fail if I had chosen r = 1
instead of T = N for that case?
The Maximal Inequality will control the behavior of the averages over blocks of
successive partial sums, chosen short enough that the ft constants stay bounded. The
BorelCantelli argument can then be applied along the subsequence corresponding
to the endpoints of the blocks. The longer the blocks, the sparser the subsequence
of endpoints, and the easier it becomes to establish convergence for the sum of tail
probabilities along the subsequence. With block lengths increasing geometrically
fast, under mild second moment conditions we get both acceptable ft values and tail
probabilities decreasing rapidly enough for the BorelCantelli argument.
<39>
97
ED
*=1 , = 1
i=\
The innermost sum on the righthand side is just the tail of a geometric series,
started at *(i), the smallest k for which i < 2*. The sum equals
max
\Sn\
neBk
ft
>0
as k  oo.
"'
 < o o .
max \Sn\
[n<nk+\
Knk+l
"; > 1 
y2v7'i/ ^ 1
as k > oo.
V(nk+\)/n\+v
98
Chapter 4:
involving only a typical X\. Actually slightly less would suffice. We only need a
way to bound various integrals by quantities that do not change with n.
Proof of Theorem <3>. With no loss of generality, suppose fi = 0 (or replace X,by the centered variable X, JJL)9 SO that the Theorem becomes: for independent,
identically distributed, integrable random variables X\, X2,... with PX, = 0,
> 0
almost surely.
Break each X, into two contributions: a central part, F, = X,{X, < 1}, which,
when recentered at its expectation /x, = PX,{X, < 1}, becomes a candidate for
Theorem <39>; and a tail part, X,  Yt = X,{X, > 1}, which can be handled by
relatively crude summability arguments. Notice that, after truncation, the summands
no longer have identical distributions.
REMARK.
The choice of i as the ith truncation level is determined by the
available moment information. More generally, finiteness of a /?th moment would
allow us to truncate at il/p for almost sure convergence arguments.
<42>
00
i=l
 T Xi(a>)  n T*
n
i<n
<  T \Xi(a>)\  0.
nn *f
JC2 J^fkl
1=1
4.7
99
From this bound and the inequality P(Y,  /z,) 2 < py? = PX^{Xi < i}, deduce that
oo p / y . _ ii.)2
y\
.2
 /AF/ lit) ^ 0
n
almost surely.
T<n
The asserted SLLN for the identically distributed {X,} then follows from the
results <42> and <43> and <45>.
*8.
<46>
all F in 7 , ,
or, equivalently,
P w+ i(jri, ...*) = Vng(xu .*)
for
all g e M + ( Q n , %).
100
Chapter 4:
Fh : = Fngn
if h(co) = gn(co\n).
The functional is well defined because condition <46> ensures that Fngn = Fn+\gn+\
when h(co) = gn+i(a)\n + 1) = gn(co\n), that is, when h depends on only the first n
coordinates of a>. Some authors would call P a finitely additive probability.
From Appendix A.6, the functional P has an extension to a (countably additive)
probability measure on 7 if and only if it is crsmooth at 0, meaning that P/*, \ 0
for every sequence {hi} in J( + for which 1 > hi \, 0 pointwise.
REMARK.
The asmoothness property requires more than countable additivity
of each individual P rt . Indeed, Andersen & Jessen (1948) gave an example of weird
spaces {Xj} for which P is not countably additive (see Problem [29] for details).
The failure occurs because a decreasing sequence {hi} in 5 + need not depend on
only a fixed, finite set of coordinates. If there were such a finite set, that is, if
hi(co) = gi(con) for a fixed n, then we would have P/i, = Pnfc, > 0 if hi  0. In
general, ht might depend on the first nt coordinates, with nt oo, precluding the
reduction to a fixed .
In the literature, sufficient conditions for asmoothness generally take one of two
forms. The first, due to Daniell (1919), and later rediscovered by Kolmogorov (1933,
Section III.4) in slightly greater generality, imposes topological assumptions on
the X, coordinate spaces and a tightness assumption on the P n 's. Probabilists who
have a distaste for topological solutions to probability puzzles sometimes find the
Daniell/Kolmogorov conditions unpalatable, in general, even though those conditions
hold automatically for products of real lines.
REMARK.
Remember that a probability measure Q on a topological space X is
called tight if, to each > 0 there is a compact subset Ke of X such that Q ^ < e.
If X = Xi x . . . x X, then Q is tight if and only if each of its marginals is tight.
Fn := Pi 0 A2 0 A 2 . . . An.
101
<50>
xn) >
The product A 2 <8>... An defines a probability kernel from Xi to X2<i<n X,. Define
functions / on Xi by
fn(x\) := (A.Xl A 3 . . . A n ) gn(xux2
for n > 2,
xH)
with f\(x\)
<5l>
for
all co,
g\(x\) > e.
Hold x\ fixed for the rest of the argument. The defining property of Jci becomes
(X^, A 3 . . . A n ) gn(XuX2, . . . , Xn) >
fOTH > 2,
Notice the similarity to the final inequality in <50>, with the role of Pi taken over
by A.jc,. Repeating the argument, we find an Jc2 for which
(*<*1.*2) A4 . . . A n ) gn(X\,X2, X3 . . . , Xn) > for rt > 3,
D
<52>
102
<53>
Chapter 4:
9.
Problems
[1]
Let B\,B2,...
be independent events for which X i M < = P{^, infinitely often } = 1 by following these steps.
Show
that
(i) Show that P (B\BC2 ... Bcn) < exp (  J^Li Pft) > 0.
(ii) Deduce that f [ i Bl = almost surely,
(iii) Deduce that YftLx Bi  ! a l m s t surely.
(iv) Deduce that J^m Bi  1 almost surely, for each finite m. Hint: The events
Bm, Bm+\,... are independent.
(v) Complete the proof.
Remark: This result is a converse to the BorelCantelli Lemma discussed in Section 2.6.
A stronger converse was established in the Problems to Chapter 2.
103
4.9 Problems
[2]
[3]
Show that every continuous real function on a topological space X is !B(X)VB(R)measurable. Hint: What do you know about inverse images of open sets?
[4]
Let X be a topological space. Say that its topology is countably generated if there
exists a countable class of open sets So such that G = (J{Go : G 2 Go So) for
each open set G. Show that such a 9o generates the Borel sigmafield on X.
[5]
Let (X, d) be a separable metric space with a countable dense subset Xo. Show
that its topology is countably generated. [Lindelof s theorem.] Hint: Let % be the
countable class of all open balls of rational radius centered at a point of Xo If
x G G, find a rational r such that G contains the ball of radius 2r centered at x, then
find a point of Xo lying within a distance rofjc.
[6]
Let X and y be topological spaces equipped with their Borel sigmafields (X)
and S(y). Equip X x y with the product topology and its Borel sigmafield B(X x y).
(The open sets in the product space are, by definition, all possible unions of sets
G x Hy with G open in X and H open in y.)
(i) Show that B(X) <8> BflO c B(X x U).
(ii) If both X and y have countably generated topologies, prove equality of the two
sigmafields on the product space.
(iii) Show that tB(Rn) = B(R*) <g>
[7] Let X be a set with cardinality greater than 2**, the cardinality of the set of all
sequences of 0's and l's. Equip X with the trivial topology for which all subsets
are open.
(i) Show that the Borel sigmafield B ( X x I ) consists of all subsets of X x X.
(ii) Show that B(X) B(X) equals
[J{a() : a countable class of measurable rectangles}.
(iii) For a given countable class of subsets 6 of X define an equivalence relation
x ~ y if and only if {x e C] = {y e C) for all C in C Show that there are
at most 2* equivalence classes. Deduce that there exists at least one pair of
distinct points JCO and yo such that *o ~ yo> Deduce that a(6) cannot separate
the pair (jto, yo)>
(iv) Show that the diagonal A = {(JC, y) X x X : x = y] cannot belong to the
product sigmafield !B(X) (8) B(X), which is therefore a proper subsigmafield
104
Chapter 4:
of 3(X x X). Hint: Suppose A e a() for some countable class of measurable
rectangles. Find a pair of distinct points JCO, yo such that no member of , or
of <r(), can extract a proper subset of F = {(JCO, JCO), (}>O, *O)> (*O> yo), (yo, yo)h
but FA =
[8]
[9]
Let Zn = (Z\ + ... + Z n )/n, for a sequence {Z,} of independent random variables.
(i) Use the Kolmogorov zeroone law (Example <12>) to show that the set
{limsup Zn > r] is a tail event for each constant r, and hence it has probability
either zero or one. Deduce that limsupZ,, = co almost surely, for some
constant co (possibly 00).
(ii) If Zn converges to a finite limit (possibly random) at each co in a set A with
PA > 0, show that in fact there must exists a finite constant co for which
Zn > co almost surely.
[10]
[11]
For sigmafinite measures /JL and k, on sigmafields A and S, show that the only
measure r on A 0 2 for which V (A x B) = (/xA) (XB) for all A e .A and 5 S
is the product measure /x 0 X. Hint: Use the 7TA. theorem when both measures are
finite. Extend to the sigmafinite case by breaking the underlying product space into
a countable union of measurable rectangles, At x Bj, with /xA; < 00 and XBj < 00,
for all i, j .
[12]
Let m denote Lebesgue measure and X denote counting measure (that is, A. A equals
the number of points in A, possibly infinite), both on S(R). Let /(JC, y) = {JC = y}.
Show that mxXyf(x, y) = 00 but Xymxf(x, y) = 0. Why does the Tonelli Theorem
not apply?
[13]
Let X and \i both denote counting measure on the sigmafield of all subsets of N. Let
/ ( * , y) := {y = x]  {y = JC + 1}. Show that / I ^ V ( J C , y) = 0 but kyitxf(x, y) = 1.
Why does the Fubini Theorem not apply?
[14]
For nonnegative random variables Zi, r Zm, show that P(ZiZ2 Zm) is no
greater than /J Gz,()0z 2 (w) QZm(u)du, with g Z/ the quantile function for Z,.
[15]
Let m denote Lebesgue measure on R*. For each fixed a in 1*, define maf
mxf(x + a). Check that ma is a measure on B(R*) for which ma[a, fc] = m[a, fr],
105
4.9 Problems
for each rectangle [a, b] = X,[a,, &,]. Deduce that m a = m. That is, m is invariant
under translation: xtiaf = f f(x + a) dx = f f(x) dx = m / .
[16]
f
dF(t)
dt = F(oo)G(oo)  I G(t) dt
Hint: Read Section 3.4.
[17]
In the notation of Example <32>, show that P^(X  Y\) > P4>(X  ?), for
every convex, increasing function V on R+. Hint: Use the fact that ^(JC) =
*(0) + /o H(t)dt, where H is an increasing rightcontinuous function on R+.
Represent H(t) as /x(0, t] for some measure /x. Consider Pm5/i'{0 < f < 5 < A},
remembering inequality <35>.
[18]
Let P = i<nPi and Q = <8>,<(?,, where P, and Q, are defined on the same
sigmafield yi;. Show that
// 2 (P, Q) = 2  2 n i 5 n ( l  \H\Pt,
Qi)) < E I 5
[20]
[21]
[22]
(Kronecker's lemma) Let {fc,} and {jcf} be sequences of real numbers for which
0 < b\ < 62 < ... > 00 and J ] ^ i xi is convergent (with a finite limit). Show that
J21=\ biXi/bn > 0 as n > 00, by following these steps.
106
Chapter 4:
where Cm = X ^ 1 ^ 
(ii) Given 6 > 0, find an m such that " = / 7 *,1 < whenever m < p <n.
(iii) With m as in (ii), and n > m, show that ]r"=i M i  < Idil f e J2"=\ <*/
(iv) Deduce the asserted convergence.
[23]
(i) Show that YZi WZfli2? < oo and J2Z\ p l z l/^ < (ii) Deduce that Yl^i Zi/i^ i s convergent almost surely,
(iii) Show that i p {l y H a > i] < oo.
(iv) Deduce that Y1H\ Y\l^
x%
(v) Deduce via Kronecker's Lemma (Problem [22]) that n~xla(Y\ + . . . 4 Yn)
converges to 0 almost surely.
[24]
[25]
Establish inequality <44>. Hint: First show that ( ^ i ~ 2 decreases like it"1, for
k = 2, 3 , . . . , by comparing with fj{ y~2 dy. Then convert to a continuous range. It
helps to consider the case \x\ <2 separately, to avoid inadvertent division by zero.
[26]
(Etemadi 1981) Let X\, Xi,... be independent, integrable random variables, with
common expected value /x. Give another proof of the SLLN by following these
steps. As in Section 7, define Yt = Xt{\Xi\ < i] and /z, = PF,. Let Tn = J2i<n Yi(i) Argue that there is no loss of generality in assuming Xt > 0. Hint: Consider
positive and negative parts.
(ii) Show that PTn/n + \x as n  oo and (Sn  Tn)/n  0 almost surely,
(iii) Show that var(7n) < Y,i WX\{XX < / < n}.
(iv) For a fixed p > 1, let {kn} be an increasing sequence of positive integers such
that kn/pn  1. Show that n { / < kn}/kl < C/i2 for each positive integer i,
for some constant C.
(v) Use parts (iii) and (iv), and Problem [25], to show that
J2n P{\Tkn  VTk(n)\ > kn) < Ce~2 J2( F^?{^1 < i)/i2 < oo.
Deduce via the BorelCantelli lemma that (Tkn  PTkn)/kn + 0 almost surely,
(vi) Deduce that Skjkn > /JL almost surely, as n > oo.
(vii) For each p' > p, show that
^
kn+\
forkn<m<kn+u
107
4.9 Problems
when n is large enough.
(viii) Deduce that limsup.S m /m and liminf Sm/m both lie between fi/p and /zp, with
probability one.
(ix) Cast out a sequence of negligible sets as p decreases to 1 to deduce that
Sm/m > /x almost surely.
[27]
: / M+(Q, 7)}.
(iv) Show that Q(FX) := P F is a well defined (by (i)) probability measure on A.
(v) Show that Qf = Pf, for each / M+(2,3) whose restriction to X equals / .
[28]
[29]
Let m denote Lebesgue measure on [0, 1), and let {Xn : n N} be a decreasing
sequence of (nonmeasurable) subsets with DnXn = 0 but with each Xn having outer
measure 1, as in Problem [28]. Write An for the trace of 3 := 2*[0, 1) on Xn, as in
Problem [27]. Write Qn for X,<n X, and 7n for <8>,<nA, as in Section 8.
(i) Show that each function / in M+(2n, 7n) is the restriction of a function / in
M+([0, D.B^toO,,.
(ii) Show that nf := m'/(f, *,...,*) for / M + (^ n , y n ), defines a consistent
family of finite dimensional distributions.
108
Chapter 4:
(iii) Let gn denote the indicator function of {con e Qn : x\ xi = ... = *}, and let
fn(co) = gn((o\n) be the corresponding functions in M + . Show that Pngn = 1
for all n, even though / 4 0 pointwise.
(iv) Deduce that the functional P, as defined by <47>, is not sigmasmooth at zero.
[30]
[31]
Let T be an uncountable index set, and let Q = XtTXt denote the set of all
functions co : T  U ^ X , with o>(0 G X, for each r. Let At be a sigmafield
on X,. For each S c r, define J5 be the smallest sigmafield for which each of the
maps co H> co (s), for 5 5, is IFVAsmeasurable. The product sigmafield (S)/r^f is
defined to equal 5>.
(i) Show that 3r = U5 J5, the union running over all countable subsets S of T.
(ii) For each countable S c 7\ let P$ be a probability measure on J5, with the
property that Fs equals the restriction of P y to J 5 if 5 c S'. Show that
PF := P5F for F G J5 defines a countably additive probability measure on 3V
10.
Notes
According to the account by Hawkins (1979, pages 154162), the Fubini result
(in a 1907 paper, discussed by Hawkins) was originally stated only for products
of Lebesgue measure on bounded intervals. I do not personally know whether the
appelation Tonelli's Theorem is historically justified, but Dudley (1989, page 113)
cited a 1909 paper of Tonelli as correcting an error in the Fubini paper. As noted
by Hawkins, Lebesgue also has a claim to being the inventor of the measure
theoretic version of the theorem for iterated integrals: in his thesis (pages 4451
of Lebesgue 1902) he established a form of the theorem for bounded measurable
functions defined on a rectangle. He expressed the twodimensional Lebesgue
integral as an iterated inner or outer onedimensional integral. With modern
109
4.10 Notes
hindsight, his result essentially contains the Fubini Theorem for Lebesgue measure
on intervals, but the reformulation involves some later refinements. Royden (1968,
Section 12.4) gave an excellent discussion of the distinctions between the two
theorems, Tonelli and Fubini, and the need for something like sigmafiniteness in
the Tonelli theorem.
I do not know, in general, whether my definition of sigmafiniteness of a kernel,
in the sense of Definition <2i>, is equivalent, to the apparently weaker property of
sigmafiniteness for each measure Xx, for x X.
The inequality between 2 norms in Example <32> was noted by
Frechet (1957). He cited earlier works of Salvemini, Bass, and Dall'Aglio (none
of which I have seen) containing more specialized forms of the result, based on
Frechet (1951). The * version of the result is also in the literature (see the comments by Dudley 1989, page 342). I do not know whether the general version of
the inequality, as in Problem [17], has been stated before.
The Maximal Inequality <38> is usually attributed to a 1939 paper of Ottaviani,
which I have not seen. Theorem <39> is due to Kolmogorov (1928). Theorem <3>
was stated without proof by Kolmogorov (1933, p 57; English edition p 69), with the
remark that the proof had not been published. However, the necessary techniques
for the proof were already contained in his earlier papers (Kolmogorov 1928, 1930).
Problem [26] presents a slight repackaging of an alternative method of proof due
to Etemadi (1981). By splitting summands into positive and negative parts, he was
able to greatly simplify the method for handling blocks of partial sums.
Daniell (1919) constructed measures on countable products of bounded
subintervals of the real line. Kolmogorov (1933, Section IH.4), apparently unaware
of DanielFs work, proved the extension theorem for arbitrary products of real lines.
As shown by Problem [31], the extension from countable to uncountable products is
almost automatic. Theorem <49> is due to Ionescu Tulcea (1949). See Doob (1953,
613615) or Neveu (1965, Section 5.1) for different arrangements of the proof.
Apparently (Doob 1953, p 639) there was quite a history of incorrect assertions
before the question of existence of measures on infinite product spaces was settled.
Andersen & Jessen (1948), as well as providing a counterexample (Problem [29])
to the general analog of the Kolmogorov extension theorem, also suggested that the
Ionescu Tulcea form was more widely known:
In the terminology of the theory of probability this means, that the case of
dependent variables cannot be treated for abstract variables in the same manner
as for unrestricted real variables. Professor Doob has kindly pointed out, what
was also known to us, that this case may be dealt with along similar lines as the
case of independent variables (product measures) when conditional probability
measures are supposed to exist. This question will be treated in a forthcoming
paper by Doob and Jessen.
REFERENCES
110
Chapter 4:
Chapter 5
Conditioning
SECTION 1 considers the elementary case of conditioning on a map that takes only finitely
many different values, as motivation for the general definition.
SECTION 2 defines conditional probability distributions for conditioning on the value of a
general measurable map.
SECTION 3 discusses existence of conditional distributions by means of a slightly more
general concept, disintegration, which is essential for the understanding of general
conditional densities.
SECTION 4 defines conditional densities. It develops the general analog of the elementary
formula for a conditional density: (joint density)/(marginal density).
SECTION *5 illustrates how conditional distributions can be identified by symmetry
considerations. The classical Borel paradox is presented as a warning against the
misuse of symmetry.
SECTION 6 discusses the abstract Kolmogorov conditional expectation, explaining why it is
natural to take the conditioning information to be a subsigmafield.
SECTION *7 discusses the statistical concept of sufficiency.
112
Chapter 5:
Conditioning
partition of the sample space Q into finitely many disjoint events, such as the sets
[T = t} where some random variable T takes each of its possible values 1 , 2 , . . . , n.
From the probabilities F{T = t] and the conditional distributions P(  T = /), for
each t, one calculates expected values as weighted averages.
= t)F{T = t}.
Notice that the weights Q[t] := F{T = t] define a probability measure Q on the range
space T = { 1 , 2 , . . . , w}, the distribution of T under P. (That is Q = TF.) Also,
if there is no ambiguity about the choice of 7\ it helps notationally to abbreviate
the conditional distribution to P,(), writing P,(X) instead of P(X  T = t). The
probability measure P, lives on Q, with Ft{T ^ t] = 0 for each t in 7. With these
simplifications in notation, formula < i > can be rewritten more concisely as
Example. Suppose a deck of 26 red and 26 black cards is well shuffled, that is,
all 52! permutations of the cards are equally likely. Let A denote the event {top and
bottom cards red), and let T be the map into T = {red, black} that gives the color of
the top card. Then T has distribution Q given by
Q{red} = F{T = red} = 1/2
and
By symmetry, the conditional distribution P, gives equal probability to all permutations of the remaining 51 cards. In particular
PredA = P{top and bottom cards red  T = red} = 25/51,
Pbiack^ = P{top and bottom cards red  T = black} = 0/51,
from which we deduce that
PA = Q{red} (25/51) + Q{black} (0/51) = (25/102).
5.7
113
r"h(Ta>)X(a>) = Qh(t)FtX
for the concentration property (i). As you will see in Section 3, the difference
between the two concepts comes down to little more than a question of measurability
of a particular subset of fl x T.
114
Chapter 5:
Conditioning
Exercise. Let P denote Lebesgue measure on 3 ([0, I] 2 ). Let T(xy y) = max(jc, y).
Show that the conditional probability distributions {P,} for P given T are uniform
on the sets {T = t}.
SOLUTION: Write m for Lebesgue measure on !B[0,1]. For 0 < t < 1, formalize
the idea of a uniform distribution on the set
{T = t] = {(*, t) : 0 < x < t] U {(/, y) : 0 < y < t]
by defining
'0 + Yt
t y){0
*
You should check that Ft{T ^ t] = 0 by direct substitution. The definition of P,
for t 0 will not matter, because F[T = 0} = 0. Tonelli gives measurability of
t H> P , / for each / in M+([0,1] 2 ).
The image measure Q is determined by the values it gives to the generating
class of all intervals [0, / ] ,
Q[0, t] = P{(JC, y ) : max(jc, y) <t} = P[0, t]2 = t2
(/(JC, 0(0
(/(JC, 0(0 < x < t}) + m'm^ (/(rf y){0 < y < rJ).
Replace the dummy variables /, y in the last iterated integral by new dummy
variables JC, t to see that the sum is just a decomposition of P / = m 0 m / into
contributions from the two triangular regions {x < t] and {t < x}. The overlap
between the two regions, and the missing edges, all have zero m (8) m measure.
The decomposition of the uniform distribution on the square, into an average of
uniform conditional distributions on the boundaries of subsquares, has an extension
to more general sets, but the conditonal distributions need no longer be uniform.
Example. Let AT be a compact subset of Rd, starshaped around the origin. That
is, if y e K and 0 < t < 1 then ty K.
x/p(x)
/
/
x/p(x)
K /
Kt
For each t > 0, let K, denote the compact subset {ty : y e K). The sets {K, : t > 0}
are nested, shrinking to {0} as t decreases to zero. The sets define a function
p : Rd + R+, by p(x) := M{t > 0 : x e Kt] = inf{t : x/t e K}. In fact, the
infimum is achieved for each x in KA{0}, with 1 > p(x) > 0. Only when x = 0
do we have p(x) = 0. The set \x : p(x) = 1} is a subset of the boundary, dK, a
5.2
115
proper subset unless 0 lies in the interior of K. For each x in Krf\{0}, the point
fix) := x/p(x) lies in dK. Notice that f(x/t) = ty(x) for each t > 0.
Let P denote the uniform distribution on K, defined by
Px f{x) =
TtiK
= tdmyf(y)
will imply independence of p(x) and f(x) under P. For 0 < t < 1 and g M
< t] = m**(*(x/0){*A *}/m*r
= tdmyg(ir(y)){y e K}/mK
In particular, P{p(x)
x
<t}
= tdfor0<t<l
P g(f(x)){p(x)
and
< *} = ( P *
A generating class argument then leads to the factorization needed for independence.
Write /x for the distribution of 1r(x), a probability measure concentrated on dK,
and Q for the distribution with density dtd~l with respect to Lebesgue measure
on [0,1], The image of fi under the map x H> xry for a fixed r in [0,1], defines
a probability measure Pr concentrated on the set {JC : p(x) = r} c dATr. The
defining equality x = p(jt)^Cx) then has the interpretation: if R has distribution Q,
independently of F, which has distribution IJL, then RY has distribution P, that is,
/>*/(*) = Qrt*yf(ry)
<7>
= QrPrxfW
for / M+(Mrf).
The probability kernel {Pr : 0 < r < 1} is the conditional distribution for P
given p(x).
Example. Consider P := Pn, the joint distribution for n independent observations
JCI, . . . , xn from P, with P equal to the uniform distribution on a compact, starshaped subset K of Rd, as in Example <6>. Define T(x) := max,*<np(jCi). We
may generate each xi from a pair of independent variables, xt = nyt, with yi BK
having distribution /x, and rt [0, 1] having distribution Q. More formally,
p / ( * i , . . . , xn) = ^ M V f r i y i , , ynyn),
where r := ( n , . . . , rn) and y := ( j i , . . . , y n ), with M := fin, the product measure on
(9Jf)\ and Q := Qn> the product measure on [ 0 , 1 ] \ Then we have T max/<n r,,
with distribution v for which v[0, t] = (QfO, f])n = tnd for 0 < t < 1. Thus v has
density ndtnd~l with respect to Lebesgue measure on [0,1].
Problem [2] shows how to construct the conditional distribution Q,, for Q
given T = t, namely, generate n 1 independent observations {52,..., sn} from
the conditional distribution Qt := Q (  j < *), then take ( n , . . . , rn) as a random
permutation of (f, 52,... ,sn). The representation for P becomes
i , . . . , xn) = vr
Chapter 5:
116
Conditioning
from Pn~l
.,tyn).
a
REMARK.
The two step construction lets us reduce calculations involving n
independent observations from P to the special case where one of the points lies
on the boundary dK. The conditioning machinery provides a rigorous setting for
Crofton's theorem from geometric probability.
The theorem dates from an 1885 article by M. F. Crofton in the Encyclopedia
Britannica. As noted by Eisenberg & Sullivan (2000), existing statements and
proofs (such as those in Chapter 2 of Kendall & Moran 1963, or Chapter 5 of
Solomon 1978) of the theorem are rather on the heuristic side. Eisenberg and
Sullivan derived a version of the theorem by means of Kolmogorov conditional
expectations, with supplementary smoothness assumptions. If the boundary is
smooth, the measure ix is absolutely continuous with respect to surface measure,
with a density expressible in terms of quantities familiar from differential geometry
(as discussed by Baddeley 1977).
Of course there are many situations where one cannot immediately guess the
form of the conditional distributions, and then one must rely on systematic methods
such as those to be discussed in Sections 4 and 5.
3.
<8>
P t concentrates on
5.3
117
the graph of the map T. Indeed, provided the graph is a product measurable set,
existence of the conditional distribution is equivalent to the representation of the
image measure y(P) as Q <g> 3\
Existence of conditional distributions follows as a special case of a more
general decomposition. The following Theorem is proved in Appendix F, where it
is also shown that condition (ii) is satisfied by every sigmafinite Borel measure on
a complete separable metric space.
<9>
Theorem. Let X be a metric space equipped with its Borel sigmafield A, and
let T be a measurable map from X into a space 7, equipped with a sigmafield !B.
Suppose k is a sigmafinite measure on A and fx is a sigmafinite measure on 2
dominating the image measure T(k). Suppose:
(i) the graph of T is A B measurable;
(ii) the measure k is expressible as a countable sum of finite measures, each with
compact support.
Then there exists a kernel A = [kt : t e 7} from (T, 2 ) to (X, A) for which the
image ofk under the map x H* (JC, TX) equals \x 0 A, that is,
<io>
for g e M + (T x X, B A).
measures on A, for p almost all r, and not just that XtA ktA a.e. [/x], for each A.
A kernel A with the properties described by the Theorem is called a (7, /x)disintegration of k. The construction of a disintegration is a sort of reverse operation
to the "integration" methods used in Section 4.3 to construct measures on product
spaces. As you will see in the next Section, disintegrations are essential for the
understanding of general conditional densities.
You should not worry too much about the details of Theorem <9>. I have
stated it in detail so that you can see how existence of disintegrations involves
topological assumptions (or topologically inspired assumptions, as in Pachl 1978).
Dellacherie & Meyer (1978, page 78) lamented that "The theorem on disintegration
of measures has a bad reputation, and probabilists often try to avoid the use of
conditional distributions ... But it really is simple and easy to prove." Perhaps the
unpopularity is due, in part, to the role of topology in the proof. Many probabilists
seem to regard topology as completely extraneous to any discussion of conditioning,
or even to any discussion of abstract probability theory. Nevertheless, it is a sad
118
Chapter 5:
Conditioning
reality of measure theoretic life that the axioms for countable additivity are not well
suited to dealing with uncountable families of negligible sets, and that occasionally
they need a little topological help.
REMARK.
Topological ideas also come to the rescue of countable additivity in
the modern theory of stochastic processes in continuous time. I think it is no accident
that the abstract theory of (stochastic) processes has flourished more readily amongst
the probabilists who are influenced by the French approach to measure theory, where
measures are linear functional and topological requirements are accepted as natural.
<ii>
Example. Let X denote Lebesgue measure on S(R 2 ), let X denote the map that
projects R2 onto the first coordinate axis, and let 11 denote the onedimensional
Lebesgue measure on that axis. Write Xx for onedimension Lebesgue measure
transplanted to live on [x] 0 R. That is, Xx is defined on 2(ft 2 ) but it concentrates
all its mass along a line orthogonal to the first coordinate axis. By the Tonelli
Theorem, {Xx : x e R} is an (X, ^disintegration of X.
REMARK.
Many authors (including Dellacherie & Meyer 1978, Section 11170)
require A to be a probability kernel, a restriction that excludes interesting and useful
cases, such as the decomposition of Lebesgue measure from Example <11>. As
shown by Problem [3], A can be chosen as a probability kernel if and only if fi is
equal to the image measure TX. In particular, because the image of twodimensional
Lebesgue measure under a coordinate projection is not a sigmafinite measure, there
is no way we could have chosen A as a probability kernel in Example <11>.
4.
Conditional densities
In introductory courses one learns to calculate conditional densities by dividing
marginal densities into joint densities. Such a calculation has a natural generalization
for conditional distributions: it is merely a matter of reinterpreting the meanings of
the joint and marginal densities.
Recall the elementary construction. Suppose P has density p(x,y) with respect
to Lebesgue measure X on $(R 2 ). Then the Xmarginal has density q(x) :=
f p(x,y)dy with respect to Lebesgue measure /x on the Xaxis, and the conditional
p{x,y)/q{x).
distribution Px() is given by the conditional density p(y \ x) :=
Typically one does not worry too hard about how to define conditional distributions
when q(x) = 0 or q(x) = oo, but maybe one should.
The dy in the definition of q(x) corresponds to the disintegrating Lebesgue
measure Xx from Example < n > . The marginal density q(x) equals Xxp. It is the
density, with respect to Lebesgue measure //, on E, of the image of P under the
coordinate map X. The probability distribution Px() is absolutely continuous with
respect to Xx, with density /?(  JC) standardized to integrate to 1.
To generalize the elementary formula, suppose a probability measure P
is dominated by a sigmafinite measure X, which has a (7\ /^disintegration
A = {Xt : / 7}. The density p(x) := dP/dX corresponds to the joint density, and
q(t) := Xxtp{x) corresponds to the marginal density. If the analogy is appropriate,
the ratio pt(x) := p(x)/q(t) should correspond to the conditional density, with Xt
5.4
119
Conditional densities
as the dominating measure. In fact, if we took a little care to avoid 0/0 or oo/oo
by inserting an indicator function {0 < q < oo} into the definition, the elementary
formula would indeed carry over to the general setting. For some purposes (such
as the discussion of sufficiency in Section 7) it is better not to force pt(x) to be
zero when q(t) = 0 or q{t) = oo. The wording in part (iii) of the next Theorem is
designed to accommodate a slightly more flexible choice for the conditional density.
<12>
image measure
= k (p(x)h(Tx))
x
density p =
dP/dk
p(x)h(t).
Pf = kx (p(x)f(x))
density p = dP/dk
f x
= fM k (p(x)f(x))
x
/jL*k (q(t)pt(x)f(x))
x
= n* (q(t)k (Pt(x)f(x)))
disintegration of k
assumption on pt
= Q' (Ptxf(x)) ,
<13>
120
Chapter 5:
Conditioning
<14>
That is, Q has density p{x, 9) with respect to the product measure JJLTT. The
product measure has the trivial disintegration (/x <g> n)x = 7r, if we regard 7r to be
a measure living on {x} 0 0 . It follows from Theorem <12> that the posterior
distribution Qx has density p(x, 9)/nep(x, 9) with respect to n. Why would a
Bayesian probably not worry about the negligible sets where such a ratio is not well
defined?
Example. Let P and Q be probability measures on (X, A), with densities p and
q with respect to a sigmafinite measure X. The Hellinger distance H(P, Q) was
defined in Section 3.3 as the 2(A) distance between ^fp and *Jq. Suppose X has a
(7\ /^disintegration A = {Xt : f G T}, with T a measurable map from X into T.
The image measures TP and r<9 have densities p(t) = Xt(p) and q(t) = ^,(4)
with respect to /z. By the CauchySchwarz inequality, {Xtyfpq) < pq, and hence
Xjpq = ^Xxty/p{x)q{x)
<15>
for / M + (X <g> 0 ) .
< /n'^/p(t)q(t).
That is, the Hellinger affinity between P and Q is smaller than the Hellinger affinity
between TP and TQ. Equivalently, H2(TP, TQ) < H2(P, Q).
Example. Let {P0 : 9 e 0 } be a family of probability measures with densities
p(x,9) with respect to a sigmafinite measure A. on (X, A), The method of maximum
likelihood promises good properties for the estimator defined by maximizing p(x, 9)
as a function of 9, for each observed x.
If x itself is not known, but instead the value of a measurable function T(x),
taking values in some space (7, ) , is observed, the method still applies. If Qe,
the distribution of T under P9, has density q(t, 9) with respect to some sigmafinite
measure /x, the estimate of 9 is obtained by maximizing q(t, 9) as a function of 0,
for the observed value t = T(x).
REMARK.
of statistics
(5(JC), T(X)).
incomplete data.
5.4
Conditional densities
121
for the density of P^.(  T = t) with respect to v and qt instead of q(t,$i), for
i = 1,2. Then p(x, 0,) = Xi(x)qi9 and
0 < G(9i)  G(ft>) = v*jro(x) log
*5.
Invariance
Formal verification of the disintegration property can sometimes be reduced to a
symmetry, or invariance, argument, as in the following Exercise.
<16>
122
Chapter 5:
Conditioning
By a generating class argument it follows that the joint distribution of R(x) and 9(x)
equals the product Q m on S(R + ) <g> !B[0, 27r), which implies independence.
REMARK.
A similar argument would work for any sigmafinite measure k
that is invariant under a group G of measurable transformations on X, if the level
sets {T = t] are also invariant. Let G o be a countable subset of G. Then the
measures {A,} must be invariant under each transformation in Go, except possibly for
those f i n a negligible set that depends on G o . Often invariance under a suitable G o
will characterize the {Xt} up to multiplicative constants.
123
5.5 Invariance
distributions are uniform around circles of constant latitude. It follows that almost
all great circles of constant latitude must carry a uniform conditional distribution.
Unfortunately, the equator is the only such great circle; it corresponds to a negligible
set of T values. Thus there is no harm in assuming that Po> the conditional
distribution around the equator, is uniform, just as there is no harm in assuming it to
carry any other conditional distribution. A uniform conditional distribution around
the equator is not forced by the T disintegration.
What happens if we change T to pick off the longitude of the random point
on the surface? It would be nonsense to argue that the conditional distributions stay
the samethe level sets supporting those distributions are not even the same as
before. Couldn't we just argue again by invariance? Certainly not. The great circles
of constant T are no longer invariant under a group of rotations; the disintegrating
measures need no longer be invariant; the conditional distributions around the great
circles of constant longitude are not uniform.
There is a general lesson to be learned from the last Example. Even if almost
all level sets [T = t] in one disintegration must carry particular distributions, it
does not follow that similar sets in a different disintegration must carry the same
distributions. It is an even worse error to assume that a conditional distribution that
could be assigned to a region as a typical level set of one disintegration must be the
same as the conditional distribution similar regions must carry if they appear as level
sets of another disintegration.
6.
<18>
(a(Ta))X((o)) = Qf (a(t)gx(t))
Even when the conditional distribution does not exist, it turns out that it is
still possible to find a function t H> g(t, X) satisfying an analog of <18>, for each
fixed X. Kolmogorov (1933) suggested that such a function should be interpreted
as the conditional expectation of X given F = t. Most authors write E(X  F = t)
for g(t, X). This abstract conditional expectation has manybut not allof the
properties associated with an expectation with respect to a conditional distribution.
Kolmogorov's suggestion has the great merit that it imposes no extra regularity
assumptions on the underlying probability spaces and measures but at the cost of
greater abstraction.
The properties of the Kolmogorov conditional expectation X H> g(t,X) are
analogous to the properties of increasing linear functionals with the Monotone
Convergence property, except for omnipresent a.e. [Q] qualifiers.
124
<19>
Chapter 5:
Conditioning
Theorem. There is a map X H> g(t, X) from M + (ft, 90 to M + (T, !B) with the
property that P" (<x(T(o)X(co)) = Q' ((*)(*, *)) for each a G M+(T, ) . For each
X, the function t H g(f, X) is unique up to a Qeguivalence. The map has the
following properties.
(i) IfX = h(T) for some h in M+(T, ) then g(t, X) = h(t) a.e. [Q].
(ii) For X t , X2 in M+(2, 3) and h\, h2 in M+(T, S),
g (r, *i(r)Xi + A 2 (DX 2 ) = M O s C * i ) + A2(Og(*> *2>
a.e. [Q].
OfiJ> if 0 < Xi < X2 a.e. [P] then g(f, Xi) < g(f, X2) a.e. [Q].
(iv) If 0 < Xi < X2 < ... t X a.e. [P] then g(t, Xn) \ g(t, X) a.e. [Q].
To establish the existence and the four properties of the function g(ty X), we
must systematically translate to corresponding assertions about averages, by means
of the following simple result.
<20>
Lemma.
for
Then h\ < l%2 a.e. [Q]. If the inequality is replaced by an equality, that is, if
Q' (cc(t)hi(t)) = Qr (a(O*2(O) for aii a, then hx = /i2 a.e. [Q].
PAT?O/. For a positive rational number r, let Ar := {f : A2(O < r < h\(t)}. There are
no oo oo problems in subtracting to get 0 < Q' ((ft2(O ~ Ai(0) (^ Ar}), which
forces QAr = 0 because hj h\ is strictly negative on Ar. Take the union over all
positive rational r to deduce that Q{/*2 < h\] = 0. Reverse the roles of /*i and /i2 to
get the companion assertion, Q{h\ < fc2} = 0, for the case where the inequality is
replaced by an equality.
noo
125
126
Chapter 5:
Conditioning
Brave readers might also want to dabble with conditional expectation for other cases
where only one of PX + or PX~ is finite. Beware of conditional oo oo, whatever
that might mean!
There are conditional analogs of the Fatou Lemma (Problem [7]), Dominated
Convergence (Problem [8]), Jensen's inequality (Problem [13]), and so on. The
derivations are essentially the same as for ordinary expectations, because the
accumulation of countably many Qnegligible sets causes no difficulty when we
deal with only countably many random variables X.
Conditioning on a subsigmafield
The choice of Q as the image measure lets us rewrite Q* (a(f)&(f, X)) as
F(a(T)g(T, X)). Both g(T, X) and a(T) are a(7>measurable functions on Q.
Indeed (recall the Problems to Chapter 2), every a(T)\B[0, oo]measurable map
into [0, oo] must be of the form a(T) for some a in M+(T, 23). The defining
property may be recast as: to each X in M+(2, 7) there exists an XT := g(7\ X)
in M+(ft, a(T)) for which
<22>
F(WX) = F(WXT)
The random variable XT, which is also denoted by P(X  T) (or E(X  T), in
traditional notation), is unique up to a P almost sure equivalence.
REMARK.
Be sure that you understand the distinction between the functions g{t) := g(t, X) and g(T) = g o T ; they live on different spaces, 7 and 2. If
we write P(X  T = t) for g(t) it is tempting to write the nonsensical P(X \ T = T)
for g(T), instead of P(X  T), or even P ( X  a(T)) (see below).
The conditional expectation P(X  T) depends on T only through the sigmafield a(T). If S were another random element for which a(T) = cr(S), the
conditional expectation P(X  5) would be defined by the same requirement <22>.
That is, except for the unavoidable nonuniqueness within a Pequivalence class of
random variables, we have P(X  T) = P(X  S). We could regard the conditioning
information as coming from a subsigmafield of 7 rather than from a T that
generates the subsigmafield.
<23>
<24>
The variable Xg is called the conditional expectation of X given the subsigmafield S. It is unique up to equivalence.
Be careful that you remember to check that Xg is 9measurable. Otherwise
you might be tempted to leap to the conclusion that Xg equals X, as a trivial
solution to the equality <24>. Such an identification is valid only when X itself
5.6
127
a.e. [P].
(iv) If 0 < Xi < X2 < ... t X then P(Xn  9) t P(X I 9) a.e. [P].
REMARK.
YOU might find it helpful to think of 9 as partial informationthe
information one obtains about co by learning the value g(co) for each 9measurable
random variable g. The value of the conditional expectation P(X  9) is determined
by the "9information;" it is a prediction of X(co) based on partial information.
(The conditional expectation can also be interpreted as the fair price to pay for the
random return X(co) after one learns the 9information about co, as with the fairprice
interpretation described in Section 1.5.) Whatever you prefer as a guide to intuition,
be warned: you should not take the interpretation too literally, because there are
examples (Problem [5]) where one can determine co precisely from the values of all
9measurable g, even with 9 a proper subsigmafield of 3 \
<26>
<27>
128
<28>
Chapter 5:
Example. When PX2 < oo, the defining property of the conditional expectation
Xs = P(X  S) may be written as P (X  X s ) g = 0 for all bounded, Smeasurable g.
A generating class argument extends the equality to all squareintegrable, Smeasurable Z. That is, X  Xg is orthogonal to C2(Q, S, P). The (equivalence class
of the) conditional expectation is just the projection of X onto L2(Q, S, P), as a
closed subspace of L2(Q, 7, P).
Some abstract conditioning results are readily explained in this setting. For
example, if So ^ Si, for two subsigmafields of J, then the assertion
P(P(X  SO  So) = P(* I So)
*7.
Conditioning
almost surely
from Example <27> corresponds to the fact that no o n\ = n$ where m denotes the
projection map from L2(Q, y, P) onto H, := L2(Q, Si, P). The TT0 on the lefthand
side kills any component in Hi that is orthogonal to Ho, leaving only the component
in Ho.
Sufficiency
The intuitive definition of sufficiency says that a statistic T (a measurable map into
some space 7) is sufficient for a family of probability measures 7 = {P# : 0 e 0} if
the conditional distributions given T do not depend on 0. The value of T(co) carries
all the "information about 0" there is to learn from an observation w on P^.
There are (at least) two ways of making the definition precise, using either
the Kolmogorov notion of conditional expectation or the properties of conditional
distributions in the sense of Section 2. To distinguish between the two approaches,
I will use nonstandard terminology.
<29>
5.7
<30>
129
Sufficiency
Example. Let 1 be the real line, equipped with its Borel sigmafield 7. For each
9 > 0 define P# as the probability measure on 3 that puts mass 1/2 at each of the
two points 9. Say that a set B is symmetric if co B for each co in B. Let S
be a fixed symmetric set not in 7, and not containing the origin. Write So for the
subsigmafield of all symmetric Borel sets, and Si for the subsigmafield of all
Borel sets for which BSC is symmetric.
A simple generating class argument shows that a Borel measurable function g
is Someasurable if and only if g(co) = g(co) for all co, and it is Simeasurable if
and only if g(co) = g(co) for all co in Sc.
Consider an X in M+(7). The symmetric function XQ(CO) :=  (X(co) 4 X(a>))
is Someasurable. For all 9 > 0 and all go in M+(So)>
P* (Xo) = \ (X($)go(O) + X(O)go(O)) = Xo(6)go(O) = P* (Xogo).
That is, Xo is a version of P#(X  So) that does not depend on 9. The subsigmafield So is weakly sufficient for 7 = {P# : 9 > 0}. In fact, a stronger assertion
is possible. The sigmafield So is generated by the map T(co) = \eo\. For each 0
we can take the conditional distribution Pt as the probability measure that places
mass 1/2 at t. The statistic T is strongly sufficient.
Now consider X equal to the indicator of the halfline H = [0, oo). Could
there be a Simeasurable function Xi for which P# (Xg\) = P# (X\g\) for all g\
in M + (Si)? If such an X\ existed we would have
\g\iO) = P (X(co)gl(a>)) = Xi(O)8i(0) + Xl(0)gl(0)
For 0 in ifnS, take g\ as the indicator function of the singleton set {9}, and then as the
indicator function of the singleton set {9], to deduce that X\(9) = X\(9) = 1. For
9 in H\S take g\ = 1 to deduce (via the Simeasurability property X\(9) = X\(9))
that Xi(0) = X\{9) = 1/2. That is, fXi = 1} = 5, which would contract the Borel
measurability of Xi and the nonmeasurability of S. There can be no Simeasurable
function Xi that is a version of Pe(X \ Si) for all 9. The subsigmafield Si is not
weakly sufficient for ^ e .
REMARK.
The failure of weak sufficiency for Si is due to the fact that the tPe
is not dominated: that is, there exists no sigmafinite measure domininating each P#.
Problem [17] shows that the failure cannot occur for dominated families.
The concepts of strong and weak sufficiency coincide when Q is a finite set
and Pflfeo} > 0 for each co in 2 and each 9 in 0 . If {co : Tco = t] is nonempty then
the sum g(t, 9) := J ^ { r ( a / ) = t}Fe[cof} is nonzero and

* 0) if T(co) = r,

otherwise,
If T is sufficient, the lefthand side is a function of co and t alone. If we write it
as H(co, 0> then sufficiency implies
<r},\g(t,9)H(co,t)
otherwise.
130
Chapter 5: Conditioning
or, more succinctly, Fo{co} = g(T(co), 9)h{co) where h(co) = H(eoy T(co)). Conversely,
if Feico] has such a factorization then, for a> with T(co) = t,
g(t,0M<p)
h(a>)
=
EatlTW) = *}*(*, W ^ )
E ^ ^ a / ) = f }*(oO *
Thus T is sufficient if FQ{CO} factorizes in the way shown.
The factorization criterion also extends to more general settings. The proof
of the analog for weak sufficiency, due to Halmos & Savage (1949), appears in
the textbook of Lehmann (1959, Section 2.6). The corresponding proof for strong
sufficiency is easier to recognize as a direct generalization of the argument for
finite 2.
<3i>
where
q(f, 0) = kf (g(Tco, 0)h(co)) = g{t, 0)H(t)
a.e. [/x].
The aberrant negligible sets are allowed to depend on 6. Because Tco = t a.e. [/x A],
the task reduces to showing that
g(t9 6)h((o){0 < H{t) <oo} = git, 9)hi<o)
a.e. [p A].
We need to show that the contributions to the righthand side from the sets where H
is zero or infinite vanish a.e. [/x A]. Equivalently, we need to show
li<k* (git, 0)hico){Hit) = 0 or oo}) = 0.
The git, 0){Hit) = 0 or oo} factor slips outside the innermost integral with respect
to A,,, leaving fif igit,O)Hit){Hit)
= 0 or oo}). Clearly the product g(f,9)H(t) is
zero on the set {H = 0}. That git, 9) must be zero almost everywhere on the
set {H = oo} follows from the fact that qit, 9) is a probability density:
131
5.7 Sufficiency
<32>
8. Problems
[1]
Suppose gi and g2 are maps from (X, A) to (y, $), with productmeasurable graphs.
Define fi(x) = (*,#(*)) Let Pt = ^,(Mf), for probability measures /u,, on A.
Let P = a\P\ + #2^2, for constants a, > 0 with ot\ + c*2 = 1. Let X denote the
coordinate map onto the X space. Show that the conditional probability distribution
Px concentrates on the points (jc,gi(jc)) and (JC,$2(*)) F^d the conditional
probabilities assigned to each point. Hint: Consider the density of /Lt, with respect
[2]
<xi)
132
[4]
Chapter 5:
Conditioning
For families of probability measures 7 and Q defined on the same sigmafield, define
012(7, Q) to equal sup{<X2(P, Q) : P e 7, Q e Q}, where c*2 denotes the Hellinger
affinity, as in Section 3.3. Write 7 0 7 for {Pi 0 P2 : Pi IP}, and co(3> 0 0>) for
its convex hull. Show that a 2 (co(? 0 ?), co(Q 0 Q)) < a 2 (co^P), co(Q))2 by the
following steps. Write A for 2 (co(T), co(Q)).
(i) For given P = . ftPu 0 ft/ in co(3>) and Q =
let X be any probability measure dominating all
with corresponding densities pu(x), and so on.
p(x, y) := J \ Pipu(x)p2i(y) with respect to A. 0
for the density q(x, y) of Q.
(ii) Write X for the projection map of X x X onto its first coordinate. Show that
XP has density p(x) := , PiPu(x) with respect to X, with a similar expression
for the density q(x) of XQ. Deduce that XP co(0>) and XQ co(Q).
(iii) Define px(y) := . PiP\i(x)/p(x) on the set {* : p(*) > 0}. Define ^ ( y )
analogously. Show that A^y/pAyjqAy) < A for all JC.
(iv) Show that a 2 (P, Q) = A*
[5]
Let P be Lebesgue measure on J, the Borel sigmafield of [0,1]. Let 9 denote the
sigmafield generated by all the singletons in [0,1].
(i) Show that each member of S has probability either zero or one.
(ii) Deduce that P(X  S) = PX for each X in M + (7).
(iii) Show for each Borel measurable X that X(co) is uniquely determined once we
know the values of all Smeasurable random variables.
[6]
[7]
Suppose Xn M + (90. Show that P(liminf n X n  S) < lim infw P(XW  5) almost
surely, for each subsigmafield S. Hint: Derive from the conditional form of
Monotone Convergence. Imitate the proof of Fatou's Lemma.
[8]
Suppose Xn > X almost surely, with \Xn\ < H for an integrable H. Show that
P(Xn  3) > P(X  3) almost surely. Hint: Derive from the conditional form of
Fatou's Lemma (Problem [7]). Imitate the proof of Dominated Convergence.
[9]
Suppose PX 2 < oo. Show that var(X) = var(P(X  3)) 4 P(var(X  3)).
[10]
5.8
Problems
133
[13] For each convex function ^ on the real line there exists a countable family of linear
functions for which i/r(x) = suplN(a, + bt*) for all x (see Appendix C). Use this
representation to prove the conditional form of Jensen's inequality: if X and f(X)
are both integrable then P0I>(X)  S) > ^ (P(X  3)) almost surely. Hint: For each i
argue that P(^(X)  9) > at + bi(X  $) almost surely. Question: Is integrability
of xlr(X) really needed?
[14] Let P be a probability measure on 3(R 2 ) that is invariant under rotations SQ for
a dense set o of angles. Show that it is invariant under all rotations. Hint: For
bounded, continuous / , if ft > 0 then Pxf(Seix) > Pxf(S$x).
[15] Find an invariance argument that produces the conditional distributions from
Exercise <5>. Hint: Consider transformations g# that map (JC, y) into (*, y$), where
yo = y + 0 if y + 0 <x and y + 0  x otherwise.
[16] Let Q be a family of probability measures on a sigmafield A dominated by a fixed
sigmafinite measure X. That is, each Q in Q has a density 8Q with respect to X.
Show that Q is also dominated by a probability measure of the form v = J2jti 2~7 Qj
with Qj Q for each y, by following these steps.
(i) Show that there is no loss of generality in assuming that X is a finite measure.
Hint: Express A. as a sum of finite measures Ylt h ^ e n choose positive
constants a, so that ]T\ a/X, has finite total mass.
(ii) For each countable subfamily S of Q define L(S) := k (UQS{SQ > 0}). Define
L := sup{L(S) : Q 3 S countable}. Find countable subsets Sn for which
L(Sn) f L. Show that L = L(S*), where S* == UnSn.
(iii) Write Xo for UQeS*{8Q > 0}. For each Qo in Q, show that X ({SQo > 0}\X0) = 0.
Hint: L(S* U{Q0}) < L(S*).
(iv) Enumerate S* as {Qj : j N}. Define v = YlJLi^QjIf / M + and
vf = 0, show that X (fX0) = 0. Deduce that Qof = 0 for all Qo Q. That is,
v dominates Q.
134
[17]
Chapter 5:
Conditioning
[18]
dk
a.e. [A],
in the sense that the righthand side can be taken as a density for P#.
[19]
[20]
Let (X, A, P) be a probability space, and let P equal Pny the nfold product measure
on AN. For each x in X, let <$(*) denote the point mass at JC. For x in Xw, let T(x)
denote the socalled empirical measure, n~l Ylt<n &(xi) Intuitively, if we are given
the measure T(x), the conditional distribution for P should give mass \/n\ to
each of the n\ permutations (JCI, . . . , xn). Formalize this intuition by constructing a
(7\ 7T)disintegration for P. Warning: Not as easy as it seems. What is the (T, S)
for this problem? What happens if the empirical measure is not supported by n
5.8 Problems
135
distinct points? How should you define Wt when t does not correspond to a possible
realization of an empirical measure?
9.
Notes
The idea of defining abstract conditional probabilities and expectations as RadonNikodym derivatives is due to Kolmogorov (1933, Chapter 5).
(Regular) conditional distributions have a more complicated history.
Lofeve (1978, Section 30.2) mentioned that the problem of existence was "investigated principally by Doob," but he cited no specific reference. Doob (1953,
page 624) cited a counterexample to the unrestricted existence of a regular conditional probability, which also appears in the exercises to Section 48 of the 1969
printing of Halmos (1950). Doob's remarks suggest that the original edition of
the Halmos book contained a slightly weaker form of the counterexample. Doob
also noted that the counterexample destroyed a claim made in (Doob 1938), an
error pointed out by Dieudonne (no citation) and Andersen & Jessen (1948).
Blackwell (1956) cited Dieudonn6 (1948) as the source of a counterexample for
unrestricted existence of a regular conditional probabilities. Blackwell also proved
existence of regular conditional distributions for (what are now known as) Blackwell
spaces.
In point process theory, disintegrations appear as Palm distributionsconditional
distributions given a point of the process at a particular position (Kallenberg 1969).
Pfanzagl (1979) gave conditions under which a regular conditional distribution
can be obtained by means of the elementary limit of ratio of probabilities. The
existence of limits of carefully chosen ratios can also be established by martingale
methods (see Chapter 6).
The BarndorffNielsen, Blaesild & Eriksen (1989) book contains much material
on the invariance properties of conditional distributions.
Halmos & Savage (1949) cited a 1935 paper by Neyman, which I have
not seen, as the source of the factorization criterion for (weak) sufficiency. See
Bahadur (1954) for a detailed discussion of sufficiency. Example <30> is based on
a meatier counterexample of Burkholder 1961. The traditional notion of sufficiency
(what I have called weak sufficiency) has strange behavior for undominated families.
I learned much about the subtleties of conditioning while working on the paper
Chang & Pollard (1997), where we explored quite a range of statistical applications.
The result in Problem [4] is taken from Le Cam (1973) (see also Le Cam 1986,
Section 16.4), who used it to establish asymptotic results in the theory of estimation
and testing. Donoho & Liu (1991) adapted Le Cam's ideas to establish results about
achievability of lower bounds for minimax rates of convergence of estimators.
136
Chapter 5:
Conditioning
REFERENCES
5.9 Notes
137
Le Cam, L. (1986), Asymptotic Methods in Statistical Decision Theory, SpringerVerlag, New York.
Lehmann, E. L. (1959), Testing Statistical Hypotheses, Wiley, New York. Later
edition published by Chapman and Hall.
Loeve (1978), Probability Theory, Springer, New York. Fourth Edition, Part II.
Pachl, J. (1978), 'Disintegration and compact measures', Mathematica Scandinavica
43, 157168.
Pfanzagl, J. (1979), 'Conditional distributions as derivatives', Annals of Probability
7, 10461050.
Solomon, H. (1978), Geometric Probability, NSFCBMS Regional Conference Series
in Applied Mathematics, Society for Industrial and Applied Mathematics.
Chapter 6
Martingale et al.
SECTION I gives some examples of martingales, submartingales, and supermartingales.
SECTION 2 introduces stopping times and the sigmafields corresponding to "information
available at a random time." A most important Stopping Time Lemma is proved,
extending the martingale properties to processes evaluted at stopping times.
SECTION 3 shows that positive supermartingales converge almost surely.
SECTION 4 presents a condition under which a submartingale can be written as a
difference between a positive martingale and a positive supermartingale (the Krickeberg
decomposition). A limit theorem for submartingales then follows.
SECTION *5 proves the Krickeberg decomposition.
SECTION *6 defines uniform integrability and shows how uniformly integrable martingales
are particularly well behaved.
SECTION *7 show that martingale theory works just as well when time is reversed.
SECTION *8 uses reverse martingale theory to study exchangeable probability measures on
infinite product spaces. The de Finetti representation and the HewittSavage zeroone
law are proved.
1.
6.1
139
The set T has the interpretation of time, the sigmafield 7t has the interpretation
of information available at time f, and Xt denotes some random quantity whose
value Xt((o) is revealed at time t.
<i>
Xs = F(Xt  ?,)
a.s.
FXSF = XtF
REMARK.
Often the filtration is fixed throughout an argument, or the particular
choice of filtration is not important for some assertion about the random variables. In
such cases it is easier to talk about a martingale {X, : t e T] without explicit mention
of that filtration. If in doubt, we could always work with thefiltration.'natural,
7t := <T{XS : S < /}, which takes care of adaptedness, by definition.
Analogously, if there is a need to identify the filtration explicitly, it is convenient
to speak of a martingale {(Xt,3rt) : t T], and so on.
Property (MG) has the interpretation that Xs is the best predictor for Xt based
on the information available at time s. The equivalent formulation (MG)' is a
minor repackaging of the definition of the conditional expectation P(X, \7S). The
Jymeasurability of Xs comes as part of the adaptation assumption. Approximation
by simple functions, and a passage to the limit, gives another equivalence,
(MG)"
<2>
FXSZ = FXtZ
where Mbdd(3\y) denotes the set of all bounded, 9^measurable random variables.
The formulations (MG)' and (MG)" have the advantage of removing the slippery
concept of conditioning on sigmafields from the definition of a martingale. One
could develop much of the basic theory without explicit mention of conditioning,
which would have some pedagogic advantages, even though it would obscure one
of the important ideas behind the martingale concept.
Several of the desirable properties of martingales are shared by families of
random variables for which the defining equalities (MG) and (MG)' are relaxed to
inequalities. I find that one of the hardest things to remember about these martingale
relatives is which name goes with which direction of the inequality.
Definition. A family of integrable random variables {Xt : t e T] adapted to a
filtration [Jt : t e T] is said to be a submartingale (for that filtration) if it satisfies
any (and hence all) of the following equivalent conditions:
(subMG)
(subMG)'
(subMG)"
140
Chapter 6:
Martingale et al
REMARK.
It is largely a matter of taste, or convenience of notation for particular
applications, whether one works primarily with submartingales or supermartingales.
For most of this Chapter, the index set T will be discrete, either finite or equal
to N, the set of positive integers, or equal to one of
N0:={0}UN
or
N:=NU{oo}
or
For some purposes it will be useful to have a distinctively labelled first or last element
in the index set. For example, if a limit Xoo := limnN Xn can be shown to exist, it is
natural to ask whether [Xn : n eN] also has sub or supermartingale properties. Of
course such a question only makes sense if a corresponding sigmafield 9 ^ exists.
If it is not otherwise defined, I will take Joo to be the sigmafield a (U/<OOJ,).
Continuous time theory, where T is a subinterval of R, tends to be more
complicated than discrete time. The difficulties arise, in part, from problems
related to management of uncountable families of negligible sets associated with
uncountable collections of almost sure equality or inequality assertions. A nontrivial
part of the continuous time theory deals with sample path properties, that is, with
the behavior of a process Xt((o) as a function of t for fixed co or with properties
of X as a function of two variables. Such properties are typically derived from
probabilistic assertions about finite or countable subfamilies of the [Xt] random
variables. An understanding of the discretetime theory is an essential prerequisite
for more ambitious undertakings in continuous timesee Appendix E.
For discrete time, the (MG)' property becomes
for all F e 7 n , all n < m.
FXnF = FXmF
It suffices to check the equality for m = n + l, with n e No, for then repeated
appeals to the special case extend the equality to m = n f 2, then m = n + 3, and so
on. A similar simplification applies to submartingales and supermartingales.
<3>
<4>
by independence.
X n _i, or of i , . . . , n_i, if
Example. Let {Xn : n e No} be a martingale and let ^ be a convex function for
which each W(Xn) is integrable. Then {W(Xn) : n No} is a submartingale: the
required almost sure inequality, P(*I>(Xn)  7n\) > ^(X n _i), is a direct application
of the conditional expectation form of Jensen's inequality.
The companion result for submartingales is: if the convex 4> function is
increasing, if {Xn} is a submartingale, and if each V(Xn) is integrable, then
: n No} is a submartingale, because
7n\)
>
6.1
<5>
141
a.s.
D
<6>
which equals zero by a simple generalization of (MG)". (Use Dominated Convergence to accommodate integrable Z.) If {Xn : n e No} is just a submartingale, a
similar argument shows that the new sequence is also a submartingale, provided the
predictable weights are also nonnegative.
Example. Suppose X is an integrable random variable and {J, : t T] is a
filtration. Define X, := P(X  7t). Then the family {Xt : t e T] is a martingale with
respect to the filtration, because for s < t,
P(X,F) = P(XF)
)
<7>
if F e3t
if F e?s
(We have just reproved the formula for conditioning on nested sigmafields.)
Example. Every sequence {Xn : n e No} of integrable random variables adapted to
a filtration {3^ : n e No} can be broken into a sum of a martingale plus a sequence of
accumulated conditional expectations. To establish this fact, consider the increments
i=n := Xn Xn_i. Each is integrable, but it need not have zero conditional
expectation given 3nu *he property that characterizes martingale differences.
Extraction of the martingale component is merely a matter of recentering the
increments to zero conditional expectations. Define rjn := P(n  7n\) and
Mn :== X0 + (ft  1)1) + + (^ ~ fin)
An :=i?i + . . . + *?.
142
Chapter 6:
Martingale et al.
REMARK.
The representation of a submartingale as a martingale plus an
increasing, predictable process is sometimes called the Doob decomposition. The
corresponding representation for continuous time, which is exceedingly difficult to
establish, is called the DoobMeyer decomposition.
2.
Stopping times
The martingale property requires equalities PXSF = PX,F, fors<t
and F e 5S.
Much of the power of the theory comes from the fact that analogous inequalities
hold when s and t are replaced by certain types of random times. To make sense
of the broader assertion, we need to define objects such as 7X and Xx for random
times r.
<8>
<9>
eB}e%
for n e N o .
It is called the first hitting time of the set B. Do you see why it is convenient to
allow stopping times to take the value foo?
If S't corresponds to the information available up to time i, how should we
define a sigmafield 7X to correspond to information available up to a random
time T ? Intuitively, on the part of Q where r = i the sets in the sigmafield 3>
should be the same as the sets in the sigmafield 7t. That is, we could hope that
{F{x = i} : F 3>} = {F{x = i) : F Si}
for each i.
F{x = i] 7i
for all i e N o .
For continuous time such a definition could become vacuous if all the sets {r = t)
were negligible, as sometimes happens. Instead, it is better to work with a definition
that makes sense in both discrete and continuous time, and which is equivalent
to <io> in discrete time.
<li>
Definition. Let x be a stopping time for a filtration {7t : t e T}9 taking values in
f := T U {oo}. If the sigmafield Joo is not already defined, take it to be a ( U l r J r ) .
The prex sigmafield 7X is defined to consist of all F for which F{x < t] 7t for
all t e T.
143
The class 7X would not be a sigmafield if r were not a stopping time: the
property Q, e % requires {r < t] e Jt for all t.
REMARK.
Notice that 7 r c J ^ (because F{x < 00} e S ^ if F e J r ), with
equality when r = 00. More generally, if r takes a constant value, f, then 7X =3rt.
It would be very awkward if we had to distinguish between random variables taking
constant values and the constants themselves.
<12>
<13>
That is, {r < a] % for all a e R + , from which the iFTmeasurability follows by
the usual generating class argument. It would be counterintuitive if the information
corresponding to the sigmafield 7X did not include the value taken by r itself.
Example. Suppose a and r are both stopping times, for which a < x always.
Then 7a c 7X because
for all / 7\
and both sets on the righthand side are 7t measurable if F 5aExercise. Show that a random variable Z is !?>measurable if and only if Z{x < t]
is ^measurable for all t in T.
SOLUTION: For necessity, write Z as a pointwise limit of ^measurable simple
functions Zn, then note that each Zn{x < t) is a linear combination of indicator
functions of Frmeasurable sets.
For sufficiency, it is enough to show that {Z > a] e 3> and {Z < a] e 9>, for
each a e M+. For the first requirement, note that {Z > a]{x <t} = {Z{x <t}> a},
which belongs to J, for each f, because Z{x < t] is assumed to be J,measurable.
Thus ( Z > a ) e ? T . Argue similarly for the other requirement.
The definition of XT is almost straightforward. Given random variables
[Xt : t T) and a stopping time r, we should define XT as the function taking
the value Xt(co) when x(co) = t. If r takes only values in T there is no problem.
However, a slight embarrassment would occur when x(co) = 00 if 00 were not
a point of 7\ for then Xoo((o) need not be defined. In the happy situation when
there is a natural candidate for Xoo, the embarrassment disappears with little fuss;
otherwise it is wiser to avoid the difficulty altogether by working only with the
random variable XT{x < 00}, which takes the value zero when r is infinite.
Measurability of XT{x < 00}, even with respect to the sigmafield J, requires
further assumptions about the {Xt} for continuous time. For discrete time the task is
much easier. For example, if {Xn : n e No} is adapated to a filtration {7n : n e No},
and r is a stopping time for that filtration, then
X r {r < oo}{r < t] = J^ xd(
^ ')
INO
For 1 > t the ith summand is zero; for i < t it equals X,{r = / } , which is
7imeasurable. The 7Xmeasurability of Xx{x < 00} then follows by Exercise <14>
144
Chapter 6:
Martingale et al.
The next Exercise illustrates the use of stopping times and the afields they
define. The discussion does not directly involve martingales, but they are lurking in
the background.
<15>
Exercise. A deck of 52 cards (26 reds, 26 blacks) is dealt out one card at a time,
face up. Once, and only once, you will be allowed to predict that the next card will
be red. What strategy will maximize your probability of predicting correctly?
SOLUTION:
Write Rt for the event {/th card is red}. Assume all permutations of
R n.
the deck are equally likely initially. Write 7n for the afield generated by R\,...,
A strategy corresponds to a stopping time r that takes values in { 0 , 1 , . . . , 51}: you
should try to maximize P/? r +i.
Surprisingly, P/?T+1 = 1/2 for all such stopping rules. The intuitive explanation
is that you should always be indifferent, given that you have observed cards
1 , 2 , . . . , r, between choosing card r + 1 or choosing card 52. That is, it should
be true that P(/? r +i i 3>) = P(#52 I 9r) almost surely; or, equivalently, that
FRT+\F = FR52F for all F e J T ; or, equivalently, that
FRMF{r
=k} = FR52F{r = k]
We could then deduce that PJ?T+i = Pi?52 = 1/2. Of course, we only need the
case F = 2, but 1*11 carry along the general F as an illustration of technique while
proving the assertion in the last display. By definition of 3>,
F { x = k } = F { x <k
\}c{x < k } e
7*.
That is, F[x = k] must be of the form {(J?i,...,l?*) B] for some Borel
subset B of R*. Symmetry of the joint distribution of R\r..., R$2 implies that the
random vector (R\,..., /?*, /?*+i) has the same distribution as the random vector
( / ? i , . . . , Rk, /?52>, whence
PJ?*+l{(J?l, . . . , ! ? * ) * } = P * 5 2 { ( * 1 . .  . , * * ) B ) .
 7n).
because {r = i} e 7*
= PX r .
The martingale property tells us that WXo = PX, for i = 1 , . . . , 51. If we could
extend the equality to random i, by showing that PX r = PXo, then the surprising
conclusion from the Exercise would follow.
Clearly it would be useful if we could always assert that FX^ = PX r for
every martingale, and every pair of stopping times. Unfortunately (Or should I say
145
fortunately?) the result is not true without some extra assumptions. The simplest
and most useful case concerns finite time sets. If a takes values in a finite set F, and
if each Xt is integrable, then \Xa\ < J2teT l^*l which eliminates any integrability
difficulties. For an infinite index set, the integrability of Xo is not automatic.
<16>
Stopping Time Lemma. Suppose a and x are stopping times for a filtration
{7t : t F}, with T finite. Suppose both stopping times take only values in T'.
Let F be a set in % for which a(co) < x(co) when co G F. If {Xt : t T] is
a submartingale, then FXaF < FXtF. For supermartingales, the inequality is
reversed. For martingales, the inequality becomes an equality.
Proof. Consider only the submartingale case. For simplicity of notation, suppose
T = {0,1,..., N}. Write each Xn as a sum of increments, Xn = Xo H i + ... +
The inequality a < T, on F, lets us write
\<i<N
\<i<N
\<i<N
Note that {a < i < x}F = ({a < i  \}F) {r < i  \}c 3i_i. The expected value
D of each summand is nonnegative, by (subMG)'.
REMARK.
If a < r everywhere, the inequality for all F in 7a implies that
Xa < P ( X r I 3^) almost surely. That is, the submartingale (or martingale, or
supermartingale) property is preserved at bounded stopping times.
<17>
The Stopping Time Lemma, and its extensions to various cases with infinite
index sets, is basic to many of the most elegant martingale properties. Results for
a general stopping time r, taking values in N or Mo, can often be deduced from
results for x A Ny followed by a passage to the limit as N tends to infinity. (The
random variable r A N is a stopping time, because {T A N < n] equals the whole
of Q when N <n, and equals {T < n] when N > n.) As Problem [1] shows, the
finiteness assumption on the index set T is not just a notational convenience; the
Lemma <16> can fail for infinite T.
It is amazing how many of the classical inequalities of probability theory can
be derived by invoking the Lemma for a suitable martingale (or submartingale or
supermartingale).
Exercise. Let ] , . . . , ^ be independent random variables (or even just martingale
increments) for which PI, = 0 and Pf ? < oo for each i. Define S, := f i
Prove the maximal inequality Kolmogorov inequality: for each > 0,
max \St\ >\
1<I<N
<Si/l.
146
Chapter 6:
Martingale et al.
What happens in the case when a equals N because \St\ < e for every i? Take
expectations, then invoke the Stopping Time Lemma (with F = Q) for the
submartingale {X,}, to deduce
e2P{max \St > 6} < FXa < FXt = PSft,
i
as asserted.
Notice how the Kolmogorov inequality improves upon the elementary bound
> *} < PS#/e 2 . Actually it is the same inequality, applied to 5CT instead
of SN, supplemented by a useful bound for PS 2 made possible by the submartingale
property. Kolmogorov (1928) established his inequality as the first step towards a
proof of various convergence results for sums of independent random variables.
More versatile maximal inequalities follow from more involved appeals to the
Stopping Time Lemma. For example, a strong law of large numbers can be proved
quite efficiently (Bauer 1981, Section 6.3) by an appeal to the next inequality.
P{SAH
<18>
SOLUTION: Define 7, := flr(5i,...,5) Write Pf for P(.  7i). Define r/f :=
y?Sf Y?_xSf_v and A, := P,ir/i. By the Doob decomposition from Example <7>,
the sequence Mk := X!?=i(^ ~ ^) *s a martingale with respect to thefiltration{7,};
and yls\ = (Ai f . . . + A*) 4 Mk. Define stopping times a = 0 and
147
The asserted inequality now follows via the Stopping Time Lemma:
<PA,
D
3.
and so on, with the convention that the infimum of an empty set is taken as +oo.
ax
xx
a2
x2
Because the {X,} are adapted to {ffi}, each oi and rf is a stopping time for the
filtration. For example,
{n < k] = {Xi < a, X; > p for some i < j < k },
which could be written out explicitly as a finite union of events involving only
Xo,..., X*;.
When tk is finite, the segment {X, : a* < i < tk\ is called thefcthupcrossing
of the interval [a, p] by the process {Xn : n e No}. The event {r* < N) may
be described, slightly informally, by saying that the process completes at least k
upcrossings of [a, p] up to time N.
148
<20>
Chapter 6:
Martingale et al.
: n No} and
for k e N.
Prac?/. Choose, and temporarily hold fixed, a finite positive integer N. Define to
to be identically zero. For k > 1, using the fact that X rt > $ when i* < oo and
X<*k < # when a* < oo, we have
P({r* < N) + X*{r* > AT}) < PX r , A *
< PXaftA^
<N}<
That is,
N}
^^
p
< iV}
for
k>\.
Repeated appeals to this inequality, followed by a passage to the limit as N > oo,
leads to Dubins's Inequality.
REMARK.
When 0 = a < ft we have P{ri < oo} = 0. By considering a
sequence of ft values decreasing to zero, we deduce that on the set {crj < OO} we
must have Xn = 0 for all n > o\. That is, if a positive supermartingale hits zero
then it must stay there forever.
Notice that the main part of the argument, before N was sent off to infinity,
involved only the variables XQ, . . . , X#. The result may fruitfully be reexpressed as
an assertion about positive supermartingales with a finite index set.
<2i>
<22>
<23>
with a, p ranging over all pairs of rational numbers. On Da^ we must have T* < oo
for every k. Thus PD a / 3 < (a/p)k for every k, which forces PDa>/? = 0, and D = 0.
The sequence Xn converges to Xoo := liminf Xn on the set Dc. Fatou's lemma,
and the fact that PX n is nonincreasing, ensure that X ^ is integrable.
Exercise. Suppose {,} are independent, identically distributed random variables with P{ = 41} = p and P{& =  1 } = 1  p. Define the partial
6.3
149
= PX n _iF
for F in 5Fn_i.
The sequence fXTAn} is a positive martingale (Problem [3]). It follows that there
exists an integrable Xoo such that XrAW  X ^ almost surely. The sequence {Sn}
cannot converge to a finite limit because \Sn  Sn\\ = 1 for all n. On the set where
r = oo, convergence of BSn to a finite limit is possible only if Sn  oo and 0Sn  0.
Thus,
X TA II  0~ 1 {^ < oo) h 0{r = oo}
almost surely.
The bounds 0 < XtAn <9~l allow us to invoke Dominated Convergence to deduce
that 1 = FXtAn + 9~lF{x < oo}.
Monotonicity of P{r < oo} as a function of p extends the solution to p = 1/2.
The almost sure limit X^ of a positive supermartingale {Xn} satisfies the
inequality liminfPZ n > PXoo> by Fatou. The sequence [Xn] is decreasing. Under
what circumstances do we have it converging to FXQQ? Equality certainly holds if
{Xn} converges to X^ in Ll norm. In fact, convergence of expectations is equivalent
to L1 convergence, because
F\xn  Jooi = P(Xoo  xn)+ + PCXoo 
xny
<25>
150
Chapter 6:
Martingale et al
Just on {Zn\ > 3}? On all of 2? Moreover, the notation invites the blunder of
ignoring the randomness of the range of summation, leading to an absurd assertion
that P X ^ ] 1 m equals Y^fli ^m = Zn_i/x. The corresponding assertion for an
expectation conditional on Zn\ is correct, but some of the doubts still linger.
It is much better to start with an entire family {fni : n N, i e N} of
independent random variables, each with distribution P, then define Zo = 1 and
Zn := /GN m{* < Zn\} for n > 1. The random variable Zn is measurable with
respect to the sigmafield 7n = ofai ' k <ny i e N}, and, almost surely,
iN
If /x < 1, the {Zn} sequence is a positive supermartingale with respect to the {7n}
filtration. By Theorem <22>, there exists an integrable random variable Z^ with
Zn Zoo almost surely.
A sequence of integers Zn(<o) can converge to a finite limit k only if Zn(<o) = k
for all n large enough. If fc > 0, the convergence would imply that, with nonzero
probability, only finitely many of the independent events (Yli<k ^ k] can occur.
By the converse to the BorelCantelli lemma, it would follow that J2i<k / = k
almost surely, which can happen only if P{1) = 1. In that case, Zn = I for all n.
If P{\] < 1, then Zn must converge to zero, with probability one, if /x < 1. The
bunyips die out if the average number of offspring is less than or equal to 1.
If ii > 1 the situation is more complex. If P{0} = 0 the population cannot
decrease and Zn must then diverge to infinity with probability one. If P{0} > 0 the
convex function g(t) := Pxtx must have a unique value 6 with 0 < 0 < 1 for which
g(0) = 0: the strictly convex function h{t) := g(t)  t has h(0) = P{0] > 0 and
h(l) = 0, and its lefthand derivative Px(xtx~l  1) converges to /x  1 > 0 as t
increases to 1. The sequence {0Zn} is a positive martingale:
F(0z  J n _i) = {Zn_i = 0} + y^P((9^ 1 +  + ^MZ n i = k]  J n _i)
= {Zn_! = 0} +
k>\
= g(0)Zn1 = 0z1
Zn
because g(0) = 0.
The positive martingale {0 } has an almost sure limit, W. The sequence {Zn} must
converge almost surely, with an infinite limit when W = 0. As with the situation
when ix < 1, the only other possible limit for Zn is 0, corresponding to W = 1.
Because 0 < BZn < 1 for every n, Dominated Convergence and the martingale
property give P{W = 1} = lim^oo F0z = F0z = 0.
151
almost surely.
4.
whence <f> = ^N 0 <t> *{Z\ = k] = g($). We must have either 0 = 1, meaning that
X = 0 almost surely, or else <p = 0, in which case X > 0 almost surely on Dc.
The latter must be the case if X is nondegenerate, that is, if P{X > 0} > 0, which
happens if and only if Px (x log(l + x)) < oosee Problem [14].
<26>
<27>
<28>
152
Chapter 6:
Martingale et al
finite limit, it is necessary that supw Mn(a)) be finite. In fact, it is also a suffieicient
condition. More precisely
{co : lim Mn(co) exists as a finite limit} = {w : supMn(co) < 00}
n
almost surely.
To establish that the righthand side is (almost surely) a subset of the lefthand side,
for a fixed positive C define r as the first n for which that Mn > C, with x = 00
when supn Mn < C. The martingale Xn :== MXAn is bounded above by the constant
C f 1, because the increment (if any) that pushes Mn above C cannot be larger
than 1. In particular, supnPX+ < 00, which ensures that {Xn} converges almost
surely to a finite limit. On the set {supn Mn < C] we have Mn = Xn for all w, and
hence Mn also converges almost surely to a finite limit on that set. Take a union
over a sequence of C values increasing to 00 to complete the argument.
REMARK.
Convergence of Mn(a)) to a finite limit also implies that
supn MW(G>) < 00. The result therefore contains the surprising asssertion that,
almost surely, finiteness of supn Mn(a)) implies finiteness of supn Mn(*y).
*5.
almost surely,
<29>
fj,n+i \% = fin \%
for each n,
where, in general, v\9 denotes the restriction of a measure v to a subsigmafield S. Similarly, the defining inequality (subMG)' for a submartingale, /xn+iF :=
P(X n+! F) > XnF =: nnF for all F %, is equivalent to
<30>
fin+i I > ^ n  y
for each n.
6.5
153
Now consider the submartingale {Sn : n No} from the statement of the
Krickeberg decomposition. Define an increasing functional X : 3Vt+(J) * [0, oo] by
Xf := limsupF(S+f)
for / e
if / M + (3i).
n>k
n>k
n>k
Uniform integrability
Corollary <27> gave a sufficient condition for a submartingale {Xn} to converge
almost surely to an integrable limit XQO If [Xn] happens to be a martingale, we
know that Xn = F(Xn+m \ %) for arbitrarily large m. It is tempting to leap to the
conclusion that
<32>
Xn I
as suggested by a purely formal passage to the limit as m tends to infinity. One
should perhaps look before one leaps.
<33>
Example. Reconsider the limit behavior of the partial sums [Sn] from Example <23> but with p = 1/3 and 0 2. The sequence Xn = 2Sn is a positive
martingale. By the strong law of large numbers, Sn/n *  1 / 3 almost surely, which
gives Sn > oo almost surely and XQQ = 0 as the limit of the martingale. Clearly
Xn is not equal to P(X<x> I y)
154
Chapter 6:
Martingale et al.
REMARK.
The branching process of Example <25> with /x = 1 provides
another case of a nontrivial martingale converging almost surely to zero.
As you will learn in this Section, the condition for the validity of <32>
(without the cautionary question mark) is uniform integrability. Remember that
a family of random variables [Zt : t e T] is said to be uniformly integrable
if sup, r PZ,{Z, > M) > 0 as M  oo. Remember also the following
characterization of C1 convergence, which was proved in Section 2.8.
<34>
<35>
Lemma. For a fixed integrable random variable Z, the family of all conditional
expectations {P(Z  S) ' 9 a subsigmafield of J} is uniformly integrable.
Proof. Write Zg for P(Z  9) With no loss of generality, we may suppose
Z > 0, because Zg < P (  Z   9). Invoke the defining property of the conditional
expectation, and the fact that {Zg > M2} 9, to rewrite PZgfZg > M2} as
PZ{Z g > A/2} < MP{Zg > M2} + PZ{Z > M}.
The first term on the righthand side is less than MPZg/M 2 = PZ/Af, which tends
to zero as M > oo. The other term also tends to zero, because Z is integrable.
More generally, if X is an integrable random variable and {Jn : n e No} is a
filtration then Xn := P(X  J w ) defines a uniformly integrable martingale. In fact,
every uniformly integrable martingale must be of this form.
<36>
The XL convergence makes the first term on the righthand side converge to zero
as m tends to infinity. The second term is zero for all positive m, by the martingale
property. Thus P X ^ F = PX n F for every F in %.
155
<37>
<38>
*7.
156
Chapter 6:
Martingale et al.
SN^9NI
<40>
D
<4i>
Theorem.
(i) there exists an XQQ in M+(Q, 9oo) for which Xn > XQQ almost surely;
(ii) F(Xn I 9oo) f Xoo almost surely;
(Hi) F\Xn  Xoo > 0 if and only if supn FXn < 00.
Proof. The Corollary <2i> to Dubins's Inequality bounds by (ct/P)k the probability
that XN, X # _ I , . . . , XO completes at least k upcrossings of the interval [a, /J], no
matter how large we take TV. As in the proof of Theorem <22>, it then follows
that P{limsupXn > ft > a > liminf Xn] = 0, for each pair 0 < a < ft < 00, and
hence Xn converges almost surely to a nonnegative limit Xoo, which is necessarily
9oomeasurable.
I will omit most of the "almost sure" qualifiers for the remainder of the proof.
Temporarily abbreviate P(  9) to P n ( ) , for n e No, and write Zn for PooXw. From
the reversed supermartingale property, Wn+\Xn < Xn+u and the rule for iterated
conditional expectations we get
Zn = FooXn = Poo (Pn+iX,,) < PooXn+l = Z n + 1 .
Thus Zn f Zoo *= limsup n Z n , which is 9oonieasurable.
For (ii) we need to show Z ^ = X^, almost surely. Equivalently, as both
variables are 9oomeasurable, we need to show P(ZooG) = P(XooG) for each G
in 9oo For such a G,
G) = supnNo P(Z n G)
Monotone Convergence
= supn P(X n G)
definition of Zn := PooXw
157
For the necessity, just note that an * limit of a sequence of integrable random
variables is integrable.
The analog of the Krickeberg decomposition extends the result to reversed
submartingales {(Xn,9n) : n No}. The sequence Mn := F(X  9n) is a
reversed positive martingale for which Mn > P(Xo  9n) > Xn almost surely. Thus
Xn = Mn (Mn Xn) decomposes Xn into a difference of a reversed positive
martingale and a reversed positive supermartingale.
<42>
<43>
Proof. The identification of the limit as the conditional expectation follows from the
facts that F(X0G) = P(XrtG), for each n, and P(XnG)P(XooG) < PX n Xoo > 0,
D for each G in 9<x>.
Reversed martingales arise naturally from symmetry arguments.
<44>
<45>
REMARK.
Here I am arguing, heuristically, assuming Pn concentrates on n
distinct points. A similar heuristic could be developed when there are ties, but there
158
Chapter 6:
Martingale et al.
is no point in trying to be too precise at the moment. The problem of ties would
disappear from the formal argument.
Similarly, if we knew all Pt for i > n then we should be able to locate ,(<w)
exactly for i > n + 1, but the values of ^\(co),..., in(co) would still be known only
up to random relabelling of the support points of Pn. The new information would
tell us no more about i than we already knew from Pn. In other words, we should
have
<46>
i= l
That is, {(Pnf, Sn) : N} would be a reversed martingale, for each fixed / .
It is possible to define Qn rigorously, then formalize the preceeding heuristic
argument to establish the reversed martingale property. I will omit the details,
because it is simpler to replace Sn by the closely related nsymmetric sigmafield Sn,
to be defined in the next Section, then invoke the more general symmetry arguments
(Example <50>) from that Section to show that {(P*/, Sn) : n e N} is a reversed
martingale for each Pintegrable / .
Corollary <43> ensures that Pnf + P(/(fi)  Soo) almost surely. As you will
see in the next Section (Theorem <5i>, to be precise), the sigmafield S^ is trivial
it contains only events with probability zero or oneand P(X  8^) = PX, almost
surely, for each integrable random variable X. In particular, for each Pintegrable
function / we have
= pnf + pf
almost surely.
n
The special case X := R and Pi < oo and f(x) = x recovers the Strong Law of
Large Numbers (SLLN) for independent, identically distributed summands.
In statistical problems it is sometimes necessary to prove a uniform analog of
the SLLN (a USLLN):
An := supPn/0 Pfe\ > 0
almost surely,
(p
)
e
159
*8.
the triviality of Soo (Theorem <5i> again) is (almost surely) constant. To prove
the USLLN, we have only to show that the constant is zero. For example, it
would suffice to show PArt * 0, a great simplification of the original task. See
Pollard (1984, Section II.5) for details.
<47>
if 7t
<49>
fjt(X) = : T2
is nsymmetric.
Definition. A probability measure P on An is said to be exchangeable if it is
invariant under S^ for every finite permutation n, that is, ifFh Whn for every h
in M + (X N , AN) and every finite permutation it. Equivalently, under P the random
vector (kd), 6,(2),..., r(n)) has the same distribution as (i, 2, . . , ?), for every
npermutation n, and every n.
The collection of all sets in AN whose indicator functions are nsymmetric
forms a subsigmafield Sn of AN> the nsymmetric sigmafield. The Srtmeasurable
160
Chapter 6:
Martingale et al
functions are those ANmeasurable functions that are wsymmetric. The sigmafields
{Sn : n e N} form a decreasing filtration on XN, with Si = AN.
<50>
f (**<!)) = fn/.
7r$(n)
<5i>
Theorem. (HewittSavage zeroone law) If P = PN, the symmmetric sigmafield oo is trivial: for each F in Soo, either FF = 0 or PF = 1.
Proof, Write /i(x) for the indicator function of F, a set in Soo. By definition,
hn = h for every finite permutation. Equip XN with the filtration 3n = <T[XJ : i < n}.
Notice that Joo := a (UneN^n) = AN = Si.
The martingale Fn := P(F  %) converges almost surely to P(F  JQO) = F,
and also, by Dominated Convergence, P/i Yn\2 > 0.
The ?measurable random variable Fn may be written as hn(xi,..., *), for
some Anmeasurable hn on Xn. The random variable Zn := /in(jcn+i, ...,X2 n ) is
independent of Yn, and it too converges in 2 to h: if 7r denotes the 2npermutation
that interchanges i and i + n, for 1 < i < , then, by exchangeability,
Pfc(x)  Z n  2 = P^,(x)  ^ ( x ^ + i ) , . . . , xn{2n))  2 = Pfc(x)  Yn\2 > 0.
The random variables Zn and Fn are independent, and they both converge in 2 (P)norm to F. Thus
0 = lim Pr n  Zn\2 = lim ( p y 2  2 (PFn) (PZn) + PZ 2 ) =FF
2(PF) 2 + PF.
161
can be built up from mixtures of product measures, in various senses. The simplest
general version of the de Finetti result is expressed as an assertion of conditional
independence.
<52>
n}fl(xi)f2(xj)Mxk).
On the righthand side, there are n(n  l)(n  2) triples of distinct subscripts (i, j \ k),
leaving O(n2) of them with at least one duplicated subscript. The latter contribute
a sum bounded in absolute value by a multiple of n 2 ; the former appear in the sum
that Example <50> identifies as P n (/i(xi)/2(jt2)/3(*3)). T h u s
(/())/2(Jc2))(P/3(JC3))
<53>
162
Chapter 6:
Martingale et al
for all h in M + (T), which implies that />,, = tt a.e. [Q], for each i.
For every finite subcollection { / , , . . . , Ein) of , Theorem <52> asserts
Ein \%oo} = f ] J = J P{*, Eh  00} =
{xx eEil,...,xne
which integrates to
f l , . . . , xn Eim) = P
9.
Problems
[1]
[2]
Let r be a stopping time for the natural filtration generated by a sequence of random
variables {Zn : n e N}. Show that $t = a{Z TAn : n N}.
[3]
Let {(Zn, Jn) : n e No) be a (sub)martingale and r be a stopping time. Show that
{(ZTAn,%) : n No} is also a (sub)martingale. Hint: For F in 5i,_i, consider
separately the contributions to ZnAtF and PZ( n _i) AT F from the regions {r < n 1}
and {r > n}.
[4]
Let r be a stopping time for a filtration {J, : i e No}. For an integrable random
variable X, define X( := P(X  %). Show that
P(X I $T) = ] T { r = i)Xi = X r
almost surely.
ieNo
Let {(Xn, *Jn)  n No} be a positive supermartingale, and let a and r be stopping
times (not necessarily bounded) for which a < r on a set F in 3>. Show that
PX^fcr < 00}F > PX r {r < 00}F. Hint: For each positive integer N, show
that FN := F{a < A^} 9>A^ Use the Stopping Time Lemma to prove that
XaANFN > FXtANFN > PX r {r < N}F, then invoke Monotone Convergence.
6.9
163
Problems
[6]
For each positive supermartingale {(Xw, 7n) : n No}, and stopping times a < T,
show that P(X r fr < oo} \%) < Xo{a < oo} almost surely.
[7]
[8]
(Birnbaum & Marshall 1961) Let 0 = Xo, X\,... be nonnegative integrable random
variables that are adapted to a filtration {3^}. Suppose there exist constants 0,, with
0 < 0, < 1, for which
nXi 1 ?,_,) > OiXii
(*)
foxi>\.
Let Ci > Ci > ... > C^+j = 0 be constants. Prove the inequality
AT
(*)
>x}<
FST{ST
>x}<
FSn{Mn
> JC}.
(ii) Show that FXP = /0 pxp~lF{X > x}dx for each nonnegative random
variable X.
(iii) Show that PMnp < qPSHMg~l.
(iv) Bound the last product using Holder's inequality, then rearrange to get the
stated inequality. (Any problems with infinite values?)
[10] Let (Q,vF, P) be a probability space such that 7 is countably generated: that is,
3r = o{B\, #2* . } for some sequence of sets {#f}. Let /it be a finite measure on 3%
dominated by P. Let 7n := a{Bu ..., Bn).
164
Chapter 6:
Martingale et al
(i) Show that there is a partition nn of Q into at most 2n disjoint sets from Jw such
that each F in % is a union of sets from nn,
(ii) Define Jnmeasurable random variables Xn by: for co A nn,
ifPA>0,
X(a,)s=M/PA
10
otherwise.
Show that FXnF = [iF for all F in 7B.
(iii) Show that (Xn, J n ) is a positive martingale.
(iv) Show that [Xn] is uniformly integrable. Hint: What do you know about
H[Xn > M}?
(v) Let Zoo denote the almost sure limit of the {Xn}. Show that PXooF = /xF for
all F in 7. That is, show that XOQ is a density for ji with respect to P.
[11]
[13]
Suppose the offspring distribution in Example <25> has finite mean /Lt > 1 and
variance a1.
(i) Show that var(Zn) = a2fin~l + /Lt2var(Zn_i).
(ii) Write Xn for the martingale Zn//x". Show that supn var(Xw) < 00.
(iii) Deduce that Xn converges both almost surely and in C1 to the limit X, and
hence PX = 1. In particular, the limit X cannot be degenerate at 0.
6.9
[14]
165
Problems
Suppose the offspring distribution P from Example <25> has has finite mean ix > 1.
Write Xn for the martingale Z n //i n , which converges almost surely to an integrable
limit random variable X. Show that the limit X is nondegenerate if and only if the
condition
(XLOGX)
PX (X log(l
holds. Follow these steps. Write fin for Px (x{x < fi"}) and Pn () for expectations
conditional on 7n.
(i) Show that En O1 ~~ Mn) = Px (x Yln^x
if and only if (XLOGX) holds.
>
(ii) Define Xn := pr* E,&,,{&,, < M*Hi < Zi}. Show that PiXn = M
almost surely. Show also that
x  xN = L*+i ( x n  x B i) > E ^ N + I (*  *i)
almost
166
[17]
Chapter 6:
Martingale et al.
10.
Notes
De Moivre used what would now be seen as a martingale method in his solution
of the gambler's ruin problem. (Apparently first published in 1711, according
to Thatcher (1957). See pages 5153 of the 1967 reprint of the third edition of
de Moivre (1718).)
The name martingale is due to Ville (1939). Levy (1937, chapter VIII), expanding on earlier papers (Levy 1934, 1935a, 1935fe), had treated martingale differences,
identifying them as sequences satisfying his condition (C). He extended several
results for sums of independent variables to martingales, including Kolmogorov's
maximal inequality and strong law of large numbers (the version proved in Section 4.6), and even a central limit theorem, extending Lindeberg's method (to be
discussed, for independent summands, in Section 7.2). He worked with martingales
stopped at random times, in order to have sums of conditional variances close to
specified constant values.
Doob (1940) established convergence theorems (without using stopping times)
for martingales and reversed martingales, calling them sequences with "property ."
He acknowledged (footnote to page 458) that the basic maximal inequalities were
"implicit in the work of Ville" and that the method of proof he used "was used by
Levy (1937), in a related discussion." It was Doob, especially with his stochastic
167
6.10 Notes
processes book (Doob 1953see, in particular the historical notes to Chapter VII,
starting page 629), who was the major driving force behind the recognition of
martingales as one of the most important tools of probability theory. See Levy's
comments in Note II of the 1954 edition of L6vy (1937) and in L6vy (1970,
page 118) for the relationship between his work and Doob's.
I first understood some martingale theory by reading the superb text of
Ash (1972, Chapter 7), and from conversations with Jim Pitman. The material
in Section 3 on positive supermartingales was inspired by an old set of notes for
lectures given by Pitman at Cambridge. I believe the lectures were based in part
on the original French edition of the book Neveu (1975). I have also borrowed
heavily from that book, particularly so for Theorems <26> and <4i>. The book
of Hall & Heyde (1980), although aimed at central limit theory and its application,
contains much about martingales in discrete time. Dellacherie & Meyer (1982,
Chapter V) covered discretetime martingales as a preliminary to the detailed study
of martingales in continuous time.
Exercise <15> comes from Aldous (1983, p. 47).
Inequality <20> is due to Dubins (1966). The upcrossing inequality of
Problem [11] comes from the same paper, slightly weakening an analogous
inequality of Doob (1953, page 316). Krickeberg (1963, Section IV.3) established
the decomposition (Theorem <26>) of submartingales as differences of positive
supermartingales.
I adapted the branching process result of Problem [14], which is due to Kesten
& Stigum (1966), from Asmussen & Hering (1983, Chapter II).
The reversed submartingale part of Example <44> comes from Pollard (1981).
The zeroone law of Theorem <5i> for symmetric events is due to Hewitt &
Savage (1955). The study of exchangeability has progressed well beyond the
original representation theorem. Consult Aldous (1983) if you want to know more.
REFERENCES
168
Chapter 6:
Martingale et al.
Chapter 7
Convergence in distribution
SECTION I defines the concepts of weak convergence for sequences of probability measures
on a metric space, and of convergence in distribution for sequences of random
elements of a metric space and derives some of their consequences. Several equivalent
definitions for weak convergence are noted.
SECTION 2 establishes several more equivalences for weak convergence of probability
measures on the real line, then derives some central limit theorems for sums of
independent random variables by means of Lindeberg*s substitution method.
SECTION 3 explains why the multivariate analogs of the methods from Section 2 are not
often explicitly applied.
SECTION 4 develops the calculus of stochastic order symbols.
SECTION *5 derives conditions under which sequences of probability measures have weakly
convergent subsequences.
1.
170
<i>
Chapter 7:
Convergence in distribution
error terms, replacing them by a less specific assurance that the errors all disappear
in the limit. Explicit approximations would be better, but limit theorems are often
easier to work with.
In this Chapter you will learn about the notion of convergence traditionally
used for central limit theorems. To accommodate possible extensions (such as the
theories for convergence in distribution of stochastic processes, as in Pollard 1984),
I will start from a more general concept of convergence in distribution for random
elements of a general metric space, then specialize to the case of real random
variables. I must admit I am motivated not just by a desire for added generality.
I also wish to discourage my readers from clinging to inconvenient, oldfashioned
definitions involving pointwise convergence of distribution functions (at points of
continuity of the limit function).
Let X be a metric space, with metric d(, ), equipped with its Borel crfield
$ := 3(X). A random element of X is just an 3r\3(X)measurable map from some
probability space (Q, 7% P) into X. Remember that the image measure XP is called
the distribution of X under P.
The concept of convergence in distribution of a sequence of random elements
{Xn} depends on Xn only through its distribution. It is really a concept of convergence
for probability measures, the image measures XFn, There are many equivalent ways
to define convergence of a sequence of probability measures {Pn}, all defined
on 3(X). I feel that it is best to start from a definition that is easy to work with,
and from which useful conclusions can be drawn quickly.
Definition. A realvalued function on a metric space X is said to satisfy a
Lipschitz condition if there exists a finite constant K for which
\t(x)  i(y)\ < Kd(x, y)
Write BL(X) for the vector space of all bounded Lipschitz functions on X.
The space BL(X) has a simple characterization via the quantity /BL defined
for all real valued functions / on X by
where K\ := sup ^M
<2>
x^y
d(X, y)
I have departed slightly from the usual definition of  \\BL> in order to get the neat
bound
<3>
\f(x)
 f(y)\
for all x, y e X.
The space BL(X) consists precisely of those functions for which \\\\BL < oo. It
is easy to show that  \\BL is a norm when restricted to BL(X). Moreover, slightly
tedious checking of various pointwise cases leads to inequalities such as
ll/i v f2\\BL < max (/, UL, UIWBL) ,
which implies that BL(X) is stable under the formation of pointwise maxima of
pairs of functions. (Replace ft by ft to deduce the same bound for /i A /2\\BL,
and hence stability under pairwise minima.)
171
<5>
Definition. Say that a sequence X\, X%, ...of random elements ofX of X
converges in distribution to a probability measure P on !B(X) if their distributions
converge weakly to P. Denote this convergence by Xn ~* P. If X is a random
element with distribution P, write Xn*^> X to mean Xn ^> P.
REMARK.
Convergence in distribution is also called convergence in law (the
word law is a synonym for distribution) or weak convergence. For the study of
abstract empirical process it was necessary (HoffmannJ0rgensen 1984, Dudley 1985)
to extend the definition to nonmeasurable maps Xn into X. For that case, the
concept of distribution for Xn is not defined. It turns out that the most natural and
successful definition requires convergence of outer expectations, *h(Xn) > Ph9 for
bounded, continuous, functions h. That is, the convergence Xn ^ P becomes the
primary concept, with no corresponding generalization needed for weak convergence
of probability measures.
(X,)
It is important to remember that convergence in distribution, in general, says
nothing about pointwise convergence of the Xn as functions. Indeed, each Xn might
be defined on a different probability space, (Qn, J n , Pn) so that the very concept of
pointwise is void. In that case, Xn ** P means that Xn (Pn) ~+ P, that is,
Fn(Xn) := (Xnn)(i)  Pi
172
Chapter 7:
Convergence in distribution
and Xn > X, with X defined on (2,y,P), means Pt(Xn) + i(X) for all
in L(X).
Similarly, convergence in probability need not be well defined if the Xn live
on different probability spaces. There is, however, one important exceptional case.
If Xn ~> P with P a probability measure putting mass 1 at a single point xo in X,
then, for each e > 0,
Fn{d(Xn, XQ) >}<
Wne(Xn)  * Pi = 0
where i(x)
: = 1 A (rf(jc,
xo)/e).
i(Xfn)\
<KFn(lAd(Xn,Xfn))
< KF{d(Xn, X'n)>} + K
by <3>
for each <? > 0.
The first term in the final bound tends to zero as n > oo, for each > 0, by the
assumed convergence in probability. It follows that nl(X'n)  P.
REMARK.
A careful probabilist would worry about measurability of d(Xn, X'n).
If X were a separable metric space, Problem [6] would ensure measurability. In
general, one could reinterpret the assertion of convergence in probability to mean:
there exists a sequence of measurable functions {An} with Art > d(Xn, X'n) and
Aw > 0 in probabilty. The argument for the Example would be scarcely affected.
173
Proof. With no loss of generality, we may assume g > 0. For each t > 0,
the set Ft := {g < t] is closed. The sequence of nonnegative BL functions
4 r(x) := t A (kd(x, Ft)) for ^ N increases pointwise to t{g > r}, because
d(x, Ft) > 0 if and only if g(x) > t. (Compare with Problem [3].)
The countable collection S of all *,, functions, for k G N and positive rational t
has pointwise supremum equal to g. Enumerate S as {*i,ft2> } then define
<8>
U :=ma.Xj<jhj.
Take the supremum over i, using Monotone Convergence to show supf P, = Pg, to
obtain (ii). Put g = / to deduce (ii).
When specialized to the case of indicator functions, we have
<9>
<io>
(fix) ){yeG}
which differ by only 2e at JC, thereby implying that /(JC) = /(JC). The Borel
measurable set C/ := {JC : /(JC) = /(JC)} is precisely the set of all points at which /
is continuous.
174
Chapter 7:
Convergence in distribution
Now suppose Pn **+ P . From Theorem < 8 > , and the inequality / > / > / ,
we have
Pf > l i m s u p P n f > l i m s u p P n f > liminf Pnf > liminf Pnf > Pf.
n
Pnf  Pf
PnB > PB
if Pn ~* P and P (dB) = 0,
because the discontinuities of the indicator function of a set occur only at its
boundary. A set with zero P measure on its boundary is called a Pcontinuity
set. For example, an interval (oo, x] on the real line is a continuity set for every
probability measure that puts zero mass at the point JC. When specialized to real
random variables, assertion <12> gives the traditional convergence of distribution
functions at continuity points of the limit distribution function.
REMARK.
Intuitively speaking, the closeness of probability measures in a weak
convergence sense is not sensitive to changes that have only small effects on functions
in BL(X): arbitrary relocations of small amounts of mass (because the functions
in BL(X) are bounded), or small relocations of arbitrarily large amounts of mass
(because the functions in BL(X) satisfy a Lipschitz condition). The Pcontinuity
condition, P (dB) = 0, ensures that small rearrangements of P masses near the
boundary of B cannot have much effect on PB. See Problem [4] for a way of
making this idea precise, by constructing sequences P'n ^ P and Pn* ** P for which
P'nB  PB and P'^B  *
PB.
Problem <io> shows that convergence for all Pcontinuity sets implies
the convergence Pnf Pf for all bounded, measurable functions / that are
continuous a.e. [P], and, in particular, for all / in BL(X). Thus, any one of the
assertions in the following summary diagram of equivalences could be taken as
the definition of weak convergence, and then the other equivalences would become
theorems. It is largely a matter of taste, or convenience, which equivalent form one
chooses as the definition. It is is worth noting that, because of the equivalences,
the concept of weak convergence does not depend on the particular choice for the
metric: all metrics generating the same topology lead to the same concept.
REMARK.
Billingsley (1968, Section 2) applied the name Portmanteau theorem
to a subset of the equivalences shown in the following diagram. The circle of
ideas behind these equivalences goes back to Alexandroff (194043), who worked
in an abstract, nontopological setting. Prohorov (1956) developed a very useful
theory for weak convergence in complete separable metric spaces. Independently,
Le Cam (1957) developed an analogous theory for more general topological spaces.
(See also Varadarajan 1961.) For arbitrary topological spaces, Tops0e (1970,
page 41) chose the semicontinuity assertions (or more precisely, their analogs for
generalized sequences) to define weak convergence. Such a definition is needed to
build a nonvacuous theory, because there exist nontrivia! spaces for which the only
continuous functions are the constants.
175
Problem [10]
def
X and T
If BLQd x Z) then the function y H> (y, zo) belongs to BL(y>, and hence
Wni(Yn,zo) > P(F,z0). That is, the random elements X'n := (YnJzo) converge
in distribution to (F, ZQ). The random element Xn := (Fw, Zn) is close to X'n: in
fact, d(Xn,Xfn) dz(Zn$zo) ~> 0 in probability. By Example <6>, it follows
that Xn ~* (F, zo). If T is a measurable function of (y, z) that is continuous at
almost all realizations (F(<y), zo) of the limit process, Corollary <13> then gives
T(Yny Zn) ~* T(Y, zo). The special cases where y = Z = R and T(y, z) = y + z or
, z) = yz are sometimes referred to as Slutsky's theorem.
176
<15>
D
2.
Chapter 7:
Convergence in distribution
<16>
L e m m a . IfPf(Xn)
> P / ( X ) for each f in the class e(R) of all bounded
functions with bounded derivatives of all orders, then Xn > X.
111
Proof. Let Z have a AT(0,1) distribution. For a fixed e BL(R) and tr > 0, define
a smoothed function by convolution,
a(x) :=
<TZ)
exp ((y
:=
(y)dy.
For weak convergence of probability measures on the real line, we can augment
the equivalences from Section 1 by a further collection involving specific properties
of the real line.
P(oo, x] + P(oo, x] for all x e R with P{x] = 0
for
all real t
178
Chapter 7:
Convergence in distribution
+ R(x, y),
3 lff
where R(x, y) = \y f {x*) for some JC* lying between x and x 4 j . Consequently,
i?(jc, y) < Cy3
If X and F are independent random variables, with PF 3 < oo, then
P/(X 4 F) = P/(X) 4 P (F/'(X)) +  P ^F 2 /"(X)) + P1?(X, F).
Using independence to factorize two of the terms and bounding \R(Xr Y)\ by CF 3 ,
we get
P/(X + F)  P/(X)  (PF) (P/'(X))  (FY2) (Ff"(X)) \ < CPF 3 .
Suppose Z is another random variable independent of X, with PZ 3 < oo and
PZ = PF and PZ 2 = PF 2 . Subtract, cancelling out the first and second moment
contributions, to leave
P/(X + F)  P/(X + Z)\ < C (PF 3 4 PZ 3 ) .
For the particular case where Z has a N(fi, a2) distribution, with $i := PF and
a1 := var(F), the third moment bound simplifies slightly. For convenience, write W
for (Z ji)/cr, which has a 7V(0, 1) distribution. Then
PZ 3 <8/x 3 + 8<r3PW3
< 8(PF) 3 + 8(PF 2 ) 3/2 PW 3
< ^8 f 8PW\3^ PF 3
by Jensen's inequality.
The righthand side of <17> is therefore less than CiPF 3 where C\ denotes the
constant (9 + 8PW3)C.
Now consider independent random variables j , . . . , & with
m := Pfe ,
af := varfe),
Independently of all the {,}, generate rji distributed N(fjLir a 2 ), for i = 1,..., k.
Choose the {r/,} so that all 2k variables are independent. Define
S:=i+...+&
and
T := m 4... 4 m .
and
Repeated application of inequality < n > will lead to the third moment bound for
the difference P/(5)  Vf(T).
For each i define
179
The variables Xi9 Yh and Z, are independent. From <17> with the upper bound
simplified for the normally distributed Z,,
Zt)\ <
for each i. At i = k and i = 1 we recover the two sums of interest, Xk + F* =
i 4... + & = S and Xi + Zi = r?i + ... + rjk = T. Each substitution of a Z, for
a Ff replaces one more ft by the corresponding r?,; the fc substitutions replace all
the ft by the rji. The accumulated change in the expectations is bounded by a sum
of third moment terms,
P/(S)  p/(r>i < ct
We have only to add an extra subscript to get the basic central limit theorem.
It is cleanest to state the theorem in terms of a triangular array of random
variables,
lr2,*(2)
3,*(3)
The variables within each row are assumed independent. Nothing need be assumed
about the relationship between variables in different rows; all calculations are
carried out for a fixed row. By working with triangular arrays we eliminate various
centering and scaling constants that might otherwise be needed.
<19>
<20>
Exercise. If Xn has a Bin (n, pn) distribution and if npn(l /*)* oo, show that
Xn /?
iv(0,1).
? n (l  pn)
Manufacture a random variable with the same distribution as the
standardized Xn as follows. Let A r t j , . . . , AH%n be independent events with AnJ =
for i = 1, . . . , n . Define
SOLUTION:
fn.i :=
180
Chapter 7:
Convergence in distribution
Check the conditions of Theorem <19> with /x := 0 and a2 := 1 for these n>l.
The centering was chosen to given Finj = 0, so (i) holds. By direct calculation,
so J2i var (&u) = 1 Requirement (ii) holds. Finally, because \Anj  pn\ < 1,
Ei P I M * < *nl Ei V\$nj\2 = <*? * 0.
D
<2i>
Apply Theorem <19> to the variables fw?l := Xf{Xi < y/n]/y/n, for i =
1, . . . , n . Notice that
, > v ^ } / ^
i
which gives the bound
^
because WX( = 0,
+ 0.
To control the sum of variances, use the identical distributions together with the fact
that Fi;nj = ixnjn = o(\/n) to deduce that
Ei var(fc,fi) = P
For the third moment bound use
?(
U%) > 0.
It follows that /?n,i '^ ^(0* !) Complete the proof via an appeal to
Example <6>, using the inequality
^1 ^ E
D
181
with conditions expressible as Hn(e)  0 for each e > 0, for some sequence
of functions Hn. It is convenient to be able to replace by a sequence {} that
converges to zero and also have Hn(n) > 0. Roughly speaking, if a condition holds
for all fixed then it will also hold for sequences n tending to zero slowly enough.
<22>
Lemma. Suppose Hn()  0 as n  oo, for each fixed c > 0. Then there exists
a sequence n > 0 such that Hn(n) + 0.
Proof. For each positive integer k there exists an * such that \Hn(\/k)\ < l/k for
n > *. Without loss of generality, assume n\ < ni <
Define
{ arbitrary
l/k
<23>
(ii) E,PX, = 1;
(iii) Ln(e) := J \ T>xlti{\XHti\ > e] > 0 for each 6 > 0 [Lindeberg's condition].
Show that \ ^.i ^ ^(0,1).
2
SOLUTION: Invoke Lemma <22> with Hn(e) := Ln()/ tofinden tending to zero
slowly enough to ensure that Ln{en)/el  0. Define a triangular array of random
variables ,, := Xnj[\Xnj\ < n}. Notice that
Pf&.i # X,,,/ for some i} < ^ PdX^I > n} < L n ( n )/^ > 0.
By Example <6> it suffices to show that * ?,i ^* ^(0.1).
Check the conditions of Theorem <19>. For the first moments:
I > c n } < Ln(cn)/n ^ 0.
For the variances, use the fact that PfWfI = PXn>l{Xn>l > } to show that
, vatfe,,) = E i PX^dX,,.,! < }  E i (PX^dX,,.,! > n}f.
The first sum on the righthand side equals E i ^ 2 i ~ Ln(n), which tends to 1.
The second sum is bounded above by Ln(n). For the third moments:
182
Chapter 7:
Convergence in distribution
<24>
and
The Fourier tools from Chapter 8 offer the simplest method for showing that the
distribution does not depend on the particular choice of R.
Problem [14] shows how to adapt the onedimensional calculations to derive a
multivariate analog of approximation <18>. The assertion of Theorem <19> holds
if we reinterpret a 2 to be a variance matrix. The derivations of other multivariate
central limit theorems follow in much the same way as before.
Example. If Xi, X2,... are independent, identically distributed random vectors
with FXt = 0 and PX,2 < 00, then (Xi + ... + Xn)/Jn ~> W(0, V), where V :=
P(XiX',).
In short, the multivariate versions of the results from Section 2 present little
extra challenge if we merely translate the methods into vector notation. Fans
of the more traditional approach to the theory (based on pointwise convergence
of distribution functions) might find the extensions via multivariate distribution
functions more tedious. Textbooks seldom engage in such multivariate exercises,
because there is a more pleasant alternative: By means of a simple Fourier device
(to be discussed in Section 8.6), multivariate results can often be reduced directly
to their univariate analogs. There is not much incentive to engage in multivariate
proofs, except as a way of deriving explicit error bounds (see Section 10.4).
4.
near 0,
183
<25>
than to write out the formal epsilondelta details of the limit properties that define
differentiability of / at 0.
The order symbols have stochastic analogs, which are almost indispensable in
advanced asymptotic theory. As with any very concise notation, it is easy to conceal
subtle errors if the symbols are not used carefully; but without them, all but the
simplest arguments become notationally overwhelming.
Definition. For random vectors {Xn} and nonnegative random variables {}
write Xn = 0 p (a n ) to mean; for each > 0 there is a finite constant M such that
F{\Xn\ > Man] < eventually. Write Xn = op(otn) to mean: W{\Xn\ > ectn} > 0
for each > 0.
Topically {otn} is a sequence of constants, but occasionally random bounds are
useful.
REMARK.
The notation allows us to write things like op(ctn) = Op{otn), but not
Op(an) = op(an)f meaning that if Xn is of order op(an) then it is also of order Op(an)
but not conversely. It might help to think of Op() and op(<) as defining classes of
sequences of random variables and perhaps even to write Xn e Op(an) or Xn op(an)
instead of Xn = Op(an)
<26>
<27>
D
<28>
or Xn =
op(an).
D
<29>
<30>
If you worry about the 2e in the bound, replace e by e/2 throughout the
previous paragraph.
Example. If [Xn) is a sequence of real random variables of order O p (l), what
can be asserted about {l/Xn}l Nothing. The stochastic order symbol Op() conveys
no information about lower bounds. For example, if Xn = \/n then Xn = Op{\) but
l/Xn = n > oo. You should invent other examples.
The blunder of asserting l/0 p (l) = 0 p (l) is quite common. Be warned.
Example. For a sequence of constants [an] that tends to zero, suppose Xn =
Op(an). Let g be a function, defined on the range space of the {Xw}, for which
g(x) = O(JC) near 0. Then the random variables g(Xn) are of order op(an). To
prove the assertion, for given > 0 and f > 0, find M and then 5 > 0 such that
184
Chapter 7:
Convergence in distribution
P{XW > Man] < f eventually, and g(jt) < e\x\/M when \x\ < 8. When n is large
enough, we have Man < 8 and
\g(Xn)\ > ^ M a w  < P{XM > Man]
D
<3i>
op(an).
*5.
<32>
<33>
<34>
185
limsup Pnt < 2c for every I in L(X)+ for which 0 < I < K*: for such an , the
open set G := [ < e} contains K and
<35>
eventually.
This equivalent form of uniform tightness is better suited to the passage to the limit.
<36>
<37>
Lemma. For each 8 > 0, > 0, and each compact set K there exists a finite
collection 9 = {go, 8\,, > g*} Q BL(X)+ such that:
(i) goM 4... + gk(x) = 1 for each x X;
(ii) the diameter of each set {g, > 0}, for i > 1, is less than 8.
(Hi) go < on K.
REMARK. A finite collection of nonnegative, continuous functions that sum to
one everywhere is called a continuous partition of unity.
Proof. Let j q , . . . , x* be the centers of open balls of radius 8/4 whose union covers
the compact set K. Define functions f0 = 6/2 and ft(x) := (1  2d(x, *i)/<S)+, for
i > 1, in BL(X)+. Notice that ft(x) = 0 if d(x, xfi > 5/2, for i > 1. Thus the set
{fi > 0} has diameter less than 8 for i > 1.
The function F{x) := X^=o fi(x) is everywhere greater than c/2, and it belongs
to BL(X)+. The nonnegative functions g, := fi/F are bounded by 1 and satisfy a
Lipschitz condition:
\F(y)Mx)  F(x)My)\
\F(y)  F(x)\My)
F(x)
F(x)F(y)
For each x in K there is an i for which d(x,x{) < 8/4. For that i we have
fi(x) > 1/2 and go(x) < fo(x)/ft(x) < e. The g, sum to 1 everywhere. They are
the required functions.
Proof of Theorem <36>. Write Ki for the compact set given by Definition <34>
with := 1/i. For each i in N write 9* for the finite collection of functions
in BL(X)+ constructed via Lemma <37> with 8 := := l/i and K equal to Ki.
The class S := U^NS, is countable.
For each g in 9 the sequence of real numbers Png is bounded. It has a
convergent subsequence. Via a Cantor diagonalization argument, we can construct
a single subsequence Ni c N for which lim^eNj Png exists for every g in 9The approximation properties of 9 will ensure existence of the limit XI :=
limnNl Pn for every I in BL(X)+. With no loss of generality, suppose U\\BL < 1.
186
Chapter 7:
Convergence in distribution
Given e > 0, choose an i > l/e, then write Si = (&0t &i* g*h w ^ h indexing
as in Lemma <37>. The open set G, := {go < e] contains Kt, which ensures that
limsup n PnG? <.
For each j > 1 let x, be any point at which gj(xj) > 0. If x is any other point
with gj(x) > 0 we have \(x)  (JC,) < d(x, xy) < . It follows that, for every x
in X,
It then follows, via the existence of HmnM, Pngj9 that limsup nNl Pnl differs from
liminfneNi Pnf by at most 6c. The limit Xt := limwN, Pn exists.
The limit functional X inherits linearity from P. Clearly XI = 1. It inherits
tightness from the uniform tightness of the sequence, as in <35>. From Theorem <33>, the functional X corresponds to a tight probability measure to which
{Pn : n Nj} converges in distribution.
REMARK.
For readers who know more about general topology: The Cantor
diagonalization argument could be replaced by an argument with ultrafilters, or
universal subsets, for uniformly tight nets of probability measures on more general
topological spaces. Lemma <37> was needed only to allow us to work with
sequences; it could be avoided.
<38>
6,
Problems
[1]
[2]
[3]
[4]
187
7.6 Problems
(i) For each e > 0, show that there is a partition of SB into disjoint Borel sets
{Df : i N} each with diameter less than . Hint: Consider the union of balls
of radius e/3 centered at the points of a countable dense subset of dB.
(ii) For each i, find points jcf B and yt e Bc such that </(*,, Dt) < e and
d(yi, Di) < e. Define a probability measure P' by replacing all the P mass
in Di by a point mass (of size PDt) at *,, for each i. Define P" similarly, by
concentrating the mass in each D at yt. Show that P^B = PB and P"B = PB
for each > 0.
(iii) Show that, for each with BL := K < oo,
and
Deduce that P' P and P " ~> P as e > 0, even though we have PfB ==
Let X be a metric space. Show that the map t/r : (JC, y) > d(x, y) is continuous.
Deduce that f is S(X2)\3(E)measurable. If X is separable, deduce that \(r is
$(X) 0 B(X)\(E)measurable, and hence co H> d(X(w), Y(<o)) is measurable if X
and Y are random elements of X.
[7]
[8]
Let X be a separable metric space. Show that A(P, Q) := sup{P Ql\ : \\\\BL < 1}
is a metric for weak convergence. That is, show that Pn ~* P if and only if
A(Pn, P)  0, by following these steps. (Read the proof of Lemma <37> for
hints.)
(i) Show that A is a metric, and \Pl  Q\ < BLA(e, P), for each L Deduce
that Pn + P, for each G BL(X), if D(Pn, P) ~ 0.
(ii) Let Xo := [xf : i e N} be dense in X. For a fixed 6 > 0, define fo(x) ==
and ft(x) := (1 rf(jc,^)/O + . Define Gk := uf =1 {^ > 1/2}. Choose k so
188
Chapter 7:
Convergence in distribution
that PGkk < . Show that each function lt := /i/o<i<* ft belongs to BL(X).
Show that 0(x) < 2c for x e Gk. Show that ]Cf=0^ s * S h o w t h a t
diam{, > 0} < 2e for i > 1.
(iii) For each i > 1, choose an xt from {/ > 0}. Show that
L
4 + G\
if
(iv) For each probability measure G> and each I with \\1\\BL S 1 show that
fi  P < 8 + QG\ + PG + E t i \QU  Plil
(v) If Pn ~* P, deduce that A(Pn, P) < 10^ eventually.
[9]
Let y and Z be metric spaces, such that y is separable. Let d be the metric on
defined in Example <14>. Let Po, P be probability measures on S(y), and Qo Q be
probability measure on B(Z). For a fixed I in BL(V x Z), define ho(z) := PQ (y, z)
and gQ(y) := G*(y,z).
(i) Show that AOIUL < P b i and \\gQ\\BL < W\BL.
(ii) Let Ay and A*, denote the analogs of the metric from Problem [8], Show that
I (P 0 Q)l  (Po 0 fioXI < i ^ c  P68Q\ + \Qho  Qoho\
(iii) Show that AyxZ(P G, Po 0 Go) < AV(P, Po) + AZ(G, Go).
(iv) If Pn ^ Po and Qn ^ Go, show that Pn G ^ ^o Go, even if 2, is not
separable. (For separable Z, the result follows from (iii); otherwise use (ii).)
[10]
[11]
Suppose iin > /i, and an2  a 2 , with both limits finite. Let Z have a #(0,1) distribution. Show that P(Mn+ornZ)~P(M+aZ) < BiLP(l A (\iiH  n\ + orn  a Z)).
Deduce, by Dominated Convergence, that N(iin, a 2 ) ^ iV(/x, a 2 ).
[12]
189
7.6 Problems
(ii) Extend the result to sequences of random vectors. Hint: Use part (i) to prove
boundedness of {/xn} and each diagonal element of {Vn}. Use CauchySchwarz
to deduce that all elements of [Vn] are bounded.
[13]
Suppose the random variables {Xn} converge in distribution to X, and that {An} and
{Bn} are sequences of constants with An * A and Bn  B (both limits finite).
Show that AnXn + Bn ~* AX + B. Generalize to random vectors.
[14]
Let Y be a randomfcvectorwith fi := FY and V := var(r). Let V have the representation V = LA2L\ with L an orthogonal matrix and A := diag(A.i,..., A*) each A.,
nonnegative. Define R := LA. Let W be a random ^vector of independent #(0,1)
random variables.
(i) Show that /x < PF. Hint: For a unit vector u in the direction of fi, use the
fact that uY < \Y\.
(ii) Show that W\RW\3 = P ]T\ A.,W3 < (trace V) 3/2 PI#(0, l) 3 .
(iii) Show that P/z + RW\3 < 8PF 3 f 8 (PK2)3/2PJV(0,1)3.
[15]
[16]
Let {Xn} be a sequence of random variables, all defined on the same probability
space. If Xn = 0P(1), we know from Chapter 2 that there is a subsequence
{Xni : i N} for which Xn.(co) = o(l) for almost all a>. If, instead, Xn = 0 P (1),
must there exist a subsequence for which Xni(co) = 0(1) for almost all of! Hint:
Let & : i No} be a sequence of independent random variables, each distributed
Uniform(0,1). Consider Xn := (o  fei)""1.
[17]
[18] Let {Xn} and {Yn} be sequences of random itvectors, with Xn and Yn defined on
the same space and independent of each other. Suppose Xn  Yn = Op(l). Show
that there exists a sequence of nonrandom vectors {an} for which Xn an = Op(\).
Hint: For probability measures P and g, show that it PxQyf (x  y) < M then
Pxx/r (x  y) < M for at least one y. Use Problem [17].
190
Chapter 7:
Convergence in distribution
[19]
[20]
Let {Xnj} be a triangular array of random variables, independent within each row
and satisfying
(a) Et n\Xn,i\ > 6} ^ 0 for each e > 0,
(b) Ei (XHA\Xnj\
Let {Xn,,} be a triangular array of random variables, independent within each row
and satisfying
(i) maxt \Xnj\ > 0 in probability,
(ii) ,. FXHA\Xn,i\ < } > n for each > 0,
(iii) 5^f. var(Xn?l{Xn,i <}) a 2 < 00 for each > 0.
Show that **. * N{^a2). Hint: Define i;Bff. := XnA\XnA *n) and f=nJ :=
r;n>/ Pi7f,, where crt tends to zero slowly enough that: (a) P{max, \Xnj\ > n] ~> 0;
2
)
(b) ; PX^aiX^j < } > n\ and (c) E / var(X
[22]
If each of the components {Xni}, for i = 1,..., Jfc, of a sequence of random fcvectors
fXn} is of order Op{\), show that Xn = Op(l).
[23]
7.
Notes
See Daw & Pearson (1972) and Stigler (1986a, Chapter 2) for discussion of
De Moivre's 1733 derivation of the normal approximation to the binomial distribution
(reproduced on pages 243250 of the 1967 reprint of de Moivre 1718)the first
central limit theorem.
Theorem <19> is esentially due to Liapounoff (1900, 1901), although the
method of proof is due to Lindeberg (1922). I adapted my exposition of Lindeberg's
method, in Section 2, from Billingsley (1968, section 7), via Pollard (1984,
Section III.4). The development of the CLT, from the simple idea described in
Section 1 to formal limit theorems, has a long history, culminating in the work of
several authors during the 1920's and 30's. For example, building on Lindeberg's
ideas, Levy (1931, Section 10) conjectured the form of the general necessary and
sufficient condition for a sum of independent random variables to be approximately
normally distributed, but established only the sufficiency. Apparently independently
of each other, Levy (1935) and Feller (1935) established necessary conditions for
the CLT, under an assumption that individual summands satisfy a mild "asymptotic
negligibility" condition. See the discussion by Le Cam (1986). Chapter 4 of
Levy (1937) and Chapter 3 of Levy (1970) provide more insights into Levy's
191
7.7 Notes
thinking about the CLT and the role of the normal distribution. The idea of using
truncation to obtain general CLT's from the Lindeberg version of the theorem runs
through much of Levy's work.
Later works, such as Gnedenko & Kolmogorov (1949/68, Chapter 5) and
Petrov (1972/75, Section IV.4), treat the CLT as a special case of more general
limit theorems for infinitely divisible distributionscompare, for example, the direct
argument in Problem [20] with Theorem 3 in Section 25 of the former or with
Theorem 16 in Chapter 4 of the latter.
Theorem <36> for complete separable metric spaces is due to Prohorov (1956).
Independently, Le Cam (1957) proved similar results for more general topological
spaces. The monograph of Billingsley (1968) is still an excellent reference for
the theory of weak convergence on metric spaces. Together with the slightly
more abstract account by Parthasarathy (1967), Billingsley's exposition stimulated
widespread interest in weak convergence methods by probabilists and statisticians
(including me) during the 1970's. See Dudley (1989, Chapter 11) for an elegant
treatment that weaves in more recent ideas.
The stochastic order notation of Section 4 is due to Mann & Wald (1943). For
further examples see the survey paper by Chernoff (1956) and Pratt (1959).
REFERENCES
192
Chapter 7:
Convergence in distribution
Le Cam, L. (1986), 'The central limit theorem around 1935% Statistical Science
1, 7896.
Levy, P. (1931), 'Sur les series dont les termes sont des variables eventuelles
independantes', Studia Mathematica 3, 119155.
Levy, P. (1935), 'Proprietes asymptotique des sommes de variables aleatoires
independantes ou enchainees', Journal de Math Pures Appl. 14, 347402.
Levy, P. (1937), Theorie de Vaddition des variables aleatoires, GauthierVillars, Paris.
Second edition, 1954.
Levy, P. (1970), Quelques Aspects de la Pensee d'un Mathematicien, Blanchard, Paris.
Liapounoff, A. M. (1900), 'Sur une proposition de la theorie des probability,
Bulletin de VAcademie imperiale des Sciences de St. Petersbourg 13, 359386.
Liapounoff, A. M. (1901), 'Nouvelle forme du th6oreme sur la limite de probability,
Memoires de VAcademie imperiale des Sciences de St. Petersbourg.
Lindeberg, J. W. (1922), 'Eine neue Herleitung des Exponentialgesetzes in der
Wahrscheinlichkeitsrechnung', Mathematische Zeitschrift 15, 211225.
Mann, H. B. & Wald, A. (1943), 'On stochastic limit and order relationships',
Annals of Mathematical Statistics 14, 217226.
Parthasarathy, K. R. (1967), Probability Measures on Metric Spaces, Academic, New
York.
Petrov, V. V. (1972/75), Sums of Independent Random Variables, SpringerVerlag.
Enlish translation in 1975, from 1972 Russian edition.
Pollard, D. (1984), Convergence of Stochastic Processes, Springer, New York.
Pratt, J. W. (1959), 'On a general concept of "in probability"', Annals of Mathematical
Statistics 30, 549558.
Prohorov, Y. V. (1956), 'Convergence of random processes and limit theorems in
probability theory', Theory Probability and Its Applications 1, 157214.
Stigler, S. M. (1986a), The History of Statistics: The Measurement of Uncertainty
Before 1900, Harvard University Press, Cambridge, Massachusetts.
Tops0e, F. (1970), Topology and Measure, Vol. 133 of Springer Lecture Notes in
Mathematics, SpringerVerlag, New York.
Varadarajan, V. S. (1961), 'Measures on topological spaces', Mat. Sbornik
55(97), 35100. American Mathematical Society Translations 48 (1965),
161228.
Chapter 8
Fourier transforms
SECTION 1 presents a few of the basic properties of Fourier transforms that make them
such a valuable tool of probability theory.
SECTION 2 exploits a mysterious coincidence, involving the Fourier transform and
the density function of the normal distribution, to establish inversion formulas for
recovering distributions from Fourier transforms.
SECTION *3 explains why the coincidence from Section 2 is not really so mysterious.
SECTION 4 shows that the inversion formula from Section 2 has a continuity property,
which explains why pointwise convergence of Fourier transforms implies convergence
in distribution.
SECTION *5 establishes a central limit theorem for triangular arrays of martingale
differences.
SECTION 6 extends the theory to multivariate distributions, pointing out how the calculations
reduce to onedimensional analogs for linear combinations of coordinate variables
the Cramer and Wold device.
SECTION *7 provides a direct proof (no Fourier theory) of the fact that the family of
(onedimensional) distributions for all linear combinations of a random vector uniquely
determines its multivariate distribution.
SECTION *8 illustrates the use of complexvariable methods to prove a remarkable property
of the normal distributionthe LevyCramer theorem.
1.
194
Chapter 8:
Fourier transforms
:= Wexp(iX(w)t)
for t in E.
Fourier transforms are well defined for every probability measure on $CR), and
\fP(t)\ = IP*exp(i*r)l < P* I exp(i*OI = 1
+ )  fP(t)\
< Px\eixit+8)
 eixt\ = P x \ e i x S  l j ,
Theorem. If X\,..., Xn are independent then the Fourier transform ofX\\.. .+Xn
factorizes into \j/x{ (0 ^X (0> for all real t.
Proof, Extend the factorization property of real functions of the X/s to the complex
functions exp(itXj).
REMARK.
Be careful that you do not invent a false converse to the Theorem.
fxiOfrit),
If X has a Cauchy distribution and Y = X then ^*+r(0 = exp(2r) =
but X and Y are certainly not independent (Problem [6]).
195
<2>
Theorem. If \X\k < oo, for a positive integer k, then the Fourier transform has
the approximation
(it}k
(ii\^
fx(t) = 1 + itPX + ^ P X
X 2 + ... 44 Vklr 1 * +
2!
for t near 0.
D Proof. Apply Problem [5] to Pcos(*7) and Psin(A7).
<3> Example. The Poisson(A) distribution has Fourier transform
An appeal to the central limit theorem and a suitable passage to the limit will lead
us to the Fourier transform for the normal distribution.
Suppose X\y X2,... is a sequence of independent random variables, each
distributed Poisson(l). By the central limit theorem for identically distributed
summands, Zn := (X\ + ... + Xn ri)/*fn~> N(0,1). For each fixed t the function
eixt is bounded and continuous in x. Thus limn_>OoPexp(i7Zn) is the Fourier
transform of the #(0,1) distribution. Evaluate the limit using Theorem <i>:
Pexp(iZRf) =
J^
*=i
if
 n).
Notice the way the error term behaves as a function of n for fixed t. In the limit we
get exp(r 2 /2) as the Fourier transform of the iV(0,1) distribution.
REMARK.
dz
2.
Inversion formula
Written out in terms of the Af(0,1) density, the final result from Example <3>
becomes
exp(i>*  \y2)
i 2x
 dy = exp( j/ z )
for real *.
196
Chapter 8:
Fourier transforms
The function on the righthand side looks very like a normal density; it lacks
only the standardizing constant. If we substitute t/a for t, then make a change of
variables z y/cr, we get an integral representation for the N(0, a2) density <t>a.
1
i /
exp (izt
exp(4*2/a2)
i >> o\
2 2
PV
dz =
 \<J Z )
= : <t><,(t).
'
This equality is a special case of a general inversion formula that relates Fourier
transforms to densities. Indeed, the general formula can be derived from the special
case.
Suppose Z has a iV(0,1) distribution, independently of a random variable X
with Fourier transform ^r(). From the convolution formula for densities with respect
to Lebesgue measure (Section 4.4), for a > 0 the sum X + aZ has a distribution
with density f(a)(y) '= P^&r (y  X(co)). Substituting for <t>a from <4>, we have
Joo
2 2
fa)(y) = 7r /
Lit Joo
'
<7>
h(y)f(<T)(y)dy.
Joo
which shows, via formula <5>, that the Fourier transform of the random variable X
uniquely determines the density for the distribution of X + aZ. Specialize to h
in C(R), the class of all bounded, continuous real functions on R. By Dominated
Convergence, Wh(X\oZ) > P^(X) as a > 0. Thus the Fourier transform uniquely
determines all the expectations P/i(X), with h ranging over G(R). A generating class
argument shows that these expectations uniquely determine the distribution of X, as
a probability measure on
197
You might feel tempted to rearrange limit operations in the previous proof, to
arrive at an assertion that
h(y)f((T)(y)dy
Fh(X) = lim /
= /
h(y) lim
f(a)(y)dy,
f(y) = Km fia)(y)
= f
with respect to Lebesgue measure. Of course such temptation should be resisted. The
migration of limits inside integrals typically requires some domination assumptions.
Moreover, it would be exceedingly strange to derive densities for measures that are
not absolutely continuous with respect to Lebesgue measure.
<9>
Example. Let P denote the probability measure that puts mass 1/2 at 1 . It has
Fourier transform %lr(t) = {eil f e~u) /2. The integral on the righthand side of <8>
does not exist for this Fourier transform. Application of formulas <5> then <4>
gives
<io>
JOQ l>
That is, f{a) is the density for the mixture l/2N(l,a2) h y2#(+l,a 2 ), a density
that is trying to behave like two point masses. The limit of fiar) does not exist in
the ordinary sense.
Theorem. If P has Fourier transform \/r for which / ^ \\/r(z)\ dz < oo, then P is
absolutely continuous with respect to Lebesgue measure, with a density
*OjM
h(y)f(<T)(y)dy = /
JM
h(y)f(y)dy.
That is, Ph = / hf for a large enough class of functions h to ensure (via a generating
class argument) that P has density / .
REMARK.
The inversion formula for integrable Fourier transforms is the basis
for a huge body of theory, including Edgeworth expansions, rates of convergence
in the central limit theorem, density estimation, and more. See for example the
monograph of Petrov (1972/75).
198
*3.
Chapter 8:
Fourier transforms
A mystery?
I have always been troubled by the mysterious workings of Fourier transforms. For
example, it seems an enormous stroke of luck that the Fourier transform of the
normal distribution is proportional to its density functionthe key to the inversion
formula <5>. Perhaps it would be better to argue slightly less elegantly, to see what
is really going on.
Start from the slightly less mysterious fact that there exists at least one random
variable W whose Fourier transform $0 is Lebesgue integrable. (See Problem [3]
for one way to ensure integrability of the Fourier transform.) Symmetrize by means
of an independent copy W of W to get a random variable W W with a realvalued
Fourier transform x/r = ^oi2 which is also Lebesgue integrable. That is, there exists
a probability distribution Q whose Fourier transform r/r is both nonnegative and
Lebesgue integrable. For some finite constant co, the function CQ^T defines a density
(with respect to Lebesgue measure m on the real line) of a probability measure Q
on S(R).
For h bounded and measurable,
P Qh(x+ay)
x)/a)),
4.
Convergence in distribution
The representation <5> shows not only that the density / ( t r ) of X + <r Z is uniquely
determined by the Fourier transform fx of X but also that it depends on fx in a
continuous way. The factor exp(<r2z2/2) ensures that small perturbations of fx
do not greatly affect the integral. The traditional way to make this continuity idea
precise involves an assertion about pointwise convergence of Fourier transforms.
The proof makes use of a simple fact about smoothing and a simple consequence
of Dominated Convergence known as Scheffe's Lemma:
Let /, /1, / j , . . . be nonnegative, fxintegrable functions for which fn > /
a.e. [11} and fifn * /*/. men ix\fn  /  > 0.
See Section 3.1 for the proof.
8.4
<n>
199
Convergence in distribution
Lemma. Let X, X\% X2,... and Z be random variables for which Xn + crZ ~*
X + oZ for each a > 0. Then Xn ~* X.
Proof, For I in L(R),
P(Xn)  P(X) < P(Xn)  t(Xn + aZ) + P(Xn + aZ)  F(X + aZ)\
<12>
On the righthand side, the middle term tends to zero, by assumption. The other
two terms are both bounded by     5 L P ( 1 A <TZ), which tends to zero as a * 0.
Continuity Theorem. Let X, X\, X2,... be random variables with Fourier
transforms f, fa, fa,... for which fa(t) > \jr(t) for each real t. Then Xn ~ X.
Proof. Let Z have a N(0,1) distribution independent of X and all the Xn. The
random variables Xn+<rZ have distributions with densities
= I f h(y)f^(y)dy  J h(y)f^(y)dy\
<M I \f<a)(y)  f(a)(y)\
 0
D
<13>
dy
by SchefK's Lemma.
Thus Xn +aZ ~> X+oZ for each a > 0, and the asserted convergence in distribution
follows via Lemma < n > .
Example. Suppose Xi, X2,... are independent, identically distributed random
variables with PX* = 0 and PXf = 1. From Chapter 7 we know that
V
_l_
_1_ V
,1).
Here is the proof of the same result using Fourier transforms.
Let VO) be the Fourier transform of X\. From Theorem <2>,
f(t) = 1  \t2 + o(f2)
as f > 0.
= 1  j ^ 2 / w + 0(l/ n )
as n ~> <x).
The limit equals the N(0,1) Fourier transform. By the Continuity Theorem, the
asymptotic normality of the standardized sum follows.
Certainly the calculations in the previous Example involve less work than the
Lindeberg argument plus the truncations used in Chapter 7 to establish the same
200
Chapter 8:
Fourier transforms
central limit theorem. I would point out, however, the amount of theory needed
to establish the Continuity Theorem. Moreover, for the corresponding Fourier
transform proofs of central limit theorems for more general triangular arrays, the
calculations would parallel those for the Lindeberg method.
REMARK.
A slightly stronger version of the Continuity Theorem (due to
Cramersee the Notes) also goes by the same name. The assumptions of the
Theorem can be weakened (Problem [9]) to mere existence of a pointwise limit
VKO = limn \lrn(t) with \jr continuous at the origin. Initially it need not be identified
as the Fourier transform of some specified distribution. The stronger version of
the Theorem asserts the existence of a distribution P for which \ff is the Fourier
transform, such that Xn **+ P.
*5.
<14>
201
Temporarily write Zj for itfy. Notice that Ylj<n zj = t2 + op(l), by (i). Also, when
U\Mn < 1/2, which happens with probability tending to 1,
,< \r(zj)\ < f3 ,.< ^ 3 < \t\3Mn Zj<n Hj = op{\)
Thus
in probability.
To strengthen the assertion to FYn  0, it is enough to show that supn Py n  2 < oo,
for then
F\Yn\ < MlV\Yn\2 + MP(F n {F n  < M}) = O ( A / " 1 ) + o(l).
+ Mj\2u < *)) < exp (in2 E y sf[j < *n}) ( I + I/I2MW2) ,
D
which gives sup n PX n  2 < sup n exp(2r 2 )(l f^ 2 PM n  2 ) < oo. The asserted central
limit theorem follows.
REMARKS.
(i) Notice the role played by the stopping times rn. They ensure that properties
of the increments that hold with probability tending to one can be made to
hold everywhere, if we can ignore the effect of the increment corresponding
to tn = j . To control n,Tn the Theorem had to impose constraints on
(ii) The sum ] ^ . ^ plays the same role as a sum of variances in the corresponding
theory for independent variables. Martingale central limit theorems sometimes
impose constraints on the sum of conditional variances P , _ i ^ . The sum of
squared increments corresponds to the "square brackets" process [Xn] of a
martingale, and the sum of conditional variances corresponds to the "pointy
brackets" process (Xn). These two processes are also called the quadratic
variation process for Xn and the compensator for X^y respectively.
202
6.
Chapter 8:
Fourier transforms
Example.
D
< 16>
Pass to image measures to deduce that the joint distribution Qxj of X and Y has
the same Fourier transform as the product Qx QY of the marginal distributions.
By the uniqueness result (i) for Fourier transforms of distributions on RM, we must
have QX,Y = Qx QY That is, X and Y are independent.
You might regard this result as a sort of converse to Theorem <i>. Don't slip
into the error of checking only that the factorization holds when s and t happen to
be equal.
Example. A random nvector X is said to have a multivariate normal distribution
if each linear combination t'X, for t Rw, has a normal distribution. In particular,
each coordinate X, has a normal distribution, with a finite mean and a finite variance.
The random vector must have a well defined expectation /z := PX and variance
matrix V := P(X  ^)(X  /x)'. The distribution of t'X is therefore N(t'ii, t'Vt),
which implies that X must have Fourier transform
Pexp(iVX) = exp(iV/x  \t'Vt)
Write N(fM, V) for the probability distribution with this Fourier transform.
Every /JL in Rn and nonnegative definite matrix V defines such a distribution.
For if Z is a nvector of independent N(0yl) random variables and if we factorize
203
IU  ii)'V'\x  /
*7.
This result follows via the Jacobian formula (Rudin 1974, Chapter 8) for change
of variable from the density (In)"12 exp(x 2 /2) for the W(0, /) distribution. (In
fact, the Jacobian formula for nonsingular linear transformations is easy to establish
by means of invariance arguments.)
If V is singular, the N(/JL, V) distribution concentrates on a translate of a
subspace, {JC R* : (JC  /X)'V(JC  p) ^ 0} = {/x + y : Ay = 0}, where V = A'A.
Jo
= Cmm ff giO/t2"1*1 dt # 0,
JO
204
Chapter 8:
Fourier transforms
_
f
L
.
.
.
+
^
.
+
L_^L
exp(**)
with 0 < x* < *,
x +
+
2!
3!
(m  1)!
m!
shows that e~x is the sum of a polynomial of degree m 1 plus a remainder
term, r(x), of order O(xm) near x = 0, such that r(x) > 0 for all x > 0 if m is even,
and r{x) < 0 for all JC > 0 if m is odd. Replace x by JC 2 /2, divide by V27r, then
integrate from 0 to t to derive a corresponding expansion for the standard normal
distribution function: <I>(0 = p(t) + R(t) for all t > 0, where p{t) := E ^ o ! ft'*
and R{t) is a remainder of order O(t2m+l) near t 0 that takes only positive (if m
is even) or negative (if m is odd) values. Note that fa = 4>(0) = 1/2. The constant
K := /0 R(t)t~2m~x dt is finite and nonzero.
By construction,
* = l 
a; I
 r r dt =
o
The integral is nonzero if we can also ensure that J2t afilm " 0.
A simple piece of linear algebra establishes existence of a suitable vector
a := ( i , . . . , 02m+i). Write t/* for (0f,..., ^ + 1 ) . The vector U2m could not be a
linear combination Yllio * YkUk, for otherwise the polynomial 02m  J^IQ1 Ykk of
8.7
205
*8.
<17>
independence
A similar argument gives a similar bound for the lower tail. Thus
<18>
It follows (Problem [11]) that the function g(z) := Pexp(zX) is well defined and
is an analytic function of z throughout the complex plane C. It inherits a growth
condition from the tail bound <18>:
< 1 +2z I
Joo
<19>
= exp(C f z )
exp(zjc x2/2)dx
by <18>
A similar argument shows that h(z) := Pexp(zF) is also well defined for every z.
By independence, g(z)h(z) = exp(z 2 /2) for all z C, and hence g(z) ^ 0 for all z.
It follows (Rudin 1974, Theorem 13.11) that there exists an analytic function y()
on C such that g(z) = exp(y(z)). We may choose y so that y(0) = 0, because
#(0) = 1. (In effect, logg can be defined as a singlevalued, analytic function
on C.) The analytic function y has a power series expansion y(z) = J^L\ ynzn that
converges uniformly to y on each bounded subset of C.
206
Chapter 8:
Fourier transforms
Decompose y(rew) into its real and imaginary parts U(r, 0) + iV(r, 6). Then,
from <19>, we have exp(/) =  exp(/ f iV)\ < exp(C f r 2 ), which gives
U(r, 0) < C + r2
<20>
Uniform convergence of the power series expansion for y on the circle \z\ = r lets
us integrate termbyterm, giving
2
"
r /
\2nynrn
for n = 1, 2, 3 , . . .
for n = 0 ,  1 ,  2 , . . .
/
y(reiB)exp(in0)de
Jo
Jo
2nynrneifi.
/
U(r, 0)(2
+ 2cos(w0 + P))y dO = 2n\yn\rn.
v
Jo
The integrand on the lefthand side is less than 4/(r, 0) < 4(C h r 2 ). Let r tend to
That is,
infinity to deduce that yn = 0 for n = 3,4, 5,
Pexp(zX) = exp(yu +
nz2).
Problem [12] shows why y\ must be real valued and yi must be nonnegative. That
is, X has the Fourier transform that characterizes the normal distribution.
9.
Problems
[1]
[2]
Suppose a Fourier transform has ^/>Oo)l = 1 for some nonzero to. That is,
V'P('O) = exp(/0o) for some real #o Show that P concentrates on the lattice of
points {(0O + 2nn)/t0 : n e Z}. Hint: Show m (1  Px exp(itox  Wo)) = 0. What
do you know about 1 cos(fo* 6Q)1
[3]
Suppose is / is both integrable with respect to Lebesgue measure m on the real line
and absolutely continuous, in the sense that f{x) = mx ({t < x}f(t)), for all x9 for
some integrable function / .
(i) Show that f(x)  0 as JC  oo. Hint: Show m ' / ( 0 = 0 to handle x  oo.
207
8.9 Problems
ixt
ixt
(ii) Show that m* (f(x)e ) = m (e f(x)) /{it) for t # 0. Hint: For safe
Fubini, write the lefthand side as lim c ootn* {eixt{\x\ < C}tns (f(s){s < x})).
(iii) If a probability density has Lebesgue integrable derivatives up to fcth order,
prove that its Fourier transform is of order O(\t\~k) as \t\ > oo.
[4]
Let m denote Lebesgue measure on !B(R). For each / in Jdl(m) show that
mx (f(x)eixt) > 0 as f * oo. Hint: Check the result for / equal to the indicator function of a bounded interval. Show that the linear space spanned by
such indicator functions is dense in ^ ( m ) . Alternatively, approximate / by linear
combinations of densities with integrable derivatives, then invoke Problem [3].
[5]
Let g() be a bounded real function on the real line with bounded derivatives up to
order k + 1, and let X be a random variable for which F\X\k < oo. Let /?*() be the
remainder in the Taylor expansion up to th power:
jj8ik)(0)
(i) Show that \Rk(x)\ < Cmin(;c*, ** + 1 ), for some constant C.
(ii) Invoke Dominated Convergence to show that
tk
+ o(tk)
as t  0.
[7]
[8]
[9]
Suppose X is a random variable with Fourier transform ^x(i) For each 8 > 0, show that
I JT (1 mxlrxit)) dt=*(\
as 8 > 0.
208
Chapter 8:
Fourier transforms
(iii) Suppose {Xn} is a sequence of random variables whose Fourier transforms \/rn
converge pointwise to a function f that is continuous at the origin. Show that
Xn is of order Op(\).
(iv) Show that the ^ from part (iii) equals the Fourier transform of some random
variable Z representing the limit in distribution of a subsequence of {Xn}.
Deduce via Theorem <12> that Xn ^ Z.
[10]
Suppose a random variable X has a Fourier transform \// for which ^ ( 0 = 1 + O(t2)
near t = 0.
(i) Show that there exists a finite constant C such that
C>
for all M. Deduce via Dominated Convergence that 2C > PX 2 .
(ii) Show that FX = 0.
[11]
[12]
10.
Notes
Feller (1971, Chapters 15 and 16) is a good source for facts about Fourier transforms
(characteristic functions). Much of my exposition in the first four Sections is based
on his presentation, with help from Breiman (1968, Chapter 8).
The idea of generating functions is quite oldsee the entries under generating
Junctions in the index to Stigler (1986a), for descriptions of the contributions of
De Moivre, Simpson, Lagrange, and Laplace.
Apparently Levy borrowed the name characteristic function from Poincare
(who used it to refer to what is now known as the moment generating function),
when rediscovering the usefulness of the Fourier transform for probability theory
calculations, unaware of earlier contributions.
209
8.10 Notes
210
Chapter 8:
Fourier transforms
Levy, P. (1922), 'Sur la determination des lois de probability par leurs fonctions
caracteristiques', Comptes Rendus de I'Academie des Sciences, Paris 175, 854856.
Levy, P. (1925), Calcul des Probability, GauthierVillars, Paris.
Levy, P. (1934), *Sur les integrates dont les elements sont des variables aleatoires
independantes', Ann. Ecole. Norm. Sup. Pisa(2) 3, 337366.
Levy, P. (1937), Theorie de Vaddition des variables aleatoires, GauthierVillars, Paris.
Page references from the 1954 second edition.
Levy, P. (1970), Quelques Aspects de la Pensee d'un Mathematicien, Blanchard, Paris.
Loeve, M. (1973), 'Paul Levy, 18861971', Annals of Probability 1, 118. Includes
a list of Levy's publications.
McLeish, D. L. (1974), 'Dependent central limit theorems and invariance principles',
Annals of Probability 2, 620628.
Petrov, V. V. (1972/75), Sums of Independent Random Variables, SpringerVerlag.
Enlish translation in 1975, from 1972 Russian edition.
Radon, J. (1917), 'Uber die Bestimmung von Functionen durch ihre Integralwerte
langs gewisser Mannigfaltigkeiten', Ber. Verh. Sachs. Akad. Wiss. Leipzig,
MathNat. Kl. 69, 262277.
Rudin, W. (1974), Real and Complex Analysis, second edn, McGrawHill, New York.
Stigler, S. M. (1986a), The History of Statistics: The Measurement of Uncertainty
Before 1900, Harvard University Press, Cambridge, Massachusetts.
Walther, G. (1997), 'On a conjecture concerning a theorem of Cramer and Wold',
Journal of Multivariate Analysis 63, 313319. Addendum in same journal, 63,
431.
Chapter 9
Brownian motion
SECTION 1 collects together some facts about stochastic processes and the normal
distribution, for easier reference.
SECTION 2 defines Brownian motion as a Gaussian process indexed by a subinterval T
of the real line. Existence of Brownian motions with and without continuous sample
paths is discussed. Wiener measure is defined.
SECTION 3 constructs a Brownian motion with continuous sample paths, using an
orthogonal series expansion of square integrable functions.
SECTION *4 describes some of the finer propertieslack of differentiability, and a modulus
of continuityfor Brownian motion sample paths.
SECTION 5 establishes the strong Markov property for Brownian motion. Roughly speaking,
the process starts afresh as a new Brownian motion after stopping times.
SECTION *6 describes a family of martingales that can be built from a Brownian motion,
then establishes Levy's martingale characterization of Brownian motion with continuous
sample paths.
SECTION *7 shows how square integrable functions of the whole Brownian motion path
can be represented as limits of weighted sums of increments. The result is a thinly
disguised version of a remarkable property of the isometric stochastic integral, which
is mentioned briefly.
SECTION *8 explains how the result from Section 7 is the key to the determination of
option prices in a popular model for changes in stock prices.
1.
Prerequisites
Broadly speaking, Brownian motion is to stochastic process theory as the normal
distribution is to the theory for real random variables. They both arise as natural
limits for sums of small, independent contributions; they both have rescaling and
transformation properties that identify them amongst wider classes of possible
limits; and they have both been studied in great detail. Every probabilist, and
anyone dealing with continuoustime processes, should learn at least a little about
Brownian motion, one of the most basic and most useful of all stochastic processes.
This Chapter will define the process and explain a few of its properties.
The discussion will draw on a few basic ideas about stochastic processes, and
a few facts about the normal distribution, which are summarized in this Section.
212
Chapter 9:
Brvwnian motion
213
2.
<i>
where FF(A) = P(FA)/PF for all A in 3. By the uniqueness theorem for Fourier
transforms, Bt ft has the same distribution under P/r as under P. In particular,
FFg(Bt ft) = g(Bt ft), for all bounded, measurable functions g.
Once we specify the distribution of Bo, the joint distribution of Bo, Btl,..., Btk,
for 0 < t\ < ... < f*, is uniquely determined. That is, the fidis are uniquely
detemined by Definition <i> and the distribution of Bo. If Bo is integrable then
so are all the ft, and the process is a martingale. If Bo has a normal distribution
(possibly degenerate), then all the fidis are multivariate normal, which makes B
a Gaussian process. If Bo = x, for a constant JC, the process is said to start at x.
In particular, when Bo = 0, the Brownian motion is a centered Gaussian process,
whose fidis are uniquely determined by the covariances,
cov(ft, Bt) = cov(ft, ft) + cov(ft, Bt  ft) = s + 0
if s < t.
214
Chapter 9:
Brownian motion
Example. Let {(/?,, J,) : 0 < t < 1} be a Brownian motion with continuous
sample paths. The quantity M(co) := sup, \B{t, co)\ is finite for each co, because each
continuous function is bounded on the compact set [0,1]. It is an 7\measurable
random variable, because {co : M(co) > x] = U5s{#s(a>) > x], for any countable,
dense subset 5 of [0,1]. A countable union of sets from H\ also belongs to 3\.
What happens when the process does not have continuous sample paths, but (i)
and (ii) are satisfied? To show you how bad it can get, I will perform pathwise
surgery on B to create a version that behaves badly. As you will see, the issue is
really one of managing uncountable families of negligible sets. Countable collections
of negligible sets can be ignored, but uncountable collections are capable of causing
real trouble.
From Problem [7], there is a partition of Q into an uncountable union of
disjoint, J\measurable sets {Qt : 0 < t < 1}, with PQ, = 0 for every t. Let
P(co) be an arbitrarily nasty, nonmeasurable, nonnegative function on Q for which
P(co) > M(co) at each co. Define B*(t,co)
215
A Brownian motion that has continuous sample paths also defines a map,
(o H> B(,to) from Q into the space C(7) of continuous, real valued functions
on 7. It becomes a random element of C(7) if we equip that space with its
finite dimensional (or cylinder) sigmafield 6(7), the sigmafield generated by the
cylinder sets {x e C(7) : (x(t\),..., x(tk)) A}, with {t\9..., tk) ranging over all
finite subsets of 7 and A ranging over !B(R*), for each finite k.
As an ^ecnmeasurable map, w H> /?(, o>), from ft into C(7), a Brownian
motion induces a probability measure (its distribution or image measure) on 6(7).
The distribution is uniquely determined by the fidis, because the collection of
cylinder sets is stable under finite intersections and it generates 6(7). For the
simplest case, where Bo = 0, the distribution is called Wiener measure on 6(7), or,
less precisely, Wiener measure on 7. I will denote it by W, relying on context to
identify 7.
REMARK.
Each coordinate projection, Xt(x) : x(t), defines a random variable
on C(7). As a stochastic process on the probability space (C(7), 6 ( 7 ) , W), the
family {Xt : t 7} is a Brownian motion with continuous paths, started at 0. For
many purposes, the study of Brownian motion is just the study of W.
For the remainder of the Chapter, you may assume that all Brownian motions
satisfy requirements (i), (ii), and (iv), with BQ = 0. That is, unless explicitly warned
otherwise, you may assume that all Brownian motions from now on are centered
Gaussian processes with continuous sample paths and cov(Bty Bs) = t A s, a process
3.
^ ^ { A i . f t M * * * ,  ) = (*i,*2>.
Note that G(h) is uniquely determined as an element of L2(P), but it is only defined
up to an almost sure equivalence as a random variable.
In particular, from the facts (c) and (d) in Section 1, the random variable G(h)
has a N(0, \\h\\2) distribution. Moreover, for each finite subset {h\,..., hk) of IK,
216
Chapter 9:
Brownian motion
the random vector (G{h\),..., G(hk) has a multivariate normal distribution with
zero means and covariances given by FG(hi)G(hj) (ht,hj). That is, all the fidis
of the process are centered multivariate normal. The family of random variables
{G(h) : h e !K} is a Gaussian process that is sometimes called the isonormalprocess,
indexed by the Hilbert space 'K (compare with Dudley 1973, page 67).
REMARK.
Notice that the map h i> G(h) is linear and continuous, as a function
from JC into L 2 (P). Thus G(h) can be recovered, up to an almost sure equivalence,
from the values {Gfe): i e N} for any orthonormal basis {e{ : / N) for *K.
Hik = Ju,k+\ Ju+\,k+\> Thus mH?k = m7/,* = 2~k. The functions ^r,^ := V2tHifk
are orthogonal, and each has L2(m) norm 1. As shown in Section 3 of Appendix B,
the collection of functions V := {1} U {Vo\* : 0 < i < 2k for k = 0,1,2,...} is an
orthonormal basis for L2(m). The Brownian motion has the L2(m) representation,
<3>
= fjt + ^ = 0 2k'*Xk(t)
with Xk(t) :=
where fj and the {rjijk} are mutually independent N(Oy1) random variables.
As a function of r, each (/,, ///,*) is nonzero only in the interval Jiik, within
which it is piecewise linear, achieving its maximum value of 2~(*+1) at the midpoint,
(/ HU) = 2(*+1> A (t  0
 2<*+1> A (t 
^\
217
X
2i/2 k+I
(2i+l)/2 k+1
(2i+2)/2 k+1
W8
T 22 /8
The process Xk(t) has continuous, piecewise linear sample paths. It takes the
value 0 at t = J/2* for i = 0 , 1 , . . . , 2*. It takes the value r)iik/2k+l at the point
(2i +l)/2* + 1 . Thus sup, \Xk(t)\ = 2~(*+1) max, i/<>t. From property (e) in Section 1,
V X*(OI < 2_Zo2
k/2
which implies finiteness of J^k 2 supr X*(f) almost everywhere. With probability
one, the random series <3> for Bt converges uniformly in 0 < t < 1; almost
all sample paths of B are continuous functions of t. If we redefine B{t,co) to be
identically zero for a negligible set of eo, we have a Brownian motion with all its
sample paths continuous.
From the Brownian motion B indexed by [0,1] we could build a Brownian
motion ft indexed by M+ by defining
A := (1 + t)B (T^)  tB{\)
for t R + .
Clearly {fit : t e R+} is a centered Gaussian process with continuous paths and
p0 = J5o = 0. You should check that cov(&, ft) = 5 A f, in order to complete the
argument that it is a Brownian motion.
REMARK.
We could also write Brownian motion indexed by R + as a doubly
infinite series,
which converges uniformly (almost surely) on each bounded subinterval. When we
focus on [0,1], the terms for k < 0 contribute t XLt<o 2*/2f?o,*> which corresponds to
the fjt in expansion <3>; and for k > 0, only the terms with 0 < i < 2* contribute
to Bt.
*4.
218
Chapter 9:
Brownian motion
with a finite derivative v, that is, x(t) = x(to) + (t to)v + o(\t /ol) near to. For a
positive integer m, let im be the integer defined by (im l)/m < to < im/m. Then
and hence
2x t J + x{ m)=
(m
(m
\ ,
/m\
/ _,\
Similarly, both m Alm+2,m* and m Aim+^mx must also converge to zero. By considering
successive second differences, we eliminate both to and v9 leaving the conclusion
that if x is differentiable (with a finite derivative) at an unspecified point of (0,1)
then, for all m large enough there must exist at least one i for which
mA;,mJt < 1
and
mA/+2,m* < 1
and
m\Ai+^mx\ < 1.
Apply the same reasoning to each sample path of the Brownian motion
{Bt : 0 < t < 1} to see that the set of co for which B(, co) is somewhere differentiable
on (0,1) is contained in
liminf{a>: m\Ai>mB\ < 1, mA,+2,m# < 1, m\Ai+^mB\ < 1 for some i},
mtoo
where the AimB are the second differences for the sample path B(,co). Each
of Ao,m#, &2,mB, A4mB, ...has a N(0,2/m) distributionbeing a difference
of two independent random variables, each Af(O, 1/m) distributedand they are
independent. By Fatou's lemma, the probability of the displayed event is smaller
than the lim inf of the probabilities
P ^11=0 {(O ' m\^i,mB\ < 1, tn\Ai+2,mB\ < 1, mA44>m^ < 1}
< m (P{A^(0, 2/m) < I}) 3 =
O(m"l/2).
Thus almost all Brownian motion sample paths are nowhere differentiable.
The nondifferentiability is also suggested by the fact that (Bt+s Bt) /8 has
a N(0, \/8) distribution, which could not settled down to a finite limit as 8 tends
to zero, for a fixed t. Indeed, the increment Bt+8 Bt should be roughly of
magnitude ^/8. The maximum such increment, D(S) := supo<r<i5 Ift+s  Bt\% is
even larger. For example, part (i) of Problem [2] shows, for small e > 0 and k large
enough, that
P ( max 2k/1
[0<i<2*
hm sup
D(8)
v
'
t.
> 1
almost surely. A similar argument (Levy 1937, page 172; see McKean 1969,
Section 1.6 for a concise presentation of the proof) leads to a sharper conclusion,
<6>
lim sup
= 1
ao y/28\og(l/8)
almost surely.
219
More broadly, y/8\og{\/8) gives a global bound for the magnitude of the
incrementsa modulus of continuityas shown in the next Theorem. To avoid
trivial complications for 8 near 1, it helps to increase the modulus slightly, by adding
a term that becomes inconsequential for 8 near zero.
<7>
T h e o r e m . Define h{8) := y/S + <51og(l/<5) for 0 < 8 < 1. Then for almost all co
there is a finite constant Cw such that
\Bt(co)  Bs(co)\ < CJi (r  s\)
"
X (5)l
where Mk :
= max< i^*i
For a fixed k, the function ft fs is orthogonal to all /f,f* except possibly when t
or s belongs to the support interval /,,*. There are at most two such intervals, and
for them we have the bound (/,  fsi Huk)\ < m ((5, t] n Jt,k) < \s  t\ A 2~k. Thus
SUp
s^t
h \\t s\)
< SUp
o<s<\
From Problem [2], for almost all co there is a ko(co) such that Mk(co) < 2^/log(l 12*)
for k > ko(co). From Problem [5] we then have
for a finite constant Co, which, together with the fact that 8 < h(8), gives
sup
D
5.
220
Chapter 9:
Brownian motion
That is, FFg(X) = (PF)Pg(X) for each F e 7t and (at least) for each bounded,
e(R+)measurable function g on C(R + ). Equivalently,
P (g(X)  3t) = Wg
almost surely.
Intuitively speaking, the conditioning lets us treat the J r measurable random variable
r as a constant, so that Bt  BT has a N(0, t  r) conditional distribution, which
is symmetric about 0. Unfortunately, this line of reasoning takes us beyond the
properties of the abstract, Kolmogorov conditional expectationif you were paying
very careful attention while reading Chapter 5, you will recognize a covert appeal
to existence of a conditional distribution. Fortunately, Fubini offers a way around
the technicality.
Rewrite the Markov property as a pathwise decomposition of Brownian motion
into two independent contributions: R* shifted to the origin 0,0); and a killed
process K'B, defined by (K'B)S := BSAt for s e R+. More formally, define S* as
the operator that shifts functions to the right,
x(s t)
0
for s > t
for 0 < s < t.
221
Then B = K'B f S'R'B. The Markov property lets us replace R*B by a new
Brownian motion, independent of ?,, without changing the distributional properties
of the whole path. For example, for each ^measurable Y and (at least) for each
bounded, product measurable real function / , we have (via a generating class
argument)
KTx
x(co)
Now consider the effect of replacing the fixed t by a stopping time r. For every
sample path we have (, co) = KT{(O)B{, co) 4 5r(w)/?T(<u)B(, co). The decomposition
even makes sense for those co at which z{co) = oo, because Kx = x and S shifts
whatever we decide to define as Rx right out of the picture. For concreteness,
perhaps we could take Sx == 0. Of course it would be a little embarrassing to
assert that RTB is, conditionally, a Brownian motion at those co.
Strong Markov property. Let B be a standard Brownian motion for a filtration
{7t : t e R+}, and let x be a stopping time. Then for each 7X measurable random
element Y, and (at least) for each bounded, product measurable function / ,
P/(K, B) = 1TW*/(K, KT((O)B(', co) +
REMARK.
5JC).
Proof. A generating class argument reduces to the case where f(y, z) := g(y)h(z),
where g is bounded and measurable, and h(z) := ho (z(s\),..., z(.s*)) with ho a
bounded, continuous function on R* and s\,..., sk fixed values in R + . Discretize
the stopping time by rounding up to the next multiple of n~ l ,
rn := 0{r = 0} 4 V  I' < r <  \ + oo{r = oo}.
For each n,
P/(F, B) =
, B){r = 00}
222
Chapter 9:
Brownian motion
= i/n)g(Y)h(Ki/nB
5I/WJC)
= F W f o = i/n}g(Y)h(KTnB
ST"JC).
<n>
As n tends to infinity, rn(w) converges to T(O>), and hence KtnB + STnx > T B + STx
pointwise (in particular, at each sy), for each co and each JC e C(R + ). Continuity of /io
then gives h(KTnB + SXnx) > h(KTB + 5TJC). An appeal to Dominated Convergence
completes the proof.
Corollary.
+ 5TJC)
almost surely,
For fixed t e R + ,
*6.
If you differentiate you will discover that the distribution of r has density
at~3/2expia2/2t)/V2n
with respect to Lebesgue measure on R + .
<13>
where A := Bt  Bs
by independence.
223
J02/2)F)
by independence
02
03
\ /
oh
eAt2
+ ...J
with a similar expansion for Pexp(0Z?5). The series converge for all complex 0.
By equating coefficients of powers of 0, we obtain a sequence of equalities that
establish the martingale property for {Bt}, {B2  f},
As an exercise you might
find the term involving B?, then check the martingale property by direct calculation,
as I did for B2  t.
Given that Fourier transforms determine distributions, it is not surprising that
the martingale property of exp(02?r f0 2 /2), for all 0, characterizes Brownian
motionessentially equivalence (iii) of Definition < i > . It is less obvious that the
martingale property for the linear and quadratic polynomials alone should also
characterize Brownian motion with continuous sample paths. This striking fact is
actually just an elegant repackaging of a martingale central limit theorem. The
continuity of the sample paths lets us express the process as a sum of many small
martingale differences, whose individual contributions can be captured by Taylor
expansions to quadratic terms.
<14>
<15>
= PF
where 02 := ^0 2 .
I will present the argument only for the notationally simplest case where s = 0 and
t = 1, leaving to you the minor modifications needed for the more general case.
In principle, the method of proof is just Taylor's theorem. Break Xi into a
sum of increments J =1 17*, where rjk := X(k/n)  X((k  l)/n). Write P* for
expectations conditional on ^ / n . The two martingale assumptions give P*i*?* = 0
and vti :=Pni^=
l//t. Notice that %L( v*i = 1. For fixed real 0, define
Dk := exp (i0(rji + . . . + rjk) + 02(v0 + . . . + tfri) ,
224
Chapter 9:
Brownian motion
so that Do = 1 and Dn = exp(i'0Xi I $2). Continuity of the paths should make all
the t]k small, suggesting
+ iOrjk 
if v%_{ is small.
"ATn
A Xn <  .
\n
n
)
n
The conditional variance V*_i is not equal to the constant 1/n, but it is true that
Vo + ... Vn\ < 1 and
P(V0 + . . . Vn_i) = V P (  Axn
n
f=f \
J =P(lArB)> 1
a s n  > oo,
225
To keep track of the remainder terms, use the bounds (compare with Problem [8])
ex = 1 + x + A{x)
iy
e = 1 + iy  y /2 + B(y)
We still have Do = 1, but now VDnF = PFexp (WY{ + G2(V0 + ... + V n _0), which,
by Dominated Convergence, converges to PF exp(/0Xi + 02) as n > 00. With
bounds on the remainder terms, the conditioning argument gives a more precise
assertion,
< \DkxI
< exp(02) ((022 40 2 )
< CB (n~l + <:) Vki
= 1
and
a\Zt)8
for small 5 > 0, where JJL and a are suitably smooth functions. If we break [0, t] into
a union of small intervals [4,4+1], each of length 8, then the standardized increments
AiX := (Z(r l+1 )  Zfo)  /*(Z,,)*) /cr(Z,,)
are martingale differences for which P((A,X) 2  ^r,.) 8. The sum X, := . A,X
is a discrete martingale for which Xf t is also approximately a martingale. It is
possible (see, for example, Stroock & Varadhan 1979, Section 4.5) to make these
heuristics rigorous, by a formal passage to the limit as 8 tends to zero, to build a
Brownian motion X from Z, so that Zt+S  Z, ^ ii{Zt)8 I <r(Z,) (X, +5  X,). By
summing increments and formalizing another passage to the limit, we then represent
Z as a solution to a stochastic integral equation,
226
*7.
Chapter 9:
Brownian motion
<16>
Theorem. The Hilbert space H := L 2 (2,?f ,P) equals the closure of the
subspace Ho spanned by the constants together with the collection of all random
variables of the form (Bt  Bs)hs, where h5 ranges over all bounded, 7Smeasurable
random variables, and s, t range over all pairs for which 0 < s < t < 1.
Proof. It is enough if we show that a random variable Z in H that is orthogonal to Ho
is also orthogonal to every bounded random variable of the form V := g(Btx,..., Btk).
For then a generating class argument would show that Z must be orthogonal to
every random variable in H, from which it would follow that Z = 0 almost surely.
It even suffices to consider, in place of V, only random variables of the form
U := f[y=i exPO'0/#/,)> for real constants 07. For if
PZ + Y[kj={ exp(i9jBtj) = PZ" f ] * = 1 exp(i0jBtj)
for all real {#,}, the uniqueness theorem for Fourier transforms implies equality of
the measures Q*, on a{Btx,..., Btk}> with densities Z with respect to P, and hence
Q+V = Q~V for all V = g(Btl,..., Btk).
Repeated application of the following equality (proved below),
<17>
 s)}
for all real 0, all s < t, and all bounded, 7Smeasurable random variables F,
will establish orthogonality of Z and U. The argument is easy. First put Y\ :=
FI*=i exp(/0,,.) and 0 := 0k to deduce
exp(WkBtk) = FZY{ exp(WkBtk_x)exp (\ol(tk
Then repeat the exercise with F2 = n > = ? e x P ( ^ 7 ^ )
VZY2exp(WBtk_l) = ZY2txp(i0Btk_2)
anc
 ft_
* ^ : = ^*i +^* t o
(\o\tkX
 fc_
< \0h\3/6.
227
f(Xj) = f(Xjl)
+ WAjf(Xjl)02AJf(Xjl)
+ Rj
+ Rj.
We may assume that 8 is small enough to ensure that \y\ < 1. Orthogonality of the
{,} then gives
In the limit, as n tends to infinity, we get the asserted equality < n > .
REMARK.
Almost the same argument shows that Ho is dense in Li :=
L!(2, 3^,P), under the Lx norm. The proof rests on the fact that if some
element X of L were not in the closure of Ho, there would exist a bounded,
yfmeasurable random variable Z for which V(WZ) = 0 for all W in Ho but
P(XZ) = 1. (HahnBanach plus (L1)* = L, for those of you who know some
functional analysis.) The argument based on <17> would imply P(ZW) = 0 for
all bounded, ? f measurable random variables W, and, in particular, PZ 2 = 0, a
contradiction.
The Theorem tells us that for each X in L2(Q, J f , P) and each > 0 there is
a constant co and grid points 0 = to < t\ < ... < /*+1 = 1 such that
with each hi is bounded and wF,.measurable, and FR2 < e2. Notice that PX col =
\FR I < ey so we may as well absorb the difference co FX into the remainder, and
assume co = FX.
REMARK.
The representation takes an even neater form if we encode the
{hi} into a single function on [0, 1] x ft, an elementary predictable process,
H(t,co) := 7r,ki=ohi(a)){ti < t < ti+\}. The function H is measurable with respect
to the predictable sigmafield 7, the subsigmafield of [0,1] <g> 7f generated by
the class of all adapted processes with leftcontinuous sample paths. Moreover, it is
228
Chapter 9:
Brownian motion
square integrable for the measure fi : = m <g> P, with m equal to Lebesgue measure
on [0, 1],
For i < y, the random variable A , # is independent of the y,. +1 measurable random
variable hihj(AiB); and for 1 = 7 , the random variable ( A , # ) 2 is independent of h2.
The crossproduct terms disappear, leaving
Thus the map H \> fQ H dB is an isometry from the space IK0 of all elementary
predictable processes into L2 : = L2(Q, 3*f, P). It is not too difficult to show that Ji0
is dense in the space J { : = L2 ([0, 1] x 2, 3 \ /x)). The isometry therefore extends
to a map from [K into L2, which defines the stochastic integral fQ HdB for all
predictable processes that are squareintegrable with respect to \i.
Theorem <16> implies that for each X in L 2 there exists a sequence (Hn)
in *KQ for which P X PX fQ
HndB
sequence,
*8.
Option pricing
The lognormal model of stock prices assumes that the price of a particular stock
(standardized to take the value 1 at time 0) at time t is given by
St = exp (crBt + (fjt  \G2\ t\
<18>
where B is a standard Brownian motion. The drift parameter fx is unknown, but the
volatility parameter a is known (or is at least assumed to be well estimated). Notice
that 5 is a martingale, for the filtration J, := a{Bs : s < t) = a{Ss : s < t], if /x = 0.
The strange form of the parametrization makes more sense if we consider
relative increments in stock price over a small time interval,
t+S
where AB := Bt+S  Bt
9.8
229
Option pricing
The square (A/?) 2 has expected value 8. The term o28/2 centers it to have zero
mean. As you will see, the centered variable eventually gets absorbed into a small
error term, leaving /JL8 + a AB as the main contribution to the relative changes in
stock price over short time intervals. The model is sometimes written in symbolic
form as dSt = ixStdt + aStdBt.
An option may be thought of as a random variable Y that is, potentially, a
function of the entire history {St : 0 < t < 1} of the stock price: ones pays an
amount yo at time 0 in return for the promise of a return Y at time 1 that depends
on the performance of the stock. (That is, an option is a refined form of gambling
on the stock market.)
The question whose answer makes investment bankers rich is: What is the
appropriate price yo? The elegant answer is: yo = QF, where Q is a probability
measure on 7 that makes the stock price a martingale, because (as you will soon
learn) there are trading schemes whose net returns can be made as close to Y yo
as we please, in a probabilistic sense. If a trader offered the option for a price y
smaller than yo, one could buy the option and also engage in a trading scheme for
a net return abitrarily close to (Y y) (Y yo) = yo y > 0. A similar argument
can be made against an asking price greater than yo.
A trading scheme consists of a finite set of times 0 < to < t\ < ... < tk+\ < 1
at which shares need to be traded: at time f, buy a quantity Kt of the stock, at the
cost KiS(tj), then sell at time ti+\ for an amount KiS(ti+\), for a return of /f/A,S,
where A,5 = S(f,+i) S(f,) denotes the change in stock price per share over the
time interval. The quantity Kt must be determined by the information available at
time tt, that is, K( must be Frjmeasurable. We should also allow Kt to take negative
values, a purchase of a negative quantity being a sale and a sale of a negative
quantity being a purchase. The return from the trading scheme is E/=o Ki^iSWe could assume that the spacing of the times between trades is as small as
we please. For example, purchase of K shares at time s followed by resale at time t
has the same return as purchases of K at times s + i8 and sales of K at times
s + (i 41)8, for i = 0 , 1 , . . . , AT  1, with 8 := (/  s)/N for an arbitrarily large AT.
Conceptually, we could even pass to the mathematical limit of continuous trading,
in which case the errors of approximation could be driven to zero (almost surely).
Existence of the trading scheme to nearly duplicate the return Y  yo will
follow from the representation in the previous Section. Consider first the case where
fi = 0, which makes S a martingale. Let us assume that PF 2 is finite. Write yo for
PF. For an arbitrarily small > 0, Theorem <16> gives a finite collection of times
{/,} and bounded, 3>.measurable random variables A/ for which
F\Yy0
E i *iA,JB 2 < 2
If we could replace A,B by A,5/ (aSti) for each i we would have a trading
scheme Kt := A,/(tr5r.) with the desired approximation property. If the time
intervals <S, := ti+\ ti are all small enough, a simple calculation will show that such
a substitution increases the L2 error only slightly. The error term
230
Chapter 9:
Brownian motion
has zero expected value and it is independent of 3>,. You will see soon that it also
has a small L2(P) norm. To simplify the algebra, temporarily write x for ojh and
Z for AiB/V89 which has a standard normal distribution. Then
P/?
2
+1
OX
when r < 1.
The approximation ]T\ hiAtB to F yo equals Yli ^ A,S J^. ftj/^+i/cr. The first
sum represents the net return from a trading strategy. The other sum will be small
in L2(P) when max/ 8t is small. Independence eliminates crossproduct terms,
if j > i,
and hence
] ; (P/I 2 ) (a 4 6 2 )
as max 6/ > 0.
The contribution from the /?, 's can be absorbed into the other error of approximation,
increasing the e by an arbitrarily small amount, leaving the desired trading strategy
approximation for Y yo.
Finally, what happens when //, is not zero? A change of measure will dispose
of its effects. For a fixed constant a, let Q be the probability measure on 7 with
density zxp(aB\  <*2/2) with respect to P. Problem [11] shows that the process
Xt := Bt at is a Brownian motion under Q. The stock price is also a function
of X if we choose a = fi/cr,
9.
Problems
[1]
231
9.9 Problems
[2]
for x > 0,
(i) For each small e > 0, show that there exist strictly positive constants C and 0
for which {Mn < (1  )ln] = (1  2P{Zi > (1  )tn})n < exp (Cn) , for
all n large enough. Deduce that liming Mn/ln > 1 almost surely.
(ii) For each > 0, show that {Mn > (1 + )ln] < n/n(l+)2.
\imsupkMn(k)/ln(fC) < 1 almost surely.
Deduce that
(iii) Use the inequality Mn/ln < Mn(k+\)/lnik) for n(k) < n < n(k+1), and the result
from parts (i) and (ii), to conclude that Mn/ln + 1 almost surely.
[3]
Let {(/?,, 7t) : 0 < t < 1} be a centered Brownian motion whose sample paths
need not be continuous. Show that there exists a centered Brownian motion
{B* : 0 < t < 1} with continuous sample paths, for which {B* ^ Bt] = 0 for
every t. Argue as follows. Let Sn := {i/2 n : 1 = 0 , 1 , . . . , 2W} and S := U nN 5 n . For
each to define
Dn(co) == max
0<i <2n
(i) Use Problem [1] to prove that J2n^D" <  Deduce that there exists a
negligible set N such that n(o>) := J2t>n A(a>) * 0 as n > 00 for w ^ N.
(ii) Let ^ and t be points in 5W for which \s  t\ < 2~ m , where n > m. For each
k, write j* for the largest value in S* for which Sk < s, and define tk similarly.
(Thus s = sn and f = tn.) Show that
232
Chapter 9:
Brownian motion
2em{co),
and hence
sup{(.y, co)  B(t, co)\:s,t
(v) Show that B*(t,co) = limn B(tn,co) almost surely, where {tn} is the sequence
defined in step (ii). For each e > 0, deduce that
F[\B?  Bt\ > } < Pliminf(5(f n )  Bt\ > e] < liminfF{\B(tn)  Bt\ > } = 0.
n
(Remember that B(t)  B(tn) has a N(0,t  tn) distribution.) Conclude that
B* = Bt almost surely, for each t.
[4]
Suppose {Bs : s e S] is a Brownian motion, with S the countable set of all dyadic
rationals in [0, 1], as in the previous Problem. Modify the argument from that
Problem to construct a standard Brownian motion indexed by [0, 1].
[5]
for 0 < 8
Let B be a standard Brownian motion indexed by [0, 1]. Show that the process
Xt = B\ #i_ r , for 0 < t < 1, is also a Brownian motion (with respect to own its
natural filtration, not the filtration for B).
[7]
Let {(#,, J,) : 0 < t < 1} be a Brownian motion with continuous sample paths.
Define r(co) as the smallest t at which the sample path /?(, co) achieves its maximum
value, M(co). (Note: r is not a stopping time.) Follow these steps to show that
the sets Qt := {r = t] for 0 < t < 1 is a family suitable for the construction in
Example <2>.
(i) Show that {r < t) = {supJ5, Bs(co) = M(co)}. Deduce that r is 7\measurable.
Hint: Work with the process at rational times.
(ii) For each / in [0, 1), show that {r = /} c {sup,< J5 ,( 5  Bt) < 0}. Deduce via
the result from Exercise <12> that P{r = t) = 0.
(iii) Use Problem [6] to show that P{r = 1} = 0.
233
9.9 Problems
[8]
For all real y, show that eiy = 1 + iy + (ry) 2 /2! + . . . + (iy)k/k\ + *(>0 with
1**001 < \y\Mlih + 1)!. Hint: for y > 0 use the fact that i /oy *(*)<** = #*+i(>0.
[9]
Let X := C(E + ), equipped with its metric d for uniform convergence on compacta.
(i) Show that (X, d) is separable. Hint: Consider the countable collection of
piecewise linear functions obtained by interpolating between a finite number of
"vertices" with rational coordinates.
(ii) Prove that the sigmafield G generated by the finite dimensional projections
coincides with the Borel sigmafield !B(X). Hint: For one inclusion use
continuity of the projections. For the other inclusion replace supnorm distances
by suprema over subsets of rationals.
[10]
Let [Bt : t e R+] be a standard Brownian motion. For fixed constants a > 0 and
P > 0 show that F[Bt eventually hits the line a + fit] = exp(2cr/J), by following
these steps. Write r for the first hitting time on the linear barrier. For fixed real 0,
let X0(t) denote the martingale exp(0,  lfy92t).
(i) For each fixed t, show that 1 = FXg(tAr).
if you wish to be completely rigorous.
Let {(#,, 7t) : 0 < t < 1} be a Brownian motion defined on a probability space
(Q, 7, P). Let a be a positive real number.
(i) Show that the measure Q defined by dQ/dF := exp(a#i  \a2) is a probability
measure on 7.
(ii) Show that the process Xt := Bt  at for 0 < t < 1 is a Brownian motion
under Q. Hint: For fixed s < t, real 0, and F in 7sy show that
Calculate the price of an option that allows you to purchase one unit of stock for
a price K at time 1, if you wish. Hint: Interpret (Si  K)+, or read a book about
options.
234
10.
Chapter 9:
Brownian motion
Notes
The detailed study by Brush (1968) describes the history of Brownian motion:
from the recognition by Brown, in 1828, that it represented a physical, rather
than biological, phenomenon; through the mathematical theories of Einstein and
Smoluchowski, and the experimental evidence of Perrin, in the first decade of the
twentieth century.
Apparently Wiener (1923) was motivated to study the irregularity of the
Brownian motion sample paths, in part, by remarks of Perrin regarding the haphazard
motion of small particles, cited (page 133) in translation as "One realizes from such
examples how near the mathematicians are to the truth in refusing, by a logical
instinct, to admit the pretended geometrical demonstrations, which are regarded
as experimental evidence for the existence of a tangent at each point of a curve."
(Compare with Kac (1966, page 34), quoting from Wiener's autobiography.) Wiener
constructed "Wiener measure" as a linear functional on the space of continuous
functions, deriving the necessary countable additivity property from a property
slightly weaker than the modulus condition of Theorem <7>.
The construction in Section 3 is essentially due to Levy (1939, Section 6), who
obtained the piecewise linear approximations to Brownian motion by an interpolation
argument (compare with Theorem 1, Chapter 1 of Levy 1948). He also referred to
Levy (1937, page 172) for the derivation of a Holder condition <6> for the sample
paths. The explicit construction via an orthogonal series expansion in the Haar basis
is due to to Ciesielski (1961).
Theorem <14> in Section 6 is due to Levy (1948, pages 7778 of the
second edition), who merely stated the result, with a reference to Theorem 67.3
of Levy (1937), which established a central limit theorem for (discrete time)
martingales under an assumption on the conditional variances analogous to the
martingale property for X,2 t. Doob (1953, Theorem 11.9, Chapter VII) provided
a formal proof, similar to the proof in Section 6. The method could be streamlined
slightly, with replacement of increments over deterministic time intervals by
increments over random intervals. It could also be deduced from fancier results
in stochastic calculus, whose derivations ultimately reduce to calculations with
small increments of the process. I feel the direct method has pedagogic advantages,
because it makes very clear the vital role of sample path continuity.
The proof of Theorem <16> is a disguised version of Ito's formula for
stochastic integrals, applied to a particular function of Brownian motioncompare
with Durrett (1984, Section 2.14). The result is due to Kunita & Watanabe (1967),
although, as noted by Clark (1970), it follows easily from an expansion due to
Ito (1950). See also Dudley (1977) for an extension to 7f measurable random
variables that are not necessarily square integrable, verifying a result first asserted
then retracted by Clark.
See Harrison & Pliska (1981) and Duffie (1992, Chapter 6) for a discussion
of stochastic calculus and option trading. The book by Wilmott, Howison &
Dewynne (1995) provides a gentler introduction to some of the finance ideas.
235
9.10 Notes
In the last three Sections of the Chapter, I was dabbling with ideas important
in stochastic calculus, without introducing the formal machinery of the subject.
Accordingly, there was some repetition of methodschop the sample paths into
small increments; make Taylor expansions; dispose of remainder terms by arguments
reeking of martingale theory. As I remarked at the end of Section 6, if you wish
to pursue the theory any further it would be a good idea to invest some time in
studying the formal machinery. I found Chung & Williams (1990) a good place to
start, with Metivier (1982) as a reliable backup, and Dellacherie & Meyer (1982) as
a rigorous test of true understanding.
REFERENCES
236
Chapter 9:
Brownian motion
Chapter 10
1.
What is coupling?
A coupling of two probability measures, P and Q, consists of a probability space
(2, J, P) supporting two random elements X and F, such that X has distribution P
and Y has distribution Q. Sometimes interesting relationships between P and Q
can be coded in some simple way into the joint distribution for X and Y. Three
examples should make the concept clearer.
<i>
Example. Let Pa denote the Bin(n,a) distribution. As a gets larger, the distribution should "concentrate on bigger values." More precisely, for each fixed JC, the
tail probability Pa[x, n] should be an increasing function of a. A coupling argument
will give an easy proof.
Consider a f$ larger than a. Suppose we construct a pair of random variables,
Xa with distribution Pa and Xp with distribution Pp, such that Xa < Xp almost
surely. Then we will have {Xa > x] < {Xp > x] almost surely, from which we
would recover the desired inequality, Pa[jc,n] < Pp[x,n]9 by taking expectations
with respect to P.
How might we construct the coupling? Binomials count successes in independent trials. Couple the trials and we couple the counts. Build the trials from
238
<2>
Chapter 10:
independent random variables /,, each uniformly distributed on (0,1). That is,
define Xa := ,<{/, < ot} and Xp := J^i<n{Ui < P). In fact, the construction
couples all Py, for 0 < y < 1, simultaneously.
Example. Let P denote the Bin(n, a) distribution and Q denote the approximating
Poisson(na) distribution. A coupling argument will establish a total variation bound,
supA \PA QA\ < na2, an elegant means for expressing the Poisson approximation
to the Binomial.
Start with the simplest case, where n equals 1. Find a probability measure P
concentrated on {0,1} x No with marginal distributions P : Bin(l,a) and
Q := Poisson(a). The strategy is simple: put as much mass as
7k!
we can on the diagonal, (0,0) U (1,1), then spread the remaining
mass as needed to get the desired marginals. The atoms on the
diagonal are constrained by the inequalities
P(0,0) < min (P{0), 0(0})) = min (1  a, e~a),
To maximize, choose P(0,0) := 1  a and P(l, 1) := ae~a. The
1a
rest is arithmetic. We need P(l, 0) := e~a  1 + a to attain the
marginal probability Q{0], and P(0, k) := 0, for k = 1, 2 , . . . , to attain the marginal
P{0] = 1  a. The choices P(l, k) := Q{k], for k = 2, 3 , . . . , are then forced. The
total offdiagonal mass equals a ae~a < a2.
For the general case, take P to be the nfold product of measures of the
type constructed for n = 1. That is, construct n independent random vectors
(X\,Y\), . . . , ( X n , y n ) with each Xt distributed Bin(l,a), each Y( distributed
Poisson(a), and P{X, ^ Yi] < a2. The sums X := . X, and Y := \ F, then have
the desired Binomial and Poisson distributions, and P{X ^ Y] < X],P{^i # Yt) <
na2. The total variation bound follows from the inequality
P{X e A]  F{Y A} = P{X A, X # Y)  F{Y e A, X / Y}\ < P{X ^ Y],
if and only if
# F ( ) < x.
239
<3>
The representation lets us prove facts about weak convergence by means of the
tools for almost sure convergence. For example, in the problems to Chapter 7, you
were asked to show that A(P, Q) := sup{\Pi  Ql\ : \\i\\BL < 1} defines a metric
for weak convergence on the set of all Borel probability measures on a separable
metric space. (Refer to Section 7.1 for the definition of the bounded Lipschitz
norm.) If A(P n , P)  0 then Pnf > Pf for each / with /^ L < oo, that is,
Pn ~* P. Conversely, if Pn ~> P and can we find Xn with distribution Pn and X
with distribution P for which Xn X almost surely (see Section 2 for the general
case), then
A(PniP)< sup
F\l(Xn)l(X)\<F(lA\XnX\)>0.
U\\BL<1
2.
In effect, the general constructions of the representing variables subsume the specific
calculations used in Chapter 7 to approximate { : \\1\\BL < 1} by a finite collection
of functions.
<4>
Theorem. For probability measures on the Borel sigma field of a separable metric
space X, if Pn ~> P then there exist random elements Xn, with distributions Pn, and
X, with distribution P, for which Xn  X almost surely.
The main step in the proof involves construction of a joint distribution for Xn
and X. To avoid a profusion of subscripts, it is best to isolate this part of the
240
Chapter 10:
( m 0 F 0 K)uxyf(u.
y) := PxKyuxf{x,
y)
In fact, {(P <8> K)u : u e (0, 1)} is a probability kernel from (0, 1) to X x X. Notice
also that the marginal distribution muPxKu%x for y is a m <g> P average of the Kux
probability measures on !B(X). As an exercise in generating class methods, you
might check all the measurability properties needed to make these assertions precise.
<5>
 *' X
e Ba]
Q(
I Ba)'
When u < 1  the recipe is: generate y from Q( \ Ba) when x e Ba, which
ensures that JC and y then belong to the same Ba. Integrate over u and x to find the
marginal probability that y lands in a Borel set A:
muPxKuxA =
as asserted.
REMARK.
Notice that the kernel K does nothing clever when u e Ja. If we
were hoping for a result closer to the quantile coupling of Example <3>, we might
instead try to select y from a Bp that is close to xy in some sense. Such refined
behavior would require a more detailed knowledge of the partition.
Proof of Theorem < 4 > .
The idea is simple. For each n we will construct an
appropriate probability kernel K^\ from (0, 1) x X to X, via an appeal to the
241
<6>
by Dominated Convergence.
I noted in Example <3> that if Yn has distribution Pn, and if each Yn is
defined on a different probability space (Qn, J n ,P n ), then the convergence in
distribution Yn ~* P cannot possibly imply almost sure convergence for Yn.
Nevertheless, using an argument similar to the proof of Theorem <4>, Dudley (1985)
obtained something almost as good as almost sure convergence.
Chapter 10:
242
*3.
Strassen's Theorem
Once again let (X, d) be a separable metric space equipped with its Borel sigmafield 'B(X). For each subset A of X, and each 6 > 0, define A to be the closed set
{x e X : d(x, A) < }. The Prohorov distance between any P and Q from the set 7
of all probability measures on (X) is defined as
P(P, Q) '= inf{ > 0 : PB < QB + for all B in B(X)}.
Despite the apparent lack of symmetry in the definition, p is a metric (Problem [3])
on 9.
REMARK.
Separability of X is convenient, but not essential when dealing with
the Prohorov metric. For example, it implies that 3(X x X) = 3(X) <g)23(X), which
ensures that d(X, X') is measurable for each pair of random elements X and X'\ and
if Xn > X almost surely then P{d(XHf X) > 6} * 0 for each > 0.
If p(Pni P) > 0 then, for each closed F we have PnF < PF + e eventually,
and hence limsup n PnF < PF, implying that Pn w P. Theorem <4> makes it easy
to prove the converse. If Xn has distribution Pn and X has distribution P, and if
Xn > X almost surely, then for each > 0 there is an n such that
F{d(Xn,X)
>} <e
for n > n .
< P { X n e B, d(Xni
X) < 6} + P{d(Xni
X)>}<
) }
{X
PB + .
{X e B}} + = PB
243
a stronger result; the tightness will be used only at the very end, to tidy up.)
Also, the role of e is slightly easier to understand if we replace it by two separate
constants.
<8>
Theorem. Let P and Q be tight probability measures on the Borel sigma field *B of
a separable metric space X. Let e and e' be positive constants. There exists random
elements X andY ofX with distributions P and Q such that F[d(X, Y) > e] < e' if
and only if PB < QB + f for all Borel sets B.
The argument for deducing the family of inequalities from existence of the
coupling is virtually the same as <7>. For the other, more interesting direction, I
follow an elegant idea of Dudley (1976, Lecture 18). By approximation arguments
he reduced to the case where both P and Q concentrate on a finite set of atoms,
and then existence of the coupling followed by an appeal to the classical Marriage
Lemma (Problem [5]). I modify his argument to eliminate a few steps, by making
an appeal to the following generalization (proved in Problem [6]) of that Lemma.
<9>
(uaGABa) + /x{oo} = /z (U aA /? a ).
244
Chapter 10:
REMARK.
In effect, I have converted K to a probability kernel L from X to
X, by setting Lx equal to Ka\x + Ka{oo]Q0 when x Ba. The definition of Fs is
equivalent to Fs := P <> L, in the sense of Section 4.4.
The measure F8 has marginals P and > because, for g and A in M+(X),
= P* Q H B { * *} (* a X + Ka{oo}) g(x))
= P*,
W}
It concentrates most of its mass on the set D :== U^=1 (Ba x #*),
pX
( ^ *}*' 30
= 1 r 
When (JC, y) belongs to D, we have x e Ba and */(>, 5 a ) < 6 for some Ba with
diameter(5a) < 8, and hence J(JC, y) < 8 + e. Thus Fs assigns measure at least
1  '  8 to the closed set F8+ := {(JC, y ) e X x l : J(JC, ^) < 6 + }.
The tightness of both P and ? will let us eliminate 5, by passing to the limit
along a subsequence. For each rj > 0 there exists a compact set Cn for which
PC$ < t] and gCJJ < rj. The probability measure Fs, which has marginals P and Q,
puts mass at most 2rj outside the compact set C^ x C^. The family {P^ : 5 > 0} is
uniformly tight, in the sense explained in Section 7.5. As shown in that Section,
there is a sequence {<$,} tending to zero for which F8i ~> P, with P a probability
measure on !B<g>$. It is a very easy exercise to check that P has marginals P and Q.
For each fixed t > 6, the weak convergence implies
PF, > limsup, F8iFt > limsup, F8iF+8i > 1  *'.
*4.
<io>
w here
B :=
245
Proof. The existence of the asserted coupling (for a suitably rich probability space)
will follow via Theorem <8> if we can show for each Borel subset A of R* that
P{5 A] < {T A38} + ERROR,
with the ERROR equal to the upper bound stated in the Theorem. By choosing a
smooth (bounded derivatives up to third order) function / that approximates the
indicator function of A, in the sense that / 1 on A and / 0 outside A33, we will
be able to deduce inequality < i i > from the multivariate form of Lindeberg's method
(Section 7.3), which gives a third moment bound for a difference in expectations,
<i2>
IP/(5)
 p/(m < c
(PI$, i + . . . + P I & I )
= cp.
<14>
3S
I =(f.
V e" '
where
a)
(*)
246
Chapter 10:
If B is small then a will be large, which would make log(l + a) small compared
with a. If we make a slightly larger than 2k~l log(l/#) we should get close to
equality. Actually, we can afford to have a a larger multiple of log(l/Z?), because
extra multiplicative factors will just be absorbed into constant Co. With these
thoughts, it seems to me I cannot do much better than choose
 yV y <and hence
d6)
1ei'
The proof is complete, except for the construction of the smooth function /
satisfying <14>.
Before moving on to the construction of / , let us see what we can do with the
coupling from the Theorem in the case of identically distributed random vectors.
For convenience of notation write y*(x) for the function Co* (1 + I log(l/jt)/fc).
<16>
For fixed k, we can make the righthand side as small as we please by choosing 5
as a large enough enough multiple of n~ 1/6 . Thus, with finite third moments,
wn
= Opin
1/6
For k = 1, this coupling is not the best possible. For example, under an assumption
of finite third moments, a theorem of Major (1976) gives a sequence of independent
random variables Y\, Y2,..., each distributed Af(O, V), for which
= op(n~Xf6)
<n>
almost surely.
Major's result has the correct joint distributions for the approximating normals, as
n changes, as well as providing a slightly better rate.
Example. Yurinskii's coupling (and its refinements: see, for example, the
discussion near Lemma 2.12 of Dudley & Philipp 1983) is better suited to situations
where the dimension k can change with n.
Consider the case of a sequence of independent, identially distributed stochastic
processes {Xt(t) : t e T}. Suppose X\(t) = 0 and \X\(t)\ < 1 for every t. Under
suitable regularity conditions on the sample paths, we might try to show that the
standardized partial sum processes, Zn(t) := (X[(t) + . . . + Xn(t))/y/n, behave like a
247
centered Gaussian process {Z(t) : t e T], with the same covariance structure as X\.
We might even try to couple the processes in such a way that sup, \Zn(t) Z(t)\ is
small in some probabilistic sense.
The obvious first step towards establishing a coupling of the processes is to
consider behavior on large finite subsets T(k) := {t\,..., f*} of 7 , where k is allowed
to increase with n. The question becomes: How rapidly can k tend to infinity?
For fixed k, write , for the random fcvector with components X/(/ y ), for
j = 1 , . . . , fc. We seek to couple (1 + . . . + %n)/y/n with a random vector Wn,
distributed like {Z(fy) : j = 1,...,A:}. The bound is almost the same as in
Example <16>, except for the fact that the third moment now has a dependence
on fc,
3/2
/i
<*3/2Pl
\ * 7=1
]P. xj \
^0
That is, max,<* Z n (r,)  WnJ\ = o p (l) if k increases more slowly than n I / 5 .
Smoothing of indicator functions
There are at least two methods for construction of a smooth approximation / to a
set A. The first uses only the metric:
For an interval in one dimension, the approximation has the effect of replacing the
discontinuity at the boundary points by linear functions with slope 1/5. The second
method treats the indicator function of the set as an element of an C{ space, and
constructs the approximation by means of convolution smoothing,
f(x)=mw({weA)4>*(wx)),
where <pa denotes the N(), a2Ik) density and m denotes Lebesgue measure on
(Any smooth density with rapidly decreasing tails would suffice.) A combination of
the two methods of smoothing will give the best bound:
Chapter 10:
248
<18>
Lemma. Let A be a Borel subset of R*. let Z have a N(0, /*) distribution. For
positive constants 8 and o define
g(x) := (l  ^ A l \
and
^<
Ol
For fixed x and y, the function h(t) := f(x + ty), for 0 < r < 1, has second
derivative
.
.
jc + ry + <rZ) ( ( / Z ) 2  y
The Lipschitz property \g(x + ty + oZ)  g(x + oZ)\ < t\y\/S then implies
where t* e (0,1).
For approximation <14>, first note that A8 < g < A28 and 0 < / < 1
everywhere. Also P{Z > 8/a] < e, from Problem [7]. Thus
fix) > Wg(x + aZ){\aZ\ < 8} = P{Z < 5/a} > 1  6
if * e A,
and
+ aZ){\aZ\ < 8} + Pg(jc 4 <TZ){<XZ > 5} < *
if * A35.
5.
70.5
249
approximating N(n/2, n/4) has been the starting point for a growing collection of
striking approximation results, inspired by the publication of the fundamental paper
of Koml6s, Major & Tusnady (1975).
<19>
Tusnady's Lemma. For each positive integer n there exists a deterministic, increasing function r(n, ) such that the random variable X := r(n, rj) has a Bin(n, 1/2)
distribution whenever rj has a N(0,1) distribution. The random variable X satisfies
the inequalities
and
< x
\XY\<
Co (l + 772)
and
x ^
+ M),
6.
More generally, for each finite set of square integrable functions f\,..., /*, the
random vector (v n /i,..., vnfk) has a limiting multivariate normal distribution
with zero means and covariances P(ftfj)  (Pfi)(Pfj). These finite dimensional
distributions identify a Gaussian process that is closely related to the isonormal
process {G(f) : / L2(P)} from Section 9.3.
Recall that G is a centered Gaussian process, defined on some probability space
(2,y,P), with cov(G(/), G(g)) = (f9g) = P(fg), the L2(P) inner product. The
250
Chapter 10:
Haar basis, *I> = {1} U [i/ritk : 0 < i < 2*, k (= N o }, for L2(P) consists of rescaled
differences of indicator functions of intervals Jiik := / ( / , k) := (i2~k, (i + 1)2~~*],
fi%k := 2*/2 (7 2 a +i  7 2 l > u +i) = 2k/2 (2/2.M+1  J/,0
REMARK.
For our current purposes, it is better to replace L2(P) by 2 ( P ) , the
space of squareintegrable real functions whose Pequivalence classes define L2(P).
It will not matter that each G(f) is defined only up to a Pequivalence. We need
to work with the individual functions to have Pnf well defined. It need not be true
that Pnf = Png when / and g differ only on a Pnegligible set.
G(f) =
with convergence in the L2(P) sense. If we center each function / to have
zero expectation, we obtain a new Gaussian process, v(f) := G ( / Pf) =
G(f) (P/)G(1), indexed by 2 (P), whose covariances identify it as the limit
process for vn. Notice that v(x/rik) = G{yjf^k) = r\i%k almost surely, because P^/,* = 0Thus we also have a series representation for v,
<22>
vn(f) 2 X X
REMARK.
Don't worry about the niceties of convergence: when the heuristics
are past I will be truncating the series at some finite k.
The expansion suggests a way of coupling the process vn and v, namely, find a
probability space on which vn(\lritk) ^ v(\lsijk) = rjitk for a large subset of the basis
functions. Such a coupling would have several advantages. First, the peculiarities
of each function / would be isolated in the behavior of the coefficients (/, ^r,,*).
Subject to control of those coefficients, we could derive simultaneous couplings for
many different / ' s . Second, because the ^ functions are rescaled differences of
indicator functions of intervals, the vn(\lrik) are rescaled differences of Binomial
counts. Tusnady's Lemma offers an excellent means for building Binomials from
standard normals. With some rescaling, we can then build versions of the v,,^/,*)
from the rjink.
The secret to success is a recursive argument, corresponding to the nesting
of the Juk intervals. Write node(i, k) for (i + V2)/2*, the midpoint of Jiik. Regard
node(2i, k + 1) and node(2i + 1,fc+ 1) as the children of node(i, k), corresponding
to the decomposition of Ji%k into the disjoint union of the two subintervals 72/,*+1
and 72,+u+i The parent of node(i, k) is node(Li/2J, k  1).
251
For each integer i with 0 < i < 2k there is a path back through the tree,
path(i\ k) := {(/o, 0), {i\, 1 ) , . . . , (ik% k)}
for which /(/*, k) c J(ik\,k  1) C . . . C / ( 0 , 0 ) = (0,1]. That is, the path traces
through all the ancestors (parent, grandparent, . . . ) back to the root of the tree.
JlA
h2
(
(0.3)
(1.3)
(2,3)
(3,3)
(4,3)
(5,3)
(6,3)
h2
h2
J0,3](
(7.3)
and so on. That is, recursively divide the count X,* at each node(i\k) between
the two children of the node, using the normal variable 77,* to determine the
Bin(Xj,*, 1/2) count assigned to the child at node(2i, k f 1). The joint distribution
for the Xi%k variables is the same as the joint distribution for the empirical counts
nPnJi,k, because we have used the correct conditional distributions.
If we continued the process forever then, at least conceptually, we would
identify the locations of the n observations, without labelling. Each point would
be determined by a nested sequence of intervals. To avoid difficulties related to
pointwise convergence of the Haar expansion, we need to stop at some finite level,
say the mth, after which we could independently distribute the Xl>m observations (if
any) within 7,,m.
The recursive construction works well because Tusnddy's Lemma, even in its
weakened form <20>, provides us with a quadratic bound in the normal variables
for the difference between vn(\lrifk) and the corresponding fy*.
<23>
Lemma.
There exists a universal constant C such that, for each k and 0 < 1* < 2k,
252
Chapter JO:
Consider the effect of the split of 7, into its two subintervals, / ' := 7(2/ ; , j +1)
and J" := J(2ij + IJ + 1). Write N for nPnJj and X for nPnJ', so that
A, = VN/n and A' := V+lPnJ' = V+lX/n and A /; := V+xPnJ" = 2A7  A'.
From inequality <20>, we have X = N / 2 + y/Nrjj/2 + /?, where
/? < C 0 (l + r?;2)
<24>
and
X  N/2\ <
C0VN
(l +  ^  ) .
By construction,
_ .,
N + VNr>, + 2R
, /
A A " . 2/?N
and hence
v B ^ = V27p n (2f  Jj) =
From the first inequality in <24>,
<25>
2Co
v n ^  r]j\ <
V7
X  N/2\ < Co
Invoke the inequality V  <fb\ <\a b\/Vb, for positive a andfc,to deduce that
Iv/A^T  yA~ < max (^/A 7  y/AJ\, I^A17  ^ A j  ) < ICQV'1
(1 + i?y) /Vn.
From <25> with j = k, and the inequality from the previous line, deduce that
7.
<26>
gn \ < Coexp(x)
V rt
j
with constants C\ and Co that depend on neither n nor x.
[0<r<l
253
Proof. We will build vn from B, allocating counts down to intervals of length 2" m ,
as described in Section 6. It will then remain only to control the behavior of
both processes over small intervals. Let T(m) denote the set of grid points
[i/2m : 1 = 0, 1 , . . . , 2m} in [0, 1]. For each t in T(m), both series <2i> and <22>
terminate after k = m, because [0, t] is orthogonal to each Vu for k > m. That
is, using the Hungarian construction we can determine Pn7,,m for each i, and then
calculate
v>n(0, t] = J ^
for t in T(m),
by Lemma <23>
<27>
As t ranges over T(m), or even over the whole of (0,1), the path defining
Sm(t) ranges over the set of all 2 m paths from the root down to the mth level.
We bound the maximum difference between the two processes if we bound the
maximum of Sm(t). The maximum grows roughly linearly with m, the same rate as
the contribution from a single t. More precisely
<28>
254
Chapter 10:
I postpone the proof of this result, in order not to break the flow of the main
argument.
REMARK.
The constants 5 and 4 are not magical. They could be replaced by
any other pair of constants for which Pexp((iV(0, I) 2 ci)/c2) < 1/2.
From inequalities <27> and <28> we have
+
<30>
jn
I t
Provided we choose m smaller than a constant multiple of x + logw, this term will
cause us no trouble.
We now have the easy task of extrapolating from the grid T(m) to the whole
of (0,1). We can make 2"m exceedingly small by choosing m close to a large
enough multiple of JC + logn. In fact, when x > 2, the choice of m such that
2 n V > 2m > n2ex
will suffice. As an exercise, you might want to play around with other m and the
various constants to get a neater statement for the Theorem.
We can afford to work with very crude estimates. For each s in (0, 1) write ts
for the point of T(m) for which ts < s < ts + 2~m. Notice that
MO, s]  vn(0, ts]\ < # points in (* s]/</n + Vw2"m.
The supremum over s is larger than 3/y/n only when at least one / I m interval, for
0 < i < 2m contains 2 or more observations, an event with probability less than
G) (2~ )
m 2
Similarly,
for m as in <30>.
sup, \B(s)  B(ts)\ < sup, G[0, s]  G[0, ts]\ + sup, \(s  ts)Jj]
< max sup G[0, s]  G[0, i/2m]\ h 2~ m ^
0<i<2mseJ.m
where B is a Brownian motion. The second term on the righthand side is less than
exp (4mjc2/2n). By the reflection principle for Brownian motion (Section 9.5), the
first term equals
JC 1
2m/2jc 1
2m r 2 \
\B(2m)\ > = 2m+1P \N(0, 1) > j= < 2 m+1 exp ( y  J .
For x > 2 and m as in <30>, the sum of the two contributions from the Brownian
Bridge is much smaller than e~x.
From <29>, and the inequality
MO, *]  B(s)\ < vn(0, s]  vn(0, ts]\ + \vn(0, ts]  B(ts)\ + \B(ts)  B(s)\,
together with the bounds from the previous paragraph, you should be able to
complete the argument.
255
as asserted.
By means of the quantile transformation, Theorem <26> extends immediately
to a bound for the empirical distribution function Fn generated from a sample
1 , . . . , n from a probability measure on the real line with distribution function F.
Again writing qF for the quantile function, and recalling that we can generate the
sample as & = qF(xt), we have
which implies y/n(Fn(t)  F(t)) = vrt(0, F(r)]. Notice that F(t) ranges over a
subset of [0,1] as t ranges over R; and when F has no discontinuities, the range
covers all of (0,1). Theorem <26> therefore implies
p f supK/n (Fn(f)  F(t))  B(F(t))\ >
C l
for x > 0.
Inequality <3i> lets us deduce results about the empirical distribution function Fn from analogous results about the Brownian Bridge. For example, it implies
sup r V^l^i(0 ~ ^(01 ^ sup, J?(F(f)). If F has no discontinuities, the limit
distribution is the same as that of sup5 #(.s). That is, we have an instant derivation of the KolmogorovSmirov theorem. The Csorg6 & Rev6sz book describes
other consequences that make much better use of all the hard work that went into
establishing the KMT inequality.
256
Chapter 10:
8.
Problems
[1]
Suppose F and Fn9 for n e N are distribution functions on the real line for which
Fn(x)  F(x) for each x in a dense subset D of the real line. Show that the
corresponding quantile functions Qn converge pointwise to Q at all except (at worst)
a countable subset of points in (0, 1). Hint: Prove convergence at each continuity
point MO of Q. Given points x\ JC" in D with x' < xo = Q(uo) < x", find 8 > 0 such
that JC' < Q(M 0  5) and Q(M 0 + 8) < x". Deduce that
FH(x') < F(x) + 8 < MO < F(x")  5 < Fn(jc")
eventually,
[3]
Show that the Prohorov distance is a metric. Hint: For the triangle inequality,
use the inclusion (By' c B+. For symmetry, consider /o(P, (2) < 8 < e. Put
D c = B*. Prove that D 5 c Bc, then deduce that 1  PB < QD8 + 8 < 1  QB + 5.
[4]
[5]
257
10.8 Problems
each princess is to find a frog on her list to marry, then clearly the "Desirable Frog
Condition" (DFC), #K(A) > #A, for each A c s , must be satisfied. Show that DFC
is also sufficient for happy princesses: under the DFC there exists a onetoone
map n from S into K(S) such that 7t(a) e K(a) for every a in S. Hint: Translate
the following mathematical fairy tale into an inductive argument.
(i) Once upon a time there was a princess ao who proposed to marry a frog TO
from her list. That would have left a collection 5\{TO} of princesses with lists
K(G)\{TO]
to choose from. If the analog of the DFC had held for those lists,
an induction hypothesis would have made everyone happy.
(ii) Unfortunately, a collection Ao c S\{cro) of princesses protested, on the grounds
that #X(AO)\{TO] < #Ao; clearly not enough frogs to go around. They pointed
out that the DFC held with equality for Ao, and that their happiness could be
assured only if they had exclusive access to the frogs in K(Ao).
(iii) Everyone agreed with the assertion of the Ao. They got their exclusive access,
and, by induction, lived happily ever after.
(iv) The other princesses then got worried. Each collection B in 5\Ao asked,
> ##?" They were reassured, "Don't worry. Originally
"#K(B)\K(A0)
#K(B U Ao) > #B + #A 0 , and we all know that #(A 0 ) = #A 0 , so of course
Prove Lemma <9> by carrying out on the following steps. Write RA for UaeARa.
Argue by induction on the size of S. With no loss of generality, suppose 5 =
{ 1 , 2 , . . . , m}. Check the case m = 1. Work from the inductive hypothesis that the
result is true for #5 < m.
(i) Suppose there exists a proper subset Ao of S for which vAo = V>RA0 Define
R'a = Ra\RAo for a $ Ao. Show that vA < ILR'A for all A c S\A0. Construct
K by invoking the inductive hypothesis separately for Ao and 5\Ao (Compare
with part (iv) of Problem [5].)
Now suppose vA < /JLRA for all proper subsets A of S. Write La for the probability
distribution [i(  Ra), which concentrates on Ra.
(ii) Show that /x > v{l}L\.
Write \ for the unit mass at 1. Let 0o be the largest value in [0, v{\}] for which
(/x  OoL\)RA > (v  0o*i)A for every A c 5.
(iii) If Go = v{l}, use the inductive hypothesis to find a probability kernel from
S\{1] into T for which (/x  v{l}L{) > a > 2 v{a}Ka. Define Kx = L\.
(iv) If $o < v{l}, show that there exists an Ao 5 S for which (/x 9L\)RAQ
<
(v  0i)Ao when v{l] > 9 > 0o Deduce that Ao must be a proper subset
of S for which (/x  0oL\)RAo = (v  0o^i)Ao. Invoke part (i) to find a
probability kernel M for which ^  0o^i > (v{l}  0o)^i + L a >2 v{
Define Kx := (0 o /v{l})Li + (1  0oMl})Mi.
258
[7]
Chapter 10:
Establish the bound P{N(O, /*) > Vkx} < (xe{x)k/2, for x > 1, as needed (with
Vibe = 8/a) for the proof of Lemma <18>. Hint: Show that
9.
Notes
In increasing degrees of generality, representations as in Theorem <4> are due to
Skorohod (1956), Dudley (1968), Wichura (1970), and Dudley (1985).
Prohorov (1956) defined his metric for probability measures on complete,
separable metric spaces. Theorem <8> is due to Strassen (1965). I adapted the
proof from Dudley (1976, Section 18), who used the Marriage Lemma (Problem [5])
to prove existence of the desired coupling in a special discrete case. Lemma <9>
is a continuous analog of the Marriage Lemma, slightly extending the method of
Pollard (1984, Lemma IV.24).
The discussion in Section 4 is adapted from an exposition of Yurinskii (1977)'s
method by Le Cam (1988). I think the slightly weaker bound stated by Yurinskii
may be the result of his choosing a slightly different tail bound for Af(O, /*), with
a correspondingly different choice for the smoothing parameter.
The idea for Example < n > comes from the construction used by Dudley &
Philipp (1983) to build strong approximations for sums of independent random
processes taking values in a Banach space. Massart (1989) refined the coupling
technique, as applied to empiricial processes, using a Hungarian coupling in place
of the Yurinskii coupling.
The proof of the KMT approximation in the original paper (Komlos et al. 1975)
was based on the analog of the first inequality in <20>, for \X  n/2\ smaller than a
tiny multiple of n. The proof of the elegant refinement in Lemma <19> appeared in
a 1977 dissertation of Tusnady, in Hungarian. I have seen an annotated extract from
the dissertation (courtesy of Sandor Csorgo). Csorgo & Revesz (1981, page 133)
remarked that Tusnady's proof is "elementary" but not "simple". I agree. Bretagnolle
& Massart (1989, Appendix) published another proof, an exquisitely delicate exercise
in elementary calculus and careful handling of Stirling's approximation. The method
used in Appendix D resulted from a collaboration between Andrew Carter and me.
Lemma <23> repackages a construction from Komlos et al. (1975) that has
been refined by several authors, most notably Bretagnolle & Massart (1989),
Massart (1989), and Koltchinskii (1994).
259
10.9 Notes
REFERENCES
Bretagnolle, J. & Massart, P. (1989), 'Hungarian constructions from the nonasymptotic viewpoint', Annals of Probability 17, 239256.
Csorgo, M. & Rev6sz, P. (1981), Strong Approximations in Probability and Statistics,
Academic Press, New York.
Dudley, R. M. (1968), 'Distances of probability measures and random variables',
Annals of Mathematical Statistics 39, 15631572.
Dudley, R. M. (1976), 'Convergence of laws on metric spaces, with a view to
statistical testing'. Lecture Note Series No. 45, Matematisk Institut, Aarhus
University.
Dudley, R. M. (1985), 'An extended Wichura theorem, definitions of Donsker classes,
and weighted empirical distributions', Springer Lecture Notes in Mathematics
1153, 141178. Springer, New York.
Dudley, R. M. & Philipp, W. (1983), 'Invariance principles for sums of Banach space valued random elements and empirical processes', Zeitschrift fur
Wahrscheinlichkeitstheorie und Verwandte Gebiete 62, 509552.
Kim, J. & Pollard, D. (1990), 'Cube root asymptotics', Annals of Statistics 18, 191
219.
Koltchinskii, V. I. (1994), 'KomlosMajorTusnady approximation for the general
empirical process and Haar expansion of classes of functions', Journal of
Theoretical Probability 7, 73118.
Komlos, J., Major, P. & Tusnady, G. (1975), 'An approximation of partial sums of
independent rvs, and the sample df. I', Zeitschrift fiir Wahrscheinlichkeitstheorie
und Verwandte Gebiete 32, 111131.
Le Cam, L. (1988), On the Prohorov distance between the empirical process and
the associated Gaussian bridge, Technical report, Department of Statistics, U.C.
Berkeley. Technical report No. 170.
Major, P. (1976), 'The approximation of partial sums of independent rv's', Zeitschrift
fur Wahrscheinlichkeitstheorie und Verwandte Gebiete 35, 213220.
Massart, P. (1989), 'Strong approximation for multivariate empirical and related
processes, via KMT constructions', Annals of Probability 17, 266291.
Pollard, D. (1984), Convergence of Stochastic Processes, Springer, New York.
Pollard, D. (1990), Empirical Processes: Theory and Applications, Vol. 2 of NSFCBMS Regional Conference Series in Probability and Statistics, Institute of
Mathematical Statistics, Hayward, CA.
Prohorov, Yu. V. (1956), 'Convergence of random processes and limit theorems in
probability theory', Theory Probability and Its Applications 1, 157214.
Skorohod, A. V. (1956), 'Limit theorems for stochastic processes', Theory Probability
and Its Applications 1, 261290.
Strassen, V. (1965), 'The existence of probability measures with given marginals',
Annals of Mathematical Statistics 36, 423439.
Tusnady, G. (1977), A study of Statistical Hypotheses, PhD thesis, Hungarian
Academy of Sciences, Budapest. In Hungarian.
260
Chapter 10:
Chapter 11
1.
LIL for n o r m a l s u m m a n d s
Two important ideas run in tandem through this Chapter: the existence of exponential
tail bounds for sums of independent random variables, and proofs of the law of the
iterated logarithm (LIL) in various contexts. You could read the Chapter as either
a study of exponential inequalities, with the LIL as a guiding application, or as a
study of the LIL, with the exponential inequalities as the main technical tool.
The LIL's will all refer to partial sums Sn := X\ + . . . + Xn for sequences
of independent random variables [Xi] with PX, = 0 and var(Z,) := a? < oo,
for each i. The words iterated logarithm refer to the role played by function
L(x) := y^jtloglogx. To avoid minor inconveniences (such as having to exclude
cases involving logarithms or square roots of negative numbers), I arbitrarily define
L{x) as 1 for x < ee % 15.15. Under various assumptions, we will be able to prove,
with Vn := var(Sn), that
<i>
almost surely,
together with analogous assertions about the lim inf and the almost sure behavior of
the sequence {Sn/L (Vn)}. Equality < i > breaks naturally into a pair of assertions,
<2>
and
a.s.,
262
Chapter 11:
inequalities that I will refer to as the upper and lower halves of the LIL, or upper
and lower LIUs, for short. In general, it will be easier to establish the upper half,
because the exponential inequalities required for that case are easier to prove.
As you will see, several of the techniques used for proving LIL's are refinements
of techniques used in Chapter 4 (appeals to the BorelCantelli lemma, truncation of
summands, bounding of whole blocks of terms by means of maximal inequalities)
for proving strong laws of large numbers (SLLN). Indeed, the LIL is sometimes
described as providing a rate of convergence for the SLLN.
The theory is easiest to understand when specialized to the normal distribution,
for which the following result holds.
<3>
Theorem.
variables,
(i) l i m s u p ^ ^ Sn/Ln = 1
(ii) liminf^oo Sn/Ln =  1
a.s.
a.s.
(Hi) Sn/Ln e J infinitely often, a.s., for every open subinterval J of [ 1,1].
Proof. The key requirement for the proof of the upper LIL is an exponential tail
bound, such as (see Appendix D for the proof)
<4>
for x > 0.
<5>
< 1
almost surely.
lim sup
n*oo y/2nlogn
This result is not quite what we need for the upper LIL. Somehow we must replace
the y/2n log n factor by y/2n log log /i, without disturbing the almost sure bound.
As with the proof of the SLLN in Section 4.6, the improvement is achieved
by collecting the Sn into blocks, then applying BorelCantelli to a bound for a
maximum over a block. To handle the contributions from within each block we
need a maximal inequality, such as the following onesided analog of the bound
from Section 4.6.
<6>
263
Define blocks /?* := {n : w* < n < n*+i}, where n*/p* > 1 for some
constant p > 1 (depending on y) that needs to be specified. For a fixed y > 1,
P{S >y(w) for some n e *}
< P{maxSn > yUrik)}
nzBk
by <4>.
almost surely.
The lower half of the LIL asserts that, with probability one, Sn > yL(n)
infinitely often, for each fixed y < 1. To prove this assertion, it is enough if we can
find a sequence [n(k)} along which Sn(k) > yLnik) infinitely often, with probability
one. (I will write n(k) and Ln, instead of nk and L(n), to avoid nearly invisible
subscripts.) The proof uses the BorelCantelli lemma in the other direction, namely:
if {Ak} is a sequence of independent events for which J^k FAk = oo then the event
[Ak occurs infinitely often} has probability one.
The sums Sn(k) are not independent, because of shared summands. However
the events Ak := {(*)  Sn{k\) > yLn(k)} are independent. If we can choose n(k)
so that J^k FAk = oo then we will have
l i m supk S"<*> ~ Sk~ > Y
Ln{k)
almost surely.
If n(k) increases rapidly enough to ensure that Ln(*i)/*i(*) ~* 0, then <7> will
force Sn(k\)/Ln(k)  0 almost surely, and the lower LIL will follow.
We need a lower bound for FAk. The increment (*> Sn(k\) has a AT(O, m(k))
distribution, where m(k) := n{k)  n(k  1). From Appendix D,
<8>
, .
r
for it large enough.
The choice n(k) := /:* ensures that both n(k)/m(k)  1 and Ln{k\)/Ln{k) > 0.
With 0 close enough to 1, the lower bound behaves like (k\o%k)~a for an a < 1,
making J2k ^ * diverge, and completing the proof of (i).
264
2.
Chapter 11:
<9>
1)}
Proof. Replace L(Wn) by the lower bound L(Wn^))y to put the inequality in the
form amenable to an application of the Maximal Inequality <6>. Then argue, by
Tchebychev's inequality, that for n(k) < n <n(k+ 1),
?
XT (XV
 Sn > 8L(Wn(k))}
\\ ^ 1
> 1
82L(Wn(k))
10
when we focus on departures "not too far out into the tails."
<ii>
Bennett's Inequality.
h a2,
for x
265
FexpitYi).
<12>
with A(0) = 1,
is nonnegative and increasing over the whole real line. Rewrite cxp(tYt) as
1 + tYi + \{t Yt)2A(t Yi) < 1 + tYi +
\t2Y2&{tM).
<13>
Corollary.
VHiM)
> oo
oo
Vn >
and
and
Mnn=o
=o \JV
\JVnn/^g\o%
/^g\o%
M
Vnn jj
V
as n > oo,
266
Chapter 11:
a condition introduced by Kolmogorov (1929) to prove the LIL <i> for partial
sums of independent random variables X, with PX, = 0 and X, < Af,.
The proof of the upper LIL under <15> is essentially the same as the proof
for uniformly bounded summands, as sketched above. The lower LIL requires an
analog of the exponential lower bound <8>. The next Section establishes this lower
bound, an extremely delicate exercise in Calculus. With this exponential bound, you
could prove the corresponding lower LIL by modifying the analogous proof from
Section 1 (or the proof of the lower LIL that will be sketched in Section 4).
*3.
and
and
The constant Jto will need to be large enough that  > exp(exfy and
2 + 3(1 + 6)JC exp (JC 2 (1 + <02K) < \ exp (jjc 2 (l + <02(l  *?))
267
)=
> PJ/<w (l + \t2a2A{t8)\
2 2
t a A(t8)
2
i I
(J2
^
9 A
, I
< A(t8)
i+2K
(1 + 30
2 2
U o A(2K)\^
via log(l
)~
__,
CXP
(^
(1
"
the second equality coming from < n > and the fact that , a? = Vn = 1. We also
have an upper bound for the same quantity,
Pexp(fSn) = P /
tety{Sn
>y}dy<\+(
tetyF{Sn
> y}dy.
The idea now is to choose t so that the last integrand is maximized somewhere
in a small interval J := [x, w], which contributes at most
<20>
/ tetetytyP{Sn >x}dy<
etwF{Sn > x]
Jx
to the righthand side of <19>. We need the other contributions to <19>, from y
outside / , to be relatively small. For such y9 we can use Bennett's inequality to
bound the integrand by texp(ty \y2ty{y8)). If we were to ignore the x// factor,
the bound would be maximized at y t, which suggests we make t slightly larger
than x but smaller than w. Specifically, choose t := (1 f )x and w := (1 H 4C)JC,
for a small e that needs to be specified. (Note that t < 2x < 2K/8, as required
for <18>.)
When y is large, the \/r factor has a substantial effect on the y2. However, using
the fact (Appendix C) that y\/r(y) is an increasing function of y, and the constraint
x < K/S9 we have (y8)f(y8) > (Sx8)\/r(Sx8) > (Sx8)\lr(SK) when y > 8JC, hence
) > 2yt
by
The contribution from the region where y > 8JC is therefore small if x < K/8:
/OO
tetyf>{Sn>y}dy
/OO
< I
i%x
hx
Within the interval [0, 8JC] we have \fr(y8) > \/r(Sx8) > ir(SK) > (1 + rj)~l
because x < K/8, and by < n > , and the integrand tetyV{Sn > y] is less than
Notice that the exponent is maximized at y = (1 4 r])t = (1 f rj)(l h 6)JC, which lies
in the interior of 7, with
min ((1 + rj)t  JC, w  (1 + rj)f) > ex
268
Chapter 11:
If divided by yj2n(\ + r\), a constant smaller than 3, the factor contributed by the
quadratic in y turns into the N{{\ + rf)t, (1 + 77)) density. The contribution to the
bound <19> from y in [0, 8 J C ] \ / is less than
<22>
t exp (\t\\
Combining the inequalities from <18> and <19>, with the righthand side
of the latter broken into contributions bounded via <2i>, <22>, and <20> then
rewritten as functions of JC, we have
2+3(1
2
<23>
> JC}
We need to absorb the first two terms on the lefthand side into the righthand side,
which will happen for large enough x if we ensure that
Choose r] := rj() to make this inequality hold (a value slightly smaller than 2 /2
will suffice), then find x so that the sum of the first two terms in <23> is smaller
ghthand side when x > x. We may also assume that  > exp(6JC
exp(6JC22).
than half the righthand
Then we have
( 2  x22(l + 4<0(l
sn >x}> exp (~x
*4.
which is possible because the lefthand side tends to 1/2 as e decreases to zero.
Put JCO equal to the corresponding x.
<24>
(i) l i m s u p ^ ^ Sn/Ln = 1
(ii) liminf^oo Sn/Ln =  1
a.s.
a.s.
(Hi) Sn/Ln J infinitely often, a.s., for every open subinterval J of [ 1,1].
11.4
269
Most of the ideas needed for the proof are contained in Sections 1 and 2. I will
merely sketch the arguments needed to prove Theorem <24>, with emphasis on the
way the new idea, truncation, fits neatly with the other techniques.
The truncated variables will satisfy an analog of <15> with Vn := n, except
that the o{>) will be replaced by a fixed, small factor. The fragments discarded by
the truncation will be controlled by the the following lemma, which formalizes an
idea of DeAcosta (1983).
<25>
Lemma.
The function
dx<Ct
Jee
L(X)
Proof. Differentiate.
>  (1   )>0
2 ^ p f =   
g(x)
x log log x log x x x \
e)
The inequality also implies that
1
Q (x ^
*2.p
<26>
T(X,
For a fixed e > 0, which will depend on y, define, for i > 17 > 1 f eey
with gi := g(i), as defined by Lemma <25>. Note that /x, < PZ,, because
V(Yi + Z/) = PX/ = 0.
REMARK.
Notice the dependence of the truncation level on i. Compare with the
truncation used to prove the SLLN under a first moment assumption in Section 4.7,
and the truncations used to prove central limit theorems in Section 7.2. For almost
sure convergence arguments it is common for each variable to have its own truncation
level, because ultimately BorelCantelli assertions must depend on convergence of a
single infinite series, rather than on convergence of changing sequences of partial
sums to zero.
270
Chapter 11:
The partial sum Sn decomposes into J2i<n & + Zw<n Mi + /< zi Th e first
sum will be handled essentially as in Section 2. The other two sums will be small
relative to Ln, by virtue of the following bound,
Zll < j _L_li < 2 J
i=17 ^'
i=17
^'
'
lllg
n/
i=17
identical distnbutions
^'
r?L
1=1/
L(x)
dx)
I
/
by Lemma <25>
2^+1)
f I
Vn(k+\)
I I.
))
LH(M)n(k+l)
For fixed y, the f factor can be brought as close to 1 as we please, by choosing
small enough. The other term in the exponent behaves like (A. 2 /p)loglogp*. With
appropriate choices for p and e, we therefore have the tail probability decreasing
faster than k~a, for some a > 1, which leads to the desired upper LIL.
Lower LIL
For a fixed y < 1 we need to show limsup(7 n /L n ) > y almost surely. As in
Section 1, look along a subsequence n(k) := kk. Write T for Tn(k)  Tn(k\), and V
for var(r) = Vn{k) V^i). We can make V/n(k) as close to 1 as we please, by
making k large enough. The summands contributing to T are bounded in abolute
value by 8y/V, where 8 := 2tgnik)/W.
We need to bound F[T > yLn(k)] from
below by a term of a divergent series. Fix a 0 > 1. Write JC for yLn{k)/y/V.
Inequality <16> gives
F[T > yLnik)] = F{T > xW) > exp (^Jc 2 ) = exp (
2V
271
2n(k)loglogn(k)
Y n(*)(l+o(l))
26 V
"(*)
With small enough, the range eventually contains the desired x value. The rest of
the argument follows as in Section 1.
Cluster points
Assertion (iii) of Theorem <24> will follow from assertion (i), by means of an
ingenious projection argument, borrowed from Finkelstein (1971). Construct new
independent observations {X,} with the same distribution as the {X,}, and let
Sn := X\ + . . . + Xn. Write Wn for the random vector (S n , Sn)/Ln and u9 for
the unit vector (cos 0, sin 0). For each fixed 0, the random variables Wn UQ
X, cos0 4 Xi sin# have mean zero, variance one, and they are identically distributed.
From (i), limsup,,^^ (Wn UQ) 1 almost surely. Given 6 > 0, there exists a finite
collection of halfspaces {(x,y)  uo < 1 + 6 } whose intersection lies inside the ball
of radius 1 + 26 about the origin. It follows that limsup \Wn\ < 1 + 2 6 almost
surely. The geometry of the circle then forces Wn to visit each neighborhood of
the boundary point ue infinitely often, with probability one.
The projection of such a neighborhood onto the horizontal axis
gives a neighborhood of the point cos#, which Sn/Ln must visit
infinitely often. After a casting out of a countable sequence
of negligible sets, for a countable collection of subintervals of
(  1 , 1 ) , we then deduce assertion (iii).
5,
Problems
[1]
Let X have a Bin(n, p) distribution. Define q := 1  p. For 0 < x < nq, show that
Hint: Bound the tail probability by exp (t(np + JC) + n log(q + pe1)) for t e R+,
then minimize the expression in the exponent by Calculus. For 0 < x < nq show
? ) . For the second bound, use
convexity of x//.
[2]
lx/nqj
Let X\,...,
Xn be independent random variables with PX, := p\ and 0 < Xt < 1 for
each i. Let p := (p\ H
h pn)/n =: I q. Show that
( ^ ) )
272
Chapter 11:
(ii) Derive the same tail bound by a passage to the limit in the binomial bound
from Problem [1].
[4]
For each random variable X with finite variance, and each constant M, show that
var(r(X, M)) < var(X), with r as defined in <26>. Hint: Let X' be an independent
copy of X. Show that 2var(r(X, M)) = Pr(X, M)  r(X', M)\2 and also that
r
T(JC, M)  T(JC', M)\ <\x x'\ for all real x and x .
[5]
6.
Notes
My understanding of the LIL began with the reading of Feller (1968, Section VIII.5),
Lamperti (1966, Section 11) and Stout (1974, Chapter 5). I learned the idea of
regarding the Bennett inequality as a slightly corrupted (by the presence of the
\js function in the exponent) analog the the normal tail bound from conversations
with Galen Shorack and Jon Wellner. Shorack (1980) systematically exploited the
idea to establish very sharp LIL results for the empirical distribution function.
For a beautiful exposition of the many applications of the idea to the study of
inequalities for the empirical distribution function and related processes see Shorack
& Wellner (1986, Chapter 11).
The method used to establish the Bennett inequality, but not the form of
the inequality, comes from Chow & Teicher (1978, page 338). They developed
exponential inequalities suitable for derivation of LIL results. I am uncertain about
the earlier history of the exponential bounds. Apparently (cf. Kolmogorov & Sarmanov 1960), Bernstein's inequality comes from a 1924 paper. The Bennett (1962)
and Hoeffding (1963) papers contain other tail bounds for sums of random variables,
with some references to further literature.
Apparently the first versions of the LIL with a log log bound are due to
Khinchin (1923, 1924). For the early history of the LIL, leading up to the definitive
273
11.6 Notes
version by Kolmogorov (1929), see Feller (1943). Hartman & Wintner (1941)
extended to the case of independent identically distributed summands with finite
second moments, by means of a truncation argument. (Actually, they did not
assume identical distributions, but only a domination condition for the tails, which
holds under the second moment condition for identically distributed summands.)
My exposition in Section 4 draws from DeAcosta (1983), who gave an elegant
alternative derivation (and extension) of the HartmanWintner version of the LIL.
I thank Jim Kuelbs for the proof of part (iii) of Theorem <3>.
REFERENCES
Chapter 12
1.
Introduction
Of all the probability distributions on multidimensional Euclidean spaces the
multivariate normal is the most studied and, in many ways, the most tractable.
In years past, the statistical subject known as "Multivariate Analysis" was almost
entirely devoted to the study of the multivariate normal. The literature on Gaussian
processesstochastic processes whose finite dimensional distributions are all
multivariate normalis vast. It is important to know a little about the multivariate
normal.
As you saw in Section 8.6, the multivariate normal is uniquely determined by
its vector of means and its matrix of covariances. In principle, everything that one
might want to know about the distribution can be determined by calculation of means
and covariances, but in practice it is not completely straightforward. In this Chapter
you will see two elegant examples of what can be achieved: Fernique's (1975)
inequality, which deduces important information about the spread in a multivariate
normal distribution from its covariances; and Borell's (1975) Gaussian isoperimetric
inequality, with a proof due to Ehrhard (1983a, 1983b). Both results are proved by
careful Calculus.
The Chapter provides only a very brief glimpse of Multivariate Analysis and
the theory of Gaussian processes, two topics that are covered in great detail in many
specialized texts. I have chosen merely to present examples that give the flavor of
some of the more modern theory. Both the Fernique and Borell inequalities have
found numerous applications in the recent research literature.
275
2.
Fernique's inequality
The larger <r2, the more spread out is the N(0,cr2) distribution. Fernique (1975,
page 18) proved a striking multivariate generalization of this simple fact.
<i>
Theorem. Suppose X and Y both have centered (zero means) multivariate normal
distributions, with PX,  Xj\2 < F\Yt  Yj\2 for all i, j . Then
P/(max/ Xt  min, Xt) < P/(max, Yt  min, Yt)
for each increasing, convex function f on R + .
The theorem lets us deduce inequalties for multivariate normal distributions,
with potentially complicated covariance structures, by making comparisons with
simpler processes.
<2>
where
For a ^ ,
. ( ( * ,   ft)2 <82<
F\Ya  Yp\2.
as asserted.
REMARK.
The lower bound is sharp within a constant, in the following sense. If
F\YjYk\2 < 82 for all j ^ k then Pmax, Yt = PF,+ Pmax^yiy!) = P m a x ^ y ,  ^ )
and, by Jensen's inequality and monotonicity,
exp (Pma Xl (r,  Y})/28)2 < Pmax, exp ((7,  Yx)2/482) < nPexp (N (0, \
Thus Pmax, Yt is bounded above by
28y/\og(\/2n).
276
*3.
Chapter 12:
where
A := Vi  Vo.
(f (max/ * ~ ni x0 df.k8e)
by <3>.
Two integrations by parts will replace the QJlintegrals by integrals involving the
nonnegative functions / ' and /", reducing the expression for H'(0) to a sum of
the form J2j<k (^jj + &kk 2A^) (something nonnegative). To establish such a
representation, we need to keep track of contributions from subregions of W defined
by inequalities involving the functions
L(x) := max, xt
and
and
Notice that L{x) = L7(JC) v jt, and S(x) = Sj(x) A xj9 for each j .
Let my denote Lebesgue measure on the y'th coordinate space R, and 9JI, denote
(n  l)dimensional Lebesgue measures on the product of the remaining coordinate
subspaces. That is, m, integrates over Xj and 971, integrates over the remaining n  1
coordinate variables. The product of m7 0 9)ty equals 971, Lebesgue measure on Rn.
The function Xj \> fL S) is absolutely continuous with almost sure derivative
and subsequently, I ignore the 9Jtnegligible set of x for which there is a tie for
maximum or minimum. The function dkge decreases to zero exponentially fast as
max, \xi  > oo. An integration by parts with respect to Xj gives
277
S){JC,
= L2}ge(x)\xxl2_oo
 f\L
{JC,
= 5}) go)
JC2
= L} +
{JCI
= L,
JC2
= S}) g0)
a.e. [m2].
The OT2 negligible set allows for JC where x\ and JC2 tie for maximum or minimum.
Integrate with respect to 9^2 to get an expression for the mixed partial derivative.
5)({JC,
= 5,
JC2
= L] +
{JC,
= L,
JC2
S])g9(x))
 #1,2.
Notice the curious form of the integral for Ax,2 It runs over n 1 variables (JC2
omitted) ranging over the sets where either JCI is the smallest or the largest of those
variables. The second argument of g#, previously occupied by JC2, is now occupied
by the extremal value JC, . We would get exactly the same integral if we interchanged
the roles of JCI and JC2. Write A/t* for the analogous integral with JC, taking over the
role of JCI and JC* taking over the role of JC2. Nonnegativity of / ' , and the symmetry
of the roles of the two variables, ensure that A,* = A* ,7 > 0 for all j ^ k. Similarly,
write Bjk for the analog of #i, 2 with Xj taking over the role of xx and JC* taking
over the role of JC2. Nonegativity of / " implies that Bjk = Bkj > 0 for all j ^ k.
Repeated partial derivative
The calculations for d\xge are similar. Integrate first with respect to JCI.
m, (f(L
 S) ({xx =L}
 mi (f"(L
 S) ({xx =L}
{xx = S})
ge(Lu))
278
Chapter 12:
When split according to which of the variables X2>..., xn achieves the extremal
value S\ or L\y the first integral breaks into the sum Ylk>2 ^i.*> an( ^ ^ e s e c o n d i
12k>2 ^M ^ e o t her repeated derivatives contribute similar expressions.
Collection of terms
Substitute the results from the two integrations by parts into <4>.
H
'W = 5 E , ^'^f(L
= 2EA
~ S) dli
4.
<7>
279
<8>
<9>
That is, the spread in the distribution of max y, about its median is no worse than
the spread in the distribution of the Yt with largest variance.
In special cases, such as independent variables (Problem [3]), one can get
tighter bounds, but Borell's inequality has two great virtues: it is impervious to
the effects of possible dependence between the Yiy and it does not depend on w.
In consequence, it implies similar concentration inequalities for general Gaussian
processes, with surprising consequences.
Each half of the Borell inequality follows easily from Theorem <7> if we
represent the K, as linear functions Yi(x) := /i, f 0[x on W equipped with the
measure yn. (That is, regard the vector of Yi's as a linear transformation of a
vector of independent N(0, l)'s.) The assumption about the variances becomes
var(yi) = \9i\2 < a2 for each i. By definition of the median M, the set
A := {JC e Rn : max/<n F/(JC) < M]
has ynA > 1/2 = <f>(0). If a point x lies within a distance r of a point *o in A, then
\9[x  0/jto < \$i\ \x  xo\ < or, for each i. Thus the neighborhood Ar is contained
within {x : maxf<n F/(JC) < M + ar}, and
P{max/<* \Y{\ < M + err] > ynAr > <I>(0 + r),
D
<io>
as asserted by the upper half of the Borell inequality. The derivation for the
companion inequality is similar.
Example. Suppose X\, X2,... is a Gaussian sequence of random variables. The
supremum S := sup, X, might take infinite values, but if P{5 < 00} > 0 then
Borell's inequality <9> will show that the distribution of S must have an upper
tail that decreases like that of a normal distribution. More precisely, the constant
a2 := supt var(X) must be finite, and there must exist some finite constant M such
that
{S > M + or) < O(r) = P{AT(0, 1) > r]
Of course there is no comparable result needed for the lower tail, because S is
nonnegative.
The tail bound < n > ensures that S has finite moments of all orders, and in
fact Pexp(aS 2 ) < 00 for all a < (2a 2 ) 1 . See Problem [8].
280
Chapter 12:
We can establish < n > by reducing the problem to finite sets of random
variables, to which Borell's inequality applies. Write Sn for maxt<n X, and Mn for
its median. Define a 2 := max,<n var(X,). From <9>,
F{\SnMn\>onr}<4>(r).
*5.
281
t+5 1
tS
282
Chapter 12:
When t < O(JC^  2 r ) 8 the neighborhoods Vt and Jr are disjoint, and y\Brt =
yiAr + y i / r . For larger f, the two neighborhoods overlap, and y\Brt < y\Vt +y\Jr.
If we choose t with y\l\ < y\J\ then Bt is an improvement over A. The following
Lemma shows that we get the most improvement by pushing t to one of the extreme
positions, because a concave function on [8, t*] achieves its minimum at one of the
endpoints.
<13>
L e m m a . For each fixed r > 0 the function G{t) := y\Vt is a concave function
oft on [8,1 <$].
Proof. It is enough if we show that the derivative Gr is a decreasing function on
(6, 1  8). By direct differentiation of <I>(x+ 4 r)  <&(x~  r) as a function of t we
have
G'(t) = ^ ( *
^ r )  ^X
~r\
REMARK.
Actually, \ogh(x) = jcr r 2 / 2 , which is clearly decreasing. I wrote
the argument using concavity because I suspect there might be a more general version
of the isoperimetric inequality provable by similar methods (cf. Bobkov 1996).
Both JC+ and x~ are increasing functions of t. Thus G' equals a decreasing
function of / minus the reciprocal of another decreasing function of t, which makes
G' decreasing, and G concave.
With the appropriate t, the improvement Bt is also a finite union of disjoint
intervals, UJL, J/, with either J[ := 0 or /( := [oo, x+].
Now repeat the argument, replacing J^ by an interval that abuts either J[ or
Jy And so on. After at most n such operations, A is replaced by either a single
semiinfinite interval (the desired minimizing setit doesn't matter whether the
interval extends off to oo or to +oo) or a union D := [oo, z~] U [z + , oo], for
which yD = yA and yDr < yAr.
For the second possibility we may assume that z+ z~ > 2r to avoid the
trivial case where Dr = E. The complement of D is an interval of the form
(<P~l(t  8), <P~l(t + 8)) for some t, where 28 = yi(z",z + ). Calculations almost
identical to those for the proof of Lemma <13>, with r replaced by r, show that
H(t) := <D (&~l (t + 8)  r )  <D U>~1 (t  8) + A
is a convex function of t. It achieves its maximum at one of the extreme positions,
which corresponds to the transformation of D into a single semiinfinite interval.
The proof for the onedimensional case of Theorem <7> is complete.
(ii) Twodimensions
The 1shifts along each section parallel to the xaxis transform the closed set A into
another closed set B with yiB = yiA <!>().
<14>
Lemma.
283
Proof. We need to show that yiBr < yikT. By Fubini, it is enough if we can show
that Yi(Br)y < yi(Ar)y, for all y.
because (A r )^ 2 G
+ O from part (i), because y\ An = <I>(^(jo))
> (gr(yi))
because gr(y\) = xi < x0 + < g(;yo) +
r
= yi (# )yi
definition of ^ r .
It follows that B is an improvement over A.
To make precise the idea that a sequence of shifts can make A look more like a
halfspace, we need some way of measuring how close a set is to being a halfspace.
The picture suggests a method. The idea is that there
should be a cone C of directions that we can follow from
each point of A without leaving A. (The jagged edges
on A and the cone C are meant to suggest that both sets
should be extended off to infinitythere is not enough
room on the page to display the whole sets.) If the vertex
angle 0 of the cone were a full 180, the set A would be a
halfspace (or the whole of R2). More formally, let us say
that a set A has a 0 spread if there exists a cone C with vertex angle 0 such that
{JC + y : x e A, y C] c A. Call C the spreading cone.
For example, the set B produced by Lemma <14> has spread of at least zero,
with spreading cone C := {(JC, 0) : x < 0}.
Chapter 12:
284
Lemma. If a closed set A has 0 spread, for some 0 less than n, there exists a
shift Su such that SUA has (n + 0)/2 spread.
Proof. To make the picture easier to draw I will assume that the axis of symmetry
of the cone C points along the yaxis. The required shift then has u pointing along
the jcaxis.
0/2 . 0/2
(g(y),y)
Consider the sections through SUA at heights y and yf = y + 6, with 8 > 0.
Both sections are intervals, (oo, g(y)] and ( o o , g ( / ) ] , where y\Ay := <P(g(y))
and yiAy := <&(g(y')). Define 6 := <5tan(0/2). If JC Ay then (x + t,y +
8)eA
for all f with \t\ < , because A has 0 spread. That is, the section Ay contains a
onedimension neighborhood F := Ay[e]. The cross section at y' has y\ measure
greater than y\Ay\<e\, which by part (i) is greater than <&(y\Ay + ). That is,
whence g(y') > g(y) + e. It follows that Su A has spread at least {n + 0)/2.
Repeated application of Lemma <15>, starting from A produces improvements
#i, #2> #3> ^4, . with spreads at least 0, 7r/2, 37r/4, 77r/8, and so on. Each Z?, has
y2 measure equal to O(a) = y2A, and y2Ar > y2B\ > y2Br2 >
We may even
rotate each Bn so that its spreading cone has axis parallel to the JC axis.
The fact that Bn has spread n 2en, with en 0, and the fact that y2Bn O(a)
together force Brn to lie close to the half space H := {(JC, y) e R2 : x < a + r]
eventually. More precisely, it forces liminf Brn > H, in the sense that each point of
H must belong to Brn for all n large enough.
For if a point (JCO + r, yo) of H were not in Brn then the point (JCO, yo) could not
lie in Bn, which would ensure that no points of Bn lie outside the set
On := {(x,y) :x < \y  angle en
a+r
yo\tenen}.
285
6.
Problems
[1]
Let Vb and Vi be positive definition matrices. For 0 < 0 < 1 let gB denote the
, (1  0)V0 + 0V\) density on R n . Show that
286
Chapter 12:
(i) Use Fourier inversion to show that ge{x) = (2n)~n f exp (ix't  \t'Vet) dt.
(ii) Justify differentiation under the integral sign to show that
dgeix)
30
and
= (2n)~n I
J dXjdx
= (2ny j(itj){itk)txp(ix't
[3]
[4]
Let X and Y be random variables for which both {X > x] < F{Z > x] and
F{X < x] < F{Z < x), for all x > 0. Show that Pexp(rX) < Pexp(/K) for all
nonnegative t. Hint: Consider F footexp(tx)[X > x]dx.
[5]
[6]
Reduce the onedimensional version of Theorem <7> to the case where A is a finite
union of intervals, by following these steps. Let 4>(a) := y\A.
(i) Show that it is enough prove that yAr+s > O(r + a) for each 8 > 0, because
Ar+8 I Ar as 8 decreases to zero.
287
12.6 Problems
(ii) Define an open set G := {x : d(x, A) < 8}. Show that yG > a. Hint: The set
G\A is open.
(iii) Show that there exists a countable family {/,} of disjoint closed intervals for
which yi(G\U l 7 l ) = 0.
(iv) Choose N so that the closed set AN := U,<#// has y\ measure greater than <J>(a).
Show that the assertion of the Theorem for AN implies that yAr+s > <t>(r + a).
[7]
(Slepian's (1962) inequality) Let X = (Xu . . . , Xn) and Y = (K,,..., Yn) both have
centered multivariate normal distributions with var(Xy) = var(ly) for each j and
FXjXk < FYjYk for all j / k. Prove that PU, [Xj > ctj] > PUy [Yj > a,} for all real
numbers a\,..., an. Hint: Use equality <3> to show that \ix J~[/<fI{jc, < oti)g$(x) is
a decreasing function of 0.
[8]
Let 5 be a nonnegative random variable with P{5 > M + r] < Coexp (~\r2/a2)
all r > 0, for positive constants Co, M, and a1.
for
7.
Notes
There is a huge amount of literature on Gaussian process theory, with which I am
only partially acquainted. I took the proof of the Sudakov minoration (Example <2>)
from Fernique (1975, page 27). According to Dudley (1999, notes to Section 2.3),
a stronger, but incorrect, result was first stated by Sudakov (1971). The result in
Example <io> is due to Marcus & Shepp (1972), who sharpened earlier results of
Landau & Shepp (1970) and Fernique.
See Dudley (1973) for a detailed discussion of sample path properties of
Gaussian process, particularly regarding entropy characterizations of boundedness or
continuity of paths. Dudley (1999, Chapter 2) contains much interesting information
about Gaussian processes and their recent history. I also found the books by Jain &
Marcus (1978), Adler (1990, Section II), and Lifshits (1995) useful references.
The Lifshits book contains an exposition of Ehrhard's method, similar to the
one in Section 5, and proofs (Section 14) of the Fernique and Slepian inequalities.
See also the notes in that section of his book for a discussion of related work of
Schlafli and the contributions of Sudakov.
BorelFs isoperimetric inequality was, apparently, also proved by similar methods
by TsireFson and Sudakov in a 1974 paper which I have not seen. TsireFson (1975)
mentioned the result and the method. The notes of Ledoux (1996, Section 4) discuss
the isoperimetric inequality in great detail.
REFERENCES
288
Chapter 12:
Billingsley, P. (1986), Probability and Measure, second edn, Wiley, New York.
Bobkov, S. (1996), 'Extremal properties of halfspaces for logconcave distributions',
Annals of Probability 24, 3548.
Borell, C. (1975), T h e BrunnMinkowski inequality in Gauss space', Inventiones
Math. 30, 207216.
Dudley, R. M. (1973), 'Sample functions of the Gaussian process', Annals of
Probability 1, 66103.
Dudley, R. M. (1999), Uniform Central Limit Theorems, Cambridge University Press.
Ehrhard, A. (1983a), 'Un principe de symetrisation dans les espaces de Gauss',
Springer Lecture Notes in Mathematics 990, 92101.
Ehrhard, A. (1983b), 'Symetrisation dans l'espace de Gauss', Mathematica Scandinavica 53, 281301.
Fernique, X. (1975), 'Regularity des trajectoires des fonctions aleatoires gaussiennes',
Springer Lecture Notes in Mathematics 480, 197.
Jain, N. C. & Marcus, M. B. (1978), Advances in probability, in J. Kuelbs, ed.,
'Probability in Banach Spaces', Vol. 4, Dekker, New York, pp. 81196.
Landau, H. J. & Shepp, L. A. (1970), 'On the supremum of a Gaussian process',
Sankhyd: The Indian Journal of Statistics, Series A 32, 369378.
Ledoux, M. (1996), 'Isoperimetry and Gaussian analysis', Springer Lecture Notes in
Mathematics 1648, 165294.
Lifshits, M. A. (1995), Gaussian Random Functions, Kluwer.
Marcus, M. B. & Shepp, L. A. (1972), 'Sample behavior of Gaussian processes',
Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and
Probability 2, 423441.
Slepian, D. (1962), 'The onesided barrier problem for Gaussian noise', Bell System
Technical Journal 41, 463501.
Sudakov, V. N. (1971), 'Gaussian random processes and measures of solid angles in
Hilbert space', Soviet Math. Doklady 12, 412415.
Tsirel'son, V. S. (1975), 'The density of the distribution of the maximum of a
Gaussian process', Theory Probability and Its Applications 20, 847856.
Appendix A
1.
M e a s u r e s a n d inner m e a s u r e
\i (Uji4,) = Yli M^i for sequences {At; : 1 N} of pairwise disjoint sets from A.
If property SF3 is weakened to require stability only under finite unions and
intersections, the class is called a field. If property M2 is weakened to hold only
for disjoint unions of finitely many sets from A, the set function is called a finitely
additive measure.
Where do measures come from? Typically one starts from a nonnegative
realvalued setfunction /x defined on a small class of sets DCo, then extends to a
sigmafield A containing fto One must at least assume "measurelike" properties
for \i on 3Co if such an extension is to be possible. At a bare minimum,
290
(Mo)
Appendix A:
for
each A in A.
REMARK.
When X consists of compact sets, a measure with the inner regularity
property is often called a Radon measure.
The desired regularity property makes it clear how the extension of /x must be
constructed, namely, by means of the inner measure /z*, defined for every subset A
of X by ^A := supluK : A 3 K e Xo}.
In the business of building measures it pays to start small, imposing as few
conditions on the initial domain Xo as possible. The conditions are neatly expressed
by means of some picturesque terminology. Think of X as a large expanse of muddy
lawn, and think of subsets of X as paving stones to lay on the ground, with overlaps
permitted. Then a collection of subsets of X would be a paving for X. The analogy
might seem farfetched, but it gives a concise way to describe properties of various
classes of subsets. For example, a field is nothing but a (0, U/, n / , c ) paving,
meaning that it contains the empty set and is stable under the formation of finite
unions (U/), finite intersections ( n / ) , and complements ( c ). A (0, Uc, He, c ) paving
is just another name for a sigmafieldthe Uc and Pic denote countable unions and
intersections. With inner approximations the natural assumption is that %o be at
least a (0, U/, Of) pavinga lattice of subsets.
REMARK.
Note well. A lattice is not assumed to be stable under differences
or the taking of complements. Keep in mind the prime example, where %o denotes
the class of compact subsets of a (Hausdorff) topological space, such as the real
line. Inner approximation by compact sets has turned out to be a good thing for
probability theory.
For a general lattice 3Co, the role of the closed sets (remember f for ferm6)
is played by the class J(3Co) of all subsets F for which FK Xo for every K in
Xo. (Of course, Xo c 3"(3C0) The inclusion is proper if X Xo) The sigmafield
^>{Xo) generated by 7(%o) will play the role of the Borel sigmafield.
The first difficulty along the path leading to countably additive measures lies
in the choice of the sigmafield A, in order that the restriction of /x* to A has the
desired countable additivity properties. The CarathSodory splitting method identifies
a suitable class of sets by means of an apparently weak substitute for the finite
additivity property. Define So as the class of all subsets 5 of X for which
ji*i4 = fi*(AS) +
/JL*(ASC)
291
If A e So then /z* adds the measures of the disjoint sets AS and ASC correctly. As
far as /z* is concerned, S splits the set A "properly."
<2>
s
c
Lemma. The class So of all subsets S with the property < i > is a field. The
restriction of /x* to So is a finitely additive measure.
Proof. Trivially So contains the empty set (because /z*0 = 0) and it is stable under
the formation of complements. To establish the field property it suffices to show
that So is stable under finite intersections.
Suppose S and T belong to So. Let A be an arbitrary subset
JC
T
of X. Split A into two pieces using S, then split each of those two
pieces using T. From the defining property of So,
Decompose A(ST)C similarly to see that the last three terms sum to
li*A(ST)c. The intersection ST splits A correctly; the class So contains ST; the
class is a field. If ST = 0, choose A := S U T to show that the restriction of /z*
to So is finitely additive.
At the moment there is no guarantee that So includes all the members of XQ,
let alone all the members of $(9Co). In fact, the Lemma has nothing to do with the
choice of /z and XQ beyond the fact that /z*(0) = 0. To ensure that So 2 Xo we
must assume that /z has a property called %otightness, an analog of finite additivity
that compensates for the fact that the difference of two Xo sets need not belong
to Xo. Section 2 explains Xotightness. Section 3 adds the assumptions needed to
make the restriction of /z* to So a countable additivity measure.
2.
Tightness
If So is to contain every member of Xo, every set K e Xo must split every
set K\ 3Co properly, in the sense of Definition < i > ,
Writing Ko for ATi AT, we then have the following property as a necessary condition
for 3C0 ^ So. It will turn out that the property is also sufficient.
<3>
292
Appendix A:
achieved by H. Additivity for disjoint !Kosets implies superadditivity for the inner
measure,
<4>
because the union of each inner approximating H for A and each inner approximating
K for B is an inner approximating set for A U B. Tightness also gives us a way to
relate S o to DC0.
<5>
Lemma.
Let Xo be a (0, U/, Of) paving, and fi be Xotight set function. Then
(i) S So if and only if [iK < /** (KS) + /x* (K\S) for all K in Xo;
(ii) the field So contains the field generated by 7(Xo)
3.
Proof. Take a supremum in (i) over all K c A to get /x*A < /x* (AS) + /x* (A\S).
The superadditivity property <4> gives the reverse inequality.
If S e 7(0Co) and K e Xo, the pair K{ := K and Ko := KS are candidates for
the tightness equality, [iK = /JL (KS) + n* (K\S), implying the inequality in (i).
Countable additivity
Countable additivity ensures that measures are well behaved under countable limit
operations. To fit with the lattice properties of 9Co> it is most convenient to insert
countable additivity into the construction of measures via a limit requirement that
has been called asmoothness in the literature. I will stick with that term, rather
than invent a more descriptive term (such as a continuity from above), even though
I feel that it conveys not quite the right image for a set function.
<6>
Definition, Say that /x is crsmooth (along XQ) at a set K in Xo if \iKn \, \xK for
every decreasing sequence of sets {Kn} in Xo with intersection K.
REMARK.
It is important that [i takes only (finite) real values for sets in Xo.
If A. is a countably additive measure on a sigmafield A, and An I Aoo with all A,
in A, then we need not have XAn  XA^ unless XAn < oo for some n, as shown by
the example of Lebesgue measure with An = [n, oo) and A^ = 0.
Notice that the definition concerns only those decreasing sequences in XQ for
which HnGN Kn Xo. At the moment, there is no presumption that Xo be stable
under countable intersections. As usual, the a is to remind us of the restriction to
countable families. There is a related property called rsmoothness, which relaxes
the assumption that there are only countably many Kn setssee Problem [1].
Tightness simplifies the task of checking for asmoothness. The next proof is
a good illustration of how one makes use of 3Cotightness and the fact that ii* has
already been proven finitely additive on the field So
<7>
A. 3
Countable additivity
293
Proof. Suppose Kn i
<8>
The
Loo
[iLn
< iiLn
because Sn 6 S o
294
Appendix A:
that IJL+S < !, N /i*7}. The reverse inequality follows from the superadditivity
property <4>. The set function /z* is countably additive on the the sigmafield So.
For (iv), note that \iK = \L+ (KSo) + /x* (K\SO), which is smaller than
/x, (KB) + /x, (K\SO + ^(KS]SCO)
for every K in XoIn one particularly important case we get asmoothness for free, without any
extra assumptions on the set function /x. A paving Xo is said to be compact (in the
sense of Marczewski 1953) if: to each countable collection {Ki : i N) of sets from
Xo with CiienKi = 0 there is some finite n for which nf<nAT, = 0. In particular,
if Ki I 0 then Kn = 0 for some n. For such a paving, the a smoothness property
places no constraint on [i beyond the standard assumption that /x0 = 0.
<9>
4.
<io>
<n>
Lemma. Let \i be a Xotight set function on a (0, U/, Of) paving Xo, which is
asmooth along Xo at 0. Then the extension /x of /x to the Heclosure X9 defined
by
{iH := infifiK : H c K e Xo]
is Xtight and asmooth (along X) at 0.
for H e X,
A.4
295
Proof. Lemma <7> gives us a simpler expression for /I. If {Kn : n e N} c Xo and
Kn i L e X, then jlL = infn /ztf,, because, for each 9Cosubset K with K 2 L,
(^* U AT)
by asmoothness of /x at K.
as n
oo.
Together, Theorem <8> and Lemma <io> give a highly useful extension
theorem for set functions defined initially on a lattice of subsets.
<12>
for K e X,
for 5 e S.
5. Lebesgue measure
There are several ways in which to construct Lebesgue measure on Rk. The
following method for R2 is easily extended to other dimensions.
296
Appendix A:
Take %o to consist of all finite unions of semiopen rectangles (ot\, P\](a2, fh]Each difference of two semiopen rectangles can be written as a disjoint union of
at most eight similar rectangles. As a consequence, every member
of 9Co has a representation as a finite union of disjoint semiopen
rectangles, and Xo is stable under the formation of differences. The
initial definition of Lebesgue measure m, as a set function on 9Co,
might seem obviousadd up the areas of the disjoint rectangles.
It is a surprisingly tricky exercise to prove rigorously that m is well defined and
finitely additive on 3C0.
REMARK.
The corresponding argument is much easier in one dimension. It is,
perhaps, simpler to consider only that case, then obtain Lebesgue measure in higher
dimensions as a completion of products of onedimensional Lebesgue measures.
e/2'.
It follows that mKn tends to zero as n tends to infinity. The finitely additive
measure m is 3Cosmooth at 0. By Theorem <12>, it extends to a 3Cregular,
countably additive measure on S, a sigmafield that contains all the sets in J(X).
You should convince yourself that X, the Heclosure of Xo, contains all compact
subsets of R 2 , and 7{X) contains all closed subsets. The sigmafield 8 is complete
and contains the Borel sigmafield 23(R2). In fact S is the Lebesgue sigmafield, the
closure of the Borel sigmafield.
6,
Integral representations
Throughout the book I have made heavy use of the fact that there is a onetoone
correspondence (via integrals) between measures and increasing linear functional
on M+ with the Monotone Convergence property. Occasionally (as in Sections 4.8
and 7.5), I needed an analogous correspondence for functional on a subcone
of M + . The methods from Sections 1, 2, and 3 can be used to construct measures
representing such functionals if the subcone is stable under latticelike operations.
<13>
(H\)
(H2)
(H3)
(H4)
belong to 5{ + ;
A.6
Integral representations
297
The best example of a lattice cone to keep in mind is the class CQ (Rk) of
all nonnegative, continuous functions with compact support on some Euclidean
space R*.
REMARK.
By taking the positive part of the difference in H2, we keep the
function nonnegative. Properties Hi and H2 are what one would get by taking
the collection of all positive parts of members of a vector space of functions.
Property H4 is sometimes called Stone's condition. It is slightly weaker than an
assumption that the constant function 1 should belong to <K+. Notice that the cone
CQ (Rk) satisfies R,, but it does not contain nonzero constants. Nevertheless, if
h 3f + and a is a positive constant then the function (h a ) + = (h a(\ A h/a))+
belongs to !K+.
<14>
(T\)
(T2)
(T3)
(T4)
298
<16>
Appendix A:
Proof. We must prove that \i is asmooth along Xo at 0 and Xotight; and then
prove that Th > [ih and Th < \xh for every h in 0i+.
(Tsmoothness: Suppose Kn e XQ and Kn I 0. Express Kn as a pointwise infimum
of functions {hnj} in tKv. Write hn for infw<rt,/<n/iw,,. Then Kn < hn I 0, and
hence \iKn < Thn I 0 by the ersmoothness for T and the definition of /JL.
A.6
299
Integral representations
for h. Notice that {h > ie] 3Co, by Lemma <15>. Find sequences hni from W+
with hni  [h > ie). Then we have
Sf<h<
(hA)+S(
D
<n>
<18>
as *  oo
as f  0, by asmoothness of T.
300
Appendix A:
7.
Problems
[1]
[2]
8,
Notes
The construction via DCtight inner measures is a reworking of ideas from
Tops0e (1970). The application to integral representations is a special case of
results proved by Pollard & Tops0e (1975).
The book by Fremlin (1974) contains an extensive treatment of the relationship
between measures and linear functionals. The book by Konig (1997) develops the
theory of measure and integration with a heavy emphasis on inner regularity.
See Pfanzagl & Pierlo (1969) for an exposition of the properties of pavings
compact in the sense of Marczewski.
REFERENCES
Appendix B
Hilbert spaces
SECTION 1 defines Hilbert space and presents two basic inequalities.
SECTION 2 establishes the existence of orthogonal projections onto closed subspaces of a
Hilbert space.
SECTION 3 defines orthonormal bases of Hilbert spaces. Vectors in the space have
representations as infinite linear combinations (convergent series) of basis vectors.
SECTION 4 shows how to construct a random process from an orthonormal sequence of
random variables and an orthonormal basis.
1.
Definitions
Hilbert space is an infinite dimensional generalization of ordinary Euclidean space.
Arguments involving Hilbert spaces look similar to their analogs for Euclidean
space, with the addition of occasional precautions against possible difficulties with
infinite dimensionality.
<i>
302
Appendix B:
Hilbert spaces
Ill/ll
11*11 i
2.
Orthogonal projections
Many proofs for Rk that rely only on completeness carry over to Hilbert spaces. For
example, if Wo is a subspace of R*, then to each vector x there exists a vector xo
in Wo that is closest to x. The vector jto is characterized by the property that x xo
is orthogonal to Wo. The vector JCO is called the orthogonal projection of x onto Wo,
or the component of x in the subspace Wo; the vector x jto is called the component
of x orthogonal to Wo.
Projections also exist for closed subspaces of a Hilbert space W. (Recall that a
subset 9 of W is said to be closed if it contains all its limit points: if [gn] c g and
\\gn f\\ 0 then / G S ) Every finite dimensional subspace is automatically closed
(Problem [2]); infinite dimensional subspaces need not be closed (Problem [3]).
<2>
B.2
303
Orthogonal projections
<3>
For such an inequality to hold, the coefficient, ( /  /o, g), of the linear term in t
must be zero. (Otherwise what would happen for t close to zero?)
To prove uniqueness, suppose both /o and f\ have the projection property.
Then f\  /o in 0<o would be orthogonal to both /  /o and /  f\, from which it
follows that / i  /o is orthogonal to itself, whence f\ = fa.
Corollary. (RieszFrechet Theorem) To each continuous linear map T from a
Hilbert space 0< into R there exists a unique element ho in Oi such that Th {h, ho)
for all h in *H.
Proof. Uniqueness is easy to prove: if (h,ho) = (h,h\) for h := ho  h\ then
l l * o  * i 11=0.
Existence is also easy when Th = 0 = (ft, 0). So let us assume that there exists
an h2 with Th2 # 0. Let h3 denote the component of h2 that is orthogonal to the
closed (by continuity of T) linear subspace Wo := [h e 0{ : Th = 0} of <K. Note
that Th3 = Th2  T(h2  h3) = Th2 ^ 0.
For h in IK, define Ch := Th/Th3. The difference h  Q/13 belongs to 3K0
3.
Orthonormal bases
A family of vectors ^ = [fa? : 1 e 1} in a Hilbert space !K is said to be orthonormal
if (^1, tyi) = 1 for all 1 and (^r,, ^y> = 0 for all i ^ j . The subspace JCo spanned
by ^ consists of all linear combinations J Z i e j ^ ^ . with J ranging over all finite
subsets of / . If Jio is dense in *K (that is, if the closure *Ko of 0<o equals !K) then
the family \jr is called an orthonormal basis for 3~C.
<4>
Lemma.
(i) For each h in Oi, the set h := {1 e / : (h, ^1) # 0} is countable, and for every
enumeration {i(l), i(2),...} of /A, the sum 5Z*=i(^ fnk))^i{k) converges in
norm to h as n 00.
(ii) (g, A) = //(> ^>(*, ^,) (ParsevaVs identity) for all g,heK.
2
In particular,
Appendix B:
304
Hilbert spaces
hj := j /(A, ti)fi, because (A  hj, ft) = {A, ft)  (A, ft) = 0 for each i in / .
The orthogonality implies
\\h\\2 = \\h hj\\
^)2*
For each > 0, the set Ih() := {i / : (A, fi)\ > } has cardinality N no greater
than \\h\\2/2, because \\h\\2 > .{1 e /*()} (A, ft)2 > e2iV. It follows that the set
/A = Uknh(l/k) is countable, as asserted.
For a fixed A let (i(l), i(2),...} be an enumeration of 4 . Write /(n) for the
initial segment ( i ( l ) , . . . , i(n)}. Denseness of Jio means that to each 6 > 0 there is
some finite linear combination h = J2tj a * ^ s u c h ^ a t \\h h\\ < . By definition
of 4 , the vector /i is orthogonal to ft for each i in i \ 4 . It follows that the vector
h \{i" 6 J f i h}&ityi is orthogonal to {^/ : i J\//,}, whence
A  hf
We reduce the distance between h and h if we discard from J those i not in lh.
Without loss of generality we may therefore assume that J is a subset of //,.
When n is large enough that / c J(n) we have h hj^) orthogonal to the
subspace Jij(n), which contains both hj(n) and h. For such n we have
2 > A  hj(H) + hJ{n)  h\\2
= IIA  h
definition of h
 +  hJin) hf
by orthogonality
That is, for each > 0 we have A A/(W) < for all n large enoughprecisely
what it means to say that J2kL\(h' !M*))ifo(*) converges (in norm) to A. The
representation A2 = $^ij(A, ^V(*))2 then follows via continuity (Problem [1]) of
the map g H* g 2 .
For ParsevaFs identity, let (j(l), j(2),...} be an enumeration of / g U Ih. Then,
from the special case just established,
D
<5>
The series for 4{g, A) = g + A2  g  A2 is obtained by subtraction.
E x a m p l e . (Haar basis) Let m denote Lebesgue measure on the Borel sigmafield of (0,1]. For k = 0 , 1 , 2 , . . . , partition (0,1] into subintervals J
:=
(i2~*, (1 41)2"*], for i = 0 , 1 , . . . , 2 *  1 . Define functions Hik := J2i,M /2i+i f *+i
and 1ritk := VHa, for 0 < i < 2k, and k = 0 , 1 , 2
Ho.,
e(
H,,
Hu
H2i2
H 32
1 f
305
and so on. Take finite sums to deduce that IKo contains the class of all indicator
functions of dyadic intervals (i/2*, j/2*]. Note that is stable under finite intersections and that it generates the Borel sigmafield. The class D of all Borel sets whose
indicators belong to the closure IKo is a A.system (easy proof), which contains .
Thus IKo contains <r(), the Borel sigmafield on (0,1].
If IKo were not equal to the whole of L2(m) there would exist a squareintegrable
function / orthogonal to IKo. In particular, / would be orthogonal to both {/ > 0}
and {/ < 0}, which would force m / + = m/~ = 0. That is, / = 0 a.e. [m].
The component of a function h in the subspace spanned by 1 is just the constant
function m/t. Each function h in L2(m) has a series expansion,
4.
as n  oo.
Write X(h) for the limit ^l\(h, ^)ft, which is defined up to an almost sure
equivalence as a square integrable random variable.
By Problem [1] and the Parseval identity, for g and h in IK we have
In particular, FX(h)2 = /i2. The map h > X(/i) is a /m^ar isometry between IK
and a linear subspace of L2(P): the map preserves inner products and distances.
The most important example of a series expansion arises when all the ft
have independent, standard normal distributions. The family of random variables
[X(h) : h IK} is then called the isonormal process indexed by IK. The particular
case known as Brownian motion is discussed in Chapter 9.
306
Appendix B:
Hilbert spaces
5.
Problems
[1]
Suppose gn > g and hn > /i, as elements of a Hilbert space. Show that (gn, hn) >
(g, /i). Hint: Use CauchySchwarz to bound terms like (gn  g, hn).
[2]
Let 3~ be a finite subset of a Hilbert space IK. Show that the subspace generated
by 3 is a closed subset of IK. Hint: Without loss of generality assume the elements
{/i /*} of J are linearly independent. For each i, find a vector Vv such that
(/,, fa) = 1 but (/), V>) = 0 f r j # ' W ^* = Y,i<*i(n)fi converges to some fc,
deduce that {a,(n)} converges for each i.
[3]
Let A. be a finite measure on 3[0, 1]. Let IK denote the collection of Aequivalence
classes {[h] : h is continuous }. Show that IK is not a closed subspace of L2(k) if
k equals Lebesgue measure. Could it be a closed subspace for some other choice
of X?
[4]
[5]
Use Zorn's Lemma to prove that every Hilbert space has at least one orthonormal
basis. Hint: Order orthonormal bases by inclusion. If 4> is maximal for this
ordering, show that there can be no nonzero element orthogonal to every member
of 4/.
[6]
6.
Notes
Halmos (1957, Chapter I) is an excellent source for basic facts about Hilbert space.
See Dudley (1973) and Dudley (1989, page 378) for the isonormal process.
REFERENCES
Appendix C
Convexity
SECTION I defines convex sets and functions.
SECTION 2 shows that convex functions defined on subintervals of the real line have leftand righthand derivatives everywhere.
SECTION 3 shows that convex functions on the real line can be recovered as integrals of
their onesided derivatives.
SECTION 4 shows that convex subsets of Euclidean spaces have nonempty relative interiors.
SECTION 5 derives various facts about separation of convex sets by linear functions.
1.
Appendix C:
308
Convexity
2.
Onesided derivatives
Let / be a convex function, defined and realvalued at least on an interval J of the
real line.
Consider any three points JCI < JC2 < JC3, all in J. (For the moment, ignore
the point JCO shown in the picture.) Write a for (JC2 jci)/to *i)> so that
X2 = 0CX3 + (1  <x)x\. By convexity, y2 := a / t o ) + (1  ot)f(x\) > / t o ) . Write
S(xi,Xj) for (f(xj)  f(xj))/(xj  JC/), the slope of the chord joining the points
(JC,, /(JC,)) and (JC,, /(*,)). Then
/to)  /to)
= S(xux3)
yi 
f(xi)
X2X\
/to)/to)
S(xux2).
slope S(x 2 ,x 3 ) ,
slope S(x,,x 3 )
309
right, because
D+(x2)
* >+(*!)
as x3 i x\.
 lD+(x)
as,  , ,
yx
for each x in [a,b). The function D+{x) is increasing and rightcontinuous
on [a, b). It is finite for a < x < b, but D+(a) might possibly equal 00.
(ii) The lefthand derivative D(x) exists,
fix)  Hz)
x>
x z
for each x in (a,b]. The function D(x) is increasing and leftcontinuous
function on (a,b]. It is finite for a < x < b, but D(b) might possibly
equal +00.
(Hi) For a < x < y < b,
D+(x)
yx
(iv) D(x) < D+(JC) for each x in (a, b)9 and
/(w) > fix) + c(w  x)
f(w)  f(x)
wx
forw<x
use
xw
D
<2>
310
Appendix C:
Convexity
Proof. Consider first the case n = \. Suppose w eC with w > JCO. The righthand
derivative D+(JCO) = l i m ^ (f(y)  /C*o)) /(y  JCO) must be nonnegative, because
/OO > /C*o) for y near JCO. Assertion (iv) of the Theorem then gives
f(w) >
/(JCO) + (W
JCO)D+(JCO)
> f(xo).
/W
_ I Vx
"l
for x > 0
forjc=0.
for all xh
> f(w).
If D+(a) > oo then / is continuous at a, and
Xi)Ci)
= /(n+) = f(a).
3.
JC,)C,)
f(a+).
Integral representations
Convex functions on the real line are expressible as integrals of onesided derivatives.
311
<4>
r*f+,
D.(t)dty
D_(0rf/.
v X2
({la){t<xo)+a{t<xi}{t<xa})Dit)dt
xa < t < x\]  (I  a){*0 < t < xa}) D(t)dt
312
Appendix C:
Convexity
l}f"(xt)dtds=x2f*(\t)f"(xt)dt,
is increasing.
4.
11
for x = 0
is nonnegative and increasing over the whole real line. The choice /(JC) :=
(1 + jc)log(l 4 JC)  JC, for JC >  1 , with / ( J C ) = log(lf JC) and /"(JC) = (1 + JC)"1,
leads to the conclusion that the function
(
^ 2
 1
for JC = 0.
is nonnegative, convex, and decreasing. Also x\/r(x) is increasing on R + , and
1r(x)>(l+x/3)1.
<6>
Theorem.
(i) There exists a smallest subspace V for which C c JCO 0 V := {JCO 4 x : JC 6 V},
for each JCO C.
(ii) dim(V) = n if and only if C has a nonempty interior.
(iii) If int(C) ^ 0, there exists a convex, nonnegative function p defined on Rn
for which int(C) = {JC : p(x) < 1} c C c {JC : p(x) < 1} = int(C).
Proof. With no loss of generality, suppose O e C . Let JCJ, . . . , JC* be a maximal set
of linearly independent vectors from C, and let V be the subspace spanned by those
vectors. Clearly C c V. If k < n, there exists a unit vector w orthogonal to V, and
every point JC of V is a limit of points JC + t w not in V. Thus C has an empty interior.
313
If k = n, write x for J^f. JC,//I. Each member of the usual orthonormal basis has a
representation as a linear combination, e{ = J2jai,jxj Choose an > 0 for which
y
vl/2
2n f ]Tf. a?y J < 1 for every j . For every y := ]T\ yiei in Rn with y < , the
coefficients ^ ; := (2n)~l 4 J]f.a/y^/ are positive, summing to a quantity 1 /fo < 1,
and Jc/2 + j = $>0 + E/ A^i e c ^ ^ ^ / 2 i s a*1 interior point of C.
If int(C) / 0, we may, with no loss of generality, suppose 0 is an interior
point. Define a map p : Rn * R+ by p(z) := inf{f > 0 : z/t e C}. It is easy
to see that p(0) = 0, and p(ay) = ap(y) for a > 0. Convexity of C implies that
p(z\ + Z2) < p(z\) 4 pfe) for all zii if Zi/U C then
/Z2
z\ + Z2
^i
t\+tl
t\+l
In particular, p is a convex function. Also p satisfies a Lipschitz condition: if
y = E i y.^i and z = X)f. z/e, then
 z) = P (Ei(yt  Zi)et)
< Ei ((yi  Zi)+p(ei) + 0*  Zf)p(^))
^ 1^  z\ (J^i ^ei)1
*(
5.
<7>
314
Appendix C: Convexity
f(xp + ty\)
X0GV0,/>0
TQ(XQ)
^>
>
T0(x\)
f(x\
~ x , v 0 .5>0
syi)
+ ty\) +
5hr
which implies
oo >
<8>
f(x0 + ty\)
T0(x0)
> oo.
t
s
The infimum over JCO and / > 0 on the lefthand side must be greater than or equal
to the supremum over x\ and s > 0 on the righthand side, and both bounds must
be finite. Existence of the desired real c follows.
The vector space V need not be finite dimensional. We can order
REMARK.
extensions of 70, bounded above by / , by defining (7*a,Va) > (7^, V$) to mean
that Vp is a subspace of V a , and Ta is an extension of Tp. Zorn's lemma gives a
maximal element of the set of extensions (Ty,Vy) > (To, V o ). Lemma <7> shows
that VY must equal the whole of V, otherwise there would be a further extension.
That is, To has an extension to a linear functional T defined on V with T(x) < /(JC)
for every x in V. This result is a minor variation on the HahnBanach theorem from
functional analysis (compare with page 62 of Dunford & Schwartz 1958).
Theorem.
>
(i) There exists a linear functional T on Rk for which 0 ^ T(y0) > sup x e ? T(x).
(ii) If yo & C, then we may choose T so that T(yo) > s u p ^ r ( j t ) .
Proof. With no loss of generality, suppose 0 C. Let V denote the subspace
spanned by C, as in Theorem < 6 > . If yo < V, let t be its component orthogonal
to V. Then y0 I > 0 = x i for all x in C.
If yo V, the problem reduces to construction of a suitable linear functional T
on V: we then have only to define T(z) := 0 for z V to complete the proof.
Equivalently, w e may suppose that V = Rn. Define To on Vo := {rjto : r e R ) by
T(ryo) : = rp(yo), for the p defined in Theorem < 6 > . Note that To(yo) = p(yo) >: U
because yo relint(C) = {p < 1}. Clearly To(x) < p(x) for all x e Vo. Invoke
Lemma < 7 > repeatedly to extend 7b to a linear functional 7 on R", with T(x) < p(x)
for all JC IR". In particular,
T(yo) > 1 > P(x) > T(x)
<9>
315
<io>
< 11 >
Proof. Define C as the convex set {JCI  ;*2 : JC, e C,}. The origin does not belong
to C. Thus there is a nonzero linear functional for which 0 = T(0) > T(x\  X2) for
alljt,C,.
Corollary. For each closed convex subset F ofW there exists a countable family
of closed halfspaces {//, : i e N} for which F = n/N//,.
Proof. Let {JC, : i <= N} be a countable dense subset of Fc. Define r, as the distance
from xi to F, which is strictly positive for every 1, because Fc is open. The open
ball B(xiiri) with radius r, and center JC/ is convex and disjoint from F. From
the previous Corollary, there exists a unit vector lt and a constant &, for which
U y > h > li x for all y G (JC, , r,) and all x e F. Define //, := {JC Rn : lt x < k(}.
Each x in F c is the center of some open ball B(x, 3e) disjoint from F. There
is an xi with \x  JC, < . We then have r, > 2^, because 5(JC, 3e) 2 B(xiy 2^), and
hence JC lt e B(JC,, r,). The separation inequality tx (x ett) > it/ then implies
ix> kt, that is JC $ H(.
Corollary. Let f be a convex (realvalued) function defined on a convex subset C
ofW1, such that epi(/) is a closed subset of Rn+K Then there exist {d{ : i e N} c Rn
and [ci : i e N} c R such that /(JC) = supIN(c/ + d( JC) for every x in C.
Proof. From the previous Corollary, and the definition of epi(/), there exist tt W1
and constants a,, k[ R such that
00 > t > /(JC) if and only if *, > lt JC  fa,
for all i N.
The ith inequality can hold for arbitrarily large t only if ar > 0. Define \jr{x) :=
sup a > 0 (li JC  kt) /a,. Clearly /(JC) > \lr(x) for JC C. If s < /(JC) for an JC in C
then there must exist an i for which ^ JC  /(jc)or/ < *, < lt  x  sat, thereby
forcing or, > 0 and s < ^(x).
6.
Problems
[1]
[2]