You are on page 1of 366

A User's Guide to Measure Theoretic Probability

This book grew from a need to teach a rigorous probability course to a mixed
audience - statisticians, mathematically inclined biostatisticians, mathematicians,
economists, and students of finance - at the advanced undergraduate/introductory
graduate level, without the luxury of a course in measure theory as a prerequisite.
The core of the book covers the basic topics of independence, conditioning, martingales, convergence in distribution, and Fourier transforms. In addition, there are
numerous sections treating topics traditionally thought of as more advanced, such as
coupling and the KMT strong approximation, option pricing via the equivalent martingale measure, and Fernique's inequality for Gaussian processes.
In a further break with tradition, the necessary measure theory is developed via
the identification of integrals with linear functional on spaces of measurable functions, allowing quicker access to the full power of the measure theoretic methods.
The book is not just a presentation of mathematically rigorous theory; it is also
a discussion of why some of that theory takes its current form and how anyone could
have thought of those clever ideas in the first place. It is intended as a secure starting point for anyone who needs to invoke rigorous probabilistic arguments and to
understand what they mean.
David Pollard is Professor of Statistics and Mathematics at Yale University in New
Haven, Connecticut. His interests center on probability, measure theory, theoretical
and applied statistics, and econometrics. He believes strongly that research and
teaching (at all levels) should be intertwined. His book, Convergence of Stochastic
Processes (Springer-Verlag, 1984), successfully introduced many researchers and
graduate students to empirical process theory.

CAMBRIDGE SERIES IN STATISTICAL AND PROBABILISTIC MATHEMATICS

Editorial Board
R. Gill (Department of Mathematics, Utrecht University)
B. D. Ripley (Department of Statistics, University of Oxford)
S. Ross (Department of Industrial Engineering, University of California, Berkeley)
M. Stein (Department of Statistics, University of Chicago)
D. Williams (School of Mathematical Sciences, University of Bath)

This series of high quality upper-division textbooks and expository monographs covers
all aspects of stochastic applicable mathematics. The topics range from pure and applied
statistics to probability theory, operations research, optimization and mathematical programming. The books contain clear presentations of new developments in the field and
also of the state of the art in classical methods. While emphasizing rigorous treatment of
theoretical methods, the books also contain applications and discussions of new techniques made possible by advances in computational practice.
Already Published
1. Bootstrap Methods and Their Application, by A. C. Davison and D. V. Hinkley
2. Markov Chains, by J. Norris
3. Asymptotic Statistics, by A. W. van der Vaart
4. Wavelet Methods for Time Series Analysis, by Donald B. Percival and
Andrew T. Walden
5. Bayesian Methods: An Analysis for Statisticians and Interdisciplinary Researchers,
by Thomas Leonard and John S. J. Hsu
6. Empirical Processes in M-Estimation, by Sara van de Geer
7. Numerical Methods of Statistics, by John F. Monahan

A User's Guide to
Measure Theoretic
Probability

DAVID POLLARD
Yale University

CAMBRIDGE

UNIVERSITY PRESS

CAMBRIDGE UNIVERSITY PRESS

Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore,


So Paulo, Delhi, Dubai, Tokyo, Mexico City
Cambridge University Press
32 Avenue of the Americas, New York, NY 10013-2473, USA
www.cambridge.org
Information on this title: www.cambridge.org/9780521002899
David Pollard 2002

This publication is in copyright. Subject to statutory exception


and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2002
7th printing 2010

A catalog record for this publication is available from the British Library.
Library of Congress Cataloging in Publication Data

Pollard, David, 1950


A users guide to measure theoretic probability / David Pollard.
p. cm. (Cambridge series in statistical and probabilistic mathematics)
Includes bibliographical references and index.
ISBN 0-521-80242-3 ISBN 0-521-00289-3 (pbk.)
1. Probabilities. 2. Measure theory. I. Title.
II. Cambridge series in statistical and probabilistic mathematics.
QA273 .P7735 2001
519.2 dc21
2001035270
ISBN 978-0-521-80242-0 Hardback
ISBN 978-0-521-00289-9 Paperback

Cambridge University Press has no responsibility for the persistence or accuracy of URLs for
external or third-party Internet Web sites referred to in this publication and does not guarantee that
any content on such Web sites is, or will remain, accurate or appropriate.

Contents
PREFACE

XI

CHAPTER 1: MOTIVATION

1
2
3
4
*5
6
7

Why bother with measure theory?


1
The cost and benefit of rigor 3
Where to start: probabilities or expectations?
The de Finetti notation
7
Fair prices
11
Problems
13
Notes
14

CHAPTER 2: A MODICUM OF MEASURE THEORY

1
2
3
*4
5
6
*7
*8
9
10
*11
12
13

Measures and sigma-fields


17
Measurable functions
22
Integrals
26
Construction of integrals from measures
Limit theorems
31
Negligible sets
33
Lp spaces
36
Uniform integrability
37
Image measures and distributions
39
Generating classes of sets
41
Generating classes of functions
43
Problems
45
Notes
51

29

CHAPTER 3: DENSITIES AND DERIVATIVES

1
*2
3
4
*5
*6
7
8

Densities and absolute continuity


53
The Lebesgue decomposition
58
Distances and affinities between measures
The classical concept of absolute continuity
Vitali covering lemma
68
Densities as almost sure derivatives
70
Problems
71
Notes
75

59
65

CHAPTER 4: PRODUCT SPACES AND INDEPENDENCE

1
2
3
4
*5
6
*7
*8

Independence
77
Independence of sigma-fields
80
Construction of measures on a product space
Product measures
88
Beyond sigma-finiteness
93
SLLN via blocking
95
SLLN for identically distributed summands
Infinite product spaces
99

83

97

Contents

9
10

Problems
102
Notes
108

CHAPTER 5: CONDITIONING

1
2
3
4
*5
6
*7
8
9

Conditional distributions: the elementary case


111
Conditional distributions: the general case
113
Integration and disintegration
116
Conditional densities
118
Invariance
121
Kolmogorov's abstract conditional expectation
123
Sufficiency
128
Problems
131
Notes
135

CHAPTER 6: MARTINGALE ET AL.

1
2
3
4
*5
*6
*7
*8
9
10

What are they?


138
Stopping times
142
Convergence of positive supermartingales
Convergence of submartingales
151
Proof of the Krickeberg decomposition
Uniform integrability
153
Reversed martingales
155
Symmetry and exchangeability
159
Problems
162
Notes
166

147
152

CHAPTER 7: CONVERGENCE IN DISTRIBUTION

1
2
3
4
*5
6
7

Definition and consequences


169
Lindeberg's method for the central limit theorem
Multivariate limit theorems
181
Stochastic order symbols
182
Weakly convergent subsequences
184
Problems
186
Notes
190

CHAPTER 8: FOURIER TRANSFORMS


Definitions and basic properties
1
Inversion formula
195
2
A mystery?
198
3
Convergence in distribution
198
4
A martingale central limit theorem
*5

6
*7
*8
9
10

193

200
Multivariate Fourier transforms
202
Cramer-Wold without Fourier transforms
203
The Levy-Cramer theorem
205
Problems
206
Notes
208

176

ix

Contents
CHAPTER 9: BROWNIAN MOTION

1
2
3
*4
5
*6
*7
*8
9
10

Prerequisites
211
Brownian motion and Wiener measure
213
Existence of Brownian motion
215
Finer properties of sample paths
217
Strong Markov property
219
Martingale characterizations of Brownian motion
Functionals of Brownian motion
226
Option pricing
228
Problems
230
Notes
234

222

CHAPTER 10: REPRESENTATIONS AND COUPLINGS

1
2
*3
*4
5
6
7
8
9

What is coupling?
237
Almost sure representations
239
Strassen's Theorem
242
The Yurinskii coupling
244
Quantile coupling of Binomial with normal
248
Haar couplingthe Hungarian construction
249
The Koml6s-Major~Tusndy coupling
252
Problems
256
Notes
258

CHAPTER 11: EXPONENTIAL TAILS AND THE LAW OF THE ITERATED LOGARITHM

1
2
*3
*4
5
6

LIL for normal summands


261
LIL for bounded summands
264
Kolmogorov's exponential lower bound
Identically distributed summands
268
Problems
271
Notes
272

266

CHAPTER 12: MULTIVARIATE NORMAL DISTRIBUTIONS

1
*2
*3
4
*5
6
7

Introduction
274
Fernique's inequality
275
Proof of Fernique's inequality
276
Gaussian isoperimetric inequality
278
Proof of the isoperimetric inequality
280
Problems
285
Notes
287

APPENDIX A: MEASURES AND INTEGRALS

1
2
3
4
5
6
7
8

Measures and inner measure


289
Tightness
291
Countable additivity
292
Extension to the He-closure
294
Lebesgue measure
295
Integral representations
296
Problems
300
Notes
300

Contents
APPENDIX B: HILBERT SPACES

1
2
3
4
5
6

Definitions
301
Orthogonal projections
302
Orthonormal bases
303
Series expansions of random processes
Problems
306
Notes
306

305

APPENDIX C: CONVEXITY

1
2
3
4
5
6
7

Convex sets and functions


307
One-sided derivatives
308
Integral representations
310
Relative interior of a convex set
312
Separation of convex sets by linear functionals
Problems
315
Notes
316

313

APPENDIX D: BINOMIAL AND NORMAL DISTRIBUTIONS

1
2
3
4

Tails of the normal distributions


317
Quantile coupling of Binomial with normal
Proof of the approximation theorem
324
Notes
328

320

APPENDIX E: MARTINGALES IN CONTINUOUS TIME

1
2
3
4
5
6

Filtrations, sample paths, and stopping times


329
Preservation of martingale properties at stopping times
Supermartingales from their rational skeletons
334
The Brownian filtration 336
Problems
338
Notes
338

APPENDIX F: DISINTEGRATION OF MEASURES

1
2
3
4
INDEX

Representation of measures on product spaces


339
Disintegrations with respect to a measurable map
342
Problems
343
Notes
345
347

332

Preface
This book began life as a set of handwritten notes, distributed to students in my
one-semester graduate course on probability theory, a course that had humble aims:
to help the students understand results such as the strong law of large numbers, the
central limit theorem, conditioning, and some martingale theory. Along the way
they could expect to learn a little measure theory and maybe even a smattering of
functional analysis, but not as much as they would learn from a course on Measure
Theory or Functional Analysis.
In recent years the audience has consisted mainly of graduate students in
statistics and economics, most of whom have not studied measure theory. Most of
them have no intention of studying measure theory systematically, or of becoming
professional probabilists, but they do want to learn some rigorous probability
theoryin one semester.
Faced with the reality of an audience that might have neither the time nor
the inclination to devote itself completely to my favorite subject, I sought to
compress the essentials into a course as self-contained as I could make it. I tried
to pack into the first few weeks of the semester a crash course in measure theory,
with supplementary exercises and a whirlwind exposition (Appendix A) for the
enthusiasts. I tried to eliminate duplication of mathematical effort if it served no
useful role. After many years of chopping and compressing, the material that I most
wanted to cover all fit into a one-semester course, divided into 25 lectures, each
lasting from 60 to 75 minutes. My handwritten notes filled fewer than a hundred
pages.
I had every intention of making my little stack of notes into a little book. But I
couldn't resist expanding a bit here and a bit there, adding useful reference material,
spelling out ideas that I had struggled with on first acquaintance, slipping in extra
topics that my students have seemed to need when writing dissertations, and pulling
in material from other courses I have taught and neat tricks I have learned from my
friends. And soon it wasn't so little any more.
Many of the additions ended up in starred Sections, which contain harder
material or topics that can be skipped over without loss of continuity.
My treatment includes a few eccentricities that might upset some of my
professional colleagues. My most obvious departure from tradition is in the use
of linear functional notation for expectations, an approach I first encountered in
books by de Finetti. I attempt to explain the virtues of this notation in the first
two Chapters. Another slight noveltyat least for anyone already exposed to the
Kolmogorov interpretation of conditional expectationsappears in my treatment
of conditioning, in Chapter 5. For many years I have worried about the wide gap
between the free-wheeling conditioning calculations of an elementary probability
course and the formal manipulations demanded by rigor. I claim that a treatment
starting from the idea of conditional distributions offers one way of bridging the gap,

xii

Preface

at least for many of the statistical applications of conditioning that have troubled me
the most.
The twelve Chapters and six Appendixes contain general explanations, remarks,
opinions, and blocks of more formal material. Theorems and Lemmas contain the
most important mathematical details. Examples contain gentler, or less formal,
explanations and illustrations. Supporting theoretical material is presented either
in the form of Exercises, with terse solutions, or as Problems (at the ends of the
Chapters) that work step-by-step through material that missed the cutoff as Exercises,
Lemmas, or Theorems. Some Problems are routine, to give students an opportunity
to digest the ideas in the text without great mental effort; some Problems are hard.
A possible one-semester course
Here is a list of the material that I usually try to cover in the one-semester
graduate course.
Chapter 1: Spend one lecture on why measure theory is worth the effort, using
a few of the Examples as illustrations. Introduce de Finetti notation, identifying
sets with their indicator functions and writing P for both probabilities of sets and
expectations of random variables. Mention, very briefly, the fair price Section as an
alternative to the frequency interpretation.
Chapter 2: Cover the unstarred Sections carefully, but omitting many details
from the Examples. Postpone Section 7 until Chapter 3. Postpone Section 8
until Chapter 6. Describe briefly the generating class theorem for functions, from
Section 11, without proofs.
Chapter 3: Cover Section 1, explaining the connection with the elementary notion
of a density. Take a short excursion into Hilbert space (explaining the projection
theorem as an extension of the result for Euclidean spaces) before presenting the
simple version of Radon-Nikodym. Mention briefly the classical concept of absolute
continuity, but give no details. Maybe say something about total variation.
Chapter 4: Cover Sections 1 and 2, leaving details of some arguments to the
students. Give a reminder about generating classes of functions. Describe the
construction of /x 0 A, only for a finite kernel A, via the iterated integral. Cover
product measures, using some of the Examples from Section 4. Explain the need for
the blocking idea from Section 6, using the Maximal Inequality to preview the idea
of a stopping time. Mention the truncation idea behind the version of the SLLN for
independent, identically distributed random variables with finite first moments, but
skip most of the proof.
Chapter 5: Discuss Section 1 carefully. Cover the high points of Sections 2
through 4. (They could be skipped without too much loss of continuity, but I prefer
not to move straight into Kolmogorov conditioning.) Cover Section 6.

Preface

xiii

Chapter 6: Cover Sections 1 through 4, but skipping over some Examples.


Characterize uniformly integrable martingales, using Section 6 and some of the
material postponed from Section 8 of Chapter 2, unless short of time.
Chapter 7: Cover the first four Sections, skipping some of the examples of central
limit theorems near the end of Section 2. Downplay multivariate results.
Chapter 8: Cover Sections 1, 2, 4, and 6.
If time is left over, cover a topic from the remaining Chapters.
Acknowledgments

I am particularly grateful to Richard Gill, who is a model of the constructive critic.


His comments repeatedly exposed weaknesses and errors in the manuscript. My
colleagues Joe Chang and Marten Wegkamp asked helpful questions while using
earlier drafts to teach graduate probability courses. Andries Lenstra provided some
important historical references.
Many cohorts of students worked through the notes, revealing points of
obscurity and confusion. In particular, Jeankyung Kim, Gheorghe Doros, Daniela
Cojocaru, and Peter Radchenko read carefully through several chapters and worked
through numerous Problems. Their comments led to a lot of rewriting.
Finally, I thank Lauren Cowles for years of good advice, and for her inexhaustible patience with an author who could never stop tinkering.
David Pollard
New Haven
February 2001

Chapter 1

Motivation
SECTION 1 offers some reasons for why anyone who uses probability should know about
the measure theoretic approach.
SECTION 2 describes some of the added complications, and some of the compensating
benefits that come with the rigorous treatment of probabilities as measures.
SECTION 3 argues that there are advantages in approaching the study of probability theory
via expectations, interpreted as linear functionate, as the basic concept.
SECTION 4 describes the de Finetti convention of identifying a set with its indicator
function, and of using the same symbol for a probability measure and its corresponding
expectation.
SECTION *5 presents a fair-price interpretation of probability, which emphasizes the
linearity properties of expectations. The interpretation is sometimes a useful guide to
intuition.

1.

Why bother with measure theory?


Following the appearance of the little book by Kolmogorov (1933), which set forth
a measure theoretic foundation for probability theory, it has been widely accepted
that probabilities should be studied as special sorts of measures. (More or less
truesee the Notes to the Chapter.) Anyone who wants to understand modern
probability theory will have to learn something about measures and integrals, but it
takes surprisingly little to get started.
For a rigorous treatment of probability, the measure theoretic approach is a vast
improvement over the arguments usually presented in undergraduate courses. Let
me remind you of some difficulties with the typical introduction to probability.
Independence
There are various elementary definitions of independence for random variables. For
example, one can require factorization of distribution functions,
F[X < x, Y < y] = F{X < x} F{Y < y]

for all real x, y.

The problem with this definition is that one needs to be able to calculate distribution
functions, which can make it impossible to establish rigorously some desirable

Chapter 1:

Motivation

properties of independence. For example, suppose X i , . . . , X 4 are independent


random variables. How would you show that
F = X,X 2
is independent of

Z = sin I X3 + x\ + X3X4 + X| 4- Jxf+X% 1 ,


by means of distribution functions? Somehow you would need to express events
{Y < y, Z < z) in terms of the events {X* < JC/}, which is not an easy task. (If you
did figure out how to do it, I could easily make up more taxing examples.)
You might also try to define independence via factorization of joint density
functions, but I could invent further examples to make your life miserable, such as
problems where the joint distribution of the random variables are not even given
by densities. And if you could grind out the joint densities, probably by means of
horrible calculations with Jacobians, you might end up with the mistaken impression
that independence had something to do with the smoothness of the transformations.
The difficulty disappears in a measure theoretic treatment, as you will see in
Chapter 4. Facts about independence correspond to facts about product measures.
Discrete versus continuous
Most introductory texts offer proofs of the Tchebychev inequality,
P{|X - Ml > } < var(X)/2,
where \i denotes the expected value of X. Many texts even offer two proofs, one for
the discrete case and another for the continuous case. Indeed, introductory courses
tend to split into at least two segments. First one establishes all manner of results
for discrete random variables and then one reproves almost the same results for
random variables with densities.
Unnecessary distinctions between discrete and continuous distributions disappear
in a measure theoretic treatment, as you will see in Chapter 3.
Univariate versus multivariate

The unnecessary repetition does not stop with the discrete/continuous dichotomy.
After one masters formulae for functions of a single random variable, the whole
process starts over for several random variables. The univariate definitions acquire a
prefix joint, leading to a whole host of new exercises in multivariate calculus: joint
densities, Jacobians, multiple integrals, joint moment generating functions, and so
on.
Again the distinctions largely disappear in a measure theoretic treatment.
Distributions are just image measures; joint distributions are just image measures for
maps into product spaces; the same definitions and theorems apply in both cases.
One saves a huge amount of unnecessary repetition by recognizing the role of image

1.1 Why bother with measure theory?

measures (described in Chapter 2) and recognizing joint distributions as measures


on product spaces (described in Chapter 4).
Approximation of distributions

Roughly speaking, the central limit theorem asserts:


Jf i , . . , Hn ore independent random variables with zero expected values and
variances summing to one, and if none of the ,- makes too large a contribution
to their sumy then %\ + . . . + is approximately N(0, 1) distributed.

What exactly does that mean? How can something with a discrete distribution,
such as a standardized Binomial, be approximated by a smooth normal distribution?
The traditional answer (which is sometimes presented explicitly in introductory
texts) involves pointwise convergence of distribution functions of random variables;
but the central limit theorem is seldom established (even in introductory texts) by
checking convergence of distribution functions. Instead, when proofs are given, they
typically involve checking of pointwise convergence for some sort of generating
function. The proof of the equivalence between convergence in distribution and
pointwise convergence of generating functions is usually omitted. The treatment of
convergence in distribution for random vectors is even murkier.
As you will see in Chapter 7, it is far cleaner to start from a definition involving
convergence of expectations of "smooth functions" of the random variables, an
approach that covers convergence in distribution for random variables, random
vectors, and even random elements of metric spaces, all within a single framework.
***
In the long run the measure theoretic approach will save you much work and
help you avoid wasted effort with unnecessary distinctions.

2.

The cost and benefit of rigor


In traditional terminology, probabilities are numbers in the range [0,1] attached to
events, that is, to subsets of a sample space Q. They satisfy the rules
(i) P0 = 0 and FQ = 1
(ii) for disjoint events A\, A2,..., the probability of their union, P(U/A,), is equal
to Y^i P^i tne sum of the probabilities of the individual events.
When teaching introductory courses, I find that it pays to be a little vague
about the meaning of the dots in (ii), explaining only that it lets us calculate the
probability of an event by breaking it into disjoint pieces whose probabilities are
summed. Probabilities add up in the same way as lengths, areas, volumes, and
masses. The fact that we sometimes need a countable infinity of pieces (as in
calculations involving potentially infinite sequences of coin tosses, for example) is
best passed off as an obvious extension of the method for an arbitrarily large, finite
number of pieces.
In fact the extension is not at all obvious, mathematically speaking. As
explained by Hawkins (1979), the possibility of having the additivity property (ii)

Chapter 1:

Motivation

hold for countable collections of disjoint events, a property known officially as


countable additivity, is one of the great discoveries of modern mathematics. In his
1902 doctoral dissertation, Henri Lebesgue invented a method for defining lengths
of complicated subsets of the real line, in a countably additive way. The definition
has the subtle feature that not every subset has a length. Indeed, under the usual
axioms of set theory, it is impossible to extend the concept of length to all subsets
of the real line while preserving countable additivity.
The same subtlety carries over to probability theory. In general, the collection
of events to which countably additive probabilities are assigned cannot include all
subsets of the sample space. The domain of the set function P (the probability
measure) is usually just a sigma-field, a collection of subsets of Q with properties
that will be defined in Chapter 2.
Many probabilistic ideas are greatly simplified by reformulation as properties
of sigma-fields. For example, the unhelpful multitude of possible definitions for
independence coalesce nicely into a single concept of independence for sigma-fields.
The sigma-field limitation turns out to be less of a disadvantage than might be
feared. In fact, it has positive advantages when we wish to prove some probabilistic
fact about all events in some sigma-field, A. The obvious line of attackfirst find an
explicit representation for the typical member of A> then check the desired property
directlyusually fails. Instead, as you will see in Chapter 2, an indirect approach
often succeeds.
(a) Show directly that the desired property holds for all events in some subclass
of "simpler sets" from A.
(b) Show that A is the smallest sigma-field for which A^> .
(c) Show that the desired property is preserved under various set theoretic
operations. For example, it might be possible to show that if two events have
the property then so does their union.
(d) Deduce from (c) that the collection 23 of all events with the property forms
a sigma-field of subsets of 2. That is, 23 is a sigma-field, which, by (a), has
the property T> 2 .
(e) Conclude from (b) and (d) that B D A That is, the property holds for all
members of A.
REMARK.
Don't worry about the details for the moment. I include the outline
in this Chapter just to give the flavor of a typical measure theoretic proof. I have
found that some students have trouble adapting to this style of argument.

The indirect argument might seem complicated, but, with the help of a few key
theorems, it actually becomes routine. In the literature, it is not unusual to see
applications abbreviated to a remark like "a simple generating class argument shows
. . . , " with the reader left to fill in the routine details.
Lebesgue applied his definition of length (now known as Lebesgue measure)
to the construction of an integral, extending and improving on the Riemann
integral. Subsequent generalizations of Lebesgue's concept of measure (as in
the 1913 paper of Radon and other developments described in the Epilogue to

1.2 The cost and benefit of rigor

Hawkins 1979) eventually opened the way for Kolmogorov to identify probabilities
with measures on sigma-fields of events on general sample spaces. From the Preface
to Kolmogorov (1933), in the 1950 translation by Morrison:
The purpose of this monograph is to give an axiomatic foundation for the
theory of probability. The author set himself the task of putting in their natural
place, among the general notions of modern mathematics, the basic concepts of
probability theoryconcepts which until recently were considered to be quite
peculiar.
This task would have been a rather hopeless one before the introduction
of Lebesgue's theories of measure and integration. However, after Lebesgue's
publication of his investigations, the analogies between measure of a set and
probability of an event, and between integral of a function and mathematical
expectation of a random variable, became apparent. These analogies allowed of
further extensions; thus, for example, various properties of independent random
variables were seen to be in complete analogy with the corresponding properties
of orthogonal functions. But if probability theory was to be based on the above
analogies, it still was necessary to make the theories of measure and integration
independent of the geometric elements which were in the foreground with
Lebesgue. This has been done by Frechet.
While a conception of probability theory based on the above general
viewpoints has been current for some time among certain mathematicians, there
was lacking a complete exposition of the whole system, free of extraneous
complications. (Cf., however, the book by Fre*chet . . . )

Kolmogorov identified random variables with a class of real-valued functions


(the measurable functions) possessing properties allowing them to coexist comfortably with the sigma-field. Thereby he was also able to identify the expectation
operation as a special case of integration with respect to a measure. For the newly
restricted class of random variables, in addition to the traditional properties
(i) E(c\X\ + c2X2) = c\E(X\) + c2E(X2), for constants c\ and c2,
(ii) E(X) > E(Y) if X > F,
he could benefit from further properties implied by the countable additivity of the
probability measure.
As with the sigma-field requirement for events, the measurability restriction on
the random variables came with benefits. In modern terminology, no longer was E
just an increasing linear functional on the space of real random variables (with
some restrictions to avoid problems with infinities), but also it had acquired some
continuity properties, making possible a rigorous treatment of limiting operations in
probability theory.

3. Where to start: probabilities or expectations?


From the example set by Lebesgue and Kolmogorov, it would seem natural to start
with probabilities of events, then extend, via the operation of integration, to the study
of expectations of random variables. Indeed, in many parts of the mathematical
world that is the way it goes: probabilities are the basic quantities, from which
expectations of random variables are derived by various approximation arguments.

Chapter 1:

Motivation

The apparently natural approach is by no means the only possibility, as anyone


brought up on the works of the fictitious French author Bourbaki could affirm.
(The treatment of measure theory, culminating with Bourbaki 1969, started from
integrals defined as linear functionals on appropriate spaces of functions.) Moreover,
historically speaking, expectation has a strong claim to being the preferred starting
point for a theory of probability. For instance, in his discussion of the 1657 book
Calculating in Games of Chance by Christian Huygens, Hacking (1978, page 97)
commented:
The fair prices worked out by Huygens are just what we would call the
expectations of the corresponding gambles. His approach made expectation a
more basic concept than probability, and this remained so for about a century.

The fair price interpretation is sketched in Section 5.


The measure theoretic history of integrals as linear functionals also extends
back to the early years of the twentieth century, starting with Daniell (1918), who
developed a general theory of integration via extension of linear functionals from
small spaces of functions to larger spaces. It is also significant that, in one of the
greatest triumphs of measure theory, Wiener (1923, Section 10) defined what is now
known as Wiener measure (thereby providing a rigorous basis for the mathematical
theory of Brownian motion) as an averaging operation for functionals defined on
Brownian motion paths, citing Daniell (1919) for the basic extension theorem.
There are even better reasons than historical precedent for working with expectations as the basic concept. Whittle (1992), in the Preface to an elegant, intermediate
level treatment of Probability via Expectations, presented some arguments:
(i) To begin with, people probably have a better intuition for what is meant by
an * average value' than for what is meant by a 'probability.'
(ii) Certain important topics, such as optimization and approximation problems,
can be introduced and treated very quickly, just because they are phrased in
terms of expectations.
(iii) Most elementary treatments are bedeviled by the apparent need to ring
the changes of a particular proof or discussion for all the special cases of
continuous or discrete distribution, scalar or vector variables, etc. In the
expectations approach these are indeed seen as special cases, which can be
treated with uniformity and economy.
His list continued. I would add that:
(a) It is often easier to work with the linearity properties of integrals than with
the additivity properties of measures. For example, many useful probability
inequalities are but thinly disguised consequences of pointwise inequalities,
translated into probability form by the linearity and increasing properties of
expectations.
(b) The linear functional approach, via expectations, can save needless repetition
of arguments. Some theorems about probability measures, as set functions,
are just special cases of more general results about expectations.

1.3 Where to start: probabilities or expectations?

(c) When constructing new probability measures, we save work by defining


the integral of measurable functions directly, rather than passing through
the preliminary step of building the set function then establishing theorems
about the corresponding integrals. As you will see repeatedly, definitions and
theorems sometimes collapse into a single operation when expressed directly
in terms of expectations, or integrals.
**

I will explain the essentials of measure theory in Chapter 2, starting from the
traditional set-function approach but working as quickly as I can towards systematic
use of expectations.

4.

The de Finetti notation


The advantages of treating expectation as the basic concept are accentuated by
the use of an elegant notation strongly advocated by de Finetti (1972, 1974).
Knowing that many traditionally trained probabilists and statisticians find the
notation shocking, I will introduce it slowly, in an effort to explain why it is worth
at least a consideration. (Immediate enthusiastic acceptance is more than I could
hope for.)
Ordinary algebra is easier than Boolean algebra. The correspondence A ++ 1A
between subsets A of a fixed set X and their indicator functions,
1

if x A,

transforms Boolean algebra into ordinary pointwise algebra with functions. I claim
that probability theory becomes easier if one works systematically with expectations
of indicator functions, E1U, rather than with the corresponding probabilities of
events.
Let me start with the assertions about algebra and Boolean algebra. The
operations of union and intersection correspond to pointwise maxima (denoted by
max or the symbol v ) and pointwise minima (denoted by min or the symbol A), or
pointwise products:

iu,A, ( * > = v ^(jc)


i

and

^(jc)=A u

(jc

> = n JA> <*>


i

Complements correspond to subtraction from one: IAc(x) = 1 IA(x). Derived


operations, such as the set theoretic difference A\B := AD Bc and the symmetric
difference, AAB := (A\B) U (B\A), also have simple algebraic counterparts:
IA\B(X) = (IA(*) ~ Ia(*)) + := max (0, IA(JC) -

lB(x)),

To check these identities, just note that the functions take only the values 0 and 1,
then determine which combinations of indicator values give a 1. For example,
\lA(x) - IB(X)\ takes the value 1 when exactly one of IA(x) and lB(x) equals 1.

Chapter 1:

Motivation

The algebra looks a little cleaner if we omit the argument x. For example, the
horrendous set theoretic relationship
(n*=1A,) A (njL,ft) c u*=1 (A,Aft)
corresponds to the pointwise inequality
whose verification is easy: when the right-hand side takes the value 1 the inequality
is trivial, because the left-hand side can take only the values 0 or 1; and when
right-hand side takes the value 0, we have 1^. = lBi for all i, which makes the
left-hand side zero.
<i>

Example.

One could establish an identity such as

(AAB)A(CAD) = AA (BA(CAD))

<2>

by expanding both sides into a union of many terms. It is easier to note the pattern
for indicator functions. The set AAB is the region where IA + I B takes an odd value
(that is, the value 1); and (AAB)AC is the region where (lA + IB) + IC takes an odd
value. And so on. In fact both sides of the set theoretic identity equal the region
where 1A + 1B + lc + ID t a k e s a n dd value. Associativity of set theoretic differences
is a consequence of associativity of pointwise addition.
Example. The lim sup of a sequence of sets {An : n N} is defined as

lim sup An := PUJ-A*.


n

n=l i>n

That is, the lim sup consists of those x for which, to each n there exists an i > ri
such that x A(. Equivalently, it consists of those x for which x Ai for infinitely
many i. In other words,
An =limsupI An .

Do you really need to learn the new concept of the lim sup of a sequence
of sets? Theorems that work for lim sups of sequences of functions automatically
carry over to theorems about sets. There is no need to prove everything twice. The
correspondence between sets and their indicators saves us from unnecessary work.
After some repetition, it becomes tiresome to have to keep writing the I for
the indicator function. It would be much easier to write something like A in place
of IA. The indicator of the lim sup of a sequence of sets would then be written
limsupn An, with only the tilde to remind us that we are referring to functions. But
why do we need reminding? As the example showed, the concept for the lim sup
of sets is really just a special case of the concept for sequences of functions. Why
preserve a distinction that hardly matters?
There is a well established tradition in Mathematics for choosing notation that
eliminates inessential distinctions. For example, we use the same symbol 3 for the
natural number and the real number, writing 3 + 6 = 9 as an assertion both about
addition of natural numbers and about addition of real numbers.

1.4 The de Finetti notation


nat

^L

It does not matter if we cannot tell immediately which


interpretation is intended, because we know there is a one-to-one
correspondence between natural numbers and a subset of the real
numbers, which preserves all the properties of interest. Formally,
there is a map \/r : N - R for which
i?(x +natural j ) = f(x) +real f(y)

for

all X, y in N,

with analogous equalities for other operations. (Notice that I even


took care to distinguish between addition as a function from N x N
3reai
^reai
to N and as a function from E x E to E.) The map ty is an
isomorphism between N and a subset of E .
REMARK.
Of course there are some situations where we need to distinguish
between a natural number and its real counterpart. For example, it would be highly
confusing to use indistinguishable symbols when first developing the properties of the
real number system from the properties of the natural numbers. Also, some computer
languages get very upset when a function that expects a floating point argument is
fed an integer variable; some languages even insist on an explicit conversion between
types.

We are faced with a similar overabundance of notation in the correspondence


between sets and their indicator functions. Formally, and traditionally, we have a
map A - I,* from sets into a subset of the nonnegative real functions. The map
preserves the important operations. It is firmly in the Mathematical tradition that
we should follow de Finetti's suggestion and use the same symbol for a set and its
indicator function.
REMARK.
A very similar convention has been advocated by the renowned
computer scientist, Donald Knuth, in an expository article (Knuth 1992). He attributed
the idea to Kenneth Iversen, the inventor of the programming language APL.

In de Finetti's notation the assertion from Example <2> becomes


lim sup An = lim sup An,
a fact that is quite easy to remember. The theorem about lim sups of sequences
of sets has become incorporated into the notation; we have one less theorem to
remember.
The second piece of de Finetti notation is suggested by the same logic that
encourages us to replace -{-natural and +reai by the single addition symbol: use the
same symbol when extending the domain of definition of a function. For example,
the symbol "sin" denotes both the function defined on the real line and its extension
to the complex domain. More generally, if we have a function g with domain Go,
which can be identified with a subset Go of some G via a correspondence JC -> i ,
and if | is a function on G for which g(x) = g(x) for x in Go, then why not write g
instead of g for the function with the larger domain?
With probability theory we often use P to denote a probability measure, as a
map from a class A (a sigma-field) of subsets of some O into the subinterval [0,1]
of the real line. The correspondence A > A := 1^, between a set A and its indicator
function A, establishes a correspondence between A and a subset of the collection of

10

Chapter 1:

Motivation

random variables on Q. The expectation maps random variables into real numbers,
in such a way that E(A) = P(A). This line of thinking leads us to de Finetti's
second suggestion: use the same symbol for expectation and probability measure,
writing FX instead of EX, and so on.
The de Finetti notation has an immediate advantage when we deal with several
probability measures, P, Q, ... simultaneously. Instead of having to invent new
symbols Ep, EQ, ..., we reuse P for the expectation corresponding to P, and so on.
REMARK.
YOU might have the concern that you will not be able to tell whether
FA refers to the probability of an event or the expected value of the corresponding
indicator function. The ambiguity should not matter. Both interpretations give the
same number; you will never be faced with a choice between two different values
when choosing an interpretation. If this ambivalence worries you, I would suggest
going systematically with the expectation/indicator function interpretation. It will
never lead you astray.

<3>

Example. For a finite collection of events A\,..., An, the so-called method of
inclusion and exclusion asserts that the probability of the union Uf<nAi equals
i, j , k distinctionA y HA*)-.. .P(AiflA 2 n.. .HAn).
The equality comes by taking expectations on both sides of an identity for (indicator)
functions,

t; - ^2 AiAJ

+]C^' ^*

distinct

M A J A * - . . . A I A 2 ... A n .

The right-hand side of this identity is just the expanded version of 1 Y\i<n (1 ~ Ai).
The identity is equivalent to

<4>

which presents two ways of expressing the indicator function of n,<nAf. See
Problem [1] for a generalization.
Example. Consider Tchebychev's inequality, F{\X ix\ > e] < var(X)/e2, for
each > 0, and each random variable X with expected value /x := FX and finite
variance, var(X) := P(X ~- /x)2. On the left-hand side of the inequality we have
the probability of an event. Or is it the expectation of an indicator function?
Either interpretation is correct, but the second is more helpful. The inequality is
a consequence of the increasing property for expectations invoked for a pair of
functions, {|X JJL\ > e] < (X ^t)2/2. The indicator function on the left-hand
side takes only the values 0 and 1. The quadratic function on the right-hand side is
nonnegative, and is > 1 whenever the left-hand side equals 1.
***
For the remainder of the book, I will be using the same symbol for a set and
its indicator function, and writing P instead of E for expectation.
REMARK.
For me, the most compelling reason to adopt the de Finetti notation,
and work with P as a linear functional defined for random variables, was not that
I would save on symbols, nor any of the other good reasons listed at the end of

1.4 The de Finetti notation

11

Section 3. Instead, I favor the notation because, once the initial shock of seeing old
symbols used in new ways wore off, it made probability theory easier. I can truly
claim to have gained better insight into classical techniques through the mere fact of
translating them into the new notation. I even find it easier to invent new arguments
when working with a notation that encourages thinking in terms of linearity, and
which does not overemphasize the special role for expectations of functions that take
only the values 0 and 1 by according them a different symbol.
The hope that I might convince probability users of some of the advantages
of de Finetti notation was, in fact, one of my motivations for originally deciding to
write yet another book about an old subject.

*5. Fair prices


For the understanding of this book the interpretation of probability as a model for
uncertainty is not essential. You could study it purely as a piece of mathematics,
divorced from any interpretation but then you would forgo much of the intuition
that accompanies the various interpretations.
The most widely accepted view interprets probabilities and expectations as
long run averages, anticipating the formal laws of large numbers that make precise
a sense in which averages should settle down to expectations over a long sequence
of independent trials. As an aid to intuition I also like another interpretation, which
does not depend on a preliminary concept of independence, and which concentrates
attention on the linearity properties of expectations.
Consider a situationa bet if you will-where you stand to receive an uncertain
return X. You could think of X as a random variable, a real-valued function on a
set Q. For the moment forget about any probability measure on Q. Suppose you
consider p(X) to be the fair price to pay now in order to receive X at some later
time. (By fair I mean that you should be prepared to take either side of the bet. In
particular, you should be prepared to accept a payment p(X) from me now in return
for giving me an amount X later.) What properties should p{) have?
REMARK.
AS noted in Section 3, the value p(X) corresponds to an expected
value of the random variable X. If you already know about the possibility of infinite
expectations, you will realize that I would have to impose some restrictions on the
class of random variables for which fair prices are defined, if I were seriously trying
to construct a rigorous system of axioms. It would suffice to restrict the argument to
bounded random variables.

Your net return will be the random quantity X'(o>) := X(a>) - p(X). Call
the random variable X' a fair return, the net return from a fair trade. Unless you
start worrying about utilitiesin which case you might consult Savage (1954) or
Ferguson (1967, Section 1.4)you should find the following properties reasonable.
(i) fair + fair = fair. That is, if you consider p(X) fair for X and p(Y) fair
for Y then you should be prepared to make both bets, paying p(X) 4- p(Y) to
receive X + F.
(ii) constant x fair = fair. That is, you shouldn't object if I suggest you pay
2p(X) to receive 2X (actually, that particular example is a special case of (i))

12

Chapter 1:

Motivation

or 3.76/?(X) to receive 3.76X, or -p(X) to receive - X . The last example


corresponds to willingness to take either side of a fair bet. In general, to
receive cX you should pay cp(X), for constant c.
Properties (i) and (ii) imply that the collection of all fair returns is a vector space.
There is a third reasonable property that goes by several names: coherency or
nonexistence of a Dutch book, the no-arbitrage requirement, or the no-free-lunch
principle:
(iii) There is no fair return Xf for which Xf(co) > 0 for all co, with strict inequality
for at least one co.
(Students of decision theory might be reminded of the the concept of admissibility.)
If you were to declare such an X' to be fair I would be delighted to offer you the
opportunity to receive a net return of 10100X'. I couldn't lose.
<5>

Lemma. Properties (I), (ii), and (iii) imply that p() is an increasing linear
functional on random variables. The fair returns are those random variables for
which p(X) = 0.
Proof. For constants a and ft, and random variables X and Y with fair prices p(X)
and p(Y), consider the combined effect of the following fair bets:
you pay me ap(X) to receive aX
you pay me f$p(Y) to receive f$Y
I pay you p(aX + @Y) to receive (aX 4Your net return is a constant,
c = p(aX + PY) - ap(X) If c > 0 you violate (iii); if c < 0 take the other side of the bet to violate (iii). That
proves linearity.
To prove that p() is increasing, suppose X(co) > Y(co) for all co. If you claim
that p(X) < p(Y) then I would be happy for you to accept the bet that delivers
(Y - p(Y)) - (X - p(X)) = -(X -Y)-

(p(Y) - p(X)),

which is always < 0.


If both X and X - p(X) are considered fair, then the constant return p(X) =
X (X p(X)) is fair, which would contradict (iii) unless p(X) = 0.
As a special case, consider the bet that returns 1 if an event F occurs, and 0
otherwise. If you identify the event F with the random variable taking the value 1
on F and 0 on Fc (that is, the indicator of the event F), then it follows directly
from Lemma <5> that /?() is additive: p(F\ U Fi) = p{F\) 4- p{Fi) for disjoint
events F\ and F2. That is, p defines a finitely additive set-function on events. The
set function /?() has most of the properties required of a probability measure. As
an exercise you might show that /?(0) = 0 and p(Q) = 1.
Contingent bets
Things become much more interesting if you are prepared to make a bet to receive an
amount X but only when some event F occurs. That is, the bet is made contingent

7.5 Fair prices

13

on the occurrence of F. Typically, knowledge of the occurrence of F should change


the fair price, which we could denote by p(X | F). Expressed more compactly,
the bet that returns (X - p(X | F)) F is fair. The indicator function F ensures that
money changes hands only when F occurs.
<6>

Lemma. If Q is partitioned into disjoint events F\,...,


variable, then p(X) = f = 1 p(Ft)p(X | F,).

F*, and X is a random

Proof. For a single FJ, argue by linearity that


0 = p (XFf - p{X | Fi)Fi) = p(XFi) - p(X | Fi)p(Fi).
D

Sum over i, using linearity again, together with the fact that X = f . XF,, to deduce
that p(X) = . p(XF f ) = ,. p(Fi)p(X \ F,), as asserted.
Why should we restrict the Lemma to finite partitions? If we allowed countable
partitions we would get the countable additivity propertythe key requirement in
the theory of measures. I would be suspicious of such an extension of the simple
argument for finite partitions. It makes a tacit assumption that a combination of
countably many fair bets is again fair. If we accept that assumption, then why not
accept that arbitrary combinations of fair events are fair? For uncountably infinite
collections we would run into awkward contradictions. For example, suppose co is
generated from a uniform distribution on [0,1]. Let X, be the random variable that
returns 1 if co = t and 0 otherwise. By symmetry one might expect p(Xt) = c for
some constant c that doesn't depend on /. But there can be no c for which
/v\
(0
1
/i\
/v^
v \ ? v^
x
l = p{\) = p I2^o<r<i n = <Lo<r<i P\xo = I oo

ifc = 0
_ 0

if c

Perhaps our intuition about the infinite rests on shaky analogies with the finite.
REMARK.
I do not insist that probabilities must be interpreted as fair prices, just
as I do not accept that all probabilities must be interpreted as assertions about long
run frequencies. It is convenient that both interpretations lead to almost the same
mathematical formalism. You are free to join either camp, or both, and still play by
the same probability rules.

6.

Problems

[1]

Let A i , . . . , AN be events in a probability space (Q, J, P). For each subset J


of {1,2, . . . , # } write Aj for n l yA/. Define S* := J2\j\=k^Aj, where | / | denotes the number of indices in J. For 0 < m < N show that the probability
Pfexactly m of the A,-'s occur} equals ()Sm - (m+l)Sm+\ + ... ()SN. Hint: For
a dummy variable z, show that njLiO*? + z^i) = YH=o Yl\j\=k(z ~~ l)kAj. Expand
the left-hand side, take expectations, then interpret the coefficient of zm.

[2]

Rederive the assertion of Lemma <6> by consideration of the net return from the
following system of bets:
(i) for each i, pay cip(Fi) in order to receive c, if f}
occurs, where c, := p(X | F,); (ii) pay -p(X) in order to receive - X ;
(iii) for
each i, make a bet contingent on F,, paying c, (if F( occurs) to receive X.

14

Chapter 1:

Motivation

[3]

For an increasing sequence of events {An : n e N} with union A, show FAn t FA.

7.

Notes
See Dubins & Savage (1964) for an illustration of what is possible in a theory of
probability without countable additivity.
The ideas leading up to Lebesgue's creation of his integral are described in
fascinating detail in the excellent book of Hawkins (1979), which has been the
starting point for most of my forays into the history of measure theory. Lebesgue
first developed his new definition of the integral for his doctoral dissertation
(Lebesgue 1902), then presented parts of his theory in the 1902-1903 Peccot course
of lectures (Lebesgue 1904). The 1928 revision of the 1904 volume greatly expanded
the coverage, including a treatment of the more general (Lebesgue-)Stieltjes integral.
See also Lebesgue (1926), for a clear description of some of the ideas involved in
the development of measure theory, and the Note Historique of Bourbaki (1969), for
a discussion of later developments.
Of course it is a vast oversimplification to imagine that probability theory
abruptly became a specialized branch of measure theory in 1933. As Kolmogorov
himself made clear, the crucial idea was the measure theory of Lebesgue. Kolmogorov's little book was significant not just for "putting in their natural place,
among the general notions of modern mathematics, the basic concepts of probability
theory", but also for adding new ideas, such as probability distributions in infinite
dimensional spaces (reinventing results of Daniell 1919) and a general theory of
conditional probabilities and conditional expectations.
Measure theoretic ideas were used in probability theory well before 1933.
For example, in the Note at the end of L6vy (1925) there was a clear statement
of the countable additivity requirement for probabilities, but Levy did not adopt
the complete measure theoretic formalism; and Khinchin & Kolmogorov (1925)
explicitly constructed their random variables as functions on [0,1], in order to avail
themselves of the properties of Lebesgue measure.
It is also not true that acceptance of the measure theoretic foundation was total
and immediate. For example, eight years after Kolmogorov's book appeared, von
Mises (1941, page 198) asserted (emphasis in the original):
In recapitulating this paragraph I may say: First, the axioms of Kolmogorov
are concerned with the distribution function within one kollektiv and are
supplementary to my theory, not a substitute for it. Second, using the notion of
measure zero in an absolute way without reference to the arbitrarily assumed
measure system, leads to essential inconsistencies.

See also the argument for the measure theoretic framework in the accompanying
paper by Doob (1941), and the comments by both authors that follow (von Mises &
Doob 1941).
For more about Kolmogorov's pivotal role in the history of modern probability,
see: Shiryaev (2000), and the other articles in the same collection; the memorial

15

1.7 Notes

articles in the Annals of Probability, volume 17 (1989); and von Plato (1994), which
also contains discussions of the work of von Mises and de Finetti.
REFERENCES

Bourbaki, N. (1969), Integration sur les espaces topologiques separes, Elements de


math6matique, Hermann, Paris. Fascicule XXXV, Livre VI, Chapitre IX.
Daniell, P. J. (1918), 'A general form of integral', Annals of Mathematics (series 2)
19, 279-294.
Daniell, P. J. (1919), 'Functions of limited variation in an infinite number of
dimensions', Annals of Mathematics (series 2) 21, 30-38.
de Finetti, B. (1972), Probability, Induction, and Statistics, Wiley, New York,
de Finetti, B. (1974), Theory of Probability, Wiley, New York. First of two volumes
translated from Teoria Delle probability, published 1970. The second volume
appeared under the same title in 1975.
Doob, J. L. (1941), 'Probability as measure', Annals of Mathematical Statistics
12, 206-214.
Dubins, L. & Savage, L. (1964), How to Gamble if You Must, McGraw-Hill.
Ferguson, T. S. (1967), Mathematical Statistics: A Decision Theoretic Approach,
Academic Press, Boston.
Fr6chet, M. (1915), 'Sur l'int^grale d'une fonctionnelle 6tendue a un ensemble
abstrait', Bull. Soc. Math. France 43, 248-265.
Hacking, I. (1978), The Emergence of Probability, Cambridge University Press.
Hawkins, T. (1979), Lebesgue*s Theory of Integration: Its Origins and Development,
second edn, Chelsea, New York.
Khinchin, A. Y. & Kolmogorov, A. (1925), 'liber Konvergenz von Reihen, deren
Glieder durch den Zufall bestimmt werden', Mat. Sbornik 32, 668-677.
Knuth, D. E. (1992), 'Two notes on notation', American Mathematical Monthly
99, 403-422.
Kolmogorov, A. N. (1933), Grundbegriffe der Wahrscheinlichkeitsrechnung, SpringerVerlag, Berlin. Second English Edition, Foundations of Probability 1950,
published by Chelsea, New York.
Lebesgue, H. (1902), Integrate, longueur, aire. Doctoral dissertation, submitted to
Faculty des Sciences de Paris. Published separately in Ann. Mat. Pura Appl. 7.
Included in the first volume of his CEuvres Scientifiques, published in 1972 by
L'Enseignement Math6matique.
Lebesgue, H. (1904), Lecons sur Vintegration et la recherche des fonctions primitives,
first edn, Gauthier-Villars, Paris. Included in the second volume of his CEuvres
Scientifiques, published in 1972 by L'Enseignement Math6matique. Second
edition published in 1928. Third edition, 'an unabridged reprint of the second
edition, with minor changes and corrections', published in 1973 by Chelsea,
New York.

16

Chapter 1:

Motivation

Lebesgue, H. (1926), 'Sur le developpement de la notion d'intgrale', Matematisk


Tidsskrift B. English version in the book Measure and Integral, edited and
translated by Kenneth O. May.
L6vy, P. (1925), Calcul des Probability, Gauthier-Villars, Paris.
Radon, J. (1913), 'Theorie und Anwendungen der absolut additiven Mengenfuno
tionen', Sitzungsberichten der Kaiserlichen Akademie der Wissenschaften in Wien.
Mathematisch-naturwissenschaftliche Klasse 122, 1295-1438.
Savage, L. J. (1954), The Foundations of Statistics, Wiley, New York. Second edition,
Dover, New York, 1972.
Shiryaev, A. N. (2000), Andrei Nikolaevich Kolmogorov: a biographical sketch
of his life and creative paths, in 'Kolmogorov in Perspective', American
Mathematical Society/London Mathematical Society,
von Mises, R. (1941), 'On the foundations of probability and statistics', Annals of
Mathematical Statistics 12, 191-205.
von Mises, R. & Doob, J. L. (1941), 'Discussion of papers on probability theory',
Annals of Mathematical Statistics 12, 215-217.
von Plato, J. (1994), Creating Modern Probability: its Mathematics, Physics and
Philosophy in Historical Perspective, Cambridge University Press.
Whittle, P. (1992), Probability via Expectation, third edn, Springer-Verlag, New York.
First edition 1970, under the title "Probability".
Wiener, N. (1923), 'Differential-space', Journal of Mathematics and Physics 2, 131174. Reprinted in Selected papers ofNorbert Wiener, MIT Press, 1964.

Chapter 2

A modicum of measure theory


SECTION 1 defines measures and sigma-fields.
SECTION 2 defines measurable functions.
SECTION 3 defines the integral with respect to a measure as a linear functional on a cone
of measurable functions. The definition sidesteps the details of the construction of
integrals from measures.
SECTION *4 constructs integrals of nonnegative measurable functions with respect to a
countably additive measure.
SECTION 5 establishes the Dominated Convergence theorem, the Swiss Army knife of
measure theoretic probability.
SECTION 6 collects together a number of simple facts related to sets of measure zero.
SECTION * 7 presents a few facts about spaces of Junctions with integrable pth powers,
with emphasis on the case p=2, which defines a Hilbert space.
SECTION 8 defines uniform integrabilityt a condition slightly weaker than domination.
Convergence in * is characterized as convergence in probability plus uniform
integrability.
SECTION 9 defines the image measure, which includes the concept of the distribution of a
random variable as a special case.
SECTION 10 explains how generating class arguments, for classes of sets, make measure
theory easy.
SECTION *11 extends generating class arguments to classes of functions.

1.

Measures and sigma-fields


As promised in Chapter 1, we begin with measures as set functions, then work
quickly towards the interpretation of integrals as linear fimctionals. Once we are
past the purely set-theoretic preliminaries, I will start using the de Finetti notation
(Section 1.4) in earnest, writing the same symbol for a set and its indicator function.
Our starting point is a measure space: a triple (X, A% /i), with X a set, A a class
of subsets of X, and /JL a function that attaches a nonnegative number (possibly +00)
to each set in A. The class A and the set function JJL are required to have properties
that facilitate calculations involving limits along sequences.

18

<i>

Chapter 2:

Definition.

A modicum of measure theory

Call a class A a sigma-field of subsets ofX if:

(i) the empty set 0 and the whole space X both belong to A;
(ii) if A belongs to A then so does its complement Ac;
(Hi) if A\, Ai,... is a countable collection of sets in A then both the union UjA,and the intersection n( A, are also in A.
Some of the requirements are redundant as stated. For example, once we
have 0 G A then (ii) implies X e A. When we come to establish properties about
sigma-fields it will be convenient to have the list of defining properties pared down
to a minimum, to reduce the amount of mechanical checking. The theorems will
be as sparing as possible in the amount the work they require for establishing the
sigma-field properties, but for now redundancy does not hurt.
The collection A need not contain every subset of X, a fact forced upon us in
general if we want fju to have the properties of a countably additive measure.
<2>

Definition.

A function /x defined on the sigma-field A is called a (countably

additive, nonnegative) measure if:


(i) 0 < ixA < oo for each A in A;
(ii) fM0 = 0;
(Hi) if A i, A 2 , . . . is a countable collection of pairwise disjoint sets in A then
A measure fi for which \iX 1 is called a probability measure, and the
corresponding (X, A, pu) is called a probability space. For this special case it is
traditional to use a symbol like P for the measure, a symbol like Q, for the set,
and a symbol like 7 for the sigma-field. A triple (Q, 7, P) will always denote a
probability space.
Usually the qualifications "countably additive, nonnegative" are omitted, on the
grounds that these properties are the most commonly assumedthe most common
cases deserve the shortest names. Only when there is some doubt about whether
the measures are assumed to have all the properties of Definition <2> should the
qualifiers be attached. For example, one speaks of "finitely additive measures"
when an analog of property (iii) is assumed only for finite disjoint collections, or
"signed measures" when the value of fiA is not necessarily nonnegative. When
finitely additive or signed measures are under discussion it makes sense to mention
explicitly when a particular measure is nonnegative or countably additive, but, in
general, you should go with the shorter name.
Where do measures come from? The most basic constructions start from set
functions [i defined on small collections of subsets , such as the collection of all
subintervals of the real line. One checks that /x has properties consistent with the
requirements of Definition <2>. One seeks to extend the domain of definition while
preserving the countable additivity properties of the set function. As you saw in
Chapter 1, Theorems guaranteeing existence of such extensions were the culmination
of a long sequence of refinements in the concept of integration (Hawkins 1979).
They represent one of the great achievements of modern mathematics, even though
those theorems now occupy only a handful of pages in most measure theory texts.

19

2.1 Measures and sigma-fields

Finite additivity has several appealing interpretations (such as the fair-prices


of Section 1.5) that have given it ready acceptance as an axiom for a model of
real-world uncertainty. Countable additivity is sometimes regarded with suspicion,
or justified as a matter of mathematical convenience. (However, see Problem [6] for
an equivalent form of countable additivity, which has some claim to intuitive appeal.)
It is difficult to develop a simple probability theory without countable additivity,
which gives one the licence (for only a small fee) to integrate series term-by-term,
differentiate under integrals, and interchange other limiting operations.
The classical constructions are significant for my exposition mostly because they
ensure existence of the measures needed to express the basic results of probability
theory. I will relegate the details to the Problems and to Appendix A. If you crave
a more systematic treatment you might consult one of the many excellent texts on
measure theory, such as Royden (1968).
The constructions do notindeed cannot, in generallead to countably
additive measures on the class of all subsets of a given X. Typically, they extend
a set function defined on a class of sets to a measure defined on the sigma-field
<r() generated by , or to only slightly larger sigma-fields. By definition,
a() := smallest sigma-field on X containing all sets from
= {A c X : A 5F for every sigma-field 7 with c f ) .
The representation given by the second line ensures existence of a smallest sigmafield containing . The method of definition is analogous to many definitions of
"smallest... containing a fixed class" in mathematicsthink of generated subgroups
or linear subspaces spanned by a collection of vectors, for example. For the
definition to work one needs to check that sigma-fields have two properties:
(i) If {7i : i e 3} is a nonempty collection of sigma-fields on X then n,-jSi, the
collection of all the subsets of X that belong to every 7,, is also a sigma-field.
(ii) For each there exists at least one sigma-field 7 containing all the sets in .
You should check property (i) as an exercise. Property (ii) is trivial, because the
collection of all subsets of X is a sigma-field.
REMARK.
Proofs of existence of nonmeasurable sets typically depend on
some deep set-theoretic principle, such as the Axiom of Choice. Mathematicians
who can live with different rules for set theory can have bigger sigma-fields. See
Dudley (1989, Section 3.4) or Oxtoby (1971, Section 5) for details.

<3>

Exercise. Suppose X consists of five points a, b, c, d, and e. Suppose consists


of two sets, E\ = [a, b, c] and E2 = {c, d, e}. Find the sigma-field generated by .
SOLUTION: For this simple example we can proceed by mechanical application of
the properties that a sigma-field cr() must possess. In addition to the obvious 0
and X, it must contain each of the sets
Fx := {a, b} = Exn Ec2
F3 := {</, e] = E\ n E2

and
and

F2 := {c} = , n 2 ,
F4 := [a, b, d, e] = F\ U F3.

20

Chapter 2:

A modicum of measure theory

Further experimentation creates no new members of <x(); the sigma-field consists


of the sets
0,

Fi,

F2,

F3,

F!UF 3 ,

Fil)F2 = Eu

F 2 U F3 = F 2 ,

X.

The sets F\9 F2, F3 are the atoms of the sigma-field; every member of <r() is a
union of some collection (possibly empty) of Ft. The only measurable subsets of f}
are the empty set and F, itself. There are no measurable protons or neutrons hiding
inside these atoms.
An unsystematic construction might work for finite sets, but it cannot generate
all members of a sigma-field in general. Indeed, we cannot even hope to list all
the members of an infinite sigma-field. Instead we must find a less explicit way to
characterize its sets.

<4>

Example. By definition, the Borel sigma-field on the real line, denoted by 2(R),
is the sigma-field generated by the open subsets. We could also denote it by a (9)
where 9 stands for the class of all open subsets of R. There are several other
generating classes for 2(R). For example, as you will soon see, the class of all
intervals (oo, t], with t e R, is a generating class.
It might appear a hopeless task to prove that cr() = !B(R) if we cannot
explicitly list the members of both sigma-fields, but actually the proof is quite
routine. You should try to understand the style of argument because it is often used
in probability theory.
The equality of sigma-fields is established by two inclusions, a() c a (9) and
tf(9) <T(), both of which follow from more easily established results. First we
must prove that c a(9), showing that a (9) is one of the sigma-fields 7 that enter
into the intersection defining a(), and hence cr(E) c a(9). The other inclusion
follows similarly if we show that 9 c cr().
Each interval (oo, t] in has a representation njjli(~~ t+n~l), a countable
intersection of open sets. The sigma-field a(9) contains all open sets, and it is
stable under countable intersections. It therefore contains each (oo, t]. That is,
Ca(9).
The argument for 9 c a() is only slightly harder. It depends on the fact
that an open subset of the real line can be written as a countable union of open
intervals. Such an interval has a representation (a, b) = (oo, b) n (oo, a]c, and
(-oo, b) = U^=i(~' * - n~1]- That is, every open set can be built up from sets
in using operations that are guaranteed not to take us outside the sigma-field a().
My explanation has been moderately detailed. In a published paper the
reasoning would probably be abbreviated to something like "a generating class
argument shows that ...," with the routine details left to the reader.
REMARK.
The generating class argument often reduces to an assertion like: A
is a sigma-field and A 2 , therefore A = a{A) 2 <T().

<5>

Example. A class of subsets of a set X is called & field if it contains the empty
set and is stable under complements, finite unions, and finite intersections. For a
field , write 5 for the class of all possible intersections of countable subclasses
of , and Ea for the class of all possible unions of countable subclasses of .

21

2.1 Measures and sigma-fields


Of course if is a sigma-field then = 5 = <,, but, in general, the inclusions
cr() 2 j 2 and a(E) 2 Ea 2 will be proper. For example, if X = R and
consists of all finite unions of half open intervals (a, b]9 with possibly a = oo or
b = +oo, then the set of rationals does not belong to Ea and the complement of the
same set does not belong to 5.
Let /* be a finite measure on a(). Even though <x() might be much larger
than either Za or 5, a generating class argument will show that all sets in a() can
be inner approximated by Es, in the sense that,
fjtA = snplfxF : A 2 F 5}

for each A in a(),

and outer approximated by a , in the sense that,


fiA = inf{fiG : A c G CT}

for each A in cr().

REMARK.
Incidentally, I chose the letters G and F to remind myself of open
and closed sets, which have similar approximation properties for Borel measures on
metric spacessee Problem [12].

It helps to work on both approximation properties at the same time. Denote by


So the class of all sets in a() that can be both innner and outer approximated. A
set B belongs to So if and only if, to each e > 0 there exist F e 5 and G Ea such
that F c B c G and fi(G\F) < e. I'll call the sets F and G an ^-sandwich for B.
Trivially So 2 because each member of belongs to both a and $. The
approximation result will follow if we show that So is a sigma-field, for then we
will have S o = or (So) 2 *().
Symmetry of the definition ensures that So is stable under complements: if
F c B c G is an ^-sandwich for B, then Gc c Bc c Fc is an -sandwich for B c .
To show that So is stable under countable unions, consider a countable collection
{Bn : n e N} of sets from So. We need to slice the bread thinner as n gets larger:
choose /2n-sandwiches Fn c Bn c Gn for each n. The union UwBn is sandwiched
between the sets G := UnGn and # = UnFn; and the sets are close in /JL measure
because
/x(Uw G n \U n Fn) < ]TM(G n \F n ) < ] [ > / 2 " = .
REMARK.
Can you prove this inequality? Do you see why UnGn\ Un Fn c
Uft (Gn\Fn) and why countable additivity implies that the measure of a countable union
of (not necessarily disjoint) sets is smaller than the sum of their measures? If not,
just wait until Section 3, after which you can argue that VnGn\Un Fn < ^Tn(Gn\Fn),
as an inequality between indicator functions, and /x (,H(Gn\FH)) = Yin M(^n\^n)
by Monotone Convergence.

We have an -sandwich, but the bread might not be of the right type. It is
certainly true that G Ea (a countable union of countable unions is a countable
union), but the set H need not belong to $. However, the sets HN := KJn<NFn do
belong to 5, and countable additivity implies that fiH^ f lt<H.
REMARK.

DO you see why? If not, wait for Monotone Convergence again.

D If we choose a large enough N we have a 2e-sandwich HN c.UnBn c G.

22

Chapter 2:

A modicum of measure theory

The measure m on 23 (R) for which m(ay b] = b a is called Lebesgue measure.


Another sort of generating class argument (see Section 10) can be used to show
that the values m(B) for B in 23 (R) are uniquely determined by the values given to
intervals; there can exist at most one measure on 23(R) with the stated property. It
is harder to show that at least one such measure exists. Despite any intuitions you
might have about length, the construction of Lebesgue measure is not trivialsee
Appendix A. Indeed, Henri Lebesgue became famous for proving existence of the
measure and showing how much could be done with the new integration theory.
The name Lebesgue measure is also given to an extension of m to a measure
on a sigma-field, sometimes called the Lebesgue sigma-field, which is slightly larger
than 23 (R). I will have more to say about the extension in Section 6.
Borel sigma-fields are defined in similar fashion for any topological space X.
That is, 23 (X) denotes the sigma-field generated by the open subsets of X.
Sets in a sigma-field A are said to be measurable or yi-measurable. In
probability theory they are also called events. Good functions will also be given the
title measurable. Try not to get confused when you really need to know whether an
object is a set or a function.

2.

Measurable functions
Let X be a set equipped with a sigma-field A, and y be a set equipped with a
sigma-field 23, and T be a function (also called a map) from X to y. We say that T
is AyB-measurable if the inverse image {x X : Tx e B] belongs to A for each
B in 23. Sometimes the inverse image is denoted by {T e B] or T~lB. Don't be
fooled by the T~l notation into treating T~l as a function from y into X: it's not,
unless T is one-to-one (and onto, if you want to have domain y). Sometimes an
.A\23-measurable map is referred to in abbreviated form as just .A-measurable, or
just 23-measurable, or just measurable, if there is no ambiguity about the unspecified
sigma-fields.
T

01B)

(X,A)
For example, if y = R and 23 equals the Borel sigma-field 23 (R), it is common
to drop the 23 (R) specification and refer to the map as being ^4-measurable, or as
being Borel measurable if A is understood and there is any doubt about which
sigma-field to use for the real line. In this book, you may assume that any sigma-field
on R is its Borel sigma-field, unless explicitly specified otherwise. It can get confusing
if you misinterpret where the unspecified sigma-fields live. My advice would be
that you imagine a picture showing the two spaces involved, with any missing
sigma-field labels filled in.

23

2.2 Measurable functions

Sometimes the functions come first, and the sigma-fields are chosen specifically
to make those functions measurable.
<6>

Definition. Let *K be a class of functions on a set X. Suppose the typical h in *K


maps X into a space y*, equipped with a sigma-field 'Bh- Then the sigma-field o^K)
generated by *K is defined as a{h~x(B) : B e *Bh, h JC}. It is the smallest
sigma-field Ao on X for which each h in *K is Ao\Bh-measurable.

<7>

Example. If 3 = <r() for some class of subsets of y then a map T is


/i\a()-measurable if and only if T~lE e A for every E in . You should prove
this assertion by checking that {B e !B : T~lB e A} is a sigma-field, and then
arguing from the definition of a generating class.
In particular, to establish ./l\!B(R)-measurability of a map into the real line
it is enough to check the inverse images of intervals of the form (t, oo), with t
ranging over R. (In fact, we could restrict r to a countable dense subset of R,
such as the set of rationals: How would you build an interval (t, oo) from intervals
(f,-, oo) with rational ff?) That is, a real-valued function / is Borel-measurable if
{x e X : f(x) > t] e A for each real t. There are many similar assertions obtained
by using other generating classes for #(R). Some authors use particular generating
classes for the definition of measurability, and then derive facts about inverse images
of Borel sets as theorems.

It will be convenient to consider not just real-valued functions on a set X,


but also functions from X into the extended real line R := [oo, oo]. The Borel
sigma-field (R) is generated by the class of open sets, or, more explicitly, by all
sets in $(R) together with the two singletons {-oo} and {oo}. It is an easy exercise
to show that (R) is generated by the class of all sets of the form (f, oo], for t in R,
and by the class of all sets of the form [-oo, t)9 for t in R. We could even restrict t
to any countable dense subset of R.
<8>

Example. Let a set X be equipped with a sigma-field A. Let {/ : n e N} be a


sequence of ,A\(R)-measurable functions from X into R. Define functions / and g
by taking pointwise suprema and infima: f(x) := supn fn(x) and g(x) := infw fn(x).
Notice that / might take the value -f oo, and g might take the value oo, at some
points of X. We may consider both as maps from X into R. (In fact, the whole
argument is unchanged if the / functions themselves are also allowed to take
infinite values.)
The function / is yi\(R)-measurable because
[x : f(x) > t] =

U{JC

: fn(x) > t} A

for each real t :

for each fixed JC, the supremum of the real numbers fn(x) is strictly greater than t
if and only if fn(x) > t for at least one n. Example <7> shows why we have only
to check inverse images for such intervals.
The same generating class is not as convenient for proving measurability of g.
It is not true that an infimum of a sequence of real numbers is strictly greater than t
if and only if all of the numbers are strictly greater than t: think of the sequence
[n~l : n = 1,2, 3 , . . . } , whose infimum is zero. Instead you should argue via the
identity {JC : g(x) < t] = Un[x : fn(x) < t} A for each real r.

24

Chapter 2:

A modicum of measure theory

From Example <8> and the representations limsup fn(x) = infnG^ supm>n fm(x)
and liminf fn(x) = supnNinfm>n fm(x), it follows that the lim sup or liminf of a
sequence of measurable (real- or extended real-valued) functions is also measurable.
In particular, if the limit exists it is measurable.
Measurability is also preserved by the usual algebraic operationssums,
differences, products, and so onprovided we take care to avoid illegal pointwise
calculations such as oo - oo or 0/0. There are several ways to establish these
stability properties. One of the more direct methods depends on the fact that R has
a countable dense subset, as illustrated by the following argument for sums.
<9>

Example, Let / and g be !B(R)-measurable functions, with pointwise sum


h(x) = /(JC) + g(x). (I exclude infinite values because I don't want to get caught
up with inconclusive discussions of how we might proceed at points x where
f(x) = + oo and g(x) = - o o , or /(JC) = - o o and g(x) = +oo.) How can we prove
that h is also a !B(R)-measurable function?
It is true that
{x : *(*) > t} = U,R ({JC : f{x) = s) n {x : g(x) > t - s}) ,
and it is true that the set {JC : /(JC) = s] n {JC : g(x) > t s] is measurable for each s
and r, but sigma-fields are not required to have any particular stability properties for
uncountable unions. Instead we should argue that at each JC for which f(x)-\-g(x) > t
there exists a rational number r such that /(JC) > r > t g(x). Conversely if there
is an r lying strictly between /(JC) and t - g(x) then /(JC) -I- g(x) > t. Thus

{x : h(x) > t } = U r G Q ({x : f(x) > r } n { x : g(x) > t - r \ ) ,

where Q denotes the countable set of rational numbers. A countable union of


intersections of pairs of measurable sets is measurable. The sum is a measurable
function.
As a little exercise you might try to extend the argument from the last Example
to the case where / and g are allowed to take the value H-oo (but not the value oo).
If you want practice at playing with rationals, try to prove measurability of products
(be careful with inequalities if dividing by negative numbers) or try Problem [4],
which shows why a direct attack on the lim sup requires careful handling of
inequalities in the limit.
The real significance of measurability becomes apparent when one works
through the construction of integrals with respect to measures, as in Section 4. For
the moment it is important only that you understand that the family of all measurable
functions is stable under most of the familiar operations of analysis.

<io>

Definition. The class M(X,.A), or M(X) or just M for short, consists of all
A\"B(R)-measurable functions from X into R. The class M+(X,.A), or M + (X) or
just M + for short, consists of the nonnegative functions in M(X,A).
If you desired exquisite precision you could write M(X,A, R, !B(R)), to
eliminate all ambiguity about domain, range, and sigma-fields.
The collection M+ is a cone (stable under sums and multiplication of functions
by positive constants). It is also stable under products, pointwise limits of sequences,

25

2.2 Measurable functions

and suprema or infima of countable collections of functions. It is not a vector space,


because it is not stable under subtraction; but it does have the property that if / and
g belong to M + and g takes only real values, then the positive part ( / - g)+, defined
by taking the pointwise maximum of f(x) - g(x) with 0, also belongs to M+. You
could adapt the argument from Example <9> to establish the last fact.
It proves convenient to work with M + rather than with the whole of M, thereby
eliminating many problems with oo oo. As you will soon learn, integrals have
some convenient properties when restricted to nonnegative functions.
For our purposes, one of the most important facts about M + will be the
possibility of approximation by simple functions that is by measurable functions
of the form s := ),. or, A/, for finite collections of real numbers a,- and events A,from A. If the A,- are disjoint, s(x) equals a,- when x e A,-, for some i, and is
zero otherwise. If the A,- are not disjoint, the nonzero values taken by s are sums
of various subsets of the {a,-}. Don't forget: the symbol A, gets interpreted as an
indicator function when we start doing algebra. I will write M+mple for the cone of
all simple functions in M + .

<n>

Lemma.

For each f in M+ the sequence {/} c Mj

le,

defined by

has the property 0 < f\ (x) < fi(x) < ... < fn(x) f f(x) at every x.
REMARK.
The definition of / involves algebra, so you must interpret {/ > i/2n]
as the indicator function of the set of all points x for which f(x)>
i/2n.

Proof. At each x, count the number of nonzero indicator values. If f(x) > 2", all
4n summands contribute a 1, giving fn(x) = 2 n . If k2~n < f{x) < (k + 1)2"", for
some integer k from { 0 , 1 , 2 , . . . , 4 n 1}, then exactly k of the summands contribute
a 1, giving fn(x) = kl~n. (Check that the last assertion makes sense when k
equals 0.) That is, for 0 < f(x) < 2n, the function / rounds down to an integer
multiple of 2~", from which the convergence and monotone increasing properties
follow.

If you do not find the monotonicity assertion convincing, you could argue,
more formally, that

~ 2^} - 2 ^ S {{f ~ 2^} + {f ~ ^ r } ) = fn+U

which reflects the effect of doubling the maximum value and halving the step size
when going from the nth to the (n+l)st approximation.

26

Chapter 2:

A modicum of measure theory

As an exercise you might prove that the product of functions in M+ also


belongs to M+, by expressing the product as a pointwise limit of products of simple
functions. Notice how the convention 0 x oo = 0 is needed to ensure the correct
limit behavior at points where one of the factors is zero.

3.

Integrals
Just as f* f(x) dx represents a sort of limiting sum of f(x) values weighted by
small lengths of intervalsthe / sign is a long "S", for sum, and the dx is a sort
of limiting incrementso can the general integral / f(x) ii(dx) be defined as a
limit of weighted sums but with weights provided by the measure /x. The formal
definition involves limiting operations that depend on the assumed measurability of
the function / . You can skip the details of the construction (Section 4) by taking
the following result as an axiomatic property of the integral.

< 12>

T h e o r e m . For each measure /x on (X, A) there is a uniquely determined functional,


a map fi from 3Vt+(X, A) into [0, oo], having the following properties:
(i) /I(IU) = iiA for each A in A;
(ii) /2(0) = 0, where the first zero stands for the zero function;
(Hi) for nonnegative real numbers a, p and functions f,gin

M4",

(iv) if f, g are in M + and f < g everywhere then jl(f) < jl(g);


(v) if / i , / 2 , . . . is a sequence in M+ with 0 < f\ (x) < f2(x) < ... t /(*) for
each x in X then jl(fn) f (/)
I will refer to (iii) as linearity, even though M+ is not a vector space.
It will imply a linearity property when jl is extended to a vector subspace of M.
Property (iv) is redundant because it follows from (ii) and nonnegativity. Property (ii)
is also redundant: put A = 0 in (i); or, interpreting 0 x oo as 0, put a = p = 0 and
/ = g = 0 in (iii). We need to make sure the bad case jlf = oo, for all / in M + ,
does not slip through if we start stripping away redundant requirements.
Notice that the limit function / in (v) automatically belongs to M + . The
limit assertion itself is called the Monotone Convergence property. It corresponds
directly to countable additivity of the measure. Indeed, if [Ai : i N} is a countable
collection of disjoint sets from A then the functions / := A\ -f . . . + An increase
pointwise to the indicator function of A = U I N ^ > SO that Monotone Convergence
and linearity imply /JLA = X!, M^<REMARK.
YOU should ponder the role played by +oo in Theorem <12>. For
example, what does ajx{f) mean if a = 0 and / ! ( / ) = oo? The interpretation
depends on the convention that 0 x oo = 0.
In general you should be suspicious of any convention involving oo. Pay
careful attention to cases where it operates. For example, how would the five
assertions be affected if we adopted a new convention, whereby 0 x oo = 6? Would
the Theorem still hold? Where exactly would it fail? I feel uneasy if it is not
clear how a convention is disposing of awkward cases. My advice: be very, very

2.3 Integrals

27

careful with any calculations involving infinity. Subtle errors are easy to miss when
concealed within a convention.

There is a companion to Theorem <12> that shows why it is largely a matter


of taste whether one starts from measures or integrals as the more primitive measure
theoretic concept.
<13>

Theorem. Let /x be a map from M + to [0, oo] that satisfies properties (ii)
through (v) of Theorem <12>. Then the set function defined on the sigma-field A
by (i) is a (countably additive, nonnegative) measure, with ji the functional that it
generates.
Lemma < n > provides the link between the measure /x and the functional /x.
For a given / in M+, let {/} be the sequence defined by the Lemma. Then

iif = lim iifn = lim 2~nTn{f


n~>-oo

n-oo

f*

> i/2n],

the first equality by Monotone Convergence, the second by linearity. The value
of jlf is uniquely determined by /x, as a set function on A. It is even possible to
use the equality, or something very similar, as the basis for a direct construction of
the integral, from which properties (i) through (v) are then derived, as you will see
from Section 4.
In summary: There is a one-to-one correspondence
\i defined
between
measures on the sigma-field A and increasing linear
here
functional on M+(A) with the Monotone Convergence
property. To each measure /tx there is a uniquely determined
functional /x for which /x(IU) == ti(A) for every A in A.
The functional ft is usually called an integral with respect
to /x, and is variously denoted by f fd/x or / f(x) /x(rfjt)
or / f(x)dfi(x). With the de Finetti notation,
or fxfdfi
where we identify a set A with its indicator function, the
functional fi, is just an extension of /x from a smaller domain
(indicators of sets in A) to a larger domain (all of 3VC+).
Accordingly, we should have no qualms about denoting it by the same symbol. I
will write /x/ for the integral. With this notation, assertion (i) of Theorem <12>
becomes: fiA = ixA for all A in A. You probably can't tell that the A on the
left-hand side is an indicator function and the /x is an integral, but you don't need
to be able to tellthat is precisely what (i) asserts.
REMARK.
In elementary algebra we rely on parentheses, or precedence, to make
our meaning clear. For example, both (ax) + b and ax 4- b have the same meaning,
because multiplication has higher precedence than addition. With traditional notation,
the f and the dfi act like parentheses, enclosing the integrand and separating it
from following terms. With linear functional notation, we sometimes need explicit
parentheses to make the meaning unambiguous. As a way of eliminating some
parentheses, I often work with the convention that integration has lower precedence
than exponentiation, multiplication, and division, but higher precedence than addition
or subtraction. Thus I intend you to read ixfg -I- 6 as (fi(fg)) + 6. I would write
li(fg + 6) if the 6 were part of the integrand.

28

Chapter 2:

A modicum of measure theory

Some of the traditional notations also remove ambiguity when functions of


several variables appear in the integrand. For example, in f f(x,y) fi(dx) the y
variable is held fixed while the fx operates on the first argument of the function.
When a similar ambiguity might arise with linear functional notation, I will append
a superscript, as in fjLxf(x,y), to make clear which variable is involved in the
integration.

<14>

Example. Suppose /JL is a finite measure (that is, fiX < oo) and / is a function
in M + . Then /x/ < oo if and only if J2T=\ Pif n) < The assertion is just a pointwise inequality in disguise. By considering
separately values for which k < f(x) < k + 1, for k = 0,1, 2,..., you can verify the
pointwise inequality between functions,
In fact, the sum on the left-hand side defines I_/C*)J> the largest integer <
and the right-hand side denotes the smallest integer > f(x). From the leftmost
inequality,
V>f ^ ^(Li U >n))
increasing
= lim A*(]Li{/>})
Monotone Convergence
N-+OQ

= Jim E!Li /*{/ > 1

linearity

= TZi nif > }


A similar argument gives a companion upper bound. Thus the pointwise inequality
integrates out to J^L{ M ( / > } < M/ < M^C + 12T=i Ml/ > wh from which the
asserted equivalence follows.
Extension of the integral to a larger class of functions
Every function / in M can be decomposed into a difference / = / + / " of two
functions in M + , where f+(x) := max(/(jt),0) and f~(x) := max(-/(jc),0). To
extend /Lt from M+ to a linear functional on M we should define fif := /xf+ tif~.
This definition works if at least one of /x/ + and fif" is finite; otherwise we get
the dreaded oo oo. If both /z/+ < oo and ///" < oo (or equivalently, / is
measurable and ^ l / l < oo) the function / is said to be integrable or ^t-integrable.
The linearity property (iii) of Theorem <12> carries over partially to M if oo - oo
problems are excluded, although it becomes tedious to handle all the awkward cases
involving oo. The constants a and f$ need no longer be nonnegative. Also if both
/ and g are integrable and if / < g then \if < jig, with obvious extensions to
certain cases involving oo.

<15>

Definition.

The set of all real-valued, fi-integrable functions in M is denoted by

The set ^ ( M ) is a vector space (stable under pointwise addition and multiplication by real numbers). The integral fi defines an increasing linear functional on
&l(li), in the sense that fif > fig if / > g pointwise. The Monotone Convergence
property implies other powerful limit results for functions in &(&), as described in
Section 5. By restricting /JL to ^(fi), we eliminate problems with oo oo.

2.3

29

Integrals

For each / in Cl(ji), its Cl norm is defined as ||/||i := MI/I- Strictly speaking,
|| 111 is only a seminorm, because ||/||i = 0 need not imply that / is the zero
functionas you will see in Section 6, it implies only that /Lt{/ ^ 0} = 0. It
is common practice to ignore the small distinction and refer to || ||i as a norm
<16>

Example. Let ^ be a convex, real-valued function on M. The function ^ is


measurable (because {*!> < t] is an interval for each real r), and for each xo in E
there is a constant a such that V(x) > 4>(JCO) + a(x - xo) for all x (Appendix C).
Let P be a probability measure, and X be an integrable random variable.
Choose JCO := PX. From the inequality *(x) > -|^(*o)l - M(|JC| + \xo\) we
deduce that P*I>(X)~ < |^(JCO)I + |cr|(P|X| + |*ol) < > Thus we should have no
oo - oo worries in taking expectations (that is, integrating with respect to P) to
deduce that P*(X) > 4>(PX) + (PX - x0) = *(PX), a result known as Jensen's
inequality. One way to remember the direction of the inequality is to note that
0 < var(X) = PX2 - (PX)2, which corresponds to the case V(x) = x2.
Integrals with respect to Lebesgue measure
Lebesgue measure m on S(R) corresponds to length: m[a, b] = b - a for each
interval. I will occasionally revert to the traditional ways of writing such integrals,
/oo

f(x)dx = /

[b

f(x)dx

and

m*(/(jt){a < x < b}) = /

f(x)dx.

Don't worry about confusing the Lebesgue integral with the Riemann integral over
finite intervals. Whenever the Riemann is well defined, so is the Lebesgue, and the
two sorts of integral have the same value. The Lebesgue is a more general concept.
Indeed, facts about the Riemann are often established by an appeal to theorems
about the Lebesgue. You do not have to abandon what you already know about
integration over finite intervals.
The improper Riemann integral, / ^ f{x) dx = limn_^oo /" n /(JC) dx9 also agrees
with the Lebesgue integral provided tn|/| < oo. If m | / | = oo, as in the case of
the function /(JC) := YlT=\{n - x < n + l)(-l) rt /w, the improper Riemann integral
might exist as a finite limit, while the Lebesgue integral m / does not exist.

*4.

Construction of integrals from measures


To construct the integral fl as a functional on M+(X,,A), starting from a measure
ix on the sigma-field A, we use approximation from below by means of simple
functions.
First we must define fi on M^mplc. The representation of a simple function as a
linear combination of indicator functions is not unique, but the additivity properties
of the measure \i will let us use any representation to define the integral. For
example, if s := 3A\ + 7A2 = 3A\AC2 -I- IOA\A2 + 1A\A2, then
+ l0n(AxA2) + 7^(Af A2).

Chapter 2:

30

A modicum of measure theory

More generally, if s := ,-(*, A,- ^ a s another representation s = J2j PjBj


^(XifiAi = J2JPJ/*BJ. Proof? Thus we can uniquely define jl(s) for a simple
function s := ,. a, A, by \x (s) := f atfiAi.
Define the increasing functional jl on M + by
A ( / ) := sup{ (s):f>se

M+mple}.

That is, the integral of f is a supremum of integrals of nonnegative simple functions


less than / .
From the representation of simple functions as linear combinations of disjoint
sets in A, it is easy to show that /zOU) = /xA for every A in A. It is also easy to
show that jl(0) = 0, and ji(af) = oejl(f) for nonnegative real or, and
The last inequality, which is usually referred to as the superadditivity property,
follows from the fact that if / > u and g > v, and both u and v are simple, then
/ + > + *> with u + v simple.
Only the Monotone Convergence property and the companion to <17>,
require real work. Here you will see why measurability is needed.
Proof of inequality <18>.
Let s be a simple function < / + g, and let e be a small
positive number. It is enough to construct simple functions M, V with u < f and
v < g such that u + i> > (1 6)5. For then jlf + jig > fiu + fiv > (1 )JJLS, from
which the subadditivity inequality <18> follows by taking a supremum over simple
functions then letting e tend to zero.
For simplicity of notation I will assume s to be very simple: s := A. You can
repeat the argument for each Af in a representation Yli ai^i w ^h disjoint A,- to get
the general result. Suppose = \/m for some positive
integer m. Write tj for j/m. Define simple functions
u := A{f > 1} + E7 = i A {,_! < / < y} <y-i,
The measurability of / ensures .A-measurability of all
A
the sets entering into the definitions of u and v. For the
inequality v < g, notice that / + g > 1 on A, so g > 1 - , = v when ,-1 < / < 4
on A. Finally, note that the simple functions were chosen so that
u

+ v=

A{/>

ij + ; 7=i A { 4 _ ! < / < ij} (1 - ) > (1 - 6) A,

as desired.
Proof of the Monotone Convergence property. Suppose fn e
and / t / Suppose / > s := X]a,-A,-, with the A, disjoint
sets in *A and at > 0. Define approximating simple functions
s
n := E i O - ^)/Al{/n > (1 - ^ K } . Clearly j n < /. The

2.4

31

Construction of integrals from measures

simple function sn is one of those that enters into the supremum defining jlfn. It
follows that
A/n > Mfo.) = (1 - ) Ei OLill {Ai{fn > (1 On the set A,- the functions / increase monotonely to / , which is > a,-. The sets
Ai{fn > (1 )ctj} expand up to the whole of A,. Countable additivity implies that
the [i measures of those sets increase to /zA,-. It follows that
lim/Z/n > lim sup/Is,, > (1 )jls.
D

Take a supremum over simple s < f then let e tend to zero to complete the proof.

5. Limit theorems
Theorem <13> identified an integral on M+ as an increasing linear functional with
the Monotone Convergence property :
<19>

<20>

if 0 < / f then \i (lim /) = lim [ifn.


\n->oo

1 3n

<2i>

n-^oo

Two direct consequences of this limit property have important applications throughout probability theory. The first, Fatou's Lemma, asserts a weaker limit property
for nonnegative functions when the convergence and monotonicity assumptions are
dropped. The second, Dominated Convergence, drops the monotonicity and nonnegativity but imposes an extra domination condition on the convergent sequence {/}.
I have slowly realized over the years that many simple probabilistic results can be
established by Dominated Convergence arguments. The Dominated Convergence
Theorem is the Swiss Army Knife of probability theory.
It is important that you understand why some conditions are needed before
we can interchange integration (which is a limiting operation) with an explicit
limit, as in <19>. Variations on the following example form the basis for many
counterexamples.
E x a m p l e , Let /JL be Lebesgue measure on 3 [ 0 , 1 ] and let {an} be a sequence
of positive numbers. The function fn(x) : = an{0 < x < l/n) converges to zero,
pointwise, but its integral /x(/ n ) = otn/n need not converge to zero. For example,
a n = n2 gives fifn - oo; the integrals diverge. And
6n for n even
AO
f 6 for n even
f
for n odd

glVCS

M/n =

I 3 for n odd.

The integrals oscillate.


Fatou's Lemma.

For every sequence {/} in M+ (not necessarily convergent),

Proof. Write / for liminf /. Remember what a liminf means. Define gn :=


infm>n fm. Then gn < fn for every n and the {gn} sequence increases monotonely to
the function / . By Monotone Convergence, fif = limn-+oo iign. By the increasing
property, pgn < ^fn for each n, and hence lim^oo fign < liminf^oo \ifn.

32

Chapter 2:

A modicum of measure theory

For dominated sequences of functions, a splicing together of two Fatou Lemma


assertions gives two liminf consequences that combine to produce a limit result.
(See Problem [10] for a generalization.)
<22>

Dominated Convergence. Let {/} be a sequence of fi-integrable functions for


which lim,, fn(x) exists for all x. Suppose there exists a ii-integrable function F,
which does not depend on n, such that \fn(x)\ < F(x) for all x and all n. Then the
limit function is integrable and //(lim,,-^ /) = lim,,.^ M/nProof, The limit function is also bounded in absolute value by F, and hence it is
integrable.
Apply Fatou's Lemma to the two sequences {F + /} and {F /} in M + , to
get
/x(liminf(F + /)) < liminf i(F + /) = liminf {jiF + fifn),
/z(liminf(F - /)) < liminf/z(F - /) = liminf QiF - /*/).
Simplify, using the fact that a liminf is the same as a lim for convergent sequences.
IL(F lim/,,)) < tiF 4- liminf (iifH).
Notice that we cannot yet assert that the liminf on the right-hand side is actually a
limit. The negative sign turns a liminf into a lim sup.

Cancel out the finite number fiF then rearrange, leaving


limsup/i/ n < fi(limfn) < liminf^/w.
D

The convergence assertion follows.


REMARK.
YOU might well object to some of the steps in the proof on o o - o o
grounds. For example, what does F(x) + /(*) mean at a point where F(x) = oo
and fn(x) = oo? To eliminate such problems, replace F by F{F < oo} and /
by fn{F < oo}, then appeal to Lemma <26> in the next Section to ensure that the
integrals are not affected.
The function F is said to dominate the sequence {/}. The assumption in
Theorem <22> could also be written as /u, (sup,, |/ n |) < oo, with F := supn \fn\ as
the dominating function. It is a common mistake amongst students new to the result
to allow F to depend on n.

Dominated Convergence turns up in many situations that you might not at first
recognize as examples of an interchange in the order of two limit procedures.
<23>

Example.

Do you know why

f extx5f2(l - xff2dx = f extx7^2(l - x)3?2 dx ?

dt Jo
Jo
Of course I just differentiated under the integral sign, but why is that allowed? The
neatest justification uses a Dominated Convergence argument.
More generally, for each t in an interval (-8,8) about the origin let /(, t) be a
ji-integrable function on X, such that the function / ( * , ) is differentiate in (-<$, 8)

33

2.5 Limit theorems

<24>

for each JC. We need to justify taking the derivative at t = 0 inside the integral, to
conclude that
d X
x
(lL /(JC,
</r v

t))\
7

lf=o

= \l

I
\ dt

Domination of the partial derivative will suffice.


Write g(t) for fixf(x, t) and A(JC, t) for the partial derivative JJ/(JC, 0- Suppose
there exists a /x-integrable function M such that
I A(JC, 01 < M{x)

for all JC, all t e (-<$,8).

To establish <24>, it is enough to show that

6.

for every sequence [hn] of nonzero real numbers tending to zero. (Please make sure
that you understand why continuous limits can be replaced by sequential limits in this
way. It is a common simplification.) With no loss of generality, suppose 8 > hn > 0
for all n. The ratio on the left-hand side of <25> equals the \x integral of the
By assumption, fn(x) - A(JC,O) for
function fn(x) := (f(x,hn)-f(x,0))/hn.
every x. The sequence {/} is dominated by M: by the mean-value theorem, for
each x there exists a tx in (-hn, hn) c (-<$, 8) for which |/ n (^)| = |A(JC, tx)\ < M(x).
An appeal to Dominated Convergence completes the argument.

Negligible sets
A set N in A for which /xN = 0 is said to be ^-negligible. (Some authors use
the term /x-null, but I find it causes confusion with null as a name for the empty
set.) As the name suggests, we can usually ignore bad things that happen only on
a negligible set. A property that holds everywhere except possibly for those JC in a
/^-negligible set of points is said to hold fi-almost everywhere or ^almost surely,
abbreviated to a.e. [ft] or a.s. [/LA], with the [JJ,] omitted when understood.
There are several useful facts about negligible sets that are easy to prove and
exceedingly useful to have formally stated. They depend on countable additivity,
via its Monotone Convergence generalization. I state them only for nonnegative
functions, leaving the obvious extensions for *(//) to you.

<26>

Lemma. For every measure /JL:


(i) if g e M + and fjug < oo then g < oo a.e. [//];
(ii) if g, h M+ and g = h a.e. [fi] then iig = fxh;
(Hi) if Ni, #2> . is a sequence of negligible sets then {Jt Ni is also negligible;
(iv) if g M + and fig = 0 then g = 0 a.e. [/x].
Proof. For (i): Integrate out the inequality g > n{g = oo} for each positive integer n
to get oo > fig > nfi{g = oo}. Let n tend to infinity to deduce that fi{g = oo} = 0.
For (ii): Invoke the increasing and Monotone Convergence properties of
integrals, starting from the pointwise bound h < g + oo{h ^ g] = limM (g + n{h ^ g})

34

Chapter 2:

A modicum of measure theory

to deduce that /xh < limn (fig + nfi{h ^ g}) = /xg. Reverse the roles of g and h to
get the reverse inequality.
For (iii): Invoke Monotone Convergence for the right-hand side of the pointwise
inequality U,-iV< < , Nt to get /x (U,//,) < /x QT. N() = . /xty = 0.
For (iv): Put Nn := {g > \/n] for n = 1,2,.... Then //,# < n/xg = 0, from
which it follows that {g > 0} = (J Mi is negligible.
REMARK.
Notice the appeals to countable additivity, via the Monotone
Convergence property, in the proofs. Results such as (iv) fail without countable
additivity, which might trouble those brave souls who would want to develop a
probability theory using only finite additivity.

Property (iii) can be restated as: if A A and A is covered by a countable


family of negligible sets then A is negligible. Actually we can drop the assumption
that A A if we enlarge the sigma-field slightly.
<27>

Definition. The fx-completion of the sigma-field A is the class A^ of all those


sets B for which there exist sets Ao, A\ in A with Ao c B c A\ and ix(A\\Ao) = 0.
You should check that A^ is a sigma-field and that jx has a unique extension
to a measure on A^ defined by \xB := fiAo = fiA\, with Ao and A\ as in the
Definition. More generally, for each / in M+(X,.AM), you should show that there
exist functions /o, go in M+(X, A) for which /o < / < /o 4- go and figo = 0. Of
course, we then have /xf := fxfo.
The Lebesgue sigma-field on the real line is the completion of the Borel
sigma-field with respect to Lebesgue measure.

<28>

Example. Here is one of the standard methods for proving that some measurable
set A has zero fi measure. Find a measurable function / for which f(x) > 0, for
all x in A, and fi(fA) = 0. From part (iv) of Lemma <26> deduce that fA = 0
a.e. \ji\. That is, f(x) = 0 for almost all x in A. The set A = {x A : f(x) > 0}
must be negligible.
Many limit theorems in probability theory assert facts about sequences that
hold only almost everywhere.

<29>

Example. (Generalized Borel-Cantelli lemma) Suppose {/} is a sequence in M+


for which ]Tn /xfn < oo. By Monotone Convergence, fi J2n fn = ]Tn fifn < oo.
Part (i) of Lemma <26> then gives ]Tn fn(x) < oo for /x almost all x.
For the special case of probability measure with each / an indicator function
of a set in A, the convergence property is called the Borel-Cantelli lemma: If
J2n PAn < oo then n An < oo almost surely. That is,
P{ft> Q : co G An for infinitely many n] = 0,
a trivial result that, nevertheless, is the basis for much probabilistic limit theory.
The event in the last display is often written in abbreviated form, {An i. o.}.
REMARK.
For sequences of independent events, there is a second part to
the Borel-Cantelli lemma (Problem [1]), which asserts that if ^2n^An = oo then
F{An i. o.} = 1. Problem [2] establishes an even stronger converse, replacing
independence by a weaker limit property.

2.6 Negligible sets

35

The Borel-Cantelli argument often takes the following form when invoked
to establish almost sure convergence. You should make sure you understand the
method, because the details are usually omitted in the literature.
Suppose {Xn} is a sequence of random variables (all defined on the same Q)
for which 2Zn F{|X n | > } < oo for each > 0. By Borel-Cantelli, to each > 0
there is a P-negligible set N() for which rt{|X,,(<y)i > *} < oo if co N()c. A
sum of integers converges if and only if the summands are eventually zero. Thus to
each co in N() there exists a finite n(c, co) such that \Xn(co)\ < when n > n(, co).
We have an uncountable family of negligible sets {N(e): > 0}. We are allowed
to neglect only countable unions of negligible sets. Replace 6 by a sequence of
values such as 1, 1/2, 1/3, 1/4, . . . , tending to zero. Define N := ( J ^ N(l/k),
which, by part (iii) of Lemma <26>, is negligible. For each co in Nc we have
\Xn(<*>)\ < 1/& when n > n{\/k, co). Consequently, Xn(co) -> 0 as n oo for each
coin Nc; the sequence {Xn} converges to zero almost surely.
For measure theoretic arguments with a fixed /Lt, it is natural to treat as identical
those functions that are equal almost everywhere. Many theorems have trivial modifications with equalities replaced by almost sure equalities, and convergence replaced
by almost sure convergence, and so on. For example, Dominated Convergence holds
in a slightly strengthened form:
Let {/} be a sequence of measurable functions for which fn(x) - f(x) at fi
almost all x. Suppose there exists a /i-integrable function F, which does not
depend on n9 such that |/(*)) < F(x) for fi almost all x and all n. Then
Most practitioners of probability learn to ignore negligible sets (and then suffer
slightly when they come to some stochastic process arguments where the handling
of uncountable families of negligible sets requires more delicacy). For example,
if I could show that a sequence [fn] converges almost everywhere I would hardly
hesitate to write: Define / := limn /. What happens at those x where fn(x) does
not converge? If hard pressed I might write:
n r. /f,\x)v .._
uenne
if limn fn(x) on the set where the limit exists,

10
otherwise.
You might then wonder if the function so-defined were measurable (it is), or if the
set where the limit exists is measurable (it is). A sneakier solution would be to
write: Define f(x) := limsupn /(*). It doesn't much matter what happens on the
negligible set where the limsup is not equal to the liminf, which happens only when
the limit does not exist.
A more formal way to equate functions equal almost everywhere is to work with
equivalence classes, [/] := {g e M : / = g a.e. [/it]}. The almost sure equivalence
also partitions &1(IA) into equivalence classes, for which we can define /x[/] := fig
for an arbitrary choice of g from [ / ] . The collection of all these equivalence classes
is denoted by L 1 ^ ) . The L1 norm, ||[/]||i := ||/||i, is a true norm on L 1 , because
[/] equals the equivalence class of the zero function when ||[/]|h = 0. Few authors
are careful about maintaining the distinction between / and [ / ] , or between L1(JJL)
and

36

*7.

Chapter 2:

A modicum of measure theory

Lp s p a c e s
For each real number p with p > 1 the Cp-norm is defined on M(X, A, ix) by
p 1/P
II/UP *= (Ml/I ) . Problem [17] shows that the map / H> | | / | | P satisfies the
triangle inequality, ||/ + g\\p < \\f\\p + ||g||p, at least when restricted to real-valued
functions in M.
As with the i^-norm, it is not quite correct to call || ||p a norm, for two
reasons: there are measurable functions for which \\f\\p = oo, and there are nonzero
measurable functions for which ||/|| p = 0. We avoid the first complication by
restricting attention to the vector space Cp := LP(X,A, JJL) of all real-valued, Ameasurable functions for which ||/|| p < oo. We could avoid the second complication
by working with the vector space Lp := LP(X,A, fi) of /^-equivalence classes of
functions in P (X, A, /x). That is, the members of Lp are the /^-equivalence classes,
[/] := {g e &p : g = / a.e. [/i]}, with / in Cp. (See Problem [20] for the limiting
case, p = oo.)
REMARK.
The correct term for || ||p on JP is pseudonorm, meaning that it
has all the properties of a norm (triangle inequality, and ||c/|| = |c| ||/|| for real
constants c) except that it might be zero for nonzero functions. Again, few authors
are careful about maintaining the distinction between Lp and Lp.

Problem [19] shows that the norm defines a complete pseudometric on Lp


(and a complete metric on Lp). That is, if {/} is a Cauchy sequence of functions
in / (meaning that \\fn fm\\p -> 0 as min(m,n) -> oo) then there exists a
function / in Lp for which ||/ n f\\n 0. The limit function / i s unique up to a
/^-equivalence.
For our purposes, the case where p equals 2 will be the most important. The
pseudonorm is then generated by an inner product (or, more correctly, a "pseudo"
inner product), (/, g) := /x(/g). That is, \\f\\\ := (/, / ) . The inner product has the
properties:
(a) (af + 0g, h) = <*{/, h) + 0(g, h) for all real a, p all /, g, h in 2 ;
(b) </,g) = ( g , / ) f o r a l l / , g i n 2 ;
(c) (/ / ) > 0 with equality if and only if / = 0 a.e. [/x].
If we work with the equivalence classes of L2 then (c) is replaced by the assertion
that ( [ / ] , [/]) equals zero if and only if [/] is zero, as required for a true inner
product.
A vector space equipped with an inner product whose corresponding norm
defines a complete metric is called a Hilbert space, a generalization of ordinary
Euclidean space. Arguments involving Hilbert spaces look similar to their analogs
for Euclidean space, with an occasional precaution against possible difficulties
with infinite dimensionality. Many results in Probability and Statistics rely on
Hilbert space methods: information inequalities; the Blackwell-Rao theorem;
the construction of densities and abstract conditional expectations; Hellinger
differentiability; prediction in time series; Gaussian process theory; martingale
theory; stochastic integration; and much more.

2.7 Lp spaces

37

Some of the basic theory for Hilbert space is established in Appendix B. For
the next several Chapters, the following two Hilbert space results, specialized to L2
spaces, will suffice.
(1) Cauchy-Schwarz inequality: Mfg)\ < WfhWgh for all / , g in L 2 O0,
which follows from the Holder inequality (Problem [15]).
(2) Orthogonal projections: Let !Ko be a closed subspace of C2(/JL), For each /
in 2 there is a /o in !Ko, the (orthogonal) projection of / onto IKo, for which
/ /o is orthogonal to !Ko, that is, (/ /o, g) = 0 for all g in MQ. The
point /o minimizes \\f -h\\ over all h in 9<o. The projection /o is unique up
to a ji-almost sure equivalence.
REMARK.
A closed subspace IKo of 2 must contain all / in 2 for which there
exist / 3<o with \\fn - f\\2 - 0. In particular, if / belongs to !K0 and g = /
a.e. [/i] Jien g must also belong to !K0. If % is closed, the set of equivalence
classes 3o = f [/] / % } must be a closed subspace of L2(fi), and !Ko must
equal the union of all equivalence classes in 3foFor us the most important subspaces of 2 (X, A, fz) will be defined by the subsigma-fields AQ of A. Let*Ho= -C2(X, AQ, /JL). The corresponding L 2 (X, ^4o> 0 is a

Hilbert space in its own right, and therefore it is a closed subspace of L2(X, A, n).
Consequently 3<o is a complete subspace of 2 : if {/} is a Cauchy sequence in 5Co
then there exists an /o Oio such that ||/ n - /o|| 2 -> 0. However, {/} also converges
to every other .A-measurable / for which / = /o a.e. [p]. Unless Ao contains all
/^-negligible sets from A, the limit / need not be .Ao-measurable; the subspace Jio
need not be closed. If we work instead with the corresponding L2(X,^4, fi) and
L2(X, *Ao 11) we do get a closed subspace, because the equivalence class of the limit
function is uniquely determined.

*8. Uniform integrability


Suppose {/} is a sequence of measurable functions converging almost surely
to a limit / . If the sequence is dominated by some /x-integrable function F,
then 2F > \fn f\ - 0 almost surely, from which it follows, via Dominated
Convergence, that jj,\fn f\ -> 0. That is, domination plus almost sure convergence
imply convergence in *(M) norm. The converse is not true: fM equal to Lebesgue
measure and fn(x) := nKn + l)" 1 < x < n~1} provides an instance of &1 convergence
without domination.
At least when we deal with finite measures, there is an elegant circle of
equivalences, involving a concept (convergence in measure) slightly weaker than
almost sure convergence and a concept (uniform integrability) slightly weaker than
domination. With no loss of generality, I will explain the connections for a sequence
of random variables {Xn} on a probability space (Qf 7, P).
The sequence is said to converge in probability to a random variable X,
P
sometimes written Xn X, if P{|Xn - X| > e] -* 0 for each > 0. Problem [14]
guides you through the proofs of the following facts.

Chapter 2:

38

A modicum of measure theory

(a) If {Xn} converges to X almost surely then Xn -* X in probability, but the


converse is false: there exist sequences that converge in probability but not
almost surely.
(b) If {Xn} converges in probability to X, there is an increasing sequence of
positive integers {n(k)} for which lim*_>oo Xn(*) = X almost surely.
If a random variable Z is integrable then a Dominated Convergence argument
shows that P|Z|{ \Z\ > M] -> 0 as M - oo. Uniform integrability requires that
the convergence holds uniformly over a class of random variables. Very roughly
speaking, it lets us act almost as if all the random variables were bounded by a
constant M, at least as far as Lx arguments are concerned.
<30>

Definition. A family of random variables {Zt : t T] is said to be uniformly


integrable if sup, 7 P|Z,|{ \Zt\ > M} -> 0 as M -> oo.
It is sometimes slightly more convenient to check for uniform integrability by
means of an e-5 characterization.

<3i>

L e m m a . A family of random variables {Zt : t e T] is uniformly integrable if and


only if both the following conditions hold:

(i) sup,6rP|Z,| <oo


(ii) for each > 0 there exists a 8 > 0 such that sup r e r F\Zt\F < e for every
event F with FF < 8.
REMARK.
Requirement (i) is superfluous if, for each 8 > 0, the space Q can be
partitioned into finitely many pieces each with measure less than 8.

Proof. Given uniform integrability, (i) follows from P|Z,| < M + P|Z,|{ \Zt\ > M},
and (ii) follows from P|Z,|F < MFF + P|Z,|{ |Z,| > M}.
Conversely, if (i) and (ii) hold then the event {|Z,| > M) is a candidate for
the F in (ii) when M is so large that P{ |Z,| > M} < sup r e r F\Zt\/M < 8. It follows
that sup, r P|Z,|{ \Zt\ > M) < if M is large enough.
almost surely

Xn -> X
(a)
,..,.,

domination

subsequence, (b)

. . . . . ^

xn -*

Xn - X

in probability

in^CP),

uniform integrability

with each Xn integrable

The diagram summarizes the interconnections between the various convergence


concepts, with each arrow denoting an implication. The relationship between almost
sure convergence and convergence in probability corresponds to results (a) and (b)
noted above. A family [Zt : t e T] dominated by an integrable random variable Y

2.8

39

Uniform integrability

is also uniformly integrable, because Y[Y > M} > |Z,|{|Z, | > M] for every t. Only
the implications leading to and from the box for the * convergence remain to be
proved.
<32>

Theorem. Let {Xn : n e N} be a sequence of integrable random variables. The


following two conditions are equivalent
(i) The sequence is uniformly integrable and it converges in probability to a
random variable Xoo, which is necessarily integrable.
(ii) The sequence converges in Zl norm, P|XW Xool -> 0, with a limit X ^ that
is necessarily integrable.
Proof. Suppose (i) holds. The assertion about integrability of Xoo follows from
Fatou's lemma, because \Xn>\ - |Xoo| almost surely along some subsequence, so
that P|Xoo| < liming P|X n | < supn P|X n | < oo. To prove * convergence, first split
according to whether |Xn - Xool is less than or not, and then split according to
whether max(|X n |, |Xoo|) is less than some large constant M or not.
P|Xn - Xool < + P(|X n | + IXool) {|Xn - Xoo| > }
< + 2MP{|Xn - Xool > <0 + P(|X n j + IXool) {|Xn| V IXool > M}.
Split the event {|XJ v |Xool > M] according to which of the two random variables
is larger, to bound the last term by 2P|Xn|{|Xw| > M} + 2F\XOO\{\XOO\ > M}. Invoke
uniform integrability of {Xn} and integrability of Xoo to find an M that makes this
bound small, uniformly in n. With M fixed, the convergence in probability sends
MP{|Xn Xool > *} to zero as n ~> oo.
Conversely, if the sequence converges in -C1, then XQO must be integrable,
because P|Xoo| < P|X n | +P|X n - Xool for each n. When |X n | < M or |Xool > M/2,
the inequality
|X n |{|X w | >M}< |Xoo|{IXool > M/2} + 2|Xn -Xool,

9.

is easy to check; and when |Xool < M/2 and |Xrt| > M, it follows from the inequality
|XM - Xool > |X n | - |Xoo| > |X n |/2. Take expectations, choose M large enough
to make the contribution from Xoo small, then let n tend to infinity to find an no
such that P|Xn|{ |X n | > M} < for n > no- Increase M if necessary to handle the
corresponding tail contributions for n < no.

Image measures and distributions


Suppose fi is a measure on a sigma-field A of subsets of X and T is a map from X
into a set y, equipped with a sigma-field 33. If T is ,A\!B-measurable we can carry
fi over to y, by defining
vB := ii(T~lB)

<33>

)
(X.A.\L)

for each B in iB.

Actually the operation is more one of carrying the sets


back to the measure rather than carrying the measure over
to the sets, but the net result is the definition of a new set
function on $ .

40

Chapter 2:

A modicum of measure theory

It is easy to check that v is a measure on 3, using facts such as T~x (Bc) =


(T-XB)C and T~l (U,-ft) = U/T" 1 ^. It is called the image measure of fi under T,
or just the image measure, and is denoted by iiT~l or \iT or F(/i), or even just T/x.
The third and fourth forms, which I prefer to use, have the nice property that if \i is
a point mass concentrated at x then T(/x) denotes a point mass concentrated at T(x).
Starting from definition <33> we could prove facts about integrals with respect
to the image measure v. For example, we could show
<34>

vg = ii(g o T)

for all g e M+QJ, ).

The small circle symbol o denotes the composition of functions: (goT)(x) := g(Tx).
The proof of <34> could follow the traditional path: first argue by linearity
from <33> to establish the result for simple functions; then take monotone limits of
simple functions to extend to M+(^i !B).
There is another method for constructing image measures that gets <34> all in
one step. Define an increasing linear functional v on M + (y, !B) by vg := /x(g o F). It
inherits the Monotone Convergence property directly from /x, because, if 0 < gn f g
then 0 < gn o T f g o T. By Theorem <13> it corresponds to a uniquely determined
measure on 2$. When restricted to indicator functions of measurable sets the new
measure coincides with the measure defined by <33>, because if g is the indicator
function of B then g o T is the indicator function of T~l B, (Why?) We have gained
a theorem with almost no extra work, by starting with the linear functional as the
definition of the image measure.
Using the notation T^JL for image measure, we could rewrite the defining
equality as (7(g) := /x(g o T) at least for all g e M + (y, 3), a relationship that I
find easier to remember.
REMARK.

In the last sentence I used the qualifier at least, as a reminder that

the equality could easily be extended to other cases. For example, by splitting into
positive and negative parts then subtracting, we could extend the equality to functions
in JC'ftl.B, V). And soon.

Several familiar probabilistic objects are just image measures. If X is a random


variable, the image measure X(P) on 3(R) is often written P*, and is called the
distribution of X. More generally, if X and Y are random variables defined on the
same probability space, they together define a random vector, a (measurablesee
Chapter 4) map T(co) = (X(co), Y((o)) from Q into M2. The image measure T(F)
on S(M2) is called the joint distribution of X and Y, and is often denoted by fxjSimilar terminology applies for larger collections of random variables.
Image measures also figure in a construction that is discussed nonrigorously
in many introductory textbooks. Let P be a probability measure on (M). Its
distribution function (also known as a cumulative distribution function) is defined
by FP(x) : P(oo, x] for x e R. Don't confuse distribution, as a synonym for
probability measure, with distribution Junction, which is a function derived from
the measures of a particular collection of sets. The distribution function has the
following properties.
(a) It is increasing, with lim^-oo FP(x) = 0 and lim*.^ FP(x) = 1.

41

2.9 Image measures and distributions

(b) It is continuous from the right: to each > 0 and x G R, there exists a 8 > 0
such that FP(x) < FP(y) < FP(x) + e for x < y < x + 8.
Property (a) follows from that fact that the integral is an increasing functional,
and from Dominated Convergence applied to the sequences (-oo, -n] I 0 and
(-oo, n] f R as n -> oo. Property (b) also follows from Dominated Convergence,
applied to the sequence (-oo, x + l/n] | (-oo, x].
Except in introductory textbooks, and in works dealing with the order properties
of the real line (such as the study of ranks and order statistics), distribution functions
have a reduced role to play in modern probability theory, mostly in connection
with the following method for building measures on (R) as images of Lebesgue
measure. In probability theory the construction often goes by the name of quantile
transformation.
<35>

<36>

Example. There is a converse to the assertions (a) and (b) about distribution
functions. Suppose F is a right-continuous, increasing function on R for which
lim^.oo F(x) = 0 and lim*-^ F(x) = 1. Then there exists a probability measure P
such that P(oo, x] = F(x) for all real x. To construct such a P , consider the
quantile function q, defined by q{t) := inf{jt : F(x) > t) for 0 < t < 1.
By right continuity of the increasing function F, the set {JC e R : F(x) > t] is
a closed interval of the form [a, oo), with a = q(t). That is, for all x e R and all
t (0, 1),
F(x) >t

if and only if

x > q(t).

In general there are many plausible, but false, equalities related to <36>. For
example, it is not true in general that F(q(t)) = f. However, if F is continuous
and strictly increasing, then q is just the inverse function of F, and the plausible
equalities hold.
Let m denote Lebesgue measure restricted to the Borel sigma-field on (0,1).
The image measure P := #(m) has the desired property,
P ( - o o , JC] = m{t : q(t) <x} = m{t:t<

D
10.

F(x)} =

the first equality by definition of the image measure, and the second by equality <36>.
The result is often restated as: if i- has a Uniform(0, 1) distribution then q(t-) has
distribution function F.
Generating classes of sets
To prove that all sets in a sigma-field A have some property one often resorts to a
generating-class argument. The simplest form of such an argument has three steps:
(i) Show that all members of a subclass have the property,
(ii) Show that A c <r().
(iii) Show that A) := {A A: A has the property } is a sigma-field.
Then one deduces that .Ao = <r(Ao) 5 <r() 2 A, whence AQ = A. That is, the
property holds for all sets in A.

42

Chapter 2:

A modicum of measure theory

For some properties, direct verification of all the sigma-field requirements


for Ao proves too difficult. In such situations an indirect argument sometimes
succeeds if has some extra structure. For example, if is possible to establish
that Ao is a X-system of sets, then one needs only check one extra requirement for
in order to produce a successful generating-class argument.
<37>

Definition.

A class D of subsets ofX is called a X-system if

(i) X D,
(ii) if D\, D2 e D and D\ 2 D2 then D\\D2 e D,
(Hi) if [Dn] is an increasing sequence of sets in D then UDn e D.
REMARK.
Some authors start from a slightly different definition, replacing
requirement (iii) by
(iii)' if {/>} is a sequence of disjoint sets in D then UDn D.
The change in definition would have little effect on the role played by A-systems.
Many authors (including me, until recently) use the name Dynkin class instead
of X-system, but the name Sierpinski class would be more appropriate. See the Notes
at the end of this Chapter.

Notice that a A-system is also a sigma-field if and only if it is stable under


finite intersections. This stability property can be inherited from a subclass , as
in the next Theorem, which is sometimes referred to as the n-X theorem. The n
stands for product, an indirect reference to the stability of the subclass under
finite intersections (products). I think that the letter A. stands for limit, an indirect
reference to property (iii).
<38>

Theorem. If is stable under finite intersections, and if D is a X-system with


D 2 , then D 2 a().
Proof. It would be enough to show that D is a sigma-field, by establishing that it
is stable under finite intersections, but that is a little more than I know how to do.
Instead we need to work with a (possibly) smaller A-system Do, with D 2 Do 2 ,
for which generating class arguments can extend the assumption

<39>

E\E2e

for all E\, E2 in

to an assertion that
<40>

DiD2T>0

for all Di, D 2 in D o .

It will then follow that Do is a sigma-field, which contains , and hence Do 2 ^(fi).
The choice of Do is easy. Let {D a : a A] be the collection of all A-systems
with D a 2 , one of them being the D we started with. Let Do equal the intersection
of all these D a . That is, let Do consist of all sets D for which D e T)a for each a.
I leave it to you to check the easy details that prove Do to be a A.-system. In other
words, Do is the smallest A.-system containg ; it is the A-system generated by .
To upgrade <39> to <40> we have to replace each Ei on the left-hand side by
a Di in Do, without going outside the class Do- The trick is to work one component at
a time. Start with the E\. Define Di := {A : AE D o for each E e }. From <39>,
we have Di 2 . If we show that T>\ is a A.-system then it will follow that Di 2 Do,
because Do is the smallest A-system containing . Actually, the assertion that Dj is

43

2. JO Generating classes of sets

A-system is trivial; it follows immediately from the ^.-system properties for Do and
identities like (A\\A2)E = (AXE)\(A2E) and (U.-A,-) E = U(A,).
The inclusion Di D D O implies that D\E2 e D o for all D\ D o and all E2 e .
Put another waythis step is the only subtlety in the proofwe can assert that the
class D 2 := {B : BD Do for each D D o } contains . Just write Di instead of D,
and Ei instead of B, in the definition to see that it is only a matter of switching the
order of the sets.
Argue in the same way as for Di to show that T>2 is also a A.-system. It then
follows that D2 2 Do> which is another way of expressing assertion <40>.
The proof of the last Theorem is typical of many generating class arguments,
in that it is trivial once one knows what one has to check. The Theorem, or its
analog for classes of functions (see the next Section), will be my main method for
establishing sigma-field properties. You will be getting plenty of practice at filling
in the details behind frequent assertions of "a generating class argument shows that
...." Here is a typical example to get you started.

<4i>

Exercise. Let \x and v be finite measures on !B(R) with the same distribution
function. That is, ^ ( - 0 0 , t] = v(-oo, t] for all real t. Show that ixB = vB for all
B S(M), that is, /x = v as Borel measures.
SOLUTION: Write for the class of all intervals (-00, f], with t e R. Clearly is
stable under finite intersections. From Example <4>, we know that cr() = B(M).
It is easy to check that the class D := [B !B(R) : /JLB = vB] is a X-system.
For example, if Bn e D and Bn t B then ixB = limn \iBn = limn vBn = v#, by
Monotone Convergence. It follows from Theorem <38> that D D <r() = !B(R),
and the equality of the two Borel measures is established.
When you employ a A.-system argument be sure to verify the properties required
of . The next Example shows what can happen if you forget about the stability
under finite intersections.

<42>

Example. Consider a set X consisting of four points, labelled nw, ne, sw, and se.
Let consist of X and the subsets N {nw, ne}, 5 = {sw, se}, E = {ne, se}, and
W = {nw, sw}. Notice that generates the sigma-field of all subsets of X, but it is
not stable under finite intersections. Let /x and v be probability measures for which
/i(nw) = 1/2
/Lt(sw) = 0

* 11.

/x(ne) = 0
/it(se) = 1/2

v(nw) = 0
v(sw) = 1/2

v(ne) = 1/2
v(se) = 0

Both measures give the the value 1/2 to each of N, 5, E, and W, but they differ in
the values they give to the four singletons.

Generating classes of functions


Theorem <38> is often used as the starting point for proving facts about measurable
functions. One first invokes the Theorem to establish a property for sets in a sigmafield, then one extends by taking limits of simple functions to M+ and beyond, using
Monotone Convergence and linearity arguments. Sometimes it is simpler to invoke
an analog of the A.-system property for classes of functions.

44

<43>

Chapter 2:

Definition.
X-cone if:

A modicum of measure theory

Call a class 2C+ of bounded, nonnegative functions on a set X a

(i) 9C+ is a cone, that is, ifh\,h2e


then <x\h\ + <x2h2 0<+;

0i+ and a\ and a2 are nonnegative constants

(ii) each nonnegative constant function belongs to !H+;


(iii) ifhuh2

IK4" and hi > h2 then h\-h2e

<

KAr;

(iv) if{hn] is an increasing sequence of functions in *H+ whose pointwise limit h


is bounded then h W + .
Typically IK+ consists of the nonnegative functions in a vector space of bounded
functions that is stable under pairwise maxima and minima.
REMARK.
The name X-cone is not standard. I found it hard to come up with a
name that was both suggestive of the defining properties and analogous to the name
for the corresponding classes of sets. For a while I used the term Dynkin-cones but
abandoned it for historical reasons. (See the Notes.) I also toyed with the name
cdl-cone, as a reminder that the cone contains the (positive) constant functions
and that it is stable under (proper) differences and (monotone increasing) limits of
uniformly bounded sequences.

The sigma-field properties of A-cones are slightly harder to establish than their
A-system analogs, but the reward of more streamlined proofs will make the extra,
one-time effort worthwhile. First we need an analog of the fact that a A-system that
is stable under finite intersections is also a sigma-field.
<44>

Lemma. If a k-cone H+ is stable under the formation of pointwise products of


pairs of functions then it consists of all bounded, nonnegative, a (IK+) -measurable
functions, where a(!K + ) denotes the sigma-field generated by %+.
Proof. First note that J{ + must be stable under uniform limits. For suppose hn - h
uniformly, with hn e 0<+. Write 8n for 2~n. With no loss of generality we may
suppose hn + 8n > h > hn 8n for all n. Notice that
hn + 3 4 = hn + 8n + 8n.i > h + 5n_i > hn-\.
From the monotone convergence, 0 < hn + 3(5i + . . . + 8n) f h + 3, deduce that
h + 3 *K+, and hence, via the proper difference property (iii), h e *H+.
Via uniform limits we can now show that J{ + is stable under composition
with any continuous nonnegative function / . Let / i b e a member of J{ + , bounded
above by a constant D. By a trivial generalization of Problem [25],x there exists a
sequence of polynomials pn() such that s u p ^ ^ \pn{t) - f(t)\ < 1/n. The function
fn(h) := pn(h) -f 1/n takes only nonegative values, and it converges uniformly
to f(h). Suppose fn(t) = ao -h a\t + . . . + autk. Then

^*

*) 2: 0.

By virtue of properties (i) and (ii) of X-cones, and the assumed stability under
products, both terms on the right-hand side belong to 0<+. The proper differencing
property then gives fn(h) e 3f+. Pass uniformly to the limit to get f(h) Ji+.
Write E for the class of all sets of the form {h < C}, with h e IK+ and C
a positive constant. From Example <7>, every h in 3+ is a(E)-measurable, and

2.11

<45>

<46>

Generating classes of functions

45

hence <x() = a(M+). For a fixed h and C, the continuous function (1 - (h/C)n)+
of h belongs to 9C1", and it increases monotonely to the indicator of {h < C}.
Thus the indicators of all sets in belong to 3 + . The assumptions about *H+
ensure that the class $ of all sets whose indicator functions belong to W+ is stable
under finite intersections (products), complements (subtract from 1), and increasing
countable unions (montone increasing limits). That is, is a A-system, stable
under finite intersections, and containing . It is a sigma-field containing . Thus
2 <T() = cr(M+). That is, !K+ contains all indicators of sets in cr(W+).
Finally, let it be a bounded, nonnegative, a(9<+)-measurable function. From
the fact that each of the sets {k > //2 n }, for i = 1 , . . . , 4 \ belongs to the cone 5+,
we have kn := 2~n ^^L\{k > i/2n] e <K. The functions kn increase monotonely
to k9 which consequently also belongs to !H+.
Theorem. Let !K+ be a X-cone of bounded, nonnegative functions, and 5 be a
subclass ofH+ that is stable under the formation ofpointwise products of pairs of
functions. Then !K+ contains all bounded, nonnegative, o{^-measurable functions.
Proof Let JCj be the smallest A.-cone containing 9- From the previous Lemma, it
is enough to show that Wj is stable under pairwise products.
Argue as in Theorem <38> for A-systems of sets. A routine calculation shows
that 'K^ := {h e JCj : hg e J{J for all g in 9 } is a X-cone containing 9, and
hence Oif = ^Cj ^ ^ i s ' ho8 MQ f o r a11 hoe^
andg e 9. Similarly, the class
:Kj := {h Wj : hoh J{J for all h0 in IKj } is a X-cone. By the result for J{f we
have JCj 2 9, and hence JCj = ^ o ^ a t *s' ^ o ^s s t a ^l e under products.

Exercise. Let \i be a finite measure on 23(K*). Write Co for the vector space
of all continuous real functions on R* with compact support. Suppose / belongs
to Cl(ix). Show that for each > 0 there exists a g in Co such that / x | / g\ < e.
That is, show that Co is dense in -C!(/x) under its .C1 norm.

SOLUTION: Define *K as the collection of all bounded functions in -CHAO that


can be approximated arbitrarily closely by functions from Co. Check that the
class IK+ of nonnegative functions in IK is a X-cone. Trivially it contains C j , the
class of nonnegative members of Co. The sigma-field a(Cj) coincides with the
Borel sigma-field. Why? The class *K+ consists of all bounded, nonnegative Borel
measurable functions.
To approximate a general / in >l(ii), first reduce to the case of nonnegative
functions by splitting into positive and negative parts. Then invoke Dominated
Convergence to find a finite n for which / x | / + / + An\ < e, then approximate / + A n
by a member of C j . See Problem [26] for the extension of the approximation result
to infinite measures.

12.
[1]

Problems
Suppose events A\,A2,...,
in a probability space (2,3\P), are independent:
meaning that F(AhAi2... Aik) = PA f l PA l 2 .. .PAIJt for all choices of distinct
subscripts I'I, 12,..., 1'*, all k. Suppose Y11L\ p^ = -

46

Chapter 2:

A modicum of measure theory

(i) Using the inequality e~x > 1 x, show that

P max At = ln<i<m
<i<

[ I 0 ~ P A i) >: 1 exp[ V PA,-1

*. *

n<i<m

n<i<m

(ii) Let m then n tend to infinity, to deduce (via Dominated Convergence) that
Plimsup; At[ = 1. That is, P{A,- i. o.} = 1.
REMARK.
The result gives a converse for the Borel-Cantelli lemma from
Example <29>. The next Problem establishes a similar result under weaker
assumptions.
[2]

Let A\, A2,... be events in a probability space (Q, 7, P). Define Xn = Ai 4-... + An
and an = PXn. Suppose an - 00 and i|Xn/crn||2 - 1. (Compare with the inequality
\\Xn/on2 > 1 which follows from Jensen's inequality.)
(i) Show that

for each positive integer k,


(ii) By an appropriate choice of k (depending on n) in (i), deduce that YA ^ - *
almost surely.
(iii) Prove that Yl^ &i t. 1 almost surely, for each fixed m. Hint: Show that the
two convergence assumptions also hold for the sequence Am, Am+\,
(iv) Deduce that F{co A, i. o. } = 1.
(v) If [Bi) is a sequence of events for which ]T. Pi?,- = 00 and Pi?,-fl>- = PB/PB,
for 1 # j , show that F{co e Bi i. o. } = 1.
[3]

Suppose T is a function from a set X into a set y, and suppose that y is equipped
with a cr-field B. Define A as the sigma-field of sets of the form T~lB, with B in B.
Suppose / M + (X,^l). Show that there exists a S\S[0, oo]-measurable function
g from y into [0, 00] such that f(x) = g(T(x)), for all x in X, by following these
steps.
(i) Show that A is a a-field on X. (It is called the a-field generated by the map T.
It is often denoted by cr(F).)
(ii) Show that {/ > i/2n] = T~lBUn for some Bin in B. Define

/ = 2~n < ' ^ '/211}


1=1

and

* = 2"" J ] B,tn.
1=1

Show that /n(jc) = gn(T(x)) for all x.


(iii) Define g(y) = limsupgn(y) for each y in y. Show that g has the desired
property. (Question: Why can't we define g(y) = li
[4]

Let g\,g2,...
be yi\B(E)-measurable functions from X into R. Show that
{limsup n g n > t] = (JrQ fl^Li U>mte > rJ- Deduce, without any appeal to
r>t

Example <8>, that lim sup gn is ^l\jB(E)-measurable. Warning: Be careful about

2.12 Problems

47

strict inequalities that turn into nonstrict inequalities in the limitit is possible to
have xn > x for all n and still have limsupw xn = x.
[5]

Suppose a class of sets cannot separate a particular pair of points x, y: for every E
in , either {JC, y] c E or {JC, y] c Ec. Show that a() also cannot separate the pair.

[6]

A collection of sets Jo that is stable under finite unions, finite intersections, and
complements is called a field. A nonnegative set function fi defined on 7 is called
fr e v e r y finite collection
a finitely additive measure if /i(u,-</*}) = Yli<n^i
of disjoint sets in 3$. The set function is said to be countably additive on 3$ if
fi(Ui(:f$Fi) = l N M ^ f r e v e r Y countable collection of disjoint sets in 3o whose
union belongs to 7. Suppose fiX < oo. Show that fi is countably additive on 3$ if
and only if fiAn \ 0 for every decreasing sequence in 3b with empty intersection.
Hint: For the argument in one direction, consider the union of differences A,-\A,-+i.

[7]

Let / i , . . . , / be functions in JVt+(X, A), and let fi be a measure on A Show that


fi (v,-/}) < J2i Pfi ^ M (v,-/-) + i<y M (/} A /}) where v denotes pointwise maxima
of functions and A denotes pointwise minima.

[8]

Let IJL be a finite measure and / be a measurable function. For each positive
>n}<oo.
integer k, show that fi\f\k < oo if and only if J2=i W*~!M{I/I

[9]

Suppose v := T/x, the image of the measure fi under the measurable map T. Show
that f eCl(v) if and only if / o T e Clbi), in which case vf = /x ( / o T).

[10]

Let {hn}, {/}, and {gn} be sequences of /x-integrable functions that converge ix
almost everywhere to limits h, f and g. Suppose hn(x) < fn(x) < gn(x) for all JC.
Suppose also that fihn fjuh and ixgn -> fig. Adapt the proof of Dominated
Convergence to prove that fifn -> fif.

[11]

A collection of sets is called a monotone class if it is stable under unions of


increasing sequences and intersections of decreasing sequences. Adapt the argument
from Theorem <38> to prove: if a class is stable under finite unions and
complements then a() equals the smallest monotone class containing .

[12]

Let / i b e a finite measure on the Borel sigma-field 23 (X) of a metric space X. Call
a set B inner regular if fiB = sup{/xF : B D F closed } and outer regular if
fiB = inf{/xF : 5 C G open }
(i) Prove that the class So of all Borel sets that are both inner and outer regular is
a sigma-field. Deduce that every Borel set is inner regular.
(ii) Suppose fi is tight: for each > 0 there exists a compact K such that
fiKc < . Show that the F in the definition of inner regularity can then be
assumed compact.
(iii) When fi is tight, show that there exists a sequence of disjoint compacts subsets
{Kt : i N} of X such that fi (UtKif = 0.

[13]

Let / x b e a finite measure on the Borel sigma-field of a complete, separable metric


space X. Show that fi is tight: for each e > 0 there exists a compact K such that
liKc < . Hint: For each positive integer n, show that the space X is a countable

48

Chapter 2:

A modicum of measure theory

union of closed balls with radius l/n. Find a finite family of such balls whose
union Bn has 11 measure greater than fiX e/2n. Show that C\nBn is compact, using
the total-boundedness characterization of compact subsets of complete metric spaces.
[14]

A sequence of random variables {Xn} is said to converge in probability to a random


variable X, written Xn X, if F{\Xn - X\ > } -* 0 for each e > 0.
(i) If Xn -+ X almost surely, show that I > {\Xn X\ > e] -* 0 almost surely.
Deduce via Dominated Convergence that Xn converges in probability to X.
(ii) Give an example of a sequence {Xn} that converges to X in probability but not
almost surely.
(iii) Suppose Xn -> X in probability. Show that there is an increasing sequence of
positive integers {n(k)} for which X^P{|^n(*) X\ > l/k] < oo. Deduce that
Xn(k) -* X almost surely.

[15]

Let / and g be measurable functions on (X, A, fi), and r and s be positive real
numbers for which r~l H-s"1 = 1. Show that fi\fg\ < (ji\f\r)l/r (jJL\gls)l/s by
arguing as follows. First dispose of the trivial case where one of the factors on
the righthand side is 0 or oo. Then, without loss of generality (why?), assume
that Ml/r = 1 = Mlgl5- Use concavity of the logarithm function to show that
\fg\ 5 \f\r/r + \g\*/s9 and then integrate with respect to JJL. This result is called the
Holder inequality.

[16]

Generalize the Holder inequality (Problem [15]) to more than two measurable
functions / i , ...,/*, and positive real numbers n , . . . , r* for which ],. rfl = 1.

show that MI/I ... fk\ < n, (Mi/;rol/r'.

[17]

Let (X, A, fi) be a measure space, / and g be measurable functions, and r be a


real number with r > 1. Define ||/|| r = (Ml/D 1 / r - Follow these steps to prove
Minkowski's inequality: \\f + *||r < il/llr + llgilr.
(i) From the inequality \x + y\r < \2x\r + \2y\r deduce that ||/ -- g\\r < oo if
||/|| r < oo and ||g||r < oo.
(ii) Dispose of trivial cases, such as ||/|| r = 0 or ||/|| r = oo.
(iii) For arbitrary positive constants c and d argue by convexity that
V c+d ) " c +d \ c )
c+d \ d )
(iv) Integrate, then choose c = ||/|| r and d = ||g||r to complete the proof.

[18]

For / in ! ( M ) define ||/||i = fi\f\. Let {fn} be a Cauchy sequence in -C!(M) that
is, ||/n /mill -> 0 as min(m,n) -> oo. Show that there exists an / in -C^/x) for
which ||/n - /Hi -> 0, by following these steps.
(i) Find an increasing sequence {n{k)} such that j i H/(*) ~ /(*+i)lli < Deduce that the function H := J2T=\ !/(*) " /(*+i)l i s integrable.
(ii) Show that there exists a real-valued, measurable function / for which
H > \fn(k)(x) - /C*)| -> 0

as k -* oo, for \i almost all x.

49

2.12 Problems

Deduce that \\fn{k) - / | | i -* 0 as k -* oo.


(iii) Show that / belongs to ! 0 i ) and \\fn - f\\x -> 0 as n -> oo.
[19]

Let {/} be a Cauchy sequence in LP(X,A,^\


that is, \\fn - fm\\p - 0 as
min(m,n) -> oo. Show that there exists a function / in P (X,.A, /x) for which
II/ - /lip -> 0, by following these steps.
(i) Find an increasing sequence {n(k)} such that C := J^=l \\fn(k) - fn{k+\)\\P < oo.
Define H^ = l i m ^ ^ HN, where HN = f = 1 |/w(ib) - /<*+1}| for 1 < W < oo.
Use the triangle inequality to show that jxH^ < Cp for all finite N. Then use
Monotone Convergence to deduce that /JLHQ < Cp.
(ii) Show that there exists a real-valued, measurable function / for which fn(k)(x) ->
f(x) as k - oo, a.e. [fi\.
(iii) Show that \fn(k) - f\ < YlZk I/*(O ~ /<i+ol < #<x> a.e. [/i]. Use Dominated
Convergence to deduce that ||/n(*) - f\\p -> 0 as k -> oo.
(iv) Deduce from (iii) that / belongs to <C(X, >lf /x) and ||/ B - / | | p -+ 0 as w - oo.

[20]

For each random variable on a probability space (Q, 7, P) define


||X||oo := inf{c [0, oo] : |X| < c almost surely}.
Let L := L (Q, J , P) denote the set of equivalence classes of real-valued random
variables with ||X||oo < oo. Show that || ||oo is a norm on L, which is a vector
space, complete under the metric defined by ||X||oo.

[21]

Let {Xt : t e T] be a collection of E-valued random variables with possibly


uncountable index set T. Complete the following argument to show that there exists
a countable subset To of T such that the random variable X = sup rGro Xt has the
properties
(a) X > Xt almost surely, for each t e T
(b) if Y > Xt almost surely, for each t e 7\ then Y > X almost surely
(The random variable X is called the essential supremum of the family. It is denoted
by esssup r r Xt. Part (b) shows that it is, unique up to an almost sure equivalence.)
(i) Show that properties (a) and (b) are unaffected by a monotone, one-to-one
transformation such a s j c i - > j c / ( l + |jc|). Deduce that there is no loss of
generality in assuming \Xt\ < 1 for all t.
(ii) Let 8 = sup{Psup re5 X, : countable S c T}. Choose countable Tn such that
Psup, e 7 ; Xt > 8 - \/n. Let 7b = UnTn. Show that Psup, 7 b Xt = 8.
(iii) Suppose t i To. From the inequality 8 > F(Xt v X) > WX = 8 deduce that
X > Xt almost surely.
(iv) For a Y as in assertion (b), show that Y > sup rero Xt = X almost surely.

[22]

Let ^ be a convex, increasing function for which ^(0) = 0 and ^(x) -> oo as
x -> oo. (For example, V(x) could equal xp for some fixed p > 1, or exp(jc) 1
or exp(jc2) 1.) Define ^^(X, A, /x) to be the set of all real-valued measurable
functions on X for which iity{\f\/c$) < oo for some positive real CQ. Define

50

Chapter 2:

A modicum of measure theory

:= inf{c > 0 : /x^(|/|/c) < 1}, with the convention that the infimum of an
empty set equals +oo. For each / , g in *(X,.A, fi) and each real t prove the
following assertions.
(i) ll/ll* < oo. Hint: Apply Dominated Convergence to

/x^(\f\/c).

(ii) f+g *(X, yi, M) and the triangle inequality holds: \\f+gU
Hint: If c > | | / | | * and d > ||g||, deduce that
\

c+d ) - c+d

\ c )

< WfU + WgU-

c+d

by convexity of * .
(iii) tf e L*{X,A,n)

and ||f/||* = \t\

REMARK.
II Ik is called an Orlicz "norm"to make it a true norm one should
work with equivalence classes of functions equal {i almost everywhere. The Lp
norms correspond to the special case 4*(JC) = xp, for some p > 1.

[23]

Define | | / | | * and * as in Problem [22]. Let {/} be a Cauchy sequence in


that is, ||/ fm\\y -> 0 as min(m, n) -> oo. Show that there exists an / in *(/i)
for which ||/ n - / | | * -> 0, by following these steps.
(i) Let {gi} be a nonnegative sequence in * ( M ) for which C := f . Hg/11* < oo.
Show that the function G := i # is finite almost everywhere and ||G||^ <
E i WSiU < ' H i n t : U s e Problem [22] to show that P*I> ( < n # / c ) < 1 for
each n, then justify a passage to the limit.
(ii) Find an increasing sequence {n(k)} such that EJU=I li/(*> /(*+!)II* < Deduce that the functions Hi := EjbU l/(*)"~ /(fc+i)l satisfy
o o > | | t f i | | * > || H2\h > . . . - > 0.
(iii) Show that there exists a real-valued, measurable function / for which
\fn(k)(x) ~ /(*)l 0

as k ex), for /x almost all x.

(iv) Given > 0, choose L so that H/JiH* < 6. For i > L, show that
* (L/) > * (|/n(L) ~ /n( 0 IA) "> * (\fn(L) ~ / l A ) .
Deduce that ||/ n ( L ) - f\\* < e.
(v) Show that / belongs to *(/x) and ||/ rt - / | | ^ -> 0 as n -> oo.
[24]

Let ^ be a convex increasing function with *I>(0) = 0, as in Problem [22]. Let


denote its inverse function. If X\,...,
XN e *(X, A, ^ ) , show that

Hint: Consider ^(Pmax|X/|/C) with C > max,^ ||X|||*.


REMARK.
Compare with van der Vaart & Wellner (1996, page 96): if
also limsupj y_^0Q^(x)^f(y)/}\f(cxy)
< oo for some constant c > 0 then
||max/</v IX/IH* < KV~l(N)maXi<N \\Xi\\* for a constant # depending only on
See page 105 of their Problems and Complements for related counterexamples.

51

2.12 Problems

[25]

For each 0 in [0,1] let Xnt$ be a random variable with a Binomial (w, 0) distribution.
That is, F{Xnj = k] = (2)0*(1 - 0)"-* for * = 0 , 1 , . . . . n. You may assume these
elementary facts: VXnte = n0 and F(Xn# -nO)2 = n0(l - 0 ) . Let / be a continuous
function defined on [0,1].
(i) Show that pn(0) = Ff(Xnj/n)

is a polynomial in 0.

(ii) Suppose | / | < M, for a constant M. For a fixed 6, invoke (uniform) continuity
to find a 8 > 0 such that | / ( s ) - f(t)\ < whenever \s - t\ < 8, for all s, t
in [0,1]. Show that
\f(x/n) - / ( 0 ) | < + 2M{ |(jc/n) - 0|

2Af

l(*/*O-01

(iii) Deduce that supo<^<! \pn(9) - / ( 0 ) | < 2e for n large enough. That is, deduce
that /() can be uniformly approximated by polynomials over the range [0,1],
a result known as the Weierstrass approximation theorem.
[26]

13.

Extend the approximation result from Example <46> to the case of an infinite
measure /z on !B(]R*) that gives finite measure to each compact set. Hint: Let B
be a closed ball of radius large enough to ensure fi\f\B < e. Write ixB for the
restriction of it to B. Invoke the result from the Example to find a g in Co such
that ixB\f - g\ < . Find Co functions 1 > hx \, B. Consider approximations ght
for i large enough.

Notes
I recommend Royden (1968) as a good source for measure theory. The books of
Ash (1972) and Dudley (1989) are also excellent references, for both measure theory
and probability. Dudley's book contains particularly interesting historical notes.
See Hawkins (1979, Chapter 4) to appreciate the subtlety of the idea of a
negligible set.
The result from Problem [10] is often attributed to (Pratt 1960), but, as he noted
(in his 1966 Acknowledgment of Priority), it is actually much older.
Theorem <38> (the n-k theorem for generating classes of sets) is often
attributed to Dynkin (1960, Section 1.1), although Sierpiiiski (1928) had earlier
proved a slightly stronger result (covering generation of sigma-rings, not just sigmafields). I adapted the analogous result for classes of functions, Theorem <45>, from
Protter (1990, page 7) and Dellacherie & Meyer (1978, page 14). Compare with
the "Sierpiiiski Stability Lemma" for sets, and the "Functional Sierpinski Lemma"
presented by Hoffmann-J0rgensen (1994, pages 8, 54, 60).
REFERENCES

Ash, R. B. (1972), Real Analysis and Probability, Academic Press, New York.
Dellacherie, C. & Meyer, P. A. (1978), Probabilities and Potential, North-Holland,
Amsterdam.
Dudley, R. M. (1989), Real Analysis and Probability, Wadsworth, Belmont, Calif.

52

Chapter 2:

A modicum of measure theory

Dynkin, E. B. (1960), Theory of Markov Processes, Pergamon.


Hawkins, T. (1979), Lebesgue's Theory of Integration: Its Origins and Development,
second edn, Chelsea, New York.
Hoffmann-J0rgensen, J. (1994), Probability with a View toward Statistics, Vol. I,
Chapman and Hall, New York.
Oxtoby, J. (1971), Measure and Category, Springer-Verlag.
Pratt, J. W. (1960), 'On interchanging limits and integrals', Annals of Mathematical
Statistics 31, 74-77. Acknowledgement of priority, same journal, vol 37 (1966),
page 1407.
Protter, P. (1990), Stochastic Integration and Differential Equations, Springer, New
York.
Roy den, H. L. (1968), Real Analysis, second edn, Macmillan, New York.
Sierpiiiski, W. (1928), 'Un theoreme general sur les families d'ensembles', Fundamenta Mathematicae 12, 206-210.
van der Vaart, A. W. & Wellner, J. A. (1996), Weak Convergence and Empirical
Process: With Applications to Statistics, Springer-Verlag.

Chapter 3

Densities and derivatives


SECTION 1 explains why the traditional split of introductory probability courses into
two segmentsthe study of discrete distributions, and the study of "continuous"
distributionsis unnecessary in a measure theoretic treatment. Absolute continuity
of one measure with respect to another measure is defined. A simple case of the
Radon-Nikodym theorem is proved.
SECTION *2 establishes the Lebesgue decomposition of a measure into parts absolutely
continuous and singular with respect to another measure, a result that includes the
Radon-Nikodym theorem as a particular case.
SECTION 3 shows how densities enter into the definitions of various distances between
measures.
SECTION 4 explains the connection between the classical concept of absolute continuity and
its measure theoretic generalization. Part of the Fundamental Theorem of Calculus is
deduced from the Radon-Nikodym theorem.
SECTION *5 establishes the Vitali covering lemma, the key to the identification of derivatives
as densities.
SECTION *6 presents the proof of the other part of the Fundamental Theorem of Calculus,
showing that absolutely continuous functions (on the real line) are Lebesgue integrals
of their derivatives, which exist almost everywhere.

1.

Densities and absolute continuity


Nonnegative measurable functions create new measures from old.
Let (X, Ay /x) be a measure space, and let A(-) be a function in M+(X, A). The
increasing, linear functional defined on M+(X, A) by vf := fi (/A) inherits from \i
the Monotone Convergence property, which identifies it as an integral with respect
to a measure on A.
The measure /x is said to dominate v; the measure v is said to have density A
with respect to /x. This relationship is often indicated symbolically as A =
which fits well with the traditional notation,
f
dv
/

f(x)dv(x) = j

f(x)dn(x).

The d/ji symbols "cancel out," as in the change of variable formula for Lebesgue
integrals.

54

Chapter 3:

Densities and derivatives

REMARK.
The density dv/dju, is often called the Radon-Nikodym derivative of v
with respect to /x, a reference to the result described in Theorem <4> below. The
word derivative suggests a limit of a ratio of v and fi measures of "small" sets. For
fj, equal to Lebesgue measure on a Euclidean space, dv/d/x can indeed be recovered
as such a limit. Section 4 explains the one-dimensional case. Chapter 6 will give
another interpretation via martingales.

For example, if /x is Lebesgue measure on $(R), the probability measure


with respect to /JL is called the
defined by the density A(x) = (2n)~l/2exp(-x2/2)
standard normal distribution, usually denoted by AT(0,1). If /u, is counting measure
on No (that is, mass 1 at each nonnegative integer), the probability measure defined
by the density A(x) = e~$0x/xl is called the Poisson(#) distribution, for each
positive constant 0, If \i is Lebesgue measure on $(R 2 ), the probability measure
defined by the density A(JC, y) = (ln)~x exp (-(x2 + y2)/2) with respect to fi is
called the standard bivariate normal distribution. The qualifier joint sometimes
creeps into the description of densities with respect to Lebesgue measure on !B(E2)
or $(]R*). From a measure theoretic point of view the qualifier is superfluous, but
it is a comforting probabilistic tradition perhaps worthy of preservation.
Under mild assumptions, which rule out the sort of pathological behavior
involving sets with infinite measure described by Problems [1] and [2], it is not hard
to show (Problem [3]) that the density is unique up to a //-equivalence. The simplest
way to avoid the pathologies is an assumption that the dominating measure \x is
sigma-finite, meaning that there exists a partition of X into countably many disjoint
measurable sets Xi, X2,... with fiXi < 00 for each i.
Existence of a density is a property that depends on two measures. Even
measures that don't fit the traditional idea of a continuous distribution can be
specified by densities, as in the case of measures dominated by a counting measure.
Some introductory texts use the technically correct term density in that case, much
to the confusion of students who have come to think that densities have something
to do with continuous functions. More generally, every measure could be thought of
as having a density, because d/x/dfi = 1, which is a perfectly useless fact. Densities
are useful because they allow integrals with respect to one measure to be reexpressed
as integrals with respect to a different measure.
The distribution function of a probability measure dominated by Lebesgue
measure on S(M) is continuous, as a map from R into E; but not every probability
measure with a continuous distribution function has a density with respect to
Lebesgue measure
<i>

Example. Let m denote Lebesgue measure on [0,1). Each point x in [0,1) has a
binary expansion x = Y^Li xn^~n with each JC, either 0 or 1. To ensure uniqueness
of the {*}, choose the expansion that ends in a string of zeros when x is a dyadic
rational. The set {xn = 1} is then a finite union of intervals of the form [a9b),
with both endpoints dyadic rationals. The map T(x) := J^L\ 2**3~" from [0,1)
back into itself is measurable. The image measure v := T(m) concentrates on the

3.1 Densities and absolute continuity

55

compact subset C = nnCn of [0,1] obtained by successive removal of "middle


thirds" of subintervals:
Ci = [0,1/3] U [2/3,1]
C2 = [0, 1/9] U [2/9, 3/9] U [6/9, 7/9] U [8/9,1]

and so on. The set C, which is called the Cantor set, has Lebesgue measure less
than mCn == (2/3)" for every n. That is, mC = 0.
The distribution function F(x) := v[0, JC], for 0 < x < 1 has the strange
property that it is constant on each of the open intervals that make up each C,
because v puts zero mass in those intervals. Thus F has a zero derivative at each
point of Cc = UnC, a set with Lebesgue measure one.
The distinction between a continuous function and a function expressible
as an integral was recognized early in the history of measure theory, with the
name absolute continuity being used to denote the stronger property. The original
definition of absolute continuity (Section 4) is now a special case of a streamlined
characterization that applies not just to measures on the real line.
REMARK.
Probability measures dominated by Lebesgue measure correspond to
the continuous distributions of introductory courses, although the correct term is
distributions absolutely continuous with respect to Lebesgue measure. By extension,
random variables whose distributions are dominated by Lebesgue measure are
sometimes called "continuous random variables," which I regard as a harmful abuse
of terminology. There need be nothing continuous about a "continuous random
variable" as a function from a set 2 into R. Indeed, there need be no topology
on Q; the very concept of continuity for the function might be void. Many a student
of probability has been misled into assuming topological properties for "continuous
random variables." I try to avoid using the term.

<2>

<3>

Definition. A measure v is said to be absolutely continuous with respect to a


measure /x, written v <c /x, if every fi-negligible set is also v-negligible.
Clearly, a measure v given by a density with respect to a dominating measure
/x is also absolutely continuous with respect to /x.
Example. If the measure v is finite, there is an equivalent formulation of absolute
continuity that looks more like a continuity property.
Let v and /x be measures on (X, A)9 with vX < oo. Suppose that to each > 0
there exists a 8 > 0 such that vA < e for each A e A with \xA < 8. Then clearly
v must be absolutely continuous with respect to /JL: for each measurable set A
with /iA = 0 we must have vA < for every positive .
Conversely, if v fails to have the e-8 property then there exists some 6 > 0 and
a sequence of sets {An} with \iAn -> 0 but vAn >'e infinitely often. With no loss
of generality (or by working with a subsequence) we may assume that ixAn < 2~n
and vAn > e for all n. Define A := {An i.o.) = limsup,, An. Finiteness of fiAn
implies n An < oo a.e. [/x], and hence /JLA = 0; but Dominated Convergence,
using the assumption vX < oo, gives vA = limn v (supI>n At) > . Thus v is not
absolutely continuous with respect to /x.
In other words, the -8 property is equivalent to absolute continuity, at least
when v is a finite measure. The equivalence can fail if v is not a finite measure.

56

Chapter 3:

Densities and derivatives

For example, if /x denotes Lebesgue measure on 3(E) and v is the measure defined
by the density |JC|, then the interval (n, n + n~l) has /JL measure n~l but v measure
greater than 1.
REMARK.
For finite measures it might seem that absolute continuity should also
have an equivalent formulation in terms of functionals on M + . However, even if
v : /x, it need not be true that to each > 0 there exists a 8 > 0 such that: / G M+
and /If < 8 imply vf < . For example, let //, be Lebesgue measure on 3(0, 1) and
v be the finite measure with density A(JC) := JC~1/2 with respect to /x. The functions
fn(x) := A(x){n~2 < x < n~1} have the property that jxfn < f*/nx~l/2dx
-> 0 as
n - oo, but vfn = J "2 x~l dx = logw -> oo, even though v <& [i.

Existence of a density and absolute continuity are equivalent properties if


we exclude some pathological examples, such as those presented in Problems [1]
and [2].
<4>

Radon-Nikodym Theorem. Let /x be a sigma-finite measure on a space (X, A).


Then every sigma-finite measure that is absolutely continuous with respect to [i has
a density, which is unique up to fx-equivalence.
The Theorem is a special case of the slightly more general result known as the
Lebesgue decomposition, which is proved in Section 2 using projections in Hilbert
spaces. Most of the ideas needed to prove the general version of the Theorem
appear in simpler form in the following proof of a special case.

<5>

Lemma. Suppose v and JJL are finite measures on (X, A), with v < /x, that is,
vf < fif for all f in M+(X, A). Then v has a density A with respect to /x for which
0 < A < 1 everywhere.
Proof. The linear subspace JCo := {/ e 2 (M) : vf = 0} of 2(/x) is closed for
convergence in the L2{ii) norm: if /x|/n - f\2 -> 0 then, by the Cauchy-Schwarz
inequality,

Mn ~ Vf\2 < (vl 2 ) (v\fn -

/| 2 -> 0.

Except when v is the zero measure (in which case the result is trivial), the constant
function 1 does not belong to*KQ.From Section 2.7, there exist functions go ^o
and g\ orthogonal to Jio for which 1 = go + g\. Notice that vg\ = vl ^ 0. The
desired density will be a constant multiple of g\.
Consider any / in 2(/x). With C := vf/vg\ the function / - Cg\ belongs
to Wo, because v(/ - Cg\) = vf- Cvg\ = 0. Orthogonality of / - Cg\ and g\ gives
0 = (/ - Cgugi) = /x(/gi) -

~ix{g\),
vg\
which rearranges to vf = /x(/A) where A := (vg\/iig\)g\.
The inequality 0 < v{A < 0} = AIA{A < 0} ensures that A > 0 a.e. [/x];
and the inequalities v{A > 1} = /xA{A > 1} > /x{A > 1} > v{A > 1} force
A < 1 a.e. [/x], for otherwise the middle inequality would be strict. Replacement
of A by A{0 < A < 1} therefore has no effect on the representation vf = /x(/A)
for / e 2(/x). TO extend the equality to / in M+, first invoke it for the 2(/x)
function n A / , then invoke Monotone Convergence twice as n tends to infinity.

3.1 Densities and absolute continuity

57

Not all measures on (X, A) need be dominated by a given /x. The extreme
example is a measure v that is singular with respect to /z, meaning that there
exists a measurable subset S for which fiSc = 0 = vS. That is, the two measures
concentrate on disjoint parts of X, a situation denoted by writing v JJL. Perhaps
it would be better to say that the two measures are mutually singular, to emphasize
the symmetry of the relationship. For example, discrete measuresthose that
concentrate on countable subsetsare singular with respect to Lebesgue measure on
the real line. There also exist singular measures (with respect to Lebesgue measure)
that give zero mass to each countable set, such as the probability measure v from
Example <i>.
Avoidance of all probability measures except those dominated by a counting
measure or a Lebesgue measureas in introductory probability coursesimposes
awkward constraints on what one can achieve with probability theory. The restriction
becomes particularly tedious for functions of more than a single random variable, for
then one is limited to smooth transformations for which image measures and densities
(with respect to Lebesgue measure) can be calculated by means of Jacobians. The
unfortunate effects of an artificially restricted theory permeate much of the statistical
literature, where sometimes inappropriate and unnecessary requirements are imposed
merely to accommmodate a lack of an appropriate measure theoretic foundation.
REMARK.
Why should absolute continuity with respect to Lebesgue measure
or counting measure play such a central role in introductory probability theory? I
believe the answer is just a matter of definition, or rather, a lack of definition. For a
probability measure P concentrated on a countable set of points, expectations Pg(X)
become countable sums, which can be handled by elementary methods. For general
probability measures the definition of Wg(X) is typically not a matter of elementary
calculation. However, if X has a distribution P with density A with respect to
Lebesgue measure, then Pg(X) = Pg = f A(x)g(x)dx. The last integral has the
familiar look of a Riemann integral, which is the subject of elementary Calculus
courses. Seldom would A or g be complicated enough to require the interpretation
as a Lebesgue integralone stays away from such functions when teaching an
introductory course.

From the measure theoretic viewpoint, densities are not just a crutch for support
of an inadequate integration theory; they become a useful tool for exploiting absolute
continuity for pairs of measures.
In much statistical theory, the actual choice of dominating measure matters
little. The following result, which is often called SchefiK's lemma, is typical.
<6>

Exercise. Suppose {Pn} is a sequence of probability measures with densities {An}


with respect to a measure fi. Suppose An converges almost everywhere [/x] to the
density A := dP/dfi of a probability measure P. Show that fi\ An - A| -> 0.
SOLUTION: Write /z|A - A n | as
An)+ 4- M(A - An)~ =

On the right-hand side, the second term equals zero, because both densities
integrate to 1, and the first term tends to zero by Dominated Convergence, because
A > (A - An)+ -* 0 a.e. [/A].

Chapter 3:

58

Densities and derivatives

Convergence in * (fju) of densities is equivalent to convergence of the probability


measures in total variation, a topic discussed further in Section 3.

*2,

The Lebesgue decomposition


Absolute continuity and singularity represent the two extremes for the relationship
between two measures on the same space.
Lebesgue decomposition. Let ii be a sigma-finite measure on a space (X, .A).
To each sigma-fmite measure v on A there exists a ^-negligible set K and a
real-valued A in M + such that
vf =

for all / e M+(X, A).

The set DSf and the function AKC are unique up to a v + ii almost sure equivalence.
REMARK.
Of course the value of A on N has no effect on M ( / A ) or v(fJ4).
Some authors adopt the convention that A = oo on the set N . The convention has
no effect on the equality <8>, but it does make A unique v -f fi almost surely.
The restriction Vj_ of v to 3sf is called the part of v that is singular with respect
to ix. The restriction vabs of v to Jic is called the part of v that is absolutely continuous
with respect to fi. Problem [11] shows that the decomposition v = v + vabs, into
singular and dominated components, is unique.
There is a corresponding decomposition JJL = fjb_ 4- /*abS into parts singular and
absolutely continuous with respect to v. Together the two decompositions partition
the underlying space into four measurable sets: a set that is negligible for both
measures; a set where they are mutually absolutely continuous; and two sets where
the singular components /JL and v concentrate.

A = o?
|i = v = 0

A=0
v =0
\L concentrates here
0<A<o
dvabs^Ad^hs
^abs

J / A

dvzbs

neither measure
puts mass here

1/A=O?
^1 = 0
vj_ concentrates here

Proof. Consider first the question of existence. With no loss of generality we may
assume that both v and JJL are finite measures. The general decomposition would
follow by piecing together the results for countably many disjoint subsets of X.
Define A. = v + /z. Note that v < k. From Lemma <5>, there is an Ameasurable function Ao, taking values in [0,1], for which vf = X(fAo) for all /
in M + . Define K := {Ao = 1}, a set that has zero \i measure because
v{A 0 = 1} = vAo{Ao =

1} + M A O { A O =

1} = WA0 = l} + /x{A 0 = 1}.

3.2

59

The Lebesgue decomposition

Define
A := T - ^ r - f A o < 1}.
1 - Ao
We have to show that v (/N c ) = /x(/A) for all / in M + . For such an / , and each
positive integer n, the defining property of Ao gives
v(f A n){Ao < 1} = v(f A n)Ao{Ao < 1} + /x(/ A n)A 0 {A 0 < 1},
which rearranges (no problems with oo oo) to
v(f A n)(l - A 0 ){A 0 < 1} = M(/ A n)A 0 {A 0 < 1).
Appeal twice to Monotone Convergence as n oo to deduce
v / ( l - A 0 ){A 0 < 1} = At/A0{A0 < 1}

3.

for all / M+.

Replace / by /{Ao < 1}/(1 - Ao), which also belongs to M+, to complete the proof
of existence.
The proof of uniqueness of the representation, up to various almost sure
equivalences, follows a similar style of argument. Problem [9] will step you through
the details.

Distances and affinities between measures


Let ix\ and /X2 be two finite measures on (X,A). We may suppose both /xi and ^2
are absolutely continuous with respect to some dominating (nonnegative) measure k,
with densities m\ and m2. For example, we could choose k = /JL\ + M2For the purposes of this subsection, a finite signed measure will be any set
function expressible as a difference ix\ - \X2 of finite, nonnegative measures. In fact
every real valued, countably additive set function defined on a sigma field can be
represented in that way, but we shall not be needing the more basic characterization.
If fjii has density mf with respect to k then ii\ /12 has density m\ m,2.
Throughout the section, Mbdd will denote the space of all bounded, real-valued,
yi-measurable functions on X, and JVtjd will denote the cone of nonnegative
functions in MbddThere are a number of closely related distances between the measures, all of
which involve calculations with densities. Several easily proved facts about these
distances have important application in mathematical statistics.
Total variation distance
The total variation norm ||/i||i of the signed measure /x is defined as sup^ A 7 r |/xA|,
where the supremum ranges over all partitions it of X into finitely many measurable
sets. The total variation distance between two signed measures is the norm of their
difference. In terms of the density m for /x with respect to k9 we have

\liA\ = J2 l*<wA)l ^ I > M A = Mm\.


Aen

Aen

60

Chapter 3:

Densities and derivatives

Equality is achieved for a partition with two sets: A\ := {m > 0} and A 2 := {m < 0}.
In particular, the total variation distance between two measures equals the JC1
distance between their densities; in fact, total variation distance is often referred
to as i;1 distance. The initial definition of \\ii\\u in which the dominating measure
does not appear, shows that the .C1 norm does not depend on the particular choice
of dominating measure. The * norm also equals sup^^j \fif\, because \/xf\ =
\H*nf)\ < * ( M I/I) < k\m\ for | / | < 1, with equality for"/ := {m > 0} - {m < 0}.
REMARK.
Some authors also refer to v(/^i>M2) = SUPAZA IMI^ /*2^l as
the total variation distance, which can be confusing. In the special case when
we have ii{m\ > m 2 } = fi{m\ < W2}, whence ||/x|| = 2v(fjL\, /x 2 ).

The affinity between two finite signed measures


The affinity between /xi and ^2 is defined as
<*i(/xi, 112) := inf{/xi/i + ^2/2 : / 1 , h M^, d , f\ + / 2 > 1}
= inf{Mwi/i + m 2 / 2 ) : fu h M ^ d , /1 + f2 > 1}
= A. (mi A m2),

the infimum being achieved by f\ = {mi < 7722) and / 2 = {mi > m2}. That is, the
affinity equals the minimum of ii\ A +112A c over all A in A.
REMARK.
For probability measures, the minimizing set has the statistical
interpretation of the (nonrandomized) test between the two hypotheses \i\ and /i2
that minimizes the sum of the type one and type two errors.
The pointwise equality 2 (mi A m2> = mi -f m2 |mi m2| gives the connection
between affinity and ! distance,
+ m 2 - | m i - m 2 |) =
Both the affinity and the * distance are related to a natural ordering on the
space of all finite signed measures on A, defined by:
v < 1/ means vf < vr f for each / in M^j d .
<9>

E x a m p l e . To each pair of finite, signed measures ii\ and /X2 there exists a largest
measure v for which v < \i\ and v < M2- It is determined by its density m\ /\mi
with respect to the dominating A. It is easy to see that the v so defined is smaller
than both JX\ and [12, To prove that it is the largest such measure, let /x be any other
signed measure with the same property. Without loss of generality, we may assume
ix has a density m with respect to X. For each bounded, nonnegative / , we then
have
k (rnf) = fif = /x/{mi > m2} 4- fif{m\ < m2}
< ^if{m\

> m2} + Mi/{mi < m2}

= X(mi A m2)f.
The particular choice f = [m > m\ /\ mi) then leads to the conclusion that
m < m\ Ani2 a.e. [A.], and hence /x < v as measures.

3.3 Distances and affinities between measures

61

The measure v is also denoted by /JL\ A /X2 and is called the measure theoretic
minimum of /xi and jx2- For nonnegative measures, the affinity <*i(/xi, /i2) equals
the t 1 norm of fix A M.
By a similar argument, it is easy to show that the density \m\ rri2\ defines the
smallest measure v (necessarily nonnegative) for which v > \i\ ^ and v > ^2 (i\.
Hellinger distance between probability measures
Let P and Q be probability measures with densities p and q with respect to a
dominating measure k. The square roots of the densities, <fp and ^fq are both
square integrable; they both belong to 2(A.). The Hellinger distance between
the two measures is defined as the 2 distance between the square roots of their
densities,
H(P, Q)2 := k (Jp - Jqf

= *(/> + * - 2Jpq) = 2 - 2k Jpq.

Again the distance does not depend on the choice of dominating measure (Problem [13]). The quantity k^fpq is called the Hellinger affinity between the two
probabilities, and is denoted by ai(P, Q). Clearly ^fpq > p A #, from which it
follows that a2(P, Q) > i ( P , Q) and H(P, Q)2 < \\P - Q\\\. The Cauchy-Schwarz
inequality gives a useful lower bound:

\\P - Gil? =

Substituting for the Hellinger affinity, we get | | P - G | | i < H(P, Q) (4 - H(P, Q)2)l/2,
which is smaller than 2 / / ( P , Q). The Hellinger distance defines a bounded metric on
the space of all probability measures on A. Convergence in that metric is equivalent
to convergence in -C1 norm, because
H(Py Q) 2 < ||P - Gill < ( P , G).

<H

The Hellinger distance satisfies the inequality 0 < H(P, Q) < >/2. The equality
at 0 occurs when ^fp = y/q almost surely [A.], that is, when P = Q as measures
on A. Equality at V2 occurs when the Hellinger affinity is zero, that is, when
pq = 0 almost surely [A.], which is the condition that P and Q be mutually singular.
For example, discrete distributions (concentrated on a countable set) are always at
the maximum Hellinger distance from nonatomic distributions (zero mass at each
point).
REMARK.
Some authors prefer to have an upper bound of 1 for the Hellinger
distance; they include an extra factor of a half in the definition of H(Py Q)2.
Relative entropy
Let P and Q be two probability measures with densities p and q with respect to
some dominating measure k. The relative entropy (also known as the KullbackLeibler "distance," even though it is not a metric) between P and Q is defined as
= k(plog(p/q)).

62

Chapter 3:

Densities and derivatives

At first sight, it is not obvious that the definition cannot suffer from the oo oo
problem. A Taylor expansion comes to the rescue: for x > 1,
<n>

(1 +Jc)log(l + *) = x + R(x),
2

with R(x) = \x /{\ + x*) for some x* between 0 and x. When p > 0 and q > 0, put
x = (p q)/q, discard the nonnegative remainder term, then multiply through by q
to get plog(p/q) > p -q. The same inequality also holds at points where p > 0
and q = 0, with the left-hand side interpreted as oo; and at points where p = 0 we get
no contribution to the defining integral. It follows not only that kp (log(p/q))~ < oo
but also that D(P\\Q) > X(p - q) = 0. The relative entropy is well defined and
nonnegative. (Nonnegativity would also follow via Jensen's inequality.) Moreover,
if k{p > 0 = q] > 0 then plog(p/q) is infinite on a set of positive measure, which
forces D(P\\Q) = oo. That is, the relative entropy is infinite unless P is absolutely
continuous with respect to Q. It can also be infinite even if P and Q are mutually
absolutely continuous (Problem [15]).
As with the Cl and Hellinger distances, the relative entropy does not depend
on the choice of the dominating measure X (Problem [14]).
It is easy to deduce from the conditions for equality in Jensen's inequality that
D(P\\ Q) = 0 if and only if P = Q. An even stronger assertion follows from the
inequality
D{P\\Q)>H2(P,Q).
This inequality is trivially true unless P is absolutely continuous with respect to Q,
in which case we can take X equal to Q. For that case, define rj = ^/p 1. Note
that Qrj2 = H2(P, Q) and
1 = Qp = 2(1 + nf = 1 + IQr) + Qt)2,
which implies that 2Qrj = -H2(P, Q). Hence
= 2Q ((1

as asserted.
In a similar vein, there is a lower bound for the relative entropy involving the
*-distance. Inequalities <io> and <12> together imply D(P\\Q) > | | | P - Q\\2V A
more direct argument will give a slightly better bound,
The improvement comes from a refinement (Problem [19]) of the error term in
equality < n > , namely, R(x) > \x2/(l + x/3) for x > - 1 .

63

3.3 Distances and affinities between measures

To establish <13> we may once more assume, with no loss of generality, that P
is absolutely continuous with respect to Q. Write 1 + 8 for the density. Notice that
Q8 = 0. Then deduce that
D(P\\Q) = fi((l + )log(l + ) - ) > \Q
Multiply the right-hand side by 1 = 0(1 + 5/3), then invoke the Cauchy-Schwarz
inequality to bound the product from below by half the square of
+ 673
The asserted inequality <13> follows.
<14>

Example. Let PQ denote the N(0,1) distribution on the real line with density
<f>(x - 6) with respect to Lebesgue measure, where 4>{x) = exp(-jc 2 /2)/\/27r. Each
of the three distances between PQ and P9 can be calculated in closed form:
D(P0\\P$) = Po* i-\x2

+ Ux - $)2) = \02y

and

H2(P0, P9) = 2 - - | = f00 exp (-i^ 2 + |(x - ^)2) = 21


and

= i

= 2 /'

(<Hx)-<t>(x-0))+dx
(*(*) - <f>(x - 0)) dx = 2(<t> {\0) - <D ( - |

where O denotes the AT(0,1) distribution function.


For 0 near zero, Taylor expansion gives
H2(Po, PG) = | ^ 2 + O(04)

and

||P 0 - Poh = / H ^ I

Inequalities <io>, <12>, and <13> then become, for 0 near zero,

\e2
\e2 > -e2 + o($4)

<15>

The comparisons show that there is little room for improvement, except possibly for
the lower bound in <io>.
Example. The method from the previous Example can be extended to other
families of densities, providing a better indication of how much improvement might
be possible in the three inequalities. I will proceed heuristically, ignoring questions

64

Chapter 3:

Densities and derivatives

of convergence and integrability, and assuming existence of well behaved Taylor


expansions.
Suppose Pe has a density exp(g(;t 9)) with respect to Lebesgue measure on
the real line. Write g and g for the first and second derivatives of the function g(x),
so that
g(x - 9) - g(x) = -9g(x) + \02g{x) + terms of order |<9|3
and
exp (g(x -9)- g{x)) = 1 - 9g(x) + \02g(x) + \ (-9g(x) + \02g(x)) + . . .
= 1 - 0g(x) 4- \02 (g{x) + g(jt)2) 4- terms of order |0| 3
From the equalities
1= /

J-oo

exp(g(x-9))

= P$cxp(g(x-9)-g(x))
= 1 - 0Pog(x) + \02P0 (g(x) + g(jc)2) + terms of order |0| 3
we can conclude first that Pog = 0, and then PQ (g(x) -f g(x)2) = 0.
REMARK.
The last two assertions can be made rigorous by domination
assumptions, allowing derivatives to be taken inside integral signs. Readers familiar
with the information inequality from theoretical statistics will recognize the two ways
of representing the information function for the family of densities, together with the
zero-mean property of the score function.

Continuing the heuristic, we will get approximations for the three distances for
values of 0 near zero.
D(P0\\Pe) = PQ (g(x) - g(x - 9)) = \92Pg2 + ...
From the Taylor approximation,
exp (ig(x - 9) - {g{x)) = \-\9g + \92 {l'g + g2) + ...
we get
H2(P0, Pe) = 2 - 2P exp (\g{x - 9) - \g{x)) = \92Pg2 + ...
And finally,
If we could find a g for which (Po\g\)2 = Pog2, the two inequalities
2H(P0, Pe) > lift -

ft||i

and
2

D(ft||P*) > | | | f t - P^llf

would become sharp up to terms of order 9 . The desired equality for g would force
\g| to be a constant, which would lead us to the density f(x) = \e~^xK Of course
log / is not differentiate at the origin, an oversight we could remedy by a slight
smoothing of |JC| near the origin. In that way we could construct arbitrarily smooth
densities with Pog2 as close as we please to (Po\g\)2. There can be no improvement
in the constants in the last two displayed inequalities.

3.4

4.

The classical concept of absolute continuity

65

The classical concept of absolute continuity


The fundamental problem of Calculus concerns the interpretation of differentiation
and integration, for functions of a real variable, as inverse operations: Which
functions on the real line can be expressed as integrals of their derivatives? Clearly,
if H(x) = f* h(t)dt, with h Lebesgue integrable, then if is a continuous function,
but it also possesses a stronger property.

<16>

Definition. A real valued function H defined on an interval [a, b] of the real


line is said to be absolutely continuous if to each e > 0 there exists a 8 > 0 such
that J2i \H(bt) - H{a{)\ < for all finite collections of nonoverlapping subintervals
{cii, bi\ of [a, b] for which ][^(&i ~~ a*) < *
REMARK.
In the Definition, and subsequently, nonoverlapping means that the
interiors of the intervals are disjoint. For example, [0,1] and [1, 2] are nonoverlapping.
Their intersection has zero Lebesgue measure.

Note the strong similarity between Definition <16> and the reformulation
of the measure theoretic definition of absolute continuity as an e-S property, in
Example <3>.
The connection between absolute continuity of functions and integration of
derivatives was established by Lebesgue (1904). It is one of the most celebrated
results of classical analysis.
<17>

Fundamental Theorem of Calculus. A real valued function H defined on an


interval [a, b] is absolutely continuous if and only if the following three conditions
hold
(i) the derivative H'(x) exists at Lebesgue almost all points of [a, b]
(ii) the derivative Hf is Lebesgue integrable
(Hi) H(x) - H(a) = Hf{t)dt for each x in [a, b]
REMARK.

Of course it is actually immaterial for (ii) and (iii) how Hf

is defined on the Lebesgue negligible set of points at which the derivative


does not exist. For example, we could take H' as the measurable function
limsupn_>>0On (#(JC + n~l) /f(*)). The proof in Section 6 provides two other
natural choices.

We may fruitfully think of the Fundamental Theorem as making two separate


assertions:
(FTi)
<18>

(FT2)

A real-valued function H on [a, b] is absolutely continuous if and only if it has


the representation
H(x) = H(a) + / h(t)dt
for all x in [a, b]
Ja
for some Lebesgue integrable h defined on [a, b]. The function h is unique up to
Lebesgue almost sure equivalence.
If H is absolutely continuous then it is differentiate almost everywhere, with
derivative equal Lebesgue almost everywhere to the function h from <18>.

66

Chapter 3:

Densities and derivatives

As shown at the end of this Section, Assertion FTi can be recovered from the
Radon-Nikodym theorem for measures. The proof of Assertion FT2 (Section 6),
which identifies the density h with a derivative defined almost everywhere, requires
an auxiliary result known as the Vitali Covering Lemma (Section 5).
Notice what the Fundamental Theorem does not assert: that differentiability
almost everywhere should allow the function to be recovered as an integral of that
derivative. As a pointwise limit of continuous functions, (H(x + 8) - H(x))/8, the
derivative H'(x) is measurable, when it exists. However, it need not be Lebesgue
integrable, as noted by Lebesgue in his doctoral thesis (Lebesgue 1902): the function
F(x) := jc 2 sin(l/jc 2 ), with F(0) = 0, has a derivative at all points of the real line,
but F'(JC) behaves like -2x~l COS(1/JC2) for x near 0, which prevents integrability
on any interval that contains 0 (Problem [20]). The function F is not absolutely
continuous.
Neither does the Fundamental Theorem assert that absolute continuity follows
from almost sure existence of H' with f* \H'(x)\dx finite. For example, the
function F constructed in Example < i > , whose derivative exists and is equal to zero
outside a set of zero Lebesgue measure, cannot be recovered by integration of the
derivative, even though that derivative is integrable. It is therefore quite surprising
that existence of even a one sided, integrable derivative everywhere is enough to
ensure absolute continuity.
<19>

Theorem. Let H be a continuous function defined on an interval [a, b\, with a


(real-valued) right-hand derivative h(x) := lim^o (H(x + 8) - H(x)) /8 existing at
each point of [a, b). Ifh is Lebesgue integrable then H{x) = H(q) + J* h(t)dt for
each x in [a,b], and hence H is an absolutely continuous function on [a,b].
See Problem [21] for an outline of the proof. The Theorem justifies the usual
treatment in introductory Calculus courses of integration as an inverse operation to
differentiation.
Proof of FT\.
If H is given by representation <18> then its increments are
controlled by the measure v defined on [a,b] by its density \h\ with respect
to Lebesgue measure m on 33[<z,&]. If {[aifbi] : i = 1, ...,&} is a family of
nonoverlapping subintervals of [a,b] then Y%=\ l#(*i) ~ #(*i)l < Z!*=i *>[,,&,].
From Example <3>, when the Lebesgue measure YLi&i at) of the set A = U/[a,, bi]
is small enough, the v measure is also small, because v is absolutely continuous (in
the sense of measures) with respect to m. It follows that H is absolutely continuous
as a function on [a,b].
Conversely, an absolutely continuous function H defines two nonnegative
functions F+ and F~ on [a, b] by

<20>

F(x) = supY" (H(x() 7T(X)

Hixt-Ot

where the suprema run over the collection 7T(JC) of all finite partitions a *o < x\ <
... < xk = x of [a, JC], for each x in [a, b]. At x = a all partitions are degenerate and
F(a) = 0. By splitting [a, x] into a finite union of intervals of length less than 8,
you will see that both functions are real valued. More precisely, F(x) < (b-a)/8,
where e and 8 come from Definition <16>.

3.4

67

The classical concept of absolute continuity

Both F are increasing functions, for the following reason. First note that
insertion of an extra point into a partition {*,} increases the defining sums. In
particular, if a < y < x then we may arrange that y is a point in the n(x) partition,
so that the sums on the right-hand side of <20> are larger than the corresponding
sums for n{y). More precisely, F(x) = F^OO + s u p ^ . (H(#) - #(#_!))*, where
the supremum runs over all finite partitions y = yo < y\ < < yk = x of [y, JC].
When |JC y\ < 6, with 8 > 0 as in Definition <16>, each sum for the supremum
is less than , and hence \F(x) - F(y)\ < 6. That is, both F are continuous
functions on [a, b]. A similar argument applied to finite nonoverlapping collections
of intervals {[j,, JC,]} leads to the stronger conclusion that both F are absolutely
continuous functions.
REMARK.
We didn't really need absolute continuity to break H into a difference
of two increasing functions. It would suffice to have sup]T\ \H(yi) H(y,--i)| < oo,
where the supremum runs over all finite partitions y y0 < y\ < . . . < yk = x
of [y,x]. Such a function is said to have bounded variation on the interval [a,b].
Close inspection of the arguments in Section 6 would reveal that functions of bounded
variation have a derivative almost everywhere. Without absolute continuity, we cannot
recover the function by integrating that derivative, as shown by Example <1>.

As shown in Section 2.9 using quantile transformations, increasing functions


such as F correspond to distribution functions of measures: there exists two finite
measures v* on *B[a9b] such that v(a,x] = F(x) for x in [a,b]. Notice that v^
put zero mass at each point, by continuity of F. The absolute continuity of the F
functions translates into an assertion regarding the class of all subsets of [a, b]
expressible as a finite union of intervals (#,.*,]: for each > 0 there exists a 8 > 0
such that vE < for every set E in E with tn < 8.
A simple X-class generating argument shows that is a dense subclass of jB[a, b]
in the ! (M) sense, where /x = m + v + + v". That is, for each set B in *B[a, b] and
each f > 0 there exists an in for which JJL\B E\ < ;. In particular, if we put
f = min(5/2, ) then mB < 8/2 implies tnE < 8 from which we get vE < e and
vB < 2. That is, both v are absolutely continuous (as measures) with respect
to m.
By the Radon-Nikodym Theorem, there exist m-integrable functions h for
which vf = mC/i*/) for all / in M+. In particular, F^x) = v ^ a , JC] = /* h(t) dt
for a < x < b.
Finally, note that for each partition [xt] in n(x) we have

which reduces to H(x) H(a) after cancellations. We can choose the partition to
give simultaneously values as close as we please to both suprema in <20>. In the
limit we get
H(x) - H(a) = F + (JC) - F~(x) = f (h+(t) - h'(t)) dt.
Ja

That is, representation <18> holds with h = h+ - h~. The uniqueness of the
representing h can be established by another A-class generating argument.

68

*5.

Chapter 3:

Densities and derivatives

Vitali covering lemma


Suppose D is a Borel subset of Rd with finite Lebesgue measure mD. There are
various ways in which we may approximate D by simpler sets. For example, from
Section 2.1 and related problems, to each e > 0 there exists an open set G D D and
a compact set K c D for which m(G\K) < . In particular, we could take G as a
countable union of open cubes. In general, we cannot hope to represent K as union
of closed cubes, because those sets would all have to lie in the interior of D, a set
whose measure might be strictly smaller than mD. However, if we allow the cubes
to poke slightly outside D we can even approximate by a finite union of disjoint
closed cubes, as a consequence of a Vitali covering theorem.
There are many different results presented in texts as the Vitali theorem, each
involving different levels of generality and different degrees of subtlety in the proof.
For the application to FT2, a simple version for approximation by intervals on the
real line would suffice, but I will present a version for Rd, in which the sets are
not assumed to be cubes, but are instead required to be "not too skinny". The extra
generality is useful; and it causes few extra difficulties in the proof, indeed, it helps
to highlight the beautiful underlying idea; and the pictures look nicer in more than
one dimension.
Of course skinny is not a technical term. Instead, for a fixed constant y > 0,
let us say that a measurable set F is ^-regular if there exists an open ball BF with
BF 2 F and mF/xnBF > y. We will sometimes need to write BF as B(x,r), the
open ball with center x and radius r.

<2i>

Definition. Call a collection V of closed subsets ofRd a y-regular Vitali covering


of a set E if each member of V is y-regular and if, to each e > 0, each point of E
belongs to a set F (depending on the point and ) from V with diameter less than .
Put another way, if we write V for [F e V : diam(F) < e], then the Vitali
covering property is equivalent to E c u{F : F e V] for every 6 > 0. Notice that if
G is an open set with G 3 E, then {F e V : F c G) is also a Vitali covering for E:
for each x in E, all points close enough to x must lie within G.

<22>

Vitali Covering Lemma. For some fixed y > 0, let V be a Vitali covering for a
set D in $(Rrf) with finite Lebesgue measure. Then there exists a countable family
{Fi} of disjoint sets from V for which the set D\ (U,F,) has zero Lebesgue measure.
REMARK.
More refined versions of the Theorem (such as the results presented
by Saks 1937, Section IV.3, or Wheeden & Zygmund 1977, Section 7.3) allow y
to vary from point to point within D, and relax assumptions of measurablity or
finiteness of mD. We will have no need for the more general versions.

Proof. The result is trivial if mD is zero, so we may assume that mD > 0. The
proof works via repeated reduction of mD by removal of unions of finite families
of disjoint sets from V. The method for the first step sets the pattern.
Fix an e > 0. As you will see later, we need e small enough that the constant
p := 3"rf(l - )y - e is strictly positive.

3.5

69

Vitali covering lemma

Find an open set G and a compact set K for which G 2 D 3 K and


m(G\K) < emD. Discard all those members of V that are not subsets of G. What
remains is still a Vitali covering of D.
REMARK.
AS the proof proceeds, various sets will be discarded from V. Rather
than invent a new symbol for the class of sets that remains after each discard, I will
reuse the symbol V. That is, V will denote different things at different stages of the
proof, but in every case it will be a Vitali covering for a subset of interest.

The family of open balls {Bp : F V] covers the compact set K. It has a finite
subcover, corresponding to sets F\,..., Fm. The ball BFi is of the form B(x{, r,-).
We may assume that the sets are numbered so that r\ > r2 > . . . > rm.
Define a subset J of { 1 , 2 , . . . , m} by successively
considering each y, in the order 1, 2, . . . , m, for inclusion
in / , rejecting j if Fj intersects an F, for an i already
accepted into J. For example, with the five sets shown in the
picture, we would have / = {1,2,5}: we include 1, then 2
(because F 2 nFi = 0), reject 3 (because F 3 flFi ^ 0), reject 4
(because F4 n F2 # 0), then accept 5 (because F5 n F\ = 0
and F5nF2 = 0).
For each excluded j there is an j in J for which
i < j and Ft n Fj # 0. The ordering of the radii
ensures that B(xiy 3r,) 3 B(xh r,): if z Ft n F;
and y (*,, r7), we have
to - y\ < \xi -z\

+ \z-y\<

n + 2/> < 3n.

Thus U^y(*,-, 3r,-) 2 U^^CJCy.r,-) 2 AT. The


open balls B(jc/,3r/) might poke outside G, but
the corresponding sets Ff from V, and their union,
E\ := U IG /F,, are closed subsets of G.
Pairwise disjointness of the sets {Ft : i e J] allows us to calculate the measure
of their union by adding, which, together with the regularity property

mFi > ymBixi, n) = (y3-d)mB(xiy 3r,),


gives us a lower bound for the measure of E\,
3d
2
mE\ =

3n) >

, 3rf)) >

> (1 - e)mD.

It follows that

m (D\E\) < mG - mE{ < (1 + )mD - (3"rf(l - e)y) mD = (1 - p)mD.


That is, E\ carves away at least a fraction p of the Lebesgue measure of D, thereby
completing the first step in the construction.
The second step is analogous. Discard all those members of V that intersect E\.
What remains is a new Vitali covering V for D\E\, because each point of D\E\ is
at a strictly positive distance from the closed set E\. Repeat the argument from the
previous paragraphs to find disjoint sets from V whose union, E2, removes at least a

70

*6.

Chapter 3:

Densities and derivatives

fraction p of the Lebesgue measure of D\E\. Then m (D\(E\ U E2)) < (1 - p)2tnD,
and the sets that make up E\ U Ej are all disjoint. And so on. In the limit we have
the countable disjoint family whose existence is asserted by the Lemma.

Densities as almost sure derivatives


With Lemma <22> we have the means to prove FT2: if
H(x) - H(a) = I h(t) dt

for a < x < b

Ja

for a Lebesgue integrable function h then H has a derivative at almost all points
of [a, b] and that derivative coincides with h almost everywhere.
It suffices if we consider the case where h is nonnegative, from which the
general case would follow by breaking h into its positive and negative parts. For
nonnegative h9 write v for the measure on !B[a, b] with density h with respect to
Lebesgue measure m. Define W(JC) as the set of all nondegenerate (that is, nonzero
length) closed intervals E for which x e E c (x -n~l,x + n~l). Notice that if the
derivative Hf(x) exists then it can be recovered as a limit of ratios vEn/mEn, with
En (*).
Define functions
~:Ee,n(x)\

and

gn(x) = inf ^

: E (*)

Both the sets {/ > r] and {gn < r] are open, and hence both fn and gn are
Borel measurable. For example, if fn(x) > r then there exists an interval E in
(*) with vE > rmE. Continuity of H at the endpoints of E lets us expand E
slightly, ensuring that x is an interior point of E while keeping the ratio above r
and keeping E within (JC n~l,x + n~l). I f j i s close enough to *, the interval E
also belongs to (, and hence fn(y) > r. The monotone limits /(JC) := infn fn(x)
and g(x) := supn gn(x) are also Borel measurable.
Clearly /(JC) > g(jc) everywhere. For 0 < 8 < 1/n, both [JC, JC + 5 ] and [JC - 5 , JC]
belong to n (x), and hence both
+ 8)-H(x)
H(x)-H(x-8)
A
8
lie between gn(x) and fn(x). In the limit as first 8 tends to zero then n tends to
infinity we have
, ,

. -H(x + 8)-H(x)

g(x) < hminf

< hmsup

H(x+8)-H(x)

/ x

< f(x),

with an analogous pair of inequalities for [JC - 8, JC]. At points where g(jc) = /I(JC) =
/(JC) the derivative Hf(x) must exist and be equal to h(x). The next Lemma will
help us prove the almost sure equality of / , g9 and h.
<23>

Lemma.
constant

Let A be a Borel set with finite Lebesgue measure and r be a positive

(i) If /(JC) > r for all x in A then vA > rmA.

3.6 Densities as almost sure derivatives

71

(ii) If g(x) < r for all x in A then vA < rmA.


Proof. Let G be an open set and K be a compact set, with G 2 A 2 K.
For (i), note that the collection V of all nondegenerate closed subintervals E
of G that satisfy the inequality vE > rmE is a Vitali covering of A (because
/ > r on A for every ri). By Lemma <22> there is a countable family of disjoint
intervals {,} from V whose union L covers A up to a m-negligible set. We then
have
vG > vL

because G contains all intervals from V

= ]n, vEj
> J2t

rmE

disjoint intervals
definition of V

= rmL

disjoint intervals

> rmA

covering property.

Take the infimum over G to obtain the assertion from (i).


The argument for (ii) is similar: Reverse the direction of the inequality in the
definition of V, then interchange the roles of v and rm in the string of inequalities
for the covering.
To complete the proof of FT2, for any pair of rational numbers with r > s > 0
let A = {x e [a, b] : h(x) < s < r < / ( * ) } . From Lemma <23> and the fact that v
has density h with respect to m we get smA > vA > rmA. Conclude that mA = 0.
Cast out countably many negligible sets, one for each pair of rationals, to deduce
that h > / a.e. [k] on [a, b]. Argue similarly to show that h < g a.e. [k] on [a,b].
Together with the fact that f > g these two inequalities imply that / = g = h at
almost all points of [a%b\.

7.

Problems

[1]

Let IX be the measure defined for all subsets of X by /JLA = +00 if A ^ 0, and
= 0. Show that / / / = /x(2/) for all nonnegative functions / .

[2]

Let A, denote the measure on R with kA equal to the number of points in A (possibly
infinite), and let [iA 00 for all nonempty A. Show that k has no density with
respect to JJL, even though both measures have the same negligible sets.

[3]

Suppose Ai and A2 are functions in M+(X, A) for which /x(/Ai) = /x(/A2) for all
/ in M+(X, A), for some measure \i.
(i) Show that Ai = A2 a.e. [/x] if fiA\ < 00. Hint: Consider the equality
/xAi{Ai > A 2 } = MA2{AI > A 2 }.

(ii) Show that the assumption of finiteness of fiA\ is not required for the conclusion
Ai = A2 a.e. [//,] if /x is a sigma-finite measure. Hint: For each positive rational
number r show that /JL(A\ - A2MAi > r > A2M = 0 for each set with /JLA < 00.
[4]

Let P and Q be probability measures on (2,7), and let be a generating class


for 7 that is stable under finite intersections and contains Q. Suppose there exists a

72

Chapter 3:

Densities and derivatives

p in Cl(Q) such that PE = Q(pE) for every E in . Show that P has density /?
with respect to Q.
[5]

Does Example <3> have an analog for Hp convergence, for some p > 1: if v is a
finite measure dominated by a measure A, and if {/} is a sequence of measurable
functions for which X\fn\p -> 0, does it follow that v\fn\p -* 0?

[6]

Let v and /JL be finite measures on the sigma-field a ( ) generated by a countable


field . Suppose that for each > 0 there exists a 6 > 0 such that vE < for each
in with \iE < 8. Show that v is absolutely continuous with respect to /x, as
measures on <r(). Hint: Suppose /xA = 0 for an A in <x(). Show that there exists
a countable subclass {En} of such that UnEn 2 A and ix(U w n ) < a. Use the
properties of fields to argue that the En sets can be chosen disjoint. Deduce that
[i (l)n<NEn) < 8 for each finite N. Conclude that vA < lim v ((UnEn) < c.

[7]

Let v and fi be finite measures on (X,A). Suppose v has Lebesgue decomposition


(A, H) with respect to fi. Find the Lebesgue decomposition of ii with respect to v.
Hint: Consider I/A on {0 < A < oo}.

[8]

Show that a measure /JL is sigma-finite if and only if there exists a strictly positive,
measurable function *1> for which ^
< oo. Hint: If /JL is sigma-finite, consider
functions of the form <f>(jt) = ]T\ a,-{* e A,}, for appropriate partitions.

[9]

Suppose (3\fi, AO and (N2, A2) are both Lebesgue decompositions for a finite
measure v with respect to a finite measure 11. Prove Ji\ = K2 a.e. [v 4- /JL] and
AiKj = A2N2 a e - t v + Ml by the following steps.
(i) Show that Hi = K 2 a.e. [v] by means of the equality v(N2:Np = v ( K 2 ^ K i ) +
^(H2Kj Ai) = 0, and a companion equality obtained by interchanging the roles
of the two decompositions. Using the trivial fact that Hi = H2 a.e. [JJL] (because
both sets are ^-negligible), deduce that Hi = H2 a.e. [v 4- fi],
(ii) Use the result from part (i) to show that
M(/AiHJ)

= v ( / H p = v(fjq)

= fi(fA2jq)

for all / M + .

Deduce that A i H | = A2H2 a.e. [/x] by arguing as in Problem [3].


(iii) Use part (ii) to show that
v{AiH^ > A 2 H|} = v{A,Hf > A2H^}Hi 4- fi[AiJ4\ > A 2 H^}A!H| = 0
Hint: What value does AiHj take on Hi? Argue similarly for the companion
equality. Deduce that AiHf = A2H^ a.e. [v 4- fi].
(iv) Extend the argument to prove the analogous uniqueness assertion for the
Lebesgue decomposition when v and /x are sigma-finite measures. Hint:
Decompose X into countably many sets fX, : i N} with (v 4- /x)X, < 00 for
each 1.
[10]

Suppose y and /JL are both sigma-finite measures on (X, A).


(i) Show that v 4- /x is also sigma-finite. Deduce via Problem [8] that there exists
a strictly positive, measurable function #0 for which v<J>o < 00 and fi&o < 00.

73

3.7 Problems

(ii) Define vo/ = v(/<J>o) and /xo/ =


/io are both finite measures.

M(/^>O)

for each / in M+. Show that vo and

(iii) From the Lebesgue decomposition vof = vo(/N) + /*o(/A) for all / e M + ,
where /xoN = 0 and A JVC1", derive the corresponding Lebesgue decomposition
for A. with respect to /x.
(iv) From the uniqueness result of Problem [9], applied to Xo and /xo, deduce that
the Lebesgue decomposition for A. is unique up to the appropriate almost sure
equivalences.
[11]

Let v be a finite measure for which v = v\ + k\ = v2 + A.2, where each vt is


dominated by a fixed sigma-finite measure /x and each A.,- is singular with respect
to /x. Show that vi = v2 and A.i = A.2. Hint: If fiSf = 0 = A.,-S, and dvi/dfi = A,-,
= M (f&ifS\S2)
for all / in M+, by arguing
for i = 1, 2, show that /x (fAifS\S2)
as in Problem [3].

[12]

Let fji\ and \i2 be finite measures with densities m\ and m2 with respect to a
sigma-finite measure X. Define (/xi /JL2)* as the measure with density (m\ m
with respect to A..
(i) Show that (m\ mi)+B = sup^ YIA^171^
over all finite partitions of X.

~ m2)ABy the supremum running

(ii) Show that |/*i - /JL2\ = (Mi - M2>+ + (M2 - /^i) + [13]

Let P and Q be probability measures with densities p and q with respect to a


sigma-finite measure A. For fixed a > J, show that Aa(7>, 0 ;= A|/?1/flf - qx/a\a
does not depend on the choice of dominating measure. Hint: Let [i be another
sigma-finite dominating measure. Write \[r for the density of A. with respect to A. + MShow that dP/dik + /x) = fp and dQ/d(X 4- /x) = ifa. Express A O (P, Q) as an
integral with respect to A. + /x. Argue similarly for /x.

[14]

Adapt the argument from the previous Problem to show that the relative entropy
D(P\\Q) does not depend on the choice of dominating measure.

[15]

Let P be the standard Cauchy distribution on the real line, and let Q be the standard
normal distribution. Show that D(P\\Q) = 00, even though P and Q are mutually
absolutely continuous.

[16]

Let P and Q be probability measures defined on the same sigma-field 9r. A


randomized test is a measurable function / with 0 < / < 1. (For observation co, the
value f(co) gives the probability of rejecting the "hypothesis" P.) Find the test that
minimizes Pf -f Q(l - / ) .

[17]

Let P and Q be finite measures defined on the same sigma-field y, with densities p
and q with respect to a measure /x. Suppose Xo is a measurable set with the property
that there exists a nonnegative constant K such that q > Kp on Xo and q < Kp
on XQ. For each jF-measurable function with 0 < / < 1 and Pf < PXo, prove that
Qf < QXo. Hint: Prove that (q - Kp)(2X 0 - 1 ) > (q - Kp)(2f - 1 ) , then integrate.
To statisticians this result is known as the Neyman-Pearson Lemma.

74

Chapter 3:

Densities and derivatives

[18]

Let JLC = ix\ - fa ^nd fi' = fi\ - fi2 be signed measures, in the sense of Section 3.
Show that v := (/xi + /x^) A (M'I + Pi) 0^2 + A^) *s the largest signed measure for
which v < ^ and v < fif.

[19]

Let /(JC) = (1 + JC) log(l + JC) - JC, for JC > - 1 . Use the representations
to show that /(JC) > 72JC2(1 + JC/3)"1.

[20]

Show that the function JC~1COS(JC~2) is not integrable on [0,1]. Hint: Consider
contributions from intervals where JC~2 lies in a range nn T T / 4 . The intervals have
lengths of order n~ 3/2 , but the JC"1 contributes a factor of order n 1 / 2 .

[21]

Prove Theorem <19> by establishing the following assertions. With no loss of


generality, assume a = 0 and b = 1 and //(0) = 0. Write m for Lebesgue measure
on $[0, 1], Fix > 0. Define An := {ne < h < (n + 1)6} for each integer n, both
positive and negative. Note that
h+(x) = n > 0 /*(*){* An)

and

h~(x) = - J2n<Qh(x){x An).

(i) There exist open sets Gn and compact sets Kn with Kn c An c Gn and
m(Gn\ATn) < 6M, with en so small that the functions
fix) := E>o( + Dl* GM}

and

^(JC) := E w <o lw + ll(* ATn}

satisfy the inequalities f >h+ and g <h~ and m ( / / J + ) + m (/i~ g) < 6.


(ii) The function h := 6 -{- / ^ " i . j v \n -f l|e{jc ^ n } , for a large enough positive
integer N, has the properties h > h + e and m(/if ft)< 36.
(iii) The function h is lower semicontinuous, that is, for each real number r the set
{h > r] is open.
(iv) For 0 < JC < 1, define a continuous function G(JC) := /Q h(t)dt //(JC).
It achieves its maximum at some point y in [0,1]. We cannot have y < 1,
for otherwise there would be a small 8 > 0 for which h(t) > h(y) 4- 6 for
y < f < y + < l , and (//(y + 8) //(y)) /S < / I ( J ) -H 6/2, from which we
could conclude that (G(y + 6) - G(y)) / > h(y) + 6 - (/i(y) + 6/2) = 6/2,
a contradiction. Thus G(l) > G(0) = 0. That is, / / ( I ) < flh(t)dt
<
/ J h(t) dt + 46, for each 6 > 0.
(v) Similarly //(JC) < f* h(t)dt

for 0 < JC < 1.

(vi) An analogous argument applied to H gives the reverse inequality, letting us


conclude that //(JC) = / ^ h(t)dt for 0 < JC < 1. The function / / is absolutely
continuous on [0,1].
[22]

Let H be a continuous real function defined on [0,1). Suppose the right-hand


derivative h exists and is finite everywhere. If A is an increasing function, show
that H is convex. Hint: For fixed 0 < XQ < x\ < 1 and 0 < a < 1 define
xa = (1 CT)JCO + ajci. Show that

(1 - a)H(xo)+aH(xx) - H(xa)
= /

Jo

(a[xa < t < J C I } - ( 1 -QO{JC 0 < t <xa})h(t)dt

> 0.

75

3.7 Problems

[23] Let h be integrable with respect to Lebesgue measure m on Rrf. Show that
lim J
kx ({\x -z\<
r*o mB(z, r) v

r}h(x)) = h(z)

a.e. [m].

Generalize by replacing open balls by decreasing sequences of closed sets Fn | {z}


that are regular in the sense of Section 5.

8.

Notes
The Fundamental Theorem < n > is due to Lebesgue, proved in part in his doctoral
dissertation (Lebesgue 1902), and completed in his Peccot lectures (Lebesgue 1904).
Read Hawkins (1979) if you want to appreciate the subtlety and great significance
of Lebesgue's contributions. Note, in particular, Hawkins's comments on page 145
regarding introduction of the term absolute continuity. Compare the footnotes on
page 129 (as reprinted in his collected works) of the 1904 edition of the Peccot
lectures,
Pour qu'une fonction soit integrate indefinie, il faut de plus que sa variation
totale dans une infinite de"nombrable d'intervalles de longeur totale , tende vers
zero avec I.
Si, dans l'enonce* de la page 94, on n'assujettit pas f(x) a etre borne*e,
ni F(x) a etre a nombres de*rive*s bornes, mais seulement a la condition
pr6cedente, on a une definition de 1'integrate 6quivalente a celle developpe*e dans
ce Chapitre et applicable a toutes les fonctions sommables, bornes ou non.

and on page 188 of the reprinted 1928 edition,


Dans la premiere edition de ce livre, j'avais signale cet e"nonce\ en note de
la page 128, de fae.on tout a fait incidente et sans demonstration. M. Vitali a
retrouv^ ce the"oreme et en a publie" la premiere demonstration (Ace. Reale delle
Sc. di Torino, 1904-1905). C'est a l'occasion de ce the"oreme que M. Vitali
a introduit, pour les fonctions d'une variable, la denomination de fonction
absolument continue et qu'il a montre* la simplicite et la clarte que prend toute
la th6orie quand on met cette notion a sa base.

The essay by Lebesgue (1926) contains a clear account of the transformation


of absolute continuity from a property of functions to a property of measures. The
book of Benedetto (1976), which is particularly interesting for its discussion of the
role played by Vitali, and the Notes to Chapters 5 and 7 of Dudley (1989), provide
more historical background.
For the discussion in Sections 4 through 6,1 borrowed ideas from Saks (1937,
Chapters 4 and 7), Royden (1968, Chapter 8), Benedetto (1976, Chapter 4), and
Wheeden & Zygmund (1977, Chapter 7). The methods of Section 6 extend easily
to higher dimensional Euclidean spaces.
See Royden (1968, Chapter 6) and Dunford & Schwartz (1958, Chapter 3) for
more about the Radon-Nikodym theorem. According to Dudley (1989, page 141),
the method of proof used in Section 2 is due to von Neumann (1940), but I have
not seen the original paper.

76

Chapter 3:

Densities and derivatives

Inequality <13> is due to Kemperman (1969), Csiszar (1967), and Kullback (1967). (In fact, Kullback established a slightly better inequality.) My proof
is based on Kemperman's argument. See Devroye (1987, Chapter 1) for other
inequalities involving distances between probability densities.
REFERENCES

Benedetto, J. J. (1976), Real Variable and Integration, Mathematische Leitfaden,


Teubner, Stuttgart. Subtitled "with Historical Notes".
Csiszar, I. (1967), 'Information-type measures of difference of probability distributions and indirect observations', Studia Scientarium Mathematicarum Hungarica
2, 299-318.
Devroye, L. (1987), A Course in Density Estimation, Birkhauser, Boston.
Dudley, R. M. (1989), Real Analysis and Probability, Wadsworth, Belmont, Calif.
Dunford, N. & Schwartz, J. T. (1958), Linear Operators, Part I: General Theory,
Wiley.
Hawkins, T. (1979), Lebesgue's Theory of Integration: Its Origins and Development,
second edn, Chelsea, New York.
Kemperman, J. H. B. (1969), On the optimum rate of transmitting information,
in 'Probability and Information Theory', Springer-Verlag. Lecture Notes in
Mathematics, 89, pages 126-169.
Kullback, S. (1967), 'A lower bound for discrimination information in terms of
variation', IEEE Transactions on Information Theory 13, 126-127.
Lebesgue, H. (1902), Integrate, longueur, aire. Doctoral dissertation, submitted to
Faculte des Sciences de Paris. Published separately in Ann. Mat. Pura Appl. 7.
Included in the first volume of his (Euvres Scientifiques, published in 1972 by
L'Enseignement Mathematique.
Lebesgue, H. (1904), Lecons sur I'integration et la recherche des fonctions primitives,
first edn, Gauthier-Villars, Paris. Included in the second volume of his (Euvres
Scientifiques, published in 1972 by L'Enseignement Mathematique. Second
edition published in 1928. Third edition, 'an unabridged reprint of the second
edition, with minor changes and corrections', published in 1973 by Chelsea,
New York.
Lebesgue, H. (1926), 'Sur le developpement de la notion d'integrale', Matematisk
Tidsskrift B. English version in the book Measure and Integral, edited and
translated by Kenneth O. May.
Royden, H. L. (1968), Real Analysis, second edn, Macmillan, New York.
Saks, S. (1937), Theory of the Integral, second edn, Dover. English translation of the
second edition of a volume first published in French in 1933. Page references
are to the 1964 Dover edition,
von Neumann, J. (1940), 'On rings of operators, III', Annals of Mathematics
41, 94-161.
Wheeden, R. & Zygmund, A. (1977), Measure and Integral: An Introduction to Real
Analysis, Marcel Dekker.

Chapter 4

Product spaces and independence


SECTION 1 introduces independence as a property that justifies some sort of factorization
of probabilities or expectations. A key factorization Theorem is stated, with proof
deferred to the next Section, as motivation for the measure theoretic approach. The
Theorem is illustrated by a derivation of a simple form of the strong law of large
numbers, under an assumption of bounded fourth moments.
SECTION 2 formally defines independence as a property of sigma-fields. The key Theorem
from Section 1 is used as motivation for the introduction of a few standard techniques
for dealing with independence. Product sigma-fields are defined.
SECTION 3 describes a method for constructing measures on product spaces, starting from
a family of kernels.
SECTION 4 specializes the results from Section 3 to define product measures. The Tonelli
and Fubini theorems are deduced. Several important applications are presented.
SECTION *5 discusses some difficulties encountered in extending the results of Sections 3
and 4 when the measures are not sigma-finite.
SECTION 6 introduces a blocking technique to refine the proof of the strong law of large
numbers from Section I, to get a version that requires only a second moment condition.
SECTION *7 introduces a truncation technique to further refine the proof of the strong
law of large numbers, to get a version that requires only a first moment condition for
identically distributed summands.
SECTION *8 discusses the construction of probability measures on products of countably
many spaces.

1. Independence
Much classical probability theory, such as the laws of large numbers and central
limit theorems, rests on assumptions of independence, which justify factorizations
for probabilities of intersections of events or expectations for products of random
variables.
An elementary treatment usually starts from the definition of independence for
events. Two events A and B are said to be independent if (AB) = (PA)(P/?); three
events A, , and C, are said to be independent if not only P (ABC) = (PA)(Pfl)(PC)

but also F(AB) = (PA)(PB) and P(AC) = (PA)(PC) and P(BC) = (PB)(C).
And so on. There are similar definitions for independence of random variables,
in terms of joint distribution functions or joint densities. The definitions have two

78

Chapter 4:

Product spaces and independence

things in common: they all assert some type of factorization; and they do not
lend themselves to elementary derivation of desirable facts about independence.
The measure theoretic approach, by contrast, simplifies the study of independence
by eliminating unnecessary duplications of definitions, replacing them by a single
concept of independence for sigma-fields, from which useful consequences are
easily deduced. For example, the following key assertion is impossible to derive by
elementary means, but requires only routine effort (see Section 2) to establish by
measure theoretic arguments.
<i>

Theorem. Let Z\,..., Zn be independent random variables on a probability space (Qf y,F). If f M + (R*,S(R*)) and g M+(MW-*,:B(R'1-*)) then
f(Z\,...,
Zk) and g(Z^ + i,..., Zn) are independent random variables, and
P / ( Z i , . . . , Zk)g(ZM,-

<2>

.-,Zn) = P / ( Z i , . . . , Zk)Fg(Zk+u , Zn).

Corollary. The same conclusion (independendence and factorization) holds for


Borel measurable functions f and g taking both positive and negative values if both
are integrable.
f(Z\,...,
Zk) and g(ZM,...,Zn)
As you will see at the end of Section 2, the Corollary follows easily from addition and subtraction of analogous results for the functions / * and g. Problem [10]
shows that the result also extends to cases where some of the integrals are infinite,
provided oo oo problems are ruled out.
The best way for you to understand the worth of Theorem < i > and its Corollary
is to see it used. At the risk of interrupting the flow of ideas, I will digress slightly
to present an instructive application.
The proof of the strong law of large numbers (often referrred to by means
of the acronym SLLN) illustrates well the use of Corollary <2>. Actually, several
slightly different results answer to the name SLLN. A law of large numbers asserts
convergence of averages to expectations, in some sense. The word "strong" specifies
almost sure convergence. The various SLLN's differ in the assumptions made about
the individual summands. The most common form invoked in statistical applications
goes as follows.

<3>

Theorem. (Kolmogorov) Let X\, X 2 , . . . be independent, integrable random


variables, each with the same distribution and common expectation \x. Then the
average (X\ + . . . + Xn)/n converges almost surely to 11.
REMARK.
If P | X t | = 00 then (Xi + . . .+Xn)/n cannot converge almost surely to
a finite limit (Problem [21]). Moreover Kolmogorov's zero-one law (Example <12>)
implies that it cannot even converge to a finite limit at each point of a set with
strictly positive probability. If only one of P x f is infinite, the average still converges
almost surely to PX, (Problem [20]).

A complete proof of this form of the SLLN is quite a challenge. The classical
proof (a modified version of which appears in Sections 6 and 7) combines a number
of tricks that are more easily understood if introduced as separate ideas and not
just rolled into one monolithic argument. The basic idea is not too hard to grasp
when we have bounded fourth moments; it involves little more than an application
of Corollary <2> and an appeal to the Borel-Cantelli lemma from Section 2.6.

79

4.1 Independence

For theoretical purposes, for summands that need not all have the same
distribution, it is cleaner to work with the centered variables Xf PX,-, which is
equivalent to an assumption that all variables have zero expected values.
<4>

Theorem. Let Xi, X 2 ,... be independent random variables with PX, = 0 for
every i and sup, PXf < 00. Then (Xi 4-... + Xn)/n 0 almost surely.
Proof. Define Sn = Xi + ... 4- Xn. It is good enough to show, for each e > 0, that

Do you remember why? If not, you should refer to Section 2.6 for a detailed
explanation of the Borel-Cantelli argument: the series 53 n {|5 n |/n > e] must
converge almost surely, which implies that limsup|5 n /n| < almost surely, from
which the conclusion Mm sup \Sn/n\ 0 follows after a casting out of a sequence of
negligible sets.
Bound the nth term of the sum in <5> by (ne)~4(X\ + ... + Xn)4. Expand
the fourth power.
+ (lots of terms like X\X2)

j terms like 6X?x|

(A)

+ (lots of terms like x\X2X3)

-f (lots of terms like XiX2X3X4)

[U

The contributions to P(Xi -I-... -f Xn) from the five groups of terms are:

QG Yli<n Xt ^ nM > W h e r e M = SUPi PX f'


[2] zero, because V(X]X2) = (PXf) (PX2) = 0 ;

<6>

\3\ less than 12$M, because P(X;fx|) < PX? + PX^ < 2M;
g] zero, because P(Xf X2X3) = (PXf X2) (PX3) = 0;
g| zero, because P(Xi X2X3X4) = (PXiX2X3) (PX4) = 0.
Notice all the factorizations due to independence. Combining these bounds and
equalities we get P{|Sn|/n > e] = O (~2), from which <5> follows.
If you feel that Theorem <4> is good enough for 'practical purposes,' and that
all the extra work to whittle a fourth moment assumption down to a first moment
assumption is hardly worth the gain in generality, you might like to contemplate the
following example. How natural, or restrictive, would it be if we were to assume
finite fourth moments?
Example. Let {Pe : 9 = 0 , 1 , . . . , N] be a finite family of distinct probability measures, defined by densities {/?<?} with respect to a measure /x. Suppose observations
Xi, X 2 ,... are generated independently from Po- The maximum likelihood estimator
@n(o)) is defined as the value that maximizes Ln(9,(o) := Yli<n PeiXii*0))- Th e SLLN
will show that P $ , = 0 eventually} = 1. That is, the maximum likelihood estimator
eventually picks the true value of 9.

80

Chapter 4:

Product spaces and independence

It will be enough to show, for each 0 ^ 0 , that with probability one,


log(Ln(0)/Ln(O)) < 0 eventually. For fixed 0 ^ 0 define lt =
By Jensen's inequality, with a strict inequality because Pe ^

= logfjLxpe(x){po(x)^0}<0.

By the SLLN (or its extension from Problem [20] if P,- = -oo), for almost all co
there exists a finite no(co,0) for which 0 > n~l ],-<, : = n~X log(Ln(0)/Ln(O))
when n > no(coy 0). When n > max^, wo(<*>, 0), we have m a x ^ Ln(0) < Ln(0), in
which case the maximizing % prefers 0 to each 0 > 1.
REMARK.
Notice that the argument would not work if the index set were infinite.
To handle such sets, one typically imposes compactness assumptions to reduce to the
finite case, by means of a much-imitated method originally due to Wald (1949).

2.

Independence of sigma-fields
Technically speaking, the best treatment of independence starts with the concept
of independent sub-sigma-fields of 3", for a fixed probability space (2,7, P). This
Section will develop the appropriate definitions and techniques for dealing with
independence of sigma-fields, using the ideas needed for the proof of Theorem <2>
as motivation.

<7>

Definition. Let (2, ? , P) be a probability space. Sub-sigma-fields S i . . . . . S


of J are said to be independent if

P(Gj ... Gn) = (PGO ... (PGn)

for all G, 9,, for i = 1,.. .n.

An infinite collection of sub-sigma-fields {3/ : i e 1} is said to be independent if


each finite subcollection is independent, that is, ifF(DisGi) = r L e s ^ ' fr ^
finite subsets S of /, and all choices Gi e Si for each i in S.
The definition neatly captures all the factorizations involved in the elementary
definitions of independence for more than two events.
<8>

Example. Let A, B, and C be events. They generate sigma-fields A =


{0, A, A\ ft), and $ = {0, B, Bc, 2}, and e = {0, C, C \ Q). Independence of
the three sigma-fields requires factorization for 43 = 64 triples of events, amongst
which are the four factorizations stated at the start of Section 1 as the elementary definition of independence for the three events A, B, and C. In fact, all
64 factorizations are consequences of those four. For example, any factorization
where one of the factors is the empty set will reduce to the identity 0 = 0. The
factorization F(ABCC) = (PA)(P#C)(PC) follows from (AC) = (PA)(PC) and
F(ABC) = (PA)(PJ?)(PC), by subtraction. And so on.
Generating class arguments, such as the 7t-k Theorem from Section 2.10, make
it easy to derive facts about independent sigma-fields. For example, Problem [8]
uses such arguments in a routine way to establish the following result.

81

4.2 Independence of sigma-fields

<9>

Theorem. Let 1, . . . , be classes of measurable sets, each class stable under


finite intersections and containing the whole space Q. If
F(E\E2...)

= (FE\)(FE2)...(FEn)

for all Et ,, for i = 1 , 2 , . . . , n ,

tfien the sigma-fields a(i), a ( 2 ) , . . . , cr(n) are independent


REMARK.
The requirement that 2 G , for each i is just a sneaky way of
getting factorizations for intersections of fewer than n sets.
<io>

<n>

<12>

Corollary. Let {; : i /} be classes of measurable sets, each stable under finite


intersections. IfF(nieSEi) = f|iS PT, for all finite subsets S of I, and all choices
Et , for each i in S, then the sigma-fields a(,), for i / , are independent
Proof. Notice the alternative to requiring Q , for every 1. Theorem <9>
establishes independence for each finite subcollection.
Corollary. Let {S* : 1 /} he independent sigma-fields. If{Ij : j J} are disjoint
subsets of I, then the sigma-fields a (u I /.Si), for j J, are independent
Proof. Invoke Corollary <io> with , consisting of the collection of all finite
intersections of sets chosen from U,-/.SiExample. Let {Si : 1 N} be a sequence of independent sigma-fields. For each n
let Jin denote the sigma-field generated by U l>n Si. The tail sigma-field is defined
as Oioo := nn3n. Kolmogorov's zero-one law asserts that, for each H in Iftoo, either
FH = 0 or FH = 1. Equivalently, the sigma-field JCoo is independent of itself, so
that F(HH) = (FH)(FH) for every H in JCoo.
For each finite n, Corollary < n > implies independence of 0in, Si, . , S . From
the fact that IKQO *Kn for every n, it then follows that each finite subcollection
of {CKQO, Si : 1 N} is independent, and hence the whole collection of sigmafields is independent. From Corollary < n > again, IKoo and 3 ^ := a (U I 6 NSI) are
independent. To complete the argument, note that Joo 2 ^KooRandom variables (or random vectors, or random elements of more general
spaces) inherit their definition of independence from the sigma-fields they generate.
Recall that if X is a map from Q into a set X, equipped with a sigma-field A, then
the sigma-field a(X) on Q generated by X is defined as the smallest sigma-field S for
which X is S\A-rneasurable. it consists of all sets of the form {co e Q : X(<o) e A],
with A A.
REMARK.
The extra generality gained by allowing maps into arbitrary measurable
spaces will not be wasted; but in the first instance you could safely imagine each
space to be the real line, ignoring the fact that the definition also covers independence
of random vectors and independence of stochastic processes.

<13>

Definition. Measurable maps Xif for i e I, from & into measurable spaces
(Xi,Ai) are said to be independent if the sigma-fields that they generate are
independent, that is, if

\iS

PlfjfX/ Ai}\ = I~]P{X| A;},


ieS

for all finite subsets S of the index set I, and all choices of At e At for i S.

82

Chapter 4:

Product spaces and independence

Results about independent random variables are usually easy to deduce from
the corresponding results about independent sigma-fields.
<15>

Example.

Real random variables Xi and X2 for which

{X{ <xu X2< x2] = nX\ < *i}P{X2 ^ *2)


D

fo

all xux2 in R

are independent, because the collections of sets ,- = {{X,- < *} : JC R} are both
stable under finite intersections, and <T(X,) = a(,).
We now have the tools needed to establish Theorem <2>. Write Xi
for / ( Z i , . . . , Zk) and X2 for g(Z*+i,..., Zn). Write Sf for a(Zi), the sigmafield generated by the random variable Z,-. From Corollary < n > , the sigma-fields
$\ := <r (Si U ... U 3*) and J 2 := 0 (9M U ... U gn) are independent. If we can
show that Xi is 3j\33(R)-measurable and X2 is ^XB^-measurable, then their
independence will follow: we will have the desired factorization for all sets of the
form {Xi A\) and {X2 A2}, for Borel sets A\ and A2.
Consider first the measurability property for Xi. Temporarily write Z for
(Zi,..., Zu), a map from Q into R*. We need to show that the set
{XleA} = {Zef~l(A)}
belongs to 5\ for every A in S(R). The !B(R*)\S(R)-measurability of / ensures
that f~l{A) B(R*). We therefore need only show that {Z B] e 7\ for every B
in B(R*), that is, that the map Z is ?i\$(R*)-measurable.
As with many measure theoretic problems, it is better to turn the question
around and ask: For how extensive a class of sets B does {Z e B] belong to 7\1
It is very easy to show that the class 25o of all such B is a sigma-field; so Z is an
^iX^o-nieasurable function. Moreover, for all choices of A S(R), the set
D := {(zi,..., z*) R* : z,- A for i = 1,..., k]
belongs to *BQ because {Z D] = n/{Zi e Dj] e $\. As shown in Problem [6],
the collection of all such D sets generates the Borel sigma-field (R*). Thus
B(R*) c Bo, and {Z B] e 7X for all B e (R*). It follows that X\ is 5Ti\B(R)measurable. Similarly, X2 is 3r2\(R)-measurable. The random variables Xi and
X2 are independent, as asserted by Theorem <2>.
The whole argument can be carried over to random elements of more general
spaces if we work with the right sigma-fields.

< 16>

Definition. Let X\,..., Xn be sets equipped with sigma-fields A\,..., An. The
set of all ordered n-tuples (x\9..., xn)9 with xt X, for each i is denoted by
X\ x ... x Xn or X(<n Xi. It is called the product of the {Xi}. A set of the form
A\ x . . . x An = { ( J C I , . . . , xn) Xi x . . . x X n : jcf i4/ for each 1},

with At Ai for each i, is called a measurable rectangle. The product sigma-field


A\ ... 0 An onX\ x . . . x Xn is defined to be the sigma-field generated by all
measurable rectangles.
REMARK.
Even if n equals 2 and Xi = X2 = R, there is is no presumption
that either A\ or A2 is an intervala measurable rectangle might be composed of

4.2

83

Independence of sigma-fields
many disjoint pieces. The symbol <g> in place of x is intended as a reminder that
A\ <8>*42 consists of more than the set of all measurable rectangles A\ x A2.

If Z, is an 3\Ai-measurable map from 2 into X,, for i = 1 , . . . , n, then


the map co H* Z(CO) = (Z\(co),..., Zn(co)) from Q into X = Xi x . . . x Xn is
3Vi-measurable, where A denotes the product sigma-field A\ <g>... <g> An. If / is
an >l\B(R)-measurable real-valued function on X then / ( Z ) is ^^(RJ-measurable.
The second assertion of Theorem < i > is now reduced to a factorization
property for products of independent random variables, a result easily deduced from
the defining factorization for independence of sigma-fields by means of the usual
approximation arguments.
<n>

Lemma. Let X and Y be independent random variables. If either X > 0 and


Y > 0, or both X and Y are integrable, then F(XY) == (FX)(FY). The product XY
is integrable if both X and Y are integrable.
Proof. Consider first the case of nonnegative variables. Express X and Y as
monotone increasing limits of simple random variables (as in Section 2.2), Xn :
2""1 Ei<i<4{X > l/2n} and Yn := 2~n Ei<i<4{^ > 'V2"}- Then, for each n,
(XnYn) = 4"" J2ij W({* > V2nHY > j/2n})
= 4~n E/,y ( p ( ^ > '7 2n }) ( p (^ > 7/2 n })
n

by independence
n

= (2- Zi nx > i/i }) (i~ Ej nr > j/2 ])


= (FXn)(FYn).
Invoke Monotone Convergence twice in the passage to the limit to deduce F(XY) =
For the case of integrable random variables, factorize expectations for products
of positive and negative parts, P(X F ) = (PX )(Pr ). Each of the four products
represented by the right-hand side is finite. Complete the argument by splitting each
term on the right-hand side of the decomposition

3.

into a product of expectations, then refactorize as (PX + - FX~)(FY+ - FY~).


Integrability of XY follows from a similar decomposition for

Construction of measures on a product space


The probabilistic concepts of independence and conditioning are both closely related
to the measure theoretic constructions for measures on product spaces. As you
will see in Chapter 5, conditioning may be thought of as a inverse operation to a
general construction whereby a measure on a product space is built from families
of measures on the component spaces. For probability measures the components
have the interpretation of distributions involved in a two-stage experiment. Product
measures, and independence, correspond to the special case where the second stage
of the experiment does not depend on the first stage. Many traditional facts about

84

Chapter 4:

Product spaces and independence

independence, such as the assertion of Theorem <2>, have interpretations as facts


about product measures.
If you want to understand independence then you should learn about product
measures. If you want to understand conditioning you should learn about the more
general construction. We kill two birds with one stone by starting with the general
case. The full generality will be needed in Chapter 5.
To keep the notation simple, I will mostly consider only measures on a
product of two spaces, (X,A) and (y, $). Sometimes I will abbreviate symbols like
M+(X x y, A <g> B) to M+(X x y), with the product sigma-field assumed, or to
M+(A 0 ), with the product space assumed.
To each measure r on A *B there correspond two marginal measures fi and
A, defined by
/xA := T(A x y) for A A

and

kB := T(X x B) for B <B.

Equivalently, ix is the image of T under the coordinate projection X, which takes


(x, y) to JC, and k is the image under the projection onto the other coordinate space.
In particular, if F(X x y) = 1 then the marginals give the distributions of the
coordinate projections as X- or y-valued random variables on the probability space

<18>

In general, each marginal has the same total mass as T, which can lead to
bizarre behavior if F (X x y) = oo. For example, if X = y = R and r is Lebesgue
measure on B(R2) then each marginal assigns infinite mass to all except the Lebesgue
negligible subsets of R.
As you will see in Section 4, if F is a probability measure under which
the coordinate projections define independent random variables then r can be
reconstructed as a product of its marginal distributions. Without independence, T
is not completely determined by its marginals. Instead we need a whole family
A = {kx : x X} of measures on $, together with the marginal /x. The construction
will make senseand be usefulfor more than just probability measures, provided
the members of A are tied together by a measurability assumption.
Definition. Call a family of measures A = {kx : x X} on 3 a kernel from (X, A)
to (y, !B) if the map x H XXB is A-measurable for each B in S. In addition, call A
a probability kernel ifkx(^) = 1 for each x.
When there is no ambiguity regarding the sigma-fields, I will also speak of
kernels from X to y.
REMARK.
Probability kernels are also known by several other names: Markov
kernels, randomizations, conditional distributions, and (particularly when X is
interpreted as a parameter space) statistical models.

Suppose /x is a measure on A and A = {kx : x X} is a kernel from (X, A)


to (y, IB). The main idea behind the construction is definition of a measure on
via an iterated integral, fix (kxf(x, y)).
REMARK.
Remember the notation for distinguishing between arguments. The
superscript y means that the kx measure integrates out f(x9 y) with x held fixed.
The measure /x then integrates out over the resulting function of JC. Notice that

4.3

Construction of measures on a product space

85

superscripts denote dummy variables of integration, and subscripts denotes variables


that are held fixed. In traditional notation the iterated integral would be written

/ / /(*, y) kx(dy)n(dxl or / ( / /(*, v) kx(dy)) fi(dx).

For the iterated integral to make sense, we need to establish two key measurability properties: for each product measurable function / and each fixed x, the
map y H> f(x, y) should be ^-measurable; and the map x H> kyxf{x, y) should be
^.-measurable. In order to establish conditions under which these two measurability
properties hold, I will make use of the generating class method for A-cones, as
developed in Section 2.11.
REMARK.
It is unfortunate that the letter X should have two distinct meanings
in this Chapter: as a measure or member of a family of measures on V, and as a
prefix suggesting the idea of stability under bounded limits. Sometimes there are not
enough good symbols to go around.
Recall that a A-cone on a set Q is a family 3i + of bounded, nonnegative
functions on X with the properties:
(i) 2{ + is a cone, that is, if h\, h2 e IK+ and <x\ and a2 are nonnegative constants
then a\h\ 4- ot2h2 JC+;
(ii) each nonnegative constant function belongs to !K+;
(iii) if huh2

JC+ and hx > h2 then h\ - h2 e 0-C+;

(iv) if [hn] is an increasing sequence of functions in 5f+ whose pointwise limit h


is bounded then h e 0i+.
Recall also that: if a sigma-field 3 on 1 is generated by a subclass S (of a
k-cone 3i+) that is stable under die formation of pointwise products of pairs of
functions, then every bounded, nonnegative, ^-measurable function belongs to 0i+.
Let me illustrate the application of X-cone methods, by establishing the first
of the desired measurability properties. It would be more elegant to combine the
proofs for both properties into a single generating class argument, but I feel it is
good to see the method first in a simple setting.
<19>

Lemma. For each f in M(X x y, A <g> !B), and each fixed x in X, the function
J H - f(x,y) is *B-measurable.
Proof. It is enough to consider the case of bounded nonnegative / . The general case
would then follow by splitting / into positive and negative parts, and representing
each part as a pointwise limit of bounded functions, / * = limn ( / A n).
Write !K+ for the collection of all bounded, nonnegative, A S-measurable
functions on X x ^ for which the stated measurablity property holds. It is routine
to check the four properties identifying 0i+ as a A.-cone. It contains the class S of
all indicator functions g(jc, y) := {JC A, y B] of measurable rectangles, because
g(x, ) is either the zero function or the indicator of the set B. The class S is stable
under pointwise products, and it generates A !B.
The application of the generating argument to establish the second desired
measurablity property can be surprisingly delicate for kernels assigning infinite
measure to y. A finiteness assumption eliminates all difficulties.

86
<20>

Chapter 4:

Product spaces and independence

Theorem. Let A = {Xx : x X} be a kernel from (X, A) to (y, 2 ) with Xxy < oo
for all x, and let /x be a measure on A. Then, for each function f in M + (Xx y, .A<g>2),
(i) y H> /(JC, y) is 'B-measurable for each fixed x;
(ii) x H> Xlf(x, y) is A-measurable;
(Hi) the iterated integral (/i A ) ( / ) := nx (Xyxf(x, y)), for f in M+(X x y, yi<8>2),
defines a measure on *A <g> 2 .
Proof. Property (i) merely restates the assertion of Lemma <19>. For (ii), consider
the class IH+ := {/ e M + (X x y) : / bounded and satisfying (ii)}. Note that
kyxf{x,y) < oo for each JC if / JC4*, so there is no problem with infinite values
when subtracting to show that kyf\(x, y) klf2(x, y) is ^-measurable if f\ > fo and
both functions belong to IK+. It is just as easy to check the other three properties
needed to show that IK+ is a A.-cone. All indicator functions of measurable rectangles
belong to JC+, because Xx{x A, y B] = {JC A}(kxB). Property (ii) therefore
holds for all bounded functions in M + (X x y). A Monotone Convergence argument,
Xyxf(x,y)=

lim^(WA/(jc,y)),

noo

extends the property to all of M+(X x y).


It follows immediately that the iterated integral is well defined, thereby defining
an increasing linear functional \i % A on M + (X x y). The functional has the
Monotone Convergence property: if 0 < / \ f then

= \ix uimX^/n(jc, y) j

Monotone Convergence for Xx

= lim iix (Xyxfn (x, y))

Monotone Convergence for //.

Thus /x 0 A has all the properties required of a functional corresponding to a measure


on A 0 2 .
If /x is a probability measure and A is a probabilty kernel, then / i A defines
a probability measure on the product sigma-field of X x y. It defines a joint
distribution for the coordinate projections, X and Y. In Chapter 5, the probability
measure kx will be identified as the conditional distribution for Y, given that X = x.
The construction can be extended further by means of a second probability kernel,
N = {vx,y : (JC, y) e X x y}, from (X x y, A <g> 2 ) to another space, (X, e). For /
in M + (X x y x Z, ^1 (8) 2 6), the iterated integral
(Qi A) N) f = (/x A)x-y ( i y / ( * , y, z)) = /** (XJ (v*,y/(jf, y, z)))
is well defined. It corresponds to a probability measure on A <g> 2 <8> e, which defines
a joint distribution for the three coordinate projections, X, F, Z. We can also define a
probability kernel A <g> N from X to y x Z by means of the map JC H> A* (v* yg()>, z)),
for ^ in M + (y x Z, 2 <8) 6), thereby identifying the joint distribution as /x 0 (A <g> A/").
It makes no difference which way we interpret the probability on A <g> 2 e, which
from now on I will write as /x (8) A JV. A similar construction works for any
finite number of probability kernels. The extension to a countably infinite sequence
involves a more delicate argument, which will be presented in Section 8.

4.3

87

Construction of measures on a product space

Infinite kernels

The construction of p 0 A, for general measures and kernels, extends easily to


product measurable sets that can be covered by a countable collection of measurable
rectangles, in each of which an analog of the conditions of Theorem <20> hold.
<2i> Definition. Say that a kernel is sigma-finite if there exist countably many
measurable rectangles Ar- x ft for which X x y = UIGN>*; X ft and for which
XxBi < oo for all x A,-, for each i.
<22> Corollary. If A is a sigma-finite kernel, then the three assertions of Theorem <20>
still hold. The countably additive measure fi A on A 0 3 , defined via the iterated
integral, is sigma-finite if fi is a sigma-finite measure on A.
Proof. Only property (ii), the measurablity of x \-> Xxf(x,y) for each / in
M + (X x y), requires any new argument, because the finiteness of Xxy was used only
in the proof of the corresponding assertion in the Theorem.
Temporarily, I will use the word rectangle as an abbreviation for measurable
rectangle. The difference of two rectangles can be written as a disjoint union of two
CxD
other rectangles: (Ax B)\(C x D) = (ACC x B)U
(AC x BDC). In particular, we can write At x B2 as
I a disjoint union of two rectangles, each disjoint from
A
c
CxBD
A\ x B\; then write A3 x B3 as a disjoint union of at
most four rectangles, each disjoint from both A\ x B\
R
I
and A2 x B2; and so on. In other words, with no loss
of generality, we may assume the rectangles A,- x ft are pairwise disjoint, thereby
ensuring that i e N A,- x Bf = 1.
For / in M + (X x y), the integral kyxf(x, y) breaks into a countable sum,

D
<23>

<24>

K E,- N({* At, y e Bi)f{x% y)) = | N { * Ai)ki({y e ,}/(*, y)).


For the ith summand, we may regard Xx as a kernel from Af to Bi9 which lets us
invoke Theorem <20> to establish measurability of that summand as a function of x.
Thus A.*/(x, y) is a countable sum of .A-measurable functions, which establishes (ii).
If /x is sigma-finite, we may assume, with no loss of generality, that /x A, < 00 for
every 1. Define A in = {x Af : XxBi < n}. Then /x <8> A (Af> x 5,-) < n^Aiyn < 00,
and X x y = UIN,nN (Af> x ft).
Example. If X = y = R and each Xx and fi equals (one-dimensional) Lebesgue
measure X on S(M), then the resulting /x A may be taken as the definition of
two-dimensional Lebesgue measure 1112 on S(E 2 ). (For a more direct construction,
see Section A.5 of Appendix A.)
Example. Let ft denote the N(0,1) distribution on 3(M), and Xx denote the
N(px, 1 - p2) distribution, also on S(M), for a constant p with |p| < 1. For
/ M+(l2),
//
/
AW
x /ex
(-jc 2 /2)
A) / = m I P , xn yy I
x

/exp (-x2/2 - (y - px)2/2(l - p2))

= m nr I

- p2)

88

Chapter 4:

Product spaces and independence

That is, /JL 0 A is absolutely continuous with respect to two-dimensional Lebesgue


measure, with density
(JC2 2pxy H

2nJl^l

4.

\
)

The probability measure //, 0 A is called the standard bivariate normal distribution
with correlation p .

Product measures
A particularly important special case of the constructions from the previous Section
arises when A = [k] for a fixed measure k on , that is, when kx = k for all x.
Sigma-finiteness of the kernel A is then equivalent to sigma-finiteness of A. in the
usual sense: the space y should be a countable union of sets each with finite k
measure. I will abbreviate / A 0 {k} to /z0A., the conventional notation for the product
of the measures \x and A. That is, \x 0 k is the measure on A 0 defined by the
linear functional
(Ji A.) ( / ) := \ix {kyf(xy y))

<25>

for / M + (X x y, A 0 B).

In particular, (/z 0 k) (A x ) = (fiA)(kB).


If /x is also sigma-finite, we can reverse the roles of the two measures,
integrating first with respect to /JL and then with respect to k, to build another
measure on A 0 !B, which again would give the value (/jiA)(kB) to the measurable
rectangle A x B. As shown in Problem [11], the equality for the generating class
of all measurable rectangles ensures that the new linear functional defines the same
measure on A 0 S.
Tonelli T h e o r e m . If ii is a sigma-finite measure on (X, A), and k is a sigma-finite
measure on (y, ) , then, for each f in M+ (X x y, A 0 !B),
(i) y i-> /(JC, y) is <B-measurable for each fixed JC, and x H> /(JC, y) is Ameasurable for each fixed y;
(ii) x H> kyf(x, y) is A-measurable, and y h* iixf(x, y) is *B-measurable;
(Hi) (jiX)f = iL* Wfix, 3O) = W (/**/(*, 30).
Proo/. Assertions (i) and (ii) follow immediately from Corollary <22>. The third
assertion merely restates the fact that the linear fiinctionals defined by both iterated
integrals correspond to the same measure on A 0 !B.
See Problem [12] for an example emphasizing the need for sigma-finiteness.

<26>

E x a m p l e . Let fi be a sigma-finite measure on A. For / in M + (X, A) and each


constant p > 1, we can express \x (fp) as an iterated integral,
H (fp) = nx ( m > (pyp-l{y

: f(x) >y>

0})) ,

where m denotes Lebesgue measure on 3(R). It is not hardalthough a little


messy, as you will see from Problem [2]to show that the function g(x,y) :=
pyp~{[f(x) > y > 0}, on X x R, is product measurable. Tonelli lets us reverse the

4.4

89

Product measures

order of integration. Abbreviating fix[y : f(x) > y > 0} to / z { / > y] and writing
the Lebesgue integral in traditional notational, we then conclude that

In particular, if /z (fp)
y -> 00.

= pf
< oo then /z{/Jo
> 3;} must decrease to zero faster than y~p

as

The definition of product measures, and the Tonelli Theorem, can be extended
to collections / z i , . . . , \in of more than two sigma-finite measures, as for kernels.
<27>

Example. Apparently every mathematician is supposed to know the value of the


constant C := i!!^ exp(-* 2 ) dx. With the help of Tonelli, you too will discover
that C = y/n. Let m denote Lebesgue measure on S(R) and mi = m <g> m denote
Lebesgue measure on !B(R2). Then
C2 = m W exp(-x 2 - y2) = m*'* (mz{;t2 + /

< z}e~z} .

The ni2 measure of the ball [x2 + y2 < z}, for fixed positive z, equals nz, A change
in the order of integration leaves mz (7r{0 < z}ze~z) = n as the value for C 2 .
The Tonelli Theorem is often invoked to establish integrability of a product
measurable (extended-) real-valued function / , by showing that at least one of
the iterated integrals /z* (A^l/Ot, y)\) or Xy (/Z X |/(JC, ;y)|) is finite. In that case, the
Theorem also asserts equality for pairs of iterated integrals for the positive and
negative parts of the function:

HxVf+(x, y) = W / + ( * > y) < 00,


with a similar assertion for / " . As a consequence, the ^-measurable set

tyi := {x : AV+(*. y) = ooor \yf~(x, y) = 00}


has zero /z-measure, and the analogously defined S-measurable set Nx has zero

A. measure. For x N^ the integral Wf(x,y) := kyf+(x,y) - Vf~{x,y) is

well defined and finite. If we replace / by the product measurable function


/(JC, y) := /(JC, y){x Nv, y Nk], the negligible sets of bad behavior disappear,
leaving an assertion similar to the Tonelli Theorem but for integrable functions taking
both positive and negative values. Less formally, we can rely on the convention that
a function can be left undefined on a negligible set without affecting its integrability
properties.
<28>

Corollary (Pubini Theorem). For sigma-finite measures /z and A., and a product
measurable function f with (/JL 0 A.) | / | < 00,
(i) y H> f(xyy) is *B -measurable for each fixed x; and x \-+ f(x,y)
A-measurable for each fixed y;

is

(ii) the integral kyf(x, y) is well defined and finite fi almost everywhere, and
x H>- A//(JC, >>) is /x-integrable; the integral /z*/(JC, y) is well defined and
finite k almost everywhere, and y H* \XX f{x,y) is X-integrable;
(Hi) (/z (g) k) f = /z* ( * v u .

y)) = v (M*/(*. y))-

90

Chapter 4:

Product spaces and independence

REMARKS.
If we add similar almost sure qualifiers to assertion (i), then the
Fubini Theorem also works for functions that are measurable with respect to ? , the
/it 0 X completion of the product sigma-field. The result is easy to deduce from
the Theorem as stated, because each ^-measurable function / can be sandwiched
between two product measurable functions, fo < f < fu with /o = / i , a.e. [fx <g> A.].
Many authors work with the slightly more general version, stated for the completion,
but then the Tonelli Theorem also needs almost sure qualifiers.
Without integrability of the function / , the Fubini Theorem can fail, as shown
by Problem [13]. Strictly speaking, the sigma-finiteness of the measures is not
essential, but little is gained by eliminating it from the assumptions of the Theorem.
As explained in the next Section, under the traditional definition of products for
general measures, integrable functions must almost concentrate on a countable union
of measurable rectangles each with finite product measure.

Product measures correspond to joint distributions for independent random


variables or random elements. For example, suppose X is a random element of
a space (X, A) (that is, X is an ^.A-measurable map from Q into X), and Y is
a random element of a space (^, ) . Let X have (marginal) distribution fi and Y
have (marginal) distribution X. That is, F{X A] = fiA for each A in A, and
P{F e B] = XB for each B in tB. The joint distribution of X and Y is the image
measure of P under the map co Y-> (X((O), Y(CO)). It is the probability measure Q
on A $ defined by QD = Pf (X, Y) e D). If X and F are independent, and if D
is a measurable rectangle, then QD factorizes:
Q(A x B) = F{X e A, Y e B] = F{X e A}F{Y e B} =

(fiA)(XB).

That is, QD = (/x X) D for each D in the generating class of measurable rectangles.
It follows that Q = }JL <g> X, as measures on the product sigma-field.
Conversely, if Q is a product measure then F{X e A, Y e B] factorizes,
implying that X and Y are independent.
In short: random variables (or random elements of general spaces) are independent if and only if their joint distribution is the product of their marginal
distributions.
Facts about independence can often be deduced from analogous facts about
product measures. In effect, the proofs of the Tonelli/Fubini Theorems are equivalent
to several of the standard generating class and Monotone Convergence tricks used
for independence arguments. Moreover, the results for product measures are stated
for functions, which eliminates a layer of argument needed to extend independence
factorizations from sets to functions.
<29>

E x a m p l e . The factorization asserted by Theorem <2> follows from Tonelli's


Theorem. Let X = ( Z j , . . . , Zk) and Y = (Zk+u , Zn). The function (JC, y) H>
f(x)g(y) is product measurable. Why? With Q, /x, and v as above,

P(f(X)g(Y))

= (P/(X)) (Fg(Y))

= Q x y (f(x)g(y))
image measure
x y
= (^ 0 v) * (f(x)g(y))
independence
y
x
= v (fx f(x)g(y))
Tonelli
images measures.

4.4

<30>

91

Product measures

In the third line on the right-hand side the factor g(y) behaves like a constant for
the [i integral.
Example. The image of fx v, a product of finite measures on R* x R*, under
the map T(x, y) = x + y is called the convolution of the two measures, and is
denoted by ft v (or v /x, the order of the factors being irrelevant). If fi and
v are probability measures, and X and Y are independent random vectors with
distributions /JL and v, then the product measure /x 0 v gives their joint distribution,
and the convolution $i v gives the distribution of the sum X 4- F.
By the Tonelli Theorem and the definition of image measure,
QL * v)(/) = iixvyf(x

+ y)

for / M+(R*).

When v has a density &(-) with respect to it-dimensional Lebesgue measure m, the
innermost integral on the right-hand side can be written, in traditional notation, as
/ Hy)f(* + y)dy, which invites us to make a change of variable and rewrite it
as / 8(y x)f(y)dy. More formally, we could invoke the invariance of Lebesgue
measure under translations (Problem [15]) to justify the reexpression. Whichever
way we justify the change, the convolution becomes
= myiix8(y - x)f(y)

by Tonelli.

Writing g(y) := [ix8(y x), we have (/Lt v) / = my (g(y)f(y))>


density g with respect to m.

That is, JJL v has

REMARK.
The statistical techniques known as density estimation and nonparametric smoothing rely heavily on convolutions.
<3i>

Exercise.

The N(0, a2) distribution on the real line is defined by the density

with respect to Lebesgue measure. Show that the convolution of two normal
distributions is normal: N(0\,af) N($2, o%) = N{0\ + 02, a\ + a%).
SOLUTION: The convolution formula from Example o o > (with \x as the N(0\Jarf)
distribution and 8 as the N(02, a2) density) becomes
Ky

2)

dx.

Make the change of variable z = x 0\, and replace y by 0\ -h ^2 + y (to get a neater
expression).
r2

Complete the square, to rewrite the exponent as

rv-T^ 2 \

1 dz.

92

Chapter 4:

Product spaces and independence

The coefficient of y2 simplifies to - ( 2 a 2 ) " 1 , where a2 = of + or|. When integrated


out, the quadratic in z contributes a factor y/lnr, leaving the appropriate multiple
of exp(y2/2o2). The convolution result follows.
REMARK.
We could avoid most of the algebra in the Example, by noting that
g(y + Q\ + #2) has the form C\ exp(Ciy2), for constants Ci and C2. Nonnegativity
and integrability of the density would force both constants to be strictly positive,
which would show that g must be some N(i,o2) density. Calculation of means and
variances would then identify /x and a2.
There is also a quicker derivation of the result, based on Fourier transforms
(see Chapter 8), but a distaste for circular reasoning has compelled me to inflict on
you the grubby Calculus details of the direct convolution argument: I will derive the
form of the N(0, a1) Fourier transform by means of a central limit theorem, for the
proof of which I will use the convolution property of normals.

<32>

Example. Recall from Section 2.2 the definition of the distribution function Fx
and its corresponding quantile function for a random variable X:
Fx(x) = F{X < x]

for x e R,

qx(u) = inf{f : Fx(t) > u]

for 0 < u < 1.

The quantile function is almost an inverse to the distribution function, in the sense
that Fx(x) > u if and only if qx(u) < x. As a random variable on (0,1) equipped
with its Borel sigma-field and Lebesgue measure P, the function X := qx(u) has
the same distribution as X. Similarly, if Y has distribution function FY and quantile
function qy, the random variable Y := qy(u) has the same distribution as Y.
Notice that X and Y are both defined on the same (0,1), even though the
original variables need not be defined on the same space. If X and Y do happen
to be defined on the same 2 their joint distribution need not be the same as the
joint distribution for X and Y. In fact, the new variables are closer to each other,
in various senses. For example, several applications of Tonelli will show that
P|X - Y\P > F\X - Y\P for each p > 1.
As a first step, calculate an inequality for tail probabilities.
{X > JC, Y > y] < min (P{X > JC}, F{Y > y})

= 1 - Fx(x) v FY(y)
u>
0

<33>

Fx(x)vFY(y)}du
), y <qy(u)}du

= P{X > JC, Y > y)


We also have P{X > JC} = P{X > x] and F{Y > y) = F{Y > y], from equality of the
marginal distributions. By subtraction,
P({X > x) + (Y > y] - 2{X > JC, Y > y})

<34>

> P({X > JC} + {Y > y] - 2{X > JC, Y > y})

for all JC and y.

93

4.4 Product measures

The left-hand side can be rewritten as

I P ({X(a))

> x f y > Y(a>)} + {X(co)

< x , y < Y(co)})

a nonnegative function just begging for an application of Tonelli. For each real
constant s, put y = x + s then integrate over x with respect to Lebesgue measure m
on !B(E). Tonelli lets us interchange the order of integration, leaving
IP (mx{X(co) >x> Y(co)
= IP ((X(co) - Y(co)

- s) + mx{X((o) < x < Y(co)

- s})

+ s ) + (K(o>) - s - X(co))+)

Argue similarly for the right-hand side of <34>, to deduce that


P|X - Y + s\ > P|X - Y + s\

for all real s.

For each nonnegative t, invoke the inequality for s = r then 5 = - r , then add.
P(|X - F + r| + |X - Y - t\) > P(|X - Y H- f| + |X - Y - t\)

for all t > 0.

An appeal to the identity |z + f| + |z - f| = 2* + 2(|z| - 1 ) , for z E and


t > 0, followed by a cancellation of common terms, then leaves us with a useful
relationship, which neatly captures the idea that X and Y are more tightly coupled
than X and F.

P (|X - Y| - 0 + > P (i* ~ ?i ~ 0 +

<35>

f o r a11 f

^ -

Various interesting inequalities follow from <35>. Putting t equal to zero we


get P|X - Y\ > P|X - Y\. For p > 1, note the identity

D? = p(p-l)

(D- t)tp~2 dt = p(p - l)m^ [tp-2 (D - 0 + )

for D > 0,

where mo denotes Lebesgue measure on S (M+). Temporarily write A for |X - F|


and A for |X - Y\. Two more appeals to Tonelli then give

P|X - Y\P = p{p - l)m (tp-2 (A(co) > p(p D

*5.

See Problem [17] for the analogous inequality, P^(|X - F|) > P^(|X - F|), for
every convex, increasing function f on 1 + .

Beyond sigma-finiteness
Even when the kernel A is not sigma-finite, it is possible to define fi A as a
measure on the whole of A S . Unfortunately, the obvious methodby means
of an iterated integralcan fail for measurability reasons. Even when kx does
not depend on x, the map x h* A*/(JC, y) need not be measurable for all / in
M+i% x y, A (8) B), as shown by the following counterexample.

94

<36>

Chapter 4:

Product spaces and independence

Example. Let X equal [0,1], equipped with Borel sigma-field A, and y also
equal [0,1], but equipped with the sigma-field of all subsets of [0,1]. Let be a
subset of [0,1], not .A-measurable. Let A. be the counting measure on , interpreted
as a measure on . That is, A. B equals the cardinality of E n B, for every subset B
of [0,1]. The Borel measurable subset D = {x = y] of [0,1] 2 also belongs to the
larger sigma-field A <g> 23. The function x \-+ ky{(x, y) e D) equals 1 when x ,
and is 0 otherwise; it is not yi-measurable.
In general, the difficulty lies in the possible nonmeasurability of the function
x - A.J/(JC, JO. When / equals the indicator function of a measurable rectangle
the difficulty disappears, because x H> XXB is assumed to be .A-measurable for
every B in !B. If [ix (A.{JC A,y B}) < OO then Ao := {x e A : XXB = oo} has
zero fi measure. The method of Theorem <20> defines a measure on the product
sigma-field of (A\Ao) x B, by means of an iterated integral,

Tf := fix (kyf(x, y))

if / M+ and / = 0 outside (A\A0) x B.

For the remainder of the Section, I will use the letter F, instead of the symbol
M <g> A, to avoid any inadvertent suggestion of an iterated integral.
The same definition also works if / is required to be zero only outside A x /?,
provided we ignore possible nonmeasurability of Xyxf(x,y) on a /x-negligible set.
More formally, we could assume [i to be complete, so that the behavior for x in AQ
has no effect on the measurability. The method of Corollary <22> can then be
applied to extend F to a measure on a larger collection of subsets.
<37>

Definition. Write % for the collection of all A *B-measurable sets D for which
there exist measurable rectangles with D c U/eNAj x Bt and ixx ({JC e Ai)kxBi) < oo
for each i. Denote by M+(X x y, # ) , or just M + (#), the collection of functions f
in M+(X x y,.A B) for which {/ / 0} e X.
The collection Jl is stable under countable unions and countable intersections,
and differences (that is, if /?,- % then R\\R2 #)that is, Jl is a sigma-ring.
It need not be stable under complements, because the whole product space X x y
might not have a covering by a sequence of measurable rectangles with the desired
propertythat is, Jl need not be a sigma-field.
The definition of a countably additive measure on a sigma-ring is essentially
the same as the definition for a sigma-field. There is a one-to-one correspondence
between measures defined on*Rand increasing linear functionals on M + (#) with the
Monotone Convergence property. In particular, the iterated integral /x* (A.*/(JC, y))
defines an increasing linear functional on / e M + (#), corresponding to a measure
on ft.
For product measurable sets not in % there is no unique way to proceed.
A minor extension of the classical approach for product measures (using
approximation from above, as in Section 12.4 of Royden 1968) suggests we should
define Y(D) as the infimum of i N M* ({x Ai}A.xfi,-), taken over all countable
coverings of D by measurable rectangles, U/ NA; X B,-. This definition is equivalent
to putting T(D) = oo when D (./i:B)\3. Consequently, we would have
F ( / ) = oo for each nonnegative, product measurable / not in M + (ft). For the

4.5 Beyond sigma-finiteness

95

particular case where k == A., if / is product measurable and r ( | / | ) < oo, then
{/ ^ 0} c U I NA| x Bt, for some countable collection of measurable rectangles with
I N (jiAi) (kBi) < oo. For each i in the set J^ := {i : /xA, = oo} we must have
XBj = 0, which implies that the set Nk := UijfiBi is ^.-negligible. Similarly, the
set Nn := UlyxA|, where Jx := {i : kBt = oo}, is /^-negligible. When restricted to
Xo := Ul^yMi4l, the measure fi is sigma-finite; and when restricted to % := U^Z?,-,
the measure X is sigma-finite. Corollary <28> gives the assertions of the Fubini
Theorem for the restriction of / to Xo x %. For trivial reasons, the Fubini Theorem
also holds for the restriction of / to N^ x y or X x Nx.
REMARK.
In effect, with the classical approach, the general Fubini Theorem is
made to hold by forcing integrable functions to concentrate on regions where both
measures are sigma-finite. The general form of the Fubini Theorem is really just a
minor extension of the theorem for sigma-finite measures. I see some virtue in being
content with the definition of the general /i <8> A as a measure on the sigma-ring 31.

6.

<38>

SLLN via blocking


The fourth moment bound in Theorem <4> is an unnecessarily strong requirement.
It reflects the crudity of the Borel-Cantelli argument based on <5> as a method for
proving almost sure convergence. Successive terms contribute almost the same tail
probability to the sum, because the averages do not change very rapidly with n. It
is possible to do better by breaking the sequence of averages into blocks of similar
terms. We need to choose the blocks so that the maximum of the terms over each
block behaves almost like a single average, for then the Borel-Cantelli argument
need be applied only to a subsequence of the averages.
Probabilistic bounds involving maxima of collections of random variables are
often called maximal inequalities. For the SLLN problem there are several types
of maximal inequality that could be used. I am fond of the following result because
it also proves useful in other contexts. The proof introduces the important idea of a
first passage time.
Maximal Inequality. Let Z\, ..., Z# be independent random variables, and \,
2, and p be positive constants for which P{|Z,- + Z,+i + ... + ZN\ < e2] > l/P for
each i. Then

p|max|Zi + ... + Z,| ><?! +2\ <


REMARK.
The same inequality holds if the {Z,} are independent random vectors,
and | | is interpreted as length. The inequality is but one example of a large class
of results based on a simple principle. Suppose we wish to bound the probability
that a sequence {5, : i = 0, 1 , . . . , N] ever enters some region IR. If it does enter %
there will be a first time, r, at which it does so. If we can show that the process has
a (conditional) probability of at least \/fi of 'staying near the scene of the crime'
(that is, of remaining within a distance 6 of D at time N), then the probability of
the event {ZN within of 31} should be at least \/fi times the probability of the
event {the process hits CR}. Of course the time r will be a random variable, which

96

Chapter 4:

Product spaces and independence

means that SN ST need not behave like a typical increment SN 5,-. We avoid that
complication by arguing separately for each event {r = i).
There are also conditional probability versions of the inequality, relaxing the
independence assumption.

Proof. Define

5, := Z, + ..

+ z,,

-st
Define a random
A

y \ i

r\

variable r ( a first passage time) by

te2
Id

*
X
of the asserted inequality equals

first i for which \St 1 > ^i


A^ if 15,1 <e{ + for all i.
Notice that the events {r = i) for i = 1 , . . . , N
are disjoint. The probability on the left-hand side

P{r = i and |5 | > e\ 4- Q for some i }


= YriLi ^ { r = *' \$\

>

i + ^}

disjoint events

< X!|Li P{* = , I-S.-I > \ 4- 2} )8P{|7f I < 62}

definition of .

The event {T = 1, |5,-| > 1 4- 62} is a function of Z i , . . . , Z,-; the event {|7}| < 62}
is a function of Z,+i,..., Z#. By independence, the product of the probabilities in
the last displayed line is equal to the probability of the intersection. The sum is less
than
If I Si-1 > 61 4- 62 and |7}| < 62 then certainly \SN\ > \. The sum is less than
Q

P E I nr

= /, |5AT| > 6,} = /8P{|SA,| > 61}, as asserted.

REMARK.
Notice how the disjointness of the events [r i} for i = 1 , . . . , N
was used twice: first to break the maximal event into pieces and then to reassemble
the bounds on the pieces. Also, you should ponder the choice of r value for the
case where \St\ < \ +2 for all i. Where would the proof fail if I had chosen r = 1
instead of T = N for that case?

The Maximal Inequality will control the behavior of the averages over blocks of
successive partial sums, chosen short enough that the ft constants stay bounded. The
Borel-Cantelli argument can then be applied along the subsequence corresponding
to the endpoints of the blocks. The longer the blocks, the sparser the subsequence
of endpoints, and the easier it becomes to establish convergence for the sum of tail
probabilities along the subsequence. With block lengths increasing geometrically
fast, under mild second moment conditions we get both acceptable ft values and tail
probabilities decreasing rapidly enough for the Borel-Cantelli argument.
<39>

Theorem. (Kolmogorov) Let X\, X2, . . . be independent random variables with


FXi = 0 for every i. If \ FXf/i2 < 00 then (X\ + . . . 4- Xn)/n - 0 almost surely.
Proof. Define St := X\ 4 - . . . 4- Xt and
V(i) := of + . . . + or? = PS?
Bk : {n : nk < n < nk+\]

where or? := PX?,


where nk := 2*, for k = 1, 2 , . . . .

97

4.6 SLLN via blocking


REMARK.
The nk define the blocks of terms to which the Maximal Inequality
will be applied. You might try experimenting with other block sizes, to get a feel
for how much latitude we have here. For example, what would happen if we took
nk = 3*, or kkl

The series ]T^ V (nk) /nj is convergent, because

ED

*=1 , = 1

i=\

The innermost sum on the right-hand side is just the tail of a geometric series,
started at *(i), the smallest k for which i < 2*. The sum equals

The almost sure convergence assertion of the Theorem is equivalent to


<40>

max

\Sn\

neBk

ft

>0

as k - oo.

It therefore suffices if we show, for each > 0, that


|5

"'

- < o o .

Replace the n in the denominator for thefcthsummand by the smaller number i


then expand the range from Bk to {n < nk+\] to bound the probability by
P

max \Sn\

[n<nk+\

By the Maximal Inequality, this probability is less than


PkP{\SHk+l\ > enk] <
where
Pkx = min P{|5njk+1 - Sn\ < nk]
> 1-max

Knk+l

"; > 1 -

y2v7'i/ ^ 1

as k -> oo.

The A:th term in <4i> is eventually smaller than a constant multiple of


which establishes the desired convergence.

V(nk+\)/n\+v

*7. SLLN for identically distributed summands


Theorem <39> is not only useful in its own right, but also it acts as the stepping
stone towards the version of the SLLN stated as Theorem <3>, where the moment
assumptions are relaxed even further, at the cost of a requirement that the variables
have identical distributions. The proof of Theorem <3> requires a truncation
argument and several appeals to the assumption of identical distributions in order
to reduce the calculation of bounds involving many Xt to pointwise inequalities

98

Chapter 4:

Product spaces and independence

involving only a typical X\. Actually slightly less would suffice. We only need a
way to bound various integrals by quantities that do not change with n.
Proof of Theorem <3>. With no loss of generality, suppose fi = 0 (or replace X,by the centered variable X, JJL)9 SO that the Theorem becomes: for independent,
identically distributed, integrable random variables X\, X2,... with PX,- = 0,

-> 0

almost surely.

Break each X,- into two contributions: a central part, F, = X,-{|X,-| < 1}, which,
when recentered at its expectation /x, = PX,-{|X,-| < 1}, becomes a candidate for
Theorem <39>; and a tail part, X,- - Yt = X,-{|X,-| > 1}, which can be handled by
relatively crude summability arguments. Notice that, after truncation, the summands
no longer have identical distributions.
REMARK.
The choice of i as the ith truncation level is determined by the
available moment information. More generally, finiteness of a /?th moment would
allow us to truncate at il/p for almost sure convergence arguments.

The truncation constants increase fast enough to allow us to dispose of various


fragments by means of the first moment assumption. For example, using the identity
0 = PX/ = fit + PX,-{|X,-| > /}, and the identical distribution assumption, we have
< P|X,|min
in (l, l-^)
-^ -* 0,

<42>

the final convergence to zero following by Dominated Convergence. Notice how


the contribution from X,{iX,| < i) was related to a contribution from X,-{|Xj| > 1}.
Without that trick, the analog of <42> would have given the useless upper
bound P|X! I.
The rate of growth of the truncation constants is also fast enough to ensure
that the truncation has little effect on the summands, as far as the almost sure
convergence of the averages is concerned:
00

00

X>f*/ # Yt) < ]TP{|X,| > 1} < P|Xi| < 00.


i=l

i=l

Again the identical distributions have reduced the argument to calculation of a


pointwise bound involving the single Xi. It follows that, with probability one, there
exists a positive integer io(co) such that, for n > io(co),
<43>

- T Xi(a>) - n T-*
n
i<n

< - T \Xi(a>)\ - 0.
nn *f

Notice that the last sum does not change as n increases.


The truncation constants increase slowly enough to allow us to bound second
moment quantities for the truncated variables by first moments of the original
variables. Here I leave some minor calculus details to you (see Problem [25]). First
you should establish a deterministic inequality: there exists a finite constant C such
that
<44>

|JC|2 J^fkl
1=1

< J'JTJ S C\X\


l

for each real x.

4.7

99

SLLN for identically distributed summands

From this bound and the inequality P(Y, - /z,) 2 < py? = PX^{|Xi| < i}, deduce that
oo p / y . _ ii.)2

y\

.2

< CPIXI i < oo.

It follows by Theorem <39> that


<45>

- /AF/ lit) -^ 0
n

almost surely.

T<n

The asserted SLLN for the identically distributed {X,} then follows from the
results <42> and <43> and <45>.

*8.

Infinite product spaces


The SLLN makes an assertion about an infinite sequence [Xn : n e N} of independent
random variables. How do we know such sequences exist? More generally, for an
assertion about any sequence of random variables {Xn}, with prescribed behavior for
the joint distributions of finite subcollections of the variables, how do we know there
exists a probability space (2, 7, P) on which the {Xn} can be defined? Depending
on one's attitude towards rigor, the question of existence is either a technical detail
or a vital preliminary to any serious study of limit theorems. If you prefer, at least
initially, to take the existence as a matter of faith, then you will probably want to
skip this Section.
To accommodate more general random objects, suppose X,- takes values in a
measurable space (X,-, .A,-). For finite collections of random elements X i , . . . , X n , we
can take Q as a product space X,<n X,- equipped with its product sigma-field l < n A ,
with the random elements X, as the coordinate projections. For infinite collections
{Xn : n N) of random elements, we need a way of building a probability measure
on Q = X I N Xf, the set of all infinite sequences co := (JCI, JC2,...) with JC, X, for
each i. The measure should live on the product sigma field 7 := I G N A , which is
generated by the measurable hyperrectangles X I N A,- with At A for each i.
It is natural to start from a prescribed family {Pw : n e N} of desired finite
dimensional distributions for the random elements. That is, P n is a probability
measure defined on the product sigma-field 3>, := A\ 0 . . . 0 An of the product
space Qn := Xi x . . . x X n , for each n. The X,'s correspond to the coordinate
projections on these product spaces. Of course we need the P n 's to be consistent in
the distributions they give to the variables, in the sense that

<46>

P n + i (F x X B+ i) = P B + i { ( X i , . . . , XH) e F, Xn+i X n+1 } = FnF

all F in 7 , ,

or, equivalently,
P w+ i(jri, ...*) = Vng(xu .*)

for

all g e M + ( Q n , %).

Such a family of probability measures is said to be a consistent family of finite


dimensional distributions. Within this framework, the existence problem becomes:
For a consistent family of finite dimensional distributions {Fn : n N}, when
does there exist a probability measure P on 7 for which the joint distribution
of the coordinate projections (X\,..., X n ) equals P n , for each finite n?

100

Chapter 4:

Product spaces and independence

Roughly speaking, when such a P exists (as a countably additive probability


measure on JF), we are justified in speaking of the joint distribution of the whole
sequence {Xn : n N}. If such a P does not exist, assertions such as almost sure
convergence require delicate interpretation.
The sigma-field J n is also generated by the cone 9+ of all bounded, nonnegative,
Jn -measurable real functions on Qn. Write !K+ for the corresponding cone of
all functions on Q of the form h(co) = gn(co\n), where co\n denotes the initial
segment (jti, ..., xn) of eo and gn e St- A routine argument shows that the cone
J{+ := UnN3iJ generates the product sigma-field 7 on Q. Consistency of the family
{Fn : n N} lets us define unambiguously an increasing linear functional P on 3C1"
by
<47>

Fh : = Fngn

if h(co) = gn(co\n).

The functional is well defined because condition <46> ensures that Fngn = Fn+\gn+\
when h(co) = gn+i(a)\n + 1) = gn(co\n), that is, when h depends on only the first n
coordinates of a>. Some authors would call P a finitely additive probability.
From Appendix A.6, the functional P has an extension to a (countably additive)
probability measure on 7 if and only if it is cr-smooth at 0, meaning that P/*, \ 0
for every sequence {hi} in J( + for which 1 > hi \, 0 pointwise.
REMARK.
The a-smoothness property requires more than countable additivity
of each individual P rt . Indeed, Andersen & Jessen (1948) gave an example of weird
spaces {Xj} for which P is not countably additive (see Problem [29] for details).
The failure occurs because a decreasing sequence {hi} in 5 + need not depend on
only a fixed, finite set of coordinates. If there were such a finite set, that is, if
hi(co) = gi(con) for a fixed n, then we would have P/i, = Pnfc, > 0 if hi | 0. In
general, ht might depend on the first nt coordinates, with nt oo, precluding the
reduction to a fixed .

In the literature, sufficient conditions for a-smoothness generally take one of two
forms. The first, due to Daniell (1919), and later rediscovered by Kolmogorov (1933,
Section III.4) in slightly greater generality, imposes topological assumptions on
the X,- coordinate spaces and a tightness assumption on the P n 's. Probabilists who
have a distaste for topological solutions to probability puzzles sometimes find the
Daniell/Kolmogorov conditions unpalatable, in general, even though those conditions
hold automatically for products of real lines.
REMARK.
Remember that a probability measure Q on a topological space X is
called tight if, to each > 0 there is a compact subset Ke of X such that Q ^ < e.
If X = Xi x . . . x X, then Q is tight if and only if each of its marginals is tight.

The other sufficient condition, due to Ionescu Tulcea (1949), makes no


assumptions about existence of a topology, but instead requires a connection between
the {Fn} by means of a family of probability kernels, An+\ = {k^ : con e Qn} from
Qn to X n+ i, for each n e N. Indeed, a sequence of probability measures can be
constructed from such a family, starting from an arbitrary probability measure Pi
<48>

Fn := Pi 0 A2 0 A 2 . . . An.

101

4.8 Infinite product spaces


More succinctly, Fn = Pn_i <8> An for n > 2. The requirement that each An be
a probability kernel ensures that {n : n e N} is a consistent family of finite
dimensional distributions.
<49>

Theorem. Suppose [Fn : n N} be the consistent family of finite dimensional


distributions defined via probability kernels, Pn = Pn_i <g> An for each n, as in <48>.
Then the P defined by <47> has a/i extension to a countably additive probability
measure on the product sigma-field 7.
Proof. We have only to establish the a-smoothness at 0. Equivalently, suppose we
have a sequence {A/ : i e N} in ?{ + with 1 > A, | A, but for which inf, PA,- > e for
some e > 0. We need to show that A is not the zero function. That is, we need to
find a point o> = (Jci, Jt2,...,) in Q for which A(a>) > 0.
Construct <b one coordinate at a time. With no loss of generality we may
assume hn depends on only the first n coordinates, that is, hn((o) = gn(con), so that

<50>

PAn = Fngn = (Pi A 2 . . . AB) gn(xux2

xn) >

The product A 2 <8>... An defines a probability kernel from Xi to X2<i<n X,-. Define
functions / on Xi by
fn(x\) := (A.Xl A 3 . . . A n ) gn(xux2

for n > 2,

xH)

= gi(jci). Then P i / n = Fngn > for each n. The assumed monotonicity,

with f\(x\)
<5l>

gn((o\n) = An(o>) > An+i(a>) = gn+i(o>|n 4-1)

for

all co,

implies that {/ : n e N} is a decreasing sequence of measurable functions. By


Dominated Convergence, < F\fn i Piinf l N /;, from which we may deduce
existence of at least one value Jci for which fn(x\) > for all n. In particular,

g\(x\) > e.
Hold x\ fixed for the rest of the argument. The defining property of Jci becomes
(X^, A 3 . . . A n ) gn(XuX2, . . . , Xn) >

fOTH > 2,

Notice the similarity to the final inequality in <50>, with the role of Pi taken over
by A.jc,. Repeating the argument, we find an Jc2 for which
(*<*1.*2) A4 . . . A n ) gn(X\,X2, X3 . . . , Xn) > for rt > 3,

D
<52>

and with g2(Jci, jc2) > 6.


And so on. In this way construct an <b (Jci, jc 2 ,...) for which An(d>) =
gn(co\n) > e for all n. In the limit we have A(o>) > , which ensures that A is not
the zero function. Sigma-smoothness at zero, and the asserted countable additivity
for P, follow.
Corollary^ For probability measures Pi on arbitrary measure spaces (X,-,./!,,')
there exists a probability measure P (also denoted by (S>inPi) for which
P (A\ x A2 x . . . x A*) = Wi<k piAi, for all measurable rectangles.
Proof. Specialize to the case where X,-(u;,-_i, ) = P/ for all wf_i Q|_i.
The proof for existence of a countably additive P under topological assumptions
is similar, with only a slight change in the definition of <K.

102

<53>

Chapter 4:

Product spaces and independence

T h e o r e m . (Daniell/Kolmogorov extension theorem) Suppose {Fn : n N} is a


consistent family of finite dimensional distributions. If each X,- is a separable metric
space, equipped with its Borel sigma-field Aif and if each Fn is tight, then the P
defined by <47> has an extension to a tight probability measure on the product
sigma-field 7.
REMARK.
AS you will see in Chapter 5, the tightness assumption actually ensures
that the P n satsify the conditions of Theorem <49>. However, the construction of
the probability kernels requires even more topological work than the direct proof of
Theorem <53> given below.

Proof. As shown in Section A.6 of Appendix A, we may simplify definition <47>


by restricting it to bounded, continuous, nonnegative gn. Countable additivity of P
is implied by its a -smoothness at zero for the smaller W + class. That is, it suffices
to prove that Fhn I 0 for every sequence {hn : n N} from !K+ that decreases
pointwise to the zero function.
As in the proof of Theorem <49>, we may consider a sequence {A, : i e N]
in iK+ with 1 > ht, j A, but for which inf, PA/ > e for some e > 0. Once again, with
no loss of generality we may assume hn depends on only the first n coordinates,
that is, hn(co) = gn{(o\n)\ but now the gn functions are also continuous. Again we
need to find a point a) = (x\, jt2,...,) in Q for which h(co) > 0.
The tightness assumption lets us construct compact subsets Kn c Qn for which
FnKn > 1 e/2, with the added property that Kn x Xn+\ 2 Kn+\, for every n: use
the fact that sup* P n + i Kn x X w+ i) n K) = P w+1 (Kn x X w+ i) = FnKn, with the
supremum running over all compact subsets of 2n+i. The construction of Kn ensures
that Fngn((on){con Kn] > e/2, so the compact set Ln := {(on e Kn : gn(con) > e/2]
is nonempty. Moreover, Ln x Xn+\ 2 L n + i.
By a Cantor diagonalization argument (Problem [30]), n rtN (Ln x X I>n X,-) ^ 0.
That is, there exists an a> in Q for which co\n e Kn and hn(co) > e/2 for every n. It
follows that h((b) > e/2. The probability measure P gives mass > 1 e/2 to the set
n n e N (Kn x X l > n X,-), which Problem [30] shows is compact.

9.

Problems

[1]

Let B\,B2,...
be independent events for which X i M < = P{^, infinitely often } = 1 by following these steps.

Show

that

(i) Show that P (B\BC2 ... Bcn) < exp ( - J^Li Pft) -> 0.
(ii) Deduce that f [ i Bl = almost surely,
(iii) Deduce that YftLx Bi - ! a l m s t surely.
(iv) Deduce that J^m Bi - 1 almost surely, for each finite m. Hint: The events
Bm, Bm+\,... are independent.
(v) Complete the proof.
Remark: This result is a converse to the Borel-Cantelli Lemma discussed in Section 2.6.
A stronger converse was established in the Problems to Chapter 2.

103

4.9 Problems

[2]

In Example <26> we needed the function g(jc, y) = pyp~l{f(x) > y > 0} to be


product measurable if / is ^-measurable. Prove this fact by following these steps.
(i) The map </>: R2 -> R defined by 0(5, t) = {5 > *} is B(R2)\B(R)-measurable.
(What do you know about inverse images for <f> ?)
(ii) Prove that the map (JC, y) i-> (/(*), y) is A <g> B(R)\!B(R) 0 B(R)-measurable.
(iii) Show that the composition of measurable functions is measurable, if the various
sigma-fields fit together in the right way.
(iv) Complete the argument.

[3]

Show that every continuous real function on a topological space X is !B(X)VB(R)measurable. Hint: What do you know about inverse images of open sets?

[4]

Let X be a topological space. Say that its topology is countably generated if there
exists a countable class of open sets So such that G = (J{Go : G 2 Go So) for
each open set G. Show that such a 9o generates the Borel sigma-field on X.

[5]

Let (X, d) be a separable metric space with a countable dense subset Xo. Show
that its topology is countably generated. [Lindelof s theorem.] Hint: Let % be the
countable class of all open balls of rational radius centered at a point of Xo- If
x G G, find a rational r such that G contains the ball of radius 2r centered at x, then
find a point of Xo lying within a distance rofjc.

[6]

Let X and y be topological spaces equipped with their Borel sigma-fields (X)
and S(y). Equip X x y with the product topology and its Borel sigma-field B(X x y).
(The open sets in the product space are, by definition, all possible unions of sets
G x Hy with G open in X and H open in y.)
(i) Show that B(X) <8> BflO c B(X x U).
(ii) If both X and y have countably generated topologies, prove equality of the two
sigma-fields on the product space.
(iii) Show that tB(Rn) = B(R*) <g>

[7] Let X be a set with cardinality greater than 2**, the cardinality of the set of all
sequences of 0's and l's. Equip X with the trivial topology for which all subsets
are open.
(i) Show that the Borel sigma-field B ( X x I ) consists of all subsets of X x X.
(ii) Show that B(X) B(X) equals
[J{a() : a countable class of measurable rectangles}.
(iii) For a given countable class of subsets 6 of X define an equivalence relation
x ~ y if and only if {x e C] = {y e C) for all C in C Show that there are
at most 2* equivalence classes. Deduce that there exists at least one pair of
distinct points JCO and yo such that *o ~ yo> Deduce that a(6) cannot separate
the pair (jto, yo)>
(iv) Show that the diagonal A = {(JC, y) X x X : x = y] cannot belong to the
product sigma-field !B(X) (8) B(X), which is therefore a proper sub-sigma-field

104

Chapter 4:

Product spaces and independence

of 3(X x X). Hint: Suppose A e a() for some countable class of measurable
rectangles. Find a pair of distinct points JCO, yo such that no member of , or
of <r(), can extract a proper subset of F = {(JCO, JCO), (}>O, *O)> (*O> yo), (yo, yo)h
but FA =
[8]

Prove Theorem <9> by following these steps.


(i) For fixed 2, , * in 2, . . . , , define T>i as the class of all events D for
which P (D 2 .) = (PD) (FE2)... (FEn). Show that T>i is a A,-class.
(ii) Deduce that T)\ 2 <r(i). That is, that the analog of the hypothesized
factorization with 1 replaced by <r(i) also holds.
(iii) Argue similarly that each subsequent , can also be replaced by its a(/).

[9]

Let Zn = (Z\ + ... + Z n )/n, for a sequence {Z,-} of independent random variables.
(i) Use the Kolmogorov zero-one law (Example <12>) to show that the set
{limsup Zn > r] is a tail event for each constant r, and hence it has probability
either zero or one. Deduce that limsupZ,, = co almost surely, for some
constant co (possibly 00).
(ii) If Zn converges to a finite limit (possibly random) at each co in a set A with
PA > 0, show that in fact there must exists a finite constant co for which
Zn -> co almost surely.

[10]

Let X and Y be independent, real-valued random variables for which P(XF) is


well defined, that is, either F(XY)+ < 00 or F(XY)~ < 00. Suppose neither X
nor Y is degenerate (equal to zero almost surely). Show that both PX and FY
are well defined and F(XY) = (FX)(FY). Hint: What would you learn from
00 > P(AT)+ = PX+K+

[11]

For sigma-finite measures /JL and k, on sigma-fields A and S, show that the only
measure r on A 0 2 for which V (A x B) = (/xA) (XB) for all A e .A and 5 S
is the product measure /x 0 X. Hint: Use the 7T-A. theorem when both measures are
finite. Extend to the sigma-finite case by breaking the underlying product space into
a countable union of measurable rectangles, At x Bj, with /xA; < 00 and XBj < 00,
for all i, j .

[12]

Let m denote Lebesgue measure and X denote counting measure (that is, A. A equals
the number of points in A, possibly infinite), both on S(R). Let /(JC, y) = {JC = y}.
Show that mxXyf(x, y) = 00 but Xymxf(x, y) = 0. Why does the Tonelli Theorem
not apply?

[13]

Let X and \i both denote counting measure on the sigma-field of all subsets of N. Let
/ ( * , y) := {y = x] - {y = JC + 1}. Show that / I ^ V ( J C , y) = 0 but kyitxf(x, y) = 1.
Why does the Fubini Theorem not apply?

[14]

For nonnegative random variables Zi, r Zm, show that P(ZiZ2 Zm) is no
greater than /J Gz,()0z 2 (w) QZm(u)du, with g Z/ the quantile function for Z,-.

[15]

Let m denote Lebesgue measure on R*. For each fixed a in 1*, define maf
mxf(x + a). Check that ma is a measure on B(R*) for which ma[a, fc] = m[a, fr],

105

4.9 Problems

for each rectangle [a, b] = X,[a,, &,]. Deduce that m a = m. That is, m is invariant
under translation: xtiaf = f f(x + a) dx = f f(x) dx = m / .
[16]

Let ix and v be finite measures on $(R). Define distribution functions F(t) :=


/i(-oo, r] and G(f) := v(-oo, / ] .
(i) Show that there are at most countably many points JC, atoms, for which /x{*} > 0.
(ii) Show that

v'F(t) = /x(R)v(R) + . Mtelvfo},


where {*, : i e N} contains all the atoms for both measures,
(iii) Explain how (ii) is related to the integration-by-parts formula:

f
dF(t)
dt = F(oo)G(oo) - I G(t)- dt
Hint: Read Section 3.4.
[17]

In the notation of Example <32>, show that P^(|X - Y\) > P4>(|X - ?|), for
every convex, increasing function V on R+. Hint: Use the fact that ^(JC) =
*(0) + /o H(t)dt, where H is an increasing right-continuous function on R+.
Represent H(t) as /x(0, t] for some measure /x. Consider Pm5/i'{0 < f < 5 < A},
remembering inequality <35>.

[18]

Let P = i<nPi and Q = <8>,<(?,, where P, and Q,- are defined on the same
sigma-field yi;. Show that
// 2 (P, Q) = 2 - 2 n i 5 n ( l - \H\Pt,

Qi)) < E I 5

where / / 2 denotes squared Hellinger distance, as defined in Section 3.3. Hint:


For the equality, factorize Hellinger affinities calculated using dominating measures
Xi = (Pf + G,-)/2. For the inequality, establish the identity ];< yi+[\i<n 0 - yt) > 1
for all 0 < yf < 1.
[19]

(One-sided version of <38>) Let Z\, . . . , ZN be independent random variables, and


\, 2, and ft be nonnegative constants for which P{Z,- + Z,-+i + .. . + Z # > 2} > l/)3
for each 1. Show that P {maxI<Ar(Zi + . . . + Zf) > e\ -h e 2 j < )3P{Zi + . . .-hZyv > \}.

[20]

Let X i , X 2 , . . . be independent, identically distributed, random variables with


PX^ = 00 > PXr. Show that J2i<nX'/n
-+ almost surely. Hint: Apply
Theorem <3> to the random variables {Xr} and [m A X^}, for positive constants m.
Note that P(m A X*) -> 00 as m -> 00.

[21]

Let X j , X 2 , . . . be independent, identically distributed, random variables with


P|X f | = 00. Let Sn = X\ + . . . + Xn. Show that Sn/n cannot converge almost surely
to a finite value. Hint: If Sn/n -> c, show that (Sn+i - Sn)/n -> 0 almost surely.
Deduce from Problem [1] that ^ ^ { 1 ^ 1 > } < 00. Argue for a contradiction by
showing that P|Xi| < 1 + ^ 1 W n l > n).

[22]

(Kronecker's lemma) Let {fc,} and {jcf} be sequences of real numbers for which
0 < b\ < 62 < ... -> 00 and J ] ^ i xi is convergent (with a finite limit). Show that
J21=\ biXi/bn -> 0 as n -> 00, by following these steps.

106

Chapter 4:

Product spaces and independence

(i) Express b\ as a\ + . . . -f a,, for a sequence {a,} of nonnegative numbers. By a


change in the order of summation, show, for m < n, that $^=1A-Jt,- equals
c

m + E\yU < y < n}cr/{max(m, j) < i < n}xt

where Cm = X ^ 1 ^ -

(ii) Given 6 > 0, find an m such that |" = / 7 *,-1 < whenever m < p <n.
(iii) With m as in (ii), and n > m, show that |]r"=i M i | < Idil -f e J2"=\ <*/
(iv) Deduce the asserted convergence.
[23]

Let Y\,Yi,... be independent, identically distributed random variables, with


W\Y\\ < oo for some fixed a with 0 < a < 1. Define 0 = I/a and Z, = Yi{\Yi\a < i}.

(i) Show that YZi WZfli2? < oo and J2Z\ p l z l/^ < (ii) Deduce that Yl^i Zi/i^ i s convergent almost surely,
(iii) Show that i p {l y H a > i] < oo.
(iv) Deduce that Y1H\ Y\l^

x%

convergent almost surely.

(v) Deduce via Kronecker's Lemma (Problem [22]) that n~xla(Y\ + . . . 4- Yn)
converges to 0 almost surely.
[24]

Let Sn = Xi 4- . . . + Xnj a sum of independent, identically distributed random


variables with X\k < oo for some positive integer k and PX,- = 0. Show that
PS* = O(nk) as n -> oo.

[25]

Establish inequality <44>. Hint: First show that ( ^ i ~ 2 decreases like it"1, for
k = 2, 3 , . . . , by comparing with fj{ y~2 dy. Then convert to a continuous range. It
helps to consider the case \x\ <2 separately, to avoid inadvertent division by zero.

[26]

(Etemadi 1981) Let X\, Xi,... be independent, integrable random variables, with
common expected value /x. Give another proof of the SLLN by following these
steps. As in Section 7, define Yt = Xt{\Xi\ < i] and /z, = PF,. Let Tn = J2i<n Yi(i) Argue that there is no loss of generality in assuming Xt > 0. Hint: Consider
positive and negative parts.
(ii) Show that PTn/n -+ \x as n - oo and (Sn - Tn)/n - 0 almost surely,
(iii) Show that var(7n) < Y,i WX\{XX < / < n}.
(iv) For a fixed p > 1, let {kn} be an increasing sequence of positive integers such
that kn/pn - 1. Show that n { / < kn}/kl < C/i2 for each positive integer i,
for some constant C.
(v) Use parts (iii) and (iv), and Problem [25], to show that

J2n P{\Tkn - VTk(n)\ > kn) < Ce~2 J2( F^?{^1 < i)/i2 < oo.
Deduce via the Borel-Cantelli lemma that (Tkn - PTkn)/kn -+ 0 almost surely,
(vi) Deduce that Skjkn -> /JL almost surely, as n -> oo.
(vii) For each p' > p, show that
^

kn+\

forkn<m<kn+u

107

4.9 Problems
when n is large enough.

(viii) Deduce that limsup.S m /m and liminf Sm/m both lie between fi/p and /zp, with
probability one.
(ix) Cast out a sequence of negligible sets as p decreases to 1 to deduce that
Sm/m -> /x almost surely.
[27]

Let P be a probability measure on (2,T). Suppose X is a (possibly nonmeasurable)


subset of 2 with outer probability 1, that is, PF = 1 for every set with X c F 7.
(i) If Fx and F 2 are sets in 7 for which F{X = F 2 X, show that PF\ = P F 2 .
(ii) Write A for the collection of all sets of the form FX, with F e 7. Show that
A is a sigma-field (the so-called trace sigma-field).

(iii) Show that M+(X, A) = {/(*)

: / M+(Q, 7)}.

(iv) Show that Q(FX) := P F is a well defined (by (i)) probability measure on A.
(v) Show that Qf = Pf, for each / M+(2,3) whose restriction to X equals / .
[28]

Let m denote Lebesgue measure on the Borel sigma-field of [0,1). Define an


equivalence relation on [0, 1) by x ~ y if x - y is rational. Let Ao be any subset
containing exactly one point from each equivalence class. Let {r, : i N} be an
enumeration for the set of all rational numbers in (0,1). Write A,- for set of all
numbers of the form x -f r,-, with x e Ao and addition carried out modulo 1.
(i) Show that the sets A,-, for i N, are disjoint, and U,->oA,- = [0,1).
(ii) Let Do be a Borel measurable subset of Ao. Define D, as the set of points
* + r,-, with x e Do and addition carried out modulo 1. Show mDo = mZ), for
each i, and 1 > Y,i>omDi- Deduce that mDo = 0.
(iii) Deduce that Ao cannot be measurable (not even for the completion of the Borel
sigma-field), for otherwise [0,1) would be a countable union of Lebesgue
negligible sets.
(iv) Suppose D is a Borel measurable subset of U/^A, for some finite n. Show that
[0,1) contains countably many disjoint translations of D. Deduce that mD = 0.
(v) Define Xn = U,->nA,-. Deduce from (iv) that each Xn has Lebesgue outer
measure 1 (that is, mB 1 for each Borel set 5 2 1B), but nnXn = 0.

[29]

Let m denote Lebesgue measure on [0, 1), and let {Xn : n N} be a decreasing
sequence of (nonmeasurable) subsets with DnXn = 0 but with each Xn having outer
measure 1, as in Problem [28]. Write An for the trace of 3 := 2*[0, 1) on Xn, as in
Problem [27]. Write Qn for X,<n X, and 7n for <8>,<nA, as in Section 8.
(i) Show that each function / in M+(2n, 7n) is the restriction of a function / in
M+([0, D-.B^toO,,.
(ii) Show that nf := m'/(f, *,...,*) for / M + (^ n , y n ), defines a consistent
family of finite dimensional distributions.

108

Chapter 4:

Product spaces and independence

(iii) Let gn denote the indicator function of {con e Qn : x\ xi = ... = *}, and let
fn(co) = gn((o\n) be the corresponding functions in M + . Show that Pngn = 1
for all n, even though / 4 0 pointwise.
(iv) Deduce that the functional P, as defined by <47>, is not sigma-smooth at zero.
[30]

Let Q = X i e N %i be a product of metric spaces. Define Qn = Xt<n X,- and


Sn = Xi>n X,-. For each n let Jfn be a compact subset of Qn, with the property
that Hn := n,-<n (A:, x 5 I+ i) # 0 for each finite n. Write 7rf for the projection map
from Q onto ft/. Show that H := n IGN #; is a nonempty compact subset of 2, by
following these steps. (Remember, for metric spaces, compactness is equivalent to
the property that each sequence has a convergent subsequence.)
(i) Let {zn ' n e N] be a sequence with zn Hn. Use compactness of each AT, to
find subsequences Ni 2 N2 2 N3 2 of N for which y(i) := limwN, 7tiZn
exists, as a point of Kt. Define Noo to be the subsequence whose ith member
equals the ith member of N,-. Show that limnGNoo ^tzn = y(i) for every i.
(ii) Show that the first i components of y(i + 1) coincide with y(i) for each 1.
Deduce that there exists a y in Q for which 7T/y = y(i) ^, for every i.
Deduce that y H, and hence H ^0.
(iii) Specialize to the case where zn / / for every n. Show that there is a
subsequence that converges to a point of //.

[31]

Let T be an uncountable index set, and let Q = XtTXt denote the set of all
functions co : T - U ^ X , with o>(0 G X, for each r. Let At be a sigma-field
on X,. For each S c r, define J5 be the smallest sigma-field for which each of the
maps co H> co (s), for 5 5, is IFVAs-measurable. The product sigma-field (S)/r^f is
defined to equal 5>.
(i) Show that 3r = U5 J5, the union running over all countable subsets S of T.
(ii) For each countable S c 7\ let P$ be a probability measure on J5, with the
property that Fs equals the restriction of P y to J 5 if 5 c S'. Show that
PF := P5F for F G J5 defines a countably additive probability measure on 3V-

10.

Notes
According to the account by Hawkins (1979, pages 154-162), the Fubini result
(in a 1907 paper, discussed by Hawkins) was originally stated only for products
of Lebesgue measure on bounded intervals. I do not personally know whether the
appelation Tonelli's Theorem is historically justified, but Dudley (1989, page 113)
cited a 1909 paper of Tonelli as correcting an error in the Fubini paper. As noted
by Hawkins, Lebesgue also has a claim to being the inventor of the measure
theoretic version of the theorem for iterated integrals: in his thesis (pages 44-51
of Lebesgue 1902) he established a form of the theorem for bounded measurable
functions defined on a rectangle. He expressed the two-dimensional Lebesgue
integral as an iterated inner or outer one-dimensional integral. With modern

109

4.10 Notes

hindsight, his result essentially contains the Fubini Theorem for Lebesgue measure
on intervals, but the reformulation involves some later refinements. Royden (1968,
Section 12.4) gave an excellent discussion of the distinctions between the two
theorems, Tonelli and Fubini, and the need for something like sigma-finiteness in
the Tonelli theorem.
I do not know, in general, whether my definition of sigma-finiteness of a kernel,
in the sense of Definition <2i>, is equivalent, to the apparently weaker property of
sigma-finiteness for each measure Xx, for x X.
The inequality between 2 norms in Example <32> was noted by
Frechet (1957). He cited earlier works of Salvemini, Bass, and Dall'Aglio (none
of which I have seen) containing more specialized forms of the result, based on
Frechet (1951). The * version of the result is also in the literature (see the comments by Dudley 1989, page 342). I do not know whether the general version of
the inequality, as in Problem [17], has been stated before.
The Maximal Inequality <38> is usually attributed to a 1939 paper of Ottaviani,
which I have not seen. Theorem <39> is due to Kolmogorov (1928). Theorem <3>
was stated without proof by Kolmogorov (1933, p 57; English edition p 69), with the
remark that the proof had not been published. However, the necessary techniques
for the proof were already contained in his earlier papers (Kolmogorov 1928, 1930).
Problem [26] presents a slight repackaging of an alternative method of proof due
to Etemadi (1981). By splitting summands into positive and negative parts, he was
able to greatly simplify the method for handling blocks of partial sums.
Daniell (1919) constructed measures on countable products of bounded
subintervals of the real line. Kolmogorov (1933, Section IH.4), apparently unaware
of DanielFs work, proved the extension theorem for arbitrary products of real lines.
As shown by Problem [31], the extension from countable to uncountable products is
almost automatic. Theorem <49> is due to Ionescu Tulcea (1949). See Doob (1953,
613-615) or Neveu (1965, Section 5.1) for different arrangements of the proof.
Apparently (Doob 1953, p 639) there was quite a history of incorrect assertions
before the question of existence of measures on infinite product spaces was settled.
Andersen & Jessen (1948), as well as providing a counterexample (Problem [29])
to the general analog of the Kolmogorov extension theorem, also suggested that the
Ionescu Tulcea form was more widely known:
In the terminology of the theory of probability this means, that the case of
dependent variables cannot be treated for abstract variables in the same manner
as for unrestricted real variables. Professor Doob has kindly pointed out, what
was also known to us, that this case may be dealt with along similar lines as the
case of independent variables (product measures) when conditional probability
measures are supposed to exist. This question will be treated in a forthcoming
paper by Doob and Jessen.
REFERENCES

Andersen, E. S. & Jessen, B. (1948), 'On the introduction of measures in infinite


product sets', Danske Vid. Selsk. Mat.-Fys. Medd.

110

Chapter 4:

Product spaces and independence

Daniell, P. J. (1919), 'Functions of limited variation in an infinite number of


dimensions', Annals of Mathematics (series 2) 21, 30-38.
Doob, J. L. (1953), Stochastic Processes, Wiley, New York.
Dudley, R. M. (1989), Real Analysis and Probability, Wadsworth, Belmont, Calif.
Etemadi, N. (1981), 'An elementary proof of the strong law of large numbers',
Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete 55, 119-122.
Frechet, M. (1951), 'Sur les tableaux de correlation dont les marges sont donnees',
Annales de VUniversite de Lyon 14, 53-77.
Frechet, M. (1957), 'Sur la distance de deux lois de probability, Comptes Rendus de
VAcademic des Sciences, Paris, Sen I Math 244, 689-692.
Hawkins, T. (1979), Lebesgue's Theory of Integration: Its Origins and Development,
second edn, Chelsea, New York.
Ionescu Tulcea, C. T. (1949), 'Mesures dans les espaces produits', Lincei-Rend. Sc.

fis. mat. enat. 7, 208-211.


Kolmogorov, A. (1928), 'Uber die Summen durch den Zufall bestimmter unabhangiger GroBen', Mathematische Annalen 99, 309-319. Corrections: same
journal, volume 102, 1929, pages 484-488.
Kolmogorov, A. (1930), 'Sur la loi forte des grands nombres', Comptes Rendus de
VAcademie des Sciences, Paris 191, 910-912.
Kolmogorov, A. N. (1933), Grundbegriffe der Wahrscheinlichkeitsrechnung, SpringerVerlag, Berlin. Second English Edition, Foundations of Probability 1950,
published by Chelsea, New York.
Lebesgue, H. (1902), Integrate, longueur, aire. Doctoral dissertation, submitted to
Faculte des Sciences de Paris. Published separately in Ann. Mat. Pura Appl. 7.
Included in the first volume of his (Euvres Scientifiques, published in 1972 by
L'Enseignement Mathematique.
Neveu, J. (1965), Mathematical Foundations of the Calculus of Probability, HoldenDay, San Francisco.
Royden, H. L. (1968), Real Analysis, second edn, Macmillan, New York.
Wald, A. (1949), 'Note on the consistency of the maximum likelihood estimate',
Annals of Mathematical Statistics 20, 595-601.

Chapter 5

Conditioning
SECTION 1 considers the elementary case of conditioning on a map that takes only finitely
many different values, as motivation for the general definition.
SECTION 2 defines conditional probability distributions for conditioning on the value of a
general measurable map.
SECTION 3 discusses existence of conditional distributions by means of a slightly more
general concept, disintegration, which is essential for the understanding of general
conditional densities.
SECTION 4 defines conditional densities. It develops the general analog of the elementary
formula for a conditional density: (joint density)/(marginal density).
SECTION *5 illustrates how conditional distributions can be identified by symmetry
considerations. The classical Borel paradox is presented as a warning against the
misuse of symmetry.
SECTION 6 discusses the abstract Kolmogorov conditional expectation, explaining why it is
natural to take the conditioning information to be a sub-sigma-field.
SECTION *7 discusses the statistical concept of sufficiency.

1. Conditional distributions: the elementary case


In introductory probability courses, conditional probabilities of events are defined
as ratios, P(A | B) = FAB/FB, provided FB ^ 0. The division by FB ensures that
P( | B) is also a probability measure, which puts zero mass outside the set #, that
is, F(BC | B) = 0. The conditional expectation of a random variable X is defined as
its expectation with respect to P( | B), or, more succinctly, P(X | B) = F(XB)/FB.
If FB = 0, the conditional probabilities and conditional expectations are either left
undefined or are extracted by some heuristic limiting argument. For example, if Y
is a random variable with F[Y = y] = 0 for each possible value y, one hopes
that something like P (A \ Y = y) = Iim5_+o F(A\y<Y<y
+ 8) exists and is a
probability measure for each fixed y. Rigorous proofs lie well beyond the scope of
the typical introductory course.
In applications of conditioning, the definitions get turned around, to derive
probabilities and expectations from conditional distributions constructed by appeals
to symmetry or modelling assumptions. The typical calculation starts from a

112

Chapter 5:

Conditioning

partition of the sample space Q into finitely many disjoint events, such as the sets
[T = t} where some random variable T takes each of its possible values 1 , 2 , . . . , n.
From the probabilities F{T = t] and the conditional distributions P(- | T = /), for
each t, one calculates expected values as weighted averages.

FX = ]T r F(X{T = t}) = J2tnX\T

= t)F{T = t}.

Notice that the weights Q[t] := F{T = t] define a probability measure Q on the range
space T = { 1 , 2 , . . . , w}, the distribution of T under P. (That is Q = TF.) Also,
if there is no ambiguity about the choice of 7\ it helps notationally to abbreviate
the conditional distribution to P,(), writing P,(X) instead of P(X | T = t). The
probability measure P, lives on Q, with Ft{T ^ t] = 0 for each t in 7. With these
simplifications in notation, formula < i > can be rewritten more concisely as

with the interpretation that the probability measure P is a weighted average of


the family of probability measures 9 = [Ft : t 7}. The new formula also has
a suggestive interpretation as a two-step method for generating an observation co
from P:
(i) First generate a t from the distribution Q on 7.
(ii) Given the value t from step (i), generate co from the distribution Ft.
Notice that P, concentrates on the set of co for which T(co) = t. The value of T(co)
from step (ii) must therefore equal the t from step (i).
<3>

Example. Suppose a deck of 26 red and 26 black cards is well shuffled, that is,
all 52! permutations of the cards are equally likely. Let A denote the event {top and
bottom cards red), and let T be the map into T = {red, black} that gives the color of
the top card. Then T has distribution Q given by
Q{red} = F{T = red} = 1/2

and

Q{black} = F{T = black} = 1/2.

By symmetry, the conditional distribution P, gives equal probability to all permutations of the remaining 51 cards. In particular
PredA = P{top and bottom cards red | T = red} = 25/51,
Pbiack^ = P{top and bottom cards red | T = black} = 0/51,
from which we deduce that
PA = Q{red} (25/51) + Q{black} (0/51) = (25/102).

Notice how we were able to assign P, probabilities by appeals to symmetry, rather


than by a direct calculation of a ratio.
Section 2 will describe the extension of formula <2> to more general families
of conditional distributions {P, : t e 7}. Unfortunately, for subtle technical reasons
(discussed in Appendix F), conditional distributions do not always exist. In such
situations we must settle for a weaker concept of conditional expectation, following
an approach introduced by Kolmogorov (1933, Chapter 5), as explained in Section 6.
REMARK.
I claim that it is usually easier to think in terms of conditional
distributions, despite the technical caveats regarding existence. Kolmogorov's abstract

5.7

113

Conditional distributions: the elementary case


conditional expectations can be thought of as rescuing just one of the desirable
properties of conditional distributions in situations where the full conditional
distribution does not exist. The rescue comes at the cost of some loss in intuitive
appeal, and with some undesirable side effects. Not all intuitively desirable properties
of conditioning survive the nonexistence of a conditional probability distribution.
Section 7 will provide an example, where the abstract approach to conditioning
allows some counterintuitive cases to slip through the definition of sufficiency.
In some situationssuch as the study of martingaleswe need conditional
expectations for only a small collection of random variables. In those situations we
do not need the full conditional distribution and so the abstract Kolmogorov approach
suffices.

2. Conditional distributions: the general case


With some small precautions about negligible sets, the representation of P as a
weighted average of distributions living on the level sets {T = t}, as in <2>, makes
sense in a more general setting.
<4>

Definition. Let T be an 3\B-measurable map from a probability space (2, J, P)


into a measurable space (T, 2). Let Q equal TF, the distribution of T under P. Call
a family T = {P, : t e 7] of probability measures on 7 the conditional probability
distribution of P given T if
(i) Ft{T ^ t} = 0 for Q almost all t in T,
(ii) the map t \-+ F?f(co) is %-measurable and F^fico) = QV?f(a>), for each f
In the language of Chapter 4, the family 7 is a probability kernel from (T, S)
to (Q,3). The fine print about an exceptional Q-negligible set in (i) protects us
against those t not in the range of T; if {T = t] were empty then P, would have
nowhere to live. We could equally well escape embarrassment by allowing P, to
have total mass not equal to 1 for a Q-negligible set of t.
The Definition errs slightly in referring to the conditional probability distribution,
as if it were unique. Clearly we could change P, on a Q-negligible set of t and still
have the two defining properties satisfied. Under mild conditions, that is the extent
of the possible nonuniqueness: see Theorem <9>.
REMARK.
Many authors work with the slightly weaker concept of a regular
conditional distribution, substituting a requirement such as

r"h(Ta>)X(a>) = Qh(t)FtX

for all h in M + (T, )

for the concentration property (i). As you will see in Section 3, the difference
between the two concepts comes down to little more than a question of measurability
of a particular subset of fl x T.

In some problems, where intuitively obvious candidates for the conditional


distributions exist, it is easy to check directly properties (i) and (ii) of the Definition.

114

Chapter 5:

Conditioning

Exercise. Let P denote Lebesgue measure on 3 ([0, I] 2 ). Let T(xy y) = max(jc, y).
Show that the conditional probability distributions {P,} for P given T are uniform
on the sets {T = t}.
SOLUTION: Write m for Lebesgue measure on !B[0,1]. For 0 < t < 1, formalize
the idea of a uniform distribution on the set
{T = t] = {(*, t) : 0 < x < t] U {(/, y) : 0 < y < t]
by defining

'0 + Yt

t y){0

*
You should check that Ft{T ^ t] = 0 by direct substitution. The definition of P,
for t 0 will not matter, because F[T = 0} = 0. Tonelli gives measurability of
t H> P , / for each / in M+([0,1] 2 ).
The image measure Q is determined by the values it gives to the generating
class of all intervals [0, / ] ,
Q[0, t] = P{(JC, y ) : max(jc, y) <t} = P[0, t]2 = t2

for 0 < t < 1.

That is, Q is the measure that has density 2t with respect to m.


It remains to show that the {P,} satisfy property (ii) required of a
conditional distribution.
/ = m'2f (mx

(/(JC, 0(0

< t}) + ^m^ (/(r, y){0 <y

(/(JC, 0(0 < x < t}) + m'm^ (/(rf y){0 < y < rJ).

Replace the dummy variables /, y in the last iterated integral by new dummy
variables JC, t to see that the sum is just a decomposition of P / = m 0 m / into
contributions from the two triangular regions {x < t] and {t < x}. The overlap
between the two regions, and the missing edges, all have zero m (8) m measure.
The decomposition of the uniform distribution on the square, into an average of
uniform conditional distributions on the boundaries of subsquares, has an extension
to more general sets, but the conditonal distributions need no longer be uniform.
Example. Let AT be a compact subset of Rd, star-shaped around the origin. That
is, if y e K and 0 < t < 1 then ty K.
x/p(x)

/
/
x/p(x)

K /

Kt

For each t > 0, let K, denote the compact subset {ty : y e K). The sets {K, : t > 0}
are nested, shrinking to {0} as t decreases to zero. The sets define a function
p : Rd -+ R+, by p(x) := M{t > 0 : x e Kt] = inf{t : x/t e K}. In fact, the
infimum is achieved for each x in KA{0}, with 1 > p(x) > 0. Only when x = 0
do we have p(x) = 0. The set \x : p(x) = 1} is a subset of the boundary, dK, a

5.2

115

Conditional distributions: the general case

proper subset unless 0 lies in the interior of K. For each x in Krf\{0}, the point
fix) := x/p(x) lies in dK. Notice that f(x/t) = ty(x) for each t > 0.
Let P denote the uniform distribution on K, defined by
Px f{x) =

where m is Lebesgue measure on Rd.

TtiK

The scaling property of Lebesgue measure under the transformation x H x/t,


mxf(x/t)

= tdmyf(y)

for / M+flR**) and t > 0,

will imply independence of p(x) and f(x) under P. For 0 < t < 1 and g M
< t] = m**(*(x/0){*A *}/m*r

= tdmyg(ir(y)){y e K}/mK
In particular, P{p(x)
x

<t}

= tdfor0<t<l

P g(f(x)){p(x)

and

< *} = ( P *

A generating class argument then leads to the factorization needed for independence.
Write /x for the distribution of 1r(x), a probability measure concentrated on dK,
and Q for the distribution with density dtd~l with respect to Lebesgue measure
on [0,1], The image of fi under the map x H> xry for a fixed r in [0,1], defines
a probability measure Pr concentrated on the set {JC : p(x) = r} c dATr. The
defining equality x = p(jt)^Cx) then has the interpretation: if R has distribution Q,
independently of F, which has distribution IJL, then RY has distribution P, that is,
/>*/(*) = Qrt*yf(ry)

<7>

= QrPrxfW

for / M+(Mrf).

The probability kernel {Pr : 0 < r < 1} is the conditional distribution for P
given p(x).
Example. Consider P := Pn, the joint distribution for n independent observations
JCI, . . . , xn from P, with P equal to the uniform distribution on a compact, starshaped subset K of Rd, as in Example <6>. Define T(x) := max,*<np(jCi). We
may generate each xi from a pair of independent variables, xt = nyt, with yi BK
having distribution /x, and rt [0, 1] having distribution Q. More formally,
p / ( * i , . . . , xn) = ^ M V f r i y i , , ynyn),
where r := ( n , . . . , rn) and y := ( j i , . . . , y n ), with M := fin, the product measure on
(9Jf)\ and Q := Qn> the product measure on [ 0 , 1 ] \ Then we have T max/<n r,,
with distribution v for which v[0, t] = (QfO, f])n = tnd for 0 < t < 1. Thus v has
density ndtnd~l with respect to Lebesgue measure on [0,1].
Problem [2] shows how to construct the conditional distribution Q,, for Q
given T = t, namely, generate n 1 independent observations {52,..., sn} from
the conditional distribution Qt := Q ( | j < *), then take ( n , . . . , rn) as a random
permutation of (f, 52,... ,sn). The representation for P becomes
i , . . . , xn) = vr

Chapter 5:

116

Conditioning

Thus the conditional distribution Pr equals Qt <g> M, for each t in [ 0 , 1 ] . Less


formally, to generate a random sample x := (JCI, . . . , xn) from the conditional
probability distribution P,
(i) independently generate w from fi and yi> ...,yn
(ii) take x as a random permutation of (tw, tyi,..

from Pn~l
.,tyn).

a
REMARK.
The two step construction lets us reduce calculations involving n
independent observations from P to the special case where one of the points lies
on the boundary dK. The conditioning machinery provides a rigorous setting for
Crofton's theorem from geometric probability.
The theorem dates from an 1885 article by M. F. Crofton in the Encyclopedia
Britannica. As noted by Eisenberg & Sullivan (2000), existing statements and
proofs (such as those in Chapter 2 of Kendall & Moran 1963, or Chapter 5 of
Solomon 1978) of the theorem are rather on the heuristic side. Eisenberg and
Sullivan derived a version of the theorem by means of Kolmogorov conditional
expectations, with supplementary smoothness assumptions. If the boundary is
smooth, the measure ix is absolutely continuous with respect to surface measure,
with a density expressible in terms of quantities familiar from differential geometry
(as discussed by Baddeley 1977).
Of course there are many situations where one cannot immediately guess the
form of the conditional distributions, and then one must rely on systematic methods
such as those to be discussed in Sections 4 and 5.

3.

Integration and disintegration


Suppose P has conditional distribution 9 = {P, : t e T) given 7\ as in Definiton <4>.
As shown in Section 4.3, from the probability kernel 9 and the distribution Q, it
is possible to construct a probability measure Q 9 on the product sigma-field by
means of an iterated integral, (Q <g> 9) (g) := Q' (Pf g(co, t)) for g e !M+(ft xT, 7(8)3).
This measure has marginal distributions P and Q, as may be seen by restricting g
to functions of co alone or / alone. The concentration property (i) of the Definition
ensures that f*?g(co, t) = F?g(co, Tco) for Q almost all f, which, by property (ii) of
the Definition, leads to

<8>

(Q 0 00 (g) = Q' (P?*(o>, 0 ) = 'K8(>, Tco) = Wg(co, Tco).


That is, Q 0 7 is the joint distribution of co and Tco, the image of P under the map
Y ' co H- (co, Tco), which "lifts P up to live on the graph."
7(P) concentrates on
graph(T)
(TffQ)
t

P t concentrates on

5.3

117

Integration and disintegration

That is, the joint distribution concentrates on the set


graph(7*) := {(co, t) e Q x 7: t = Tco}y

the graph of the map T. Indeed, provided the graph is a product measurable set,
existence of the conditional distribution is equivalent to the representation of the
image measure y(P) as Q <g> 3\
Existence of conditional distributions follows as a special case of a more
general decomposition. The following Theorem is proved in Appendix F, where it
is also shown that condition (ii) is satisfied by every sigma-finite Borel measure on
a complete separable metric space.
<9>

Theorem. Let X be a metric space equipped with its Borel sigma-field A, and
let T be a measurable map from X into a space 7, equipped with a sigma-field !B.
Suppose k is a sigma-finite measure on A and fx is a sigma-finite measure on 2
dominating the image measure T(k). Suppose:
(i) the graph of T is A B -measurable;
(ii) the measure k is expressible as a countable sum of finite measures, each with
compact support.
Then there exists a kernel A = [kt : t e 7} from (T, 2 ) to (X, A) for which the
image ofk under the map x H* (JC, TX) equals \x 0 A, that is,

<io>

kxg(x, Tx) = (/x <g> A) (g) := ^kxtg(x, t)

for g e M + (T x X, B A).

Moreover, property < io> is equivalent to the two requirements


(Hi) kt[T ^ t] = 0 for /x almost all t,
(iv) kxf(x) = ^kxtf(x), for each f in M+(X,.A),
or to the assertion that Tx = t for /x <g> A almost all (x,t). The kernel A is unique
up to a ix almost sure equivalence.
REMARK.
The uniqueness assertion is quite strong. If A = {kt : t 7} and
A = [kt : t 7} are two kernels with the stated properties, it requires kt = kt, as

measures on A, for p almost all r, and not just that XtA ktA a.e. [/x], for each A.

A kernel A with the properties described by the Theorem is called a (7, /x)disintegration of k. The construction of a disintegration is a sort of reverse operation
to the "integration" methods used in Section 4.3 to construct measures on product
spaces. As you will see in the next Section, disintegrations are essential for the
understanding of general conditional densities.
You should not worry too much about the details of Theorem <9>. I have
stated it in detail so that you can see how existence of disintegrations involves
topological assumptions (or topologically inspired assumptions, as in Pachl 1978).
Dellacherie & Meyer (1978, page 78) lamented that "The theorem on disintegration
of measures has a bad reputation, and probabilists often try to avoid the use of
conditional distributions ... But it really is simple and easy to prove." Perhaps the
unpopularity is due, in part, to the role of topology in the proof. Many probabilists
seem to regard topology as completely extraneous to any discussion of conditioning,
or even to any discussion of abstract probability theory. Nevertheless, it is a sad

118

Chapter 5:

Conditioning

reality of measure theoretic life that the axioms for countable additivity are not well
suited to dealing with uncountable families of negligible sets, and that occasionally
they need a little topological help.
REMARK.
Topological ideas also come to the rescue of countable additivity in
the modern theory of stochastic processes in continuous time. I think it is no accident
that the abstract theory of (stochastic) processes has flourished more readily amongst
the probabilists who are influenced by the French approach to measure theory, where
measures are linear functional and topological requirements are accepted as natural.

<ii>

Example. Let X denote Lebesgue measure on S(R 2 ), let X denote the map that
projects R2 onto the first coordinate axis, and let 11 denote the one-dimensional
Lebesgue measure on that axis. Write Xx for one-dimension Lebesgue measure
transplanted to live on [x] 0 R. That is, Xx is defined on 2(ft 2 ) but it concentrates
all its mass along a line orthogonal to the first coordinate axis. By the Tonelli
Theorem, {Xx : x e R} is an (X, ^-disintegration of X.
REMARK.
Many authors (including Dellacherie & Meyer 1978, Section 111-70)
require A to be a probability kernel, a restriction that excludes interesting and useful
cases, such as the decomposition of Lebesgue measure from Example <11>. As
shown by Problem [3], A can be chosen as a probability kernel if and only if fi is
equal to the image measure TX. In particular, because the image of two-dimensional
Lebesgue measure under a coordinate projection is not a sigma-finite measure, there
is no way we could have chosen A as a probability kernel in Example <11>.

4.

Conditional densities
In introductory courses one learns to calculate conditional densities by dividing
marginal densities into joint densities. Such a calculation has a natural generalization
for conditional distributions: it is merely a matter of reinterpreting the meanings of
the joint and marginal densities.
Recall the elementary construction. Suppose P has density p(x,y) with respect
to Lebesgue measure X on $(R 2 ). Then the X-marginal has density q(x) :=
f p(x,y)dy with respect to Lebesgue measure /x on the X-axis, and the conditional
p{x,y)/q{x).
distribution Px() is given by the conditional density p(y \ x) :=
Typically one does not worry too hard about how to define conditional distributions
when q(x) = 0 or q(x) = oo, but maybe one should.
The dy in the definition of q(x) corresponds to the disintegrating Lebesgue
measure Xx from Example < n > . The marginal density q(x) equals Xxp. It is the
density, with respect to Lebesgue measure //, on E, of the image of P under the
coordinate map X. The probability distribution Px() is absolutely continuous with
respect to Xx, with density /?( | JC) standardized to integrate to 1.
To generalize the elementary formula, suppose a probability measure P
is dominated by a sigma-finite measure X, which has a (7\ /^-disintegration
A = {Xt : / 7}. The density p(x) := dP/dX corresponds to the joint density, and
q(t) := Xxtp{x) corresponds to the marginal density. If the analogy is appropriate,
the ratio pt(x) := p(x)/q(t) should correspond to the conditional density, with Xt

5.4

119

Conditional densities

as the dominating measure. In fact, if we took a little care to avoid 0/0 or oo/oo
by inserting an indicator function {0 < q < oo} into the definition, the elementary
formula would indeed carry over to the general setting. For some purposes (such
as the discussion of sufficiency in Section 7) it is better not to force pt(x) to be
zero when q(t) = 0 or q{t) = oo. The wording in part (iii) of the next Theorem is
designed to accommodate a slightly more flexible choice for the conditional density.
<12>

Theorem. Suppose P is a probability measure with density p with respect to a


sigma-finite measure k that has a (7\ /JL)-disintegration A = [kt : t e T}. Then
(i) the image measure Q TP has density q(t) := ktp with respect to //,;
(ii) the set {(JC, t) : q(t) = oo or q(t) = 0 < p(x)} has zero /JL <g> A measure.
Let pt(x) be an A B-measurable function for which q(t)pt(x)
Then

= p(x) a. e. [/u, A].

(iii) the Pt defined by dPt/dkt := /?,() is a probability measure, for Q almost


all t, and P has conditional probability distributions {Pt : t 7} given T.
Proof. For each h in M+(7),
Qh = Ph(Tx)

image measure

= k (p(x)h(Tx))
x

= it* (h(t)k p(x))

density p =

dP/dk

by <io> with g(x, t) =

p(x)h(t).

Thus, Q has density q(t) := kxp(x) with respect to /JL.


From the fact that \iq 1 we have \x{q = oo} = 0. Also, if q(t) = 0 for some t
then p{x) = 0 for kt almost all JC. Assertion (ii) follows.
For Assertion (iii), first note that Pt concentrates on {T = f}, for /x almost all t
(and hence Q almost all 0 , because it is absolutely continuous with respect to kt.
The set [t : q(t) = 0 or oo} has zero Q measure. When 0 < q(t) < oo we have
kxpt(x) = kxp(x)/q(t) = 1. Also, for / M+(X),

Pf = kx (p(x)f(x))

density p = dP/dk

f x

= fM k (p(x)f(x))
x

/jL*k (q(t)pt(x)f(x))
x

= n* (q(t)k (Pt(x)f(x)))

disintegration of k
assumption on pt
= Q' (Ptxf(x)) ,

as required for the conditional distribution.


REMARK.
Notice that I did not prove that every Pt is a probability measure. In
fact, the best we can assert, as in Problem [3], is that g-almost all Pt are probability
measures. Of course we have no control over pt(x) when q{t) is zero or infinite.
For maximum precision you might like to make appropriate almost sure modifications
to Definition <4>.

<13>

E x a m p l e . Let {P# : 9 0} be a family of probability measures on X. If the


map 0 H> Wof is measurable for / M + (X), and if n is a probability (a Bayesian
prior distribution) on 0 , then Q = n 0 7 is a probability measure on X (8) 0 . The
coordinate maps X (onto X) and T (onto 0 ) have Q as their joint distribution.
The conditional distribution of X given T = 0 is 0. The conditional distribution
Q x (.) = Q(. | X = JC) is called the Bayesian posterior distribution.

120

Chapter 5:

Conditioning

If P# has a density p(x, 9) with respect to a sigma-finite [i on X, then


Q / = nenx (p(x, 9) f{x, 9))

<14>

That is, Q has density p{x, 9) with respect to the product measure JJLTT. The
product measure has the trivial disintegration (/x <g> n)x = 7r, if we regard 7r to be
a measure living on {x} 0 0 . It follows from Theorem <12> that the posterior
distribution Qx has density p(x, 9)/nep(x, 9) with respect to n. Why would a
Bayesian probably not worry about the negligible sets where such a ratio is not well
defined?
Example. Let P and Q be probability measures on (X, A), with densities p and
q with respect to a sigma-finite measure X. The Hellinger distance H(P, Q) was
defined in Section 3.3 as the 2(A) distance between ^fp and *Jq. Suppose X has a
(7\ /^-disintegration A = {Xt : f G T}, with T a measurable map from X into T.
The image measures TP and r<9 have densities p(t) = Xt(p) and q(t) = ^,(4)
with respect to /z. By the Cauchy-Schwarz inequality, {Xtyfpq) < pq, and hence
Xjpq = ^Xxty/p{x)q{x)

<15>

for / M + (X <g> 0 ) .

< /n'^/p(t)q(t).

That is, the Hellinger affinity between P and Q is smaller than the Hellinger affinity
between TP and TQ. Equivalently, H2(TP, TQ) < H2(P, Q).
Example. Let {P0 : 9 e 0 } be a family of probability measures with densities
p(x,9) with respect to a sigma-finite measure A. on (X, A), The method of maximum
likelihood promises good properties for the estimator defined by maximizing p(x, 9)
as a function of 9, for each observed x.
If x itself is not known, but instead the value of a measurable function T(x),
taking values in some space (7, ) , is observed, the method still applies. If Qe,
the distribution of T under P9, has density q(t, 9) with respect to some sigma-finite
measure /x, the estimate of 9 is obtained by maximizing q(t, 9) as a function of 0,
for the observed value t = T(x).
REMARK.
of statistics

Often we can conceive of x as a one-to-one function of some pair


By observing only T(x) wefindourselves working with

(5(JC), T(X)).

incomplete data.

If direct maximization of 9 t-> #(f,0) is awkward, there is an alternative


iterative method, due to Dempster, Laird & Rubin (1977), which at least offers
a way of finding a sequence of 9 values with q(t,9) increasing at each iteration.
Their method is called the EM-algorithm, the E standing for estimation, the M for
maximization.
Each iteration consists of a two-step procedure to replace an initial guess #o by
a new guess #i.
(E-step) Calculate G(9) = P0Q (log/?(jt, 9) \ T = 0(M-step) Maximize G, or at least find a 9\ for which G(9\) > G(#o)It is easiest to analyze the algorithm when X has a (7\ /^-disintegration, so that
dQe/dii = q(t,9) = Xxp(x, 9). Throughout the calculation t is held fixed, so there
is no ambiguity in writing v for the measure Xt. Similarly, it is cleaner to write 7r,

5.4

Conditional densities

121

for the density of P^.(- | T = t) with respect to v and qt instead of q(t,$i), for
i = 1,2. Then p(x, 0,) = Xi(x)qi9 and
0 < G(9i) - G(ft>) = v*jro(x) log

which implies log(qi/qo) > vx (jto(x)log(n\(x)/7to(x))). The last integral defines


the Kullback-Leibler distance, which is always nonnegative, by Jensen's inequality.
Perhaps more informative is the lower bound (V|TTI TTO|)2 /2, proved in Section 3.3,
which quantifies the improvement in q{t, 0) from the EM iteration.
REMARK.
I paid no attention to possible division by zero in the calculations
leading to the lower bound for log(#i/#<)) You might like to provide appropriate
integrability assumptions, and insert indicator functions to guard against 0/0. It
would also be helpful to ignore the story about missing data and conditioningas the
simplifications in notation should make clear, the EM algorithm works for any q(0)
expressible as vxp(x, 0), for any measure v and any v-integrable, nonnegative p.
Some regularity assumptions are needed to ensure finiteness of the various integrals
appearing in the Example.

*5.

Invariance
Formal verification of the disintegration property can sometimes be reduced to a
symmetry, or invariance, argument, as in the following Exercise.

<16>

Exercise. Let P be a rotationally symmetric probability measure on $(R 2 ), such


as the standard bivariate normal distribution. That is, if S# denotes the map for
rotation through an angle 0 about the origin, then S#P = P for all 0. Define R(x)
to equal (x\ + ^|) i/2 ^ e distance from the point x = (jq, xz) to the origin. Show
that P has conditional probability distributions Pr uniform on the sets {R = r). That
is, show that the polar angle 0(x) is uniformly distributed on [0,2n)y independently
of R.
SOLUTION: Let Q denote the distribution of R under P. Let Xr denote the uniform
probability measure on {R = r}, which can be characterized by invariance, S$Xr = kr,
for 0 in a countable, dense set 0o of rotational angles (Problem [14]). We need to
show that Pr = Xr, for Q-almost all r.
The problem lies in the transfer of rotational invariance from P to each of
the conditional probabilities. Consider a fixed 0 in the countable @o and a fixed /
in 3Vt+(R2). Then
p / = (Se P)yf(y)
invariance of P
= Pxf(Sex)
image measure
= QrP?f(S$x)
disintegration
r
y
= Q (SePr) f(y)
image measure.
When Pr concentrates on the set {R = r}, so does SePr. It follows that the family
{SePr : r e M+} is another disintegration for P. By uniqueness of disintegrations,
there exists a g-negligible set H^ such that SePr = Pr for all r in 3MJ. Cast out
a sequence of negligible sets to deduce that, for Q almost all r, the probability

122

Chapter 5:

Conditioning

measure concentrates on {R = r] and it is invariant under all o rotations, which


implies that Pr A.r, as asserted.
Invariance of Pr implies that 0(x) has the uniform conditional distribution,
m := Uniform[0, 2TT), under Pr. Thus, for g e M+(R + ) and h M + [0, 2TT),

Pxg(R(x)M6(x)) = QrPrxg(r)h(0(x)) = (Qrg(r)) (m9h(0)).

By a generating class argument it follows that the joint distribution of R(x) and 9(x)
equals the product Q m on S(R + ) <g> !B[0, 27r), which implies independence.
REMARK.
A similar argument would work for any sigma-finite measure k
that is invariant under a group G of measurable transformations on X, if the level
sets {T = t] are also invariant. Let G o be a countable subset of G. Then the
measures {A,} must be invariant under each transformation in Go, except possibly for
those f i n a negligible set that depends on G o . Often invariance under a suitable G o
will characterize the {Xt} up to multiplicative constants.

Appeals to symmetry can be misleading if not formalized as invariance


arguments. The classical Borel paradox, in the next Example, is a case in point.
<n>

Example. Let x be a point chosen at random (from the uniform distribution P)


on the surface of the Earth. Intuitively, if the point lies somewhere along the equator
then the longitude should be uniformly distributed over the range [180, 180].
But what is so special about the equator? Given that the point lies on any particular
great circle, its position should be uniformly distributed around that circle. In
particular, for a great circle through the poles (that is, conditional on the longitude)
there should be conditional probability 1/4 that the point lies north of latitude 45N.
Average out over the longitude to deduce that the point has probability 1/4 of lying
in the spherical cap extending from the north pole down to the 45 parallel of
latitude. Unfortunately, that cap does not cover 1/4 of the earth's surface area, as
one would require for a point uniformly distributed over the whole surface. That is
Borel's paradox.
The apparent paradox does not mean that one cannot argue by symmetry in
assigning conditional distributions. As Kolmogorov (1933, page 51) put it, "the
concept of a conditional probability with regard to an isolated hypothesis whose
probability equals 0 is inadmissible. For we can obtain a probability distribution
for [the latitude] on the meridean circle only if we regard this circle as an element
of the decomposition of the entire spherical surface into meridian circles with the
given poles." In other words, the conditional distributions on circles are not defined
unless the disintegration in which they are level sets is specified. Even then, the
conditional distributions are only determined up to an almost sure equivalence.
We can, however, argue by invariance for a uniform conditional distribution
on almost all circles of constant latitude in their roles as level sets for a latitude
disintegration. Suppose T picks off the latitude of the point on the surface. The
sets of constant T value are the parallels of constant latitude. Each such parallel is
mapped onto itself by rotations about the polar axis; the sets {T = t] are invariant
under each member of that group S of rotations. The uniform measure P is also
invariant under such rotations. As in Example <16>, it follows that almost all Pt
measures must be invariant under 9. That is, almost all the conditional probability

123

5.5 Invariance

distributions are uniform around circles of constant latitude. It follows that almost
all great circles of constant latitude must carry a uniform conditional distribution.
Unfortunately, the equator is the only such great circle; it corresponds to a negligible
set of T values. Thus there is no harm in assuming that Po> the conditional
distribution around the equator, is uniform, just as there is no harm in assuming it to
carry any other conditional distribution. A uniform conditional distribution around
the equator is not forced by the T disintegration.
What happens if we change T to pick off the longitude of the random point
on the surface? It would be nonsense to argue that the conditional distributions stay
the samethe level sets supporting those distributions are not even the same as
before. Couldn't we just argue again by invariance? Certainly not. The great circles
of constant T are no longer invariant under a group of rotations; the disintegrating
measures need no longer be invariant; the conditional distributions around the great
circles of constant longitude are not uniform.
There is a general lesson to be learned from the last Example. Even if almost
all level sets [T = t] in one disintegration must carry particular distributions, it
does not follow that similar sets in a different disintegration must carry the same
distributions. It is an even worse error to assume that a conditional distribution that
could be assigned to a region as a typical level set of one disintegration must be the
same as the conditional distribution similar regions must carry if they appear as level
sets of another disintegration.

6.

Kolmogorov's abstract conditional expectation


Let (2, 3\ P) be a probability space, and T be an 3\'B-measurable map into a set 7
equipped with a sigma-field B. Write Q for the image measure FP.
Suppose the conditional distribution {P, : t e 7} of P given T exists. For a
fixed X in M+(2,5F), the conditional expectation gx(t) := V?X((o) is a function
in M+(T, !B), with the property that I f (a(T(o)X(o))) = a(t)gx(t) a.e. [Q], and hence

<18>

(a(Ta))X((o)) = Qf (a(t)gx(t))

for each a e M+(T, $).

Even when the conditional distribution does not exist, it turns out that it is
still possible to find a function t H> g(t, X) satisfying an analog of <18>, for each
fixed X. Kolmogorov (1933) suggested that such a function should be interpreted
as the conditional expectation of X given F = t. Most authors write E(X | F = t)
for g(t, X). This abstract conditional expectation has manybut not allof the
properties associated with an expectation with respect to a conditional distribution.
Kolmogorov's suggestion has the great merit that it imposes no extra regularity
assumptions on the underlying probability spaces and measures but at the cost of
greater abstraction.
The properties of the Kolmogorov conditional expectation X H> g(t,X) are
analogous to the properties of increasing linear functionals with the Monotone
Convergence property, except for omnipresent a.e. [Q] qualifiers.

124

<19>

Chapter 5:

Conditioning

Theorem. There is a map X H> g(t, X) from M + (ft, 90 to M + (T, !B) with the
property that P" (<x(T(o)X(co)) = Q' ((*)(*, *)) for each a G M+(T, ) . For each
X, the function t H g(f, X) is unique up to a Q-eguivalence. The map has the
following properties.
(i) IfX = h(T) for some h in M+(T, ) then g(t, X) = h(t) a.e. [Q].
(ii) For X t , X2 in M+(2, 3) and h\, h2 in M+(T, S),
g (r, *i(r)Xi + A 2 (DX 2 ) = M O s C * i ) + A2(Og(*> *2>

a.e. [Q].

OfiJ> if 0 < Xi < X2 a.e. [P] then g(f, Xi) < g(f, X2) a.e. [Q].
(iv) If 0 < Xi < X2 < ... t X a.e. [P] then g(t, Xn) \ g(t, X) a.e. [Q].
To establish the existence and the four properties of the function g(ty X), we
must systematically translate to corresponding assertions about averages, by means
of the following simple result.
<20>

Lemma.

Let h\ and fc2 be functions in M+(T, ) for which

Q' (a(OMO) < Qr ((0*2(0)

for

all a M+(T, S).

Then h\ < l%2 a.e. [Q]. If the inequality is replaced by an equality, that is, if
Q' (cc(t)hi(t)) = Qr (a(O*2(O) for aii a, then hx = /i2 a.e. [Q].

PAT?O/. For a positive rational number r, let Ar := {f : A2(O < r < h\(t)}. There are
no oo oo problems in subtracting to get 0 < Q' ((ft2(O ~ Ai(0) (^ Ar}), which
forces QAr = 0 because hj h\ is strictly negative on Ar. Take the union over all
positive rational r to deduce that Q{/*2 < h\] = 0. Reverse the roles of /*i and /i2 to
get the companion assertion, Q{h\ < fc2} = 0, for the case where the inequality is
replaced by an equality.

As before, to simplify notation, abbreviate M + (2,7) to


Proof of Theorem <19>.
M+(Q) and M+(T, 3 ) to M+CJ).
Fix an X in M+(2). For each n e N, define an increasing linear functional vn
on M+(T) by vn(a) := P" ((X(cy) A n)a(Tw)) for a M+CT). It is easy to check
that vn (or) < nQ(a) and that vn has the Monotone Convergence property. Thus each
vn corresponds to a finite measure on *B, absolutely continuous with respect to Q.
By the Radon-Nikodym Theorem (Section 3.1), there exist bounded densities yn
in M+(7) for which vna = Q(yna) for each a in M + (7). By Lemma <20>, the
inequality Q(yna) = vna < vn+\a = Q(yn+ict) for all a e M + (T,) implies that
S* < g+i a.e. [Q]. That is, {yn} is increasing a.e. [Q] to y := limsupn yj, e M+CT).
Two appeals to Monotone Convergence then give the desired equality,
(a(Tco)X(co)) = lim <a(f) = lim Qr (yB(r)a(r)) = Q' (y(r)a(r)),
n->oo

noo

for a M + (T). The uniqueness of y up to Q-equivalence follows directly from


Lemma <20>. Arbitrarily choose from the Q-equivalence class of all possible y's
one member, and call it g(t, X).
Again by Lemma <20>, the first three assertions are equivalent to the
relationships, for all a e M+CT, $ ) ,
(i) Q

5.6 Kolmogorov's abstract conditional expectation

125

(ii) Q(a(t)g(t,hi(T)Xi +h2(T)X2))=Q(a{t)hi{t)g(t9Xi) + h2(t)g(t,X2)),


(iii) Q(a(t)g(t, Xi)) < Q(a(t)g(t, X2)).
Systematically replace all expressions like Q(F(t)g(t, Z)) by the corresponding
V(F(T)Z), and expressions like QG(t) by the corresponding VG(T)9 to get further
equivalent relationships,

(i) F(a(T)X) = f(a(T)h(T)),


(ii) P^CD^KDXj +h2(T)X2)) =V(a(T)hi(T)Xi + h2(T)X2),
(iii) V(a(T)Xi)<V(a(T)X2).
The first three assertions follow.
For the fourth assertion, note that (iii) implies g(t,Xn) f y(t) :=
limsupng(f, Xn) a.e. [Q]. Then apply Monotone Convergence twice, with a in
M+(T), to get
Q((0y(0) = limQ(a(f)*(*, *)) = limP(a(r)X n ) = (a(T)X) = Q(a(*)(*,
D from which (iv) follows via Lemma <20>.
At the risk of misinterpreting the symbol as an expectation of X with respect
to a conditional probabilty measure P(- | T = 0 , 1 will write P(X | T = t) instead of
g(t, X) for the Kolmogorov conditional expectation, by analogy with the traditional
E(X | T = 0.
If it were possible to combine the negligible sets corresponding to the a.e. [Q]
qualifiers on each of the assertions (ii), (iii), and (iv) of the Theorem into a single
Q-negligible set 3sf, the maps {g(t, ) : t K} would define a family of increasing
linear functional on M + ^ S O , each with the Monotone Convergence property.
Moreover, (ii) would even allow us to treat functions of T as constants when
conditioning on T. Such functional would then define the conditional distribution
as a family of probability measures on 3. Unfortunately, we accumulate uncountably
many negligible sets as we cycle (ii)(iv) through all the required combinations of
functions in M+(X) and positive constants. If we could somehow reduce (ii)(iv)
to a countable collection of requirements, each involving the exclusion of a single
/Lt-negligible set, then the task of constructing the X would become easy. The
topological assumptions in Theorem <9> are used precisely to achieve this effect.
REMARK.
The accumulation of negligible sets causes no difficulty when we
deal with only countably many random variables X. In such circumstances, the
Kolmogorov conditional expectation effectively has the properties of an expectation
with respect to a conditional distribution.

For general P-integrable X, define


P(x | T = t) := P(x+ | r = o - P(*~ I r = f),
with the conditional expectations for positive and negative parts of X defined as
in Theorem <19>. Notice that both those conditional expectations can be taken
(almost) everywhere finite, because they have finite expectations: choose a = 1 in
defining equality to get P(P(X | T = t)) = FX < oo. To avoid any problem
with infinite expectations, we could also restrict the a to be a bounded S-measurable

126

Chapter 5:

Conditioning

random variable, and define P(X | T = /) for integrable X as the Q-integrable


random variable g(t, X) for which
<2i>

P(a(DX) = Q(a(f)s(f, X))

for all bounded, S-measurable a

Brave readers might also want to dabble with conditional expectation for other cases
where only one of PX + or PX~ is finite. Beware of conditional oo oo, whatever
that might mean!
There are conditional analogs of the Fatou Lemma (Problem [7]), Dominated
Convergence (Problem [8]), Jensen's inequality (Problem [13]), and so on. The
derivations are essentially the same as for ordinary expectations, because the
accumulation of countably many Q-negligible sets causes no difficulty when we
deal with only countably many random variables X.
Conditioning on a sub-sigma-field
The choice of Q as the image measure lets us rewrite Q* (a(f)&(f, X)) as
F(a(T)g(T, X)). Both g(T, X) and a(T) are a(7>measurable functions on Q.
Indeed (recall the Problems to Chapter 2), every a(T)\B[0, oo]-measurable map
into [0, oo] must be of the form a(T) for some a in M+(T, 23). The defining
property may be recast as: to each X in M+(2, 7) there exists an XT := g(7\ X)
in M+(ft, a(T)) for which
<22>

F(WX) = F(WXT)

for each W in M + (Q, o(T)).

The random variable XT, which is also denoted by P(X | T) (or E(X | T), in
traditional notation), is unique up to a P almost sure equivalence.
REMARK.
Be sure that you understand the distinction between the functions g{t) := g(t, X) and g(T) = g o T ; they live on different spaces, 7 and 2. If
we write P(X | T = t) for g(t) it is tempting to write the nonsensical P(X \ T = T)
for g(T), instead of P(X | T), or even P ( X | a(T)) (see below).

The conditional expectation P(X | T) depends on T only through the sigmafield a(T). If S were another random element for which a(T) = cr(S), the
conditional expectation P(X | 5) would be defined by the same requirement <22>.
That is, except for the unavoidable nonuniqueness within a P-equivalence class of
random variables, we have P(X | T) = P(X | S). We could regard the conditioning
information as coming from a sub-sigma-field of 7 rather than from a T that
generates the sub-sigma-field.
<23>

<24>

Definition. Let X belong to M + (ft, 7) and P be a probability measure on J .


For each sub-sigma-field S of 7, the conditional expectation P(X | 9) is the random
variable X s in M+(X, 5) for which
P(#X) = P(#Xg)

for each g in M+(2, 9).

The variable Xg is called the conditional expectation of X given the sub-sigmafield S. It is unique up to -equivalence.
Be careful that you remember to check that Xg is 9-measurable. Otherwise
you might be tempted to leap to the conclusion that Xg equals X, as a trivial
solution to the equality <24>. Such an identification is valid only when X itself

5.6

127

Kolmogorvv's abstract conditional expectation

is 9-measurable. That is, P(X | 9) = X when X is 9-measurable (cf. part (i) of


Theorem <19>).
REMARK.
The traditional notation for P(X | 9) is E(X | 9). It would perhaps
be more precise to speak of a conditional expectation, to stress the nonuniqueness.
But, as with most situations involving an almost sure equivalence, the tradition is to
ignore such niceties, except for the occasional almost sure reminder.
Some authors write the conditional expectation P(X | S) as 9X, a prefix
notation that stresses its role as a linear map from random variables to random
variables. If I were not such a traditionalist, I might adopt this more concise notation,
which has much to recommend it.
The existence and almost sure uniqueness of P(X | 9) present no new challenges:
it is just P(X | T) for the identity map T from (fi, 7) onto (fi, 9 ) . Theorem <19>
specializes easily to give us "measure-like" properties of P(- | 9).
<25>

Theorem. For a fixed sub-sigma-field 9 of7,


following properties.

conditional expectations have the

(i) P(X | 9) = X a.e. [P], for each Q-measurable X.


(ii) For Xu X2 in M+(J) and gu g2 in M+(9),

P(Si*i + 2*2 I S) = ginXi | 9) + g2nx2 I 9)


(Hi) IfO <XX<X2

a.e. [P].

then P(Xi | 9) < P(X2 | 9) a.e. [P].

(iv) If 0 < Xi < X2 < ... t X then P(Xn | 9) t P(X I 9) a.e. [P].
REMARK.
YOU might find it helpful to think of 9 as partial informationthe
information one obtains about co by learning the value g(co) for each 9-measurable
random variable g. The value of the conditional expectation P(X | 9) is determined
by the "9-information;" it is a prediction of X(co) based on partial information.
(The conditional expectation can also be interpreted as the fair price to pay for the
random return X(co) after one learns the 9-information about co, as with the fair-price
interpretation described in Section 1.5.) Whatever you prefer as a guide to intuition,
be warned: you should not take the interpretation too literally, because there are
examples (Problem [5]) where one can determine co precisely from the values of all
9-measurable g, even with 9 a proper sub-sigma-field of 3 \
<26>

<27>

Example. Suppose an X in M+(Q, 7) is independent of all 9-measurable random


variables. (That is, <r(X) is independent of 9.) What does the partial information
heuristic for conditioning on sigma-fields suggest for the value P(X | 9)?
Information from 9 leaves us just as ignorant about X as when we started,
when the prediction of X was the constant PX. It would seem, that we should
have P(X | 9) = PX when X is independent of 9. Indeed, the constant random
variable C = PX is 9-measurable and, for all g in M+(9), independence gives
P (gX) = (Pg) (PX) = P (sC), as required to show that C = P(X | 9).
Example. Suppose So and 9i are sub-sigma-fields of J with % c g l f and
let X M + (ft, 7). Let X, = P(X | 9,), for i = 1, 2. That is, Xt M + (2, 9,) and
F(giX) = P(ftX,) for each g, in M+(ft, 9,). In particular, P($ 0 *i) = P(So*) =
F(goXo) for each g0 in M + (fi, 9 0 ), because M + (Q, 9 0 ) c M+(n, 9i). That is, the
random variable Xo has the two properties characterizing P(Xi | 9o) Put another
way, P(P(X | 9 0 | 9o) = P(X | So) almost surely.

128

<28>

Chapter 5:

Example. When PX2 < oo, the defining property of the conditional expectation
Xs = P(X | S) may be written as P (X - X s ) g = 0 for all bounded, S-measurable g.
A generating class argument extends the equality to all square-integrable, Smeasurable Z. That is, X - Xg is orthogonal to -C2(Q, S, P). The (equivalence class
of the) conditional expectation is just the projection of X onto L2(Q, S, P), as a
closed subspace of L2(Q, 7, P).
Some abstract conditioning results are readily explained in this setting. For
example, if So ^ Si, for two sub-sigma-fields of J, then the assertion
P(P(X | SO | So) = P(* I So)

*7.

Conditioning

almost surely

from Example <27> corresponds to the fact that no o n\ = n$ where m denotes the
projection map from L2(Q, y, P) onto H,- := L2(Q, Si, P). The TT0 on the left-hand
side kills any component in Hi that is orthogonal to Ho, leaving only the component
in Ho.

Sufficiency
The intuitive definition of sufficiency says that a statistic T (a measurable map into
some space 7) is sufficient for a family of probability measures 7 = {P# : 0 e 0} if
the conditional distributions given T do not depend on 0. The value of T(co) carries
all the "information about 0" there is to learn from an observation w on P^.
There are (at least) two ways of making the definition precise, using either
the Kolmogorov notion of conditional expectation or the properties of conditional
distributions in the sense of Section 2. To distinguish between the two approaches,
I will use nonstandard terminology.

<29>

Definition. Say that T is strongly sufficient for a family of probability measures


7 = {Ps : 0 0} on (&, 5) if there is a probability kernel {Pt :t e 7 ) that serves
as a conditional distribution for each P# given T.
Say that T is weakly sufficient for *?& if for each X in M+(2, T) there exists
a version of the Kolmogorov conditional expectation Fe(X | T = t) that does not
depend on 0. Say that a sub-sigma-field S of 7 is weakly sufficient for *?& if, for
each X in M + (Q, 5) there exists a version ofFe(X | S) that does not depend on 0.
REMARK.
I add the subscript to 3*0 to ensure that there is no confusion with 7
as a probability kernel from T to Q.

The concept of strong sufficiency places more restrictions on the underlying


probability spaces, but those restrictions have the added benefit of making some
arguments more intuitive. The weaker concept has the advantage that it requires only
the Kolmogorov definition of conditioning. However, it also has the drawback of
allowing some counterintuitive consequences. For example, if a sub-sigma-field So is
weakly sufficient for a family 7 then, in the intuitive sense explained in Section 6,
it provides all the information available about 0 from an observation on an unknown
P# in ?0. If Si is another sub-sigma-field, with So Q Si, then intuitively Si should
provide even more information about 0, which should make Si weakly sufficient
for ?0. Unfortunately, as the next Example shows, the intuition is not correct.

5.7

<30>

129

Sufficiency

Example. Let 1 be the real line, equipped with its Borel sigma-field 7. For each
9 > 0 define P# as the probability measure on 3 that puts mass 1/2 at each of the
two points 9. Say that a set B is symmetric if -co B for each co in B. Let S
be a fixed symmetric set not in 7, and not containing the origin. Write So for the
sub-sigma-field of all symmetric Borel sets, and Si for the sub-sigma-field of all
Borel sets for which BSC is symmetric.
A simple generating class argument shows that a Borel measurable function g
is So-measurable if and only if g(co) = g(co) for all co, and it is Si-measurable if
and only if g(co) = g(co) for all co in Sc.
Consider an X in M+(7). The symmetric function XQ(CO) := | (X(co) 4- X(a>))
is So-measurable. For all 9 > 0 and all go in M+(So)>
P* (Xo) = \ (X($)go(O) + X(-O)go(-O)) = Xo(6)go(O) = P* (Xogo).
That is, Xo is a version of P#(X | So) that does not depend on 9. The sub-sigmafield So is weakly sufficient for 7 = {P# : 9 > 0}. In fact, a stronger assertion
is possible. The sigma-field So is generated by the map T(co) = \eo\. For each 0
we can take the conditional distribution Pt as the probability measure that places
mass 1/2 at t. The statistic T is strongly sufficient.
Now consider X equal to the indicator of the half-line H = [0, oo). Could
there be a Si-measurable function Xi for which P# (Xg\) = P# (X\g\) for all g\
in M + (Si)? If such an X\ existed we would have
\g\iO) = P (X(co)gl(a>)) = Xi(O)8i(0) + Xl(-0)gl(-0)

for all 9 > 0

For 0 in ifnS, take g\ as the indicator function of the singleton set {9}, and then as the
indicator function of the singleton set {9], to deduce that X\(9) = X\(9) = 1. For
9 in H\S take g\ = 1 to deduce (via the Si-measurability property X\(9) = X\(9))
that Xi(0) = X\{-9) = 1/2. That is, fXi = 1} = 5, which would contract the Borel
measurability of Xi and the nonmeasurability of S. There can be no Si-measurable
function Xi that is a version of Pe(X \ Si) for all 9. The sub-sigma-field Si is not
weakly sufficient for ^ e .
REMARK.
The failure of weak sufficiency for Si is due to the fact that the tPe
is not dominated: that is, there exists no sigma-finite measure domininating each P#.
Problem [17] shows that the failure cannot occur for dominated families.

The concepts of strong and weak sufficiency coincide when Q is a finite set
and Pflfeo} > 0 for each co in 2 and each 9 in 0 . If {co : Tco = t] is nonempty then
the sum g(t, 9) := J ^ { r ( a / ) = t}Fe[cof} is nonzero and
|
* 0) if T(co) = r,
|
otherwise,
If T is sufficient, the left-hand side is a function of co and t alone. If we write it
as H(co, 0> then sufficiency implies

<r},\g(t,9)H(co,t)

otherwise.

130

Chapter 5: Conditioning

or, more succinctly, Fo{co} = g(T(co), 9)h{co) where h(co) = H(eoy T(co)). Conversely,
if Feico] has such a factorization then, for a> with T(co) = t,
g(t,0M<p)
h(a>)
=

EatlTW) = *}*(*, W ^ )
E ^ ^ a / ) = f }*(oO *
Thus T is sufficient if FQ{CO} factorizes in the way shown.
The factorization criterion also extends to more general settings. The proof
of the analog for weak sufficiency, due to Halmos & Savage (1949), appears in
the textbook of Lehmann (1959, Section 2.6). The corresponding proof for strong
sufficiency is easier to recognize as a direct generalization of the argument for
finite 2.
<3i>

Example. Suppose each F0 has a density pico, 0) with respect to a sigma-finite


measure A, and that k has a (T, ^-disintegration {kt : t e T}. If the density
factorizes, pico, 9) = giTco, 6)h(co), then Theorem <12> will show that T is strongly
sufficient for 3\
Ignoring problems related to division by zero, we would expect the conditional
distribution Ft to be dominated by kt, with density
pja),9) = giTco,9)hj(o)
kfpi(o,9)
For kt almost all co, the Ta> in the last numerator and denominator are both equal
to t, allowing us to cancel out a g(t, 0) factor, leaving a conditional density that
doesn't depend on 6.
Let us try to be more careful about division by zero. Define
= - ^ { 0 < H(t) < oo}
where H(i) := kth.
H(t)
A proof that pt is a conditional density for each P<? requires careful handling of
contributions from the sets where H is zero or infinite. According to Theorem <12>,
it suffices to show, for each fixed 6, that
Pt(co)

q(t, 0)pt(co) = p(a>, 0)

a.e. [fi A],

where
q(f, 0) = kf (g(Tco, 0)h(co)) = g{t, 0)H(t)

a.e. [/x].

The aberrant negligible sets are allowed to depend on 6. Because Tco = t a.e. [/x A],
the task reduces to showing that
g(t9 6)h((o){0 < H{t) <oo} = git, 9)hi<o)

a.e. [p A].

We need to show that the contributions to the right-hand side from the sets where H
is zero or infinite vanish a.e. [/x A]. Equivalently, we need to show
li<k* (git, 0)hico){Hit) = 0 or oo}) = 0.
The git, 0){Hit) = 0 or oo} factor slips outside the innermost integral with respect
to A,,, leaving fif igit,O)Hit){Hit)
= 0 or oo}). Clearly the product g(f,9)H(t) is
zero on the set {H = 0}. That git, 9) must be zero almost everywhere on the
set {H = oo} follows from the fact that qit, 9) is a probability density:

131

5.7 Sufficiency

<32>

The strong sufficiency follows.


See Problem [18] for a converse, where strong sufficiency implies existence of
a factorizable density.
Example. Let W0 denote the uniform distribution on [0,#] 2 , for 0 > 0. Let
T(x, y) = max(jc, y). As in Example <5>, the conditional probability distributions
Pa given T are uniform on the sets {T = t). That is T is a strongly sufficient
statistic for the family {P# : 0 > 0}. The same conclusion follows directly from the
factorization criterion, because Fe has density p(x, y, 9) 0~2{T(x,y) < $} with
respect to Lebesgue measure on (E + ) 2 .

8. Problems
[1]

Suppose gi and g2 are maps from (X, A) to (y, $), with product-measurable graphs.
Define fi(x) = (*,#(*)) Let Pt = ^,(Mf), for probability measures /u,, on A.
Let P = a\P\ + #2^2, for constants a, > 0 with ot\ + c*2 = 1. Let X denote the
coordinate map onto the X space. Show that the conditional probability distribution
Px concentrates on the points (jc,gi(jc)) and (JC,$2(*)) F^d the conditional
probabilities assigned to each point. Hint: Consider the density of /Lt,- with respect

[2]

Let Q be a nonatomic probability measure on S(R) with corresponding distribution


function F. Let Q = Qn, the joint distribution of n independent observations
JCI,..., xn from Q. Define T(x\,..., xn) := maxf<n JC,.
(i) Show that T has distribution v, defined by P(-OO, f] = F(t)n.
(ii) For each fixed t M, show that
v(-oo, t] = ] T \ ^ Q{*, < / , . . . , JC < r, T = jc,}
Q{

r,^2 < xuX3 <x\9...9xH

<xi)

Deduce that v has density nFn~~l with respect to Q.


(iii) Write Qt for the distribution of x\ conditional on JCI < t. That is,
G^ (g(x){x < t}) /F(t). Write r for the probability measure degenerate at the
ponit t. Show that the conditional distribution Q,, for Q given T = t, equals
That is, to generate a sample JCI, . . . , x n from Q,, select an JC, to equal r (with
probability n~l for each 0* then generate the remaining n - 1 observations
independently from the conditional distribution Qt.
[3]

Show that a (7\ /x)) disintegration A = {Xt : t 1} of a sigma-finite measure A. can


be chosen as a probability kernel if and only if the sigma-finite measure /x is equal to
the image measure Tk. Hint: Write (t) for XtX. Show that (Tk)g = M' (g(t)i(t))
for all g in M+(!B). Prove that the right-hand side reduces to fig for all g if and
only if l(t) = 1 for /i almost all t.

132

[4]

Chapter 5:

Conditioning

For families of probability measures 7 and Q defined on the same sigma-field, define
012(7, Q) to equal sup{<X2(P, Q) : P e 7, Q e Q}, where c*2 denotes the Hellinger
affinity, as in Section 3.3. Write 7 0 7 for {Pi 0 P2 : Pi IP}, and co(3> 0 0>) for
its convex hull. Show that a 2 (co(? 0 ?), co(Q 0 Q)) < a 2 (co^P), co(Q))2 by the
following steps. Write A for 2 (co(T), co(Q)).
(i) For given P = . ftPu 0 ft/ in co(3>) and Q =
let X be any probability measure dominating all
with corresponding densities pu(x), and so on.
p(x, y) := J \ Pipu(x)p2i(y) with respect to A. 0
for the density q(x, y) of Q.

y YJQIJ G2j in co(Q),


the component measures,
Show that P has density
X, with a similar expression

(ii) Write X for the projection map of X x X onto its first coordinate. Show that
XP has density p(x) := , PiPu(x) with respect to X, with a similar expression
for the density q(x) of XQ. Deduce that XP co(0>) and XQ co(Q).
(iii) Define px(y) := . PiP\i(x)/p(x) on the set {* : p(*) > 0}. Define ^ ( y )
analogously. Show that A^y/pAyjqAy) < A for all JC.
(iv) Show that a 2 (P, Q) = A*
[5]

Let P be Lebesgue measure on J, the Borel sigma-field of [0,1]. Let 9 denote the
sigma-field generated by all the singletons in [0,1].
(i) Show that each member of S has probability either zero or one.
(ii) Deduce that P(X | S) = PX for each X in M + (7).
(iii) Show for each Borel measurable X that X(co) is uniquely determined once we
know the values of all S-measurable random variables.

[6]

Let X be an integrable random variable, and Z be a 9-measurable random variable


for which XZ is integrable. Show that P(XZ) = F(YZ) where Y = P(X | 3).

[7]

Suppose Xn M + (90. Show that P(liminf n X n | S) < lim infw P(XW | 5) almost
surely, for each sub-sigma-field S. Hint: Derive from the conditional form of
Monotone Convergence. Imitate the proof of Fatou's Lemma.

[8]

Suppose Xn -> X almost surely, with \Xn\ < H for an integrable H. Show that
P(Xn | 3) -> P(X | 3) almost surely. Hint: Derive from the conditional form of
Fatou's Lemma (Problem [7]). Imitate the proof of Dominated Convergence.

[9]

Suppose PX 2 < oo. Show that var(X) = var(P(X | 3)) 4- P(var(X | 3)).

[10]

Let X be an integrable random variable on a probability space (Q, 7, P). Let 3 be a


sub-sigma-field of 7 containing all P-negligible sets. Show that X is S-measurable
if and only if F(XW) = 0 for every bounded random variable W with P(W | 3) = 0
almost surely. (Compare with the corresponding statement for random variables that
are square integrable: Z e 2 (S) if and only if it is orthogonal to every square
integrable W that is orthogonal to 2 (S).) Hint: For a fixed real t with PfX = t) = 0
define Z, = P(X > * | 3) and Wt = {X > t] - Zt. Show that (X - t)Wt > 0 almost
surely, but P((X - t)Wt) = 0. Deduce that Wt = 0 almost surely, and hence Wt is
S-measurable.

5.8

Problems

133

[11] Let So, Si > SN be random vectors, and ^o c 3^ c ... c ?# c y be sigma-fields


such that Si is ^-measurable for each i. Let e, a, and p be positive numbers for
which PP{\SN - St\ < (1 - a)\St\ | 9i) > {|S,-| > } almost surely. Show that
Pfmax, |S t | > } < fiP{\SN\ > ct}. Hint: Let r denote the first i for which | $ | > c.
Show that {r = i} 7,. Use the definition of conditional expectation to bound
fiV[\Si\ > , r = i] by P{|S*| > t*, t = *}.
[12]

Suppose P is a probability measure on (X,A) with density p(x) with respect to a


probability measure X. Let Q equal the image TP, and \x equal FX, for a measurable
map T into (7,3). Without assuming existence of disintegrations or conditional
distributions, show
(i) Q has density q(t) := X(p \T = r) with respect to /Lt.
(ii) For X M+(X), show that P(X | T = 0, the Kolmogorov conditional
expectation, is given by {0 < q(t) < oo}X(Xp \ T = t)/q(t), up to a
Q-equivalence.

[13] For each convex function ^ on the real line there exists a countable family of linear
functions for which i/r(x) = suplN(a, + bt*) for all x (see Appendix C). Use this
representation to prove the conditional form of Jensen's inequality: if X and f(X)
are both integrable then P0I>(X) | S) > ^ (P(X | 3)) almost surely. Hint: For each i
argue that P(^(X) | 9) > at + bi(X | $) almost surely. Question: Is integrability
of xlr(X) really needed?
[14] Let P be a probability measure on 3(R 2 ) that is invariant under rotations SQ for
a dense set o of angles. Show that it is invariant under all rotations. Hint: For
bounded, continuous / , if ft -> 0 then Pxf(Seix) -> Pxf(S$x).
[15] Find an invariance argument that produces the conditional distributions from
Exercise <5>. Hint: Consider transformations g# that map (JC, y) into (*, y$), where
yo = y + 0 if y + 0 <x and y + 0 - x otherwise.
[16] Let Q be a family of probability measures on a sigma-field A dominated by a fixed
sigma-finite measure X. That is, each Q in Q has a density 8Q with respect to X.
Show that Q is also dominated by a probability measure of the form v = J2jti 2~7 Qj
with Qj Q for each y, by following these steps.
(i) Show that there is no loss of generality in assuming that X is a finite measure.
Hint: Express A. as a sum of finite measures Ylt h ^ e n choose positive
constants a, so that ]T\ a/X, has finite total mass.
(ii) For each countable subfamily S of Q define L(S) := k (UQS{SQ > 0}). Define
L := sup{L(S) : Q 3 S countable}. Find countable subsets Sn for which
L(Sn) f L. Show that L = L(S*), where S* == UnSn.
(iii) Write Xo for UQeS*{8Q > 0}. For each Qo in Q, show that X ({SQo > 0}\X0) = 0.
Hint: L(S* U{Q0}) < L(S*).
(iv) Enumerate S* as {Qj : j N}. Define v = YlJLi^QjIf / M + and
vf = 0, show that X (fX0) = 0. Deduce that Qof = 0 for all Qo Q. That is,
v dominates Q.

134

[17]

Chapter 5:

Conditioning

Let 7e {Fo : 0 e 0} be a family of probability measures on J , dominated by


a sigma-finite measure A. Suppose So is a weakly sufficient sub-sigma-field, and
So Si for another sigma-field 5\ c.9. Show that Si is also weakly sufficient, by
following these steps.
(i) With no loss of generality (Problem [16]), suppose A. = ^2jN2~j P$j. Write
fe for the (So-measurable) density of Fe\9o with respect to A| SQ . For an X
in 3Vt+(3r), write Xo for the version of P#(X | So) that doesn't depend on 0.
Show that Xo is also a version of A(X | So). Deduce that P^X = P^X0 =
A. (/flXo) = A. (/0X), that is, fe is also the density of P# with respect to A, as
measures on J .
(ii) For an X in M+(S), write Xi for A(X | Si). For g{ e M + ( S I ) , show that
Fe (giX) = k (fegiX) = A. (feg\Xx) = Fe (g\Xx). Deduce that X, is a version
of Fe(X | Si) that doesn't depend on 0.

[18]

Let y@ = {Fe : 9 e 0} be a family of probability measures dominated by a


sigma-finite measure A.. Suppose T is a strongly sufficient statistic for 3\ meaning
that there exists a probability kernel {Pt : t e 7} that is a conditional distribution for
each F0 given T. Show that there exist versions of the densities dFe/dk of the form
g(Tco, 0)h(co), by following these steps.
(i) From Problem [16], there exists a dominating probability measure for 7
of the form P = J2t 2~'P^, for some countable subfamily {F0i} of 3\ Show
that {Pt : t 7} is also a conditional distribution for P given T.
(ii) By part (i) of Theorem <12>, Qe := TF0 is dominated by Q := TV.
Write g(t, 0) for some choice of the density dQe/dQ. Show that W$f(a>) =

P" (g(Tco, 9)f(co)) for / e M+(ft).


(iii) Deduce that
^

dk

a.e. [A],

in the sense that the right-hand side can be taken as a density for P#.
[19]

Let 3*0 = {P<9 : 0 e 0 } be a family of probability measures dominated by a


sigma-finite measure A. Suppose T is a strongly sufficient statistic for 7. Suppose
7* is another statistic, taking values in a set T* equipped with a countably generated
sigma-field !B* containing the singletons, such that a{T) c a(T*). Show that T can
be written as a measurable function of 7"\ Deduce via the factorization criterion
that T* is also strongly sufficient.

[20]

Let (X, A, P) be a probability space, and let P equal Pny the n-fold product measure
on AN. For each x in X, let <$(*) denote the point mass at JC. For x in Xw, let T(x)
denote the so-called empirical measure, n~l Ylt<n &(xi)- Intuitively, if we are given
the measure T(x), the conditional distribution for P should give mass \/n\ to
each of the n\ permutations (JCI, . . . , xn). Formalize this intuition by constructing a
(7\ 7T)-disintegration for P. Warning: Not as easy as it seems. What is the (T, S)
for this problem? What happens if the empirical measure is not supported by n

5.8 Problems

135

distinct points? How should you define Wt when t does not correspond to a possible
realization of an empirical measure?

9.

Notes
The idea of defining abstract conditional probabilities and expectations as RadonNikodym derivatives is due to Kolmogorov (1933, Chapter 5).
(Regular) conditional distributions have a more complicated history.
Lofeve (1978, Section 30.2) mentioned that the problem of existence was "investigated principally by Doob," but he cited no specific reference. Doob (1953,
page 624) cited a counterexample to the unrestricted existence of a regular conditional probability, which also appears in the exercises to Section 48 of the 1969
printing of Halmos (1950). Doob's remarks suggest that the original edition of
the Halmos book contained a slightly weaker form of the counterexample. Doob
also noted that the counterexample destroyed a claim made in (Doob 1938), an
error pointed out by Dieudonne (no citation) and Andersen & Jessen (1948).
Blackwell (1956) cited Dieudonn6 (1948) as the source of a counterexample for
unrestricted existence of a regular conditional probabilities. Blackwell also proved
existence of regular conditional distributions for (what are now known as) Blackwell
spaces.
In point process theory, disintegrations appear as Palm distributionsconditional
distributions given a point of the process at a particular position (Kallenberg 1969).
Pfanzagl (1979) gave conditions under which a regular conditional distribution
can be obtained by means of the elementary limit of ratio of probabilities. The
existence of limits of carefully chosen ratios can also be established by martingale
methods (see Chapter 6).
The Barndorff-Nielsen, Blaesild & Eriksen (1989) book contains much material
on the invariance properties of conditional distributions.
Halmos & Savage (1949) cited a 1935 paper by Neyman, which I have
not seen, as the source of the factorization criterion for (weak) sufficiency. See
Bahadur (1954) for a detailed discussion of sufficiency. Example <30> is based on
a meatier counterexample of Burkholder 1961. The traditional notion of sufficiency
(what I have called weak sufficiency) has strange behavior for undominated families.
I learned much about the subtleties of conditioning while working on the paper
Chang & Pollard (1997), where we explored quite a range of statistical applications.
The result in Problem [4] is taken from Le Cam (1973) (see also Le Cam 1986,
Section 16.4), who used it to establish asymptotic results in the theory of estimation
and testing. Donoho & Liu (1991) adapted Le Cam's ideas to establish results about
achievability of lower bounds for minimax rates of convergence of estimators.

136

Chapter 5:

Conditioning

REFERENCES

Andersen, E. S. & Jessen, B. (1948), 'On the introduction of measures in infinite


product sets', Danske Vid. Selst Mat.-Fys. Medd.
Baddeley, A. (1977), 'Integrals on a moving manifold and geometrical probability',
Advances in Applied Probability 9, 588-603.
Bahadur, R. R. (1954), 'Sufficiency and statistical decision functions', Annals of
Mathematical Statistics 25, 423-462.
Barndorff-Nielsen, O. E., Blaesild, P. & Eriksen, P. S. (1989), Decomposition and
Invariance of Measures, and Statistical Transformation Models, Vol. 58 of Springer
Lecture Notes in Statistics, Springer-Verlag, New York.
Blackwell, D. (1956), On a class of probability spaces, in J. Neyman, ed., 'Proceedings of the Third Berkeley Symposium on Mathematical Statistics and
Probability', Vol. I, University of California Press, Berkeley, pp. 1-6.
Burkholder, D. L. (1961), 'Sufficiency in the undominated case', Annals of Mathematical Statistics 32, 1191-1200.
Chang, J. & Pollard, D. (1997), 'Conditioning as disintegration', Statistica Neerlandica 51, 287-317.
Dellacherie, C. & Meyer, P. A. (1978), Probabilities and Potential, North-Holland,
Amsterdam.
Dempster, A. P., Laird, N, M. & Rubin, D. B. (1977), 'Maximum likelihood
estimation from incomplete data via the EM algorithm (with discussion)',
Journal of the Royal Statistical Society, Series B 39, 1-38.
Dieudonne, J. (1948), 'Sur le theoreme de Lebesgue-Nikodym, IIF, Ann. Univ.
Grenoble 23, 25-53.
Donoho, D. L. & Liu, R. C. (1991), 'Geometrizing rates of convergence, IF, Annals
of Statistics 19, 633-667.
Doob, J. L. (1938), 'Stochastic processes with integral-valued parameter', Transactions of the American Mathematical Society 44, 87-150.
Doob, J. L. (1953), Stochastic Processes, Wiley, New York.
Eisenberg, B. & Sullivan, R. (2000), 'Crofton's differential equation', American
Mathematical Monthly pp. 129-139.
Halmos, P. R. (1950), Measure Theory, Van Nostrand, New York, NY. July 1969
reprinting.
Halmos, P. R. & Savage, L. J. (1949), 'Application of the Radon-Nikodym theorem to
the theory of sufficient statistics', Annals of Mathematical Statistics 20, 225-241.
Kallenberg, O. (1969), Random Measures, Akademie-Verlag, Berlin. US publisher:
Academic Press.
Kendall, M. G. & Moran, P. A. P. (1963), Geometric Probability, Griffin.
Kolmogorov, A. N. (1933), Grundbegriffe der Wahrscheinlichkeitsrechnung, SpringerVerlag, Berlin. Second English Edition, Foundations of Probability 1950,
published by Chelsea, New York.
Le Cam, L. (1973), 'Convergence of estimates under dimensionality restrictions',
Annals of Statistics 1, 38-53.

5.9 Notes

137

Le Cam, L. (1986), Asymptotic Methods in Statistical Decision Theory, SpringerVerlag, New York.
Lehmann, E. L. (1959), Testing Statistical Hypotheses, Wiley, New York. Later
edition published by Chapman and Hall.
Loeve (1978), Probability Theory, Springer, New York. Fourth Edition, Part II.
Pachl, J. (1978), 'Disintegration and compact measures', Mathematica Scandinavica
43, 157-168.
Pfanzagl, J. (1979), 'Conditional distributions as derivatives', Annals of Probability
7, 1046-1050.
Solomon, H. (1978), Geometric Probability, NSF-CBMS Regional Conference Series
in Applied Mathematics, Society for Industrial and Applied Mathematics.

Chapter 6

Martingale et al.
SECTION I gives some examples of martingales, submartingales, and supermartingales.
SECTION 2 introduces stopping times and the sigma-fields corresponding to "information
available at a random time." A most important Stopping Time Lemma is proved,
extending the martingale properties to processes evaluted at stopping times.
SECTION 3 shows that positive supermartingales converge almost surely.
SECTION 4 presents a condition under which a submartingale can be written as a
difference between a positive martingale and a positive supermartingale (the Krickeberg
decomposition). A limit theorem for submartingales then follows.
SECTION *5 proves the Krickeberg decomposition.
SECTION *6 defines uniform integrability and shows how uniformly integrable martingales
are particularly well behaved.
SECTION *7 show that martingale theory works just as well when time is reversed.
SECTION *8 uses reverse martingale theory to study exchangeable probability measures on
infinite product spaces. The de Finetti representation and the Hewitt-Savage zero-one
law are proved.

1.

What are they?


The theory of martingales (and submartingales and supermartingales and other
related concepts) has had a profound effect on modern probability theory. Whole
branches of probability, such as stochastic calculus, rest on martingale foundations.
The theory is elegant and powerful: amazing consequences flow from an innocuous
assumption regarding conditional expectations. Every serious user of probability
needs to know at least the rudiments of martingale theory.
A little notation goes a long way in martingale theory. A fixed probability
space (2, J, P) sits in the background. The key new ingredients are:
(i) a subset T of the extended real line R;
(ii) &filtration{7t : t e T), that is, a collection of sub-sigma-fields of 7 for
which <JS c 7t if s < t\

(iii) a family of integrable random variables {Xt : t e T] adapted to the filtration,


that is, Xt is J,-measurable for each f in T.

6.1

139

What are they?

The set T has the interpretation of time, the sigma-field 7t has the interpretation
of information available at time f, and Xt denotes some random quantity whose
value Xt((o) is revealed at time t.
<i>

Definition. A family of integrable random variables {Xt : t e T] adapted to a


filtration {7t : t e T] is said to be a martingale (for that filtration) if
(MG)

Xs = F(Xt | ?,)
a.s.

for all s < t.

Equivalently, the random variables should satisfy


(MG)'

for all F e 7 5 , all s < t.

FXSF = XtF

REMARK.
Often the filtration is fixed throughout an argument, or the particular
choice of filtration is not important for some assertion about the random variables. In
such cases it is easier to talk about a martingale {X, : t e T] without explicit mention
of that filtration. If in doubt, we could always work with thefiltration.'natural,
7t := <T{XS : S < /}, which takes care of adaptedness, by definition.
Analogously, if there is a need to identify the filtration explicitly, it is convenient
to speak of a martingale {(Xt,3rt) : t T], and so on.

Property (MG) has the interpretation that Xs is the best predictor for Xt based
on the information available at time s. The equivalent formulation (MG)' is a
minor repackaging of the definition of the conditional expectation P(X, \7S). The
Jy-measurability of Xs comes as part of the adaptation assumption. Approximation
by simple functions, and a passage to the limit, gives another equivalence,
(MG)"

<2>

FXSZ = FXtZ

for all Z Mbdd(^), all s < r,

where Mbdd(3\y) denotes the set of all bounded, 9^-measurable random variables.
The formulations (MG)' and (MG)" have the advantage of removing the slippery
concept of conditioning on sigma-fields from the definition of a martingale. One
could develop much of the basic theory without explicit mention of conditioning,
which would have some pedagogic advantages, even though it would obscure one
of the important ideas behind the martingale concept.
Several of the desirable properties of martingales are shared by families of
random variables for which the defining equalities (MG) and (MG)' are relaxed to
inequalities. I find that one of the hardest things to remember about these martingale
relatives is which name goes with which direction of the inequality.
Definition. A family of integrable random variables {Xt : t e T] adapted to a
filtration [Jt : t e T] is said to be a submartingale (for that filtration) if it satisfies
any (and hence all) of the following equivalent conditions:
(subMG)
(subMG)'
(subMG)"

Xs < P(X, | ?s)


FXSF < FXtF
FXSZ < FXtZ

for all s < t,


almost surely
for all F e J 5 , all s < t.
for all Z e Mjd(75), all s < t,

The family is said to be a supermartingale (for that filtration) if {Xt : t e T] is


a submartingale. That is, the analogous requirements (superMG), (superMG)'\ and
(superMG)" reverse the direction of the inequalities.

140

Chapter 6:

Martingale et al

REMARK.
It is largely a matter of taste, or convenience of notation for particular
applications, whether one works primarily with submartingales or supermartingales.
For most of this Chapter, the index set T will be discrete, either finite or equal
to N, the set of positive integers, or equal to one of
N0:={0}UN

or

N:=NU{oo}

or

N o := {0} UNU {oo}.

For some purposes it will be useful to have a distinctively labelled first or last element
in the index set. For example, if a limit Xoo := limnN Xn can be shown to exist, it is
natural to ask whether [Xn : n eN] also has sub- or supermartingale properties. Of
course such a question only makes sense if a corresponding sigma-field 9 ^ exists.
If it is not otherwise defined, I will take Joo to be the sigma-field a (U/<OOJ,).
Continuous time theory, where T is a subinterval of R, tends to be more
complicated than discrete time. The difficulties arise, in part, from problems
related to management of uncountable families of negligible sets associated with
uncountable collections of almost sure equality or inequality assertions. A nontrivial
part of the continuous time theory deals with sample path properties, that is, with
the behavior of a process Xt((o) as a function of t for fixed co or with properties
of X as a function of two variables. Such properties are typically derived from
probabilistic assertions about finite or countable subfamilies of the [Xt] random
variables. An understanding of the discrete-time theory is an essential prerequisite
for more ambitious undertakings in continuous timesee Appendix E.
For discrete time, the (MG)' property becomes
for all F e 7 n , all n < m.

FXnF = FXmF

It suffices to check the equality for m = n + l, with n e No, for then repeated
appeals to the special case extend the equality to m = n -f 2, then m = n + 3, and so
on. A similar simplification applies to submartingales and supermartingales.
<3>

Example. Martingales generalize the theory for sums of independent random


variables. Let i, & be independent, integrable random variables with Pn = 0
for n > 1. Define Xo := 0 and Xn : i + . . . + . The sequence {Xn : n e No} is a
martingale with respect to the natural filtration, because for F e J w _i,
F(Xn - Xn-X)F = (Pn) (FF) = 0

<4>

by independence.

You could write F as a measurable function of X\,...,


you prefer to work with random variables.

X n _i, or of i , . . . , n_i, if

Example. Let {Xn : n e No} be a martingale and let ^ be a convex function for
which each W(Xn) is integrable. Then {W(Xn) : n No} is a submartingale: the
required almost sure inequality, P(*I>(Xn) | 7n-\) > ^(X n _i), is a direct application
of the conditional expectation form of Jensen's inequality.
The companion result for submartingales is: if the convex 4> function is
increasing, if {Xn} is a submartingale, and if each V(Xn) is integrable, then
: n No} is a submartingale, because
7n-\)

>

6.1

<5>

141

What are they?

Two good examples to remember: if [Xn] is a martingale and each Xn is square


integrable then [X%] is a submartingale; and if {Xn} is a submartingale then {X+} is
also a submartingale.
Example. Let {Xn : n No} be a martingale written as a sum of increments,
Xn := Xo + i + ... + . Not surprisingly, the {,} are called martingale differences.
Each is integrable and P(n | 7w_i) = 0 for n e Nj.
u

a.s.

A new martingale can be built by weighting the increments using predictable


functions {// : n e N}, meaning that each Hn should be an %_i-measurable
random variable, a more stringent requirement than adaptedness. The value of the
weight becomes known before time n; it is known before it gets applied to the next
increment.
If we assume that each Hni-n is integrable then the sequence
is both integrable and adapted. It is a martingale, because

D
<6>

which equals zero by a simple generalization of (MG)". (Use Dominated Convergence to accommodate integrable Z.) If {Xn : n e No} is just a submartingale, a
similar argument shows that the new sequence is also a submartingale, provided the
predictable weights are also nonnegative.
Example. Suppose X is an integrable random variable and {J, : t T] is a
filtration. Define X, := P(X | 7t). Then the family {Xt : t e T] is a martingale with
respect to the filtration, because for s < t,
P(X,F) = P(XF)
)

<7>

if F e3t
if F e?s

(We have just reproved the formula for conditioning on nested sigma-fields.)
Example. Every sequence {Xn : n e No} of integrable random variables adapted to
a filtration {3^ : n e No} can be broken into a sum of a martingale plus a sequence of
accumulated conditional expectations. To establish this fact, consider the increments
i=n := Xn Xn_i. Each is integrable, but it need not have zero conditional
expectation given 3n-u *he property that characterizes martingale differences.
Extraction of the martingale component is merely a matter of recentering the
increments to zero conditional expectations. Define rjn := P(n | 7n-\) and
Mn :== X0 + (ft - 1)1) + + (^ ~ fin)
An :=i?i + . . . + *?.

Then Xn = Mn + An, with {Mn} a martingale and {An} a predictable sequence.


Often {An} will have some nice behavior, perhaps due to the smoothing
involved in the taking of a conditional expectation, or perhaps due to some other
special property of the {Xn}. For example, if [Xn] were a submartingale the ?/,
would all be nonnegative (almost surely) and {An} would be an increasing sequence
of random variables. Such properties are useful for establishing limit theory and
inequalitiessee Example <18> for an illustration of the general method.

142

Chapter 6:

Martingale et al.

REMARK.
The representation of a submartingale as a martingale plus an
increasing, predictable process is sometimes called the Doob decomposition. The
corresponding representation for continuous time, which is exceedingly difficult to
establish, is called the Doob-Meyer decomposition.

2.

Stopping times
The martingale property requires equalities PXSF = PX,F, fors<t
and F e 5S.
Much of the power of the theory comes from the fact that analogous inequalities
hold when s and t are replaced by certain types of random times. To make sense
of the broader assertion, we need to define objects such as 7X and Xx for random
times r.

<8>

Definition. A random variable x taking values in T := T U {oo} is called a


stopping time for a filtration {7t : t T] if {x < t] e 3t for each t in T.
In discrete time, with T = No, the defining property is equivalent to
[x = n] e Jn
because fr < n} = Ui<nfr

<9>

'1 anc * {^ =n] = {x < n}{x <n 1}C.

E x a m p l e . Let {Xn : n e No} be adapted to a filtration {7n : n No}, and let B


be a Borel subset of R. Define r(co) := inf{n : Xn(co) e #}, with the interpretation
that the infimum of the empty set equals H-oo. That is, t((o) = +oo if Xn(co) B
for all n. The extended-real-valued random variable r is a stopping time because
{r < n] = \Ji<n{Xt

for each n in No,

eB}e%

for n e N o .

It is called the first hitting time of the set B. Do you see why it is convenient to
allow stopping times to take the value -foo?
If S't corresponds to the information available up to time i, how should we
define a sigma-field 7X to correspond to information available up to a random
time T ? Intuitively, on the part of Q where r = i the sets in the sigma-field 3>
should be the same as the sets in the sigma-field 7t. That is, we could hope that
{F{x = i} : F 3>} = {F{x = i) : F Si}

for each i.

These equalities would be suitable as a definition of JT in discrete time; we could


define 7t to consist of all those F in 7 for which
<io>

F{x = i] 7i

for all i e N o .

For continuous time such a definition could become vacuous if all the sets {r = t)
were negligible, as sometimes happens. Instead, it is better to work with a definition
that makes sense in both discrete and continuous time, and which is equivalent
to <io> in discrete time.
<li>

Definition. Let x be a stopping time for a filtration {7t : t e T}9 taking values in
f := T U {oo}. If the sigma-field Joo is not already defined, take it to be a ( U l r J r ) .
The pre-x sigma-field 7X is defined to consist of all F for which F{x < t] 7t for
all t e T.

143

6.2 Stopping times

The class 7X would not be a sigma-field if r were not a stopping time: the
property Q, e % requires {r < t] e Jt for all t.
REMARK.
Notice that 7 r c J ^ (because F{x < 00} e S ^ if F e J r ), with
equality when r = 00. More generally, if r takes a constant value, f, then 7X =3rt.
It would be very awkward if we had to distinguish between random variables taking
constant values and the constants themselves.
<12>

E x a m p l e . The stopping time r is measurable with respect to 9^, because, for


each a e R + and t e 7\

{r < a}{x < t] = [T < a A t} %At C J,.

<13>

That is, {r < a] % for all a e R + , from which the iFT-measurability follows by
the usual generating class argument. It would be counterintuitive if the information
corresponding to the sigma-field 7X did not include the value taken by r itself.
Example. Suppose a and r are both stopping times, for which a < x always.
Then 7a c 7X because

F{x < t) = (F[a < t}){x < t]


D
<14>

for all / 7\

and both sets on the right-hand side are 7t -measurable if F 5aExercise. Show that a random variable Z is !?>-measurable if and only if Z{x < t]
is ^-measurable for all t in T.
SOLUTION: For necessity, write Z as a pointwise limit of ^-measurable simple
functions Zn, then note that each Zn{x < t) is a linear combination of indicator
functions of Fr-measurable sets.
For sufficiency, it is enough to show that {Z > a] e 3> and {Z < a] e 9>, for
each a e M+. For the first requirement, note that {Z > a]{x <t} = {Z{x <t}> a},
which belongs to J, for each f, because Z{x < t] is assumed to be J,-measurable.
Thus ( Z > a ) e ? T . Argue similarly for the other requirement.
The definition of XT is almost straightforward. Given random variables
[Xt : t T) and a stopping time r, we should define XT as the function taking
the value Xt(co) when x(co) = t. If r takes only values in T there is no problem.
However, a slight embarrassment would occur when x(co) = 00 if 00 were not
a point of 7\ for then Xoo((o) need not be defined. In the happy situation when
there is a natural candidate for Xoo, the embarrassment disappears with little fuss;
otherwise it is wiser to avoid the difficulty altogether by working only with the
random variable XT{x < 00}, which takes the value zero when r is infinite.
Measurability of XT{x < 00}, even with respect to the sigma-field J, requires
further assumptions about the {Xt} for continuous time. For discrete time the task is
much easier. For example, if {Xn : n e No} is adapated to a filtration {7n : n e No},
and r is a stopping time for that filtration, then
X r {r < oo}{r < t] = J^ xd(

^ ')

INO

For 1 > t the ith summand is zero; for i < t it equals X,{r = / } , which is
7i-measurable. The 7X-measurability of Xx{x < 00} then follows by Exercise <14>

144

Chapter 6:

Martingale et al.

The next Exercise illustrates the use of stopping times and the a-fields they
define. The discussion does not directly involve martingales, but they are lurking in
the background.
<15>

Exercise. A deck of 52 cards (26 reds, 26 blacks) is dealt out one card at a time,
face up. Once, and only once, you will be allowed to predict that the next card will
be red. What strategy will maximize your probability of predicting correctly?
SOLUTION:
Write Rt for the event {/th card is red}. Assume all permutations of
R n.
the deck are equally likely initially. Write 7n for the a-field generated by R\,...,
A strategy corresponds to a stopping time r that takes values in { 0 , 1 , . . . , 51}: you
should try to maximize P/? r +i.
Surprisingly, P/?T+1 = 1/2 for all such stopping rules. The intuitive explanation
is that you should always be indifferent, given that you have observed cards
1 , 2 , . . . , r, between choosing card r + 1 or choosing card 52. That is, it should
be true that P(/? r +i i 3>) = P(#52 I 9r) almost surely; or, equivalently, that
FRT+\F = FR52F for all F e J T ; or, equivalently, that
FRMF{r

=k} = FR52F{r = k]

for all F ?x and k = 0 , 1 , . . . , 51.

We could then deduce that PJ?T+i = Pi?52 = 1/2. Of course, we only need the
case F = 2, but 1*11 carry along the general F as an illustration of technique while
proving the assertion in the last display. By definition of 3>,
F { x = k } = F { x <k-

\}c{x < k } e

7*.

That is, F[x = k] must be of the form {(J?i,...,l?*) B] for some Borel
subset B of R*. Symmetry of the joint distribution of R\r..., R$2 implies that the
random vector (R\,..., /?*, /?*+i) has the same distribution as the random vector
( / ? i , . . . , Rk, /?52>, whence
PJ?*+l{(J?l, . . . , ! ? * ) * } = P * 5 2 { ( * 1 . . - . , * * ) B ) .

See Section 8 for more about symmetry and martingale properties.


The hidden martingale in the previous Exercise is Xn, the proportion of red
cards remaining in the deck after n cards have been dealt. You could check the
martingale property by first verifying that P(/?n+i I 3^) = %n (an equality that is
obvious if one thinks in terms of conditional distributions), then calculating

( 5 2 - n - l)P(X n+1 | yB) = P((52 - n)Xn - Rn+{ | $H) = (52-n)Xn-(Rn+l

| 7n).

The problem then asks for the stopping time to maximaize


T = i})

because {r = i} e 7*

= PX r .
The martingale property tells us that WXo = PX,- for i = 1 , . . . , 51. If we could
extend the equality to random i, by showing that PX r = PXo, then the surprising
conclusion from the Exercise would follow.
Clearly it would be useful if we could always assert that FX^ = PX r for
every martingale, and every pair of stopping times. Unfortunately (Or should I say

145

6.2 Stopping times

fortunately?) the result is not true without some extra assumptions. The simplest
and most useful case concerns finite time sets. If a takes values in a finite set F, and
if each Xt is integrable, then \Xa\ < J2teT l^*l which eliminates any integrability
difficulties. For an infinite index set, the integrability of Xo is not automatic.
<16>

Stopping Time Lemma. Suppose a and x are stopping times for a filtration
{7t : t F}, with T finite. Suppose both stopping times take only values in T'.
Let F be a set in % for which a(co) < x(co) when co G F. If {Xt : t T] is
a submartingale, then FXaF < FXtF. For supermartingales, the inequality is
reversed. For martingales, the inequality becomes an equality.
Proof. Consider only the submartingale case. For simplicity of notation, suppose
T = {0,1,..., N}. Write each Xn as a sum of increments, Xn = Xo H- i + ... +
The inequality a < T, on F, lets us write

XtF - XaF = lx0F 4- ]|T {i < r}Ff ] - lx0F


\

\<i<N

\<i<N

\<i<N

Note that {a < i < x}F = ({a < i - \}F) {r < i - \}c 3i_i. The expected value
D of each summand is nonnegative, by (subMG)'.
REMARK.
If a < r everywhere, the inequality for all F in 7a implies that
Xa < P ( X r I 3^) almost surely. That is, the submartingale (or martingale, or
supermartingale) property is preserved at bounded stopping times.

<17>

The Stopping Time Lemma, and its extensions to various cases with infinite
index sets, is basic to many of the most elegant martingale properties. Results for
a general stopping time r, taking values in N or Mo, can often be deduced from
results for x A Ny followed by a passage to the limit as N tends to infinity. (The
random variable r A N is a stopping time, because {T A N < n] equals the whole
of Q when N <n, and equals {T < n] when N > n.) As Problem [1] shows, the
finiteness assumption on the index set T is not just a notational convenience; the
Lemma <16> can fail for infinite T.
It is amazing how many of the classical inequalities of probability theory can
be derived by invoking the Lemma for a suitable martingale (or submartingale or
supermartingale).
Exercise. Let ] , . . . , ^ be independent random variables (or even just martingale
increments) for which PI-, = 0 and Pf ? < oo for each i. Define S, := f i
Prove the maximal inequality Kolmogorov inequality: for each > 0,
max \St\ >\
1<I<N

<Si/l.

The random variables Xt : = Sf form a submartingale, for the natural


filtration. Define stopping times x = N and a := first i such that |S,| > , with the
convention that a = N if |5,| < for every i. Why is a a stopping time? Check
the pointwise bound,
SOLUTION:

2 {max\Si\ >} = 2{Xa > 2} < Xa.

146

Chapter 6:

Martingale et al.

What happens in the case when a equals N because \St\ < e for every i? Take
expectations, then invoke the Stopping Time Lemma (with F = Q) for the
submartingale {X,-}, to deduce
e2P{max \St| > 6} < FXa < FXt = PSft,
i

as asserted.
Notice how the Kolmogorov inequality improves upon the elementary bound
> *} < PS#/e 2 . Actually it is the same inequality, applied to 5CT instead
of SN, supplemented by a useful bound for PS 2 made possible by the submartingale
property. Kolmogorov (1928) established his inequality as the first step towards a
proof of various convergence results for sums of independent random variables.
More versatile maximal inequalities follow from more involved appeals to the
Stopping Time Lemma. For example, a strong law of large numbers can be proved
quite efficiently (Bauer 1981, Section 6.3) by an appeal to the next inequality.
P{|SAH

<18>

Exercise. Let 0 = S o , . . . , SN be a martingale with vt := P(S, Sj_i) 2 < oo for


each i. Let y\ > Yi ^ > YN be nonnegative constants. Prove the Hdjek-Renyi
inequality:
P | max Yi\St\ > l \ <

SOLUTION: Define 7, := flr(5i,...,5-)- Write Pf- for P(. | 7i). Define r/f :=
y?Sf Y?_xSf_v and A,- := P,-ir/i. By the Doob decomposition from Example <7>,
the sequence Mk := X!?=i(^ ~ ^) *s a martingale with respect to thefiltration{7,};
and yls\ = (Ai -f . . . + A*) 4- Mk. Define stopping times a = 0 and

first i such that Yi\S{\ > 1,


AT if tt|S,-| < 1 for all/.

The main idea is to bound each Ai + . . . + A* by a single random variable A, whose


expectation will become the right-hand side of the asserted inequality.
Construct A from the martingale differences & := 5/ Sj-i for i = 1 , . . . , N.
For each i, use the fact that 5,_i is 3^-1 measurable to bound the contribution of A,-:
A, = P,_,
= y,2P,-i
The middle term on the last line vanishes, by the martingale difference property,
and the last term is negative, because y-2 < y}_v The sum of the three terms is less
than the nonnegative quantity x,2P(2 I 9",-i), and
A := Ei<N y,2P.--ift2 > EJ<* A,-,

for each k, as required.

6.2 Stopping times

147

The asserted inequality now follows via the Stopping Time Lemma:

<PA,
D

because Ai + ... + Ar < A and PMT = WMa = 0.


The method of proof in Example <18> is worth remembering; it can be used
to derive several other bounds.

3.

Convergence of positive supermartingales


In several respects the theory for positive (meaning nonnegative) supermartingales
{Xn : n No} is particularly elegant. For example (Problem [5]), the Stopping
Time Lemma extends naturally to pairs of unbounded stopping times for positive
supermartingales. Even more pleasantly surprising, positive supermartingales
converge almost surely, to an integrable limitas will be shown in this Section.
The key result for the proof of convergence is an elegant lemma (Dubins's
Inequality) that shows why a positive supermartingale {Xn} cannot oscillate between
two levels infinitely often.
For fixed constants a and p with 0 < a < p < oo define increasing sequences
of random times at which the process might drop below a or rise above fi:
01 := inf{i > 0 : X, < a},
a2 := inffi > t\ : X< < a},

r\ := inf{i > o\ : Xf > },


r2 := inf{i > a2 : X, > },

and so on, with the convention that the infimum of an empty set is taken as +oo.

ax

xx

a2

x2

Because the {X,} are adapted to {ffi}, each oi and rf is a stopping time for the
filtration. For example,
{n < k] = {Xi < a, X; > p for some i < j < k },
which could be written out explicitly as a finite union of events involving only
Xo,..., X*;.
When tk is finite, the segment {X, : a* < i < tk\ is called thefcthupcrossing
of the interval [a, p] by the process {Xn : n e No}. The event {r* < N) may
be described, slightly informally, by saying that the process completes at least k
upcrossings of [a, p] up to time N.

148

<20>

Chapter 6:

Dubins's inequality. For a positive supermartingale {(Xn,%)


constants 0 < a < p < oo, and stopping times as defined above,
F{rk < oo} < {a/pf

Martingale et al.

: n No} and

for k e N.

Prac?/. Choose, and temporarily hold fixed, a finite positive integer N. Define to
to be identically zero. For k > 1, using the fact that X rt > $ when i* < oo and
X<*k < # when a* < oo, we have
P({r* < N) + X*{r* > AT}) < PX r , A *
< PXaftA^

Stopping Time Lemma

< (a{ak <N} + XN{ak > N}) ,


which rearranges to give
mrk

<N}<

a{ak < N] + PX^ ({ak > N] - {rk > N})


< aP{t*_i < A^}

because rfc_i < ak < tk and XN > 0.

That is,
N}

^^
p

< iV}

for

k>\.

Repeated appeals to this inequality, followed by a passage to the limit as N -> oo,
leads to Dubins's Inequality.
REMARK.
When 0 = a < ft we have P{ri < oo} = 0. By considering a
sequence of ft values decreasing to zero, we deduce that on the set {crj < OO} we
must have Xn = 0 for all n > o\. That is, if a positive supermartingale hits zero
then it must stay there forever.

Notice that the main part of the argument, before N was sent off to infinity,
involved only the variables XQ, . . . , X#. The result may fruitfully be reexpressed as
an assertion about positive supermartingales with a finite index set.
<2i>

Corollary. Let {(Xn, jTn) : n = 0 , 1 , . . . , N] be a positive supermartingale with a


finite index set For each pair of constants 0 < a < f$ < oo, the probability that the
process completes at least k upcrossings is less than (a/p)k.

<22>

T h e o r e m . Every positive supermartingale converges almost surely to a nonnegative, integrable limit


Proof. To prove almost sure convergence (with possibly an infinite limit) of the
sequence {Xn}, it is enough to show that the event
D = {a): limsupXw(a>) > liminfXrt(ft>)}
is negligible. Decompose D into a countable union of events
Da>/8 = flimsupX n > fi > a > liminfX n },

<23>

with a, p ranging over all pairs of rational numbers. On Da^ we must have T* < oo
for every k. Thus PD a / 3 < (a/p)k for every k, which forces PDa>/? = 0, and D = 0.
The sequence Xn converges to Xoo := liminf Xn on the set Dc. Fatou's lemma,
and the fact that PX n is nonincreasing, ensure that X ^ is integrable.
Exercise. Suppose {,-} are independent, identically distributed random variables with P{ = 4-1} = p and P{& = - 1 } = 1 - p. Define the partial

6.3

149

Convergence of positive supermartingales

sums So = 0 and 5, = i + . . . + , for i > 1. For lk < p < 1, show that


F{St = - 1 for at least one i } = (1 - p)/p.
SOLUTION:
Consider a fixed p with 1/2 < p < 1. Define 0 = (1 - p)/p. Define
r = inf{i N : 5, = - 1 } . We are trying to show that P{r < oo} = 9. Observe that
Xn = 0Sn is a positive martingale with respect to the filtration J w = <r(i,..., ):
by independence and the equality P0*1 = 1,
FXnF = F9^9Sn~l F = 0^Xn-iF

= PX n _iF

for F in 5Fn_i.

The sequence fXTAn} is a positive martingale (Problem [3]). It follows that there
exists an integrable Xoo such that XrAW - X ^ almost surely. The sequence {Sn}
cannot converge to a finite limit because \Sn - Sn-\\ = 1 for all n. On the set where
r = oo, convergence of BSn to a finite limit is possible only if Sn - oo and 0Sn - 0.
Thus,
X TA II - 0~ 1 {^ < oo) -h 0{r = oo}

almost surely.

The bounds 0 < XtAn <9~l allow us to invoke Dominated Convergence to deduce
that 1 = FXtAn -+ 9~lF{x < oo}.
Monotonicity of P{r < oo} as a function of p extends the solution to p = 1/2.
The almost sure limit X^ of a positive supermartingale {Xn} satisfies the
inequality liminfPZ n > PXoo> by Fatou. The sequence [Xn] is decreasing. Under
what circumstances do we have it converging to FXQQ? Equality certainly holds if
{Xn} converges to X^ in Ll norm. In fact, convergence of expectations is equivalent
to L1 convergence, because
F\xn - Jooi = P(Xoo - xn)+ + PCXoo -

xny

= 2P(Xoo - Xn)+ - (PXoo - P X n ) .


On the right-hand side the first contribution tends to zero, by Dominated Convergence, because X ^ > ( X ^ X rt ) + -> 0 almost surely. (I just reproved SchefK's
lemma.)
<24>

Corollary. A positive supermartingale {Xn} converges in Ll to its limit X^ if


and only ifFXn -> PXoo.

<25>

E x a m p l e . Female bunyips reproduce once every ten years, according to a fixed


offspring distribution P on N. Different bunyips reproduce independently of each
other. What is the behavior of the number Zn of th generation offspring from Lucy
bunyip, the first of the line, as n gets large? (The process {Zn : n e No} is usually
called a branching process.)
Write fx for the expected number of offspring for a single bunyip. If reproduction went strictly according to averages, the nth generation size would equal ixn.
Intuitively, if fi > 1 there could be an explosion of the bunyip population; if fi < 1
bunyips would be driven to extinction; if /x = 1, something else might happen. A
martingale argument will lead rigorously to a similar conclusion.
Given Z n _i = k, the size of the nth generation is a sum of k independent
random variables, each with distribution P. Perhaps we could write Z n = J^f^}1 ,,
with the {m : i = 1 , . . . , Zn-\] (conditionally) independently distributed like P. I
have a few difficulties with that representation. For example, where is | n 3 defined?

150

Chapter 6:

Martingale et al

Just on {Zn-\ > 3}? On all of 2? Moreover, the notation invites the blunder of
ignoring the randomness of the range of summation, leading to an absurd assertion
that P X ^ ] 1 m equals Y^fli ^m = Zn_i/x. The corresponding assertion for an
expectation conditional on Zn-\ is correct, but some of the doubts still linger.
It is much better to start with an entire family {fni : n N, i e N} of
independent random variables, each with distribution P, then define Zo = 1 and
Zn := /GN m{* < Zn-\} for n > 1. The random variable Zn is measurable with
respect to the sigma-field 7n = ofai ' k <ny i e N}, and, almost surely,

= Y^{/ < Zn_i}P(m- | 7n-\)

because Zn-\ is 3n-\-measurable

iN

= Y^{/ < Zn_i}P(m)

because ,- is independent of 7n-\

If /x < 1, the {Zn} sequence is a positive supermartingale with respect to the {7n}
filtration. By Theorem <22>, there exists an integrable random variable Z^ with
Zn Zoo almost surely.
A sequence of integers Zn(<o) can converge to a finite limit k only if Zn(<o) = k
for all n large enough. If fc > 0, the convergence would imply that, with nonzero
probability, only finitely many of the independent events (Yli<k ^ k] can occur.
By the converse to the Borel-Cantelli lemma, it would follow that J2i<k / = k
almost surely, which can happen only if P{1) = 1. In that case, Zn = I for all n.
If P{\] < 1, then Zn must converge to zero, with probability one, if /x < 1. The
bunyips die out if the average number of offspring is less than or equal to 1.
If ii > 1 the situation is more complex. If P{0} = 0 the population cannot
decrease and Zn must then diverge to infinity with probability one. If P{0} > 0 the
convex function g(t) := Pxtx must have a unique value 6 with 0 < 0 < 1 for which
g(0) = 0: the strictly convex function h{t) := g(t) - t has h(0) = P{0] > 0 and
h(l) = 0, and its left-hand derivative Px(xtx~l - 1) converges to /x - 1 > 0 as t
increases to 1. The sequence {0Zn} is a positive martingale:
F(0z | J n _i) = {Zn_i = 0} + y^P((9^ 1 + - + ^MZ n -i = k] | J n _i)
= {Zn_! = 0} +
k>\

= g(0)Zn-1 = 0z-1
Zn

because g(0) = 0.

The positive martingale {0 } has an almost sure limit, W. The sequence {Zn} must
converge almost surely, with an infinite limit when W = 0. As with the situation
when ix < 1, the only other possible limit for Zn is 0, corresponding to W = 1.
Because 0 < BZn < 1 for every n, Dominated Convergence and the martingale
property give P{W = 1} = lim^oo F0z = F0z = 0.

151

6.3 Convergence of positive supermartingales

In summary: On the set D := [W = 1}, which has probability 0, the bimyip


population eventually dies out; on Dc, the population explodes, that is, Zn -> oo.
It is possible to say a lot more about the almost sure behavior of the process {Zn}
when fi > I. For example, the sequence Xn := Zn/fin is a positive martingale,
which must converge to an integrable random variable X. On the set {X > 0}, the
process [Zn] grows geometrically fast, like /x11. On the set D we must have X = 0,
but it is not obvious whether or not we might have X = 0 for some realizations
where the process does not die out.
There is a simple argument to show that, in fact, either X = 0 almost surely
or X > 0 almost surely on D. With a little extra bookkeeping we could keep
track of the first generation ancestor of each bunyip in the later generations. If we
write zij) for the members of the nth generation descended from the jth (possibly
hypothetical) member of the first generation, then Zn = Ylj^n^Hj
< Z\}. The
ZJf\ for j = 1,2,..., are independent random variables, each with the same
distribution as Zn_i, and each independent of Z\. In particular, for each j , we have
Z(nj)Iiin~x -> X(j) almost surely, where the X(j\ for j = 1,2,..., are independent
random variables, each distributed like X, and
liX = \ N X^U < Zi]

almost surely.

Write <f> for P{X = 0}. Then, by independence,

4.

whence <f> = ^N 0 <t> *{Z\ = k] = g($). We must have either 0 = 1, meaning that
X = 0 almost surely, or else <p = 0, in which case X > 0 almost surely on Dc.
The latter must be the case if X is nondegenerate, that is, if P{X > 0} > 0, which
happens if and only if Px (x log(l + x)) < oosee Problem [14].

Convergence of sub martingales


Theorem <22> can be extended to a large class of submartingales by means of the
following decomposition Theorem, whose proof appears in the next Section.

<26>

Krickeberg decomposition. Let {Sn : n No} be a submartingale for which


supnS^ < oo. Then there exists a positive martingale {Mn} and a positive
supermartingale {Xn} such that Sn = Mn Xn almost surely, for each n.

<27>

Corollary. A submartingale with supn PS+ < oo converges almost surely to an


integrable limit.
For a direct proof of this convergence result, via an upcrossing inequality for
supermartingales that are not necessarily nonnegative, see Problem [11].
REMARK.
Finiteness of supnPS+ is equivalent to thefinitenessof sup n P|5 n |,
because \SH\ = 25+-(5+-5~) and by the submartingale property, P(S+-S-) = FSn
increases with n.

<28>

E x a m p l e . (Section 68 of Levy 1937.) Let [Mn : n No) be a martingale such


that \Mn - Mn-\\ < 1, for all n, and Mo = 0. In order that Mn(co) converges to a

152

Chapter 6:

Martingale et al

finite limit, it is necessary that supw Mn(a)) be finite. In fact, it is also a suffieicient
condition. More precisely
{co : lim Mn(co) exists as a finite limit} = {w : supMn(co) < 00}
n

almost surely.

To establish that the right-hand side is (almost surely) a subset of the left-hand side,
for a fixed positive C define r as the first n for which that Mn > C, with x = 00
when supn Mn < C. The martingale Xn :== MXAn is bounded above by the constant
C -f 1, because the increment (if any) that pushes Mn above C cannot be larger
than 1. In particular, supnPX+ < 00, which ensures that {Xn} converges almost
surely to a finite limit. On the set {supn Mn < C] we have Mn = Xn for all w, and
hence Mn also converges almost surely to a finite limit on that set. Take a union
over a sequence of C values increasing to 00 to complete the argument.
REMARK.
Convergence of Mn(a)) to a finite limit also implies that
supn |MW(G>)| < 00. The result therefore contains the surprising asssertion that,
almost surely, finiteness of supn Mn(a)) implies finiteness of supn |Mn(*y)|.

As a special case, consider a sequence {An} of events adapted to afiltration{%}.


The martingale Mn := J21=\ (^* "~ ^(Ai \ 7/-i)) has increments bounded in absolute
value by 1. For almost all co,finitenessof XlneN^ An] implies supn Mn(co) < 00,
and hence convergence of the sum of conditional probabilities. Argue similarly for
the martingale {-Mn} to conclude that
{< : EZi
D

*5.

n < 00} = {co : ZZi VWn I 7n-\) < oo}

almost surely,

a remarkable generalization of the Borel-Cantelli lemma for sequences of independent


events.

Proof of the Krickeberg decomposition


It is easiest to understand the proof by reinterpreting the result as assertion about
measures. To each integrable random variable X on (12, F, P) there corresponds a
signed measure /x defined on 7 by ixF := W(XF) for F 7, The measure can
also be written as a difference of two nonnegative measures fi+ and /Lt~, defined by
//+F : = p(x+F) and /x"F := P(X"F), for / 3.
By equivalence (MG)', a sequence of integrable random variables \Xn : n No}
adapted to a filtration {3n : n No} is a martingale if and only if the corresponding
sequence of measures {/nn} on 3 has the property

<29>

fj,n+i \% = fin \%

for each n,

where, in general, v\9 denotes the restriction of a measure v to a sub-sigmafield S. Similarly, the defining inequality (subMG)' for a submartingale, /xn+iF :=
P(X n+! F) > XnF =: nnF for all F %, is equivalent to
<30>

fin+i I > ^ n | y

for each n.

6.5

153

Proof of the Krickeberg decomposition

Now consider the submartingale {Sn : n No} from the statement of the
Krickeberg decomposition. Define an increasing functional X : 3Vt+(J) -* [0, oo] by
Xf := limsupF(S+f)

for / e

Notice that XI = limsupnPS+, which is finite, by assumption. The functional also


has a property analogous to absolute continuity: if P / = 0 then Xf = 0.
Write Xk for the restriction of A. to M+(7*)- For / in M+(5k), the submartingale
property for {S+} ensures that FS+f increases with n for n > k. Thus
<3i>

Xkf := Xf = limP(S+/) = supP(S+/)

if / M + (3i).

The increasing functional A* is linear (because linearity is preserved by limits), and


it inherits the Monotone Convergence property from P: for functions in M+(7*)
with 0 < fiit / ,
sup A*/ = sup supP(5rt+/;) = supsupP(S n + / i ) = supP(S+/) = Xkf.
i

n>k

n>k

n>k

It defines a finite measure on 3> that is absolutely continuous with respect to P| y .


Write Mk for the corresponding density in M+(Q, J*).
The analog of <29> identifies [Mk] as a nonnegative martingale, because
A,*+i|^ = X\Jk = Xk. Moreover, Mk > S? almost surely because
Mk{Mk < S+) = Xk{Mk < S+] > VSt{Mk < S+l
the last inequality following from <3i> with / := {Mk < S). The random variables
Xk := Mk - Sk are almost surely nonnegative. Also, for F e 7*,
XkF = FMkF - FSkF > FMMF - FSMF = FXMF,
because [Mk] is a martingale and {Sk} is a submartingale. It follows that {Xk} is a
supermartingale, as required for the Krickeberg decomposition.
*6.

Uniform integrability
Corollary <27> gave a sufficient condition for a submartingale {Xn} to converge
almost surely to an integrable limit XQO- If [Xn] happens to be a martingale, we
know that Xn = F(Xn+m \ %) for arbitrarily large m. It is tempting to leap to the
conclusion that

<32>

Xn I
as suggested by a purely formal passage to the limit as m tends to infinity. One
should perhaps look before one leaps.

<33>

Example. Reconsider the limit behavior of the partial sums [Sn] from Example <23> but with p = 1/3 and 0 2. The sequence Xn = 2Sn is a positive
martingale. By the strong law of large numbers, Sn/n -* - 1 / 3 almost surely, which
gives Sn -> -oo almost surely and XQQ = 0 as the limit of the martingale. Clearly
Xn is not equal to P(X<x> I y)-

154

Chapter 6:

Martingale et al.

REMARK.
The branching process of Example <25> with /x = 1 provides
another case of a nontrivial martingale converging almost surely to zero.

As you will learn in this Section, the condition for the validity of <32>
(without the cautionary question mark) is uniform integrability. Remember that
a family of random variables [Zt : t e T] is said to be uniformly integrable
if sup, r P|Z,|{|Z,| > M) -> 0 as M - oo. Remember also the following
characterization of -C1 convergence, which was proved in Section 2.8.
<34>

Theorem. Let {Zn : n e N} be a sequence of integrable random variables. The


following two conditions are equivalent.
(i) The sequence is uniformly integrable and it converges in probability to a
random variable Z^, which is necessarily integrable.
(ii) The sequence converges in Cl norm, P|Zn Zod -> 0, to an integrable
random variable Z^.
The necessity of uniform integrability for <32> follows immediately from a
general property of conditional expectations.

<35>

Lemma. For a fixed integrable random variable Z, the family of all conditional
expectations {P(Z | S) ' 9 a sub-sigma-field of J} is uniformly integrable.
Proof. Write Zg for P(Z | 9)- With no loss of generality, we may suppose
Z > 0, because |Zg| < P ( | Z | | 9). Invoke the defining property of the conditional
expectation, and the fact that {Zg > M2} 9, to rewrite PZgfZg > M2} as
PZ{Z g > A/2} < MP{Zg > M2} + PZ{Z > M}.

The first term on the right-hand side is less than MPZg/M 2 = PZ/Af, which tends
to zero as M -> oo. The other term also tends to zero, because Z is integrable.
More generally, if X is an integrable random variable and {Jn : n e No} is a
filtration then Xn := P(X | J w ) defines a uniformly integrable martingale. In fact,
every uniformly integrable martingale must be of this form.

<36>

Theorem. Every uniformly integrable martingale {Xn : n No} converges almost


surely and in Ll to an integrable random variable X^, for which Xn = P(Xoo | Jn).
Moreover, if Xn := P(X | %) for some integrable X then Xoo = P(X | 3^), where
^oo := a (Uw6NrFn).
Proof. Uniform integrability implies finiteness of sup n P|X n |, which lets us deduce
via Corollary <27> the almost sure convergence to the integrable limit X^. Almost
sure convergence implies convergence in probability, which uniform integrability
and Theorem <34> strengthen to &1 convergence. To show that Xn = P(Xoo | %),
fix an F in Jn. Then, for all positive m,
IPX^F - PX n F| < PIXoo - Xn+m\ + \Xn+mF - fXHF\.
1

The XL convergence makes the first term on the right-hand side converge to zero
as m tends to infinity. The second term is zero for all positive m, by the martingale
property. Thus P X ^ F = PX n F for every F in %.

155

6.6 Uniform integrability

If F(X | %) = Xn = P(Xoo | 7n) then FXF = FXooF for each F in 3^. A


generating class argument then gives the equality for all F in 5oo, which characterizes
the ^-measurable random variable Xoo as the conditional expectation F(X | 3^).
REMARK.
More concisely: the uniformly integrable martingales {Xn : n e N}
are precisely those that can be extended to martingales [Xn : n N}. Such a
martingale is sometimes said to be closed on the right.

<37>

Example. Classical statistical models often consist of a parametric family


9 = {F0 : 0 0} of probability measures that define joint distributions of infinite
sequences co := (<wi, c&i,...) of possible observations. More formally, each P# could
be thought of as a probability measure on RN, with random variables X,- as the
coordinate maps.
For simplicity, suppose 0 is a Borel subset of a Euclidean space. The
parameter 0 is said to be consistently estimated by a sequence of measurable
functions % = %(COQ,..., con) if

<38>

Fe{ \tn - 0\ > } -> 0

for each > 0 and each 0 in 0.

A Bayesian would define a joint distribution Q := n <8> IP for 0 and co by


equipping 0 with a prior probability distribution n. The conditional distributions
Qn,t given the random vectors Tn := (Xo,..., Xn) are called posterior distributions.
We could also regard nna>(') := Qn,rB(&>)() as random probability measures on the
product space. An expectation with respect to nna) is a version of a conditional
expectation given the the sigma-field 7 n := <r(X\,..., Xn).
A mysterious sounding result of Doob (1949) asserts that mere existence of
some consistent estimator for 0 ensures that the 7rnw distributions will concentrate
around the right value, in the delicate sense that for n -almost all 0, the nna> measure
of each neighborhood of 0 tends to one for P# -almost all co.
The mystery dissipates when one understands the role of the consistent
estimator. When averaged out over the prior, property <38> implies (via Dominated
Convergence) that Q{{0,c) : \0n(a>) - 0\ > } -* 0. A Q-almost surely convergent
subsequence identifies 0 as an S'oo-measurable random variable, r(a>), on the product
space, up to a Q equivalence. That is, 0 = r(o>) a.e. [Q].
Let IX be a countable collection of open sets generating the topology of 0 .
That is, each open set should equal the union of the It-sets that it contains. For
each U in It, the sequence of posterior probabilities Mna)[0 U] = Q{0 e U | Hn]
defines a uniformly integrable martingale, which converges Q-almost surely to
Q{0 U | IFoo} = {0 U]

*7.

because {0 e U] = {T(<D) U} 2ooa.s.

Cast out a sequence of Q-negligible sets, leaving a set E with QE = 1 and


nna){0 U] -> {0 U] for all U in 11, all (0, a>) , which implies Doob's result.
Reversed martingales
Martingale theory gets easier when the index set T has a largest element, as in the
case T = No := {n : n No}. Equivalently, one can reverse the "direction of

156

Chapter 6:

Martingale et al.

time," by considering families of integrable random variables {Xt : t T] adapted


to decreasing filiations, families of sub-sigma-fields {9, : t T} for which 5S => 9,
when s < t. For such a family, it is natural to define Soo := ^tr9t if it is not
already defined.
<39>

Definition. Let {Xn : n No) be a sequence of integrable random variables,


adapted to a decreasing filtration {$n : n No}. Call {(Xn, 9 n ) : n No) a reversed
supermartingale if P(Xn | Sn+i) < Xn+i almost surely, for each n. Define reversed
submartingales and reversed martingales analogously.
That is, f(X n ,9 n ) * n e No} is a reversed supermartingale if and only if
{(X_n, 5-n) - n No} is a supermartingale. In particular, for each fixed N, the
finite sequence X#, X # _ i , . . . , XQ is a supermartingale with respect to the filtration
c...c8o.

SN^9N-I

<40>
D

Example. If {S : w No} is a decreasing filtration and X is an integrable


random variable, the sequence Xn := F(X | 9 n ) defines a uniformly integrable,
(Lemma <35>) reversed martingale.
The theory for reversed positive supermartingales is analogous to the theory
from Section 3, except for the slight complication that the sequence {PXn : n No)
might be increasing, and therefore it is not automatically bounded.

<4i>

Theorem.

For every reversed, positive supermartingale {(Xn, Sn)' n No}:

(i) there exists an XQQ in M+(Q, 9oo) for which Xn -> XQQ almost surely;
(ii) F(Xn I 9oo) f Xoo almost surely;
(Hi) F\Xn - Xoo| -> 0 if and only if supn FXn < 00.
Proof. The Corollary <2i> to Dubins's Inequality bounds by (ct/P)k the probability
that XN, X # _ I , . . . , XO completes at least k upcrossings of the interval [a, /J], no
matter how large we take TV. As in the proof of Theorem <22>, it then follows
that P{limsupXn > ft > a > liminf Xn] = 0, for each pair 0 < a < ft < 00, and
hence Xn converges almost surely to a nonnegative limit Xoo, which is necessarily
9oo-measurable.
I will omit most of the "almost sure" qualifiers for the remainder of the proof.
Temporarily abbreviate P(- | 9) to P n ( ) , for n e No, and write Zn for PooXw. From
the reversed supermartingale property, Wn+\Xn < Xn+u and the rule for iterated
conditional expectations we get
Zn = FooXn = Poo (Pn+iX,,) < PooXn+l = Z n + 1 .
Thus Zn f Zoo *= limsup n Z n , which is 9oo-nieasurable.
For (ii) we need to show Z ^ = X^, almost surely. Equivalently, as both
variables are 9oo-measurable, we need to show P(ZooG) = P(XooG) for each G
in 9oo- For such a G,
G) = supnNo P(Z n G)

Monotone Convergence

= supn P(X n G)

definition of Zn := PooXw

= supn sup mN P ((Xn A m)G)

Monotone Convergence, for fixed n

= supm supn P((X n A m)G).

157

6.7 Reversed martingales

The sequence [Xn A m : n No} is a uniformly bounded, reversed positive


supermartingale, for each fixed m. Thus P((Xn Am)G) increases with n, and, by
Dominated Convergence, its limit equals P((Xoo Am)G). Thus
P(ZooG) = supm P ^ o o A m)G) =F(XooG),
the final equality by Monotone Convergence. Assertion (ii) follows.
Monotone Convergence and (ii) imply that PXoo = supnPXn. Finiteness of
the supremum is equivalent to integrability of X^. The sufficiency in assertion (iii)
follows by the usual Dominated Convergence trick (also known as Scheffe's lemma):
P|Xoo - Xn\ = 2P(Xoo - Xn)+ - (PXoo - FXn) -+ 0.

For the necessity, just note that an * limit of a sequence of integrable random
variables is integrable.
The analog of the Krickeberg decomposition extends the result to reversed
submartingales {(Xn,9n) : n No}. The sequence Mn := F(X | 9n) is a
reversed positive martingale for which Mn > P(Xo | 9n) > Xn almost surely. Thus
Xn = Mn (Mn Xn) decomposes Xn into a difference of a reversed positive
martingale and a reversed positive supermartingale.

<42>

<43>

Corollary. Every reversed submartingale {(Xn, 9rt) : n No} converges almost


surely. The limit is integrable, and the sequence also converges in the O sense, if
infn FXn > -oo.
Proof. Apply Theorem <4i> to both reversed positive supermartingales {Mn - Xn]
and {A/rt}, noting that supn FMn = PXj and supn P (Mn - Xn) = PX^ - infn PXrt.
Corollary. Every reversed martingale {(Xrt, 9n) : n No} converges almost surely,
and in Lx9 to the limit Xoo := P(X0 I Soo), where 9^ := 0 , ^ 9 , , .

Proof. The identification of the limit as the conditional expectation follows from the
facts that F(X0G) = P(XrtG), for each n, and |P(XnG)-P(XooG)| < P|X n -Xoo| -> 0,
D for each G in 9<x>.
Reversed martingales arise naturally from symmetry arguments.
<44>

Example. Let {?f : i N} be a sequence of independent random elements taking


values in a set X equipped with a sigma-field A. Suppose each ,- induces the same
distribution P on A, that is, P/(ft) = Pf for each / in M + (X,yi). For each n
define the empirical measure Pnj(o (or just Pn, if there is no need to emphasize the
dependence on w) on X as the probability measure that places mass n~l at each of
the points i(<y), , &(<). That is, Pn^f := n~x Ei<i <
Intuitively speaking, knowledge of Pn tells us everything about the values
i(o>),..., %n(o)) except for the order in which they were produced. Conditionally
on Pn, we know that f i should be one of the points supporting Pn, but we don't know
which one. The conditional distribution of i given Pn should put probability n~l
at each support point, and it seems we should then have
n

<45>
REMARK.
Here I am arguing, heuristically, assuming Pn concentrates on n
distinct points. A similar heuristic could be developed when there are ties, but there

158

Chapter 6:

Martingale et al.

is no point in trying to be too precise at the moment. The problem of ties would
disappear from the formal argument.

Similarly, if we knew all Pt for i > n then we should be able to locate ,-(<w)
exactly for i > n + 1, but the values of ^\(co),..., i-n(co) would still be known only
up to random relabelling of the support points of Pn. The new information would
tell us no more about i than we already knew from Pn. In other words, we should
have
<46>

P(/($i) 19,.) = ... = P (/(&.) I Sn) = Pnf

where Qn :l a(P n , />+,,...),

which would then give

Sn) = At- ^-1rT~T


V P (/(ft) I Sn) =W -
i= l

i= l

That is, {(Pnf, Sn) : N} would be a reversed martingale, for each fixed / .
It is possible to define Qn rigorously, then formalize the preceeding heuristic
argument to establish the reversed martingale property. I will omit the details,
because it is simpler to replace Sn by the closely related n-symmetric sigma-field Sn,
to be defined in the next Section, then invoke the more general symmetry arguments
(Example <50>) from that Section to show that {(P*/, Sn) : n e N} is a reversed
martingale for each P-integrable / .
Corollary <43> ensures that Pnf -+ P(/(fi) | Soo) almost surely. As you will
see in the next Section (Theorem <5i>, to be precise), the sigma-field S^ is trivial
it contains only events with probability zero or oneand P(X | 8^) = PX, almost
surely, for each integrable random variable X. In particular, for each P-integrable
function / we have
= pnf -+ pf
almost surely.
n
The special case X := R and P|i| < oo and f(x) = x recovers the Strong Law of
Large Numbers (SLLN) for independent, identically distributed summands.
In statistical problems it is sometimes necessary to prove a uniform analog of
the SLLN (a USLLN):
An := sup|Pn/0 Pfe\ -> 0

almost surely,

where [f$ : 0 e 0} is a class of P-integrable functions on X. Corollary <4i> can


greatly simplify the task of establishing such a USLLN.
To avoid measurability difficulties, let me consider only the case where 0 is
countable. Write Xn# for Pnfe Pfe- Also, assume that the envelope F := sup^ |/#|
is P-integrable, so that PAn < P(P n F 4- PF) = 2PF < oo.
For each fixed 0, we know that {(Xnj, Sn) : n e N} is a reversed martingale,
and hence
P(A n

(p

)
e

That is, {(An, Sn) : n e N} is a reversed submartingale. From Corollary <4i>,


An converges almost surely to a Soo-measurable random variable Aoo, which by

159

6.7 Reversed martingales

*8.

the triviality of Soo (Theorem <5i> again) is (almost surely) constant. To prove
the USLLN, we have only to show that the constant is zero. For example, it
would suffice to show PArt -* 0, a great simplification of the original task. See
Pollard (1984, Section II.5) for details.

Symmetry and exchangeability


The results in this Section involve probability measures on infinite product spaces. You
might want to consult Section 4.8/<?r notation and the construction of product measures.
The symmetry arguments from Example <44> did not really require an
assumption of independence. The reverse martingale methods can be applied to
more general situations where probability distributions have symmetry properties.
Rather than work with random elements of a space (X, A), it is simpler to deal
with their joint distribution as a probability measure on the product sigma-field AN
of the product space X N , the space of all sequences, x := (x\,X2,.. .)> on X. We
can think of the coordinate maps ,(x) := JC, as a sequence of random elements
of (X, A), when it helps.

<47>

Definition. Call a one-to-one map n from N onto itself an n-permutation if


n(i) = i for i > n. Write Jl(n) for the set of all n\ distinct n-permutations. Call
UneN:R(n) the set of all finite permutations of N. Write Sn for the map, from XN
back onto itself, defined by
)

if 7t

Say that a function h on X is n-symmetric if it is unaffected all n-permutationsf


that is, ifhn{%) := h{Snx) = h(x) for every n-permutation n.
<48>

E x a m p l e . Let / be a real valued function on X. Then the function YlT=i f(xi)


is n-symmetric for every m > w, and the function limsup^^^JZ^Li f(xt)/m is
/i-symmetric for every n.
Let g be a real valued function on X <8> X. Then g(x\, X2) 4- g f e , x\) is
2-symmetric. More generally, J2\<i^j<mS(xi^xj)
*s n-symmetric for every m >n.
N
For every real valued function / on X , the function
F(X) := J2

<49>

fjt(X) = : T2

/(**(l)*>r(2). ...**(). *+!...)

is n-symmetric.
Definition. A probability measure P on An is said to be exchangeable if it is
invariant under S^ for every finite permutation n, that is, ifFh Whn for every h
in M + (X N , AN) and every finite permutation it. Equivalently, under P the random
vector (kd), 6,(2),..., r(n)) has the same distribution as (i, 2, . . , ?), for every
n-permutation n, and every n.
The collection of all sets in AN whose indicator functions are n-symmetric
forms a sub-sigma-field Sn of AN> the n-symmetric sigma-field. The Srt-measurable

160

Chapter 6:

Martingale et al

functions are those AN-measurable functions that are w-symmetric. The sigma-fields
{Sn : n e N} form a decreasing filtration on XN, with Si = AN.
<50>

Example. Suppose P is exchangeable. Let / be a fixed P-integrable function


on XN. Then a symmetry argument will show that
P ( / I 8B) = - \
The functioncall it F(x)on the right-hand side is n-symmetric, and hence
Sn -measurable. Also, for each bounded, 8n -measurable function H,
P(/(x)//(x)) = P(fnHK)
= P (fn H)

for all jr, by exchangeability


for all n in %(n)

As a special case, if / depends only on the first coordinate then we have


P ( / ( * i ) I 8B)) = ^

f (**<!)) = fn/.

7r$(n)

where Pn denotes the empirical measure, as in Example <44>.


When the coordinate maps are independent under an exchangeable P, the
symmetric sigma-field Soo becomes trivial, and conditional expectations (such as
P ( / ( * i ) I oo)) reduce to constants.

<5i>

Theorem. (Hewitt-Savage zero-one law) If P = PN, the symmmetric sigmafield oo is trivial: for each F in Soo, either FF = 0 or PF = 1.
Proof, Write /i(x) for the indicator function of F, a set in Soo. By definition,
hn = h for every finite permutation. Equip XN with the filtration 3n = <T[XJ : i < n}.
Notice that Joo := a (UneN^n) = AN = Si.
The martingale Fn := P(F | %) converges almost surely to P(F | JQO) = F,
and also, by Dominated Convergence, P|/i Yn\2 -> 0.
The ?-measurable random variable Fn may be written as hn(xi,..., *), for
some An-measurable hn on Xn. The random variable Zn := /in(jcn+i, ...,X2 n ) is
independent of Yn, and it too converges in 2 to h: if 7r denotes the 2n-permutation
that interchanges i and i + n, for 1 < i < , then, by exchangeability,
P|fc(x) - Z n | 2 = P|^,(x) - ^ ( x ^ + i ) , . . . , xn{2n)) | 2 = P|fc(x) - Yn\2 -> 0.
The random variables Zn and Fn are independent, and they both converge in 2 (P)norm to F. Thus
0 = lim P|r n - Zn\2 = lim ( p y 2 - 2 (PFn) (PZn) + PZ 2 ) =FF-

2(PF) 2 + PF.

It follows that either PF = 0 or PF = 1.


In a sense made precise by Problem [17], the product measures PN are the
extreme examples of exchangeable probability measuresthey are the extreme
points in the convex set of all exchangeable probability measures on ,/lN. A
celebrated result of de Finetti (1937) asserts that all the exchangeable probabilities

161

6.8 Symmetry and exchangeability

can be built up from mixtures of product measures, in various senses. The simplest
general version of the de Finetti result is expressed as an assertion of conditional
independence.
<52>

Theorem. Under an exchangeable probability distribution F on (XN,AN), the


coordinate maps are conditionally independent given the symmetric sigma-field SooThat is, for all sets A, in A,
eAu...,xmeAm

\Soo) = F(xi e A{ | Soo) x . . . x W>(xm e Am | Soo)

almost surely, for every m.


Proof. Consider only the typical case where m = 3. The proof of the general case
is similar. Write f for the indicator function of Af. Abbreviate P(- | 8 n ) to P n , for
n N. From Example <50>, for n > 3,

n3 (Wnfi(x0) (Fnf2(x2)) (Vnfiixs)) = { 1 < i\ j \ k <

n}fl(xi)f2(xj)Mxk).

On the right-hand side, there are n(n - l)(n - 2) triples of distinct subscripts (i, j \ k),
leaving O(n2) of them with at least one duplicated subscript. The latter contribute
a sum bounded in absolute value by a multiple of n 2 ; the former appear in the sum
that Example <50> identifies as P n (/i(xi)/2(jt2)/3(*3)). T h u s
(/())/2(Jc2))(P/3(JC3))

By the convergence of reverse martingales, in the limit we get


(Poo/l^l)) (Poo/2te <Poo/3(*3)) = Poo (/l(JCl)/2(JC
D

the desired factorization.


When conditional distributions exist, it is easy to extract from Theorem <52>
the representation of P as a mixture of product measures.

<53>

Theorem. Let A be the Borel sigma-field of a separable metric space X. Let P be


an exchangeable probability measure on AN, under which the distribution P ofx\ is
tight Then there exists an S^-measurable map T into [0,1] N , with distribution Q,
for which conditional distributions {Ft : t 7} exist, and Ft = FrN, a product
measure, for Q almost all t.
Proof. Let := {Is, : / N} be a countable generating class for the sigma-field A,
stable under finite intersections and containing X. For each i let 7}(x) be a version
of P(JCI Et | Soo). By symmetry, 71(x) is also a version of P(JC, 6 Et | S^), for
every j . Define T as the map from XN into T := [0,1] N for which T(x) has ith
coordinate 7}(x).
The joint distribution of x\ and T is a probability measure T on the product
sigma-field of X x 7, with marginals P and Q. As shown in Section 1 of Appendix F,
the assumptions on P ensure existence of a probability kernel 7 := {Pt : t e T} for
which
Fg(xu T) = Txtg(x, t) = Q'Pfgix, t)

for all g in 3Vt+(X x 7).

In particular, by definition of 7} and the Soo-measurability of T,


Qr (tMt)) = F(TMT)) = P(fjq Ei)h(T)) = Qr (h(t)PtEi)

162

Chapter 6:

Martingale et al

for all h in M + (T), which implies that />,, = tt a.e. [Q], for each i.
For every finite subcollection { / , , . . . , Ein) of , Theorem <52> asserts
Ein \%oo} = f ] J = J P{*,- Eh | 00} =

{xx eEil,...,xne

which integrates to
f l , . . . , xn Eim) = P

A routine generating class argument completes the proof.

9.

Problems

[1]

Follow these steps to construct an example of a martingale {Z,} and a stopping


time x for which PZ 0 ^ PZ T {r < 00}.
(i) Let i,2 t>e independent, identically distributed random variables with
P{ = +1} = 1/3 and P{& = - 1 } = 2/3. Define Xo = 0 and X,- := ft + . . . + &
and Z/ := 2 X ', for i > 1. Show that {Z,-} is a martingale with respect to an
appropriate filtration.
(ii) Define T := inf{i : X, = 1}. Show that r is a stopping time, finite almost
everywhere. Hint: Use SLLN.
(iii) Show that PZo > P Z r . (Should you worry about what happens on the
set {r = 00}?)

[2]

Let r be a stopping time for the natural filtration generated by a sequence of random
variables {Zn : n e N}. Show that $t = a{Z TAn : n N}.

[3]

Let {(Zn, Jn) : n e No) be a (sub)martingale and r be a stopping time. Show that
{(ZTAn,%) : n No} is also a (sub)martingale. Hint: For F in 5i,_i, consider
separately the contributions to ZnAtF and PZ( n _i) AT F from the regions {r < n 1}
and {r > n}.

[4]

Let r be a stopping time for a filtration {J, : i e No}. For an integrable random
variable X, define X( := P(X | %). Show that
P(X I $T) = ] T { r = i)Xi = X r

almost surely.

ieNo

Hint: Start with X > 0, so that there are no convergence problems.


[5]

Let {(Xn, *Jn) - n No} be a positive supermartingale, and let a and r be stopping
times (not necessarily bounded) for which a < r on a set F in 3>. Show that
PX^fcr < 00}F > PX r {r < 00}F. Hint: For each positive integer N, show
that FN := F{a < A^} 9>A^- Use the Stopping Time Lemma to prove that
XaANFN > FXtANFN > PX r {r < N}F, then invoke Monotone Convergence.

6.9

163

Problems

[6]

For each positive supermartingale {(Xw, 7n) : n No}, and stopping times a < T,
show that P(X r fr < oo} \%) < Xo{a < oo} almost surely.

[7]

(Kolmogorov 1928) Let i , . . . , fn be independent random variables with P, = 0


and |&| < 1 for each i. Define Xt : = ^ i + . . . + & and Vt := FXf. For each > 0
show that P {maxl:<n |X,| < e) < (1 + )2/Vn. Note the direction of the inequalities.
Hint: Define a stopping time r for which Vn {max,<n |X,| < e] < Frfr = }. Show
that FVT = FX2 < (1 + e)2.

[8]

(Birnbaum & Marshall 1961) Let 0 = Xo, X\,... be nonnegative integrable random
variables that are adapted to a filtration {3^}. Suppose there exist constants 0,-, with
0 < 0,- < 1, for which
nXi 1 ?,_,) > OiXi-i

(*)

foxi>\.

Let Ci > Ci > ... > C^+j = 0 be constants. Prove the inequality
AT

(*)

F{maxCiXi > 1} < J^iQ - 0l+1Cl+!)PXI-,

by following these steps.


(i) Interpret (*) to mean that there exist nonnegative, 3^-1 -measurable random
variables F/_i for which P(X,- | J|_i) = F^i -f 0|Xi_i almost surely. Put
Zt := X, - Fj_i - 0,-Xi-i. Show that QXi < Q-iX^i + QZt + C,-Fi_i almost
surely.
(ii) Deduce that QX, < Mf -f A, where Mf is a martingale with Mo = 0 and
(iii) Show that the left-hand side of inequality (**) is less than PCTXr for an
appropriate stopping time r, then rearrange the sum for PA to get the asserted
upper bound.
[9]

(Doob 1953, page 317) Suppose S i , . . . , Sn is a nonnegative submartingale, with


PSf < oo for some fixed p > 1. Let q > 1 be defined by p~~l +q~l = 1. Show that
P (maxi<n Sf) < qpFS%, by following these steps.
(i) Write Mn for maxi<n S,. For fixed x > 0, and an appropriate stopping time r,
apply the Stopping Time Lemma to show that
xF{Mn

>x}<

FST{ST

>x}<

FSn{Mn

> JC}.

(ii) Show that FXP = /0 pxp~lF{X > x}dx for each nonnegative random
variable X.
(iii) Show that PMnp < qPSHMg~l.
(iv) Bound the last product using Holder's inequality, then rearrange to get the
stated inequality. (Any problems with infinite values?)
[10] Let (Q,vF, P) be a probability space such that 7 is countably generated: that is,
3r = o{B\, #2* . } for some sequence of sets {#f}. Let /it be a finite measure on 3%
dominated by P. Let 7n := a{Bu ..., Bn).

164

Chapter 6:

Martingale et al

(i) Show that there is a partition nn of Q into at most 2n disjoint sets from Jw such
that each F in % is a union of sets from nn,
(ii) Define Jn-measurable random variables Xn by: for co A nn,
ifPA>0,
X-(a,)s=|M/PA
10
otherwise.
Show that FXnF = [iF for all F in 7B.
(iii) Show that (Xn, J n ) is a positive martingale.
(iv) Show that [Xn] is uniformly integrable. Hint: What do you know about
H[Xn > M}?
(v) Let Zoo denote the almost sure limit of the {Xn}. Show that PXooF = /xF for
all F in 7. That is, show that XOQ is a density for ji with respect to P.
[11]

Let {(Xn,7n) : n No} be a submartingale. For fixed constants a < fi (not


necessarily nonnegative), define stopping times a\ < t\ < 02 < ..., as in Section 3.
Establish the upcrossing inequality,

for each positive integer Nf by following these steps.


(i) Show that Zn = (Xn - a) + is a positive submartingale, with Zff, = 0 if at < 00
and Zr. > ^ a if tf < 00.
(ii) For each 1, show that Zt.AN ZOi^ > (/S a){r, < A^}. Hint: Consider
separately the three cases a{ > N, at < N < r,, and r, < AT.
(iii) Show that -PZ aiAiV + ZtkAN > k0 - a)F{tk < N}. Hint: Take expectations
then sum over 1 in the inequality from part (ii). Use the Stopping Time Lemma
for submartingales to prove PZTjA# WZm+lAN < 0.
(iv) Show that ZtkAN < FZN = P (XN - a)+.
[12]

Reprove Corollary <27> (a submartingale {Xn : n No} converges almost surely to


an integrable limit if supn PX^ < 00) by following these steps.
(i) For fixed a < p, use the upcrossing inequality from Problem [11] to prove that
P{liminfn Xn < a < p < limsupn Xn] = 0
(ii) Deduce that [Xn] converges almost surely to a limit random variable X that
might take the values 00.
(iii) Prove that P|X n | < 2PX^ - PXi for every n. Deduce via Fatou's lemma that
P|X| < 00.

[13]

Suppose the offspring distribution in Example <25> has finite mean /Lt > 1 and
variance a1.
(i) Show that var(Zn) = a2fin~l + /Lt2var(Zn_i).
(ii) Write Xn for the martingale Zn//x". Show that supn var(Xw) < 00.
(iii) Deduce that Xn converges both almost surely and in -C1 to the limit X, and
hence PX = 1. In particular, the limit X cannot be degenerate at 0.

6.9

[14]

165

Problems

Suppose the offspring distribution P from Example <25> has has finite mean ix > 1.
Write Xn for the martingale Z n //i n , which converges almost surely to an integrable
limit random variable X. Show that the limit X is nondegenerate if and only if the
condition
(XLOGX)

PX (X log(l

+ x)) < oo,

holds. Follow these steps. Write fin for Px (x{x < fi"}) and Pn () for expectations
conditional on 7n.
(i) Show that En O1 ~~ Mn) = Px (x Yln^x
if and only if (XLOGX) holds.

>

M11})* which converges to a finite limit

(ii) Define Xn := pr* E,&,,{&,, < M*Hi < Z-i}. Show that P-iXn = M
almost surely. Show also that
x - xN = L*+i ( x n - x B -i) > E ^ N + I (* - *-i)

almost

(iii) Show that, for some constant Cu


# Xn] < E n Hn'XP[x > MW} < Q/i < 00.
Deduce that Yln (Xn Xn) converges almost surely to a finite limit,
(iv) Write varn_i for the conditional variance corresponding to P n -i. Show that
Deduce, via (ii), that
Zn P (XH ' HnXn-xlll)2

< E n M"""1 Px2[x < fl") < C2/X < OO,

for some constant C2. Conclude that E w (Xn V<nXn-\/v) is a martingale,


which converges both almost surely and in l .
(v) Deduce from (iii), (iv), and the fact that E(^ ~ ^-i) converges almost
surely, that En Xn-\{\ - M/M) converges almost surely to a finite limit.
(vi) Suppose P{X > 0} > 0. Show that there exists an a> for which both
E w *n-i(a>)(l - iin/ii) < 00 and limXn__i(G>) > 0. Deduce via (i) that
(XLOGX) holds.
(vii) Suppose (XLOGX) holds. From (i) deduce that P ( E n Xn-\(l - Mn/M)) < 00.
Deduce via (iv) that En(^ ~" Xn-\) converges in JC1. Deduce via (ii) that
PX >P(XN + En%+i(* " x *-i>) = l - *(D a s N -* ' f r o m w h i c h k
follows that X is nondegenerate. (In fact, P|Xrt - X| -> 0. Why?)
[15] Let fa : i N} be a martingale difference array for which E I N ' (ft2/1'2) < (i) Define Xrt := E?=i6/'- S h o w t h a t s u P p ^n < - Deduce that Xn(a))
converges to a finite limit for almost all co.
(ii) Invoke Kronecker's lemma to deduce that n"1 E?=i & ~^ ^ almost surely.
[16] Suppose {Xn : n N} is an exchangeable sequence of square-integrable random
variables. Show that cov(Xi, X2) > 0. Hint: Each X, must have the same variance, V; each pair X/, X,, for i ^ y, must have the same covariance, C Consider
var (Ei< ^1) f r arbitrarily large n.

166

[17]

Chapter 6:

Martingale et al.

(Hewitt & Savage 1955, Section 5) Let P be exchangeable, in the sense of


Definition <49>.
(i) Let / be a bounded, ^"-measurable function on X n . Define X := f(x\,...,
xn)
and Y := f(xn+u , * 2 n ) . Use Problem [16] to show that P(XY) > (FX) (PK),
with equality if P is a product measure.
(ii) Suppose P = ct\Q\ +2Q2, with a, > 0 and ai + 2 = 1 where Qi and Q2 are
distinct exchangeable probability measures. Let / be a bounded, measurable
xn) ^ Qiffri,...,
*) =: /L^.
function on some Xn for which /ii := Qi f(x\,...,
Define X and Y as in part (i). Show that P(XF) > (PX) (PF). Hint: Use strict
convexity of the square function to show that a\fi] + (X2M2 > faiMi + <*2M2)2Deduce that P is not a product measure.
(iii) Suppose P is not a product measure. Explain why there exists an E An and
a bounded measurable function g for which
P (fz e E}g(xn+l, xn+2, . . . ) ) # (P{z }) (Pg(jcn+1, * n + 2 , . . . ) ) ,
where z := (JCI, . . . , xn). Define a = P{z G E). Show that 0 < a < 1. For each
h M+(X N ,.A N ), define
Qift := P({z E}h(xn+U x n + 2 , . . )) / ,
Q2* := P ({z Ec}/i(jcn+1, jc rt4 . 2 ,...)) /(I - a).
Show that Qi and Q2 correspond to distinct exchangeable probability measures
for which P = aQi + (1 - a)Oz. That is, P is not an extreme point of the set
of all exchangeable probability measures on *4N.

10.

Notes
De Moivre used what would now be seen as a martingale method in his solution
of the gambler's ruin problem. (Apparently first published in 1711, according
to Thatcher (1957). See pages 51-53 of the 1967 reprint of the third edition of
de Moivre (1718).)
The name martingale is due to Ville (1939). Levy (1937, chapter VIII), expanding on earlier papers (Levy 1934, 1935a, 1935fe), had treated martingale differences,
identifying them as sequences satisfying his condition (C). He extended several
results for sums of independent variables to martingales, including Kolmogorov's
maximal inequality and strong law of large numbers (the version proved in Section 4.6), and even a central limit theorem, extending Lindeberg's method (to be
discussed, for independent summands, in Section 7.2). He worked with martingales
stopped at random times, in order to have sums of conditional variances close to
specified constant values.
Doob (1940) established convergence theorems (without using stopping times)
for martingales and reversed martingales, calling them sequences with "property ."
He acknowledged (footnote to page 458) that the basic maximal inequalities were
"implicit in the work of Ville" and that the method of proof he used "was used by
Levy (1937), in a related discussion." It was Doob, especially with his stochastic

167

6.10 Notes

processes book (Doob 1953see, in particular the historical notes to Chapter VII,
starting page 629), who was the major driving force behind the recognition of
martingales as one of the most important tools of probability theory. See Levy's
comments in Note II of the 1954 edition of L6vy (1937) and in L6vy (1970,
page 118) for the relationship between his work and Doob's.
I first understood some martingale theory by reading the superb text of
Ash (1972, Chapter 7), and from conversations with Jim Pitman. The material
in Section 3 on positive supermartingales was inspired by an old set of notes for
lectures given by Pitman at Cambridge. I believe the lectures were based in part
on the original French edition of the book Neveu (1975). I have also borrowed
heavily from that book, particularly so for Theorems <26> and <4i>. The book
of Hall & Heyde (1980), although aimed at central limit theory and its application,
contains much about martingales in discrete time. Dellacherie & Meyer (1982,
Chapter V) covered discrete-time martingales as a preliminary to the detailed study
of martingales in continuous time.
Exercise <15> comes from Aldous (1983, p. 47).
Inequality <20> is due to Dubins (1966). The upcrossing inequality of
Problem [11] comes from the same paper, slightly weakening an analogous
inequality of Doob (1953, page 316). Krickeberg (1963, Section IV.3) established
the decomposition (Theorem <26>) of submartingales as differences of positive
supermartingales.
I adapted the branching process result of Problem [14], which is due to Kesten
& Stigum (1966), from Asmussen & Hering (1983, Chapter II).
The reversed submartingale part of Example <44> comes from Pollard (1981).
The zero-one law of Theorem <5i> for symmetric events is due to Hewitt &
Savage (1955). The study of exchangeability has progressed well beyond the
original representation theorem. Consult Aldous (1983) if you want to know more.
REFERENCES

Aldous, D. (1983), 'Exchangeability and related topics', Springer Lecture Notes in


Mathematics 1117, 1-198.
Ash, R. B. (1972), Real Analysis and Probability, Academic Press, New York.
Asmussen, S. & Hering, H. (1983), Branching processes, Birkhauser.
Bauer, H. (1981), Probability Theory and Elements of Measure Theory, second english
edn, Academic Press.
Birnbaum, Z. W. & Marshall, A. W. (1961), 'Some multivariate Chebyshev
inequalities with extensions to continuous parameter processes', Annals of
Mathematical Statistics pp. 687-703.
de Finetti, B. (1937), 'La prevision: ses lois logiques, ses sources subjectives',
Annales de VInstitut Henri Poincare 7, 1-68. English translation by H. Kyburg
in Kyberg & Smokier 1980.
de Moivre, A. (1718), The Doctrine of Chances, first edn. Second edition 1738.
Third edition, from 1756, reprinted in 1967 by Chelsea, New York.

168

Chapter 6:

Martingale et al.

Dellacherie, C. & Meyer, P. A. (1982), Probabilities and Potential B: Theory of


Martingales, North-Holland, Amsterdam.
Doob, J. L. (1940), 'Regularity properties of certain families of chance variables',
Transactions of the American Mathematical Society 47, 455-486.
Doob, J. L. (1949), 'Application of the theory of martingales', Colloques Internationaux du Centre National de la Recherche Scientifique pp. 23-27.
Doob, J. L. (1953), Stochastic Processes, Wiley, New York.
Dubins, L. E. (1966), 4A note on upcrossings of semimartingales', Annals of
Mathematical Statistics 37, 728.
Hall, P. & Heyde, C. C. (1980), Martingale Limit Theory and Its Application,
Academic Press, New York, NY.
Hewitt, E. & Savage, L. J. (1955), 'Symmetric measures on cartesian products',
Transactions of the American Mathematical Society 80, 470-501.
Kesten, H. & Stigum, B. P. (1966), 'Additional limit theorems for indecomposable
multidimensional Galton-Watson process', Annals of Mathematical Statistics
37, 1463-1481.
Kolmogorov, A. (1928), 'Uber die Summen durch den Zufall bestimmter unabhangiger GroBen', Mathematische Annalen 99, 309-319. Corrections: same
journal, volume 102, 1929, pages 484-488.
Krickeberg, K. (1963), Wahrscheinlichkeitstheorie, Teubner. English translation,
1965, Addison-Wesley.
Kyberg, H. E. & Smokier, H. E. (1980), Studies in Subjective Probability, second
edn, Krieger, Huntington, New York. Reprint of the 1964 Wiley edition.
Levy, P. (1934), *Uaddition de variables aleatoires enchalnees et la loi de Gauss',
Bull Soc. Math. France 62, 42-43.
Levy, P. (1935a), Troprietds asymptotiques des sommes de variables aleatoires
enchainees', Comptes Rendus de VAcademie des Sciences, Paris 199, 627-629.
Levy, P. (19356), 'Proprietes asymptotiques des sommes de variables aleatoires
enchainees', Bull. Soc. math 59, 1-32.
Levy, P. (1937), Theorie de Vaddition des variables aleatoires, Gauthier-Villars, Paris.
Second edition, 1954.
Levy, P. (1970), Quelques Aspects de la Pensee d'un Mathematicien, Blanchard, Paris.
Neveu, J. (1975), Discrete-Parameter Martingales, North-Holland, Amsterdam.
Pollard, D. (1981), 'Limit theorems for empirical processes', Zeitschrift fur
Wahrscheinlichkeitstheorie und Verwandte Gebiete 57, 181-195.
Pollard, D. (1984), Convergence of Stochastic Processes, Springer, New York.
Thatcher, A. R. (1957), 'A note on the early solutions of the problem of the duration
of play', Biometrika 44, 515-518.
Ville, J. (1939), Etude Critique de la Notion de Collectif, Gauthier-Villars, Paris.

Chapter 7

Convergence in distribution
SECTION I defines the concepts of weak convergence for sequences of probability measures
on a metric space, and of convergence in distribution for sequences of random
elements of a metric space and derives some of their consequences. Several equivalent
definitions for weak convergence are noted.
SECTION 2 establishes several more equivalences for weak convergence of probability
measures on the real line, then derives some central limit theorems for sums of
independent random variables by means of Lindeberg*s substitution method.
SECTION 3 explains why the multivariate analogs of the methods from Section 2 are not
often explicitly applied.
SECTION 4 develops the calculus of stochastic order symbols.
SECTION *5 derives conditions under which sequences of probability measures have weakly
convergent subsequences.

1.

Definition and consequences


Roughly speaking, central limit theorems give conditions under which sums of
random variable have approximate normal distributions. For example:
If i , . . . , $ are independent random variables with P& = 0 for each i and
y \ var(,) = 1, and if none of the f, makes too large a contribution to their sum,
then . f, is approximately N(0> 1) distributed.

The traditional way to formalize approximate normality requires, for each


real JC, that {J2t & < *) ** P{Z < x] where Z has a N(0,1) distribution. Of
course the variable Z is used just as a convenient way to describe a calculation
with the N(0,1) probability measure; Z could be replaced by any other random
variable with the same distribution. The assertion does not mean that ]Tf. f Z, as
functions defined on a common probability space. Indeed, the Z need not even live
on the same space as the {&}. We could remove the temptation to misinterpret the
approximation by instead writing P C - f t < x] & P(oo, JC] where P denotes the
AT(0,1) probability measure.
Assertions about approximate distributions of random variables are usually
expresssed as limit theorems. For example, the sum could be treated as one of
a sequence of such sums, with the approximation interpreted as an assertion of
convergence to a limit. We thereby avoid all sorts of messy details about the size of

170

<i>

Chapter 7:

Convergence in distribution

error terms, replacing them by a less specific assurance that the errors all disappear
in the limit. Explicit approximations would be better, but limit theorems are often
easier to work with.
In this Chapter you will learn about the notion of convergence traditionally
used for central limit theorems. To accommodate possible extensions (such as the
theories for convergence in distribution of stochastic processes, as in Pollard 1984),
I will start from a more general concept of convergence in distribution for random
elements of a general metric space, then specialize to the case of real random
variables. I must admit I am motivated not just by a desire for added generality.
I also wish to discourage my readers from clinging to inconvenient, old-fashioned
definitions involving pointwise convergence of distribution functions (at points of
continuity of the limit function).
Let X be a metric space, with metric d(-, ), equipped with its Borel cr-field
$ := 3(X). A random element of X is just an 3r\3(X)-measurable map from some
probability space (Q, 7% P) into X. Remember that the image measure XP is called
the distribution of X under P.
The concept of convergence in distribution of a sequence of random elements
{Xn} depends on Xn only through its distribution. It is really a concept of convergence
for probability measures, the image measures XFn, There are many equivalent ways
to define convergence of a sequence of probability measures {Pn}, all defined
on 3(X). I feel that it is best to start from a definition that is easy to work with,
and from which useful conclusions can be drawn quickly.
Definition. A real-valued function on a metric space X is said to satisfy a
Lipschitz condition if there exists a finite constant K for which
\t(x) - i(y)\ < Kd(x, y)

for all x and y in X.

Write BL(X) for the vector space of all bounded Lipschitz functions on X.
The space BL(X) has a simple characterization via the quantity ||/||BL defined
for all real valued functions / on X by
where K\ := sup ^M

<2>

x^y

^ W l and K2 := sup |/(JC)|.

d(X, y)

I have departed slightly from the usual definition of || \\BL> in order to get the neat
bound
<3>

\f(x)

- f(y)\

< UWBL (1 A d(x, 3O)

for all x, y e X.

The space BL(X) consists precisely of those functions for which \\\\BL < oo. It
is easy to show that || \\BL is a norm when restricted to BL(X). Moreover, slightly
tedious checking of various pointwise cases leads to inequalities such as
ll/i v f2\\BL < max (||/, UL, UIWBL) ,
which implies that BL(X) is stable under the formation of pointwise maxima of
pairs of functions. (Replace ft by -ft to deduce the same bound for ||/i A /2\\BL,
and hence stability under pairwise minima.)

171

7.1 Definition and consequences


REMARK.
It is also easy to show that BL(X) is complete under || \\BL: that
is, if {in : n e N} c BL(X) and \\in - tm\\BL -> 0 as min(m,n) - oo then
\\L-i\\BL - * 0 for a uniquely determined t in BL(X). In fact, BL(X) is a Banach
lattice, but that fact will play no explicit role in this book.

It is easy to manufacture useful members of BL(X) by means of the distance


function
d(x, B) := inf{d(x, y) : y B]
for B c X.
Problem [1] shows that \d(x, B) - d(y, B)\ < d(x, y). Thus functions such as
Za,p,,B(x) := a A f$d(x, B), for positive constants a and /*, and B e l , all belong
to BL(X).
<4>

Definition. Say that a sequence of probability measures {Pn}, defined on !B(X),


converges weakly to a probability measure P, on !B(X), if Pnl - PI for all I in
BL(X). Write Pn ^ P to denote weak convergence.
REMARK.
Functional analytically minded readers might prefer the term weak-*
convergence. Many authors use the symbol => to denote weak convergence. I beg
my students to avoid this notation, because it too readily leads to indecipherable
homework assertions, such as Pn=$P=>TPn=$TP.
(Which = is an implication
sign?)

<5>

Definition. Say that a sequence X\, X%, ...of random elements ofX of X
converges in distribution to a probability measure P on !B(X) if their distributions
converge weakly to P. Denote this convergence by Xn ~* P. If X is a random
element with distribution P, write Xn*^> X to mean Xn ^> P.
REMARK.
Convergence in distribution is also called convergence in law (the
word law is a synonym for distribution) or weak convergence. For the study of
abstract empirical process it was necessary (Hoffmann-J0rgensen 1984, Dudley 1985)
to extend the definition to nonmeasurable maps Xn into X. For that case, the
concept of distribution for Xn is not defined. It turns out that the most natural and
successful definition requires convergence of outer expectations, *h(Xn) -> Ph9 for
bounded, continuous, functions h. That is, the convergence Xn -^ P becomes the
primary concept, with no corresponding generalization needed for weak convergence
of probability measures.

(X,)
It is important to remember that convergence in distribution, in general, says
nothing about pointwise convergence of the Xn as functions. Indeed, each Xn might
be defined on a different probability space, (Qn, J n , Pn) so that the very concept of
pointwise is void. In that case, Xn ** P means that Xn (Pn) ~+ P, that is,
Fn(Xn) := (Xnn)(i) - Pi

for all I in BL(X);

172

Chapter 7:

Convergence in distribution

and Xn -> X, with X defined on (2,y,P), means Pt(Xn) -+ i(X) for all
in L(X).
Similarly, convergence in probability need not be well defined if the Xn live
on different probability spaces. There is, however, one important exceptional case.
If Xn ~> P with P a probability measure putting mass 1 at a single point xo in X,
then, for each e > 0,
Fn{d(Xn, XQ) >}<

Wne(Xn) - * Pi = 0

where i(x)

: = 1 A (rf(jc,

xo)/e).

That is, Xn converges in probability to XQ.


<6>

E x a m p l e . Suppose Xn ~* P and {Xfn} is another sequence for which d(Xn, Xfn)


converges to zero in probability. Then X'n P. For if \\1\\BL = # < oo,
|Pw(Xn) - Fn(X'J\ < Fn\i(Xn) -

i(Xfn)\

<KFn(lAd(Xn,Xfn))
< KF{d(Xn, X'n)>} + K

by <3>
for each <? > 0.

The first term in the final bound tends to zero as n -> oo, for each > 0, by the
assumed convergence in probability. It follows that nl(X'n) - P.
REMARK.
A careful probabilist would worry about measurability of d(Xn, X'n).
If X were a separable metric space, Problem [6] would ensure measurability. In
general, one could reinterpret the assertion of convergence in probability to mean:
there exists a sequence of measurable functions {An} with Art > d(Xn, X'n) and
Aw -> 0 in probabilty. The argument for the Example would be scarcely affected.

Convergence for expectations of the functions in BL(X) will lead to convergence


for expectations of other types of function, by means of various approximation
schemes. The argument for semicontinuous functions is typical. Recall that a
function g : X M is said to be lower semicontinuous (LSC) if {JC : g(x) > t} is
an open set for each fixed t. Similary, a function / : X ~> E is said to be upper
semicontinuous (USC) if {x : f(x) < t} is an open set for each fixed t. (That is,
/ is USC if and only if - / is LSC.) If a function is both LSC and UCS then it
is continuous. The prototypical example for lower semicontinuity is the indicator
function of an open set; the prototypical example for upper semicontinuity is the
indicator function of a closed set.
<7>

L e m m a . If g is a lower semicontinuous function that is bounded from below by


a constant, on a metric space X, then there exists a sequence {< : i e N} in BL(X)
for which 4 (JC) f g(x) at each x.

173

7.1 Definition and consequences

Proof. With no loss of generality, we may assume g > 0. For each t > 0,
the set Ft := {g < t] is closed. The sequence of nonnegative BL functions
4 r(x) := t A (kd(x, Ft)) for ^ N increases pointwise to t{g > r}, because
d(x, Ft) > 0 if and only if g(x) > t. (Compare with Problem [3].)
The countable collection S of all *,, functions, for k G N and positive rational t
has pointwise supremum equal to g. Enumerate S as {*i,ft2> } then define

<8>

U :=ma.Xj<jhj.

Theorem. Suppose Pn -^ P. Then


(i) liminf^oo Png > Pg for each lower semicontinuous function g that is
bounded from below by a constant,
(ii) limsup^^ Pnf < Pf for each upper semicontinuous function f that is
bounded from above by a constant.
Proof For a given LSC g, bounded from below, invoke Lemma <7> to find an
increasing sequence {,} in BL(X) with ,- f g pointwise. Then, for fixed i,
liminf Png > liminf Pn, = P,
n

because Pn, -> Plt.

Take the supremum over i, using Monotone Convergence to show supf P, = Pg, to
obtain (ii). Put g = / to deduce (ii).
When specialized to the case of indicator functions, we have

<9>
<io>

liminf* PnG > PG for all open G )


\
if Pn - P .
limsupn PnF < PF for all closed FJ
Example. Let / be a bounded, measurable function on X. The collection of all
lower semicontinuous functions g with g < f has a largest member, because the
supremum of any family S of LSC functions is also LSC: {JC : supgS g(x) > t] =
Vgesi* g(x) > t), a union of open sets. By analogy with the notation for the
interior of a set, write / for this largest LSC < / . The analogy is helpful, because /
equals the indicator function of B when / equals die indicator of a set B. Similarly,
there is a smallest USC function / that is everywhere > / , and / is the indicator
of B when / is the indicator of B.
For simplicity suppose 0 < / < 1. We have /(JC) > /(JC) > /(JC) for all x.
At a point JC where /(JC) = /(JC) the set {y : f(y) > /(JC) e] is an open
neighborhood of JC within which f(y) > /(JC) - . Similarly, if /(JC) = /(JC), there
is a neighborhood of x on which f(y) < f(x) + e. In short, / is continuous at each
point of the set {JC : /(JC) = /(JC)}. Conversely, if / is continuous at a point x then,
for each e > 0, there is some open neighborhood G of JC for which |/(JC) - f(y)\ < c
for y G. We may assume < I. Then the function / is sandwiched between a
LSC function and an USC function,

(fix) -){yeG}-

2{y i G] < f(y) < (/(JC) + 6) {y e G) 4- 2{y G},

which differ by only 2e at JC, thereby implying that /(JC) = /(JC). The Borel
measurable set C/ := {JC : /(JC) = /(JC)} is precisely the set of all points at which /
is continuous.

174

Chapter 7:

Convergence in distribution

Now suppose Pn **+ P . From Theorem < 8 > , and the inequality / > / > / ,
we have
Pf > l i m s u p P n f > l i m s u p P n f > liminf Pnf > liminf Pnf > Pf.
n

If Pf = p / , which happens if PCf = 1, then we have convergence, Pnf -> Pf.


That is, for a bounded, Borel measurable, real function / on X,
<n>

Pnf - Pf

if Pn -w P and / is continuous P almost everywhere.

When specialized to the case of an indicator function of a set B e S(X), we have


<12>

PnB -> PB

if Pn ~* P and P (dB) = 0,

because the discontinuities of the indicator function of a set occur only at its
boundary. A set with zero P measure on its boundary is called a P-continuity
set. For example, an interval (oo, x] on the real line is a continuity set for every
probability measure that puts zero mass at the point JC. When specialized to real
random variables, assertion <12> gives the traditional convergence of distribution
functions at continuity points of the limit distribution function.
REMARK.
Intuitively speaking, the closeness of probability measures in a weak
convergence sense is not sensitive to changes that have only small effects on functions
in BL(X): arbitrary relocations of small amounts of mass (because the functions
in BL(X) are bounded), or small relocations of arbitrarily large amounts of mass
(because the functions in BL(X) satisfy a Lipschitz condition). The P-continuity
condition, P (dB) = 0, ensures that small rearrangements of P masses near the
boundary of B cannot have much effect on PB. See Problem [4] for a way of
making this idea precise, by constructing sequences P'n -^ P and Pn* -** P for which
P'nB - PB and P'^B - *

PB.

Problem <io> shows that convergence for all P-continuity sets implies
the convergence Pnf Pf for all bounded, measurable functions / that are
continuous a.e. [P], and, in particular, for all / in BL(X). Thus, any one of the
assertions in the following summary diagram of equivalences could be taken as
the definition of weak convergence, and then the other equivalences would become
theorems. It is largely a matter of taste, or convenience, which equivalent form one
chooses as the definition. It is is worth noting that, because of the equivalences,
the concept of weak convergence does not depend on the particular choice for the
metric: all metrics generating the same topology lead to the same concept.
REMARK.
Billingsley (1968, Section 2) applied the name Portmanteau theorem
to a subset of the equivalences shown in the following diagram. The circle of
ideas behind these equivalences goes back to Alexandroff (1940-43), who worked
in an abstract, nontopological setting. Prohorov (1956) developed a very useful
theory for weak convergence in complete separable metric spaces. Independently,
Le Cam (1957) developed an analogous theory for more general topological spaces.
(See also Varadarajan 1961.) For arbitrary topological spaces, Tops0e (1970,
page 41) chose the semicontinuity assertions (or more precisely, their analogs for
generalized sequences) to define weak convergence. Such a definition is needed to
build a nonvacuous theory, because there exist nontrivia! spaces for which the only
continuous functions are the constants.

7.1 Definition and consequences

175

PnB -> PB for all Borel sets B with P(dB) = 0


liming PnGz

Problem [10]

PG for all open G

<+ limsup,, PnF s PF for all closed F

lim infn Png & Pg for all bounded,


lower semicontinuous functions g

limsupn Pnf Pf for all bounded,


upper semicontinuous functions /
Example <io>

Pnf -+ Pf for all bounded / with P{x e X : / discontinuous at x] = 0


Pf for all bounded, continuous
Theorem <8>

Pn -* Pi for all I in BL(X)

def

Equivalences for weak convergence of Borel probability measures on a general


metric space. Further equivalences, for the special case of probability measures
on the real line, are stated in the next Section.

The following consequence of <ii> is often referred to as the Continuous


Mapping Theorem, even though the mapping in question need not be continuous.
It would be more accurate, if clumsier, to call it the Almost-Surely-Continuous
Mapping Theorem.
Corollary. Let T be a tB(X)yB(^)-measurable map from X into another metric
space y, which is continuous at each point of a measurable subset CT- If Pn ~* P
and PCT = 1 then T(Pn) ~> T(P).

Proof. For BLty), the composition / := I o T is continuous at each point


of CT. From <li>, we have (TPn)(t) := Pni(T) -> Pi(T) =: (TP)(i).
REMARK.
The equivalent assertion for random elements is: if Xn
is continuous at almost all realizations X(<o) then T(Xn) -~* T(X).

X and T

Example. Suppose Yn ** F, as random elements of a metric space (y,/y), and


Zn ~> ZQ in probability, as random elements of a metric space (Z, dz)* Equip
X := y x 2, with the metric
d (x\, x2) := max (dy(y\, y2), dz(z\, zi))

where xt := (yit zt).

If BLQd x Z) then the function y H> (y, zo) belongs to BL(y>, and hence
Wni(Yn,zo) -> P(F,z0). That is, the random elements X'n := (YnJzo) converge
in distribution to (F, ZQ). The random element Xn := (Fw, Zn) is close to X'n: in
fact, d(Xn,Xfn) dz(Zn$zo) ~> 0 in probability. By Example <6>, it follows
that Xn ~* (F, zo). If T is a measurable function of (y, z) that is continuous at
almost all realizations (F(<y), zo) of the limit process, Corollary <13> then gives
T(Yny Zn) ~* T(Y, zo). The special cases where y = Z = R and T(y, z) = y + z or
, z) = yz are sometimes referred to as Slutsky's theorem.

176
<15>

D
2.

Chapter 7:

Convergence in distribution

E x a m p l e . Suppose 1,^2, are independent, identically distributed random


Rk-vectors with Pft = 0 and P(&/) = /*. Define K,, = (1 + ... + $n)/>/n. As
you will see in Section 3, the sequence Krt converges in distribution to a random
vector Y whose components are independent N(0, 1) variables. Corollary <13>
gives convergence in distribution of the squared lengths,
The limit distribution is xh by definition of that distribution.
Statistical applications of this result sometimes involve the added complication
that the ,- are not necessarily standardized to have covariance equal to the identity
matrix but instead P(&/) = V. In that case, Yn -> Y9 where Y has the N(0, V)
distribution. If V is nonsingular, the random variable YrnV~xYn converges in
distribution to Y'V~lY, which again has a xl distribution. Sometimes V has to be
estimated, by means of a nonsingular random matrix Vn, leading to consideration
of the random variable Y'nV~xYn. If Vn converges in probability to V (meaning
convergence in probability of each component), then it follows from Example <14>
that (Yn, Vn) -~> (F, V), in the sense of convergence of random elements of R*+*2.
Corollary <13> then gives Y'nV~xYn ~> YrV~]Y, because the map (y, A) H* y'A~ly
is continous at each point of R*+*2 where the matrix A is nonsingular.
Lindeberg's method for the central limit theorem
The first Section showed what rewards we reap once we have established convergence
in distribution of a sequence of random elements of a metric space. I will now explain
a method to establish such convergence, for sums of real random variables, using
the central limit theorem (CLT) as the motivating example. It will be notationally
convenient to work directly with convergence in distribution of random variables,
rather than weak convergence of probability measures. To simplify even further, let
us assume all random variables are defined on a single probability space (2, JF, P).
The CLT, in its various forms, gives conditions under which a sum of
independent variables 1 4 - . . . -f * has an approximate normal distribution, in the
sense that P(i + . . . + *) for I in BL(R) is close to the corresponding expectation
for the normal distribution. Lindeberg's (1922) method transforms the sum into
a sum of normal increments by successive replacement of each ,- by a normally
distributed y/f with the same expected value and variance. The errors accumulated
during the replacements are bounded using a Taylor expansion of the function / .
Of course the method can work only if / is smooth enough to allow the Taylor
expansion. More precisely, it requires functions with bounded derivatives up to
third order. We therefore need first to check that convergence of expectations for
such smooth functions suffices to establish convergence in distribution. In fact,
convergence for an even smaller class of functions will suffice.

<16>

L e m m a . IfPf(Xn)
-> P / ( X ) for each f in the class e(R) of all bounded
functions with bounded derivatives of all orders, then Xn > X.

111

7.2 Lindeberg's method for the central limit theorem

Proof. Let Z have a AT(0,1) distribution. For a fixed e BL(R) and tr > 0, define
a smoothed function by convolution,
a(x) :=

<TZ)

exp (-|(y

:=

(y)dy.

The function a belongs to e(R): a Dominated Convergence argument justifies


repeated differentiation under the integral sign (Problem [15]). As a tends to zero,
a converges uniformly to , because
\a{x) - {x)\ < F\(x + aZ) - {x)\ < |j||BLP (1 A a\Z\) -* 0,

again by Dominated Convergence.


Given > 0, fix a a for which supx \a(x) - (x)\ < . Then observe that
\{Xn) - P(X)| lies within 2t of Wa{Xn) - Ma(X% which converges to zero as
n -+ oo, because a
REMARK.
There is nothing special about the choice of the N(0, 1) distribution
for Z in the construction of a. It matters only that the distribution have a density
with bounded derivatives of all orders, and that differentiation under the integral sign
in the convolution integral can be justified.

For weak convergence of probability measures on the real line, we can augment
the equivalences from Section 1 by a further collection involving specific properties
of the real line.
P(-oo, x] -+ P(-oo, x] for all x e R with P{x] = 0

Pnf -> P / for all / in e (R)


P - / ^ P / for all / in e(R)
pxeixt _^ pxeixt

for

all real t

Further equivalences for weak convergence of probability measures on the


real line. Lemma <16> handles the implications leading from C(]R) to weak
convergence. Example < 10> then gives the convergence of distribution functions.
Problem [5] gives the approximation arguments leading from convergence of
distribution functions (in fact, it even treats the messier case for R 2 ) to weak
convergence. The final equivalence, involving the complex exponentials, will be
explained in Chapter 8.

Lindeberg's method needs only bounded derivatives up to third order. Define


C3(E) to to be the class of all such bounded functions. Of course convergence of
expectations for functions in C3(R) is more than enough to establish convergence in
distribution, because e3(M) 2 e(R).

178

Chapter 7:

Convergence in distribution

For a fixed / in C3(E), define

which is finite by assumption. By Taylor's Theorem,

f{x + y) = f(x) + yf'(x) 4- \y2f(x)

+ R(x, y),

3 lff

where R(x, y) = \y f {x*) for some JC* lying between x and x 4- j . Consequently,
|i?(jc, y)| < C|y|3

for all JC and y.

If X and F are independent random variables, with P|F| 3 < oo, then
P/(X 4- F) = P/(X) 4- P (F/'(X)) + | P ^F 2 /"(X)) + P1?(X, F).
Using independence to factorize two of the terms and bounding \R(Xr Y)\ by C|F| 3 ,
we get
|P/(X + F) - P/(X) - (PF) (P/'(X)) - (FY2) (Ff"(X)) \ < CP|F| 3 .
Suppose Z is another random variable independent of X, with P|Z| 3 < oo and
PZ = PF and PZ 2 = PF 2 . Subtract, cancelling out the first and second moment
contributions, to leave
|P/(X + F) - P/(X + Z)\ < C (P|F| 3 4- P|Z| 3 ) .
For the particular case where Z has a N(fi, a2) distribution, with $i := PF and
a1 := var(F), the third moment bound simplifies slightly. For convenience, write W
for (Z ji)/cr, which has a 7V(0, 1) distribution. Then
P|Z| 3 <8|/x| 3 + 8<r3P|W|3
< 8(P|F|) 3 + 8(P|F| 2 ) 3/2 P|W| 3
< ^8 -f 8P|W\3^ P|F| 3

by Jensen's inequality.

The right-hand side of <17> is therefore less than CiP|F| 3 where C\ denotes the
constant (9 + 8P|W|3)C.
Now consider independent random variables j , . . . , & with

m := Pfe ,

af := varfe),

P|fe|3 < oo.

Independently of all the {,}, generate rji distributed N(fjLir a 2 ), for i = 1,..., k.
Choose the {r/,} so that all 2k variables are independent. Define
S:=i+...+&

and

T := m 4-... 4- m .

The sum T has a normal distribution with


P r = /Lii4-... + M*

and

var(T) = af + ... 4-or2.

Repeated application of inequality < n > will lead to the third moment bound for
the difference P/(5) - Vf(T).
For each i define

7.2 Lindeberg's method for the central limit theorem

179

The variables Xi9 Yh and Z, are independent. From <17> with the upper bound
simplified for the normally distributed Z,-,
Zt)\ <
for each i. At i = k and i = 1 we recover the two sums of interest, Xk + F* =
i 4-... + & = S and Xi + Zi = r?i + ... + rjk = T. Each substitution of a Z, for
a Ff replaces one more ft by the corresponding r?,; the fc substitutions replace all
the ft by the rji. The accumulated change in the expectations is bounded by a sum
of third moment terms,
|P/(S) - p/(r>i < ct
We have only to add an extra subscript to get the basic central limit theorem.
It is cleanest to state the theorem in terms of a triangular array of random
variables,
lr2,*(2)
-3,*(3)

The variables within each row are assumed independent. Nothing need be assumed
about the relationship between variables in different rows; all calculations are
carried out for a fixed row. By working with triangular arrays we eliminate various
centering and scaling constants that might otherwise be needed.
<19>

Theorem. Let , j , . . . , ,*(>, for n = 1,2,..., be a triangular array of random


variables, independent within rows, such that
0) Ei Pfti,/ -* M, witfi /x finite,
(ii) . var(ft,(j) - a 2 < oo,
(Hi) E i ' l & . H 3 - > 0 .
e n Ei<*() $. ~* ^(M. cr2 ) as n -> oo.
Pmo/. Choose / in C3(M). Apply inequality <18> and (iii) to show that fQ2i 5f/)
equals P/(r n ) + c?(l), where Tn is iV(/iw,aw2) distributed with /xn -> /JL by (i) and
2
by (ii). Deduce (see Problem [11]) that Tn ~* N(ii,cr2)9 whence
a 2 - a

<20>

Exercise. If Xn has a Bin (n, pn) distribution and if npn(l /*)-* oo, show that
Xn /?
iv(0,1).
? n (l - pn)
Manufacture a random variable with the same distribution as the
standardized Xn as follows. Let A r t j , . . . , AH%n be independent events with AnJ =
for i = 1, . . . , n . Define
SOLUTION:

fn.i :=

where an := y/npn(\ - pn).

Then E/<n ?.i ^as the same distribution as (Xn - npn)/an.

180

Chapter 7:

Convergence in distribution

Check the conditions of Theorem <19> with /x := 0 and a2 := 1 for these n>l.
The centering was chosen to given Fi-nj = 0, so (i) holds. By direct calculation,

so J2i var (&u) = 1- Requirement (ii) holds. Finally, because \Anj - pn\ < 1,
Ei P I M * < *nl Ei V\$nj\2 = <*? -* 0.
D

<2i>

It follows that Ei<n &M ~* N(Q, 1), as required.


Theorem <19> can be extended to random variables that don't satisfy the
moment conditionsindeed, to variables that don't even have finite momentsby
means of truncation arguments. The theorem for independent, identically distributed
random variables with finite variances illustrates the truncation technique well. The
method of proof will delight all fans of Dominated Convergence, which once again
emerges as the right tool to handle truncation arguments for identically distributed
variables.
Theorem. Let X\, Xi,... be independent, identically distributed random variables
with WXi = 0 and FXf = 1. Then (Xr + ... + XH)/Jn - N(0,1).
Proof. The argument will depend on three applications of Dominated Convergence,
with dominating variable X\:

Apply Theorem <19> to the variables fw?l := Xf{|Xi| < y/n]/y/n, for i =
1, . . . , n . Notice that
,| > v ^ } / ^

i
which gives the bound
^

because WX( = 0,
+ 0.

To control the sum of variances, use the identical distributions together with the fact
that Fi;nj = ixnjn = o(\/n) to deduce that
Ei var(fc,fi) = P
For the third moment bound use

?(

U%) > 0.

It follows that /?n,i '^ ^(0* !) Complete the proof via an appeal to
Example <6>, using the inequality

^1 ^ E
D

to show that Ei<n xi/^


" Ei $nj -> 0 in probability.
Similar truncation techniques can be applied to derive even more general
forms of the central limit theorem from Theorem <19>. Often we have to deal

7.2 Lindeberg's method for the central limit theorem

181

with conditions expressible as Hn(e) - 0 for each e > 0, for some sequence
of functions Hn. It is convenient to be able to replace by a sequence {} that
converges to zero and also have Hn(n) -> 0. Roughly speaking, if a condition holds
for all fixed then it will also hold for sequences n tending to zero slowly enough.
<22>

Lemma. Suppose Hn() - 0 as n - oo, for each fixed c > 0. Then there exists
a sequence n -> 0 such that Hn(n) -+ 0.
Proof. For each positive integer k there exists an * such that \Hn(\/k)\ < l/k for
n > *. Without loss of generality, assume n\ < ni <
Define

{ arbitrary
l/k

for n < n\9


for nk < n < *+iThat is, for n > ni, we have = 1/Jfcn, where kn is the positive integer k for
which nk < n < n*+i. Clearly kn -^ oo as n - oo. Also, for n > i, we have
\Hn(n)\ < \/kny which converges to zero as n -> oo.
The form of the central limit theorem in the next Exercise is essentially due to
Lindeberg (1922). The result actually includes Theorem <2i> as a special case.

<23>

Exercise. Let {Xnj} be a triangular array of random variables, independent within


each row, such that:
(i) FXnJ = 0 for all n and i;

(ii) E,PX, = 1;
(iii) Ln(e) := J \ T>xlti{\XHti\ > e] -> 0 for each 6 > 0 [Lindeberg's condition].
Show that \ ^.i ^ ^(0,1).
2
SOLUTION: Invoke Lemma <22> with Hn(e) := Ln()/ tofinden tending to zero
slowly enough to ensure that Ln{en)/el - 0. Define a triangular array of random
variables ,, := Xnj[\Xnj\ < n}. Notice that
Pf&.i # X,,,/ for some i} < ^ PdX^I > n} < L n ( n )/^ -> 0.
By Example <6> it suffices to show that * ?,i ^* ^(0.1).
Check the conditions of Theorem <19>. For the first moments:
I > c n }| < Ln(cn)/n ^ 0.
For the variances, use the fact that PfWfI = PXn>l{|Xn>l| > } to show that
, vatfe,,) = E i PX^dX,,.,! < } - E i (-PX^dX,,.,-! > n}f.
The first sum on the right-hand side equals E i ^ 2 i ~ Ln(n), which tends to 1.
The second sum is bounded above by Ln(n). For the third moments:

The central limit theorem follows.

3. Multivariate limit theorems


The arguments for proving convergence in distribution of random Rk vectors
multivariate limit theoremsare similar to those for random variables. Indeed,

182

Chapter 7:

Convergence in distribution

with an occasional reinterpretation of a square as a squared length of a vector, and


a product as an inner product, the arguments in Section 2 carry over to random
vectors.
Perhaps the only subtlety in the multivariate analog of Theorem <19> arises
from the factorization of quadratic terms like P(F'/(X)y), with Y independent of
the k x k random matrix of second derivatives /(X). We could resort to an explicit
expansion into a sum of terms P(y J F i /(X) l i ), but it is more elegant to reinterpret
the quadratic as the expected trace of a 1 x 1 matrix, then rearrange it as
Ptraee(F7(X)F) = Plrace(/(X)IT') = trace ((P/(X)) (WYYf)).
We are again in the position to approximate P / ( ^ X/) by means of a sequence of
substitutions of variables with the same expected values and covariances.
The new variables should be chosen as multivariate normals. For each fi e Rk
and each nonnegative definite matrix V, the N(fi,V) could be defined as the
distribution of fi + RW, with W a vector of independent iV(0, l)'s and R any k x k
matrix for which RRf = V. It is easy to check that
P(M + RW) = ii + RVW = n

<24>

and

var(/x + RW) = RVM(W)R' = F.

The Fourier tools from Chapter 8 offer the simplest method for showing that the
distribution does not depend on the particular choice of R.
Problem [14] shows how to adapt the one-dimensional calculations to derive a
multivariate analog of approximation <18>. The assertion of Theorem <19> holds
if we reinterpret a 2 to be a variance matrix. The derivations of other multivariate
central limit theorems follow in much the same way as before.
Example. If Xi, X2,... are independent, identically distributed random vectors
with FXt = 0 and P|X,|2 < 00, then (Xi + ... + Xn)/Jn ~> W(0, V), where V :=
P(XiX',).
In short, the multivariate versions of the results from Section 2 present little
extra challenge if we merely translate the methods into vector notation. Fans
of the more traditional approach to the theory (based on pointwise convergence
of distribution functions) might find the extensions via multivariate distribution
functions more tedious. Textbooks seldom engage in such multivariate exercises,
because there is a more pleasant alternative: By means of a simple Fourier device
(to be discussed in Section 8.6), multivariate results can often be reduced directly
to their univariate analogs. There is not much incentive to engage in multivariate
proofs, except as a way of deriving explicit error bounds (see Section 10.4).

4.

Stochastic order symbols


You are probably familiar with the 0() and o(-) notation from real analysis. They
allow one to avoid specifying constants in many arguments, thereby simplifying the
notational load. For example, it is neater to write
f(x) = /(0) 4- Jtf(O) 4- *(|jc|)

near 0,

183

7.4 Stochastic order symbols

<25>

than to write out the formal epsilon-delta details of the limit properties that define
differentiability of / at 0.
The order symbols have stochastic analogs, which are almost indispensable in
advanced asymptotic theory. As with any very concise notation, it is easy to conceal
subtle errors if the symbols are not used carefully; but without them, all but the
simplest arguments become notationally overwhelming.
Definition. For random vectors {Xn} and nonnegative random variables {}
write Xn = 0 p (a n ) to mean; for each > 0 there is a finite constant M such that
F{\Xn\ > Man] < eventually. Write Xn = op(otn) to mean: W{\Xn\ > ectn} -> 0
for each > 0.
Topically {otn} is a sequence of constants, but occasionally random bounds are
useful.
REMARK.
The notation allows us to write things like op(ctn) = Op{otn), but not
Op(an) = op(an)f meaning that if Xn is of order op(an) then it is also of order Op(an)
but not conversely. It might help to think of Op() and op(<) as defining classes of
sequences of random variables and perhaps even to write Xn e Op(an) or Xn op(an)
instead of Xn = Op(an)

<26>

<27>
D
<28>

or Xn =

op(an).

Example. The assertion Xn = op{\) means the same as Xn -* 0 in probability.


When specialized to random vectors, the result in Example <6> becomes: if
Xn ~* P then Xn + op(\) ~> P- The op{\) here replaces the sequence Xn - X'n.
Example. If Xn ~* P then Xn = 0^(1). From <9> with G as the open ball of
radius M centered at the origin, liminfP{Xn G} > PG. If M is large enough,
PG > 1 - , which implies that {\Xn\ >M}< eventually.
Example. Be careful with the interpretation of an assertion such as Op{\) +
0P(1) = Op(\). The three 0 p (l) symbols do not refer to the same sequence of
random vectors; it would be a major blunder to cancel out the 0p{\) to conclude that
Op{\) = 0 . The assertion is actually shorthand for: if Xn = 0 p (l) and Yn = Op(l)
The assertion is easy to verify. Given e > 0, choose a constant M so that
{\Xn\ > M} < e md{\Yn\ > M} < eventually. Then, eventually,
{\Xn 4- Yn\ > 2M) < {\Xn\ > M) + P{|yn| > M} < 2c.

D
<29>

<30>

If you worry about the 2e in the bound, replace e by e/2 throughout the
previous paragraph.
Example. If [Xn) is a sequence of real random variables of order O p (l), what
can be asserted about {l/Xn}l Nothing. The stochastic order symbol Op() conveys
no information about lower bounds. For example, if Xn = \/n then Xn = Op{\) but
l/Xn = n -> oo. You should invent other examples.
The blunder of asserting l/0 p (l) = 0 p (l) is quite common. Be warned.
Example. For a sequence of constants [an] that tends to zero, suppose Xn =
Op(an). Let g be a function, defined on the range space of the {Xw}, for which
g(x) = O(|JC|) near 0. Then the random variables g(Xn) are of order op(an). To
prove the assertion, for given > 0 and f > 0, find M and then 5 > 0 such that

184

Chapter 7:

Convergence in distribution

P{|XW| > Man] < f eventually, and |g(jt)| < e\x\/M when \x\ < 8. When n is large
enough, we have Man < 8 and
\g(Xn)\ > ^ M a w | < P{|XM| > Man]
D
<3i>

That is, g(Xn) = op(an), or, more cryptically, o(Op(an)) =

op(an).

E x a m p l e . The so-called delta method gives a simple way to analyze smooth


transformations of sequences of random vectors. Suppose XQ is a fixed vector in
Rk, and {Xn} is a sequence of random vectors for which Zn = y/n{Xn - XQ) ~ Z.
Suppose g is a measurable function from R* into R ; that is differentiate at JCO. That
is, there exists an / x k matrix D such that
g(x0 + 8) =

*5.

where |/?(5)| = o(\8\) as 5 -> 0. If we replace 8 by the random quantity Zn/y/n


we get y ^ g ^ / i ) - g(xo)) = DZn + y/nR(Zn/^/n).
From Example <30> we have
R(Zn/*fn) = o(Op(lfyfn)) op{\/^/n), from which it follows, via Example <26>,
that Vn(^(X B ) - g(x0)) + o p (l) ^> DZ.

Weakly convergent subsequences


In a compact metric space, each sequence has a convergent subsequence. Sequences
of probability measures that concentrate most of their mass on compact subsets of
a metric space have a similar property, a result that provides a powerful method for
proving existence of probability measures.

<32>

Definition. A probability measure P on 3(X) is said to be tight if to each


positive e there exists a compact set K such that PK > 1 e.
For the purposes of weak convergence arguments it is more convenient to
have tight measures identified with particular linear functionals on BL(X), with X a
metric space. The following characterization is a special case of a result proved in
Section 6 of Appendix A.

<33>

Theorem. A linear functional X : BL(X)+ H* M+ with A.1 = 1 defines a tight


probability measure if and only if it is functionally tight: to each positive there
exists a compact set K such that XI < for every I in BL(X)+ for which < Kc.
In order that a limit functional on BL(X) + inherit the functional tightness
property from a convergent sequence, it suffices that an analogous property hold
"uniformly along the sequence." It turns out that a property slightly weaker than
requiring supn PnKc < is enough.

<34>

Definition. Call a sequence of probability measures {Pn} on the Borel sigma-field


of a metric space uniformly tight if to each e > 0 there exists a compact set K
such that limsupM_^oo PnGc <e for every open set G containing K.
Uniform tightness implies (and, apart from a few
inconsequential constant factors, is equivalent to) the assertion
that for each > 0 there is a compact set K such that
K - -----G

185

7.5 Weakly convergent subsequences

limsup Pnt < 2c for every I in L(X)+ for which 0 < I < K*: for such an , the
open set G := [ < e} contains K and
<35>

Pni < + PnGc < 2

eventually.

This equivalent form of uniform tightness is better suited to the passage to the limit.
<36>

Theorem. (Prohorov 1956, Le Cam 1957) Every uniformly tight sequence of


probability measures on the Borel sigma-field of a metric space has a subsequence
that converges weakly to a tight probability measure.
Construction of the limit distribution, in the form of a tight linear functional
on Z?L(X)+, will be achieved by a Cantor diagonalization argument, applied to a
countable family of functions of the type descibed in the following Lemma.

<37>

Lemma. For each 8 > 0, > 0, and each compact set K there exists a finite
collection 9 = {go, 8\,, > g*} Q BL(X)+ such that:
(i) goM 4-... + gk(x) = 1 for each x X;
(ii) the diameter of each set {g, > 0}, for i > 1, is less than 8.
(Hi) go < on K.
REMARK. A finite collection of nonnegative, continuous functions that sum to
one everywhere is called a continuous partition of unity.
Proof. Let j q , . . . , x* be the centers of open balls of radius 8/4 whose union covers
the compact set K. Define functions f0 = 6/2 and ft(x) := (1 - 2d(x, *i)/<S)+, for
i > 1, in BL(X)+. Notice that ft(x) = 0 if d(x, xfi > 5/2, for i > 1. Thus the set
{fi > 0} has diameter less than 8 for i > 1.
The function F{x) := X^=o fi(x) is everywhere greater than c/2, and it belongs
to BL(X)+. The nonnegative functions g, := fi/F are bounded by 1 and satisfy a
Lipschitz condition:

\F(y)Mx) - F(x)My)\
\F(y) - F(x)\My)
F(x)

F(x)F(y)

For each x in K there is an i for which d(x,x{) < 8/4. For that i we have
fi(x) > 1/2 and go(x) < fo(x)/ft(x) < e. The g, sum to 1 everywhere. They are
the required functions.
Proof of Theorem <36>. Write Ki for the compact set given by Definition <34>
with := 1/i. For each i in N write 9* for the finite collection of functions
in BL(X)+ constructed via Lemma <37> with 8 := := l/i and K equal to Ki.
The class S := U^NS, is countable.
For each g in 9 the sequence of real numbers Png is bounded. It has a
convergent subsequence. Via a Cantor diagonalization argument, we can construct
a single subsequence Ni c N for which lim^eNj Png exists for every g in 9The approximation properties of 9 will ensure existence of the limit XI :=
limnNl Pn for every I in BL(X)+. With no loss of generality, suppose U\\BL < 1.

186

Chapter 7:

Convergence in distribution

Given e > 0, choose an i > l/e, then write Si = (&0t &i* g*h w ^ h indexing
as in Lemma <37>. The open set G, := {go < e] contains Kt, which ensures that
limsup n PnG? <.
For each j > 1 let x, be any point at which gj(xj) > 0. If x is any other point
with gj(x) > 0 we have \(x) - (JC,-)| < d(x, xy) < . It follows that, for every x
in X,

which integrates to give


2.

It then follows, via the existence of HmnM, Pngj9 that limsup nNl Pnl differs from
liminfneNi Pnf- by at most 6c. The limit Xt := limwN, Pn exists.
The limit functional X inherits linearity from P. Clearly XI = 1. It inherits
tightness from the uniform tightness of the sequence, as in <35>. From Theorem <33>, the functional X corresponds to a tight probability measure to which
{Pn : n Nj} converges in distribution.
REMARK.
For readers who know more about general topology: The Cantor
diagonalization argument could be replaced by an argument with ultrafilters, or
universal subsets, for uniformly tight nets of probability measures on more general
topological spaces. Lemma <37> was needed only to allow us to work with
sequences; it could be avoided.

<38>

E x a m p l e . Let [Xn] be a sequence of R*-valued random vectors of order 0 P (1).


If F{\Xn\ > M] < e eventually then limsupP{X n ^ G) < for every open set G that
contains the compact ball {JC : \x\ < M}. That is, {Xn} is uniformly tight. It has a
subsequence that converges in distribution to a probability measure on !B(R*).

6,

Problems

[1]

Let B be a subset of a metric space. For each pair of points jq and X2 in X,


show that infyBd(x\, y) < d(x\,X2) + inf yfl dfa, v). Deduce that the function
fB(x) := d(x, B) satisfies the Lipschitz condition \/B(X) fniy)\ < d(x, y).

[2]

For real-valued functions / , g on X, prove that \\fg\\BL <

[3]

Let B be a subset of a metric space X. Show that


{x : d(x, B) = 0} = B := closure of B,
{x : d(x, Bc) > 0} = B := interior of B,
{x : d(x, B) = 0 = d(x, Bc)} = B\B = 3B = boundary of B.
Hint: If d(x, B) = 0, there exists points xn e B with d(x, xn) - 0.

[4]

Let P be a probability measure on a separable metric space, and B b e a Borel set


for which P(BB) > 0.

187

7.6 Problems

(i) For each e > 0, show that there is a partition of SB into disjoint Borel sets
{Df : i N} each with diameter less than . Hint: Consider the union of balls
of radius e/3 centered at the points of a countable dense subset of dB.
(ii) For each i, find points jcf B and yt e Bc such that </(*,-, Dt) < e and
d(yi, Di) < e. Define a probability measure P' by replacing all the P mass
in Di by a point mass (of size PDt) at *,-, for each i. Define P" similarly, by
concentrating the mass in each D at yt. Show that P^B = PB and P"B = PB
for each > 0.
(iii) Show that, for each with ||||BL := K < oo,

|P - P'| < KP (SB)

and

|P - P"l\ < KeP {dB).

Deduce that P' P and P " ~> P as e -> 0, even though we have PfB ==

PB > PB and P" == PB < PB.


[5] For x = (JCI, JC2) R2, define Q% = {(y\,y2) e M2 : yi < x\, yt < X2}, the quadrant
with vertex x. Let {Xn} be a sequence of random vectors and P be a probability
measure such that F{Xn <2x} -> PQX for all x such that P (dQx) = 0. Show that
Xn ^ P by following these steps.
(i) Write for the class of all lines parallel to a coordinate axis. Show that all
except countably many lines in have zero P measure. (Hint: How many
horizontal lines can have P measure greater than 1/n?)
(ii) Given e BL(R2), with 0 < < 1, show that there exist disjoint rectangles
Si,..., Sm with sides parallel to the coordinate axes, such that (a) P (35,) = 0
for each each i; (b) the set B = (Jf $ has P-measure greater than 1 e\ (c) the
function oscillates by less than on each Si.
(iii) Choose arbitrarily points x, in Si, for i = 1,..., m. Define functions g(x) =
J2A* ^ Si}(Xi). Show that |g(x) - (x)| < c + {x i B] for all x.
(iv) Use a sandwiching argument to show that F(Xn) ~> P, then deduce that
Xn^P.
[6]

Let X be a metric space. Show that the map t/r : (JC, y) -> d(x, y) is continuous.
Deduce that f is S(X2)\3(E)-measurable. If X is separable, deduce that \(r is
$(X) 0 B(X)\(E)-measurable, and hence co H> d(X(w), Y(<o)) is measurable if X
and Y are random elements of X.

[7]

If P = Q for each in BL(%), show that P = 0 as measures on 3(X).

[8]

Let X be a separable metric space. Show that A(P, Q) := sup{|P Ql\ : \\\\BL < 1}
is a metric for weak convergence. That is, show that Pn ~* P if and only if
A(Pn, P) - 0, by following these steps. (Read the proof of Lemma <37> for
hints.)
(i) Show that A is a metric, and \Pl - Q\ < ||||BLA(e, P), for each L Deduce
that Pn -+ P, for each G BL(X), if D(Pn, P) ~ 0.
(ii) Let Xo := [xf : i e N} be dense in X. For a fixed 6 > 0, define fo(x) ==
and ft(x) := (1 -rf(jc,^)/O + . Define Gk := uf =1 {^ > 1/2}. Choose k so

188

Chapter 7:

Convergence in distribution

that PGkk < . Show that each function lt := /i/o<i<* ft belongs to BL(X).
Show that 0(x) < 2c for x e Gk. Show that ]Cf=0^ s * S h o w t h a t
diam{, > 0} < 2e for i > 1.
(iii) For each i > 1, choose an xt from {/ > 0}. Show that
L

4 + G\

if

(iv) For each probability measure G> and each I with \\1\\BL S 1 show that
|fi - P| < 8 + QG\ + PG + E t i \QU - Plil
(v) If Pn ~* P, deduce that A(Pn, P) < 10^ eventually.
[9]

Let y and Z be metric spaces, such that y is separable. Let d be the metric on
defined in Example <14>. Let Po, P be probability measures on S(y), and Qo Q be
probability measure on B(Z). For a fixed I in BL(V x Z), define ho(z) := PQ (y, z)
and gQ(y) := G*(y,z).
(i) Show that ||AOIUL < P b i and \\gQ\\BL < W\BL.
(ii) Let Ay and A*, denote the analogs of the metric from Problem [8], Show that
I (P 0 Q)l - (Po 0 fioXI < i ^ c - P68Q\ + \Qho - Qoho\
(iii) Show that AyxZ(P G, Po 0 Go) < AV(P, Po) + AZ(G, Go).
(iv) If Pn ^ Po and Qn ^ Go, show that Pn G ^ ^o Go, even if 2, is not
separable. (For separable Z, the result follows from (iii); otherwise use (ii).)

[10]

Suppose [Xn] are random elements of X, and P is a probability measure P on S(X),


for which P{Xn B] -> PB for every P-continuity set. Let / be a bounded
measurable function on X (with no loss of generality assume 0 < / < 1) that is
continuous at all points except those of a P-negligible set K.
(i) For each real t, show that the boundary of the set {/ > t] is contained
in 3sf U {/ = t}. Deduce that {/ > t] is a P-continuity set for almost all
(Lebesgue measure) /. Hint: Consider sequences xn - x and yn -> JC with
/(*) > t > f(yn).
(ii) Show that f(Xn) = /0! F{/(XW) > t)dt -+ Pf.

[11]

Suppose iin -> /i, and an2 - a 2 , with both limits finite. Let Z have a #(0,1) distribution. Show that |P(Mn+ornZ)~P(M+aZ)| < ||||BiLP(l A (\iiH - n\ + |orn - a| |Z|)).
Deduce, by Dominated Convergence, that N(iin, a 2 ) ^ iV(/x, a 2 ).

[12]

Suppose Xn has a ^(^n, a 2 ) distribution, and Xn ^ P.


(i) Show that ^ := lim/in and a 2 := lima 2 must exist as finite limits, and that P
must be the N(/i, a 2 ) distribution. Hint: Choose M with P{M} = 0 = P{-M]
and P[-M, M] > 3/4. If |/xw| > M or if an is large enough, show that
P{|XJ > M} > 1/2. Show that all convergent subsequences of (inn,crn) must
converge to the same limit.

189

7.6 Problems

(ii) Extend the result to sequences of random vectors. Hint: Use part (i) to prove
boundedness of {/xn} and each diagonal element of {Vn}. Use Cauchy-Schwarz
to deduce that all elements of [Vn] are bounded.
[13]

Suppose the random variables {Xn} converge in distribution to X, and that {An} and
{Bn} are sequences of constants with An -* A and Bn - B (both limits finite).
Show that AnXn + Bn ~* AX + B. Generalize to random vectors.

[14]

Let Y be a randomfc-vectorwith fi := FY and V := var(r). Let V have the representation V = LA2L\ with L an orthogonal matrix and A := diag(A.i,..., A*) each A.,
nonnegative. Define R := LA. Let W be a random ^-vector of independent #(0,1)
random variables.
(i) Show that |/x| < P|F|. Hint: For a unit vector u in the direction of fi, use the
fact that uY < \Y\.
(ii) Show that W\RW\3 = P| ]T\ A.,-W-|3 < (trace V) 3/2 PI#(0, l)| 3 .
(iii) Show that P|/z + RW\3 < 8P|F| 3 -f- 8 (P|K|2)3/2P|JV(0,1)|3.

[15]

Let / be a bounded, measurable, real-valued function on the real line. Let k be a


Lebesgue integrable function, with derivative k'. Suppose there exists a Lebesgue
integrable function M with \k(x + 8) - k(x)\ < \8\M(x) for all |5| < 1 and all x.
(i) Define g(x) := f f(x + y)k(y)dy = f f(z)k(z - x)dz. Use Dominated
Convergence to justify differentiation under the integral sign to prove that g is
differentiable, with derivative gf(x) = / f(x 4- y)kf(y) dy.
(ii) Let k(x) := p(x) exp(~jr2/2), with p a polynomial in x. Show that A: and each of
its derivatives satisfies the stated assumptions. Deduce that the corresponding g
has bounded derivatives of all orders. Hint: Consider the case p(x) := xd.
Show that \e* - 1| < |f|^|r| for all real t.
(iii) For each a > 0, show that the function x H> k(x/a)/a
derivatives also satisfies the assumptions.

and each of its

[16]

Let {Xn} be a sequence of random variables, all defined on the same probability
space. If Xn = 0P(1), we know from Chapter 2 that there is a subsequence
{Xni : i N} for which Xn.(co) = o(l) for almost all a>. If, instead, Xn = 0 P (1),
must there exist a subsequence for which Xni(co) = 0(1) for almost all of! Hint:
Let & : i No} be a sequence of independent random variables, each distributed
Uniform(0,1). Consider Xn := (o - fei)""1.

[17]

Let ^ b e a strictly increasing function on E + with ^(0) = 0 and \/r(t) -> 1 as


t -> oo. Show that a sequence of random vectors [Xn] is of order Op{\) if and only
if limsupP^(|X n |) < 1.

[18] Let {Xn} and {Yn} be sequences of random it-vectors, with Xn and Yn defined on
the same space and independent of each other. Suppose Xn - Yn = Op(l). Show
that there exists a sequence of nonrandom vectors {an} for which Xn an = Op(\).
Hint: For probability measures P and g, show that it PxQyf (x - y) < M then
Pxx/r (x - y) < M for at least one y. Use Problem [17].

190

Chapter 7:

Convergence in distribution

[19]

If X has a Poisson(A) distribution, show that %/X-y/k^

[20]

Let {Xnj} be a triangular array of random variables, independent within each row
and satisfying
(a) Et n\Xn,i\ > 6} ^ 0 for each e > 0,
(b) Ei (XHA\Xnj\

N(0, 1/4) as k -> 00.

<))->l for each e > 0.

Show that Ei*nj - AH ~> JV(O,1), where An := , PXn,if|X*,il < *} Hint:


Consider truncated variables rjnj := Xnj{\Xnj\ < } and ,, := rjnj rjnj, for a
suitable {} sequence.
[21]

Let {Xn,,-} be a triangular array of random variables, independent within each row
and satisfying
(i) maxt \Xnj\ -> 0 in probability,
(ii) ,. FXHA\Xn,i\ < } -> n for each > 0,
(iii) 5^f. var(Xn?l{|Xn,i| <})- a 2 < 00 for each > 0.
Show that **. * N{^a2). Hint: Define i;Bff. := XnA\XnA *n) and f=nJ :=
r;n>/ Pi7f,-, where crt tends to zero slowly enough that: (a) P{max, \Xnj\ > n] ~> 0;
2
)
(b) ; PX^aiX^j < } -> n\ and (c) E / var(X

[22]

If each of the components {Xni}, for i = 1,..., Jfc, of a sequence of random fc-vectors
fXn} is of order Op{\), show that Xn = Op(l).

[23]

Suppose /(JC) = o(\x\) as JC -^ 0, and g(x) = O(|jr|fc) as x -> 0. Supppose


Xw = Op(an) for some sequence of constants an tending to zero. Derive a bound
for the rate at which f(g(Xn)) tends to zero.

7.

Notes
See Daw & Pearson (1972) and Stigler (1986a, Chapter 2) for discussion of
De Moivre's 1733 derivation of the normal approximation to the binomial distribution
(reproduced on pages 243-250 of the 1967 reprint of de Moivre 1718)the first
central limit theorem.
Theorem <19> is esentially due to Liapounoff (1900, 1901), although the
method of proof is due to Lindeberg (1922). I adapted my exposition of Lindeberg's
method, in Section 2, from Billingsley (1968, section 7), via Pollard (1984,
Section III.4). The development of the CLT, from the simple idea described in
Section 1 to formal limit theorems, has a long history, culminating in the work of
several authors during the 1920's and 30's. For example, building on Lindeberg's
ideas, Levy (1931, Section 10) conjectured the form of the general necessary and
sufficient condition for a sum of independent random variables to be approximately
normally distributed, but established only the sufficiency. Apparently independently
of each other, Levy (1935) and Feller (1935) established necessary conditions for
the CLT, under an assumption that individual summands satisfy a mild "asymptotic
negligibility" condition. See the discussion by Le Cam (1986). Chapter 4 of
Levy (1937) and Chapter 3 of Levy (1970) provide more insights into Levy's

191

7.7 Notes

thinking about the CLT and the role of the normal distribution. The idea of using
truncation to obtain general CLT's from the Lindeberg version of the theorem runs
through much of Levy's work.
Later works, such as Gnedenko & Kolmogorov (1949/68, Chapter 5) and
Petrov (1972/75, Section IV.4), treat the CLT as a special case of more general
limit theorems for infinitely divisible distributionscompare, for example, the direct
argument in Problem [20] with Theorem 3 in Section 25 of the former or with
Theorem 16 in Chapter 4 of the latter.
Theorem <36> for complete separable metric spaces is due to Prohorov (1956).
Independently, Le Cam (1957) proved similar results for more general topological
spaces. The monograph of Billingsley (1968) is still an excellent reference for
the theory of weak convergence on metric spaces. Together with the slightly
more abstract account by Parthasarathy (1967), Billingsley's exposition stimulated
widespread interest in weak convergence methods by probabilists and statisticians
(including me) during the 1970's. See Dudley (1989, Chapter 11) for an elegant
treatment that weaves in more recent ideas.
The stochastic order notation of Section 4 is due to Mann & Wald (1943). For
further examples see the survey paper by Chernoff (1956) and Pratt (1959).
REFERENCES

Alexandroff, A. D. (1940-43), 'Additive set functions in abstract spaces', Mat


SborniL Chapter 1: 50(NS 8) 1940, 307-342; Chapters 2 and 3: 51(NS 9)
1941, 563-621; Chapters 4 and 5: 55(NS 13) 1943, 169-234.
Billingsley, P. (1968), Convergence of Probability Measures, Wiley, New York.
Chernoff, H. (1956), 'Large sample theory: parametric case', Annals of Mathematical
Statistics 27, 1-22.
Daw, R. H. & Pearson, E. S. (1972), 'Abraham De Moivre's 1733 derivation of the
normal curve: a bibliographical note', Biometrika 59, 677-680.
de Moivre, A. (1718), The Doctrine of Chances, first edn. Second edition 1738.
Third edition, from 1756, reprinted in 1967 by Chelsea, New York.
Dudley, R. M. (1985), *An extended Wichura theorem, definitions of Donsker classes,
and weighted empirical distributions', Springer Lecture Notes in Mathematics
1153, 141-178. Springer, New York.
Dudley, R. M. (1989), Real Analysis and Probability, Wadsworth, Belmont, Calif.
Feller, W. (1935), 'Uber den Zentralen Grenzwertsatz der Wahrscheinlichskeitsrechnung', Mathematische Zeitung 40, 521-559. Part II, same journal, 42 (1937),
301-312.
Gnedenko, B. V. & Kolmogorov, A. N. (1949/68), Limit Theorems for Sums of
Independent Random Variables, Addison-wesley. English translation in 1968, of
original Russian edition from 1949.
Hoffmann-J0rgensen, J. (1984), Stochastic Processes on Polish Spaces, Unpublished
manuscript, Aarhus University, Denmark.
Le Cam, L. (1957), 'Convergence in distribution of stochastic processes', University
of California Publications in Statistics 2, 207-236.

192

Chapter 7:

Convergence in distribution

Le Cam, L. (1986), 'The central limit theorem around 1935% Statistical Science
1, 78-96.
Levy, P. (1931), 'Sur les series dont les termes sont des variables eventuelles
independantes', Studia Mathematica 3, 119-155.
Levy, P. (1935), 'Proprietes asymptotique des sommes de variables aleatoires
independantes ou enchainees', Journal de Math Pures Appl. 14, 347-402.
Levy, P. (1937), Theorie de Vaddition des variables aleatoires, Gauthier-Villars, Paris.
Second edition, 1954.
Levy, P. (1970), Quelques Aspects de la Pensee d'un Mathematicien, Blanchard, Paris.
Liapounoff, A. M. (1900), 'Sur une proposition de la theorie des probability,
Bulletin de VAcademie imperiale des Sciences de St. Petersbourg 13, 359-386.
Liapounoff, A. M. (1901), 'Nouvelle forme du th6oreme sur la limite de probability,
Memoires de VAcademie imperiale des Sciences de St. Petersbourg.
Lindeberg, J. W. (1922), 'Eine neue Herleitung des Exponentialgesetzes in der
Wahrscheinlichkeitsrechnung', Mathematische Zeitschrift 15, 211-225.
Mann, H. B. & Wald, A. (1943), 'On stochastic limit and order relationships',
Annals of Mathematical Statistics 14, 217226.
Parthasarathy, K. R. (1967), Probability Measures on Metric Spaces, Academic, New
York.
Petrov, V. V. (1972/75), Sums of Independent Random Variables, Springer-Verlag.
Enlish translation in 1975, from 1972 Russian edition.
Pollard, D. (1984), Convergence of Stochastic Processes, Springer, New York.
Pratt, J. W. (1959), 'On a general concept of "in probability"', Annals of Mathematical
Statistics 30, 549-558.
Prohorov, Y. V. (1956), 'Convergence of random processes and limit theorems in
probability theory', Theory Probability and Its Applications 1, 157-214.
Stigler, S. M. (1986a), The History of Statistics: The Measurement of Uncertainty
Before 1900, Harvard University Press, Cambridge, Massachusetts.
Tops0e, F. (1970), Topology and Measure, Vol. 133 of Springer Lecture Notes in
Mathematics, Springer-Verlag, New York.
Varadarajan, V. S. (1961), 'Measures on topological spaces', Mat. Sbornik
55(97), 35-100. American Mathematical Society Translations 48 (1965),
161-228.

Chapter 8

Fourier transforms
SECTION 1 presents a few of the basic properties of Fourier transforms that make them
such a valuable tool of probability theory.
SECTION 2 exploits a mysterious coincidence, involving the Fourier transform and
the density function of the normal distribution, to establish inversion formulas for
recovering distributions from Fourier transforms.
SECTION *3 explains why the coincidence from Section 2 is not really so mysterious.
SECTION 4 shows that the inversion formula from Section 2 has a continuity property,
which explains why pointwise convergence of Fourier transforms implies convergence
in distribution.
SECTION *5 establishes a central limit theorem for triangular arrays of martingale
differences.
SECTION 6 extends the theory to multivariate distributions, pointing out how the calculations
reduce to one-dimensional analogs for linear combinations of coordinate variables
the Cramer and Wold device.
SECTION *7 provides a direct proof (no Fourier theory) of the fact that the family of
(one-dimensional) distributions for all linear combinations of a random vector uniquely
determines its multivariate distribution.
SECTION *8 illustrates the use of complex-variable methods to prove a remarkable property
of the normal distributionthe Levy-Cramer theorem.

1.

Definitions and basic properties


Some probabilistic calculations simplify when reexpressed in terms of suitable
transformations, such as the probability generating function (especially for random
variables taking only positive integer values), the Laplace transform (especially
for random variables taking only nonnegative values), or the moment generating
function (for random variables with rapidly decreasing tail probabilities). The
Fourier transform shares many of the desirable properties of these transforms
without the restrictions on the types of random variable to which it is best applied,
but with the slight drawback that we must deal with random variables that can take
complex values.
The integral of a complex-valued function, / := g + ih, is defined by splitting
into real Oft/ := g) and imaginary ( 3 / := h) parts, \if := [ig + i\xh. These integrals
inherit linearity and the dominated convergence property from their real-valued

194

Chapter 8:

Fourier transforms

counterparts. The increasing functional property becomes meaningless for complex


integralsthe complex numbers are not ordered. The inequality \fif\ < fi\f\ still
holds if | | is interpreted as the modulus of a complex number (Problem [1]).
The Fourier transform (which is often referred to as the characteristic function
in the probability and statistics literature) of a probability measure P on 3(W) is
defined by
fP(t) := Pxeixt
for t in R.
Similarly, the Fourier transform of a real random variable X is defined by
fx(t)

:= Wexp(iX(w)t)

for t in E.

That is, fx is the Fourier transform of the distribution of X.


REMARK.
Without the i in the exponent, we would be defining the moment
generating function, Pex\ which might be infinite except at t 0, as in the case of
the Cauchy distribution (Problem [6]).

Fourier transforms are well defined for every probability measure on $CR), and
\fP(t)\ = IP*exp(i*r)l < P* I exp(i*OI = 1

for all real t.

They are uniformly continuous, because


\fP(t

+ ) - fP(t)\

< Px\eixit+8)

- eixt\ = P x \ e i x S - l j ,

which tends to zero as 8 -> 0 by Dominated Convergence. As a map from R into


the complex plane C, the Fourier transform defines a curve that always lies within
the unit disk. The curve touches the boundary of the disk at 1 + Oi, corresponding
to t = 0. If it also touches for some nonzero value of t, then P must concentrate on a
regularly spaced, countable subset of R (Problem [2]). If P is absolutely continuous
with respect to Lebesgue measure then V>(0 -* 0 as t -> oo (Problem [4]).
Fourier methods are particularly effective for dealing with sums of independent
random variables, chiefly due to the following simplification.
<l>

Theorem. If X\,..., Xn are independent then the Fourier transform ofX\-\-.. .+Xn
factorizes into \j/x{ (0 ^X (0> for all real t.
Proof, Extend the factorization property of real functions of the X/s to the complex
functions exp(itXj).
REMARK.
Be careful that you do not invent a false converse to the Theorem.
fxiOfrit),
If X has a Cauchy distribution and Y = X then ^*+r(0 = exp(-2|r|) =
but X and Y are certainly not independent (Problem [6]).

There are various ways to extract information about a distribution from


its Fourier transform. For example, the next Theorem shows that existence of
finite moments of the distribution gives polynomial approximations to the Fourier
transform, corresponding to the purely formal operation of taking expectations
term-by-term in the Taylor expansion for exp(iXt).

195

8.1 Definitions and basic properties

<2>

Theorem. If \X\k < oo, for a positive integer k, then the Fourier transform has
the approximation
(it}k

(ii\^

fx(t) = 1 + itPX + ^ P X
X 2 + ... 44- Vklr 1 * +
2!

for t near 0.
D Proof. Apply Problem [5] to Pcos(*7) and Psin(A7).
<3> Example. The Poisson(A) distribution has Fourier transform

An appeal to the central limit theorem and a suitable passage to the limit will lead
us to the Fourier transform for the normal distribution.
Suppose X\y X2,... is a sequence of independent random variables, each
distributed Poisson(l). By the central limit theorem for identically distributed
summands, Zn := (X\ + ... + Xn ri)/*fn~> N(0,1). For each fixed t the function
eixt is bounded and continuous in x. Thus limn_>OoPexp(i7Zn) is the Fourier
transform of the #(0,1) distribution. Evaluate the limit using Theorem <i>:
Pexp(iZRf) =

J^
*=i
if

- n).

The last exponent has the approximation

Notice the way the error term behaves as a function of n for fixed t. In the limit we
get exp(-r 2 /2) as the Fourier transform of the iV(0,1) distribution.
REMARK.

If Z has a # ( 0 , 1 ) distribution and s is real,


1
f
Pexp(sZ) = = I exp (sz - h2)
V2K J-oo
= ~

dz

exp (i* 2 - i(z - s)2) dz = exp(5 2 /2).

Formally, we have only to replace s by it to get the Fourier transform for Z. A


rigorous justification requires some complex-variable theory, such as the uniqueness
of analytic continuations.

2.

Inversion formula
Written out in terms of the Af(0,1) density, the final result from Example <3>
becomes
exp(i>* - \y2)

i 2x

- dy = exp(- j/ z )

for real *.

196

Chapter 8:

Fourier transforms

The function on the right-hand side looks very like a normal density; it lacks
only the standardizing constant. If we substitute t/a for t, then make a change of
variables z y/cr, we get an integral representation for the N(0, a2) density <t>a.
1

-i- /

exp (-izt

exp(-4*2/a2)

i >> o\

2 2

PV

dz =

- \<J Z )

= : <t><,(t).

'

This equality is a special case of a general inversion formula that relates Fourier
transforms to densities. Indeed, the general formula can be derived from the special
case.
Suppose Z has a iV(0,1) distribution, independently of a random variable X
with Fourier transform ^r(-). From the convolution formula for densities with respect
to Lebesgue measure (Section 4.4), for a > 0 the sum X + aZ has a distribution
with density f(a)(y) '-= P^&r (y - X(co)). Substituting for <t>a from <4>, we have

= -Lp / exp (-iziy - X(co)) - \o2A dz.


X

J-oo

2 2

The integrand, which is bounded in absolute value by exp(-a z /2), is integrable


(as a function of o> and z) with respect to the product of P with Lebesgue measure.
Interchanging the order of integration, as justified by the Fubini theorem, we get the
basic integral representation,

fa)(y) = 7r /

^(z)exp {-izy - \o2z2) dz.


^

Lit Joo

'

Notice that t/r(r)exp(-a2f2/2) is the Fourier transform of X + oZ. If we write P


for the distribution of X, the formula becomes
density of P #(0, a2) = / e"izy(Fourier transform of P N(0, a2)) dz,

<7>

a special case of the inversion formula to be proved in Theorem <io>.


Limiting arguments as o tends to zero in the basic formula <5>, or <6>, lead
to several important conclusions.
Theorem. The distribution of a random variable is uniquely determined by its
Fourier transform.
Proof. For h a bounded measurable function on E and f(cr) as defined above,
Pft(X + a Z ) = f

h(y)f(<T)(y)dy.

Joo

which shows, via formula <5>, that the Fourier transform of the random variable X
uniquely determines the density for the distribution of X + aZ. Specialize to h
in C(R), the class of all bounded, continuous real functions on R. By Dominated
Convergence, Wh(X-\-oZ) -> P^(X) as a -> 0. Thus the Fourier transform uniquely
determines all the expectations P/i(X), with h ranging over G(R). A generating class
argument shows that these expectations uniquely determine the distribution of X, as
a probability measure on

197

8.2 Inversion formula

You might feel tempted to rearrange limit operations in the previous proof, to
arrive at an assertion that
h(y)f((T)(y)dy

Fh(X) = lim /

= /

h(y) lim

f(a)(y)dy,

and thereby conclude that the distribution of X has density


<8>

f(y) = Km fia)(y)

= f

f(z) exp (-izy) dz

with respect to Lebesgue measure. Of course such temptation should be resisted. The
migration of limits inside integrals typically requires some domination assumptions.
Moreover, it would be exceedingly strange to derive densities for measures that are
not absolutely continuous with respect to Lebesgue measure.
<9>

Example. Let P denote the probability measure that puts mass 1/2 at 1 . It has
Fourier transform %lr(t) = {eil -f e~u) /2. The integral on the right-hand side of <8>
does not exist for this Fourier transform. Application of formulas <5> then <4>
gives

f(a\y) = r- r I (eiz+e~iz) exp (~izy - I*2*2)dz


Lit

<io>

JOQ l>

That is, f{a) is the density for the mixture l/2N(-l,a2) -h y2#(+l,a 2 ), a density
that is trying to behave like two point masses. The limit of fiar) does not exist in
the ordinary sense.
Theorem. If P has Fourier transform \/r for which / ^ \\/r(z)\ dz < oo, then P is
absolutely continuous with respect to Lebesgue measure, with a density

that is bounded and uniformly continuous.


Proof The convolution density /(or)(j) as in <5>, is bounded by the constant C :=
f\\(r\/2n. Uniform continuity follows as for Fourier transforms. By Dominated
Convergence f(a)(y) converges pointwise to / . Thus 0 < / < C. If h is continuous
and vanishes outside a bounded interval [M, M], a second appeal to Dominated
Convergence gives
Ph = lim /

-*Oj-M

h(y)f(<T)(y)dy = /

J-M

h(y)f(y)dy.

That is, Ph = / hf for a large enough class of functions h to ensure (via a generating
class argument) that P has density / .
REMARK.
The inversion formula for integrable Fourier transforms is the basis
for a huge body of theory, including Edgeworth expansions, rates of convergence
in the central limit theorem, density estimation, and more. See for example the
monograph of Petrov (1972/75).

198

*3.

Chapter 8:

Fourier transforms

A mystery?
I have always been troubled by the mysterious workings of Fourier transforms. For
example, it seems an enormous stroke of luck that the Fourier transform of the
normal distribution is proportional to its density functionthe key to the inversion
formula <5>. Perhaps it would be better to argue slightly less elegantly, to see what
is really going on.
Start from the slightly less mysterious fact that there exists at least one random
variable W whose Fourier transform $0 is Lebesgue integrable. (See Problem [3]
for one way to ensure integrability of the Fourier transform.) Symmetrize by means
of an independent copy W of W to get a random variable W W with a real-valued
Fourier transform x/r = |^oi2 which is also Lebesgue integrable. That is, there exists
a probability distribution Q whose Fourier transform r/r is both nonnegative and
Lebesgue integrable. For some finite constant co, the function CQ-^T defines a density
(with respect to Lebesgue measure m on the real line) of a probability measure Q
on S(R).
For h bounded and measurable,
P Qh(x+ay)

= Pxmy (cQ(Qzeiyz)h(x + ay)).

If h vanishes outside a bounded interval, Fubini lets us interchange the order of


integration, make a change of variable w = x + <xy in the Lebesgue integration, then
change order of integration again, to reexpress the right-hand side as
Qzmw (h(w)Px exp(iz(u> -

x)/a)),

a function of the Fourier transform of P. Again we have a special case of the


inversion formula, from which all results flow.
The method in Section 2 corresponds to the case where both Q and Q are
normal distributions, but that coincidence is not vital to the method.

4.

Convergence in distribution
The representation <5> shows not only that the density / ( t r ) of X + <r Z is uniquely
determined by the Fourier transform fx of X but also that it depends on fx in a
continuous way. The factor exp(-<r2z2/2) ensures that small perturbations of fx
do not greatly affect the integral. The traditional way to make this continuity idea
precise involves an assertion about pointwise convergence of Fourier transforms.
The proof makes use of a simple fact about smoothing and a simple consequence
of Dominated Convergence known as Scheffe's Lemma:
Let /, /1, / j , . . . be nonnegative, fx-integrable functions for which fn -> /
a.e. [11} and fifn -* /*/. men ix\fn - / | -> 0.
See Section 3.1 for the proof.

8.4

<n>

199

Convergence in distribution

Lemma. Let X, X\% X2,... and Z be random variables for which Xn + crZ ~*
X + oZ for each a > 0. Then Xn ~* X.
Proof, For I in L(R),
|P(Xn) - P(X)| < P|(Xn) - t(Xn + aZ)| + |P(Xn + aZ) - F(X + aZ)\

<12>

On the right-hand side, the middle term tends to zero, by assumption. The other
two terms are both bounded by | | | | 5 L P ( 1 A <T|Z|), which tends to zero as a -* 0.
Continuity Theorem. Let X, X\, X2,... be random variables with Fourier
transforms f, fa, fa,... for which fa(t) -> \jr(t) for each real t. Then Xn ~ X.
Proof. Let Z have a N(0,1) distribution independent of X and all the Xn. The
random variables Xn+<rZ have distributions with densities

= ^ ~ fn(t) exp (~ity - \v2t2) dt


with respect to Lebesgue measure, and X + oZ has a distribution with a similarly
defined density / (cr) . By Dominated convergence, f^Hy) - f(a)(y) as n -> oo, for
each fixed y. If h is bounded and measurable,

\h(Xn 4- aZ) - Fh(X + aZ)\

= I f h(y)f^(y)dy - J h(y)f^(y)dy\
<M I \f<a)(y) - f(a)(y)\
- 0
D
<13>

dy

where M = sup \h\

by SchefK's Lemma.

Thus Xn +aZ ~> X+oZ for each a > 0, and the asserted convergence in distribution
follows via Lemma < n > .
Example. Suppose Xi, X2,... are independent, identically distributed random
variables with PX* = 0 and PXf = 1. From Chapter 7 we know that
V-

_l_

_1_ V

,1).
Here is the proof of the same result using Fourier transforms.
Let VO) be the Fourier transform of X\. From Theorem <2>,
f(t) = 1 - \t2 + o(f2)

as f -> 0.

In particular, for fixed r,


f(t/yfn)

= 1 - j ^ 2 / w + 0(l/ n )

as n ~> <x).

The standardized sum has Fourier transform


- f2/* + 0(l/)) n ^ exp ( - i

The limit equals the N(0,1) Fourier transform. By the Continuity Theorem, the
asymptotic normality of the standardized sum follows.
Certainly the calculations in the previous Example involve less work than the
Lindeberg argument plus the truncations used in Chapter 7 to establish the same

200

Chapter 8:

Fourier transforms

central limit theorem. I would point out, however, the amount of theory needed
to establish the Continuity Theorem. Moreover, for the corresponding Fourier
transform proofs of central limit theorems for more general triangular arrays, the
calculations would parallel those for the Lindeberg method.
REMARK.
A slightly stronger version of the Continuity Theorem (due to
Cramersee the Notes) also goes by the same name. The assumptions of the
Theorem can be weakened (Problem [9]) to mere existence of a pointwise limit
VKO = limn \lrn(t) with \jr continuous at the origin. Initially it need not be identified
as the Fourier transform of some specified distribution. The stronger version of
the Theorem asserts the existence of a distribution P for which \ff is the Fourier
transform, such that Xn **+ P.

*5.

A martingale central limit theorem


Fourier methods have some advantages over the methods of Chapter 7. For example,
the proof of the following important theorem seems to depend in an essential way
on the factorization properties of the exponential function.
REMARK.
The Lindeberg method for independent summands, as explained in
Section 7.2, can be extended to martingales differences under a natural assumption
on the sum of conditional variancessee Levy (1937, Section 67).

<14>

Theorem. (McLeish 1974) For each n in N let {%nj : j = 0, ...,} be a


martingale difference array, with respect to a filtration {3nj}, for which:
0) E , n, -+ 1 in probability;
(ii) max, [, | -> 0 in probability;
(Hi) sup,, P max, S,2 < oo.
Then ]T\ %nj ~> N(0y 1) as n -* oo.
Proof. Write Sn for ] [ \ %nj and Mn for max, \%nj\. Denote expectations conditional
on $nj by P/(). The omission of the n subscript should cause no confusion, because
all calculations will be carried out for a fixed n. Let me also abbreviate t;nj to ,,
and simplify notation by assuming kn = n.
By the Continuity Theorem <12>, it suffices to show that Pexp(i'r5n) ->
exp(-f 2 /2), for each fixed t in R. A Taylor expansion of log(l + it%j) when Mn is
small gives exp (itt-j + t2%J/2\ ^ 1 + iffy, and hence, via (i),
exp (itSn) * exp ( - ; 2 / 2 ) Y\jSn (1 + it$j).
Define Xm := Y\jsm(l+itt;j)9
for m = 1 , . . . , n, with Xo == 1. Each Xm has expected
value 1, because Xm = P(X m -i(l +/*P m _i m )) = PXm_! = ... = PX 0 , which
suggests Pexp(irSn) exp(-t2/2)FXn = exp(-r 2 /2).
For the formal proof, use the error bound

log(l +z) = z- z2/2 + r(z)

with \r(z)\ < \z\3 for \z\ < 1/2.

201

8.5 A martingale central limit theorem

Temporarily write Zj for itfy. Notice that Ylj<n zj = -t2 + op(l), by (i). Also, when
U\Mn < 1/2, which happens with probability tending to 1,
,< \r(zj)\ < |f|3 ,.< |^| 3 < \t\3Mn Zj<n Hj = op{\)

by (ii) and (i).

Thus

exp (itSn + \t2) = Xn exp (\t2 + \ - 5 n z) - ,-< r(zj)) = exp (


and

Yn := exp (itSn + \t2}-Xn-+0

in probability.

To strengthen the assertion to FYn - 0, it is enough to show that supn P|y n | 2 < oo,
for then
F\Yn\ < M-lV\Yn\2 + MP(|F n |{|F n | < M}) = O ( A / " 1 ) + o(l).

(Compare with uniform integrability, as defined in Section 2.8.) The contribution


from Sn to Yn is bounded in absolute value by a constant. We have only to control
the contribution from Xn.
Consider first the obvious bound,

i*i2 = r w (i + \zj\2) <


By (ii), the expression on the right-hand side is of order O p (l), but it needn't be
bounded everywhere by a constant. We can achieve something closer to uniform
boundedness by means of a stopping time argument. Write Qn(m) for 52 J < m ^.
Define stopping times xn := inf{m : Qn(m) > 2}, with the usual convention that
xn := n if the sum never exceeds 2. Redefine Zj as it%j{xn < j], a new sequence
of martingale differences, for which P{ifSn ^ ]C,< *>j) ~ p { r < w} -> 0. We have
only to prove that Pexp J2j<n (zj +12/2) -+ 1.
Repeat the argument from the previous paragraph but with Xn redefined using
the new z / s . We then have

+ Mj\2u < *)) < exp (in2 E y sf[j < *n}) ( I + I/I2MW2) ,
D

which gives sup n P|X n | 2 < sup n exp(2r 2 )(l -f^ 2 P|M n | 2 ) < oo. The asserted central
limit theorem follows.
REMARKS.

(i) Notice the role played by the stopping times rn. They ensure that properties
of the increments that hold with probability tending to one can be made to
hold everywhere, if we can ignore the effect of the increment corresponding
to tn = j . To control n,Tn the Theorem had to impose constraints on
(ii) The sum ] ^ . ^ plays the same role as a sum of variances in the corresponding
theory for independent variables. Martingale central limit theorems sometimes
impose constraints on the sum of conditional variances P , _ i ^ . The sum of
squared increments corresponds to the "square brackets" process [Xn] of a
martingale, and the sum of conditional variances corresponds to the "pointy
brackets" process (Xn). These two processes are also called the quadratic
variation process for Xn and the compensator for X^y respectively.

202

6.

Chapter 8:

Fourier transforms

Multivariate Fourier transforms


The (multivariate) Fourier transform of a randomfc-vectorX is defined as
PexpO'f'X) for t e Rk. Many of the results for the one-dimensional Fourier
transform carry over to the multidimensional setting with only notational changes.
For example, if T^X is integrable with respect to Lebesgue measure m* on Rk then
the distribution of X is absolutely continuous with respect to m* with density
f(y) := ( 2 ^

Once again the Fourier transform uniquely determines the distribution of


the random vector, and the pointwise convergence of Fourier transforms implies
convergence in distribution. These two results have two highly useful consequences:
(i) The distribution of X is uniquely determined by the family of distributions
of all linear combinations t'X, as t ranges over Rk.
(ii) If t'Xn -> t'X for each t in R* then Xn -> X.
Both assertions follow from the trivial fact that Pexp(i'f'y) is both the multivariate
Fourier transform of the random vector F, evaluated at r, and the Fourier transform
of the random variable t'Y, evaluated at 1. The reduction to the one dimensional
case via linear combinations is usually called the Cramer-Wold device.
Consequence (ii) shows why one seldom bothers with direct proofs of multivariate limit theorems: They can usually be deduced from their one-dimensional
analogues. For example, the multivariate central limit theorem of Section 7.3 is an
immediate consequence of its univariate counterpart.
<15>

Example.

Suppose X is a random A;-vector and Y is a random -vector for which

PexpOVX 4- it'Y) = Pexp(/^/X)Pexp07/F)

D
< 16>

all s R*, all / Re.

Pass to image measures to deduce that the joint distribution Qxj of X and Y has
the same Fourier transform as the product Qx QY of the marginal distributions.
By the uniqueness result (i) for Fourier transforms of distributions on RM, we must
have QX,Y = Qx QY- That is, X and Y are independent.
You might regard this result as a sort of converse to Theorem <i>. Don't slip
into the error of checking only that the factorization holds when s and t happen to
be equal.
Example. A random n-vector X is said to have a multivariate normal distribution
if each linear combination t'X, for t Rw, has a normal distribution. In particular,
each coordinate X, has a normal distribution, with a finite mean and a finite variance.
The random vector must have a well defined expectation /z := PX and variance
matrix V := P(X - ^)(X - /x)'. The distribution of t'X is therefore N(t'ii, t'Vt),
which implies that X must have Fourier transform
Pexp(iVX) = exp(iV/x - \t'Vt)

for all t in Rn.

Write N(fM, V) for the probability distribution with this Fourier transform.
Every /JL in Rn and nonnegative definite matrix V defines such a distribution.
For if Z is a n-vector of independent N(0yl) random variables and if we factorize

203

8.6 Multivariate Fourier transforms

V as AA\ with A a n n x n matrix, then the random vector X := /x + AZ has Fourier


transform ^ x ( 0 = Pexp(iV(/i + AZ)) = exp07'/z)^z(A'O- The random vector Z
has Fourier transform x/fzis) = Ylj<k ^(-ty) exp(-^|5| 2 ), from which it follows

that fz(A't) = exp(-\t'A'At) = exp (-\t'Vt).


If V is nonsingular, the N(/z, V) distribution is absolutely continuous with
respect to n-dimensional Lebesgue measure, with density

IU - ii)'V'\x - /

*7.

This result follows via the Jacobian formula (Rudin 1974, Chapter 8) for change
of variable from the density (In)-"12 exp(-|x| 2 /2) for the W(0, /) distribution. (In
fact, the Jacobian formula for nonsingular linear transformations is easy to establish
by means of invariance arguments.)
If V is singular, the N(/JL, V) distribution concentrates on a translate of a
subspace, {JC R* : (JC - /X)'V(JC - p) ^ 0} = {/x + y : Ay = 0}, where V = A'A.

Cramer-Wold without Fourier transforms


Fourier methods were long regarded as essential underpinning for the Cramer-Wold
device. Walther (1997) recently found a beautiful direct argument that avoids use
of Fourier transforms altogether. I will describe only his method for showing that
linear combinations characterize a distribution, which depends on two facts about
the normal distribution:
(i) If Z is a vector of k independent random variables, each distributed N(0,1),
and if B is a vector of constants, then B Z has a N(0, \B\2) distribution.
(This fact can be established by a direct convolution argument, which makes
no use of Fourier transforms.)
(ii) Write <J> for the standard normal distribution function. If 0\,..., 02m+i are
distinct real numbers, then there exist real numbers a\,..., a2m+i such that
the function g(t) := , a^iGit), for t > 0, is of order O (t2m+l) near t = 0,
and hence giO/t2"1*1 is Lebesgue integrable. Moreover, the {a,} can be
chosen so that /0 giO/t2"1*1 dt ^ 0.
Walther gave an explicit construction for the constants {a,} for a particular
choice of the {0/}. (Actually, he also needed to add another well chosen constant a$
to g to get the desired properties.) I will give a different proof of (ii) at the end of
this Section.
Write m for Lebesgue measure on R2. The function g serves to define a
m-integrable function F on R2m by F(u) := #(1/|M|), for which
2m ll
= mug(l/\u\)=
\u\)= rCmmr2m
-- g(l/r)dr
g(l/r)dr

Jo

= Cmm ff giO/t2"1*1 dt # 0,
JO

where Cm denotes the surface area of the unit sphere in R2m.


Let h be any bounded, continuous, real function on R2"1. Integrability of F
justifies an appeal to Dominated Convergence to deduce that
Vh(X) = lim rmu (h(X(a>) + au)F{u)) /mF.
CT*0

204

Chapter 8:

Fourier transforms

A change of variable in the m integral, followed by an appeal to Fubini, gives


rmu (h(X((o) + ou)F(u)) = a~2mmy
The last expectation is determined by the distributions of linear functions of X, as
seen by an appeal to property (i), first conditioning on X, for a random normal
vector Z that is independent of X:

Condition on Z to see that the last expression is uniquely determined by the


distributions of X z for z in Rm.
Thus the distributions of the linear combinations X z, with z ranging over R2"1,
uniquely determines the expectation h(X) for every bounded, continuous, real
function ft, which is enough to determine the distribution of X.
REMARK.
The Cramer-Wold result for random vectors taking values in
follows directly from the result for R2"1: we have only to append one more coordinate
variable to X. There is no loss in having a proof for only even dimensions.

Proof of assertion (ii)


The Taylor expansion of the exponential function,
x2
x3
(x)m~l
(x)m

_
f
L
.
.
.
+
^

.
+
L_^L
exp(-**)
with 0 < x* < *,
x +
+
2!
3!
(m - 1)!
m!
shows that e~x is the sum of a polynomial of degree m 1 plus a remainder
term, r(x), of order O(xm) near x = 0, such that r(x) > 0 for all x > 0 if m is even,
and r{x) < 0 for all JC > 0 if m is odd. Replace x by JC 2 /2, divide by V27r, then
integrate from 0 to t to derive a corresponding expansion for the standard normal
distribution function: <I>(0 = p(t) + R(t) for all t > 0, where p{t) := E ^ o ! ft'*
and R{t) is a remainder of order O(t2m+l) near t 0 that takes only positive (if m
is even) or negative (if m is odd) values. Note that fa = 4>(0) = 1/2. The constant
K := /0 R(t)t~2m~x dt is finite and nonzero.
By construction,
-* = l -

g(t) := 2>,*(0,O = >*(**) 4- J2T=ol A'*Eifl'**'


If we can choose the {a,} such that f . ,-0^ = 0 for ifc == 0, 1 , . . . , 2m - 1, then the
contributions from the polynomials p(Ott) disappear, leaving

a; I

- r r dt =

o
The integral is nonzero if we can also ensure that J2t afilm " 0.
A simple piece of linear algebra establishes existence of a suitable vector
a := ( i , . . . , 02m+i). Write t/* for (0f,..., ^ + 1 ) . The vector U2m could not be a
linear combination Yllio * YkUk, for otherwise the polynomial 02m - J^IQ1 Ykk of

8.7

205

Cramer-Wold without Fourier transforms

degree 2m would have 2m + 1 distinct roots, 0\,..., 02m+i- a contradiction. The


component of U2m that is orthogonal to all the other Ut vectors defines a suitable a.

*8.

The Levy-Cramer theorem


If X and Y are independent random variables, each normally distributed, it is easy
to verify by direct calculation that X + Y has a normal distribution. Surprisingly,
there is a converse to this simple result.

<17>

Levy-Cramer theorem. If X and Y are independent random variables with


X -f Y normally distributed then both X and Y have normal distributions.
REMARK.
The proof of the theorem makes use of several facts about analytic
functions of a complex variable, such as existence of power series expansions. See
Chapters 10 and 13 of the Rudin (1974) text for the required theory.

Proof. With no loss of generality we may assume X + Y to have a standard normal


distribution, and Y to have a zero median, that is, F{Y > 0} > \ and F[Y < 0} > \.
Then, for each x > 0,
> 0}P{X > x]
= 2{Y > 0, X > x]
< exp(x2/2)

independence

normal tail bound from Section D.I.

A similar argument gives a similar bound for the lower tail. Thus
<18>

P{|X| > JC} < 2exp(-jt 2 /2)

for all JC > 0.

It follows (Problem [11]) that the function g(z) := Pexp(zX) is well defined and
is an analytic function of z throughout the complex plane C. It inherits a growth
condition from the tail bound <18>:

\g(z)\ < P|exp(zX)| < Pexp(|zX|)


= l + |z| / cxp(\z\x)F{\X\>x}dx
Jo
poo

< 1 +2|z| I

Joo

<19>

= exp(C -f |z| )

exp(|z|jc -x2/2)dx

by <18>

for some constant C.

A similar argument shows that h(z) := Pexp(zF) is also well defined for every z.
By independence, g(z)h(z) = exp(z 2 /2) for all z C, and hence g(z) ^ 0 for all z.
It follows (Rudin 1974, Theorem 13.11) that there exists an analytic function y()
on C such that g(z) = exp(y(z)). We may choose y so that y(0) = 0, because
#(0) = 1. (In effect, logg can be defined as a single-valued, analytic function
on C.) The analytic function y has a power series expansion y(z) = J^L\ ynzn that
converges uniformly to y on each bounded subset of C.

206

Chapter 8:

Fourier transforms

Decompose y(rew) into its real and imaginary parts U(r, 0) + iV(r, 6). Then,
from <19>, we have exp(/) = | exp(/ -f iV)\ < exp(C -f r 2 ), which gives
U(r, 0) < C + r2

<20>

for all re1"* C.

Uniform convergence of the power series expansion for y on the circle \z\ = r lets
us integrate term-by-term, giving
2

"

r /

\2nynrn

for n = 1, 2, 3 , . . .
for n = 0 , - 1 , - 2 , . . .

In particular, for n = 1, 2, 3 , . . . , and real p.

(exp(-/n0 - ifi) + exp(*n0 + ip) + 2^ <


/27T

/
y(reiB)exp(-in0)de
Jo
Jo
2nynrne-ifi.

Choose p so that yne ^ = |y w |, then equate real parts to deduce that

/
U(r, 0)(2
+ 2cos(w0 + P))y dO = 2n\yn\rn.
v
Jo
The integrand on the left-hand side is less than 4/(r, 0) < 4(C -h r 2 ). Let r tend to
That is,
infinity to deduce that yn = 0 for n = 3,4, 5,
Pexp(zX) = exp(yu +

nz2).

Problem [12] shows why y\ must be real valued and yi must be nonnegative. That
is, X has the Fourier transform that characterizes the normal distribution.

9.

Problems

[1]

Let / = ( / i , . . . , fk) be a vector of /x-integrable, real-valued functions. Define [if


as the vector
(i) Show that \fif\ < /z|/|, where | | denotes the usual Euclidean norm. Hint: Let
a be a unit vector in the direction of /z/. Note that a' f < | / | .
(ii) Let / = /!-{- if2 be a complex-valued function, with /x-integrable real and
imaginary parts f\ and fa. Show that |/x/| < /x|/|, where | | denotes the
modulus of a complex number.

[2]

Suppose a Fourier transform has |^/>Oo)l = 1 for some nonzero to. That is,
V'P('O) = exp(/0o) for some real #o- Show that P concentrates on the lattice of
points {(0O + 2nn)/t0 : n e Z}. Hint: Show m (1 - Px exp(itox - Wo)) = 0. What
do you know about 1 cos(fo* 6Q)1

[3]

Suppose is / is both integrable with respect to Lebesgue measure m on the real line
and absolutely continuous, in the sense that f{x) = mx ({t < x}f(t)), for all x9 for
some integrable function / .
(i) Show that f(x) - 0 as |JC| - oo. Hint: Show m ' / ( 0 = 0 to handle x - oo.

207

8.9 Problems
ixt

ixt

(ii) Show that m* (f(x)e ) = -m (e f(x)) /{it) for t # 0. Hint: For safe
Fubini, write the left-hand side as lim c -ootn* {eixt{\x\ < C}tns (f(s){s < x})).
(iii) If a probability density has Lebesgue integrable derivatives up to fcth order,
prove that its Fourier transform is of order O(\t\~k) as \t\ -> oo.
[4]

Let m denote Lebesgue measure on !B(R). For each / in Jdl(m) show that
mx (f(x)eixt) -> 0 as f -* oo. Hint: Check the result for / equal to the indicator function of a bounded interval. Show that the linear space spanned by
such indicator functions is dense in ^ ( m ) . Alternatively, approximate / by linear
combinations of densities with integrable derivatives, then invoke Problem [3].

[5]

Let g() be a bounded real function on the real line with bounded derivatives up to
order k + 1, and let X be a random variable for which F\X\k < oo. Let /?*() be the
remainder in the Taylor expansion up to th power:

g(x) = s(0) + xg'{0) + ... +

jj8ik)(0)

(i) Show that \Rk(x)\ < Cmin(|;c|*, |*|* + 1 ), for some constant C.
(ii) Invoke Dominated Convergence to show that
tk

Fg(Xt) = g(0) + tg'(0)FX + . . . + -g{k)(0)FXk


[6]

+ o(tk)

as t - 0.

Let P denote the double-exponential distribution, given by the density p(x) =


V2exp(-|jc|) with respect to Lebesgue measure on !B(R).
(i) Show that Pxexs = 1/(1 - s 2 ) for real s with |*| < 1.
(ii) By a leap of faith (or by an appeal to analytic continuation), deduce that
x/rP(t) = 1/(1 +1 2 ) fort eR.
(iii) The Cauchy probability distribution is given by the density q(x) = n~l/(l -f x2).
Apply the inversion formula from Theorem <io> to deduce that the Cauchy
distribution has Fourier transform exp(\t\).

[7]

Suppose (X, Y) has a multivariate normal distribution with cov(X, Y) = 0. Show


that X and Y are independent. (Hint: What is the Fourier transform of Fx 0 FY ?)

[8]

Let X i , . . . , Xn be independent, Uniform(0,l) distributed random variables. Calculate


the logarithm of the Fourier transform of n~lf2{X\ 4-... + Xn jn) up to a remainder
term of order O(n~2).

[9]

Suppose X is a random variable with Fourier transform ^x(i) For each 8 > 0, show that

I JT (1 -mxlrxit)) dt=*(\-

where 1 co = suP|*|>i I sin;c/jc|.


(ii) If \[r is a complex-valued function, continuous at the origin, show that
m (\lr(0) - V(0) dt -> 0

as 8 -> 0.

208

Chapter 8:

Fourier transforms

(iii) Suppose {Xn} is a sequence of random variables whose Fourier transforms \/rn
converge pointwise to a function f that is continuous at the origin. Show that
Xn is of order Op(\).
(iv) Show that the ^ from part (iii) equals the Fourier transform of some random
variable Z representing the limit in distribution of a subsequence of {Xn}.
Deduce via Theorem <12> that Xn -^ Z.
[10]

Suppose a random variable X has a Fourier transform \// for which ^ ( 0 = 1 + O(t2)
near t = 0.
(i) Show that there exists a finite constant C such that
C>
for all M. Deduce via Dominated Convergence that 2C > PX 2 .
(ii) Show that FX = 0.

[11]

Suppose X is a random variable, and / ( z , X) is a jointly measurable function


that is analytic in a neighborhood N := {z G C : |z zol < <$} of a point zo in
the complex plane. Write f'{z, X) for the derivative with respect to z. Suppose
l/'(z, *)l < M(X) for |z - zol < <$, where PAf(X) < oo. Show that P/(z, X) is
analytic in X, with derivative P/'(z, X). Hint: Reduce to the corresponding theorem
for differentiation with respect to a real variable by defining g(t, X) := f(zo + th, X)
for 0 < t < 1, with h fixed.

[12]

Suppose a random variable X has Fourier transform 0 ( 0 = exp(/cf dt2), for


complex numbers c := c\ + ic2 and d := di -f id2.
(i) Deduce from the facts that 10(01 < 1 and 0(r) = 1 - c2t + icir + o(0 near r = 0
that c2 = 0.
(ii) Show that X c\ has Fourier transform exp(dt2) 1 + 0(f 2 ) near r = 0.
Deduce from Theorem <2> and Problem [10] that d = P|X c\\2, which is
nonnegative.

10.

Notes
Feller (1971, Chapters 15 and 16) is a good source for facts about Fourier transforms
(characteristic functions). Much of my exposition in the first four Sections is based
on his presentation, with help from Breiman (1968, Chapter 8).
The idea of generating functions is quite oldsee the entries under generating
Junctions in the index to Stigler (1986a), for descriptions of the contributions of
De Moivre, Simpson, Lagrange, and Laplace.
Apparently Levy borrowed the name characteristic function from Poincare
(who used it to refer to what is now known as the moment generating function),
when rediscovering the usefulness of the Fourier transform for probability theory
calculations, unaware of earlier contributions.

209

8.10 Notes

Levy (1922) extended the classical Fourier inversion formula to general


probability distributions on the real line. He also used an inversion formula to prove
a form of the Continuity Theorem slightly weaker than Theorem <12>. (He required
uniform convergence on bounded intervals, but his method of proof works just as
well with pointwise convergence.) He noted that the theorem could be proved by
reduction to the case of bounded densities, by convolution smoothing, offering the
normal as a suitable source of smoothingcompare with Levy (1925, page 197),
where he used convolution with a uniform distribution for the same purpose. The
slightly stronger version of the Continuity Theorem described in Problem [9] is due
to Cramer (1937), albeit originally incorrectly stated (Cram6r 1976, page 525).
The book by Hall & Heyde (1980) is one of the best references for martingale
theory in discrete time. It contains a slightly stronger form of Theorem <14>.
The results in Section 6 concerning characterizations via linear combinations
come from Cramer & Wold (1936). In the 1998 Addendum to his 1997 paper,
Walther noted that Radon (1917) had proved similar results, also without the use of
Fourier theory. I have not seen Radon's paper.
I borrowed the proof of the Levy-Cram6r theorem from Chow & Teicher (1978,
Section 8.4). The result was conjectured by L6vy (1934, final paragraph) then proved
by Cramer (1936)see Cramer (1976, page 522) and L6vy (1970, page 111). The
last part of the proof in Section 8 essentially establishes a special case of the result
of Hadamard originally invoked by Cramer. See Le Cam (1986, page 80) and
Lo&ve (1973, page 3) for further discussion of why the result plays such a key role
in the statement of necessary and sufficient conditions for the central limit theorem
to hold.
REFERENCES

Breiman, L. (1968), Probability, first edn, Addison-Wesley, Reading, Massachusets.


Chow, Y. S. & Teicher, H. (1978), Probability Theory: Independence, Interchangeability, Martingales, Springer, New York.
Cramer, H. (1936), 'Uber eine Eigenschaft der normalen Verteilungsfinktion',
Mathematische Zeitung 41, 405-414.
Cramer, H. (1937), Random Variables and Probability Distributions, Cambridge
University Press.
Cramer, H. (1976), 'Half a century with probability theory: some personal recollections', Annals of Probability 4, 509-546.
Cramer, H. & Wold, H. (1936), 'Some theorems on distribution functions', Journal
of the London Mathematical Society 11, 290-294.
Feller, W. (1971), An Introduction to Probability Theory and Its Applications, Vol. 2,
second edn, Wiley, New York.
Hall, P. & Heyde, C. C. (1980), Martingale Limit Theory and Its Application,
Academic Press, New York, NY.
Le Cam, L. (1986), 'The central limit theorem around 1935', Statistical Science
1, 78-96.

210

Chapter 8:

Fourier transforms

Levy, P. (1922), 'Sur la determination des lois de probability par leurs fonctions
caracteristiques', Comptes Rendus de I'Academie des Sciences, Paris 175, 854856.
Levy, P. (1925), Calcul des Probability, Gauthier-Villars, Paris.
Levy, P. (1934), *Sur les integrates dont les elements sont des variables aleatoires
independantes', Ann. Ecole. Norm. Sup. Pisa(2) 3, 337-366.
Levy, P. (1937), Theorie de Vaddition des variables aleatoires, Gauthier-Villars, Paris.
Page references from the 1954 second edition.
Levy, P. (1970), Quelques Aspects de la Pensee d'un Mathematicien, Blanchard, Paris.
Loeve, M. (1973), 'Paul Levy, 1886-1971', Annals of Probability 1, 1-18. Includes
a list of Levy's publications.
McLeish, D. L. (1974), 'Dependent central limit theorems and invariance principles',
Annals of Probability 2, 620-628.
Petrov, V. V. (1972/75), Sums of Independent Random Variables, Springer-Verlag.
Enlish translation in 1975, from 1972 Russian edition.
Radon, J. (1917), 'Uber die Bestimmung von Functionen durch ihre Integralwerte
langs gewisser Mannigfaltigkeiten', Ber. Verh. Sachs. Akad. Wiss. Leipzig,
Math-Nat. Kl. 69, 262-277.
Rudin, W. (1974), Real and Complex Analysis, second edn, McGraw-Hill, New York.
Stigler, S. M. (1986a), The History of Statistics: The Measurement of Uncertainty
Before 1900, Harvard University Press, Cambridge, Massachusetts.
Walther, G. (1997), 'On a conjecture concerning a theorem of Cramer and Wold',
Journal of Multivariate Analysis 63, 313-319. Addendum in same journal, 63,
431.

Chapter 9

Brownian motion
SECTION 1 collects together some facts about stochastic processes and the normal
distribution, for easier reference.
SECTION 2 defines Brownian motion as a Gaussian process indexed by a subinterval T
of the real line. Existence of Brownian motions with and without continuous sample
paths is discussed. Wiener measure is defined.
SECTION 3 constructs a Brownian motion with continuous sample paths, using an
orthogonal series expansion of square integrable functions.
SECTION *4 describes some of the finer propertieslack of differentiability, and a modulus
of continuityfor Brownian motion sample paths.
SECTION 5 establishes the strong Markov property for Brownian motion. Roughly speaking,
the process starts afresh as a new Brownian motion after stopping times.
SECTION *6 describes a family of martingales that can be built from a Brownian motion,
then establishes Levy's martingale characterization of Brownian motion with continuous
sample paths.
SECTION *7 shows how square integrable functions of the whole Brownian motion path
can be represented as limits of weighted sums of increments. The result is a thinly
disguised version of a remarkable property of the isometric stochastic integral, which
is mentioned briefly.
SECTION *8 explains how the result from Section 7 is the key to the determination of
option prices in a popular model for changes in stock prices.

1.

Prerequisites
Broadly speaking, Brownian motion is to stochastic process theory as the normal
distribution is to the theory for real random variables. They both arise as natural
limits for sums of small, independent contributions; they both have rescaling and
transformation properties that identify them amongst wider classes of possible
limits; and they have both been studied in great detail. Every probabilist, and
anyone dealing with continuous-time processes, should learn at least a little about
Brownian motion, one of the most basic and most useful of all stochastic processes.
This Chapter will define the process and explain a few of its properties.
The discussion will draw on a few basic ideas about stochastic processes, and
a few facts about the normal distribution, which are summarized in this Section.

212

Chapter 9:

Brvwnian motion

A stochastic process is just a family of random variables [Xt : t e T], all


defined on the same probability space, say (2,JF, P). Throughout the Chapter, the
index set T will always be E + or a subinterval [0, a] of R + . You should think of
the parameter t as time, with the stochastic process evolving in time.
I will use the symbols Xt(co) and X(t, co) interchangeably. The latter notation
suggests that we regard the whole process as a single function X on T x Q, and
use the single letter X to refer to the whole family of random variables. We can
also treat X{t, co) as a family of functions X(-, co) defined on 7\ one for each co.
Each of these functions, t H> X(f, co), is called a sample path of the process. Each
viewpointa family of random variables, a single function of two arguments, and
a family of sample pathshas its advantages.
As time passes, we learn more about the process and about other random
variables defined on ft, a situation represented (as in Chapter 6) by a filtration: a
family of sub-sigma-fields [7t : t e T] of J for which 7S c % whenever s < t.
A stochastic process {Xt : t G T] is said to be adapted to the filtration if Xt is
7t -measurable for each t. On occasion it will be helpful to enlarge a filtration
slightly, replacing 3> by the sigma-field 7* generated by Jt U N, with K the class of
P-negligible subsets of Q. I will refer to {J* : t e T] as the completed filtration.
The joint distributions of the subcollections {Xt : t e 5), with S ranging over
all the finite subsets of 7\ are called the finite dimensional distributions (or fidis)
of the process. If all the fidis are multivariate normal, the process is said to be
Gaussian. If each Xt has zero expected value, the process is said to be centered.
The striking behavior of Brownian motion will be largely determined by just a
few properties of the normal distribution (see Section 8.6):
(a) A multivariate normal distribution is uniquely determined by its vector of
means and its variance matrix. (The Fourier transform is a function of those
two quantities.)
(b) If X and Y have a bivariate normal distribution, then X is independent
of Y if and only if cov(X, Y) = 0. That is, under an assumption of joint
normality, independence is equivalent to orthogonality of X /JLX and F /xy,
in the L 2 (P) sense, where /xx := PX and /xy := FY.
(c) If Zi, Z 2 , . . . Zn are independent random variables, each with a N(0,1) distribution, then all linear combinations J2iaizt have normal distributions. The
joint distributions of finite collections of linear combinations of Z\,..., Zn
are multivariate normal.
(d) If {Xn} is a sequence of random vectors with multivariate normal distributions
that converges in distribution to a random vector X, then X has a multivariate
normal distribution. The expected value PX n must converge to PX, and the
covariance matrix var(Xn) must converge to var(X). Convergence in L 2 (P)
implies convergence in distribution.
(e) If Zi, Z 2 , . . . Zn are independent random variables, each with a Af(0,1)
distribution, then Pmax,< n |Z,-| < 2^/1 H-logn and max,<n |Z,|/ x /21ogn - 1
almost surely as n - 00. (See Problems [1] and [2] for proofs.)

213

9.2 Brownian motion and Wiener measure

2.

Brownian motion and Wiener measure


There are several closely related definitions for Brownian motion, each focusing on
a different desirable property of the process. Let us start from a minimal definition,
then build towards a more comfortable set of properties.

<i>

Definition. A stochastic process B := {Bt : t T], adapted to a filtration {?,},


is said to be a Brownian motion (for that filtration) if its increments have the
following two properties,
(i) for all s < t the increment Bt Bs is independent of3s,
(ii) for all s < t the increment Bt Bs has a N(0, t s) distribution.
Equivalently,
(iii) PFexp (W(Bt - Bs) + \02(t - s)) = PF, for alls <t, all 0 R, all F 3S.
Proof of the equivalence. Necessity of (iii) follows from the independence and the
fact that the N(09 t s) distribution has Fourier transform exp(02(t s)/2). For
sufficiency, first take F = Q to show that Bt-Bs is N(0, t-s) distributed. To establish
independence, we have only to show that P(g(ft - BS)F) = (Fg(Bt - Bs)) (PF)
for all bounded, measurable functions g. The assertion is trivial if PF = 0. When
PF ^ 0, equality (iii) may be rewritten as
PFexp(i0(ft - Bs)) = exp(-0 2 (r - s)/2) = Pexp(i0(ft - ft)),

where FF(A) = P(FA)/PF for all A in 3. By the uniqueness theorem for Fourier
transforms, Bt ft has the same distribution under P/r as under P. In particular,
FFg(Bt ft) = g(Bt ft), for all bounded, measurable functions g.
Once we specify the distribution of Bo, the joint distribution of Bo, Btl,..., Btk,
for 0 < t\ < ... < f*, is uniquely determined. That is, the fidis are uniquely
detemined by Definition <i> and the distribution of Bo. If Bo is integrable then
so are all the ft, and the process is a martingale. If Bo has a normal distribution
(possibly degenerate), then all the fidis are multivariate normal, which makes B
a Gaussian process. If Bo = x, for a constant JC, the process is said to start at x.
In particular, when Bo = 0, the Brownian motion is a centered Gaussian process,
whose fidis are uniquely determined by the covariances,
cov(ft, Bt) = cov(ft, ft) + cov(ft, Bt - ft) = s + 0

if s < t.

More succinctly, cov(ft, Bt) = s A t for all s, t in R .


Often one speaks of a Brownian motion without explicit mention of the
filtration, in which case it is implicit that 7t equals ?f := a {ft : s < f}, the natural
or Brownian filtration. In that case, a simple generating class argument shows that
property (i) is equivalent to the assertion:
(i)' for all choices of to < t\ < ti < ... < tk from 7\ the random variables
{B(tj) - B(tj-\) : y = 1,2, ...,*} and B(t0) are independent.
A centered Gaussian process with cov(Bt, ft) = t A S for all s, t in T is a Brownian
motion for the natural filtration.

214

Chapter 9:

Brownian motion

A further property is usually added to the list of requirements for Brownian


motion, namely that it have continuous sample paths:
(iv) For each fixed co, the sample path B(,co) is a continuous function on T.
Some authors give the mistaken impression that property (iv) follows from properties (i) and (ii). The proper assertion (Problem [3]) is that there exists a stochastic
process that satisfies (i) and (ii), and has continous sample paths, or, more precisely,
if {Bt : t e T) satisfies (i) and (ii), then there exists another process {B* : t e T],
defined on the same probability space, for which (i), (ii), and (iv) hold, and for
which F{Bt = B*} = 1 at every t. The new process is called a version of the original
process. Notice that B* need not be J,-measurable if 7t does not contain enough
negligible sets, but it is measurable with respect to the completed sigma-field 3?.
In fact, B* is a Brownian motion for the completed filtration: the added negligible
sets have no effect on the calculations required to prove that B* B* is independent
of 3*.
In truth, a Brownian motion that did not have all, or almost all, of its sample
paths continuous would not be a very nice beast. Many of the beautiful properties
of Brownian motion depend on the continuity of its sample paths.
<2>

Example. Let {(/?,, J,) : 0 < t < 1} be a Brownian motion with continuous
sample paths. The quantity M(co) := sup, \B{t, co)\ is finite for each co, because each
continuous function is bounded on the compact set [0,1]. It is an 7\-measurable
random variable, because {co : M(co) > x] = U5s{|#s(a>)| > x], for any countable,
dense subset 5 of [0,1]. A countable union of sets from H\ also belongs to 3\.
What happens when the process does not have continuous sample paths, but (i)
and (ii) are satisfied? To show you how bad it can get, I will perform pathwise
surgery on B to create a version that behaves badly. As you will see, the issue is
really one of managing uncountable families of negligible sets. Countable collections
of negligible sets can be ignored, but uncountable collections are capable of causing
real trouble.
From Problem [7], there is a partition of Q into an uncountable union of
disjoint, J\-measurable sets {Qt : 0 < t < 1}, with PQ, = 0 for every t. Let
P(co) be an arbitrarily nasty, nonmeasurable, nonnegative function on Q for which
P(co) > M(co) at each co. Define B*(t,co)

:= B(t,co)[co < lt] + P(co){co e 12,}.

By construction, P{#* = Bt] = 1 for every t. The joint distributions of finite, or


countable, collections of B* variables are the same as the joint distributions for
the corresponding collections of Bt. In particular, B* is a Brownian motion, with
respect to the completed filtration.
The construction ensures that |*(-, co)\ is maximized at the t for which co Qt.
and sup5 \B*(co)\ P(co). We have built ourselves a process satisfying requirements (i) and (ii) of Definition < i > but by deliberately violating (iv), we have
created a nasty nonmeasurablity.
REMARK.
The point of the Example is not that anyone might choose a bad
Brownian motion, like B*, in preference to one with continuous sample paths, but
rather that there is nothing in requirements (i) and (ii) to exclude the bad version.
Continuity of a path requires cooperation of uncountably many random variables,

215

9.2 Brownian motion and Wiener measure


a cooperation that cannot be ensured by requirements expressed purely in terms of
joint distributions of finite, or countable, subfamilies of random variables.

A Brownian motion that has continuous sample paths also defines a map,
(o H> B(,to) from Q into the space C(7) of continuous, real valued functions
on 7. It becomes a random element of C(7) if we equip that space with its
finite dimensional (or cylinder) sigma-field 6(7), the sigma-field generated by the
cylinder sets {x e C(7) : (x(t\),..., x(tk)) A}, with {t\9..., tk) ranging over all
finite subsets of 7 and A ranging over !B(R*), for each finite k.
As an ^ecn-measurable map, w H> /?(, o>), from ft into C(7), a Brownian
motion induces a probability measure (its distribution or image measure) on 6(7).
The distribution is uniquely determined by the fidis, because the collection of
cylinder sets is stable under finite intersections and it generates 6(7). For the
simplest case, where Bo = 0, the distribution is called Wiener measure on 6(7), or,
less precisely, Wiener measure on 7. I will denote it by W, relying on context to
identify 7.
REMARK.
Each coordinate projection, Xt(x) : x(t), defines a random variable
on C(7). As a stochastic process on the probability space (C(7), 6 ( 7 ) , W), the
family {Xt : t 7} is a Brownian motion with continuous paths, started at 0. For
many purposes, the study of Brownian motion is just the study of W.

For the remainder of the Chapter, you may assume that all Brownian motions
satisfy requirements (i), (ii), and (iv), with BQ = 0. That is, unless explicitly warned
otherwise, you may assume that all Brownian motions from now on are centered
Gaussian processes with continuous sample paths and cov(Bty Bs) = t A s, a process

that I will refer to as standard Brownian motion (on T).

3.

Existence of Brownian motion


It takes some ingenuity to build a Brownian motion with continuous sample paths, a
feat first achieved with mathematical rigor by Wiener (1923). This Section contains
one construction, based on a few facts about Hilbert spaces, all of which are are
established in Section 4 of Appendix B.
Suppose*Kis a Hilbert space with a countable orthonormal basis {& : / 6 N).
Let [rji] be a sequence of independent Af(0,1) random variables, defined on some
probability space (ft, 7, P). For each fixed h in 5{, the sequence of random variables
Gn(h) := E?=i<* Mii converges in L2(P) to a limit, G(h) := / e N (/i, Vi)^/, and
by Parseval's identity,
cov(Gn(/u),Gn(/t2)) = J^M(huifi)(h2^i)

^ ^ { A i . f t M * * * , - ) = (*i,*2>.

Note that G(h) is uniquely determined as an element of L2(P), but it is only defined
up to an almost sure equivalence as a random variable.
In particular, from the facts (c) and (d) in Section 1, the random variable G(h)
has a N(0, \\h\\2) distribution. Moreover, for each finite subset {h\,..., hk) of IK,

216

Chapter 9:

Brownian motion

the random vector (G{h\),..., G(hk) has a multivariate normal distribution with
zero means and covariances given by FG(hi)G(hj) (ht,hj). That is, all the fidis
of the process are centered multivariate normal. The family of random variables
{G(h) : h e !K} is a Gaussian process that is sometimes called the isonormalprocess,
indexed by the Hilbert space 'K (compare with Dudley 1973, page 67).
REMARK.
Notice that the map h i-> G(h) is linear and continuous, as a function
from JC into L 2 (P). Thus G(h) can be recovered, up to an almost sure equivalence,
from the values {Gfe): i e N} for any orthonormal basis {e{ : / N) for *K.

To build a Brownian motion indexed by [0,1], specialize to the case where


*K := L2(m), with m equal to Lebesgue measure on [0,1], Write ft for the indicator
function of the interval [0, t]. The subset {/, : t e [0,1]} of JC defines a centered
Gaussian process Bt := G(ft) indexed by [0,1], with
,, Bt) = m(fsft) = m[0,s At] = s At.
That is, if we take J, as a{Bs : s < t] then {(# 7t) : t e [0,1]} is a centered
Gaussian process with the covariances that identify it as a Brownian motion indexed
by [0,1], in the sense that properties (i) and (ii) of Definition <i> hold. The
question of sample path continuity is more delicate.
Each partial sum Gn defines a process with continuous paths, because

\Gn{fs) - Gn(ft)\2 < Y^(fs - /tfi>2"=inf


and

EILito - ft* fi)2 < Us - M2 = \s- t\. We need to preserve continuity in


limit as n tends to infinity. Convergence uniform in t would suffice.
Something slightly weaker than uniform convergence is easy to check if we
work with the orthonormal basis of Haar functions on [0, 1]. It is most natural to
specify this basis via a double indexing scheme. For k = 0 , 1 , . . . and 0 < i < 2*,
define Hik as a difference of indicator functions,
Hitk(s) := |/2"* < s < (i + )2~*} - {(i + \)2~k < s < (i + 1)2-*}.
Notice that |/f/,*| is the indicator function of the interval Jik := (*2~*, (i + l)2"fc], and

Hik = Ju,k+\ Ju+\,k+\> Thus mH?k = m7/,* = 2~k. The functions ^r,-^ := V2tHifk

are orthogonal, and each has L2(m) norm 1. As shown in Section 3 of Appendix B,
the collection of functions V := {1} U {Vo\* : 0 < i < 2k for k = 0,1,2,...} is an
orthonormal basis for L2(m). The Brownian motion has the L2(m) representation,
<3>

= fjt + ^ = 0 2k'*Xk(t)
with Xk(t) :=
where fj and the {rjijk} are mutually independent N(Oy1) random variables.
As a function of r, each (/,, ///,*) is nonzero only in the interval Jiik, within
which it is piecewise linear, achieving its maximum value of 2~(*+1) at the midpoint,

(/ HU) = 2-(*+1> A (t - 0

- 2-<*+1> A (t -

-^\

217

9.3 Existence of Brownian motion

X
2i/2 k+I

(2i+l)/2 k+1

(2i+2)/2 k+1

W8

T| 22 /8

The process Xk(t) has continuous, piecewise linear sample paths. It takes the
value 0 at t = J/2* for i = 0 , 1 , . . . , 2*. It takes the value r)iik/2k+l at the point
(2i +l)/2* + 1 . Thus sup, \Xk(t)\ = 2~(*+1) max, |i/<>t|. From property (e) in Section 1,
V |X*(OI < 2_Zo2

' V1 + log(2*) < oo,

k/2

which implies finiteness of J^k 2 supr |X*(f)| almost everywhere. With probability
one, the random series <3> for Bt converges uniformly in 0 < t < 1; almost
all sample paths of B are continuous functions of t. If we redefine B{t,co) to be
identically zero for a negligible set of eo, we have a Brownian motion with all its
sample paths continuous.
From the Brownian motion B indexed by [0,1] we could build a Brownian
motion ft indexed by M+ by defining
A := (1 + t)B (T^) - tB{\)

for t R + .

Clearly {fit : t e R+} is a centered Gaussian process with continuous paths and
p0 = J5o = 0. You should check that cov(&, ft) = 5- A f, in order to complete the
argument that it is a Brownian motion.
REMARK.
We could also write Brownian motion indexed by R + as a doubly
infinite series,
which converges uniformly (almost surely) on each bounded subinterval. When we
focus on [0,1], the terms for k < 0 contribute t XLt<o 2*/2f?o,*> which corresponds to
the fjt in expansion <3>; and for k > 0, only the terms with 0 < i < 2* contribute
to Bt.

*4.

Finer properties of sample paths


Brownian motion can be constructed to have all of its sample paths continuous, but,
almost all of its paths must be nowhere differentiable. The heuristic explanation
for this extreme irregularity is: existence of a derivative would imply approximate
proportionality of successive small increments, implying a degree of cooperation
that is highly unlikely for a succession of independent normals.
A formal proof can be built from the heuristic. Consider first a single continuous
function x on [0,1], which happens to be differentiable at some point to in (0,1),

218

Chapter 9:

Brownian motion

with a finite derivative v, that is, x(t) = x(to) + (t to)v + o(\t /ol) near to. For a
positive integer m, let im be the integer defined by (im l)/m < to < im/m. Then

and hence

-2x t J + x{ m)=

(m

(m

\ ,

/m\

/ _,\

Similarly, both m Alm+2,m* and m Aim+^mx must also converge to zero. By considering
successive second differences, we eliminate both to and v9 leaving the conclusion
that if x is differentiable (with a finite derivative) at an unspecified point of (0,1)
then, for all m large enough there must exist at least one i for which
m|A;,mJt| < 1

and

m|A/+2,m*| < 1

and

m\Ai+^mx\ < 1.

Apply the same reasoning to each sample path of the Brownian motion
{Bt : 0 < t < 1} to see that the set of co for which B(-, co) is somewhere differentiable
on (0,1) is contained in
liminf{a>: m\Ai>mB\ < 1, m|A,+2,m#| < 1, m\Ai+^mB\ < 1 for some i},
mtoo

where the AimB are the second differences for the sample path B(-,co). Each
of Ao,m#, &2,mB, A4mB, ...has a N(0,2/m) distributionbeing a difference
of two independent random variables, each Af(O, 1/m) distributedand they are
independent. By Fatou's lemma, the probability of the displayed event is smaller
than the lim inf of the probabilities
P ^11=0 {(O ' m\^i,mB\ < 1, tn\Ai+2,mB\ < 1, m|A|4-4>m^| < 1}
< m (P{|A^(0, 2/m)| < I}) 3 =

O(m"l/2).

Thus almost all Brownian motion sample paths are nowhere differentiable.
The nondifferentiability is also suggested by the fact that (Bt+s Bt) /8 has
a N(0, \/8) distribution, which could not settled down to a finite limit as 8 tends
to zero, for a fixed t. Indeed, the increment Bt+8 Bt should be roughly of
magnitude ^/8. The maximum such increment, D(S) := supo<r<i-5 Ift+s - Bt\% is
even larger. For example, part (i) of Problem [2] shows, for small e > 0 and k large
enough, that
P ( max 2k/1
[0<i<2*

for constants C > 0 and 0 > 0 depending on . A Borel-Cantelli argument with


8k := 2~k then gives
<5>

hm sup

D(8)
v

'

t.

\B (i8k + 8k) - B (i8k)\

- > hm sup max '

> 1

almost surely. A similar argument (Levy 1937, page 172; see McKean 1969,
Section 1.6 for a concise presentation of the proof) leads to a sharper conclusion,
<6>

lim sup
= 1
a-o y/28\og(l/8)

almost surely.

219

9.4 Finer properties of sample paths

More broadly, y/8\og{\/8) gives a global bound for the magnitude of the
incrementsa modulus of continuityas shown in the next Theorem. To avoid
trivial complications for 8 near 1, it helps to increase the modulus slightly, by adding
a term that becomes inconsequential for 8 near zero.
<7>

T h e o r e m . Define h{8) := y/S + <51og(l/<5) for 0 < 8 < 1. Then for almost all co
there is a finite constant Cw such that
\Bt(co) - Bs(co)\ < CJi (|r - s\)

for all s, t in [0, 1],

where B is a standard Brownian motion indexed by [0,1].


Proof. Consider a pair with 0 < s < t < 1. Temporarily write 8 for t s. From
the series representation <3>,

\Bt - Bs\ < \(t - s)fj\ + YZo2k/2\Xk(t)

"

X (5)l

- * w + T Z o Hi2*/2 \{f< - f" Hi^\Mk

where Mk :

= max< i^*i-

For a fixed k, the function ft fs is orthogonal to all /f,-f* except possibly when t
or s belongs to the support interval /,,*. There are at most two such intervals, and
for them we have the bound |(/, - fsi Huk)\ < m ((5, t] n Jt,k) < \s - t\ A 2~k. Thus
SUp

s^t

h \\t s\)

< SUp
o<s<\

From Problem [2], for almost all co there is a ko(co) such that Mk(co) < 2^/log(l -1-2*)
for k > ko(co). From Problem [5] we then have

for a finite constant Co, which, together with the fact that 8 < h(8), gives
sup
D

5.

the desired bound.

Strong Markov property


Let B be a standard Brownian motion indexed by R + . For each fixed f, define
the restarted process R'B by moving the origin to the point (t, Bt). That is,
(/?'#) 5 := Bt+S - Bt at each time s in R+.
For simplicity, let me temporarily write Xs instead of (R'B)S. As a process,
[Xs : s e E + } is adapted to the filtration 9 5 := 7s+t for 5 G E + . Moreover,
each increment Xs+8 - Xs has a N(0,8) distribution, independent of 9*. Thus
X is also a standard Brownian motion. It has distribution W, Wiener measure
on e(R + ). For each finite subset S of R + , the collection of random variables
[Xs : s 5} is independent of J,. Via the usual sort of generating class argument
it then follows that X, as a random element of C(R + ), is independent of 7,.

220

Chapter 9:

Brownian motion

That is, FFg(X) = (PF)Pg(X) for each F e 7t and (at least) for each bounded,
e(R+)-measurable function g on C(R + ). Equivalently,
P (g(X) | 3t) = Wg

almost surely.

The fact that R B is a Brownian motion independent of 7, is known as the Markov


property of Brownian motion.

If we replace the fixed time t by a stopping time r, we get a stronger assertion,


known as the strong Markov property. Roughly speaking, the restarted process
RTB is a Brownian motion independent of the pre-r sigma-field J r , which consists
of all F for which F{r < t} is JF,-measurable, for 0 < t < oo.
REMARK.
Remember that ^ is defined, if not otherwise specified, as the
sigma-field generated by U{9r, : t R + }. We need to ensure that <SX c J^ to
avoid embarrassing ambiguities about the definition of 3^ for the extreme case of a
stopping time that is everywhere infinite.

We could write the stronger assertion as (g(X) | 3>) = Wg almost surely on


the set {r < oo}, but that equality does not quite capture everything we need, as
you will discover when we consider the reflection principle, in Example <12>. In
that Example we will meet a stopping time r for which Bt = a, a positive constant,
whenever r(a>) < t. We will need to make an assertion like
{Bt < a | ?T) = F{Bt - BT < 0 | %} = |

on the set {r < t).

Intuitively speaking, the conditioning lets us treat the J r -measurable random variable
r as a constant, so that Bt - BT has a N(0, t - r) conditional distribution, which
is symmetric about 0. Unfortunately, this line of reasoning takes us beyond the
properties of the abstract, Kolmogorov conditional expectationif you were paying
very careful attention while reading Chapter 5, you will recognize a covert appeal
to existence of a conditional distribution. Fortunately, Fubini offers a way around
the technicality.
Rewrite the Markov property as a pathwise decomposition of Brownian motion
into two independent contributions: R* shifted to the origin 0,0); and a killed
process K'B, defined by (K'B)S := BSAt for s e R+. More formally, define S* as
the operator that shifts functions to the right,
x(s -t)
0

for s > t
for 0 < s < t.

9.5 Strong Markov property

221

Then B = K'B -f S'R'B. The Markov property lets us replace R*B by a new
Brownian motion, independent of ?,, without changing the distributional properties
of the whole path. For example, for each ^-measurable Y and (at least) for each
bounded, product measurable real function / , we have (via a generating class
argument)

P/(F, B) = rWxf (Y(co), *'(-, co) + S'x).


The Y could take values in some arbitrary measurable space. We might take
Y = K*B, for instance.
REMARK.
If Y takes values in (X, a), we can define an ./l<g>e(R+)<g)!B([0, oo])measurable function g by g(y, z, t) := W* f(y, Ar'z-hS'jt). Then the Markov property
becomes P/(K, B) = F^g (Y(co)y (-, co), f), for 7,-measurable Y. Multiple appeals
to the Tonelli/Fubini theorem would establish the necessary measurability properties.

KTx
x(co)
Now consider the effect of replacing the fixed t by a stopping time r. For every
sample path we have (-, co) = KT{(O)B{-, co) 4- 5r(w)/?T(<u)B(-, co). The decomposition
even makes sense for those co at which z{co) = oo, because Kx = x and S shifts
whatever we decide to define as Rx right out of the picture. For concreteness,
perhaps we could take Sx == 0. Of course it would be a little embarrassing to
assert that RTB is, conditionally, a Brownian motion at those co.
Strong Markov property. Let B be a standard Brownian motion for a filtration
{7t : t e R+}, and let x be a stopping time. Then for each 7X -measurable random
element Y, and (at least) for each bounded, product measurable function / ,
P/(K, B) = 1TW*/(K, KT((O)B(', co) +
REMARK.

5JC).

In the notation from the previous Remark, the assertion becomes:

Ff(Y, B) = f^g (Y(co), B(-, co), T(CD)), for Jr-measurable Y.

Proof. A generating class argument reduces to the case where f(y, z) := g(y)h(z),
where g is bounded and measurable, and h(z) := ho (z(s\),..., z(.s*)) with ho a
bounded, continuous function on R* and s\,..., sk fixed values in R + . Discretize
the stopping time by rounding up to the next multiple of n~ l ,
rn := 0{r = 0} 4- V - I' < r < - \ + oo{r = oo}.

For each n,
P/(F, B) =

, B){r = 00}

P{rn = i/n}g(Y)h{Ki'nB + S""R"*B).

222

Chapter 9:

Brownian motion

The product {rn = i/n}g{Y) is J//n-measurable, because Y is Jr-measurable.


By <9>, with / ( F , B) replaced by ({rn = i/n}g(Y))h(B), the ith summand equals
rWx{rn

= i/n)g(Y)h(Ki/nB

5I/WJC)

= F W f o = i/n}g(Y)h(KTnB

ST"JC).

Sum over i to deduce that


P/(F, ) = VVfxg(Y)h{Kx*B + Stnx).

<n>

As n tends to infinity, rn(w) converges to T(O>), and hence KtnB + STnx -> T B + STx
pointwise (in particular, at each sy), for each co and each JC e C(R + ). Continuity of /io
then gives h(KTnB + SXnx) -> h(KTB + 5TJC). An appeal to Dominated Convergence
completes the proof.
Corollary.

For each W-integrable function f on C(E + ),


P (f(B) | J T ) = Wxf(KTB

+ 5TJC)

almost surely,

for each stopping time r.


<12>

Exercise. Let # be a standard Brownian motion indexed by M+, and let a be a


positive constant. Define r := inf{t : Bt > or}. Use the strong Markov property to
find the distribution of r.
SOLUTION:

For fixed t e R + ,

P{r < t} = P{r < f, Bf > a} + P{r <t, Bt < a}.


The first contribution on the right-hand side equals {Bt > a] = P{N(0, 0 > a},
because the inequality for r is superfluous (by continuity) when Bt > a. Invoke
Theorem <io> to write the second term as
P" ({T(CO) < t}Wx{x : B (t A tip)) + x (t - r(ft})) < a}).
For each co with r(<y) < r we have B it A r(cy)) = a and W{JC : jc(r-r(w)) < 0} = 1/2.
By Fubini, the term equals ^P{r < r}. Thus
P{r < r} = 2 P { ^ > a] =

*6.

If you differentiate you will discover that the distribution of r has density
at~3/2expia2/2t)/V2n
with respect to Lebesgue measure on R + .

Martingale characterizations of Brownian motion


A centered Brownian motion is a martingale. This fact is just the simplest instance
of a method for building martingales from polynomial functions of Bt and t.

<13>

Example. If {(#,, J r ) : t T] is a centered Brownian motion then a direct


calculation shows that {B? t} is a martingale with respect to the same filtration: if
s < t and F e 7S, then
P, 2 F = P(52 + 25, A + A 2 )F

where A := Bt - Bs

= P 5 2 F + 2P( 5 F)PA + (PA 2 ) iFF)

by independence.

223

9.6 Martingale characterizations of Brvwnian motion

After substitution of 0 for PA and t - s for PA 2 , the equality rearranges to give


F(B2 - t)F = P( 2 - s)F for all F in ?s, the asserted martingale property.
Similar arguments could be used for higher degree polynomials, but it is easier
to code all the martingale assertions into a single identity. For each fixed complex 0,
the process Mt := exp(0Bt - tO2/2) is a (complex-valued) martingale: for F in 3>,
PAf,F = cxp(0Bs + 0 A - W2/2)F
= P (exp(0, = PM,F

J02/2)F)

Pexp(0 A - (r - s)0 2 /2)

by independence

because A is Af(O, f - s) distributed.

A dominated convergence argument would justify the integration term-by-term to


produce a power-series expansion,
MtF = Pexp(0,)Fexp(-0 2 f/2)

02

03

\ /

oh

eAt2

yP* F + -PZ? F + ...J ^1 - _ + _

+ ...J

with a similar expansion for Pexp(0Z?5). The series converge for all complex 0.
By equating coefficients of powers of 0, we obtain a sequence of equalities that
establish the martingale property for {Bt}, {B2 - f},
As an exercise you might
find the term involving B?, then check the martingale property by direct calculation,
as I did for B2 - t.
Given that Fourier transforms determine distributions, it is not surprising that
the martingale property of exp(02?r f0 2 /2), for all 0, characterizes Brownian
motionessentially equivalence (iii) of Definition < i > . It is less obvious that the
martingale property for the linear and quadratic polynomials alone should also
characterize Brownian motion with continuous sample paths. This striking fact is
actually just an elegant repackaging of a martingale central limit theorem. The
continuity of the sample paths lets us express the process as a sum of many small
martingale differences, whose individual contributions can be captured by Taylor
expansions to quadratic terms.

<14>

Theorem. (Levy) Suppose {(Xf,3>) : t R+] is a martingale with continuous


sample paths, and Xo = 0. If {(X2 - t, 3t) : t M+} is also a martingale then the
process is a standard Brownian motion.
Proof. From equivalence (iii) of Definition < i > , it is enough to prove, for each
real 0 and each F in 3T5, that

<15>

Pexp(W(Xt - Xs) + 02(t -s))F

= PF

where 02 := ^0 2 .

I will present the argument only for the notationally simplest case where s = 0 and
t = 1, leaving to you the minor modifications needed for the more general case.
In principle, the method of proof is just Taylor's theorem. Break Xi into a
sum of increments J =1 17*, where rjk := X(k/n) - X((k - l)/n). Write P* for
expectations conditional on ^ / n . The two martingale assumptions give P*-i*?* = 0
and vt-i :=Pn-i^=
l//t. Notice that %L( v*-i = 1. For fixed real 0, define
Dk := exp (i0(rji + . . . + rjk) + 02(v0 + . . . + tfr-i) ,

224

Chapter 9:

Brownian motion

so that Do = 1 and Dn = exp(i'0Xi -I- $2). Continuity of the paths should make all
the t]k small, suggesting
+ iOrjk -

Dk-i (1 + 62vk-i + ...) (1 - 0 4- 02vk-i + ...)


^ Djt_i

if v%_{ is small.

Averaging over an F in 3b we get F(DkF) P(Z)*_iF). Repeated appeals to this


approximation give DnF 1, from which <15> would follow in the limit as
n > oo.
For a rigorous proof we must pay attention to the remainder terms, and therein
lies a small difficulty. Continuity of the sample paths does make all the increments rjk
small with high probability, but we need slightly more control to ensure that the
remainders cause no trouble when we take expectations. A stopping time trick
will solve the problem. Here we need to make use of a result for martingales in
continuous time:
If {(M,, 3r/) : t e T] is a martingale with right continuous sample paths,
and if x is a stopping time for the filtration, then {(M,AT, 7t) : t e T] is
also a martingale.
The analogous result for discrete-time martingales was established as a Problem in
Chapter 6. For continuous time, some extra measurability questions must be settled,
and further approximation arguments are required. For details see Appendix E.
For Levy's theorem, choose the stopping time so that the increments of the
stopped process XtAT are bounded by a small quantity. Fix an e > 0. For each n in
N, define
xn := 1 A inf{/ R4" : \Xt - Xs\ > for some s with t - n~l < s < t).
Each xn is a stopping time: by sample path continuity, the set {xn < t} can be
written as a countable union of ^-measurable sets, {|Xr - Xr>\ > e], with r and r'
ranging over pairs of rational numbers from [0, t] with \r rr\ < n~~l. Sample path
continuity (via uniform continuity on compact time intervals) also implies that, for
each co we have xn(co) = 1 eventually, that is, for all n > no(co).
For fixed n and e, write Yt for the martingale X(t A xn). The martingale
increments & := Y(k/n) Y((k l)/n) are all bounded in absolute value by , and
from the martingale property of Y} - (t A xn),
Vk-\ := P*-lifc = P*-l

"ATn

A Xn < - .

\n
n
)
n
The conditional variance V*_i is not equal to the constant 1/n, but it is true that
Vo + ... Vn-\ < 1 and
P(V0 + . . . Vn_i) = V P ( - Axn
n

f=f \

J =P(lArB)-> 1

a s n - > oo,

from which it follows that 1 (Vo 4 - . . . Vn-\) -> 0 in probability.


Replace the Dk defined above by the analogous quantity for the Y process,
Dk:=i

225

9.6 Martingale characterizations of Brownian motion

To keep track of the remainder terms, use the bounds (compare with Problem [8])
ex = 1 + x + A{x)
iy

with 0 < A(x) <x2

for 0 < x < 1,

with \B(y)\ < \y\3 for all y.

e = 1 + iy - y /2 + B(y)

We still have Do = 1, but now VDnF = PFexp (WY{ + G2(V0 + ... + V n _0), which,
by Dominated Convergence, converges to PF exp(/0Xi + 02) as n -> 00. With
bounds on the remainder terms, the conditioning argument gives a more precise
assertion,

= D,_i (1 + 0 2 V*-i + A (0 2 V*-i)) (1 - 02Vk-i


= Dfc_i + **
where

< \Dk-xI
< exp(02) ((022 4-0 2 )
< CB (n~l + <:) Vk-i

because P^_i^2 = Vk.x < n~\

for a constant Co that depends on 0. Averaging over an F in Jo we then get


|P(FDW) - PF| < ^
D

= 1

|P(FD*) - P(FD,_O| < J^^ \Rk\ < Ce (n~l + e) .

Let n tend to infinity then tend to zero to complete the argument.


REMARK.
The LeVy characterization explains why Brownian motion plays a
central role in the theory of ltd diffusions. Roughly speaking, such a diffusion is an
adapted process {Z, : t R + } with continuous sample paths for which
P (Z,+, - Z, I 7,) ^ n(Z,)8

and

P ((Zt+S - Z,) 2 | 7,) *

a\Zt)8

for small 5 > 0, where JJL and a are suitably smooth functions. If we break [0, t] into
a union of small intervals [4,4+1], each of length 8, then the standardized increments
AiX := (Z(r l+1 ) - Zfo) - /*(Z,,)*) /cr(Z,,)
are martingale differences for which P((A,X) 2 | ^r,.) 8. The sum X, := . A,X
is a discrete martingale for which Xf t is also approximately a martingale. It is
possible (see, for example, Stroock & Varadhan 1979, Section 4.5) to make these
heuristics rigorous, by a formal passage to the limit as 8 tends to zero, to build a
Brownian motion X from Z, so that Zt+S - Z, ^ ii{Zt)8 -I- <r(Z,) (X, +5 - X,). By
summing increments and formalizing another passage to the limit, we then represent
Z as a solution to a stochastic integral equation,

showing that the diffusion is driven by the Brownian motion X.


The probability theory needed to formalize the heuristics is mostly at the level
of the current Chapter. However, if you wish to pursue the idea further it would
be better to invest some time in studying systematically the methods of stochastic
calculus, and the theory of the ltd stochastic integral, as developed by Stroock &
Varadhan (1979), or (in the more general setting of stochastic integrals with respect
to martingales) by Chung & Williams (1990). See also the comments at the end of
Section 7 for more about stochastic integrals.

226

*7.

Chapter 9:

Brownian motion

Functionals of Brownian motion


Let B be a standard Brownian motion indexed by [0, 1], with [7f : 0 < t < 1}
its natural filtration. How complicated can Jf-measurable random variables be?
Remember that these are the random variables expressible as measurable functionals
of the whole sample path. The answer for square integrable random variables is
quite surprising; and it has remarkable consequences, as you will see in Section 8.

<16>

Theorem. The Hilbert space H := L 2 (2,?f ,P) equals the closure of the
subspace Ho spanned by the constants together with the collection of all random
variables of the form (Bt - Bs)hs, where h5 ranges over all bounded, 7S-measurable
random variables, and s, t range over all pairs for which 0 < s < t < 1.
Proof. It is enough if we show that a random variable Z in H that is orthogonal to Ho
is also orthogonal to every bounded random variable of the form V := g(Btx,..., Btk).
For then a generating class argument would show that Z must be orthogonal to
every random variable in H, from which it would follow that Z = 0 almost surely.
It even suffices to consider, in place of V, only random variables of the form
U := f[y=i exPO'0/#/,)> for real constants 07. For if
PZ + Y[kj={ exp(i9jBtj) = PZ" f ] * = 1 exp(i0jBtj)
for all real {#,}, the uniqueness theorem for Fourier transforms implies equality of
the measures Q*, on a{Btx,..., Btk}> with densities Z with respect to P, and hence
Q+V = Q~V for all V = g(Btl,..., Btk).
Repeated application of the following equality (proved below),

<17>

PZ1V**' = PZYeWB' exp (~\02(t

- s)}

for all real 0, all s < t, and all bounded, 7S-measurable random variables F,
will establish orthogonality of Z and U. The argument is easy. First put Y\ :=
FI*=i exp(/0,,.) and 0 := 0k to deduce
exp(WkBtk) = FZY{ exp(WkBtk_x)exp (-\ol(tk
Then repeat the exercise with F2 = n > = ? e x P ( ^ 7 ^ )
VZY2exp(WBtk_l) = ZY2txp(i0Btk_2)

anc

- ft_

* ^ : = ^*-i +^* t o

(-\o\tk-X

- fc_

And so on. In effect, we replace successive complex exponentials in Btj by a


nonrandom factors, and replace 0, by a sum ^2a>j0a. After k steps, we are left
with a product of nonrandom factors multiplied by PZ, which is zero because Z is
orthogonal to the constant function 1.
To prove equality <17>, break the interval [s, t] into n subintervals each of
length 8 := (t - s ) / n , and endpoints s = SQ < s\ < ... < sn = t. Write X, for B(SJ)
and Aj for Xj - X/_i, which has a Af(0,8) distribution. Abbreviate the conditional
expectation P(- | 3^) to P 7 ( ) .
Temporarily write f(y) for eWy. Notice that f'(y) = Wf(y) and f"{y) =
-202f(y)
where 02 := 02/2. From Problem [8],

h) - f{x) - hf'{x) - \h2f\x)\

< \0h\3/6.

227

9.7 Functionate of Brownian motion


Substitute x := Xj-\ and h := A7 to get

f(Xj) = f(Xj-l)

+ WAjf(Xj-l)-02AJf(Xj-l)

+ Rj

where \Rj\ < |0A,| 3 /6

= (1 - 02S)f(Xj-i) + iOAjf(Xj-i) - 02(Af - S)f(XJ.l)

+ Rj.

The random variable ,- := -02(AJ - 8)f(Xj-\) is 7*-measurable, with P/_i/ = 0


and P|,| 2 < Cj62 for some constant C\ depending on 0. Similarly, F\Rj\2 < C283.
Write y for the constant l-028. Notice that y" = (l-02(t-s)/n)n
-+ exp(-02(t-s))
as n -> 00. Multiply the expansion of / ( X , ) by ZY then take expected values. The
contribution from Ajf(Xj^j) is zero because YAjf{Xj-i)
e J%, Thus we have a
recurrence formula,

Repeated substitutions, starting with j =n9 give an explicit formula,


FZYf(Xn) = ynFZYf(Xo) + FZY QTJ=1 Yn~j{Hj + *
By Cauchy-Schwarz, the last term is bounded in absolute value by ||ZF||2 times

We may assume that 8 is small enough to ensure that \y\ < 1. Orthogonality of the
{,} then gives

The other term is bounded by

In the limit, as n tends to infinity, we get the asserted equality < n > .
REMARK.
Almost the same argument shows that Ho is dense in Li :=
L!(2, 3^,P), under the Lx norm. The proof rests on the fact that if some
element X of L| were not in the closure of Ho, there would exist a bounded,
yf-measurable random variable Z for which V(WZ) = 0 for all W in Ho but
P(XZ) = 1. (Hahn-Banach plus (L1)* = L, for those of you who know some
functional analysis.) The argument based on <17> would imply P(ZW) = 0 for
all bounded, ? f -measurable random variables W, and, in particular, PZ 2 = 0, a
contradiction.
The Theorem tells us that for each X in L2(Q, J f , P) and each > 0 there is
a constant co and grid points 0 = to < t\ < ... < /*+1 = 1 such that

X((o) = co + ]C*=o*' (ft>)AiiJ + Re^

where AiB :==

with each hi is bounded and wF,.-measurable, and FR2 < e2. Notice that |PX col =
\FR I < ey so we may as well absorb the difference co FX into the remainder, and
assume co = FX.
REMARK.
The representation takes an even neater form if we encode the
{hi} into a single function on [0, 1] x ft, an elementary predictable process,
H(t,co) := 7r,ki=ohi(a)){ti < t < ti+\}. The function H is measurable with respect
to the predictable sigma-field 7, the sub-sigma-field of [0,1] <g> 7f generated by
the class of all adapted processes with left-continuous sample paths. Moreover, it is

228

Chapter 9:

Brownian motion

square integrable for the measure fi : = m <g> P, with m equal to Lebesgue measure
on [0, 1],

H2 = m'P" ] . hiipfto < t


The stochastic integral of H with respect to B is defined as

The random variable defined in this way is also square integrable,

For i < y, the random variable A , # is independent of the y,. +1 -measurable random
variable hihj(AiB); and for 1 = 7 , the random variable ( A , # ) 2 is independent of h2.
The cross-product terms disappear, leaving

Thus the map H \-> fQ H dB is an isometry from the space IK0 of all elementary
predictable processes into L2 : = L2(Q, 3*f, P). It is not too difficult to show that Ji0
is dense in the space J { : = L2 ([0, 1] x 2, 3 \ /x)). The isometry therefore extends
to a map from [K into L2, which defines the stochastic integral fQ HdB for all
predictable processes that are square-integrable with respect to \i.
Theorem <16> implies that for each X in L 2 there exists a sequence (Hn)
in *KQ for which P X PX fQ

HndB

0. In consequence, [Hn] is a Cauchy

sequence,

Completeness ensures existence of an H in *K for which


P |/ 0 ! Hn dB - flQ HdB^ = n\Hn - H\2 -> 0.
In summary: X = PX + fQ H dB almost surely, a most elegant representation.

*8.

Option pricing
The lognormal model of stock prices assumes that the price of a particular stock
(standardized to take the value 1 at time 0) at time t is given by
St = exp (crBt + (fjt - \G2\ t\

<18>

for 0 < t < 1,

where B is a standard Brownian motion. The drift parameter fx is unknown, but the
volatility parameter a is known (or is at least assumed to be well estimated). Notice
that 5 is a martingale, for the filtration J, := a{Bs : s < t) = a{Ss : s < t], if /x = 0.
The strange form of the parametrization makes more sense if we consider
relative increments in stock price over a small time interval,
t+S

' "" ' = exp ((/x - <r2) 8 + aAfij - 1

where AB := Bt+S - Bt

+ or AB 4- \ ((/x - ^a 2 ) 6 4- (rAfi) + ...

9.8

229

Option pricing

The square (A/?) 2 has expected value 8. The term -o28/2 centers it to have zero
mean. As you will see, the centered variable eventually gets absorbed into a small
error term, leaving /JL8 + a AB as the main contribution to the relative changes in
stock price over short time intervals. The model is sometimes written in symbolic
form as dSt = ixStdt + aStdBt.
An option may be thought of as a random variable Y that is, potentially, a
function of the entire history {St : 0 < t < 1} of the stock price: ones pays an
amount yo at time 0 in return for the promise of a return Y at time 1 that depends
on the performance of the stock. (That is, an option is a refined form of gambling
on the stock market.)
The question whose answer makes investment bankers rich is: What is the
appropriate price yo? The elegant answer is: yo = QF, where Q is a probability
measure on 7 that makes the stock price a martingale, because (as you will soon
learn) there are trading schemes whose net returns can be made as close to Y yo
as we please, in a probabilistic sense. If a trader offered the option for a price y
smaller than yo, one could buy the option and also engage in a trading scheme for
a net return abitrarily close to (Y y) (Y yo) = yo y > 0. A similar argument
can be made against an asking price greater than yo.
A trading scheme consists of a finite set of times 0 < to < t\ < ... < tk+\ < 1
at which shares need to be traded: at time f, buy a quantity Kt of the stock, at the
cost KiS(tj), then sell at time ti+\ for an amount KiS(ti+\), for a return of /f/A,S,
where A,5 = S(f,+i) S(f,-) denotes the change in stock price per share over the
time interval. The quantity Kt must be determined by the information available at
time tt, that is, K( must be Frj-measurable. We should also allow Kt to take negative
values, a purchase of a negative quantity being a sale and a sale of a negative
quantity being a purchase. The return from the trading scheme is E/=o Ki^iSWe could assume that the spacing of the times between trades is as small as
we please. For example, purchase of K shares at time s followed by resale at time t
has the same return as purchases of K at times s + i8 and sales of K at times
s + (i 4-1)8, for i = 0 , 1 , . . . , AT - 1, with 8 := (/ - s)/N for an arbitrarily large AT.
Conceptually, we could even pass to the mathematical limit of continuous trading,
in which case the errors of approximation could be driven to zero (almost surely).
Existence of the trading scheme to nearly duplicate the return Y - yo will
follow from the representation in the previous Section. Consider first the case where
fi = 0, which makes S a martingale. Let us assume that PF 2 is finite. Write yo for
PF. For an arbitrarily small > 0, Theorem <16> gives a finite collection of times
{/,} and bounded, 3>.-measurable random variables A/ for which
F\Y-y0-

E i *iA,-JB| 2 < 2

where A; := Bfo+i) - B(t().

If we could replace A,-B by A,-5/ (aSti) for each i we would have a trading
scheme Kt := A,/(tr5r.) with the desired approximation property. If the time
intervals <S, := ti+\ ti are all small enough, a simple calculation will show that such
a substitution increases the L2 error only slightly. The error term

f o A B = exp (aAtB - {o28%\ - 1 - aA(B

230

Chapter 9:

Brownian motion

has zero expected value and it is independent of 3>,. You will see soon that it also
has a small L2(P) norm. To simplify the algebra, temporarily write x for ojh and
Z for AiB/V89 which has a standard normal distribution. Then
P/?

2
+1

p (exp (xZ - \xz\ - 1 - r

= Pexp {ixZ - r 2 ) + 1 + r2 - 2P ((1 + xZ) exp (xZ - r 2 ))


= exp(r2) - 1 - r2
< r4

cf. r e ^ 2 = e*1'2 = P* rZ = P (ZerZ)


OX

OX

when r < 1.

The approximation ]T\ hiAtB to F yo equals Yli ^ A,-S J^. ftj/^+i/cr. The first
sum represents the net return from a trading strategy. The other sum will be small
in L2(P) when max/ 8t is small. Independence eliminates cross-product terms,

FfaRi+xhjRj+i) =F(hiRi+lhj) (FRj+l) = 0

if j > i,

and hence

] ; (P/I 2 ) (a 4 6 2 )

if max, j,- < I/a 2

(max, f) J^. (P^A P(A/5)2


< a 4 (max, 5f) || \ fc,-A,-i?|^ -> 0

as max 6/ -> 0.

The contribution from the /?, 's can be absorbed into the other error of approximation,
increasing the e by an arbitrarily small amount, leaving the desired trading strategy
approximation for Y yo.
Finally, what happens when //, is not zero? A change of measure will dispose
of its effects. For a fixed constant a, let Q be the probability measure on 7 with
density zxp(aB\ - <*2/2) with respect to P. Problem [11] shows that the process
Xt := Bt at is a Brownian motion under Q. The stock price is also a function
of X if we choose a = fi/cr,

St = exp (or(Xt + at) + (u - ^a 2 ) t\ = exp (crXt - \cr2t\

for 0 < t < 1.

If we replace P by Q, and the Brownian motion B by the Brownian motion X, then


a repeat of the argument for the case /x = 0 shows that the appropriate price for the
option is now yo QY.
REMARK.
Of course there is a hidden assumption that Y is Q square-integrable,
or just Q-integrable if you accept the Remark following Theorem <16>.

9.

Problems

[1]

Let Zi have a N(0, of) distribution, for i = 1, 2 , . . . , n. Prove that


Pmax \Zi\ < /Pmax|Z,| 2 < 2ay/l 4-logn
/

where a \ max, a,,

231

9.9 Problems

Hint: Argue via Jensen's inequality that


exp (Pmax, |Z,-|2/4r2) < Pexp (max,- |Z;|2/4<x2) < ) . Pexp (jZ,| 2 /4<r 2 ) .
Bound the right-hand side by ny/2, then take logarithms.
REMARK.
For Problem [1] we do not need to assume that the {Z,} have a
multivariate normal distribution. Nowhere in the argument do we need to know
anything about the joint distribution. When the Z, are independent, or nearly so, the
bound is quite good. For the extreme case of n independent N(0, a2) variables, the
inequality from part (i) of Problem [21 shows that Pmax/<n |Z,| > Cocr^/l + l o g for
a universal constant Co. See Section 12.2 for a precise waySudakov's minoration
of capturing the idea of approximate independence when the {Z,} have a multivariate
normal distribution. When the variables are highly dependent, the bound is not
good: in the extreme cases where Z, == Z{, which has a N(0,a2) distribution,
Pmax,<n |Z,| = crP|iV(0,1)|, which does not increase with n.

[2]

Let Zi, Z 2 , . . . , be independent random variables, each distributed Af(0,1). Define


Mn := max/<rt |Z,-| and n := y/2logn and n(k) := 2k. Use the tail bound from
Appendix D,
3 ) exp ( - y J < P{Zi > x] < \ exp ( -%- J
to prove that Mn/y/2logn

for x > 0,

- 1 almost surely. Argue as follows.

(i) For each small e > 0, show that there exist strictly positive constants C and 0
for which {Mn < (1 - )ln] = (1 - 2P{Zi > (1 - )tn})n < exp (-Cn) , for
all n large enough. Deduce that liming Mn/ln > 1 almost surely.
(ii) For each > 0, show that {Mn > (1 + )ln] < n/n(l+)2.
\imsupkMn(k)/ln(fC) < 1 almost surely.

Deduce that

(iii) Use the inequality Mn/ln < Mn(k+\)/lnik) for n(k) < n < n(k+1), and the result
from parts (i) and (ii), to conclude that Mn/ln -+ 1 almost surely.
[3]

Let {(/?,, 7t) : 0 < t < 1} be a centered Brownian motion whose sample paths
need not be continuous. Show that there exists a centered Brownian motion
{B* : 0 < t < 1} with continuous sample paths, for which {B* ^ Bt] = 0 for
every t. Argue as follows. Let Sn := {i/2 n : 1 = 0 , 1 , . . . , 2W} and S := U nN 5 n . For
each to define
Dn(co) == max

0<i <2n

(i) Use Problem [1] to prove that J2n^D" < - Deduce that there exists a
negligible set N such that n(o>) := J2t>n A(a>) -* 0 as n -> 00 for w ^ N.
(ii) Let ^ and t be points in 5W for which \s - t\ < 2~ m , where n > m. For each
k, write j* for the largest value in S* for which Sk < s, and define tk similarly.
(Thus s = sn and f = tn.) Show that

232

Chapter 9:

Brownian motion

(iii) Deduce that


m a x { | f l ( . s , co) - B(t, co)\ :s,t

e Sn a n d \s - t\ < 2~m] <

2em{co),

and hence
sup{|(.y, co) - B(t, co)\:s,t

e S and \s - t\ < 2" m } < 2m(co).

(iv) For each t in [0, 1] and each co e Nc, define


B*(t, co) = lim sup{fl5 : s e S and t - 2~n < s < t}.
Define B*(t, co) to be zero if co G N. (Notice that B* is measurable with respect
to the completion 7*.) Show that, for all t, t' in [0,1],
\B*(t, co) - B*(t\ co)\ < 2m(co)

if \t - t'\ < 2~m.

(v) Show that B*(t,co) = limn B(tn,co) almost surely, where {tn} is the sequence
defined in step (ii). For each e > 0, deduce that
F[\B? - Bt\ > } < Pliminf(|5(f n ) - Bt\ > e] < liminfF{\B(tn) - Bt\ > } = 0.
n

(Remember that B(t) - B(tn) has a N(0,t - tn) distribution.) Conclude that
B* = Bt almost surely, for each t.
[4]

Suppose {Bs : s e S] is a Brownian motion, with S the countable set of all dyadic
rationals in [0, 1], as in the previous Problem. Modify the argument from that
Problem to construct a standard Brownian motion indexed by [0, 1].

[5]

Show that there exists a finite constant Co for which


A2 k

~ ) yi+lg(2*) < <VS + 61og(l/S)

for 0 < 8

Hint: For 2~m > 8 > 2~m~1, bound the sum by

The ratio of successive terms in the last sum converges to 1/V2.


[6]

Let B be a standard Brownian motion indexed by [0, 1]. Show that the process
Xt = B\ #i_ r , for 0 < t < 1, is also a Brownian motion (with respect to own its
natural filtration, not the filtration for B).

[7]

Let {(#,, J,) : 0 < t < 1} be a Brownian motion with continuous sample paths.
Define r(co) as the smallest t at which the sample path /?(-, co) achieves its maximum
value, M(co). (Note: r is not a stopping time.) Follow these steps to show that
the sets Qt := {r = t] for 0 < t < 1 is a family suitable for the construction in
Example <2>.
(i) Show that {r < t) = {supJ5, Bs(co) = M(co)}. Deduce that r is 7\-measurable.
Hint: Work with the process at rational times.
(ii) For each / in [0, 1), show that {r = /} c {sup,< J5 ,( 5 - Bt) < 0}. Deduce via
the result from Exercise <12> that P{r = t) = 0.
(iii) Use Problem [6] to show that P{r = 1} = 0.

233

9.9 Problems

[8]

For all real y, show that eiy = 1 + iy + (ry) 2 /2! + . . . + (iy)k/k\ + *(>0 with
1**001 < \y\Mlih + 1)!. Hint: for y > 0 use the fact that i /oy *(*)<** = #*+i(>0.

[9]

Let X := C(E + ), equipped with its metric d for uniform convergence on compacta.
(i) Show that (X, d) is separable. Hint: Consider the countable collection of
piecewise linear functions obtained by interpolating between a finite number of
"vertices" with rational coordinates.
(ii) Prove that the sigma-field G generated by the finite dimensional projections
coincides with the Borel sigma-field !B(X). Hint: For one inclusion use
continuity of the projections. For the other inclusion replace sup-norm distances
by suprema over subsets of rationals.

[10]

Let [Bt : t e R+] be a standard Brownian motion. For fixed constants a > 0 and
P > 0 show that F[Bt eventually hits the line a + fit] = exp(-2cr/J), by following
these steps. Write r for the first hitting time on the linear barrier. For fixed real 0,
let X0(t) denote the martingale exp(0, - lfy92t).
(i) For each fixed t, show that 1 = FXg(tAr).
if you wish to be completely rigorous.

Hint: You might consult Appendix E

(ii) If 0 > 2fi, show that 0 < X0{t A r) < exp(0a).


(iii) If 0 > 2)3, show that X9(t A r) - 0 as t -+ oo on the set {r = oo}.
(iv) Deduce that 1 = Pexp (0a + (00 - 7 2 0 2 ) T ) {T < oo} for each 0 > 2. Then let
0 decrease to 2fi to conclude the argument.
[11]

Let {(#,, 7t) : 0 < t < 1} be a Brownian motion defined on a probability space
(Q, 7, P). Let a be a positive real number.
(i) Show that the measure Q defined by dQ/dF := exp(a#i - \a2) is a probability
measure on 7.
(ii) Show that the process Xt := Bt - at for 0 < t < 1 is a Brownian motion
under Q. Hint: For fixed s < t, real 0, and F in 7sy show that

Qexp (iO(Xt - Xs) + \02(t - 5)) F


= Pexp
x Pexp ((a + W)(Bt - Bs) - \{a + W)2(t x Pexp (aBs - \ct2s\ F.
Notice that right-hand side reduces to an expression that does not depend on 0,
which identifies it as QF.
[12]

Calculate the price of an option that allows you to purchase one unit of stock for
a price K at time 1, if you wish. Hint: Interpret (Si - K)+, or read a book about
options.

234

10.

Chapter 9:

Brownian motion

Notes
The detailed study by Brush (1968) describes the history of Brownian motion:
from the recognition by Brown, in 1828, that it represented a physical, rather
than biological, phenomenon; through the mathematical theories of Einstein and
Smoluchowski, and the experimental evidence of Perrin, in the first decade of the
twentieth century.
Apparently Wiener (1923) was motivated to study the irregularity of the
Brownian motion sample paths, in part, by remarks of Perrin regarding the haphazard
motion of small particles, cited (page 133) in translation as "One realizes from such
examples how near the mathematicians are to the truth in refusing, by a logical
instinct, to admit the pretended geometrical demonstrations, which are regarded
as experimental evidence for the existence of a tangent at each point of a curve."
(Compare with Kac (1966, page 34), quoting from Wiener's autobiography.) Wiener
constructed "Wiener measure" as a linear functional on the space of continuous
functions, deriving the necessary countable additivity property from a property
slightly weaker than the modulus condition of Theorem <7>.
The construction in Section 3 is essentially due to Levy (1939, Section 6), who
obtained the piecewise linear approximations to Brownian motion by an interpolation
argument (compare with Theorem 1, Chapter 1 of Levy 1948). He also referred to
Levy (1937, page 172) for the derivation of a Holder condition <6> for the sample
paths. The explicit construction via an orthogonal series expansion in the Haar basis
is due to to Ciesielski (1961).
Theorem <14> in Section 6 is due to Levy (1948, pages 77-78 of the
second edition), who merely stated the result, with a reference to Theorem 67.3
of Levy (1937), which established a central limit theorem for (discrete time)
martingales under an assumption on the conditional variances analogous to the
martingale property for X,2 t. Doob (1953, Theorem 11.9, Chapter VII) provided
a formal proof, similar to the proof in Section 6. The method could be streamlined
slightly, with replacement of increments over deterministic time intervals by
increments over random intervals. It could also be deduced from fancier results
in stochastic calculus, whose derivations ultimately reduce to calculations with
small increments of the process. I feel the direct method has pedagogic advantages,
because it makes very clear the vital role of sample path continuity.
The proof of Theorem <16> is a disguised version of Ito's formula for
stochastic integrals, applied to a particular function of Brownian motioncompare
with Durrett (1984, Section 2.14). The result is due to Kunita & Watanabe (1967),
although, as noted by Clark (1970), it follows easily from an expansion due to
Ito (1950). See also Dudley (1977) for an extension to 7f -measurable random
variables that are not necessarily square integrable, verifying a result first asserted
then retracted by Clark.
See Harrison & Pliska (1981) and Duffie (1992, Chapter 6) for a discussion
of stochastic calculus and option trading. The book by Wilmott, Howison &
Dewynne (1995) provides a gentler introduction to some of the finance ideas.

235

9.10 Notes

In the last three Sections of the Chapter, I was dabbling with ideas important
in stochastic calculus, without introducing the formal machinery of the subject.
Accordingly, there was some repetition of methodschop the sample paths into
small increments; make Taylor expansions; dispose of remainder terms by arguments
reeking of martingale theory. As I remarked at the end of Section 6, if you wish
to pursue the theory any further it would be a good idea to invest some time in
studying the formal machinery. I found Chung & Williams (1990) a good place to
start, with Metivier (1982) as a reliable backup, and Dellacherie & Meyer (1982) as
a rigorous test of true understanding.
REFERENCES

Brush, S. G. (1968), 'A history of random processes: I. Brownian movement from


Brown to Perrin', Archive for History of the Exact Sciences pp. 1-36.
Chung, K. L. & Williams, R. J. (1990), Introduction to Stochastic Integration,
Birkhauser, Boston.
Ciesielski, Z. (1961), 'Holder condition for realization of Gaussian processes',
Transactions of the American Mathematical Society 99, 403-413.
Clark, J. M. C. (1970), 'The representation of functional of Brownian motion
by stochastic integrals', Annals of Mathematical Statistics 41, 1282-1295.
Correction, ibid. 42 (1971), 1778.
Dellacherie, C. & Meyer, P. A. (1982), Probabilities and Potential B: Theory of
Martingales, North-Holland, Amsterdam.
Doob, J. L. (1953), Stochastic Processes, Wiley, New York.
Dudley, R. M. (1973), 'Sample functions of the Gaussian process', Annals of
Probability 1, 66-103.
Dudley, R. M. (1977), 'Wiener functionals as Ito integrals', Annals of Probability
5, 140-141.
Duffie, D. (1992), Dynamic Asset Pricing Theory, Princeton Univeristy Press.
Durrett, R. (1984), Brownian Motion and Martingales in Analysis, Wadsworth,
Belmont CA.
Harrison, J. M. & Pliska, S. R. (1981), 'Martingales and stochastic integrals in
the theory of continuous trading', Stochastic Processes and their Applications
11, 215-260.
ltd, K. (1950), 'Multiple Wiener integral', J. Math. Society Japan 3, 158-169.
Kac, M. (1966), 'Wiener and integration in function spaces', Bulletin of the American
Mathematical Society 72, 52-68. One of several articles in a special issue of
the journal, devoted to the life and work of Norbert Wiener.
Kunita, H. & Watanabe, S. (1967), 'On square integrable martingales', Nagoya Math.
J. 30, 209-245.
Levy, P. (1937), Theorie de Vaddition des variables aleatoires, Gauthier-Villars, Paris.
References from the 1954 second edition.
Levy, P. (1939), 'Sur certaines processus stochastiques homogenes', Compositio
mathematica 7, 283-339.

236

Chapter 9:

Brownian motion

Levy, P. (1948), Processus stochastiques et mouvement brownien, Gauthier-Villars,


Paris. Second edition, 1965.
McKean, H. P. (1969), Stochastic Integrals, Academic Press.
Metivier, M. (1982), Semimartingales: A Course on Stochastic Processes, De Gruyter,
Berlin.
Stroock, D. W. & Varadhan, S. R. S. (1979), Multidimensional Diffusion Processes,
Springer, New York.
Wiener, N. (1923), 'Differential-space', Journal of Mathematics and Physics 2, 131174. Reprinted in Selected papers of Norbert Wiener, MIT Press, 1964.
Wilmott, P., Howison, S. & Dewynne, J. (1995), The Mathematics of Financial
Derivatives: a Student Introduction, Cambridge University Press.

Chapter 10

Representations and couplings


SECTION I illustrates the usefulness of coupling, by means of three simple examples.
SECTION 2 describes how sequences of random elements of separable metric spaces that
converge in distribution can be represented by sequences that converge almost surely.
SECTION *3 establishes Strassen's Theorem, which translates the Prohorov distance
between two probability measures into a coupling.
SECTION *4 establishes Yurinskii's coupling for sums of independent random vectors to
normally distributed random vectors.
SECTION 5 describes a deceptively simple example (Tusnddy's Lemma) of a quantile
coupling, between a symmetric Binomial distribution and its corresponding normal
approximation.
SECTION 6 uses the Tusnddy Lemma to couple the Haar coefficients for the expansions of
an empirical process and a generalized Brownian Bridge.
SECTION 7 derives one of most striking results of modern probability theory, the KMT
coupling of the uniform empirial process with the Brownian Bridge process.

1.

What is coupling?
A coupling of two probability measures, P and Q, consists of a probability space
(2, J, P) supporting two random elements X and F, such that X has distribution P
and Y has distribution Q. Sometimes interesting relationships between P and Q
can be coded in some simple way into the joint distribution for X and Y. Three
examples should make the concept clearer.

<i>

Example. Let Pa denote the Bin(n,a) distribution. As a gets larger, the distribution should "concentrate on bigger values." More precisely, for each fixed JC, the
tail probability Pa[x, n] should be an increasing function of a. A coupling argument
will give an easy proof.
Consider a f$ larger than a. Suppose we construct a pair of random variables,
Xa with distribution Pa and Xp with distribution Pp, such that Xa < Xp almost
surely. Then we will have {Xa > x] < {Xp > x] almost surely, from which we
would recover the desired inequality, Pa[jc,n] < Pp[x,n]9 by taking expectations
with respect to P.
How might we construct the coupling? Binomials count successes in independent trials. Couple the trials and we couple the counts. Build the trials from

238

<2>

Chapter 10:

Representations and couplings

independent random variables /,-, each uniformly distributed on (0,1). That is,
define Xa := ,-<{/, < ot} and Xp := J^i<n{Ui < P). In fact, the construction
couples all Py, for 0 < y < 1, simultaneously.
Example. Let P denote the Bin(n, a) distribution and Q denote the approximating
Poisson(na) distribution. A coupling argument will establish a total variation bound,
supA \PA QA\ < na2, an elegant means for expressing the Poisson approximation
to the Binomial.
Start with the simplest case, where n equals 1. Find a probability measure P
concentrated on {0,1} x No with marginal distributions P : Bin(l,a) and
Q := Poisson(a). The strategy is simple: put as much mass as
7k!
we can on the diagonal, (0,0) U (1,1), then spread the remaining
mass as needed to get the desired marginals. The atoms on the
diagonal are constrained by the inequalities
P(0,0) < min (P{0), 0(0})) = min (1 - a, e~a),
To maximize, choose P(0,0) := 1 - a and P(l, 1) := ae~a. The
1-a
rest is arithmetic. We need P(l, 0) := e~a - 1 + a to attain the
marginal probability Q{0], and P(0, k) := 0, for k = 1, 2 , . . . , to attain the marginal
P{0] = 1 - a. The choices P(l, k) := Q{k], for k = 2, 3 , . . . , are then forced. The
total off-diagonal mass equals a ae~a < a2.
For the general case, take P to be the n-fold product of measures of the
type constructed for n = 1. That is, construct n independent random vectors
(X\,Y\), . . . , ( X n , y n ) with each Xt distributed Bin(l,a), each Y( distributed
Poisson(a), and P{X, ^ Yi] < a2. The sums X := . X, and Y := \ F, then have
the desired Binomial and Poisson distributions, and P{X ^ Y] < X],-P{^i # Yt) <
na2. The total variation bound follows from the inequality
|P{X e A] - F{Y A}| = |P{X A, X # Y) - F{Y e A, X / Y}\ < P{X ^ Y],

for every subset A of integers.


The first Example is an instance of a general method for coupling probability
measures on the real line by means of quantile functions. Suppose P has distribution function F and Q has distribution function G, with corresponding quantile
functions qF and qG. Remember from Section 2.9 that, for each 0 < i# < 1,
u < F(x)

if and only if

# F ( ) < x.

In particular, if U is uniformly distributed on (0,1) then


n<lF(U) <x}

= F{U < F(x)} = F(JC),

so that X := qF(U) must have distribution P. We couple P with Q by using the


same U to define the random variable Y := qc(U) with distribution Q.
A slight variation on the quantile coupling is available when G is one-to-one
with range covering the whole of (0,1). In that case, qG is a true inverse function
for G, and U = G(Y). The random variable X := qpG{Y) is then an increasing
function of y, a useful property. Section 5 will describe a spectacularly successful
example of a quantile coupling expressed in this form.

239

10.1 What is coupling?

<3>

Example. Suppose {Pn} is a sequence of probability measures on the real line,


for which Pn ^ P. Write Fn and F for the corresponding distribution functions, and
qn and q for the quantile functions. From Section 7.1 we know that Fn(x) - F(x)
at each x for which P{x] = 0, which implies (Problem [1]) that qn(u) - <?(M) at
Lebesgue almost all u in (0,1). If we use a single /, distributed uniformly on (0,1),
to construct the variables Xn := qn(U) and X := q(U), then we have Xn - X almost
surely. That is we have represented the weakly convergent sequence of measures by
an almost surely convergent sequence of random variables.
REMARK.
It might happen that the measures {Pn} are the distributions of some
other sequence of random variables, {Yn}. Then, necessarily, Yn -w P; but the
construction does not assert that Yn converges almost surely. Indeed, we might even
have the Yn defined on different probability spaces, which would completely rule out
any possible thought of almost sure convergence. The construction ensures that each
Xn has marginal distribution Pn, the same as Yn, but the joint distribution of the Xn's
has nothing to do with the joint distribution of the IVs (which is only well defined
if the Yn all live on the same probability space). Indeed, that is the whole point of
the construction: we have artificially manufactured the joint distribution for the Xn's
in order that they converge, not just in the distributional sense, but also in the almost
sure sense.

The representation lets us prove facts about weak convergence by means of the
tools for almost sure convergence. For example, in the problems to Chapter 7, you
were asked to show that A(P, Q) := sup{\Pi - Ql\ : \\i\\BL < 1} defines a metric
for weak convergence on the set of all Borel probability measures on a separable
metric space. (Refer to Section 7.1 for the definition of the bounded Lipschitz
norm.) If A(P n , P) - 0 then Pnf -> Pf for each / with ||/||^ L < oo, that is,
Pn ~* P. Conversely, if Pn ~> P and can we find Xn with distribution Pn and X
with distribution P for which Xn X almost surely (see Section 2 for the general
case), then

A(PniP)< sup

F\l(Xn)-l(X)\<F(lA\Xn-X\)->0.

U\\BL<1

2.

In effect, the general constructions of the representing variables subsume the specific
calculations used in Chapter 7 to approximate { : \\1\\BL < 1} by a finite collection
of functions.

Almost sure representations


The representation from Example <3> has extensions to more general spaces. The
result for separable metric spaces gives the flavor of the result without getting us
caught up in too many measure theoretic details.

<4>

Theorem. For probability measures on the Borel sigma field of a separable metric
space X, if Pn ~> P then there exist random elements Xn, with distributions Pn, and
X, with distribution P, for which Xn - X almost surely.
The main step in the proof involves construction of a joint distribution for Xn
and X. To avoid a profusion of subscripts, it is best to isolate this part of the

240

Chapter 10:

Representations and couplings

construction into a separate lemma. Once again, a single uniformly distributed U


(that is, with distribution equal to Lebesgue measure m on 3 ( 0 , 1)) will eventually
provide the thread that ties together the various couplings into a single sequence
converging almost surely. The construction builds the joint distribution via a
probability kernel, K from (0,1) x X into X.
Recall, from Section 4.3, that such a kernel consists of a family of probability
measures {KUiX() : e (0, 1), x e X} with (uyx) f-+ KUXB measurable for each
fixed B in !B(X). We define a measure on the product sigma-field of (0,1) x X x X
by

( m 0 F 0 K)uxyf(u.

JC, y) := mu {PxKyuxf{u, JC, y)).

Less formally: we independently generate an observation u from the uniform


distribution m and an observation x from P, then we generate a y from the
corresponding Kux. The expression in parentheses on the right-hand side also
defines a probability distribution, (P K)uy on X x X,
(P K)xuyf{x,

y) := PxKyuxf{x,

y)

for each fixed u.

In fact, {(P <8> K)u : u e (0, 1)} is a probability kernel from (0, 1) to X x X. Notice
also that the marginal distribution muPxKu%x for y is a m <g> P average of the Kux
probability measures on !B(X). As an exercise in generating class methods, you
might check all the measurability properties needed to make these assertions precise.
<5>

L e m m a . Let P and Q be probability measures on the Borel sigma- field B(X).


Suppose there is a partition of X into disjoint Borel sets Bo, B\t...9
Bm, and a
positive constant , for which QBa > (1 e)PBa for each a. Then there exists a
probability kernel K from (0, 1) x X to X for which Q = muPxKUtX and for which
(P <g> K)u concentrates on U a (Ba x Ba) whenever u < 1 - e.
Proof. Rewrite the assumption as QBa = 8a +
numbers 8a must sum to because a QBa =
the conditional distribution, which can be taken
on Ba if QBa = 0. Partition the interval (1 - 6,
xnJa = 8a. Define

KuA) = J^a ({M e J] + {U -

(1 )PBay where the nonnegative


]T a PBa = 1. Write Q(- | Ba) for
as an arbitrary probability measure
1) into disjoint subintervals Ja with

- *' X

e Ba]

Q(

I Ba)'

When u < 1 - the recipe is: generate y from Q(- \ Ba) when x e Ba, which
ensures that JC and y then belong to the same Ba. Integrate over u and x to find the
marginal probability that y lands in a Borel set A:

muPxKuxA =

(8a + (1 - )PBa) Q(A \ Ba) = ^ ( G ^ ) Q ( A | Ba) = QA.

as asserted.
REMARK.
Notice that the kernel K does nothing clever when u e Ja. If we
were hoping for a result closer to the quantile coupling of Example <3>, we might
instead try to select y from a Bp that is close to xy in some sense. Such refined
behavior would require a more detailed knowledge of the partition.
Proof of Theorem < 4 > .
The idea is simple. For each n we will construct an
appropriate probability kernel K^\ from (0, 1) x X to X, via an appeal to the

10.2 Almost sure representations

241

Lemma, with Q equal to the corresponding Pn and 6 depending on n. We then


independently generate Xn((o) from K^x> for each n, with u an observation from m
independent of an observation X(co) := x from P.
The inequality required by the Lemma would follow from convergence in
distribution if each Ba were a P-continuity set (that is, if each boundary dBa had
zero P measuresee Section 7.1), for then we would have PnBa -+ PBa as n -> oo.
Problem [4] shows how to construct such a partition n := {Bo, B\,..., Bm] for an
arbitrarily small e > 0, with two additional properties,
(i) PB0 <
(ii) diameter(#a) < for each a > 1.

<6>

We shall need a a whole family of such partitions, nk := [Batk : a = 0, 1 , . . . , m*},


corresponding to values k := 2~* for each k e N.
To each k there exists an nk for which /># > (1 - ek)PB for all in 7r^,
when n > nk. With no loss of generality we may assume that 1 < n\ < ni < ...,
which ensures that for each n greater than n\ there exists a unique k := k(n) for
which nk < n < n*+i. Write K^x for the probability kernel defined by Lemma <5>
for Q := Pn with := ek{n)j and nk{n) as the partition. Define P as the probability
measure m (8) P ( ^ n e N ^ ) on the product sigma-field of Q := (0, 1) x X x XN.
The generic point of & is a sequence o> := (w, JC, yi, >>2,.. .) Define X(co) := JC and
Xn(co):=yn.
Why does Xn converge P-almost surely to X? First note that 4 PBo,* < ooBorel-Cantelli therefore ensures that, for almost all x and every u in (0,1), there
exists a &o = ^o(w *) for which u < \ ek and JC ^ fio,* for all k > ko. For such
(W,JC) and it > ko we have (jc,^n) Va>\Ba,k x Bajt for nk < n < nk+\y by the
concentration property of the kernels. That is, both X((o) and Xn(co) fall within the
same Bak with a > 1, a set with diameter less than ek. Think your way through that
convoluted assertion and you will realize we have shown something even stronger
than almost sure convergence.
Example. Suppose Pn -~* P as probability measures on the Borel sigma-field of
a separable metric space, and suppose that [Tn] is a sequence of measurable maps
into another metric space y. If P-almost all x have the property that Tn(xn) - T(x)
for every sequence {*} converging to JC, then the sequence of image measures
also converges in distribution, TnPn ~ TP, as probability measures on the Borel
sigma-field of y. The proof is easy is we represent {Pn} by the sequence {Xw}, as
in the Theorem. For each I in BLi$)y we have l(Tn(Xn(a>))) - l(T(X(co))) for
P-almost all co. Thus

(TnPn)i = Ft(Tn(Xn)) -+ l(T(X)) = (TP)t,

by Dominated Convergence.
I noted in Example <3> that if Yn has distribution Pn, and if each Yn is
defined on a different probability space (Qn, J n ,P n ), then the convergence in
distribution Yn ~* P cannot possibly imply almost sure convergence for Yn.
Nevertheless, using an argument similar to the proof of Theorem <4>, Dudley (1985)
obtained something almost as good as almost sure convergence.

Chapter 10:

242

Representations and couplings

He built a single probability space ( Q , y , P )


supporting measurable maps \jrn, into Qn, and X,
into X, with distributions Prt = fn (P) and P = X (P),
for which Yn(x//n(co)) -> X(a>) for P almost all co.
In effect, the ^ maps pull Yn back to 2, where the
notions of pointwise and almost sure convergence
make sense.
Actually, Dudley established a more delicate
result, for Yn that need not be measurable as maps
into X, a generalization needed to accommodate
an application in the theory of abstract empirical
processes. See Pollard (1990, Section 9) for a discussion of some of the conceptual
and technical difficultiessuch as the meaning of convergence in distribution for
maps that don't have distributions in the usual sensethat are resolved by Dudley's
construction. See Kim & Pollard (1990, Section 2) for an example of the subtle
advantages of Dudley's form of the representation theorem.

*3.

Strassen's Theorem
Once again let (X, d) be a separable metric space equipped with its Borel sigmafield 'B(X). For each subset A of X, and each 6 > 0, define A to be the closed set
{x e X : d(x, A) < }. The Prohorov distance between any P and Q from the set 7
of all probability measures on (X) is defined as
P(P, Q) '= inf{ > 0 : PB < QB + for all B in B(X)}.
Despite the apparent lack of symmetry in the definition, p is a metric (Problem [3])
on 9.
REMARK.
Separability of X is convenient, but not essential when dealing with
the Prohorov metric. For example, it implies that 3(X x X) = 3(X) <g)23(X), which
ensures that d(X, X') is measurable for each pair of random elements X and X'\ and
if Xn -> X almost surely then P{d(XHf X) > 6} -* 0 for each > 0.
If p(Pni P) -> 0 then, for each closed F we have PnF < PF + e eventually,
and hence limsup n PnF < PF, implying that Pn -w P. Theorem <4> makes it easy
to prove the converse. If Xn has distribution Pn and X has distribution P, and if
Xn -> X almost surely, then for each > 0 there is an n such that
F{d(Xn,X)

>} <e

for n > n .

For every Borel set B, when n > n we have


PnB

< P { X n e B, d(Xni

X) < 6} + P{d(Xni

X)>}<
) }

{X
PB + .
{X e B}} + = PB

Thus p is actually a metric for weak convergence of probability measures.


The Prohorov metric also has an elegant (and useful, as will be shown by
Section 4) coupling interpretation, due to Strassen (1965). I will present a slightly
restricted version of the result, by placing a tightness assumption on the probabilities,
in order to simplify the statement of the Theorem. (Actually, the proof will establish

243

10.3 Strassen's Theorem

a stronger result; the tightness will be used only at the very end, to tidy up.)
Also, the role of e is slightly easier to understand if we replace it by two separate
constants.
<8>

Theorem. Let P and Q be tight probability measures on the Borel sigma field *B of
a separable metric space X. Let e and e' be positive constants. There exists random
elements X andY ofX with distributions P and Q such that F[d(X, Y) > e] < e' if
and only if PB < QB + f for all Borel sets B.
The argument for deducing the family of inequalities from existence of the
coupling is virtually the same as <7>. For the other, more interesting direction, I
follow an elegant idea of Dudley (1976, Lecture 18). By approximation arguments
he reduced to the case where both P and Q concentrate on a finite set of atoms,
and then existence of the coupling followed by an appeal to the classical Marriage
Lemma (Problem [5]). I modify his argument to eliminate a few steps, by making
an appeal to the following generalization (proved in Problem [6]) of that Lemma.

<9>

Lemma. Let v be a finite measure on a


a sigma-field *B on a set T. Suppose {Ra
with the domination property that v(A) <
exists a probability kernel K from S to T
and
Haes vWKa < M.

finite set S and ^ be a finite measure on


: a e S] is a collection of measurable sets
/x(U a/ i^a) for all A c 5. Then there
with Ka concentrated on Ra for each a

Proof of Theorem <8>.


The measure P will live on X x X, with X and Y as
the coordinate maps. It will be the limit of a weakly convergent subsequence of a
uniformly tight family {s : 8 > 0}, obtained by an appeal to the Prohorov/Le Cam
theorem from Section 7.5.
Construct Fs via a "discretization" of P, which brings the problem within the
ambit of Lemma <9>. For a small, positive 8, which will eventually be sent to
zero, partition X into finitely many disjoint, Borel sets Bo,B\,...,Bm
with PBo < 8
and diameter(#a) < 8 for a > 1. (Compare with the construction in Problem [4].)
Define a probability measure v, concentrated on the finite set 5 := { 0 , 1 , . . . , m}, by
v[a] := PBa for a = 0 , . . . , m. Augment X by a point oo. Extend Q to a measure
/ i o n l : = l u {oo} by placing mass e' at oo. Define Ra as B% U {oo}. With these
definitions, the measures v and /x satisfy the requirements of Lemma <9>: for each
subset A of 5,
v(A) = P (UaABa) < Q (UaABaY +f=Q

(uaGABa) + /x{oo} = /z (U aA /? a ).

The Lemma ensures existence of a probability kernel A\ from 5 to 7, with


KaBa + Ka{oo] = KaRa = 1 for each a and J^a v{a}KaA < /JLA for every Borel
subset A of T. In particular, ]P a v{a}KaB < QB for all B S. The nonnegative
measure Q - J^a v{a}Ka\x on *B has total mass
* := 1 " v{<x}KaX = v[a}Ka{oo} < /x{oo} = *'.
Write this measure as TQQ, with go a probability measure on !B. (If r = 0, choose
Qo arbitrarily.) We then have Qh = rQoh + a v{a}Kah for all h M + (X).
Define a probability measure P5 on !B !B by
F8f := Px ( l o { j c Ba] (Ka + Ka{oo}Q0)y / ( * , y))

for / M+(X x X).

244

Chapter 10:

Representations and couplings

REMARK.
In effect, I have converted K to a probability kernel L from X to
X, by setting Lx equal to Ka\x + Ka{oo]Q0 when x Ba. The definition of Fs is
equivalent to Fs := P <> L, in the sense of Section 4.4.

The measure F8 has marginals P and > because, for g and A in M+(X),
= P* Q H B { * *} (* a X + Ka{oo}) g(x))

= P*,

J^ ff P{* } (Kah + tfa{oo}Q0fc) =

W}

It concentrates most of its mass on the set D :== U^=1 (Ba x #*),

F8D > Y2=x

pX

( ^ *}*' 30
= 1- r -

When (JC, y) belongs to D, we have x e Ba and */(>, 5 a ) < 6 for some Ba with
diameter(5a) < 8, and hence J(JC, y) < 8 + e. Thus Fs assigns measure at least
1 - ' - 8 to the closed set F8+ := {(JC, y ) e X x l : J(JC, ^) < 6 + }.
The tightness of both P and ? will let us eliminate 5, by passing to the limit
along a subsequence. For each rj > 0 there exists a compact set Cn for which
PC$ < t] and gCJJ < rj. The probability measure Fs, which has marginals P and Q,
puts mass at most 2rj outside the compact set C^ x C^. The family {P^ : 5 > 0} is
uniformly tight, in the sense explained in Section 7.5. As shown in that Section,
there is a sequence {<$,} tending to zero for which F8i ~> P, with P a probability
measure on !B<g>$. It is a very easy exercise to check that P has marginals P and Q.
For each fixed t > 6, the weak convergence implies
PF, > limsup, F8iFt > limsup, F8iF+8i > 1 - *'.

*4.

Let t decrease to to complete the proof.

The Yurinskii coupling


The multivariate central limit theorem gives conditions under which a sum S of
independent random vectors i , . . . , has an approximate normal distribution.
Theorem <4> would translate the corresponding distributional convergence into a
coupling between the standardized sum and a random vector with the appropriate normal distribution. When the random vectors have finite third moments, Theorem <8>
improves the result by giving a rate of convergence (albeit in probability).

<io>

Theorem. Let i , . . . , n be independent random k-vectors with P& = 0 for each i


and p := . P|| 3 finite. Let S := i + ... + n. For each 8 > 0 there exists a
random vector T with a N(0, var(S)) distribution such that
F{\S -T\>

38} < C0B (l 4- ^log^/B^\

for some universal constant Co-

w here

B :=

245

10.4 The Yurinskii coupling


REMARK.
The result stated by Yurinskii (1977) took a slightly different form.
I have followed Le Cam (1988, Theorem 1) in reworking the Yurinskii's methods.
Both those authors developed bounds on the Prohorov distance, by making an
explicit choice for 8. The Le Cam preprint is particularly helpful in its discussion of
heuristics behind how one balances the effect of various parameters to get a good
bound.

Proof. The existence of the asserted coupling (for a suitably rich probability space)
will follow via Theorem <8> if we can show for each Borel subset A of R* that
P{5 A] < {T A38} + ERROR,
with the ERROR equal to the upper bound stated in the Theorem. By choosing a
smooth (bounded derivatives up to third order) function / that approximates the
indicator function of A, in the sense that / 1 on A and / 0 outside A33, we will
be able to deduce inequality < i i > from the multivariate form of Lindeberg's method
(Section 7.3), which gives a third moment bound for a difference in expectations,

<i2>

IP/(5)

- p/(m < c

(PI$, i + . . . + P I & I )

= cp.

More precisely, if the constant Cf is such that


<13>

\f(x + y) - fix) - y'f(x) - \y'f{x)y\ < Cf\y\3

for all x and y,

then we may take C = (9 4- 8P|JV(O, l)l ) Cf < 15C/.


For a fixed Borel set A, Lemma <18> at the end of the Section will show
how to construct a smooth function / for which approximation <13> holds with
Cf = (cr28)-1 and for which, if 8 > aVk,

<14>

3S

(1 - e){x e A) < f{x) < + (1 - e){x A ]

I =-(f.
V e" '

where

The Lindeberg bound <12>, with C0 = \5p/(a28) = 155(1 + a ) , then gives


F{S A} < (1 - )-lVf(S)
< (1 - )~l (P/(D + 155(1 + a))
< F{T A38} + ,'

where ' := + l**]

a)

We need to choose a, as a function of k and B, to make e' small.


Clearly the bound <15> is useful only when is small, in which case the
( l - e ) factor in the denominator contributes only an extra contant factor to the
final bound. We should concentrate on the numerator. Similarly, the assertion of
the Theorem is trivial if B is not small. Provided we make sure Co > e, we may
assume B < e~\ that is, log(l/#) > 1.
To get within a factor 2 of minimizing a sum of two nonnegative functions,
one increasing and the other decreasing, it suffices to equate the two contributions.
This fact suggests we choose a to make

-(-*)

log(l + a) j log(l/fl) 4- O(k~x).

246

Chapter 10:

Representations and couplings

If B is small then a will be large, which would make log(l + a) small compared
with a. If we make a slightly larger than 2k~l log(l/#) we should get close to
equality. Actually, we can afford to have a a larger multiple of log(l/Z?), because
extra multiplicative factors will just be absorbed into constant Co. With these
thoughts, it seems to me I cannot do much better than choose

which at least has the virtue of giving a clean bound:

- yV y <and hence

d-6)
-1-e-i'
The proof is complete, except for the construction of the smooth function /
satisfying <14>.
Before moving on to the construction of / , let us see what we can do with the
coupling from the Theorem in the case of identically distributed random vectors.
For convenience of notation write y*(x) for the function Co* (1 + I log(l/jt)|/fc).

<16>

Example. Let 1, 2. be independent, identically distributed random A:-vectors


with Pi = 0, var(i) := V, and /x3 := P|i| 3 < 00. Write Sn for 1 + ....
The central limit theorem asserts that Sn/*Jn ~> Af(O, V). Theorem <io>, asserts
existence of a sequence of random vectors Wn, each distributed N(0, V) for which

For fixed k, we can make the right-hand side as small as we please by choosing 5
as a large enough enough multiple of n~ 1/6 . Thus, with finite third moments,

-wn

= Opin

1/6

via the Yurinskii coupling.

For k = 1, this coupling is not the best possible. For example, under an assumption
of finite third moments, a theorem of Major (1976) gives a sequence of independent
random variables Y\, Y2,..., each distributed Af(O, V), for which
= op(n~Xf6)

<n>

almost surely.

Major's result has the correct joint distributions for the approximating normals, as
n changes, as well as providing a slightly better rate.
Example. Yurinskii's coupling (and its refinements: see, for example, the
discussion near Lemma 2.12 of Dudley & Philipp 1983) is better suited to situations
where the dimension k can change with n.
Consider the case of a sequence of independent, identially distributed stochastic
processes {Xt(t) : t e T}. Suppose X\(t) = 0 and \X\(t)\ < 1 for every t. Under
suitable regularity conditions on the sample paths, we might try to show that the
standardized partial sum processes, Zn(t) := (X[(t) + . . . + Xn(t))/y/n, behave like a

10.4 The Yurinskii coupling

247

centered Gaussian process {Z(t) : t e T], with the same covariance structure as X\.
We might even try to couple the processes in such a way that sup, \Zn(t) Z(t)\ is
small in some probabilistic sense.
The obvious first step towards establishing a coupling of the processes is to
consider behavior on large finite subsets T(k) := {t\,..., f*} of 7 , where k is allowed
to increase with n. The question becomes: How rapidly can k tend to infinity?
For fixed k, write , for the random fc-vector with components X/(/ y ), for
j = 1 , . . . , fc. We seek to couple (1 + . . . + %n)/y/n with a random vector Wn,
distributed like {Z(fy) : j = 1,...,A:}. The bound is almost the same as in
Example <16>, except for the fact that the third moment now has a dependence
on fc,
3/2

/i

<*3/2Pl-

\ * 7=1

]P. xj \

^0

, the coupling bound becomes

if it = o ( 1/5 ) and 5 -> 0 slowly enough.

That is, max,<* |Z n (r,) - WnJ\ = o p (l) if k increases more slowly than n I / 5 .
Smoothing of indicator functions
There are at least two methods for construction of a smooth approximation / to a
set A. The first uses only the metric:

For an interval in one dimension, the approximation has the effect of replacing the
discontinuity at the boundary points by linear functions with slope 1/5. The second
method treats the indicator function of the set as an element of an C{ space, and
constructs the approximation by means of convolution smoothing,

f(x)=mw({weA)4>*(w-x)),
where <pa denotes the N(), a2Ik) density and m denotes Lebesgue measure on
(Any smooth density with rapidly decreasing tails would suffice.) A combination of
the two methods of smoothing will give the best bound:

Chapter 10:

248

<18>

Representations and couplings

Lemma. Let A be a Borel subset of R*. let Z have a N(0, /*) distribution. For
positive constants 8 and o define
g(x) := (l - ^ A l \

and

f(x) := Pg(x + oZ) = m" (g(w)</>a(w - x)).

Then f satisfies <13> with C := (cr28)~l, and approximation <14> holds.


Proof. The function / inherits some smoothness from g and some from the
convolving standard normal density 0 a , which has derivatives
and

^<

Ol

For fixed x and y, the function h(t) := f(x + ty), for 0 < r < 1, has second
derivative
.

.
jc + ry + <rZ) ( ( / Z ) 2 - |y

The Lipschitz property \g(x + ty + oZ) - g(x + oZ)\ < t\y\/S then implies

\h(t) - M0)| < ^ 1


The asserted inequality <13> then follows from a Taylor expansion,
|A(1) - h(0) - h(0) - i/i(0)| = \ \h(t*) - *(0)|

where t* e (0,1).

For approximation <14>, first note that A8 < g < A28 and 0 < / < 1
everywhere. Also P{|Z| > 8/a] < e, from Problem [7]. Thus
fix) > Wg(x + aZ){\aZ\ < 8} = P{|Z| < 5/a} > 1 - 6

if * e A,

and
+ aZ){\aZ\ < 8} + Pg(jc 4- <TZ){|<XZ| > 5} < *

if * A35.

5.

Quantile coupling of Binomial with normal


As noted in Section 1, if rj is distributed N(0, 1), with distribution function <I>,
and if q denotes the Bin(n, 1/2) quantile function, then the random variable
X := q{(r))) has exactly a Bin(n, 1/2) distribution. In a sense made precise by
the following Lemma, X is very close to the random variable Y := n/2 + rj^/n/4,
which has a N{n/2,n/4) distribution. The coupling of the Bin(n, 1/2) with its

70.5

249

Quantile coupling of Binomial with normal

approximating N(n/2, n/4) has been the starting point for a growing collection of
striking approximation results, inspired by the publication of the fundamental paper
of Koml6s, Major & Tusnady (1975).
<19>

Tusnady's Lemma. For each positive integer n there exists a deterministic, increasing function r(n, ) such that the random variable X := r(n, rj) has a Bin(n, 1/2)
distribution whenever rj has a N(0,1) distribution. The random variable X satisfies
the inequalities

and

< x

where Y := - 4- rj./-, which has a N (-, - ) distribution.


2
V4
\2 4/
At first glance it is easy to underestimate the delicacy of these two inequalities.
Both X and Y have mean n/2 and standard deviation of order <Jn. It would be no
challenge to construct a coupling for which |X - Y\ is of order yfn\ the Lemma
gives a coupling for which \X Y | is bounded by a quantity whose distribution does
not even change with n.
The original proof (Tusndy 1977) of the Lemma is challenging. Appendix D
contains an alternative derivation of similar inequalities. To simplify the argument,
I have made no effort to derive the best constants for the bound. In fact, the precise
constants appearing in the Lemma will have no importance for us. It will be enough
for us to have a universal constant Co for which there exists couplings such that
<20>

\X-Y\<

Co (l + 772)

and

x- ^

+ M),

a weaker bound that follows easily from the inequalities in Appendix D.

6.

Haar couplingthe Hungarian construction


Let JCI,...,JC B be n independent observations from the uniform distribution P
on (0,1]. The empirical measure Pn is defined as the discrete distribution that puts
mass \/n at each of jq, ...,*. That is, Pnf := " = 1 /(JC/)/n, for each function /
on (0, 1]. Notice that nPnD has a Bin(, PD) distribution for each Borel set D. The
standardized measure vn := */n(Pn - P) is called the uniform empirical process.
For each square integrable function / ,
vnf = n~l/2 " = 1 (/(*,) - Pf) - JV(O, a})

where a) = Pf2 - (P/) 2 .

More generally, for each finite set of square integrable functions f\,..., /*, the
random vector (v n /i,..., vnfk) has a limiting multivariate normal distribution
with zero means and covariances P(ftfj) - (Pfi)(Pfj). These finite dimensional
distributions identify a Gaussian process that is closely related to the isonormal
process {G(f) : / L2(P)} from Section 9.3.
Recall that G is a centered Gaussian process, defined on some probability space
(2,y,P), with cov(G(/), G(g)) = (f9g) = P(fg), the L2(P) inner product. The

250

Chapter 10:

Representations and couplings

Haar basis, *I> = {1} U [i/ritk : 0 < i < 2*, k (= N o }, for L2(P) consists of rescaled
differences of indicator functions of intervals Jiik := / ( / , k) := (i2~k, (i + 1)2~~*],
fi%k := 2*/2 (7 2 a +i - 7 2 l > u +i) = 2k/2 (2/2.M+1 - J/,0

for 0 < i < 2*.

REMARK.
For our current purposes, it is better to replace L2(P) by 2 ( P ) , the
space of square-integrable real functions whose P-equivalence classes define L2(P).
It will not matter that each G(f) is defined only up to a P-equivalence. We need
to work with the individual functions to have Pnf well defined. It need not be true
that Pnf = Png when / and g differ only on a P-negligible set.

Each function in 2 ( P ) has a series expansion,


which converges in the 2 ( P ) sense. The random variables If := G(l) and
rjik := G(\lfitk) are independent, each with a Af(0,1) distribution, and

G(f) =
with convergence in the L2(P) sense. If we center each function / to have
zero expectation, we obtain a new Gaussian process, v(f) := G ( / Pf) =
G(f) (P/)G(1), indexed by 2 (P), whose covariances identify it as the limit
process for vn. Notice that v(x/rik) = G{yjf^k) = r\i%k almost surely, because P^/,* = 0Thus we also have a series representation for v,

At least in a heuristic sense, we could attempt a similar series expansion of the


empirical process,

<22>

vn(f) 2 X X
REMARK.
Don't worry about the niceties of convergence: when the heuristics
are past I will be truncating the series at some finite k.

The expansion suggests a way of coupling the process vn and v, namely, find a
probability space on which vn(\lritk) ^ v(\lsijk) = rjitk for a large subset of the basis
functions. Such a coupling would have several advantages. First, the peculiarities
of each function / would be isolated in the behavior of the coefficients (/, ^r,-,*).
Subject to control of those coefficients, we could derive simultaneous couplings for
many different / ' s . Second, because the ^ functions are rescaled differences of
indicator functions of intervals, the vn(\lrik) are rescaled differences of Binomial
counts. Tusnady's Lemma offers an excellent means for building Binomials from
standard normals. With some rescaling, we can then build versions of the v,,^/,*)
from the rjink.
The secret to success is a recursive argument, corresponding to the nesting
of the Juk intervals. Write node(i, k) for (i + V2)/2*, the midpoint of Jiik. Regard
node(2i, k + 1) and node(2i + 1,fc+ 1) as the children of node(i, k), corresponding
to the decomposition of Ji%k into the disjoint union of the two subintervals 72/,*+1
and 72,+u+i- The parent of node(i, k) is node(Li/2J, k - 1).

251

10.6 Haar couplingthe Hungarian construction

For each integer i with 0 < i < 2k there is a path back through the tree,
path(i\ k) := {(/o, 0), {i\, 1 ) , . . . , (ik% k)}

where I'O = 0 and ik = i,

for which /(/*, k) c J(ik-\,k - 1) C . . . C / ( 0 , 0 ) = (0,1]. That is, the path traces
through all the ancestors (parent, grandparent, . . . ) back to the root of the tree.

JlA
h2
(

(0.3)

(1.3)

(2,3)

(3,3)

(4,3)

(5,3)

(6,3)

h2

h2

J0,3](

(7.3)

The recursive argument constructs successively refined approximations to Pn


by assigning the numbers of observations X,-,* amongst x\, xi,..., xn that land in
each interval /,,*. Notice that, conditional on X,-t* = N, the two offspring counts
must sum to N, with X2i,*+i having a conditional Bin(Af, 1/2) distribution. Via
Lemma <19> define
Xo,i := r(n, ^0,0) = : n - X i j ,
Xo,2 := r(Xo,i, T]O,\) = *o,i - Xi,2,

X2,2 = r ( X i j , ?7i,i) = : Xi f i - X3,2,

and so on. That is, recursively divide the count X,* at each node(i\k) between
the two children of the node, using the normal variable 77,* to determine the
Bin(Xj,*, 1/2) count assigned to the child at node(2i, k -f 1). The joint distribution
for the Xi%k variables is the same as the joint distribution for the empirical counts
nPnJi,k, because we have used the correct conditional distributions.
If we continued the process forever then, at least conceptually, we would
identify the locations of the n observations, without labelling. Each point would
be determined by a nested sequence of intervals. To avoid difficulties related to
pointwise convergence of the Haar expansion, we need to stop at some finite level,
say the mth, after which we could independently distribute the Xl>m observations (if
any) within 7,,m.
The recursive construction works well because Tusnddy's Lemma, even in its
weakened form <20>, provides us with a quadratic bound in the normal variables
for the difference between vn(\lrifk) and the corresponding fy*.
<23>

Lemma.

There exists a universal constant C such that, for each k and 0 < 1* < 2k,

where {(ij, j) : j 0 , 1 , . . . , k] is a path from the root down to node(i*, k).


Proof. Abbreviate /((,, j) to J/, and rji.j to rjj, and so on, for j = 0 , 1 , . . .,k.
Notice that the random variable PnJj has expected value PJj = 2~j, and a small
variance, so we might hope that all of the random variables A7 := 2j PnJj should be
close to 1. Of course Ao = 1.

252

Chapter JO:

Representations and couplings

Consider the effect of the split of 7, into its two subintervals, / ' := 7(2/ ; , j +1)
and J" := J(2ij + IJ + 1). Write N for nPnJj and X for nPnJ', so that
A,- = VN/n and A' := V+lPnJ' = V+lX/n and A /; := V+xPnJ" = 2A7- - A'.
From inequality <20>, we have X = N / 2 + y/Nrjj/2 + /?, where
|/?| < C 0 (l + r?;2)

<24>

and

|X - N/2\ <

C0VN

(l + | ^ | ) .

By construction,
_ .,

N + VNr>, + 2R

, /

A A " . 2/?N

and hence
v B ^ = V27p n (2f - Jj) =
From the first inequality in <24>,
<25>

2Co

|v n ^- - r]j\ <

V7

From the second inequality in <24>,


|A" - A y | = |A' - Ay| =

|X - N/2\ < Co

Invoke the inequality |V - <fb\ <\a- b\/Vb, for positive a andfc,to deduce that
Iv/A^T - yA~| < max (|^/A 7 - y/AJ\, I^A17 - ^ A j | ) < ICQV'1

(1 + |i?y|) /Vn.

From <25> with j = k, and the inequality from the previous line, deduce that

Mvntk - ml < 2CO2*/2 (l + nl) + M


< 2C 0 2*' 2 ( l + r, 2 ) + 2C 0
D

7.

Bound 1^1 + l*7*fyl by 1 + ^ + 5 ^ + 5 ^ , then collect terms involving r/^, to


complete the proof.

The Komlos-Major-Tusnady coupling


The coupling method suggested by expansions <2i> and <22> works particularly
well when restricted to the set of indicator functions of intervals, ft{x) = {0 < x < f},
for 0 < t < 1. For that case, the limit process {v(0, t] : 0 < t < 1), which can
be chosen to have continuous sample paths, is called the Brownian Bridge, or
tied-down Brownian motion, often written as {B(t) : 0 < t < 1}.

<26>

T h e o r e m . (KMT coupling) There exists a Brownian Bridge [B(t) : 0 < t < 1}


with continuous sample paths, and a uniform empirical process vn, for which

P ] sup Ivn(0, t] - B(t)\ > Cx X+

gn \ < Coexp(-x)
V rt
j
with constants C\ and Co that depend on neither n nor x.
[0<r<l

for all x > 0,

253

10.7 The Komlds-Major-Tusnddy coupling


REMARK.
Notice that the exponent on the right-hand side is somewhat arbitrary;
we could change it to any other positive multiple of x by changing the constant C\
on the left-hand side. By the same reasoning, it would suffice to get a bound like
C2 exp(C2JC) + C3 exp(C3JC) 4- C4 exp(c*x) for various positive constants C, and
c,, for then we could recover the cleaner looking version by adjusting C\ and Co. In
my opinion, the exact constants are unimportant; the form of the inequality is what
counts. Similarly, it would suffice to consider only values of x bounded away from
zero, such as x > c0, because the asserted inequality is trivial for x < c0 if Co > ec.
It is easier to adjust constants at the end of an argument, to get a clean-looking
inequality. When reading proofs in the literature, I sometimes find it frustrating to
struggle with a collection of exquisitely defined constants at the start of a proof,
eventually to discover that the author has merely been aiming for a tidy final bound.

Proof. We will build vn from B, allocating counts down to intervals of length 2" m ,
as described in Section 6. It will then remain only to control the behavior of
both processes over small intervals. Let T(m) denote the set of grid points
[i/2m : 1 = 0, 1 , . . . , 2m} in [0, 1]. For each t in T(m), both series <2i> and <22>
terminate after k = m, because [0, t] is orthogonal to each Vu for k > m. That
is, using the Hungarian construction we can determine Pn7,,m for each i, and then
calculate
v>n(0, t] = J ^

J2t VnWi,k)(ft, ti,k)

for t in T(m),

which we need to show is close to


B(t) := v(0, t] = 5 ^ J2t *i.k(ft> *uk)
for t in T(m).
Notice that B(0) = B(\) = 0 = v n (0,0] = v n (0,1]. We need only consider / in
7(m)\{0,1}. For each k, at most one coefficient (/,, yfrik) is nonzero, corresponding
to the interval for which t e 7,^, and it is bounded in absolute value by 2~k/2. The
corresponding nodes determine a path ( 0 , 0 ) , . . . , (i ; , 7 ) , . . . , (im, m) down to the
mth level. The difference between the processes at t is controlled by the quadratic
function,
Sm(t) := ^ m = 0 rif.j
where t e J(ih j) for each j ,
of the normal variables at the nodes of this path:
(0, t] - B(t)\ < J2=0 rf\VnWik,k) ~ mk*\2-k/2
< J2i0 <j<k<

m}C2-* )/2 ( l + rjtj}

< 4C Y^m_0 ( l + ilf-jj

by Lemma <23>

summing the geometric series

<27>

As t ranges over T(m), or even over the whole of (0,1), the path defining
Sm(t) ranges over the set of all 2 m paths from the root down to the mth level.
We bound the maximum difference between the two processes if we bound the
maximum of Sm(t). The maximum grows roughly linearly with m, the same rate as
the contribution from a single t. More precisely
<28>

P{max, Sm(t) > 5m + JC} < 2exp(-x/4)

for each x > 0.

254

Chapter 10:

Representations and couplings

I postpone the proof of this result, in order not to break the flow of the main
argument.
REMARK.
The constants 5 and 4 are not magical. They could be replaced by
any other pair of constants for which Pexp((iV(0, I) 2 ci)/c2) < 1/2.
From inequalities <27> and <28> we have
+

<29> p{ max \vn(0,t]-B(t)\ >4C


[tT(m)

<30>

* J" m\ <p{maxSm(O >jc + 5m| <exp(-Jt/4).


J

jn

I t

Provided we choose m smaller than a constant multiple of x + logw, this term will
cause us no trouble.
We now have the easy task of extrapolating from the grid T(m) to the whole
of (0,1). We can make 2"m exceedingly small by choosing m close to a large
enough multiple of JC + logn. In fact, when x > 2, the choice of m such that
2 n V > 2m > n2ex
will suffice. As an exercise, you might want to play around with other m and the
various constants to get a neater statement for the Theorem.
We can afford to work with very crude estimates. For each s in (0, 1) write ts
for the point of T(m) for which ts < s < ts + 2~m. Notice that
MO, s] - vn(0, ts]\ < # points in (* s]/</n + Vw2"m.
The supremum over s is larger than 3/y/n only when at least one / I m interval, for
0 < i < 2m contains 2 or more observations, an event with probability less than

G) (2~ )

m 2

Similarly,

< n22~m < e~x

for m as in <30>.

sup, \B(s) - B(ts)\ < sup, |G[0, s] - G[0, ts]\ + sup, \(s - ts)Jj]
< max sup |G[0, s] - G[0, i/2m]\ -h 2~ m |^
0<i<2mseJ.m

from which it follows that


j)-B(f,)|> ^ 1 <2 m P

where B is a Brownian motion. The second term on the right-hand side is less than
exp (-4mjc2/2n). By the reflection principle for Brownian motion (Section 9.5), the
first term equals

JC 1

2m/2jc 1

2m r 2 \

\B(2-m)\ > = 2m+1P \N(0, 1)| > j=- < 2 m+1 exp ( y - J .
For x > 2 and m as in <30>, the sum of the two contributions from the Brownian
Bridge is much smaller than e~x.
From <29>, and the inequality
MO, *] - B(s)\ < |vn(0, s] - vn(0, ts]\ + \vn(0, ts] - B(ts)\ + \B(ts) - B(s)\,
together with the bounds from the previous paragraph, you should be able to
complete the argument.

255

10.7 The Komlos-Major-Tusnddy coupling

Proof of inequality <28>.


Write Rm for max, Sm(t). Think of the binary tree of
depth m as two binary trees of depth m 1 rooted at node(0,1) and node(l, 1), to
see that that Rm has the same distribution as ^ 0 - | - m a x ( r , 7'), where T and T both
have the same distribution as Rm-u and 770,0, T9 and V are independent. Write Dk
for Pexp((J?* - 5A:)/4). Notice that
^-5/4DQ

P e x p ( i ^ 0 - I ) = V2exp(-5/4) < 1/2.

For m > 1, independence lets us bound Dm by


( r - 5(m - 1), r - 5(m - 1)))
- 1)) + P e x p ( r - |(m - 1))) = D m _,.
By induction, Pexp(/? m - 5m)/4 = Dm < Do = V2. Thus
P{Rm > 5m + JC} < Pexp ((/?m - 5m)/4) exp(-x/4) < \/2exp(-Jt/4),
D

as asserted.
By means of the quantile transformation, Theorem <26> extends immediately
to a bound for the empirical distribution function Fn generated from a sample
1 , . . . , n from a probability measure on the real line with distribution function F.
Again writing qF for the quantile function, and recalling that we can generate the
sample as & = qF(xt), we have

which implies y/n(Fn(t) - F(t)) = vrt(0, F(r)]. Notice that F(t) ranges over a
subset of [0,1] as t ranges over R; and when F has no discontinuities, the range
covers all of (0,1). Theorem <26> therefore implies
p f supK/n (Fn(f) - F(t)) - B(F(t))\ >

C l

* + 1 g " | < coe~x

for x > 0.

Put another way, we have an almost sure representation Fn(t) = F(t) +


+ Rn(t), where, for example, supr \Rn(t)\ = 6>p (n"1 logn).
REMARK.
From a given Brownian Bridge B and a given n we have constructed
a sample * i , . . . , xn from the uniform distribution. From the same By we could
also generate a sample JCJ, . . . , x'nJ x'n+l of size n + 1. However, it is not true that
JC, = JC,' for 1 < n\ it is not true that * i , . . . , * , JC^+1 are mutually independent. If we
wished to have the samples relate properly to each other we would have to change
the Brownian Bridge with n. There is a version of KMT called the Kiefer coupling,
which gets the correct joint distributions between the samples at the cost of a weaker
error bound. See Csorg6 & Revesz (1981, Chapter 4) for further explanation.

Inequality <3i> lets us deduce results about the empirical distribution function Fn from analogous results about the Brownian Bridge. For example, it implies
sup r V^l^i(0 ~ ^(01 ^ sup, |J?(F(f))|. If F has no discontinuities, the limit
distribution is the same as that of sup5 |#(.s)|. That is, we have an instant derivation of the Kolmogorov-Smirov theorem. The Csorg6 & Rev6sz book describes
other consequences that make much better use of all the hard work that went into
establishing the KMT inequality.

256

Chapter 10:

Representations and couplings

8.

Problems

[1]

Suppose F and Fn9 for n e N are distribution functions on the real line for which
Fn(x) - F(x) for each x in a dense subset D of the real line. Show that the
corresponding quantile functions Qn converge pointwise to Q at all except (at worst)
a countable subset of points in (0, 1). Hint: Prove convergence at each continuity
point MO of Q. Given points x\ JC" in D with x' < xo = Q(uo) < x", find 8 > 0 such
that JC' < Q(M 0 - 5) and Q(M 0 + 8) < x". Deduce that
FH(x') < F(x) + 8 < MO < F(x") - 5 < Fn(jc")

eventually,

in which case x < Qn(uo) < x".


[2]

Let P and Q be two probability measures defined on the same sigma-field A of a


set X. The total variation distance v = v(P, Q) is defined as sup AeA \PA - QA\.
(i) Suppose X are Y are random elements of X, defined on the same probability
space (Q, J, P), with distributions P and Q. Show that F*{X # K} > t^P, 6 ) .
Hint: Choose a measurable set D D ( X ^ F ) with PD = P*{X ^ F}. Note that
P{X A] - F{Y e A) = P{X G A} n D - P{Y e A] n D.
(ii) Suppose the diagonal A : = { ( j c , j ) G X x X : ; c = j } i s product measurable.
Recall from Section 3.3 that V = 1-(PA Q)(X) = (P- Q) + (X) =
(Q-P)+(X).
+
+
Define a probability measure P = i ( P Q) 0 (Q P) 4- A., where A. is the
image of P A Q under the map x H^ (JC, JC). Let X and F be the coordinate maps.
Show that X has distribution P and 7 has distribution Q, and P{X ^ Y] - v.

[3]

Show that the Prohorov distance is a metric. Hint: For the triangle inequality,
use the inclusion (By' c B+. For symmetry, consider /o(P, (2) < 8 < e. Put
D c = B*. Prove that D 5 c Bc, then deduce that 1 - PB < QD8 + 8 < 1 - QB + 5.

[4]

Let P be a Borel probability measure concentrated on the closure of a countable


subset S = {JC/ : / e N) of a metric space X. For fixed > 0, follow these
steps to show that there exist a partition of X into finitely many P-continuity sets
Co, C\,.. , Cm such that PCQ < e and diameter(C,-) < for i > 1.
(i) For each JC in X, show that there are at most countably many closed balls B
centered at x with P(dB) > 0.
(ii) For each xt in 5, find a ball Z?, centered at xt with radius between e/4 and 6/2
and P(dBt) = 0.
(iii) Show that U,-NB,- contains the closure of 5. Hint: Each point of the closure
lies within c/4 of at least one *,-.
(iv) Show that P (u,<mZ?i) > 1 e when m is large enough.
(v) Show that the sets Ct := Bt\ D\<j<i Bj and Co := (Uf^B,-)0 have the desired
properties.

[5]

(2)e O(Dc 9Jlatriage cmma) Suppose 5 is a finite set of princesses. Suppose


each princess, a, has a list, AT (a), of frogs desirable for marriage. For each
collection A c 5, the combined list of frogs equals K(A) = (JfA'Ccr) : 0" A}. If

257

10.8 Problems

each princess is to find a frog on her list to marry, then clearly the "Desirable Frog
Condition" (DFC), #K(A) > #A, for each A c s , must be satisfied. Show that DFC
is also sufficient for happy princesses: under the DFC there exists a one-to-one
map n from S into K(S) such that 7t(a) e K(a) for every a in S. Hint: Translate
the following mathematical fairy tale into an inductive argument.
(i) Once upon a time there was a princess ao who proposed to marry a frog TO
from her list. That would have left a collection 5\{TO} of princesses with lists
K(G)\{TO]
to choose from. If the analog of the DFC had held for those lists,
an induction hypothesis would have made everyone happy.
(ii) Unfortunately, a collection Ao c S\{cro) of princesses protested, on the grounds
that #X(AO)\{TO] < #Ao; clearly not enough frogs to go around. They pointed
out that the DFC held with equality for Ao, and that their happiness could be
assured only if they had exclusive access to the frogs in K(Ao).
(iii) Everyone agreed with the assertion of the Ao. They got their exclusive access,
and, by induction, lived happily ever after.
(iv) The other princesses then got worried. Each collection B in 5\Ao asked,
> ##?" They were reassured, "Don't worry. Originally
"#K(B)\K(A0)
#K(B U Ao) > #B + #A 0 , and we all know that #(A 0 ) = #A 0 , so of course

#K(B)\K(A0) = #K(B U Ao) - #tf(A0) > #B.


You too can live happily ever after, by induction." And they did.
[6]

Prove Lemma <9> by carrying out on the following steps. Write RA for UaeARa.
Argue by induction on the size of S. With no loss of generality, suppose 5 =
{ 1 , 2 , . . . , m}. Check the case m = 1. Work from the inductive hypothesis that the
result is true for #5 < m.
(i) Suppose there exists a proper subset Ao of S for which vAo = V>RA0- Define
R'a = Ra\RAo for a $ Ao. Show that vA < ILR'A for all A c S\A0. Construct
K by invoking the inductive hypothesis separately for Ao and 5\Ao- (Compare
with part (iv) of Problem [5].)
Now suppose vA < /JLRA for all proper subsets A of S. Write La for the probability
distribution [i(- | Ra), which concentrates on Ra.
(ii) Show that /x > v{l}L\.

Hint: Show iiB > v{l}/z(#/?i)//z/?i for all B c Rx.

Write \ for the unit mass at 1. Let 0o be the largest value in [0, v{\}] for which
(/x - OoL\)RA > (v - 0o*i)A for every A c 5.
(iii) If Go = v{l}, use the inductive hypothesis to find a probability kernel from
S\{1] into T for which (/x - v{l}L{) > a > 2 v{a}Ka. Define Kx = L\.
(iv) If $o < v{l}, show that there exists an Ao 5 S for which (/x 9L\)RAQ
<
(v - 0i)Ao when v{l] > 9 > 0o- Deduce that Ao must be a proper subset
of S for which (/x - 0oL\)RAo = (v - 0o^i)Ao. Invoke part (i) to find a
probability kernel M for which ^ - 0o^i > (v{l} - 0o)^i + L a >2 v{
Define Kx := (0 o /v{l})Li + (1 - 0oMl})Mi.

258

[7]

Chapter 10:

Representations and couplings

Establish the bound P{|N(O, /*)| > Vkx} < (xe{-x)k/2, for x > 1, as needed (with
Vibe = 8/a) for the proof of Lemma <18>. Hint: Show that

P{|W(0, h)\2 > kx] < exp(-tkx)(l - 2t)~k/2

for 0 < t < 1/2

which is minimized at t = j(l x~ ).


[8]

Let Fm and Gn be empirical distribution functions, constructed from independent


samples (of sizes m and n) from the same distribution function F on the real line.
Show that
sup, \Fm(t) - Gn(t)\ ~> sup, |B(F(O)|
as min(m, n) -* oo.
m 4- n
Hint: Use <3i>. Show that otB^(s) + pB^is) is a Brownian Bridge if a2 + 2 = 1
and # p Z?^ are independent Brownian Bridges.

9.

Notes
In increasing degrees of generality, representations as in Theorem <4> are due to
Skorohod (1956), Dudley (1968), Wichura (1970), and Dudley (1985).
Prohorov (1956) defined his metric for probability measures on complete,
separable metric spaces. Theorem <8> is due to Strassen (1965). I adapted the
proof from Dudley (1976, Section 18), who used the Marriage Lemma (Problem [5])
to prove existence of the desired coupling in a special discrete case. Lemma <9>
is a continuous analog of the Marriage Lemma, slightly extending the method of
Pollard (1984, Lemma IV.24).
The discussion in Section 4 is adapted from an exposition of Yurinskii (1977)'s
method by Le Cam (1988). I think the slightly weaker bound stated by Yurinskii
may be the result of his choosing a slightly different tail bound for |Af(O, /*)|, with
a correspondingly different choice for the smoothing parameter.
The idea for Example < n > comes from the construction used by Dudley &
Philipp (1983) to build strong approximations for sums of independent random
processes taking values in a Banach space. Massart (1989) refined the coupling
technique, as applied to empiricial processes, using a Hungarian coupling in place
of the Yurinskii coupling.
The proof of the KMT approximation in the original paper (Komlos et al. 1975)
was based on the analog of the first inequality in <20>, for \X - n/2\ smaller than a
tiny multiple of n. The proof of the elegant refinement in Lemma <19> appeared in
a 1977 dissertation of Tusnady, in Hungarian. I have seen an annotated extract from
the dissertation (courtesy of Sandor Csorgo). Csorgo & Revesz (1981, page 133)
remarked that Tusnady's proof is "elementary" but not "simple". I agree. Bretagnolle
& Massart (1989, Appendix) published another proof, an exquisitely delicate exercise
in elementary calculus and careful handling of Stirling's approximation. The method
used in Appendix D resulted from a collaboration between Andrew Carter and me.
Lemma <23> repackages a construction from Komlos et al. (1975) that has
been refined by several authors, most notably Bretagnolle & Massart (1989),
Massart (1989), and Koltchinskii (1994).

259

10.9 Notes
REFERENCES

Bretagnolle, J. & Massart, P. (1989), 'Hungarian constructions from the nonasymptotic viewpoint', Annals of Probability 17, 239-256.
Csorgo, M. & Rev6sz, P. (1981), Strong Approximations in Probability and Statistics,
Academic Press, New York.
Dudley, R. M. (1968), 'Distances of probability measures and random variables',
Annals of Mathematical Statistics 39, 15631572.
Dudley, R. M. (1976), 'Convergence of laws on metric spaces, with a view to
statistical testing'. Lecture Note Series No. 45, Matematisk Institut, Aarhus
University.
Dudley, R. M. (1985), 'An extended Wichura theorem, definitions of Donsker classes,
and weighted empirical distributions', Springer Lecture Notes in Mathematics
1153, 141-178. Springer, New York.
Dudley, R. M. & Philipp, W. (1983), 'Invariance principles for sums of Banach space valued random elements and empirical processes', Zeitschrift fur
Wahrscheinlichkeitstheorie und Verwandte Gebiete 62, 509-552.
Kim, J. & Pollard, D. (1990), 'Cube root asymptotics', Annals of Statistics 18, 191
219.
Koltchinskii, V. I. (1994), 'Komlos-Major-Tusnady approximation for the general
empirical process and Haar expansion of classes of functions', Journal of
Theoretical Probability 7, 73-118.
Komlos, J., Major, P. & Tusnady, G. (1975), 'An approximation of partial sums of
independent rv-s, and the sample df. I', Zeitschrift fiir Wahrscheinlichkeitstheorie
und Verwandte Gebiete 32, 111-131.
Le Cam, L. (1988), On the Prohorov distance between the empirical process and
the associated Gaussian bridge, Technical report, Department of Statistics, U.C.
Berkeley. Technical report No. 170.
Major, P. (1976), 'The approximation of partial sums of independent rv's', Zeitschrift
fur Wahrscheinlichkeitstheorie und Verwandte Gebiete 35, 213220.
Massart, P. (1989), 'Strong approximation for multivariate empirical and related
processes, via KMT constructions', Annals of Probability 17, 266-291.
Pollard, D. (1984), Convergence of Stochastic Processes, Springer, New York.
Pollard, D. (1990), Empirical Processes: Theory and Applications, Vol. 2 of NSFCBMS Regional Conference Series in Probability and Statistics, Institute of
Mathematical Statistics, Hayward, CA.
Prohorov, Yu. V. (1956), 'Convergence of random processes and limit theorems in
probability theory', Theory Probability and Its Applications 1, 157-214.
Skorohod, A. V. (1956), 'Limit theorems for stochastic processes', Theory Probability
and Its Applications 1, 261-290.
Strassen, V. (1965), 'The existence of probability measures with given marginals',
Annals of Mathematical Statistics 36, 423-439.
Tusnady, G. (1977), A study of Statistical Hypotheses, PhD thesis, Hungarian
Academy of Sciences, Budapest. In Hungarian.

260

Chapter 10:

Representations and couplings

Wichura, M. J. (1970), 'On the construction of almost uniformly convergent random


variables with given weakly convergent image laws', Annals of Mathematical
Statistics 41, 284-291.
Yurinskii, V. V. (1977), 'On the error of the Gaussian approximation for convolutions', Theory Probability and Its Applications 2, 236-247.

Chapter 11

Exponential tails and the


law of the iterated logarithm
SECTION I introduces the law of the iterated logarithm (LIL) through the technically
simplest case: independent standard normal summands.
SECTION 2 extends the results from Section I to sums of independent bounded random
variables, by means of Bennett's exponential inequality. It is noted that the bounds
on the variables could increase slowly without destroying the limit assertion, thereby
pointing to the easy (upper) half of Kolmogorov's definitive LIL.
SECTION *3 derives the very delicate exponential lower bound for bounded summands,
needed to prove the companion lower half for Kolmogorov's LIL.
SECTION *4 shows how truncation arguments extend Kolmogorov's LIL to the case of
independent, identically distributed summands with finite second moments.

1.

LIL for n o r m a l s u m m a n d s
Two important ideas run in tandem through this Chapter: the existence of exponential
tail bounds for sums of independent random variables, and proofs of the law of the
iterated logarithm (LIL) in various contexts. You could read the Chapter as either
a study of exponential inequalities, with the LIL as a guiding application, or as a
study of the LIL, with the exponential inequalities as the main technical tool.
The LIL's will all refer to partial sums Sn := X\ + . . . + Xn for sequences
of independent random variables [Xi] with PX, = 0 and var(Z,) := a? < oo,
for each i. The words iterated logarithm refer to the role played by function
L(x) := y^jtloglogx. To avoid minor inconveniences (such as having to exclude
cases involving logarithms or square roots of negative numbers), I arbitrarily define
L{x) as 1 for x < ee % 15.15. Under various assumptions, we will be able to prove,
with Vn := var(Sn), that

<i>

limsupn_)>00 Sn/L (Vn) = 1

almost surely,

together with analogous assertions about the lim inf and the almost sure behavior of
the sequence {Sn/L (Vn)}. Equality < i > breaks naturally into a pair of assertions,
<2>

limsup,,.^ Sn/L (Vn) < 1

and

l i m s u p , , ^ Sn/L (Vn) > 1

a.s.,

262

Chapter 11:

Exponential tails and the LIL

inequalities that I will refer to as the upper and lower halves of the LIL, or upper
and lower LIUs, for short. In general, it will be easier to establish the upper half,
because the exponential inequalities required for that case are easier to prove.
As you will see, several of the techniques used for proving LIL's are refinements
of techniques used in Chapter 4 (appeals to the Borel-Cantelli lemma, truncation of
summands, bounding of whole blocks of terms by means of maximal inequalities)
for proving strong laws of large numbers (SLLN). Indeed, the LIL is sometimes
described as providing a rate of convergence for the SLLN.
The theory is easiest to understand when specialized to the normal distribution,
for which the following result holds.
<3>

Theorem.
variables,

For the partial sums {Sn} of a sequence of independent N(0,1) random

(i) l i m s u p ^ ^ Sn/Ln = 1
(ii) liminf^oo Sn/Ln = - 1

a.s.
a.s.

(Hi) Sn/Ln e J infinitely often, a.s., for every open subinterval J of [ 1,1].
Proof. The key requirement for the proof of the upper LIL is an exponential tail
bound, such as (see Appendix D for the proof)
<4>

P{Sn > xy/n] < I exp(-jc 2 /2)


If we take xn := yy/2\ogn

for x > 0.

for some fixed y > 1, then

which, by the Borel-Cantelli lemma, implies


n
lim sup
<y
almost surely, for fixed y > 1.
n-Kx> y/2n log n
Cast out a sequence of negligible sets, for a sequence of y values decreasing to 1,
to deduce that

<5>

< 1
almost surely.
lim sup
n-*oo y/2nlogn
This result is not quite what we need for the upper LIL. Somehow we must replace
the y/2n log n factor by y/2n log log /i, without disturbing the almost sure bound.
As with the proof of the SLLN in Section 4.6, the improvement is achieved
by collecting the Sn into blocks, then applying Borel-Cantelli to a bound for a
maximum over a block. To handle the contributions from within each block we
need a maximal inequality, such as the following one-sided analog of the bound
from Section 4.6.

<6>

M a x i m a l Inequality. Let i, . . . , %N be independent random variables, and x9


, and p be nonnegative constants such that P J V ^ 7 > e} > l/P for 2 < i < N.
Then P {
}
The proof is almost identical to the proof for the two-sided bound. For independent,
standard normal summands, symmetry lets us take ft = 2 with = 0.

263

11.1 LJL for normal summands

Define blocks /?* := {n : w* < n < n*+i}, where n*/p* -> 1 for some
constant p > 1 (depending on y) that needs to be specified. For a fixed y > 1,
P{S >y(w) for some n e *}
< P{maxSn > yUrik)}
nzBk

< 2F{Snk+l > yL{nk)}

because L(n) is increasing


by <6> with p = 2 and = 0

< exp (-|}/ 2 L(n*) 2 /n*+i)

by <4>.

The expression in the exponent increases like


yV
,
y2
^ - ^ l o g l o g p * = Mlogfc + loglogp).
If p < y, the bound for block Bk eventually decreases faster than k~a, for
some a > 1. The Borel-Cantelli lemma then implies that, with probability one, only
finitely many of the events [Sn > yL(n) for some n e Bk} occur. As before, it then
follows that limsup^^^ Sn/L{ri) < 1 almost surely, the upper half of the LIL. By
symmetry, we also have limsupn (-Sn/L(n)) < 1 almost surely, and hence
<7>

limsupn \Sn\/L(n) < 1

almost surely.

The lower half of the LIL asserts that, with probability one, Sn > yL(n)
infinitely often, for each fixed y < 1. To prove this assertion, it is enough if we can
find a sequence [n(k)} along which Sn(k) > yLnik) infinitely often, with probability
one. (I will write n(k) and Ln, instead of nk and L(n), to avoid nearly invisible
subscripts.) The proof uses the Borel-Cantelli lemma in the other direction, namely:
if {Ak} is a sequence of independent events for which J^k FAk = oo then the event
[Ak occurs infinitely often} has probability one.
The sums Sn(k) are not independent, because of shared summands. However
the events Ak := {(*) - Sn{k-\) > yLn(k)} are independent. If we can choose n(k)
so that J^k FAk = oo then we will have
l i m supk S"<*> ~ Sk~ > Y
Ln{k)

almost surely.

If n(k) increases rapidly enough to ensure that Ln(*-i)/*i(*) ~* 0, then <7> will
force Sn(k-\)/Ln(k) - 0 almost surely, and the lower LIL will follow.
We need a lower bound for FAk. The increment (*> Sn(k-\) has a AT(O, m(k))
distribution, where m(k) := n{k) - n(k - 1). From Appendix D,

> exp (-0x2/2j

<8>

for all x large enough, if 0 > 1.

Thus, for fixed 0 > 1,


/-0y 2 2n(fc)loglogn(*)\
v/
fe
FAk > exp I
f
I
\
2m (k)
/

, .
r
for it large enough.

The choice n(k) := /:* ensures that both n(k)/m(k) - 1 and Ln{k-\)/Ln{k) -> 0.
With 0 close enough to 1, the lower bound behaves like (k\o%k)~a for an a < 1,
making J2k ^ * diverge, and completing the proof of (i).

264

2.

Chapter 11:

Exponential tails and the LIL

Assertion (ii) follows from (i) by symmetry of the normal distribution.


For (iii) note that J^nF{\Sn - Sn-\\ > y^logn} < n e x p ( - 2 1 o g n ) < oo. By
Borel-Cantelli, Sn Sn-\ O (y/\ognj a.s., which (after some algebra) implies
(Sn/Ln) (5 n _i/L n _i) - 0 a.s.. As [Sn/Ln] oscillates between neighborhoods of
-hi and 1 it must pass through each intervening J infinitely often.

LIL for bounded summands


The upper LIL for normal variables relied on symmetry of the summands (for the
appeal to the Maximal Inequality) and the exponential bound <4>. For sequences
{Sn} generated in other ways we will not have such a clean tail bound, and we need
not have symmetry, but the arguments behind the LIL can be adapted in some cases.

<9>

Lemma. Let Tn := i + . . . + be a sum of independent random variables


with Pft = 0 and a? := var(&) < oo. Suppose {Wn} is an increasing sequence
of constants with o\ + . . . + a2 < Wn -> oo, and {n(k) : k e N} is an increasing
sequence for which Wn^+\)/ Wn^) is bounded. Then, for constants k > 1 and 8 > 0,
W{Tn > (X + 8)L(Wn) for some n with n(k) <n<n(k+
< 2P{rn(*+1) > XL(Wn{k))}

1)}

for all k large enough.

Proof. Replace L(Wn) by the lower bound L(Wn^))y to put the inequality in the
form amenable to an application of the Maximal Inequality <6>. Then argue, by
Tchebychev's inequality, that for n(k) < n <n(k+ 1),
?

XT (XV

- Sn > -8L(Wn(k))}

\\ ^ 1

> 1

82L(Wn(k))

282Wn(k) loglog Wn(k)'

which tends to 1 as A: tends to infinity.


If we are to imitate the proof for normal summands, the other key requirement
is existence of an exponential tail bound. For bounded summands there is a simple
exponential inequality, which looks like <4> except for the appearance of an extra
factor in the exponent, a factor involving the nonegative function

-10

. = | 2 ((1 + x) log(l + x) - JC) /x2 for x > - 1 and x # 0


11
for JC = 0.
The function yfr is convex and decreasing (Appendix C). For the
moment, it is enough to know that V(*) % 1 when JC ^ 0, so that
the inequalities look similar to the inequalities for normal tails

when we focus on departures "not too far out into the tails."
<ii>

Bennett's Inequality.

Let Y\y , Yn be independent random variables with

(i) FYf = 0 and a := FY2 < oo


(H) Yi < M for every i, for some finite constant M.
For each constant W > o2 -\

h a2,

Yn > x} < exp [~^f i^ff)

for x

265

11.2 LIL for bounded summands


Proof. For each t > 0,
P{7, + + Yn > x] < e~xt Y\^n

FexpitYi).

As shown in Appendix C, the function


A(x) := 2(ex - 1 - JC)/JC2,

<12>

with A(0) = 1,

is nonnegative and increasing over the whole real line. Rewrite cxp(tYt) as
1 + tYi + \{t Yt)2A(t Yi) < 1 + tYi +

\t2Y2&{tM).

Take expectations, then invoke 1 + a < ea to bound the tail probability by


( + \tVA(rM))

< exp (-** + If

Minimize the exponent by putting Mf equal to log(l + Mx/W), then rearrange to


get the stated upper bound.
The inequality ir(x) > (1 +x/3)~l, also established in Appendix C, gives a
slight weakening of Bennett's inequality,

<13>

Corollary.

(Bernstein's inequality) Under the conditions of < n > ,

With Lemma <9> and Bennett's inequality, we have enough to establish an


upper LIL, at least for summands X, bounded in absolute value by a fixed constant M.
Assume var(Sn) := Vn -* oo as n -> oo. For a fixed p > 1 (depending on y), define
blocks by putting n(k) := max{n : Vn < pk}. The fact that fl^(ik)+1 < M2 = ^(Vn^))
ensures that Vn{k)/pk -> 1 as A; -> oo. From Lemma <9> with Ww equal to Vn,
F{Sn > (k 4- 5)L(Vn) for some n with (*:) < n < n(k + 1)}
< 2{Sn(M) >
2VniM)

VHiM)

The expression in the exponent increases like X22/o*loglog(p*)^ (o(


The ^(^(1)) converges to 1, and therefore can be absorbed into other factors. If
p < A., the bound for block Bk again eventually decreases faster than k~a, for
some a > 1. The upper half of the LIL, as in first assertion of <2>, then follows
via Borel-Cantelli and the casting out of a sequence of negligible sets.
Notice that uniform boundedness of the summands was needed twice:
(i) to show that <r2(jt)+1 = o(Vn(k))y thereby ensuring that Vn(*)/P* -* 1 as k -> oo;
(ii) to show that the argument of the ^ factor in the exponent tends to zero.
The same properties also hold for sequences with |X n | < Mn, where {Mn} is a slowly
diverging sequence of constants. In fact, they hold if and
<15>

-> oo
oo
Vn ->

and
and

Mnn=o
=o \JV
\JVnn/^g\o%
/^g\o%
M

Vnn jj
V

as n -> oo,

266

Chapter 11:

Exponential tails and the LIL

a condition introduced by Kolmogorov (1929) to prove the LIL <i> for partial
sums of independent random variables X, with PX, = 0 and |X,| < Af,.
The proof of the upper LIL under <15> is essentially the same as the proof
for uniformly bounded summands, as sketched above. The lower LIL requires an
analog of the exponential lower bound <8>. The next Section establishes this lower
bound, an extremely delicate exercise in Calculus. With this exponential bound, you
could prove the corresponding lower LIL by modifying the analogous proof from
Section 1 (or the proof of the lower LIL that will be sketched in Section 4).

*3.

Kolmogorov's exponential lower bound


Let X,, Sn and Vn be as before. Suppose also that |X/| < 8^/V^ for / = 1,2,..., w,
where 8 is a small positive constant. Then for each constant 0 > 1 there exists
an JCO > 0 and a K (both depending only on 0) such that
F[Sn > Xy/Vn} > exp {-\0x2\

for JC0 < x < K/8.

Proof. The constants will be determined by a collection of requirements that


emerge during the course of the argument. As with many proofs of this type, the
requirements make little intuitive sense out of context, but it is useful to have them
collected in one place, so that the dependences between the constants are clear.
As you will soon see: the constant 0 will determine a small > 0, which in turn
will determine an even smaller rj > 0 (in fact, we will choose rj slightly smaller
than 62/2), and a small, positive K depending on and r). Specifically, we will
need K so small that
max ((1 + rj)~\ (1 + ))

and

with ^ as in <io> and A as in <12>. To avoid an accumulation of many trivial


constraints, assume (redundantly) that 0 < 17 < e < 1. We will also need

and

- 6 - (1 + 40(1 + ) + i ( l + )2(1 - ri) > -0/2.

The constant Jto will need to be large enough that | > exp(exfy and
2 + 3(1 + 6)JC exp (|JC 2 (1 + <02K) < \ exp (jjc 2 (l + <02(l - *?))

for JC > JC0.

Now let the argument begin.


REMARK.
AS noted by Dudley (1989, page 379), it is notoriously difficult to
manage the constants and ranges correctly for the Kolmogorov inequality. With the
constraints made explicit, I hope that my errors will be easier to detect and repair.

267

11.3 Kolmogorov's exponential lower bound

With no loss of generality, assume Vn = 1 (equivalently, divide each X,


by \/V^), and consequently o} = FXf < 82 for each i. By almost the same
reasoning as in the proof of Bennett's inequality, for t > 0,

)=
> PJ/<w (l + \t2a2A{-t8)\

because A increasing, and Xt > -8

2 2

-t a A(-t8)
2

i I

If 0 < t < 2K/8 then A(-2K)

(J2
^

9 A

, -I

< A(-t8)

i+2K

(1 + 30

< A(0) = 1. We then have

2 2

U o A(-2K)\^

via log(l

)~

__,

CXP

(^

(1

"

the second equality coming from < n > and the fact that , a? = Vn = 1. We also
have an upper bound for the same quantity,
Pexp(fSn) = P /

tety{Sn

>y}dy<\+(

tetyF{Sn

> y}dy.

The idea now is to choose t so that the last integrand is maximized somewhere
in a small interval J := [x, w], which contributes at most
<20>

/ tetetytyP{Sn >x}dy<
etwF{Sn > x]
Jx
to the right-hand side of <19>. We need the other contributions to <19>, from y
outside / , to be relatively small. For such y9 we can use Bennett's inequality to
bound the integrand by texp(ty \y2ty{y8)). If we were to ignore the x// factor,
the bound would be maximized at y t, which suggests we make t slightly larger
than x but smaller than w. Specifically, choose t := (1 -f )x and w := (1 H- 4C)JC,
for a small e that needs to be specified. (Note that t < 2x < 2K/8, as required
for <18>.)
When y is large, the \/r factor has a substantial effect on the y2. However, using
the fact (Appendix C) that y\/r(y) is an increasing function of y, and the constraint
x < K/S9 we have (y8)f(y8) > (Sx8)\/r(Sx8) > (Sx8)\lr(SK) when y > 8JC, hence
) > 2yt

by

The contribution from the region where y > 8JC is therefore small if x < K/8:
/OO

tetyf>{Sn>y}dy

/OO

< I

texp(ty - 2ty) dy = exp(-8rjc) < 1.

i%x
hx
Within the interval [0, 8JC] we have \fr(y8) > \/r(Sx8) > ir(SK) > (1 + rj)~l
because x < K/8, and by < n > , and the integrand tetyV{Sn > y] is less than

Notice that the exponent is maximized at y = (1 4- r])t = (1 -f rj)(l -h 6)JC, which lies
in the interior of 7, with
min ((1 + rj)t - JC, w - (1 + rj)f) > ex

because (1 + rj)(l + ) < 1 + 3e.

268

Chapter 11:

Exponential tails and the LIL

If divided by yj2n(\ + r\), a constant smaller than 3, the factor contributed by the
quadratic in y turns into the N{{\ + rf)t, (1 + 77)) density. The contribution to the
bound <19> from y in [0, 8 J C ] \ / is less than
<22>

t exp (\t\\

+ r/)) 3P{|tf(0, (1 + r/))| > ex] < 3f exp

Combining the inequalities from <18> and <19>, with the right-hand side
of the latter broken into contributions bounded via <2i>, <22>, and <20> then
rewritten as functions of JC, we have
2+3(1
2

+ exp (JC (1 + 4 0 ( 1 + 6)) P{Sn

<23>

> exp ( | J C 2 ( 1 + <02(l - ??))

> JC}

for 0 < x < K/8.

We need to absorb the first two terms on the left-hand side into the right-hand side,
which will happen for large enough x if we ensure that

Choose r] := rj() to make this inequality hold (a value slightly smaller than 2 /2
will suffice), then find x so that the sum of the first two terms in <23> is smaller
ght-hand side when x > x. We may also assume that | > exp(6JC
exp(6JC22).
than half the right-hand
Then we have
( 2 - x22(l + 4<0(l
sn >x}> exp (~x

when JC6 < x < K/8. Finally, we choose so small that


-e - ( 1 + 4 0 ( 1 + 6 ) + | ( 1 +*) 2 (1 - 17) > - 0 / 2 ,

*4.

which is possible because the left-hand side tends to 1/2 as e decreases to zero.
Put JCO equal to the corresponding x.

Identically distributed summands


Kolmogorov's LIL for bounded summands under the constraint <15> extends to
identically distributed summands by means of a truncation argument, an idea due to
Hartman & Wintner (1941). That is, the normality assumption can be dropped from
Theorem <3>.

<24>

T h e o r e m . For the sequence of partial sums {} of a sequence of independent,


identically distributed random variables {Xf} with PX; = 0 and var(X/) = 1,

(i) l i m s u p ^ ^ Sn/Ln = 1
(ii) liminf^oo Sn/Ln = - 1

a.s.
a.s.

(Hi) Sn/Ln J infinitely often, a.s., for every open subinterval J of [ 1,1].

11.4

269

Identically distributed summands

Most of the ideas needed for the proof are contained in Sections 1 and 2. I will
merely sketch the arguments needed to prove Theorem <24>, with emphasis on the
way the new idea, truncation, fits neatly with the other techniques.
The truncated variables will satisfy an analog of <15> with Vn := n, except
that the o{>) will be replaced by a fixed, small factor. The fragments discarded by
the truncation will be controlled by the the following lemma, which formalizes an
idea of DeAcosta (1983).
<25>

Lemma.

The function

is strictly increasing and

for some constant C.

dx<Ct

Jee

L(X)

for t > g{ee\

Proof. Differentiate.
> - (1 - - )>0
2 ^ p f = - - -
g(x)
x log log x log x x x \
e)
The inequality also implies that
1

Q (x ^

for x > ee.

*2.p

SL1 < Cg\x)


where C :=
3.16,
L(x)
x
e-l
from which it follows, for t > g(ee), that f*~l(t) j^dx < feg~l(t) Cg'(x) dx < Ct.
Upper LIL
Consider first the steps needed to prove limsup(S n /L n ) < y a.s., for a fixed y > 1.
It helps to work with a smooth form of truncation, so that (Problem [4]) the variances
of the truncated variables are necessarily smaller than the variances of the original
variables. For each positive constant M define a function from E onto [M, M] by

<26>

T(X,

M) := -M{x < -M] + JC{|JC| < M} + M{x > M}.

For a fixed e > 0, which will depend on y, define, for i > 17 > 1 -f eey
with gi := g(i), as defined by Lemma <25>. Note that |/x,| < P|Z,|, because
V(Yi + Z/) = PX/ = 0.
REMARK.
Notice the dependence of the truncation level on i. Compare with the
truncation used to prove the SLLN under a first moment assumption in Section 4.7,
and the truncations used to prove central limit theorems in Section 7.2. For almost
sure convergence arguments it is common for each variable to have its own truncation
level, because ultimately Borel-Cantelli assertions must depend on convergence of a
single infinite series, rather than on convergence of changing sequences of partial
sums to zero.

270

Chapter 11:

Exponential tails and the UL

The partial sum Sn decomposes into J2i<n & + Zw<n Mi + /< zi- Th e first
sum will be handled essentially as in Section 2. The other two sums will be small
relative to Ln, by virtue of the following bound,
Zll < j _L_li < 2 J
i=17 ^'

i=17

^'

'

lllg

n/

i=17

identical distnbutions

^'

<F\\X\\y^[g l(\X\\/e) > i] I


\

r-?L

1=1/

< P (|Xi |C|Xi |/6)


< oo

L(x)

dx)
I
/

by Lemma <25>

because FX\ < oo.

The series started from i = 1 also converges. By Kronecker's lemma,


L~x Yli<n Li\l*i\/Li -> 0 as n - > o o . Similarly, finiteness of the expected value of
J2i \Zi\~/Li implies J2i<n %* = (Ln), almost surely.
To handle the contribution from the {ft}, write Tn for J2i<n & anc* Vnforvar(rn).
By Dominated Convergence, var(ft) - 1 as i > oo, implying Vn/n -> 1 as n -> oo.
Write y as 1 4- 25, with 5 > 0. We need to prove limsup(r n /L n ) < 1 + 28 almost
surely. As in Section 2, use blocks Bk := {n() < n < n(k -h 1)}, with n(k)/pk - 1,
for a > 1 to be specified. Invoke Lemma <9> with Wn := n and A := 1 + 8
to reduce to behavior along the geometrically spaced subsequence. Then invoke
Bennett's inequality,
Tn(k+i) > kLnik)] < exp I

2^+1)

f I

Vn(k+\)

I I.
))

The argument of the ^ factor behaves like

LH(M)n(k+l)
For fixed y, the f factor can be brought as close to 1 as we please, by choosing
small enough. The other term in the exponent behaves like (A. 2 /p)loglogp*. With
appropriate choices for p and e, we therefore have the tail probability decreasing
faster than k~a, for some a > 1, which leads to the desired upper LIL.
Lower LIL
For a fixed y < 1 we need to show limsup(7 n /L n ) > y almost surely. As in
Section 1, look along a subsequence n(k) := kk. Write T for Tn(k) - Tn(k-\), and V
for var(r) = Vn{k) V^-i). We can make V/n(k) as close to 1 as we please, by
making k large enough. The summands contributing to T are bounded in abolute
value by 8y/V, where 8 := 2tgnik)/W.
We need to bound F[T > yLn(k)] from
below by a term of a divergent series. Fix a 0 > 1. Write JC for yLn{k)/y/V.
Inequality <16> gives

F[T > yLnik)] = F{T > xW) > exp (-^Jc 2 ) = exp (-

2V

271

11.4 Identically distributed summands


provided JCQ < x < K/8, that is, provided

2n(k)loglogn(k)
Y n(*)(l+o(l))

26 V

"(*)

With small enough, the range eventually contains the desired x value. The rest of
the argument follows as in Section 1.
Cluster points
Assertion (iii) of Theorem <24> will follow from assertion (i), by means of an
ingenious projection argument, borrowed from Finkelstein (1971). Construct new
independent observations {X,} with the same distribution as the {X,}, and let
Sn := X\ + . . . + Xn. Write Wn for the random vector (S n , Sn)/Ln and u9 for
the unit vector (cos 0, sin 0). For each fixed 0, the random variables Wn UQ
X, cos0 4- Xi sin# have mean zero, variance one, and they are identically distributed.
From (i), limsup,,^^ (Wn UQ) 1 almost surely. Given 6 > 0, there exists a finite
collection of halfspaces {(x,y) - uo < 1 + 6 } whose intersection lies inside the ball
of radius 1 + 26 about the origin. It follows that limsup \Wn\ < 1 + 2 6 almost
surely. The geometry of the circle then forces Wn to visit each neighborhood of
the boundary point ue infinitely often, with probability one.
The projection of such a neighborhood onto the horizontal axis
gives a neighborhood of the point cos#, which Sn/Ln must visit
infinitely often. After a casting out of a countable sequence
of negligible sets, for a countable collection of subintervals of
( - 1 , 1 ) , we then deduce assertion (iii).

5,

Problems

[1]

Let X have a Bin(n, p) distribution. Define q := 1 - p. For 0 < x < nq, show that

Hint: Bound the tail probability by exp (-t(np + JC) + n log(q + pe1)) for t e R+,
then minimize the expression in the exponent by Calculus. For 0 < x < nq show
? ) . For the second bound, use
convexity of x//.
[2]

l-x/nqj

Let X\,...,
Xn be independent random variables with PX,- := p\ and 0 < Xt < 1 for
each i. Let p := (p\ H
h pn)/n =: I -q. Show that

F {,, *, > nP + X} < exp (-.+

( ^ ) )

for 0 < , < ,

272

Chapter 11:

Exponential tails and the LIL

Hint: For t M+, bound the tail probability by


exp(-t(np + *))Pexp(f EX,-) = exp (-t(np + x) + ,< l o g(# + A*'))
Use concavity of the logarithm function to increase the bound to a form amenable
to the method of Problem [1].
[3]

Suppose X has a Poisson(A.) distribution.


(i) By direct minimization of exp (-t(X + JC)) Pexp(rX) over R+, prove that

(ii) Derive the same tail bound by a passage to the limit in the binomial bound
from Problem [1].
[4]

For each random variable X with finite variance, and each constant M, show that
var(r(X, M)) < var(X), with r as defined in <26>. Hint: Let X' be an independent
copy of X. Show that 2var(r(X, M)) = P|r(X, M) - r(X', M)\2 and also that
r
|T(JC, M) - T(JC', M)\ <\x- x'\ for all real x and x .

[5]

Let {Xn} be a sequence of independent, identically distributed two-dimensional


random vectors with PXf = 0 and var(X,) = h. Define Sn = Xi -f ... + Xn.
(i) Show that limsup|5 n |/L n < 1 almost surely.
(ii) Show that, with probability one, the sequence Sn/Ln visits every open subset of
the unit ball {|JC| < 1} infinitely often. Hint: Project three-dimensional random
vectors.

6.

Notes
My understanding of the LIL began with the reading of Feller (1968, Section VIII.5),
Lamperti (1966, Section 11) and Stout (1974, Chapter 5). I learned the idea of
regarding the Bennett inequality as a slightly corrupted (by the presence of the
\js function in the exponent) analog the the normal tail bound from conversations
with Galen Shorack and Jon Wellner. Shorack (1980) systematically exploited the
idea to establish very sharp LIL results for the empirical distribution function.
For a beautiful exposition of the many applications of the idea to the study of
inequalities for the empirical distribution function and related processes see Shorack
& Wellner (1986, Chapter 11).
The method used to establish the Bennett inequality, but not the form of
the inequality, comes from Chow & Teicher (1978, page 338). They developed
exponential inequalities suitable for derivation of LIL results. I am uncertain about
the earlier history of the exponential bounds. Apparently (cf. Kolmogorov & Sarmanov 1960), Bernstein's inequality comes from a 1924 paper. The Bennett (1962)
and Hoeffding (1963) papers contain other tail bounds for sums of random variables,
with some references to further literature.
Apparently the first versions of the LIL with a log log bound are due to
Khinchin (1923, 1924). For the early history of the LIL, leading up to the definitive

273

11.6 Notes

version by Kolmogorov (1929), see Feller (1943). Hartman & Wintner (1941)
extended to the case of independent identically distributed summands with finite
second moments, by means of a truncation argument. (Actually, they did not
assume identical distributions, but only a domination condition for the tails, which
holds under the second moment condition for identically distributed summands.)
My exposition in Section 4 draws from DeAcosta (1983), who gave an elegant
alternative derivation (and extension) of the Hartman-Wintner version of the LIL.
I thank Jim Kuelbs for the proof of part (iii) of Theorem <3>.
REFERENCES

Bennett, G. (1962), 'Probability inequalities for the sum of independent random


variables', Journal of the American Statistical Association 57, 33-45.
Chow, Y. S. & Teicher, H. (1978), Probability Theory: Independence, Interchangeability, Martingales, Springer, New York.
DeAcosta, A. (1983), 'A new proof of the Hartman-Wintner law of the iterated
logarithm', Annals of Probability 11, 270-276.
Dudley, R. M. (1989), Real Analysis and Probability, Wadsworth, Belmont, Calif.
Feller, W. (1943), 'The general form of the so-called law of the iterated logarithm',
Transactions of the American Mathematical Society 54, 373-402.
Feller, W. (1968), An Introduction to Probability Theory and Its Applications, Vol. 1,
third edn, Wiley, New York.
Finkelstein, H. (1971), 'The law of the iterated logarithm for empirical distributions',
Annals of Mathematical Statistics 42, 607-615.
Hartman, P. & Wintner, A. (1941), 'On the law of the iterated logarithm', American
Journal of Mathematics 63, 169-176.
Hoeffding, W. (1963), 'Probability inequalities for sums of bounded random
variables', Journal of the American Statistical Association 58, 13-30.
Khinchin, A. Ya. (1923), 'Uber dyadische Briiche', Math. Zeit. 18, 109-116.
Khinchin, A. Ya. (1924), 'Uber einen Satz der Wahrscheinlichkeitsrechnung',
Fundamenta Mathematicae 6, 9-20.
Kolmogorov, A. (1929), 'Uber das Gesetz des Iterierten Logarithmus', Mathematische
Annalen 101, 126-135.
Kolmogorov, A. N. & Sarmanov, O. V. (1960), 'The work of S. N. Bernshtein on
the theory of probability', Theory Probability and Its Applications 5, 197-203.
Lamperti, J. (1966), Probability: A Survey of the Mathematical Theory, W. A.
Benjamin, New York.
Shorack, G. R. (1980), 'Some law of the iterated logarithm type results for the
empirical process', Australian Journal of Statistics 22(1), 50-59.
Shorack, G. R. & Wellner, J. A. (1986), Empirical Processes with Applications to
Statistics, Wiley, New York.
Stout, W. F. (1974), Almost Sure Convergence, Academic Press.

Chapter 12

Multivariate normal distributions


SECTION 1 explains why you will not learn from this Chapter everything there is to know
about the multivariate normal distribution.
SECTION 2 introduces Fernique's inequality. As illustration, Sudakov's lower bound for
the expected value of a maximum of correlated normals is derived.
SECTION * J proves Fernique 's inequality.
SECTION 4 introduces the Gaussian isoperimetric inequlity. As an application, BorelVs tail
bound for the distribution of the maximum of correlated normals is derived.
SECTION *5 proves the Gaussian isoperimetric inequlity.

1.

Introduction
Of all the probability distributions on multidimensional Euclidean spaces the
multivariate normal is the most studied and, in many ways, the most tractable.
In years past, the statistical subject known as "Multivariate Analysis" was almost
entirely devoted to the study of the multivariate normal. The literature on Gaussian
processesstochastic processes whose finite dimensional distributions are all
multivariate normalis vast. It is important to know a little about the multivariate
normal.
As you saw in Section 8.6, the multivariate normal is uniquely determined by
its vector of means and its matrix of covariances. In principle, everything that one
might want to know about the distribution can be determined by calculation of means
and covariances, but in practice it is not completely straightforward. In this Chapter
you will see two elegant examples of what can be achieved: Fernique's (1975)
inequality, which deduces important information about the spread in a multivariate
normal distribution from its covariances; and Borell's (1975) Gaussian isoperimetric
inequality, with a proof due to Ehrhard (1983a, 1983b). Both results are proved by
careful Calculus.
The Chapter provides only a very brief glimpse of Multivariate Analysis and
the theory of Gaussian processes, two topics that are covered in great detail in many
specialized texts. I have chosen merely to present examples that give the flavor of
some of the more modern theory. Both the Fernique and Borell inequalities have
found numerous applications in the recent research literature.

275

12.2 Fernique's inequality

2.

Fernique's inequality
The larger <r2, the more spread out is the N(0,cr2) distribution. Fernique (1975,
page 18) proved a striking multivariate generalization of this simple fact.

<i>

Theorem. Suppose X and Y both have centered (zero means) multivariate normal
distributions, with P|X,- - Xj\2 < F\Yt - Yj\2 for all i, j . Then
P/(max/ Xt - min, Xt) < P/(max, Yt - min, Yt)
for each increasing, convex function f on R + .
The theorem lets us deduce inequalties for multivariate normal distributions,
with potentially complicated covariance structures, by making comparisons with
simpler processes.

<2>

Example. (Sudakov's minoration) Let Y := (Y\, Y2,..., Yn) have a centered


multivariate normal distribution, with F\Yj - F*|2 > 82 for all j ^ k. Fernique's
inequality will show that Pmax/<n Y( > C8y/log2n, where C := P|N(0, l)\/V%.
The result is trivial for n = 1. For n > 2, let k be the largest integer for which
n > 2*. Note that 2k > k + 1 > log 2 n > k. Reindex the variables [Yt : 1 < i < 2*}
by the vectors a in the set A = {1, +1}* of all it-tuples of 1 values. Write
Ya instead of Yt. The precise correspondence between A and {1 : 1 < 1 < 2*} is
unimportant.
Build another centered multivariate normal family {Xa : a A},
Xa := \8k-X12 Y^i=x aiwi

where

W i , . . . , Wlk arc independent N(0, l)'s.

For a ^ ,

\Xa - Xt\2 = \82k~l

. ( ( * , - - ft)2 <82<

F\Ya - Yp\2.

From Fernique's inequality with f(t) := t we get


P (maxa Ya nun*, Ya) > P (maxa Xa
Symmetry of the multivariate normal implies that maxa Ya has the same distribution
as maXa(-Ya) = -min Ya, and similarly for the X's. The last inequality implies
Prnax* Ya > Pmax Xa = ^ /

For each realization W := ( W i , . . . , W*), the maximum of J],,-Wi is achieved when


each a, takes the same sign as W,. (Of course, the maximizing a depends on W.)
The lower bound equals

as asserted.
REMARK.
The lower bound is sharp within a constant, in the following sense. If
F\Yj-Yk\2 < 82 for all j ^ k then Pmax,- Yt = PF,+ Pmax^yi-y!) = P m a x ^ y , - ^ )
and, by Jensen's inequality and monotonicity,
exp (Pma Xl (r, - Y})/28)2 < Pmax, exp ((7, - Yx)2/482) < nPexp (N (0, \
Thus Pmax, Yt is bounded above by

28y/\og(\/2n).

276

*3.

Chapter 12:

Multivariate normal distributions

Proof of Fernique's inequality


A straightforward approximation argument (see Appendix C) reduces to the case
where / has a bounded continuous second derivative with bounded support. Also
we may assume that the covariance matrices Vb := var(X) and Vi := var(F) are
both nonsingular n xn matrices: We could prove the result for X + eZ and Y + Z,
where Z is a distributed Af (0, /) independently of X and F, then let tend to zero.
To simplify notation, I will write 9, for d/Sxj and djk for 32/dxj3xk.
For nonsingular Vb and Vi, the covariance matrix V$ := (1 0)Vb + 0V\ is
nonsingular for 0 < 0 < 1. The N(0, V#) distribution has a density ge with respect
to Lebesgue measure 9Jt on Rn. A simple calculation (Problem [1]) based on the
Fourier inversion formula for the N(0, Ve) density gives

^-(x) = I J2j,k A M a JW*>

where

A := Vi - Vo.

The Theorem will be proved if we can show that the function


H(0) := mx (f (max, xt - min, *,-) gG(x))
is increasing in 6. The assumptions on / justify differentation under the 97? to get

H'(0) = mxf (max, x, - min, jt,) geto


= \ J2j,k ^j^x

(f (max/ * ~ ni x0 df.k8e)

by <3>.

Two integrations by parts will replace the QJl-integrals by integrals involving the
nonnegative functions / ' and /", reducing the expression for H'(0) to a sum of
the form J2j<k (^jj + &kk 2A^) (something nonnegative). To establish such a
representation, we need to keep track of contributions from subregions of W defined
by inequalities involving the functions
L(x) := max, xt

and

S(x) := mint JC,

Lj (x):= max, {/ ^ j}x,

and

5} (x) := min, {i ^ j}xt.

Notice that L{x) = L7(JC) v jt, and S(x) = Sj(x) A xj9 for each j .
Let my denote Lebesgue measure on the y'th coordinate space R, and 9JI, denote
(n - l)-dimensional Lebesgue measures on the product of the remaining coordinate
subspaces. That is, m, integrates over Xj and 971, integrates over the remaining n - 1
coordinate variables. The product of m7 0 9)ty equals 971, Lebesgue measure on Rn.
The function Xj \-> fL S) is absolutely continuous with almost sure derivative

fix; - Sj){xj > Lj] - f(Lj-Xj)[xj

< Sj] = /'(L - 5) ({xj = L) - {Xj = 5}). Here,

and subsequently, I ignore the 9Jt-negligible set of x for which there is a tie for
maximum or minimum. The function dkge decreases to zero exponentially fast as
max, \xi | -> oo. An integration by parts with respect to Xj gives

Mxf (L - S) dfkg9 = Wlj (mjfixj v Lj - Xj A Sjty (dkge))


= -Mj (mjfiL - S) ({XJ = L) - {*, =
= -m (f(L - 5) ({*, = L] - {Xj =

277

12.3 Proof of Fernique's inequality

The second integration-by-parts proceeds slightly differently for j = k or j # k.


To simplify notation, I will temporarily assume j = 1 and either k = 1 or k = 2;
and I will replace * 3 , . . . , xn or JC2, * 3 , . . . , xn by a long dash () to keep attention
focussed on the variables actively involved in the calculation.
Mixed partial derivative
With j := 1 and k := 2, consider the final integrand of <5> as a function of JC2Rewrite the difference of indicator functions as {x2 < x\ = L2} - {X2 > x\ = S2}.
Integration by parts, with respect to JC2, gives
m2 (f'(L - S) ({JC, = L) - [xx = S}) d2ge)
= /'(L -

S){JC,

= L2}ge(x)\xxl2_oo

- S){Xl = S 2 }^(*)|~ =JC1

- f\L

- m2 (/"(L - S) ({X2 = L] - {X2 = 5}) ({xi = L} -

{JC,

= 5}) go)

= / U 2 - S2)S*(*i, jcj,)({*, = L2} + {JC! = 52})


+ m2 (/"(L - 5) ({xi = 5,

JC2

= L} +

{JCI

= L,

JC2

= S}) g0)

a.e. [m2].

The OT2 negligible set allows for JC where x\ and JC2 tie for maximum or minimum.
Integrate with respect to 9^2 to get an expression for the mixed partial derivative.

mf(L - S)dl2gd = -m2 (f(L2 - S2){{xl = S2) + {x{ = L2))ge(xuxu))


- m (f"(L := -A{,2

5)({JC,

= 5,

JC2

= L] +

{JC,

= L,

JC2

S])g9(x))

- #1,2.

Notice the curious form of the integral for Ax,2- It runs over n 1 variables (JC2
omitted) ranging over the sets where either JCI is the smallest or the largest of those
variables. The second argument of g#, previously occupied by JC2, is now occupied
by the extremal value JC, . We would get exactly the same integral if we interchanged
the roles of JCI and JC2. Write A/t* for the analogous integral with JC, taking over the
role of JCI and JC* taking over the role of JC2. Nonnegativity of / ' , and the symmetry
of the roles of the two variables, ensure that A,* = A* ,7 > 0 for all j ^ k. Similarly,
write Bjk for the analog of #i, 2 with Xj taking over the role of xx and JC* taking
over the role of JC2. Nonegativity of / " implies that Bjk = Bkj > 0 for all j ^ k.
Repeated partial derivative
The calculations for d\xge are similar. Integrate first with respect to JCI.
m, (f(L

- S) ({xx =L}-

{xx = S}) dvg0)

- mi (f"(L

- S) ({xx =L}-

{xx = S})

- m f ( L - S)ge(x) ({xx =S} + {xx = L})


Then integrate over the remaining n 1 variables, to conclude that
Lx - S{) (ge(Sx,) +
m (f'\L

ge(Lu))

- S)ge(x) ({xx =S} + {xx = L}))

278

Chapter 12:

Multivariate normal distributions

When split according to which of the variables X2>..., xn achieves the extremal
value S\ or L\y the first integral breaks into the sum Ylk>2 ^i.*> an( ^ ^ e s e c o n d i
12k>2 ^M- ^ e o t her repeated derivatives contribute similar expressions.
Collection of terms
Substitute the results from the two integrations by parts into <4>.
H

'W = 5 E , ^'^f(L
= 2EA

~ S) dli

The assumption of the Theorem tells us that

Ajj + Akk - 2Ajk = PF/ - PXy2 + PF,2 - PX2 - 2(FYjJk - FXjXk)


=
\Yj-Yk\2-\Xj-Xk\2>0.
D

4.

Thus H'(0) > 0, and Fernique's inequality follows.

Gaussian isoperimetric inequality


Let yk denote the standard normal, N(0, Ik)9 distribution on R*. Write <t> for the onedimensional N(0, 1) distribution function, and write </>(x) := (27r)"1/2exp(x2/2)
for its density with respect to Lebesgue measure.
For each Borel subset A of Rk and each r > 0 define Ar := {JC : d(x, A) < r},
a set that I will call a neighborhood of A. The isoperimetric problem requires
minimization of ykAr over all A with a fixed value of ykA. Borell (1975) showed
that the minimizing choice is a closed half-space H. The neighborhood Hr is another
half-space. By rotational symmetry of the standard normal, the calculation of y^W
reduces to a one-dimensional problem: if y^H := <J>(a) then ykHr = 3>(r + a ) .
The term isoperimetric comes from an analogy with the classical isoperimetric
inequality for minimization of surface area of a set with fixed Lebesgue measure. If
one avoids the tricky problem of defining the surface area of a general Borel set A
by substituting the Lebesgue measure of the thin shell A r \A, for very small r, the
problem becomes one of minimizing the Lebesgue measure mAr for a fixed value
of xnA. Replace Lebesgue measure by yk and we have the Gaussian analog of the
modified isoperimetric problem.

<7>

Theorem. The Gaussian measure of the neighbourhood Ar is minimized, for


a given value of ykA, by choosing A as a closed half-space. More generally,
YkAr > <J> (r + a), for every Borel subset A ofRk with ykA > <l>(a).
It is the reduction from a fc-dimensional problem, with k arbitrarily large, to
a one-dimensional calculation for the lower bound that makes BorelFs result so
powerful, as shown by the inequalities in the next Example.

279

12.4 Gaussian isoperimetric inequality

<8>

<9>

Example. Recall that a median of a (real valued) random variable X is any


constant m for which F{X >m}> V2 and P{X <m}> V2. Such an m always exists,
but it need not be unique.
Suppose Y\,...,Yn have a multivariate normal distribution. Define S :=
max,<* \Yi\9 and let M be a median of 5. (In fact, the inequalities will force M
to be unique.) Define a2 := max, var(X,). Borell's concentration inequality asserts
that F{S> M + r] < {N(0,a2) > r} and {S < M-r) < F{N(0,a2) < - r } , for
each r > 0. More succinctly,
P{|max \Yt\ - M)\ > r] < {\N(0, o2))\ > r}

for each r > 0.

That is, the spread in the distribution of max |y,| about its median is no worse than
the spread in the distribution of the Yt with largest variance.
In special cases, such as independent variables (Problem [3]), one can get
tighter bounds, but Borell's inequality has two great virtues: it is impervious to
the effects of possible dependence between the Yiy and it does not depend on w.
In consequence, it implies similar concentration inequalities for general Gaussian
processes, with surprising consequences.
Each half of the Borell inequality follows easily from Theorem <7> if we
represent the K, as linear functions Yi(x) := /i, -f 0[x on W equipped with the
measure yn. (That is, regard the vector of Yi's as a linear transformation of a
vector of independent N(0, l)'s.) The assumption about the variances becomes
var(yi) = \9i\2 < a2 for each i. By definition of the median M, the set
A := {JC e Rn : max/<n |F/(JC)| < M]
has ynA > 1/2 = <f>(0). If a point x lies within a distance r of a point *o in A, then
\9[x - 0/jto| < \$i\ \x - xo\ < or, for each i. Thus the neighborhood Ar is contained
within {x : maxf<n |F/(JC)| < M + ar}, and
P{max/<* \Y{\ < M + err] > ynAr > <I>(0 + r),
D
<io>

as asserted by the upper half of the Borell inequality. The derivation for the
companion inequality is similar.
Example. Suppose X\, X2,... is a Gaussian sequence of random variables. The
supremum S := sup, |X,| might take infinite values, but if P{5 < 00} > 0 then
Borell's inequality <9> will show that the distribution of S must have an upper
tail that decreases like that of a normal distribution. More precisely, the constant
a2 := supt var(X|) must be finite, and there must exist some finite constant M such
that
{S > M + or) < O(r) = P{AT(0, 1) > r]

for all r > 0.

Of course there is no comparable result needed for the lower tail, because S is
nonnegative.
The tail bound < n > ensures that S has finite moments of all orders, and in
fact Pexp(aS 2 ) < 00 for all a < (2a 2 )- 1 . See Problem [8].

280

Chapter 12:

Multivariate normal distributions

We can establish < n > by reducing the problem to finite sets of random
variables, to which Borell's inequality applies. Write Sn for maxt<n |X,-| and Mn for
its median. Define a 2 := max,<n var(X,). From <9>,

F{\Sn-Mn\>onr}<4>(r).

*5.

The assumption on S ensures existence of a finite constant C and an > 0 such


P{S < C] > , which implies that P{|X,| <C}> P{Sn < C] > for all n and i < n.
These inequalities place an upper bound of 2C2/ne2 on var(X,), and thereby also
on a 2 , because P{|N(/x, r 2 ) | <C}< 2C/r^/2n. Choose an ro for which <t>(ro) < .
From <12> we have {Sn < Mn crnro} < > which excludes the possibility that
Mn - crnro might be larger than C. The constant M := supn Mn is therefore bounded
above by C + aro. From <12> we then get P{Sn > M + or) < <l>(r) for all n and
all r > 0, an assertion stronger than < n > .

Proof of the isoperimetric inequality


First note that we may assume A is closed, because Ar = Ar and ykA > ykA.
As a convenient abbreviation, I will call a closed set B an improvement over A
if tt# > YkA and )/*r < ykAr. Following Ehrhard (1983a, 1983b), I prove the
Theorem in three steps:
(i) Establish the Theorem for the one-dimensional case, which Problem [6]
shows can be reduced by a simple approximation argument to the case
where A is a finite union of disjoint closed intervals Jt := [jcj~, *+], where
oo < JCJ" < JCJ^ < x^ < . . . < x+ < +oo. The method of proof depends only
on the fact that the logarithm of the standard normal density 0 is a concave
function. It works by showing that we can improve A by successively fusing
each Jt with its neighboring interval on the left or the right, until eventually
we are left with either a single semi-infinite interval or a union of two such
intervals. A further convexity argument disposes of the two-interval case.
(ii) Establish the two-dimension version of the Theorem by an analog of the
classical Steiner symmetrization method (Billingsley 1986, Section 19),
which draws on the one-dimensional result. For a Borel subset A of M2,
Fubini's Theorem asserts that the y-section Ay := {JC R : (JC, y) A] is
Borel measurable, and that y\Ay is a Borel measurable function. Define a
function g(y) to satisfy the equality <P(g(y)) := y\Ay. Then the y-sections
By of the set B := {(x, y) : y < g(x)} have the same y\ measure as the
corresponding Ay.

12.5 Proof of the isoperimetric inequality

281

Intuitively speaking, the set B is obtained from A by sliding and


stretching each of its jt-sections into a semi-infinite interval with the same y\
measure. (The jagged left edge for B in the picture is supposed to suggest
that the set extends off to -oo.) I will call the operation that transforms A
into B a 1-shift.
Problem [5] shows that B is closed if A is closed. Fubini's theorem
ensures that yiB = yiA. The one-dimensional version of the Theorem will
then ensure that B is an improvement over A.
The same 1-shift idea works for every direction, not just for slices
parallel to the jc-axis. I will write Su for the 1-shift operator in the direction
of a unit vector w. (The picture corresponds to the case where u is the
unit vector u := ( - 1 , 0 ) that points back along the x-axis.) By means of a
sequence of such shifts, we can rearrange A into a set arbitrarily close to a
half space, with an improvement at each step. A formal limit argument then
establishes the two-dimensional version of the Theorem.
(iii) Establish the fc-dimensional version of the Theorem by induction on the
dimension. It will turn out that the two dimensional case involves most of
the hard work. For example, two applications of the result for two dimensions
will give the result for three dimensions.
DETAILS OF THE PROOF

(i) One dimension


We have to show that a half-line is an improvement over A := U,-</,-, a finite union
of disjoint closed intervals.
If jcj1" > x^ 2r we may replace J\ and J2 by the single interval [JCJ", x]
without changing y\Ar. Thus, with no loss of generality, we may suppose J\ and
the set J = U/>2/, are a distance at least 2r apart, in which case y\ Ar = y\ J[ + y\Jr.
Define 28 := y\J\ = <&(xf) - 4>(jcf) and 2t\ := <J>(jtf) + <E>(jcf), so that
J\ has endpoints &~l(t\ 8). Consider the effect of replacing J\ by another
interval /, := [JC~, JC+]. If we take JC" := <&~l(t - 8) and JC+ := O" 1 ^ + 8), with
8 < t < t* = O(JC^) - 6, then y\lt = y\J\. When t = 8, the interval lt is semiinfinite, [00, JC+]; when t = t*, the intervals It and J2 touch at JC+ = JC^~. The sets
Bt := It U / and A have the same y\ measure.

t+5 1

t-S

282

Chapter 12:

Multivariate normal distributions

When t < O(JC^ - 2 r ) -8 the neighborhoods Vt and Jr are disjoint, and y\Brt =
yiAr + y i / r . For larger f, the two neighborhoods overlap, and y\Brt < y\Vt +y\Jr.
If we choose t with y\l\ < y\J\ then Bt is an improvement over A. The following
Lemma shows that we get the most improvement by pushing t to one of the extreme
positions, because a concave function on [8, t*] achieves its minimum at one of the
endpoints.
<13>

L e m m a . For each fixed r > 0 the function G{t) := y\Vt is a concave function
oft on [8,1 -<$].
Proof. It is enough if we show that the derivative Gr is a decreasing function on
(6, 1 - 8). By direct differentiation of <I>(x+ 4- r) - <&(x~ - r) as a function of t we
have
G'(t) = ^ ( *

^ r ) - ^X

~r\

Concavity of log0 implies that <t>'(x)/<p(x) is a decreasing function of JC. Thus


h(x) := <f>{x -f r)/0(jc) is a decreasing function of JC, because

REMARK.
Actually, \ogh(x) = jcr r 2 / 2 , which is clearly decreasing. I wrote
the argument using concavity because I suspect there might be a more general version
of the isoperimetric inequality provable by similar methods (cf. Bobkov 1996).

Both JC+ and x~ are increasing functions of t. Thus G' equals a decreasing
function of / minus the reciprocal of another decreasing function of t, which makes
G' decreasing, and G concave.
With the appropriate t, the improvement Bt is also a finite union of disjoint
intervals, UJL, J/, with either J[ := 0 or /( := [-oo, x+].
Now repeat the argument, replacing J^ by an interval that abuts either J[ or
Jy And so on. After at most n such operations, A is replaced by either a single
semi-infinite interval (the desired minimizing setit doesn't matter whether the
interval extends off to oo or to +oo) or a union D := [oo, z~] U [z + , oo], for
which yD = yA and yDr < yAr.
For the second possibility we may assume that z+ z~ > 2r to avoid the
trivial case where Dr = E. The complement of D is an interval of the form
(<P~l(t - 8), <P~l(t + 8)) for some t, where 28 = yi(z",z + ). Calculations almost
identical to those for the proof of Lemma <13>, with r replaced by r, show that
H(t) := <D (&~l (t + 8) - r ) - <D U>~1 (t - 8) + A
is a convex function of t. It achieves its maximum at one of the extreme positions,
which corresponds to the transformation of D into a single semi-infinite interval.
The proof for the one-dimensional case of Theorem <7> is complete.
(ii) Two-dimensions
The 1-shifts along each section parallel to the x-axis transform the closed set A into
another closed set B with yiB = yiA <!>().

12.5 Proof of the isoperimetric inequality

<14>

Lemma.

283

The set B is an improvement over A.

Proof. We need to show that yiBr < yikT. By Fubini, it is enough if we can show
that Yi(Br)y < yi(Ar)y, for all y.

The section (Br)yi through Br at a fixed y\ is a closed set. Consider the


boundary point (x\,y\) in (Br)yi, with JCI as large as possible. By definition, there
exists a point (*o, yo) in B that lies within a distance r of (x\, y\): if |JCI - xo\ :=
and \y\ yo\ := 8 then e2 + 82 < r 2 . (Both e and 5 depend on y\9 of course.)
For each t > 0 the point (JCI - f, ji) lies within a distance r of (XQ - f, >^o), which
also belongs to B. Thus Br is of the form {(JC, y) e R2 : x < ^ r (y)}, for some
function g r ().
Write ^ [ c ] and A^|>] for the one-dimensional ^-neighborhoods of the yosections B^ and AM. Notice that (Ar)J1 contains the set G = A^|>] <g> {3^1}, because
all points of G lie within a distance (e2 4- 52)1/2 < r of A. Thus

because (A r )^ 2 G
+ O from part (i), because y\ An = <I>(^(jo))
> (gr(yi))
because gr(y\) = xi < x0 + < g(;yo) +
r
= yi (# )yi
definition of ^ r .
It follows that B is an improvement over A.
To make precise the idea that a sequence of shifts can make A look more like a
halfspace, we need some way of measuring how close a set is to being a half-space.
The picture suggests a method. The idea is that there
should be a cone C of directions that we can follow from
each point of A without leaving A. (The jagged edges
on A and the cone C are meant to suggest that both sets
should be extended off to infinitythere is not enough
room on the page to display the whole sets.) If the vertex
angle 0 of the cone were a full 180, the set A would be a
half-space (or the whole of R2). More formally, let us say
that a set A has a 0 spread if there exists a cone C with vertex angle 0 such that
{JC + y : x e A, y C] c A. Call C the spreading cone.
For example, the set B produced by Lemma <14> has spread of at least zero,
with spreading cone C := {(JC, 0) : x < 0}.

Chapter 12:

284

Multivariate normal distributions

Lemma. If a closed set A has 0 spread, for some 0 less than n, there exists a
shift Su such that SUA has (n + 0)/2 spread.
Proof. To make the picture easier to draw I will assume that the axis of symmetry
of the cone C points along the y-axis. The required shift then has u pointing along
the jc-axis.
0/2 . 0/2

(g(y),y)
Consider the sections through SUA at heights y and yf = y + 6, with 8 > 0.
Both sections are intervals, (oo, g(y)] and ( o o , g ( / ) ] , where y\Ay := <P(g(y))
and yiAy := <&(g(y')). Define 6 := <5tan(0/2). If JC Ay then (x + t,y +
8)eA
for all f with \t\ < , because A has 0 spread. That is, the section Ay contains a
one-dimension neighborhood F := Ay[e]. The cross section at y' has y\ measure
greater than y\Ay\<e\, which by part (i) is greater than <&(y\Ay + ). That is,

(g(y')) Y\Ay' > Y\Ay[] > <&(g(y) + ),

whence g(y') > g(y) + e. It follows that Su A has spread at least {n + 0)/2.
Repeated application of Lemma <15>, starting from A produces improvements
#i, #2> #3> ^4, . with spreads at least 0, 7r/2, 37r/4, 77r/8, and so on. Each Z?,- has
y2 measure equal to O(a) = y2A, and y2Ar > y2B\ > y2Br2 >
We may even
rotate each Bn so that its spreading cone has axis parallel to the JC axis.
The fact that Bn has spread n 2en, with en 0, and the fact that y2Bn O(a)
together force Brn to lie close to the half space H := {(JC, y) e R2 : x < a + r]
eventually. More precisely, it forces liminf Brn > H, in the sense that each point of
H must belong to Brn for all n large enough.
For if a point (JCO + r, yo) of H were not in Brn then the point (JCO, yo) could not
lie in Bn, which would ensure that no points of Bn lie outside the set
On := {(x,y) :x < \y - angle en

a+r

yo\tenen}.

However, the set Dn converges to a halfspace with yi measure


equal to 4>(JCO), which is strictly smaller than <J>(a) = y2Bn.
Eventually Bn must therefore contain points outside Dn,
in which case (JCO, yo) Bn and (JCO + r, yo) Brn. Fatou's
Lemma then completes the proof of the two-dimensional
isoperimetric assertion:
y2Ar > liminfy2Brn > ^liminf Brn >

12.5 Proof of the isoperimetric inequality

285

(iii) More than two dimensions


A formal proof uses induction on the dimension, invoking the result from part (ii)
to reduce for R* to the result for R*"1. Then another application of part (ii) reduces
to R*~2. And so on.
For simplicity of notation, I will explain only the reduction from R 3 to R2.
Consider a closed subset A of R 3 with yiA := a. Write Ay for its y-section, so that
A = U ^ <g> {y}. Define a function g by the equality <P(g(y)) := YiAy. The closed
set B := {(JC, y% z) : x < g(y)} has yiBy = yiAy for every y, and hence (Fubini)
yiB = y$A. Call B a Irshift of A.
The set B has all its z-sections equal to C = {(*, y) R 2 : * < g (>>)}.
That is, B = C (8) [z e R}. The closed set # r has all its jc-sections equal to the
two-dimensional neighborhood C[r].
The proof that B improves upon A is almost identical to the proof of
Lemma <14>. Indeed the picture from the proof can be reinterpreted as a zsection of the (three-dimensional) picture for the present proof. Only small changes
in wording are needed. In fact, here is a repeat of the argument, with changes
indicated in boldface:
We need to show that ~f$Br <^Ar.
By Fubini, it is enough if we can
show that *Y2(Br)y < 7 2 ( 4 % far all y.
The section (Br)y] through Br at a fixed y\ is a closed set. Consider
the boundary point (x\, y\,Z\) in (Br)yi with Z\ as large as possible. By
definition, there exists a point (jto, yo, Zo), with Zo = Zi, in B that lies
within a distance r of (x\, y\, Z\): if \x\ jtol = and \y\ yol = $ then
e 2 + 2 < r 2 . (Both and 8 depend on Z\, of course.) For each t > 0 the point
(x\ f, y\, Zo) lies within a distance r of (*o t, yo, Zo), which also belongs
to B. Thus Br is of the form {(JC, y, z) e R 2 : x < gr(y), - o o < z < oo}, for
some function ^ r ( ) Write Byqle] and Ayo[e] for the two-dimensional -neighborhoods of
the yo-sections Byo and Ayo. ...
And so on.
The set B := C (8) {z R} is not a halfspace, but it can be transformed into
one by means of a second 2-shift, with sections taken orthogonal to the y axis. The
isoperimetric theorem for R 3 then follows.
The argument for higher dimensions is similar.

6.

Problems

[1]

Let Vb and Vi be positive definition matrices. For 0 < 0 < 1 let gB denote the
, (1 - 0)V0 + 0V\) density on R n . Show that

by following these steps.

286

Chapter 12:

Multivariate normal distributions

(i) Use Fourier inversion to show that ge{x) = (2n)~n f exp (-ix't - \t'Vet) dt.
(ii) Justify differentiation under the integral sign to show that

dgeix)
30

= (2nyn f ^ exp (-ix't - \t\OA + V0)t) dt


J oO
= (2nyn f -\t'At exp (-ix't - \t\OA + V0)t) dt

and

= (2n)~n I

J dXjdx

= (2ny j(-itj){-itk)txp(-ix't

- \t\0A + V0)f) dt.

(iii) Collect terms.


[2]

Suppose Xn has a multivariate normal distribution, and Xn ~* X. Show that X also


has a multivariate normal distribution. Hint: Reduce to the one-dimensional case
by working with linear combinations. Also show that if N(/i n ,a 2 ) converges then
both {/xn} and [on) must be bounded. Argue along subsequences.

[3]

Let Zi, Z 2 , . . . be independent random variables, each distributed N(0,1). Show


that the distribution of Mn := max,-<n Z, concentrates largely in a range of order
(logn)~ 1/2 by following these steps. Define an := (21ogn) 1/2 and Ln \ loglogn.
(i) Remember (Appendix D) that the tail function O(JC) := P{N(0,1) > A
decreases like 4>(x)/x as x -> 00, in the sense that the ratio of the two functions
tends to 1. For a fixed constant C\ define xn := an (C + Ln)/an. Show that
log (n$>(xn)) converges to Co := C \ log(47r) as n -> cx>.
(ii) Deduce that logP{Mn < xn) = n log (l - <t>(jcrt)) - e~CQ.
(iii) Deduce that an(Mn an) -f Ln converges in distribution as n -> 00.

[4]

Let X and Y be random variables for which both {X > x] < F{Z > x] and
F{X < -x] < F{Z < -x), for all x > 0. Show that Pexp(rX) < Pexp(/K) for all
nonnegative t. Hint: Consider F footexp(tx)[X > x]dx.

[5]

Let A be a closed subset of R2, with sections Ay := {JC e R : (JC, y) e A] having


Gaussian measure g(y) := y\Ay. Let B denote the 1 -shift, as in Lemma <14>.
(i) If yn -> yy show that limsupA^ < Ay, in the sense of pointwise convergence
of indicator functions. (If JC Ayn infinitely often, deduce that (JC, y) e A.)
(ii) If yn -* yy use Fatou's lemma to prove that lim sup(>>,,) < g(y).
(iii) If (xn, yn) e B and (*, yn) -+ (JC, >^), show that JC < g(y), that is, (JC, y) e B.

[6]

Reduce the one-dimensional version of Theorem <7> to the case where A is a finite
union of intervals, by following these steps. Let 4>(a) := y\A.
(i) Show that it is enough prove that yAr+s > O(r + a) for each 8 > 0, because
Ar+8 I Ar as 8 decreases to zero.

287

12.6 Problems

(ii) Define an open set G := {x : d(x, A) < 8}. Show that yG > a. Hint: The set
G\A is open.
(iii) Show that there exists a countable family {/,} of disjoint closed intervals for
which yi(G\U l -7 l -) = 0.
(iv) Choose N so that the closed set AN := U,<#// has y\ measure greater than <J>(a).
Show that the assertion of the Theorem for AN implies that yAr+s > <t>(r + a).
[7]

(Slepian's (1962) inequality) Let X = (Xu . . . , Xn) and Y = (K,,..., Yn) both have
centered multivariate normal distributions with var(Xy) = var(ly) for each j and
FXjXk < FYjYk for all j / k. Prove that PU, [Xj > ctj] > PUy [Yj > a,} for all real
numbers a\,..., an. Hint: Use equality <3> to show that \ix J~[/<fI{jc, < oti)g$(x) is
a decreasing function of 0.

[8]

Let 5 be a nonnegative random variable with P{5 > M + r] < Coexp (~\r2/a2)
all r > 0, for positive constants Co, M, and a1.

for

(i) Show that Pexp(aS 2 ) = 1 + /02;yaexp(cry2)P{S > y)dy.


(ii) For a < l/(2<r2), prove that Pexp(a5 2 ) < oo.

7.

Notes
There is a huge amount of literature on Gaussian process theory, with which I am
only partially acquainted. I took the proof of the Sudakov minoration (Example <2>)
from Fernique (1975, page 27). According to Dudley (1999, notes to Section 2.3),
a stronger, but incorrect, result was first stated by Sudakov (1971). The result in
Example <io> is due to Marcus & Shepp (1972), who sharpened earlier results of
Landau & Shepp (1970) and Fernique.
See Dudley (1973) for a detailed discussion of sample path properties of
Gaussian process, particularly regarding entropy characterizations of boundedness or
continuity of paths. Dudley (1999, Chapter 2) contains much interesting information
about Gaussian processes and their recent history. I also found the books by Jain &
Marcus (1978), Adler (1990, Section II), and Lifshits (1995) useful references.
The Lifshits book contains an exposition of Ehrhard's method, similar to the
one in Section 5, and proofs (Section 14) of the Fernique and Slepian inequalities.
See also the notes in that section of his book for a discussion of related work of
Schlafli and the contributions of Sudakov.
BorelFs isoperimetric inequality was, apparently, also proved by similar methods
by TsireFson and Sudakov in a 1974 paper which I have not seen. TsireFson (1975)
mentioned the result and the method. The notes of Ledoux (1996, Section 4) discuss
the isoperimetric inequality in great detail.
REFERENCES

Adler, R. J. (1990), An introduction to continuity, Extrema, and Related Topics for


General Gaussian Processes^ Vol. 12 of Lecture Notes-Monograph series, Institute
of Mathematical Statistics, Hayward, CA.

288

Chapter 12:

Multivariate normal distributions

Billingsley, P. (1986), Probability and Measure, second edn, Wiley, New York.
Bobkov, S. (1996), 'Extremal properties of half-spaces for log-concave distributions',
Annals of Probability 24, 35-48.
Borell, C. (1975), T h e Brunn-Minkowski inequality in Gauss space', Inventiones
Math. 30, 207-216.
Dudley, R. M. (1973), 'Sample functions of the Gaussian process', Annals of
Probability 1, 66-103.
Dudley, R. M. (1999), Uniform Central Limit Theorems, Cambridge University Press.
Ehrhard, A. (1983a), 'Un principe de symetrisation dans les espaces de Gauss',
Springer Lecture Notes in Mathematics 990, 92-101.
Ehrhard, A. (1983b), 'Symetrisation dans l'espace de Gauss', Mathematica Scandinavica 53, 281-301.
Fernique, X. (1975), 'Regularity des trajectoires des fonctions aleatoires gaussiennes',
Springer Lecture Notes in Mathematics 480, 1-97.
Jain, N. C. & Marcus, M. B. (1978), Advances in probability, in J. Kuelbs, ed.,
'Probability in Banach Spaces', Vol. 4, Dekker, New York, pp. 81-196.
Landau, H. J. & Shepp, L. A. (1970), 'On the supremum of a Gaussian process',
Sankhyd: The Indian Journal of Statistics, Series A 32, 369-378.
Ledoux, M. (1996), 'Isoperimetry and Gaussian analysis', Springer Lecture Notes in
Mathematics 1648, 165-294.
Lifshits, M. A. (1995), Gaussian Random Functions, Kluwer.
Marcus, M. B. & Shepp, L. A. (1972), 'Sample behavior of Gaussian processes',
Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and
Probability 2, 423-441.
Slepian, D. (1962), 'The one-sided barrier problem for Gaussian noise', Bell System
Technical Journal 41, 463-501.
Sudakov, V. N. (1971), 'Gaussian random processes and measures of solid angles in
Hilbert space', Soviet Math. Doklady 12, 412-415.
Tsirel'son, V. S. (1975), 'The density of the distribution of the maximum of a
Gaussian process', Theory Probability and Its Applications 20, 847-856.

Appendix A

Measures and integrals


SECTION I introduces a method for constructing a measure by inner approximation,
starting from a set function defined on a lattice of sets.
SECTION 2 defines a "tightness" property, which ensures that a set function has an extension
to a finitely additive measure on a field determined by the class of approximating sets.
SECTION 3 defines a tlsigma-smoothness" property, which ensures that a tight set function
has an extension to a countably additive measure on a sigma-field.
SECTION 4 shows how to extend a tight, sigma-smooth set function from a lattice to its
closure under countable intersections.
SECTION 5 constructs Lebesgue measure on Euclidean space.
SECTION 6 proves a general form of the Riesz representation theorem, which expresses
linear functionals on cones of functions as integrals with respect to countably additive
measures.

1.

M e a s u r e s a n d inner m e a s u r e

Recall the definition of a countably additive measure on sigma-field. A sigma-field


yiona set X is a class of subsets of X with the following properties.
(SFj)
The empty set 0 and the whole space X both belong to A.
(SF2)
(SF3)

If A belongs to A then so does its complement Ac.


For countable [At : 1 N} c A, both U,-A,- and O, A, are also in A.

A function /x defined on the sigma-field A is called a countably additive (nonnegative)


measure if it has the following properties.
(Mj)
fx0 = 0 < \iA < 00 for each A in A.
(M2)

\i (Uji4,-) = Yli M^i for sequences {At; : 1 N} of pairwise disjoint sets from A.

If property SF3 is weakened to require stability only under finite unions and
intersections, the class is called a field. If property M2 is weakened to hold only
for disjoint unions of finitely many sets from A, the set function is called a finitely
additive measure.
Where do measures come from? Typically one starts from a nonnegative
real-valued set-function /x defined on a small class of sets DCo, then extends to a
sigma-field A containing fto- One must at least assume "measure-like" properties
for \i on 3Co if such an extension is to be possible. At a bare minimum,

290

(Mo)

Appendix A:

Measures and integrals

M is an increasing map from %o into R+ for which /x0 = 0.


Note that we need %o to contain 0 for Mo to make sense. I will assume that Mo
holds thoughout this Appendix. As a convenient reminder, I will also reserve the
name set function on %o for those /x that satisfy Mo.
The extension can proceed by various approximation arguments. In the
first three Sections of this Appendix, I will describe only the method based on
approximation of sets from inside. Although not entirely traditional, the method has
the advantage that it leads to measures with a useful approximation property called
Xo-regularity:
fiA = supiiiK : A 2 K XQ]

for

each A in A.

REMARK.
When X consists of compact sets, a measure with the inner regularity
property is often called a Radon measure.
The desired regularity property makes it clear how the extension of /x must be
constructed, namely, by means of the inner measure /z*, defined for every subset A
of X by ^A := supluK : A 3 K e Xo}.
In the business of building measures it pays to start small, imposing as few
conditions on the initial domain Xo as possible. The conditions are neatly expressed
by means of some picturesque terminology. Think of X as a large expanse of muddy
lawn, and think of subsets of X as paving stones to lay on the ground, with overlaps
permitted. Then a collection of subsets of X would be a paving for X. The analogy
might seem far-fetched, but it gives a concise way to describe properties of various
classes of subsets. For example, a field is nothing but a (0, U/, n / , c ) paving,
meaning that it contains the empty set and is stable under the formation of finite
unions (U/), finite intersections ( n / ) , and complements ( c ). A (0, Uc, He, c ) paving
is just another name for a sigma-fieldthe Uc and Pic denote countable unions and
intersections. With inner approximations the natural assumption is that %o be at
least a (0, U/, Of) pavinga lattice of subsets.
REMARK.
Note well. A lattice is not assumed to be stable under differences
or the taking of complements. Keep in mind the prime example, where %o denotes
the class of compact subsets of a (Hausdorff) topological space, such as the real
line. Inner approximation by compact sets has turned out to be a good thing for
probability theory.

For a general lattice 3Co, the role of the closed sets (remember f for ferm6)
is played by the class J(3Co) of all subsets F for which FK Xo for every K in
Xo. (Of course, Xo c 3"(3C0)- The inclusion is proper if X Xo) The sigma-field
^>{Xo) generated by 7(%o) will play the role of the Borel sigma-field.
The first difficulty along the path leading to countably additive measures lies
in the choice of the sigma-field A, in order that the restriction of /x* to A has the
desired countable additivity properties. The CarathSodory splitting method identifies
a suitable class of sets by means of an apparently weak substitute for the finite
additivity property. Define So as the class of all subsets 5 of X for which
ji*i4 = fi*(AS) +

/JL*(ASC)

for all subsets A of X.

A.I Measures and inner measure

291

If A e So then /z* adds the measures of the disjoint sets AS and ASC correctly. As
far as /z* is concerned, S splits the set A "properly."
<2>

s
c

Lemma. The class So of all subsets S with the property < i > is a field. The
restriction of /x* to So is a finitely additive measure.
Proof. Trivially So contains the empty set (because /z*0 = 0) and it is stable under
the formation of complements. To establish the field property it suffices to show
that So is stable under finite intersections.
Suppose S and T belong to So. Let A be an arbitrary subset
JC
T
of X. Split A into two pieces using S, then split each of those two
pieces using T. From the defining property of So,

li+A = /z* (AS) + /z* (ASC)


= it* {AST) + /z* (ASTC) + /z* (ASCT) + /z, (ASCTC) .

Decompose A(ST)C similarly to see that the last three terms sum to
li*A(ST)c. The intersection ST splits A correctly; the class So contains ST; the
class is a field. If ST = 0, choose A := S U T to show that the restriction of /z*
to So is finitely additive.
At the moment there is no guarantee that So includes all the members of XQ,
let alone all the members of $(9Co). In fact, the Lemma has nothing to do with the
choice of /z and XQ beyond the fact that /z*(0) = 0. To ensure that So 2 Xo we
must assume that /z has a property called %o-tightness, an analog of finite additivity
that compensates for the fact that the difference of two Xo sets need not belong
to Xo. Section 2 explains Xo-tightness. Section 3 adds the assumptions needed to
make the restriction of /z* to So a countable additivity measure.

2.

Tightness
If So is to contain every member of Xo, every set K e Xo must split every
set K\ 3Co properly, in the sense of Definition < i > ,
Writing Ko for ATi AT, we then have the following property as a necessary condition
for 3C0 ^ So. It will turn out that the property is also sufficient.

<3>

Definition. Say that a set function /z on Xo is Xo-tight if^iK\ = /z Ao+/z*(Af i \ # o )


for all pairs of sets in Xo with K\ 2 Ao.
The intuition is that there exists a set K 3Co that almost fills out K\\Ko, in the
sense that \iK ixK\ - /z#o- More formally, for each e > 0 there exists a K( e Xo
with K( c K\\Ko and fiK > ILK\ fiKo . As a convenient abbreviation, I will
say that such a K fills out the difference Ki\Ko within an .
Tightness is as close as we come to having Xo stable under proper differences.
It implies a weak additivity property: if K and H are disjoint members of Xo then
= iiH -f/z^f, because the supremum in the definition of ii*((HUK)\K)
is

292

Appendix A:

Measures and integrals

achieved by H. Additivity for disjoint !Ko-sets implies superadditivity for the inner
measure,
<4>

/x*(A U B ) > />i*A + /x*

for all disjoint A and #,

because the union of each inner approximating H for A and each inner approximating
K for B is an inner approximating set for A U B. Tightness also gives us a way to
relate S o to DC0.
<5>

Lemma.

Let Xo be a (0, U/, Of) paving, and fi be Xo-tight set function. Then

(i) S So if and only if [iK < /** (KS) + /x* (K\S) for all K in Xo;
(ii) the field So contains the field generated by 7(Xo)-

3.

Proof. Take a supremum in (i) over all K c A to get /x*A < /x* (AS) + /x* (A\S).
The superadditivity property <4> gives the reverse inequality.
If S e 7(0Co) and K e Xo, the pair K{ := K and Ko := KS are candidates for
the tightness equality, [iK = /JL (KS) + n* (K\S), implying the inequality in (i).

Countable additivity
Countable additivity ensures that measures are well behaved under countable limit
operations. To fit with the lattice properties of 9Co> it is most convenient to insert
countable additivity into the construction of measures via a limit requirement that
has been called a-smoothness in the literature. I will stick with that term, rather
than invent a more descriptive term (such as a -continuity from above), even though
I feel that it conveys not quite the right image for a set function.

<6>

Definition, Say that /x is cr-smooth (along XQ) at a set K in Xo if \iKn \, \xK for
every decreasing sequence of sets {Kn} in Xo with intersection K.
REMARK.
It is important that [i takes only (finite) real values for sets in Xo.
If A. is a countably additive measure on a sigma-field A, and An I Aoo with all A,
in A, then we need not have XAn | XA^ unless XAn < oo for some n, as shown by
the example of Lebesgue measure with An = [n, oo) and A^ = 0.
Notice that the definition concerns only those decreasing sequences in XQ for
which HnGN Kn Xo. At the moment, there is no presumption that Xo be stable
under countable intersections. As usual, the a is to remind us of the restriction to
countable families. There is a related property called r-smoothness, which relaxes
the assumption that there are only countably many Kn setssee Problem [1].
Tightness simplifies the task of checking for a-smoothness. The next proof is
a good illustration of how one makes use of 3Co-tightness and the fact that ii* has
already been proven finitely additive on the field So-

<7>

Lemma. If a Xo-tight set function on a (0, U/, n / ) paving Xo is a-smooth at 0


then it is cr-smooth at every set in XQ.

A. 3

Countable additivity

293

Proof. Suppose Kn i

^ with all Kt in Ko. Find an H Xo that fills out the


difference K^K^ within e. Write L for H U A^. Finite
additivity of /i* on So lets us break ixKn into the sum
The middle term decreases to zero as n -> oo because
KnH i K^H = 0. The last term is less than

which is less than c, by construction.


If 3Co is a stable under countable intersections, the a-smoothness property
translates easily into countable additivity for /JL* as a set function on So.

<8>

Theorem. Let Xo be a lattice of subsets of X that is stable under countable


intersections, that is, a (0, U/, He) paving. Let [i be a Xo-tight set function on a
Xo, with associated inner measure IM*A := sup{[iK : A 2 K e Xo). Suppose n is
a-smooth at 0 (along Xo). Then
(i) the class
So := {S c X : [iK < ii*(KS) + fx*(K\S) for all K in Xo)
is a sigma-field on X;
(ii) So 2 (3Co), the sigma-field generated by J(Xo);
(Hi) the restriction of it* to So is a Xo-regular, countably additive measure;
(iv) So is complete: if S\ 2 B D So with St So and ii*(S\\So) = 0 then B e SoProof. From Lemma <5>, we know that So is a field that contains J(3Co). To
prove (i) and (ii), it suffices to show that the union S := U,-6N7} of a sequence of sets
in So also belongs to So, by establishing the inequality [iK < it* (KS) + ii* (K\S),
for each choice of K in Xo.
Write Sn for U/<n7}. For a fixed > 0 and each i, choose a 3Co-subset
AT, of K\Si for which ^ATi > /x* (^\5/) - c/2 1 . Define
Ln := n,<nA'I. Then, by the finite additivity of /x* on So,
H*(K\Sn) - ixLn < Zi<n (l**(K\Si) - nKi) < .

which gives /x*

The
Loo
[iLn
< iiLn

sequence of sets {Ln} decreases to a ^Co-subset


of K\S. By the a-smoothness at Loo we have
< jitLoo + < M* (A'X^) + , for n large enough,
-f < /x* (A^\5) -f 2e, whence
-f /x* (K\Sn)

because Sn 6 S o

It follows that 5 So.


When K S, the inequality fiK < n* (KSn) + ii* (K\S) + 2^ and the finite
additivity of /z* on So imply [iK < ^2i<n/A*(KTi) + 2e. Take the supremum
over all 9Co-subsets of 5, let n tend to infinity, then e tend to zero, to deduce

294

Appendix A:

Measures and integrals

that IJL+S < !, N /i*7}. The reverse inequality follows from the superadditivity
property <4>. The set function /z* is countably additive on the the sigma-field So.
For (iv), note that \iK = \L+ (KSo) + /x* (K\SO), which is smaller than
/x, (KB) + /x, (K\SO + ^(KS]SCO)

< ix* (KB) + /x* (K\B) + 0,

for every K in XoIn one particularly important case we get a-smoothness for free, without any
extra assumptions on the set function /x. A paving Xo is said to be compact (in the
sense of Marczewski 1953) if: to each countable collection {Ki : i N) of sets from
Xo with CiienKi = 0 there is some finite n for which nf<nAT, = 0. In particular,
if Ki I 0 then Kn = 0 for some n. For such a paving, the a -smoothness property
places no constraint on [i beyond the standard assumption that /x0 = 0.

<9>

E x a m p l e . Let Xo be a collection of closed, compact subsets of a topological


space X. Suppose {Ka : a e A] is a subcollection of Xo for which HaAKa = 0.
Arbitrarily choose an o from A. The collection of open sets Ga := K* for a e A
covers the compact set Kao. By the definition of compactness, there exists a finite
subcover. That is, for some c*i,..., am we have Kao c U^jGa. = (rr=lKai)c. Thus
n=oKai = 0. In particular, Xo is also compact in the Marczewski sense.
REMARK.
Notice that the Marczewski concept involves only countable subcollections of Xo, whereas the topological analog from Example <9> applies to
arbitrary subcollections. The stronger property turns out to be useful for proving
r-smoothness, a property stronger than a-smoothness. See Problem [11 for the
definition of r-smoothness.

4.

Extension to the nc-closure


If Xo is not stable under countable intersections, a-smoothness is not quite enough
to make /x* countably additive on So- We must instead work with a slightly richer
approximating class, derived from Xo by taking its He-closure: the class X of
all intersections of countable subcollections from Xo. Clearly X is stable under
countable intersections. Also stability under finite unions is preserved, because
(DietiHi) U (Dj^Kj) = n M N x N (Hi U Kj) ,
a countable intersection of sets from Xo. Note also that if Xo is a compact paving
then so is X.
The next Lemma shows that the natural extension of /z to a set function on X
inherits the desirable cr-smoothness and tightness properties.

<io>

<n>

Lemma. Let \i be a Xo-tight set function on a (0, U/, Of) paving Xo, which is
a-smooth along Xo at 0. Then the extension /x of /x to the He-closure X9 defined
by
{iH := infifiK : H c K e Xo]
is X-tight and a-smooth (along X) at 0.

for H e X,

A.4

Extension to the He-closure

295

Proof. Lemma <7> gives us a simpler expression for /I. If {Kn : n e N} c Xo and
Kn i L e X, then jlL = infn /ztf,, because, for each 9Co-subset K with K 2 L,
(^* U AT)

by a-smoothness of /x at K.

The a-smoothness at 0 is easy. If Hn := f]jN ^J ^ ^ d Hn i 0 then the


sets := fl/<*,,< *<; belong to 0Co, and /Jn = HXH2... // c ^ n | 0. It follows
that iiHn < fiKn I 0.
The 3C-tightness is slightly trickier. Suppose H\ 2 Ho, with both sets in X. Let
{Kn} be a decreasing sequence of sets in Xo with intersection # i . For a fixed > 0,
choose a AT in 3Co with K o> Ho and iiK < fiHo + . With no loss of generality we
may assume that /if c K\. Invoke 3Co-tightness to find a DCo-subset L of K\\K for
which /xL > /JLK\ - fiK - > fiK\ - jlHo - 2e. Notice that the sequence {LKn}
decreases to LH\y a DC-subset of H\\K c H\\Ho. The finite
additivity of ^*, when restricted to So, gives
li(LKn) = / i L + / x ^ - / x ( L U Kn)
\xH\

as n

oo.

jiHo - 2e, as required for DC-tightness.


REMARK.
It is helpful, but not essential, to have a different symbol for the
extension of \i to a larger domain while we are establishing properties for that
extension. For example, it reminds us not to assume that has the same properties
as /x before we have proved as much. Once the result is proven, the \x has served
its purpose, and it can then safely be replaced by \i.
A similar argument might be made about the distinction between JCo and DC,
but there is some virtue in retaining the subscript as a reminder than DC0 is assumed
stable only under finite intersections.

Together, Theorem <8> and Lemma <io> give a highly useful extension
theorem for set functions defined initially on a lattice of subsets.
<12>

Theorem. Let Xo be a (0, U/, n / ) paving of subsets of X, and let X denote


its He-closure. Let /JL : Xo -> R+ be a Xo-tight set function that is sigma-smooth
along Xo at 0. Then /x has a unique extension to a complete, X-regular, countably
additive measure on a sigma-field , defined by
IiK := MfaKo : K c Ko e DC0}
liS := sup{/x/i: : S 2 K e X)

for K e X,
for 5 e S.

The sigma-field S contains all sets F for which FK e X for all K in X. In


particular, S 2 9C 2 DC0.
REMARK.

Remember: <r-smoothness is automatic if DCo is a compact paving.

5. Lebesgue measure
There are several ways in which to construct Lebesgue measure on Rk. The
following method for R2 is easily extended to other dimensions.

296

Appendix A:

Measures and integrals

Take %o to consist of all finite unions of semi-open rectangles (ot\, P\](a2, fh]Each difference of two semi-open rectangles can be written as a disjoint union of
at most eight similar rectangles. As a consequence, every member
of 9Co has a representation as a finite union of disjoint semi-open
rectangles, and Xo is stable under the formation of differences. The
initial definition of Lebesgue measure m, as a set function on 9Co,
might seem obviousadd up the areas of the disjoint rectangles.
It is a surprisingly tricky exercise to prove rigorously that m is well defined and
finitely additive on 3C0.
REMARK.
The corresponding argument is much easier in one dimension. It is,
perhaps, simpler to consider only that case, then obtain Lebesgue measure in higher
dimensions as a completion of products of one-dimensional Lebesgue measures.

The 3Co-tightness of m is trivial, because XQ is stable under differences: if


*i 2 #o, w J th both sets in XOy then K\\K0 e Xo and mK\ - mK0 = m(K\\K0).
To establish <x-smoothness, consider a decreasing sequence [Kn] with empty
intersection. Fix > 0. If we shrink each component rectangle of Kn by a small
enough amount we obtain a set Ln in X$ whose closure Ln is a compact subset of Kn
and for which m(Kn\Ln) < e/2n. The family of compact sets {Ln : n = 1, 2,...}
has empty intersection. For some finite N we must have n,<# / = 0, so that
mKN < m (D^Li) + /<* m(Ki\Li) < 0 + ,<

e/2'.

It follows that mKn tends to zero as n tends to infinity. The finitely additive
measure m is 3Co-smooth at 0. By Theorem <12>, it extends to a 3C-regular,
countably additive measure on S, a sigma-field that contains all the sets in J(X).
You should convince yourself that X, the He-closure of Xo, contains all compact
subsets of R 2 , and 7{X) contains all closed subsets. The sigma-field 8 is complete
and contains the Borel sigma-field 23(R2). In fact S is the Lebesgue sigma-field, the
closure of the Borel sigma-field.

6,

Integral representations
Throughout the book I have made heavy use of the fact that there is a one-to-one
correspondence (via integrals) between measures and increasing linear functional
on M+ with the Monotone Convergence property. Occasionally (as in Sections 4.8
and 7.5), I needed an analogous correspondence for functional on a subcone
of M + . The methods from Sections 1, 2, and 3 can be used to construct measures
representing such functionals if the subcone is stable under lattice-like operations.

<13>

Definition. Call a collection !K+ of nonnegative real functions on a set X a lattice


cone if it has the following properties. For h, h\ and hi in JC+, and <x\ and c*2
in M + :

(H\)
(H2)
(H3)
(H4)

ai h i -f a2fr2 belongs to ft*;


hx\h2 := (h{ - h2)+ belongs to W+;
the pointwise minimum h\ A h2 and maximum h\Vh2
h A 1 belongs to W+.

belong to 5{ + ;

A.6

Integral representations

297

The best example of a lattice cone to keep in mind is the class CQ (Rk) of
all nonnegative, continuous functions with compact support on some Euclidean
space R*.
REMARK.
By taking the positive part of the difference in H2, we keep the
function nonnegative. Properties Hi and H2 are what one would get by taking
the collection of all positive parts of members of a vector space of functions.
Property H4 is sometimes called Stone's condition. It is slightly weaker than an
assumption that the constant function 1 should belong to <K+. Notice that the cone
CQ (Rk) satisfies R,, but it does not contain nonzero constants. Nevertheless, if
h 3f + and a is a positive constant then the function (h a ) + = (h a(\ A h/a))+
belongs to !K+.

<14>
(T\)
(T2)
(T3)
(T4)

Definition. Say that a map T : 2C+ - R+ is an increasing linear functional if,


forh\, hi in 9C+, andot\, oti in R + ;
T(a\h\ -\-a2h2) = ot\Th\ -\-a2Th2l
Th\ < Th2 ifh\ < h2 pointwise.
Call the functional a-smooth at 0 if
Thn I 0 whenever the sequence [hn] in 3-C+ decreases pointwise to zero.
Say that T has the truncation property if
< +
T(h An)-4 Th as n ^ 00, for each h in
K .
REMARK.
For an increasing linear functional, T3 is equivalent to an apparently
stronger property,
(T 3 )

if hn I tin with all /i, in J{ + then Thn | Th^


because Thn < Th^ + 7'(/irt\/i00) I Thoo + 0. Property T4 will allow us to reduce
the representation of arbitrary members of J{+ as integrals to the representation for
bounded functions in !K+.

If fi is a countably additive measure on a sigma-field A, and all the functions


in !K+ are /x-integrable, then the Th := [ih defines a functional on*K^satisfying Tj
through T4. The converse problemfind a \i to represent a given functional Tis
called the integral representation problem. Theorem <8> will provide a solution
to the problem in some generality.
Let Xo denote the class of all sets K for which there exists a countable
subfamily of *K+ with pointwise infimum equal to (the indicator function of) K.
Equivalently, by virtue of H3, there is a decreasing sequence in 9C+ converging
pointwise to K. It is easy to show that %o is a (0, U/, Oc)-paving of subsets for X.
Moreover, as the next Lemma shows, the functions in IK+ are related to Xo and
7(Xo) in much the same way that nonnegative, continuous functions with compact
support in R* are related to compact and closed sets.
<15>

Lemma. For each h in 3{+ and each nonnegative constant a,


(i) lh>a}eX0ifct>0,
and
(ii) {h < a] e 7(XO).
Proof. For (i), note that [h > a] = infneN (l A n (h - a + n~l)+\ a pointwise
infimum of a sequence of functions in <K. For (ii), for a given K in 9Co find a
sequence {hn : n e N} c J{ + that decreases to K. Then note that K{h < a] =
infn hn\ (nh na) + , a set that must therefore belong to XQ.

298

<16>

Appendix A:

Measures and integrals

T h e o r e m . Let Ji+ be a lattice cone of functions, satisfying requirements H\


through H4, and T be an increasing, linear functional on J{ + satisfying conditions T\
through T4. Then the set function defined on X o by fxK := inf{Th : K <h CK+}
is Xo-tight and o-smooth along Xo at 0. Its extension to a Xo-regular measure
on (Xo) represents the functional, that is, Th = [ih for all h in M + . There is only
one Xo-regular measure on 25(30)) whose integral represents T.
REMARK.
Notice that we can replace the infimum in the definition of /x by an
infimum along any decreasing sequence {/*} in !K + with pointwise limit K. For if
K <he <K+, then infn Thn < infw T (hn v h) = Th, by T 2 and Tr

Proof. We must prove that \i is a-smooth along Xo at 0 and Xo-tight; and then
prove that Th > [ih and Th < \xh for every h in 0i+.
(T-smoothness: Suppose Kn e XQ and Kn I 0. Express Kn as a pointwise infimum
of functions {hnj} in tKv. Write hn for infw<rt,/<n/iw,,. Then Kn < hn I 0, and
hence \iKn < Thn I 0 by the er-smoothness for T and the definition of /JL.

Xo-tightness: Consider sets K\ 3 Ko in Xo. Choose 9+ functions g > Ko and


hn 4 K\ andfixa positive constant t < 1. The 3C~-function gn := (hn n(g\t))+
decreases pointwise to the set L : K\{g < t] c K\\Ko. Also, it is trivially true
that g > tK\{g > t}. From the inequality gn -f g > f ATi we get iiK\ < T(gn + g)/t,
because (gn +g)/t is one of the 5{+-functions that enters into the definition of /JLK\.
Let n tend to infinity, take an infimum over all g > Ko, then let t increase to 1, to
deduce that fiKi < IJLL + IAKO, as required for Xo-tightness.
By Theorem <8>, the set function /x extends to a Xo-regular measure on !B(Xo)
Inequality Th > /x/i: Suppose h > u := X!y=iay^y G ^-simple* ^ e n e e ( * t o
show that 7/i > iiu \ YLj^j^^r
^ e m a Y assume that the yi-measurable sets Aj
are disjoint. Choose Xo sets Kj c Aj, thereby defining another simple function
v := X^=i a ./^0 - U- Fi n( * sequences hnj from !K+ with hnj I otjKj, so that
J2j Thnj I Zj oijiiKj = [iv. With no loss of generality, assume h > hnj for all n
and j . Then we have a pointwise bound, Yljhnj < h + J2i<jhni A An;-, because
maxy hnj < h and each of the smaller hnj summands must appear in the last sum.
Thus
<Th + Y.i<jT{hni^hnj).
As n tends to infinity, hni A hnj | K(Kj = 0. By a-smoothness of T, the right-hand
side decreases to Th, leaving /iv < Th. Take the supremum over all Kj c Aj, then
take the supremum over all u < h, to deduce that fih <Th.
Inequality Th < fih: Invoke property T4 to reduce to the case of a bounded h.
For a fixed e > 0, approximate h by a simple function s := e Xl/lit^ - f'K w ^ h
steps of size . Here N is a fixed value large enough to make Ne an upper bound

A.6

299

Integral representations

for h. Notice that {h > ie] 3Co, by Lemma <15>. Find sequences hni from W+
with hni | [h > ie). Then we have

Sf<h<

(hA)+S(

< (hA) + 1t\hni<

from which it follows that


-> T(h

) -I- Y,1L\ V{h > ie)

< 7(/i A e) -f iih


-> fih

D
<n>

<18>

as * - oo

because Ylh=\lh ^ i) = s( < h

as f - 0, by a-smoothness of T.

Uniqueness: Let v be another 3Co-regular representing measure. If hn I K %Qy


and hn e 3i+, then fiK = limn /x/in = limn T/in = limn v/in = vK. Regularity extends
the equality to all sets in S(X 0 ).
Example. Let !K+ equal CQ (X), the cone of all nonnegative, continuous functions
with compact support on a locally compact, Hausdorff space X. For example, X
might be Rk. Let T be an increasing linear functional on Cj(X).
Property T4 holds for the trivial reason that each member of Cj (X) is bounded.
Property T3 is automatic, for a less trivial reason. Suppose hn I 0. Without loss
of generality, K > h\ for some compact K. Choose h in ej(X) with h > K. For
fixed e > 0, the union of the open sets [hn < } covers K. For some finite N, the
set {hN < } contains K, in which case hN < eK < /i, and ThN < eTh. The
a-smoothness follows.
The functional T has a representation Th = [ih on Cj (X), for a DCo-regular
measure ix. The domain of /x need not contain all the Borel sets. However, by an
analog of Lemma <io> outlined in Problem [1], it could be extended to a Borel
measure without disturbing the representation.
Example. Let J{+ be a lattice cone of bounded continuous functions on a
topological space, and let T : 1K+ -* R + be a linear functional (necessarily
increasing) with the property that to each > 0 there exists a compact set K
for which Th < e if 0 < h < Kc. (In Section 7.5, such a functional was called
functionally tight.)
Suppose 1 G ^K+. The functional is automatically cr-smooth: if 1 > ht ], 0 then
eventually K c {hi < e), in which case Th{ < T ((hi - O + + *) < + T(\). In
fact, the same argument shows that the functional is also r-smooth, in the sense of
Problem [2].
The functional T is represented by a measure fi on the sigma-field generated
by M + . Suppose there exists a sequence {*,-} c Jf+ for which 1 > hx \ K(. (The
version of the representation theorem for r-smooth functional, as described by
Problem [2], shows that it is even enough to have W+ generate the underlying
topology.) Then fiK = lim, Th{ = 7(1) - lim, T(\ - ht) > T(l) - . That is, \i is
a tight measure, in the sense that it concentrates most of its mass on a compact set.
It is inner regular with respect to approximation by the paving of compact sets.

300

Appendix A:

Measures and integrals

7.

Problems

[1]

A family of sets U is said to be downward filtering if to each pair U\, Ui in It


there exists a U3 in U with U\ n U2 2 U3. A set function \x : Xo -> M+ is said
to be r-smooth if inf{/x : AT U} = /x(ntl) for every downward filtering family
IX c %0. Write X for the fia-closure of a (0, U/, n / ) paving 3C0, the collection of
all possible intersections of subclasses of Xo.
(i) Show that X is a (0, U/, Ha) paving (stable under arbitrary intersections).
(ii) Show that a 3Co-tight set function that is r-smooth at 0 has a 9C-tight, r-additive
extension to X.

[2]

Say that an increasing functional T on 0i+ is x-smooth at zero if inf{Th : h e V] for


each subfamily V of !K+ that is downward filtering to the zero function. (That is, to
each h\ and /12 in V there is an /13 in V with h\ A / I 2 > ^3 and the pointwise infimum
of all functions in V is everywhere zero.) Extend Theorem <16> to r-smooth
functional by constructing a 3C-regular representing measure from the class X of
sets representable as pointwise infima of subclasses of 5{ + .

8,

Notes
The construction via DC-tight inner measures is a reworking of ideas from
Tops0e (1970). The application to integral representations is a special case of
results proved by Pollard & Tops0e (1975).
The book by Fremlin (1974) contains an extensive treatment of the relationship
between measures and linear functionals. The book by Konig (1997) develops the
theory of measure and integration with a heavy emphasis on inner regularity.
See Pfanzagl & Pierlo (1969) for an exposition of the properties of pavings
compact in the sense of Marczewski.
REFERENCES

Fremlin, D. H. (1974), Topological Riesz Spaces and Measure Theory, Cambridge


University Press.
Konig, H. (1997), Measure and Integration: An Advanced Course in Basic Procedures
and Applications, Springer-Verlag.
Marczewski, E. (1953), *On compact measures', Fundamenta Mathematicae pp. 113124.
Pfanzagl, J. & Pierlo, N. (1969), Compact systems of sets, Vol. 16 of Springer Lecture
Notes in Mathematics, Springer-Verlag, New York.
Pollard, D. & Tops0e, F. (1975), *A unified approach to Riesz type representation
theorems', Studia Mathematica 54, 173-190.
Tops0e, F. (1970), Topology and Measure, Vol. 133 of Springer Lecture Notes in
Mathematics, Springer-Verlag, New York.

Appendix B

Hilbert spaces
SECTION 1 defines Hilbert space and presents two basic inequalities.
SECTION 2 establishes the existence of orthogonal projections onto closed subspaces of a
Hilbert space.
SECTION 3 defines orthonormal bases of Hilbert spaces. Vectors in the space have
representations as infinite linear combinations (convergent series) of basis vectors.
SECTION 4 shows how to construct a random process from an orthonormal sequence of
random variables and an orthonormal basis.

1.

Definitions
Hilbert space is an infinite dimensional generalization of ordinary Euclidean space.
Arguments involving Hilbert spaces look similar to their analogs for Euclidean
space, with the addition of occasional precautions against possible difficulties with
infinite dimensionality.

<i>

Definition. A Hilbert space is a vector space IK equipped with an inner product


(, ) (a map from*K<8> IK into R) which satisfies the following requirements.
(a) (ctf + fig, h) = <*</, h) + p{g, h) for all real a, p all /, g, h in IK.
(b) (f,g) =
{g,f)forallf,ginM.
(c) (/. /> > 0 with equality if and only if / = 0.
(d) *K is complete for the norm defined by \\f\\ := J{f, / ) . That is, if {/} is a
Cauchy sequence in IK, meaning \\fn - fm\\ -> 0 as min(m, n) -> oo, then
there exists an f inOi for which \\fn f\\ - 0.
Two elements / and g of *K are said to be orthogonal, written / g, if
(/g) = 0. An element / is said to be orthogonal to a subset G of IK, written
/ G, if / J. g for every g in G.
The prime examples of Hilbert spaces are ordinary Euclidean space and L2(/i),
the set of equivalence classes of measurable real-valued functions whose squares are
/i-integrable, for a fixed measure /x. See Section 2.7 for discussion of why we need
to work with /^-equivalence classes to get property (c).
Hilbert space shares several properties with ordinary Euclidean space.

302

Appendix B:

Hilbert spaces

Cauchy-Schwarz inequality: | ( / , g)\ < \\f\\ \\g\\ for all / , g in M.


The inequality is trivial if either ||/|| = 0 or ||g|| = 0. Otherwise it follows
immediately from an expansion of the left-hand side of the inequality

Ill/ll

11*11 i

Triangle Inequality: 11/ + *|| < ||/|| + ||*|| for all / , g in W.


The square of the left-hand side equals ||/|| 2 + 2(/, g) + ||g|| 2 , which is less
than H/ll2 + 2H/II ||g|| + ||g|| 2 = (||/|| + ||g||) 2 , by Cauchy-Schwarz.

2.

Orthogonal projections
Many proofs for Rk that rely only on completeness carry over to Hilbert spaces. For
example, if Wo is a subspace of R*, then to each vector x there exists a vector xo
in Wo that is closest to x. The vector jto is characterized by the property that x xo
is orthogonal to Wo. The vector JCO is called the orthogonal projection of x onto Wo,
or the component of x in the subspace Wo; the vector x jto is called the component
of x orthogonal to Wo.
Projections also exist for closed subspaces of a Hilbert space W. (Recall that a
subset 9 of W is said to be closed if it contains all its limit points: if [gn] c g and
\\gn f\\ 0 then / G S ) Every finite dimensional subspace is automatically closed
(Problem [2]); infinite dimensional subspaces need not be closed (Problem [3]).

<2>

T h e o r e m . Let Wo be a closed subspace of a Hilbert space W. For each f in W


there is a unique fo in Wo, the orthogonal projection of f onto Wo, for which f - fo
is orthogonal to Wo- The point fo minimizes \\f - h\\ over all h in Wo.
Proof For a fixed / in W, define 8 := inf{||/ h\\ : h G WO}. Choose [hn) in Wo
such that | | / /in II 8. For arbitrary g, g' in W, cancellation of cross-product
terms leads to the identity
II* + *'ll2 + II* - *'ll 2 = 2||g|| 2 + 2||g / || 2 .
Put g := / - hn and g' := f - hm to get
4 | | / - (* + hm)/2\\2 + \\hm - hn\\2 = 2 | | / - hn\\2 + 2 | | / - hm\\2.
The first term on the left-hand side must be > 482 because (hn +hm)/2 belongs to
Wo- Both terms on the right-hand side converge to 2<52 as min(m,n) - oo. Thus
\\hm - K\\ - 0 as min(m, n) -+ oo. That is, {hn} is a Cauchy sequence.
Completeness of W ensures that hn converges to some /o in W. As Wo is closed
and each hn belongs to Wo, we deduce that /o e Wo. The infimum 8 is achieved at
/o, because | | / - / 0 || < | | / - hn\\ 4- \\hn - / 0 || -+ 8.
To prove the orthogonality assertion, for a fixed g in Wo consider the squared
distance from / to /o + fg,

11/ - (/o + *)||2 = 11/ - /oil2 + 2r{/ - /o, g) + /2||g||2,

B.2

303

Orthogonal projections

as a function of the real variable t. The vector /o + tg belongs to !Ko . It is one of


those vectors in the range of the the infimum that defines S. It follows that
82 + 2t{f - /o, g) + f 2 |||| 2 > 82

<3>

for all real t.

For such an inequality to hold, the coefficient, ( / - /o, g), of the linear term in t
must be zero. (Otherwise what would happen for t close to zero?)
To prove uniqueness, suppose both /o and f\ have the projection property.
Then f\ - /o in 0<o would be orthogonal to both / - /o and / - f\, from which it
follows that / i - /o is orthogonal to itself, whence f\ = fa.
Corollary. (Riesz-Frechet Theorem) To each continuous linear map T from a
Hilbert space 0< into R there exists a unique element ho in Oi such that Th {h, ho)
for all h in *H.
Proof. Uniqueness is easy to prove: if (h,ho) = (h,h\) for h := ho - h\ then
l l * o - * i 11=0.
Existence is also easy when Th = 0 = (ft, 0). So let us assume that there exists
an h2 with Th2 # 0. Let h3 denote the component of h2 that is orthogonal to the
closed (by continuity of T) linear subspace Wo := [h e 0{ : Th = 0} of <K. Note
that Th3 = Th2 - T(h2 - h3) = Th2 ^ 0.
For h in IK, define Ch := Th/Th3. The difference h - Q/13 belongs to 3K0

3.

because T(h-Chh3) = 0, and therefore 0 = (h-Chh3,h3) =


(h,h3)-(Th/Th3)\\h3\\2.
2
The choice ho := (Th3/\\h3\\ )h3 gives the desired representation.

Orthonormal bases
A family of vectors ^ = [fa? : 1 e 1} in a Hilbert space !K is said to be orthonormal
if (^1, tyi) = 1 for all 1 and (^r,-, ^y> = 0 for all i ^ j . The subspace JCo spanned
by ^ consists of all linear combinations J Z i e j ^ ^ . with J ranging over all finite
subsets of / . If Jio is dense in *K (that is, if the closure *Ko of 0<o equals !K) then
the family \jr is called an orthonormal basis for 3~C.

<4>

Lemma.

Let ^ be an orthonormal basis for *K.

(i) For each h in Oi, the set h := {1 e / : (h, ^1) # 0} is countable, and for every
enumeration {i(l), i(2),...} of /A, the sum 5Z*=i(^ fnk))^i{k) converges in
norm to h as n 00.
(ii) (g, A) = //(> ^>(*, ^,) (ParsevaVs identity) for all g,heK.
2

In particular,

I|A|I = E< /K*.^>I REMARK.


The sum in assertion (ii) actually runs over only a countable subset
of /. The assertion should be understood as convergence of partial sums for all
possible enumerations of that countable subset.
Proof. For each finite subset J of / , the subspace 'Kj spanned by the finite set
{t^, : 1 7} is closedsee Problem [2]. The projection of h onto Jij equals

Appendix B:

304

Hilbert spaces

hj := j /(A, ti)fi, because (A - hj, ft) = {A, ft) - (A, ft) = 0 for each i in / .
The orthogonality implies
\\h\\2 = \\h -hj\\

+ \\hj\\2 = ||A - hj\\2 + J2izj{h>

^)2*

For each > 0, the set Ih() := {i / : |(A, fi)\ > } has cardinality N no greater
than \\h\\2/2, because \\h\\2 > .{1 e /*()} (A, ft)2 > e2iV. It follows that the set
/A = Uknh(l/k) is countable, as asserted.
For a fixed A let (i(l), i(2),...} be an enumeration of 4 . Write /(n) for the
initial segment ( i ( l ) , . . . , i(n)}. Denseness of Jio means that to each 6 > 0 there is
some finite linear combination h = J2tj a * ^ s u c h ^ a t \\h h\\ < . By definition
of 4 , the vector /i is orthogonal to ft for each i in i \ 4 . It follows that the vector
h \{i" 6 J f i h}&ityi is orthogonal to {^/ : i J\//,}, whence
||A - hf

= \\h - EiV / H /}a^ll 2 + II /{* A ' } ^ l l 2 .

We reduce the distance between h and h if we discard from J those i not in lh.
Without loss of generality we may therefore assume that J is a subset of //,.
When n is large enough that / c J(n) we have h hj^) orthogonal to the
subspace Jij(n), which contains both hj(n) and h. For such n we have
2 > ||A - hj(H) + hJ{n) - h\\2

= IIA - h

definition of h

|| + || hJin) -hf

by orthogonality

That is, for each > 0 we have ||A A/(W)|| < for all n large enoughprecisely
what it means to say that J2kL\(h' !M*))ifo(*) converges (in norm) to A. The
representation ||A||2 = $^ij(A, ^V(*))2 then follows via continuity (Problem [1]) of
the map g H* ||g|| 2 .
For ParsevaFs identity, let (j(l), j(2),...} be an enumeration of / g U Ih. Then,
from the special case just established,

D
<5>

The series for 4{g, A) = ||g + A||2 - ||g - A||2 is obtained by subtraction.
E x a m p l e . (Haar basis) Let m denote Lebesgue measure on the Borel sigmafield of (0,1]. For k = 0 , 1 , 2 , . . . , partition (0,1] into subintervals J
:=
(i2~*, (1 4-1)2"*], for i = 0 , 1 , . . . , 2 * - 1 . Define functions Hik := J2i,M -/2i+i f *+i
and 1ritk := VHa, for 0 < i < 2k, and k = 0 , 1 , 2
Ho.,

e(

H,,

Hu

H2i2

H 32

-1 f-

Jo.3 y J , 3 y J2.3 y J..3 y J43 y J5.3 y J6.3 K J73 j

The collection of functions 4> := {1} U {^fjk : k e N o ; 0 < 1 < 2*} is an


orthonormal family in L 2 (m). A generating class argument will show that ^ is an
orthonormal basis.

305

B.3 Orthonormal bases

Each Jitk belongs to the subspace IKo spanned by ^:


7o.i = 5 0
Jo,2 = \ (/o.I + #0,1) = Jb.i - Ki.

^2,2 = \ (/i.i + J?u) = /i.i - J3.2

and so on. Take finite sums to deduce that IKo contains the class of all indicator
functions of dyadic intervals (i/2*, j/2*]. Note that is stable under finite intersections and that it generates the Borel sigma-field. The class D of all Borel sets whose
indicators belong to the closure IKo is a A.-system (easy proof), which contains .
Thus IKo contains <r(), the Borel sigma-field on (0,1].
If IKo were not equal to the whole of L2(m) there would exist a square-integrable
function / orthogonal to IKo. In particular, / would be orthogonal to both {/ > 0}
and {/ < 0}, which would force m / + = m/~ = 0. That is, / = 0 a.e. [m].
The component of a function h in the subspace spanned by 1 is just the constant
function m/t. Each function h in L2(m) has a series expansion,

4.

with convergence in the L2(m) sense.

Series expansions of r a n d o m processes


Suppose IK is a Hilbert space (necessarily separablesee Problem [6]) with a
countable orthonormal basis * = {fa : i e N}. Let {& : i N} be a sequence
of random variables, on some probability space (Qy 7, P), that is orthonormal for
the L2(P) inner product: Pftfy = 1 if 1 = j , zero otherwise. For each h in IK, the
sequence of random variables
is a Cauchy sequence in L2(F), because convergence of Xw(A, Vo)2 implies
P !<<<,<*. ^i)6| 2 = Hn<i<m(h, +t)2-+0

as n - oo.

Write X(h) for the limit ^l\(h, ^)ft, which is defined up to an almost sure
equivalence as a square integrable random variable.
By Problem [1] and the Parseval identity, for g and h in IK we have

VX(g)X(h) = lim FXn(g)Xn(h) = lim ? "

(g, ^)(* f ^ ) = (g, h).

In particular, FX(h)2 = ||/i||2. The map h -> X(/i) is a /m^ar isometry between IK
and a linear subspace of L2(P): the map preserves inner products and distances.
The most important example of a series expansion arises when all the ft
have independent, standard normal distributions. The family of random variables
[X(h) : h IK} is then called the isonormal process indexed by IK. The particular
case known as Brownian motion is discussed in Chapter 9.

306

Appendix B:

Hilbert spaces

5.

Problems

[1]

Suppose gn -> g and hn -> /i, as elements of a Hilbert space. Show that (gn, hn) ->
(g, /i). Hint: Use Cauchy-Schwarz to bound terms like (gn - g, hn).

[2]

Let 3~ be a finite subset of a Hilbert space IK. Show that the subspace generated
by 3 is a closed subset of IK. Hint: Without loss of generality assume the elements
{/i /*} of J are linearly independent. For each i, find a vector Vv such that
(/,, fa) = 1 but (/)-, V>) = 0 f r j # ' W ^* = Y,i<*i(n)fi converges to some fc,
deduce that {a,(n)} converges for each i.

[3]

Let A. be a finite measure on 3[0, 1]. Let IK denote the collection of A-equivalence
classes {[h] : h is continuous }. Show that IK is not a closed subspace of L2(k) if
k equals Lebesgue measure. Could it be a closed subspace for some other choice
of X?

[4]

Let IK be a closed, convex subset of a Hilbert space IK.


(i) Show that to each / in IK there is a unique /o in X for which ||/ - /oil =
inf{||/ - h\\ : h e X). Hint: Mimic the proof of Theorem <2>.
(ii) Show that ( / - fo, g - /o) < 0 for all g in X. Hint: Consider the distance
from / to (1 - t)f0 + tg for 0 < t < 1.
(iii) Give a (finite-dimensional) example where ( / - /o, g /o) < 0 for all g
in 3C\{/o).

[5]

Use Zorn's Lemma to prove that every Hilbert space has at least one orthonormal
basis. Hint: Order orthonormal bases by inclusion. If 4> is maximal for this
ordering, show that there can be no nonzero element orthogonal to every member
of 4/.

[6]

Let IK be a Hilbert space with an orthonormal basis 4> := {^ : i / } . Show that


/ is countable if and only if IK is separable (that is, it contains a countable, dense
subset). Hint: if / is countable, consider finite linear combinations X^gjC*,-^; w ^ *
the a, rational. Conversely, if {h\, hi,...} is dense, construct an orthonormal basis
inductively by defining g{ := ht - ^{j < i}(*,-, Vo) md $i : = gi/WgiW w h e n gi # 0.

6.

Notes
Halmos (1957, Chapter I) is an excellent source for basic facts about Hilbert space.
See Dudley (1973) and Dudley (1989, page 378) for the isonormal process.
REFERENCES

Dudley, R. M. (1973), 'Sample functions of the Gaussian process', Annals of


Probability 1, 66-103.
Dudley, R. M. (1989), Real Analysis and Probability, Wadsworth, Belmont, Calif.
Halmos, P. R. (1957), Introduction to Hilbert space, second edn, Chelsea.

Appendix C

Convexity
SECTION I defines convex sets and functions.
SECTION 2 shows that convex functions defined on subintervals of the real line have leftand right-hand derivatives everywhere.
SECTION 3 shows that convex functions on the real line can be recovered as integrals of
their one-sided derivatives.
SECTION 4 shows that convex subsets of Euclidean spaces have nonempty relative interiors.
SECTION 5 derives various facts about separation of convex sets by linear functions.

1.

Convex sets and functions


A subset C of a vector space is said to be convex if it contains all the line segments
joining pairs of its points, that is,
otx\ + (1 - )JC2 e C

for all JCI , X2 C and all 0 < a < 1.

A real-valued function / defined on a convex subset C (of a vector space V) is said


to be convex if

f(ax{ + (1 - a)x2) < af(x\) + (1 - a)f(x2)

for all x\, x2 e C and 0 < a < 1.

Equivalently, the epigraph of the function,


epi(/) := {(JC, 0 e C x R : t > / ( * ) } ,
is a convex subset of C x IR. Some authors (such as Rockafellar 1970) define /(JC)
to equal +oo for x e V\C, so that the function is convex on the whole of V, and
epi(/) is a convex subset of V x R.
This Appendix will establish several facts about convex functions and sets,
mostly for Euclidean spaces. In particular, the facts include the following results as
special cases.
(i) For a convex function / defined at least on an open interval of the real line
(possibly the whole real line), there exists a countable collection of linear
functions for which /(JC) = sup IN (a, 4- fax) on that interval.
(ii) If a real-valued function / has an increasing, real-valued right-hand derivative
at each point of an open interval, then / is convex on that interval. In
particular, if / is twice differentiable, with f" > 0, then / is convex.

Appendix C:

308

Convexity

(iii) If a convex function / on a convex subset C c f has a local minimum at


a point JCO, that is, if /(JC) > f(xo) for all x in a neighborhood of JCO, then
/ ( w ) > f(x0) for all u; in C.
(iv) If Ci and C 2 are disjoint convex subsets of E n then there exists a nonzero
t in Rn for which sup xCl x -I < infxc2 x * Z- Th a t *s> * e linear functional
x H> x I separates the two convex sets.

2.

One-sided derivatives
Let / be a convex function, defined and real-valued at least on an interval J of the
real line.
Consider any three points JCI < JC2 < JC3, all in J. (For the moment, ignore
the point JCO shown in the picture.) Write a for (JC2 jci)/to *i)> so that
X2 = 0CX3 + (1 - <x)x\. By convexity, y2 := a / t o ) + (1 - ot)f(x\) > / t o ) . Write
S(xi,Xj) for (f(xj) - f(xj))/(xj - JC/), the slope of the chord joining the points
(JC,, /(JC,)) and (JC,, /(*,)). Then

/to) - /to)
= S(xux3)

yi -

f(xi)

X2-X\

/to)-/to)

S(xux2).

slope S(x 2 ,x 3 ) ,
slope S(x,,x 3 )-

From the second inequality it follows that 5(JCI,JC) decreases as x decreases


to x\. That is, / has right-hand derivative D+(x\) at jq, if there are points of J
that are larger than x\. The limit might equal - 0 0 , as in the case of the function
f(x) = *fx defined on R + , with JCI = 0. However, if there is at least one point JCO
of / for which *o < x\ then the limit D+(JCI) must be finite: Replacing {JCI, JC2, JC3}
in the argument just made by {JCO, JCI, JC2}, we have S(JCO,JCI) < S(JCI,JC2), implying
that - 0 0 < 5(JCO,JCI) < D+(JCI).
The inequality 5(JCI, JC) < S(x\, JC2) < 5 t o , JC') if JCI < JC < JC2 < JC', leads to the
conclusion that D+ is an increasing function. Moreover, it is continuous from the

309

C.2 One-sided derivatives

right, because
D+(x2)

< S(jt2, JC3) - > 5(JCJ, JC3)

-* >+(*!)

as JC2 I x\, for fixed JC3

as x3 i x\.

Analogous arguments show that S(JCO,JCI) increases to a limit D_(JCJ) as JCO


increases to x\. That is, / has left-hand derivative DJ(JCI) at JCI, if there are points
of J that are smaller than x\.
If JCI is an interior point of J then both left-hand and right-hand derivatives
exist, and Z>_(JCI) < D+(JCI). The inequality may be strict, as in the case where
f(x) = \x\ with x\ = 0. The left-hand derivative has properties analogous to those
of theright-handderivative. The following Theorem summarizes.
<1>

Theorem. Let f be a convex, real-valued function defined (at least) on a bounded


interval [a, b] of the real line. The following properties hold.
(i) The right-hand derivative D+(JC) exists,
f{y) fix)

- lD+(x)
as, | , ,
y-x
for each x in [a,b). The function D+{x) is increasing and right-continuous
on [a, b). It is finite for a < x < b, but D+(a) might possibly equal 00.
(ii) The left-hand derivative D-(x) exists,
fix) - Hz)
x>
x -z
for each x in (a,b]. The function D-(x) is increasing and left-continuous
function on (a,b]. It is finite for a < x < b, but D-(b) might possibly
equal +00.
(Hi) For a < x < y < b,
D+(x)

< fJytJM < D_(y).

y-x
(iv) D-(x) < D+(JC) for each x in (a, b)9 and
/(w) > fix) + c(w - x)

for all w in [a, b],

for each real c with D_(JC) < c < D+(x).


Proof. Only the second part of assertion (iv) remains to be proved. For w > x use

f(w) - f(x)
wx
forw<x

= Six, w) > D+(JC) > c;

use
xw

D
<2>

where 5(-, ) denotes the slope function, as above.


Corollary. If a convex function f on a convex subset C c Rn has a local
minimum at a point xo, that is, if fix) > /(JCO) for all x in a neighborhood of XQ,
then fiw) > /(JCO) for all w in C.

310

Appendix C:

Convexity

Proof. Consider first the case n = \. Suppose w eC with w > JCO. The right-hand
derivative D+(JCO) = l i m ^ (f(y) - /C*o)) /(y - JCO) must be nonnegative, because
/OO > /C*o) for y near JCO. Assertion (iv) of the Theorem then gives
f(w) >

/(JCO) + (W-

JCO)D+(JCO)

> f(xo).

The argument for w < xo is similar.


For general Rn, apply the result for R along each straight line through JCO.
Existence of finite left-hand and right-hand derivatives ensures that / is
continuous at each point of the open interval (a, b). It might not be continuous at
the endpoints, as shown by the example
f(x)

/W

_ I -Vx

"|l

for x > 0
forjc=0.

Of course, we could recover continuity by redefining / ( 0 ) to equal 0, the value of


the limit / ( 0 + ) := lim^o / ( w ) .
<3>

Corollary. Let f be a convex, real-valued function on an interval [a, b\. There


exists a countable collection of linear functions d, -f Cjtu, for which the convex
function rjr(w) := sup jeN (d, + ciw) is everywhere < f(w), with equality except
possibly at the endpoints w = a or w = b, where \fr(a) = f(a+) and \lr(b) = f(b-).
Proof Let Xo := {JC, : I N} be a countable dense subset of (a, /?). Define
Ci := D+(jC|) and dt := /(jcf-)c,-jcf-. By assertion (iv) of the Theorem, f(w) > dj+CiW
for a < w < b for each i, and hence f(w) > ^r{w).
If a < w < b then (iv) also implies that /(JC,) > f(w) -I- (x, - w)D+(w), and
hence
t(w)

> f(Xi) + Ci(w - JC/) > f(w)

- (Xi - w) (D + (JC,) - D+(w))

for all xh

Let JC/ decrease to w (through Xo) to conclude, via right-continuity of D+ at w, that

> f(w).
If D+(a) > oo then / is continuous at a, and

f(a) > iKa) > limsup (f(Xi) + (a -

Xi)Ci)

= /(n+) = f(a).

If D+(a) = oo then / must be decreasing in some neighborhood N of a, with


Cj < 0 when jcf e N, and
xlr(a) > sup (/(JC,-) + (a If \{r(a) were strictly greater than f(a+),

3.

JC,-)C,-)

> sup /(JC,-) =

f(a+).

the open set

would contain a neighborhood of 0, which would imply existence of points


w in N\[a} for which t/r(u;) > / ( a + ) > / ( w ) , contradicting the inequality
^(w^) 5 / ( w ) . A similar argument works at the other endpoint.

Integral representations
Convex functions on the real line are expressible as integrals of one-sided derivatives.

311

C.3 Integral representations

<4>

Theorem. If f is real-valued and convex on [a,b], with f(a) = f(a+) and


f(b) = f(b), then both D+(JC) and D_(JC) are integrable with respect to Lebesgue
measure on [ayb], and
f{x) = f(a) + / D+(t)dt = f(a) + f D.(t)dt
for a < x < b.
Ja
Ja
Proof Choose a and p with a < a < p < x. For a positive integer w, define
8 := (P - a)/n and JC, := a + i8 for i = 0, 1 , . . . , n. Both D+ and D_ are bounded
on [a, /B]. For i = 2 , . . . , n - 1, part (iii) of Theorem < i > and monotonicity of both
one-sdied derivatives gives
r*-i

r*f-+,

D+(t)dt < D+(x,--i) < /(*,-) - /(JC,-_I) < <$_(*,) < /

D.(t)dty

which sums to give


D+(t)dt < /(*_,) - /(*,) < /

D_(0rf/.

v X2

Let n tend to infinity, invoking Dominated Convergence and continuity of / , to


deduce that ff D+(t)dt < f(P) - f(a) < ff D_(t)dt. Both inequalities must
actually be equalities, because D-(t) < D+(t) for all t in (a, b).
Let a decrease to a. Monotone Convergencethe functions D are bounded
above by D+(P) on (a, p]and continuity of / at a give f(p)-f(a)
= / / D+(t) dt =
ff D-(t)dt. In particular, the negative parts of both D are integrable. Then let P
increase to x to deduce, via a similar argument, the asserted integral expressions for
/ ( * ) /(<*)> ^ d the integrability of D on [a, b].
Conversely, suppose / is a continuous function defined on an interval [a, b],
with an increasing, real-valued right-hand derivative D+(t) existing at each point of
[a,b). On each closed proper subinterval [a,jc], the function D+ is bounded, and
hence Lebesgue integrable. From Section 3.4, f(x) = f* D+(t)dt for all a < x < b.
Equality for x b also follows, by continuity and Monotone Convergence. A
simple argument will show that / is then convex on \a,b\.
More generally, suppose D is an increasing, real-valued function defined (at
least) on [a,b). Define g(x) := f* D{t)dty for a < x < b. (Possibly g(b) = oo.)
Then g is convex. For if a < XQ < x\ < b and 0 < a < 1 and xa := (1 a)jco 4- <xx\,
then

(1 - a)g(x0) + ag(x\) - g(xa)


= /
Ja

({l-a){t<xo)+a{t<xi}-{t<xa})Dit)dt
xa < t < x\] - (I - a){*0 < t < xa}) D(t)dt

> (aixi - xa) - (1 - a)ixa - xo)) D(xa) = 0.


<5>

Example. Let / be a twice continuously differentiable (actually, absolute


continuity of / ' would suffice) convex function, defined on a convex interval J c R

312

Appendix C:

Convexity

that contains the origin. Suppose / ( 0 ) = /'(0) = 0. The representations


f(x)=xf{0<s<l}f'(xs)ds
= x2ff{0<t<s<

l}f"(xt)dtds=x2f*(\-t)f"(xt)dt,

establish the following facts.


(i) The function f(x)/x

is increasing.

(ii) The function </>(x) := 2/(JC)/JC 2 is nonnegative and convex,


(iii) If / " is increasing then so is 0.
Moreover, Jensen's inequality for the uniform distribution A. on the triangular region
{0 < t < s < 1} implies that
<t>(x) = k'>'f'\xt) > / " {Xs'xt) = /"(JC/3).
Two special cases of these results were needed in Chapter 10, to establish the
Bennett inequality and to establish Kolmogorov's exponential lower bound. The
choice /(JC) := ex - 1 - JC, with /"(JC) = ex, leads to the conclusion that the function

4.

11
for x = 0
is nonnegative and increasing over the whole real line. The choice /(JC) :=
(1 + jc)log(l 4- JC) - JC, for JC > - 1 , with / ( J C ) = log(l-f JC) and /"(JC) = (1 + JC)"1,
leads to the conclusion that the function
(
^ 2
| 1
for JC = 0.
is nonnegative, convex, and decreasing. Also x\/r(x) is increasing on R + , and
1r(x)>(l+x/3)-1.

Relative interior of a convex set


Convex subsets of Euclidean spaces either have interior points, or they can be
regarded as embedded in lower dimensional subspaces within which they have
interior points.

<6>

Theorem.

Let C be a convex subset ofRn.

(i) There exists a smallest subspace V for which C c JCO 0 V := {JCO 4- x : JC 6 V},
for each JCO C.
(ii) dim(V) = n if and only if C has a nonempty interior.
(iii) If int(C) ^ 0, there exists a convex, nonnegative function p defined on Rn
for which int(C) = {JC : p(x) < 1} c C c {JC : p(x) < 1} = int(C).
Proof. With no loss of generality, suppose O e C . Let JCJ, . . . , JC* be a maximal set
of linearly independent vectors from C, and let V be the subspace spanned by those
vectors. Clearly C c V. If k < n, there exists a unit vector w orthogonal to V, and
every point JC of V is a limit of points JC + t w not in V. Thus C has an empty interior.

313

C.4 Relative interior of a convex set

If k = n, write x for J^f. JC,//I. Each member of the usual orthonormal basis has a
representation as a linear combination, e{ = J2jai,jxj- Choose an > 0 for which
y

vl/2

2n f ]Tf. a?y J < 1 for every j . For every y := ]T\ yiei in Rn with |y| < , the
coefficients ^ ; := (2n)~l 4- J]f.a/y^/ are positive, summing to a quantity 1 -/fo < 1,
and Jc/2 + j = $>0 + E/ A^i e c- ^ ^ ^ / 2 i s a*1 interior point of C.
If int(C) / 0, we may, with no loss of generality, suppose 0 is an interior
point. Define a map p : Rn -* R+ by p(z) := inf{f > 0 : z/t e C}. It is easy
to see that p(0) = 0, and p(ay) = ap(y) for a > 0. Convexity of C implies that
p(z\ + Z2) < p(z\) 4- pfe) for all zii if Zi/U C then
/Z2
z\ + Z2
^i
t\+tl
t\+l
In particular, p is a convex function. Also p satisfies a Lipschitz condition: if
y = E i y.^i and z = X)f. z/e, then
- z) = P (Ei(yt - Zi)et)
< Ei ((yi - Zi)+p(ei) + 0* - Zf)-p(-^))

^ 1^ - z\ (J^i ^ei)1

*(-

Thus {p < 1} is open and {/? < 1} is closed.


Clearly p(jc) < 1 for every x in C; and if p(jc) < 1 then *o := x/t e C for some
t < 1, implying x = (1 -1)0 + '*o C. Thus {z : p(z) < 1} c C c {^ : pfe) < 1}.
Every point JC with p(x) = 1 lies on the boundary, being a limit of points x(l n~l)
from C and Cc. Assertion (iii) follows.
If C c jc0 V c Rn, with dim(V) = )t < n, we can identify V with R* and C
with a subset of R*. By part (ii) of the Theorem, C has a nonempty interior, as a
subset of JCO 0 V. That is, there exist points x of C with open neighborhoods (in Rn)
for which N n (JCO 0 V) c C. The set of all such points is called the relative interior
of C, and is denoted by rel-int(C). Part (iii) of the Theorem has an immediate
extension,
rel-int(C) c C c rel-int(C),
with a corresponding representation via a convex function p defined only on JCQ 0 V.

5.

Separation of convex sets by linear functionals


The theorems asserting existence on separating linear functionals depend on the
following simple extension result.

<7>

Lemma. Let f be a real-valued convex function, defined on a vector space V. Let


To be a linear functional defined on a vector subspace Vo, on which To(x) < /(JC)
for all x e Vo. Let y\ be a point ofV not in Vo. There exists an extension of To to a
linear functional T\ on the subspace V\ spanned by Vo U {yi} for which T{ (z) < f(z)
onV\.

314

Appendix C: Convexity

Proof. Each point z in Vi has a unique representation z := x + ry\, for some


x e Vo and some r e R. We need to find a value for T\(y\) for which /(JC -I- ry\) >
To(x) + rT\(y\) for all r e R. Equivalently we need a real number c such that
.

f(xp + ty\)-

X0GV0,/>0

TQ(XQ)

^>

>

T0(x\)-

f(x\

~ x , v 0 .5>0

-syi)

for then T\(y\) := c will give the desired extension.


For given JCO, *i in Vo and s, t > 0, define a := s/(s+t)
Then, by convexity of / on Vi and linearity of 7b on Vo,
/U

+ ty\) +

and xa := axo+O -a)jci.

5-hr

which implies
oo >

<8>

f(x0 + ty\)-

T0(x0)

T0{xx) - f(x{ - syx)

> oo.
t
s
The infimum over JCO and / > 0 on the left-hand side must be greater than or equal
to the supremum over x\ and s > 0 on the right-hand side, and both bounds must
be finite. Existence of the desired real c follows.
The vector space V need not be finite dimensional. We can order
REMARK.
extensions of 70, bounded above by / , by defining (7*a,Va) > (7^, V$) to mean
that Vp is a subspace of V a , and Ta is an extension of Tp. Zorn's lemma gives a
maximal element of the set of extensions (Ty,Vy) > (To, V o ). Lemma <7> shows
that VY must equal the whole of V, otherwise there would be a further extension.
That is, To has an extension to a linear functional T defined on V with T(x) < /(JC)
for every x in V. This result is a minor variation on the Hahn-Banach theorem from
functional analysis (compare with page 62 of Dunford & Schwartz 1958).
Theorem.

>

Let C be a convex subset ofW1 and yo be a point not in rel-int(C).

(i) There exists a linear functional T on Rk for which 0 ^ T(y0) > sup x e ? T(x).
(ii) If yo & C, then we may choose T so that T(yo) > s u p ^ r ( j t ) .
Proof. With no loss of generality, suppose 0 C. Let V denote the subspace
spanned by C, as in Theorem < 6 > . If yo < V, let t be its component orthogonal
to V. Then y0 -I > 0 = x i for all x in C.
If yo V, the problem reduces to construction of a suitable linear functional T
on V: we then have only to define T(z) := 0 for z V to complete the proof.
Equivalently, w e may suppose that V = Rn. Define To on Vo := {rjto : r e R ) by
T(ryo) : = rp(yo), for the p defined in Theorem < 6 > . Note that To(yo) = p(yo) >: U
because yo rel-int(C) = {p < 1}. Clearly To(x) < p(x) for all x e Vo. Invoke
Lemma < 7 > repeatedly to extend 7b to a linear functional 7 on R", with T(x) < p(x)
for all JC IR". In particular,
T(yo) > 1 > P(x) > T(x)

<9>

for all x e C = {p < 1}.

For (ii), note that T(y0) > 1 if yo i C.


Corollary. Let C\ and Ci be disjoint convex subsets ofRn. Then there is a
nonzero linear functional for which i n f , ^ T(x) > supx^2 T(x).

315

C.5 Separation of convex sets by linear functionals

<io>

< 11 >

Proof. Define C as the convex set {JCI - ;*2 : JC, e C,}. The origin does not belong
to C. Thus there is a nonzero linear functional for which 0 = T(0) > T(x\ - X2) for
alljt,C,.
Corollary. For each closed convex subset F ofW there exists a countable family
of closed halfspaces {//, : i e N} for which F = n/N//,-.
Proof. Let {JC, : i <= N} be a countable dense subset of Fc. Define r, as the distance
from xi to F, which is strictly positive for every 1, because Fc is open. The open
ball B(xiiri) with radius r,- and center JC/ is convex and disjoint from F. From
the previous Corollary, there exists a unit vector lt and a constant &, for which
U y > h > li x for all y G (JC, , r,-) and all x e F. Define //, := {JC Rn : lt x < k(}.
Each x in F c is the center of some open ball B(x, 3e) disjoint from F. There
is an xi with \x - JC,| < . We then have r,- > 2^, because 5(JC, 3e) 2 B(xiy 2^), and
hence JC lt e B(JC,, r,-). The separation inequality t-x (x ett) > it/ then implies
i-x> kt, that is JC $ H(.
Corollary. Let f be a convex (real-valued) function defined on a convex subset C
ofW1, such that epi(/) is a closed subset of Rn+K Then there exist {d{ : i e N} c Rn
and [ci : i e N} c R such that /(JC) = supIN(c/ + d( JC) for every x in C.
Proof. From the previous Corollary, and the definition of epi(/), there exist tt W1
and constants a,-, k[ R such that
00 > t > /(JC) if and only if *,- > lt JC - fa,

for all i N.

The ith inequality can hold for arbitrarily large t only if ar > 0. Define \jr{x) :=
sup a > 0 (li JC - kt) /a,. Clearly /(JC) > \lr(x) for JC C. If s < /(JC) for an JC in C
then there must exist an i for which ^ JC - /(jc)or/ < *,- < lt - x - sat, thereby
forcing or, > 0 and s < ^(x).

6.

Problems

[1]

Let / be the convex function, taking values in E U {00}, defined by


y ( j c y) _ { -yl/2
for 0 < 1 and x e R
\ 00
otherwise.
Let To denote the linear function defined on the jc-axis by 7O(JC,O) := 0 for all
x e R. Show that To has no extension to a linear functional on R2 for which
T(x, y) < /(JC, y) everywhere, even though To < f along the jc-axis.

[2]

Suppose X is a random variable for which the moment generating function,


M(t) := Pexp(fX), exists (and is finite) for t in an open interval / about the origin
of the real line. Write Pf for the probability measure with density etX/M(t) with
respect to P, for t e J, with corresponding variance varf (). Define A(0 := log M(t).
(i) Use Dominated Convergence to justify the operations needed to show that

A'(0 = M'(t)/M(t) = F(XetX/M(t)) = P,X,


A"(0 =