You are on page 1of 589

Graduate Texts in Mathematics

Editorial Board

95

F. W. Gehring P.R. Halmos (Managing Editor)


C. C. Moore

"Order out of chaos"


(Courtesy of Professor A. T. Fomenko of the Moscow State University)

A. N. Shiryayev

Probability
Translated by R. P. Boas

With 54 Illustrations

Springer Science+Business Media, LLC

A. N. Shiryayev
Steklov Mathematical Institute
Vavilova 42
GSP-1 117333 Moscow
U.S.S.R.
Editorial Board
P.R. Halmos
Managing Editor
Department of
Mathematics
Indiana University
Bloomington, IN 47405
U.S.A.

R. P. Boas (Translator)
Department of Mathematics
Northwestern University
Evanston, IL 60201
U.S.A.

F. W. Gehring

C. C. Moore

Department of
Mathematics
University of Michigan
Ann Arbor, MI 48109
U.S.A.

Department of
Mathematics
University of California
at Berkeley
Berkeley, CA 94720
U.S.A.

AMS Classification: 60-0 I


Library of Congress Cataloging in Publication Data
Shiriaev, Al'bert Nikolaevich.
Probability.
(Graduate texts in mathematics; 95)
Translation of: Veroiiitnost'.

Bibliography: p.
Includes index.
I. Probabilities. I. Title. II. Series.
519
83-14813
QA273.S54413 1984
Original Russian edition: Veroicltnost'. Moscow: Nauka, 1979.
This book is part of the Springer Series in Soviet Mathematics.

1984 by Springer Science+Business Media New York


Originally published by Springer-Verlag New York, Inc. in 1984
Softcover reprint of the hardcover 1st edition 1984
All rights reserved. No part of this book may be translated or reproduced in any
form without written permission from Springer Science+Business Media, LLC.

Typeset by Composition House Ltd., Salisbury, England.

9 8 7 6 54 32 1
ISBN 978-1-4899-0020-3
ISBN 978-1-4899-0018-0 (eBook)
DOI 10.1007/978-1-4899-0018-0

Preface

This textbook is based on a three-semester course of lectures given by the


author in recent years in the Mechanics-Mathematics Faculty of Moscow
State University and issued, in part, in mimeographed form under the title
Probability, Statistics, Stochastic Processes, I, II by the Moscow State
University Press.
We follow tradition by devoting the first part of the course (roughly one
semester) to the elementary theory of probability (Chapter I). This begins
with the construction of probabilistic models with finitely many outcomes
and introduces such fundamental probabilistic concepts as sample spaces,
events, probability, independence, random variables, expectation, correlation, conditional probabilities, and so on.
Many probabilistic and statistical regularities are effectively illustrated
even by the simplest random walk generated by Bernoulli trials. In this
connection we study both classical results (law of large numbers, local and
integral De Moivre and Laplace theorems) and more modern results (for
example, the arc sine law).
The first chapter concludes with a discussion of dependent random variables generated by martingales and by Markov chains.
Chapters II-IV form an expanded version of the second part of the course
(second semester). Here we present (Chapter II) Kolmogorov's generally
accepted axiomatization of probability theory and the mathematical methods
that constitute the tools of modern probability theory (a-algebras, measures
and their representations, the Lebesgue integral, random variables and
random elements, characteristic functions, conditional expectation with
respect to a a-algebra, Gaussian systems, and so on). Note that two measuretheoretical results-Caratheodory's theorem on the extension of measures
and the Radon-Nikodym theorem-are quoted without proof.

VI

Preface

The third chapter is devoted to problems about weak convergence of


probability distributions and the method of characteristic functions for
proving limit theorems. We introduce the concepts of relative compactness
and tightness of families of probability distributions, and prove (for the
real line) Prohorov's theorem on the equivalence of these concepts.
The same part of the course discusses properties "with probability 1"
for sequences and sums of independent random variables (Chapter IV). We
give proofs of the "zero or one laws" of Kolmogorov and of Hewitt and
Savage, tests for the convergence of series, and conditions for the strong law
of large numbers. The law of the iterated logarithm is stated for arbitrary
sequences of independent identically distributed random variables with
finite second moments, and proved under the assumption that the variables
have Gaussian distributions.
Finally, the third part of the book (Chapters V-VIII) is devoted to random
processes with discrete parameters (random sequences). Chapters V and VI
are devoted to the theory of stationary random sequences, where "stationary" is interpreted either in the strict or the wide sense. The theory of random
sequences that are stationary in the strict sense is based on the ideas of
ergodic theory: measure preserving transformations, ergodicity, mixing, etc.
We reproduce a simple proof (by A. Garsia) of the maximal ergodic theorem;
this also lets us give a simple proof of the Birkhoff-Khinchin ergodic theorem.
The discussion of sequences of random variables that are stationary in
the wide sense begins with a proof of the spectral representation of the
covariance fuction. Then we introduce orthogonal stochastic measures, and
integrals with respect to these, and establish the spectral representation of
the sequences themselves. We also discuss a number of statistical problems:
estimating the covariance function and the spectral density, extrapolation,
interpolation and filtering. The chapter includes material on the KalmanBucy filter and its generalizations.
The seventh chapter discusses the basic results of the theory of martingales
and related ideas. This material has only rarely been included in traditional
courses in probability theory. In the last chapter, which is devoted to Markov
chains, the greatest attention is given to problems on the asymptotic behavior
of Markov chains with countably many states.
Each section ends with problems of various kinds: some of them ask for
proofs of statements made but not proved in the text, some consist of
propositions that will be used later, some are intended to give additional
information about the circle of ideas that is under discussion, and finally,
some are simple exercises.
In designing the course and preparing this text, the author has used a
variety of sources on probability theory. The Historical and Bibliographical
Notes indicate both the historical sources of the results, and supplementary
references for the material under consideration.
The numbering system and form of references is the following. Each
section has its own enumeration of theorems, lemmas and formulas (with

Preface

Vll

no indication of chapter or section). For a reference to a result from a


different section of the same chapter, we use double numbering, with the
first number indicating the number of the section (thus (2.10) means formula
(10) of 2). For references to a different chapter we use triple numbering
(thus formula (11.4.3) means formula (3) of 4 of Chapter II). Works listed
in the References at the end of the book have the form [L n], where Lis a
letter and n is a numeral.
The author takes this opportunity to thank his teacher A. N. Kolmogorov,
and B. V. Gnedenko and Yu. V. Prohorov, from whom he learned probability
theory and under whose direction he had the opportunity of using it. For
discussions and advice, the author also thanks his colleagues in the Departments of Probability Theory and Mathematical Statistics at the Moscow
State University, and his colleagues in the Section on probability theory of the
Steklov Mathematical Institute of the Academy of Sciences of the U.S.S.R.
Moscow
Steklov Mathematical Institute

A. N.

SHIRYAYEV

Translator's acknowledgement. I am grateful both to the author and to


my colleague C. T. Ionescu Tulcea for advice about terminology.
R. P. B.

Contents

Introduction
CHAPTER I

Elementary Probability Theory

1. Probabilistic Model of an Experiment with a Finite Number of


Outcomes
2. Some Classical Models and Distributions
3. Conditional Probability. Independence
4. Random Variables and Their Properties
5. The Bernoulli Scheme. I. The Law of Large Numbers
6. The Bernoulli Scheme. II. Limit Theorems (Local,
De Moivre-Laplace, Poisson)
~7. Estimating the Probability of Success in the Bernoulli Scheme
8. Conditional Probabilities and Mathematical Expectations with
Respect to Decompositions
9. Random Walk. I. Probabilities of Ruin and Mean Duration in
Coin Tossing
10. Random Walk. II. Reflection Principle. Arcsine Law
11. Martingales. Some Applications to the Random Walk
12. Markov Chains. Ergodic Theorem. Strong Markov Property

5
5
17
23
32
45
55
68
74
81
92
101
108

CHAPTER II

Mathematical Foundations of Probability Theory


1. Probabilistic Model for an Experiment with Infinitely Many
Outcomes. Kolmogorov's Axioms
2. Algebras and u-algebras. Measurable Spaces
3. Methods oflntroducing Probability Measures on Measurable Spaces
4. Random Variables. I.

129
129
137
149
164

Contents

5. Random Elements
6. Lebesgue Integral. Expectation
7. Conditional Probabilities and Conditional Expectations with
Respect to a a-Algebra
8. Random Variables. II.
9. Construction of a Process with Given Finite-Dimensional
Distribution
10. Various Kinds of Convergence of Sequences of Random Variables
11. The Hilbert Space of Random Variables with Finite Second Moment
12. Characteristic Functions
13. Gaussian Systems

174
178
210
232
243
250
260
272
295

CHAPTER III

Convergence of Probability Measures. Central Limit


Theorem
1. Weak Convergence of Probability Measures and Distributions
2. Relative Compactness and Tightness of Families of Probability
Distributions
3. Proofs of Limit Theorems by the Method of Characteristic Functions
4. Central Limit Theorem for Sums of Independent Random Variables
5. Infinitely Divisible and Stable Distributions
6. Rapidity of Convergence in the Central Limit Theorem
7. Rapidity of Convergence in Poisson's Theorem

306
306
314
318
326
335
342
345

CHAPTER IV

Sequences and Sums of Independent Random Variables


1.
2.
3.
4.

Zero-or-One Laws
Convergence of Series
Strong Law of Large Numbers
Law of the Iterated Logarithm

354
354
359
363
370

CHAPTER V

Stationary (Strict Sense) Random Sequences and Ergodic


Theory
1. Stationary (Strict Sense) Random Sequences. Measure-Preserving
Transformations
2. Ergodicity and Mixing
3. Ergodic Theorems

376
376
379
381

CHAPTER VI

Stationary (Wide Sense) Random Sequences. L 2 Theory


1.
2.
3.
4.

Spectral Representation of the Covariance Function


Orthogonal Stochastic Measures and Stochastic Integrals
Spectral Representation of Stationary (Wide Sense) Sequences
Statistical Estimation of the Covariance Function and the Spectral
Density

387
387
395
401
412

Contents
5. Wold's Expansion
6. Extrapolation, Interpolation and Filtering
7. The Kalman-Bucy Filter and Its Generalizations

xi
418
425
436

CHAPTER VII

Sequences of Random Variables that Form Martingales


1. Definitions of Martingales and Related Concepts
2. Preservation of the Martingale Property Under Time Change at a
Random Time
3. Fundamental Inequalities
4. General Theorems on the Convergence of Submartingales and
Martingales
5. Sets of Convergence of Submartingales and Martingales
6. Absolute Continuity and Singularity of Probability Distributions
7. Asymptotics of the Probability of the Outcome of a Random Walk
with Curvilinear Boundary
8. Central Limit Theorem for Sums of Dependent Random Variables

446
446
456
464
476
483
492
504
509

CHAPTER VIII

Sequences of Random Variables that Form Markov Chains


I. Definitions and Basic Properties
2. Classification of the States of a Markov Chain in Terms of
Arithmetic Properties of the Transition Probabilities p!jl
3. Classification of the States of a Markov Chain in Terms of
Asymptotic Properties of the Probabilities p!i>
4. On the Existence of Limits and of Stationary Distributions
5. Examples

523
523
528
532
541
546

Historical and Bibliographical Notes

555

References

561

Index of Symbols

565

Index

569

Introduction

The subject matter of probability theory is the mathematical analysis of


random events, i.e. of those empirical phenomena which-under certain
circumstances-can be described by saying that:
They do not have deterministic regularity (observations of them do not
yield the same outcome) whereas at the same time;
They possess some statistical regularity (indicated by the statistical
stability of their frequency).
We illustrate with the classical example of a "fair" toss of an "unbiased"
coin. It is clearly impossible to predict with certainty the outcome of each
toss. The results of successive experiments are very irregular (now "head,"
now "tail") and we seem to have no possibility of discovering any regularity
in such experiments. However, if we carry out a large number of" independent" experiments with an "unbiased" coin we can observe a very definite
statistical regularity, namely that "head" appears with a frequency that is
"close" to f.
Statistical stability of a frequency is very likely to suggest a hypothesis
about a possible quantitative estimate of the "randomness" of some event A
connected with the results of the experiments. With this starting point,
probability theory postulates that corresponding to an event A there is a
definite number P(A), called the probability of the event, whose intrinsic
property is that as the number of "independent" trials (experiments) increases the frequ,ency of event A is approximated by P(A).
Applied to our example, this means that it is natural to assign the probability ! to the event A that consists of obtaining "head" in a toss of an
"unbiased" coin.

Introduction

There is no difficulty in multiplying examples in which it is very easy to


obtain numerical values intuitively for the probabilities of one or another
event. However, these examples are all of a similar nature and involve (so far)
undefined concepts such as "fair" toss, "unbiased" coin, "independence,"
etc.
Having been invented to investigate the quantitative aspects of"randomness," probability theory, like every exact science, became such a science
only at the point when the concept of a probabilistic model had been clearly
formulated and axiomatized. In this connection it is natural for us to discuss,
although only briefly, the fundamental steps in the development of probability theory.
Probability theory, as a science, originated in the middle of the seventeenth
century with Pascal (1623-1662), Fermat (1601-1655) and Huygens
( 1629-1695). Although special calculations of probabilities in games of chance
had been made earlier, in the fifteenth and sixteenth centuries, by Italian
mathematicians (Cardano, Pacioli, Tartaglia, etc.), the first general methods
for solving such problems were apparently given in the famous correspondence between Pascal and Fermat, begun in 1654, and in the first book on
probability theory, De Ratiociniis in Aleae Ludo (On Calculations in Games of
Chance), published by Huygens in 1657. It was at this time that the fundamental concept of" mathematical expectation" was developed and theorems
on the addition and multiplication of probabilities were established.
The real history of probability theory begins with the work of James
Bernoulli (1654-1705), Ars Conjectandi (The Art of Guessing) published in
1713, in which he proved (quite rigorously) the first limit theorem of probability theory, the law of large numbers; and of De Moivre (1667-1754),
Miscellanea Analytica Supplementum (a rough translation might be The
Analytic Method or Analytic Miscellany, 1730), in which the central limit
theorem was stated and proved for the first time (for symmetric Bernoulli
trials).
Bernoulli was probably the first to realize the importance of considering
infinite sequences of random trials and to make a clear distinction between
the probability of an event and the frequency of its realization. De Moivre
deserves the credit for defining such concepts as independence, mathematical
expectation, and conditional probability.
In 1812 there appeared Laplace's (1749-1827) great treatise Theorie
Analytique des Probabilities (Analytic Theory of Probability) in which he
presented his own results in probability theory as well as those of his predecessors. In particular, he generalized De Moivre's theorem to the general
(unsymmetric) case of Bernoulli trials, and at the same time presented De
Moivre's results in a more complete form.
Laplace's most important contribution was the application of probabilistic methods to errors of observation. He formulated the idea of considering errors of observation as the cumulative results of adding a large number
of independent elementary errors. From this it followed that under rather

Introduction

general conditions the distribution of errors of observation must be at least


approximately normal.
The work of Poisson (1781-1840) and Gauss (1777-1855) belongs to the
same epoch in the development of probability theory, when the center of the
stage was held by limit theorems.
In contemporary probability theory we think of Poisson in connection
with the distribution and the process that bear his name. Gauss is credited
with originating the theory of errors and, in particular, with creating the
fundamental method of least squares.
The next important period in the development of probability theory is
connected with the names of P. L. Chebyshev (1821-1894), A. A. Markov
(1856-1922), and A. M. Lyapunov (1857-1918), who developed effective
methods for proving limit theorems for sums of independent but arbitrarily
distributed random variables.
The number of Chebyshev's publications in probability theory is not
large-four in all-but it would be hard to overestimate their role in probability theory and in the development of the classical Russian school of that
subject.
"On the methodological side, the revolution brought about by Chebyshev
was not only his insistence for the first time on complete rigor in the proofs of
limit theorems, ... but also, and principally, that Chebyshev always tried to
obtain precise estimates for the deviations from the limiting regularities that are
available for large but finite numbers of trials, in the form of inequalities that are
valid unconditionally for any number of trials."
(A. N.

KOLMOGOROV

[30])

Before Chebyshev the main interest in probability theory had been in the
calculation of the probabilities of random events. He, however, was the
first to realize clearly and exploit the full strength of the concepts of random
variables and their mathematical expectations.
The leading exponent of Chebyshev's ideas was his devoted student
Markov, to whom there belongs the indisputable credit of presenting his
teacher's results with complete clarity. Among Markov's own significant
contributions to probability theory were his pioneering investigations of
limit theorems for sums of independent random variables and the creation
of a new branch of probability theory, the theory of dependent random
variables that form what we now call a Markov chain.
" ... Markov's classical course in the calculus of probability and his original
papers, which are models of precision and clarity, contributed to the greatest
extent to the transformation of probability theory into one of the most significant
branches of mathematics and to a wide extension of the ideas and methods of
Chebyshev."
{S. N. BERNSTEIN [3])

To prove the central limit theorem of probability theory (the theorem


on convergence to the normal distribution), Chebyshev and Markov used

Introduction

what is known as the method of moments. With more general hypotheses


and a simpler method, the method of characteristic functions, the theorem
was obtained by Lyapunov. The subsequent development of the theory has
shown that the method of characteristic functions is a powerful analytic
tool for establishing the most diverse limit theorems.
The modern period in the development of probability theory begins with
its axiomatization. The first work in this direction was done by S. N. Bernstein (1880-1968), R. von Mises (1883-1953), and E. Borel (1871-1956).
A. N. Kolmogorov's book Foundations of the Theory of Probability appeared
in 1933. Here he presented the axiomatic theory that has become generally
accepted and is not only applicable to all the classical branches of probability
theory, but also provides a firm foundation for the development of new
branches that have arisen from questions in the sciences and involve infinitedimensional distributions.
The treatment in the present book is based on Kolmogorov's axiomatic
approach. However, to prevent formalities and logical subtleties from obscuring the intuitive ideas, our exposition begins with the elementary theory of
probability, whose elementariness is merely that in the corresponding
probabilistic models we consider only experiments with finitely many outcomes. Thereafter we present the foundations of probability theory in their
most general form.
The 1920s and '30s saw a rapid development of one of the new branches of
probability theory, the theory of stochastic processes, which studies families
of random variables that evolve with time. We have seen the creation of
theories of Markov processes, stationary processes, martingales, and limit
theorems for stochastic processes. Information theory is a recent addition.
The present book is principally concerned with stochastic processes with
discrete parameters: random sequences. However, the material presented
in the second chapter provides a solid foundation (particularly of a logical
nature) for the-study of the general theory of stochastic processes.
It was also in the 1920s and '30s that mathematical statistics became a
separate mathematical discipline. In a certain sense mathematical statistics
deals with inverses of the problems of probability: If the basic aim of probability theory is to calculate the probabilities of complicated events under a
given probabilistic model, mathematical statistics sets itself the inverse
problem: to clarify the structure of probabilistic-statistical models by
means of observations of various complicated events.
Some of the problems and methods of mathematical statistics are also
discussed in this book. However, all that is presented in detail here is probability theory and the theory of stochastic processes with discrete parameters.

CHAPTER I

Elementary Probability Theory

1. Probabilistic Model of an Experiment with a


Finite Number of Outcomes
1. Let us consider an experiment of which all possible results are included
in a finite number of outcomes w 1, .. , wN. We do not need to know the
nature of these outcomes, only that there are a finite number N of them.

We call w 1,

wN elementary events, or sample points, and the finite set


Q = {wl, ... , roN},

the space of elementary events or the sample space.


The choice of the space of elementary events is the first step in formulating
a probabilistic model for an experiment. Let us consider some examples of
sample spaces.
ExAMPLE

1. For a single toss of a coin the sample space Q consists of two

points:
Q = {H, T},

where H ="head" and T ="tail". (We exclude possibilities like "the coin
stands on edge," "the coin disappears," etc.)
ExAMPLE

2. For n tosses of a coin the sample space is


Q = {w: w = (a 1 ,

an), ai = H or T}

and the general number N(Q) of outcomes is 2".

I. Elementary Probability Theory

3. First toss a coin. If it falls "head" then toss a die (with six faces
numbered 1, 2, 3, 4, 5, 6); if it falls "tail", toss the coin again. The sample
space for this experiment is

ExAMPLE

Q = {H1, H2, H3, H4, H5, H6, TH, TT}.

We now consider some more complicated examples involving the selection of n balls from an urn containing M distinguishable balls.
2. ExAMPLE 4 (Sampling with replacement). This is an experiment in which
after each step the selected ball is returned again. In this case each sample of
n balls can be presented in the form (a 1 , , an), where ai is the label of the
ball selected at the ith step. It is clear that in sampling with replacement
each ai can have any of the M values 1, 2, ... , M. The description of the
sample space depends in an essential way on whether we consider samples
like, for example, (4, 1, 2, 1) and (1, 4, 2, 1) as different or the same. It is
customary to distinguish two cases: ordered samples and unordered samples.
In the first case samples containing the same elements, but arranged
differently, are considered to be different. In the second case the order of
the elements is disregarded and the two samples are considered to be the
same. To emphasize which kind of sample we are considering, we use the
notation (a 1 , .. , an) for ordered samples and [a 1 , .. , an] for unordered
samples.
Thus for ordered samples the sample space has the form
Q

= {w: w = (a 1, , an), ai = 1, ... , M}

and the number of (different) outcomes is


N(Q) = Mn.

(1)

If, however, we consider unordered samples, then


Q

= {w: w = [a 1 , ... , an], ai = 1, ... , M}.

Clearly the number N(Q) of (different) unordered samples is smaller than


the number of ordered samples. Let us show that in the present case
N(Q) = CM+n-l

(2)

where q = k !j[l! (k - I)!] is the number of combinations of I elements,


taken k at a time.
We prove this by induction. Let N(M, n) be the number of outcomes of
interest. It is clear that when k ~ M we have
N(k, 1) = k =

q.

I. Probabilistic Model of an Experiment with a Finite Number of Outcomes

Now suppose that N(k, n) = C~+n- 1 fork~ M; we show that this formula
continues to hold when n is replaced by n + 1. For the unordered samples
[a 1 , ... , an+ 1] that we are considering, we may suppose that the elements
are arranged in nondecreasing order: a 1 ~ a 2 ~ ~ an. It is clear that the
number of unordered samples with a 1 = 1 is N(M, n), the number with
a 1 = 2 is N(M - 1, n), etc. Consequently
N(M, n

+ 1) =

+ N(M - 1, n) + + N(1, n)
= CM+n-1 + CM-1+n-1 + ' c:
= (C~+1n- CM-+1n-1) + (CM+-\+n- C~!1+n-1)
+ ... + (C:!i- c:) = C~_}n;
N(M, n)

here we have used the easily verified property

Ci- 1 + Ci = Ci + 1
of the binomial coefficients.
5 (Sampling without replacement). Suppose that n ~ M and that
the selected balls are not returned. In this case we again consider two possibilities, namely ordered and unordered samples.
For ordered samples without replacement the sample space is

EXAMPLE

= {w: w = (a 1 , .. , an), ak =F a1, k =F l, ai = 1, ... , M},

and the number of elements of this set (called permutations) is M(M- 1)


(M - n + 1). We denote this by (M)n or AM- and call it "the number of
permutations of M things, n at a time").
For unordered samples (called combinations) the sample space
Q

= {w: w = [a 1 , .. , an], ak =F a1, k =F l, ai = 1, ... , M}

consists of
N(Q)

CM-

(3)

elements. In fact, from each unordered sample [a 1 , .. , an] consisting of


distinct elements we can obtain n! ordered samples. Consequently
N(Q) n! = (M)n
and therefore
N(Q) = (M;n =

n.

CM-.

The results on the numbers of samples of n from an urn with M balls are
presented in Table 1.

I. Elementary Probability Theory

Table 1
M"

c~+n-1

With
replacement

(M).

eM

Without
replacement

Ordered

Unordered

For the case M = 3 and n


displayed in Table 2.

= 2,

~
e

the corresponding sample spaces are

ExAMPLE 6 (Distribution of objects in cells). We consider the structure of


the sample space in the problem of placing n objects (balls, etc.) in M cells
(boxes, etc.). For example, such problems arise in statistical physics in studying the distribution of n particles (which might be protons, electrons, ... )
among M states (which might be energy levels).
Let the cells be numbered 1, 2, ... , M, and suppose first that the objects
are distinguishable (numbered 1, 2, ... , n). Then a distribution of the n
objects among the M cells is completely described by an ordered set
(a 1, .. , an), where ai is the index of the cell containing object i. However,
if the objects are indistinguishable their distribution among the M cells
is completely determined by the unordered set [a 1, .. , an], where ai is the
index of the cell into which an object is put at the ith step.
Comparing this situation with Examples 4 and 5, we have the following
correspondences:
(ordered samples)+-+ (distinguishable objects),
(unordered samples)+-+ (indistinguishable objects),
Table 2
(1, 1) (1, 2) (1, 3)
(2, 1) (2, 2) (2, 3)
(3, 1) (3, 2) (3, 3)

[1, 1] [2, 2] [3, 3]


[1, 2] [1, 3]
[2, 3]

With
replacement

(1, 2) (1, 3)
(2, 1) (2, 3)
(3, 1) (3, 2)

[1, 2] [1, 3]
[2, 3]

Without
replacement

Ordered

Unordered

~
e

1. Probabilistic Model of an Experiment with a Finite Number of Outcomes

by which we mean that to an instance of an ordered (unordered) sample of


n balls from an urn containing M balls there corresponds (one and only one)
instance of distributing n distinguishable (indistinguishable) objects among
M cells.
In a similar sense we have the following correspondences:
) (a cell may receive any number)
h
.
(samp1mg
w1t rep1acement f b" t
,
o o ~ec s

.
.
(a cell may receive at most)
b" t

(samphng without replacement)one o ~ec


These correspondences generate others of the same kind:
an unordered sample in)
( sampling without
replacement

(indistinguishable objects in the


)
problem of distribution among cells
when each cell may receive at
most one object

etc.; so that we can use Examples 4 and 5 to describe the sample space for
the problem of distributing distinguishable or indistinguishable objects
among cells either with exclusion (a cell may receive at most one object) or
without exclusion (a cell may receive any number of objects).
Table 3 displays the distributions of two objects among three cells. For
distinguishable objects, we denote them by W (white) and B (black). For
indistinguishable objects, the presence of an object in a cell is indicated
bya +.
Table 3

lwlsl I llwlsl llwl 18 I I++ I I II I+ +I II I I++ I


lslwl II lwlsl II lwlsl
1+1+1 I 1+1 1+1
IBl lwll lslwll I lwlsl
I 1+1+1
lwlsl
lslwl
I8 1

I lwl Isl
I I lwlsl
lwl I lslwl

Distinguishable
objects

1+1+1

I 1+1 1+1
I 1+1+1

-:::: =
0
0 -

~2u

..<:

><

'"'

.9

-"'

~-
><

'"'
Indistinguishable
objects

I~
n

10

I. Elementary Probability Theory

Table 4
N(Q) in the problem of placing n objects in M cells

Distinguishable
objects

Indistinguishable
objects

Without exclusion

Mn

C:V+n-1

With exclusion

{M)n

c:v

Ordered
samples

Unordered
samples

(MaxwellBoltzmann
statistics)

(BoseEinstein
statistics)

(Fermi-Dirac
statistics)

With
replacement

Without
replacement

~
e

N(Q) in the problem of choosing n balls from an urn


containing M balls

The duality that we have observed between the two problems gives us
obvious way of finding the number of outcomes in the problem of placing
objects in cells. The results, which include the results in Table 1, are given in
Table 4.
In statistical physics one says that distinguishable (or indistinguishable,
respectively) particles that are not subject to the Pauli exclusion principlet
obey Maxwell-Boltzmann statistics (or, respectively, Bose-Einstein statistics). If, however, the particles are indistinguishable and are subject to the
exclusion principle, they obey Fermi-Dirac statistics (see Table 4). For
example, electrons, protons and neutrons obey Fermi-Dirac statistics.
Photons and pions obey Bose-Einstein statistics. Distinguishable particles
that are subject to the exclusion principle do not occur in physics.
an

3. In addition to the concept of sample space we now need the fundamental


concept of event.
Experimenters are ordinarily interested, not in what particular outcome
occurs as the result of a trial, but in whether the outcome belongs to some
subset of the set of all possible outcomes. We shall describe as events all
subsets A c Q for which, under the conditions ofthe experiment, it is possible
to say either "the outcome wE A" or "the outcome w A."

t At most one particle in each cell. (Translator)

1. Probabilistic Model of an Experiment with a Finite Number of Outcomes

11

For example, let a coin be tossed three times. The sample space Q consists
of the eight points
Q = {HHH, HHT, ... , TTT}

and if we are able to observe (determine, measure, etc.) the results of all three
tosses, we say that the set
A = {HHH, HHT, HTH, THH}
is the event consisting of the appearance of at least two heads. If, however,
we can determine only the result of the first toss, this set A cannot be considered to be an event, since there is no way to give either a positive or negative
answer to the question of whether a specific outcome w belongs to A.
Starting from a given collection of sets that are events, we can form new
events by means of statements containing the logical connectives "or,"
"and," and "not," which correspond in the language of set theory to the
operations "union," "intersection," and "complement."
If A and B are sets, their union, denoted by A u B, is the set of points that
belong either to A or to B:
Au B ={wen: we A or weB}.
In the language of probability theory, Au B is the event consisting of the
realization either of A or of B.
The intersection of A and B, denoted by A n B, or by AB, is the set of
points that belong to both A and B:
An B = {weQ: we A and weB}.

The event A n B consists of the simultaneous realization of both A and B.


For example, if A = {HH, HT, TH} and B = {TT, TH, HT} then
Au B

= {HH, HT, TH, TT} ( =0),

A n B = {TH, HT}.

If A is a subset of Q, its complement, denoted by A, is the set of points of


that do not belong to A.
If B\A denotes the difference of B and A (i.e. the set of points that belong
to B but not to A) then A= Q\A. In the language of probability, A is
the event consisting of the nonrealization of A. For example, if A =
{HH, HT, TH} then A = {TT}, the event in which two successive tails occur.
The sets A and A have no points in common and consequently A n A is
empty. We denote the empty set by 0. In probability theory, 0 is called an
impossible event. The set n is naturally called the certain event.
When A and B are disjoint (AB = 0), the union A u B is called the
sum of A and B and written A + B.
If we consider a collection d 0 of sets A s Q we may use the set-theoretic
operators u, n and \ to form a new collection of sets from the elements of
Q

12

I. Elementary Probability Theory

d 0 ; these sets are again events. If we adjoin the certain and impossible
events Q and 0 we obtain a collection d of sets which is an algebra, i.e. a
collection of subsets of Q for which

(1) Qed,
(2) if A Ed, BEd, the sets A u B, A n B, A\B also belong to d.
It follows from what we have said that it will be advisable to consider
collections of events that form algebras. In the future we shall consider only
such collections.
Here are some examples of algebras of events:

(a) {0, 0}, the collection consisting of Q and the empty set (we call this the
trivial algebra);
(b) {A, A, 0, 0}, the collection generated by A;
(c) d = {A: A 0}, the collection consisting of all the subsets of Q
(including the empty set 0).
It is easy to check that all these algebras of events can be obtained from the
following principle.
We say that a collection

of sets is a decomposition of 0, and call the Di the atoms of the decomposition,


if the Di are not empty, are pairwise disjoint, and their sum is 0:
D1

+ + Dn

= Q.

For example, if Q consists of three points,


different decompositions:
!'}t={Dd
!'}2

= {Dt, D2}

= {1, 2, 3}, there are five

with D 1 = {1, 2, 3};


with D 1

= {1, 2}, D2 = {3};

!'}3 = {Dt, D2}

with D 1 = {1, 3}, D2 = {2};

!'}4 = {Dt, D2}

with D 1 = {2,3},D 2 = {1};

!'}s

= {Dl, D2, D3}

with

D1

= {1}, D2 = {2}, D3 = {3}.

(For the general number of decompositions of a finite set see Problem 2.)
If we consider all unions of the sets in !'}, the resulting collection of sets,
together with the empty set, forms an algebra, called the algebra induced by
!'}, and denoted by ex(!'}). Thus the elements of ex(!'}) consist of the empty set
together with the sums of sets which are atoms of!'},
Thus if!'} is a decomposition, there is associated with it a specific algebra
fJi = ex(!'}).
The converse is also true. Let f!i be an algebra of subsets of a finite space
0. Then there is a unique decomposition !'} whose atoms are the elements of

I. Probabilistic Model of an Experiment with a Finite Number of Outcomes

13

flA, with flA = oc(~). In fact, let D E 14 and let D have the property that for
every B E flA the set D n B either coincides with D or is empty. Then this
collection of sets D forms a decomposition ~ with the required property
oc(~) = flA. In Example (a),~ is the trivial decomposition consisting of the
single set D1 = Q; in (b),~= {A, A}. The most fine-grained decomposition
~, which consists of the singletons {roJ, roi E Q, induces the algebra in
Example (c), i.e. the algebra of all subsets of Q.
Let ~ 1 and ~ 2 be two decompositions. We say that ~ 2 is finer than ~ 1 ,
and write ~ 1 ~ ~ 2 , ifoc(~ 1 ) ~ oc(~ 2 ).
Let us show that if Q consists, as we assumed above, of a finite number of
points ro 1, ... , roN, then the number N(d) of sets in the collection d is
equal to 2N. In fact, every nonempty set A Ed can be represented as A =
{roi,, ... , rod, where roij E Q, 1 ~ k ~ N. With this set we associate the sequence of zeros and ones

(0, ... ' 0, 1, 0, ... ' 0, 1, ... ),


where there are ones in the positions i 1, ... , ik and zeros elsewhere. Then
for a given k the number of different sets A of the form {roi,, ... , roik} is the
same as the number of ways in which k ones (k indistinguishable objects)
can be placed in N positions (N cells). According to Table 4 (see the lower
right-hand square) we see that this number is C~. Hence (counting the empty
set) we find that
N(d) = 1 + C~

+ .. + CZ

= (1

+ It

= 2N.

4. We have now taken the first two steps in defining a probabilistic model
of an experiment with a finite number of outcomes: we have selected a sample
space and a collection d of subsets, which form an algebra and are called
events. We now take the next step, to assign to each sample point (outcome)
roi E Qi, i = 1, ... , N, a weight. This is denoted by p(roi) and called the
probability of the outcome roi ; we assume that it has the following properties:
(a) 0 ~ p(roJ ~ 1 (nonnegativity),
(b) p(ro 1) + + p(roN) = 1 (normalization).
Starting from the given probabilities p(roJ of the outcomes roi> we define
the probability P(A) of any event A Ed by
(4)
{i:ro;eA}

Finally, we say that a triple


(Q, d, P),

where Q = {ro 1 ,

. ,

roN}, dis an algebra of subsets of Q and


p = {P(A); A Ed}

14

I. Elementary Probability Theory

defines (or assigns) a probabilistic model, or a probability space, of experiments


with a (finite) space n of outcomes and algebra d of events.
The following properties of probability follow from (4):
(5)

P(0) = 0,

P(Q)

P(A u B)= P(A)

(6)

1,

P(B)- P(A n B).

(7)

In particular, if An B = 0, then
P(A

+ B) =

P(A)

P(B)

(8)

and
P(A) = 1 - P(A).

(9)

5. In constructing a probabilistic model for a specific situation, the construction of the sample space n and the algebra d of events are ordinarily
not difficult. In elementary probability theory one usually takes the algebra
d to be the algebra of all subsets of n. Any difficulty that may arise is in
assigning probabilities to the sample points. In principle, the solution to this
problem lies outside the domain of probability theory, and we shall not
consider it in detail. We consider that our fundamental problem is not the
question of how to assign probabilities, but how to calculate the probabilities of complicated events (elements of d) from the probabilities of the
sample points.
It is clear from a mathematical point of view that for finite sample spaces
we can obtain all conceivable (finite) probability spaces by assigning nonnegative numbers p 1 , . , PN, satisfying the condition p 1 + + PN = 1, to
the outcomes w 1, ... , wN.
The validity of the assignments of the numbers p 1, , PN can, in specific
cases, be checked to a certain extent by using the law of large numbers
(which will be discussed later on). It states that in a long series of "independent" experiments, carried out under identical conditions, the frequencies
with which the elementary events appear are "close" to their probabilities.
In connection with the difficulty of assigning probabilities to outcomes,
we note that there are many actual situations in which for reasons of symmetry it seems reasonable to consider all conceivable outcomes as equally
probable. In such cases, if the sample space consists of points w 1, . . , wN,
with N < oo, we put

and consequently
P(A) = N(A)/N

(10)

I. Probabilistic Model of an Experiment with a Finite Number of Outcomes

15

for every event A Ed, where N(A) is the number of sample points in A.
This is called the classical method of assigning probabilities. It is clear that
in this case the calculation of P(A) reduces to calculating the number of
outcomes belonging to A. This is usually done by combinatorial methods,
so that combinatorics, applied to finite sets, plays a significant role in the
calculus of probabilities.
7 (Coincidence problem). Let an urn contain M balls numbered
1, 2, ... , M. We draw an ordered sample of size n with replacement. It is
clear that then
ExAMPLE

Q = {w: w = (a 1 ,

.. ,

a.), a;= 1, ... , M}

and N(Q) = M". Using the classical assignment of probabilities, we consider


the M" outcomes equally probable and ask for the probability of the event

A = {w: w = (a 1 ,

a.), a; "# aj, i "# j},

i.e., the event in which there is no repetition. Clearly N(A) = M(M - 1) ...
(M - n + 1), and therefore
P(A)

(z~ = (1- k) (1- k) ... (1- n ~ 1).

(11)

This problem has the following striking interpretation. Suppose that


there are n students in a class. Let us suppose that each student's birthday
is on one of 365 days and that all days are equally probable. The question
is, what is the probability P. that there are at least two students in the class
whose birthdays coincide? If we interpret selection of birthdays as selection
of balls from an urn containing 365 balls, then by (11)
P.

1-

(365).
365" .

The following table lists the values of P" for some values of n:
n

16

22

23

40

64

P.

0.016

0.284

0.476

0.507

0.891

0.997

It is interesting to note that (unexpectedly!) the size of class in which there


is probability! of finding at least two students with the same birthday is not
very large: only 23.
8 (Prizes in a lottery). Consider a lottery that is run in the following
way. There are M tickets numbered 1, 2, ... , M, of which n, numbered
1, ... , n, win prizes (M ~ 2n). You buy n tickets, and ask for the probability
(P, say) of winning at least one prize.

EXAMPLE

16

I. Elementary Probability Theory

Since the order in which the tickets are drawn plays no role in the presence
or absence of winners in your purchase, we may suppose that the sample space
has the form

n=

{w: w = [a 1,

By Table 1, N(Q) =

C~.

.. ,

an], ak =I= a1, k =I= l, ai

= 1, ... , M}.

Now let

A 0 = {w: w = [a 1 ,

an], ak =I= a1, k =I= l, ai

= n + 1, ... , M}

be the event that there is no winner in the set of tickets you bought. Again
by Table 1, N(A 0 ) = C~-n Therefore
P(A ) = C~-n = (M- n)n
(M)n
'C~
0

and consequently

P=

).
n
n) ... (1 (1 - _
1- P(A = 1- (1 - ~)
M-n+1
M-1
M
0)

If M = n2 and n -4 oo, then P(A 0 )

-4

e- 1 and

P--+ 1- e- 1 ~ 0.632.

The convergence is quite fast: for n = 10 the probability is already P = 0.670.


6.

PROBLEMS

1. Establish the following properties of the operators n and u:


AB = BA

A uB= BuA,

A u (B u C) = (A u B) u C,
A(B u C) = AB u AC,
Au A= A,

(commutativity),
A(BC) = (AB)C

A u (BC)

(associativity),
(distributivity),

(A u B)(A u C)

(idempotency).

AA = A

Show also that


Au B

=An B,

AB

=Au B.

2. Let Q contain N elements. Show that the number d(N) of different decompositions of
Q is given by the formula
(12)
(Hint: Show that
N-1

d(N)

k=O

c~- 1 d(k),

where d(O)

1,

and then verify that the series in (12) satisfies the same recurrence relation.)

17

2. Some Classical Models and Distributions

3. For any finite collection of sets

A~>

... , A.,

+ + P(A.).

P(A 1 u u A.) :0::: P(A 1 )

4. Let A and B be events. Show that AB u BA is the event in which exactly one of A
and B occurs. Moreover,
P(AB u BA) = P(A)

+ P(B) -

2P(AB).

5. Let A 1 , .. , A. be events, and define S 0 , S 1 , , s. as follows: S 0 = 1,

s, = L P(Ak, (\ ... (\ Ak),


J,

where the sum is over the unordered subsets J, = [k 1, ... , k,] of {1, ... , n}.
Let Bm be the event in which each of the events A 1 , . , A. occurs exactly m times.
Show that
n

P(Bm) =

L (-1)-mc~s,.

In particular, form= 0
P(B 0 ) = 1 - S 1

+ S2

s.

Show also that the probability that at least m of the events A 1 , , A. occur
simultaneously is
n

P(B 1 )

+ + P(B.) = I (-1y-mc~_-ls,.
r=m

In particular, the probability that at least one of the events A 1, . , A. occurs is

2. Some Classical Models and Distributions


1. Binomial distribution. Let a coin be tossed n times and record the results
as an ordered set {a 1 , ... , an), where a; = 1 for a head ("success") and a;= 0
for a tail ("failure"). The sample space is

n=

{w: w = (al, ... , an), a;= 0, 1}.

To each sample point w = (a 1,


p(w)

... ,

an) we assign the probability

= {l-a,qn-r.a,,

where the nonnegative numbers p and q satisfy p + q = 1. In the first place,


we verify that this assignment of the weights p(w) is consistent. It is enough
to show that
p(w) = 1.
We consider all outcomes w = (a 1 , , a.) for which
a; = k, where
k = 0, 1, ... , n. According to Table 4 (distribution of k indistinguishable

Lwen

L;

18

I. Elementary Probability Theory

ones inn places) the number of these outcomes is


n

L p(w) = L

roen

C=pkqn-k

k=O

c:. Therefore

= (p + q)n = 1.

Thus the space 0 together with the collection Jil of all its subsets and the
probabilities P(A) = LroeA p(w), A E J/1, defines a probabilistic model. It is
natural to call this the probabilistic model for n tosses of a coin.
In the case n = 1, when the sample space contains just the two points
w = 1 ("success") and w = 0 ("failure"), it is natural to call p(l) = p the
probability of success. We shall see later that this model for n tosses of a
coin can be thought of as the result of n "independent" experiments with
probability p of success at each trial.
Let us consider the events
k = 0, 1, ... , n,

consisting of exactly k successes. It follows from what we said above that


(1)

LZ=o

and
P(Ak) = 1.
The set of probabilities (P(A 0 ), , P(An)) is called the binomial distribution (the number of successes in a sample of size n). This distribution plays an
extremely important role in probability theory since it arises in the most
diverse probabilistic models. We write Pik) = P(Ak), k = 0, 1, ... , n.
Figure 1 shows the binomial distribution in the case p = ! (symmetric coin)
for n = 5, 10, 20.
We now present a different model (in essence, equivalent to the preceding
one) which describes the random walk of a "particle."
Let the particle start at the origin, and after unit time let it take a unit
step upward or downward (Figure 2).
Consequently after n steps the particle can have moved at most n units
up or n units down. It is clear that each path w of the particle is completely
specified by a set (a 1, . , an), where a; = + 1 if the particle moves up at the
ith step, and a; = - 1 if it moves down. Let us assign to each path w the
weight p(w) = pv<w>qn-v<w>, where v(w) is the number of+ 1's in the sequence
w = (a 1 , ... , an), i.e. v(w) = [(a 1 + + an) + n]/2, and the nonnegative
numbers p and q satisfy p + q = 1.
Since Lroen p(w) = 1, the set of probabilities p(w) together with the space
0 of paths w = (a 1, .. , an) and its subsets define an acceptable probabilistic
model of the motion of the particle for n steps.
Let us ask the following question: What is the probability of the event Ak
that after n steps the particle is at a point with ordinate k? This condition
is satisfied by those paths w for which v(ro) - (n - v(w)) = k, i.e.
v(w)

n+k
2

= --.

19

2. Some Classical Models and Distributions


P.(k)

P.(k)

0.3

0.3

0.2

0.2

0.1

0.1

n = 10

= 20

012345

I1
I

012345678910

""

P.(k)
0.3

0.2
0.1
I

012345678910" . . . . . . . "20

Figure 1. Graph of the binomial probabilities P.(k) for n = 5, 10, 20.

The number of such paths (see Table 4) is


P(Ak)

c~n+kJ/ 2 ,

and therefore

c~n+k112pln+kJ12qin-kl!2.

Consequently the binomial distribution (P(A_n), ... , P(A 0 ), ... , P(A"))


can be said to describe the probability distribution for the position of the
particle after n steps.
Note that in the symmetric case (p = q = !) when the probabilities of
the individual paths are equal tor",
P(Ak) = qn+kl/2. r".

Let us investigate the asymptotic behavior of these probabilities for large n.


If the number of steps is 2n, it follows from the properties of the binomial
coefficients that the largest of the probabilities P(Ak), Ik I :::; 2n, is
P(A 0 ) =

qn 2- 2 ".

Figure 2

20

I. Elementary Probability Theory

-4 -3 -2 -1

Figure 3. Beginning of the binomial distribution.

From Stirling's formula (see formula (6) in Section 4)


n! "'~ e-"nn.t

Consequently

en

2n

= (2n)! "'22n,_l_

fo

(n!?

and therefore for large n


P(A 0 )

--.

Fn

Figure 3 represents the beginning of the binomial distribution for 2n


steps of a random walk (in contrast to Figure 2, the time axis is now directed
upward).
2. Multinomial distribution. Generalizing the preceding model, we now

suppose that the sample space is

n=

{w: w = (al, ... ' an), a;= bl, ... ' b,},

where b 1, . , b, are given numbers. Let v;(w) be the number of elements of


w = (a 1 , .. , an) that are equal to b;, i = 1, ... , r, and define the probability
ofw by
p(w) =

where P; ~ 0 and p 1

+ + Pr

p~'(w) ... p;r(w),

1. Note that

where Cn(n 1 , ... , n,) is the number of (ordered) sequences (a 1 , ... , an) in
which b 1 occurs n 1 times, ... , b, occurs n, times. Since n 1 elements b1 can

t The notation} (n) -

g(n) means that f (n)/g(n)-+ 1 as n-+ oo.

21

2. Some Classical Models and Distributions

be distributed into n positions in C~' ways; n2 elements b2 into n - n 1


positions in c~:._n, ways, etc., we have

(n-

... 1
n 1 )!
n 1 ! (n- n 1 )! n 2 ! (n- n 1 - n 2 )!
n!

n!

Therefore
""
L.... P(w )

wen

n P~"
n!
""
P1'
L.
{"l;;::o, ... ,nr2!0,} n 1t nr1

P1

+ + Prt =

1,

n1 + +nr=n

and consequently we have defined an acceptable method of assigning


probabilities.
Let

Then
(2)
The set of probabilities

is called the multinomial (or polynomial) distribution.


We emphasize that both this distribution and its special case, the binomial
distribution, originate from problems about sampling with replacement.
3. The multidimensional hypergeometric distribution occurs in problems that

involve sampling without replacement.


Consider, for example, an urn containing M balls numbered 1, 2, ... , M,
where M 1 balls have the color b 1 , . . . , Mr balls have the color b., and
M 1 + + Mr = M. Suppose that we draw a sample of size n < M without
replacement. The sample space is
Q

= {w: w = (a 1, . , an), ak =I a1 , k =I l, ai = 1, ... , M}

and N(Q) = (M)". Let us suppose that the sample point~ are equiprobable,
and find the probability of the event Bn,, " in which n 1 balls have
color b 1 , .. , nr balls have color b., where n 1 + + nr = n. It is easy to
show that

22

I. Elementary Probability Theory

and therefore

P{B
)=
,.,.......

N(B
)
C" C"
"" =
Mt
N(Q)
CM- Mr

(3)

The set of probabilities {P(B, .... , ...)} is called the multidimensional


hypergeometric distribution. When r = 2 it is simply called the hypergeometric
distribution because its "generating function" is a hypergeometric function.
The structure of the multidimensional hypergeometric distribution is
rather complicated. For example, the probability

P(B,.,,2) =

C" C"2

~M M2'

(4)

contains nine factorials. However, it is easily established that if M --+ oo


and M 1 --+ oo in such a way that MdM--+ p (and therefore M 2 /M--+ 1 - p)
then
(5)
In other words, under the present hypotheses the hypergeometric distribution is approximated by the binomial; this is intuitively clear since
when M and M 1 are large (but finite), sampling without replacement ought
to give almost the same result as sampling with replacement.
Let us use (4) to find the probability of picking six "lucky" numbers in a lottery of the following kind (this is an abstract formulation of the
"sportloto," which is well known in Russia):
There are 49 balls numbered from 1 to 49; six of them are lucky (colored
red, say, whereas the rest are white). We draw a sample of six balls, without
replacement. The question is, What is the probability that all six of these
balls are lucky? Taking M = 49, M 1 = 6, n 1 = 6, n2 = 0, we see that the
event of interest, namely

ExAMPLE.

B 6 , 0 = {6 balls, all lucky}


has, by (4), probability
1
P(B6,o) = c6 ~ 7.2
49

10- 8

4. The numbers n! increase extremely rapidly with n. For example,


10!

3,628,800,

15! = 1,307,674,368,000,
and 100! has 158 digits. Hence from either the theoretical or the computational point of view, it is important to know Stirling's formula,
n! =

~ (;)" exp( 1~n),

o < e, < 1,

(6)

23

3. Conditional Probability. Independence

whose proof can be found in most textbooks on mathematical analysis


(see also [69]).
5. PROBLEMS
1. Prove formula (5).

2. Show that for the multinomial distribution {P(A.,, ... , A.J} the maximum probability is attained at a point (k 1 , , k.) that satisfies the inequalities np; - 1 <
k; :::;; (n + r - 1)p;, i = 1, ... , r.
3. One-dimensional Ising model. Consider n particles located at the points 1, 2, ... , n.
Suppose that each particle is of one of two types, and that there are n 1 particles ofthe
first type and n 2 of the second (n 1 + n 2 = n). We suppose that all n! arrangements of
the particles are equally probable.
Construct a corresponding probabilistic model and find the probability of the
event A.(m 11 , m 12 , m 21 , m22 ) = {v 11 = m11 , , v22 = m22 }, where vii is the number
of particles of type i following particles of type j (i, j = 1, 2).
4. Prove the following inequalities by probabilistic reasoning:

L (C!)

k=O
n

L(-1)"-kC! =

cz.
c;:._ 1,

m ;:o: n

m(m- 1)2m-z,

m ;:o: 2.

+ 1,

k=O
n

L k(k-

l)C~ =

k=O

3. Conditional Probability. Independence


1. The concept of probabilities of events lets us answer questions of the following kind: If there are M balls in an urn, M 1 white and M 2 black, what is

the probability P(A) of the event A that a selected ball is white? With the
classical approach, P(A) = M tfM.
The concept of conditional probability, which will be introduced below,
lets us answer questions of the following kind: What is the probability that
the second ball is white (event B) under the condition that the first ball was
also white (event A)? (We are thinking of sampling without replacement.)
It is natural to reason as follows: if the first ball is white, then at the
second step we have an urn containing M - 1 balls, of which M 1 - 1 are
white and M 2 black; hence it seems reasonable to suppose that the (conditional) probability in question is (M 1 - 1)/(M- 1).

24

I. Elementary Probability Theory

We now give a definition of conditional probability that is consistent


with our intuitive ideas.
Let (0, d, P) be a (finite) probability space and A an event (i.e. A Ed).

Definition 1. The conditional probability of event B assuming event A with


P(A) > 0 (denoted by P(BIA)) is
P(AB)
P(A) .

(1)

In the classical approach we have P(A) = N(A)/N(D.), P(AB) =


N(AB)/N(D.), and therefore

P(BIA) = N(AB).
N(A)

(2)

From Definition 1 we immediately get the following properties of conditional probability:


P(AIA)

= 1,

P(0IA) = 0,
P(BIA) = 1,

B2A,

It follows from these properties that for a given set A the conditional
probability P( I A) has the same properties on the space (D. n A, dnA),
where dnA= {B n A: BEd}, that the original probability PO has on
(D., d).
Note that

+ P(BIA) =

1;

+ P(BIA) =F
P(BIA) + P(BIA) =F

1,

P(BIA)

however in general
P(BIA)

1.

ExAMPLE 1. Consider a family with two children. We ask for the probability
that both children are boys, assuming

(a) that the older child is a boy;


(b) that at least one of the children is a boy.

25

3. Conditional Probability. Independence

The sample space is


Q

{BB, BG, GB, GG},

where BG means that the older child is a boy and the younger is a girl, etc.
Let us suppose that all sample points are equally probable:
P(BB) = P(BG) = P(GB) = P(GG) =

Let A be the event that the older child is a boy, and B, that the younger
child is a boy. Then A u B is the event that at least one child is a boy, and
AB is the event that both children are boys. In question (a) we want the
conditional probability P(AB IA), and in (b), the conditional probability
P(ABIA u B).

It is easy to see that


P(ABIA) = P(AB) =! =!

P(A)

2'

P(AB)
t 1
P(ABIA u B)= P(A u B)=!= "3

2. The simple but important formula (3), below, is called the formula for
total probability. It provides the basic means for calculating the probabilities of complicated events by using conditional probabilities.
Consider a decomposition~= {A 1, , An} with P(A;) > 0, i = 1, ... , n
(such a decomposition is often called a complete set of disjoint events). It
is clear that
B

+ +

BA 1

BAn

and therefore
n

P(B) =

P(BA;).

i= 1

But
P(BA;)

= P(BIA;)P(A;).

Hence we have the formula for total probability:


n

P(B)

= L

P(B I A;)P(A;).

(3)

i= 1

In particular, if 0 < P(A) < 1, then


P(B) = P(BIA)P(A)

+ P(BIA)P(A).

(4)

26

I. Elementary Probability Theory

2. An urn contains M balls, m of which are "lucky." We ask for the


probability that the second ball drawn is lucky (assuming that the result of
the first draw is unknown, that a sample of size 2 is drawn without replacement, and that all outcomes are equally probable). Let A be the event that
the first ball is lucky, B the event that the second is lucky. Then
ExAMPLE

m(m- 1)
P(BIA) = P(BA) = M(M- 1) = m- 1
P(A)
m
M- 1'

m(M- m)
_
P(BA)
M(M- 1)
m
P(B IA) = P(A) = M - m = M - 1

and
P(B) = P(BIA)P(A)

+ P(BIA)P(A)

m-1 m
m
M-m
=M-1M+M-1. M

m
M"

It is interesting to observe that P(A) is precisely mjM. Hence, when the


nature of the first ball is unknown, it does not affect the probability that the
second ball is lucky.
By the definition of conditional probability (with P(A) > 0),
P(AB) = P(BIA)P(A).

(5)

This formula, the multiplication formula for probabilities, can be generalized


(by induction) as follows: IfA 1, ... , An_ 1 are events with P(A 1 An_ 1) > 0,
then
P(A1 An)

(here A 1 An

= P(A1)P(A2I A1) P(An IA1 An-1)

(6)

A 1 n A 2 n nAn)

3. Suppose that A and B are.events with P(A) > 0 and P(B) > 0. Then
along with (5) we have the parallel formula
P(AB) = P(A IB)P(B).

(7)

From (5) and (7) we obtain Bayes's formula


P(AIB)

= P(A)P(BIA)
P(B)

(8)

27

3. Conditional Probability. Independence

If the events A 1,
Bayes's theorem:

... ,

An form a decomposition of

n,

(3) and (8) imply

P(A;)P(B IA;)

(9)

P(A;IB) = LJ=l P(Aj)P(BIA).

In statistical applications, A 1, , An (A 1 + +An = Q) are often


called hypotheses, and P(Ai) is called the a priorit probability of Ai. The
conditional probability P(A; IB) is considered as the a posteriori probability
of A; after the occurrence of event B.
ExAMPLE

3. Let an urn contain two coins: A 1, a fair coin with probability

! of falling H; and A 2 , a biased coin with probability-! of falling H. A coin is


drawn at random and tossed. Suppose that it falls head. We ask for the
probability that the fair coin was selected.
Let us construct the corresponding probabilistic model. Here it is natural
to take the sample space to be the set n = {A 1 H, A 1 T, A 2 H, A 2 T}, which
describes all possible outcomes of a selection and a toss (A 1 H means that
coin A 1 was selected and fell heads, etc.) The probabilities p(w) of the various
outcomes have to be assigned so that, according to the statement of the
problem,
P(Ad

= P(A 2 ) =!

and
P(HIA 2 )

=!.

With these assignments, the probabilities of the sample points are uniquely
determined :
P(A 2 T)

=!.

Then by Bayes's formula the probability in question is


P(A 1 )P(HIA 1 )
P(AliH) = P(Al)P(HIAl) + P(A2)P(HIA2)

S'

and therefore
P(A 2 1H)

4. In certain sense, the concept of independence, which we are now going to


introduce, plays a central role in probability theory: it is precisely this concept
that distinguishes probability theory from the general theory of measure
spaces.
t

A priori: before the experiment; a posteriori: after the experiment.

28

I. Elementary Probability Theory

If A and B are two events, it is natural to say that B is independent of A


if knowing that A has occurred has no effect on the probability of B. In other
words, "B is independent of A" if
P(BIA) = P(B)

(10)

(we are supposing that P(A) > 0).


Since
P(BIA) =

P(AB)

P(A) ,

it follows from (10) that


P(AB) = P(A)P(B).

(11)

In exactly the same way, if P(B) > 0 it is natural to say that" A is independent
of B" if
P(A IB) = P(A).

Hence we again obtain (11 ), which is symmetric in A and B and still makes
sense when the probabilities of these events are zero.
After these preliminaries, we introduce the following definition.

Definition 2. Events A and Bare called independent or statistically independent


(with respect to the probability P) if
P(AB)

= P(A)P(B).

In probability theory it is often convenient to consider not only independence of events (or sets) but also independence of collections of events (or
sets).
Accordingly, we introduce the following definition.

Definition 3. Two algebras d 1 and d 2 of events (or sets) are called independent or statistically independent (with respect to the probability P) if all pairs
of sets A 1 and A 2 , belonging respectively to d 1 and d 2 , are independent.
For example, let us consider the two algebras

where A 1 and A 2 are subsets of n. It is easy to verify that d 1 and d 2 are


independent if and only if A 1 and A 2 are independent. In fact, the independence of .91 1 and .91 2 means the independence of the 16 events A 1 and A 2 ,
A 1 and A2 , . , nand n. Consequently A 1 and A 2 are independent. Conversely, if A 1 and A 2 are independent, we have to show that the other 15

29

3. Conditional Probability. Independence

pairs of events are independent. Let us verify, for example, the independence
of A 1 and A2 We have
P(A 1 A 2 )

= P(A 1 )

= P(A 1 )

P(A 1 A 2 )

P(A 1 )P(A 2 )

= P(A 1) (1 - P(A 2 )) = P(A 1 )P(A 2 ).


The independence of the other pairs is verified similarly.
5. The concept of independence of two sets or two algebras of sets can be
extended to any finite number of sets or algebras of sets.
Thus we say that the sets A 1, ... , An are collectively independent or
statistically independent (with respect to the probability P) if fork = 1, ... , n
and 1 ~ i 1 < i 2 < < ik ~ n

= P(A;,) P(A;J

P(A;, A;)

(12)

The algebras d 1 , ... , dn of sets are called independent or statistically


independent (with respect to the probability P) if all sets A 1 , .. , An belonging
respectively to d 1 , , dn are independent.
Note that pairwise independence of events does not imply their independence. In fact if, for example, n = {w 1 , w 2 , w 3 , w4 } and all outcomes are
equiprobable, it is easily verified that the events

are pairwise independent, whereas

P(ABC) =

i= (t) 3

= P(A)P(B)P(C).

Also note that if


P(ABC)

P(A)P(B)P(C)

for events A, B and C, it by no means follows that these events are pairwise
independent. In fact, let n consist of the 36 ordered pairs (i, j), where i, j =
1, 2, ... , 6 and all the pairs are equiprobable. Then if A = {(i,j):j = 1, 2 or 5},
B = {(i,j):j = 4, 5 or 6}, C = {(i,j): i + j = 9} we have
P(AB) =

i i= i =

P(A)P(B),

P(AC)

l6

i= / 8

= P(A)P(C),

P(BC)

= /2

i= / 8

P(B)P(C),

but also
P(ABC)

l6 =

P(A)P(B)P(C).

6. Let us consider in more detail, from the point of view of independence,


the classical model (Q, d, P) that was introduced in 2 and used as a basis
for the binomial distribution.

30

I. Elementary Probability Theory

In this model

n=

{w:

((J

= (a1, ... ' a.), a; =

d ={A: As Q}

0, 1},

and
(13)
Consider an event A s n. We say that this event depends on a trial at
time k if it is determined by the value ak alone. Examples of such events are
Let us consider the sequence of algebras d 1, d 2 , , d., where dk =
{Ak, Ab 0, Q} and show that under (13) these algebras are independent.
It is clear that

=p
(at, ... , ak-1, Uk+

1, ... ,

an)

X q<n-1)-(a,++ak-t+ak+t++an)

= p

n-1
c~-1plq(n-1)-l

= p,

i=O

and a similar calculation shows that P(Ak) = q and that, for k # 1,


2

P(AkA 1) = p ,

P(AkA 1) = pq,

P(AkAt) = q .

It is easy to deduce from this that d k and d 1 are independent for k # I.


It can be shown in the same way that d 1 , d 2 , . , d. are independent.
This is the basis for saying that our model (Q, d, P) corresponds to "n
independent trials with two outcomes and probability p of success." James
Bernoulli was the first to study this model systematically, and established
the law of large numbers (5) for it. Accordingly, this model is also called
the Bernoulli scheme with two outcomes (success and failure) and probability
p of success.
A detailed study of the probability space for the Bernoulli scheme shows
that it has the structure of a direct product of probability spaces, defined
as follows.
Suppose that we are given a collection (0 1 , 86 1 , P 1), ... , (n., 86., P.) of
finite probability spaces. Form the space n = n1 X n2 X ... X n. of points
((J = (a1, ... ' a.), where a; E ni. Let d
= 861 ... 86. be the algebra of
the subsets of n that consists of sums of sets of the form

A= B1

B2

X ... X

B.

with B;E86;. Finally, for w = (a 1, ... ,a.) take p(w) = p 1(a 1)p.(a.) and
define P(A) for the set A = B 1 x B 2 x x B. by
P(A) =

31

3. Conditional Probability. Independence

It is easy to verify that P(O) = 1 and therefore the triple (0, d, P) defines
a probability space. This space is called the direct product of the probability
spaces (0 1 , fJI 1 , P 1 ), . , (O., fJI., P.).
We note an easily verified property of the direct product of probability
spaces: with respect to P, the events

A 1 = {ro: a 1 E Bd,
where Bi e

fJii,

... , A. =

{ro: a. E B.},

are independent. In the same way, the algebras of subsets ofO,


d

= {A 1 :A 1 = {ro:a 1 eBd,B 1 efJI 1 },

are independent.
It is clear from our construction that the Bernoulli scheme
(0, d, P) with 0 = {ro: w = (a 1,

d ={A: As; 0}

.. ,

a.), ai = 0 or 1}

p(ro) = pr,a,qn-r.a,

and

can be thought of as the direct product of the probability spaces (Oi,


i = 1, 2, ... , n, where

ni = {O, 1},
Pl{1})

fJii

fJii,

Pi),

= {{O}, {1}, 0, OJ,

= p,

Pi({O})

= q.

7.PROBLEMS
1. Give examples to show that in general the equations

+ P(B!A) =
P(B!A) + P(B!A) =
P(B!A)

1,
1

are false.
2. An urn contains M balls, of which M 1 are white. Consider a sample of size n. Let Bi
be the event that the ball selected at the jth step is white, and Ak the event that a sample
of size n contains exactly k white balls. Show that
P(Bi!Ak) = k/n

both for sampling with replacement and for sampling without replacement.

3. Let A 1, ... , A. be independent events. Then

4. Let A~> ... , A. be independent events with P(A;)


that neither event occurs is

Po=

fl(l- pJ

i=l

= P;

Then the probability P0

32

I. Elementary Probability Theory

5. Let A and B be independent events. In terms of P(A) and P(B), find the probabilities
of the events that exactly k, at least k, and at most k of A and B occur (k = 0, 1, 2).
6. Let event A be independent of itself, i.e. Jet A and A be independent. Show that
P(A) is either 0 or 1.
7. Let event A have P(A) = 0 or 1. Show that A and an arbitrary event B are independent.
8. Consider the electric circuit shown in Figure 4:

Figure 4
Each of the switches A, B, C, D, and E is independently open or closed with
probabilities p and q, respectively. Find the probability that a signal fed in at "input"
will be received at "output". If the signal is received, what is the conditional probability that E is open?

4. Random Variables and Their Properties


1. Let (Q, d, P) be a probabilistic model of an experiment with a finite
number of outcomes, N(Q) < oo, where d is the algebra of all subsets of
n. We observe that in the examples above, where we calculated the probabilities of various events A E d, the specific nature of the sample space n was
of no interest. We were interested only in numerical properties depending
on the sample points. For example, we were interested in the probability of
some number of successes in a series of n trials, in the probability distribution
for the number of objects in cells, etc.
The concept "random variable," which we now introduce (later it will
be given a more general form) serves to define quantities that are subject to
"measurement" in random experiments.

Definition 1. Any numerical function ~ = ~(w) defined on a (finite) sample


space n is called a (simple) random variable. (The reason for the term" simple"
random variable will become clear after the introduction of the general
concept of random variable in 4 of Chapter II.)

33

4. Random Variables and Their Properties

1. In the model of two tosses of a coin with sample space Q =


{HH, HT, TH, TT}, define a random variable~ = ~(w) by the table

ExAMPLE

OJ

HH

HT

TH

TT

(w)

Here, from its very definition, ~(w) is nothing but the number of heads in the
outcome w.
Another extremely simple example of a random variable is the indicator
(or characteristic function) of a set A E .s:1:

wheret
IA(w)

1, WEA,
= { 0, wA.

When experimenters are concerned with random variables that describe


observations, their main interest is in the probabilities with which the
random variables take various values. From this point of view they are
interested, not in the distribution of the probability P over (Q, d), but in
its distribution over the range of a random variable. Since we are considering
the case when Q contains only a finite number of points, the range X of
the random variable (is also finite. Let X = {x 1 , . . . , xm}, where the (different) numbers X 1, ... , Xm exhaust the values of
Let fi be the collection of all subsets of X, and let BE fi. We can also
interpret B as an event if the sample space is taken to be X, the set of values
of
On (X, fi), consider the probability P~() induced bye according to the
formula

e.

e.

P~(B)

= P{w: ~(w) E B},

BE fi.

It is clear that the values of this probability are completely determined by


the probabilities
P~(x;)

= P{w: ~(w) =X;},

The set of numbers {P~(x 1 },


bution of the random variable ~.
t The notation

... ,

P~(xm)}

X;

EX.

is called the probability distri-

/(A) is also used. For frequently used properties of indicators see Problem 1.

34

I. Elementary Probability Theory

2. A random variable that takes the two values 1 and 0 with


probabilities p ("success") and q ("failure"), is called a Bernoullit random
variable. Clearly

EXAMPLE

0, 1.

X=

(1)

A binomial (or binomially distributed) random variable is a random


variable that takes the n + 1 values 0, 1, ... , n with probabilities
x = 0, 1, ... , n.

(2)

Note that here and in many subsequent examples we do not specify the
sample spaces (Q, d, P), but are interested only in the values of the random
variables and their probability distributions.

is completely
The probabilistic structure of the random variables
specified by the probability distributions {P~(xi), i = 1, ... , m}. The concept
of distribution function, which we now introduce, yields an equivalent
description of the probabilistic structure of the random variables.

Definition 2. Let x e R 1 The function


F~(x) =

P{w: e(w)

x}

is called the distribution function of the random variable

e.

Clearly
F~(x) =

{i: Xi

P~(xi)

$X}

and
P ~(xi) = F ~(xi) - F f.._x; - ),

where F ~(x-) = limyix F ~(y).


If we suppose that x 1 < x 2 < < xm and put F ~(x 0 ) = 0, then

i = 1, ... , m.
The following diagrams (Figure 5) exhibit P~(x) and F~(x) for a binomial
random variable.
It follows immediately from Definition 2 that the distribution F~ = F~(x)
has the following properties:
(1) F ~(- oo)

= 0, F ~( + oo) = 1;

(2) F ~(x) is continuous on the right (F ~(x+) = F ~(x)) and piecewise constant.
t We use the terms "Bernoulli, binomial, Poisson, Gaussian, ... , random variables" for what
are more usually called random variables with Bernoulli, binomial, Poisson, Gaussian, ... , distributions.

35

4. Random Variables and Their Properties

-.....

q",_/_/-t---t--~~~==+.
0

'f---------_: _;,.

F<(x)

I
I

~~
0

--------,.

Figure 5

Along with random variables it is often necessary to consider random


~ = (~ 1 , ... , ~,) whose components are random variables. For
example, when we considered the multinomial distribution we were dealing
with a random vector v = (v 1, ... , v,), where v; = v;(w) is the number of
elements equal to b;, i = 1, ... , r, in the sequence w = (a 1, .. , an).
The set of probabilities
vectors

P~(x 1 , ... ,

x,)

= P{w: ~ 1 (w) = x 1 ,

... , ~,(w)

= x,},

where X; EX;, the range of ~;, is called the probability distribution of the
random vector ~, and the function

where

X; E

R 1, is called the distribution function of the random vector ~

(~1 ... ' ~,).

For example, for the random vector v = (v 1 ,

... ,

v,) mentioned above,

(see (2.2)).
2. Let ~ 1 , . . . , ~r be a set of random variables with values in a (finite) set
X s; R 1 . Let X be the algebra of subsets of X.

36

I. Elementary Probability Theory

Definition 3. The random variables


(collectively independent) if
P{~ 1 =

for all x 1,

x 1 , ,

xd P{~.

~. = x,} = P{~ 1 =

= x,}

x, EX; or, equivalently, if

P{~1EB1,

for all B 1,

are said to be independent

~ 1 , .. , ~.

... ,~,EB,} =

P{~1EBdP{~,EB,}

B, E f!l".

We can get a very simple example of independent random variables


from the Bernoulli scheme. Let

n=

{w: (I)= (a1, ... ' an), a;= 0, 1},

p(w) = pr.a;qn-r.a;

and ~;(w) = a; for w = (a 1, ... , an), i = 1, ... , n. Then the random variables
~ 1 , ~ 2 , , ~n are independent, as follows from the independence of the events

A 1 = {w: a 1 = 1}, ... , An= {w: an= 1},


which was established in 3.
3. We shall frequently encounter the problem of finding the probability
distributions of random variables that are functionsf(~ 1 , .. , ~.)of random
variables ~ 1 , , ~. For the present we consider only the determination
of the distribution of a sum ( = ~ + 11 of random variables.
If~ and '1 take values in the respective sets X = {x 1, ... , xk} and Y =
{y 1 , .. , y 1}, the random variable ( = ~ + '1 takes values in the set Z =
{z: z = X; + Yi i = 1, ... , k;j = 1, ... , /}.Then it is clear that
P,(z) = P{( = z} = Pg

+ 11

= z} =

P{~ =X;, '1 = Yi}.

{(i, j):x;+yJ=z)

The case of independent random variables


ant. In this case

and '1 is particularly import-

and therefore
P,(z)

P~(x;)P~(yj)

{(i,j):x;+yj=z}

L P~(x;)P~(z- x;)

i=1

(3)

for all z E Z, where in the last sum Pq(z - X;) is taken to be zero if z - X;~ Y.
For example, if ~ and '1 are independent Bernoulli random variables,
taking the values 1 and 0 with respective probabilities p and q, then Z =
{0, 1, 2} and
P,(O)
P,(l)

= P~(O)Pq(O) = q2 ,
= P~(O)Pq(i) + Pp)Pq(O) = 2pq,

P,(2) = P~(l)Pq(1) = p2

37

4. Random Variables and Their Properties

It is easy to show by induction that if ~ 1 ~ 2 , , ~n are independent


Bernoulli random variables with P{~; = 1} = p, P{~; = 0} = q, then the
random variable ( = ~ 1 + + ~n has the binomial distribution

k = 0, 1, ... , n.

(4)

4. We now turn to the important concept of the expectation, or mean value,


of a random variable.
Let (Q, d, P) be a (finite) probability space and ~ = ~(w) a random
variable with values in the set X= {x 1 , , xk}. If we put A;= {w: ~ = x;},
i = 1, ... , k, then ~ can evidently be represented as
k

~(w) =

X;l(A;),

(5)

i= 1

where the sets A 1, , Ak form a decomposition of Q (i.e., they are pairwise


disjoint and their sum is Q; see Subsection 3 of 1).
Let P; = P{ ~ = x;}. It is intuitively plausible that if we observe the values
of the random variable ~ in "n repetitions of identical experiments", the
value X; ought to be encountered about p;n times, i = 1, ... , k. Hence the
mean value calculated from the results of n experiments is roughly
1
- [np 1 x 1

+ + npkxk] =

I P;X;.
i=1

This discussion provides the motivation for the following definition.

Definition 4. The expectation t or mean value of the random variable

~ =

L~= 1 xJ(A;) is the number


k

E~ =

X;P(AJ

(6)

i= 1

Since A;= {w:

~(w) =X;}

and

P~(x;) =

P(A;), we have

E~ =

Recalling the definition

i= 1

x;Pix;).

ofF~= F~(x)

(7)

and writing

dF ix) = F~(x)- F~(x- ),


we obtain P ~(x;)

= dF ~(x;) and consequently


k

E~ =

x;dF~(x;).

(8)

i= 1

t Also known as mathematical expectation, or expected value, or (especially in physics) expectation value. (Translator)

38

I. Elementary Probability Theory

Before discussing the properties of the expectation, we remark that it is


often convenient to use another representation of the random variable ~'
namely
l

~(w) =

L xji(B),

j= 1

where B 1 + + B1 = n, but some of the xj may be repeated. In this case


E~ can be calculated from the formula L~= 1 xjP(B), which differs formally
from (5) because in (5) the X; are all different. In fact,

xjP(B)

=X;

(j: xj = x;)

P(B)

x;P(A;)

(j: xj = x;)

and therefore
l

i= 1

xjP(B)

i= 1

x;P(AJ

5. We list the basic properties of the expectation:


(1) If~ 2': 0 then E~ 2': 0.
(2) E(a~ + b1]) = aE~ + bE1], where a and bare constants.
(3) If~ 2': '7 then E~ 2': El'f.
(4) IE~I ~ El~l(5) If~ and '1 are independent, then E~l] = E~ El].

e E17

( 6) (E I~171) 2 ~ E

(7)

If~

(Cauchy- Bun yakovskii inequality). t

= I(A) then E~ = P(A).

Properties (1) and (7) are evident. To prove (2), let


~

= IxJ(AJ,

I]=

IYi(B).
j

Then
i, j

i,j

L(ax; + byj)I(A; n Bi)

i,j

and
E(a~

+ bl]) = I (ax; + byi)P(A; n Bi)


i,j

= Lax; P(A;)
i

+ Lbyi P(Bi)
j

= al;x;P(A;)
i

+ b LYiP(Bi) =
j

aE~

+ bE1].

t Also known as the Cauchy-Schwarz or Schwarz inequality. (Translator)

39

4. Random Variables and Their Properties

Property (3) follows from (1) and (2). Property (4) is evident, since

~~x;P(A;)I ;S; ~lxdP(A;) = Elel.

!Eel=

To prove (5) we note that

Ee11 =

E(~x;I(A;)) (~>il(Bi))

= E

L X;Yi(A; n B)= L X;YiP(A; n Bi)


i,j

i,j

L X;yjP(A;)P(Bj)
i,j

where we have used the property that for independent random variables the
events
A;= {w: e(w) = x;}

Bi = {w: 17(w) = Yi}

and

are independent: P(A; n Bi) = P(A;)P(Bi).


To prove property (6) we observe that

L xf l(A;),

'7 2

Lx~P(A;),

E'7 2 = LYJP(Bj).

LYJI(Bj)
j

and

Ee 2 =

Let Ee 2 > 0, E77 2 > 0. Put

e=~,

'7-

= -'7- .

for

Since 21~~1
+
we have 2EI~~I $ E~ 2 + E~ 2 = 2. Therefore
El~~~ $ 1 and (Eie'71) 2 $ Ee 2 E77 2
However, if, say, Ee 2 = 0, this means that
xfP(A;) = 0 and consequently the mean value of e is 0, and P{w: e(w) = 0} = 1. Therefore if at
least one of Ee2 or E77 2 is zero, it is evident that EIe111 = 0 and consequently
the Cauchy-Bunyakovskii inequality still holds.
$ ~2

~2,

L;

Remark. Property (5) generalizes in an obvious way to any finite number of


random variables: if

el, ... ' e, are independent, then

The proof can be given in the same way as for the case r = 2, or by induction.

40

I. Elementary Probability Theory

EXAMPLE 3. Let be a Bernoulli random variable, taking the values 1 and 0


with probabilities p and q. Then

ee = 1 Pg = 1}

+ o. P{e = o}

= p.

4. Let e1, ... , en ben Bernoulli random variables with P{e; = 1}


P{e; = O} = q, p + q = 1. Then if

ExAMPLE

= p,

we find that
ESn = np.

This result can be obtained in a different way. It is easy to see that ES"
is not changed if we assume that the Bernoulli random variables 1, , en
are independent. With this assumption, we have according to (4)

k = 0, 1, ... , n.
Therefore
ESn =

kP(Sn = k) =

k=O

kC~pkqn-k

k=O

n!

kn-k

= k~okk!(n-k)!pq
= np

k= 1

(n- 1)!
pk-1q<n-1)-(k-1)
(k- 1)! ((n- 1)- (k- 1))!
.

_
~
(n-1)!
l(n-1)-1_
- np 1~0 l!((n- 1)- 1)! pq
- np.

However, the first method is more direct.

Li

xJ(A;), where A; = {ro: e(ro) =X;}, and


6. Let e =
function of e(w). If Bi = {ro: cp(e(w)) = yj}, then
cp(e(w))

qJ =

cp(e(w)) is a

= LYi(Bj),
j

and consequently

Ecp

LYiP(B) = LYiP'P(yj).
j

But it is also clear that


cp(e(w)) =

cp(xJI(A;).

(9)

41

4. Random Variables and Their Properties

Hence, as in (9), the expectation of the random variable <p =


calculated as
E<p( ~)

<p(~)

can be

= L <p(x;)P ~(x;).
i

7. The important notion of the variance of a random variable ~ indicates

the amount of scatter of the values

around E~.

of~

Definition 5. The variance (also called the dispersion) of the random variable
(denoted by V ~) is

V~ = E(~- E~) 2 .

The number
Since

CJ

= + jVf, is called the standard deviation.

we have

Clearly V~

V(a

0. It follows from the definition that

+ b~) =

where a and bare constants.

b 2 V~,

In particular, Va = 0, V(b~) = b 2 V~.


Let ~ and 11 be random variables. Then
V(~

+ rJ) =
=

E((~- E~)
Vi;+ V17

+ ('1-

E1J)) 2

2E((- E()(Y/- E17).

Write
cov(~,

'1) =

E(~- E~)(IJ

- ErJ).

This number is called the covariance of~ and '1 If V ~ > 0 and V11 > 0, then

( ~ ) = cov(~, '1)
JV~VIJ
p ,IJ
is called the correlation coefficient of~ and '1 It is easy to show (see Problem
7 below) that if p(~, rJ) = 1, then~ and 11 are linearly dependent:

11 =a~+ b,
with a > 0 if p(~, rJ) = 1 and a < 0 if p(~, rJ) = -1.
We observe immediately that if~ and 11 are independent, so are
and rJ - ErJ. Consequently by Property (5) of expectations,
cov(~,

rJ) =

E(~

- EO E(IJ - ErJ)

= 0.

E~

42

I. Elementary Probability Theory

Using the notation that we introduced for covariance, we have


V(~

+ Yf)

= V~

+ VYf +

(10)

2cov(~, Yf);

if~ and Yf are independent, the variance of the sum


of the variances,
V(~ + Y/) = V~ + VYf.

+ Yf is equal to the sum


(11)

It follows from (10) that (11) is still valid under weaker hypotheses than
the independence of~ and Yf In fact, it is enough to suppose that ~ and Yf are
uncorrelated, i.e. cov(~, ry) = 0.

Remark. If ~ and Yf are uncorrelated, it does not follow in general that they
are independent. Here is a simple example. Let the random variable o: take
the values 0, n/2 and n with probability t. Then ~ = sin o: and Y/ = cos o: are
uncorrelated; however, they are not only stochastically dependent (i.e., not
independent with respect to the probability P):
P{~ = 1, Yf = 1} = 0 ~ = P{~ = 1}P{ry = 1},

but even functionally dependent: ~ 2 + Yf 2 = 1.


Properties (10) and (11) can be extended in the obvious way to any number of random variables:
(12)

In particular, if ,;; 1 ,

. ,

.;;n are pairwise independent (pairwise uncorrelated

is sufficient), then
(13)

5. If ~ is a Bernoulli random variable, taking the values 1 and 0


with probabilities p and q, then

ExAMPLE

V~

= E(~- E~)2 = (~- p)2 = (1 - p)2 p

+ p2q =

pq.

It follows that if~ 1 ,

... , ~n are independent identically distributed Bernoulli


random variables, and sn = ~ 1 + ... + ~n, then

vsn =

npq.

(14)

8. Consider two random variables ~ and Yf. Suppose that only ~ can be observed. If~ and Yf are correlated, we may expect that knowing the value of~
allows us to make some inference about the values of the unobserved variable Yf
Any function}= f(~) of~ is called an estimator for Yf We say that an estimator f* = f*(~) is best in the mean-square sense if
E(Yf - /*(0) 2 = inf E(Yf - /(~)) 2
f

43

4. Random Variables and Their Properties

Let us show how to find a best estimator in the class of linear estimators
a + b~. We consider the function g(a, b) = E(17 - (a + b~)) 2 . Differentiating g(a, b) with respect to a and b, we obtain
A.(~) =

og(a, b)= - 2E['7 -(a+


oa
og(a, b)
ob

= - 2E[('7-

b~)],

(a+ be))e],

whence, setting the derivatives equal to zero, we find that the best meansquare linear estimator is ..t*(~) = a* + b*~. where

a*

= E17 - b*E~,

b*

= cov(~, 17)
v~

(15)

In other words,
(16)
The number E(17 - ..1.*(~)) 2 is called the mean-square error of observation.
An easy calculation shows that it is equal to

Consequently, the larger (in absolute value) the correlation coefficient


17) between ~ and '1 the smaller the mean-square error of observation
tl*. In particular, if IP(e. '7)1 = 1 then tl* = 0 (cf. Problem 7). On the other
hand, if~ and '1 are uncorrelated (p(~, '7) = 0}, then A.*(~) = E'7, i.e. in the
absence of correlation between ~ and '1 the best estimate of '1 in terms of~ is
simply E17 (cf. Problem 4).
p(~.

9. PROBLEMS
1. Verify the following properties of indicators IA = IA(w):

JAB= IA .JB,
]AuB =]A+ ]B- JAB

The indicator of Uf=t Ai is 1- n~=l (1- lA,), the indicator of Ui=1 Ai is


Df= 1 (1 - I A,), and the indicator o(D= 1 Ai is D= 1 I A,.

where A 6. B is the symmetric difference of A and B, i.e. the set (A\B) u (B\A).

44

I. Elementary Probability Theory

2. Let ~ 1 , ,

~.be

independent random variables and

~min= min(~ 1 ,

.. ,

~max= max(~! ... ' ~.).

~.),

Show that
n

P{~min;;:: x} = TIP{~;;;:: x},


i=l

< x} = TIPg; < x}.

P{~max

i=l

3. Let ~ ~> ... , ~. be independent Bernoulli random variables such that

0}

P{~; =

P{~;

where ~ is a small
Show that

number,~

= 1} =A;~,

> 0, A; > 0.

P{~~ + + ~. =
P{~ 1

1 -A;~,

+ + ~.

(J/;)~ + 0(~ 2),

1} =

> 1}

= 0(~ 2 ).

4. Show that inL 00 <a< 00 E(~ - a) 2 is attained for a = E~ and consequently


inf

E(~ - a) 2 = V~.

-oo<a<oo

5.

Let~ be a random variable with distribution function


of F~(x), i.e. a point such that

F~(x)

and let me be a median

Show that
inf

El~-

-ro<a<oo

6. Let

P~(x)

P{~

= x} and

F~(x)

al

P(~::;;

= El~-

m.,l.

x}. Show that


X -

b)

Pa~+b(x) = p~ ( -a- '


X-

b)

Fa~+b(x) = F~ ( -a-

for a > 0 and - oo < b < oo. If y ;;:: 0, then

F~z(y)
Let~+ = max(~,

= F~

+Jy)- F ~ -JY) + P~( -JY).

0). Then

X< 0,
0,
X> 0.

X=

45

5. The Bernoulli Scheme. I. The Law of Large Numbers


7.

Let~ and 'I be random variables with V~ > 0, Vf1 > 0, and let p = p(~, 'I) be their
correlation coefficient. Show that IpI ::;; 1. If IpI = 1, find constants a and b such
that 'I = a~ + b. Moreover, if p = 1, then

(and therefore a > 0), whereas if p

-1, then

(and therefore a < 0).


8.

Let~ and 'I be random variables with


coefficient p = p(~, f/). Show that

E~

= Ef/ = 0,

E max<e, f/ 2 )::;; 1 +

9. Use the equation


(IndicatorofiV1
to deduce the formula P(B 0 )

=1-

S1

A;)

V~

= Vf/ = 1 and correlation

Jl=Pl.

=bY-

IA),

+ S 2 + s. from Problem 4 of 1.

10. Let ~~> ... , ~. be independent random variables, cp 1 = cp 1 (~ 1 , . , ~k) and cp 2 =


cp 2 ( ~k+ ~> ... , ~.), functions respectively of~ 1 , . , ~k and ~k+ 1 , ... , ~ . Show that the
random variables cp 1 and cp 2 are independent.
11. Show that the random variables~ 1 , ... ,
F~, ..... ~Jx 1 ,

~.are

x.)

independent if and only if

= F~,(x 1 )

for all Xt. . , x., where F~, ..... ~Jx 1 , .. , x.) =

P{~ 1

F~"(x.)

::;;

x 1, ... , ~.::;; x.}.

12. Show that the random variable ~ is independent of itself (i.e.,


pendent) if and only if~ = const.

13. Under what hypotheses

~independent?

14.

on~

are the random

variables~

and sin

and

are inde-

and 'I be independent random variables and 'I # 0. Express the probabilities
of the events P{~'l ::;; z} and P{~/'1 ::;; z} in terms of the probabilities P~(x) and Pq(y).

Let~

5. The Bernoulli Scheme.


Large Numbers

I. The Law of

1. In accordance with the definitions given above, a triple


(Q, .szl, P)

.91

with
=

{w: w

{A: A c;; Q},

= (a 1 , . , a.), ai = 0, 1},
p(w)

pr.a;qn-r.a;

is called a probabilistic model of n independent experiments with two outcomes, or a Bernoulli scheme.

46

I. Elementary Probability Theory

In this and the next section we study some limiting properties (in a sense
described below) for Bernoulli schemes. These are best expressed in terms of
random variables and of the probabilities of events connected with them.
by taking
= ai, i =
We introduce random variables 1 , .. ,
1, ... , n, where w = (a 1 , , a,). As we saw above, the Bernoulli variables
are independent and identically distributed:

e,

ei(w)

ei(w)

i = 1, ... , n.

ei

It is natural to think of as describing the result of an experiment at the


ith stage (or at time i).
Let us put S 0 (w) 0 and

k = 1, ... , n.
As we found above, ES,

np and consequently

s,

(1)

E-=p.

In other words, the mean value of the frequency of "success", i.e. S,/n,
coincides with the probability p of success. Hence we are led to ask how much
the frequency Snfn of success differs from its probability p.
We first note that we cannot expect that, for a sufficiently small e > 0
and for sufficiently large n, the deviation of S,/n from p is less than e for all
w, i.e. that

wen.

(2)

In fact, when 0 < p < 1,

P{~" = 1} = P{~l = 1, ... , e, = 1} = p",


P{~" = o} = P{el = o, ... , e, = o} = q",
whence it follows that (2) is not satisfied for sufficiently small e > 0.
We observe, however, that when n is large the probabilities of the events
{S,/n = 1} and {S,/n = 0} are small. It is therefore natural to expect that the
total probability of the events for which I [S,(w)/n] -Pi > e will also be
small when n is sufficiently large.
We shall accordingly try to estimate the probability of the event
{w: i[S,(w)/n] -Pi > e}. For this purpose we need the following inequality,
which was discovered by Chebyshev.

47

I. The Law of Large Numbers

5. The Bernoulli Scheme.

Chebyshev's inequality. Let (Q, d, P) be a probability space and


nonnegative random variable. Then

~ = ~(ro)

(3)

for all e > 0.


PROOF.

We notice that

where I(A) is the indicator of A.


Then, by the properties of the expectation,

which establishes (3).

Corollary. If~ is any random variable, we have for e > 0,


P{l~l ~ e}

s;

El~l/e,

P{l~l ~ e} = P{~ 2 ~ e2 } s; eeje 2 ,


P{l~- E~l ~ e}

s; V~je 2

In the last of these inequalities, take~ =

Therefore

{I

s.- p
p n

(4)

s.;n. Then using (4.14), we obtain

I }s ;ne- s ;4ne- ,
1

pq

~e

(5)

from which we see that for large n there is rather small probability that the
frequency S./n of success deviates from the probability p by more than e.
For n ~ 1 and 0 s; k s; n, write

Then

P{l s.n - P I~ e} =

{k:i(k/n)- PI<!:<}

P.(k),

and we have actually shown that

{k: l(k/n)- PI<!: e}

P.(k) s;

pq

-2

ne

s;

1
4ne

-2'

(6)

48

I. Elementary Probability Theory

P.(k)
I

np _ ____.
---v-np + ne

-----~

np - ne

Figure 6

i.e. we have proved an inequality that could also have been obtained analytically, without using the probabilistic interpretation.
It is clear from (6) that

P,.(k)-+ 0,

n-+ oo.

(7)

{k:l(k/n)- PI ~e)

We can clarify this graphically in the following way. Let us represent the
binomial distribution {P,.(k), 0 ~ k ~ n} as in Figure 6.
Then as n increases the graph spreads out and becomes flatter. At the same
time the sum of Pik), over k for which np - ne ~ k < np + ne, tends to 1.
Let us think of the sequence of random variables S0 , S 1 , . , S,. as the
path of a wandering particle. Then (7) has the following interpretation.
Let us draw lines from the origin of slopes kp, k(p + e), and k(p - e). Then
on the average the path follows the kp line, and for every e > 0 we can say that
when n is sufficiently large there is a large probability that the point S,.
specifying the position of the particle at time n lies in the interval
[n(p - e), n(p + e)]; see Figure 7.
We would like to write (7) in the following form:
n-+ oo,

(8)

k(p +e)

s~

lkp

I
I
I
I
I
I

k(p- e)

..........~,.---,.---..-- - - - - -- -+-+:

Figure 7

5. The Bernoulli Scheme.

I. The Law of Large Numbers

49

However, we must keep in mind that there is a delicate point involved


here. Indeed, the form (8) is really justified only if P is a probability on a
space (0, d) on which infinitely many sequences of independent Bernoulli
random variables ~ 1 , ~ 2 , .. , are defined. Such spaces can actually be
constructed and (8) can be justified in a completely rigorous probabilistic
sense (see Corollary 1 below, the end of 4, Chapter II, and Theorem 1, 9,
Chapter II). For the time being, if we want to attach a meaning to the analytic
statement (7), using the language of probability theory, we have proved only
the following.
Let (n<n>, .Jil<">, p<n>), n ;;;:: 1, be a sequence of Bernoulli schemes such that

n<n) = {w<n): w<n) = (a<;'>, .. ' a~">), al") = 0, 1},


.Jil<n> ={A: A n<n>},
p<">( w<">) =

pr.a~

q"- !:.a\"'

and

Sk">(w<"l) = ~\">(w<">)

+ + ek">(w<">),

where, for n :s; 1, ~~nl, .. , ~~nl are sequences of independent identically


distributed Bernoulli random variables.
Then

n --t oo. (9)


Statements like (7)-(9) go by the name of James Bernoulli's law of large
numbers. We may remark that to be precise, Bernoulli's proof consisted in
establishing (7), which he did quite rigorously by using estimates for the
"tails" of the binomial probabilities P"(k) (for the values of k for which
l(k/n)- pi ;;;:: e). A direct calculation of the sum of the tail probabilities of
the binomial distribution L{k:l<k/nJ-pl~l Pn(k) is rather difficult problem for
large n, and the resulting formulas are ill adapted for actual estimates of the
probability with which the frequencies Sn/n differ from p by less than e.
Important progress resulted from the discovery by De Moivre (for p = !)
and then by Laplace (for 0 < p < 1) of simple asymptotic formulas for Pn(k),
which led not only to new proofs of the law of large numbers but also to
more precise statements of both local and integral limit theorems, the essence
of which is that for large n and at least for k "' np,

and

50

I. Elementary Probability Theory

2. The next section will be devoted to precise statements and proofs of these
results. For the present we consider the question of the real meaning of the
law of large numbers, and of its empirical interpretation.
Let us carry out a large number, say N, of series of experiments, each of
which consists of "n independent trials with probability p of the event C of
interest." Let S~/n be the frequency of event C in the ith series and N, the
number of series in which the frequency deviates from p by less than e:
N, is the number of i's for which i(S~n)- pi ~e. Then
N,/N "' P,

(10)

where P, = P{i(S!/n)- Pi~ e}.


It is important to emphasize that an attempt to make (10) precise inevitably
involves the introduction of some probability measure, just as an estimate for
the deviation of Sn/n from p becomes possible only after the introduction of a
probability measure P.
3. Let us consider the estimate obtained above,

p{l sn- pI~ e}=


n

{k:l<k/nJ-pl~l

Pn(k)

~ ~.
4ne

(11)

as an answer to the following question that is typical of mathematical


statistics: what is the least number n of observations that is guaranteed to
have (for arbitrary 0 < p < 1)
(12)
where ex is a given number (usually small)?
It follows from (11) that this number is the smallest integer n for which

1
e ex

(13)

~-4
2

For example, if ex = 0.05 and e = 0.02, then 12 500 observations guarantee


that (12) will hold independently of the value of the unknown parameter p.
Later (Subsection 5, 6) we shall see that this number is much overstated;
this came about because Chebyshev's inequality provides only a very crude
upper bound for P{i(Sn/n)- Pi ~ e}.
4. Let us write

{ I

Sn(w) - p
C(n, e) = w: -n-

I ~ e} .

From the law of large numbers that we proved, it follows that for every
e > 0 and for sufficiently large n, the probability of the set C(n, e) is close to
1. In this sense it is natural to call paths (realizations) w that are in C(n, e)
typical (or (n, e)-typical).

5. The Bernoulli Scheme.

51

I. The Law of Large Numbers

We ask the following question: How many typical realizations are there,
and what is the weight p(w) of a typical realization?
For this purpose we first notice that the total number N(O.) of points is 2",
and that if p = 0 or 1, the set of typical paths C(n, e) contains only the single
path (0, 0, ... , 0) or (1, 1, ... , 1). However, if p =!,it is intuitively clear that
"almost all" paths (all except those of the form (0, 0, ... , 0) or (1, 1, ... , 1))
are typical and that consequently there should be about 2n of them.
It turns out that we can give a definitive answer to the question whenever
0 < p < 1; it will then appear that both the number of typical realizations
and the weights p(w) are determined by a function of p called the entropy.
In order to present the corresponding results in more depth, it will be
helpful to consider the somewhat more general scheme of Subsection 2 of
2 instead of the Bernoulli scheme itself.
Let (p 1 , p 2 , , p,) be a finite probability distribution, i.e. a set of nonnegative numbers satisfying p 1 + + p, = 1. The entropy of this distribution is
r

H =-

p)npi,

(14)

i= 1

with 0 In 0 = 0. It is clear that H 2: 0, and H = 0 if and only if every pi,


with one exception, is zero. The function f(x) = -x In x, 0 s x s 1, is
convex upward, so that, as know from the theory of convex functions,
f(xd

+ -~ + f(x,) s

!(x + ; + x}
1

Consequently

H = -

L' Pi In Pi s - r P1 + r + Pr In (P1 + r + Pr) = In r.

i= 1

In other words, the entropy attains its largest value for p 1 = = p, = 1/r
(see Figure 8 for H = H(p) in the case r = 2).
If we consider the probability distribution (p 1 , p 2 , , p,) as giving the
probabilities for the occurrence of events A 1 , A 2 , ... , A, say, then it is quite
clear that the "degree of indeterminancy" of an event will be different for
H(p)

Figure 8. The function H(p)

-pIn p- (1 - p)ln(l - p).

52

I. Elementary Probability Theory

different distributions. If, for example, p 1 = 1, p 2 = = Pr = 0, it is clear


that this distribution does not admit any indeterminacy: we can say with
complete certainty that the result of the experiment will be A 1 . On the other
hand, if p 1 = = Pr = 1/r, the distribution has maximal indeterminacy,
in the sense that it is impossible to discover any preference for the occurrence
of one event rather than another.
Consequently it is important to have a quantitative measure of the indeterminacy of different probability distributions, so that we may compare
them in this respect. The entropy successfully provides such a measure of
indeterminacy; it plays an important role in statistical mechanics and in many
significant problems of coding and communication theory.
Suppose now that the sample space is
Q

= {w: w = (a 1 ,

.. ,

a.), ai

= 1, ... , r}

and that p(w) = p~1 (w) pv;<rol, where v,(w) is the number of occurrences of i
in the sequence w, and (p 1 , ... , Pr) is a probability distribution.
Fore> 0 and n = 1, 2, ... , let us put

{lvi(w)
-n- -

Pi I :2: e} ,

{ I

vlw) - Pi < e, 1 = 1, ... , r .


C(n, e) = w: -n~

It is clear that
r
P(C(n, e)) :2: 1 - i~t
P

and for sufficiently large n the probabilities P{l(v;(w)/n)- p;l ;::: e} are
arbitrarily small when n is sufficiently large, by the law of large numbers
applied to the random variables
k(w)

= {01' ak:
, ak

~,

k = 1, ... , n.

-r z

Hence for large n the probability of the event C(n, e) is close to 1. Thus, as in
the case n = 2, a path in C(n, e) can be said to be typical.
If all Pi > 0, then for every w E Q
p(w)

= exp{ -n

(-

vk~w) In Pk)}.

Consequently if w is a typical path, we have

v (w) Pk )
I L - ~k-In
n
r

k=t

- H ::::;; -

v (w) L I~kr

k=t

Pk In Pk ::::;; - e

L In Pk.
r

k=t

It follows that for typical paths the probability p(w) is close to e-H andsince, by the law of large numbers, the typical paths "almost" exhaust Q
when n is large-the number of such paths must be of order e"H. These considerations lead up to the following proposition.

5. The Bernoulli Scheme.

53

I. The Law of Large Numbers

Theorem (Macmillan). Let P; > 0, i = 1, ... , rand 0 < s < 1. Then there is
an n0 = n0 (s; p 1 , , p,) such that for all n > n0
(a) en<H-J :::; N(C(n, sl)) :::; en<H+l;
(b) e-n(H+J :::; p(w) :::; e-n(H-l,

(c) P(C(n, s 1)) =

p(w)

-4

wE

1,

C(n, s 1);
n-+ oo,

roe C(n, <nl

where

s 1 is the smaller of sands/{ -2

kt/n Pk}

PROOF. Conclusion (c) follows from the law of large numbers. To establish

the other conclusions, we notice that if we C(n, s) then

k = 1, ... , r,
and therefore

vk In Pk} < exp{ -n


:::; exp{- n(H - ts)}.

p(w) = exp{-

L Pk ln Pk- s1n L ln Pd

Similarly
p(w)

> exp{- n(H + ts)}.

Consequently (b) is now established.


Furthermore, since
P(C(n, s 1)) 2:: N(C(n, s 1 ))
we have

N ( C(n, S1 ))

:::;

min p(w),
coeC(n, <t)

P(C(n, sl))
1
_ n(H+(l/2JJ
.
( ) < -n(H+(l/2JJ - e
mm pw
e
roeC(n, <tl

and similarly
N(C(n, s 1 )) 2:: P(C(n, s 1))
max p(w)

> P(C(n, s 1))en<H-(1/ 2Jl.

meC(n, 1 )

Since P(C(n, s 1))-+ 1, n-+ oo, there is an n 1 such that P(C(n, s 1 ))


for n > n 1, and therefore
N(C(n, s 1 )) 2:: (1 - s) exp{n(H - t)}

= exp{n(H -

s)

+ (tns + ln(l

- s))}.

> 1- s

54

I. Elementary Probability Theory

Let n2 be such that

+ ln(1

!ne

- e) > 0.

for n > n2 Then when n:;::: n0 = max(nto n2 ) we have


N(C(n, e1))

:;::: en<H-

>.

This completes the proof of the theorem.


5. The law of large numbers for Bernoulli schemes lets us give a simple and
elegant proof of Weierstrass's theorem on the approximation of continuous
functions by polynomials.
Letf = f(p) be a continuous function on the interval [0, 1]. We introduce
the polynomials

which are called Bernstein polynomials after the inventor of this proof of
Weierstrass's theorem.
If ~ 1 , , ~" is a sequence of independent Bernoulli random variables
with P{~i = 1} = p, P{~i = 0} = q and S" = ~ 1 ++~"'then

Ef(~) =

Bip).

Since the function.{ = f(p), being continuous on [0, 1], is uniformly continuous, for every e > 0 we can find b > 0 such that I f(x) - f(y)l :::;; e
whenever lx- yl :::;; b. It is also clear that the function is bounded: If(x)l :::;;
M < oo.
Using this and (5), we obtain
lf(p)- Bn(p)l

=I Jo
: :;

[J(p)-

f(~) JC~lqn-k I

{k:i(k/n)- PiS~)

I f(p)

{k:i(kfn)-pi>~)

:::;; e + 2M

!(~) Jc~pkqn-k

lj(p)- J(~) IC~pkqn-k

{k:i<kfn)- Pi> ~J

C~lqn-k :::;;

Hence
lim max
n-+oo 0Sp:S1

I f(p)

- Bn(p)l = 0,

which is the conclusion of Weierstrass's theorem.

2M
M
e + - -2 = e + - -2
4n<5
2nb

6. The Bernoulli Scheme.

II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

55

6.PROBLEMS

1. Let ~and 1J be random variables with correlation coefficient p. Establish the following
two-dimensional analog of Chebyshev's inequality:
1
P{l~- E~l ~ e~ or 111- EIJI ~
~ 2 (1 +
e

eJVtl}

.Ji="7).

(Hint: Use the result of Problem 8 of 4.)

2. Let f = f(x) be a nonnegative even function that is nondecreasing for positive x.


Then for a random variable ~ with I~(w) I ~ C,

EfW- f(e) < P{l;:- E;:l >c.}< Ef(~- E~).


f(C)
"'
"' f(e)
In particular, if f(x) = x 2 ,

Eee
C

3. Let ~ 1 , ,

~.be

v~
~ P{l~- E~l ~c.}~~-

a sequence of independent random variables with

V~; ~

p{ I~ 1 +...n +~. _E(~ 1 +n... +~.)I>- e} <-ne__

C. Then

(15)

(With the same reservations as in (8), inequality (15) implies the validity of the law of
large numbers in more general contexts than Bernoulli schemes.)
4. "Let ~ 1 , . , ~.be independent Bernoulli random variables with P{e; = 1} = p > 0,
P{e; = -1} = 1 - p. Derive the following inequality of Bernstein: there is a number
a > 0 such that

p{ I~- (2pwhere

s. =

~1

1)

I~ c.}~ 2e-ae2,

+ + ~. and e > 0.

6. The Bernoulli Scheme. II. Limit Theorems


(Local, De Moivre-.:.Laplace, Poisson)
1. As in the preceding section, let

s. = el + ... + e.

Then

s.

E-= p,
n

(1)

and by (4.14)
(2)

56

I. Elementary Probability Theory

It follows from (1) that Sn/n ~ p, where the equivalence symbol ~ has been
given a precise meaning in the law of large numbers in terms of an inequality
for P{I(Sn/n)- Pi ~ e}. It is natural to suppose that, in a similar way, the
relation
(3)
which follows from (2), can also be given a precise probabilistic meaning
involving, for example, probabilities of the form

xeR\
or equivalently

(since ES. = np and VS.


If, as before, we write

= npq).
O~k~n,

for n

1, then

p{ IsnJ'iS,.

ESn

I~

x} =

Pn(k).

(4)

{k:j(k-np)/JiiNI:5x}

We set the problem of finding convenient asymptotic formulas, as n --+ oo,


for Pn(k) and for their sum over the values of k that satisfy the condition on
the right-hand side of (4).
The following result provides an answer not only for these values of k
(that is, for those satisfying Ik - np I = O(JnPq)) but also for those satisfying
Ik - np I = o(npq) 213

Local Limit Theorem. Let 0 < p < 1 ; then


P.(k) ~
uniformly fork such that

J2rffiM

e-<k-npl2!<2npql,

lk- npl = o(npq) 2 13 , i.e. as n--+


P.(k)

sup

{k:jk-npj,;cp(n)} - - - - - - - - -

j2nnpq
where <p(n)

o(npq) 213

(5)

e- (k- np)2/(2npq)

oo
--+

0,

6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

57

The proof depends on Stirling's formula (2.6)

where R(n)

-+

0 as n-+ oo.

Then if n -+ oo, k

ck =
n

-+

oo, n - k -+ oo, we have

n!
k!(n- k)!

j2im e-"n"

1 + R(n)

J2nk 2n(n - k) e-kkk. e-<n-kl(n - k)n-k (1

J2.. Hl _~) (m~- ~r


1 + e(n, k, n- k)

+ R(k))(1 + R(n- k))

where e = e(n, k, n- k) is defined in an evident way and e-+ 0 as n-+ oo,


k -+ oo, n - k -+ oo.
Therefore
p J.k)

~ C,p'qo-' ~

l{l -

1
k(

p)n-k

k) (k)k(
k)" k (1
2nn- 1 - 1- n
n n
n

+ e).

Write P= kjn. Then


Pn(k)

(p)k(1 _ p)n-k
1- p
(1

J2nnfi(1 - p)

1
exp{k ln
J2nnfi(1 - p)
P

1
exp{n ln +
J2nnfi(1 - p)
n P

~ + (n -

[~ ~

---;:::===:======::;= exp{-

J2nnfi(1 - fi)

nH(p)}(1

e)

~} (1 + e)

k) ln 1 -

1- P

(1 - ~) ln 1 - P]}
(1 + e)
P
1-

+ e),

where

H(x) = xln-

1-x
-p

+ (1- x)ln1-.

We are considering values of k such that lk- npl


sequently p- p-+ 0, n-+ oo.

o(npq) 213 , and con-

58

I. Elementary Probability Theory

Since, for 0 < x < 1,

1- X
X
H'(x) =In- - I n - - ,
1- p
p

1
1
H"(x) = - + - - ,
1 -X
X
H"'( )
X

= -

=~

+ H'(p)(p-

(! + ~)

2 p

(p- p) 2

+ (1

+ (p

if we write H(fi) in the form H(p


find that for sufficiently large n
H(p) = H(p)

X2

p)

X )2 '

- p)) and use Taylor's formula, we

+ -!H"(p)(p-

p) 2

+ O(IP- pl 3 )

+ O(IP- Pl 3 ).

Consequently
Pn(k)

~
2n

1
(p- p) 2
exp{pq
J2nnfi(1 - p)

+ nO(IP-

pl 3 )} (1

+ s).

Notice that
_!!_ (p - p)z = _!!_ (~ - p)z
2pq n
2pq

= (k - np)z
2npq

Therefore
l

Pn(k) =

e-<k-np)>J(Znpq)(l

+ s'(n, k, n-

k)),

where
1 + s'(n, k, n- k) = (1

+ s(n, k, n-

k))exp{n O(IP- Pl 3 )}

and, as is easily seen,

n --+ oo,

supls'(n, k, n- k)l--+ 0,
if the sup is taken over the values of k for which
lk- npl:::;; cp(n),

This completes the proof.

cp(n)

= o(npq) 213

p(l - p)

p(l - p)

6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

59

Corollary. The conclusion of the local limit theorem can be put in the folio~
equivalent form: For all x E R 1 such that x = o(npq) 116 , and for np + x.Jnpq
an integer from the set {0, 1, ... , n},
Pinp

+ x~) ,..., ~ e-x2f2,

(7)

2nnpq

i.e. as n -+ oo,
sup
{x:lxl~l/l(n))

Pinp+ x~) _ 1
-+ 0,
1
-x2j2
---;:==e

(8)

where t/J(n) = o(npq) 116.

With the reservations made in connection with formula (5.8), we can


reformulate these results in probabilistic language in the following way:

P{Sn = k} ,...,
p{ S
n -np}
~

=x

~
1

e-<k-np>2f<2npq>,

"'~e

-x2j2

lk- npl

= o(npq) 213 , (9)

x = o(npq)lf6.

(10)

(In the last formula np + x~ is assumed to have one of the values


0, 1, ... , n.)
If we put tk = (k- np)/~ and t:.tk = tk+ 1 - tk = 1/~, the preceding formula assumes the form

Sn - np =
P{---"-==~

tk

tltk _ 12 12
,..., - - e k '

(11)

fo

It is clear that t:.tk = 1/~-+ 0 and the set of points {tk} as it were
"fills" the real line. It is natural to expect that (11) can be used to obtain the
integral formula

P{a < Sn -

np

b} ,. ., _1_ fb e-x2Jz dx,


fo

- oo < a ~ b < oo.

Let us now give a precise statement.


2. For - oo < a ~ b < oo let

Pn(a, b] =

Pn(np

+ x~),

a<x~b

where the summation is over those x for which np

+ x~ is an integer.

60

I. Elementary Probability Theory

It follows from the local theorem (see also (11)) that for all tk defined by
k = np + tkfipq and satisfying Itk I ~ T < oo,

Pn(np

"PI.f)
+ tky'r.:=

2 /2
ll.tk
M: e-tk [1
v 2n

+ e(tk, n)],

(12)

where

n --+ oo.

sup le(tk, n)l--+ 0,

(13)

itkiST

Consequently, if a and b are given so that - T

~ a ~

b ~ T, then

where

R~ll(a, b)=

ll.tk

e-l~/2-

a<tksbjbi

R~2 >(a, b) =

_1_ rb e-x2j2 dx,

j2n Ja

~tk

e(tk> n) - - e-tki 2 .

fo

a<tkSb

From the standard properties of Riemann sums,


n --+ oo.

IR~1 >(a, b)l--+ 0,

sup

(15)

-TsasbsT

It also clear that


JR~2 >(a, b)l

sup
-TsasbsT

sup Ie(tk> n)l


itkiST

x [

~ JT

y'

2n

-T

e-x 212 dx

sup

-TsasbsT

JR~ll(a, b)JJ --+ 0,

(16)

where the convergence of the right-hand side to zero follows from (15) and
from

_1_ fT e-x2!2 dx < _1_ f"' e-x2/2 dx = 1

Jbi

-T

Jbi

-oo

the value of the last integral being well known.

'

(17)

61

6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

We write

<P(x) = -1-

Jin

fx
-oo

e- 1212 dt.

Then it follows from (14)-(16) that

n --+ oo.

IPn(a, b] - (<l>(b)- <P(a))l--+ 0,

sup

(18)

-T:Sa:Sb:ST

We now show that this result holds for T = oo as well as for finite T. By
(17), corresponding to a given e > 0 we can find a finite T = T(e) such that
-1-

IT

for-T

> 1-

e-x>f 2 dx

e.

(19)

According to (18), we can find anN such that for all n >Nand T = T(e)
we have
sup
-T:Sa:Sb:ST

IPn(a, b] - (<P(b)- <l>(a))l <

e.

(20)

It follows from this and (19) that


PnC- T, T]

> 1-

t e,

and consequently
Pn(- oo, T]

+ Pn(T, oo)

te,

where Pn(- 00, T] = lims!-oo PnCS, T] and Pn(T, oo) = limstoo Pn(T, S].
Therefore for - oo ~ a ~ - T < T ~ b ~ oo,

IPn(a, b] -

- 1-

foe

fb e-x>/2 dx I
a

~ IPn(- T, T]- ~IT

+I

v 2n

Pn(a,- T]-

+
1 foo
+-foe
~ !e
4

fo

Pn(-oo,- T]

e-x 12

e-x>j 2

dx

dx

I+ I

-T

i-T e-x>f2

Pn(T, b]-

fo s:

e-x>f2

dx

+v~f-T
+
1
1
1
1
+
+
+
4
2
8
8

dx ~ -e

2n - oo
-e

e-x 212

-e

dx

PnCT, oo)

-e = e.

By using (18) it is now easy to see that Pn(a, b] tends uniformly to <l>(b)<D(a) for - oo ::::;; a < b ::::;; oo.
Thus we have proved the following theorem.

62

I. Elementary Probability Theory

De Moivre-Laplace Integral Theorem. Let 0 < p < 1,


Pn(a, b] =

a<xsb

Pn(np

+ xJnpq),

Then
'P"(a,b]-

sup

-cosa<bsco

~iba e-x 12 dx,--+O,


2

y 2n

n--+ oo.

(21)

With the same reservations as in (5.8), (21) can be stated in probabilistic


language in the following way:
sup

-cosa<bsco

ESn
IP{a < snv-~
vsn

::;; b} -

~ ib e-x212 dx I-+ 0,

v 2n

n --+ oo.

It follows at once from this formula that

P{A < S" :s; B} -

- np) - <II(AJnpq
- np)]
.jnpq
[<II(B

--+

0,

(22)

as n--+ oo, whenever - oo :s; A < B :s; oo.


A true die is tossed 12 000 times. We ask for the probability P that
the number of 6's lies in the interval (1800, 2100].
The required probability is

EXAMPLE.

P-

(l)k(5)L:
c
12 ooo 1800<kS2100
6 6
k

12 000- k.

An exact calculation of this sum would obviously be rather difficult.


However, if we use the integral theorem we find that the probability P in
question is (n = 12 000, p = i, a = 1800, b = 2100)
2100- 2000)
( 1800- 2000)
<II 6
<II ( jt2 000. i. i -<II jt2 000. i. i = (j6)- <II( - 2 j6)
~

<11(2.449) - <II(- 4.898)

0.992,

where the values of <11(2.449) and <II( -4.898) were taken from tables of <P(x)
(this is the normal distribution function; see Subsection 6 below).
3. We have plotted a graph of Pn(np + xJnpq) (with x assumed such that
np +
is an integer) in Figure 9.
Then the local theorem says that when x = o(npq) 1 16 , the curve
(1/~)e-x2 1 2 provides a close fit to Pn(np + x~). On the other hand

xJnM

theintegraltheoremsaysthatP"(a, b] = P{aj;lpq < S"- np :s; bj;lpq} =


P{np + aj;lpq < Sn :s; np + bj;lpq} is closely approximated by the integral
(1/j2ic)J! e-x2f2 dx.

6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

P.(np

63

+ xJniq)

Figure9

We write

Then it follows from (21) that


sup

-oo:S:x:Soo

n- oo.

IFn(x) - <D(x)l - 0,

(23)

It is natural to ask how rapid the approach to zero is in (21) and (23),
as n- oo. We quote a result in this direction (a special case of the BerryEsseen theorem: see 6 in Chapter Ill):
sup

JFn(x) - <D(x)l ~

-oo:Sx:Soo

+ q2
C:::
v npq

p2

(24)

It is important to recognize that the order of the estimate (1/.}niq)


cannot be improved; this means that the approximation of Fn(x) by <l>(x)
can be poor for values of p that are close to 0 or 1, even when n is large. This
suggests the question of whether there is a better method of approximation
for the probabilities of interest when p or q is small, something better than
the normal approximation given by the local and integral theorems. In this
connection we note that for p =!,say, the binomial distribution {Pn(~)} is
symmetric (Figure 10). However, for small p the binomial distribution is
asymmetric (Figure 10), and hence it is not reasonable to expect that the
normal approximation will be satisfactory.
P.(k)

P.(k)

0.3

p = !, n

0.2

= 10

'

0.1
2

0.3
0.2

1 ..,.

p = ;\, n

= 10

0.1
8

10

Figure 10

10

64

I. Elementary Probability Theory

4. It turns out that for small values of p the distribution known as the Poisson
distribution provides a good approximation to {P,(k)}.
Let
k_ 0 1
P,(k) =
,p q '
- ' ' ' n,
0,
k = n + 1, n + 2, . . ,

{ck kn-k

and suppose that pis a function p(n) of n.

Poisson's Theorem. Let p(n) -+ 0, n -+ oo, in such a way that np(n)-+ A,


where A.> 0. Then fork= 1, 2, ... ,
P,(k)-+ nk,
where

A.ke-;.
nk=~,

n-+ oo,

(25)

k = 0, 1, ....

(26)

The proof is extremely simple. Since p(n) = (A./n)


for a given k = 0, 1, ... and sufficiently large n,

P,(k) =

+ o(1/n) by hypothesis,

C~pkqn-k

But

n(n- l)(n- k
=

1)[~ + o(~)r

n(n - 1) ~~n - k

and

[ 1 --;:;A.

+ 1\A. + o( 1W-+A.\

+ o ( -;:;l)]n-k -+ e-;.,

n-+ oo,

n-+ oo,

which establishes (25).


The set of numbers {nk, k = 0, 1, ... } defines the Poisson probability
distribution (nk ~ 0, Lk'=o nk = 1). Notice that all the (discrete) distributions
considered previously were concentrated at only a finite number of points.
The Poisson distribution is the first example that we have encountered of a
(discrete) distribution concentrated at a countable number of points.
The following result of Prohorov exhibits the rapidity with which P,(k)
converges tonk as n-+ oo: if np(n) = A. > 0, then
(27)
(For the proof of (27), see 7 of Chapter III.)

65

6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

5. Let us return to the De Moivre-Laplace limit theorem, and show how it


implies the law of large numbers (with the same reservation that was made
in connection with (5.8)). Since

it is clear from (21) that when e > 0


P

{I s I }
___.!!_-

p :::;: e - -1-

s..(iljpq

fo - ,,;nrpq

e-x>J 2 dx--+

0,

n --+ oo,

(28)

whence
n--+ oo,

which is the conclusion of the law of large numbers.


From (28)

P{l SnpI : :;: e},...,


f'..fnipq
n
fo-.,;nrpq
_1_

e-x>/2

dx,

n--+ oo,

(29)

whereas Chebyshev's inequality yielded only

It was shown at the end of 5 that Chebyshev's inequality yielded the estimate
1

n:<::-42
e a

for the number of observations needed for the validity of the inequality

Thus withe= 0.02 and a= 0.05, 12 500 observations were needed. We can
now solve the same problem by using the approximation (29).
We define the number k(a) by
1 Jk(IX)

--

fo

Since e .J(il!Pq)

e-x>f 2

dx = 1 - a.

-k(<X)

; : : 2e.jn, if we define n as the smallest integer satisfying


2eJn ;;::: k(a)

(30)

we find that
(31)

66

I. Elementary Probability Theory

We find from (30) that the smallest integer n satisfying


k 2 (rx)

n 2 4s2

guarantees that (31) is satisfied, and the accuracy of the approximation can
easily be established by using (24).
Taking s = 0.02, rx = 0.05, we find that in fact 2500 observations suffice,
rather than the 12 500 found by using Chebyshev's inequality. The values
of k(rx) have been tabulated. We quote a number of values of k(rx) for various
values of rx:
0(

k(a)

0.50
0.3173
0.10
0.05
0.0454
O.ol
0.0027

0.675
1.000
1.645
1.960
2.000
2.576
3.000

6. The function
<I>(x) =

_1_

fo

fx

e-t2/2

dt,

(32)

-oo

which was introduced above and occurs in the De Moivre-Laplace integral


theorem, plays an exceptionally important role in probability theory. It is
known as the normal or Gaussian distribution on the real line, with the
(normal or Gaussian) density

-3 -2 -1

0.67I 1 1.96 2.58

Figure 11. Graph of the normal probability density cp(x).

6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

0.9--

67

-1

I
I
I

I
I
I
I
I

1111 I

-3 -2 -1

111111

q1J

0.25111 I
0.5211/ I

o.67

0.841 I
1.28i

Figure 12. Graph of the normal distribution Cl>(x).

We have already encountered (discrete) distributions concentrated on a


finite or countable set of points. The normal distribution belongs to another
important class of distributions that arise in probability theory. We have
mentioned its exceptional role; this comes about, first of all, because under
rather general hypotheses, sums of a large number of independent random
variables (not necessarily Bernoulli variables) are closely approximated by
the normal distribution (4 of Chapter III). For the present we mention only
some of the simplest properties of cp(x) and cD(x), whose graphs are shown in
Figures 11 and 12.
The function cp(x) is a symmetric bell-shaped curve, decreasing very
rapidly with increasing lxl: thus cp(1) = 0.24197, cp(2) = 0.053991, cp(3) =

0.004432, cp(4)

0.000134, cp(5)

0.000016. Its maximum is attained at

x = 0 and is equal to (2nr 112 ~ 0.399.


The curve cD(x) = (1/fo) J~ao e- 1212 dt approximates 1 very rapidly
as x increases: C1>(1) = 0.841345, cD(2) = 0.977250, Cl>(3) = 0.998650, Cl>(4) =
0.999968, Cl>(4, 5) = 0.999997.
For tables of cp(x) and cD(x), as well as of other important functions that
are used in probability theory and mathematical statistics, see [A1].

7. PROBLEMS
1. Let n = 100, p = / 0 ,

120 , 130 , 1~, 150 Using tables (for example, those in [A1]) of the
binomial and Poisson distributions, compare the values of the probabilities

P{lO <

s,oo ~ 12},

P{33 < S100

35},

P{20 < S100


P{40 <

P{so < s,oo

22},

s,oo ~ 42},

~52}

with the corresponding values given by the normal and Poisson approximations.

68

I. Elementary Probability Theory

2. Letp = !andZ. = 2S.- n(theexcessofl'soverO 'sinntrials).Showthat


suplfoP{Z2

j } - e-P/ 4 "1--> 0,

n-->

oo.

3. Show that the rate of convergence in Poisson's theorem is given by

7. Estimating the Probability of Success


in the Bernoulli Scheme
1. In the Bernoulli scheme (Q, d, P) with
0, 1) }, d = A: A ~ Q },

n=

{w:w = (xb ... ' x.), X;=

p(w) = p'f.x;qn-'f.x;,

we supposed that p (the probability of success) was known.


Let us now suppose that p is not known in advance and that we want to
determine it by observing the outcomes of experiments; or, what amounts
to the same thing, by observations of the random variables ~1> , ~.,where
~;(co) = x;. This is a typical problem of mathematical statistics, and can be
formulated in various ways. We shall consider two of the possible formulations: the problem of estimation and the problem of constructing confidence
intervals.
In the notation used in mathematical statistics, the unknown parameter
is denoted bye, assuming a priori that e belongs to the set 0 = [0, 1]. We
say that the set (Q, d, P8 ; 8 E 01 with p8(w) = er. x'(l - 8)"-r.x, is a probabilistic-statistical model (corresponding to" n independent trials" with probability of "success" E 0), and any function T, = T,(w) with values in 0 is
called an estimator.
If s. = ~ 1 + + ~. and T: = S./n, it follows from the law of large
numbers that
is consistent, in the sense that (s > 0)

T:

P8{/'J!'-

8/

s}--+ 0,

n--+ oo.

Moreover, this estimator is unbiased: for every

E0 T: =

e,

(1)

e
(2)

where E8 is the expectation corresponding to the probability P8


The property of being unbiased is quite natural: it expresses the fact that
any reasonable estimate ought, at least "on the average," to lead to the
desired result. However, it is easy to see that
is not the only unbiased
estimator. For example, the same property is possessed by every estimator

T:

T,=

b1 x 1

+ + b"",
x
n

69

7. Estimating the Probability of Success in the Bernoulli Scheme

where b 1 + + bn = n. Moreover, the law of large numbers (1) is also


satisfied by such estimators (at least if Ib;l ~ K < oo; see Problem 2(b), 3,
Chapter Ill) and so these estimators T, are just as "good" as
In this connection there arises the question of how to compare different
unbiased estimators, and which of them to describe as best, or optimal.
With the same meaning of "estimator," it is natural to suppose that an
estimator is better, the smaller its deviation from the parameter that is being
estimated. On this basis, we call an estimator f;, efficient (in the class of unbiased estimators T,) if,

r:.

Oe9,

(3)

where V9 T, is the dispersion ofT,, i.e. E9(T,- 0) 2


Let us show that the estimator
considered above, is efficient. We have

r:,

*_

(Sn) _ VSn _ n0(1 9

VoTn-VoHence to establish that

--2--

0) _ 0(1 - 0)

(4)

r: is efficient, we have only to show that


. f v T.

()

n ~

0(1 - 0)
n

(5)

This is obvious for 0 = 0 or 1. Let 0 E (0, 1) and

Po(X;) = ox(l - 0)1-x;.


It is clear that
n

Po(w) =

01 Po(x;).

i=

Let us write
L 9(w) = In p6 (w).

Then

Lo(w)

= In 0. LX; + ln(l - O)L(l -

and

oL (w)

----ae- =
9

L(x; - 0)
0(1 - 0)

Since
(jJ

and since Tn is unbiased,

(} =Eo T, = L T,(w)p (w).


9

(jJ

X;)

70

I. Elementary Probability Theory

After differentiating with respect to (), we find that

Therefore

1 = E{(1;. _ ()) oL;~w)]


and by the Cauchy-Bunyakovskii inequality,

whence
1

(6)

Eo[T, - ()] ~ In(())'


where
I.(())=

[aL;~w)T

is known as Fisher's information.


From (6) we can obtain a special case of the Rao-Cramer inequality
for unbiased estimators T,:
(7)

In the present case

I (()) = E [oLo(w)]z =
n

()()

- O)]z

E [L(~;
8
()(1 _ ())

n()(l - ())
[()(1 - ())JZ

n
()(1 - ()) '

which also establishes (5), from which, as we already noticed, there follows
the efficiency of the unbiased estimator
= Sn/n for the unknown parameter e.

r:

r:

2. It is evident that, in considering


as a pointwise estimator for(), we have
introduced a certain amount of inaccuracy. It can even happen that the
numerical value of
calculated from observations of x~> ... , xn differs
rather severely from the true value e. Hence it would be advisable to determine the size of the error.
It would be too much to hope that r:( w) differs little from the true value
() for all sample points w. However, we know from the law of large numbers

r:

71

7. Estimating the Probability of Success in the Bernoulli Scheme

that for every {J > 0 and for sufficiently large n, the probability of the event
{I
w)I > {J} will be arbitrarily small.
By Chebyshev's inequality

o- r:c

P {IO _ T*l
6

b} <

>

v6[J2r:

oon[J2- O)

and therefore, for every A > 0,

P9{1o- r:1

~ AJ0(1;

O)} ~ 1- ;

If we take, for example, A = 3, then with P6 -probability greater than 0.888


(1 - (1/3 2 ) = ! ~ 0.8889) the event

10- r:1

3)0(1~

O)

will be realized, and a fortiori the event

10since 0(1 - 0) ~
Therefore
P6

r:1 ~ 3;:,
2y n

i.

{1 o- r:1 ~ 2~} =

Po{

r: - 2~ ~ o~ r: + 2~} ~ 0.8888.

In other words, we can say with probabilipr greater than 0.8888 that the exact
value of 0 is in the interval [T~ - (3/2y'n), T~ + (3/2,fn)]. This statement
is sometimes written in the symbolic form

o~ r:

if.

c~ 88%),

where " ~ 88%" means "in more than 88% of all cases."
The interval [T: - (3/2.j1i),
+ (3/2.j1i)] is an example of what are
called confidence intervals for the unknown parameter.

r:

Definition. An interval of the form

where

t/1 1( w) and t/1 i

w) are functions of sample points, is called a corifidence


{J (or of significance level {J) if

interval of reliability 1 -

P11 {1/1 1 (w)


for all 0 E 9.

t/1 2 (w)}

1- b.

72

I. Elementary Probability Theory

The preceding discussion shows that the interval

[ Tn* -

A Tn* + Jn
A J
Jn'
2

has reliability 1 - (1/A 2 ). In point of fact, the reliability of this confidence


interval is considerably higher, since Chebyshev's inequality gives only
crude estimates of the probabilities of events.
To obtain more precise results we notice that

{w: /8- r:1 s ;.,)8(1: 8)} = {w: t/1 (T:, n):::; 8 s 1/Jz(T:, n)},
1

where t/1 1 = t/1 1(T:,n) and t/1 2 = t/1 2 (T:,n) are the roots of the quadratic
equation
(8 - T!) 2

= ;.,z 8(1 - 8),

n
which describes an ellipse situated as shown in Figure 13.
Now let

Then by (6.24)
sup /F(J(x)- <D(x)/
x

v n8(1 - 8)

Therefore if we know a priori that


0 < Ll ::::;; 8 ::::;; 1 - Ll < 1,

where Ll is a constant, then


sup /F 0(x)- <D(x)/
x

()

Figure 13

r:

Llv n

73

7. Estimating the Probability of Success in the Bernoulli Scheme

2:: (2<1>(A) - 1) -

r:..

~vn

Let A* be the smallest A for which


(2<1>(A) - 1) -

r:.

~vn

2:: 1 - <5*,

where <5* is a given significance level. Putting <5 = <5* - (2/~Jn), we find
that A* satisfies the equation

For large n we may neglect the term 2/~Jn and assume that A* satisfies
<I>( A*) = 1 -

~ <5*.

In particular, if A.* = 3 then c5* = 0.9973 .... Then with probability


approximately 0.9973

r:: -

3J8(1

~ 8) ~ () ~ r:: + 3J8(1 ~ 8)

(8)

or, after iterating and then suppressing terms of order O(n- 3 14 ), we obtain

T,.* - 3

T.*(1 - T.*)
n

~ () ~

T,.*

+3

T.n*(1 - T.n*)

(9)

Hence it follows that the confidence interval

[r:- 2~, r: + 2~]

(10)

has (for large n) reliability 0.9973 (whereas Chebyshev's inequality only


provided reliability approximately 0.8889).
Thus we can make the following practical application. Let us carry out
a large number N of series of experiments, in each of which we estimate the
parameter () after n observations. Then in about 99.73% of the N cases, in
each series the estimate will differ from the true value of the parameter by
at most 3/2Jn. (On this topic see also the end of 5.)

74

I. Elementary Probability Theory

3. PROBLEMS
1. Let it be known a priori that 8 has a value in the set 8
estimator fore, taking values only in E>o.

c;; [0, 1]. Construct an unbiased

2. Under the hypotheses of the preceding problem, find an analog of the Rao-Cramer
inequality and discuss the problem of efficient estimators.
3. Under the hypotheses of the first problem, discuss the construction of confidence
intervals for e.

8. Conditional Probabilities and Mathematical


Expectations with Respect to Decompositions
1. Let (0, d, P) be a finite probability space and

a decomposition ofO (D; Ed, P(D;) > 0, i = 1, ... , k, and D 1 + + Dk =


0). Also let A be an event from d and P(AID;) the conditional probability of
A with respect to D;.
With a set of conditional probabilities {P(A ID;), i = 1, ... , k} we may
associate the random variable
n(w)

P(A ID;)lv,(w)

(1)

i= 1

(cf. (4.5)), that takes the values P(AID;) on the atoms of D;. To emphasize
that this random variable is associated specifically with the decomposition
fl2, we denote it by
P(Aifl2)

or

P(Aifl2)(w)

and call it the conditional probability of the event A with respect to the de-

composition fl2.

This concept, as well as the more general concept of conditional probabilities with respect to au-algebra, which will be introduced later, plays an important role in probability theory, a role that will be developed progressively
as we proceed.
We mention some of the simplest properties of conditional probabilities:
P(A

+ B I fl2)

= P(A I fl2)

+ P(B Ifl2);

(2)

if fl2 is the trivial decomposition consisting of the single set 0 then


P(A I0)

P(A).

(3)

75

8. Conditional Probabilities and Mathematical Expectations

The definition of P(A i.@) as a random variable lets us speak of its expectation; by using this, we can write the formula (3.3) for total probability
in the following compact form:
EP(Ai.@) = P(A).

(4)

In fact, smce
P(AI.@)

L P(AID;)Iv;(m),

i= 1

then by the definition of expectation (see (4.5) and (4.6))


EP(Ai.@)

P(AID;)P(D;)

i= 1

P(AD;)

i= 1

= P(A).

Now let '1 = 1J(m) be a random variable that takes the values y 1, ... , Yk
with positive probabilities:
k

1J(m) =

L Yivim),

j= 1

where Di = {m: 1J(m) = y). The decomposition.@,= {D 1, ... , Dk} is called


the decomposition induced by 'I The conditional probability P(A !.@,) will
be denoted by P(AI'I) or P(AI'I)(m), and called the conditional probability
of A with respect to the random variable '1 We also denote by P(AI'I =yi)
the conditional probability P(AIDi), where Di = {m: 1J(m) = Yi}.
Similarly, if 1J 1, 1J 2 , , '1m are random variables and .@,,,, 2 , ... ,,'" is the
decomposition induced by '11> 1J 2 , , '1m with atoms
DY!.Y2 .. Ym

= {m: '11(m) = Y1 ' 'lm(m) = Ym},

then P(AID~ . ~, ..... ~~) will be denoted by P(AI17 1 , '7 2 , , '1m) and called the
conditional probability of A with respect to '11> 1] 2 , , '1m
1. Let ~ and 11 be independent identically distributed random variables, each taking the values 1 and 0 with probabilities p and q. For k =
0, 1, 2, let us find the conditional probability P(~ + '1 = ki'l) of the event
A = {m: ~ + '1 = k} with respect to '1
To do this, we first notice the following useful general fact: if ~ and 1J are
independent random variables with respective values x and y, then

ExAMPLE

P(~

+ 1J =

zi1J = y) = P(~

+y

= z).

(5)

In fact,

P(~ + = fi'l = y) = P(~ + '1 = z, '1 = y)


'1

P('l = y)

P(~

= P(~

+ y = Z,1J = y)
P('l = y)
+ y = z).

P(~

+ y = z)P(y = 'I)
P('l

= y)

76

I. Elementary Probability Theory

Using this formula for the case at hand, we find that


P(~

+ 17 =

+ 17 = ki11 = O)J{,=OJ(w)
+ P(~ + 11 = kl11 = 1)J{.,= 11(w)
= P(~ = k)I!,=O!(w) + P{~ = k-

ki17) = P(~

1}J{,= 11(w).

Thus
(6)

or equivalently
q(1 - 11),

P(~

+ 11 = kl17) = { p(1

- 17)

+ q17,

P11,

2. Let ~ =

~(w)

k = 0,
k = 1,
k = 2,

be a random variable with values in the set X = { x 1 ,

(7)

... ,

xn}:

L xJ4w),

j= 1

and let f!} = {D 1 , .. , Dk} be a decomposition. Just as we defined the expectation of~ with respect to the probabilities P(A),j = 1, ... , l.
l

Ee = L

xjP(Aj),

(8)

j= 1

it is now natural to define the conditional expectation of ~ with respect to f!}


by using the conditional probabilities P(Ail f!}), j = 1, ... , l. We denote
this expectation by EW f!}) or E(~l f!}) (w), and define it by the formula
l

E(~l f!}) =

L xiP(Ail f!}).

(9)

j= 1

According to this definition the conditional expectation E(~ If!}) (w) is a


random variable which, at all sample points w belonging to the same atom
Di> takes the same value
1 xiP(AiiD;). This observation shows that the
definition of E( ~I D;) could have been expressed differently. In fact, we could
first define E(~ ID;), the conditional expectation of~ with respect to D;, by

'LJ=

(10)

and then define


E(~lf!})(w)

L E(~ID;)Iv,(w)

i= 1

(see the diagram in Figure 14).

(11)

77

8. Conditional Probabilities and Mathematical Expectations


(8)

PO

E~

1(3.1)
(10)

P(-ID)

E(~ID)

j(l)
P(-1

1(11)
{9)

~)

E(~l ~)

Figure 14
It is also useful to notice that E(~ID) and EW ~)are independent of the
representation of~.
The following properties of conditional expectations follow immediately
from the definitions:
E(a~

+ b17i ~) =

aE(~I ~)

+ bE('71 ~),

a and b constants;

(12)
(13)

E(~l!l) = E~;

E(CI

~) =

C constant;

C,

(14)

(15)

The last equation shows, in particular, that properties of conditional probabilities can be deduced directly from properties of conditional expectations.
The following important property generalizes the formula for total
probability (5):
EE(~I ~)

(16)

E~.

For the proof, it is enough to notice that by (5)


EE(~I ~) = E

j= 1

j= 1

j= 1

L xiP(Ail ~) = L xiEP(Ail2)) = L xiP(Ai) =

E~.

Let f!} = {D 1 , . , Dk} be a decomposition and '7 = 17(w) a random


variable. We say that '7 is measurable with respect to this decomposition,
or ~-measurable, if f!}~ ~ ~.i.e. '7 = 17(w) can be represented in the form
k

'l(W)=

L YJn,(w),

i= 1

where some Y; might be equal. In other words, a random variable is


measurable if and only if it takes constant values on the atoms of f!}.

2. If~ is the trivial decomposition,~ = {Q}, then '7 ts ~-measur


able if and only if 17 C, where C is a constant. Every random variable
'7 is measurable with respect to f!}~.

ExAMPLE

78

I. Elementary Probability Theory

Suppose that the random variable 1J is

~-measurable.

Then
(17)

and in particular

= 1J).

(E(1J I~")

To establish (17) we observe that if~=

xi/AJ' then

~1] =

IJ=

(18)

L L

XiYJAjD;

j= 1 i= 1

and therefore
k

L L xiy;P(AiDd~)

E(~1JI ~) =

j= 1 i= 1

L L

j=l i=l

L L

j= 1 i = 1

On the other hand, since

IJE(~I ge) =
=

[t
[t

P(AjD;IDm)lvm(w)

xiy;P(AiDdD;)lv;(w)

j= 1 i= 1

= lv; and lv; lvm = 0, i

YJv;(w)

(19)

xiy;P(AiiD;)lv;(w).

1 YJv/w)J

L L

Jb;

m=l

XjYi

lt

=I=

m, we obtain

1 xiP(Ail ge)J

Jmt Lt XjP(AjiDm)] lvJw)

L L Y;XiP(AiiD;) lv;(w),

i= 1 j= 1

which, with (19), establishes (17).


We shall establish another important property of conditional expectations.
Let ge 1 and ge 2 be two decompositions, with ~ 1 ~ ge 2 (ge 2 is "finer"
than ~ 1 ). Then
E[E(~I ~z)l

ge1J

= E(~l

gel).

For the proof, suppose that


gel = {Dll, , D1m},
Then if~ =

LJ= xi AJ' we have


1

E(~l gez) =

j= 1

xiP(Ail ge 2),

(20)

8. Conditional Probabilities and Mathematical Expectations

79

and it is sufficient to establish that


(21)
Since

P(Aj/ ~ 2 ) =

q=1

P(Aj/D 2 q)lv 2 . ,

we have
n

E[P(Aj/ ~2)/ ~ 1 ]

P(Aj /D 2 q)P(D 2 q/ ~d

q=1

p= 1

lv,p

q= 1

p=1

P(Aj/D 2 q)P(D 2 q/D 1 p)

lv,p

P(Aj/D 2 q)P(D 2 q/D 1 P)

{q:D2qSD1p}

p=1

lv,p P(Aj/D 1 p) = P(Aj/ ~ 1 ),

which establishes (21).


When ~ is induced by the random variables 11~> ... , 11k ( ~ = ~~, .... ~J,
the conditional expectation E(~I~~~ ..... ~J is denoted by E(~/11 1 , ,11k),
or E(( IIJ 1 , ... , l]k){w), and is called the conditional expectation of ( with
respect to 11 1 , .. , 11k
It follows immediately from the definition of E(~ /11) that if~ and 11 are
independent, then
E(~/11) = E~.

(22)

11

(23)

From (18) it also follows that

E(11 /11)

Property (22) admits the following generalization. Let ~ be independent


of ~(i.e. for each D; E ~the random variables~ and lv, are independent).
Then
E(~/ ~) = E~.

(24)

As a special case of (20) we obtain the following useful formula:


(25)

80

I. Elementary Probability Theory

EXAMPLE 3. Let us find E(~ + 11 111) for the random variables~ and 11 considered in Example 1. By (22) and (23),

This result can also be obtained by starting from (8):

E(~

+ 11111) =

EXAMPLE 4. Let
variables. Then

L kP(~ + 11 =
k=O

kll1) = p(1- 11)

+ ql1 + 2pl1

+ 11

and 11 be independent and identically distributed random

(26)
In fact, if we assume for simplicity that ~and 11 take the values 1, 2, ... , m,
we find (1 ::; k ::; m, 2 ::; I ::; 2m)

P( ~ = k I~

+ '1

= I) =

P( ~ = k, '1 = I - k)
P( ~ = k, ~ + 11 = l)
P( ~ + 11 = l)
=
P( ~ + 11 = l)
P(~ =

k)P(11 = I - k)
+ 11 = l)

P( ~
=

P(IJ =

klc; +

P(11 = k)P(~ = 1 - k)
P( ~

+ '1 =

l)

1J = [).

This establishes the first equation in (26). To prove the second, it is enough
to notice that

3. We have already noticed in 1 that to each decomposition ~ =


{D 1 , ... , Dk} of the finite set Q there corresponds an algebra a(~) of subsets
of Q. The converse is also true: every algebra !!J of subsets of the finite space
Q generates a decomposition ~ (!!J = a(~)). Consequently there is a oneto-one correspondence between algebras and decompositions of a finite
space Q. This should be kept in mind in connection with the concept, which
will be introduced later, of conditional expectation with respect to the special
systems of sets called a-algebras.
For finite spaces, the concepts of algebra and a-algebra coincide. It will
turn out that if !!J is an algebra, the conditional expectation E( ~I !!J) of a
random variable ~ with respect to !!J (to be introduced in 7 of Chapter II)
simply coincides with EW ~), the expectation of ~ with respect to the decomposition ~ such that !!J = a(~). In this sense we can, in dealing with
finite spaces in the future, not distinguish between E(~I!!J) and EW ~),
understanding in each case that E(~I !!J) is simply defined to be E(~I ~).

9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

4. PROBLEMS
1. Give an example of random variables
which

(Cf. (22).)

81

eand '1 which are not independent but for

2. The conditional variance of with respect to !!} is the random variable

Show that

3. Starting from (17), show that for every function f

= f('1) the conditional expectation

E(el'1) has the property

4. Let e and '1 be random variables. Show that inf1 E('1 - f(e)) 2 is attained for J*(e) =
E('11 e). (Consequently, the best estimator for '1 in terms of in the mean-square sense,
is the conditional expectation E('11 e)).

e.

el .... ' e.

el .... 'e. are identically


e + + e, is the

5. Let
't' be independent random variables, where
distributed and r takes the values 1, 2, ... , n. Show that if S, =
sum of a random number of the random variables,

and
ES, = Er E~ 1 ,
6. Establish equation (24).

9. Random Walk. I. Probabilities of Ruin and


Mean Duration in Coin Tossing
1. The value of the limit theorems of 6 for Bernoulli schemes is not just
that they provide convenient formulas for calculating probabilities P(Sn = k)
and P(A < Sn ::;; B). They have the additional significance of being of a
universal nature, i.e. they remain useful not only for independent Bernoulli
random variables that have only two values, but also for variables of much
more general character. In this sense the Bernoulli scheme appears as the
simplest model, on the basis of which we can recognize many probabilistic
regularities which are inherent also in much more general models.
In this and the next section we shall discuss a number of new probabilistic
regularities, some of which are quite surprising. The ones that we discuss are

82

I. Elementary Probability Theory

again based on the Bernoulli scheme, although many results on the nature
of random oscillations remain valid for random walks of a more general
kind.
2. Consider the Bernoulli scheme (Q, d, P), where n = {(J): (J) = (x 1 ... ' Xn),
Xi= 1}, d consists of all subsets of Q, and p(w) = pv(w)qn-v(wl, v(w) =
(L xi + n)/2. Let ~i(w) = xi, i = 1, ... , n. Then, as we know, the sequence
~ 1, . , ~n is a sequence of independent Bernoulli random variables,
P(~i =

1) = p,

P(~i

= -1) = q,

p+q=l.

Let us put S0 = 0, Sk = ~ 1 + + ~b 1 s k s n. The sequence S0 ,


S 1 , . , Sn can be considered as the path of the random motion of a particle
starting at zero. Here Sk+ 1 = Sk + ~k> i.e. if the particle has reached the
point Sk at time k, then at time k + 1 it is displaced either one unit up (with
probability p) or one unit down (with probability q).
Let A and B be integers, A s 0 s B. An interesting problem about this
random walk is to find the probability that after n steps the moving particle
has left the interval (A, B). It is also of interest to ask with what probability
the particle leaves (A, B) at A or at B.
That these are natural questions to ask becomes particularly clear if we
interpret them in terms of a gambling game. Consider two players (first
and second) who start with respective bankrolls (-A) and B. If ~i = + 1,
we suppose that the second player pays one unit to the first; if~; = -1, the
first pays the second. Then Sk = ~ 1 + + ~k can be interpreted as the
amount won by the first player from the second (if Sk < 0, this is actually
the amount lost by the first player to the second) after k turns.
At the instant k s nat which for the first time Sk = B (Sk = A) the bankroll of the second (first) player is reduced to zero; in other words, that player
is ruined. (If k < n, we suppose that the game ends at time k, although the
random walk itself is well defined up to time n, inclusive.)
Before we turn to a precise formulation, let us introduce some notation.
Let x be an integer in the interval [A, B] and for 0 s k s n lets:= x + Sk,
r: = min{O

s ls

k: St =A orB},

(1)

where we agree to take r: = k if A < Sf < B for all 0 s l s k.


For each k in 0 s k s n and x E [A, B], the instant r:, called a stopping
time (see 11), is an integer-valued random variable defined on the sample
space n (the dependence of r: on n is not explicitly indicated).
It is clear that for alll < k the set {w: r: = l} is the event that the random
walk {Sf, 0 s is k}, starting at time zero at the point x, leaves the interval
(A, B) at time l. It is also clear that when l s k the sets {w: r: = l, Sf= A}
and {w: r: = l, Sf = B} represent the events that the wandering particle
leaves the interval (A, B) at time l through A orB respectively.

9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

For 0

83

n, we write

die =

L: {w: tic = 1, s~ = A},


O:S:ISk

&lie =

L:

{w:

O:S:!Sk

(2)

tic = 1, Sf = B},

and let

be the probabilities that the particle leaves (A, B), through A orB respectively,
during the time interval [0, k]. For these probabilities we can find recurrent
relations from which we can successively determine 1X 1(x), ... , 1Xn(x) and
Pt (x), ... , Pn(x).
Let, then, A < x <B. It is clear that 1X 0(x) = {J 0 (x) = 0. Now suppose
1 ~ k ~ n. Then by (8.5),
pk(x)

= P(&llc) = P(&llciS1 =

l)P(~ 1

= 1)

+ P(&llciS1 =X- 1)P(~l = -1)


= pP(&IIc = X + 1) + qP(&IIcl S1 = X

1).

(3)

We now show that

P(&llciS1

=X+ 1) =

P(&llc~D,

To do this, we notice that

&lie=

{w:(x,x

&llc can be represented in the form

+ ~ 1 , ... ,x + ~ 1 + ... + ~k)EBk},

Blc is the set of paths of the form


(x, x + x 1 , ... , x + x 1 + xk)
with x 1 = 1, which during the time [0, k] first leave (A, B) at B (Figure 15).
where

A~-----------------------

Figure 15. Example of a path from the set B';.

84

I. Elementary Probability Theory

We represent B~ in the form B~,x+ 1 + B~,x- 1 , where B~x+ 1 and BV- 1


are the paths in B~ for which x 1 = + 1 or x 1 = -1, respectively.
Notice that the paths (x, x + 1, x + 1 + x 2 , , x + 1 + x 2 + + xk)
in B~x+ 1 are in one-to-one correspondence with the paths
(x

+ 1, x + 1 + x 2 , , x + 1 + x 2, , x + 1 + X 2 + + xk)

in B~~ ~. The same is true for the paths in B~x- 1 Using these facts, together
with independence, the identical distribution of ~ 1 , ... , ~b and (8.6), we
obtain
P(Bi~ISf = X

+ 1)

= P(Bi~l~ 1 = 1)

= P{(x,x + ~t>x + ~ 1 + + ~k)eB~I~t = 1}


+ 1, x + 1 + ~ 2 , , x + 1 + ~2 + + ~k)eB~~D
P{(x + 1,x + 1 + ~1, .. ,x + 1 + ~1 + + ~k-t)eB~~n

= P{(x
=

= P(Bi~~D-

In the same way,


P(Bi~ISf

=X-

1) = P(Bi~=D

Consequently, by (3) with x e (A, B) and k


Pk(x) = PPk-t(x

n,

+ 1) + qflk-t(x-

1),

(4)

1:::;;, n.

(5)

where

{J1(B)

1,

{J1(A)

= 0,

Similarly
(6)

with

1X1(A) = 1,

1X1(B) =

0,

O~l~n.

Since tX 0 (x) = {J 0 (x) = 0, x e (A, B), these recurrent relations can (at least
in principle) be solved for the probabilities

1X1(x), ... , IX"(x)

and

{J 1(x), ... , fln(x).

Putting aside any explicit calculation of the probabilities, we ask for their
values for large n.
For this purpose we notice that since Bi~ _ 1 c: Bi~, k ~ n, we have
Pk- 1(x) ~ {Jk(x) ~ 1. It is therefore natural to expect (and this is actually
the case; see Subsection 3) that for sufficiently large n the probability fln(x)
will be close to the solution {J(x) of the equation
{J(x) = p{J(x

1)

+ q{J(x -

1)

(7)

9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

85

with the boundary conditions

f3(B) = 1,

f3(A) = 0,

(8)

that result from a formal approach to the limit in (4) and (5).
To solve the problem in (7) and (8), we first suppose that p =1= q. We see
easily that the equation has the two particular solutions a and b(qjpy, where
a and b are constants. Hence we look for a solution of the form
f3(x)

+ b(qjpy.

(9)

Taking account of (8), we find that for A ::;; x ::;; B


f3(x)

= (q/pY - (q/p)A.

(10)

(qjp)B _ (qjp)A

Let us show that this is the only solution of our problem. It is enough to
show that all solutions of the problem in (7) and (8) admit the representation (9).
Let P(x) be a solution with P(A) = 0, P(B) = 1. We can always find
constants aand b such that

a + b(qjp)A =

b(A),

a+

b(q/p)A+l = P(A

a+

b(q/p)A+ 2

+ 1).

Then it follows from (7) that


P(A

+ 2)

and generally
P(x) =

a + b(qjpy.

Consequently the solution (10) is the only solution of our problem.


A similar discussion shows that the only solution of
a(x)

= pa(x + 1) + qa(x - 1),

XE(A, B)

(11)

with the boundary conditions


a(A) = 1,

a(B)

=0

(12)

A::;;x::;;B.

(13)

is given by the formula


(p/q)B - (qjpy
a(x) = (p/q)B _ (pjp)A'

If p = q =
respectively

t, the only solutions f3(x) and a(x) of (7), (8) and (11 ), ( 12) are
f3(x)

x-A

= B- A

(14)

and

B-x

a(x) = B- A.

(15)

86

I. Elementary Probability Theory

We note that

a(x)

+ fJ(x) =

(16)

for 0 s; p s; 1.
We call a(x) and {J(x) the probabilities of ruin for the first and second
players, respectively (when the first player's bankroll is x - A, and the second
player's is B - x) under the assumption of infinitely many turns, which of
course presupposes an infinite sequence of independent Bernoulli random
variables ~b ~ 2 , . , where~;= + 1 is treated as a gain for the first player,
and ~; = -1 as a loss. The probability space (Q, d, P) considered at the
beginning of this section turns out to be too small to allow such an infinite
sequence of independent variables. We shall see later that such a sequence
can actually be constructed and that {J(x) and a(x) are in fact the probabilities
of ruin in an unbounded number of steps.
We now take up some corollaries of the preceding formulas.
If we take A = 0, 0 s; x s; B, then the definition of {J(x) implies that this
is the probability that a particle starting at x arrives at B before it reaches 0.
It follows from (10) and (14) (Figure 16) that

fJ(x) =

xjB, p = q
{ (qjpy - 1

= !,

(qjp)B - 1' p =f- q.

(17)

Now let q > p, which means that the game is unfavorable for the first
player, whose limiting probability of being ruined, namely a = a(O), is given
by

(qjp)B - 1
(q/pl - (qjp)A .

G(=-~----c

Next suppose that the rules of the game are changed: the original bankrolls
of the players are still (-A) and B, but the payoff for each player is now !,
{J(x)

Figure 16. Graph of fl(x), the probability that a particle starting from x reaches B
before reaching 0.

9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

87

rather than 1 as before. In other words, now let P(ei = !) = p, P(ei = -!) =
q. In this case let us denote the limiting probability of ruin for the first player
by a. 112 Then
(qjp)2B _ 1
r:J.l/2 = (qjp)2B _ (qjp)2A'
and therefore
=

r:J.l/2

r:J.'

(qjp)B + 1
(qjp)B + (qjp)A > r:t.,

if q > p.
Hence we can draw the following conclusion:

if the game is unfavorable


to the .first player (i.e., q > p) then doubling the stake decreases the probability
of ruin.
3. We now turn to the question of how fast r:t.n(x) and Pn(x) approach their
limiting values a.(x) and P(x).
Let us suppose for simplicity that x = 0 and put

r:l.n = r:t.n(O),
It is clear that

Yn = P{A < Sk < B, 0


where {A < Sk < B, 0

n},

n} denotes the event

{A< Sk < B}.

O!>k!>n

Let n = rm, where r and m are integers and

+ ... + em,
em+ 1 + ... + e2m

el = el

e2

e, =

em(r-1)+ 1

+ ... + erm

Then if C = IA I + B, it is easy to see that


{A

< Sk < B, 1 ~ k

and therefore, since ( 1 ,

rm} {1( 1 1 < C, ... , I(, I < C},

(,are independent and identically distributed,

Yn ~ P{ let I < C, ... , I(, I < C} =


We notice that
ciently large m,

Ve

0 P{l(d < C} =

= m[1 - (p- q) 2 ]. Hence, for 0

P{l( 1 1 < c}:::;


where 8 1 < 1, since

(P{I(tl < C})'.

ve 1

(18)

i=l

81,

~ C 2 if P{IC 1 1 ~ C}

= 1.

< p < 1 and suffi(19)

88

I. Elementary Probability Theory

If p = 0 or p = 1, then P{ i(d < C} = 0 for sufficiently large m, and


consequently (19) is satisfied for 0 ::::;; p ::::;; 1.
It follows from (18) and (19) that for sufficiently large n

Yn::::;; e",
where e = e~fm < 1.
According to (16), oc

+ f3

(20)

= 1. Therefore

(oc - OCn)

+ ({3 - f3n)

Yn,

and since oc ~ ocn, f3 ~ f3n, we have


0 ::::;; oc - ocn ::::;; Yn ::::;; e",

0 ::::;; f3 - /3n ::::;; Yn ::::;; e",

e < 1.

There are similar inequalities for the differences oc(x) - ocn(x) and f3(x)- /3n(x).
4. We now consider the question of the mean duration of the random walk.
Let mk(x) = E-r: be the expectation of the stopping time -r:,k::::;; n. Proceeding as in the derivation of the recurrent relations for f3x(x), we find that,
for x E (A, B),
mk(x)

= E-r: =
=
=

1 :s;l:s;k

1 :s;l:s;k

lP(-r: = l)

l [pP(-rk = ~~~1 = 1)

I [. [pP(-r:~I

1 :s;l :s;k

(l

O:s;l:s;k-1

= pmk- 1(x

1)[pP(-r:~f = l)

1)

11~1

= -1)]

= l - 1) + qP(-r:=I = l - 1)]

+ qmk- 1(x-

[pP(-r:~f = l)

O:s;l:s;k-1

= pmk-1(x

+ qP(-rk =

+ qP(-r:=:f

1)

+ qP(-r:=:f

+ 1) + qmk- 1(x-

= l)]

1)

= l)]

+ 1.

Thus,forx E (A, B)andO::::;; k::::;; n, thefunctionsmk(x)satisfytherecurrent


relations
(21)

with m0 (x) = 0. From these equations together with the boundary conditions
mk(A)

mk(B)

= 0,

we can successively find m 1(x), ... , mn(x).


Since mk(x) ::::;; mk+ 1(x), the limit
m(x) = lim mix)
n-+oo

(22)

9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

89

exists, and by (21) it satisfies the equation


m(x)

+ 1 + pm(x + 1) + qm(x

- 1)

(23)

with the boundary conditions


m(A) = m(B) = 0.

(24)

To solve this equation, we first suppose that


m(x) < oo,

(A, B).

(25)

Then if p =I= q there is a particular solution of the form a.j(q - p) and the
general solution (see (9)) can be written in the form
m(x) = -Xq-p

+ a + b (q)x
- .
p

Then by using the boundary conditions m(A) = m(B) = 0 we find that


1
m(x) = - - (B{J(x)
p-q

+ Aa.(x) -

x],

where fJ(x) and a.(x) are defined by (10) and (13). If p = q =


solution of (23) has the form
m(x)

+ bx -

(26)

!, the general

x 2,

and since m(A) = m(B) = 0 we have


m(x) = (B - x)(x - A).

(27)

It follows, in particular, that if the players start with equal bankrolls


(B = -A), then
m(O) = B 2

If we take B = 10, and suppose that each turn takes a second, then the
(limiting) time to the ruin of one player is rather long: 100 seconds.
We obtained (26) and (27) under the assumption that m(x) < oo, x E (A, B).
Let us now show that in fact m(x) is finite for all x E (A, B). We consider only
the case x = 0; the general case can be analyzed similarly.
Let p = q =!.We introduce the random variableS<" defined in terms of
the sequence S0 , Sl> ... , Sn and the stopping time n = .-~by the equation
n

s,n =

L Skl{tn=k)(w).
k=O

(28)

The descriptive meaning of S<n is clear: it is the position reached by the


random walk at the stopping time
Thus, if < n, then S," =A orB;
ifrn = n, then A~ S," ~B.

90

I. Elementary Probability Theory

Let us show that when p = q =

t.

ES,n = 0,

(29)

Es;n =Ern.

(30)

To establish the first equation we notice that


n

ES,n =

L E[SkJ{tn=k}(w)]

k=O
n

L E[SnJ{tn=k}(w)] + L E[(Sk- Sn)Jitn=k}(w)]

k=O

= ESn

k=O

L E[(Sk- Sn)II<n=k)(w)],

k=O

(31)

where we evidently have ESn = 0. Let us show that


n

L E[(Sk- Sn)Jitn=k}(w)] = 0.

k=O

To do this, we notice that {tn > k} = {A < S 1 < B, ... , A < Sk < B}
when 0:::;; k < n. The event {A< S 1 < B, ... , A< Sk < B} can evidently
be written in the form
(32)
where Ak is a subset of { -1, + 1}k. In other words, this set is determined by
just the values of e1, ... , ek and does not depend on ek+t ... , en. Since

{tn = k} = {tn > k- 1}\{tn > k},


this is also a set of the form (32). It then follows from the independence of
1 , .. , en and from Problem 9 of 4 that the random variables Sn - Sk and
I l<n=kl are independent, and therefore

E[(Sn- Sk)II<n=kl] = E[Sn- Sk] EJIn=kJ = 0.


Hence we have established (29).
We can prove (30) by the same method:

ES;" =

k=O

k=O

L ESfJ{tn=k} = L E([Sn + (Sk- Sn)] 2I{tn=k})


n

L [ES;II<n=k) + 2ESn(Sk- Sn)I 1,"=k)

k=O

= n-

L (n -

k=O

k)P(tn = k) =

L kP(tn = k) = Et".

k=O

9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

(p

Thus we have (29) and (30) when p = q =


+ q = 1) it can be shown similarly that

= (p

ES,n

E[S,"- r. E~ 1 ] 2

t.

91

For general p and q


(33)

- q) Ern,

V~ 1 Ern,

(34)

where E~ 1 = p- q, V~ 1 = 1 - (p- q) 2
With the aid of the results obtained so far we can now show that
limn-+ 00 mn(O) = m(O) < 00.
If p = q = !, then by (30)
(35)

If p =F q, then by (33),
E

r.:::;;

max( lA I, B)

(36)

Ip-q I '

from which it is clear that m(O) < oo.


We also notice that when p = q =!

Ern=

es;n = A 2 (X.+ B 2 fJ. + E[S;I{A<Sn<Ba

and therefore

It follows from this and (20) that as n _... oo, Er. converges with exponential
rapidity to

m(O) = A 2 lX

B-A

B-A

+ B 2 {J = A 2 - - - B 2 - - =

IABI.

There is a similar result when p =F q:


Er. _... m(0) -_ ocA

+ fJB ,

p-q

exponentially fast.

5. PROBLEMS
1. Establish the following generalizations of (33) and (34):

Es:. = x

+ (p

- q)Et~,

E[Stlf- : E~tJ 2 = V~ 1 . Et:.


2. Investigate the limits of oc(x), P(x), and m(x) when the level A !
3. Let p = q =

- oo.
tin the Bernoulli scheme. What is the order of EIS.l for large n?

92

I. Elementary Probability Theory

4. Two players each toss their own symmetric coins, independently. Show that the
probability that each has the same number of heads after n tosses is r 2" D=o (C=) 2
Hence deduce the equation D=o (C=>2 = Cin
Let u. be the first time when the number of heads for the first player coincides with
the number of heads for the second player (if this happens within n tosses; un = n + 1
if there is no such time). Find Emin(u., n).

10. Random Walk. II. Reflection Principle.


Arcsine Law
1. As in the preceding section, we suppose that ~ 1 , ~ 2 , , ~ 2 " is a sequence
of independent identically distributed Bernoulli random variables with

= 1) = p,
+ + ~k

P(~j

sk =

~1

1~ k

2n;

We define
a2n

= min{1

:::; k :::; 2n:

sk = 0},

putting G'2n = 00 if sk =1: 0 for 1 ~ k :::; 2n.


The descriptive meaning of a 2 n is clear: it is the time of first return to
zero. Properties of this time are studied in the present section, where we
assume that the random walk is symmetric, i.e. p = q = !.
For 0 :::; k :::; n we write
u2k

= P(S 2 k = 0),

f2k

= P(a2n = 2k).

(1)

It is clear that u0 = 1 and


Our immediate aim is to show that for 1 :::; k :::; n the probability f 2 k is
given by
(2)

It is clear that

{a 2 n = 2k} = {S 1 =1: 0, S 2 =1: 0, ... , S 2 k- 1 =1: 0, S 2 k = 0}


for 1 :::; k :::; n, and by symmetry
!2k

o, ... , s2k-1 =1: o, s2k = O}


2P{s1 > o, ... , s2k-1 > o, s2k = O}.

= P{S1
=

=1:

(3)

93

10. Random Walk. II. Reflection Principle. Arcsine Law

A sequence (S 0 , , Sk) is called a path of length k; we denote by Lk(A)


the number of paths of length k having some specified property A. Then
f2k

L2n<s1 >

o, ... , s2k-1 > o, s2k = o,

and S2k+1 = a2k+1 ... ,S2n = a2k+1

= 2L2k(s1 >

+ + a2n)2- 2"

o, ... , s2k-1 > o, s2k = O) 2- 2\

(4)

where the summation is over all sets (a 2k+ 1, ... , a2n) with a; = 1.
Consequently the determination of the probability f 2 k reduces to calculating the number of paths L 2k(S 1 > 0, ... , S 2k- 1 > 0, S 2k = 0).

Lemma 1. Let a and b be nonnegative integers, a - b > 0 and k = a


Then
a-b
Lk(S1>0, ... ,Sk-1>0,Sk=a-b)=-k-Ck

+ b.
(5)

PRooF. In fact,

Lk(S 1 > 0, ... , Sk- 1 > 0, Sk =a- b)


= Lk(S 1 = 1, S2 > 0, ... , Sk- 1 > 0, Sk =a- b)
= Lk(S 1 = 1, Sk =a- b) - Lk(S 1 = 1, Sk =a- b;
and 3 i, 2 :::;; i :::;; k - 1, such that S; :::;; 0).

(6)

In other words, the number of positive paths (S 1, S2, ... , Sk) that originate
at (1, 1) and terminate at (k, a- b) is the same as the total number of paths
from (1, 1) to (k, a - b) after excluding the paths that touch or intersect the
time axis.*
We now notice that

Lk(S 1 = 1, Sk =a- b; 3 i, 2 :::;; i:::;; k- 1, such that S;:::;; 0)


=

Lk(S 1 = -1, Sk =a- b),

(7)

i.e. the number of paths from IX = (1, 1) to f3 = (k, a - b), neither touching
nor intersecting the time axis, is equal to the total number of paths that
connect IX* = (1, -1) with {3. The proof of this statement, known as the
reflection principle, follows from the easily established one-to-one correspondence between the paths A= (S 1, ... , Sa, Sa+ 1, ... , Sk) joining IX and
{3, and paths B = ( -S 1, ... , -Sa, Sa+~> ... , Sk)joining IX* and f3 (Figure 17);
a is the first point where A and B reach zero.

* A path (S 1 , , Sk) is called positive (or nonnegative) if all S1 > 0 (S1 ~ 0); a path is said to
touch the time axis if S1 ~ 0 or else S1 :::;; 0, for I s j s k, and there is an i, I :::;; i :::;; k, such

that S 1 = 0; and a path is said to intersect the time axis if there are two times i and j such that
0 and sj < 0.

s, >

94

I. Elementary Probability Theory

fJ

Figure 17. The reflection principle.

From (6) and (7) we find

Lk(S 1 > 0, ... , Sk- 1 > 0, Sk =a- b)

= Lk(S 1 = 1,Sk =a- b)- Lk(S 1 =


a-1

= ck-1- ck-1 =

-1,Sk =a- b)

a-bca
-k- k,

which establishes (5).


Turning to the calculation ofj2 k, we find that by (4) and (5) (with a = k,
b = k- 1),
f2k

o, ... , s2k-1 > o,

= o) 2- 2k

2L2k(sl >

2Lzk-1(S1 > O, ... ,Szk- 1 = 1)2- 2k

s2k

t
1 ck
1 2k-1 = 2k Uz(k-1)
= 2 . rzk . 2k-

Hence (2) is established.


We present an alternative proof of this formula, based on the following
observation. A straightforward verification shows that

1
2k Uz(k-1) = u2(k-1) -

(8)

Uzk>

At the same time, it is clear that

= 2k} = {a2n > 2(k- 1)}\{a2n >


{a2n > 2!} = {S1 =I= 0, ... , S 21 =!= 0}

{azn

2k},

and therefore
{a2n = 2k} = {S 1 =I= 0, ... , S2(k- 1) =I= 0}\{S 1 =I= 0, ... , S 2k =I= 0}.

Hence
fzk

P{S1 =I= 0, ... , Sz(k-1) =I= 0} - P{S1 =I= 0, ... , S2k =I= 0},

95

10. Random Walk. II. Reflection Principle. Arcsine Law

Figure 18

and consequently, because of (8), in order to show that


it is enough to show only that

L2kCS1 0, ... , S2k 0)

f 2k = (1/2k)u 2 <k-l)

= L2kCS2k = 0).

(9)

For this purpose we notice that evidently

L2k(S1 0, ... , S2k 0) = 2L2k(St > 0, ... , S2k > 0).


Hence to verify (9) we need only establish that

2L2k(St > 0, ... , S2k > 0) = L2k(S1

0, ... , S2k ~ 0)

(10)

and

(11)
Now (10) will be established if we show that we can establish a one-to-one
correspondence between the paths A = (S~> ... , S 2 k) for which at least one
S; = 0, and the positive paths B = (S ~> ... , S 2k).
Let A = (S ~> ... , S2k) be a nonnegative path for which the first zero occurs
at the point a (i.e., Sa = 0). Let us construct the path, starting at (a, 2),
(Sa + 2, Sa+ 1 + 2, ... , S 2 k + 2) (indicated by the broken lines in Figure 18).
Then the path B = (S 1, .. , Sa_ 1 , Sa+ 2, ... , S 2 k + 2) is positive.
Conversely, let B = (S 1, .. , S 2k) be a positive path and b the last instant
at which Sb = 1 (Figure 19). Then the path

A = (S 1' . . . , Sb, Sb+ 1

Figure 19

2, ... , Sk - 2)

96

I. Elementary Probability Theory

2k

-m
Figure 20

is nonnegative. It follows from these constructions that there is a one-to-one


correspondence between the positive paths and the nonnegative paths with
at least one Si = 0. Therefore formula (10) is established.
We now establish (11). From symmetry and (10) it is enough to show that
L2k(S1 > 0, ... , S 2k > 0) + L 2k(S 1 :2:: 0, ... , S 2k :2:: 0 and 3 i,
1 ~ i ~ 2k, such that S; = 0) = L 2 k(S 2 k = 0).

The set of paths (S 2 k = 0) can be represented as the sum of the two sets
"t' 1 and "t'2 , where "t' 1 contains the paths (S 0 , , S 2k) that have just one
minimum, and "t' 2 contains those for which the minimum is attained at at
least two points.
Let C 1 E "t' 1 (Figure 20) and let y be the minimum point. We put the path
C 1 = (S 0 , St. ... , S 2 k) in correspondence with the path Ct obtained in the
following way (Figure 21). We reflect (S 0 , S 1, , S1 ) around the vertical
line through the point l, and displace the resulting path to the right and
upward, thus releasing it from the point (2k, 0). Then we move the origin to
the point (l, -m). The resulting path Ct will be positive.
In the same way, if C 2 e "t' 2 we can use the same device to put it into
correspondence with a nonnegative path C~.

(2k, 2m)

2k

Figure 21

97

10. Random Walk. II. Reflection Principle. Arcsine Law

Conversely, let Ct = (S 1 > 0, ... , S 2 k > 0) be a positive path with


S 2 k =2m (see Figure 21). We make it correspond to the path C 1 that is
obtained in the following way. Let p be the last point at which SP = m.
Reflect (SP, ... , S 2 m) with respect to the vertical line x = p and displace the
resulting path downward and to the left until its right-hand end coincides
with the point (0, 0). Then we move the origin to the left-hand end of the
resulting path (this is just the path drawn in Figure 20). The resulting path
C 1 = (S 0 , . , S 2 k) has a minimum at S 2 k = 0. A similar construction
applied to paths (S 1 ~ 0, ... , S 2 k ~ 0 and 3 i, 1 ~ i ~ 2k, with S; = 0) leads
to paths for which there are at least two minima and S 2 k = 0. Hence we have
established a one-to-one correspondence, which establishes (11).
Therefore we have established (9) and consequently also the formula
f2k

Uz(k-l) -

= (1/2k)u2(k-l)

U2k

By Stirling's formula

u2k =

k
c2k.

-2k

k-+

"' - - ,

fo

00.

Therefore

k-+

00.

Hence it follows that the expectation of the first time when zero is reached,
namely
Emin(a 2 n, 2n)

L 2kP(a

k=l

2n

= 2k) + 2nu 2 n

L u2(k-1) + 2nu2n
k= 1

can be arbitrarily large.


In addition,
1 u 2 <k- 1 > = oo, and consequently the limiting value of
the mean time for the walk to reach zero (in an unbounded number of steps)
is oo.
This property accounts for many of the unexpected properties of the
symmetric random walk that we have been discussing. For example, it
would be natural to suppose that after time 2n the number of zero net scores
in a game between two equally matched players (p = q = !), i.e. the number
of instants i at which S; = 0, would be proportional to 2n. However, in fact
the number of zeros has order
(see [F1]). Hence it follows, in particular,
that, contrary to intuition, the "typical" walk (S 0 , S 1 , . , S") does not have
a sinusoidal character (so that roughly half the time the particle would be
on the positive side and half the time on the negative side), but instead must
resemble a stretched-out wave. The precise formulation of this statement is
given by the arcsine law, which we proceed to investigate.

Lf=

fo

98

I. Elementary Probability Theory

2. Let P2 k, 2n be the probability that during the interval [0, 2n] the particle
spends 2k units of time on the positive side.*

Lemma 2. Let u0 = 1 and 0 ::::; k ::::; n. Then


(12)

PRooF. It was shown above thatf2k = u2<k-lJ- u2 k. Let us show that


k

Uzk =

Since {S 2 k = 0}

!;;;;; {11 2"::::;

L f2r U2(k-r)

r=

(13)

2k}, we have

{S 2k = 0} = {S 2k = 0} n {O'zn::::; 2k} =

1 ::s;l:s;k

{S2k = 0} n {0'2n = 21}.

Consequently
Uzk = P(S2k = 0) =

L
L

1 ::s;l::s;k

P(S2k = 0, O'zn = 21)

P(Szk = OIO'zk = 2l)P(O'zn = 2/).

1 :s;l::s;k

But
P(S 2k = OIO'zn = 21) = P(Szk = OIS1 #- 0, ... , S21-1 #- 0, S21 = 0)

+ (~21+1 + .. + ~zk) = O!S1 #- 0, ... , S21-1


P(S21 + (~21+1 + + ~zk) = OISz, = 0)
P(~zt+ 1 + + ~2k = 0) = P(Sz<k-1J = 0).

= P(S21
=
=

#- 0, S21 = 0)

Therefore
U2k =

1 ::s;l::s;k

P(S2(k-1J = O)P(O'zn = 21),

which establishes (13).


We turn now to the proof of (12). It is obviously true fork = 0 and k = n.
Now let 1 ::::; k ::::; n - 1. If the particle is on the positive side for exactly 2k
instants, it must pass through zero. Let 2r be the time of first passage through
zero. There are two possibilities: either Sk ;;;:: 0, k ::::; 2r, or Sk ::::; 0, k ::::; 2r.
The number of paths of the first kind is easily seen to be
(21 22rf:2r) ' 22(n-rJp 2(k-r),2(n-r)

1 22n ' J:2r' p 2(k-r),2(n-r)


= 2'

* We say that the particle is on the positive side in the interval [m - l, m] if one, at least, of the
values S,._ 1 and S,. is positive.

99

10. Random Walk. II. Reflection Principle. Arcsine Law

The corresponding number of paths of the second kind is

Consequently, for 1 s k
p2k,2n

n - 1,
1

2 7 ~1 fzr p2(k-r),2(n-r) + 2 ,~/2r P2k,2(n-r)

(14)

Let us suppose that P Zk, zm = u2 k u2 m _ Zk holds for m = 1, ... , n - 1. Then


we find from (13) and (14) that
P2k,2n

= !uzn-2k

r;1

r;1

L fzr llzk-2r + !uzk L fzr llzn-2r-2k

This completes the proof of the lemma.


Now let y(2n) be the number of time units that the particle spends on the
positive axis in the interval [0, 2n]. Then, when x < 1,
1 y(2n)
P{ - < - - S

2n

{k, 1/2 < (2k/2n) s x)

P 2k, 2n

Since

ask-+ oo, we have


p 2k,2n =

ll

2k

2(n-k)

--;====
_ k)'

njk(n

as k -+ oo and n - k -+ oo.
Therefore

P 2k, 2n

{k: 1/2 <(2k/2n)Sx}

{k: 1/2 <(2kj2n)~x)

__!,_

nn

[~ (1 - ~)]n

112

-+

0,

whence

{k: l/2<(2k/2n)Sn)

p 2k

2n '

1
n

IX
1/2

dt
-+ 0,
jt(l - t)

But, by symmetry,

{k:k/nS 1/2)

Pzk,2n-+!

n -+ oo.

n-+

00,

100

I. Elementary Probability Theory

and

_1
n

Jx
112

dt
. vC.
1
= -2 arcsm
x - 2
Jt(1 - t)
n

Consequently we have proved the following theorem.

Theorem (Arcsine Law). The probability that the fraction of the time spent
by the particle on the positive side is at most x tends to 2n- 1 arcsin .jX:

P lk, ln ~ 2n- 1 arcsin

.jX.

(15)

{k:k/n!S:x}

We remark that the integrand p(t) in the integral

ix

dt
n o Jt(1 - t)
represents aU-shaped curve that tends to infinity as t
Hence it follows that, for large n,

P{o < y(2n)


< ~} > p{! < y(2n) <
2n2
2n-

1
2

0 or 1.

+ ~}

'

i.e., it is more likely that the fraction of the time spent by the particle on the
positive side is close to zero or one, than to the intuitive value !.
Using a table of arcsines and noting that the convergence in (15) is indeed
quite rapid, we find that

p{Y~nn) :::;; 0.024} ~ 0.1,


P{y~:):::;; 0.1} ~ 0.2,
P{y~:) : :; 0.2} ~ 0.3,

P{y~:):::;; 0.65} ~ 0.6.


Hence if, say, n = 1000, then in about one case in ten, the particle spends
only 24 units of time on the positive axis and therefore spends the greatest
amount of time, 976 units, on the negative axis.
3.PR.OBLEMS
1. How fast does Emin(a 2., 2n)-+ oo as n-+ oo?
2. Lett.= min{1:::;; k:::;; n: sk= 1}, where we take t. = oo if Sk < 1 for 1:::;; k:::;; n.
What is the limit of Emin(t., n) as n-+ oo for symmetric (p = q =!)and for unsymmetric (p of. q) walks?

101

II. Martingales. Some Applications to the Random Walk

11. Martingales. Some Applications to the


Random Walk
1. The Bernoulli random walk discussed above was generated by a sequence

~ 1 , ... , ~n of independent random variables. In this and the next section we


introduce two important classes of dependent random variables, those that
constitute martingales and Markov chains.
The theory of martingales will be developed in detail in Chapter VII.
Here we shall present only the essential definitions, prove a theorem on the
preservation of the martingale property for stopping times, and apply this
to deduce the "ballot theorem." In turn, the latter theorem will be used for
another proof of proposition (10.5), which was obtained above by applying
the reflection principle.

2. Let (Q, .91, P) be a finite probability space and 22 1 ~ 22 2


sequence of decompositions.

22n a

Definition 1. A sequence of random variables~ 1, ... , ~n is called a martingale


(with respect to the decomposition 22 1 ~ 22 2 ~ ~ 22n) if
(1) ~k is 22k-measurable,
(2) E(~k+ 1l22k) = ~k' 1 :::;; k :::;; n - 1.

In order to emphasize the system of decompositions with respect to which


the random variables form a martingale, we shall use the notation
(1)

where for the sake of simplicity we often do not mention explicitly that
1 :::;; k :::;; n.
When 22k is induced by ~ 1, ... , ~n' i.e.

instead of saying that ~ = (~k 22k) is a martingale, we simply say that the
sequence~ = (~k) is a martingale.
Here are some examples of martingales.
ExAMPLE

1. Let 111, ... , IJn be independent Bernoulli random variables with


P(IJk
sk

= 1) = P(IJk

= -1)

= t,

= 1J1 + ... + 1Jk and 22k = 22qJ, ... ,q.

We observe that the decompositions 22k have a simple structure:

102

I. Elementary Probability Theory

where

v+

= {w: 171 = +1},


~

v-

= {w: 171 = -1},

{v++ v+- v-+


'

'

'

v--}

'

where

v+ + =

{w: 171 = + 1,172 = + 1}, ... ' v-- = {w: 'It= -1,172 = -1},

etc.
It is also easy to see that ~~~~k = ~s~o ... ,sk

Let us show that (Sb ~k) forms a martingale. In fact, Skis ~k-measurable,
and by (8.12), (8.18) and (8.24),
E(Sk+tl~k)

= E(Sk + 1'fk+ti~k)
= E(Skl~k) + E(17Htl~k) = Sk + E1'fk+t = Sk.

If we put S0 = 0 and take D 0 = {0}, the trivial decomposition, then the

sequence (Sk,

~k)osksn

also forms a martingale.

2. Let 17 1, ... , 1'fn be independent Bernoulli random variables with


P(1]; = 1) = p, P(1]; = -1) = q. If p -::/: q, each of the !;lequences ~ = (~k)
with

ExAMPLE

where

sk =

'11 + ... + '1n

is a martingale.
EXAMPLE

3. Let 17 be a random variable, ~ 1

~n

and
(2)

Then the sequence ~ = (~k ~k) is a martingale. In fact, it is evident that


E(17 I~k) is ~k-measurable, and by (8.20)

In this connection we notice that


by (8.20)

if~

(~k ~k)

is any martingale, then

~k = E(~k+tl~k) = E[E(~k+21~k+t)l~k]

= E(~k+21~k) = = E(~nl~k).

(3)

Consequently the set of martingales ~ = (~k ~k) is exhausted by the


martingales of the form (2). (We note that for infinite sequences ~ =
(~k ~k)k~t this is, in general, no longer the case; see Problem 7 in 1 of
Chapter VII.)

103

11. Martingales. Some Applications to the Random Walk

EXAMPLE 4. Let '1 ~> ... , '1 nbe a sequence of independent identically distributed
random Variables, Sk = '11 + + IJb and E1 = Esn' E2 = Esn.Sn~t' ,
En = Es".s"~ 1 , ... ,s, Let us show that the sequence~= (~b Ek) with

;:

sn
n

;:

sn-1
n-

S1=-,~z=--1 , ... ,;,k=

sn+1-k
k'''''~n=S1
n+ 1 -

is a martingale. In the first place, it is clear that Ek ~ Ek+ 1 and ~k is Ekmeasurable. Moreover, we have by symmetry, for j ::::; n - k + 1,
(4)

(compare (8.26)). Therefore

(n- k + 1)E(1Jt1Ek) =

n-k+ 1

j= 1

E(,.,jiEk) = E(Sn-k+11Ek) = sn-k+1

and consequently
;: =

Sk

Sn-k+l

n _ k+ 1

and it follows from Example 3 that~ =

=E(
(~b

'11

IE)

k ,

Ek) is a martingale.

ExAMPLE 5. Let 1J 1, ... , '1n be independent Bernoulli random variables with


P(IJ; = +1) = P(IJ; = -1) =

!,

Sk = '7 1 + + '1k Let A and B be integers, A < 0 <B. Then with 0 <A.<
(~k Ek) with Ek = Es,, ... ,sk and

n/2, the sequence~=

~k =(cos A.)-k exp{iA.(sk-

B; A)}

is a complex martingale (i.e., the real and imaginary parts of


martingales).

(5)
~k

3. It follows from the definition of a martingale that the expectation


the same for every k:
E~k = E~ 1 .

form

E~k

is

It turns out that this property persists if time k is replaced by a random


time.
In order to formulate this property we introduce the following definition.

Definition 2. A random variable T = -r(w) that takes the values 1, 2, ... , n is


called a stopping time (with respect to a decomposition (Ek) 1 ~k~n E 1 ~
E 2 ~ ~ En) if, for k = 1, ... , n, the random variable /{t=kl(w) is Ekmeasurable.
If we consider !?)k as the decomposition induced by observations for k
steps (for example, Ek = E~ ...... ~k' the decomposition induced by the

104

I. Elementary Probability Theory

variables 17 1, .. , 1'/k), then the ~k-measurability of Ift=kl(w) means that the


realization or nonrealization of the event {r = k} is determined only by
observations for k steps (and is independent of the "future").
If f!Jk = a(~k), then the ~k-measurability of J{t=kl(w) is equivalent to the
assumption that
{r = k} E f!Jk.
(6)
We have already introduced specific examples of stopping times: the times

rk, (Jzn introduced in 9 and 10. Those times are special cases of stopping
times of the form
rA

= min{O < k:::;

(JA

= min{O:::;

n: ~k E A},

k:::; n: ~kEA},

(7)

which are the times (respectively the first time after zero and the first time)
for a sequence ~ 0 , ~ 1 , ... , ~n to attain a point of the set A.

4. Theorem 1. Let ~ = (~k ~k) 1 s;ks;n be a martingale and r a stopping time


with respect to the decomposition (~k) 1 s;ks;n Then
(8)

where
n

~t =

L ~kl{t=kl(w)

(9)

k=l

and

(10)
PRooF (compare the proof of (9.29)). Let D E ~ 1 . Using (3) and the properties
of conditional expectations, we find that

E(~ ID) = E(~Jn)


t

P(D)

= P(D) t~1E[~J{t=tl. In]


1

= P(D) E(~nln) = E(~niD),

105

ll. Martingales. Some Applications to the Random Walk

and consequently
E(e,l~t) = E(enl~t) = e1.

The equation Ee, = Ee 1 then follows in an obvious way.


This completes the proof of the theorem.

Corollary. For the martingale (Sk, ~k) 1 sksn of Example 1, and any stopping
timer (with respect to (~k)) we have the formulas
(11)

ES, = 0,

known as Wald's identities (cf. (9.29) and (9.30); see also Problem 1 and
Theorem 3 in 2 of Chapter VII).

5. Let us use Theorem 1 to establish the following proposition.


Theorem 2 (Ballot Theorem). Let q 1, ... , '1n be a sequence of independent
identically distributed random variables whose values are nonnegative integers,
Sk = q 1 + ... + 1'fk, 1 :::; k :::; n. Then
P{Sk < kforall k, 1:::; k:::; niSn} = (1-

~r'

(12)

where a+ = max(a, 0).


PRooF. On the set {w: Sn ~ n} the formula is evident. We therefore prove
(12) for the sample points at which sn < n.
Let us consider the martingale =
~k)tsksn introduced in Example
4, With = Sn+1-J(n + 1- k) and ~k = ~Sn+l-kSn
We define

e (ek,

ek

r = min{1 :::; k ::5; n: ~k ~ 1},

taking ! = n on the set {ek < 1 for all k such that 1 ::5; k ::5; n} =
{maxl:s;l:s;n(SI/1) < 1}. It is clear that
S1 = 0 on this set, and
therefore

e, =en=

{ max S11 < 1} = {max


lSISn
lSISn

~~ < 1, S" < n}

{e, = 0}.

(13)

Now let us consider those outcomes for which simultaneously


max 1s 1 sn(S1/l) ~ 1 and Sn < n. Write a= n + 1 - r. It is easy to see that

a= max{1 :::; k:::; n: Sk;;:::; k}


and therefore (since Sn < n) we have a < n, Sa ~ a, and Sa+ 1 < a + 1.
Consequently '1a+ 1 = Sa+ 1 - Sa < (a + 1) - a = 1, i.e. '1a+ 1 = 0. Therefore a :::; sa = sa+ 1 < a + 1, and consequently sa = (J and

e,=

Sn+t- =Sa=l.
n+1-r
a

106

I. Elementary Probability Theory

Therefore

{ max~~~ 1,Sn < n} {~t = 1}.

(14)

1:51:5n

From (13) and (14) we find that


1
{ max S/

1 :51:5n

~ 1, S" < n} = {~t =

1} n {S" <

n}.

Therefore, on the set {S" < n}, we have

where the last equation follows because ~< takes only the two values 0 and 1.
Let us notice now that E(~tiSn) = E(~tl~ 1 ), and (by Theorem 1)
E(~tl~ 1 ) = ~ 1 = Sn/n. Consequently, on the set {S" < n} we have P{Sk < k

for all k such that 1 :::;; k :::;; n ISn} = 1 - (Sn/n).


This completes the proof of the theorem.

We now apply this theorem to obtain a different proof of Lemma 1 of


10, and explain why it is called the ballot theorem.
Let ~ 1 , .. , ~"be independent Bernoulli random variables with
P(~ 1 = 1) = P(~; = -1) =
Sk

+ + ~k

and a, b nonnegative integers such that a - b > 0,


b = n. We are going to show that

= ~1

t,

a-b
a+

P{S 1 > 0, ... , S" > OISn =a- b} =--b.

(15)

In fact, by symmetry,
P{S 1 > 0, ... , S" > OISn =a- b}

= P{S 1 < 0, ... , S" < OISn = -(a- b)}


= P{S 1 + 1 < 1, ... , S" + n < niSn + n = n- (a- b)}
= P{1Jt < 1,. ., '11 + + 1Jn < ni1Jt + + 1Jn = n- (aa-b
n

b)}

a-b
a+ b'

where we have put 1Jk = ~k + 1 and applied (12).


Now formula (10.5) follows from (15) in an evident way; the formula was
also established in Lemma 1 of 10 by using the reflection principle.

II. Martingales. Some Applications to the Random Walk

107

Let us interpret ~i = + 1 as a vote for candidate A and ~i = -1 as a vote


for B. Then Sk is the difference between the numbers of votes cast for A and
B at the time when k votes have been recorded, and
P{S 1 > 0, ... , Sn > OISn =a- b}

is the probability that A was always ahead of B, with the understanding that
A received a votes in all, B received b votes, and a - b > 0, a + b = n.
According to (15) this probability is (a - b)/n.

6. PROBLEMS
1. Let !?} 0 ~ !!} 1 ~~!!}.be a sequence of decompositions with !?} 0 = {Q}, and Jet
''lk be !?}k-measurable variables, 1 ~ k ~ n. Show that the sequence~ = (~b !!}k) with
k

~k =

I= I

['11 - E('ltl !?}I-I)]

is a martingale.
2. Let the random variables '1~o'1k satisfy E('1ki'11, ,'1k-l) = 0. Show that the
sequence~ = (~k) 1 ,;k,;n with ~ 1 = 17 1 and
k

~k+ 1 =

i=l

'li+lli<'11 ... , 11J,

where fi are given functions, is a martingale.


3. Show that every martingale ~ = (~i> !!}k) has uncorrelated increments: if a< b <
c < d then

4. Let ~ = (~ 1 , ... , ~.) be a random sequence such that ~k is !?}k-measurable


(!!} ~ !!} 2 ~ ~ !?}.). Show that a necessary and sufficient condition for this
sequence to be a martingale (with respect to the system (!!}k)) is that E~t = E~ 1 for
every stopping timeT (with respect to (!!}k)). (The phrase "for every stopping time"
can be replaced by "for every stopping time that assumes two values.")
5. Show that if~ =

(~k

!!}k) 1,;k,;n is a martingale and Tis a stopping time, then

for every k.
6. Let~= (~k !!}k) and '1 = ('lk !!}k) be two martingales, ~ 1 = 11 1 = 0. Show that

E~n'ln =

k=2

E(~k - ~k-1)('1k - 'lk-1)

and in particular that

E~; =

k=2

E(~k - ~k-1) 2

108

I. Elementary Probability Theory

7. Let 111, ... , Yfn be a sequence of independent identically distributed random variables
with E11; = 0. Show that the sequence~ = (~k) with

~k =
~

(.I

l=l

'7;)

kEYff,

_ exp A.('7 1 + + Yfk)


(E exp A.q 1)k

k -

is a martingale.
8. Let '7t, ... , Yfn be a sequence of independent identically distributed random variables
taking values in a finite set Y. Let f 0 (y) = P('7 1 = y), y E Y, and let f 1(y) be a nonnegative function with LyeY f 1(y) = 1. Show that the sequence ~ = (~k !0Z) with

~:J = D~"~k'
~k =

J; (111) .. . J; ('1d
fo('11) foCYfk)'

is a martingale. (The variables ~k, known as likelihood ratios, are extremely important
in mathematical statistics.)

12. Markov Chains. Ergodic Theorem.


Strong Markov Property
1. We have discussed the Bernoulli scheme with
Q

= {w: W = (xlo ... , Xn), X; = 0, 1},

where the probability p(w) of each outcome is given by


(1)

= pxql-x. With these hypotheses, the variables ~ 1 , ... , ~n


= X; are independent and identically distributed with
X= 0, 1.
P(~ 1 = x) = .. = P(~. = x) = p(x),

with p(x)
~;(w)

with

If we replace (1) by
p(w) = P1(x1) Pn(xn),

where P;(x) = pf(1 - p;), 0 :::; p; :::; 1, the random variables ~ 1 ,


still independent, but in general are differently distributed:

... ,

~.

are

We now consider a generalization that leads to dependent random variables


that form what is known as a Markov chain.
Let us suppose that

n=

{w: w = (xo. xl, ... ' x.), X; EX},

109

12. Markov Chains. Ergodic Theorem. Strong Markov Property

where X is a finite set. Let there be given nonnegative functions p0 (x),


Pt(x, y), ... , Pn(x, y) such that

L Po(x) =

1,

L Pk(x, y) =

1,

XEX

k=1, ... ,n; yEX.

(2)

yeX

(3)

It is easily verified that Lroe!l p(w) = 1, and consequently the set of numbers
p(w) together with the space Q and the collection of its subsets defines a
probabilistic model, which it is usual to call a model of experiments that form
a Markov chain.
Let us introduce the random variables ~ 0 , ~ 1 , ... , ~n with ~;(w) =X;. A
simple calculation shows that
P(~ 0 = a) = p 0 (a),

P(~o

ao, ... , ~k

ak)

Po(ao)Pt(ao, at) Pk(ak-1 ak).

(4)

We now establish the validity of the following fundamental property of


conditional probabilities:
P{~k+t

ak+tl~k

ak, ... , ~o

= ao} =

P{~k+t

ak+tl~k =ad

(5)

(under the assumption that P(~k = ak, ... , ~ 0 = a0 ) > 0).


By (4),
P{(k+t

ak+tl~k

P{~k+t

ak> ., ~o

= ao}

= ak+t' ., ~o = ao}

P{ ~k = ak> ... , ~o = a 0 }

Po(ao)Pt(ao, a1) Pk+ 1(ak, ak+ 1)


Po(ao) Pk(ak-l' ak)

(
)
Pk+t abak+t.

In a similar way we verify


(6)

which establishes (5).


Let !0i = !0~o ... , ~k be the decomposition induced by ~ 0 , . , ~k> and
P.Bi = a(!0i).
Then, in the notation introduced in 8, it follows from (5) that
(7)

or

I. Elementary Probability Theory

110
If we use the evident equation
P(AB I C)

= P(A I BC)P(B I C),

we find from (7) that


P{ ~n

an,., ~k+ 1 = ak+11.?4D

= Pgn =

an,, ~k+ 1 = ak+ll ~k}

(8)

or
P{~" =an, ... , ~k+1

= ak+1l~o, ... , ~d = P{~" =a.,, ~k+1 = ak+ll~k}.


(9)

This equation admits the following intuitive interpretation. Let us think


of ~k as the position of a particle "at present," (~ 0 , .. , ~k- 1 ) as the "past,"
and (~k+ 1- ... , ~.)as the "future." Then (9) says that if the past and the present
are given, the future depends only on the present and is independent of how
the particle arrived at ~k' i.e. is independent of the past (~ 0 , ... , ~k- 1 ).
Let F =(~.=a., ... , ~k+t = ak+t), N = gk =ad,
B

= {~k- 1 =

ak-1 , ~o

= ao}.

Then it follows from (9) that


P(FINB) = P(FIN),

from which we easily find that


P(FB IN) = P(F I N)P(B IN).

(10)

In other words, it follows from (7) that for a given present N, the future F
and the past B are independent. It is easily shown that the converse also
holds: if (10) holds for all k = 0, 1, ... , n - 1, then (7) holds for every k = 0,
1, ... , n - 1.

The property of the independence of future and past, or, what is the same
thing, the lack of dependence of the future on the past when the present is
given, is called the Markov property, and the corresponding sequence of
random variables ~ 0 , . , ~.is a Markov chain.
Consequently if the probabilities p(w) of the sample points are given by
(3), the sequence (~ 0 , ... , ~.)with ~;(w) =X; forms a Markov chain.
We give the following formal definition.

Definition. Let (Q, d, P) be a (finite) probability space and let~ = (~ 0 , . , ~.)


be a sequence of random variables with values in a (finite) set X. If (7) is
satisfied, the sequence~ = (~ 0 , ... , ~")is called a (finite) Markov chain.
The set X is called the phase space or state space of the chain. The set of
probabilities (Pn(x)), x EX, with p 0 (x) = P(~ 0 = x) is the initial distribution,
and the matrix IIPk(x,y)ll, x, yEX, with p(x,y) = Pgk = Yl~k-t = x} is
the matrix of transition probabilities (from state x to state y) at time
k = 1, ... , n.

12. Markov Chains. Ergodic Theorem. Strong Markov Property

Ill

When the transition probabilities Pk(x, y) are independent of k, that is,


=
is called a homogeneous
0 , .. ,
Markov chain with transition matrix llp(x, y)ll-

Pk(x, y) = p(x, y), the sequence

e (e

en)

Notice that the matrix ll(x, y)ll is stochastic: its elements are nonnegative
and the sum of the elements in each row is 1:
p(x, y) = 1, x EX.
We shall suppose that the phase space X is a finite set of integers
(X = {0, 1, ... , N}, X = {0, 1, ... , N}, etc.), and use the traditional
notation Pi = p0 (i) and Pii = p(i, j).
It is clear that the properties of homogeneous Markov chains completely
determine the initial distributions Pi and the transition probabilities Pii In
specific cases we describe the evolution of the chain, not by writing out the
matrix IIPiill explicitly, but by a (directed) graph whose vertices are the states
in X, and an arrow from state ito state j with the number Pii over it indicates
that it is possible to pass from point i to point j with probability Pii When
Pii = 0, the corresponding arrow is omitted.

LY

Pii

~i
j
EXAMPLE

1. Let X = {0, 1, 2} and

IIPijll

(P i).
3

The following graph corresponds to this matrix:

.l2

~a~_f).l3
o~2~
~

Here state 0 is said to be absorbing: if the particle gets into this state it remains
there, since p00 = 1. From state 1 the particle goes to the adjacent states 0
or 2 with equal probabilities; state 2 has the property that the particle remains
there with probability! and goes to state 0 with probability i
ExAMPLE

for

2. Let X= {0, 1, ... , N}, Po= 1, PNN = P(-N)(-N) = 1, and,

Iii< N,

p, j = i + 1,
{
Pii = q, j = i - 1,
0 otherwise.

(11)

112

I. Elementary Probability Theory

The transitions corresponding to this chain can be presented graphically in


the following way (N = 3):

This chain corresponds to the two-player game discussed earlier, when each
player has a bankroll N and at each turn the first player wins + 1 from the
second with probability p, and loses (wins -1) with probability q. If we
think of state i as the amount won by the first player from the second, then
reaching state N or - N means the ruin of the second or first player, respectively.
In fact, if 17 1 , 17 2, ... , 'ln are independent Bernoulli random variables with
P('l; = + 1) = p, P(1]; = -1) = q, S 0 = 0 and Sk = 17 1 + .. + 'lk the
amounts won by the first player from the second, then the sequence S0 ,
S 1, , Sn is a Markov chain with p0 = 1 and transition matrix (11 ), since
P{Sk+1 =jiSk = ik,Sk-1 = ik-1, ... }
= P{Sk
= P{Sk

+ '7k+1
+ '7k+1

=jiSk = ik,Sk-1 = ik-1, ... }


=jiSk = ik} = P{'7k+1 =j- ik}.

This Markov chain has a very simple structure:


0

n- 1,

where 17 1 ,17 2 , .. , 'ln is a sequence of independent random variables.


The same considerations show that if ~ 0 , 17 1 , .. , 'ln are independent
random variables then the sequence ~ 0 , ~ 1 , , ~n with
0

k ~ n- 1,

(12)

is also a Markov chain.


It is worth noting in this connection that a Markov chain constructed in
this way can be considered as a natural probabilistic analog of a (deterministic) sequence x = (x 0 , . , xn) generated by the recurrent equations

We now give another example of a Markov chain of the form (12); this
example arises in queueing theory.
EXAMPLE 3. At a taxi stand let taxis arrive at unit intervals of time (one at a
time). If no one is waiting at the stand, the taxi leaves immediately. Let 'lk be
the number of passengers who arrive at the stand at time k, and suppose that
1] 1 , . , 'ln are independent random variables. Let ~k be the length of the

113

12. Markov Chains. Ergodic Theorem. Strong Markov Property

waiting line at time k, ~ 0 = 0. Then if ~k = i, at the next time k


~k + 1 of the waiting line is equal to
j.=

{1Jk+ 1
i- 1

+ 1Jk+ 1

+ 1 the length

if i = 0,
ifizl.

In other words,
0

n- 1,

where a+ = max(a, 0), and therefore the sequence ~ = (~ 0 ,


Markov chain.

.. ,

~")

is a

4. This example comes from the theory of branching processes. A


branching process with discrete times is a sequence of random variables
~ 0 , ~ 1 , .. , ~"'where ~k is interpreted as the number of particles in existence
at time k, and the process of creation and annihilation of particles is as
follows: each particle, independently of the other particles and of the "prehistory" of the process, is transformed into j particles with probability pi,
j = 0, 1, ... , M.
We suppose that at the initial time there is just one particle, ~ 0 = 1. If at
time k there are ~k particles (numbered 1, 2, ... , ~k), then by assumption
~k+ 1 is given as a random sum of random variables,
EXAMPLE

~k+ 1

IJlk)

+ ... + 1]~~,

where 1Jlkl is the number of particles produced by particle number i. It is


clear that if ~k = 0 then ~k+ 1 = 0. If we suppose that all the random variables
IJ~k>, k 2 0, are independent of each other, we obtain
Pgk+1 = ik+1i~k = ik, ~k-1 = ik-t .} = Pgk+t = ik+ti~k = ik}
= P{IJ\kl + + IJ!:> = ik+ d.
It is evident from this that the sequence ~ 0 , ~ 1 ,

... , ~"is a Markov chain.


A particularly interesting case is that in which each particle either vanishes
with probability q or divides in two with probability p, p + q = 1. In this
case it is easy to calculate that

is given by the formula


Pii

{c
0

1,."12pi12qi- j/2, ). = 0' ... , 2'I,


in all other cases.

2. Let ~ = ( ~k, Ill, IP') be a homogeneous Markov chain with st~rting vectors
(rows) Ill = (p;) and transition matrix Ill = IIPiill. It is clear that
Pii = P{~1 =j/~o = i} = ... = P{~n =j/~n-1 = i}.

114

I. Elementary Probability Theory

We shall use the notation

for the probability of a transition from state i to state j in k steps, and

for the probability of finding the particle at point j at time k. Also let

Let us show that the transition probabilities pj'> satisfy the Kolmogorov-

Chapman equation

\~ I) = " p\k)p(l!
P'1
l.J ra ~J'

(13)

or, in matrix form,

(14)
The proof is extremely simple: using the formula for total probability
and the Markov property, we obtain
P!~+IJ

= P(~k+l = j l~o =

i)

L P(~k+l = j, ~k = IXI~o = i)
~

L P(~k+l = j I~k = 1X)P(~k = IX I~0 = i) = L p~}P!~>.


~

The following two cases of (13) are particularly important:


the backward equation
1J = "P
P\~+
IJ
...., I~ p<l!
~}

(15)

and the forward equation


\~+ 1)
P IJ

= "p\k)p"
.
...., I~ ~}

(16)

(see Figures 22 and 23). The forward and backward equations can be written
in the following matrix forms

= IP<k>. IP,

(17)

IP<k+ tJ = IP. jp<k>.

(18)

IP<k+ 1J

12. Markov Chains. Ergodic Theorem. Strong Markov Property

115

I+ I

Figure 22. For the backward equation.

Similarly, we find for the (unconditional) probabilities p)k) that

'\' p(k)p(l)
PJ(_k +I) = ~
a:
'2.]'

(19)

or in matrix form

fl (k +I) = fl (k) I]J>(I).


In particular,

rn<k+ 1) =

rn<k)

IP

(forward equation) and

fl (k+ 1) = fl (1).

I]J>(k)

(backward equation). Since IP 0 ) = IP, fl (1) = fl, it follows from these equations
that
JP(k) =

!Pk,

Consequently for homogeneous Markov chains the k-step transitiOn


probabilities p~j) are the elements of the kth powers of the matrix IP, so that
many properties of such chains can be investigated by the methods of matrix
analysis.

k k

+1

Figure 23. For the forward equation.

116

I. Elementary Probability Theory

5. Consider a homogeneous Markov chain with the two states 0 and


1 and the matrix

EXAMPLE

IFD

= (Poo Pot)
Pu

Pto

It is easy to calculate that


p2

Po;(Poo + Pu))
Ptt + PotPto

= ( P~o + Po1P1o
Pto(Poo

and (by induction)

+ Pu)

1 (1 - 1-

IFD" _
- 2 - Poo - Pu

Pu
1 - Pu

Poo)
1 - Poo

+ (Poo + Pu

- 1)" ( 1 - Poo
2 - Poo - Pu
-(1 - Pu)

-(1 - Poo))
1- Pu

(under the hypothesis that IPoo + p 11 - 11 < 1).


Hence it is clear that if the elements of IFD satisfy Ip00 + p11 - 11 < 1 (in
particular, if all the transition probabilities Pii are positive), then as n-+ oo
IFD"
-+

1 (1 - 1-

2 - Poo - Pu

Pu
1 - Pu

Poo)
1 - Poo '

(20)

and therefore
l - p 11
lim (nl =
P,o
2
'
n
- Poo- Pu

. Pil(n)
1tm
"

1 - Poo

= -,-------'--'---

2- Poo- Pu

Consequently if IPoo + p 11 - 11 < 1, such a Markov chain exhibits


regular behavior of the following kind: the influence of the initial state on
the probability of finding the particle in one state or another eventually
becomes negligible (plj> approach limits ni, independent of i and forming a
probability distribution: n0 ~ 0, n 1 ~ 0, n0 + n 1 = 1); if also all Pii > 0
then n0 > 0 and n 1 > 0.
3. The following theorem describes a wide class of Markov chains that have
the property called ergodicity: the limits ni = limn Pii not only exist, are
independent of i, and form a probability distribution (ni ~ 0, Li ni = 1}, but
also ni > 0 for allj (such a distribution ni is said to be ergodic).

Theorem 1 (Ergodic Theorem). Let 1P

= IIPiill be the transition matrix of a


chain with a .finite state space X = {1, 2, ... , N}.

(a) If there is an n0 such that


min p\~ol
> 0'
lJ
i,j

(21)

117

12. Markov Chains. Ergodic Theorem. Strong Markov Property

then there are numbers n 1,

nN

such that
(22)

and

n- oo

(23)

for every i e X.
if there are numbers n 1, . , nN satisfying (22) and (23), there
is an n0 such that (21) holds.
(c) 1he numbers (n 1 , ... , nN) satisfy the equations

(b) Conversely,

= 1, ... ,N.

j
PROOF.

(24)

(a) Let

m(n>
= min p!~>
)
I] '

M)"> = max Pl~P.

Since
1) = "P p<n)
P (~+
I)
L, I~ ~)'

(25)

we have
1 > =min p!'!+ 1 > =min" P p<n) >min" P min p<") = m(n>
m("+
)
I)
'
L, I~ ~) L, I~
~)
) '

eX

whence m}"> :::;; m)"+ >and similarly M}"> ~ M}"+ 1l. Consequently, to establish
(23) it will be enough to prove that
1

M("> - m("> .,... 0


1

Let e

n .,... oo, j = 1, ... , N.

'

min;, i Pljo> > 0. 'fhen


= "p(no>p<n)
= "L, (p(no)
- sp!">]p<")
P!'!o+n)
I)
L, I~
~)
I~
)~
~)
~

+ s "p(">p<n)
L,
)~

~)

(p(no)
- sp(">]p<n)
~
)~
~)

= "

L,

+ sp(~n)
11

But p!no)
- sp(">
> 0 ' therefore
la
Ja. -

>
m(n)."
[p!no)
- sp(">J
P!'!o+n)
I)
)
L,
I~
)~

+ sp(~n) =
))

m("l(1
- s)
)

and consequently
m!no+n)
>
m("l(1
- s)
1
1

+ sp(~n)
))

M(no+n>
< M(">(1
- s)
1
1

+ sp!~n>
))

In a similar way
Combining these inequalities, we obtain
M(no+n> _ m(no+n> < (M("> _ m(">). (1 _ s)
1

+ sp(~n)
Jl '

118

I. Elementary Probability Theory

and consequently
M<J<no+n> _ m<.kno+n> < (M<.n> _ m<.">)(1 _ e)k! 0
J

'

k--+ 00.

Thus M~"tll - m~npl --+ 0 for some subsequence np, np --+ oo. But the
difference M~nl - m~n) is monotonic in n, and therefore M~nl - m~nl --+ 0,
n--+ oo.
If we put ni = limn m~">, it follows from the preceding inequalities that
lp \~)IJ

7CI
< M(n)m<.n>
< (1 - e)[n/no)-1
J J
J
-

for n ~ n0 , that is, p!j> converges to its limit ni geometrically (i.e., as fast as a
geometric progression).
It is also clear that m~n> ~ m~no) ~ e > 0 for n ~ n0 , and therefore ni > 0.
(b) Inequality (21) follows from (23) and (25).
(c) Equation (24) follows from (23) and (25).
This completes the proof of the theorem.
4. Equations (24) play a major role in the theory of Markov chains. A
nonnegative solution (n 1 , , nN) satisfying I,. n~ = 1 is said to be a stationary
or invariant probability distribution for the Markov chain with transition
matrix IIPiill. The reason for this terminology is as follows.
Let us select an initial distribution (n 1, ... , nN) and take Pi= ni. Then
PJ(l)

= "" 1C

p . = 1C.J

...,~~}
~

and in general p)"> = ni. In other words, if we take (n 1, ... , nN) as the initial
distribution, this distribution is unchanged as time goes on, i.e. for any k
j = 1, ... ,N.

Moreover, with this initial distribution the Markov chain ~ =(~,Ill, IP) is
really stationary: the joint distribution of the vector (~k ~k+ ~> ... , ~k+ 1) is
independent of k for alll (assuming that k + 1 :::;; n).
Property (21) guarantees both the existence of limits ni = lim plj>, which
are independent of i, and the existence of an ergodic distribution, i.e. one
with ni > 0. The distribution (n 1 , . , nN) is also a stationary distribution.
Let us now show that the set (n 1 , .. , nN) is the only stationary distribution.
In fact, let (ft 1, . , ftN) be another stationary distribution. Then

and since P~1--+ ni we have


iii

= L (ft~ n) = ni.
~

These problems will be investigated in detail in Chapter VIII for Markov


chains with countably many states as well as with finitely many states.

119

12. Markov Chains. Ergodic Theorem. Strong Markov Property

We note that a stationary probability distribution (even unique) may


exist for a nonergodic chain. In fact, if

then
[p>2n

(0 1)

1 0 '

[p>2n+ 1

(1 0)

0 1 '

and consequently the limits lim pjj> do not exist. At the same time, the
system
j

= 1, 2,

reduces to

of which the unique solution satisfying n 1 + n 2 = 1 is(!,!).


We also notice that for this example the system (24) has the form
no

noPoo

+ n1P1o

from which, by the condition n 0 = n 1 = 1, we find that the unique stationary


distribution (n 0 , n 1) coincides with the one obtained above:
no=

1- Pu
'
2 - Poo - Pu

nl

1- Poo

2- Poo-

P11

We now consider some corollaries of the ergodic theorem.


Let A be a set of states, A ~ X and
IA(x)

1, XEA,
= { 0, x :A.

Consider

which is the fraction of the time spent by the particle in the set A. Since
E[JA((k)i(o = i] = P((kEAI(o = i) = LPI,l(=p\kl(A)),
jeA

120

I. Elementary Probability Theory

we have

E[vA(n)l~o = i] = - 1

and in particular

E[vw(n)l~o = i] = -

1-

L pjkl(A)
k=o
n

+ 1 k=o

pjj>.

It is known from analysis (see also Lemma 1 in 3 of Chapter IV) that if


an-+ a then (a 0 + + an)/(n + 1)-+ a, n-+ oo. Hence if pj'l-+ ni, k-+ oo,
then
where

nA =

L ni.

jeA

For ergodic chains one can in fact prove more, namely that the following
result holds for I A(~ 0 ), , I A( ~n), ....

Law of Large Numbers. If ~ 0 , ~ 1 , ... form a .finite ergodic Markov chain, then

n-+ oo,

(26)

for every e > 0 and every initial distribution.


Before we undertake the proof, let us notice that we cannot apply the
results of 5 directly to IA(~ 0 ), ... , JA(~n), ... , since these variables are, in
general, dependent. However, the proof can be carried through along the
same lines as for independent variables if we again use Chebyshev's inequality, and apply the fact that for an ergodic chain with finitely many
states there is a number p, 0 < p < 1, such that
(27)

Let us consider states i and j (which might be the same) and show that,
fore> 0,

n-+ oo.
By Chebyshev's inequality,

Hence we have only to show that

n-+ oo.

(28)

121

12. Markov Chains. Ergodic Theorem. Strong Markov Property

A simple calculation shows that

1
(n

'\'

'\' m!~ I)
L.,
l}
'
k=o l=o
L.,

1)2

where

s = min(k, l) and t = lk- 11.


By (27),

P!'!>
l}

= n:.J + e!'!>

l}'

Therefore
lml~ 1 >1:::; C1[p

+ P + Pk + /],
1

where C 1 is a constant. Consequently

--,-----____,.,...,.. '\' '\' m<k. I) <


(n + 1)2 k~o 1~0 ii - (n

<

- (n

'\'

'\' [

+ 1)2 k~o 1~0 p


4C 1
2(n + 1)
2
+ 1) 1 - p

k+

1]

8C 1

(n

+ 1)(1

- p)

--+

n --+ oo.

Then (28) follows from this, and we obtain (26) in an obvious way.
5. In 9 we gave, for a random walk S 0 , S 1, ... generated by a Bernoulli
scheme, recurrent equations for the probability and the expectation of the
exit time at either boundary. We now derive similar equations for Markov
chains.
Let~= (~ 0 , , ~")be a Markov chain with transition matrix IIPiill and
phase space X = {0, 1, ... , N}. Let A and B be two integers, - N :::;
A :::; 0 :::; B :::; N, and x EX. Let f!lk+ 1 be the set of paths {x 0 , x 1 , .. , xk),
xi E X, that leave the interval (A, B) for the first time at the upper end, i.e.
leave (A, B) by going into the set (B, B + 1, ... , N).
, For A :::; x :::; B, put
f3k(x) = P{(~o .... , ~k) E fflk+ll~o = x}.

In order to find these probabilities (for the first exit of the Markov chain
from (A, B) through the upper boundary) we use the method that was
applied in the deduction of the backward equations.

122

I. Elementary Probability Theory

We have
{3k(x) = P{(eo, ... ' ek) E .?Jk+ 11 eo = X}
=

L Pxy. P{(eo, ... ' ek) E .?Jk+lleo =X, el =


y

y},

where, as is easily seen by using the Markov property and the homogeneity
of the chain,

P{(eo, ... ' ek)

.?Jk+lleo =X, el = y}
= P{(x, y, e2, ... ' ek) E .?Jk+lleo =X, el
= P{(y, e2, ... , ek) E BBkle1 = y}

= y}

= P{(y, eb ... ' ek- dE .?lkl eo = y} = Pk-l(y).


Therefore
y

for A < x < B and 1 :::;; k :::;; n. Moreover, it is clear that


Pix)= 1,

= B, B + 1, ... ' N,

and
X= -N, ... ,A.

In a similar way we can find equations for ocix), the probabilities for first
exit from (A, B) through the lower boundary.
Let rk = min{O :::;; l :::;; k: (A, B)}, where rk = k if the set { } = 0.
Then the same method, applied to mix)= E(rkl eo = x), leads to the following recurrent equations:

e,

mk(x) = 1 +

L mk-l (Y)Pxy
y

(here 1 :::;; k :::;; n, A < x < B). We define

x (A, B).
It is clear that if the transition matrix is given by (11) the equations for
ock(x), {3k(x) and mk(x) become the corresponding equations from 9, where
they were obtained by essentially the same method that was used here.
These equations have the most interesting applications in the limiting
case when the walk continues for an unbounded length of time. Just as in 9,
the corresponding equations can be obtained by a formal limiting process
(k --+

CX) ).

By way of example, we consider the Markov chain with states {0, 1, ... , B}
and transition probabilities

Poo

= 1,

PBB =

1,

123

12. Markov Chains. Ergodic Theorem. Strong Markov Property

and

Pi > 0, ~ = ~ + 1,
pij = { ri,
J = z,
qi > 0, j = i - 1,
for 1 :-:;; i :-:;; B - 1, where Pi + qi + ri = 1.
For this chain, the corresponding graph is

a~---~e
0

q1

P1

qB-1 B- 1 PB-1

It is clear that states 0 and B are absorbing, whereas for every other state
i the particle stays there with probability ri, moves one step to the right with
probability Pi and to the left with probability qi.
Let us find a(x) = limk-oo ak(x), the limit of the probability that a particle
starting at the point x arrives at state zero before reaching state B. Taking
limits ask--+ oo in the equations for aix), we find that

when 0 < j < B, with the boundary conditions

a(O)

1,

a(B) = 0.

Since rj- 1 - qj- pj, we have


pj(a(j + 1) - a(j)) = qj(a(j) - a(j - 1))
and consequently
a(j

+ 1) -

a(j) = pj(a(1) - 1),

where
Po= 1.

But
a(j

+ 1) -

1=

L (a(i + 1) -

a(i)).

i=O

Therefore
a(j

+ 1) -

1 = (a(1) - 1)

L Pi

i=l

124

I. Elementary Probability Theory

Ifj = B - 1, we have r:~.(j

+ 1) =

0, and therefore

r:~.(B) =

r:J.(l)=1=-'\'B-1

L..i= 1 Pi

'

whence
an d

'\'B-1

L..i=i Pi
'
L..i=1 Pi

(
r:I.J

0)

='\'B-1

j = 1,

0'

B.

(This should be compared with the results of 9.)


Now let m(x) = limk mk(x), the limiting value of the average time taken
to arrive at one of the states 0 or B. Then m(O) = m(B) = 0,
m(x)

= 1 + L m(y)Pxy
y

and consequently for the example that we are considering,

m(j) = 1 + qimU - 1)

+ rimU) + pimU + 1)

for j = 1, 2, ... , B - 1. To find mU) we put

M(j) = m(j) - m(j- 1),

j = 0, 1,. 0. 'B.

Then

piM(j + 1)

= qiM(j)

- 1,

j = 1, ... , B- 1,

and consequently we find that

M(j

+ 1) =

piM(1)- Ri,

where
q1 .. qj
0

P1 Pi '
0

Therefore
j-1

m(i) = mU)- m(O) =

L M(i + 1)

i=O

j-1

i=O

(pim(l)- Ri) = m(l)

j-1

j-1

i=O

i=O

L Pi- L Ri.

It remains only to determine m(1). But m(B) = 0, and therefore


'\'B-l R
m(1) = Ik=o1 i,
i=O Pi

and for 1 < j ::;; B,

( 0)

mJ =

j-1

'\'~-1R.

j-1

'\' R
L.. = 0 I
'\'
i~o Pi. Lf;-o1 Pi - i:--o io

125

12. Markov Chains. Ergodic Theorem. Strong Markov Property

(This should be compared with the results in 9 for the case ri = 0, Pi = p,

qi

= q.)

6. In this subsection we consider a stronger version of the Markov property


(8), namely that it remains valid if time k is replaced by a random time (see
also Theorem 2). The significance of this, the strong Markov property, can
be illustrated in particular by the example of the derivation of the recurrent
relations (38), which play an important role in the classification of the states
of Markov chains (Chapter VIII).
Let ~ = (~ 1 , . . . , ~n) be a homogeneous Markov chain with transition
matrix IIPijll; let ~~ = (~f)oo;;k,;;n be a system of decompositions, ~f =
~ ~o ... ~k. Let ~f denote the algebra ri(~i) generated by the decomposition
~f.

We first put the Markov property (8) into a somewhat different form. Let
Let us show that then

E ~f.

= ak+llB n (~k = ak)}


= P{~n =an, ... , ~k+l =

Pgn =an,, ~k+l

ak+ll~k

= ak}

(29)

(assuming that P{B n (~k = ak)} > 0). In fact, B can be represented in the
form

B=

where

I* {~~ .... , ~k = at},

I* extends over some set (a~, ... ' an Consequently


P{~n =an, .. , ~k+l
P{(~n

= ak+llB n

(~k

= ak)}

=an, ... , ~k = ak) n B}


P{(~k = ak) n B}

I* P{(~n

= aw .. , ~k = ak) n (~ 0 = a6, ... , ~k =a%)}


P{(~k = ak) n B}

(30)

But, by the Markov property,


P{(~n

=an, ... , ~k = ak) n (~o = a~, ... , ~k =at)}

P{~n =an, ... , ~k+l = ak+ll~o =a~,, ~k =at}


x P{~ 0 =a~, ... , ~k =at} ifak =at,
0 if ak i= at,

P{~n =an, ... , ~k+l = ak+ll~k = ak}P{~o =a~, ... , ~k =at}


if ak =at,
0 if ak i= at,

={

pgn =an,., ~k+1


0

if ak =at,
if ak i= at.

= ak+ll~k = ak}P{(~k = ak) n B}

126

I. Elementary Probability Theory

Therefore the sum I* in (30) is equal to


P{ ~n = an, ... , ~k+ 1 = ak+ d ~k = adP{(~k = ak) n B},

This establishes (29).


Let r be a stopping time (with respect to the system D~ = (DDo 5 k 5 n; see
Definition 2 in 11).

Definition. We say that a set Bin the algebra~~ belongs to the system of sets
~; if, for each k, 0 ~ k ~ n,
B n {r

= k}

(31)

E ~f.

It is easily verified that the collection of such sets B forms an algebra


(called the algebra of events observed at timer).

Theorem 2. Let ~ = (~ 0 , , ~n) be a homogeneous Markov chain with


transition matrix IIPull, r a stopping time (with respect to 2fi~), BE~~ and
A = {w: r + l ~ n}. Then if P{A n B n (~, = a0 )} > 0, we have
P{~r+t

and if P{A n

(~,

=
=

a1,

P{~r+t

= a 1 IA n B n (~, = a 0 )}
= a1, , ~r+l = a 1 IA n (~, = a 0 )},
~r+l

(32)

= a0 )} > 0 then

For the sake of simplicity, we give the proof only for the case l = 1. Since
B n (r = k) E ~L we have, according to (29),
P{~r+l =

a 1, An B n (~, = a0 )}

k5n-l

P{~k+l

= a1, ~k =

a0 , r

= k, B}

k5n-l
k5n-l

Paoal.

k5n-l

P{~k = ao, r = k, B} =

Paoal.

P{A n B n (~,

= ao)},

which simultaneously establishes (32) and (33) (for (33) we have to take
B= Q).

Remark. When l

= 1, the strong Markov property (32), (33) is evidently


equivalent to the property that

(34)

127

12. Markov Chains. Ergodic Theorem. Strong Markov Property

for every C s;; X, where

Pao(C) =

L Paoa1

a1EC

In turn, (34) can be restated as follows: on the set A = {T

n- 1},
(35)

which is a form of the strong Markov property that is commonly used in the
general theory of homogeneous Markov processes.
7. Let ~ = (~ 0 , . ,
matrix IIPiill, and let

~n)

be a homogeneous Markov chain with transition

!!7>

= P{ ~k = i, ~~ =F i, 1 ~ l ~ k - 11 ~ 0 = i}

(36)

f!J>

= P{~k =j, ~~ #j,

1 ~ l ~ k- 11~ 0 = i}

(37)

and

for i =F j be respectively the probability of first return to state i at time k and


the probability of first arrival at state j at time k.
Let us show that
n

P I}\~)

= i..J
"'

j\~>p<.~-k)

k=1

I}

}}

'

where

p)~> =

1.

(38)

The intuitive meaning of the formula is clear: to go from state i to state j


inn steps, it is necessary to reach statej for the first time ink steps (1 ~ k ~ n)
and then to go from state j to state j in n - k steps. We now give a rigorous
derivation.
Let j be given and
' = min{l ~ k ~ n: ~k = j},
assuming that'! = n

+ 1 if {}

0. Then f!J> = P{T =

kl~o = i} and

P!~j> = P{~n =jl~o = i}

P{~n = j, '! = kl~o = i}

P{~r+n-k=j,T=kl~o=i},

15k5n

15k5n

(39)

where the last equation follows because ~t+n-k = ~n on the set {'! = k}.
Moreover, the set {' = k} = {' = k, ~r = j} for every k, 1 ~ k ~ n. Therefore
if P{~ 0 = i,' = k} > 0, it follows from Theorem 2 that
P{~r+n-k =jl~o = i, '! = k} = P{~r+n-k =jl~o = i, '! = k, ~t =j}

= P{~t+n-k = j I~t = j} = ptf-k>

128

I. Elementary Probability Theory

and by (37)

P!j> =
=

L P{~t+n-k = jl~o = i,

7:

k= 1

= k}P{r = kl~o = i}

~ p(~-k>:r(~)

L,

k=l

JJ

I] '

which establishes (38).


8.PROBLEMS

= (~ 0 , ... , ~.)be a Markov chain with values in X and f = f(x) (x EX) a


function. Will the sequence (f(~ 0 ), ... ,f(~.)) form a Markov chain? Will the
"reversed" sequence

1. Let ~

(~., ~.- 1> ' ~o)

form a Markov chain?


2. Let IJll = IIP;)I, 1 :::; i,j:::; r, be a stochastic matrix and A. an eigenvalue of the matrix,
i.e. a root of the characteristic equation det IIIJll - A.E II = 0. Show that A.0 = 1 is an
eigenvalue and that all the other eigenvalues have moduli not exceeding 1. If all the
eigenvalues ..1. 1, , A., are distinct, then p~~l admits the representation
p~~l = ni

+ a;;{1)A.~ + + a;k)A.~,

where ni, a;;(!), ... , a;;{r) can be expressed in terms of the elements of IJll. (It follows
from this algebraic approach to the study of Markov chains that, in particular, when
I..1. 1 1< 1, ... , IA., I < 1, the limit lim Pl~l exists for every j and is independent of i.)
3.

Let~ = (~ 0 , . , ~.)be

tion matrix

IJll

a homogeneous Markov chain with state space X and transi-

IIPxyll. Let

T<p(x) =

E[<p(~l)l~o = x]

(=

~ <p(y)Pxy)

Let the nonnegative function <p satisfy


T<p(x) = <p(x),

x EX.

Show that the sequence of random variables

is a martingale.
4. Let ~=(,.,Ill, IJll) and ~=(,.,Ill, IJll) be two Markov chains with different initial
distributions Ill = (p 1, , p,) and li = (ph ... , p,). Show that if min;.i Pii ::::: e > 0
then

L lftl"l- p~"ll :::; 2(1 -e)".

i=l

CHAPTER II

Mathematical Foundations of
Probability Theory

I. Probabilistic Model for an Experiment with


Infinitely Many Outcomes. Kolmogorov's Axioms
1. The models introduced in the preceding chapter enabled us to give a
probabilistic-statistical description of experiments with a finite number of
outcomes. For example, the triple (Q, d, P) with

and p(w) = pEaqn-r.a, is a model for the experiment in which a coin is tossed
n times "independently" with probability p of falling head. In this model the
number N(Q) of outcomes, i.e. the number of points in n, is the finite
number 2n.
We now consider the problem of constructing a probabilistic model for
the experiment consisting of an infinite number of independent tosses of a
coin when at each step the probability offalling head is p.
It is natural to take the set of outcomes to be the set

n=

{co: w = (al, a2, .. .), ai = 0, 1},

i.e. the space of sequences w = (a 1, a 2 , . ) whose elements are 0 or 1.


What is the cardinality N(Q) of Q? It is well known that every number
a e [0, 1) has a unique binary expansion (containing an infinite number of
zeros)
(ai = 0, 1).

130

II. Mathematical Foundations of Probability Theory

Hence it is clear that there is a one-to-one correspondence between the points


OJ ofQ and the points a of the set [0, 1), and therefore Q has the cardinality of
the continuum.
Consequently if we wish to construct a probabilistic model to describe
experiments like tossing a coin infinitely often, we must consider spaces Q
of a rather complicated nature.
We shall now try to see what probabilities ought reasonably to be assigned
(or assumed) in a model of infinitely many independent tosses of a fair coin
{p + q = !).
Since we may take Q to be the set [0, 1), our problem can be considered
as the problem of choosing points at random from this set. For reasons of
symmetry, it is clear that all outcomes ought to be equiprobable. But the
set [0, 1) is uncountable, and if we suppose that its probability is 1, then it
follows that the probability p(OJ) of each outcome certainly must equal
zero. However, this assignment of probabilities {p{OJ) = 0, OJ e [0, 1)) does
not lead very far. The fact is that we are ordinarily not interested in the
probability of one outcome or another, but in the probability that the result
of the experiment is in one or another specified set A of outcomes (an event).
In elementary probability theory we use the probabilities p(OJ) to find the
probability P(A) of the event A: P(A) =
p(OJ). In the present case, with
p(OJ) = 0, OJ E [0, 1), we cannot define, for example, the probability that a
point chosen at random from [0, 1) belongs to the set [0, !). At the same time,
it is intuitively clear that this probability should be!.
These remarks should suggest that in constructing probabilistic models for
uncountable spaces Q we must assign probabilities, not to individual outcomes but to subsets of Q. The same reasoning as in the first chapter shows
that the collection of sets to which probabilities are assigned must be closed
with respect to unions, intersections and complements. Here the following
definition is useful.

LroeA

Definition 1. Let Q be a set of points OJ. A system d of subsets of Q is called


an algebra if
(a) Qed,
(b) A, BEd=> Au BEd,
(c) Aed=>Aed

An Bed,

(Notice that in condition (b) it is sufficient to require only that either


A u B E d or that A n B E d, since A u B = A n Band A n B = A u B.)
The next definition is needed in formulating the concept of a probabilistic
model.

Definition 2. Let d be an algebra of subsets of Q. A set function J.l = J.l(A),


A e d, taking values in [0, oo ], is called a finitely additive measure defined

131

1. Probabilistic Model for an Experiment with Infinitely Many Outcomes

on .91 if
Jl(A

+ B) =

Jl(A)

+ Jl(B).

(1)

for every pair of disjoint sets A and B in .91.


A finitely additive measure Jl with Jl(Q) < oo is called finite, and when
Jl(Q) = 1 it is called a finitely additive probability measure, or a finitely
additive probability.

2. We now define a probabilistic model (in the extended sense).

Definition 3. An ordered triple (Q, .91, P), where


(a) Q is a set of points w;
(b) .91 is an algebra of subsets of Q;
(c) Pis a finitely additive probability on A,
is a probabilistic model in the extended sense.
It turns out, however, that this model is too broad to lead to a fruitful
mathematical theory. Consequently we must restrict both the class of subsets of Q that we consider, and the class of admissible probability measures.

Definition 4. A system ff of subsets of Q is a (1-algebra if it is an algebra and


satisfies the following additional condition (stronger than (b) of Definition 1):
(b*) if An E ff, n = 1, 2, ... , then
(it is sufficient to require either that

u An

$' or that

n An

$').

Definition 5. The space Q together with a (1-algebra ff of its subsets is a


measurable space, and is denoted by (Q, $').
Definition 6. A finitely additive measure Jl defined on an algebra .91 of subsets
of Q is countably additive (or (1-additive), or simply a measure, if, for all
pairwise disjoint subsets A 1, A 2 , of A,

Jl(~l A") =

Jl

Jl(An>

A finitely additive measure Jl is said to be (1-finite ifQ can be represented in


the form
n=l

with Jl(Qn) <

00,

n = 1, 2, ....

132

II. Mathematical Foundations of Probability Theory

If a countably additive measure P on the algebra A satisfies P(Q) = 1,


it is called a probability measure or a probability (defined on the sets that
belong to the algebra d).
Probability measures have the following properties.

If 0 is the empty set then

P(0) = 0.

If A, B E d then
P(A u B)= P(A)

If A, B E d and B

!:;;;;

+ P(B)-

P(A n B).

A then
P(B) ::; P(A).

If A. E d, n = 1, 2, ... , and

U A. E d, then

The first three properties are evident. To establish the last one it is enough
to observe that
1 B., where B 1 =At> B.= A1 n n
1 A.=
A._ 1 n A., n ~ 2, B; n Bj = 0, i # j, and therefore

L:'=

L:'=

The next theorem, which has many applications, provides conditions


under which a finitely additive set function is actually countably additive.

Theorem. Let P be a finitely additive set function defined over the algebra d,
with P(Q) = 1. The following four conditions are equivalent:
(1) P is a-additive (Pis a probability);
(2) P is continuous from below, i.e. for any sets A 1 , A 2 , Ed such that
A. !:;;;; An+ 1 and
1 A. Ed,

U:'=

(3) Pis continuous from above, i.e. for any sets A 1 , A 2 ,

and

n:'=

A. E d,

such that A. 2 A.+ 1

133

I. Probabilistic Model for an Experiment with Infinitely Many Outcomes

(4) Pis continuous at 0, i.e. for any sets A1, A 2 , Ed such that An+ 1
and
1 An= 0,

n:=

lim P(An)

An

0.

Since

PROOF. (1) => (2).

00

UAn =

A1 + (Az \A1) + (A3 \Az) + ,

n=l
we have

PC91An)

P(A 1) + P(A 2 \A 1) + P(A 3\A 2 ) + ...

P(A 1) + P(A 2 )

P(A 1) + P(A 3)- P(A 2 ) +

=lim P(An).
n

(2) => (3). Let n

1; then

The sequence {A 1\An}n> 1 of sets is nondecreasing (see the table in Subsection


3 below) and
00

00

U<Al\An)

n=l

A1\ nAn.
n=l

Then, by (2)

and therefore
n

P(Al)- PC91(A1\An))

P(A1)- P(A1)

+ P(Q1An)

P(A1)- P(Al\0/n)
=

P(Q1An)

(3) => (4). Obvious.


(4) =>(1). Let Ab A 2 , .. Ed be pairwise disjoint and let
Then

L:-'=

An Ed.

n=l

U A.

00

Af:,.B

A+B
A\B

0
AnB=0

An B (or AB)

AuB

A=!l\A

ff
A Eff

(J)

Notation

Table

union of the sets A 1 , A 2 ,

sum of sets, i.e. union of disjoint sets


difference of A and B, i.e. the set of points that belong to A
but not to B
symmetric difference of sets, i.e. (A \B) u (B\A)

outcome, sample point, elementary event


sample space; certain event
a-algebra of events
event (if wE A, we say that event A occurs)
event that A does not occur
event that either A or B occurs

element or point
set of points
a-algebra of subsets
set of points
complement of A, i.e. the set of points w that are not in A
union of A and B, i.e. the set of points w belonging either
to A or to B
intersection of A and B, i.e. the set of points w belonging to
both A and B
empty set
A and B are disjoint

event that at least one of A 1 , A 2 , occurs

event that A or B occurs, but not both

impossible event
events A and B are mutually exclusive, i.e. cannot occur
simultaneously
event that one of two mutually exclusive events occurs
event that A occurs and B does not

event that both A and B occur

Interpretation in probability theory

Set-theoretic interpretation

;.
"'

...,

'<

(\)

-<
-l
=-0

"'

.....,
...,
0
g"

P-

"'=

.,e:..

r;

(\)

-a::

.j;>.

v.>

A.

li~ 1 A.)

* i.o.

infinitely often.

(or lim inf A.)

limA.

lim A.
(or lim sup A.
or* {A. i.o.})

A li~ 1' A")

1A

(or

A.

(or

A. 1' A

n= 1

00

n A.

n=l

00

00

00

n= 1 k= n

00

un

the set

Ak

UAk

n= 1 k=n

00

the set

A 1 2 A 2 2 and A

=
n=1

00

n A"

n=l

U A.

the decreasing sequence of sets A" converges to A, i.e.

A 1 c;:; A 2 c;:; and A=

00

the increasing sequence of sets A. converges to A, I.e.

intersection of A~> A 2 ,

sum, i.e. union of pairwise disjoint sets A 1 , A 2 , .


A~>

A2,

occur

event that all the events A 1 , A 2 , occur with the possible


exception of a finite number of them

event that infinitely many of events A 1 , A 2 .. occur

the decreasing sequence of events converges to event A

the increasing sequence of events converges to event A

event that all the events

event that one of the mutually exclusive events A 1 , A 2 ,


occurs

,.,

Vo

! j.)

0
c:
;:;

'<

Ol

a::
=

.:;

::n

=
g.

:.
&

"

(1)

$....

Ol

0'
....

g.

0.

a::
0

r;

0
cr"

::,0

'e:

136

II. Mathematical Foundations of Probability Theory

and since L~n+ 1 A; !

0, n --+

oo, we have

3. We can now formulate Kolmogorov's generally accepted axiom system,


which forms the basis for the concept of a probability space.

Fundamental Definition. An ordered triple (Q, fi', P) where

(a)
is a set of points w,
(b) fi' is a u-algebra of subsets of n,
(c) Pis a probability on fi',

is called a probabilistic model or a probability space. Here n is the sample space


or space of elementary events, the sets A in ff are events, and P(A) is the
probability of the event A.
It is clear from the definition that the axiomatic formulation of probability
theory is based on set theory and measure theory. Accordingly, it is useful to
have a table (pp. 134-135) displaying the ways in which various concepts are
interpreted in the two theories. In the next two sections we shall give examples
of the measurable spaces that are most important for probability theory and
of how probabilities are assigned on them.

4.

PROBLEMS

n = {r: r E [0, 1]} be the set of rational points of [0, 1], d the algebra of sets
each of which is a finite sum of disjoint sets A of one of the forms {r: a < r < b},
{r: a::;; r < b}, {r: a< r::;; b}, {r: a::;; r::;; b}, and P(A) = b- a. Show that P(A),
A Ed, is finitely additive set function but not countably additive.

1. Let

2. Let Q be a countable set and :!F the collection of all its subsets. Put JJ(A) = 0 if A is
finite and JJ(A) = oo if A is infinite. Show that the set function JJ is finitely additive
but not countably additive.
3. Let JJ be a finite measure on a u-algebra :!F, A. E :!F, n = 1, 2, ... , and A
(i.e., A = lim. A. = lim. A.). Show that JJ(A) = lim. JJ(A.).
4. Prove that P(A 6 B)

= P(A) + P(B) -

2P(A n B).

= lim. A.

137

2. Algebras and a-Algebras. Measurable Spaces

5. Show that the "distances" p 1(A, B) and p2(A, B) defined by


p 1(A, B)= P(A

~B),

P(A ~B)
{
piA, B) = :(A v B)

ifP(A v B)# 0,
if P(A v B)= 0

satisfy the triangle inequality.


6. Let f.1. be a finitely additive measure on an algebra d, and let the sets A 1 , A 2 , Ed
be pairwise disjoint and satisfy A =
1 f..I.(A;).
1 Ai Ed. Then f.J.(A) ~

Ir;

Ir;

7. Prove that
lim sup An= lim inf A.,
lim inf A. ~ lim sup A.,

lim inf An = lim sup An,

lim sup(A. v B.) = lim sup A. v lim sup B.,

lim sup A. n lim inf B. ~ lim sup( A. n B.) ~ lim sup A. n lim sup B.

If A. j A or A.

A, then
lim inf An= lim sup A.

8. Let {x.} be a sequence of numbers and A.= (- oo, xn). Show that x =lim sup x.
and A = lim sup A. are related in the following way: (- oo, x) ~ A ~ (- oo, x].
In other words, A is equal to either (- oo, x) or to (- oo, x].
9. Give an example to show that if a measure takes the value
general that countable additivity implies continuity at 0.

+ oo, it does not follow in

2. Algebras and a-Algebras. Measurable Spaces


1. Algebras and a-algebras are the components out of which probabilistic
models are constructed. We shall present some examples and a number of
results for these systems.
Let n be a sample space. Evidently each of the collections of sets
~* =

{0, Q},

~*

= {A: A:; Q}

is both an algebra and a a-algebra. In fact, ~* is trivial, the "poorest"


a-algebra, whereas ~* is the "richest" a-algebra, consisting of all subsets

ofn.

When n is a finite space, the a-algebra ~* is fully surveyable, and commonly serves as the system of events in the elementary theory. However, when
the space is uncountable the class ~* is much too large, since it is impossible
to define "probability" on such a system of sets in any consistent way.
If A :; Q, the system
~A = {A,

A, 0, Q}

138

II. Mathematical Foundations of Probability Theory

is another example of an algebra (and a a-algebra), the algebra (or a-algebra)


generated by A.
This system of sets is a special case of the systems generated by decompositions. In fact, let
~

= {D 1 , D 2 , . }

be a countable decomposition of Q into nonempty sets:


Q = D1

+ D2 + ;

D; n Di =

0,

i =F j.

Then the system d = a(~), formed by the sets that are unions of finite
numbers of elements of the decomposition, is an algebra.
The following lemma is particularly useful since it establishes the important
principle that there is a smallest algebra, or a-algebra, containing a given
collection of sets.

Lemma 1. Let tff be a collection of subsets of n. Then there are a smallest


algebra a(&) and a smallest a-algebra a(&) containing all the sets that are inS.
The class !F* of all subsets of Q is a a-algebra. Therefore there are at
least one algebra and one a-algebra containing S. We now define a(S)
(or a(S)) to consist of all sets that belong to every algebra (or a-algebra)
containing&. It is easy to verify that this system is an algebra (or a-algebra)
and indeed the smallest.
PROOF.

Remark. The algebra r:x(E) (or a(E), respectively) is often referred to as the
smallest algebra (or a-algebra) generated by S.
We often need to know what additional conditions will make an algebra,
or some other system of sets, into a a-algebra. We shall present several results
of this kind.
Definition 1. A collection .H of subsets of Q is a monotonic class if An E .A,
n = 1, 2, ... , together with An j A or An ! A, implies that A E .H.
Let tff be a system of sets. Let f.l(S) be the smallest monotonic class containing&. (The proof of the existence of this class is like the proof of Lemma 1.)

Lemma 2. A necessary and sufficient condition for an algebra d to be a


a-algebra is that it is a monotonic class.
PRooF. A a-algebra is evidently a monotonic class. Now let d be a monotonic
class and An Ed, n = 1, 2, .... It is clear that Bn =
1 A; Ed and
Bn Bn+ 1 . Consequently, by the definition of a monotonic class,
Bn j U~ 1 A; Ed. Similarly we could show that n~ 1 A; Ed.

Ui'=

By using this lemma, we can prove that, starting with an algebra d, we


can construct the a-algebra a(d) by means of monotonic limiting processes.

139

2. Algebras and a-Algebras. Measurable Spaces

Theorem 1. Let .91 be an algebra. Then


f.l( .91) = a(.91).

(1)

PRooF. By Lemma 2, f.l(d) ~ a(d). Hence it is enough to show that f.l(d)


is a a-algebra. But At = f.l(d) is a monotonic class, and therefore, by Lemma
2 again, it is enough to show that f.l(d) is an algebra.
Let A E .A; we show that A E .A. For this purpose, we shall apply a
principle that will often be used in the future, the principle of appropriate sets,
which we now illustrate.
Let

.Ji =

{B:Be.A,Be.A}

be the sets that have the property that concerns us. It is evident that
.91 ~ .Ji ~ At. Let us show that .Ji is a monotonic class.
Let Bn E .Ji; then Bn E .A, Bn E .A, and therefore
lim j Bn E .A,
Consequently

l Bn E .A, lim l Bn = lim j Bn E .A,


lim l Bn = lim j Bn E A,
lim j Bn = lim l Bn E .A,
and therefore .A is a monotonic class. But .1i ~ At and At is the smallest
monotonic class. Therefore .ii = At, and if A EAt = f.l(d), then we also
lim j Bn = lim

have A E .A, i.e. At is closed under the operation of taking complements.


Let us now show that At is closed under intersections.
Let A EAt and

AlA= {B:Be.A,A n Be A}.


From the equations
lim

l (A n Bn) = A n

lim

l Bn,

lim j (A n Bn) = A n lim j Bn


it follows that At A is a monotonic class.
Moreover, it is easily verified that

(A

AIB)<=> (B

AI A).

(2)

Now let A E .91; then since .91 is an algebra, for every B E .91 the set
A n B E .91 and therefore
,s;l ~ A(A ~AI.

But At A is a monotonic class (since lim i ABn = A lim i Bn and lim l ABn =
A lim l Bn), and At is the smallest monotonic class. Therefore AtA = At for
all A E .91. But then it follows from (2) that

(A

AI B) <=> (B

AIA

At).

140

II. Mathematical Foundations of Probability Theory

whenever A E .91 and BE .A. Consequently if A E .91 then

for every BE .A. Since A is any set in .91, it follows that

Therefore for every B E .A


.I{B

=.If,

i.e. if B E .A and C E .A then C n B E .A.


Thus .A is closed under complementation and intersection (and therefore
under unions). Consequently .A is an algebra, and the theorem is established.

Definition 2. Let Q be a space. A class~ of subsets ofQ is ad-system if


(a) Q E ~;
(b) A, B, E ~' A s B = B\A E
(c) An E ~'An SAn+ 1 = UAn
If~

~;
E

~.

is a collection of sets then

d(~)

denotes the smallest d-system con-

taining~.

Theorem 2. If the collection ~ of sets is closed under intersections, then


(3)

d(~) = a(~)

PROOF. Every a-algebra is ad-system, and consequently d(~) s::; a(~). Hence
if we prove that d(~) is closed under intersections, d(~) must be a a-algebra
and then, of course, the opposite inclusion a(~) s d( ~) is valid.
The proof once again uses the principle of appropriate sets.
Let
~ 1 = {BEd(~):BnAEd(~)forallAE~}.

If B E ~ then B n A E ~ for all A E ~ and therefore ~ s ~ 1 . But ~ 1 is a


d-system. Hence d(~) s ~ 1 On the other hand, ~ 1 s d(~) by definition.
Consequently
Now let
~2

{BEd(~):

Bn A

Ed(~)

for all A

Ed(~)}.

Again it is easily verified that ~ 2 is ad-system. If B E ~'then by the definition


of ~ 1 we obtain that B n A Ed(~) for all A E ~ 1 = d(~). Consequently
~ s ~ 2 and d(~) s ~ 2 But d(~) 2 ~ 2 ; hence d(~) = ~ 2 , and therefore

141

2. Algebras and a-Algebras. Measurable Spaces

whenever A and Bare in d(G), the set A n B also belongs to d(G), i.e. d(G) is
closed under intersections.
This completes the proof of the theorem.
We next consider some measurable spaces (Q, :F) which are extremely
important for probability theory.

2. The measurable space (R, PA(R)). Let R = (- oo, oo) be the real line and

= {xeR:a

(a,b]

< x:::;; b}

for all a and b, - oo :::;; a < b < oo. The interval (a, oo] is taken to be (a, oo ).
(This convention is required if the complement of an interval (- oo, b] is
to be an interval of the same form, i.e. open on the left and closed on the
right.)
Let d be the system of subsets of R which are finite sums of disjoint
intervals of the form (a, b]:
n

d if A =

(ai, b;],

i= 1

n < oo.

It is easily verified that this system of sets, in which we also include the
empty set 0, is an algebra. However, it is not a a-algebra, since if An =
(0, 1 - 1/n] Ed, we have
An= (0, 1) =d.
Let PA(R) be the smallest a-algebra a(d) containing d. This a-algebra,
which plays an important role in analysis, is called the Borel algebra of subsets
of the real line, and its sets are called Borel sets.
Iff is the system of intervals f of the form (a, b], and a(f) is the smallest
a-algebra containing J, it is easily verified that a(J) is the Borel algebra.
In other words, we can obtain the Borel algebra from J without going
through the algebra d, since a(J) = a(iX(J)).
We observe that

Un

(a, b)=

U(a, b- !],
n

n=l

n(a-!, b],
{a}= n(a-!, a].

[a,

b] =

n=l

n=l

a< b,
a< b,

Thus the Borel algebra contains not only intervals (a, b] but also the singletons {a} and all sets of the six forms
(a, b),

[a, b],

[a, b),

(- oo, b),

(- oo, b],

(a, oo).

(4)

142

II. Mathematical Foundations of Probability Theory

Let us also notice that the construction of 81(R) could have been based on
any of the six kinds of intervals instead of on (a, b], since all the minimal
a-algebras generated by systems of intervals of any of the forms (4) are the
same as 81(R).
Sometimes it is useful to deal with the a-algebra 81(R) of subsets of the
extended real lineR = [ - oo, oo]. This is the smallest a-algebra generated by
intervals of the form
(a, b] = {x E R: a < x::::;; b},

- oo ::::;; a < b ::::;; oo,

where (- oo, b] is to stand for the set {x E R: - oo ::::;; x::::;; b}.


Remark 1. The measurable space (R, 81(R)) is often denoted by (R, 81) or
(R 1 , 81 1).
Remark 2. Let us introduce the metric

lx- Yl
p 1(x, y) = 1 + lx- Yl
on the real line R (this is equivalent to the usual metric Ix - y I) and let
810 (R) be the smallest a-algebra generated by the open sets SP(x 0 ) =
{x E R: p 1(x, x 0 ) < p}, p > 0, x 0 E R. Then 81 0 (R) = 81(R) (see Problem 7).
3. The measurable space (R", ?4(R")). Let R" = R x x R be the direct, or
Cartesian, product of n copies of the real line, i.e. the set of ordered n-tuples
x = (x 1 , . , x,), where - oo < xk < oo, k = 1, ... , n. The set

where Ik = (ak, bk], i.e. the set {x E R": xk E Ik> k = 1, ... , n}, is called a
rectangle, and I k is a side of the rectangle. Let J be the set of all rectangles I.
The smallest a-algebra a(J) generated by the system J is the Borel algebra
of subsets of R" and is denoted by 81(R"). Let us show that we can arrive at
this Borel algebra by starting in a different way.
Instead ofthe rectangles I = I 1 x x I, let us consider the rectangles
B = B 1 x x B, with Borel sides (Bk is the Borel subset of the real line
that appears in the kth place in the direct product R x x R). The smallest
a-algebra containing all rectangles with Borel sides is denoted by

81(R) 81(R)
and called the direct product of the a-algebras 81(R). Let us show that in fact

143

2. Algebras and u-Algebras. Measurable Spaces

In other words, the smallest a--algebra generated by the rectangles I =

I 1 x x I" and the (broader) class of rectangles B = B 1 x x B" with


Borel sides are actually the same.
The proof depends on the following proposition.

n, and define

Lemma 3. Let C be a class of subsets ofO., let B

C n B = {A n B: A E C}.

(5)

u(C n B) = u(C) n B.

(6)

Then
PROOF. Since C u(C), we have

C n B u(C) n B.

(7)

But u(C) n B is a a--algebra; hence it follows from (7) that


u(C n B) u(C) n B.

To prove the conclusion in the opposite direction, we again use the


principle of appropriate sets.
Define
1:6'8 = {A E u(C): An BE u(C n B)}.

Since u( C) and u(C n B) are a--algebras, 1:6'8 is also a a--algebra, and evidently
C 1:6'8 u(C),

whence u(C) a(!ffi'8 ) = 1:6'8 u(C) and therefore u(C) = 1:6'8 . Therefore
A n BE u(C n B)

for every A u(C), and consequently u(C) n B u(cS' n B).


This completes the proof of the lemma.
Proof that PJ(R") and PJ PJ are the same. This is obvious for n = 1.
We now show that it is true for n = 2.
Since PJ(R 2 ) PJ PJ, it is enough to show that the Borel rectangle
B 1 x B 2 belongs to PJ(R 2 ).
Let R 2 = R 1 x R 2 , where R 1 and R 2 are the "first" and "second" real
lines, ~ 1 = PJ 1 x R 2 , ~ 2 = R 1 x PJ 2 , where PJ 1 x R 2 (or R 1 x PJ 2 ) is the
collectionofsetsoftheformB 1 x R 2 (orR 1 x B2 ),withB 1 EPJ 1 (orB 2 EPJ 2 ).
Also let f 1 and f 2 be the sets of intervals in R 1 and R 2 , and .11 = f 1 x R 2 ,
.J2 = R 1 x f 2 Then, by (6),
B 1 x B2

B1 n B2 E ~ 1 n

~ 2 = u(.J 1 ) n

= u(.J 1 n B2 )

= u(f 1
as was to be proved.

B2

x f

2 ),

u(.J 1 n

.J2 )

144

II. Mathematical Foundations of Probability Theory

The case of any n, n > 2, can be discussed in the same way.

Remark. Let BI 0 (R") be the smallest a-algebra generated by the open sets
Sp(x 0 ) = {x E R": Pn(x, x 0) < p},

x 0 E R",

p > 0,

in the metric
n

Pn(x, x 0 ) =

L rkp1(xk, x~).

k=1

where x = (xl> ... , xn), x 0 = (x~, ... , x~).


Then BI 0 (Rn) = BI(R") (Problem 7).

4. The measurable space (R 00 , BI(R 00 ) ) plays a significant role in probability


theory, since it is used as the basis for constructing probabilistic models of
experiments with infinitely many steps.
The space R 00 is the space of ordered sequences of numbers,
- oo < xk < oo, k

1, 2, ...

Let I k and Bk denote, respectively, the intervals (ak, bk] and the Borel subsets
of the kth line (with coordinate xk). We consider the cylinder sets

J(J 1
J(B 1

X X
X X

In)= {x:x = (x 1,x 2 , .),x 1 E1 1, ,XnEin},

(8)

Bn) = {x:x = (x 1, X2 ), x 1 E B 1, , Xn E Bn},

(9)

J(B")

{x: (x1, ... , Xn) E B"},

(10)

where B" is a Borel set in 91(R"). Each cylinder J{B 1 x x Bn), or J(B"),
can also be thought of as a cylinder with base in R"+ 1 , R"+ 2 , . , since
J(B1

X X

Bn)

J(B")

= J(B1 X
= J(B"+1),

Bn

R),

where B"+ 1 = B" x R.


It follows that both systems of cylinders J(B 1 x x Bn) and J(B")
are algebras. It is easy to verify that the unions of disjoint cylinders

J(J 1

X X

In)

also form an algebra. Let 91(R 00 ), 91 1(R 00 ) and f!I 2 (R 00 ) be the smallest
a-algebras containing all the sets (8), (9) or (10), respectively. (The a-algebra
91 1(R 00 ) is often denoted by f!I(R) 91(R) x .)It is clear that f!I(R 00 )!;;;;
f!I 1 (R 00)!;;;; BI 2 (R 00). As a matter of fact, all three a-algebras are the same.
To prove this, we put
CfJn =

{A E R": {x: (x1, ... , Xn) E A} E BI(R 00 ) }

for n = 1, 2, .... Let B" e f!I(R"). Then


B"

E CfJn !;;;;

f!I(R 00 ).

145

2. Algebras and a-Algebras. Measurable Spaces

But

t(J n

is a (J-algebra, and therefore

consequently
P4z(R 00 )

PJ(R 00 ).

Thus PJ(R 00 ) = P4 1(R 00 ) = P4 z(R 00 ).


From now on we shall describe sets in PJ(R 00 ) as Borel sets (in R 00 ).

Remark. Let P4 0 (R 00 ) be the smallest (J-algebra generated by the open sets

= {x E Roo: Poo(x, x 0 ) < p},

Sp(x 0 )

p > 0,

x 0 E R 00 ,

in the metric
Poo(x,

X 0)

00

L rkpl(xb xf),

k=l

where x = (xt>x 2 , ... ), x = (x?,x~, ... ). Then


(Problem 7).
Here are some examples of Borel sets in Roo:

P4(R 00 ) = P4 0 (R 00 )

(a) {xER 00 :supxn>a},


{x E Roo: inf Xn <a};
(b) {x E Roo: lim

Xn ~a},

{x E Roo: lim

Xn

>a},

where, as usual,

Urn Xn

inf sup Xm,


n

lim Xn = sup inf Xm;


n

m~n

m~n

(c) {x E Roo: xn -+}, the set of x E Roo for which lim xn exists and is finite;
(d) {x E R 00 : lim Xn >a};
(e) {X E R 00 :
1 IXn I > a} ;
(f) {x E Roo:
1 xk = 0 for at least one n ;::: 1}.

L:'=
Lk=

To be convinced, for example, that sets in (a) belong to the system PJ(R 00 ),
it is enough to observe that
{x: sup

Xn

> a} =

U {x: Xn >
n

{x: inf Xn <a} =

a} E P4(R 00 ),

U{x: Xn <a} E P4(R

00

).

5. The measurable space (Rr, PJ(RT)), where Tis an arbitrary set. The space
RT is the collection of real functions x = (x 1) defined for t E Tt. IIi general
we shall be interested in the case when Tis an uncountable subset of the real
t We shall also use the notations x = (x,),.Rr and x = (x,), t ERr, for elements of Rr.

146

II. Mathematical Foundations of Probability Theory

line. For simplicity and definiteness we shall suppose for the present that
T = [0, oo).
We shall consider three types of cylinder sets
X . X

In)= {x: x,,

J 1,, ... ,tn(B 1

X X

Bn) =

EB 1, ... ,X1nEBn},

{x:X11

Ib

x,n E I 1},

J,,, ... ,tn(J 1

. '

J,,, ... ,r"(Bn) = {x: (x,,, ... , x,J E Bn},

(11)
(12)
(13)

where Ik is a set of the form (ak, bk], Bk is a Borel set on the line, and Bn is a
Borel set in Rn.
The set J,,, ... ,r" (I 1 x x In) is just the set of functions that, at times
t 1 , ,tn, "get through the windows" I~>In and at other times have
arbitrary values (Figure 24).
Let BI(RT), 91 1(RT) and 91 2 (RT) be the smallest a-algebras corresponding
respectively to the cylinder sets (11), (12) and (13). It is clear that
(14)

As a matter of fact, all three of these a-algebras are the same. Moreover, we
can give a complete description of the structure of their sets.

Theorem 3. LetT be any uncountable set. Then BI(RT) = 91 1(RT) = 91 2 (RT),


and every set A E P.l(RT) has the following structure: there are a countable set of
points tt> t 2 , ofT and a Borel set Bin fJI(R such that
00 )

(15)
PRooF. Let t! denote the collection of sets of the form (15) (for various aggregates (t 1 , t 2 , ..) and Borel sets Bin BI(R 00 )). If A1 , A 2 , ... Et! and the
corresponding aggregates are r< 1> = (t~ll, t~1 >, ), r< 2 >= (t~2 >, t~2 >, ), . ,

Of-----+-,,-+-,, -------Figure 24

147

2. Algebras and a-Algebras. Measurable Spaces

then the set r<oo> =


representation

Uk r<k> can be taken as a basis, so that every A<il has a


A;= {x: (x,,,

Xr 2 ,

B;},

where B; is a set in one and the same a-algebra :?4(R 00 ), and r; E r<oo>.
Hence it follows that the system C is a a-algebra. Clearly this a-algebra
contains all cylinder sets of the form (1) and, since PA 2 (RT) is the smallest
a-algebra containing these sets, and since we have (14), we obtain
(16)
Let us consider a set A from C, represented in the form (15). For a given
aggregate (t 1 , t 2 , ), the same reasoning as for the space (R 00 , :?I(R 00 )) shows
that A is an element of the a-algebra generated by the cylinder sets ( 11 ). But
this a-algebra evidently belongs to the a-algebra :?I(RT); together with (16),
this established both conclusions of the theorem.
Thus every Borel set A in the a-algebra :?I(RT) is determined by restrictions
imposed on the functions x = (x1), t E T, on an at most countable set of points
t 1, t 2 , Hence it follows, in particular, that the sets
A 1 = {x: sup x 1 < C for all t

[0, 1]},

A 2 = {x: x 1 = 0 for at least one t

A 3 = {x:

X1

[0, 1]},

is continuous at a given point t0

[0, 1]},

which depend on the behavior of the function on an uncountable set of points,


cannot be Borel sets. And indeed none of' these three sets belongs to 84(RI 0 1l).
Let us establish this for A 1 . If A 1 E PA(R 10 1l), then by our theorem there
are a point (t?, t~, . .. ) and a set B 0 E :?I(R 00 ) such that

It is clear that the function y1 C - 1 belongs to


(Yr? ... ) E B 0 . Now form the function
21

{c- 1,

A~>

and consequently

t E (t?, t~, .. .),

= C + 1, t (t?, t~, .. .).

It is clear that

and consequently the function z = (z1) belongs to the set {x: (x 1?, ... } E B0 }.
But at the same time itisclearthatitdoesnot belong to the set {x: sup X 1 < C}.
This contradiction shows that A 1 84(R 10 1l).

148

II. Mathematical Foundations of Probability Theory

Since the sets A 1 , A 2 and A 3 are nonmeasurable with respect to the


0'-algebra dl[R[o, 11) in the space of all functions x = (x,), t e [0, 1], it is
natural to consider a smaller class of functions for which these sets are
measurable. It is intuitively clear that this will be the case if we take the
intial space to be, for example, the space of continuous functions.

6. The measurable space (C, dl(C)). LetT = [0, 1] and let C be the space of
continuous functions x = (x,), 0 s:; t s:; 1. This is a metric space with the
metric p(x, y) = SUPteT lx,- y,l. We introduce two 0'-algebras in C:
dl(C) is the 0'-algebra generated by the cylinder sets, and d6' 0 (C) is generated
by the open sets (open with respect to the metric p(x, y)). Let us show that in
fact these 0'-algebras are the same: dl(C) = d6' 0 (C).
Let B = {x: x 10 < b} be a cylinder set. It is easy to see that this set is open.
Hence it follows that {x: x 11 < b1 , , x 1" < bn} E d6' 0(C), and therefore
dl(C) s;; d6'0 (C).
Conversely, consider a set B P = {y: y e S p(x 0 )} where x 0 is an element of C
and Sp(x 0 ) = {x e C: SUPreTix1 - x?i < p} is an open ball with center at
x 0 Since the functions in C are continuous,

BP

{y E C:

ye SP(x0)} = {y E C: m~x IYr- x?i < P}


=

n{y

C: IYtk- x~l < p}

dl(C),

(17)

lk

where tk are the rational points of [0, 1]. Therefore d6' 0 (C) s;; dl(C).
The following example is fundamental.

7. The measurable space (D, dl(D)), where Dis the space offunctions x = (x1),
t e [0, 1], that are continuous on the right (x 1 = x,+ for all t < 1) and have
limits from the left (at every t > 0).
Just as for C, we can introduce a metric d(x, y) on D such that the 0'-algebra
.16'0 (D) generated by the open sets will coincide with the 0'-algebra dl(D)
generated by the cylinder sets. This metric d(x, y), which was introduced
by Skorohod, is defined as follows:
d(x, y) = inf{e

> 0:3 A. e A: sup lx,- YA<t> +sup It- A.(t)i s:; e},
t

(18)

where A is the set of strictly increasing functions A. = A.(t) that are continuous
on [0, 1] and have A.(O) = 0, A.(1) = 1.

8. The measurable space CflreT Or, IlreT ~). Along with the space
(RT, dl(RT)), which is the direct product ofT copies of the real line together
with the system of Borel sets, probability theory also uses the measurable
space COteT n" RteT :F,), which is defined in the following way.

149

3. Methods of Introducing Probability Measures on Measurable Spaces

Let T be any set of indices and (Q0 ~1) a measurable space, t E T. Let
il10 the set of functions w = (w 1), t E T, such that w 1 E il1 for each

OreT

Q =
tE T.

The collection of cylinder sets

.F1,, ... , 1.(B 1

X X

B.)= {w: W 11

B 1, , w 1

B.},

where B1; E ~r; is easily shown to be an algebra. The smallest a-algebra


T ~ 1 , and the measurable
containing all these cylinder sets is denoted by
space
Qi,
~1) is called the direct product of the measurable spaces
(Q0 ~ 1 ), t E T.

<0

9.

Plre

PI

PROBLEMS

1. Let ffl 1 and f!l 2 be a-algebras of subsets of n. Are the following systems of sets aalgebras?
ffl 1 n ffl 2

= {A: A e ffl 1 and A e ffl 2 },

ffl 1 u ffl 2 ={A: A ef!l 1 or A ef!l 2 }.

2. Let C!fi = { D 1 , D 2 , } be a countable decomposition of n and f!l =

a(~).

Are there

also only countably many sets in f!l?

3. Show that
f!l(R") f!l(R) = f!l(R"+ 1).

4. Prove that the sets (b)-(f) (see Subsection 4) belong to f!l(R 00 ).

5. Prove that the sets A 2 and A 3 (see Subsection 5) do not belong to f!l(RIO, 11).
6. Prove that the function (15) actually defines a metric.
7. Prove that 96 0 (R")

96(R"), n ~ 1, and fli 0 (R"') = f.!I(R"').

8. Let C = C[O, oo) be the space of continuous functions x = (x,) defined for t
Show that with the metric
p(x, y) =

I z- min[ sup lx,- y,l, 1],

n= 1

0.

x,yeC,

Ostsn

this is a complete separable metric space and that the a-algebra f!l 0 (C) generated by
the open sets coincides with the a-algebra f!l(C) generated by the cylinder sets.

3. Methods of Introducing Probability Measures


on Measurable Spaces
1. The measurable space (R, r!4(R)). Let P = P(A) be a probability measure
defined on the Borel subsets A of the real line. Take A = (- oo, x] and put
F(x)

= P(- oo, x],

ER.

(1)

150

II. Mathematical Foundations of Probability Theory

This function has the following properties:


(1) F(x) is nondecreasing;
(2) F(- oo) = 0, F( + oo) = 1, where
F(- oo) = lim F(x),

F( + oo) = lim F(x);


xf oo

X~- 00

(3) F(x) is continuous on the right and has a limit on the left at each x

R.

The first property is evident, and the other two follow from the continuity
properties of probability measures.

Definition 1. Every function F = F(x) satisfying conditions (1)-(3) is called


a distribution function (on the real line R).
Thus to every probability measure P on (R, &b(R)) there corresponds (by
(1)) a distribution function. It turns out that the converse is also true.

Theorem 1. Let F = F(x) be a distribution function on the real lineR. There


exists a unique probability measure P on (R, &b(R)) such that
P(a, b] = F(b)- F(a)

(2)

for all a, b, - oo :::;; a < b < oo.


PRooF. Let d be the algebra of the subsets A of R that are finite sums of
disjoint intervals of the form (a, b]:

A =

L (ak, bk].

k= 1

On these sets we define a set function P0 by putting


n

P 0 (A) =

L [F(bk)- F(ak)],

A Ed.

(3)

k=l

This formula defines, evidently uniquely, a finitely additive set function on d.


Therefore if we show that this function is also countably additive on this
algebra, the existence and uniqueness of the required measure P on &b(R)
will follow immediately from a general result of measure theory (which we
quote without proof).

CarathOOdory's Theorem. Let n be a space, d an algebra of its subsets, and


FA = a{d) the smallest a-algebra containing d. Let J.l.o be a a-finite measure on
(0, A). Then there is a unique measure J.l. on (Q, a(d)) which is an extension
of J.l.o, i.e. satisfies
J.l.(A) = J.l.o(A),

A Ed.

3. Methods of Introducing Probability Measures on Measurable Spaces

151

We are now to show that P 0 is countably additive on .91. By a theorem


from 1 it is enough to show that P 0 is continuous at 0. i.e. to verify that

LetA 1 , A 2 , .. be a sequence of sets from .91 with the property An! 0. Let
us suppose first that the sets An belong to a closed interval [- N, N], N < oo.
Since A is the sum of finitely many intervals of the form (a, b] and since

P0 (a', b)= F(b)- F(a') -t F(b)- F(a)

P0 (a, b]

as a' ! a, because F(x) is continuous on the right, we can find, for every An,
a set Bn E .91 such that its closure [BnJ An and
Po(An)- P 0 (Bn)

e 2-n,

where e is a preassigned positive number.


[BnJ = 0. But the sets [Bn]
By hypothesis, nAn= 0 and therefore
are closed, and therefore there is a finite n0 = n0 (e) such that

(4)
n= 1

(In fact, [ -N, N] is compact, and the collection of sets {[ -N, N]\[BnJ}n~l
is an open covering of this compact set. By the Heine-Bore! theorem there
is a finite subcovering:
no

U([- N, N]\[Bn]) =

[ - N,

N]

n=1

and therefore n~~ 1 [BnJ = 0).


Using (4) and the inclusions Ano Ano- 1 A1o we obtain
Po(Ano)

Po(Ano\01 Bk)

= Po(Ano\01

Bk)

+ Po(01 Bk)

~ Po(9 (Ak\Bk))
1

no

no

k=1

k=1

L Po(Ak\Bk) ~ L e 2-k ~e.

Therefore P 0 (An)! 0, n -t oo.


We now abandon the assumption that An [ - N, N] for some N. Take
an e > 0 and choose N so that P 0 [ -N, N] > 1 - e/2. Then, since

An = An n [- N, N]

+ An n

[- N, N],

~e have

P 0 (A")

=
~

P 0 (An[ -N, N] + P 0 (An n [ -N, N])


P 0 (An n [ -N, N]) + e/2

152

II. Mathematical Foundations of Probability Theory

and, applying the preceding reasoning (replacing An by An n [- N, N]), we


find that P0 (An n [- N, N]) s:; e/2 for sufficiently large n. Hence once again
P 0 (An)! 0, n--+ oo. This completes the proof ofthe theorem.
Thus there is a one-to-one correspondence between probability measures
P on (R, fJI(R)) and distribution functions F on the real line R. The measure
P constructed from the function F is usually called the Lebesgue-Stieltjes

probability measure corresponding to the distribution function F.


The case when
F(x) =

0, X< 0,
{ x, 0 s:; x s:; 1,
1, X> 1.

is particularly important. In this case the corresponding probability measure


(denoted by A.) is Lebesgue measure on [0, 1]. Clearly A.(a, b] = b- a. In
other words, the Lebesgue measure of (a, b] (as well as of any ofthe intervals
(a, b), [a, b] or [a, b)) is simply its length b -a.
Let

fJI([O, 1]) = {A n [0, 1]: A E fJI(R)}


be the collection of Borel subsets of [0, 1]. It is often necessary to consider,
besides these sets, the Lebesgue measurable subsets of [0, 1]. We say that a
set A [0, 1] belongs to ~([0, 1)] if there are Borel sets A and B such that
A A Band A.(B\A) = O.lt is easily verified that .si([O, 1]) is au-algebra.
It is known as the system of Lebesgue measurable subsets of [0, 1]. Clearly
fJI([O, 1]) ,sj([O, 1]).
The measure A., defined so far only for sets in .?1([0, 1]), extends in a
natural way to the system ,sj([O, 1]) of Lebesgue measurable sets. Specifically,
if Ae,sj([O, 1]) and A A B, where A and Be~([O, 1]) and A.(B\A) = 0,
we define A:(A) = A.(A). The set function A: = A:(A), A e ~([0, 1]), is easily
seen to be a probability measure on ([0, 1], ~([0, 1])). It is usually called
Lebesgue measure (on the system of Lebesgue-measurable sets).

Remark. This process of completing (or extending) a measure can be applied,

and is useful, in other situations. For example, let (Q, fF, P) be a probability
space. Let P be the collection of all the subsets A of Q for which there are
sets B 1 and B 2 of fF such that B 1 A B 2 and P(B 2 \B 1) = 0. The probability measure can be defined for sets A e #'Pin a natural way (by P(A) =
P(B 1)). The resulting probability space is the completion of (Q, fF, P) with
respect to P.
A probability measure such that P = fF is called complete, and the corresponding space (Q, fF, P) is a complete probability space.
The correspondence between probability measures P and distribution
functions F established by the equation P(a, b] = F(b)- F(a) makes it

153

3. Methods of Introducing Probability Measures on Measurable Spaces


F(x)

I
I ~F(x3)
I
1

~F(xz)

I~F(x,)

x,

x2

XJ

Figure 25

possible to construct various probability measures by obtaining the corresponding distribution functions.

Discrete measures are measures P for which the corresponding distributions F = F(x) are piecewise constant (Figure 25), changing their values
at the points x 1 , x 2 , (L'1F(xJ > 0, where L'1F(x) = F(x)- F(x- ). In
this case the measure is concentrated at the points x 1 , x 2 , .. :

The set of numbers (p 1 , p2, .. .), where Pk = P( {xk}), is called a discrete


probability distribution and the corresponding distribution function F = F(x)
is called discrete.
We present a table of the commonest types of discrete probability distribution, with their names.
Table 1
Distribution

Probabilities Pk

Parameters

Discrete uniform
Bernoulli
Binomial

1/N, k = 1,2, ... ,N


Pt = p, Po= q
C~pkq-\ k = 0, 1, ... , n

N = 1,2, ...
0 :-:; p :-:; 1, q = 1 - p
0 :-:; p :-:; 1, q = 1 - p,
n = 1,2, ...

Poisson
Geometric
Negative binomial

l- 1p, k =
q=lP'l-',

e-kfk!, k

0, 1, .. .
0, 1, .. .
k = r,r + 1, ...

1>0
0 :-:; p :-:; 1, q = 1 - p
0 :-:; p :-:; 1, q = 1 - p,

r = 1, 2, ...

Absolutely continuous measures. These are measures for which the corresponding distribution functions are such that
F(x) = roof(t) dt,

(5)

154

II. Mathematical Foundations of Probability Theory

where f = f(t) are nonnegative functions and the integral is at first taken in
the Riemann sense, but later (see 6) in that of Lebesgue.
The function f = f(x), x E R, is the density of the distribution function
F = F(x) (or the density of the probability distribution, or simply the density)
and F = F(x) is called absolutely continuous.
It is clear that every nonnegative f = f(x) that is Riemann integrable and
such that J~ cxJ(x) dx = 1 defines a distribution function by (5). Table 2
presents some important examples of various kinds of densities f = f(x)
with their names and parameters (a density f(x) is taken to be zero for values
of x not listed in the table).
Table 2

Uniform on [a, b]
Normal or Gaussian

Parameters

Density

Distribution

lj(b - a),

a,beR; a<b

a ::::; x ::::; b

(27ta)-li2e-<x-m)2;(2a2)'

X E

meR,a>O

x-le-xiP

Gamma

----,
r(IX}P"

IX> 0, f3 > 0

x~O

r>O,s>O

Beta
Exponential (gamma
with IX = 1, f3 = 1/A.)

Bilateral exponential
Chi-squared, x2
(gamma with a
IX = n/2, f3 = 2)
Student, t

Cauchy

x2) -en+ 1)/2


r(!{n + 1)) (
, x
(nn)112r(nj2) 1 + --;;-

(m/n)mf2

xm/2-1

B(m/2, n/2) (1

0
n(x

+ mxjn)<m+n)/Z

+ 02)

1, 2, ...

1, 2, ...

m, n

1, 2, ...

xER

Singular measures. These are measures whose distribution functions are


continuous but have all their points of increases on sets of zero Lebesgue
measure. We do not discuss this case in detail; we merely give an example of
such a function.

3. Methods of Introducing Probability Measures on Measurable Spaces

155

Figure 26

We consider the interval [0, 1] and construct F(x) by the following procedure originated by Cantor.
We divide [0, 1] into thirds and put (Figure 26)

F2(x) =

(t, 1),

z,1

X E

xe(i,~),

(~, !),
X= 0,

4'
4'

0,

X E

1, x=l
defining it in the intermediate intervals by linear interpolation.
Then we divide each of the intervals [0, and [i, 1] into three parts and
define the function (Figure 27) with its values at other points determined by
linear interpolation.

tJ

Figure 27

156

II. Mathematical Foundations of Probability Theory

Continuing this process, we construct a sequence of functions Fn(x),

n = 1, 2, ... , which converges to a nondecreasing continuous function F(x)

(the Cantor function), whose points of increase (xis a point of increase of F(x)
+ e) - F(x - e) > 0 for every e > 0) form a set of Lebesgue measure
zero. In fact, it is clear from the construction of F(x) that the total length of
the intervals (j, ~), (!, ~), (~, !), ... on which the function is constant is

if F(x

! + ~ + _i_ + ... = !
3

27

f (~)n = 1.

3 n=O 3

(6)

Let % be the set of points of increase of the Cantor function F(x). It


follows from (6) that A.(%) = 0. At the same time, if Jl. is the measure corresponding to the Cantor function F(x), we have JJ.(%) = 1. (We then say
that the measure is singular with respect to Lebesgue measure A..)
Without any further discussion of possible types of distribution functions,
we merely observe that in fact the three types that have been mentioned cover
all possibilities. More precisely, every distribution function can be represented
in the form p 1 F 1 + p2 F 2 + p 3 F 3 , where F 1 is discrete, F 2 is absolutely
continuous, and F 3 is singular, and P; are nonnegative numbers, p1 + p2 +
P3 = 1.
2. Theorem 1 establishes a one-to-one correspondence between probability
measures on (R, Ei(R)) and distribution functions on R. An analysis of the
proof of the theorem shows that in fact a stronger theorem is true, one that in
particular lets us introduce Lebesgue measure on the real line.
Let Jl. be a u-finite measure on (Q, d), where d is an algebra of subsets of
n. It turns out that the conclusion of Caratheodory's theorem on the extension of a measure and an algebra d to a minimal u-algebra u(d) remains
valid with au-finite measure; this makes it possible to generalize Theorem 1.
A Lebesgue-Stieltjes measure on (R, Ei(R)) is a (countably additive)
measure Jl. such that the measure JJ.(I) of every bounded interval I is finite.
A generalized distribution junction on the real line R is a nondecreasing
function G = G(x), with values on (- oo, oo), that is continuous on the right.
Theorem 1 can be generalized to the statement that the formula

JJ.(a, b] = G(b)- G(a),

a< b,

again establishes a one-to-one correspondence between Lebesgue-Stieltjes


measures Jl. and generalized distribution functions G.
In fact, if G( + oo) - G(- oo) < oo, the proof of Theorem 1 can be taken
over without any change, since this case reduces to the case when G( + oo) G(- oo) = 1 and G(- oo) = 0.
Now let G( + oo)- G(- oo) = oo. Put
G(x),
Gn(x) = { G(n)
G( -n),

lxl ~ n,

n,
x = -n.

X =

157

3. Methods of Introducing Probability Measures on Measurable Spaces

On the algebra d let us define a finitely additive measure J-Lo such that
J-Lo(a, b] = G(b) - G( a), and let J-ln be the finitely additive measure previously
constructed (by Theorem 1) from GnCx).
Evidently J-ln j J-Lo on d. Now let A 1, A 2 , be disjoint sets in d and
A
LAnE d. Then (Problem 6 of 1)

00

J-Lo(A) ~ L J-Lo(An).
n=1

2:.%

If
1 J-Lo(An) = oo then J-Lo(A) = L:'= 1 J-Lo(An). Let us suppose that
L J-Lo(An) < oo. Then
00

j-t 0 (A)

= lim J-Ln(A) = lim


n

L J-LnCAk).

k=1

By hypothesis, L J-Lo(An) < oo. Therefore


0

~ j-t 0(A)- k~/ 0 (Ak) = li~ L~1 (J-LnCAk)- J-Lo(Ak))] ~ 0,

since J-ln ~ J-Lo.


Thus a a-finite finitely additive measure J-Lo is countably additive on d,
and therefore (by Caratheodory's theorem) it can be extended to a countably
additive measure J-1 on a(d).
The case G(x) = xis particularly important. The measure Acorresponding
to this generalized distribution function is Lebesgue measure on (R, P4(R)).
As for the interval [0, 1] of the real line, we can define the system fJ(R) by
writing A E Pi(R) if there are Borel sets A and B such that A ~ A ~ B,
A(B\A) = 0. Then Lebesgue measure 1 on P4(R) is defined by l(A) = A(A)
if A ~ A ~ B, A E Pi(R) and A(B\A) = 0.
3. The measurable space (W, P4(R"). Let us suppose, as for the real line, that
Pis a probability measure on (R", P4(W).
Let us write

FnCx1, ... , Xn)

P((- 00,

x1J X X ( - 00,

Xn]),

or, in a more compact form,

Fn(x)

P(- 00,

x],

where X = (x 1, ... , Xn), (- 00, X] = (- 00, X1J X X


Let us introduce the difference operator Lla,. b,: R"
formula

Lla,,b,Fn(X1, ... 'xn)

( - 00,
--+

Xn].

R, defined by the

Fn(xl, ... ) X;- b b;, X;+ 1 ...)


- FnCx~> ... , X;- 1, ai, X;+ 1 ...)

158

II. Mathematical Foundations of Probability Theory

where ai :S bi. A simple calculation shows that


da 1b 1 da"b"Fn(X1 Xn)

P(a, b],

(7)

where (a, b] = (a 1, b1 ] x x (an, bJ. Hence it is clear, in particular, that


(in contrast to the one-dimensional case) P(a, b] is in general not equal to
Fn(b)- Fn(a).
Since P(a, b] ~ 0, it follows from (7) that
(8)

for arbitrary a = (a 1, , an), b = (b 1, ... , bn).


It also follows from the continuity of P that Fn(x 1, ... , xn) is continuous
on the right with respect to the variables collectively, i.e. if x<k> Lx, x<k> =
(k)) th en
( x(k)
1 , .. ,xn ,

k-+

(9)

00.

It is also clear that


(10)

and
lim Fn(Xb ... , Xn) = 0,

(11)

x!y

if at least one coordinate of y is - oo.

Definition 2. An n-dimensional distribution function (on Rn) is a function


F = F(x 1, , Xn) with properties (8)-(11).
The following result can be established by the same reasoning as in
Theorem 1.

Theorem 2. Let F = Fn(x 1, ... , Xn) be a distribution function on Rn. Then there
is a unique probability measure P on (Rn, PJ(Rn)) such that
(12)

Here are some examples of n-dimensional distribution functions.


Let F 1, , pn be one-dimensional distribution functions (on R) and
F n(X 1 , Xn) = F 1(x 1) F"(xJ.

It is clear that this function is continuous on the right and satisfies (10) and
(11). It is also easy to verify that

da,b, ... danbJn(X1, ... , Xn)

n [Fk(bk) -

Consequently F n(X 1 , x,.) is a distribution function.

Fk(ak)] ~ 0.

3. Methods of Introducing Probability Measures on Measurable Spaces

159

The case when


xk

< 0,

0 ~ xk ~ 1,
xk > 1
is particularly important. In this case

The probability measure corresponding to this n-dimensional distribution


function is n-dimensional Lebesgue measure on [0, 1]".
Many n-dimensional distribution functions appear in the form

Fn(xl, , Xn) =

f~

f~

J,.{tl, . , tn) dtl dtn,

where J,.(t 1, ... , tn) is a nonnegative function such that

f_

00
00

f_

00

00

J,.(t h

... ,

tn) dt 1 dtn = 1,

and the integrals are Riemann (more generally, Lebesgue) integrals. The
function f = J,.(t 1, ... , tn) is called the density of the n-dimensional distribution function, the density of the n-dimensional probability distribution,
or simply ann-dimensional density.
When n = 1, the function
_ -1 e -(x-m) 2 /(2a 2 )
f( x ) -

u~

xeR,

'

with u > 0 is the density of the (nondegenerate) Gaussian or normal distribution. There are natural analogs of this density when n > 1.
Let~= llriill be a nonnegative definite symmetric n x n matrix:
n

L:

i,j= 1

rip~-).i ;;::: 0,

A.i e R, i = 1, ... , n,

When ~is a positive definite matrix, I ~I = det


is an inverse matrix A = llaiill.

IAI 112

J,.(x~o ... , Xn) = ( 2n)"12 exp{ -!

> 0 and consequently there

L: aiixi- m)(xi- mi)},

(13)

where mi e R, i = 1, ... , n, has the property that its (Riemann) integral over
the whole space equals 1 (this will be proved in 13) and therefore, since it is
also positive, it is a density.
This function is the density of then-dimensional (nondegenerate) Gaussian
or normal distribution (with vector mean m = {m 1, ... , mn) and covariance
matrix~ =A - 1).

160

II. Mathematical Foundations of Probability Theory

Figure 28. Density of the two-dimensional Gaussian distribution.

When n = 2 the density f 2 (x 1 , x 2 ) can be put in the form


1

(14)

where ui > 0, IpI < 1. (The meanings of the parameters mi, ui and p will be
explained in 8.)
Figure 28 indicates the form of the two-dimensional Gaussian density.

Remark. As in the case n = 1, Theorem 2 can be generalized to (similarly


defined) Lebesgue-Stieltjes measures on (R", PI(R")) and generalized
distribution functions on R". When the generalized distribution function
Gn(x1o ... , Xn) is x 1 Xn, the corresponding measure is Lebesgue measure
on the Borel sets of R". It clearly satisfies

n(bi n

A.(a, b) =

ai),

i=l

i.e. the Lebesgue measure of the "rectangle"


is its "content."

4. The measurable space (R 00 , 91(R 00 )). For the spaces R", n ~ 1, the probaability measures were constructed in the following way: first for elementary
sets (rectangles (a, b]), then, in a natural way, for sets A = (ai, bJ, and
finally, by using Caratheodory's theorem, for sets in PI(R").
A similar construction for probability measures also works for the space
(Roo, 91(R 00 )).

3. Methods of Introducing Probability Measures on Measurable Spaces

161

Let
BE 14(R"),
denote a cylinder set in Roo with base BE &I(R"). We see at once that it is
natural to take the cylinder sets as elementary sets in R 00 , with their probabilities defined by the probability measure on the sets of &I(R 00 ).
Let P be a probability measure on (R 00 , &I(R 00 )). For n = 1, 2, ... , we
take
BE &I(R").

(15)

The sequence of probability measures P t> P 2 , defined respectively on


(R, &I(R)), (R 2 , &I(R 2 )), , has the following evident consistency property:
for n = 1, 2, ... and B E &I(R"),
(16)

It is noteworthy that the converse also holds.

Theorem 3 (Kolmogorov's Theorem on the Extension of Measures in


(R 00 , &B(R 00 ))). Let P 1 , P 2 , be a sequence of probability measures on
(R, 14(R)), (R 2 , &B(R 2 )), , possessing the consistency property (16). Then
there is a unique probability measure P on (R 00 , &I(R 00 )) such that

BE &I(R").

(17)

for n = 1, 2, ....
E &B(R") and let Jn(B") be the cylinder with base B". We assign
the measure P(J.(B")) to this cylinder by taking P(J.(B")) = P.(B").

PRooF. Let B"

Let us show that, in virtue of the consistency condition, this definition is


consistent, i.e. the value of P(Jn(B")) is independent of the representation of
the set Jn(B"). In fact, let the same cylinder be represented in two way:
J.(B") = Jn+k(B"+k).
It follows that, if (x 1, ... , xn+k)

R"+k, we have

(18)
and therefore, by (16) and (18),
Pn(B") = pn+ l((xb ... 'Xn+ l):(xl, ... 'xn) E B")
= ... = pn+k((xl, ... , Xn+k):(xl, ... 'x.) E B")
= pn+k(B"+k).

Let Jii(R 00 ) denote the collection of all cylinder sets B" = J nCB"), B" e14(R"),
n = 1, 2, ....

162

II. Mathematical Foundations of Probability Theory

Now let B 1, ... , Bk be disjoint sets in d(R 00 ). We may suppose without loss
of generality that B; = .fiBi), i = 1, ... , k, for some n, where B~, ... , Bi: are
disjoint sets in BI(Rn). Then

i.e. the set function Pis finitely additive on the algebra d(R 00 ).
Let us show that P is "continuous at zero," i.e. if the sequence of sets
Bn ! 0, n --+ oo, then P(Bn) --+ 0, n --+ oo. Suppose the contrary, i.e. let
lim P(BJ = J > 0. We may suppose without loss of generality that {Bn}
has the form

We use the following property of probability measures Pn on (Rn, BI(Rn))


(see Problem 9): if Bn E BI(Rn), for a given J > 0 we can find a compact set
An E BI(Rn) such that An c;; Bn and

Therefore if
we have

Form the set

en = n;:= 1 Ak and let en be such that


en= {x: (x1, ... , xn) E Cn}.

Then, since the sets Bn decrease, we obtain


P(Bn\en) ~

k= 1

k= 1

L P(Bn\Ak) ~ L P(Bk\Ak) ~ J/2.

But by assumption limn P(Bn) = J > 0, and therefore limn P(en) ~ J/2 > 0.
Let us show that this contradicts the condition en! 0.
. c~n Th en (X1(n), ... , Xn(n)) E c n
A(n) -- (X1(n), X2(n) , ... ) Ill
. tX
Let USC hOOSe a pom
for n ~ 1.
Let (nd be a subsequence of (n) such that x~nJ)--+ x?, where x? is a point
incl. (Such a sequence exists since x~n) E cl and cl is compact.) Then select
a subsequence (n 2) of (n 1 ) such that (An 2 >, x~ 2 >)--+ (x?, xg) E C 2. Similarly let
(x~nk), ... , x1nkl)--+ (x?, ... , xZ) E ck. Finally form the diagonal sequence
(mk), where mk is the kth term of (nk) Then x!mkl--+ x? as mk--+ oo fori = 1, 2, ... ;
and(x?, xg, ...) E e"forn = 1, 2, ... , whichevidentlycont radictstheassumption that en! 0, n--+ oo. This completes the proof of the theorem.

163

3. Methods of Introducing Probability Measures on Measurable Spaces

Remark. In the present case, the space Roo is a countable product of lines,
R 00 = R x R x . It is natural to ask whether Theorem 3 remains true if
(R 00 , ?#(R 00 )) is replaced by a direct product of measurable spaces (Q;, ~i),

i = 1, 2, ....

We may notice that in the preceding proof the only topological property
of the real line that was used was that every set in ?#(R") contains a compact
subset whose probability measure is arbitrarily close to the probability
measure of the whole set. It is known, however, that this is a property not only
of spaces (R", 1l(R"), but also of arbitrary complete separable metric spaces
with a-algebras generated by the open sets.
Consequently Theorem 3 remains valid if we suppose that P 1 , P 2 , is a
sequence of consistent probability measures on (Q 1 , ~ 1 ),

where (Qi, ~) are complete separable metric spaces with a-algebras


generated by open sets, and (R 00 , ?#(R 00 )) is replaced by

(Ql X Q2 X ". ", ~ {8) ~2 {8) "" ).

In 9 (Theorem 2) it will be shown that the result of Theorem 3 remains


valid for arbitrary measurable spaces (Qi, ~) if the measures Pn are concentrated in a particular way. However, Theorem 3 may fail in the general
case (without any hypotheses on the topological nature of the measurable
spaces or on the structure of the family of measures {P nD This is shown by
the following example.
Let us consider the space Q = (0, 1], which is evidently not complete, and
construct a sequence f71 ~ f72 ~ of a-algebras in the following way. For
n = 1, 2, ... , let
<fJn

0 < w < 1/n,


0, 1/n ~ W ~ 1,

(w) = {1,

rcn = {A E Q: A = { w: <fJn(w) E B}, BE ?#(R)}

and let ,; = a{CC 1 , .. , CC"} be the smallest a-algebra containing the sets
eel, ... ' rcn. Clearly~l ~ ~2 ~ Let ~ = u(U ,;) be the smallest
a-algebra containing all the ,;. Consider the measurable space (Q, ,;)
and define a probability measure Pn on it as follows:

Pn{w. (<p 1(w), ... , <fJn(w)) E B} -

{1O

1)

if(l, ... ,
h
.
ot erwzse,

B",

where B" = ?#(R"). It is easy to see that the family {Pn} is consistent: if
A E ,;then Pn+ 1 (A) = Pn(A). However, we claim that there is no probability
measure P on (Q, ~) such that its restriction PI,; (i.e., the measure P

164

II. Mathematical Foundations of Probability Theory

considered only on sets in g;,) coincides with Pn for n = 1, 2, .... In fact, let
us suppose that such a probability measure P exists. Then
P{w: q> 1(w) = = q>n(w) = 1} = Pn{w: q> 1(w) = = q>n(w) = 1} = 1
(19)

for n = 1, 2, .... But


{w: q> 1 (w) = = q>n(w) = 1} = (0, 1/n)

t 0,

which contradicts (19) and the hypothesis of countable additivity (and therefore continuity at the "zero" 0) of the set function P.
We now give an example of a probability measure on (R 00 , PJ(R 00 )). Let
F 1 (x), F 2 (x), ... be a sequence of one-dimensional distribution functions.
Define the functions G(x) = F 1 (x), Gix 1 , x 2 ) = F 1 (x 1 )F ix 2 ), ... , and denote
the corresponding probability measures on (R, PJ(R)), (R 2 , PJ(R 2 )), . by
P 1 , P 2 , Then it follows from Theorem 3 that there is a measure P on
(R 00 , PJ(R 00 )) such that
P{x E R 00 : (x 1 , . . . , Xn) E B} = Pn(B),

BE PJ(Rn)

and, in particular,
P{xER 00 :X 1

~ab,Xn~an}

=F 1(a1)Fn(an).

Let us take Fi(x) to be a Bernoulli distribution,


0,
F;(x) = { q,

1,

X< 0,
0:::::; x < 1,
X;::: 1.

Then we can say that there is a probability measure P on the space Q of


sequencesofnumbersx = (x 1 , x 2 , . ),xi= Oorl,togetherwiththecr-algebra
of its Borel subsets, such that
P{x . X 1 _- a 1' ' x n -_a } _ pra,qn-ra,
n

This is precisely the result that was not available in the first chapter for
stating the law of large numbers in the form (1.5.8).
5. The measurable space (RT, PJ(RT)). LetT be a set of indices t E T and R 1 a
real line corresponding to the index t. We consider a finite unordered set
r = [t 1, . , tnJ of distinct indices ti, tiE T, n ;::: 1, and let Pr be a probability
measure on (Rr, PJ(Rt)), where Rr = R1, x x R1".
We say that the family {Pt} of probability measures, where r runs through
all finite unordered sets, is consistent if, for all sets r = [t 1 , . . . , tn] and
a = [s 1 , . . . , sk] such that a ~ r we have
Pcr{(x.,, ... , x.J:(x.,, ... , x.J E B} = Pt{(X11 ,

x 1J:(x.,, ... , X 5k) E B}


(20)

for every BE PJ(R").

3. Methods of Introducing Probability Measures on Measurable Spaces

165

Theorem 4 (Ko1mogorov's Theorem on the Extension of Measures in


(RT, &U(RT))). Let {P,} be a consistent family of probability measures on
(R', :J#(R')). Then there is a unique probability measure P on (RT, &U(RT)) such
that

P(f.(B)) = P.(B)

(21)

for all unordered sets< = [t 1 , , tnJ of different indices t; E T, BE &U(R') and


f.(B) = {x E RT: (X11 , , x,J E B}.
PROOF. Let the set BE &U(RT). By the theorem of 2 there is an at most countable setS= {s 1 ,s 2 , ... } s; T such that B = {x:(X 51 ,X 52 , )EB}, where
B E 8U(R 8 ), R 8 = R51 x R52 x . In other words, B = f 8 (B) is a cylinder
set with base B E 8U(R 8 ).
We can define a set function P on such cylinder sets by putting

P(f8 (B)) = P8 (B),

(22)

where P 8 is the probability measure whose existence is guaranteed by


Theorem 3. We claim that Pis in fact the measure whose existence is asserted
in the theorem. To establish this we first verify that the definition (22) is
consistent, i.e. that it leads to a unique value of P(B) for all possible representations of B; and second, that this set function is countably additive.
Let B = f 8 /B 1 ) and B = f 8 ,(B 2 ). It is clear that then B = fs 1 us/B 3 )
with some B 3 E 8U(R 81 us 2 ) ; therefore it is enough to show that if S s; S'
and B E 8U(R 8 ), then P 8 .(B') = P 8 (B), where
B'

= {(X

8 1 ,

X 82 , ):(X 81

X 82 ,

E B}

with S' = {s~, s~, .. .}, S = {s 1 , s 2 , }. But by the assumed consistency of


(20) this equation follows immediately from Theorem 3. This establishes that
the value of P(B) is independent of the representation of B.
To verify the countable additivity of P, let us suppose that {En} is a sequence
of pairwise disjoint sets in &U(RT). Then there is an at most countable set
S s; T such that Bn = f 8 (Bn) for all n 2 1, where Bn E 8U(R 8 ). Since P 8
is a probability measure, we have
P(L Bn) = P(L: fs(Bn)) = Ps(L Bn) = L Ps(Bn)
= L P(Is(Bn)) = L P(Bn).
Finally, property (21) follows immediately from the way in which P was
constructed.
This completes the proof.

Remark 1. We emphasize that Tis any set of indices. Hence, by the remark
after Theorem 3, the present theorem remains valid if we replace the real
lines R, by arbitrary complete separable metric spaces !11 (with u-algebras
generated by open sets).

166

II. Mathematical Foundations of Probability Theory

Remark 2. The original probability measures {Pr} were assumed defined


on unordered sets r = [t b ... , tn] of different indices. It is also possible to
start from a family of probability measures {Pr} where r runs through all
ordered sets r = (t 1 , . , tn) of different indices. In this case, in order to have
Theorem 4 hold we have to adjoin to (20) a further consistency condition:

where (i, ... , in) is an arbitrary permutation of (1, ... , n) and A 1, E ?.6(R 1J As
a necessary condition for the existence of P this follows from (21) (with
P11 ,, ... , 1JB) replaced by P<t,, ... ,1JB)).
From now on we shall assume that the sets r under consideration are
unordered. If Tis a subset of the real line (or some completely ordered set),
we may assume without loss of generality that the set r = [tb ... , tn]
satisfies t 1 < t 2 < < tn. Consequently it is enough to define "finitedimensional" probabilities only for sets r = [t 1 , ... , tnJ for which t 1 <
t2 < ... < tn.
Now consider the case T = [0, oo ). Then RT is the space of all real functions x = (x 1) 1 ;-:o A fundamental example of a probability measure on
(RIO. ool, 86(RI0 ool)) is Wiener measure, constructed as follows.
Consider the family {cpr{yjx)} 1 ;-:o of Gaussian densities (as functions of y
for fixed x):
yeR,

and for each r = [t 1 , , tnJ,


B

t1

< t 2 < < tn, and each set

11 X X

In,

construct the measure P.(B) according to the formula


PrCJ1
=

X X

In)

J(it ... J( IPt.(aliO)<pt


In

-t.(a2ial) .. <p1"-t"_,(anlan-l)dal"dan

(24)

(integration in the Riemann sense). Now we define the set function P for each
cylinder set f 1, ... 1Jf 1 x x In) = {x E RT: X11 E 11, ... , X 1" E Jn} by taking
P(f11 ... tn(J 1

X X

Jn)) =

Pitt ...

1if1

X X

In).

The intuitive meaning of this method of assigning a measure to the cylinder


set ft, ... In(I 1 X ... X In) is as follows.
The set f 11 ... 1"(1 1 x x In) is the set of functions that at times t 1 , ... , tn
pass through the" windows" 1 1 , , In (see Figure 24 in 2). We shall interpret

3. Methods of Introducing Probability Measures on Measurable Spaces

167

<p 1k- 1k_Jaklak_ 1) as the probability


tk - tk_ 1, arrives in a neighborhood

that a particle, starting at ak- 1 at time


of ak. Then the product of densities that
appears in (24) describes a certain independence of the increments of the
displacements of the moving "particle" in the time intervals

The family of measures {Pr} constructed in this way is easily seen to be


consistent, and therefore can be extended to a measure on (RIO, ool, ?8(RI 0 ool)).
The measure so obtained plays an important role in probability theory. It
was introduced by N. Wiener and is known as Wiener measure.

6.

PROBLEMS

1. Let F(x)

P(- oo, x]. Verify the following formulas:

P(a, b]

= F(b)- F(a),

P(a, b)= F(b-)- F(a),

P[a, b] = F(b)- F(a- ),

P[a, b)= F(b-)- F(a- ),

P{x} = F(x)- F(x-),


where F(x-) = limy 1 x F(y).
2. Verify (7).

3. Prove Theorem 2.
4. Show that a distribution function F = F(x) on R has at most a countable set of
points of discontinuity. Does a corresponding result hold for distribution functions
on R"?
5. Show that ea:ch of the functions
G(x, y) =

1,
0,

G(x, y) = [x

X+

y;;?: 0,
< 0,

+y

+ y], the integral part of x + y,

is continuous on the right, and continuous in each argument, but is not a (generalized)
distribution function on R 2
6. Let fl. be the Lebesgue-Stieltjes measure generated by a continuous distribution
function. Show that if the set A is at most countable, then fl.(A) = 0.
7. Let c be the cardinal number of the continuum. Show that the cardinal number of the
collection of Borel sets in R" is c, whereas that of the collection of Lebesgue measurable sets is

zc.

8. Let (Q, ', P) be a probability space and d an algebra of subsets of Q such that
u(d) = !F. Using the principle of appropriate sets, prove that for every e > 0 and
B E : there is a set A E d such that
P(A !:::,. B) :::; e.

168

II. Mathematical Foundations of Probability Theory

9. Let P be a probability measure on (R", !I(R")). Using Problem 8, show that, for
every e > 0 and B e !I(R"), there is a compact subset A of !I(R") such that A ~ B
and
P(B\A) ;s; e.
(This was used in the proof of Theorem 1.)
10. Verify the consistency of the measure defined by (21).

4. Random Variables. I
1. Let (0, ~) be a measurable space and let (R, BI(R)) be the real line with
the system BI(R) of Borel sets.
Definition 1. A real

function~

~(w)

function, or a random variable, if

defined on (0, F) is an

{w: ~(w) E B}

ff'

~-measurable

(1)

for every Be BI(R); or, equivalently, if the inverse image


~- 1 (B)

= {w: ~(w) E B}

is a measurable set in 0.
When (0, ~ = (R", BI(R")), the &I(R")-measurable functions are called

Borel functions.

The simplest example of a random variable is the indicator IA(w) of an


arbitrary (measurable) set A E ff'.
A random variable ~ that has a representation
~(w)

00

L x;IA;(w),

i= 1

(2)

where L Ai = 0, Ai E ~. is called discrete. If the sum in (2) is finite, the


random variable is called simple.
With the same interpretation as in 4 of Chapter I, we may say that a
random variable is a numerical property of an experiment, with a value
depending on "chance." Here the requirement (1) of measurability is fundamental, for the following reason. If a probability measure P is defined on
(0, ~), it then makes sense to speak of the probability of the event {~(w) E B}
that the value of the random variable belongs to a Borel set B.
We introduce the following definitions.
Definition 2. A probability measure P~ on (R, BI(R)) with
P~(B)

= P{w: ~(w) E B},

BE BI(R),

is called the probability distribution of~ on (R, BI(R)).

169

4. Random Variables. I

Definition 3. The function


F~(x)

= P(w:

::::; x},

~(w)

XER,

is called the distribution function of ~For a discrete random variable the measure P~ is concentrated on an at
most countable set and can be represented in the form
P ~(B)

p(xk),

(3)

{k: XkEB)

wherep(xk) = P{~ = xd = M~(xk).


The converse is evidently true: If P ~ is represented in the form (3) then ~
is a discrete random variable.
A random variable~ is called continuous if its distribution function F~(x)
is continuous for x E R.
A random variable~ is called absolutely continuous if there is a nonnegative
function f = Hx), called its density, such that

F~(x) = fJ~(y) dy,

XER,

(4)

(the integral can be taken in the Riemann sense, or more generally in that of
Lebesgue; see 6 below).
2. To establish that a function ~ = ~(w) is a random variable, we have to
verify property (1) for all sets BE~- The following lemma shows that the
class of such "test" sets can be considerably narrowed.
Lemma 1. Let g be a system of sets such that a( g) = l(R). A necessary and
sufficient condition that a function~ = ~(w) is ~-measurable is that

(5)

for all E

g_

PRooF. The necessity is evident. To prove the sufficiency we again use the
principle of appropriate sets.
Let~ be the system of those Borel sets D in f!.6(R) for which C 1(D) E ~
The operation "form the inverse image" is easily shown to preserve the settheoretic operations of union, intersection and complement:

yB") = yC
cl( 0B")
0Cl(B"),
C 1(

(B"),

~ l(B") =

C l(B").

(6)

170

II. Mathematical Foundations of Probability Theory

It follows that f0 is a a-algebra. Therefore

and

a(S)

a(f0) = f0

PJ(R).

But a{E) = PJ(R) and consequently f0 = PJ(R).

Corollary. A necessary and sufficient condition for


variable is that
for every x

for every x

< x}

IF

x}

IF

{w:

~(w)

{w:

~(w) ~

= ~(w) to be a random

R, or that

R.

The proof is immediate, since each of the systems

S 1 = {x: x < c, c E R},


S2 =

{X:

c, c E R}

generates the a-algebra PJ(R): a{ 1 ) = a(E 2 ) = f!J(R) (see 2).


The following lemma makes it possible to construct random variables as
functions of other random variables.

Lemma 2. Let cp = cp(x) be a Bore/function and~= ~(w) a random variable.


Then the composition 17 = cp o (, i.e. the function 17(w) = cp(((w)), is also a
random variable.
The proof follows from the equations
{w:1](W)EB} = {w:cp(((w))EB} = {w:~(w)Ecp- 1 (B)}EIF

(7)

for BE PJ(R), since cp- 1(B) E PJ(R).


Therefore if~ is a random variable, so are, for examples,~,~+ = max((, 0),
= -min(~' 0), and In since the functions x", X+' X- and IX I are Borel
functions (Problem 4).

3. Starting from a given collection of random variables {~.},we can construct


new functions, for example,
1 I~k I, lim ~., lim ~., etc. Notice that in
general such functions take values on the extended real line R = [- oo, oo].
Hence it is advisable to extend the class of IF-measurable functions somewhat
by allowing them to take the values oo.

Lk'=

171

4. Random Variables. I

Definition 4. A function ~ = ~(w) defined on (0, ) with values in R =


[ -oo, oo] will be called an extended random variable if condition (1) is
satisfied for every Borel set BE PJ(R).
The following theorem, despite its simplicity, is the key to the construction
of the Lebesgue integral (6).

Theorem 1.
~ = ~(w) (extended ones included) there is a
sequence of simple random variables ~ 1 , ~ 2 , . , such that I~" I : : ; I~ I and
~n(w)--+ ~(w), n--+ oo,for all wE 0.
(b) If also ~(w) ~ 0, there is a sequence of simple random variables ~t> ~ 2 , . ,
such that ~n(w) i ~(w), n--+ oo, for all wE 0.

(a) For every random variable

PRooF. We begin by proving the second statement. For n = 1, 2, ... , put


n2" k- 1
~n(w) = k~1 ~Ik,n(w)

+ nlww>~n)(w),

where Ik,n is the indicator of the set {(k - 1)/2" ::::;; ~(w) < k/2"}. It is easy
to verify that the sequence ~"(w) so constructed is such that ~"(w) i ~(w)
for all w E 0. The first statement follows from this if we merely observe that
~can be represented in the form~ = ~+ - ~-.This completes the proof of
the theorem.
We next show that the class of extended random variables is closed under
pointwise convergence. For this purpose, we note fir~t that if ~ 1 , ~ 2 , .. is a
sequence of extended random variables, then sup~"' inf ~"'lim~" and lim~"
are also random variables (possibly extended). This follows immediately from
{w:sup ~n > x} =
{w: inf ~n

< x} =

U{w: ~n > x} E,
n
U {w: ~n < x} E ,
n

and
lim ~n = inf sup ~m
n m2:.n

Theorem 2. Let
~(w)

~t> ~ 2 ,

lim ~n

= sup sup ~m.


n

m2:.n

be a sequence of extended random variables and

= lim ~n(w). Then ~(w) is also an extended random variable.

The prooffollows immediately from the remark above and the fact that

{w:

~(w)

< x}

= {w: lim ~n(w) < x}


= {w: lim ~n(w) =lim ~n(w)} 11 {lim ~n(w) < x}
<X} = {lim ~nCw) <X} E .

= 011 {lim ~n(w)

172

II. Mathematical Foundations of Probability Theory

4. We mention a few more properties of the simplest functions of random


variables considered on the measurable space (Q, /F) and possibly taking
values on the extended real lineR = [ - oo, oo].t
If e and '1 are random variables, e + '1 e - '1 e'1. and e/'1 are also random
variables (assuming that they are defined, i.e. that no indeterminate forms like
oo - oo, oojoo, a/0 occur.
In fact, let {en} and {1'/n} be sequences of random variables converging to
e and '1 (see Theorem 1). Then
en

'1n --+ e '1.

en '1n --+
en
------~------+
1
1'/n + -J{.,n=O}(w)

~1'/.
e
-
'1

The functions on the left-hand sides of these relations are simple random
variables. Therefore, by Theorem 2, the limit functions e 17, e11 and e/rt
are also random variables.

5. Let be a random variable. Let us consider sets from :F of the form


{w: e(w) E B}, BE PA(R). It is easily verified that they form a a-algebra,
called the a-algebra generated by and denoted by ~If qJ is a Borel function, it follows from Lemma 2 that the function 17 = qJ o e
is also a random variable, and in fact ~-measurable, i.e. such that

e.

{w: 17(w) E B} E ~.

BE PA(R)

(see (7)). It turns out that the converse is also true.

Theorem 3. Let '1 be a :F~-measurable random variable. Then there is a Borel


jimction qJ such that '1 = qJ
i.e. 1J(W) = qJ(e(w)) for every WE Q.
0

e,

PROOF. Let Cl> be the class of ~-measurable functions 1J = 1J(w) and &~
the class of ~-measurable functions representable in the form qJ 0 where
qJ is a Borel function. It is clear that & s; Cl>~. The conclusion of the theorem
is that in fact&~ = Cl>~.
Let A E ~and 17(w) = IA(w). Let us show that 1J E &~.In fact, if A E ~
there is aBE PA(R) such that A = {w: e(w) E B}. Let

e.

( )_{1,0,

XBX

X EB,
X: B.

Then/A(w) = XB(e(w)) E <ll~.Henceitfo1Iowsthateverysimple~-measurable


function
1 c)A 1(w), Ai E ~.also belongs to&~.

L7=

t We shall assume the usual conventions about arithmetic operations in R: if a e R then


oo = oo, a/ oo = 0; a oo = oo if a > 0, and a oo = - oo if a < 0; 0 ( oo) = 0,

00

+ 00 = 00,

- 0 0 - 00

-00.

173

4. Random Variables. I

Now let 17 be an arbitrary ~-measurable function. By Theorem 1 there


is a sequence of simple ~-measurable functions {IJn} such that IJn( w) --+ IJ( w ),
n--+ 00, wEn. As we just showed, there are Borel functions <fJn = <fJn(x) such
that 1Jn(w) = <fJn(~(w)). Then <fJn(~(w))--+ 1J(W), n --+ 00, wEn.
Let B denote the set {x E R: limn <pn(x) exists}. This is a Borel set. Therefore

<p(x) = {

lim <fJn(x),

X E B,

xB

0,

is also a Borel function (see Problem 7).


But then it is evident that 1J(W) =limn <fJn(~(w)) = <p(~(w)) for all wEn.
Consequently d>~ = Cl>~.

6. Let us consider a probability space (Q, :F, P), where the a-algebra ff is
generated by a finite or countably infinite decomposition!":= {Dt. D2 , . },
I D; = n, P(D;) > 0. Here we shall suppose that the D; are atoms with
respect toP, i.e. if A

D;, A E :F, then either P(A) = 0 or P(D;\A) = 0.

Lemma 3. Let ~ be an .?-measurable function, where ff = cr(!":). Then


constant on the atoms of the decomposition, i.e. ~ has the representation

is

00

~(w)

xklnk(w)

(8)

(P-a.s.).t

k= 1

(The notation

"~

= 1J (P-a.s.)" means that

P(~

=f 17) = 0.)

PROOF. Let D be an atom of the decomposition with respect to P. Let us


show that ~is constant (P-a.s.), i.e. P{D n (~ =f canst)} = 0.
Let K = sup{x E R: P{D n (.; < x)} = 0}. Then

P{D n ( < K)} = p[

ryK

{wED;

~(w)

< r}J = 0,

ratwnal r

since if P{D n (~ < x)} = 0, then also P{D n (~ < y)} = 0 for ally :::;; x.
Let x > K; then P{D n (~ < x)} > 0 and therefore P{D n (~ ~ x)} = 0,
since D is an atom. Therefore
P{D n (~ > K)} = P

U
[ r>K

{wED:~~ r}J

= 0.

ratiOnal r

Thus
P {D n ( ~ > K)} = P {D n ( ~ < K)} = 0
and therefore P{D n (~ =f K)} = 0.
Then (8) follows in general since
the lemma.
t a.s.

almost surely.

D; = n. This completes the proof of

174

7.

II. Mathematical Foundations of Probability Theory

PROBLEMS

1. Show that the random variable

is continuous if and only if P(~ = x) = 0 for all

xeR.

2. If I I is :F-measurable, is it true that is also :F -measurable?


3. Show that ~ = ~(w) is an extended random variable if and only if {w: e(w) E B}
for all Be BII(R).
4. Prove that x", x+
functions.

= max(x, 0),

x-

= -min(x, 0), and lxl =

x+

+ x-

:F

are Borel

5. If and '1 are :!'-measurable, then {w: '(w) = "'(w)} e :F.


6. Let ~ and '1 be random variables on (Q, :F), and A e :F. Then the function

C(w) = ~(w) IA

+ '1(w)Ii

is also a random variable.

... ,

7. Let e~o
~.be random variables and cp(x 1, ... , x.) a Borel function. Show that
cp(, 1(w), ... , ~.(w)) is also a random variable.
8. Let

eand '1 be random variables, both taking the values 1, 2, ... , N. Suppose that

:F. Show that there is a permutation (i 1, i2, ... , iN) of (1, 2, ... , N) such thl!t
{w:' = j} = {w: '1 = ij} for j = 1, 2, ... , N.

~ =

5. Random Elements
1. In addition to random variables, probability theory and its applications
involve random objects of more general kinds, for example random points,
vectors, functions, processes, fields, sets, measures, etc. In this connection it is
desirable to have the concept of a random object of any kind.

Definition 1. Let (0, ff') and (E, tff) be measurable spaces. We say that a
function X= X(m), defined on 0 and taking values in E, is ff'/tff-measurable,
or is a random element (with values in E), if
{m: X(m) E B} E ff'

(1)

for every BE tff. Random elements (with values in E) are sometimes called
-valued random variables.
Let us consider some special cases.
If (E, tff) = (R, f!I(R)), the definition of a random element is the same as the
definition of a random variable (4).
Let (E, cC) = (R", f!I(R")). Then a random element X(m) is a "random
point" in R". If nk is the projection of R" on the kth coordinate axis, X(m) can

175

5. Random Elements

be represented in the form


(2)
where ~k = nk o X.
It follows from (1) that
BE ~(R) we have

{w: ~k(w) E B}

~k

is an ordinary random variable. In fact, for

{w: ~ 1 (w) E R, ... , ~k- 1 ER, ~k E B, ~k+ 1 E R, .. .}


R x B x R x .. x R)} E ~

= {w: X(w) E (R x .. x

since R x x R x B x R x x R E ~(R").

Definition 2. An ordered set (1J 1(w), ... , 1].(w)) ofrandom variables is called
an n-dimensional random vector.
According to this definition, every random element X(w) with values in

R" is an n-dimensional random vector. The converse is also true: every


random vector X(w) = (~ 1 (w), ... , ~n(w)) is a random element in R". In fact,
if Bk E ~(R), k = 1, ... , n, then
{w: X(w) E (B1 X ... X Bn)} =

n{w:
k=1
n

~k(w) E Bk} E $'.

But ~(W) is the smallest a-algebra containing the sets B 1 x x Bn.


Consequently we find immediately, by an evident generalization of Lemma 1
of4, that whenever BE ~(R"), the set {w: X(w) E B} belongs to$'.
Let (E, S) = (Z, B(Z)), where Z is the set of complex numbers x + iy,
x, y E R, and B(Z) is the smallest a-algebra containing the sets {z: z = x + iy,
a 1 < x :::; b 1 , a 2 < y :::; b2 }. It follows from the discussion above that a
complex-valued random variable Z(w) can be represented as Z(w) =
X(w) + iY(w), where X(w) and Y(w) are random variables. Hence we may
also call Z(w) a complex random variable.
Let (E, S) = (Rr, ~(RT)), where Tis a subset of the real line. In this case
every random element X = X(w) can evidently be represented as X= (~ 1) 1 e T
with ~~ = n 1 o X, and is called a random function with time domain T.

Definition 3. Let T be a subset of the real line. A set of random variables


X = (~ 1 ) 1 e Tis called a random processt with time domain T.
If T = {1, 2, ... } we call X= (~ 1 , ~ 2 , .. ) a random process with discrete
time, or a random sequence.
If T = [0, 1], (- oo, oo), [0, oo), ... , we call X= (~ 1) 1 er a random
process with continuous time.
tOr stochastic process (Translator).

176

II. Mathematical Foundations of Probability Theory

It is easy to show, by using the structure of the u-algebra BB(RT) (2) that
every random process X= (~ 1)reT (in the sense of Definition 3) is also a
random function on the space (Rr, &B(RT)).

Definition 4. Let X= (~ 1 )reT be a random process. For each given wE Q


the function (~ 1(w))1 e T is said to be a realization or a trajectory of the process,
corresponding to the outcome w.
The following definition is a natural generalization of Definition 2 of 4.

Definition 5. Let X = (~ 1)reT be a random process. The probability measure


Px on (Rr, BB(RT)) defined by
BE &B(Rr),

Px(B) = P{w: X(w) E B},

is called the probability distribution of X. The probabilities


P 1,, .. , 1"(B)

= P{w: (~ 1 ,,

... ,

~~J E B}

with t 1 < t 2 < < tn, t; E T, are called finite-dimensional probabilities


(or probability distributions). The functions

Fr,, ... ,tn(Xto ... ,Xn)


with t 1 <
functions.

t2

= P{w:~r,::::; Xto ... ,~tn::::; Xn}

< < tn, t; E T, are called finite-dimensional distribution

Let (E, C) = (C, 8B0 (C)), where C is the space of continuous functions
x = (x 1)reT on T = [0, 1] and BB 0 (C) is the u-algebra generated by the open
sets (2). We show that every random element X on (C, 8B 0 (C)) is also a
random process with continuous trajectories in the sense of Definition 3.
In fact, according to 2 the set A = {x E C: x 1 <a} is open in &B 0(C).
Therefore

= {w: X(w) E A} E $'.

{w: ~ 1 (w) <a}

On the other hand, let X= (~ 1(w))reT be a random process (in the sense
of Definition 3) whose trajectories are continuous functions for every
w E n. According to (2.14 ),
{x E C:

n{x E C: lxrk- x~kl < p},

E Sp{x 0 )} =

lk

where

tk

are the rational points of [0, 1]. Therefore


{w: X(w) E Sp(X 0 w))}

n{w:

l~rk(w)- ~~(w)l

< p} E ff,

lk

and therefore we also have {w: X(w) E B} E ff for every BE 8B 0(C).


Similar reasoning will show that every random element of the space
(D, BB 0 (D) can be considered as a random process with trajectories in the
space of functions with no discontinuities of the second kind; and conversely.

177

5. Random Elements

2. Let (Q, ff, P) be a probability space and (E 11 , &11) measurable spaces,


where IX belongs to an (arbitrary) set 21.

Definition 6. We say that the ff/&11-measurable functions (Xa;(w)), IX e 21,


are independent (or collectively independent) if, for every finite set of
indices IXl, ' IX" the random elements Xa:t ... ' xtln are independent, i.e.
P(X111 e B111 , , X a." e Ba:J = P(X111 e B111 )

P(Xa." e B11J,

(3)

where B11 e Ca..


Let 21 = {1, 2, ... , n}, let
F~(xl,

~a:

... , Xn)

be random variables, let IX e 21 and let

Xn)

P(~1 ~ X1, ... , ~" ~

be the n-dimensional distribution function of the random vector


~ = (~ 1 , , ~"). Let F~.(x;) be the distribution functions of the random
variables ~;, i = 1, ... , n.

Theorem. A necessary and sufficient condition for the random variables


~ 1 , ... , ~"to be independent is that

(4)

F~(xl, ... ,Xn) = F~ 1 (X1)F~n(Xn)

for all (x 1, . , x") e R".

PRooF. The necessity is evident. To prove the sufficiency we put


a = (a 1 , , a"), b = (b 1 , .. , b")'
P~(a, b] = P{w: a 1 < ~ 1 ~ bl> ... , a"< ~" ~ b"},
P,.(a;, b;] = P{a; < ~; ~ b;}.

Then
P~(a, b] =

n" [F~,(b;)- F~,(a;)] = n" P~.(a;, b;]

i=1

i= 1

by (4) and (3.7), and therefore


P{~ 1 e It, ... ,~" E In} =

n" P{~;

(5)

I;},

P{~1 E B1, ~2 E I2, ... ' ~" E In} = P{~1 E B1}

n"

i= 1

where I; = (a;, b;].


We fix I 2 , , I" and show that

i=2

P{~; E I;}

(6)

for all B 1 e iJ(R). Let vH be the collection of sets in ii(R) for which (6)
holds. Then vH evidently contains the algebra d of sets consisting of sums of
disjoint intervals of the form I 1 = (a 1, b 1]. Hence d vH iJ(R). From

178

II. Mathematical Foundations of Probability Theory

the countable additivity (and therefore continuity) of probability measures it


also follows that .A is a monotonic class. Therefore (see Subsection 1 of 2)
JJ.(d) .A ~(R).

But JJ.(d) = u(.r;l) = ~(R) by Theorem 1 of 2. Therefore .A = ~(R).


Thus (6) is established. Now fix B 1 , 13 , ,1.; by the same method we
can establish (6) with / 2 replaced by the Borel set B2 Continuing in this
way, we can evidently arrive at the required equation,

where B; e

3.

~(R).

This completes the proof of the theorem.

PROBLEMS

1. Let 1, ,
only if

e. be discrete random variables. Show that they are independent if and


P(e1

= X1,

00

.'

e. = x.) = TI P(e; = X;)


i=l

for all real Xt. , x .


2. Carry out the proof that every random function (in the sense of Definition 1) is a
random process (in the sense of Definition 3) and conversely.
3. Let X 1 , , X. be random elements with values in (Et. S 1 ), ... , (E., s.), respectively.
In addition let (E~, t9''1 ), , (E~, t9'~) be measurable spaces and let gt. ... , g. be
8tfS'1, . ,t9'.Jt9'~-measurable functions, respectively. Show that if X 1 , ... ,x. are
independent, the random elements g 1 X t. .. , g. X. are also independent.

6. Lebesgue Integral. Expectation


1. When (0, 1Ji', P) is a finite probability space and
random variable,

~ = ~(m)

is a simple

~(m) =

L xk!Ak(m),
k=l

(1)

the expectation E~ was defined in 4 of Chapter I. The same definition of the


expectation
of a simple random variable can be used for any probability
space (Q, 1Ji', P). That is, we define

ee

E~ =

L xkP(Ak).
k=l

(2)

179

6. Lebesgue Integral. Expectation

This definition is consistent (in the sense that E~ is independent of the


particular representation of~ in the form (1)), as can be shown just as for
finite probability spaces. The simplest properties of the expectation can be
established similarly (see Subsection 5 of 4 of Chapter 1).
In the present section we shall define and study the properties of the
expectation E~ of an arbitrary random variable. In the language of analysis,
E~ is merely the Lebesgue integral of the g;;-measurable function ~ = ~(co)
with respect to the measure P. In addition to E~ we shall use the notation
~(co)P(dco) or So~ dP.
Let~ = ~(co) be a nonnegative random variable. We construct a sequence
of simple nonnegative random variables {~n} n > 1 such that ~nCco) j ~(co),
n -+ oo, for each co E n (see Theorem 1 in 4).
Since E~n ::;; E~n+ 1 (cf. Property 3) of Subsection 5, 4, Chapter 1), the
limit limn E~n exists, possibly with the value + oo.

So

Definition 1. The Lebesgue integral of the nonnegative random variable


~ = ~(co), or its expectation, is
(3)
n

To see that this definition is consistent, we need to show that the limit is
independent of the choice of the approximating sequence {~n}. In other
words, we need to show that if ~n j ~and 17m j ~'where {17m} is a sequence of
simple functions, then
(4)
m

Lemma 1. Let 17 and '" be simple random variables, n ~ 1, and

Then
(5)
n

PROOF.

Let e > 0 and

It is clear that An j Q and


~n = ~n]An

+ ~n]An

~ ~n]An ~ (17- e)JAn

Hence by the properties of the expectations of simple random variables we


find that
E~n ~

E(17- e)JA" = E17IA"- eP(An)


= E17 - E17Ix"- eP(An) ~ E17 - CP(An)- e,

180

II. Mathematical Foundations of Probability Theory

where C = maxw '1(w). Since e is arbitrary, the required inequality (5) follows.
It follows from this lemma that limn E~n ; : : : limm E'7m and by symmetry
limm Ef1m ;?:: limn E~n which proves (4).
The following remark is often useful.

Remark 1. The expectation E~ of the nonnegative random variable ~ satisfies


E~

sup Es,

(6)

{seS:s~~}

where S = {s} is a set of simple random variables (Problem 1).


Thus the expectation is well defined for nonnegative random variables.
We now consider the general case.
Let~ be a random variable and~+ =max(~, 0), ~- = -min(~, 0).

Definition 2. We say that the expectation E~ of the random variable


exists, or is defined, if at least one of E~ + and E~- is finite:
min(E~+, E~-)

< oo.

In this case we define

The expectation Ee is also called the Lebesgue integral (ofthe function


respect to the probability measure P).

Definition 3. We say that the expectation of


EC < oo.

is finite if

E~+

ewith

< oo and

Since 1~1 = e+- ~-.the finiteness ofEe, or IE~I < oo, is equivalent to
EI I < 00. (In this sense one says that the Lebesgue integral is absolutely
convergent.)

e,

Remark 2. In addition to the expectation E significant numerical characteristics of a random variable eare the number E~' (if defined) and E IeI', r > 0,
which are known as the moment of order r (or rth moment) and the absolute
moment of order r (or absolute rth moment) of e.
Remark 3. In the definition of the Lebesgue integral Jn ~(w)P(dw) given

above, we suppose that P was a probability measure (P(Q) = 1) and that


the $'-measurable functions (random variables) ~ had values in
R = (- oo, oo ). Suppose now that J1. is any measure defined on a measurable
space (Q, ff) and possibly taking the value + oo, and that ~ = e(w) is an
$'-measurable function with values in R = [- oo, oo] (an extended random
variable). In this case the Lebesgue integral Jn e(w)Jl(dw) is defined in the

181

6. Lebesgue Integral. Expectation

same way: first, for nonnegative simple ~ (by (2) with P replaced by Jl),
then for arbitrary nonnegative ~. and in general by the formula

provided that no indeterminacy of the form oo - oo arises.


A case that is particularly important for mathematical analysis is that in
which (Q, F) = (R, PJ(R)) and J1 is Lebesgue measure. In this case the
integralSR ~(X)Jl(dx) is writtenSR ~(x) dx, Or s~oo ~(x) dx, Or (L) s~oo ~(x) dx
to emphasize its difference from the Riemann integral (R) S~oo ~(x) dx. If the
measure J1 (Lebesgue-Stieltjes) corresponds to a generalized distribution
function G = G(x), the integral SR ~(x)Jl(dx) is also called a LebesgueStieltjes integral and is denoted by (L-S) SR ~(x)G(dx), a notation that
distinguishes it from the corresponding Riemann-Stieltjes integral
(R-S)

L~(x)G(dx)

(see Subsection 10 below).


It will be clear from what follows (Property D) that if E~ is defined then
so is the expectation E( 0 A) for every A E !F. The notations E( ~; A) or SA ~ dP
are often used for E(OA) or its equivalent, Sn OA dP. The integral SA~ dP is
called the Lebesgue integral of~ with respect to P over the set A.
Similarly, we write SA~ dJ1 instead of Sn ~ IA dJ1 for an arbitrary measure
Jl. In particular, if J1 is an n-dimensional Lebesgue-Stieltjes measure, and
A = (all b 1 ] x x (a., b.], we use the notation

L(

d)l.

instead of

If JliS Lebesgue measure, we write simplydx 1 dx.instead of J1(dx 1 , .. , dx.).


2. Properties of the expectation E~ of the random variable

A. Let c be a constant and let E~ exist. Then

E(c~)

~.

exists and

E(c~) = cE~.

B. Let

::::; 1J; then

with the understanding that

if -oo <

E~

then

-oo < E17

and

E~::::;

E17

or

if E17 < oo then

E~

< oo

and

E~

::::; E1J.

182
C.

II. Mathematical Foundations of Probability Theory

IfE~

exists then
IE~ I~ E 1~1-

D. If E~ exists then E(eJA) exists for each A E ?'; if E~ is finite, E(~IA) is


finite.
E. If~ and rf are nonnegative random variables, or such that EI~I < oo and
Elt71 < oo,then
E(~

+ t'f) = E~ + Et'f.

(See Problem 2 for a generalization.)


Let us establish A-E.

A. This is obvious for simple random variables. Let ~


are simple random variables and c ~ 0. Then c~n j

~
c~

0, ~n j ~' where
and therefore

~n

In the general case we need to use the representation ~ = ~+ - ~


and notice that (c~)+ = c~+, (c~)- = cC when c ~ 0, whereas when
c < 0, (c~)+ = -cC, (c~)- = -c~+.
B. If 0 ~ ~ ~ rf, then E~ and Et7 are defined and the inequality E~ ~ Et7
follows directly from (6). Now let E~ > - oo; then E~- < oo. If~ ~ rf,
we have ~+ ~ rf+ and ~- ~ rf-. Therefore Et7- ~ EC < oo; consequently E11 is defined and E~

= E~+

- E~- :::;; E11+ - E11-

case when E17 < oo can be discussed similarly.


C. Since - I~I ~ ~ ~ I~I, Properties A and B imply

E11. The

i.e.IE~I ~ E 1~1
D. This follows from B and

E. Let ~ ~ 0, '7 ~ 0, and let {~n} and }'In} be sequences of simple functions
such that ~n i ~and 'In j t'f Then E(~n + 1'/n) = E~n + Et'fn and

and therefore E(~ + rJ) = E~ + ErJ. The case when E 1~1 < oo and
E I'11 < oo reduces to this if we use the facts that

and

183

6. Lebesgue Integral. Expectation

The following group of statements about expectations involve the notion


of" P-almost surely." We say that a property holds" P-almost surely" if there
is a set% E ~with P(%) = 0 such that the property holds for every point
w ofO.\%. Instead of" P-almost surely" we often say" P-almost everywhere"
or simply "almost surely" (a.s.) or "almost everywhere" (a.e.).
F.

If~

= 0 (a.s.) then

E~

= 0.

In fact, if ~ is a simple random variable, ~ =


xk/Ak(w) and xk =I= 0,
wehaveP(Ak) = ObyhypothesisandthereforeE~ = O.If~ ~ OandO~s~~.
where s is a simple random variable, then s = 0 (a.s.) and consequently
Es = 0 and E~ = sup(seS:s:>~J Es = 0. The general case follows from this by
means of the representation ~ = ~+ - C and the facts that ~+ ~ I~ I,
C ~ 1~1, and 1~1 = 0 (a.s.).
G. If~= 1J (a.s.) and El~l < oo, then EIIJI < oo and E~ = E17 (see also
Problem 3).
In fact, let % = {w: ~=I= IJ}. Then P(%) = 0 and ~ = O.,v + 0.:f,
1J = 1Jl.,v + IJI.,v = 1Jl.,v + O.K.BypropertiesEandF,wehaveE~ = E~l.,v +
E~(v = E1Jl.:f. But E1Jl.,v = 0, and therefore E~ = E1Jl.K + E1Jl.,v = E17, by
Property E.

H.

Let~ ~

0 and

E~

= 0.

Then~

= 0 (a.s).

For the proof, let A= {w: ~(w) > 0}, An= {w: ~(w) ~ 1/n}. It is clear
that Ani A and 0 ~ ~ IA" ~ ~ IA. Hence, by Property B,
0 S EOAn S E~ = 0.

Consequently

and therefore P(An) = 0 for all n


P(A) = 0.

1. But P(A) = lim P(An) and therefore

I. Let~ and 11 be such that El~l < oo, El11l < oo and E(OA) ~ E(17/A) for
all A E ~.Then~ s 11 (a.s.).
In fact, let B = {w: ~(w) > 17(w)}. Then E(17/8 ) ~ E(0 8 ) ~ E(17/8 ) and
therefore E(0 8 ) = E(17/ 8 ). By Property E, we have E((~- 11)18 ) = 0 and by
Property H we have(~ - 17)/8 = 0 (a.s.), whence P(B) = 0.
~ be an extended random variable and E I~ I < oo. Then I~ I < oo
(a. s). In fact, let A= {wl~(w)l = oo} and P(A) > 0. Then El~l ~
E(I~IJA) = oo P(A) = oo, which contradicts the hypothesis El~l < oo.
(See also Problem 4.)

J. Let

3. Here we consider the fundamental theorems on taking limits under the


expectation sign (or the Lebesgue integral sign).

184

II. Mathematical Foundations of Probability Theory

Theorem 1 (On Monotone Convergence). Let IJ,

~' ~ 1 , ~ 2 ,

be random

variables.

(a) If~. :2:: 1J for all n :2:: 1, El] > - oo, and ~.

~' then

i E~.
< oo, and~. t ~,then
E~.

(b) If~. :::;; 1J for all n ;;::-: 1, E17

E~.

t E~.

PRooF. (a) First suppose that IJ ;;::-: 0. For each k ;;::-: 1let {~i"l} n"' 1 be a sequence
of simple functions such that ~kl i ~b n--+ oo. Put ~(nJ = max 1 ,k,;;N~k>.
Then

,(n-1) :::;; '(n) = max

:::;; max

~kn)

~k

= ~ .

1,;;k,;;n

1,;;k,;;n

for 1 :::;; k :::;; n, we find by taking limits as n

--+

oo that

for every k ; : -: 1 and therefore~ = (.


The random variables (<> are simple and (<>

i (. Therefore

Es = E( =lim E(<n>:::;; limEs .


On the other hand, it is obvious,

:::;;

since~.

limEs. :::;;

~n+ 1

:::;;

~,that

E~.

Consequently limE~. = E~.


Now let IJ be any random variable with EIJ > - oo.
If EIJ = oo then E~. = E~ = oo by Property B, and our proposition is
proved. Let EIJ < oo. Then instead of EIJ > - oo we find E IIJ I < oo. It is
clear that 0 :::;; ~. - 1J i ~ - IJ for all w E Q. Therefore by what has been
established, E(~. - l])i E(~ - IJ)and therefore (by Property E and Problem 2)
E~. - E~

i E~

- EIJ.

But E IIJ I < oo, and therefore E~. i E~, n --+ oo.
The proof of (b) follows from (a) if we replace the original variables by
their negatives.

Corollary. Let {IJ.}n;, 1 be a sequence of nonnegative random variables. Then


00

00

EIJn
E IJn =
n=1
n=1

185

6. Lebesgue Integral. Expectation

The proof follows from Property E (see also Problem 2), the monotone
convergence theorem, and the remark that
k

00

L tin j n=L tin


n=
1

k--..

00.

Theorem 2 (Fatou's Lemma). Lett~. ~b ~ 2 , be random variables.


(a) If~.. ~ t1 for all n ~ 1 and Et~ > - oo, then

E lim~~~ :::;; limE~...


(b) If~ .. :::;; t1 for all n ~ 1 and Et~ < oo, then
limE~ ..

:::;; E lim~ ...

(c) If 1~ .. 1~ t1 for all n ~ 1 and Et~ < oo, then

E lim~~~~ limE~ .. ~ llm E~ .. ~ E llm ~...


PROOF.

(a) Let

Cn =

infm~n ~m;

(7)

then

lim~~~= lim inf ~m =lim C...


n

It is clear that C.. j lim

~~~

m~n

and C.. ~ tl for all n ;;::: 1. Then by Theorem 1

E lim ~~~ = E lim C.. = lim EC.. = lim EC.. ~ limE~ .. ,


n

which establishes (a). The second conclusion follows from the first. The third
is a corollary of the first two.

Theorem 3 (Lebesgue's Theorem on Dominated Convergence). Let t~, ~.


~ 1 , ~ 2 , be random variables such that 1~ .. 1:::;; 1], E17 < oo and~~~--..~ (a.s.).
ThenEI~I < oo,
(8)

and
(9)
as n--.. oo.
PROOF.

llm ~~~ =

Formula (7) is valid by Fatou's lemma. By hypothesis, lim~~~=


~ (a.s.). Therefore by Property G,
E lim ~~~ = limE~ .. = limE~ .. = E lim ~~~ = E~,

which establishes (8). It is also clear that I~ I ~ tl Hence EI~ I < oo.
Conclusion (9) can be proved in the same way if we observe that

1~..

-" ~ 21].

186

II. Mathematical Foundations of Probability Theory

Corollary Let fl, e. e1 be random variables such that Ien I :=:;; f/, en -+ e
(a.s.) and E17P < oo for some p > 0. Then E 1e1P < oo and E1e- eniP-+ 0,
n-+ oo.
0

For the proof, it is sufficient to observe that


1e1 :=:;; fl, I~- ~niP:=:;; <lei+ leni)P :=:;; (2f1)P.

The condition "I en I :=:;; f/, Ef/ < 00" that appears in Fatou's lemma and
the dominated convergence theorem, and ensures the validity of formulas
(7)-(9), can be somewhat weakened. In order to be able to state the corresponding result (Theorem 4), we introduce the following definition.
Definition 4. A family
integrable if
sup
11

{en} 11 ~ 1

J{i~nl>c)

of random variables is said to be uniformly


c-+

le,IP(dw)-+0,

00,

(10)

or, in a different notation,


sup E[ Ie~~l Iu~"' >c}J -+ 0,

c-+

00.

(11)

II

It is clear that if ell' n ;;::: 1, satisfy Iell I :=:; f/, Ef/ < oo, then the family
{ell} 11 ~ 1 is uniformly integrable.
Theorem 4. Let

Then

{~ 11 }n~ 1

be a uniformly integrable family ofrandom variables.

(a) E lim e11 :=:;; lim Ee11 :=:;; ITiii Ee, :=:; E ITiii eli.
(b) If in addition 11 -+ (a.s.) then~ is integrable and

e e

Ee,-+ Ee,
Ele~~-el-+0,

n-+ oo,
n-+oo.

PRooF. (a) For every c > 0


(12)

By uniform integrability, for every e > 0 we can take c so large that


sup IE[e~~I 1~"<-cj]l <e.

(13)

II

By Fatou's lemma,
lim E[e,I 1~"~ -c}J ;;::: E[lim ~nl{~n~ -c}J.

But

~,If~"~ -cl

;;:::

e, and therefore

lim E[e,J!~"~-clJ;;::: E[lim e,].

(14)

187

6. Lebesgue Integral. Expectation

From (12)-(14) we obtain

Since e > 0 is arbitrary, it follows that lim E~n 2: E lim ~n The inequality
with upper limits, 1lm E~n :::;; E lim~"' is proved similarly.
Conclusion (b) can be deduced from (a) as in Theorem 3.
The deeper significance of the concept of uniform integrability is revealed
by the following theorem, which gives a necessary and sufficient condition
for taking limits under the expectation sign.
~n-> ~and E~n < oo. Then E~n-> E~ < oo
the family {~n}n> 1 is uniformly integrable.

Theorem 5. Let 0 :::;;

if and only if

PROOF. The sufficiency follows from conclusion (b) of Theorem 4. For the
proof of the necessity we consider the (at most countable) set

= {a:

P(~

= a) > 0}.

Then we have ~n/{~"<al-> Or~<a) for each a~ A, and the family


{~n/{~n<a}}n~ 1

is uniformly integrable. Hence, by the sufficiency part of the theorem, we have


E~nlr~"<aJ-> E~/~~"<aJ' a~ A, and therefore
a~

Take an e > 0 and choose a 0


N 0 so large that

A,

n-> oo.

(15)

A so large that E ~/ (~ 2: ao) < ej2; then choose

E~nlr~"2:aoJ :::;; E~lr~~aoJ

+ e/2

for all n 2: N 0 , and consequently E~n/{~"~ao}:::;; e. Then choose a 1 2: a0 so


large that E~r~"2:a!}::::;; e for all n:::;; N 0 Then we have
supE~nlr~"~ad :::;;
n

e,

which establishes the uniform integrability of the family {~n} n 2: 1 of random


variables.

4. Let us notice some tests for uniform integrability.


We first observe that if {~n} is a family of uniformly integrable random
variables, then
sup EI~n I < oo.
n

(16)

188

II. Mathematical Foundations of Probability Theory

In fact, for a given ~> > 0 and sufficiently large c > 0


sup E l~nl =sup [E(I~n IIu~nl~cl)
n

supE(I~n IImnl~c}}
n

+ E(l~n IIu~nl<cl)]
+ supE(I~niimnl<c)) s G + c,
n

which establishes (16).


It turns out that (16) together with a condition of uniform continuity is
necessary and sufficient for uniform integrability.

Lemma 2. A necessary and sufficient condition for a family {~n}n~ 1 of random


variables to be uniformly integrable is that E I~n I, n ;;::: 1, are uniformly bounded
(i.e., (16) holds) and that E{ I~n IIA},n ;;::: 1, are uniformly absolutely continuous
(i.e. sup E{I ~n II A} --+ 0 when P(A) --+ 0).
PROOF.

Necessity. Condition (16) was verified above. Moreover,


E{l~niiA} = E{l~niiArl{l~nl~c}} + E{l~niiAn{i~nl<c}}
s E{l~niiu~nl~cJ} + cP(A).

Take c so large that supn E{I~II 11 ~"I"cJ}

~>/2.

sup E{I ~n II A} S

Then if P(A)

~>/2c,

(17)
we have

I>

by (17). This establishes the uniform absolute continuity.


Sufficiency. Let 1> > 0 and b > 0 be chosen so that P(A) < b implies that
E (I ~n II A) $ e, uniformly in n. Since
El~nl ~ EJ~nliu~nl~cl ~ cP{l~nl ~ c}

for every c > 0 (cf. Chebyshev's inequality), we have


sup

P{l~nl;;:::

c}

1
c

s- sup E l~nl-+ 0,

c--+ 00,

and therefore, when c is sufficiently large, any set {I ~n I ;;::: c }, n ;;::: 1, can be
taken as A. Therefore sup E(l ~nl I 0 ~" 1 ~, 1 ) s ~>,which establishes the uniform
integrability. This completes the proof of the lemma.
The following proposition provides a simple sufficient condition for
uniform integrability.

Lemma 3. Let ~b ~ 2 , . be a sequence of integrable random variables and


G = G(t) a nonnegative increasing function, defined fort ~ 0, such that
lim G(t) = oo.
t-+ 00

supE[G(I~ni)J
n

(18)

<

00.

Then the family {~n}n" 1 is uniformly integrable.

(19)

189

6. Lebesgue Integral. Expectation

PRooF. Let e > 0, M =sup" E[G(I~ni)J, a= Mfe. Take c so large that


G(t)/t 2 a for t 2 c. Then

E[l~nllll~"l"'ca:::;; ~E[G(I~nl) Iu~"l"'cj]:::;;-;; = e

uniformly for n 2 1.
5. If ~ and Yf are independent simple random variables, we can show, as in
Subsection 5 of 4 of Chapter I, that E~IJ = E~ Ery. Let us now establish a
similar proposition in the general case (see also Problem 5).
Theorem 6. Let ~ and 1J be independent random variables with EI~ I < oo,
Elryl < oo. ThenEI~Yfl < ooand
E~ry

PRooF. First let

E~

Ery.

(20)

2 0, Yf 2 0. Put
00

00

~n=

I -J{k/n:S~(ro)<(k+l)/n}
k;o n

I - J{k/n:S~(w)<(k+ 1)/n)
k;o n

Yfn

Then ~n :::;; ~' I~n - ~I :::;; lfn and Yfn :::;; Yf, IYfn - Yf I :::;; 1/n. Since E~ < oo and
Ery < oo, it follows from Lebesgue's dominated convergence theorem that
lim
Moreover,

since~

E~nYfn =

and

Yf

E~n

E~,

are independent,

kl

L 2
k,l"'o
L

EJ{k/n:S~<(k+l)/n}/{1/n:S~<(I+l)/n}
n
kl
2E/{k/n:S~<(k+1)/n} EJ{I/n;S;~<(l+l)/n} = E~n EYfn

k,l"'o n

Now notice that

1 1 ( 1)

+E[IYfnll~-~niJ::;-E~+-E ry+-

~o,

n~

oo.

Therefore E~ry =limn E~nYfn =lim E~n lim E11n = E~ Ery, and E~Yf < oo.
The general case reduces to this one if we use the representations
~=C -C,Yf=Yf+ -ry-,~Yf=~+Yf+ -Cry+ -~+Yf- +Cry-.Thiscompletes the proof.
6. The inequalities for expectations that we develop in this subsection are
regularly used both in probability theory and in analysis.

190

II. Mathematical Foundations of Probability Theory

Chebyshev's Inequality. Let

be a nonnegative random variable. Then for

every e > 0

E~

(21)

P(~ ~e):::;;-.

The proof follows immediately from


E~ ~ E[~ /(~>elJ ~ E/(~~l = eP(~ ~e).

From (21) we can obtain the following variant of Chebyshev's inequality:


If ~ is any random variable then
P(~ ~e):::;;

ee

(22)

and

(23)
where V~ = E(~ - E~) 2 is the variance of~.

The Cauchy-Bunyakovskiilnequality. Let~ and rt satisfy E~ 2 < oo, Ert 2 < oo.
Then E I~1'/ I < oo and

(E I~1'/ 1) 2

:::;;

(24)

E~ 2 . Ert 2

PROOF.

Suppose that E~ 2 > 0, E17 2 > 0. Then, with ~

rt!JErli,

we find, since

~~~' if=

21 ~~I :::;; ~ 2 + ~ 2 , that


2E ~~~~ :::;; E~ 2 + E~ 2 = 2,

i.e. E I~~ I :::;; 1, which establishes (24).

ee

= 0, then ~ = 0 (a.s.) by Property I, and


On the other hand if, say,
then E~rt = 0 by Property F, i.e. (24) is still satisfied.
Jensen's Inequality. Let the Borel function g = g(x) be convex downward and
El~l

< oo. Then

g(E~)

:::;;

PRooF. If g = g(x) is convex downward, for each x 0


A.(x 0 ) such that
g(x)
for all x

g(x 0 )

(25)

Eg(~).

+ (x -

R there is a number

x 0 ) A.(x 0 )

R. Putting x = ~ and x 0 = E~' we find from (26) that


g(~) ~ g(E~)

and consequently Eg(~)

~ g(E~).

+ (~ -

E~) A.(E~),

(26)

191

6. Lebesgue Integral. Expectation

A whole series of useful inequalities can be derived from Jensen's inequality.


We obtain the following one as an example.

Lyapunov's Inequality. IfO < s < t,


(27)
To prove this, let r = tfs. Then, putting 11 = I~ I" and applying Jensen's
inequality to g(x) = IxI', we obtain IE17l' :::;;; EI'71', i.e.
(E I~ 1")'1" :::;;; E IeJ',

which establishes (27).


The following chain of inequalities among absolute moments in a consequence of Lyapunov's inequality:

(28)
HOlder's Inequality. Let 1 < p < oo, 1 < q < oo, and (1/p)
E I~ IP < oo and EI11lq < oo, then EI~'71 < oo and
EI~'71

: :; ; (E I~ IP) 11P(E I'11q)lfq.

+ (1/q) = 1.

If

(29)

If E I~ IP = 0 or E Irtlq = 0, (29) follows immediately as for the CauchyBunyakovskii inequality (which is the special case p = q = 2 of Holder's
inequality).
Now let E I~IP > 0, E lrtlq > 0 and

~ = (E I~ lp)lfp '
We apply the inequality
(30)

which holds for positive x, y, a, b and a


from the concavity of the logarithm:
In[ax

+ by]

~ a

In x

+b=

1, and follows immediately

+ b In y =

In xayh.

Then, putting x = ~P, y = ;jq, a = 1/p, b = 1/q, we find that

~;j :::;;; ! ~p + ! ;jq,


p

whence
-

E~q

This establishes (29).

:::;;; - E~P
p

+ -1 E;jq = -1 + -1 =
q

1.

192

II. Mathematical Foundations of Probability Theory

Minkowski's Inequality.JfE I~IP < oo, E1'71P < oo, 1 $ p < oo, then we have
E I~ + 'liP < oo and
(E I~ + '71P)l/p

(E I~ IP)lfp + (E I'71P)lip.

(31)

We begin by establishing the following inequality: if a, b > 0 and p


then

1,

(32)
In fact, consider the function F(x) = (a + x)P - 2p- 1(aP + xP). Then
F'(x) = p(a + x)p-l- 2P- 1pxp-l,

and since p

1, we have F'(a) = 0, F'(x) > 0 for x < a and F'(x) < 0 for

x > a. Therefore

F(b) $ max F(x) = F(a) = 0,


from which (32) follows.
According to this inequality,

and therefore if EI~ IP < oo and EI'71P < oo it follows that EI~ + 'liP < oo.
If p = 1, inequality (31) follows from (33).
Now suppose that p > 1. Take q > 1 so that (1/p)

(1/q)

1. Then

I~+ 'lip= I~+ '711~ + '71p-l Sl~ll~ + '71p-l + 1'711~ + '71p-l.

(34)

Notice that (p - 1)q = p. Consequently


E(l~

+ '7ip-l)q = Ei~ +'liP< oo,

and therefore by Holder's inequality


E(l~ll~

+ '7ip-1)

+ '7i(p-l>q)liq
= (E I~ IP) 11P(E I~ + '7ip)lfq < 00.
~ (Ei~IP)liP(Ei~

In the same way,


E( I'1 II~ + '71p-l) $ (E I'71P) 11P(E I~ + '7ip)lfq,
Consequently, by (34),

EI~ + 'lip

(E I~ + '71P)lfq((E I~ IP)l/p + (E I'71P) 11P).

IfE I~ + 'liP = 0, the desired inequality (31) is evident. Now let EI~
Then we obtain

(E I~ + '71P)l-(l/q)

(E I~ IP)lfp + (E I'71P)lfp

from (35), and (31) follows since 1 - (1/q)

= 1/p.

(35)

+ 'liP > 0.

193

6. Lebesgue Integral. Expectation

7.

Let~ be a random variable for which


the set function

O(A)

E~

is defined. Then, by Property D,

=L~ dP,

(36)

is well defined. Let us show that this function is countably additive.


First suppose that ~ is nonnegative. If A 1 , A 2 , ... are pairwise disjoint sets
from fF and A = LA", the corollary to Theorem 1 implies that
O(A) = E(~. IA) = E(~. IEAJ = E(:L ~. IAJ
=IE(~ IAJ = L O(An).
is an arbitrary random variable for which E~ is defined, the countable
additivity of O(A) follows from the representation
If~

(37)

where

together with the countable additivity for nonnegative random variables and
the fact that min(O+(n), o-cn)) < 00.
Thus if E~ is defined, the set function 0 = O(A) is a signed measurea countably additive set function representable as 0 = 0 1 - 0 2 , where at
least one of the measures 0 1 and 0 2 is finite.
We now show that 0 = O(A) has the following important property of
absolute continuity with respect to P:

if

P(A) = 0

O(A) = 0

then

(A

ff)

(this property is denoted by the abbreviation 0 ~ P).


To prove the sufficiency we consider nonnegative random variables. If
~ =
1 xkiAk is a simple nonnegative random variable and P(A) = 0,
then

I;;=

O(A) = E(~ IA) =

xkP(Ak n A)= 0.

k=1

If { ~"} n:e: 1 is a sequence of nonnegative simple functions such that


then the theorem on monotone convergence shows that

~"

~ ~

0,

O(A) = E(~ IA) = lim E(~n IA) = 0,


since E(~" IA) = 0 for all n ~ 1 and A with P(A) = 0.
Thus the Lebesgue integral O(A) = fA~ dP, considered as a function of
sets A E ff, is a signed measure that is absolutely continuous with respect to
P (0 ~ P). It is quite remarkable that the converse is also valid.

194

II. Mathematical Foundations of Probability Theory

Radon-Nikodym Theorem. Let (0, ff) be a measurable space, 11 a a-finite


measure, and A a signed measure (i.e., A = A1 - A2 , where at least one of the
measures A1 and A2 is finite) which is absolutely continuous with respect to 11
Then there is an ff-measurable.function f = .f(w) with values in R = [- oo, oo]
such that
A(A) = {f(w)l1(dw),

(38)

The function f(w) is unique up to sets of 11-measure zero: if h = h(w) is


another ff-measurable function such that A(A) = JA h(w)11(dw), A E ff, then
11{ w: f(w) =P h(w)} = 0.
If A is a measure, then f = f (w) has its values in R+ = [0, oo].

Remark. The function f = f(w) in the representation (38) is called the


Radon- N ikodym derivative or the density of the measure A with respect to 11,
and denoted by dA/dl1 or (dA/d11)(w).
The Radon-Nikodym theorem, which we quote without proof, will play
a key role in the construction of conditional expectations (7).
8. If~=

Li'=

xJA; is a simple random variable,

Eg( ~)

= L: g(xi)P(Ai) = Lg(xi)AF ~(xJ

(39)

In other words, in order to calculate the expectation of a function of the


(simple) random variable~ it is unnecessary to know the probability measure
P completely; it is enough to know the probability distribution P ~ or, equivalently, the distribution function F ~of~The following important theorem generalizes this property.

Theorem 7 (Change of Variables in a Lebesgue Integral). Let (0, ff) and


(E, $)be measurable spaces and X = X(w) an ff/$-measurable function with
values in E. Let P be a probability measure on (0, ff) and Px the probability
measure on (E, $)induced by X= X(w):
Px(A) = P{w: X(w)
Then

g(x)Px(dx) =

AE$.

A},

g(X(w))P(dw),

AE$,

Let A

(41)

x-l(A)

for every $-measurable function g = g(x), x E E (in the sense that


integral exists, the other is well defined, and the two are equal).
PROOF.

(40)

if one

$and g(x) = IB(x), where BE$. Then (41) becomes


Px(AB)

P(X- 1(A) n

x-

1(B)),

(42)

195

6. Lebesgue Integral. Expectation

x-

1(B) =
which follows from (40) and the observation that x- 1(A) n
1
x- (A n B).
It follows from (42) that (41) is valid for nonnegative simple functions
g = g(x), and therefore, by the monotone convergence theorem, also for all
nonnegative iff -measurable functions.
In the general case we need only represent gas g+ - g-. Then, since (41)
is valid for g+ and g-, if(for example) fAg+(x)Px(dx) < oo, we have

x- 1 (A)

g+(X(w))P(dw)

< oo

also, and therefore the existence of fA g(x)Px(dx) implies the existence of


g(X(w))P(dw).

fx-'(AJ

Corollary. Let (E, $) = (R, BI(R)) and

let~ = ~(w) be a random variable with


probability distribution P~. Then if g = g(x) is a Borel function and either ofthe
integrals fA g(x)P~(dx) or f~-'(AJ g(~(w))P(dw) exists, we have

f g(x)P~(dx) J
=

g(~(w))P(dw).

~-I(A)

In particular, for A = R we obtain

Eg(~(w)) = Lg(~(w))P(dw) = Lg(x)P~(dx).

(43)

The measure P~ can be uniquely reconstructed from the distribution


function F~ (Theorem 1 of 3). Hence the Lebesgue integral fR g(x)P~(dx) is
often denoted by JR g(x)F~(dx) and called a Lebesgue-Stieltjes integral
(with respect to the measure corresponding to the distribution function
F~(x)).

Let us consider the case when F ~(x) has a density f~(x), i.e. let
(44)
where f~ = f~(x) is a nonnegative Borel function and the integral is a Lebesgue
integral with respect to Lebesgue measure on the set (- oo, x] (see Remark 2
in Subsection 1). With the assumption of (44), formula (43) takes the form

Eg(~(w)) = J:00 g(x)f~(x) dx,

(45)

where the integral is the Lebesgue integral of the function g(x)Nx) with
respect to Lebesgue measure. In fact, if g(x) = Ia(x), BE BI(R), the formula
becomes

BE 14(R);

(46)

196

II. Mathematical Foundations of Probability Theory

its correctness follows from Theorem 1 of 3 and the formula

F~(b)- F~(a) = ff~(x) dx.


In the general case, the proof is the same as for Theorem 7.
9. Let us consider the special case of measurable spaces (Q, !F) with a
measure J.l, where Q = Q 1 x Q 2, !F = !F1 !F2 , and J.l = f.lt x J.l 2 is the
direct product of measures f.lt and f.l 2 (i.e., the measure on !F such that

the existence of this measure follows from the proof of Theorem 8).
The following theorem plays the same role as the theorem on the reduction
of a double Riemann integral to an iterated integral.

Theorem 8 (Fubini's Theorem). Let ~ = ~(ro 1 , ro 2) be an !F 1 !F rmeasurable function, integrable with respect to the measure f.lt x J.l 2:
(47)

Then the integrals Jn 1 ~(rob ro 2)J.l 1(dro 1) and Jn2 ~(ro 1 , ro 2)J.lidro2)
(1) are defined for all ro 1 and ro 2;
(2) are respectively !F 2 - and !F 1 -measurable functions with

J.l2{ro2:
J.l1{ro1:

t 1 1~(rot, ro2)1f.lt(dro

1)

= oo} =

t2 1~(rot, ro2)IJ.l2(dro2) =

0,
(48)

oo} = 0

and (3)

l"lt xn2

~(ro1, ro2) d(J.l1

x J.l2) =

i [i ~(rot,
l"lt

!12

ro2)J.l2(dro2)JJ.lt(drol)

tJt 1 ~(rob ro2)J.lt(drot)]f.lidro2).

(49)

PRooF. We first show that ~w 1 (ro 2 ) = ~(ro 1 , ro 2) is !F rmeasurable with


respect to ro 2, for each ro 1 E Q 1.
Let FE !F 1 !F 2 and ~(rob ro 2) = I iro 1, ro 2). Let

197

6. Lebesgue Integral. Expectation

be the cross-section ofF at Wt. and let re'w,


show that re'ro, =!IF for every w 1.
If F = A x B, A E !IF 1, B E !IF 2 , then
(A

B)w,

= {FE !IF: Fw, E F 2 }. We must

B if w 1 E A,
= {0
if w 1 A.

Hence rectangles with measurable sides belong to re' "''. In addition, if


FE !IF, then (Fln, = F w,, and if {F"kd are sets in !IF, then
F")w, =
F':,,. It follows that re'w, = !F.
Now let ~(w 1 , w 2 ) :2: 0. Then, since the function ~(w 1 , w 2 ) is !F2 -measurable
for each rob the integral 2 ~(w 1 , w 2 )f1z(dw 2 ) is defined. Let us show that this
integral is an ff1-measurable function and

<U

Jn

Let us suppose that ~(w 1 , w 2 ) = IAx 8 (w 1 , w 2 ), A E !F1, BE .?F2 . Then since


IA xiw1, w 2 ) = IA(w 1 )1 8 (w 2 ), we have

and consequently the integral on the left of (51) is an !F1-measurable function.


Now let ~(w 1 , w 2 ) = IF(w 1 , w 2 ), FE !IF = ff1 0 !F2 Let us show that the
integralf(w 1) =
2 I F(w 1, w 2 )f1 2 (dw 2 ) is !IF -measurable. For this purpose we
put re' = {FE !F:f(w 1) is !!i'rmeasurable}. According to what has been
proved, the set A x B belongs to '6' (A E $'1, B E $'2 ) and therefore the algebra
d consisting of finite sums of disjoint sets of this form also belongs to re'. It
follows from the monotone convergence theorem that re' is a monotonic
class, re' = Jl(re'). Therefore, because of the inclusions d <;; re' c;; !IF and
Theorem 1 of 2, we have ff = a{d) = fl(d) <;; Jl(re') = re' <;; .?F, i.e. re' = !F.
Finally, if ~(wb w 2 ) is an arbitrary nonnegative.!F-measurable function,
the !F 1-measurability of the integral 2 ~(w 1 , w 2 )f1z(dw) follows from the
monotone convergence theorem and Theorem 2 of 4.
Let us now show that the measure f1 = f1 1 x flz defined on !IF = ff2 0 ff2 ,
withtheproperty(J1 1 x J1 2 )(A x B)= f1 1(A)J1 2 (B),AE!!i'1,BEff2 ,actually
exists and is unique.
For FE ff we put

Jn

Jn

Jl(F)

LJL/Fw,(Wz)f1z(dwz)}1(dw1).

As we have shown, the inner integral is an ff1-measurable function, and


consequently the set function Jl(F) is actually defined for F E !F. It is clear

198

II. Mathematical Foundations of Probability Theory

that ifF =A x B, then fl(A x B)= 11 1(A)11 2 (B). Now let {Fn} be disjoint
sets from .F. Then

/l(L Fn) = LJL/<1:1"")w 1(Wz)flidwz)]/ll(dwl)


=

i [i
i [i
L

Clt n

Cl2

Clt

Cl2

/F:J 1 (wz)/lz(dwz)]/ll(dwl)

/F:J, (wz)/lz(dwz)]/1 1(dw 1) =

L fl(P),
n

i.e.fl is a (a-finite) measure on .F.


It follows from Caratheodory's theorem that this measure 11 is the unique
measure with the property that /l(A x B) = 11 1(A)/1 2 (B).
We can now establish (50). If ~(w 1 , w 2) = I Ax B(w 1, Wz), A E !#'1 , BE !#'2,
then

Clt xQ2

I Ax B(wl> w 2 )d(fl 1 x /lz) = (/1 1 x 11 2)(A x B),

(52)

and since 1Axiw 1, w 2) = IA(w 1)la(w 2), we have

LJL/Axiwl, w2)/12(dw2)]111(dwl)
=

L.[IA(wl) L/B(wl> w2)/12(dw2)]/11(dw 1) = /11(A)fliB).

(53)

But, by the definition of 11 1 x J12,

(J1 1

J1 2)(A x B) = J1 1 (A)J1 2 (B).

Hence it follows from (52) and (53) that (50) is valid for
IAxiwl, w2).
Now let

~(w 1 ,

~(w 1 ,

w2) =

w 2) = JF(w 1, w 2 ), FE .F. The set function

is evidently a a-finite measure. It is also easily verified that the set function

v(F) = L.[L/iwl, Wz)J1 2 (dw 2 )]J11(dw 1)


is a a-finite measure. It will be shown below that Aand v coincide on sets of
the form F = A x B, and therefore on the algebra !F. Hence it follows by
Caratbeodory's theorem that Aand v coincide for all F E .F.

199

6. Lebesgue Integral. Expectation

We turn now to the proof of the full conclusion of Fubini's theorem.


By (47),

~+(wb w2) d(f.ll

Jn,xn 2

f.12) <

00,

By what has already been proved, the

:1'1-measurable function of w 1 and

C(wl, w2)d(J.11

f.12) <

00.

Jn,xn 2
integral Jn 2~+(w 1 , w 2)J.1idw 2) is

r [ r ~+(wl, W2)f.12(dw2)]f.11(dwl) = Jn,xn2


r ~+(wl, W2) d(f.ll

Jn, Jn2

f.12) <

an

00.

Consequently by Problem 4 (see also Property J in Subsection 2)

r ~+(wb w2)f.12(dw2) <

00

(f.ll-a.s.).

( CCw1, w2)J.11(dw1) < oo

CJ.11-a.s.),

Jo2
In the same way

Jo2
and therefore

It is clear that, except on a set Jll of J.1 1-measure zero,


(

Jn2

~(wl, w2)f.1idw2) = ( ~+(wl, w2)J.12(dw2)Jo2

( C(wb w2)J.12(dw2).

Jn2

(54)
Taking the integrals to be zero for w 1 E%, we may suppose that (54) holds
for all w E n 1. Then, integrating (54) with respect to f.1 1 and using (50), we
obtain

t,[t2 ~(wl, w2)f.12(dw2)]f.11(dw1)

t,[t2 ~+(w 1 , w2)J.1z(dw2)]f.11(dw 1)

-t,[t
r

C(wb w2)f.12(dw2)}1(dw1)

~+(wl, W2)d(J.11

Jn,xn 2

- Jn,
r xn2 ~-(Wl, W2) d(f.ll

f.12)
X

f.12)

200

II. Mathematical Foundations of Probability Theory

Similarly we can establish the first equation in (48) and the equation

Jn, xn2

e(wl, W2) d(Jl.l X J1.2) =

r [ r e(wl, W2)Jl.l(dwl)JJ1.2(dw2).

Jn2 Jn,

This completes the proof of the theorem.

Corollary. If Jn, Un2 Ie<wl, w2) IJ1.idw2)]Jl.l (dwl) <


Fubini's theorem is still valid.

oo,

the conclusion of

In fact, under this hypothesis (47) follows from (50), and consequently
the conclusions of Fubini's theorem hold.

Let (e, q) be a pair of random variables whose distribution has a


two-dimensional density f~~(x, y), i.e.

EXAMPLE.

P((e, q) e B) =

{f~~(x, y) dx dy,

where f~~(x, y) is a nonnegative &6'(R 2 )-measurable function, and the integral


is a Lebesgue integral with respect to two-dimensional Lebesgue measure.
Let us show that the one-dimensional distributions for and '1 have
densities f~(x) and f,(y), and furthermore

f~(x) = f_oo,J~~(x, y) dy
(55)

and

~(y) =

J:oo f~~(x, y) dx.

In fact, if A e as'(R), then by Fubini's theorem

P(eeA) = P((e,q)eA x R) =

{x/~~(x,y)dxdy = L[Lf~~(x,y)dy]dx.

This establishes both the existence of a density for the probability distribution
of and the first formula in (55). The second formula is established similarly.
According to the theorem in 5, a necessary and sufficient condition that
and '1 are independent is that

F~71(x,

y) =

F~(x)F,(y),

(x, y)

R 2

Let us show that when there is a two-dimensional density


variables eand '1 are independent if and only if

h.~(x, y),

the
(56)

(where the equation is to be understood in the sense of holding almost


surely with respect to two-dimensional Lebesgue measure).

201

6. Lebesgue Integral. Expectation

In fact, in (56) holds, then by Fubini's theorem


F

~~(x, y) =
=

J(-oo,x)x(-oo,y] f~~(x, y) dx dy = J(-oo,x)x(-oo,y] f~(x)f~(y) dx dy

(-oo,x]

Nx) dx(J

(-oo,y)

j,(y) dy) =

F~(x)F~(y)

and consequently ~ and 11 are independent.


Conversely, if they are independent and have a density h~(x, y), then again
by Fubini's theorem

f-oo,x)x(-oo,y/~~(X, y) dx dy = (f_oo,x/~(x) dx )(f-oo.y]j,(y) dy)


= J
f~(x)fq(y) dx dy.
(-oo,x]x(-oo,y]
It follows that

f/~~(x, y) dx dy = f/~(x)fq(y) dx dy
for every BE fJI (R 2 ), and it is easily deduced from Property I that (56) holds.
10. In this subsection we discuss the relation between the Lebesgue and
Riemann integrals.
We first observe that the construction of the Lebesgue integral is inde-

pendent of the measurable space (Q, ') on which the integrands are given.
On the other hand, the Riemann integral is not defined on abstract spaces in
general, and for Q = W it is defined sequentially: first for R 1 , and then
extended, with corresponding changes, to the case n > 1.
We emphasize that the constructions of the Riemann and Lebesgue
integrals are based on different ideas. The first step in the construction of the
Riemann integral is to group the points x E R 1 according to their distances
along the x axis. On the other hand, in Lebesgue's construction (for Q = R 1 )
the points x E R 1 are grouped according to a different principle: by the
distances between the values of the integrand. It is a consequence of these
different approaches that the Riemann approximating sums have limits only
for "mildly" discontinuous functions, whereas the Lebesgue sums converge
to limits for a much wider class of functions.
Let us recall the definition of the Riemann-Stieltjes integral. Let G = G(x)
be a generalized distribution function on R (see subsection 2 of 3) and 11 its
corresponding Lebesgue-Stieltjes measure, and let g = g(x) be a bounded
function that vanishes outside [a, b].

202

II. Mathematical Foundations of Probability Theory

Consider a decomposition fJJ = {x 0 , , xn},


a = x0

< x 1 < <

= b,

Xn

of [a, b], and form the upper and lower sums


n

L = L iHG(x;+ 1) ~

L=

G(x;)],

i=1

L~;[G(xi+ 1 )- G(x;)]

i=l

where

g; =

g(y),

sup

inf

ff.i =

g(y).

Xi-1 <y:s";Xi

Xi-1 <y~Xi

Define simple functions g~(x) and fl~(x) by taking

on x;_ 1 < x :::;;

X;,

and define

g~(a) = fl~(a) =

g(a). It is clear that then

L = (L-S) Jbg~(x)G(dx)
a

and

L = (L-S) Jb fl~(x)G(dx).
a

Now let{&\} beasequenceofdecomposition ssuchthatfJJk s;: .o/lk+ 1 Then

and if Jg(x)J :::;; C we have, by the dominated convergence theorem,


lim

L=

(L-S)

rb g(x)G(dx),

k-+ oo

~k

lim

L = (L-S) Jb fl(x)G(dx),

Ja

(57)

k-+oo ~k

where g(x) = limk g~k(x), g(x) = limk fl~Jx).


H the limits limk r~k and limk I~k are finite and equal, and their common
value is independent of the sequence ofdecompositions {fJJk}, we say that g = g(x)
is Riemann-Stieltjes integrable, and the common value of the limits is denoted
by
(R-S)

g(x)G(dx).

(58)

When G(x) = x, the integral is called a Riemann integral and denoted by


(R) fg(x) dx.

203

6. Lebesgue Integral. Expectation

J!

Now let (L-S) g(x)G(dx) be the corresponding Lebesgue-Stieltjes


integral (see Remark 2 in Subsection 2).

Theorem 9. If g = g(x) is continuous on [a, b], it is Riemann-Stieltjes integrable and

(R-S)

g(x)G(dx) = (L-S)

(59)

g(x)G(dx).

PRooF. Since g(x) is continuous, we have g(x) = g(x) = g(x). Hence by (57)
Consequently g = g(x) is -Riemann-Stieltjes
= limk-+oo
limk-+oo
integral (again by (57)):-

L&k

L&k

Let us consider in more detail the question of the correspondence between


the Riemann and Lebesgue integrals for the case of Lebesgue measure on the
lineR.

Theorem 10. Let g(x) be a bounded function on [a, b].

if and only if it is
continuous almost everywhere (with respect to Lebesgue measure A on

(a) The function g = g(x) is Riemann integrable on [a, b]


~([a, b])).

(b) If g = g(x) is Riemann integrable, it is Lebesgue integrable and


(60)

(R) fg(x) dx = (L) fg(x)A(dx).


PROOF. (a) Let g = g(x) be Riemann integrable. Then, by (57),
(L)
But g(x)

g(x)

S:g(x)A(dx) = S:g(x)A(dx).
(L)

g(x), and hence by Property H

g_(x) = g(x) = g(x)

(61)

(A-a.s.),

from which it is easy to see that g(x) is continuous almost everywhere (with
respect to 1).
Conversely, let g = g(x) be continuous almost everywhere (with respect
to A). Then (61) is satisfied and consequently g(x) differs from the (Borel)
measurable function g(x) only on a set JV with A:(%) = 0. But then

{x: g(x)

c} = {x: g(x)
= {x: g(x)

It is clear that the set {x: g(x)

~
~

c} n JV
c} n JV

+ {x: g(x)
+ {x: g(x)

~
~

c} n JV
c} n JV

c} n JV E &~([a, b]), and that

{x: g(x)

c} n JV

204

II. Mathematical Foundations of Probability Theory

is a subset of .Y having Lebesgue measure A: equal to zero and therefore also


belonging to ~([a, b)]. Therefore g(x) is &l([a, b])-measurable and, as a
bounded function, is Lebesgue integrable. Therefore by Property G,
(L) fg(x)A:(dx) = (L) ff!(x)A:(dx) = (L) fg(x)A(dx),
which completes the proof of (a).
(b) If g = g(x) is Riemann integrable, then according to (a) it is continuous
(A:-a.s.). It was shown above than then g(x) is Lebesgue integrable and its
Riemann and Lebesgue integrals are equal.
This completes the proof of the theorem.

Remark. Let J.t be a Lebesgue-Stieltjes measure on ~([a, b]). Let ~Jt([a, b])
be the system consisting of those subsets A s; [a, b] for which there are sets
A and B in ~([a, b]) such that A s; A s; B and J.t(B\A) = 0. Let J.t be an
extension of J.t to &I i[a, b]) (Jl(A) = J.t(A) for A such that A s; A s; B and
J.t(B\A) = 0). Then the conclusion of the theorem remains valid if we
consider J1 instead of Lebesgue measure A, and the Riemann-Stieltjes and
Lebesgue-Stieltjes measures with respect to J1 instead of the Riemann and
Lebesgue integrals.

11. In this part we present a useful theorem on integration by parts for the
Lebesgue-Stieltjes integral.
Let two generalized distribution functions F = F(x) and G = G(x) be
given on (R, ~(R)).

Theorem 11. The following formulas are valid for all real a and b, a < b:
F(b)G(b)- F(a)G(a) = fF(s- )dG(s)

+f

G(s) dF(s),

(62)

or equivalently
F(b)G(b)- F(a)G(a)

fF(s-)dG(s)

+ L

+ fG(s-)dF(s)

llF(s) llG(s),

(63)

a<s:Sb

where F(s-) = limqs F(t), !!.F(s) = F(s)- F(s- ).

Remark l. Formula (62) can be written symbolically in "differential" form


d(FG) = F _ dG

+ GdF.

(64)

6. Lebesgue Integral. Expectation

205

Remark 2. The conclusion of the theorem remains valid for functions F and G
of bounded variation on [a, b]. (Every such function that is continuous on
the right and has limits on the left can be represented as the difference oftwo
monotone nondecreasing functions.)
PRooF. We first recall that in accordance with Subsection 1 an integral
()means J1a,bJ ().Then (see formula (2) in 3)

J:

(F(b)- F(a))(G(b)- G(a)) =

dF(s) fdG(t).

Let F x G denote the direct product of the measures corresponding to F and


G. Then by Fubini's theorem
(F(b) - F(a))(G(b) - G(a)) =
=

f
f

(a, b] x (a, b]

d(F x G)(s, t)

(a, b] x (a, b]

I{s~t)(s, t) d(F

(G(s) - G(a)) dF(s)

~~

G(s)dF(s)

G)(s, t)

J(a,

b]

x (a, b]

I 1sst)(s, t) d(F

G)(s, t)

(F(t-) - F(a)) dG(t)

~~

F(s- )dG(s)- G(a)(F(b)- F(a))- F(a)(G(b)- G(a)),


(65)

where I A is the indicator of the set A.


Formula (62) follows immediately from (65). In turn, (63) follows from

(62) if we observe that

f(G(s)- G(s- )) dF(s) =

a<~sbLlG(s) LlF(s).

(66)

Corollary 1. If F(x) and G(x) are distribution functions, then


F(x)G(x) = raoF(s-) dG(s)

+ fao G(s) dF(s).

(67)

If also
F(x) = faof(s) ds,
then
F(x)G(x)

f ao F(s) dG(s)

+f

ao G(s)f(s) ds.

(68)

206

II. Mathematical Foundations of Probability Theory

Corollary 2. Let ~ be a random variable with distribution junction F(x) and


El~l" < oo. Then
00
{

xn dF(x)

= n {ooxn- 1[1- F(x)] dx,

(69)

o lxlndF(x) = - JooxndF(-x) = n [xn- 1 F(-x)dx


-oo
0
0

(70)

and

El~ln =

f_

00
00

1XIn dF(x) = n Lnxn- 1[1- F(x)

+ F(-x)] dx.

(71)

To prove (69) we observe that

fXJ xn dF(x) =

s:

xn d(1 - F(x))

= - bn(1- F(b))

+ n J:xn- 1(1- F(x)) dx.

(72)

Let us show that since E 1 ~I" < oo,

bn(1- F(b)

+ F( -b))::; bnP(I~I

In fact,

EJ~I"

Cf.)

k~1

rk
Jk_

2 b) -t 0.

lxJ"dF(x) <

(73)

CXJ

and therefore

L rk

lxl"dF(x)-tO,

L rk

lxln dF(x) 2

n -too.

k;o,b+ 1 Jk-1

But

k;o,b+ 1 Jk-1

bnP(I~I

2 b),

which establishes (73).


Taking the limit as b -t oo in (72), we obtain (69).
Formula (70) is proved similarly, and (71) follows from (69) and (70).

12. Let A(t ), t 2 0, be a function oflocally bounded variation (i.e., of bounded


variation on each finite interval [a, b]), which is continuous on the right and
has limits on the left. Consider the equation

z, = 1 + J~z._ dA(s),

(74)

207

6. Lebesgue Integral. Expectation

which can be written in differential form as


Zo = 1.

dZ = Z_ dA,

(75)

The formula that we have proved for integration by parts lets us solve (74)
explicitly in the class of functions of bounded variation.
We introduce the function
&,(A)= eA(rJ-A(OJ

(1
O:s;s:s;r

+ L\A(s))e-&A(J,

(76)

where L\A(s) = A(s)- A(s-) for s > 0, and L\A(O) = 0.


The function A(s), 0 :::;; s :::;; t, has bounded variation and therefore has at
most a countable number of discontinuities, and so the series Lo::s:s::s:riM(s)l
converges. It follows that

O:s;s:s;r

(1

+ M(s))e-&A(s)

is a function of locally bounded variation.


If Ac(t) = A(t) - Lo:s:s:s;r L\A(s) is the continuous component of A(t),
we can rewrite (?6) in the form
&,(A)= eAC(I)-AC(O)

n (1 + M(s)).

O:s;s:s;r

(77)

Let us write
F(t) = eA<<rJ-A<<o>,

G(t) =

fl

O:s;s:s;r

(1

+ L\A(s)).

Then by (62)
tS',(A) = F(t)G(t) = 1 +

J~F(s) dG(s) + LG(s-) dF(s)

1 + 0 J;,,F(s)G(s- )L\A(s) + {G(s- )F(s) dAc(s)

1 + {s._(A) dA(s).

Therefore tS',(A), t ~ 0, is a (locally bounded) solution of(74). Let us show that


this is the only locally bounded solution.
Suppose that there are two such solutions and let Y = Y(t), t ~ 0, be their
difference. Then
Y(t)

f~ Y(s-) dA(s).

Put
T = inf{t ~ 0: Y(t) =I= 0},

where we take T

= oo if Y(t) = 0 fort~ 0.

208

II. Mathematical Foundations of Probability Theory

Since A(t) is a function of locally bounded variation, there are two


generalized distribution functions A 1(t) and A 2 (t) such that A(t) = A 1(t) A 2 (t). If we suppose that T < oo, we can find a finite T' such that

Then it follows from the equation

Y(t) =

I:

Y(s-) dA(s),

t "2! T,

that
sup! Y(t)l ~!sup I Y(t)l
tSt'

tST'

and since sup! Y(t)l < oo, we have Y(t) = 0 forT< t


the assumption that T < oo.
Thus we have proved the following theorem.

T', contradicting

Theorem 12. There is a unique locally bounded solution of(74), and it is given
by (76).
13.

PROBLEMS

1. Establish the representation (6).

2. Prove the following extension of Property E. Let and '7 be random variables for
which Ee and E17 are defined and the sum Ee + E17 is meaningful (does not have the
form oo - oo or - oo + oo ). Then

3. Generalize Property G by showing that if = '7 (a.s.) and Ee exists, then E17 exists and
Ee = E17.

4. Let be an extended random variable, p. a a-finite measure, and Jn lei dp. <
Show that I I < oo (p.-a.s.) (cf. Property J).

00.

5. Let p. be a a-finite measure, and '7 extended random variables for which Ee and E17
are defined. If J.A dP ::::;; J.A '7 dP for all A E ',then
'7 (p.-a.s.). (Cf. Property 1.)

e: : ;

6. Let and '7 be independent nonnegative random variables. Show that Eel'f = Ee E17.
7. Using Fatou's lemma, show that

8. Find an example to show that in general it is impossible to weaken the hypothesis


"I I : : ; l'f, El'f < 00" in the dominated convergence theorem.

e.

209

6. Lebesgue Integral. Expectation

9. Find an example to show that in general the hypothesis "~.


Fatou's lemma cannot be omitted.

10. Prove the following variants ofFatou's lemma. Let the family
variables be uniformly integrable and let E lim ~. exist. Then

IJ, EIJ > - oo" in

g;}."' 1 of random

IJ., n 2:: 1, where the family {~:}."' 1 is uniformly integrable and 11.
converges a.s. (or only in probability-see 10 below) to a random variable 71 Then
limE~. s E lim~.
Let~.

11. Dirichlet's function


d(x) = {1,
0,

x irr~tional,

x rational,

is defined on [0, 1], Lebesgue integrable, but not Riemann integrable. Why?
12. Find an example of a sequence of Riemann integrable functions {f.}.;;: 1, defined on
[0, 1], such that I f. I s 1, f.--> f almost everywhere (with Lebesgue measure), but
f is not Riemann integrable.

13. Let (a;,i; i,j 2:: 1) be a sequence ofreal numbers such that Li.i lai,il < oo. Deduce
from Fubini's theorem that
(78)
14. Findanexampleofasequ ence(aii; i,j;;::: 1)forwhich Li.i laiil = oo and the equation
in (78) does not hold.
15. Starting from simple functions and using the theorem on taking limits under the
Lebesgue integral sign, prove the following result on integration by substitution.
Let h = h(y) be a nondecreasing continuously differentiable function on [a, b],
and let f(x) be (Lebesgue) integrable on [h(a), h(b)]. Then the functionf(h(y))h'(y)
is integrable on [a, b] and

h(b)

f(x) dx =

fb

f(h(y))h'(y) dy.

h(a)

16. Prove formula (70).


17. Let ~.

~ 1 , ~ 2 ,

be nonnegative integrable random variables such that E~n ..... E~ and

P(~ - ~. > ~:)--> 0 for every 1: > 0. Show that then E I~n - ~I-> 0, n--> oo.

18. Let~. ry, (and ~n Yin (n, n 2:: 1, be random variables such that
p

'In --> IJ,

E(n--> E(,
and the expectations E~, EIJ, E( are finite. Show that then E~n ..... E~ (Pratt's lemma).
If also 1fn s 0 :-::; (n then E I~. - ~I --> 0,
Deduce that if~. -f. ~. EI~. I --> EI~ I and EI~ I < oo, then E I~n - ~I __. o.

210

II. Mathematical Foundations of Probability Theory

7. Conditional Probabilities and Conditional


Expectations with Respect to a a-Algebra
1. Let (0, ~ P) be a probability space, and let A E ' be an event such that
P(A) > 0. As for finite probability spaces, the conditional probability of B with
respect to A (denoted by P(BIA)) means P(BA)/P(A), and the conditional

probability of B with respect to the finite or countable decomposition

~ =

{Db D 2 } with P(Di) > 0, i ~ 1 (denoted by P(BI~))is the random variable


equal to P(BIDi) for wE Di, i ~ 1:
P(BI~)

P(BIDi)Iv,(w).

i;:. 1

In a similar way, if~ is a random variable for which E~ is defined, the


conditional expectation of~ with respect to the event A with P(A) > O(denoted
by E(~IA)) is E(OA)/P(A) (cf. (1.8.10).
The random variable P(B I~) is evidently measurable with respect to the
a-algebra r = a(~), and is consequently also denoted by P(B Ir) (see 8 of
Chapter 1).
However, in probability theory we may have to consider conditional
probabilities with respect to events whose probabilities are zero.
Consider, for example, the following experiment. Let ~ be a random
variable that is uniformly distributed on [0, 1]. If~ = x, toss a coin for which
the probability of head is x, and the probability of tail is 1 - x. Let v be the
number of heads inn independent tosses of this coin. What is the "conditional
probability P( v = k I~ = x) "? Since P( ~ = x) = 0, the conditional pro bability P(v = k I~ = x) is undefined, although it is intuitively plausible that
"it ought to be C~xk(1 - xt-k."
Let us now give a general definition of conditional expectation (and, in
particular, of conditional probability) with respect to a a-algebra r, r ~ ~
and compare it with the definition given in 8 of Chapter I for finite probability
spaces.
2. Let (0, ~ P) be a probability space, r a a-algebra, r ~ ' (r is a asubalgebra of ff), and~ = ~(w) a random variable. Recall that, according to
6, the expectation E~was defined in two stages: first for a nonnegative random
variable ~' then in the general case by

and only under the assumption that

A similar two-stage construction is also used to define conditional expectations E( ~I r).

7. Conditional Probabilities and Expectations with Respect to a u-Aigcbra

211

Definition 1.
( 1) The conditional expectation of a nonnegative random variable ~ with
respect to the a-algebra t is a nonnegative extended random variable,
denoted by E(~ It) or E(~ It)(m), such that

(a) E(~l~) is t'-measurable;


(b) for every A E t

L~ LE(~l~)
dP =

dP.

(1)

(2) The conditional expectation E(~lt), or E(~l~)(m), of any random


variable ~ with respect to the a-algebra t, is considered to be defined if
min(E(~+ 1~), E(~-1~))

< oo,

P-a.s., and it is given by the formula


EceJ~)

=E(~+ It)- E(~-1~),

where, on the set (of probability zero) of sample points for which E(~+ 1t)
= E(C I~) = oo, the difference E(~+ I~) - E(C I~) is given an arbitrary
value, for example zero.
We begin by showing that, for nonnegative random variables,
actually exists. By (6.36) the set function
Q(A) =

L~ dP,

A Et,

E(~l~)

(2)

is a measure on (Q, ~). and is absolutely continuous with respect to P


(considered on (Q, ~). ~ ;; ff). Therefore (by the Radon-Nikodym theorem)
there is a nonnegative ~-measurable extended random variable EG I~) such
that
Q(A)

= LE(~I~)dP.

(3)

Then (1) follows from (2) and (3).

Remark 1. In accordance with the Radon-Nikodym theorem, the conditional expectation E( ~It) is defined only up to sets of P-measure zero.
In other words, E(~ It) can be taken to be any t'-measurable function f(m)
for which Q(A) = JAf(m) dP, A E t (a "variant" of the conditional expectation).
Let us observe that, in accordance with the remark on the RadonNikodym theorem,
(4)

212

II. Mathematical Foundations of Probability Theory

i.e. the conditional expectation is just the derivative of the Radon- Nikodym
measure Q with respect to P (considered on (0, <;9')).
Remark 2. In connection with (1), we observe that we cannot in general put
E(~ 1<;9') = ~,since~ is not necessarily <;9'-measurable.
Remark 3. Suppose that ~ is a random variable for which E~ does not exist.
Then E(~ I<;9')maybe definable as a <;9'-measurable function for which (1) holds.
This is usually just what happens. Our definition E(~l<;9') = E(~+ 1<;9')E(C 1<;9') has the advantage that for the trivial u-algebra <;9' = {0, 0} it
reduces to the definition of E~ but does not presuppose the existence of E~.
(For example, if~ is a random variable withE~+ = oo, E~- = oo, and <;9' = !F.,
then E~ is not defined, but in terms of Definition 1, E(~ 1<;9') exists and is simply
~=~+-c.

Remark 4. Let the random variable~ have a conditional expectation E(~ 1<;9')
with respect to the u-algebra <;9'. The conditional variance (denoted by V( ~I <;9')
or V(~ I<;9')(tX)) of~ is the random variable

(Cf. the definition of the conditional variance V(~ I~) of~ with respect to a
decomposition~. as given in Problem 2, 8, Chapter 1.)
Definition 2. Let B E :F. The conditional expectation E(I B I<;9') is denoted by
P(B I<;9'), or P(B I<;9')( w), and is called the conditional probability of the event B
with respect to the u-algebra <;9', <;9' s;;; :F.
It follows from Definitions 1 and 2 that, for a given B E !F., P(B 1<;9') is a
random variable such that
(a) P(B I<;9') is <;9'-measurable,

(b)
for every A

P(A n B)= LP(BI<;9')dP


E

(5)

<;9'.

Definition 3. Let ~ be a random variable and <;9'~ the u-algebra generated by


a random element '1 Then E(~1<;9'n), if defined, means E(~l'1 or E(~IIJ)(w),
and is called the conditional expectation of~ with respect to '1
The conditional probability P(BI<;9'71) is denoted by P(BIIJ) or P(BIIJ)(w),
and is called the conditional probability of B with respect to '1
3. Let us show that the definition of E( ~I <;9') given here agrees with the definition of conditional expectation in 8 of Chapter I.

213

7. Conditional Probabilities and Expectations with Respect to a u-Aigebra

Let~ = {D 1 , D 2 , .. } be a finite or countable decomposition with atoms


D; with respect to the probability P (i.e. P(D;) > 0, and if A ~ D;, then
either P(A) = 0 or P(D;\A) = 0).

Theorem 1. If~ = a(~) and ~ is a random variable for which E~ is defined,

then

EW~)

E(~ID;)

(P-a.s. on D;)

EW~

E(OD.)
P(D;)

(P-a.s. on D;).

(6)

or equivalently

(The

notation"~ =

"~

'1 (P-a.s. on A)," or

= 17(A; P-a.s.)" means that P(A n

PRooF. According to Lemma 3 of 4, E(~l~)


constants. But

{~

= K;

::/: 17}) = 0.)


on D;, where K; are

whence
1

E(OD)

K; = P(D;) JD,~ dP = P(D;) = EWD;).


This completes the proof of the theorem.
Consequently the concept of the conditional expectation

E(~ 1~)

with

respect to a finite decomposition ~ = {Db ... , Dn}, as introduced in


Chapter I, is a special case of the concept of conditional expectation with
respect to the a-algebra ~ = a(D).
4. Properties of conditional expectations. (We shall suppose that the expectations are defined for all the random variables that we consider and that
~ ~ $'.)

A*. IfC is a constant and~= C (a.s.), then E(~l~) = C (a.s.).


B*. If~ ~ '7 (a.s.) then EW ~) ~ E('71 ~)(a.s.).
C*. IEW~)I ~ E(l~ll~) (a.s.).
D*. If a, bare constants and aE~ + bE17 is defined, then
E(a~

+ b1'fl ~)

aEW ~)

+ bE('71 ~)

E*. Let ff'* = { (/), Q} be the trivial a-algebra. Then


E(~lff'*) = E~

(a.s.).

(a.s.).

214
F*.

II. Mathematical Foundations of Probability Theory

E(~\~)

~(a.s.).

G*. E(E(ell)) =

E~.

H*. Ift 1 t 2 then


I*. If t 1

;2

t 2 then

J*. Let a random variable for which Ee is defined be independent of the


a-algebra t (i.e., independent of IB, Bet). Then
E(eit) = Ee

(a.s.).

K*. Let '7 be a t'-measurable random variable, EI'71 < oo and EIe'7\ < oo.
Then
E(e'71t) = '7E(e\t)

(a.s.).

Let us establish these properties.


A*. A constant function is measurable with respect to t. Therefore we need
only verify that
{edP
But, by the

hypothesis~ =

= {cdP,

Aet.

C (a.s.) and Property G of 6, this equation

is obviously satisfied.
B*. If $ '7 (a.s.), then by Property B of 6

L~ f}
dP $

dP,

A E t,

and therefore
{E(e\l)dP $ {E('71t)dP,

Aet.

The required inequality now follows from Property I (6).


C*. This follows from the preceding property if we observe that -

e$I~\.

D*. If A e t then by Problem 2 of 6,

{ (ae + brl) dP = {ae dP + J}'1 dP =


+{
which establishes D*.

bE('71 t) dP

= {

{ aE(e\t) dP

[aE( ~It)

+ bE('7\ t)] dP,

IeI

215

7. Conditional Probabilities and Expectations with Respect to a a-Algebra

E*. This property follows from the remark that E~ is an "*-measurable


function and the evident fact that if A = 0 or A = 0 then

F*.

Since~

if "-measurable and

L~dP = L~dP,
we have EW F) = ~ (a.s.).
G*. This follows from E* and H* by taking
H*. Let A E I' 1 ; then

Since I' 1 s;:

I' 2 ,

AE;

I' 1 =

{0, 0} and I' 2 =

1'.

we have A E I' 2 and therefore

Consequently, when A E 1' 1,

LE(~ll'1) LE[E(~II'2)11'1]
dP =

dP

and by Property I (6) and Problem 5 (6)


E(~ll'1)

I*. If A

E ~ 1o

E[E(~II'2)11'1]

(a.s.).

then by the definition of E[E( ~I~ 2 ) I~ 1 ]

The function E( ~II' 2) is I' rmeasurable and, since I' 2 s;: I' to also
I' rmeasurable. It follows that E( ~II' 2) is a variant of the expectation
E[E(~II' 2 )11' 1 ], which proves Property 1*.
J*. Since E~ is a /'-measurable function, we have only to verify that

dP =

LE~dP,

i.e. that E[ ~ I B] = E~ EJB. If EI~ I < oo, this follows immediately from
Theorem 6 of 6. The general case can be reduced to this by applying
Problem 6 of 6.
The proof of Property K* will be given a little later; it depends on conclusion (a) of the following theorem.

216

II. Mathematical Foundations of Probability Theory

Theorem 2 (On Taking Limits Under the Expectation Sign). Let


be a sequence of extended random variables.

{en}n~l

(a) If len I~ 17, E17 < oo and en--. (a.s.), then


ECenl~)--. EW~

(a.s.)

ECien-ell~)--.0

(a.s.).

and
(b) If en

17, E17 > - oo and en j

e(a.s.), then

ECenl~)

(c) If en

(a.s.).

l E(el~)

(a.s.).

17, E17 > - oo, then


E(lim enl~)

(e) If en

Ecei~)

17, E17 < oo, and en l (a.s.), then


ECenl~)

(d) If en

~lim E(enl~)

(a.s.).

~ E(lim e"l~)

(a.s.).

17, E17 < oo, then

Ilm E(enl~)
(f) If en ~ 0 then

E(L en I~)=

L E(enl~)

(a.s.).

(a) Let Cn = SUPm>n Iem- el. Since en__.. (a.s.), we have '" l 0
E~n and E~ are finite; therefore by Properties D*
and C* (a.s.)
PROOF.

(a.s.). The expectations

IECenl~)- E(el~)l

Since E(Cn+ 1 1~)

= IECen-

el~)l ~

E(len-

ell~)~ E(Cnl~).

E(Cn I~) (a.s.), the limit h = lim" E(Cn I~) exists (a.s.). Then

0~ LhdP~ {ECCnl~)dP= {CndP--.0,

n-.oo,

where the last statement follows from the dominated convergence theorem,
since 0 ~ Cn ~ 217, E17 < oo. Consequently Jn h dP = 0 and then h = 0
(a.s.) by Property H.
(b) First let 11 = 0. Since ECenl~) ~ ECen+tl~) (a.s.) the limit C(m) =
lim" ECenl~) exists (a.s.). Then by the equation

Len dP = LE(enl~) dP,

AE

~.

and the theorem on monotone convergence,

LeaP= LeaP,
Consequently

Ae~.

e= C(a.s.) by Property I and Problem 5 of 6.

217

7. Conditional Probabilities and Expectations with Respect to a u-Algcbra

For the proof in the general case, we observe that 0:::; ~: j ~+,and by
what has been proved,
E(~: /~)

But 0 s ~;; s

C, E~- <

E(~+ /~)

(a.s.).

(7)

oo, and therefore by (a)


E(~;; /~)-+

E(C /~),

which, with (7), proves (b).


Conclusion (c) follows from (b).
(d) Let (n = infm~n ~m; then (n j (, where ( = lim ~n According to (b),
E((n/~) i E((/~) (a.s.). Therefore (a.s.) E(lim ~n/~) = E((/~) =limn E((n/~)
= lim E((J ~) $ lim E(~n /~).
Conclusion (e) follows from (d).
(f) If ~n ~ 0, by Property D* we have

E(J ~kl~) = J E(~k/~)


1

(a.s.)

which, with (b), establishes the required result.


This completes the proof of the theorem.
We can now establish Property K*. Let '1 = 18 ,
AE~,

f ~'1
A

dP =

A,-,B

~ dP =

A,-,B

E(~/~) dP =

BE~-

f l8 E(~/~)
A

dP =

Then, for every

f 1JE(~/~)
A

dP.

By the additivity of the Lebesgue integral, the equation


(8)

remains valid for the simple random variables 1J = L~=l ykl 8 k, BkE~.
Therefore, by Property I (6), we have
(9)

for these random variables.


Now let 1J be any ~-measurable random variable with E 1171 < oo, and
let {17n} n~ 1 be a sequence of simple ~-measurable random variables such that
1'7n I s 1] and 'ln -+ 11 Then by (9)
E(~17nl~)

= 17nE(~/~)

(a.s.).

It is clear that 1~'7n/ si~IJI, where E/~nl < oo. Therefore E(~17nl~)-+
E(~17/~) (a.s.) by Property (a). In addition, since E I~ I < oo, we have E(~l~)
finite (a.s.) (see Property C* and Property J of 6). Therefore 17nEC~I~)-+
17E( ~I~) (a.s.). (The hypothesis that E( ~I~) is finite, almost surely, is essential,
since, according to the footnote on p. 172, 0 oo = 0, but if 'ln = 1/n, 1J = 0,
we have 1/n oo
0 oo = 0.)

218

II. Mathematical Foundations of Probability Theory

5. Here we consider the more detailed structure of conditional expectations


E(el~,). which we also denote, as usual, by E(el'l).
Since E(el'l) is a ~.,-measurable function, then by Theorem 3 of 4 (more
precisely, by its obvious modification for extended random variables) there
is a Borel function m = m(y) from R toR such that
m('l(ro))
for all

OJ E

n.

= E(el'l)(ro)

(10)

we denote this function m(y) by E(

eI, = y) and call it the

conditional expectation of with respect to the event {, = y}' or the conditional


expectation of under the condition that , = y.

Correspondingly we define
(11)

Ae~.,.

Therefore by Theorem 7 of 6 (on change of variable under the Lebesgue


integral sign)

m(17) dP =

J{ro:.,eB}

f m(y)P,(dy),
JB

BeBI(R),

where P., is the probability distribution of 'I Consequently m


Borel function such that

(12)

m(y) is a

edP= fm(y)dP,.

J{ro: 11eB}

(13)

JB

for every B e BI(R).


This remark shows that we can give a different definition of the conditional
expectation E(el '1 = y).

Definition 4. Let and '1 be random variables (possible, extended) and let
Ee be defined. The conditional expectation of the random variable under
the condition that 'I = y is any BI(R)-measurable function m = m(y) for
which

J{ro:.,eB}

edP = JBf m(y)P.,(dy),

BeBI(R).

(14)

That such a function exists follows again from the Radon-Nikodym theorem
if we observe that the set function
Q(B) =

J{ro: 11eB}

edP

is a signed measure absolutely continuous with respect to the measure P,.

7. Conditional Probabilities and Expectations with Respect to a a-Algebra

219

Now suppose that m(y) is a conditional expectation in the sense of Definition 4. Then if we again apply the theorem on change of variable under the
Lebesgue integral sign, we obtain

J{ro:~eB}

~ dP =

rm(y)P~(dy) = J{ro:~eB}
r m(11)P~(dy),

JB

BE &H(R).

The function m(17) is ~~-measurable, and the sets {w: 11 E B}, BE &H(R),
exhaust the subsets of ~~.
Hence it follows that m(IJ) is the expectation E(~ IIJ). Consequently if we
know E(~l11 = y) we can reconstruct E(~l17), and conversely from E(~l11) we
can find E(~ I11 = y).
From an intuitive point of view, the conditional expectation E(~ 1'1 = y)
is simpler and more natural than E(~ I17). However, E(~ I17), considered as a
~~-measurable random variable, is more convenient to work with.
Observe that Properties A*-K* above and the conclusions of Theorem 2
can easily be transferred to E(~l11 = y) (replacing "almost surely" by
"P~-almost surely"). Thus, for example, Property K* transforms as follows:
if E 1~1 < oo and E llf(IJ)I < oo, where f = f(y) is a &H(R) measurable function, then
E(lf(11)111 = y) = f(y)E(~IIJ = y)
In addition (cf. Property J*),
E(~l11

if~

(P~-a.s.).

(15)

and 11 are independent, then

= y) = E~

(P~-a.s.).

We also observe that if BE &H(R 2 ) and ~and 11 are independent, then


(16)
and if cp = cp(x, y) is a .c?6'(R 2 )-measurable function such that E Icp(~, 1J) I < oo,
then

To prove (16) we make the following observation. If B = B 1 x B 2 , the


validity of (16) will follow from

J{ro:~eA}

IB,xB2(~. 11)P(dw) =

J(yeA)

EIB,xBi~. y)P~(dy).

But the left-hand side is P{~ E Bt. 11 E An B 2 }, and the right-hand side is
P( ~ E B 1)P(17 E A n B 2); their equality follows from the independence of ~
and 11 In the general case the proof depends on an application of Theorem 1,
2, on monotone classes (cf. the corresponding part of the proof of Fubini's
theorem).

Definition 5. The conditional probability of the event A E ~under the condition that 11 = y (notation: P(A I11 = y)) is E(J A I1J = y).

220

II. Mathematical Foundations of Probability Theory

It is clear that P(A 111 = y) can be defined as the .14(R)-measurable function


such that
P(A n {IJEB}) =

ft(AIIJ

y)P~(dy),

BE .14(R).

(17)

6. Let us calculate some examples of conditional probabilities and conditional expectations.


EXAMPLE

Lk'=

P(IJ

1. Let 11 be a discrete random variable with P(IJ = Yk) > 0,


= Yk) = 1. Then
P(AI

11

Yk

{17 = yk})
P( _ ) ,
11- Yk

) = P(A n

For y{y 1 ,y 2 , ... } the conditional probability P(AIIJ = y) can be defined


in any way, for example as zero.
If~ is a random variable for which E~ exists, then

When y {y 1, y 2 , . } the conditional expectation E( ~ 111


in any way (for example, as zero).

= y) can be defined

2. Let ( ~, 11) be a pair of random variables whose distribution has a


density ~~~(x, y):

EXAMPLE

P{(~,IJ)EB} = f!~~(x,y)dxdy,
Let Hx) and UY) be the densities of the probability distribution of~ and 11
(see (6.46), (6.55) and (6.56).
Let us put

r ( I ) - h~(x, y)
X y j,(y) '

(18)

J~l~

taking J~ 1 ~(x Iy) = 0 if f~(y) = 0.


Then

P(~ E CIIJ = y) = J/~ 1 ~(xly) dx,

C E .14(R),

i.e. !~ 1 ~(x ly) is the density of a conditional probability distribution.

(19)

7. Conditional Probabilities and Expectations with Respect to a a-Algebra

221

In fact, in order to prove (19) it is enough to verify (17) for BE PJ(R),


{~ E C}. By (6.43), (6.45} and Fubini's theorem,

L[J/~~~(xly)

dx

JP~(dy) = [L.h 1 ~(xly) dx Jf~(y) dy


=

J J~ 1 ~(xiy)~(y)
J f~~(x,

dx dy

CxB

y) dx dy

CxB

P{(~,

IJ) E C x B}

P{(~ E

C) n (IJ E B)},

which proves (17).


In a similar way we can show that if E~ exists, then

(20)
EXAMPLE

3. Let the length of time that a piece of apparatus will continue to

operate be described by a nonnegative random variable 11 = IJ(w) whose


distribution Fq(y) has a density j~(y) (naturally, F ~(y) = j~(y) = 0 for y < 0).
Find the conditional expectation E(IJ - a 111 2 a), i.e. the average time for
which the apparatus will continue to operate on the hypothesis that it has
already been operating for time a.
Let P(IJ 2 a) > 0. Then according to the definition (see Subsection 1) and
(6.45),
E(11 - a I'1 > a)
-

E[(l'/ - a)I 1paJ


P(l'/ 2 a)

Jn (1'/ -

a)I 1 ~>a)P(dw)

"-=--'-'---=--c-'----"'-'=-='___:_____:__

P(l'/ 2 a)

_ J:' (y -

- s:

a)f~(y) dy
f~(y) dy

It is interesting to observe that if '1 is exponentially distributed, i.e.

A. -.<y
>0
f,(y) = { e ' y- '
~
0
y < 0,

(21)

then El] = E(IJIIJ 2 0) = 1/A. and E(IJ -aiiJ 2 a)= 1/A. for every a> 0. In
other words, in this case the average time for which the apparatus continues
to operate, assuming that it has already operated for time a, is independent
of a and simply equals the average time El].
Under the assumption (21) we can find the conditional distribution
P(l'/- a :S:: xl11 2 a).

222

II. Mathematical Foundations of Probability Theory

We have
P(

11 - a ~ x '1 ;;;:: a =

P(a ~ '1 ~ a +. . x)
P('7 :2:: a)
F,(a

[1 -

=e

+ x)- F,(a) + P('1 =a)


1 - F,(a) + P(17 =a)
[1 - e-..ta]
1- [1 - e .l.a]

e-..t(a+x)] -

-.l.a[1 _ -.l.x]
;. e
e a

=1-

e-.l.x.

Therefore the conditional distribution P('1 - a ~ xI 11 :2:: a) is the same


as the unconditional distribution P(17 ~ x). This remarkable property
is unique to the exponential distribution: there are no other distributions
thathavedensitiesandpossessthepropertyP('1- a~ xl'1 :2:: a)= P('1 ~ x),
a :2:: 0, 0 ~ x < oo.
EXAMPLE 4 (Buffon's needle). Suppose that we toss a needle of unit length
"at random" onto a pair of parallel straight lines, a unit distance apart, in
a plane. What is the probability that the needle will intersect at least one of the
lines?
To solve this problem we must first define what it means to toss the
needle "at random." Let be the distance from the midpoint of the needle to
the left-hand line. We shall suppose that~ is uniformly distributed on [0, 1],
and (see Figure 29) that the angle (} is uniformly distributed on [ -n/2, n/2].
In addition, we shall assume that and fJ are independent.
Let A be the event that the needle intersects one of the lines. It is easy to
see that if

B = {(a, x): Ia I ~

1[

2'

x E [0, teas a] u [1 - tcos a, 1]},

then A = {w: (fJ, e) E B}, and therefore the probability in question is


P(A)

= EIA(w) = Ela(fJ(w), e(w)).

Figure 29

7. Conditional Probabilities and Expectations with Respect to a a-Algebra

223

By Property G* and formula (16),


EJa(tl(w),

~(w)) =

E(E[Ja(tl(w),

~(w))ltl(w)])

= LE[I 8 (tl(w), ~(w))ltl(w)]P(dw)


=

tt/2

=1

_,12 E[I8 (tl(w), ~(w))ltl(w) = tx]P8(da)

J"

12

-tt/2

Ela(a,

~(w))da = 1

J"

12

-n/2

cos ada=-,
2
n

where we have used the fact that

Ela(a, ~(w)) = P{~ E [0,

t cos a] u [1 - t cos a]}= cos a.

Thus the probability that a "random" toss of the needle intersects one of
the lines is 2/n. This result could be used as the basis for an experimental
evaluation of n. In fact, let the needle be tossed N times independently.
Define ~ito be 1 if the needle intersects a line on the ith toss, and 0 otherwise.
Then by the law of large numbers (see, for example, (1.5.6))

P{l~ 1 + ; + ~N- P(A)I > e}-+ 0,

N-+

oo.

for every e > 0.


In this sense the frequency satisfies
~1

+ ... + ~N
N

P(A) = ~2

and therefore
2N

-----~1!:.

~1

+ ... +

~N

This formula has actually been used for a statistical evaluation of n. In


1850, R. Wolf (an astronomer in Zurich) threw a needle 5000 times and
obtained the value 3.1596 for n. Apparently this problem was one of the first
applications (now known as Monte Carlo methods) of probabilisticstatistical regularities to numerical analysis.
7. If { ~n}n> 1 is a sequence of nonnegative random variables, then according
to conclusion (f) of Theorem 2,

In particular, if B 1 , B 2 , is a sequence of pairwise disjoint sets,


(22)

224

II. Mathematical Foundations of Probability Theory

It must be emphasized that this equation is satisfied only almost surely


and that consequently the conditional probability P(B \~)(w) cannot be
considered as a measure on B for given w. One might suppose that, except
for a set ..;V of measure zero, P( I~) (w) would still be a measure for w E Y.
However, in general this is not the case, for the following reason. Let
.K(B 1 , B 2 , ) be the set of sample points w such that the countable additivity
property (22) fails for these B 1, B 2 , ... Then the excluded set ..;Vis

(23)
where the union is taken over all B 1, B 2 , . in fJ'. Although the P-measure
of each set .K(B 1 , B 2 , . . ) is zero, the P-measure of ..;V can be different from
zero (because of an uncountable union in (23)). (Recall that the Lebesgue
measure of a single point is zero, but the measure of the set ..;V = [0, 1],
which is an uncountable sum of the individual points {x}, is 1).
However, it would be convenient if the conditional probability P(\ ~)(w)
were a measure for each wEn, since then, for example, the calculation of
conditional probabilities E(e \~)could be carried out (see Theorem 3 below)
in a simple way by averaging with respect to the measure P( I~)(w):

E(e\~) =

fa ~(w)P(dw\~)

(a.s.)

(cf. (1.8.10)).
We introduce the following definition.
Definition 6. A function P(w; B), defined for all wEn and BE ff, is a regular
conditional probability with respect to ~ if

(a) P(w; )is a probability measure on$' for every wEn;


(b) For each B E $' the function P( w; B), as a function of w, is a variant of the
conditional probability P(B\~)(w), i.e. P(w: B)= P(B\~)(w) (a.s.).
Theorem3. Let P(w; B) be a regular conditional probability with respect to
~ and let be an integrable random variable. Then

Ece\~)(w) = [ ~(w)P(w; dw)

Jn

PROOF.

If

e= I

B'

(a.s.).

(24)

BE ff, the required formula (24) becomes


P(B\~)(w) =

P(w; B) (a.s.),

which holds by Definition 6(b). Consequently (24) holds for simple functions.

7. Conditional Probabilities and Expectations with Respect to a a-Algebra

225

Now let ~ 0 and n i , where n are simple functions. Then by (b) of


Theorem 2 we have E(/~)(w) =limn E(nl~)(w) (a.s.). But since P(w; )
is a measure for every wE Q, we have
lim
n

E(n/~)(w) =lim
n

( kiJ)P(w; dw) = ( (w)P(w; dw)

Jn

Jn

by the monotone convergence theorem.


The general case reduces to this one if we use the representation =

e+- c.

This completes the proof.

Corollary. Let ~ = ~~' where rt is a random variable, and let the pair (, rt)
have a probability distribution with density f~lx, y). Let E /g() / < oo. Then

where f~ 1 ~(x/y) is the density of the conditional distribution (see (18)).

In order to be able to state the basic result on the existence of regular


conditional probabilities, we need the following definitions.

Definition 7. Let (E, r!) be a measurable space, X= X(w) a random element


with values in E, and ~ a a-subalgebra of g;, A function Q(w; B), defined
for wEn and BE S is a regular conditional distribution of X with respect to
~if

(a) for each wEn the function Q(w; B) is a probability measure on (E, rff);
(b) for each B E C the function Q( w; B), as a function of w, is a variant of the
conditional probability P(X E B/ ~)(w), i.e.
Q(w; B)= P(X EB/~)(w)

(a.s.).

Definition 8. Let be a random variable. A function F = F(w; x), wE Q,


XE

R, is a regular distribution function for with respect to ~

if :

(a) F(w; x) is, for each wE Q, a distribution function on R;


(b) F(w; x) = P( :.:::; x /~)(w)(a.s.), for each x E R.

Theorem 4. A regular distribution function and a regular conditional distribution function always exist for the random variable

ewith respect to ~.

226

II. Mathematical Foundations of Probability Theory

PROOF. For each rational number r E R, define F,(OJ) = P(~ :::; ri~)(OJ),
where P(~:::; ri~)(OJ) = E(J 1 ~ 9li~)(OJ) is any variant of the conditional
probability, with respect to~. of the event {~:::; r}. Let {r;} be the set of
rational numbers in R. If r; < ri, Property B* implies that P(~ :::; r;i ~) :::;
P(~:::; rii~) (a.s.), and therefore if Aii = {OJ: F,/OJ) < F,,(OJ)}, A= UAii
we have P(A) = 0. In other words, the set of points OJ at which the distribution function F,(OJ), r E {r;}, fails to be monotonic has measure zero.
Now let

B; = {OJ: lim Fr,+( 1 /n)(OJ) = F,,(OJ)l,

n-+ oo

00

B=

U B;.

i= 1

It is clear that /{~:Sr,+( 1 /n)}! /{~~ril n-+ oo. Therefore, by (a) of Theorem 2,
F,,+( 1 /nl(OJ)-+ F,,(OJ) (a.s.), and therefore the set Bon which continuity on the
right fails (with respect to the rational numbers) also has measure zero,
P(B) = 0.

In addition, let

c=
Then, since
P(C) = 0.
Now put

{OJ: lim FnCOJ) = 1} u {OJ: lim F nCOJ) >


n-+ oo

n-+- oo

g :::; n} i Q,

n-+ oo, and {~ :::; n}!

F(W,. X )

0,

o}.

n-+ - oo, we have

= {lim F,(OJ), OJ A u B u C,
r!x
G(x),

wE

Au B u C,

where G(x) is any distribution function on R; we show that F(OJ; x) satisfies


the conditions of Definition 8.
Let Au B u C. Then it is clear that F(OJ; x) is a nondecreasing function ofx. Ifx < x' :::; r, then F(OJ; x) :::; F(OJ; x'):::; F(w; r) = F,(OJ)! F(w, x)
when r! x. Consequently F(OJ; x) is continuous on the right. Similarly
limx-. 00 F(OJ; x) = 1, limx-.-oo F(OJ; x) = 0. Since F(OJ; x) = G(x) when
OJ E Au B u C, it follows that F(OJ; x) is a distribution function on R for
every OJ E Q, i.e. condition (a) of Definition 8 is satisfied.
By construction, P(~:::; r)I~)(OJ) = F,(OJ) = F(OJ; r). If r! x, we have
F(OJ; r)! F(OJ; x) for all OJ E Q by the continuity on the right that we just
established. But by conclusion (a) of Theorem 2, we have P(~ :::; ri ~)(OJ)-+
P(~:::; xi~)(OJ) (a.s.). Therefore F(OJ; x) = P(~:::; xiG)(OJ) (a.s.), which
establishes condition (b) of Definition 8.
We now turn to the proof of the existence of a regular conditional distribution of~ with respect to ~Let F(OJ; x) be the function constructed above. Put

Q(OJ; B) = LF(OJ; dx),

227

7. Conditional Probabilities and Expectations with Respect to au-Algebra

where the integral is a Lebesgue-Stieltjes integral. From the properties of


the integral (see 6, Subsection 7), it follows that Q(ro; B) is a measure on B
for each given wEn. To establish that Q(ro; B) is a variant of the conditional
probability P(~ E Bl~)(ro), we use the principle of appropriate sets.
Let ~ be the collection of sets B in gj(R) for which Q(ro; B)=
P(~EBI~)(ro) (a.s.). Since F(w;'x) = P(~ ~ xl~)(w) (a.s.), the system~
contains the sets B of the form B = (- oo, x], x E R. Therefore ~ also
contains the intervals of the form (a, b], and the algebra d consisting of finite
sums of disjoint sets of the form (a, b]. Then it follows from the continuity
properties of Q(w; B) (w fixed) and from conclusion (b) of Theorem 2 that~
is a monotone class, and since d ~ gj(R), we have, from Theorem 1
of2,
gj(R) = a(d)

a(~)= Jl(~)

gj(R),

whence ~ = gj(R).
This completes the proof of the theorem.
By using topological considerations we can extend the conclusion of
Theorem 4 on the existence of a regular conditional distribution to random
elements with values in what are known as Borel spaces. We need the following definition.

Definition 9. A measurable space (E, tf) is a Borel space if it is Borel equivalent


to a Borel subset of the real line, i.e. there is a one-to-one mapping cp = cp(e):
(E, tf)

--+

(R, gj(R)) such that

(1) cp(E) {cp(e): eeE} is a set in gj(R);


(2) cp is tS'-measurable (cp- 1(A) E I, A E cp(E) n gj(R)),
(3) cp- 1 is BI(R)/tS'-measurable (cp(B) E cp(E) n BI(R), BE Iff).

Theorem 5. Let X = X(w) be a random element with values in the Borel space
(E, S). Then there is a regular conditional distribution of X with respect to ~Let cp = cp(e) be the function in Definition 9. By (2), cp(X(ro)) is a
random variable. Hence, by Theorem 4, we can define the conditional
distribution Q(ro; A) of cp(X(ro)) with respect to r, A E cp(E) n BI(R).
We introduce the function Q(w; B) = Q(w; cp(B)), BE tf. By (3) of
Definition 9, qJ(B) E qJ(E) n BI(R) and consequently Q(w; B) is defined.
Evidently Q(w; B) is a measure on BE S for every ro. Now fix BE G. By the
one-to-one character of the mapping cp = cp(e),
PROOF.

Q(w; B)= Q(w; cp(B)) = P{cp(X) E cp(B)I~} = P{X E Bl~}

(a.s.).

Therefore Q(w; B) is a regular conditional distribution of X with respect


to~.

This completes the proof of the theorem.

228

II. Mathematical Foundations of Probability Theory

Corollary .. Let X = X( w) be a random element with values in a complete separable metric space (E, 6"). Then there is a regular conditional distribution of X
with respect to t. In particular, such a distribution exists for the spaces
(R", ~(R")) and (R 00 , ~(R 00 )).
The proof follows from Theorem 5 and the well known topological result
that such spaces are Borel spaces.

8. The theory of conditional expectations developed above makes it possible

to give a generalization of Bayes's theorem; this has applications in statistics.


with
Recall that if f: = {A 1 , . . . , An} is a partition of the space
that
P(A;) > 0, Bayes's theorem (1.3.9) states

P(A; IB)

P(A;)P(B IA;)

Li=l P(A)P(BIA)
for every B with P(B) > 0. Therefore if e = 2:? = a ;I
1

.
A,

(25)
is a discrete random

variable then, according to (1.8.10),

E[g(O)IBJ

L?=: g(a;)P(A;)P(BIA;)'
Li=l P(A)P(BIA)

(26)

J~oo g(a)P(BIO = a)P9(da)


J~oo P(BIO = a)P 9(da)

(27)

or

E[g(O)IBJ

On the basis of the definition ofE[g(B)JB] given at the beginning of this


section, it is easy to establish that (27) holds for all events B with P(B) > 0,
random variables and functions g = g(a) with E Ig(O) I < oo.
We now consider an analog of (27) for conditional expectations
E[g( 8) It] with respect to a a-algebra t, t ~ ff.
Let

Q(B) = Lg(O)P(dw),

BEt.

(28)

Then by (4)

(29)
We also consider the a-algebra t 9 . Then, by (5),

P(B) =

Lt<B I

t9)

dP

(30)

or, by the formula for change of variable in Lebesgue integrals,


(31)

7. Conditional Probabilities and Expectations with Respect to a (j-Algebra

229

Since

we have

Q(B)

f_'Xl}(a)P(BIO = a)P 0(da).

Now suppose that the conditional probability P(B I0


admits the representation
P(BIO =a)= Lp(w; a)A(dw),

(32)

= a) is regular and
(33)

where p = p(w; a) is nonnegative and measurable in the two variables


jointly, and A is a a-finite measure on (Q, <;).
Let E lg(O)I < oo. Let us show that (P-a.s.)
E[g(O)I<;] = J~oo g(a)p(w; a)Po(da)
J~ oo p(w; a)P 0(da)

(34)

(generalized Bayes theorem).


In proving (34) we shall need the following lemma.

Lemma. Let (Q, F) be a measurable space.


(a) Let 11- and A be a-finite measures, and f

f(w) an ff-measurablefunction.

Then

(35)

(in the sense that !feither integral exists, the other exists and they are equal).

(b) If v is a signed measure and fl, A are a-finite measures v ~ fl, 11- ~ A, then

dv
dA

dv dfl
dfl. dA.

(A.-a.s.)

(36)

(11--a.s.)

(37)

and
dv
dfl
PROOF. (a) Since

I!( A)

dvldfl
dA. dA.

L(~~)

dA.,

I/;

I A,. The general case


(35) is evidently satisfied for simple functions f =
monotone converthe
and
f+
f
=
f
follows from the representation
gence theorem.

230

II. Mathematical Foundations of Probability Theory

(b) From (a) with f = dv/d/1 we obtain

Then v

A. and therefore

J dA.dv dA.,

v(A)

whence (36) follows since A is arbitrary, by Property I (6).


Property (37) follows from (36) and the remark that

dJ.l = 0} =
11 { w:dA.

-dJ.l

{ro: d!lfd).

= 0) dA.

dA. = 0

(on the set {w: dJ1/dA. = 0} the right-hand side of (37) can be defined arbitrarily, for example as zero). This completes the proof of the lemma.
To prove (34) we observe that by Fubini's theorem and (33),
Q(B)

L[J:
L[f_

P(B) =

00

(38)

g(a)p(w; a)P0(da)}(dw),

00
00

(39)

p(w; a)Po(da)}(dw).

Then by the lemma


dQ
dQjd).
dP = dPjd).

(P-a.s.).

Taking account of (38), (39) and (29), we have (34).

Remark. Formula (34) remains valid if we replace() by a random element


with values in some measurable space (E, C) (and replace integration over
R by integration over E).
Let us consider some special cases of (34).
Let the a-algebra r be generated by the random variable
Suppose that

P(~ E A I() = a) = {

q(x; a)A.(dx),

~' r

A E fJIJ(R),

r ~.

(40)

where q = q(x; a) is a nonnegative function, measurable with respect to both


variables jointly, and ). is a a-finite measure on (R, fJIJ(R)). Then we obtain

E[g(())l~ = x] = J:?oo g(a)q(x; a)Po(da)

J:?oo q(x; a)Po(da)

(41)

7. Conditional Probabilities and Expectations with Respect to a u-Algebra

231

In particular, let((},~) be a pair of discrete random variables,(} = La;! A,,


xiB j" Then, taking A. to be the counting measure (A.( {x;}) = 1, i = 1, 2, ...)
we find from (40) that
~='I

(Compare (26).)
Now let (0, ~) be a pair of absolutely continuous measures with density
fo.~(a, x). Then by (19) the representation (40) applies with q(x; a)=
!~ 16 (xla) and Lebesgue measure A.. Therefore

E[g(O) I~ = x] = s~ <Xl g(a)f~lfJ(x la)fo(a) da


J~ oo J~ 1 o(x Ia)fo(a) da
9.

(43)

PROBLEMS

1. Let eand '1 be independent identically distributed random variables with Ee defined.

Show that

E(ele + '1) = E(IJie + '1) = -e+IJ


2- (a.s.).

ee

2. Let 1, 2 , be independent identically distributed random variables with


E1e;1 < oo. Show that

s.

E(e11 s., s.+ 1 ...) = - (a.s.),


n
where s.

e + ... + e.
1

3. Suppose that the random elements (X, Y) are such that there is a regular distribution
Px(B) = P(YeBJX = x). Show that ifE Jg(X, Y)l < oo then

E[g(X, Y)IX = x] =
4. Let

g(x, y)Px(dy)

(Px-a.s.).

ebe a random variable with distribution function Fix). Show that


E<eJIX <

(assuming that F~(b)-

F~(a)

J!x dFix)

e~b)= F~(b)- Fia)

> 0).

5. Let g = g(x) be a convex Borel function with E Jg<e)J < oo. Show that Jensen's
inequality

holds for the conditional expectations.

232

II. Mathematical Foundations of Probability Theory

6. Show that a necessary and sufficient condition for the random variable and the
u-algebra f to be independent (i.e., the random variables.; and Ia(w) are independent for every Bet) is that E(g(e)lf) = Eg(e) for every Borel function g(x) with
Elg<e)l < oo.

e
JA edP, is u-finite.

be a nonnegative random variable and < a u-algebra, < s;; !F. Show that
E(el<) < oo (a.s.) if and only if the measure Q, defined on sets A e <by Q(A) =

7. Let

8. Random Variables. II
1. In the first chapter we introduced characteristics of simple random
variables, such as the variance, covariance, and correlation coefficient. These
extend similarly to the general case. Let (Q, !F, P) be a probability space and
~ = ~(w) a random variable for which E~ is defined.
The variance of ~ is

fo

The number q = +
is the standard deviation.
If~ is a random variable with a Gaussian (normal) density
fi( ) _
~

x -

1
foq
e

the parameters m and

-[(x-m)2]/2a2

> 0,

- oo < m < oo,

(1)

in (1) are very simple:


m

E~,

Hence the probability distribution ofthis random variable~. which we call


Gaussian, or normally distributed, is completely determined by its mean
value m and variance q 2 , (It is often convenient to write~ "' .;V (m, q 2 ).)
Now let(~, 17) be a pair of random variables. Their covariance is
(2)

(assuming that the expectations are defined).


lfcov(~, 17) = 0 we say that~ and '1 are uncorrelated.
If V ~ > 0 and V17 > 0, the number
(3)

is the correlation coefficient of ~ and '7


The properties of variance, covariance, and correlation coefficient were
investigated in 4 of Chapter I for simple random variables. In the general
case these properties can be stated in a completely analogous way.

233

8. Random Variables. II

Let e = <el, ... ' en) be a random vector whose components have finite
second moments. The covariance matrix of e is then x n matrix~= IIRiJII,
where Ril = cov(e;, ei). It is clear that ~ is symmetric. Moreover, it is nonnegative definite, i.e.
n

L1 Rw~)i ~ o

i,j=

for all A.i e R, i = 1, ... , n, since

The following lemma shows that the converse is also true.

Lemma. A necessary and sufficient condition that an n x n matrix

~ is the
covariance matrix of a vector e = <el, ... ' en) is that the matrix is symmetric
and nonnegative definite, or, equivalently, that there is an n x k matrix A
(1 ~ k ~ n) such that

where T denotes the transpose.


PROOF. We showed above that every covariance matrix is symmetric and
nonnegative definite.
Conversely, let~ be a matrix with these properties. We know from matrix
theory that corresponding to every symmetric nonnegative definite matrix ~
there is an orthogonal matrix (!) (i.e., (!)(!)T = E, the unit matrix) such that
(!)T~(!) =

where
D =

(d0

D,

'

0)

dn

is a diagonal matrix with nonnegative elements di, i = 1, ... , n.


It follows that
~ = (!)D(!)T = ((!)B)(BT(!)T),

where B is the diagonal matrix with elements bi = + .jd;, i = 1, ... , n.


Consequently if we put A = (!)B we have the required representation
~ = AATfor ~.
It is clear that every matrix AAT is symmetric and nonnegative definite.
Consequently we have only to show that ~ is the covariance matrix of some
random vector.
Let tit ti 2 , , tin be a sequence of independent normally distributed
random variables, ..(0, 1). (The existence of such a sequence follows, for
example, from Corollary 1 of Theorem 1, 9, and in principle could easily

234

II. Mathematical Foundations of Probability Theory

be derived from Theorem 2 of 3.) Then the random vector~ = A17 (vectors
are thought of as column vectors) has the required properties. In fact,
E~~T

= E(A1'/)(A1'/)T =A. E,,T. AT= AEAT = AAT.

(If ( = li(iill is a matrix whose elements are random variables, E( means the
matrix IIE~iill).
This completes the proof of the lemma.

We now turn our attention to the two-dimensional Gaussian (normal)


density

(4)

characterized by the five parameters m1 , m2 , a 1 , a 2 and p (cf. (3.14)), where


Im1 I < oo, Im2 1 < oo, a 1 > 0, a 2 > 0, IpI < 1. An easy calculation identifies
these parameters :

m1 =

E~,

m2 = E1],

ai

= V~,

a~ = V1],

p = p(~, 1]).

In4ofChapter I we explained that if~ and 17 areuncorrelated(p(~, 17) = 0),


it does not follow that they are independent. However, if the pair (~, 17) is
Gaussian, it does follow that if ~ and 17 are uncorrelated then they are
independent.
In fact, if p = 0 in (4), then

But by (6.55) and (4),

frf._x) =

oo

~~~(x,

-oo

y) dy =

foal

e-[(x-md21J2at,

Consequently
~~~(x,

y) = f~(x) fiy),

from which it follows that~ and 17 are independent (see the end of Subsection
8 of 6).

235

8. Random Variables. II

2. A striking example of the utility of the concept of conditional expectation


(introduced in 7) is its application to the solution of the following problem
which is connected with estimation theory (cf. Subsection 8 of 4 of Chapter
1).
Let(~, '1) be a pair of random variables such that~ is observable but '1 is
not. We ask how the unobservable component '1 can be "estimated" from
the knowledge of observations of~To state the problem more precisely, we need to define the concept of an
estimator. Let qJ = qJ(x) be a Borel function. We call the random variable
qJ(~) an estimator of11 in terms of~. and E['1 - ({J(~)] 2 the (mean square) error
of this estimator. An estimator qJ*(~) is called optimal (in the mean-square
sense) if

L\

=E['1- qJ*(~)Y = infE['1- ({J(~)Y,


qJ

(5)

where inf is taken over all Borel functions qJ = qJ(x).


Theorem 1. Let E11 2 < oo. Then there is an optimal estimator qJ* = qJ*(~)
and qJ*(x) can be taken to be thejimction

qJ*(x) = E('11 ~ = x).

(6)

PRooF. Without loss of generality we may consider only estimators tp(~)


for which EqJ 2 (~) < oo. Then if ({J(~) is such an estimator, and qJ*(~) = E('11 ~),
we have
E['1 - ({J(~)] 2 = E[('1 - qJ*(~)) + (({J*(~) _ ({J(~))]2
= E['7 - qJ*(~)]2 + E[({J*(~) - ({J(~)]2
+ 2E[('1 - qJ*(~))(({J*(~) - ({J(~))] ~ E[17 - qJ*(~)] 2 ,
since E[qJ*(~) - qJ(~)] 2 ~ 0 and, by the properties of conditional expectations,
E[('1-

qJ*(~))(({J*(~)- ({J(~))]

= E{E[('1-

qJ*(~))(({J*(~)- ({J(~)])I~]}

= E{(({J*(~)- ({J(~))E('1- qJ*(~)I~)} = 0.

This completes the proof of the theorem.


Remark. It is clear from the proof that the conclusion of the theorem is still
valid when ~ is not merely a random variable but any random element
with values in a measurable space (E, tf). We would then assume that
(/) = qJ(x) is an tS'/81(R)-measurable function.
Let us consider the form of qJ*(x) on the hypothesis that(~, '1) is a Gaussian
pair with density given by (4).

236

II. Mathematical Foundations of Probability Theory

From (1), (4) and (7.10) we find that the density J, 1 ~(yJx) of the conditional
probability distribution is given by

f~ I~{y IX)

1
J2rc(1 - p2 )u 2

e[(y-

m(x))2]/[2a~( 1 - p2)]'

(7)

where

(8)
Then by the Corollary of Theorem 3, 7,
(9)

and
V('71~

x)

=E[('7- E('71~ = x)) 1~ = x]


2

J_'XJoo (y-

= u~{1 -

m(x)) 2f~ 1 ~(yJx) dy

p 2 ).

Notice that the conditional variance V('71 ~


therefore

(10)

x) is independent of x and

(11)
Formulas (9) and (11) were obtained under the assumption that V~ > 0
and V17 > 0. However, if V~ > 0 and V17 = 0 they are still evidently valid.
Hence we have the following result (cf. (1.4.16) and (1.4.17)).

Theorem 2. Let (~, 17) be a Gaussian vector with


estimator of '7 in terms of~ is

V~

> 0. Then the optimal

(12)
and its error is

(13)

Remark. The curve y(x) = E('71 ~ = x) is the curve of regression of '7 on ~


or of '7 with respect to ~. In the Gaussian case E('71 ~ = x) = a + bx and
consequently the regression of '1 and ~ is linear. Hence it is not surprising
that the right-hand sides of (12) and (13) agree with the corresponding parts
of (1.4.6) and (1.4.17) for the optimal linear estimator and its error.

237

8. Random Variables. II

Corollary. Let e 1 and e2 be independent Gaussian random variables with mean


zero and unit variance, and

Then E~
and if af

= E11 = 0, V~ = af
+ a~ > 0, then
E(

11

+a~, V11

= bf + bLcov(~, 1'/) = a,b, + a2 b 2 ,

I~)= a 1 b1 + a 2 b2 ~
af +a~

(14)

'

Ll = (a 1 b2 - azbd 2
ai +a~

(15)

3. Let us consider the problem of determining the distribution functions of


random variables that are functions of other random variables.
Let~ be a random variable with distribution function F~(x) (and density
Nx), if it exists), let q> = q>(x) be a Borel function and 11 = q>(~). Letting
I Y = (- oo, y), we obtain

F~(y) =

P(ry::::;; y)

= P(q>(~)Eiy) = P(~Eq>- 1 (/y)) =

F~(dx),

(16)

"'- l(ly)

which expresses the distribution function F~(y) in terms of F~(x) and q>.
For example, if 11 = a~ + b, a > 0, we have

( y- b) (y- b)

F ~(y) = p ~ ::::;; -a- = F ~ -a- .

(17)

If 11 = ~ 2 , it is evident that F ~(y) = 0 for y < 0, while for y 2 0

F /Y) = P(

e ::;

y) = P(- Jy ::::;; ~ ::::;; Jy)

= F~(Jy)- F~( -Jy) + P(~ = -Jy).

(18)

We now turn to the problem of determining f~(y).


Let us suppose that the range of~ is a (finite or infinite) open interval
I = (a, b), and that the function q> = q>(x), with domain (a, b), is continuously
differentiable and either strictly increasing or strictly decreasing. We also
suppose that q>'(x) =f. 0, xEI. Let us write h(y) = q>- 1(y) and suppose for
definiteness that cp(x) is strictly increasing. Then when y E q>(/),

= P(e ::::;; h(y)) =

h(y)

_ 00Nx) dx.

(19)

238

II. Mathematical Foundations of Probability Theory

By Problem 15 of 6,

h(y)
fy
f - OONx) dx = - oof;(h(z))h'(z) dz

(20)

and therefore

fq(y)

= f~;(h(y))h'(y).

(21)

Similarly, if cp(x) is strictly decreasing,

fq(y)

Nh(y))(( -h'(y)).

Hence in either case


J~(y)

For example, if 1J = a~

= Nh(y)) Ih'(y) 1.

+ b, a #

0, we have

y-b
h(y) =-aand
If~ ~

JV (m, u 2) and 11
fq(y)

= el:,

(22)

fq(y)

1 (y-b)
= j;f
!~;-a-

we find from (22) that

1
exp[{ $uy

ln(y/~)2],
2u

y > 0,

(23)

y :::;; 0,

with M =em.
A probability distribution with the density (23) is said to be lognormal
(logarithmically normal).
If cp = cp(x) is neither strictly increasing nor strictly decreasing, formula
(22) is inapplicable. However, the following generalization suffices for many
applications.
Let cp = cp(x) be defined on the set
1 [ak, bk], continuously differentiable and either strictly increasing or strictly decreasing on each open
interval Ik = (ab bk), and with cp'(x) # 0 for x E Ik. Let hk = hk(y) be the
inverse of cp(x) for x Elk. Then we have the following generalization of(22):

Lk=

fq(y) =

L f~;(hk(y))lh~(y)l Ivk(y),

(24)

k=1

where Dk is the domain of hk(y).


For example, if 11 = ~ 2 we can take I 1 = ( - oo, 0), I 2
find that h1 (y) =
h 2(y) =
and therefcre

-.JY,

.JY,

f~(y) = {2Jy [f~;(.jY) + fl-JY)],


0,

y > 0,
y :::;; 0.

= (0,

oo ), and

(25)

239

8. Random Variables. II

Wecanobservethatthisresultalsofollowsfrom(18),sinceP(~ =
In particular, if~ ,..., .;V (0, 1),

h2(y) =

{k

-JY) = 0.

y > 0,

e-y/2,

0,

(26)

0.

A straightforward calculation shows that

fi~l(y)

f +v'l~l (y) --

{f~(y) + f~(- y), y > 0,


0,

y::;; 0.

{2y(f~(y2)
0,

+ f~(- y2)),

y > 0,
y ~ 0.

(27)
(28)

4. We now consider functions of several random variables.


If ~ and '7 are random variables with joint distribution F ~,(x, y), and
<p = cp(x, y) is a Borel function, then if we put' = cp(~, 17) we see at once that

F~:,(z) =

{x, y: q>(x, y) :S z}

dF~,(x, y).

(29)

For example, if cp(x, y) = x + y, and~ and '7 are independent (and therefore F~,(x, y) = F~(x) F,(y)) then Fubini's theorem shows that

F,(z) =

{x,y:x+y:Sz)

JR2

dF~(x) dF,(y)

I{x+y:s;zJ(x, y) dF~(x) dF,(y)

J:oo dF~(x){J:,/{x+y:s;z)(x, y) dF,(y)} =

J:ooF,(z- x)

dF~(x)
(30)

and similarly

F,(z) =

J:ooF~(z- y) dF,(y).

(31)

IfF and G are distribution functions, the function

H(z) = J:00 F(z- x) dG(x)


is denoted by F * G and called the convolution ofF and G.
Thus the distribution function F 1 of the sum of two independent random
variables ~ and '7 is the convolution of their distribution functions F ~ and F,:
F, = F~*F,.

It is clear that F~ * F, = F, * F~.

II. Mathematical Foundations of Probability Theory

240

Now suppose that the independent random variables ~ and '1 have
densities f~ and f~. Then we find from (31 ), with another application ofFubini's
theorem, that

F~(z) = J:oo [f~Yf~(u) du ]f~(y) dy


= s:oo

[f

ooh(u - y) du J.,(y) dy

whence
J{(z)

f_

00

/iz -

00

00

[f_0000j~(u -

y)J,(y) dy du,

(32)

y)J.,(y) dy,

and similarly
(33)
Let us see some examples of the use of these formulas.
Let ~to ~ 2 , , ~n be a sequence of independent identically distributed
random variables with the uniform density on [ -1, 1]:

f(x) =

{!.
0,

lxl ~ 1,
lxl > 1.

Then by (32) we have


f~, +~ 2 (x)

={

2
_...,...:.-....:.,

I 2
IX~'

0,

lxl > 2,

-lxl
4

(3 - lxl) 2
,
16
!~, +~2+~3(x)

= 3 - x2

1~

lxl

3,

0 ~ lxl ~ 1,

lxl > 3,

0,
and by induction

[(n+x)/21

_ { 2"( _ 1) 1

n
/~, + ... +~n(x)-

0,
Now let~ ,..., .% (m 1,

(-1)kC~(n

o-n and '1 ,..., .% (m


cp(x)

+ x- 2k)"-t, lxl

~ n,

k=O

lxl > n.
2,

o-~). If we write

= _1_ e-x2J2,

fo

241

8. Random Variables. II

then

and the formula

follows easily from (32).


Therefore the sum of two independent Gaussian random variables is again a

Gaussian random variable with mean m1

en

+ m2 and variance uf + u~.

Let 1, ,, be independent random variables each of which is normally


distributed with mean 0 and variance 1. Then it follows easily from (26) (by
induction)

t::+ . . ~
+"(x)

{-=-2n:-;;;t2=:,.,..(n....,./2,.,..) x<nf2)-le-xf2,
0,

ef

x > 0,
X~

x;,

e;

(34)

0.

The variable
+ + is usually denoted by and its distribution
(with density (30)) is the x2 -distribution ("chi-square distribution") with n
degrees of freedom (cf. Table 2 in 3).

If we write Xn =

+.JX!, it follows from (28) and (34) that


fx . (x) =

2x"-le-x2f2
{ 2"12r(n/2) , X ~ 0,

(35)

X< 0.

0,

The probability distribution with this density is the x-distribution (chidistribution) with n degrees of freedom.
Again let ~ and 11 be independent random variables with densities f~ and
J,. Then

F~~(z) =

JJ

f~(x)f~(y) dx dy,

{x,y:xy:Sz)

F~1 ~(z) =

JJ

f~x)j,(y) dx dy.

{x, y: x/y:Sz)

Hence we easily obtain

~~~(z)

oo (z)
dy
dx
f-oof~
Y J.,(y) IYT
= foo-oof~ (z)
X J~(x) TXT

(36)

and
(37)

242

II. Mathematical Foundations of Probability Theory

e;)!n,

en

Putting =eo and '1 = j(e~ + +


in (37), where eo. 1 , .. ,
are independent Gaussian random variables with mean 0 and variance
rJ 2 > 0, and using (35), we find that

f~o/[.j(l/nH~t+ ... +~~)](x)

I r(n; 1)

C
y'Ttn

( )
r~

1
(

(38)

x2)(n+ 1)/2

1+-

The variable eo/[)(1/n)(ei + ... + e:)J is denoted by t, and its distribution


is the t-distribution, or Student's distribution, with n degrees of freedom (cf.
Table 2 in 3). Observe that this distribution is independent of rJ.

5.

PROBLEMS

1. Verify formulas (9), (10), (24), (27), (28), and (34)-(38).


2. Let ~t> , ~ n ~ 2, be independent identically distributed random variables with
distribution function F(x) (and density f(x), if it exists), and let~= max(~t ... , ~.),
~ = min(~t ... , ~.), p = ~ - ~-Show that

F- (y x) = {(F(y))"- (F(y)- F(x))",


~.~ '
(F(y))",

y > x,
y ~ x,

n(n- 1)[F(y)- F(x)]"- 2 f(x)f(y),

f~.~(y, x) = { 0,

y > x,
<X,

- {n s~oo [F(y)- F(y- x)]"-tf(y) dy, X~ 0,

~00-0
'

X< '

n(n- 1) s~oo [F(y)- F(y- x)]"- 2 f(y- x)f(y) dy,

fP(x) = { 0,

X>

0,

X< 0

3. Let ~t and ~ 2 be independent Poisson random variables with respective parameters


At and A2 Show that ~t + ~ 2 has a Poisson distribution with parameter At + A2
4. Let mt = m2 = 0 in (4). Show that

5. The maximal correlation coefficient of ~ and Yf is p*(~, rt) = sup"" p(u<e), v<e)),
where the supremum is taken over the Borel functions u = u(x) and v = v(x) for
which the correlation coefficient p(u(~). v(e)) is defined. Show that' and Yf are independent if and only if p*(~. rt) = 0.
6. Let "t 1 , "t 2 , , "t be independent nonnegative identically distributed random variables with the exponential density

f(t) = Ae-At,

0.

9. Construction of a Process with Given Finite-Dimensional Distribution

Show that the distribution of r 1

243

+ + rk has the density

A_ktk-1e-A'

(k-1)!'

0, 1 :o;; k :o;; n,

and that
P(r 1

+ + rk >

t) =

k-1

(A.tY

i=O

l.

L e-A-.-1 .

7. Let ~ ~ %(0, a 2 ). Show that, for every p ~ 1,

EI~ IP =

CPaP,

where

and f(s)

So e-xxs-

dx is the gamma function. In particular, for each integer n ~ 1,


E~ 2 " = (2n - 1)!! a 2 ".

9. Construction of a Process with Given


Finite-Dimensional Distribution
1. Let

~ = ~(w)

(Q, $', P), and let

be a random variable defined on the probability space


F~(x) =

P{w:

~(w)::;

x}

be its distribution function. It is clear that F ~(x) is a distribution function


on the real line in the sense of Definition 1 of 3.
We now ask the following question. Let F = F(x) be a distribution function on R. Does there exist a random variable whose distribution function is

F(x)?

One reason for asking this question is as follows. Many statements in


probability theory begin, "Let~ be a random variable with the distribution
function F(x); then ... ". Consequently if a statement of this kind is to be
meaningful we need to be certain that the object under consideration actually
exists. Since to know a random variable we first have to know its domain
(Q, g;), and in order to speak of its distribution we need to have a probability
measure P on (Q, g;), a correct way of phrasing the question of the existence
of a random variable with a given distribution function F(x) is this:
Do there exist a probability space (Q, $', P) and a random variable~
on it, such that

P{w:

~(w)::;

x}

F(x)?

= ~(w)

244

II. Mathematical Foundations of Probability Theory

Let us show that the answer is positive, and essentially contained in


Theorem 1 of 1.
In fact, let us put
:F = PJ(R).

!l=R,

It follows from Theorem 1 of 1 that there is a probability measure P (and


only one) on (R, PJ(R)) for which P(a, b)] = F(b) - F(a), a < b.
Put c;(w) = w. Then

P{w: c;(w)

~ x}

= P{w: w

~ x}

= P(- oo, x] = F(x).

Consequently we have constructed the required probability space and the


random variable on it.
2. Let us now ask a similar question for random processes.
Let X= (c;,),er be a random process (in the sense of Definition 3, 5)
defined on the probability space (Q, !F, P), with t E T R.
From a physical point of view, the most fundamental characteristic of a
random process is the set {F,,, ... ,1"(x 1 , ... , xn)} of its finite-dimensional
distribution functions

defined for all sets tt. ... , tn with t 1 < t 2 < < tn.
We see from (1) that, for each set t 1 , .. , tn with t 1 < t 2 < < tn the
functions F 1,, ... ,1Jx 1, ... , xn) are n-dimensional distribution functions (in
the sense of Definition 2, 3) and that the collection {F,,, ... ,1"(x 1, ... , Xn)}
has the following consistency property:
lim F,,, ... ,tn(xl, ... ' Xn)

F,,, ... ,tk, ... ,tn(xl, ... '

xk> ... ' Xn)

(2)

Xkf 00

where ~ indicates an omitted coordinate.


Now it is natural to ask the following question: under what conditions
can a given family {F 11 , ... , 1"(xt. ... , xn)} of distribution functions
F,,, ... ,1"(x 1 , . , xn) (in the sense of Definition 2, 3) be the family of finitedimensional distribution functions of a random process? It is quite remarkable that all such conditions are covered by the consistency condition (2).

Theorem 1 (Kolmogorov's Theorem on the Existence of a Process). Let


{F1,,. .. , 1"(x 1, ... , Xn)}, with t; E T R, t 1 < t 2 < < tn, n ~ 1, be a given

family of finite-dimensional distribution functions, satisfying the consistency


condition (2). Then there are a probability space (Q, !F, P) and a random
process X = (c;,),er such that
(3)

9. Construction of a Process with Given Finite-Dimensional Distribution

245

PRooF. Put

i.e. taken to be the space of real functions w = (w,),eT with the a-algebra
generated by the cylindrical sets.
LetT= [t 1 , ... , tn], t 1 < t 2 < < tn. Then by Theorem 2 of 3 we can
construct on the space (R", PA(R")) a unique probability measure Pr such that

It follows from the consistency condition (2) that the family {P r} is also
consistent (see (3.20)). According to Theorem 4 of 3 there is a probability
measure P on (RT, PA(RT)) such that
P{w: (w11 ,

w,J E B} = Pr(B)

for every set r = [t 1 , .. , tn], t 1 < < tn.


From this, it also follows that (4) is satisfied. Therefore the required
random process X= (~,(w)),eT can be taken to be the process defined by
tE T.

(5)

This completes the proof of the theorem.

Remark 1. The probability space (RT, PA(RT), P) that we have constructed


is called canonical, and the construction given by (5) is called the coordinate
method of constructing the process.
Remark 2. Let (Ea, ga) be complete separable metric spaces, where oc belongs

to some set mof indices. Let {Pr} be a set of consistent finite-dimensional


distribution functions P" T = [IX 1, . , ocnJ on
(1% 1

X X

EIXn' rff/% 1 rff/XJ

Then there are a probability space (Q, !F, P) and a family of !F /rff a-measurable
functions (Xa(w))ae!ll such that
P{(Xa 1,

XaJ E B} = P,(B)

for all r = [oc 1 , . , ocnJ and BE ~fa Can


This result, which generalizes Theorem 1, follows from Theorem 4 of 3
if we put n =
Ea., !F = IIa tel% and XIX(w) = WI% for each Q) = w(wl%), 0( em.

Tia

Corollary 1. Let F 1(x), F 2 (x), ... be a sequence of one-dimensional distribution


junctions. Then there exist a probability space (Q, ff, P) and a sequence of
independent random variables ~ 1 , ~ 2 , such that
P{w: ~;(w) ::; x} = F;(x).

(6)

246

II. Mathematical Foundations of Probability Theory

In particular, there is a probability space (Q, ~ P) on which an infinite


sequence of Bernoulli random variables is defined (in this connection see
Subsection 2 of 5 of Chapter 1). Notice that Q can be taken to be the space
Q = {ro: w = (a 1, a2 ,

),

ai = 0, 1}

(cf. also Theorem 2).


To establish the corollary it is enough to put F 1 , ... ,ix1 ,
F 1(x 1 ) Fn(Xn) and apply Theorem 1.

Xn) =

Corollary 2. Let T = [0, oo) and let {p(s, x; t, B} be a family of nonnegative


functions defined for s, t E T, t > s, x E R, BE Bl(R), and satisfying the following
conditions:
(a) p(s, x; t, B) is a propability measure on B for given s, x and t;
(b) for given s, t and B, the function p(s, x; t, B) is a Borel function of x;
(c) for 0 ::S; s < t < rand BE Bl(R), the Kolmogorov-Chapman equation

p(s, x; r, B)

{p(s, x; t, dy)p(t, y; r, B)

(7)

is satisfied.
Also let 1t = n(B) be a probability measure on (R, Bl(R)). Then there are
a probability space (Q, ~ P) and a random process X = (~ 1)1 2: 0 defined on
it, such that

(8)

for 0 = t 0 < t 1 < < tn.


The process X so constructed is a Markov process with initial distribution
nand transition probabilities {p(s, x; t, B}.

Corollary 3. Let T = {0, 1, 2, ... } and let {Pk(x; B)} be a family of nonnegative functions defined for k ;;::: 1, x E R, BE PJ(R), and such that pk(x; B)
is a probability measure on B (for given k and x) and measurable in x (for
given k and B). In addition, let n = n(B) be a probability measure on (R, Bl(R)).
Then there is a probability space (Q, ~ P) with a family of random variables X= {~ 0 , ~ 1 , . } defined on it, such that

247

9. Construction of a Process with Given Finite-Dimensional Distribution

3. In the situation of Corollary 1, there is a sequence of independent random


variables ~ 1 , ~ 2 , . . . whose one-dimensional distribution functions are
F 1 , F 2 , , respectively.
Now let ( 1 , 8' 1 ), ( 2 , 8' 2 ), .. be complete separable metric spaces and
let P 1 , P 2 , ... be probability measures on them. Then it follows from Remark
2 that there are a probability space (Q, ff', P) and a sequence of independent
elements X1o X 2 , .. such that Xn is ~/c8'n-measurable and P(XnEB) =
PiB), BE @"n
It turns out that this result remains valid when the spaces (En,@" n) are

arbitrary measurable spaces.

Theorem 2 (lonescu Tulcea's Theorem on Extending a Measure and the


Existence of a Random Sequence). Let (Qn, ff,), n = 1, 2, ... , be arbitrary

PI

measurable spaces and n =


nn, ~ = ff,. Suppose that a probability
measure P1 is given on (!1 1, ~1 ) and that, for every set (wt> ... , wn) E !1 1 x
... X nn,n ~ l,probabilitymeasuresP(wl, ... ' wn; )aregivenon(Qn+ I~+ 1).
Suppose that for every BE ff,+ 1 the functions P(wl> ... , wn; B) are Borel
fimctions on (w 1 , , wn) and let

1.

(9)

Then there is a unique probability measure P on (Q, ~) such that

for every n
such that

1, and there is a random sequence X= (X 1 (w), X 2 (w), ...)

where A; E c8';.
PROOF. The first step is to establish that for each n > 1 the set function Pn
defined by (9) on the rectangle A 1 x x An can be extended to the
a-algebra ~1 ff,.
For each n ~ 2 and BE ~1 ff, we put

nn

I B(fl!1, ... , Wn)P(w1, ... , Wn- 1; dwn).

(12)

It is easily seen that when B = A 1 x x An the right-hand side of (12)


is the same as the right-hand side of (9). Moreover, when n = 2 it can be

248

II. Mathematical Foundations of Probability Theory

shown, just as in Theorem 8 of 6, that P 2 is a measure. Consequently it is


easily established by induction that Pn is a measure for all n ~ 2.
The next step is the same as in Kolmogorov's theorem on the extension of
a measure in (R 00 , 14(R 00 ) ) (Theorem 3, 3). Thus for every cylindrical set
Jn(B) = {roe 0: (ro 1, ... , ron) e B}, Be '1 !F,, we define the set
function P by
(13)

If we use (12) and the fact that P(ro 1, ... , rok; )are measures, it is easy to

establish that the definition (13) is consistent, in the sense that the value of
P(Jn(B)) is independent of the representation of the cylindrical set.
It follows that the set function P defined in (13) for cylindrical sets, and in
an obvious way on the algebra that contains all the cylindrical sets, is a
finitely additive measure on this algebra. It remains to verify its countable
additivity and apply Caratheodory's theorem.
In Theorem 3 of 3 the corresponding verification was based on the
property of (R", 14(Rn)) that for every Borel set B there is a compact set
A B whose probability measure is arbitrarily close to the measure of B.
In the present case this part of the proof needs to be modified in the following
way.
As in Theorem 3 of 3, let {Bn} n~ 1 be a sequence of cylindrical sets

Bn = {ro: (rol> ... ' ron) E Bn},


that decrease to the empty set

0. but have
lim P(Bn) > 0.

(14)

n-+oo

For n > 1, we have from (12)

where

Since .Bn+1 Bn, we have Bn+1 Bn

on+1 and therefore

JBn+l(ro1, ... ' ron+1) ~ [Bn(ro,, ... ' ron)Inn+l(ron+1>


Hence the sequence {f~1 >(ro 1 )}n~ 1 decreases. Let J(l>(ro 1) = limn f~1 >(ro 1 ).
By the dominated convergence theorem
lim P(Bn) =lim
n

i f~ >(ro 1 )P 1(dro 1 ) i
n,

n,

J< 1>(ro 1)P 1(dro 1).

By hypothesis, limn P(Bn) > 0. It follows that there is an ro~ e B such that
> 0, since if ro 1 rl B 1 then f~1 >(ro 1 ) = 0 for n ~ 1.

j< 1 >(ro~)

9. Construction of a Process with Given Finite-Dimensional Distribution

249

Moreover, for n > 2,


(15)

where

f~2 >(w 2 ) =

LP(w~,
l IB"(w~,w2,

w 2 ; dw 3 )

On

... ,wn)P(w~,w 2 ,

,wn-bdwn).

We can establish, as for {f~l)(w 1 )}, that {f~2 >(w 2 )} is decreasing. Let

j< 2 >(w 2 ) = limn .... oo Jf>(w 2 ). Then it follows from (15) that

0<

J(l>(w~) = j j< 2 >(w 2 )P(w~; dw 2 ),

Jn2

and there is a point w~ E Q 2 such that f(2)(w~) > 0. Then (w~, w~) E B 2.
Continuing this process, we find a point (w~, ... , w~) E Bn for each n.
0
Consequently (w 01, ... , wn,
.. .) E n~
Bn, but by hypothesis we have n~
Bn = 0.
This contradiction shows that limn P(Bn) = 0.
Thus we have proved the part of the theorem about the existence of the
probability measure P. The other part follows from this by putting XnCw)
= Wn, n;;::: 1.

Corollary 1. Let (En, Cn)n<: 1 be any measurable spaces and (Pn)n<: 1 , measures
on them. Then there are a probability space (Q, ', P) and a family of independent random elements X 1,X 2, ... with values in (E 1,C 1), (E 2 ,C 2 ), .. ,
respectively, such that
P{w: Xiw)

B}

= Pn(B),

Corollary 2. Let E = {1, 2, ... }, and let {pk(x, y)} be a family of nonnegative
functions, k;;::: 1, x,yEE, such that LyeEPk(x;y) = 1, xEE, k;;::: 1. Also
letn = n(x)beaprobabilitydistributiononE(thatis,n(x);;::: O,LxeE n(x) = 1).
Then there are a probability space (Q, ', P) and a family X = { ~ 0 , ~ 1 ,
of random variables on it, such that

Pgo = Xo, ~1 = X1, , ~n = Xn} = n(xo)P1(xo, X1) Pn(Xn-1 Xn)

(16)

(cf. (1.12.4)) for all x; E E and n ;;::: 1. We may take Q to be tl".e space
Q

= {w: w = (x 0 , x 1,

. ), X; E

E}.

A sequence X= g 0 , ~ 1 , ... } of random variables satisfying (16) is a


Markov chain with a countable set E of states, transition matrix {pk(x, y)}
and initial probability distribution n. (Cf. the definition in 12 of Chapter 1.)

250
4.

II. Mathematical Foundations of Probability Theory

PROBLEMS

1. Let 0 = [0, 1], let :F be the class of Borel subsets of [0, 1], and let P be Lebesgue
measure on [0, 1]. Show that the space (0, !F, P) is universal in the following sense.
For every distribution function F(x) on (0, !F, P) there is a random variable = ro)
such that its distribution function F ~x) = P(e ~ x) coincides with F(x). (Hint.
e(ro) = r 1(ro), 0 < ro < 1, where r 1(ro) = sup{x: F(x) < ro}, when 0 < ro < 1,
and e(O), W) can be chosen arbitrarily.)

e e(

2. Verify the consistency of the families of distributions in the corollaries to Theorems


1 and 2.
3. Deduce Corollary 2, Theorem 2, from Theorem 1.

10. Various Kinds of Convergence of Sequences


of Random Variables
1. Just as in analysis, in probability theory we need to use various kinds of
convergence of random variables. Four of these are particularly important:

in probability, with probability one, in mean of order p, in distribution.

First some definitions. Let~. ~ 1 , ~ 2 , be random variables defined on a


probability space (Q, !F, P).
Definition 1. The sequence

1,

~ 2 , .

of random variables converges in

probability to the random variable~ (notation: ~n E.~) if for every s > 0


P{l~n- ~~

> s}-+ 0,

n-+

oo.

(1)

We have already encountered this convergence in connection with the


law of large numbers for a Bernoulli scheme, which stated that
n-+

oo

(see 5 of Chapter 1). In analysis this is known as convergence in measure.

Definition 2. The sequence ~ 1 , ~ 2 , of random variables converges with

probability one (almost surely, almost everywhere) to the random variable


~if

P{w: ~n

+ ~} =

0,

(2)

i.e. if the set of sample points w for which ~n(w) does not converge to ~has
probability zero.
This convergence is denoted by ~" -+ ~ (P-a.s.), or ~n ~ ~ or ~n ~ ~.

10. Various Kinds of Convergence of Sequences of Random Variables

251

Definition 3. The sequence ~ 1 , ~ 2 , of random variables converges in


mean of order p, 0 < p < oo, to the random variable~ if
n--+ oo.

(3)

~.
In analysis this is known as convergence in LP, and denoted by ~n
In the special case p = 2 it is called mean square convergence and denoted by
~ = l.i.m. ~n (for "limit in the mean").

Definition 4. The sequence ~ 1 , ~ 2 , . . . of random variables converges in


distribution to the random variable ~ (notation: ~n ~ ~) if
n--+ oo,

(4)

for every bounded continuous function f = f(x). The reason for the
terminology is that, according to what will be proved in Chapter III, 1,
condition (4) is equivalent to the convergence of the distribution F~"(x) to
F~(x) at each point x of continuity ofF ~(x). This convergence is denoted by
F~" => F~.

We emphasize that the convergence of random variables in distribution


is defined only in terms of the convergence of their distribution functions.
Therefore it makes sense to discuss this mode of convergence even when the
random variables are defined on different probability spaces. This convergence will be studied in detail in Chapter III, where, in particular, we
shall explain why in the definition ofF~" => F ~ we require only convergence
at points of continuity of F~(x) and not at all x.
2. In solving problems of analysis on the convergence (in one sense or
another) of a given sequence of functions, it is useful to have the concept of a
fundamental sequence (or Cauchy sequence). We can introduce a similar
concept for each of the first three kinds of convergence of a sequence of
random variables.
Let us say that a sequence {~n}n~ 1 of random variables is fundamental in
probability, or with probability 1, or in mean of order, p, 0 < p < oo, if the
corresponding one of the following properties is satisfied: P{ I~n - ~I > e}
--+ 0, as m, n--+ oo for every e > 0; the sequence gn(w)}n~ 1 is fundamental
for almost all WE Q; the sequence {~n( W)} n ~ 1 is fundamental in U, i.e.
EI~n - ~m IP --+ 0 as n, m--+ 00.

3. Theorem 1.

(a) A necessary and sufficient condition that

~n

P{~~~~~k- ~I~ e}--+ 0,


for every e > 0.

--+

(P-a.s.) is that
n--+ oo.

(5)

252

II. Mathematical Foundations of Probability Theory

(b) The sequence {~"} "~ 1 is fundamental with probability 1 if and only

n--+ oo,

if
(6)

for every e > 0; or equivalently

~n+k - ~n I 2

P{sup I
k~O

PROOF.

n __.. w.

e}--+ 0,

=n:;,

(a) Let A~= {w: l~n- ~~ 2 e}, A'= IlniA~

+ 0 = U A' = U A

(7)
Uk~n A;;. Then

00

{w: ~n

11 m.

m=1

e~O

But
P(A') = lim
n

P( U A;;),
k~n

Hence (a) follows from the following chain of implications:

~" + ~} =

0 = P{w:

P(U

e>O

A')> P( UA1m)= 0

> P(A 1 1m) = 0,

m=1

m 2 1 > P(A') = 0, e > 0,

> PCY" A;;)--+ 0,

n--+

00

>

P(~~~~~k- ~I 2 e)--+ 0,
n --+ oo.

(b) Let

n U Bk,.
00

B' =

n= 1

k~n
l~n

Then {w: g"{w)}"~ 1 is notfundamental} = U,~ 0 B', and it can be shown


as in (a) that P{w: {~n(w)}n~ 1 is not fundamental}= 0>(6). The equivalence of (6) and (7) follows from the obvious inequalities
supJ~n+k- ~nl:::; supl~n+k- ~n+ll:::; 2 supl~n+k- ~nl

This completes the proof of the theorem.

Corollary. Since

253

10. Various Kinds of Convergence of Sequences of Random Variables

a sufficient condition for

~n ~ ~

is that

00

L P{i~k- ~I~ e} <

k= 1

(8)

00

is satisfied for every e > 0.


It is appropriate to observe at this point that the reasoning used in
obtaining (8) lets us establish the following simple but important result which
is essential in studying properties that are satisfied with probability 1.
Let A 1, A 2 , be a sequence of events in F. Let (see the table in 1)
{An i.o.} denote the event ITiiiAn that consists in the realization of infinitely
many of A 1 , Az, ....
Borei-Cantelli Lemma.

(a) If'[. P(An) < 00 then P{An i.o.} = 0.


(b) If'[. P(An) = oo and A1, A 2 , are independent, then P{An i.o.} = 1.
PROOF.

(a) By definition

{An i.o.} = ITiii An=

rl

n= 1

U Ak.

k?!n

Consequently

P{An i.o.} = Pt01 kyn Ak} =lim P(Yn Ak)


and (a) follows.
(b) If A 1, A 2 ,

are independent, so are

we have

Con Ak)

and it is then easy to deduce that

PCQ Ak)
Since log(1 - x)

-x, 0

00

log

fl

k=n

li
lt

A1, A2 ,

Hence for N ~ n

P(Ak),

(9)

P(Ak).

x < 1,
00

[1 - P(Ak)] =

~lim k~nP(Ak),

k=n

00

log[1 - P(Ak)] ~ -

Consequently

for all n, and therefore P(An i.o.) = 1.


This completes the proof of the lemma.

k=n

P(Ak) = - oo.

254

II. Mathematical Foundations of Probability Theory

Corollary 1. If A~= {w: 1e,.- e1;,:: 6} then (8) shows that L:'= 1 P(A,.) < oo,
6 > 0, and then by the Borel-Cantelli lemma we have P(A") = 0, 6 > 0, where
A = llm A~. Therefore

L P{lek- el;,:: 6} < 00,6 > 0 ~ P(A") =

0, 6 > 0
~

P{w:

e,. +e)}

= 0,

as we already observed above.

Corollary 2. Let (6,.)11 2: 1 be a sequence of positive numbers such that


n ~ oo. If

L P{le,.- el;,:: 6,.} < 00,

6 11 ~

0,

00

(10)

n=l

e,. - e

In fact, let A,. = {I


I ;,:: 6,.}. Then P(A,. i.o.) = 0 by the BorelCantelli lemma. This means that, for almost every w e Q, there is an N =
N(w) such that le,.(w)- e(w)l ~ 6,. for n;,:: N(w). But 6,. ~ 0, and therefore
e,.(w) ~ e(w) for almost every WE Q.

4. Theorem 2. We have the following implications:

e,. ~ ~ ~ ~~~ f. e.

.;,. g .; = .;,. .f. .;,

(11)
p > 0,

(12)

~~~ f. ~ ~ ~~~ ~ ~.

(13)

PRooF. Statement (11) follows from comparing the definition of convergence


in probability with (5), and (12) follows from Chebyshev's inequality.
To prove (13), let f(x) be a continuous function, let If (x) I ~ c, let 6 > 0,
and let N be such that P(l~l > N) ~ 6/4c. Take b so that lf(x)- f(y)l ~
6/2c for lxl < N and lx- Yl ~b. Then (cf. the proof of Weierstrass's
theorem in Subsection 5, 5, Chapter I)

N)
+ E(lf(~,.)- f(~)l; 1~.. - ~I ~ b, 1~1 > N)
+ E(lf<e..)- J<~>l; 1e.. - e1 >b)
~ 6/2 + 6/2 + 2cP{I~.. - ~I> <5}
= 6 + 2cP{I~ .. - ~I> <5}.

Elf(~,.)- f(~)l = E(lf(~,.)- f(~)l; 1~ ..

~I~

b,

1~1 ~

But P{le,.- ~I> <5}.....,. 0, and hence E If(~,.)- f(~)l ~ 26 for sufficiently
large n; since 6 > 0 is arbitrary, this establishes (13).
This completes the proof of the theorem.
We now present a number of examples which show, in particular, that the
converses of (11) and (12) are false in general.

255

10. Various Kinds of Convergence of Sequences of Random Variables

1 (~n ~ ~ f:> ~n ~ ~; ~n ~ ~
86([0, 1]), P = Lebesgue measure. Put

EXAMPLE

f:> ~n ~- ~). Let 0

A;=[~~]
n
n 'n '

= [0, 1],

f7 =

i = 1, 2, ... , n; n ;;::: 1.

Then the sequence


{ ).:1. ).:1

).:2. ).:1

).:3.

).:2

'>1 '>2 '>2 '>3 '>3 '>3 ..

of random variables converges both in probability and in mean of order


p > 0, but does not converge at any point wE [0, 1].
2 (~n ~ ~ => ~n ~ ~ =f ~n!:! ~, p > 0). Again let 0
86[0, 1], P = Lebesgue measure, and let
EXAMPLE

~n(w)

{e",
0,

0 :::;;

W :::;;

w > 1/n.

= [0, 1],

f7

1/n,

Then {~n} converges with probability 1 (and therefore in probability) to


zero, but
n--+

oo,

for every p > 0.


3 (~n!:! ~ =f ~n ~~).Let {~n} be a sequence of independent random
variables with

ExAMPLE

P(~n

= 0) =

1 - Pn

Then it is easy to show that


~n

--+ 0 > Pn --+ 0,

n --+ oo,

(14)

~n

--+ 0 > Pn --+ 0,

LP

n --+ oo,

(15)

00

~n ~ 0 =>

L Pn <

(16)

00.

n=1
LP

In particular, if Pn = 1/n then ~n --+ 0 for every p > 0, but ~n

+0.

a.s.

The following theorem singles out an interesting case when almost sure
convergence implies convergence in L 1

Theorem 3. Let (~n) be a sequence of nonnegative random variables such that


~n ~~and E~n--+ E~ < 00. Then
EI~n -

~I --+ 0,

n--+

oo.

(17)

256

II. Mathematical Foundations of Probability Theory

PROOF. We have
we have

E~n

< oo for sufficiently large n, and therefore for such n

E ~~- ~nl = E(~- ~n)/{~2!~nl

+ E(~n- ~)J{~n>~}
+ E(~n- ~).

2E(~- ~n)/{~2!~n}

But 0 ::;; (~ - ~n)J 1 ~.,~"l ::;; ~- Therefore, by the dominated convergence


theorem, limn E(~- ~n)/ 1 ~.,~") = 0, which together with E~n-+ E~ proves
(17).

Remark. The dominated convergence theorem also holds when almost sure

convergence is replaced by convergence in probability (see Problem 1).


Hence in Theorem 3 we may replace "~" ~ C' by "~" -f. ~-"

5. It is shown in analysis that every fundamental sequence (xn), xn E R, is


convergent (Cauchy criterion). Let us give a similar result for the convergence
of a sequence of random variables.

Theorem 4 (Cauchy Criterion for Almost Sure Convergence). A necessary and


sufficient condition for the sequence (~n)n;;, 1 of random variables to converge
with probability 1 (to a random variable~) is that it is fundamental with probability 1.
PROOF. If ~n ~ ~ then

k;;,n
l;;,n

k;::n

l<!n

whence the necessity follows.


Now let C~n)n;;,l be fundamental with probability 1. Let .;V = {w: (~n(w))
is not fundamental}. Then whenever w E Q \ .;V the sequence of numbers
(~"(w))n;;,l is fundamental and, by Cauchy's criterion for sequences of
numbers, lim ~n(w) exists. Let

~(w) ={lim ~n(w), wEil\%,


0,

WE%.

(18)

The function so defined is a random variable, and evidently ~" ~ ~.


This completes the proof.
Before considering the case of convergence in probability, let us establish
the following useful result.

Theorem 5. If the sequence(~") is fundamental (or convergent) in probability,


it contains a subsequence (~"k) that is fundamental (or convergent) with probability 1.

257

10. Various Kinds of Convergence of Sequences of Random Variables

PROOF. Let(~") be fundamental in probability. By Theorem 4, it is enough


to show that it contains a subsequence that converges almost surely.
Take n 1 = 1 and define nk inductively as the smallest n > nk-l for which

P{l~,- ~.I> rk}

for all s

n, t

< rk.

n. Then

L P{l~nk+l- ~nkl > 2-k} < L 2-k < 00


k

and by the Borel-Cantelli lemma


P{l~nk+l- ~nkl > 2-k i.o.} = 0.

Hence
00

L ~~nk+l -

k=l

with probability 1.
Let .At= {m: I~nk+l - ~nkl

~(m) =

1 ~n,(m)

,;

~nkl <

00

oo }. Then if we put

I(~~:~, - ~nk(m)),

t=I

0,

mE 0\.K,
WE

JV,

we obtain ~nk ~ ~
If the original sequence converges in probability, then it is fundamental in
probability (see also (19)), and consequently this case reduces to the one
already considered.
This completes the proof of the theorem.

Theorem 6 (Cauchy Criterion for Convergence in Probability). A necessary


and sufficient condition for a sequence (~n)n<?: 1 ofrandom variables to converge in
probability is that it is fundamental in probability.
PROOF.

If ~n ~ ~ then
P{len- ~ml ~ e} ~ P{l~n- ~I~ e/2}

+ P{l~m-

~I~ e/2}

(19)

and consequently (~n) is fundamental in probability.


Conversely, if(~") is fundamental in probability, by Theorem 5 there are
a subsequence (~nJ and a random variable~ such that ~nk ~~.But then
P{l~n- ~I~ e} ~ P{l~n- ~nkl ~ e/2}

+ P{l~nk-

~I~ e/2},

from which it is clear that ~n ~ ~. This completes the proof.


Before discussing convergence in mean of order p, we make some observations about LP spaces.

258

II. Mathematical Foundations of Probability Theory

We denote by U = U(O., !F, P) the space ofrandom variables~=


with EI~ IP = Jn I~ IP dP < oo. Suppose that p ~ 1 and put

Wlp =

~(ro)

(E I~ IP) 11P.

It is clear that
ll~llp ~

0,

llc~IIP = lciii~IIP'

c constant,

(20)
(21)

and by Minkowski's inequality (6.31)


II~

+ ttllp :s; Wlp + llttllp

(22)

Hence, in accordance with the usual terminology of functional analysis, the


function II-IlP' defined on U and satisfying (20)-(22), is (for p ~ 1) a semi-

norm.

For it to be a norm, it must also satisfy


ll~llp

= 0 ~ ~ = 0.

(23)

This property is, of course, not satisfied, since according to Property H


(6) we can only say that ~ = 0 almost surely. However, if U means the space
whose elements are not random variables ~ with EI~ IP < oo, but equivalence
classes of random variables (~ is equivalent to q if ~ = q almost surely),
then 1111 becomes a norm, so that U is a normed linear space. If we select
from each equivalence class of random variables a single element, taking the
identically zero function as the representative of the class equivalent to it, we
obtain a space (also denoted by U) wh~ch is actually a normed linear space
of functions (rather than of equivalence classes).
It is a basic result of functional analysis that the spaces U, p ~ 1, are
complete, i.e. that every fundamental sequence has a limit. Let us state and
prove this in probabilistic language.

Theorem 7 (Cauchy Test for Convergence in Mean pth Power). A necessary


and sufficient condition that a sequence (~n)n ~ 1 of random variables in U
convergences in mean of order p to a random variable in LP is that the sequence
is fundamental in mean of order p.
PRooF. The necessity follows from Minkowski's inequality. Let C~n) be
fundamental Cll~n- ~miiP-+ 0, n, m-+ oo). As in the proof of Theorem 5, we
select a subsequence (~nk) such that ~nk ~ ~'where~ is a random variable with
Wlp < oo.
Let n1 = 1 and define nk inductively as the smallest n > nk_ 1 for which
II~,- ~.liP<

for all s

n, t

n. Let

2- 2 k

259

I 0. Various Kinds of Convergence of Sequences of Random Variables

Then by Chebyshev's inequality


P(A ) <
k

E I;;

'onk+ 1 -

2-kr

2- 2kr
;; I' < __

'onk

- 2-kr

2-kr < 2-k.


-

As in Theorem 5, we deduce that there is a random variable


;;

':,nk

such that

~;;

We now deduce that II~.- ~liP--+ 0 as n--+ oo. To do this, we fix B > 0
and choose N = N(e) so that II~. - ~mil~ < dor all n ~ N, m ~ N. Then for
any fixed n ~ N, by Fatou's lemma,
E

1~.- ~IP = E{ lim 1~.- ~.kiP}= E{ lim 1~.- ~.kiP}


~-oo

~-oo

Consequently E I~. - ~ IP --+ 0, n --+ oo. It is also clear that since~ = ( ~ - ~.)
+ ~.we have E I~ IP < oo by Minkowski's inequality.
This completes the proof of the theorem.

Remark 1. In the terminology of functional analysis a complete normed


linear space is called a Banach space. Thus U, p ~ 1, is a Banach space.
Remark 2. If 0 < p < 1, the function WI P = (E I~ IPY 1P does not satisfy the

triangle inequality (22) and consequently is not a norm. Nevertheless the


space (of equivalence classes) LP, 0 < p < 1, is complete in the metric
d(~,IJ) = El~- IJIP.
Remark 3. Let L ro = L'XJ(n, :, P) be the space (of equivalence classes of)
random variables~= ~(w) for which Wloo < oo, where Wloo, the essential
supremum of~. is defined by

6.

=inf{O s c s oo:

Wloo

=ess

supl~l

The function

1111 oo

is a norm, and L oo is complete in this norm.

P(l~l

>c)= 0}.

PROBLEMS

1. Use Theorem 5 to show that almost sure convergence can be replaced by convergence in probability in Theorems 3 and 4 of 6.
2. Prove that L 00 is complete.
3. Show that if~.
4. Let ~. !'. ~.lin

f.

~and also~.

f. 11 then~ and 11 are equivalent (P(~

!'. IJ, and let ~ and 11 be equivalent. Show that


P{l~n-

for every e > 0.

llnl ~ ~>}--> 0,

n--> oo,

# 17) = 0).

260

II. Mathematical Foundations of Probability Theory

5. Let ~. ~ ~, IJn ~ YJ. Show that a~.+ bYJ. ~a~+ bYJ (a, b constants), 1~.1 ~ 1~1,

~.'I.~ ~'1

6.

Let(~. - ~) 2 --+ 0. Show that~; --+ ~ 2

7. Show that if~.


bility:

!!. C, where C is a constant, then this sequence converges in proba~. !!. c => ~ .f. c.

8. Let (~.)., 1 have the property that


~. --+ 0 (P-a.s.).

I:'=

E I~. IP < oo for some p > 0. Show that

9. Let(~.)., 1 be a sequence of independent identically distributed random variables.


Show that
<Xl

El~ 1 1 < oo=

n=1

P{l~ 1 1 > en} < oo

10. Let ( ~.)." 1 be a sequence of random variables. Suppose that there are a random variable~ and a sequence {nd such that~ --+ ~ (P-a.s.) and max _, <l,;n I~~ - ~ _,I-+ 0
(P-a.s.) ask --+ oo. Show that then ~. --+ ~ (P-a.s.).
11. Let the d-metric on the set of random variables be defined by
d(!' ) - E I~ - 'II
1 +I~- 111
"'' 11 and identify random variables that coincide almost surely. Show that convergence
in probability is equivalent to convergence in the d-metric.
12. Show that there is no metric on the set of random variables such that convergence
in that metric is equivalent to almost sure convergence.

11. The Hilbert Space of Random Variables with


Finite Second Moment
1. An important role among the Banach spaces U, p ~ 1, is played by the
space L 2 = L 2 (0, !F, P), the space of (equivalence classes of) random variables with finite second moments.
If~ and '1 E L 2, we put
(~, 1'/)

It is clear that if~' IJ, (


(a~

=E~l'/

L 2 then

+ bl'f, 0

= a(~,

0+

(~, ~) ~

and

(1)

b(IJ,

0,

a, bER,

261

11. The Hilbert Space of Random Variables with Finite Second Moment

Consequently(~, '1) is a

scalar product. The space L 2 is complete with

respect to the norm


(2)

induced by this scalar product (as was shown in 10). In accordance with the
terminology of functional analysis, a space with the scalar product (1) is a

Hilbert space.

Hilbert space methods are extensively used in probability theory to study


properties that depend only on the first two moments of random variables
(" L 2 -theory"). Here we shall introduce the basic concepts and facts that will
be needed for an exposition of L 2 -theory (Chapter VI).
2. Two random variables ~ and 11 in L 2 are said to be orthogonal ( ~ l_ 11)
'1) E~11 = 0. According to 8, ~and 11 are uncorrelated ifcov(~, '1) = 0,
i.e. if

if(~,

It follows that the properties of being orthogonal and of being uncorrelated


coincide for random variables with zero mean values.
A set M ~ L 2 is a system of orthogonal random variables if ~ l_ 11 for
every~. 11 EM(~ = 11).
If also II~ I = 1 for every ~EM, then M is an orthonormal system.

3. Let M = {'1t. ... , '1n} be an orthonormal system and~ any random variable in L 2 . Let us find, in the class oflinear estimators L7= 1 a;'1;, the best meansquare estimator for~ (cf. Subsection 2, 8).
A simple computation shows that

El~- ;t1 ~;'1;1 2 = 11~- it1 a;'1f = (~- ;t1 a;'1;, ~- ;t1 a;'1;)

it a;(~, + (t J1

= Wl2-

= 11~11 2 -

i= 1

Wl 2

where we used the equation

i= 1
n

i= 1

a;'1;)

a;(~,

'1;) + L af

= Wl 2

a;'1;,

'1;)

i= 1
n

1(~, '1;W

1<~. 11;)1 2 ,

+ L

i= 1

Ia;- (~, '1;W


(3)

262

II. Mathematical Foundations of Probability Theory

Ii=

It is now clear that the infimum of E I~ 1 a;1Jd 2 over all real


at> ... , an is attained for a; = (~, IJ;), i = 1, ... , n.
Consequently the best (in the mean-square sense) estimator for~ in terms
of 1] 1, . , IJn is
n

~=

i= 1

c~, IJ;)'li

Here

~ = infEI~- it1 a;IJ;'2 = El~- ~12 = Wlz-

(4)

J1 1(~,1];)12

(5)

(compare (1.4.17) and (8.13)).


Inequality (3) also implies Bessel's inequality: if M = {1] 1, 1] 2, ... }
an orthonormal system and ~ E L 2 , then
00

i= 1

1(~,

'1;)1 2

IS

Wl 2 ;

(6)

(~, IJ;)}I;.

(7)

and equality is attained if and only if


n

~ = l.i.m.
n

i= 1

The best linear estimator of~ is often denoted byE(~ I1] 1, . , IJn) and called
the conditional expectation (of~ with respect to 1] 1, .. , IJn) in the wide sense.
The reason for the terminology is as follows. If we consider all estimators
cp = cp(1J 1, .. , 1'/n) of ( in terms of 17 t> ... , 1'/n (where cp is a Borel function),
the best estimator will be cp* = E(( Ii'J 1, , 1'/n), i.e. the conditional expectation
of ( with respect to 1] 1 , , 1'/n (cf. Theorem 1, 8). Hence the best linear
estimator is, by analogy, denoted by E(~l1'f 1 , . . . , 1'/n) and called the conditional expectation in the wide sense. We note that if 'lt> ... , 1'/n form a
Gaussian system (see 13 below), then E((I1J 1 , ... ,1Jn) and E((I1J 1 , ... ,1Jn)
are the same.
Let us discuss the geometric meaning of~ = E(~ I'lll ... , IJn).
Let !E = !E{1] 1, . , IJn} denote the linear manifold spanned by the orthonormal system of random variables 1] 1 , ... , 'ln (i.e., the set of random variables of the form Lf= 1 a;IJ;, a; E R).
Then it follows from the preceding discussion that ~ admits the "orthogonal decomposition"
(8)

where ~ E !E and ~ - ~ .l !E in the sense that ~ - ~ .ill for every ll E !E.


It is natural to call ~ the projection of ( on !E (the element of !t' "closest"
to ~),and to say that~ - ~is perpendicular to !E.
4. The concept of orthonormality of the random variables '1~> ... , 'ln makes it
easy to find the best linear estimator (the projection) ~ of ~ in terms of

11. The Hilbert Space of Random Variables with Finite Second Moment

263

'7 1, .. , 'ln The situation becomes complicated if we give up the hypothesis of


orthonormality. However, the case of arbitrary '7 1, ... , 'ln can in a certain
sense be reduced to the case of orthonormal random variables, as will be
shown below. We shall suppose for the sake of simplicity that all our random
variables have zero mean values.
We shall say that the random variables 'It> ... , 'In are linearly independent
if the equation
n

L a;'l; =

0 (P-a.s.)

i= 1

is satisfied only when all a; are zero.


Consider the covariance matrix
~ =E'7'7T

of the vector '7 = ('7 1, ... , 'In). It is symmetric and nonnegative definite,
and as noticed in 8, can be diagonalized by an orthogonal matrix l!J:
l!JT~l!J =

D,

where

0)

D=(d1 ..
0 dn

has nonnegative elements d;, the eigenvalues of ~' i.e. the zeros A. of the
characteristic equation det(~ - A.E) = 0.
If '1 b ... , 'ln are linearly independent, the Gram determinant (det ~) is
not zero and therefore d; > 0. Let

and

(Ft.Jd,.

f3

= B-ll!JT'l

0 )
(9)

Then the covariance matrix of f3 is


Ef3f3T = B-1l!.JTE'7'7Tl!.JB-1 =

and therefore f3 = (/3 1 ,


It is also clear that

B-1l!JT~l!JB-1

= E,

f3n) consists of uncorrelated random variables.

'I= (l!.JB)/3.

(10)

Consequently if '7 1, .. , 'ln are linearly independent there is an orthonormal


system such that (9) and (10) hold. Here

.!l'{'lb .. ' '7n} = .!l'{/31, "' /3n}.


This method of constructing an orthonormal system f3 to , Pn is frequently inconvenient. The reason is that if we think of '7; as the value of the
random sequence ('7 1 , .. , 'ln) at the instant i, the value /3; constructed above

264

II. Mathematical Foundations of Probability Theory

depends not only on the "past," (Yf 1 , , Y/;), but also on the "future,"
(Y/;+ t> , Yfn). The Gram-Schmidt orthogonalization process, described
below, does not have this defect, and moreover has the advantage that it can
be applied to an infinite sequence of linearly independent random variables
(i.e. to a sequence in which every finite set of the variables are linearly
independent).
Let Yft> Yf 2, .. be a sequence of linearly independent random variables in
L 2 We construct a sequence e1,e 2 , .. as follows. Let e1 = Yft!IIY/ 1 11. If
e1 , . , e"_ 1 have been selected so that they are orthonormal, then
e

Yfn- ~n

(11)

IIYfn - ~nil'

where ~" is the projection of Yfn on the linear manifold !l'(e 1 ,


generated by

en_ 1 )

n-1

~" =

L (Yfn, ek)ek.

(12)

k=1

Since Yft>Yfn are linearly independent and !l'{Yf 1 , .. ,Yfn-d =


!l'{e 1 , . , en- 1 }, we have IIYfn- ~nil > 0 and consequently en is well defined.
By construction, lien II = 1 for n ~ 1, and it is clear that (en, ek) = 0 for
k < n. Hence the sequence e1, e2 , is orthonormal. Moreover, by (11),
where b" = II Yin - ~nil and ~n is defined by (12).
Now let Yf 1, , Yfn be any set of random variables (not necessarily linearly
llriill is the covariance matrix of
independent). Let det ~ = 0, where ~
(17 1, ... , Yfn), and let

rank~=r<n.

Then, from linear algebra, the quadratic form


n

Q(a) =

L riiaiai,

i,j= 1

has the property that there are n - r linearly independent vectors a(l>, ... ,
a(n-r) such that Q(dil) = 0, i = 1, ... , n- r.
But

Consequently
n

L a~>'1k =

k= 1

with probability 1.

0,

i = 1, ... , n- r,

II. The Hilbert Space of Random Variables with Finite Second Moment

265

In other words, there are n - r linear relations among the variables


'11> ... , '7n Therefore if, for example, 17 1, ... , '7, are linearly independent, the
other variables '7r+ 1, ... , '7n can be expressed linearly in terms of them, and
consequently 2'{171> ... , '7n} = !l'{e 1, ... , e,}. Hence it is clear that we can
find r orthonormal random variables et> ... , e, such that 17 1, ... , '7n can be
expressed linearly in terms ofthem and 2'{17 1, ... , '7n} = !l'{et> ... , e,}.
5. Let '7 1, 17 2 , be a sequence of random variables in L 2 Let .!l =
.!l{'7t> 17 2 , } be the linear manifold spanned by 17 1, 17 2 , . ,i.e. the set of
random variables of the form
1 a; I'/;. n ;:::: 1, a; e R. Then .!l =
.!l{'7t> 17 2 , } denotes the closed linear manifold spanned by 17 1,17 2, ... ,
i.e. the set of random variables in !l' together with their mean-square limits.
We say that a set 17 1 , 17 2 , is a countable orthonormal basis (or a complete
orthonormal system) if:

Li'=

(a) '11> 17 2 , is an orthonormal system,


(b) .!l{'71 '12, .. .} = L 2.
A Hilbert space with a countable orthonormal basis is said to be separable.
By (b), for every~ eL 2 and a given e > 0 there are numbers a 1 , ... , an
such that

Then by (3)

Consequently every element of a separable Hilbert space L 2 can be represented as


00

~ =

L <~. '7;). '7;,

(13)

i= 1

or more precisely as
n

~ = l.i.m.
n

L (~, '7)'7;

i= 1

We infer from this and (3) that Parseval's equation holds:

Wl 2

L 1(~. '7;)1
00

i=1

2,

(14)

It is easy to show that the converse is also valid: if '7 1 , 17 2 , is an orthonormal system and either (13) or (14) is satisfied, then the system is a basis.
We now give some examples of separable Hilbert spaces and their bases.

266

II. Mathematical Foundations of Probability Theory

EXAMPLE 1.

Let

n = R, :F = &I(R), and let P be the Gaussian measure,

P(- oo, a]
Let D

f oo

qJ(x) dx,

= d/dx and
H ( ) = ( -1)"D"qJ(x)
"x
qJ(x)
,

n ~ 0.

(15)

We find easily that

= -xqJ(x),

DqJ(x)

D 2 qJ(x) = (x 2

D 3 qJ(x)

(16)

1)qJ(x),

(3x - x 3 )qJ(x),

It follows that H,.(x) are polynomials (the Hermite polynomials). From (15)
and (16) we find that
H 0 (x)

= 1,

H 1(x) = x,
Hz(x) = x 2

1,

HJ(x) = x 3

3x,

A simple calculation shows that

(Hm, H,.)

s:ooHm(x)H,.(x) dP

J:ooHm(x)H,.(x)qJ(X) dx

n! c5mn

where c5m,. is the Kronecker delta (0, if m =F n, and 1 if m = n). Hence if we


put

h( )

,.x

= H,.(x)

Jn'

the system of normalized Hermite polynomials {h,.(x)}n>o will be an orthonormal system. We know from functional analysis that if
lim
c!O

foo eclxl P(dx)


- oo

< oo,

(17)

the system {1, x, x 2 , } is complete in L 2 , i.e. every function e = e(x) in L 2


can be represented either as
1 ai'li(x), where 1'/i(x) = xi, or as a limit of

Li=

II. The Hilbert Space of Random Variables with Finite Second Moment

267

these functions (in the mean-square sense). If we apply the Gram-Schmidt


orthogonalization process to the sequence 111(x), 11ix), ... , with IJ;(x) = xi,
the resulting orthonormal system will be precisely the system of normalized
Hermite polynomials. In the present case, (17) is satisfied. Hence {hn(x)}n;;,o
is a basis and therefore every random variable ~ = ~(x) on this probability
space can be represented in the form
~(x)

= l.i.m.
n

(~, h;)h;(x).

(18)

i=O

2. Let !2 = {0, 1, 2, ... } and let P = {P 1 , P 2 ,


distribution

EXAMPLE

be the Poisson

= 0, 1, ... ; A > 0.

Put fl.J(x) = f(x) - f(x - 1) (f(x) = 0, x < 0), and by analogy with (15)
define the Poisson-Charlier polynomials

1,

I1 0 = 1.

(19)

Since
00

(ITm, ITn) =

x=O

ITm(x)ITn(x)Px = Cnbmn'

where en are positive constants, the system of normalized Poisson-Charlier


polynomials {nn(x)}n;;,o' nn(x) = IInCx)/Jc:, is an orthonormal system, which
is a basis since it satisfies (17).
In this example we describe the Rademacher and Haar systems,
which are of interest in function theory as well as in probability theory.
Let n = [0, 1], ff = ~([0, 1]), and let P be Lebesgue measure. As we
mentioned in 1, every x E [0, 1] has a unique binary expansion
EXAMPLE3.

where x; = 0 or 1. To ensure uniqueness of the expansion, we agree to


consider only expansions containing an infinite number of zeros. Thus we
choose the first of the two expansions

110

011

-=-+-+-+
.
2 22 23
2 2 2 2 2 3 .. =-+-+-+"
We define random variables ~ 1 (x), ~ 2 (x), ... by putting

268

II. Mathematical Foundations of Probability Theory


R 2 (x)

R 1(x)
I
I
I
I
I
I
I

1.!.

1.!.
4

1.!. IJ.
2

I
I

-1

I
I

I
I

I
I

I
I
I
I

I
I
I
I
I
I
I

I
I
I
I
I
I
I

I
I

I
I
I

-1

I
I
I
I
I
I
I

I
I

I
I
I

I
I
..._...I ..._..

Figure 30

Then for any numbers a;, equal to 0 or 1,

a1
P{ x. 2

1}

a.
a 1 a2
a.
a2
++ ++< x <+ ++2" 2"
22
2
2" 22

= p { X'. XE [ -a21

1
1 ]}
a. +a. a 1 + += -.
+ +-2"
2" 2"
2"' 2

It follows immediately that~ 1 , ~ 2 , ... form a sequence of independent Bernoulli


random variables (Figure 30 shows the construction of ~ 1 = ~ 1 (x) and
~2

~z(x)).

If we now set R.(x) = 1 - 2~.(x), n ;;:: 1, it is easily verified that {R.}


(the Rademacher functions, Figure 31) are orthonormal:

ERnRm = fR.(x)Rm(x) dx = Dnm

Notice that (1, R.) ER. = 0. It follows that this system is not complete.
However, the Rademacher system can be used to construct the Haar
system, which also has a simple structure and is both orthonormal and

complete.

~(x)

~(x)

I~

I
I
I
I
I

I
I
I
I

I
I

I
I
I

I
I
I

I
I

I
I
I
I

I
I

r-;
I
I
I

I
I
I

I
I

t i

Figure 31. Rademacher functions.

I
I
I
I
I
I
I

269

11. The Hilbert Space of Random Variables with Finite Second Moment

Again let Q = [0, 1) and IF= .16'([0, 1)). Put

H 1 (x)=1,
H 2 (x)

R 1 (x),

k- 1
k
-v:::;;
x < 2i'

if

n = 2i

+ k,

1 :::;; k :::;; 2i,j;;?: 1,

otherwise.

It is easy to see that H.(x) can also be written in the form

Hzm+l(x) = {

2m/2

0 <X < 2-(m+ 1)

0,

2-(m+l):::;;
otherwise,

-2~ 12 ,

X~ 2-m,

m = 1, 2, ... '

Hzm+j(x)=H zm+l(x-j;. 1). j= 1, ... ,2m.


Figure 32 shows graphs of the first eight functions, to give an idea of the
structure of the Haar functions.
H 1 (x)

H 2 (x)

1
2

~ 1

-I

-I

H 5 (x)

H6(x)

.l
2

1
4

I
I
I
I
I
I
I I

L.J

-2

H 4 (x)

21/2

21/2

~
I
I

-21/2

I
I

;)_

I
I
I
I
I
I
I
I

4 I
I
I
I
I
I
I
I
I I

~I

;)_

-2

,.,
I
I
I
I
I
I
I
I

.l
4

-2''2

I
I
I
I
I

I
I

I
I
I
I
I
I
I
I

'-+l

H 8 (x)

.ll

H?Cx)

I
I
I

-2

I
I
I
I
I
I
I
I
I

H 3 (x)

r+i

I
I
I
I
I
I
I
I
I
I

I
I
I
I
I
I
I
I
I
I

.ll

I
I
I
I
I
I
I
I
I I

r+i

I
I
I
I
I
I
I
I
I
I
1

-2

Figure 32. The Haar functions H 1(x), ... , H 8(x).

I
I
I
I
I
I
I
I
I

:I

11

I
4 I
I
I
I
I
I
I
I
I I

270

II. Mathematical Foundations of Probability Theory

It is easy to see that the Haar system is orthonormal. Moreover, it is


complete both in L 1 and in L 2 , i.e. iff= f(x) E IJ for p = 1 or 2, then

lf(x) - k~1 (f, Hk)Hk(x)IP dx-+ 0,

n-+ oo.

The system also has the property that


n

L (f, Hk)Hk(x)-+ f(x),

n-+ oo,

k=1

with probability 1 (with respect to Lebesgue measure).


In 4, Chapter VII, we shall prove these facts by deriving them from general
theorems on the convergence of martingales. This will, in particular, provide
a good illustration of the application of martingale methods to the theory of
functions.

6. If t] 1, ... , t'/n is a finite orthonormal system then, as was shown above, for
every random variable ~ E L 2 there is a random variable ~ in the linear manifold 2 = 2{tJ 1, ... , t'/n}, namely the projection of~ on f1, such that

II~

- ~II

= inf{ll~- ell: ( E ff{t'/1, , t'fn}}.

L7=

Here~=
1 (~, t'/;)t'/; This result has a natural generalization to the case
when t] 1, t] 2 , is a countable orthonormal system (not necessarily a basis).

In fact, we have the following result.

Theorem. Let

t] 1 , t] 2 ,

be an orthonormal system of random variables, and

L = L{tJ 1, t] 2 , } the closed linear manifold spanned by the system. Then


there is a unique element ~ E L such that
II~ -~II = inf{ll~ - Cll: (

2}.

(20)

Moreover,

~ = l.i.m.
n

and ~ - ~

L (~, t'/;)t'/;

(21)

i= 1

l_ (, ( E L.

Let d = inf{ll~- Cll: ( E 2} and choose a sequence ( 1, ( 2 , such


that II~ - (nil -+d. Let us show that this sequence is fundamental. A simple
calculation shows that

PROOF.

II("-

(mll 2

= 211'n-

~11 2 + 211'm- ~11 2 - 411'"; 'm- ~r

It is clear that ((n + (m)/2 E 2; consequently


therefore ll(n - (mll 2 -+ 0, n, m-+ oo.

II[((" + (m)/2]

- ~11 2 ~ d 2 and

II. The Hilbert Space of Random Variables with Finite Second Moment

271

The space L 2 is complete (Theorem 7, 10). Hence there is an element ~


such that II(. - ~II ---+ 0. But !lis closed, so~ E !l. Moreover, II(. - ~II ---+ d,
and consequently II~ - ~II = d, which establishes the existence of the required element.
Let us show that ~ is the only element of !l with the required property.
Let ~ E !l and let

II~ - ~II = II~ - ~II = d.


Then (by Problem 3)

II~+~- 2~11 2 + II~- ~11 2

211~- ~11 2 + 211~- ~11 2 = 4d 2

But

II~+~- 2~11 2

411tC~

+ ~)- ~11 2 ~ 4d 2

Consequently II~ - ~11 2 = 0. This establishes the uniqueness of the element


of !l that is closest to ~.
Now let us show that ~ - ~ l_ (, ( E !l. By (20),

II~ - ~ - c(ll ~ II~ - ~II


for every c E R. But

II~

- ~- c'll 2

- ~11 2 + c2 WI 2

II~

2(~ - ~,cO.

Therefore

c2 ll'll 2

~ 2(~ - ~,

c().

(22)

Take c = A(~ - ~' (), AE R. Then we find from (22) that

(~- ~' 0 2 [A 2 II'II 2 - 2A.] ~ 0.


< 0 if A is a sufficiently small positive number. Con-

We have A. 2 11'11 2 - 2,{


sequently(~ - ~' 0 = 0, ( E I.
It remains only to prove (21).
The set!= i{111, 17 2 , . } is a closed subspace of L 2 and therefore a
Hilbert space (with the same scalar product). Now the system 17 1 ,17 2 , ...
is a basis for! (Problem 4), and consequently
n

L (~, 11k)11k

l.i.m.

(23)

But ~ - ~ l_ l]k, k ~ 1, and therefore(~, I'Jk) = (~, '1k), k ~ 0. This, with (23)
establishes (21).
This completes the proof of the theorem.

Remark. As in the finite-dimensional case, we say that ~ is the projection of


~ on L = L{rJ 1, 1] 2 , }, that ~- ~ is perpendicular to L, and that the
representation

is the orthogonal decomposition

+ (~-

of~.

~)

272

II. Mathematical Foundations of Probability Theory

We also denote~ by E(el'7 1 , 17 2 , ) and call it the conditional expectation


in the wide sense (of with respect to '11> 17 2 , ). From the point of view of
estimating in terms of '71> '72'
the variable eis the optimal linear estimator, with error

ll

'

= e1e- ~1 2 = 11e- ~11 2 = 11e11 2

00

L l<e. '7i)l

i=l

which follows from (5) and (23).

7. PROBLEMS
1. Show that if~ = l.i.m.
2. Show that if

!I e. !I -+ lleli-

e= l.i.m. e. and t7 = l.i.m. tfn then (e., tt.) -+ (e, tf).

3. Show that the norm

4. Let (

~.then

1 , , ~J

1111 has the parallelogram property


II~+

ttll 2 + II~- ttll 2

= 2(11~11 2

+ llttll 2).

be a family of orthogonal random variables. Show that they have the

Pythagorean property,

5. Let tf 1 , tf 2 , be an orthonormal system and !l' = !l'{tt 1, tf 2 , } the closed linear


manifold spanned by tt 1, tt 2 , . Show that the system is a basis for the (Hilbert)
~~

6. Let ~ 1 , ~ 2 , be a sequence of orthogonal random variables and s. = ~ 1 + . + ~ .


Show that if L,~ 1 E~~ < oo there is a random variable S with ES2 < oo such that
l.i.m. s. = S, i.e. liS.- Sll 2 = E IS.- Sl 2 -+ 0, n-+ oo.
7. Show that in the space L 2 = L 2 ([ -n, n], aJ([ -n, n]) with Lebesgue measure p.
the system {(1/fo)eu", n = 0, 1, ...} is an orthonormal basis.

12. Characteristic Functions


1. The method of characteristic functions is one of the main tools of the
analytic theory of probability. This will appear very clearly in Chapter III
in the proofs of limit theorems and, in particular, in the proof of the central
limit theorem, which generalizes the De Moivre-Laplace theorem. In the
present section we merely define characteristic functions and present their
basic properties.
First we make some general remarks.
Besides random variables which take real values, the theory of characteristic functions requires random variables that take complex values (see
Subsection 1 of 5).

273

12. Characteristic Functions

Many definitions and properties involving random variables can easily


be carried over to the complex case. For example, the expectation E( of a
complex random variable C= ~ + i17 will exist if the expectations E~ and
E17 exist. In this case we define E( = E~ + iE17. It is easy to deduce from the
definition of the independence of random elements (Definition 6, 5) that
the complex random variables ( 1 = ~ 1 + i17 1 and ( 2 = ~ 2 + i17 2 are independent if and only if the pairs (~ 1 , 17 1 ) and (~ 2 , 17 2 ) are independent; or,
equivalently, the a--algebras !l' ~ .. ~~ and !l' ~M 2 are independent.
Besides the space L 2 of real random variables with finite second moment,
we shall consider the Hilbert space of complex random variables C= ~ + i17
with EICI 2 < oo, where 1(1 2 = ~ 2 + 17 2 and the scalar product (( 1, ( 2 ) is
defined by EC 1 C2 , where C2 is the complex conjugate of(. The term "random
variable" will now be used for both real and complex random variables,
with a comment (when necessary) on which is intended.
Let us introduce some notation.
We consider a vector a E R" to be a column vector,

~CJ
and aT to be a row vector, aT = (a 1, . , an). If a andb E R"their scalar product
(a, b) is
1 a;b;. Clearly (a, b) = aTb.
If a E R" and ~ = llriill is ann by n matrix,

Li=

(~a,a) = aT~a =

L riiaiai.

(1)

i,j= 1

2. Definition 1. Let F = F(x 1 , , xn) be an n-dimensional distribution


function in (R", PJJ(R")). Its characteristic function is

cp(t)

= ( ei<t,x> dF(x),

JR"

teR".

(2)

Definition 2. If~ = (~ 1 , , ~")is a random vector defined on the probability


space (Q, .fF, P) with values in R", its characteristic function is
C{);(t) = ( ei<t,x) dF;(x),

JR"

where F~ = F~(x 1 ,

teR",

xn) is the distribution function of the vector

(~1> ... ' ~n).

If F(x) has a density f = f(x) then

cp(t) =

r ei(t,x)f(x) dx.

JRn

(3)
~ =

274

II. Mathematical Foundations of Probability Theory

In other words, in this case the characteristic function is just the Fourier
transform of f(x).
It follows from (3) and Theorem 6. 7 (on change of variable in a Lebesgue
integral) that the characteristic function cp~(t) of a random vector can also
be defined by
(4)
tER".
We now present some basic properties of characteristic functions, stated
and proved for n = 1. Further important results for the general case will be
given as problems.
Let~ = ~(w) be a random variable, F~ = F~(x) its distribution function,
and
its characteristic function.
We see at once that if 11
cp~(t)

= a~ + b then
= Eeit~ = Eeit(a~+b) =

eitbEeiat~.

Therefore
(5)

sn

Moreover, if ~ t> ~ 2 , , ~" are independent random variables and


= ~1 + ... +~",then

n cp~lt).
n

CfJs.(t) =

(6)

i= 1

In fact,

= Eeu~,

... Eeit~.

j=1

cp~

(t),

where we have used the property that the expectation of a product of independent (bounded) random variables (either real or complex; see Theorem 6
of 6, and Problem 1) is equal to the product of their expectations.
Property (6) is the key to the proofs of limit theorems for sums of independent random variables by the method of characteristic functions (see 3,
Chapter III). In this connection we note that the distribution function F s"
is expressed in terms of the distribution functions of the individual terms in a
rather complicated way, namely F s" = F ~~ * * F ~" where * denotes
convolution (see 8, Subsection 4).
Here are some examples of characteristic functions.
1. Let ~ be a Bernoulli random variable with P(~
0) = q, p + q = 1, 1 > p > 0; then

ExAMPLE

P(~

cp~(t) = peit

+ q.

= 1) =

p,

275

12. Characteristic Functions

If ~ 1 ,

... , ~n

are independent identically distributed random variables like

~,then, writing T,. = (Sn - np)/~, we have

<r>rJt) = EeiTnt = e-it0ifW[peitfv'iijiq + q]n

[peitJq7(npJ

Notice that it follows that as n

--+

+ qe-itv'p/(nq)y.

(7)

oo

sn- np

(8)

T,.= ~.
Let ~ "' %(m, a 2 ), Im I < oo, a 2 > 0. Let us show that

EXAMPLE 2.

(9)

Let 1J =

(~

m)ja. Then 1J "' %(0, 1) and, since


<p~(t)

= eitm<p~(at)

by (5), it is enough to show that


<p~(t) =

e-r2;z.

(10)

We have
<p~(t)

= Ee 11 ~ = -1-

J21r:

Joo e'.xe-x2j2 dx
1

-00

Joo 1..---e
~ (itxt -x2jZ dx_- 1
~ .(itt
n -x2j2 dx
. - - -1 - Joo xe
J21r: -oo n=O n!
n=O n! J21r: -oo

_ -1-

=
=

(it)2n (2n - 1) I I =
(it)2n (2n)!
n=O (2n)!
n=O (2n)! 2nn!

f (- tZ)n.
~
2
n.

n=O

= e-t212,

where we have used the formula (see Problem 7 in 8)

EXAMPLE 3.

Let

be a Poisson random variable,


e-AA_k

P(~ = k) =

Then

kl'

= 0, 1, ....

276

II. Mathematical Foundations of Probability Theory

3. As we observed in 9, Subsection 1, with every distribution function in


(R, fJI(R)) we can associate a random variable of which it is the distribution
function. Hence in discussing the properties of characteristic functions (in
the sense either of Definition 1 or Definition 2), we may consider only
characteristic functions ({J(t) = (/J(,(t) of random variables e = e(w).

Theorem 1. Let be a random variable with distribution function F = F(x) and


({J(t) =
its characteristic function. Then

qJ

Eeu~

has the following properties:

(1) I({J(t) I ~ ({J(O) = 1;


(2) ({J(t) is uniformly continuous fortE R;
(3) ({J(t) = ({J(- t);
(4) ({J(t) is real-valued if and only ifF is symmetric <h dF(x) = f -B dF(x)),
Be fJI(R), -B = {-x: x eB};
(5) if EI I" < oo for some n :2 1, then ({J<'>(t) exists for every r ~ n, and

(ix)'eitx dF(x),

({J(rl(t) =

({J(rl(O)

Ee' = -.-,-,

(13)

~ (it) 2 E~'
() = L...
({Jt
.,
1
r=o r.

(12)

()
+ (it)"1 e.t,
n.

(14)

where le.(t)l :::; 3E 1e1 and e.(t)--+ 0, t--+ 0;


(6) if qJ< 2 >(0) exists and is .finite then Ee 2 " < oo;
(7) if E I I" < oo for all n :2 1 and

- . (Eiel")li
hm
"

=-

< oo,

then
({J(t) =

n=O

(itr Ee.
n.

(15)

for allltl < R.


PRooF. Properties (1) and (3) are evident. Property (2) follows from the
inequality

+ h) -

IEeit(,(eih~ - 1) I :::; EIeih(, - 11


and the dominated convergence theorem, according to which EIeih(. - 11 --+ 0,
I({J(t

({J(t) I =

h--+ 0.
Property (4). Let F be symmetric. Then if g(x) is a bounded odd Borel
function, we have fR g(x) dF(x) = 0 (observe that for simple odd functions

277

12. Characteristic Functions

this follows directly from the definition of the symmetry of F). Consequently
JR sin tx dF(x) = 0 and therefore

cp(t) = E cos
Conversely, let

be a real function. Then by (3)

cp~(t)

({J -lt)

t~.

= cp~(- t) = cplt) = cp~(t),

t E R.

Hence (as will be shown below in Theorem 2) the distribution functions


~ and ~ are the same, and therefore (by
Theorem 3.1)

F _~and F ~of the random variables -

P( ~ E B) = P(- ~ E B) = P( ~ E -B)

for every BE fJ(R).


Property (5). If E I~ In < oo, we have E I~ lr < oo for r
inequality (6.28).
Consider the difference quotient

+ h)

cp(t

- cp(t) _
ir~(eih~ - Ee
h

n, by Lyapunov's

1) .

Since
i

eihx _
h

11

lxl,

and E1 ~ 1 < oo, it follows from the dominated convergence theorem that the
limit
lim
h~o

exists and equals

. lim (eih~ _
h~o
h

Eeu~

Eeu~(-ei_h~_-_1)

1) =

iE(~e' 1 ~)

= i

Joo xe'.x dF(x).


-oo

(16)

Hence q/(t) exists and

cp'(t) =

i(E~eil~) =

s:oo xeitx dF(x).

The existence of the derivatives cp(r)(t), 1 < r ~ n, and the validity of (12),
follow by induction.
Formula (13) follows immediately from (12). Let us now establish (14).
Since
.

e'Y

=cosy

+ l Slll y

n-1 (iy)k
= k~O F

(iy)n

+7

[cos ely+

. .
l Sill

e2y]

278

II. Mathematical Foundations of Probability Theory

for real y, with I8 1 I :::; 1 and I8 2 1:::; 1, we have

e''~

n- 1 (it~)k

(it~t

= k~O ----;z! +--;:;![COS 81(w)t~ + i Sln 82(w)t~]

(17)

and
(18)
where

en(t) =

P[~"(cos 8 1 (w)t~

+ i sin 8 2 (w)t~

- 1)].

It is clear that Ic5n(t) I :::; 3E I~n 1. The theorem on dominated convergence


shows that en(t) --+ 0, t --+ 0.
Property (6). We give a proof by induction. Suppose first that cp"(O)
exists and is finite. Let us show that in that case E~ 2 < oo. By L'Hopital's
rule and Fatou's lemma,
qJ

~ [cp'(2h) - cp'(O)
2h

"(O) = 1.

h~~ 2

cp'(O) - cp'(- 2h)]


2h

= lim 2cp'(2h) ~~cp'( - 2h) = lim 4h12 [cp(2h)- 2cp(O) + cp( -2h)]
h-+0

h-+0

= lim

Joo

h-+0

-lim
h-+0

= -

(eihx ;he-ihx)2 dF(x)

-00

hx)
oo (sin
-hJ-oo

x 2 dF(x) :::; -

hx) 2x
Joo lim (sin
-h-oo
h-+0

dF(x)

J:oo x 2 dF(x).

Therefore

J:oo x

dF(x):::; -cp"(O) < oo.

Now let cp< 2 k+ 2 l(O) exist, finite, and let J~~ x 2k dF(x) < oo. If J~oo x 2kdF(x)
= 0, then J~oo x 2k+ 2 dF(x) = 0 also. Hence we may suppose that
J~ x 2k dF(x) > 0. Then, by Property (5),

oo

cp<2kl(t)

f_oooo (ix)2keitx dF(x)

and therefore

( -1)kcp(2kl(t)
where G(x)

J~oo u 2 k dF(u).

s:ooeitx dG(x),

279

12. Characteristic Functions

Consequently the function ( -1)ktp< 2 k>(t)G- 1 ( oo) is the characteristic


function ofthe probability distribution G(x) G- 1(oo) and by what we have
proved,
G- 1(oo)

J:

00

x 2 dG(x) < oo.

But G- 1(oo) > 0, and therefore

Property (7). Let 0 < t 0 < R. Then, by Stirling's formula we find that

[E I~ l"t0/n !] converges by Cauchy's test, and


Consequently the series
therefore the series L~ 0 [(it)'jr !]E~' converges for It I ::;; t 0 But by (14),
for n ~ 1,
('t)'
L _z_l
E~' + Rn(t),
n

tp(t) =

r=O

tp(t) =

r=O

(it)'
r!

E~'

for all It I < R. This completes the proof of the theorem.

Remark 1. By a method similar to that used for (14), we can establish that if
00 for some n ~ 1, then

EIeI" <

" ik(t - s)k


tp(t) = k~O
k!
where len(t-

s)l ::;;

Joo

-oo xkex

dF(x)

3E WI, and en(t- s)

---t

i"(t - st
n!
en(t- s),

(19)

0 as t- s ---t 0.

Remark 2. With reference to the condition that appears in Property (7),


see also Subsection 9, below, on the "uniqueness of the solution of the
moment problem."
4. The following theorem shows that the characteristic function is uniquely
determined by the distribution function.

280

II. Mathematical Foundations of Probability Theory

~.

+6

Figure 33

Theorem 2 (Uniqueness). Let F and G be distribution functions with the same


characteristic function, i.e.
(20)

for all t E R. Then F(x)

= G(x).

PROOF. Choose a and bE R, and B > 0, and consider the function f' = f'(x)
shown in Figure 33. We show that

(21)
Let n ~ 0 be large enough so that [a- e, b + e] ; [ -n, n], and let the
sequence {b"} be such that 1 ~ b" L0, n -+ oo. Like every continuous function
on [- n, n] that has equal values at the endpoints,!' = f'(x) can be uniformly
approximated by trigonometric polynomials (Weierstrass's theorem), i.e.
there is a finite sum
(22)
such that
sup lf'(x)- f~(x)l:::;; b".

-nsx:s;n

Let us extend the periodic function J,.(x) to all of R, and observe that
sup ln(x)l:::;; 2.
X

Then, since by (20)

(23)

281

12. Characteristic Functions

we have

f_

00

IJ:00 f'(x)dF(x)-

00

f'(x)dG(x)l

= lfJ'dF- f.f'dGI

~I

f.n

dF-

f/~ dGI + 2<5n

~ If_0000 f~ dF- f_0000 f~ dGI + 25n


+ 2F([ -n, n]) + 2G([ -n, n]),
(24)
where F(A) = SA dF(x), G(A) = SA dG(x). As n--+ oo, the right-hand side
of (24) tends to zero, and this establishes (21).
Ass--+ 0, we have f'(x)--+ I(a,b/x). It follows from (21) by the theorem on
distribution functions' being the same.

f_'xo

00

/(a,bJ(x) dF(x)

= f_'xoool(a,b](x) dG(x),

i.e. F(b) - F(a) = G(b) - G(a). Since a and b are arbitrary, it follows that
F(x) = G(x) for all x E R.
This completes the proof of the theorem.

5. The preceding theorem says that a distribution function F = F(x) is


uniquely determined by its characteristic function <p = <p(t ). The next theorem
gives an explicit representation ofF in terms of <p.
Theorem 3 (Inversion Formula). Let F = F(x) be a distribution function and

<p(t) = s:ooeitx dF(x)


its characteristic function.
(a) For pairs of points a and b (a< b) at which F

F(b)- F(a) = lim 21


c-+oo

fc

e-ita

-c

1t

F(x) is continuous,
e-itb <p(t) dt;

(25)

lt

(b) If s~oo l<fJ(t)l dt < oo, the distribution function F(x) has a density f(x),

F(x) = roof(y) dy

(26)

and
f(x) = 1
2n

J(X)
-oo

.
e-txq>(t)
dt.

(27)

II. Mathematical Foundations of Probability Theory

282
PROOF.

We first observe that if F(x) has density f(x) then

f_'>)oo eitj(x) dx,

cp(t) =

(28)

and (27) is just the Fourier transform of the (integrable) function cp(t).
Integrating both sides of (27) and applying Fubini's theorem, we obtain
F(b)- F(a) =

ff(x)

dx =

2~ f [f_

00

f_oooo cp(t) [fe-itx dx

00

e-itxcp(t) dt] dx

Jdt

21n

e-ita- e-itb
1 Joo
dt.
cp(t)
-2
zt.
n _ 00

After these remarks, which to some extent clarify (25), we turn to the proof.
(a) We have
~c

= -2n
1
=2n:
1
2n

=-

fc

e-ita- e-itb

zt

-c

fc
-c

e-ita _ e-itb
.
zt

Joo

[fc

_ 00

-c

cp(t) dt

[foo

eitx dF(x) dt

-00

e-ita _ e-itb
eitx dt dF(x)
.
zt
(29)

where we have put


'I'c(x) = -

2n

fc
-c

e-ita - e-itb .
eztx dt
.

zt

and applied Fubini's theorem, which is applicable in this case because


_e-i_ta-_e-_itb. eitxl
l
it

and

fc

1:

le-ita- e-itbl
it

(b - a) dF(x)

=I Ja

rbe-itx dxl < b- a

~ 2c(b -

a) < oo.

283

12. Characteristic Functions

In addition,
n1
T

c X

= _!_
27C
1

=-

2n

fc

-c

sin t(x - a) - sin t(x - b)


dt
t

fc(x-a)
-c(x-a)

sin v
1
- - dv - 2n

The function

g(s, t)

fc(x-b)

-c(x-b)

sin u
- - du.

(30)

sin v
-dv
v

is uniformly continuous in s and t, and

g(s, t) --+ n

(31)

ass! - oo and t j oo. Hence there is a constant C such that I'Pc(x) I < C < oo
for all c and x. Moreover, it follows from (30) and (31) that
'l'c(x)--+ 'l'(x),

c --+

00,

where
0, X < a, X > b,
'P(x) = { !, x = a, x = b,
1, a< x <b.

Let ll be a measure on (R, f!J(R)) such that /l(a, b] = F(b)- F(a). Then
if we apply the dominated convergence theorem and use the formulas of
Problem 1 of 3, we find that, as c--+ oo,
<l>c =

f_

00
00

'Pc(x) dF(x)--+

f_

00
00

'P(x) dF(x)

= {L(a, b)+ t!l{a) + t!l{b}


= F(b-)- F(a) + t[F(a)- F(a-) + F(b)- F(b- )]
= F(b) \F(b-) _ F(a) \F(a-) = F(b) _ F(a),

where the last equation holds for all points a and b of continuity of F(x).
Hence (25) is established.
(b) Let f~oo I<P(t)l dt < 00. Write
f(x) = -21

7C

foo e-"xcp(t)
. dt.
-oo

284

II. Mathematical Foundations of Probability Theory

It follows from the dominated convergence theorem that this is a continuous


function of x and therefore is integrable on [a, b]. Consequently we find,
applying Fubini's theorem again, that

1
=lim -2
r-ae 1C

Jr

e-ita- e-itb

.
zt

-c

qJ(t) dt = F(b) - F(a)

for all points a and b of continuity of F(x).


Hence it follows that
F(x) =

aof(y) dy,

XER,

and since f(x) is continuous and F(x) is nondecreasing, f(x) is the density
of F(x).
This completes the proof of the theorem.

Corollary. The inversion formula (25) provides a second proof of Theorem 2.


Theorem 4. A necessary and sufficient condition for the components of the

<el, ... ,

en) to be independent is that its characteristic


random vector e =
function is the product of the characteristic functions of the components:
Eei(t1~1 + ... +tn~n)

fl

Eeitk~k,

k=1

PRooF. The necessity follows from Problem 1. To prove the sufficiency we


let F(x 1, ... , Xn) be the distribution function of the vector e = (e 1, , en)
andFk(x), the distribution functions of the ek, 1 ~ k ~ n. PutG = G(xl> ... , Xn)
= F 1 (x 1 ) F n(xn). Then, by Fubini's theorem, for all (t 1 , .. , tn) ERn,

fl

Eeitk~k

Eei<r1~1 + ... +tk~k>

k=1

JRn

ei(I!X!+"+tnXn)

dF(x 1

...

Xn)

Therefore by Theorem 2 (or rather, by its multidimensional analog; see


Problem 3) we have F = G, and consequently, by the theorem of 5, the
random variables eh ... ' en are independent.

285

12. Characteristic Functions

6. Theorem 1 gives us necessary conditions for a function to be a characteristic


function. Hence if qJ = ((J(t) fails to satisfy, for example, one of the first three
conclusions of the theorem, that function cannot be a characteristic function.
We quote without proof some results in the same direction.

Bochner-Khinchin Theorem. Let qJ(t) be continuous, t E R, with qJ(O) = I. A


necessary and sufficient condition that ((J(t) is a characteristic function is that it
is positive semi-definite, i.e. that for all real t 1, .. , tn and all complex Ato ... , A.n,
n = 1, 2, ... ,
n

((J(t; - tj)A.)j ~

i,j= 1

The necessity of (32) is evident since if ((J(t)

i.tl

o.

(32)

f~ oo

eirx dF(x) then

((J(t;- tj)A.J.j = J:JJ/keilkXI2 dF(x)

~ o.

The proof of the sufficiency of (32) is more difficult.


Polya's Theorem. Let a continuous even function qJ(t) satisfy qJ(t) ~ 0,
qJ(O) = 1, qJ(t) --+ 0 as t --+ oo and let qJ(t) be convex on 0 :s; t < oo. Then
((J(t) is a characteristic function.

This theorem provides a very convenient method of constructing characteristic functions. Examples are
((Jl (t)

e -lrl,

() - {1 - ltl,

cpzt-

It I :s; 1,
ltl >

0,

1.

Another is the function cp 3 (t) drawn in Figure 34. On [-a, a], the function
qJ 3(t) coincides with qJ 2(t). However, the corresponding distribution functions F 2 and F 3 are evidently different. This example shows that in general
two characteristic functions can be the same on a finite interval without their
distribution functions' being the same.

-1

-a

Figure 34

286

II. Mathematical Foundations of Probability Theory

Marcinkiewicz's Theorem. If a characteristic function cp(t) is of the form


exp &>(t), where &>(t) is a polynomial, then this polynomial is of degree at
most 2.
It follows, for example, that e-r is not a characteristic function.
7. The following theorem shows that a property of the characteristic
function of a random variable can lead to a nontrivial conclusion about the
nature of the random variable.

Theorem 5. Let

cp~(t)

be the characteristic function of the random variable

(a) If Icp~(t 0 ) I = 1 for some t 0 =I= 0, then

+ nh, h = 2n/t 0 , for some a, that is,

is concentrated at the points

00

n=- oo

Pg =a+ nh}

(33)

1,

where a is a constant.

(b) If

1 for two different points t and


is degenerate:

lcp~(t)l = lcp~(~t)l =

irrational, then

P{~

~t,

where

is

=a} = 1,

where a is some number.


= 1, then~ is degenerate.

(c) If Icp~(t)l
PROOF.

Then

(a) If Icp~(t 0 )1 = 1, t 0 =I= 0, there is a number a such that cp(t 0 ) = eitoa.

1=

f_

00
00

COS t 0 (x

- a) dF(x) =>

f_

00
00

[1 -

- a)] dF(x)

COS t 0 (x

= 0.

Since 1 - cos t 0 (x - a) 2 0, it follows from property H (Subsection 2 of


6) that
1

= cos t 0 (~

which is equivalent to (33).


(b) It follows from lcp~(t)l

n=- oo

p{~ =

a)

(P-a.s.),

= lcp~(~t)l = 1 and from (33) that

a + 2n
t

n} f
=

m=- oo

p{~ =

b + 2n
~t

m}

= 1.

If ~ is not degenerate, there must be at least two pairs of common points:


a

2n

+ -t n 1

2n
b + - m1
~t

'

2n
a+ -n 2 = b
y

2n

+ -m 2 ,
~t

287

12. Characteristic Functions

in the sets

{b + ~~ m, m = 0, 1, .. }

{a+ 2; n, n = 0, 1, .. -} and
whence
2n
(n 1
t

n2 ) =

2n
at

~(m 1 -

m2 ),

and this contradicts the assumption that a is irrational. Conclusion (c)


follows from (b).
This completes the proof of the theorem.

8.

Let~= (~b

... , ~k) be a random vector,


lfJ~(t)

= Eei(t, ~>,

t = (t 1

.. '

tk),

its characteristic function. Let us suppose that E I~; In < oo for some n ; : .-: 1,
i = 1, ... , k. From the inequalities of Holder (6.29) and Lyapunov (6.27)
it follows that the (mixed) moments E(~1' ~;;k) exist for all nonnegative
v1, ... , vk such that v1 + + vk :::;; n.
As in Theorem 1, this implies the existence and continuity of the partial
derivatives

for v1 + + vk :::;; n. Then if we expand


we see that
l{J~(tl, ... ,tk)=

i'++vk

++vk:Sn v1 .... vk.

.. ,

tk) in a Taylor series,

m<~,,. .. ,vklt1'tkk+o(ltln),

m(v., ... ,vk) -_ EJ::Vt


'>1
~

lfJ~(t 1 ,

(34)

)::Vk

'>k

is the mixed moment of order v = (v 1, ... , vk).


Now lfJ~(t 1 , ... , tk) is continuous, lfJ~(O, ... , 0) = 1, and consequently this
function is different from zero in some neighborhood It I < 8 of zero. In
this neighborhood the partial derivative
a+ ... +vk
Vk In l{J~(tb ... ' tk)
1 . . tk

at VI

exists and is continuous, where In z denotes the principal value of the


logarithm (if z = rei8 , we take In z to be In r + W). Hence we can expand
In lfJlt 1, .. ,, tk) by Taylor's formula,
iVl + + Vk
s~ klt'l' t);k + o(lt n, (35)
I
ln l{J~(t 1 tk) =
VI + ... + Vk :S n VI ! Vk !

288

II. Mathematical Foundations of Probability Theory

where the coefficients s~v vk) are the (mixed) semi-invariants or cumulants
of order v = v(v 1, .. , vk) of~= ~ 1 , .. , ~k
Observe that if ~ and '1 are independent, then
ln

cp~+,(t)

= ln

+ ln cp,(t),

cp~(t)

(36)

and therefore
(37)

(It is this property that gives rise to the term "semi-invariant" for s~v ... ,vk>.)
To simply the formulas and make (34) and (35) look "one-dimensional,"
we introduce the following notation.
If v = (vh ... , vk) is a vector whose components are nonnegative integers,
we put

We also put s~v> = s~v ... ,vk>, m~v) =


Then (34) and (35) can be written

cpr,(t) =
ln cpr,(t) =

m~v ... ,vk>.

jlvl

lvlsn V.
ilvl

Lr

lvlsn V.

m~v)tv
s~v>tv

+ o(ltl"),

(38)

+ o(ltl").

(39)

The following theorem and its corollaries give formulas that connect
moments and semi-invariants.
Theorem6. Let ~=(~h~k) be a random vector with
i = 1, ... , k, n;;;:: 1. Then for v =(vi> ... , vk) such that Ivi:::;; n
(V) -

m~;

_.!..

"

- ).Ol++A<l=v
L.
' 1(1)f
q II.

v!

1(q)f'

/1.

nq
p=l

().(P))

'

El~d"<oo,

(40)

(41)

where LA<l++A<l=v indicates summation over all ordered sets of nonnegative


integral vectors A<P>, IA<P> I > 0, whose sum is v.
PROOF.

Since

cpr,(t) = exp(ln cpr,(t)),


if we expand the function exp by Taylor's formula and use (39), we obtain

cpr,(t) = 1 +

1(
iiAI
)q
L
I
L
s~A>tA + o(ltl").
11
q=l q., lSIAisn
n

11..

(42)

289

12. Characteristic Functions

Comparing terms in e on the right-hand sides of (38) and (42), and using
1..1<1)1 + + IA.(qJI = 1..1< 1> + + A_<qll, we obtain (40).
Moreover,
In

L ~~~ m~"lt" + o( It 1")].


cp~(t) = ln[1 + 1:<>1.<1:-;n

(43)

For small z we have the expansion


ln(1

+ z)

L n

q= 1

1)q-1
q

zq

+ o(zry.

Using this in (43) and then comparing the coefficients oft;., with the corresponding coefficients on the right-hand side of (38), we obtain (41).

Corollary 1. The following formulas connect moments and semi-invariants:

n[
X

(44)
(}.Ul)J'j
v.
)
m~ - {rt)..(l)+ .. +rx).(X)=v} r1! rX! (A_(l)!)'' (A_(X)!yxj=l s~
(v) -

s~'l-

L
{rt),(IJ + ... + rx;.,(x) = v}

Il [

<.<UlJ]'
v!
(-1)q-1(q- 1)!
J,
m~
... (1(X)I.)'xJ._-1
(1(1).1)''
rx'
I
rl ....
1\.
1\.
(45)

where 'L1,,;.(1J+ .. +rx.<Cxl=vJ denotes summation over all unordered sets of


different nonnegative integral vectors A_Ul, IA_W I > 0, and over all ordered sets of
positive integral numbers ri such that r 1A.< 1l + + rxA_(x) = v.
To establish (44) we suppose that among all the vectors A_(ll, ... , A_(q)
that occur in (40), there are r 1 equal to A_(itl, ... , rx equal to A_(ixl (ri > 0,
r1 + + rx = q), where all the A_(i.J are different. There are q !f(r 1! ... rx !) different sets of vectors, corresponding (except for order) with the set {A_(ll, ...
il(ql}). But if two sets, say, {il< 1>, . , il<ql} and {J(ll, ... , J<ql} differ only in order,
then n~= 1 s~).(Pl) = n~= 1 s~'J.(Pl). Hence if we identify sets that differ only in
order, we obtain (44) from (40).
Formula (45) can be deduced from (41) in a similar way.

Corollary 2. Let us consider the special case when v = (1, ... , 1). In this case
the moments m~v) E~ 1 ~k' and the corresponding semi-invariants, are
called simple.

Formulas connecting simple moments and simple semi-invariants can


be read off from the formulas given above. However, it is useful to have them
written in a different way.
For this purpose, we introduce the following notation.
Let~= (~I> ... , ~k) be a vector, and I~= {1, 2, ... , k} its set of indices.
If I c:;: I~, let ~~denote the vector consisting of the components of~ whose

290

II. Mathematical Foundations of Probability Theory

indices belong to I. Let x(J) be the vector {XI> ... , Xn} for which Xi = 1 if
i e I, and Xi = 0 if i I. These vectors are in one-to-one correspondence with
the sets I s; I~. Hence we can write

In other words, m~(J) and s~(J) are simple moments and semi-invariants
of the subvector of
In accordance with the definition given on p. 12, a decomposition of
a set I is an unordered collection of disjoint nonempty sets I P such that
LPIP =I.
In terms of these definitions, we have the formulas

el e.

(46)
q

sp)=

(-1)q- 1(q-1)!0mpp).

l:~=llp=I

(47)

p=l

where Lr~=lip=I denotes summation over all decompositions of I,


1 ~ q ~ N(l).
We shall derive (46) from (44). If v = x(I) and A.< 1 > + + A.<q> = v, then
A_<P> = x(IP), IPs; I, where the A_<P> are all different, A_<P>f = v! = 1, and every
unordered set {x(J 1 ), , x(Iq)} is in one-to-one correspondence with the
decomposition I= L~=l IP. Consequently (46) follows from (44).
In a similar way, (47) follows from (35).
EXAMPLE 1. Let

e be. a

random variable (k = 1) and mn = m~n> = Ee",

sn =st. Then (40) and (41) imply the following formulas:


ml = sl,

+ si,
s 3 + 3s 1s 2 + s~,
s4 + 3s~ + 4sls3 + 6sis2 + st,

m2 = s2
m3 =
m4 =

(48)

and
s1

= m 1 = Ee,

s2 = m2- mi =

ve,

s 3 = m3 - 3m 1 m2 + 2mt
s4 = m4 - 3m~- 4m 1 m3 + 12mim2 - 6mt,

(49)

291

12. Characteristic Functions

ExAMPLE

2. Let ~ ,.... JV(m, a 2 ). Since, by (9),


In cp~(t) = itm -

t2(J2

T'

we have s 1 = m, s 2 = a 2 by (39), and all the semi-invariants, from the third


on, are zero: sn = 0, n 2 3.
We may observe that by Marcinkiewicz's theorem a function exp &'(t),
where .9' is a polynomial, can be a characteristic function only when the
degree of that polynomial is at most 2. It follows, in particular, that the
Gaussian distribution is the only distribution with the property that all its
semi-invariants snare zero from a certain index onward.
EXAMPLE

3. If

is a Poisson random variable with parameter A > 0, then

by (11)
In

cp~(t)

= .Jc(eit - 1).

It follows that
(50)

for all n 2 1.
EXAMPLE

4.

Let~

(~ 1 , . , ~n)

m~(l)

be a random vector. Then

s~(l),

+ s~(l)s~(2),
mp, 2, 3) = s~(1, 2, 3) + s~(1, 2)s~(3) +
+ s~(l, 3)sp) +
+ s~(2, 3)sll) + s~(l)se(2)sl3)
m~(l,

2) =

s~(l,

2)

(51)

These formulas show that the simple moments can be expressed in terms
of the simple semi-invariants in a very symmetric way. If we put ~ 1 = ~ 2 =
~k, we then, of course, obtain (48).
The group-theoretical origin of the coefficients in (48) becomes clear
from (51). It also follows from (51) that

se(l, 2) = me(l, 2)- me(l)mp) =

E~ 1 ~ 2 - E~ 1 E~ 2 ,

(52)

i.e., s/1, 2) is just the covariance of ~ 1 and ~ 2 .


9. Let ~ be a random variable with distribution function F = F(x) and
characteristic function cp(t). Let us suppose that all the moments mn = E~n,
n 2 1, exist.
It follows from Theorem 2 that a characteristic function uniquely determines a probability distribution. Let us now ask the following question

292

II. Mathematical Foundations of Probability Theory

(uniqueness for the moment problem): Do the moments {mn}n> 1 determine


the probability distribution?
More precisely, let F and G be distribution functions with the same
moments, i.e.

f_'Xloo xn dF(x)

f_oooo xn dG(x)

(53)

for all integers n ~ 0. The question is whether F and G must be the same.
In general, the answer is "no." To see this, consider the distribution F
with density

J(x) = {ke-ax'-,
0,

> 0,

X:::; 0,

where a: > 0, 0 < A, < t, and k is determined by the condition


Write f3 = a: tan A-n and let g(x) = 0 for x :::; 0 and

+ t; sin(f3xA)],

g(x) = ke- ax'-[1


It is evident that g(x)

It; I <

1,

J0 f(x) dx =

1.

> 0.

0. Let us show that

(54)
for all integers n ~ 0.
For p > 0 and complex q with Re q

> 0, we have

ootp-1e-qt dt = r(p).
qP

Take p = (n

+ 1)/A-, q

a:+ i/3, t

= xA. Then

r(~)

- a;<n+ 1)/\1

+ i tan A.n)<n+ 1)/A.

But

(1

+ i tan A.n)<n+ I)/A

= (cos A-n

+ i sin A.n)<n+ 1)/A(cos A.n)-<n+ I)/A

= ein(n+ll(cos A.n)-<n+l)/A
=cos n(n + 1) cos(A.n)-<n+ IliA,
since sin n(n

+ 1) =

0.

(55)

293

12. Characteristic Functions

Hence right-hand side of (55) is real and therefore (54) is valid for all
integral n 2:: 0. Now let G(x) be the distribution function with density g(x).
It follows from (54) that the distribution functions F and G have equal
moments, i.e. (53) holds for all integers n 2:: 0.
We now give some conditions that guarantee the uniqueness of the solution of the moment problem.

Theorem 7. Let F = F(x) be a distribution function and J.ln = s~ 00

IX I" dF(x).

If
J.l1/n

lim-"-< oo,
n~CX)

(56)

the moments {mn}n~ to where mn = J~ oo x" dF(x), determine the distribution


F = F(x) uniquely.
PRooF. It follows from (56) and conclusion (7) of Theorem 1 that there is a
t 0 > 0 such that, for alii t I ~ t 0 , the characteristic function

cp(t) = J:oo eirx dF(x)


can be represented in the form
00
(itl
cp(t) = k~o k! mk

and consequently the moments {mn}n> 1 uniquely determine the characteristic function cp(t) for It I ~ t 0 .
Take a points with lsi::;; t 0 /2. Then, as in the proof of (15), we deduce
from (56) that
({J

for

It- sl

~ t0 ,

( ) = ~ ik(t - s)k <k>( )


t

k~O

k!

({J

where

cp<k>(s) = ik J:oo xkeix dF(x)


is uniquely determined by the moments {mn} n~ 1. Consequently the moments
determine cp(t) uniquely for It I ~ !to. Continuing this process, we see that
{mn}n~ 1 determines cp(t) uniquely for all t, and therefore also determines

F(x).
This completes the proof of the theorem.

Corollary 1. The moments completely determine the probability distribution


if it is concentrated on a finite interval.

294

II. Mathematical Foundations of Probability Theory

Corollary 2. A sufficient condition for the moment problem to have a unique


solution is that
(m )1/2n

llm

2n

2n

n-+oo

<

(57)

00.

For the proof it is enough to observe that the odd moments can be
estimated in terms of the even ones, and then use (56).
EXAMPLE.

Let F(x) be the normal distribution function,


F(x) = _1_

fx

e-t2Jza2 dt.

-oo

Then m 2 n+ 1 = 0, m 2 n = [(2n) !/2"n !]u 2 ", and it follows from (57) that these
are the moments only of the normal distribution.
Finally we state, without proof:
Carleman's test for the uniqueness of the moment problem.
(a) Let

{mn}n~ 1 be

the moments of a probability distribution, and let


00

L (mzn)1f2n =

n=O

00

Then they determine the probability distribution uniquely.


(b) If {mn}n~l are the moments of a distribution that is concentrated on
[0, oo ), then the solution will be unique if we require only that
00

L (mn)lf2n

n=O

00

10. Let F = F(x) and G = G(x) be distribution functions with characteristic functions f = f(t) and g = g(t), respectively. The following theorem,
which we give without proof, makes it possible to estimate how close F
and G are to each other (in the uniform metric) in terms of the closeness of
fandg.
Theorem (Esseen's Inequality). Let G(x) have derivative G'(x) with
supiG'(x)l ~C. Thenfor every T > 0
sup IF(x)- G(x)l
x

~ ~ JT if(t)- g(t)i dt +
n o

2T
4 sup IG'(x)l.

(58)

(This will be used in 6 of Chapter III to prove a theorem on the rapidity


of convergence in the central limit theorem.)

295

13. Gaussian Systems

11.

PROBLEMS

1. Let and , be independent random variables, f(x) = j~(x) + if2(x), g(x) = gl(x)
+ ig 2(x), where A(x) and gb) are Borel functions, k= 1, 2. Show that ifE I!WI< oo
and E lg(tf)l < oo, then
E lf{e)g(tf)l < oo

and
EJ(e)g(tf) = Ef(e) Eg(tf).

2. Let

e= (e

1 , ... ,

e.) and Ell ell"< oo, where WI =


cp~(t)

where t = (t 1,

+~.Show that

I ..:, E(t, e)k + e.(t)lltll",

k=Ok.

t.) and e.(t)-> 0, t-> 0.

3. Prove Theorem 2 for n-dimensional distribution functions F = F.(x 1,


G.(x 1 ,

x.) and

x.).

4. LetF = F(x 1, ... , x.)beann-dimensionaldistributionfunctionandcp = cp(t 1, .. , t.)


its characteristic function. Using the notation of (3.12), establish the inversion formula

(We are to suppose that (a, b] is an interval of continuity of P(a, b], i.e. fork= 1,
... , n the points ak, bk are points of continuity of the marginal distribution functions
Fk(xk) which are obtained from F(x~> ... , x.) by taking all the variables except
xk equal to + oo.)

5. Let cpk(t), k ~ 1, be a characteristic function, and let the nonnegative numbers A.k,
k ~ 1, satisfy I A.k = 1. Show that I A.kcpk(t) is a characteristic function.
6. If cp(t) is a characteristic function, are Re cp(t) and Im cp(t) characteristic functions?
7. Let cp 1, cp 2 and cp 3 be characteristic functions, and cp 1 cp 2 = cp 1 cp 3 Does it follow that
CfJ2 =

CfJ3?

8. Construct the characteristic functions of the distributions given in Tables 1 and 2


of~3.

9. Let be an integral-valued random variable and


Show that
P(e = k) = 1
-

2n

f" .

e-u"cp~(t) dt,

-x

cp~

(t) its characteristic function.

k = 0, 1, 2 ....

13. Gaussian Systems


1. Gaussian, or normal, distributions, random variables, processes, and
systems play an extremely important role in probability theory and in
mathematical statistics. This is explained in the first instance by the central

296

II. Mathematical Foundations of Probability Theory

limit theorem (4 of Chapter III and 8 of Chapter VII), of which the De


Moivre-Laplace limit theorem is a special case (6, Chapter 1). According
to this theorem, the normal distribution is universal in the sense that the
distribution of the sum of a large number of random variables or random
vectors, subject to some not very restrictive conditions, is closely approximated by this distribution.
This is what provides a theoretical explanation of the "law of errors" of
applied statistics, which says that errors of measurement that result from
large numbers of independent "elementary" errors obey the normal distribution.
A multidimensional Gaussian distribution is specified by a small number
of parameters; this is a definite advantage in using it in the construction of
simple probabilistic models. Gaussian random variables have finite second
moments, and consequently they can be studied by Hilbert space methods.
Here it is important that in the Gaussian case "uncorrelated" is equivalent
to "independent," so that the results of L 2 -theory can be significantly
strengthened.
2. Let us recall that (see 8) a random variable ~ = ~(m) is Gaussian, or
normally distributed, with parameters m and a 2 (~"' .K(m, a 2 )), lml < oo,
a 2 > 0, if its density f~(x) has the form

1"( ) _ _1_ -(x-m)2f2a2


x - ;;;= e
,

J~

....; 2na

(1)

+P.

where a =
As a! 0, the density f~(x) "converges to the a-function supported at
x = m." It is natural to say that ~ is normally distributed with mean m
and a 2 = 0 (~ "' .K(m, 0)) if~ has the property that P(~ = m) = 1.
We can, however, give a definition that applies both to the nondegenerate
(a 2 > 0) and the degenerate (a 2 = 0) cases. Let us consider the characteristic
function cp~(t) Eei 1 ~, t E R.
If P(~ = m) = 1, then evidently

eitm,

(2)

eitm-(1/2)t2 a 2

(3)

cp~(t) =

whereas if~ "' .K(m, a 2 ), a 2 > 0,


cp~(t)

It is obvious that when a 2 = 0 the right-hand sides of (2) and (3) are the
same. It follows, by Theorem 1 of 12, that the Gaussian random variable
with parameters m and a 2 (I m I < oo, a 2 ~ 0) must be the same as the random
variable whose characteristic function is given by (3). This is an illustration
of the "attraction of characteristic functions," a very useful technique in the
multidimensional case.

297

13. Gaussian Systems

Let

e= Ce1o ... , en) be a random vector and


lp~(t) =

t = (t 1 ,

Eei(t, ~.

t") E R",

(4)

its characteristic function (see Definition 2, 12).

Definidon 1. A random vector e= Ce 1, ... , en) is Gaussian, or normally


distributed, if its characteristic function has the form
lp~(t)

(5)

ei(t,m)-(l/2)(1Rt,t),

where m = (m1o ... , mn), Imk I < oo and ~ = llrk1l is a symmetric nonnegative definite n X n matrix; we use the abbreviation
JV(m, ~).

e"'

This definition immediately makes us ask whether (5) is in fact a characteristic function. Let us show that it is.
First suppose that ~ is nonsingular. Then we can define the inverse
A = ~- 1 and the function

IAI 112

(6)

f(x) = (2n)"' 2 exp{ -!(A(x- m), (x- m))},

where x = (x 1 ,
us show that

.. ,

xn) and

IAI = det A. This function is nonnegative. Let

r ei(t,x)f(x) dx =

JR"

ei(t,m)-(l/2)(1Rt,t),

or equivalently that
I

=
n -

r ei(t,x-m) (2n)nf2
IA1

JR"

112

e-(1/2)(A(x-m),(x-m))

dx

= e-(lf2)(1Rt,t)

(7)

Let us make the change of variable


x- m

(!)u,

t = (!)v,

where (!) is an orthogonal matrix such that

and

is a diagonal matrix with d; ~ 0 (see the proof of the lemma in 8). Since
I~I = det ~ =F 0, we have d; > 0, i = 1, ... , n. Therefore
(8)

298

II. Mathematical Foundations of Probability Theory

Moreover (for notation, see Subsection 1, 12)


i(t, x - m)- i(A(x - m), x - m)) = i(@v, (9u) - i(A(!)u, (9u)
= i((9v)T(9u- i((9u?A((9u)
= ivTu- iuT(9TA(9u
= ivTu- iuTD- 1u.
Together with (8) and (12.9), this yields
In= (2n)-n 12 (d1 ... dn)- 112
=

Il (2ndk)-

k=l

112

JR"

exp(ivTu- tuTD- 1u) du

Joo exp(ivk uk -oo

2udl ) duk =
k

Il exp(- ivl dk)

k=l

= exp(- ivTDv) = exp(- tvT(9T[R(9v) = exp(- itT!Rt) = exp(- i(IRt, t)).


It also follows from (6) that

r f(x) dx =

JR"

1.

(9)

Therefore (5) is the characteristic function of a nondegenerate n-dimensional Gaussian distribution (see Subsection 3, 3).
Now let IR be singular. Take 8 > 0 and consider the positive definite
IR + 8E. Then by what has been proved,
symmetric matrix IR'

<p'(t) = exp{i(t, m) - !(IR't, t)}

is a characteristic function:
<p'(t) = (

JR"

ei<t,xl

dF.(x),

where F.(x) = F.(xb ... , xn) is ann-dimensional distribution function.


As 8-+ 0,
<p"(t)-+ <p(t) = exp{i(t, m)- i(IR't, t)}.
The limit function <p(t) is continuous at (0, ... , 0). Hence, by Theorem 1
and Problem 1 of 3 of Chapter III, it is a characteristic function.
We have therefore established Theorem 1.
3. Let us now discuss the significance of the vector m and the matrix
IR = llrk1il that appear in (5).
Since

299

13. Gaussian Systems

we find from (12.35) and the formulas that connect the moments and the
semi-invariants that

m1 --

0 Ol sU
~
-

E;::':.1,

mk --

s<o.
0 0 -~

E;::':.k.

Similarly
r 11-

v;::':.1

s<2,0, ... ,0)~

and generally

Consequently m is the mean-value vector of ~ and IR is its covariance


matrix.
If IR is nonsingular, we can obtain this result in a different way. In fact,
in this case~ has a density f(x) given by (6).
A direct calculation shows that

E~k = Jxd(x) dx =
cov(~k, ~~) =

(11)

mk,

J<xk- mk)(x 1 - m1)f(x) dx

rkl

4. Let us discuss some properties of Gaussian vectors.

Theorem 1

if and only if they


are independent.
(b) A vector ~ = (~ 1 , .. , ~n) is Gaussian if and only if, for every vector
A.= (A.b ... , A.n), A.keR, the random variable(~, A.)= A. 1 ~ 1 + + A.n~n
has a Gaussian distribution.
(a) The components of a Gaussian vector are uncorrelated

PRooF. (a) If the components of~= (~ 1 , ... , ~n) are uncorrelated, it follows
from the form of the characteristic function cp~(t) that it is a product of
characteristic functions. Therefore, by Theorem 4 of 12, the components
are independent.
The converse is evident, since independence always implies lack of correlation.
(b) If~ is a Gaussian vector, it follows from (5) that

E exp{it(e1A.1

+ ... + enA.n)}

and consequently

exp{it(LA.kmk)-

(Lrk,A.kA.l)},

tER,

300

II. Mathematical Foundations of Probability Theory

Conversely, to say that the random variable


is Gaussian means, in particular, that

(~,A.)

= ~ 1 A. 1

+ + ~nAn

Since A._t. ... , An are arbitrary it follows from Definition 1 that the vector
= (~ 1 , , ~n) is Gaussian.
This completes the proof of the theorem.

Remark. Let (0,

~) be a Gaussian vector with (} = (Ot. ... , (}k) and ~ =


If (} and ~ are uncorrelated, i.e. cov(O;, ~j) = 0, i = 1, ... , k;
j = 1, ... , l, they are independent.
(~ 1 , .. , ~k).

The proof is the same as for conclusion (a) of the theorem.


Let ~ = (~ 1 , .. , ~n) be a Gaussian vector; let us suppose, for simplicity,
that its mean-value vector is zero. If rank ~ = r < n, then (as was shown in
11), there are n- r linear relations connecting ~ 1 , , ~n We may then
suppose that, say, ~ 1 , ... , ~r are linearly independent, and the others can
be expressed linearly in terms of them. Hence all the basic properties of the
vector~= ~ 1 , ... , ~n are determined by the first r components (~ 1 , ... , ~,)
for which the corresponding covariance matrix is already known to be
nonsingular.
Thus we may suppose that the original vector~ = ( ~ 1, .. , ~n) had linearly
independent components and therefore that I ~I > 0.
Let (!) be an orthogonal matrix that diagonalizes ~.
(f)T~(f)

=D.

The diagonal elements of D are positive and therefore determine the inverse
matrix. Put B 2 = D and

Then it is easily verified that

i.e. the vector P= (p 1, ... , Pn) is a Gaussian vector with components that are
uncorrelated and therefore (Theorem 1) independent. Then if we write
A = (f)B we find that the original Gaussian vector ~ = (~ 1 , , ~n) can be
represented as
~ =

Ap,

(12)

where P = (p 1, , Pn) is a Gaussian vector with independent -components,


pk "'..(0, 1). Hence we have the following result. Let~ = (e 1 , , ~n) be a

301

13. Gaussian Systems

vector with linearly independent components such that

E~k =

0, k = 1,

... , n. This vector is Gaussian if and only if there are independent Gaussian
variables {3 1 , , f:Jn, f:Jk "' %(0, 1), and a nonsingular matrix A of order n
such that~ = Af:J. Here IR = AAT is the covariance matrix of~.
If IIRI =F 0, then by the Gram-Schmidt method (see 11)

k = 1, ... , n,
where since 1: =

(~: 1 ,

(13)

... , ~:k) "' %(0, E) is a Gaussian vector,

~k =

k-1

l-1

(~k> Et)El,

(14)
(15)

and
(16)

2'{~1 , ~k} = 2'{1:1, , Ek}.

We see immediately from the orthogonal decomposition (13) that

~k = E(~k~~k-1 .,~d.

(17)

From this, with (16) and (14), it follows that in the Gaussian case the conditional expectation E(~k I~k-1> ... , ~ 1 ) is a linear function of (~1> ... , ~k- 1 ):
k= 1

E(~k~~k-1

~1) =

i= 1

a;~;

(18)

(This was proved in 8 for the case k = 2.)


Since, according to a remark made in Theorem 1 of 8, E(~k I~k- 1 , ... , ~ 1 )
is an optimal estimator (in the mean-square sense) for ~k in terms of
~ 1 , . , ~k- 1 , it follows from (18) that in the Gaussian case the optimal
estimator is linear.
We shall use these results in looking for optimal estimators of()= ( () 1, .. , ()k)
in terms of~ = (~ 1 , , ~ 1) under the hypothesis that((),~) is Gaussian. Let

m6

= E(J,

m~

= E~

be the column-vector mean values and

V66 =cov(O, 0) = llcov(O;, 0)11,

1 ~ i,j

V11~ = cov(O, ~) = llcov(O;, ~)II,

1 ~ i ~ k, 1 ~ j

V~~ = cov(~, ~) = llcov(~;. ~i) II,

1 ~ i,j

k,
~ l,

the covariance matrices. Let us suppose that V~~has an inverse. Then we have
the following theorem.

Theorem 2 (Theorem on Normal Correlation). For a Gaussian vector (0,


the optimal estimator E(O I~) of 0 in terms of~. and its error matrix
L\ = E[0 - E(O I~)] [0 - E(O( ~)]T

~),

302

II. Mathematical Foundations of Probability Theory

are given by the formulas


E(OI~) = m6

+ V 6 ~ V"i/(~-

m~),

ll = V 6o- Vo~V~ 1 (Vo~)T.

(19)

(20)

PRooF. Form the vector


11 = (0- m8 ) - V 6 ~ V~ 1 {~- m~).

(21)

We can verify at once that Ert(~ - m~)T = 0, i.e. 11 is not correlated with
(~ - m~). But since (0, ~)is Gaussian, the vector (I'/,~) is also Gaussian. Hence
by the remark on Theorem 1, 11 and ~ - m~ are independent. Therefore 11 and
~ are independent, and consequently E('11 ~) = Ert = 0. Therefore
E[O- m6 1~]- V 6 ~V~ 1 (~- m~) = 0.
which establishes (19).
To establish (20) we consider the conditional covariance
cov(O, 01~)

= E[(O- E(OI~))(O- E(OI~WieJ.

Since 0- E(Oie) = 17, and 11 and

(22)

eare independent, we find that

cov(O, Ole)= E(rtrtTie) = E11rtT


= V9 6
= V 68

+ Vi/V~~V~ 1 v:~- 2V 6 ~V~ 1 V~~V~ 1 v:~


-

V 6 ~ V~ 1 V:~.

Since cov(O, Ole) does not depend on "chance," we have


ll. = Ecov(O, Ole) =cov(O, Ole),

and this establishes (20).


Corollary. Let (0, e1, ... , en) be an (n
~> ... , ~n independent. Then

+ 1)-dimensional Gaussian vector, with

(cf. (8.12) and (8.13)).


5. Let e 1 , e 2 , be a sequence of Gaussian random vectors that converge
in probability to e. Let us show that is also Gaussian.
In accordance with (a) of Theorem 1, it is enough to establish this only for
random variables.

303

13. Gaussian Systems

Let mn =
theorem

Een, u;

ven

Then by Lebesgue's dominated convergence

n-->oo

n-+ co

It follows from the existence of the limit on the left-hand side that there are
numbers m and u 2 such that

m =lim mn,
n-->oo

Consequently
i.e.

e"' JV(m, u

2 ).

e2, .. .}

It follows, in particular, that the closed linear manifold 2(e1,


generated by the Gaussian variables 1 , 2 , .. (see 11, Subsection 5) consists
of Gaussian variables.

ee

6. We now turn to the concept of Gaussian systems in general.

Definition 2. A collection of random variables e= <e~). where IX belongs to


some index set m, is a Gaussian system if the random vector (e~,, ... , e~") is
Gaussian for every n ~ 1 and all indices IX1, . ' CX.n chosen from m.
Let us notice some properties of Gaussian systems.
(a) If

e=

IX' E

(b) If
(c)

<e~), IX Em,

is a Gaussian system, then every subsystem

m' s; m, is also Gaussian.

e~, IX Em,

e= <eiX),

e' =

<e~.),

are independent Gaussian variables, then the system

IX Em, is Gaussian.
(~~), oc Em, is a Gaussian

=
system, the closed linear manifold Y(e),
consisting of all variables of the form
1 ca,ea,, together with their
mean-square limits, forms a Gaussian system.

If~

Li=

Let us observe that the converse of (a) is false in general. For example,
let e1 and 17 1 be independent and e1"'%(0, 1), 17 1"' %(0, 1). Define the
system

<e ) =
11

{<eb 1111D
<e1, -11111)

if e1 ~
if e1 <

o,
o.

(23)

Then it is easily verified that and 11 are both Gaussian, but (e, 17) is not.
Let = (ea)aelll be a Gaussian system with mean-value vector m = (ma),
oc Em, and covariance matrix IR = (r~p)a,fJelll where ma = Eea Then IR is
evidently symmetric (r rzfJ = r pa) and nonnegative definite in the sense that
for every vector c = (ca)ael!l with values in R 111, and only a finite number of
nonzero coordinates ca,

(~c, c)

=L rrzpCaCp ~ 0.
tx,{J

(24)

304

II. Mathematical Foundations of Probability Theory

We now ask the converse question. Suppose that we are given a parameter
set m: = {ex}, a vector m = (m11) 11 e!ll and a symmetric nonnegative definite
matrix IR = (r11p)11,pe!ll Do there exist a probability space (0, F. P) and a
Gaussian system of random variables~= (~ 11)11 e 111 on it, such that
E~~~ = m~~,

cov(~ 11 , ~ 11)

r<Z,fl

ex, pem:?

If we take a finite set ex 1 , , exn, then for the vector ffi = (m~~., ... , m~~J
and the matrix IR = (r1111), ex, P= ex 1, , exn, we can construct in Rn the
Gaussian distribution F 111 , ... , 11"(x 1 , .. , xn) with characteristic function
qJ(t) = exp{i(t, m) - !(IRt, t)},

(t .. t.,

'

t..J.

It is easily verified that the family

{F1Zt, ... ,1Zn(X1, ... ' Xn); (Xi Em:}

is consistent. Consequently by Kolmogorov's theorem (Theorem 1, 9,


and Remark 2 on this) the answer to our question is positive.
7. If m: = {1, 2, ... }, then in accordance with the terminology of 5 the
system of random variables~= (~ 11)11 e!IJ is a random sequence and is denoted
by ~ = (~ 1 , ~ 2 , ...). A Gaussian sequence is completely described by its
mean-value vector m = (m 1 , m2 , ) and covariance matrix IR = llriill,
rii = cov(~i ~i). In particular, if rii = afbii then ~ = (~ 1 , ~ 2 , ) is a
Gaussian sequence of independent random variables with ~i "' .!V(mi> af),
i ~ 1.
When m: = [0, 1], [0, 00 ), (- 00, 00 ), ... , the system ~ = (e,), t Em:, is a
random process with continuous time.
Let us mention some examples of Gaussian random processes. If we take
their mean values to be zero, their probabilistic properties are completely
described by the covariance matrices llr.,ll We write r(s, t) instead of r.,
and call it the covariance function.
EXAMPLE

1. If T = [0, 00) and

r(s, t) = min(s, t),

(25)

the Gaussian process = (~,)r~O with this covariance function (see Problem
2) and ~ 0 = 0 is a Brownian motion or Wiener process.
Observe that this process has independent increments; that is, for arbitrary
t 1 < t 2 < < tn the random variables

~12 - elt' ' ' ~In - ~In- I


are independent. In fact, because the process is Gaussian it is enough to
verify only that the increments are uncorrelated. But if s < t < u < v then
E[~ 1

~.] [~v

~..] =

[r(t, v) - r(t, u~] - [r(s, v) - r(s, u)]

= (t - t) - (s - s) = 0.

305

13. Gaussian Systems

EXAMPLE

2. The process

e= (e

1),

0 ::; t ::; 1, with

eo := 0 and

r(s, t) = min(s, t) - st

(26)

is a conditional Wiener process (observe that since r(1, 1) = 0 we have


= o) = 1).

P<el

3. The process

EXAMPLE

e= (e

1), -

oo < t <

00,

with

r(s, t) = e-lt-1

(27)

is a Gauss-Markov process.

8.

PROBLEMS

1. Let

e1,e2, e3be independent Gaussian random variables, e; ~ .(0, 1). Show that
e1 + e2e3 ~ Y(o, 1).

Jt + e~

(In this case we encounter the interesting problem of describing the nonlinear
transformations of independent Gaussian variables 1 , .. ,
whose distributions
are still Gaussian.)

e.

2. Show that (25), (26) and (27) are nonnegative definite (and consequently are actually
covariance functions).
3. Let A be an m x n matrix. An n x m matrix A E9 is a pseudo inverse of A if there are
matrices U and V such that

Show that A E9 exists and is unique.


4. Show that (19) and (20) in the theorem on normal correlation remains valid when
v~~ is singular provided that v;/ is replaced by v~.

e)= (ll1o ... , llk; e1,... , e,) be a Gaussian vector with nonsingular matrix

5. Let (ll,
.1 = V1111

V~ v:~. Show that the distribution function

P(ll::; ale)= P(01::; a1, ... , llk::; akle)


has (P-a.s.) the density p(al, ... ' ak Ie) defined by

1.1-1/21
(2n)k12 exp{ -t(a- E(OI~W.1- 1 (a- E(OI~))}.
6. (S. N. Bernstein). Let~ and '1 be independent identically distributed random variables
with finite variances. Show that if + '7 and ~ - '1 are independent, then and '1
are Gaussian.

CHAPTER III

Convergence of Probability Measures.


Central Limit Theorem

1. Weak Convergence of Probability Measures and


Distributions
1. Many important results of probability theory are formulated as limit
theorems. So, indeed, were James Bernoulli's law of large numbers, as well
as the De Moivre-Laplace limit theorem, the theorems with which the true
theory of probability began.
In the present chapter we discuss two central aspects of limit theorems:
one is the concept of weak convergence; the other is the method of characteristic functions, one of the most powerful methods for proving and refining
limit theorems.
We begin by recalling the statement of the law of large numbers (Chapter
I, 5) for the Bernoulli scheme.
Let ~ 1 , ~ 2 , be a sequence of independent identically distributed
random variables with P(~i = 1) = p, P(~i = 0) = q, p + q = 1. In terms
of the concept of convergence in probability (Chapter II, 10), Bernoulli's
law of large numbers can be stated as follows:
n-+ oo,

(1)

where Sn = ~ 1 + + ~n (It will be shown in Chapter IV that in fact we


have convergence with probability 1.)
We put
(2)

307

I. Weak Convergence of Probability Measures and Distributions

where F(x) is the distribution function of the degenerate random variable


~
p. Also let P nand P be the probability measures on (R, Pl(R)) corresponding to the distributions F n and F.
In accordance with Theorem 2 of 1 0, Chapter II, convergence in probability, Sn/n ~ p, implies convergence in distribution, Sn/n .!4 p, which means that

Ef(~) -+ Ef(p),
for every function!
ous functions on R.
Since

E~~) =

n-+ oo,

(3)

f(x) belonging to the class C(R) of bounded continu-

f(x)P.(dx),

(3) can be written in the form

L
L

f(x)Pn(dx)-+

Ef(p)

L
L

f(x)P(dx),

{f(x)P(dx),

fE C(R),

(4)

or (in accordance with 6 of Chapter II) in the form


f(x) dF n(x) -+

f(x) dF(x ),

C(R).

(5)

In analysis, (4) is called weak convergence (of Pn to P, n -+ oo) and written


Pn ~ P. It is also natural to call ( 5) weak convergence ofF n to F and denote
it by Fn ~F.
Thus we may say that in a Bernoulli scheme

sn

--+ p => F"-+ F.

(6)

It is also easy to see from (1) that, for the distribution functions defined
in (2),
F.(x)-+ F(x),
n-+ oo,
for all points x E R except for the single point x = p, where F(x) has a discontinuity.
This shows that weak convergence F"-+ F does not imply pointwise
convergence of F"(x) to F(x), n-+ oo, for all points x E R. However, it turns
out that, both for Bernoulli schemes and for arbitrary distribution functions,
weak convergence is equivalent (see Theorem 2 below) to "convergence
in general" in the sense of the following definition.

Definition 1. A sequence of distribution functions {F"}, defined on the real


line, converges in general to the distribution function F (notation: F" => F)
if as n-+ oo
Fn(x)-+ F(x),
xEPc(F),
where P c(F) is the set of points of continuity ofF = F(x).

308

III. Convergence of Probability Measures. Central Limit Theorem

For Bernoulli schemes, F = F(x) is degenerate, and it is easy to see


(see Problem 7 of 10, Chapter II) that

CFn=F)=(~~P)
Therefore, taking account of Theorem 2 below,

(~ ~ P) =(F n.!!+ F)<=> (F n=F)= (~ E. p)

(7)

and consequently the law of large numbers can be considered as a theorem


on the weak convergence of the distribution functions defined in (2).
Let us write
Fix)= P{
F(x)

Sn- np

jnpq : :; x } ,

= -1- fx

fo

e-u 212 du.

(8)

-oo

The De Moivre-Laplace theorem (6, Chapter I) states that Fn(x)--+ F(x)


for all x E R, and consequently Fn =F. Since, as we have observed, weak
convergence Fn ~ F and convergence in general, Fn = F, are equivalent,
we may therefore say that the De Moivre-Laplace theorem is also a theorem
on the weak convergence of the distribution functions defined by (8).
These examples justify the concept of weak convergence of probability
measures that will be introduced below in Definition 2. Although, on the
real line, weak convergence is equivalent to convergence in general of the
corresponding distribution functions, it is preferable to use weak convergence
from the beginning. This is because in the first place it is easier to work with,
and in the second place it remains useful in more general spaces than the
real line, and in particular for metric spaces, including the especially important spaces R", R 00 , C and D (see 3 of Chapter II).
2. Let (E, tff, p) be a metric space with metric p = p(x, y) and u-algebra tff
of Borel subsets generated by the open sets, and let P, Pt> P2 , ... be probability measures on (E, tff, p).

Definition 2. A sequence of probability measures {P n} converges weakly to the


probability measure P (notation: P n ~ P) if

f(x)P n(dx)

--+

J(x)P(dx)

(9)

for every function f = f(x) in the class IC(E) of continuous bounded functions on E.

I. Weak Convergence of Probability Measures and Distributions

309

Definition 3. A sequence of probability measures {P n} converges in general


to the probability measure P (notation: Pn => P) if
(10)
for every set A of G for which
P(oA)

= 0.

(11)

(Here oA denotes the boundary of A:


oA

[A] n [A],

where [A] is the closure of A.)


The following fundamental theorem shows the eqmvalence of the concepts of weak convergence and convergence in general for probability
measures, and contains still another equivalent statement.

Theorem 1. The following statements are equivalent.


(I) Pn ~ P.
(II) lim Pn(A) ::::;; P(A), A closed.
(Ill) lim Pn(A) 2': P(A), A open.
(IV) Pn=>P.
PRooF. (I)=> (II). Let A be closed, f(x)

fix)=

IA(x) and

g(~ p(x, A)),

s > 0,

where
p(x, A)

= inf{p(x, y): yEA},

1,
g(t) = { 1 - t,
0,

t::::;; 0,
0 ::::;; t ::::;; 1,
t 2': 1.

Let us also put


A,= {x: p(x, A) < s}
and observe that A, t A as s t 0.
Since.fe(x) is bounded, continuous, and satisfies

we have

which establishes the required implication.

310

III. Convergence of Probability Measures. Central Limit Theorem

The implications (II) => (Ill) and (Ill) => (II) become obvious if we take
the complements of the sets concerned.
(III)=> (IV). Let A 0 = A\oA be the interior, and [A] the closure, of A.
Then from (II), (III), and the hypothesis P(ilA) = 0, we have
lim Pn(A)

~lim
n

Pn([A])

P([A]) = P(A),

and therefore Pn(A) ~ P(A) for every A such that P(ilA) = 0.


(IV)~ (1). Letf = f(x) be a bounded continuous function with lf(x)l
M. We put

= {tER: P{x:f(x) = t} =I= 0}


and consider a decomposition 7k = (t 0 , t 1 , .. , tk) of [- M, M]:
D

- M =

t0

<

t1

< <

tk =

M,

~ 1,

with ti : D, i = 0, 1, ... , k. (Observe that D is at most countable since the


sets f- 1 { t} are disjoint and P is finite.)
Let Bi = {x: t; ~ f(x) < t;+ d. Since f(x) is continuous and therefore
the set f- 1(t;, ti+ 1 ) is open, we have oB; f- 1 {t;} u f- 1 {ti+ 1 }. The points
t;, t;+ 1 D; therefore P(ilB;) = 0 and, by (IV),
k-1

k-1

t;Pn(B;) ~

i=O

But

Lf(x)Pn(dx)- Lf(x)P(dx)

I~

:t~ t;Pn(B;)

+ I:t~ t; Pn(B;) -

:t~ t; P(B;) I

Lf(x)Pn(dx)-

t;P(B;)- Lf(x)P(dx)

2 max

(t;+ 1

O:s;i:s;k-1

whence, by (12), since the

7k (k

1) are arbitrary,

lim ( f(x)Pn(dx) = ( f(x)P(dx).


n

JE

+I :t~
~

(12)

t;P(B;).

t=O

JE

This completes the proof of the theorem.

t;)

1. Weak Convergence of Probability Measures and Distributions

311

Remark 1. The functions f(x) = IA(x) and fe(x) that appear in the proof
that (I) => (II) are respectively upper semicontinuous and uniformly continuous.
Hence it is easy to show that each of the conditions of the theorem is equivalent to one of the following:
(V) JEf(x)P.(x) dx--+ JEf(x)P(dx) for all bounded uniformly continuous

f(x);

(VI) lim JE f(x)P .(dx) ::s;; JE f(x)P(dx) for all bounded f(x) that are upper
semicontinuous (Iimf(x.) ::s;; f(x), x.--+ x);
(VII) lim JEf(x)P.(dx) ~ JEf(x)P(dx) for all bounded f(x) that are lower
semicontinuous (lim f(x.) ~ f(x), x.--+ x).
n

Remark 2.Theorem 1 admits a natural generalization to the case when the


probability measures P and P. defined on (E, tS, p) are replaced by arbitrary
(not necessarily probability) finite measures J.L and J.L . For such measures
we can introduce weak convergence J.l.n ~ J.L and convergence in general
f.Ln => J.1. and, just as in Theorem 1, we can establish the equivalence of the

following conditions:
(I*)
(II*)
(III*)
(IV*)

J.l.n ~ J.L;
lim J.L.(A) ::s;; J.L(A), where A is closed and J.LiE) --+ J.L(E);
lim J.L.(A) ~ J.L(A), where A is open and J.L.(E)--+ J.L(E);
J.l.n => J.L.

Each of these is equivalent to any of (V*), (VI*), and (VII*), which are
(V), (VI), and (VII) with P" and P replaced by J.l.n and J.L.
3. Let (R, BI(R)) be the real line with the system BI(R) of sets generated by
the Euclidean metric p(x, y) = lx - yl (compare Remark 2 of subsection 2
of2 of Chapter II). Let P and P "' n ;:::: 1, be probability measures on (R, aJ(R))
and let F and F., n ~ 1, be the corresponding distribution functions.

Theorem 2. The following conditions are equivalent:


(1)
(2)
(3)
(4)

P. ~ P,
P. => P,
F.~ F,
F.=> F.

PRooF. Since (2) <=> (1) <=> (3), it is enough to show that (2) <=> (4).
If P. => P, then in particular

P.(- oo, x]--+ P(- oo, x]


for all x e R such that P{x} = 0. But this means that F.=> F.
Now let F.=> F. To prove that P. =>Pit is enough (by Theorem 1) to
show that lim. P.(A) ~ P(A) for every open set A.
If A is open, there is a countable collection of disjoint open intervals
11 , 12 , . (of the form (a, b)) such that A= 2,1:': 1 /k. Choose e > 0 and in

312

III. Convergence of Probability Measures. Central Limit Theorem

each interval Ik = (ak> bk) select a subinterval Ik = (ak, bk] such that ak,
b~ E Pc(F) and P(Ik) :-::; P(J~) + e rk. (Since F(x) has at most countably
many discontinuities, such intervals Ik, k 2:: 1, certainly exist.) By Fatou's
lemma,
00

00

lim Pn(A) =lim


n

Pn(Ik) 2::

k=1

k=1

lim Pn{/k)
n

00

2::

lim Pn(Ik).

k= 1

But
Therefore
00

00

lim Pn(A) 2::


-

P(Ik) 2::

k= 1

(P(h) - e 2-k) = P(A)- e.

Since e > 0 is arbitrary, this shows that limn Pn(A) 2:: P(A) if A is open.
This completes the proof of the theorem.
4. Let (, S) be a measurable space. A collection X 0 () <::; S of subsets is
a determining class if whenever two probability measures P and 0 on (E, S)
satisfy
P(A)

= Q(A)

for all

A EX 0 ()

it follows that the measures are identical, i.e.


P(A) = Q(A)

for all

S.

If(, S, p) is a metric space, a collection X 1(E) <::; Iff is a convergencedetermining class if whenever probability measures P, P 1 , P 2 , ... satisfy
P n(A)

-t

P(A)

for all

A EX 1 ()

with

P(oA) = 0

it follows that
for all A E E with

P(oA) = 0.

When (E, S) = (R, ~(R)), we can take a determining class X 0 (R) to be


the class of "elementary" sets X = {(- oo, x], x E R} (Theorem 1, 3,
Chapter II). It follows from the equivalence of (2) and (4) of Theorem 2
that this class X is also a convergence-determining class.
It is natural to ask about such determining classes in more general spaces.
For W, n 2:: 2, the class X of "elementary" sets of the form (- oo, x] =
(- oo, x 1 ] x x (- oo, Xn], where x = (x 1, ... , xn) ERn, is both a determining class (Theorem 2, 3, Chapter II) and a convergence-determining
class (Problem 2).

313

I. Weak Convergence of Probability Measures and Distributions

ljn

2/n

Figure 35

For Roo the cylindrical sets % 0 (R 00 ) are the "elementary" sets whose
probabilities uniquely determine the probabilities of the Borel sets (Theorem
3, 3, Chapter II). It turns out that in this case the class of cylindrical sets is
also the class of convergence-determining sets (Problem 3). Therefore
%,(Roo)= %o(Roo).
We might expect that the cylindrical sets would still constitute determining classes in more general spaces. However, this is, in general, not the
case.
For example, consider the space (C, 140 (C), p) with the uniform metric
p (see subsection 6, 2, Chapter II). Let P be the probability measure concentrated on the element x = x(t) 0, 0 ~ t s 1, and let P., n;;::: 1, be the

probability measures each of which is concentrated on the element x

= x.(t)

shown in Figure 35. It is easy to see that P.(A)--+ P(A) for all cylindrical
sets A with P(oA) = 0. But if we consider, for example, the set

A = { oc E C: loc(t) I s

t, 0 s t s

1} E 140( C),

then P(oA) = 0, P.(A) = 0, P(A) = 1 and consequently P.


P.
Therefore % 0 (C) = 140 (C) but % 0 (C) c % 1 (C) (with strict inclusion).

5.

PROBLEMS

1. Let us say that a function F = F(x), defined on R", is continuous at x E R" provided
that, for every e > 0, there is a~ > 0 such that IF(x) - F(y) I < e for ally E R" that
satisfy

x - be < y < x

+ ~e,

where e = (1, ... , 1) E R". Let us say that a sequence of distribution functions {F.}
converges in general to the distribution function F (F.=> F) if F.(x)-+ F(x), for all
points x E R" where F = F(x) is continuous.
Show that the conclusion of Theorem 2 remains valid for R", n > 1. (See the remark
on Theorem 2.)

314

III. Convergence of Probability Measures. Central Limit Theorem

2. Show that the class :K of"elementary" sets in R" is a convergence-determining class.


3. Let E be one ofthe spaces R"", Cor D. Let us say that a sequence {P.} of probability
measures (defined on the u-algebra 8 of Borel sets generated by the open sets) converges in general in the sense of finite-dimensional distributions to the probability
measure P (notation: P. b P) if P .(A) -+ P(A), n -+ oo, for all cylindrical sets A with
P(oA) = 0.
For R"", show that
(P. b P) <->(P. => P).

4. Let F and G be distribution functions on the real line and let


L(F, G)= inf{h > 0: F(x- h)- h

G(x)

F(x +h)+ h}

be the Levy distance (between F and G). Show that convergence in general is equivalent to convergence in the Levy metric:
(F =>F)<-> L(F F)-+ 0.

5. Let F. => F and let F be continuous. Show that in this case F .(x) converges uniformly
to F(x):
sup IF.(x)- F(x)l-+ 0,

n-+ oo.

6. Prove the statement in Remark 1 on Theorem 1.


7. Establish the equivalence of (1*)-(IV*) as stated in Remark 2 on Theorem 1.
8. Show that P. ~ P if and only if every subsequence {P.} of {P.} contains a subsequence {P... } such that P... ~ P.

2. Relative Compactness and Tightness of Families


of Probability Distributions
1. If we are given a sequence of probability measures, then before we can
consider the question of its (weak) convergence to some probability measure,
we have of course to establish whether the sequence converges in general
to some measure, or has at least one convergent subsequence.
For example, the sequence {P.}, where P 2 = P, P 2 .+ 1 = Q, and P and Q
are different probability measures, is evidently not convergent, but has the
two convergent subsequences {P 2 ,} and {P 2 ,+ 1 }
It is easy to construct a sequence {P.} of probability measures P., n ~ 1,
that not only fails to converge, but contains no convergent subsequences at
all. All that we have to do is to take P., n ~ 1, to be concentrated at {n} (that
is, P.{n} = 1). In fact, since lim. P.(a, b] = Owhenevera < b, a limit measure
would have to be identically zero, contradicting the fact that 1 = P,(R) 0,

2. Relative Compactness and Tightness of Families of Probability Distributions

315

n--+ oo. It is interesting to observe that in this example the corresponding


sequence {F"} of distribution functions,
Fn{x)

1,

x;::: n,

= { O, x < n,

is evidently convergent: for every x

R,

Fn{x)--+ G(x)

=0.

However, the limit function G = G(x) is not a distribution function (in the
sense of Definition 1 of 3, Chapter II).
This instructive example shows that the space of distribution functions is
not compact. It also shows that if a sequence of distribution functions is to
converge to a limit that is also a distribution function, we must have some
hypothesis that will prevent mass from "escaping to infinity."
After these introductory remarks, which illustrate the kinds of difficulty
that can arise, we turn to the basic definitions.

2. Let us suppose that all measures are defined on the metric space (E, tff, p).
Definition 1. A family of probability measures~= {P"; a Em} is relatively
compact if every sequence of measures from ~contains a subsequence which
converges weakly to a probability measure.
We emphasize that in this definition the limit measure is to be a probability
measure, although it need not belong to the original class ~- (This is why the
word "relatively" appears in the definition.)
It is often far from simple to verify that a given family of probability
measures is relatively compact. Consequently it is desirable to have simple
and useable tests for this property. We need the following definitions.
Definition 2. A family of probability measures ~ = {Pa; rx Em} is tight if,
for every e > 0, there is a compact set K <;; E such that
supPa{E\K):::; e.

(1)

ae'll

Definition 3. A family of distribution functions F = {Fa; rx Em} defined on


R", n ;::: 1, is relatively compact (or tight) if the same property is possessed by
the family~= {P"; rx Em} of probability measures, where P"' is the measure
constructed from F"'.
3. The following result is fundamental for the study of weak convergence of
probability measures.

Theorem 1 (Prohorov's Theorem). Let ~ = {P"'; rx Em} be a family of


probability measures defined on a complete separable metric space (E, iff, p).
Then~ is relatively compact if and only if it is tight.

316

III. Convergence of Probability Measures. Central Limit Theorem

We shall give the proof only when the space is the real line. (The proof can
be carried over, almost unchanged, to arbitrary Euclidean spaces Rn, n ~ 2.
Then the theorem can be extended successively to R 00 , to a-compact spaces;
and finally to general complete separable metric spaces, by reducing each
case to the preceding one.)
Necessity. Let the family r!J> = {Pa: !'f. E m:} of probability measures defined
on (R, 86(R)) be relatively compact but not tight. Then there is an e > 0 such
that for every compact K ~ R
sup Pa(R\K) > e,
and therefore, for each interval I = (a, b),
sup Pa{R\J) >e.
It follows that for every interval In= ( -n, n), n
such that

1, there is a measure Pan

Pan(R\In) >e.
Since the original family r!J> is relatively compact, we can select from {PaJ n;;, 1
a subsequence {Pan) such that Pank ~ Q, where Q is a probability measure.
Then, by the equivalence of conditions (I) and (II) in Theorem 1 of 1, we
have
(2)
for every n ~ 1. But Q(R\In) t 0, n--+ oo, and the left side of (2) exceeds
e > 0. This contradiction shows that relatively compact sets are tight.
To prove the sufficiency we need a general result (Helly's theorem) on the
sequential compactness of families of generalized distribution functions
(Subsection 2 of 3 of Chapter II).
Let J = { G} be the collection of generalized distribution functions

= G(x) that satisfy:

(1) G(x) is nondecreasing;


(2) 0 ~ G(- oo ), G( + oo) ~ 1;
(3) G(x) is continuous on the right.
Then J clearly contains the class of distribution functions ' = {F}
for which F(- oo) = 0 and F( + oo) = 1.
Theorem 2 (Helly's Theorem). The class J = {G} of generalized distribution
fimctions is sequentially compact, i.e. for every sequence {Gn} of functions
from J we can .find afunction G E Janda sequence {nk} ~ {n} such that
k--+

00,

for every point x of the set P c( G) of points of continuity of G

= G(x ).

2. Relative Compactness and Tightness of Families of Probability Distributions

317

PROOF. Let T = {x 1 , x 2 , ... } be a countable dense subset of R. Since the


sequence of numbers {G.{x 1 }} is bounded, there is a subsequence N 1 =
{nl1>, n~l)' .. .} such that G.p){x 1) approaches a limit g1 as i--+ oo. Then we
extract from N 1 a subsequence N 2 = {nl2 >, n~2 >, .. } such that G"12)(x 2 )
approaches a limit g 2 as i --+ oo; and so on.
'
On the set T ~ R we can define a function Gr(x) by
XiE

T,

and consider the "Cantor" diagonal sequence N = {nl1>, n~2 >, .. .}. Then, for
each xi E T, as m--+ oo, we have
G.~)(xi)

--+ Gr(xJ

Finally, let us define G = G(x) for all x E R by putting


G(x) = inf{Gr(y): yET, y > x}.

(3)

We claim that G = G(x) is the required function and G.~,.~)(x)--+ G(x) at all
points x of continuity of G.
Since all the functions G. under consideration are nondecreasing, we have
Gnl:rl(x) ~ G.~,.m)(y) for all x andy that belong to T and satisfy the inequality
x ~ y. Hence for such x and y,
Gr(x) ~ Gr(y).

It follows from this and (3) that G = G(x) is nondecreasing.


Now let us show that it is continuous on the right. Let xk ! x and d =
Iimk G(xk). Clearly G(x) ~ d, and we have to show that actually G(x) = d.
Suppose the contrary, that is, let G(x) < d. It follows from (3) that there is
ayE T, x < y, such that Gr(Y) < d. But x < xk < y for sufficiently large k,
and therefore G(xk) ~ Gr(Y) < d and lim G(xk) < d, which contradicts
d = Iimk G(xk). Thus we have constructed a function G that belongs to .f.
We now establish that G.<ml{x 0 )--+ G(x 0 ) for every x 0 E Pc(G).
If x 0 < y E T, then
m
lim Gnl,.ml(x 0 ) ~ lim Gnl,.ml(y) = Gr(y),
m

whence
lim G.~:r){x 0 ) ~ inf{Gr(y): y > x 0 , yET}= G(x 0 ).

(4)

On the other hand, let x 1 < y < x 0 , y E T. Then


G(x 1 ) ~ Gr(Y) =lim Gnl:/'l(y) =lim Gnl:rl(y) ~lim Gnl:/'l(x 0 ).
m

Hence if we let x 1 j x 0 we find that


G(xo _) ~ li,!,D G.~:rlx 0 ).

(5)

318

III. Convergence of Probability Measures. Central Limit Theorem

But if G(x 0 - ) = G(x 0 ) we can infer from (4) and (5) that Gnllr>(x 0 ) ~ G(x 0 ),
m~ oo.
This completes the proof of the theorem.
We can now complete the proof of Theorem 1.
Sufficiency. Let the family & be tight and let {P"} be a sequence of probability measures from&. Let {F"} be the corresponding sequence of distribution functions.
By Helly's theorem, there are a subsequence {F "k} ;; { F"}and a generalized
distribution function G e J such that F"k(x) ~ G(x) for x e Pc(G). Let us
show that because & was assumed tight, the function G = G(x) is in fact a
genuine distribution function (G(- oo) = 0, G( + oo) = 1).
Take e > 0, and let I = (a, b] be the interval for which

sup Pn(R\1) < e,


or, equivalently,

n~l.

Choose points a', b' e Pc(G) such that a' <a, b' >b. Then 1 - e::;;; P"k(a, b]
::;;; P"k(a', b'] = F"k(b')- F"k(a') ~ G(b')- G(a'). It follows that G( + oo)G(-oo) = 1, and since 0::;;; G(-oo)::;;; G(+oo)::;;; 1, we have G(-oo) = 0
andG(+oo) = 1.
Therefore the limit function G = G(x) is a distribution function and
F"k =>G. Together with Theorem 2 of 1 this shows that Pnk ~ Q, where Q
is the probability measure corresponding to the distribution function G.
This completes the proof of Theorem 1.
4.

PROBLEMS

1. Carry out the proofs of Theorems 1 and 2 for R", n ;;:: 2.


2. Let P be a Gaussian measure on the real line, with parameters m. and cr;, ex E 21.
Show that the family f1J = {P; ex E 21} is tight if and only if

1m. I S: a,

cr; S: b,

ex E 21.

3. Construct examples of tight and nontight families f1J = {P.; ex E 21} of probability
measures defined on (R"", di(R"")).

3. Proofs of Limit Theorems by the Method of


Characteristic Functions
1. The proofs of the first limit theorems of probability theory-the law of
large numbers, and the De Moivre-Laplace and Poisson theorems for
Bernoulli schemes-were based on direct analysis of the limit functions ofthe

3. Proofs of Limit Theorems by the Method of Characteristic Functions

319

distributions F n, which are expressed rather simply in terms of binomial


probabilities. (In the Bernoulli scheme, we are adding random variables that
take only two values, so that in principle we can find Fn explicitly.) However,
it is practically impossible to apply a similar direct method to the study of
more complicated random variables.
The first step in proving limit theorems for sums of arbitrarily distributed
random variables was taken by Chebyshev. The inequality that he discovered,
and which is now known as Chebyshev's inequality, not only makes it
possible to give an elementary proof of James Bernoulli's law oflarge numbers,
but also lets us establish very general conditions for this law to hold, when
stated in the form
n-+ oo,

every c > 0,

(1)

for sums Sn = ~ 1 + + ~n n ~ 1, of independent random variables. (See


Problem 2.)
Furthermore, Chebyshev created (and Markov perfected) the "moment
method" which made it possible to show that the conclusion of the De
Moivre-Laplace theorem, written in the form

- ESn
P{Sn 1\iCi"
y

Y..Jn

-+

M-

-y2n

fx e -u>;2 du,

(2)

-co

is "universal," in the sense that it is valid under very general hypotheses


concerning the nature of the random variables. For this reason it is known as
the central limit theorem of probability theory.
Somewhat later Lyapunov proposed a different method for proving the
central limit theorem, based on the idea (which goes back to Laplace) of the
characteristic function of a probability distribution. Subsequent developments have shown that Lyapunov's method of characteristic functions is
extremely effective for proving the most diverse limit theorems. Consequently
it has been extensively developed and widely applied.
In essence, the method is as follows.
2. We already know (Chapter II, 12) that there is a one-to-one correspondence between distribution functions and characteristic functions. Hence we
can study the properties of distribution functions by using the corresponding
characteristic functions. It is a fortunate circumstance that weak convergence
Fn ~ F of distributions is equivalent to pointwise convergence q>n-+ q> of
the corresponding characteristic functions. Moreover, we have the following
result, which provides the basic method of proving theorems on weak convergence for distributions on the real line.

320

III. Convergence of Probability Measures. Central Limit Theorem

Theorem 1 (Continuity Theorem). Let {Fn} be a sequence of distribution


functions F n = FnCx ), x E R, and let {cpn} be the corresponding sequence of
characteristic jimctions,
IPn(t) = s:ooeirx dFn(x),

t E R.

(1) If Fn ~ F, where F = F(x) is a distribution function, then cpn(t) --4 cp(t),


t E R, where cp( t) is the characteristic junction ofF = F( x ).
(2) If limn cpit) exists for each t E R and cp(t) = limn cpn(t) is continuous at
t = 0, then cp(t) is the characteristic function of a probability distribution

F = F(x), and

The proof of conclusion (1) is an immediate consequence of the definition


of weak convergence, applied to the functions Re eirx and Im eirx.
The proof of (2) requires some preliminary propositions.

Lemma 1. Let {P n} be a tight family of probability measures. Suppose that


every weakly convergent subsequence {Pn.} of {Pn} converges to the same
probability measure P. Then the whole sequence {P n} converges to P.
PROOF. Suppose that Pn

f = f(x) such that

+ P. Then there is a bounded continuous function

It follows that there exist e > 0 and an infinite sequence {n'}


that

I Lf(x)Pn(dx)-

Lf(x)P(dx) I

~ e > 0.

{n} such
(3)

By Prohorov's theorem (2) we can select a subsequence {P n"} of {P n'} such


that Pn" ~ 0, where 0 is a probability measure.
By the hypotheses of the lemma, 0 = P, and therefore

Lf(x)Pn ..(dx)

--4

Lf(x)P(dx),

which leads to a contradiction with (3). This completes the proof of the
lemma.

Lemma 2. Let {Pn} be a tight family of probability measures on (R, 86(R)).


A necessary and stif.ficient condition for the sequence {Pn} to converge weakly
to a probability measure is that for each t E R the limit limn cpnCt) exists, where
IPn( t) is the characteristic function of Pn:
IPn(t) = L eirxp n(dx).

3. Proofs of Limit Theorems by the Method of Characteristic Functions

321

PRooF. If {P n} is tight, by Prohorov's theorem there are a subsequence


{Pn} and a probability measure P such that Pn' ~ P. Suppose that the whole
sequence {P n} does not converge toP (P n !!;\+ P). Then, by Lemma 1, there are a
subsequence {Pn"} and a probability measure Q such that Pn" ~ Q, and
p = Q.
Now we use the existence of limn cpn(t) for each t E R. Then
lim
n'

r eitxpn.(dx) = lim r eitxpn"(dx)

JR

n"

JR

and therefore

But the characteristic function determines the distribution uniquely


(Theorem 2, 12, Chapter II). Hence P = Q, which contradicts the assumption
that Pn ~ P.
The converse part of the lemma follows immediately from the definition
of weak convergence.
The following lemma estimates the "tails" of a distribution function in
terms of the behavior of its characteristic function in a neighborhood of
zero.

Lemma 3. Let F = F(x) be a distribution function on the real line and let
cp = cp(t) be its characteristic function. Then there is a constant K > 0 such
that for every a > 0

PROOF. Since Re cp(t)

! Ja[l a o

fa [1 -

dF(x) :::;; K

Jlxl<e:lja

= J':'

00

Re cp(t)] dt.

(4)

cos tx dF(x), we find by Fubini's theorem that

Re cp(t)] dt

=! Ja[Joo (1 -COS tx) dF(x)] dt


a o -oo

00 [~ J:(l

= f_00
=

-COS

Joo (
_ 00

sin ax) dF(x)


1- ~ax

. ( 1 - -sm
2:: mf
JyJ<e:l

=1

tx) dt] dF(x)

lxl<e:lfa

y) l

dF(x),

Jaxl<e:l

dF(x)

III. Convergence of Probability Measures. Central Limit Theorem

322
where

y)

sin- = 1 - sm 1 :?:
. f (1 - -1 = m
K

IYI2!1

7,

so that (4) holds with K = 7. This establishes the lemma.


Proof of conclusion (2) of Theorem 1. Let cpit) --+ cp(t), n --+ oo, where
cp(t) is continuous at 0. Let us show that it follows that the family of probability measures {Pn} is tight, where Pn is the measure corresponding to Fn.
By (4) and the dominated convergence theorem,

Pn{R\(-

~ ))} =
a a

(
J,xj;;,1ja

dFn(x) ::::; K
a

fa [1 -

Re cpn(t)] dt

Kfa

[1 - Re cp(t)] dt
--+ a o
as n --+ oo.
Since, by hypothesis, cp(t) is continuous at 0 and cp(O)
there is an a > 0 such that

1, for every s > 0

for all n :?: 1. Consequently {P .} is tight, and by Lemma 2 there is a probability measure P such that
Hence

but also cpn(t)--+ cp(t). Therefore cp(t) is the characteristic function of P.


This completes the proof of the theorem.

Corollary. Let {Fn} be a sequence of distribution functions and {cpn} the


corresponding sequence of characteristic functions. Also let F be a distribution
function and cp its characteristic function. Then F.~ F if and only if cpit)--+
cp(t) for all t E R.
Remark. Let 1], 1J1o 1] 2 , be random variables and Fq. ~ Fq. In accordance
with the definition of 10 of Chapter II, we then say that the random variables
1] 1 , 1] 2 , converge to '7 in distribution, and write 11. 141].

Since this notation is self-explanatory, we shall frequently use it instead

ofF~.~ F~ when stating limit theorems.

3. Proofs of Limit Theorems by the Method of Characteristic Functions

323

3. In the next section, Theorem 1 will be applied to prove the central limit
theorem for independent but not identically distributed random variables.
In the present section we shall merely apply the method of characteristic
functions to prove some simple limit theorems.

Theorem 2 (Law of LargeN umbers). Let ~ 1 , ~ 2 , . be a sequence ofindependent


identically distributed random variables with E I ~ 1 1 < oo, sn = ~ 1 + + ~n
and E~ 1 = m. Then Sn/n ~ m, that is,for every e > 0

n ~ oo.
PROOF. Let <p(t) = Ee;'~' and <fJs";n(t) = EeitSn!n. Since the random variables
are independent, we have

by (11.12.6). But according to (11.12.14)

<p(t) = 1 + itm

+ o(t),

t~o.

Therefore for each given t E R


n-+

oo,

and therefore

The function <p(t) = eitm is continuous at 0 and is the characteristic function


of the degenerate probability distribution that is concentrated atm. Therefore

sn
n

-~m,

and consequently (see Problem 7, 10, Chapter II)

This completes the proof of the theorem.

324

III. Convergence of Probability Measures. Central Limit Theorem

Theorem 3 (Central Limit Theorem for Independent Identically Distributed


Random Variables). Let 1 , 2 , be a sequence of independent identically
distributed (nondegenerate) random variables with Ee~ < oo and Sn =
+ + Then as n- oo

ee

en

e1

xeR,

(5)

where

1
W(x) = ;;,::.
y 2n
PROOF.

Let

Ee

m,

Ve

fx

e-" 212 du.

-oo

a 2 and
cp(t) =

Eeit(~,

-m>.

Then if we put

ES}

S cp"(t) = E exp{ it ~" ,


we find that

But by (11.12.14)

t-o.
Therefore

cpn(t)

[1 - ;;~: + o(~)

r-

e-1 2 /2.

as n - oo for fixed t.
The function e- 1212 is the characteristic function of a random variable
(denoted by %(0, 1)) with mean zero and unit variance. This, by Theorem 1,
also establishes (5). In accordance with the remark in Theorem 1, this can
also be written in the form

S~n J4 .ff(O, 1).


vsn

(6)

This completes the proof of the theorem.


The preceding two theorems have dealt with the behavior of the probabilities of (normalized and symmetrized) sums of independent and identically
distributed random variables. However, in order to state Poisson's theorem
(6, Chapter I) we have to use a more general model.

325

3. Proofs of Limit Theorems by the Method of Characteristic Functions

Let us suppose that for each n ~ 1 we are given a sequence of independent


random variables ~" 1 , , ~nn In other words, let there be given a triangular
array

(~11
~21> ~22

~31 ~32 ~33

of random variables, those in each row being independent. Put

sn = ~n1 + ... + ~nn

Theorem 4 (Poisson's Theorem). For each n ~ 1 let the independent random


variables ~" 1 , , ~nn be such that
P(~nk = 1) =

Pnk

+ qnk = 1. Suppose that

Pnk

max Pnk-+ 0,

and

Ll:=

Pnk -+

,( >

n-+ oo,

0, n -+ oo. Then,Jor each m = 0, 1, ... ,


e-).A.m
P(Sn = m) -+ - -1- ,
n -+ oo.

(7)

m.

PRooF.

Since

for 1

n, by our assumptions we have

CfJsn(t) = EeitSn =

rr (pnkeit + qnk)
n

k=

f1 (1 + Pnieit -

1))-+ exp{A.(eit - 1)},

k=1

n-+ oo.

The function cp(t) = exp{A.(eit - 1)} is the characteristic function of the


Poisson distribution (II.l2.11), so that (7) is established.
If n(A.) denotes a Poisson random variable with parameter A., then (7) can
be written like (6), in the form

sn ~ n(A.).

This completes the proof of the theorem.

4.

PROBLEMS

1. Prove Theorem 1 for R", n

2.

2. Let eto ~ 2 , be a sequence of independent random variables with finite means


E 1.;.1 and variances V such that V ~ K < oo, where K is a constant. Use
Chebyshev's inequality to prove the law of large numbers (1).

e.

e.

326

III. Convergence of Probability Measures. Central Limit Theorem

3. Show, as a corollary to Theorem 1, that the family {cp.} is uniformly continuous and
that <Pn

4.

-+

q> uniformly on every finite interval.

Let~., n ;;:::: 1, be random variables with characteristic functions q>~"(t), n :2:: 1. Show
that ~ .!4 0 if and only if q>~"(t) -+ 1, n -+ oo, in some neighborhood oft = 0.

5. Let X~> X 2 , be a sequence of independent random vectors (with values in Rk) with
mean zero and (finite) covariance matrix r. Show that

xl + + x .!!.. JV(O, r).

Jn

(Compare Theorem 3.)

4. Central Limit Theorem for Sums of Independent


Random Variables
1. Let us suppose that for every n

1 we have a sequence

of independent random variables with


E~nk

Lets.= ~nl

= 0,

+ + ~ ,

F.k(x) = P(~nk::; x),

cll(x) = (2n)-li2 fooe-y2;zdy,

Theorem 1. A sufficient (and necessary) condition that

s. ~ %(0, 1)
is that
(A)

k=l

Jlxl>

lxiiF.k(x)- <I>.k(x)ldx--+0,

n --+ oo,

for every t: > 0.

This theorem implies, in particular, the traditional statement of the central


limit theorem under the Lindeberg condition.

Theorem 2. Suppose that the Lindeberg condition is satisfied, that is, that for
every t: > 0

(L)

k=l

then

s. ~ %(0, 1).

x 2 dF.k(x)--+ 0,

Jlxl>e

n--+oo;

4. Central Limit Theorem for Sums oflndependent Random Variables

327

Before proving these theorems (notice that Theorem 2 is an easy corollary


of Theorem 1) we discuss the significance of conditions (A) and (L).
Since
max E~;k ::; e2
1~k~n

L E(~;ki( I~nk I > e)),

k=1

it follows from the Lindeberg condition (L) that


max E~;k --+ 0,

n --+ oo.

(1)

1 ~k~n

From this it follows (by Chebyshev's inequality) that the random variables
are asymptotically infinitesimal (negligible in the limit), that is, that, for
every e > 0,
max P{l~nkl > e}--+ 0,
1:!>k:!>n

n --+ oo.

(2)

Consequently we may say that Theorem 2 provides a condition for the


validity of the central limit theorem when the random variables that are
summed are asymptotically infinitesimal.
Limit theorems which depend on a condition of this type are known as
theorems with a classical formulation.
It is easy to give examples which satisfy neither the Lindeberg condition
nor the condition of being asymptotically infinitesimal, but where the central
limit theorem nevertheless holds. Here is a particularly simple example.
Let ~ 1 , ~ 2 , . be a sequence of independent normally distributed random
variables withE~"= 0, V~ 1 = 1, V~k = 2k- 2 ,k 2:: 2.PutSn = ~n 1 + .. + ~nn
with

~nk = ~kl Jit1 v~i


It is easily verified (Problem 1) that neither the Lindeberg condition nor the
condition of being asymptotically infinitesimal is satisfied, although the
central limit theorem is evident since S" is normally distributed withES" = 0,
vsn = 1.
Later (Theorem 3) we shall show that the Lindeberg condition (L) implies
condition (A). Nevertheless we may say that Theorem 1 also covers "nonclassical" situations (in which no condition of being asymptotically infinitesimal is used). In this sense we say that (A) is an example of a nonclassical condition for the validity of the central limit theorem.
2. PROOF OF THEOREM 1. We shall not take the time to verify the necessity of
(A), but we prove its sufficiency.
Let

328

III. Convergence of Probability Measures. Central Limit Theorem

It follows from 12, Chapter II, that

By the corollary to Theorem 1 of 3, we have S" ~ %(0, 1) if and only if


fn(t) -+ ({J(t), n -+ oo, for all real t.
We have
fn(t) - ({J(t) =

k;l

k;l

ll fnk(t) - ll (/Jnk(t).

Since I f,.k(t) I :::;; 1 and I({Jnk(t) I :::;; 1, we obtain

I f,.(t)

- ({J(t) I =

I)J f,.k(t) -

il

(/Jnk(t)

:::;; ktllfnk(t)- (/Jnk(t)l


=

kt f_oooo (eitx- itx

I
kt11 s:ooeitx d(Fnk- <I>nk)

+ tt2x2)d(Fnk-

<l>nk)

I,

I
(3)

where we have used the equations


i

1, 2.

Let us integrate by parts (Theorem 11 of Chapter II, 6, and Remark 2


following it) in the integral

and then let a-+

oo and b-+ oo. Since we have

and
X-+ CIJ,

we obtain

4. Central Limit Theorem for Sums of Independent Random Variables

329

From (3) and (4),


lfn(t)- q>(t)l:::;; kt1 1t
:::;; tltl 3 c

s:}eitxr
(

k=l

+ 2t 2

k=l

Jlxi:St

1- itx)(Fnix)- <l>nk(x))dxl

lxiiFnk(x)- <l>nk(x)ldx

Jlxl>e

lxiiFnk(x)- <l>nk(x)ldx

where we have used the inequality


(

Jlxi:St

lxiiFnk(x)- <l>nk(x)ldx:::;; 2(I;k,

(6)

which is easily deduced from (71), 6, Chapter II.


Since c > 0 is arbitrary, it follows from (5) and (A) that fn(t)-+ cp(t) as
n -+ oo, for all t E R.
This completes the proof of the theorem.
3. The following theorem gives the connections between (A) and (L).
Theorem 3.
(1) (L) =>(A).
(2) If max, s;ks;n E(;k-+ 0, n-+ oo, then (A)=> (L).
(1) We noticed above that (L) implies that max 1 s;ks;n (I;k-+ 0.
Consequently, since
1 (I;k = 1, we obtain
PROOF.

Lk=

n -+ oo,

(7)

where the integration on the right is over IxI > c/Jmax 1 s;ks;n (I;k. This,
together with (L), shows that

I (

k=l

Jlxl>e

x 2 d[Fnk(x)

+ <l>nk(x)]-+ 0,

n-+ oo,

(8)

foreveryc > O.Nowfixe > Oandleth = h(x)beacontinuouslydifferentiable


even function such that lh(x)l:::;; x 2 , h'(x) sgn x ~ 0, h(x) = x 2 for lxl > 2c,
h(x) = 0 for lxl:::;; B, and lh'(x)l :::;; 4x fore< lxl:::;; 2c. Then, by (8),

I (

k=l

Jlxl>t

h(x) d[Fnk(x)

+ <l>nk(x)] -+ 0,

n -+ oo.

330

III. Convergence of Probability Measures. Central Limit Theorem

By integration by parts we then find that, as n -+ oo,


kt1

L~.h'(x)[(1 -

Since h'(x) = 2x for IxI

Jlxl~2e
r

k=l

+ <~nk(x)] dx
~

+ (1

- <lnk(x))] dx

L~.h(x) d[Fnk + <Ink]-+ 0,

= kt

ktl Ls-eh'(x)[Fnk(x)

Fnk(x))

= ktl Ls-eh(x)d[Fnk

+ <InkJ-+ 0.

2e, we therefore have

<~nk(x) I dx-+ 0,

I X IIFnk(x) -

n-+ oo.

Consequently, since e > 0 is arbitrary, we obtain (L)::? (A).


(2) For the function h = h(x) introduced above, we obtain by (7) and the
condition max 1 sksn a~k-+ 0 that
(9)

n-+ oo.

Again integrating by parts, we find that when (A) is satisfied,

r h(x) d[Fnk- <~nkJ I ~ I fx~eh(x) d[(1 - Fnk)- (1 - <~nJJ I


I JJxJ~e
+ Ikt1 Ls -eh(x) d[Fnk - <~nkJ I
k=1

k=1

~ ktl L>eh'(x)[(1 -

Fnk)- (1 -

+ kt1 Ls-elh'(x)IIFnk-

k=1

JJxJ~e

lh'(x)IIFnk-

<~nk)] dx

<~nkldx

(10)

<~nkldx

~ 4 ktl JxJ ~lxiiFnk(x)- <~nk(x)l dx-+ 0.


It follows from (9) and (10) that

( ~2e

k= 1

Jlxl

x 2 dFnk(x)

( ~

k= 1

JJ:xJ

h(x) dFnk(x)-+ 0,

n-+oo;

that is, (L) is satisfied.


4. PRooF OF THEOREM 2. According to Theorem 3, Condition (L) implies
(A); hence Theorem 2 follows at once from Theorem 1.

4. Central Limit Theorem for Sums of Independent Random Variables

331

5. We mention some corollaries in which ~ 1 , ~ 2 , . is a sequence of independent random variables with finite second moments. Let mk = E~k>
af = v~k > o,sn = ~1 + ... + ~n = Lk=1 af,andletFk = Fk(x)bethe
distribution function of ~k.

v;

Corollary 1. Let the Lindeberg condition be satisfied: for every 8 > 0,

n --+ oo.

(11)

Then
(12)

Corollary 2. Let the Lyapunov condition be satisfied:


1

v;+a k';;1E lsk- mki


f:

2H

--+

0,

n-+ oo.

(13)

Then the central limit theorem (12) holds.

It is enough to prove that the Lyapunov condition implies the Lindeberg


condition.
Let 8 > 0; then

El~k- mki 2 H = f_

00

2:
2:

00

lx- mki 2 H dFk(x)

J{x:lx-mki~<Vn)

8V~ I

lx- mki 2 H dFk(x)

J{x:lx-mki~<Vn)

(x - mk) 2 dFix)

and therefore

Corollary 3. Let

~ 1 , ~ 2 , ... be independent identically distributed random


variables with m = E~ 1 and 0 < a 2 = V~ 1 < oo. Then

332

III. Convergence of Probability Measures. Central Limit Theorem

Therefore the Lindeberg condition is satisfied and consequently Theorem 3


of 3 follows from Theorem 2.

Corollary 4. Let
n 2 1,

ee
1,

be independent random variables such that,for all

2 , ..

lenl:::; K,
where K is a constant and V,.-+ oo as n-+ oo. Then by Chebyshev's inequality

{x: lx-mkl ~eVn

lx- mkl 2 dFk(x)

= E[(ek-

mk)2 I!ek- mk! 2 sV,.)]

:::; (2K) 2 P{Iek- mkl 2 sV,.}:::; (2k) 2

(]2
2

Vk 2 -+ 0,
n

n-+ oo.

Consequently the Lindeberg condition is again applicable, and the central limit
theorem holds.
6. We remarked (without proof) in Theorem 1 that condition (A) is also
necessary. The following (Lindeberg-Feller) theorem shows that, with the
supplementary condition max 1 s ks n Ee;k -+ 0, the Lindeberg condition (L)
is also necessary.
Theorem 4. Let max 1 sksn Ee;k-+ 0, n -+ oo. Then (L) is necessary and suf
.ficientfor the validity of the central limit theorem: Sn ~ %(0, 1).
The proof is based on the following proposition, which estimates the "tails~
of the variance in terms of the behavior of the characteristic function at the
origin (compare Lemma 3, 3, Chapter III).

Lemma. Let e be a random variable with distribution function F = F(x),


Ee = 0, ve = a 2 < oo. Thenforeacha > 0

Jlxl~1/a

x 2 dF(x) :::; _.;. [Re J(jUa) - 1 + 3a 2a2 ].


a

PRooF. We have
Re f(t) - 1

+ !a2 t 2

!a 2 t 2

!a 2 t 2

Taking t

Jlxl:51/a

Jlxl>1/a

2 !a 2 t 2
=

J:oo [1 -

(!t 2

(14)

cos tx] dF(x)

[1 - cos tx] dF(x)

[1 - cos tx] dF(x)

!t 2
2a 2 )

Jlxl:->1/a
(

Jlxl>1/a

x 2 dF(x) - 2a 2
x 2 dF(x).

aJU, we obtain (14), as required.

Jlxl>1/a

x 2 dF(x)

333

4. Central Limit Theorem for Sums of Independent Random Variables

PROOF OF THEOREM 4. The sufficiency was established in Theorem 2. We now


prove the necessity.
Let

0,

E~nk =

max a;k

--+

n--+ oo.

0,

(15)

1 o<;ko<;n

Since

n fnk(t)--+
n

(16)

e-(1/2)t2,

k=1

we can find, for a given t, a number no = no(t) such that


~ n0 (t) and consequently

nfnk(t)
n

In

nk=

J,k(t) > 0 for

L In J,k(t),

k= 1

k= 1

where the logarithms are well defined. Then since

Ifnk( t) - 11 S a;k t 2 ,
we have, by (15),

Ikt1{In [1 + Unk(t) -

1)] - Unk(t) - 1]}

S Llfnit)-11 2
k=1

t4

s -
4

max

L u~k
k=1

u;k

1o<;kos;n

t4

2
0,
-- -4 max ank--+

n--+ oo.

1o<;ko<;n

Consequently, by using (16), we have


n

Re

L Unk(t) -

k=1

1]

+ !t 2 = L [Re fnk(t)
k=1

- 1

+ !t 2 a;k] --+ 0.

In particular, if we take t = a~ we find that


n

k= 1

[Re fnk(a~) - 1

+ 3a 2a;k]

and therefore, by (14), and for every

1::

--+

0,

n--+ oo,

= 1/a > 0,

kt1~xi;,;ex 2 dFnk(x) s B2kt1[Re f,k(a~) -

1 + 3a 2a ]--+

which shows that the Lindeberg condition is satisfied.

00,

n--+ oo,

334

III. Convergence of Probability Measures. Central Limit Theorem

7. The method that we used for Theorem 1 can be used to obtain a corresponding condition for the central limit theorem without assuming that the
second moments are finite.
For each n ~ 1 let

be independent random variables with

E~nk =

0,

Let g = g(x) be a bounded nonnegative even function with the following


properties: g(x) = x 2 for lxl ~ 1, minlxl<:l g(x) > 0, lg'(x)l ~canst.
Define f:.nk(g) by the equation

J oo g(x) dFnk(x) = Joo g(x) d<D(~)


t:..k(g)
-

00

00

Theorem 5. Let

and for each

> 0 let

Then

The proof is left to the reader (Problem 4); it can be carried out along the
same lines as the proof of Theorem 1.

8.

PROBLEMS
~ 1 , ~ 2 , . be a sequence of independent normally distributed random variables
with E~k = 0, k ~ 1, and V~ 1 = 1, V~k = 2k- 2 , k ~ 2. Show that gd does not satisfy
the Linde berg condition and also is not asymptotically infinitesimal.

1. Let

2. Prove (4).
3. Let ~ 1 , ~ 2 , . be a sequence of independent identically distributed random variables
with E~ 1 = 0, E~f = 1. Show that

4. Prove Theorem 5. (Hint: use the method of the proof of Theorem 1, applying integration by parts twice in the integral (eitn - itx + !t 2x 2) d(Fnk - <l>.k).)

J!

335

5. Infinitely Divisible and Stable Distributions

5. Infinitely Divisible and Stable Distributions


1. In stating Poisson's theorem in 3 we found it necessary to use a triangular
array, supposing that for each n ~ 1 there was a sequence of independent
random variables {en k}, 1 ~ k ~ n.
Put

T,.

en, 1 +

+ en,n

n ~ 1.

(1)

The idea of an infinitely divisible distribution arises in the following problem: how can we determine all the distributions that can be expressed as
limits of sequences of distributions of random variables T,., n ~ 1?
Generally speaking, the problem of limit distributions is indeterminate
k = 0,
in such great generality. Indeed, if is a random variable and 1 =
1 < k ~ n, then T,.
and consequently the limit distribution is the
distribution of which can be arbitrary.
In order to have a more meaningful problem, we shall suppose in the
present section that the variables
1
are, for each n ~ 1, not only
independent, but also identically distributed.
Recall that this was the situation in Poisson's theorem (Theorem 4 of 3).
The same framework also includes the central limit theorem (Theorem 3
of 3) for sums sk =
+ ... + ~n n ~ 1, of independent identically distributed random variables 1 , 2 , In fact, ifwe put

e.

=e

en,

e1

en,

'

e, en,

en,n

ee

then
T.

~~

k~1 n,k

Sn - ES"

V,

Consequently both the normal and the Poisson distributions can be


presented as limits in a triangular array. If T,.-+ T, it is intuitively clear that
since T, is a sum of independent identically distributed random variables, the
limit variable T must also be a sum of independent identically distributed
random variables. With this in mind, we introduce the following definition.

Definition 1. A random variable T, its distribution F T and its characteristic function q>T are said to be infinitely divisible if, for each n ~ 1, there are
independent identically distributed random variables r, 1, , 'In such thatt
T 4 r, 1 + + 'In (or, equivalently, F T = F~, * * F~"' or q>T = (q>,11 )n).
Theorem 1. A random variable T can be a limit of sums T, =
only if T is infinitely divisible.

t The

Lk= 1 en,

if and

notation ~ 4. '1 means that the random variables ~ and '1 agree in distribution, i.e.
F,(x), x e R.

F~(x) =

336

Ill. Convergence of Probability Measures. Central Limit Theorem

PROOF. If T is infinitely divisible, for each n 2:: 1 there are independent


identically distributed random variables ~n.t .. , ~n.k such that T 4.
~n. 1 + + ~n.k and this means that T 4. T,, n 2:: 1.
Conversely, let T, ~ T. Let us show that T is infinitely divisible, i.e. for
each k there are independent identically distributed random variables
1/t, , 1'/k such that T 4. 17 1 + + 1'/k
Choose a k 2:: 1 and represent T,k in the form C~l) + + C~k>, where
Y(l) _

'>n

J!
':.nk,1

+ ... + '>nk,n"'>n
J!
y(k)

J!
'>nk,n(k-1)+1

+ ... + '>nk,nk
;;

Since T,k ~ T, n ~ oo, the sequence of distribution functions corresponding


to the random variables T,k> n 2:: 1, is relatively compact and therefore, by
Prohorov's theorem, is tight. Moreover,
[P(C~1 >

> z)]k

= P(C~1 >

> z, ... , C~k> > z) ::; P(T,k > kz)

and
[P(C~1 >

< -z)]k

= P(C~1 >

< -z, ... , C~k> < -z)::; P(T,k < -kz).

The family of distributions for C~1 >, n 2:: 1, is tight because of the preceding
two inequalities and because the family of distributions for T,k> n 2:: 1, is
tight. Therefore there are a subsequence {ni} !:; {n} and a random variable
17 1 such thatC~!> ~ 17 1 asni ~ oo. SincethevariablesC~l)' ... , C~k>areidentically
distributed, we have c~:) ~ 1'/z, .. ''~~) .!41'/k, where 1'11 4. 172 4.
= 1'/k
Since C~l), ... , C~k> are independent, it follows from the corollary to Theorem 1
of 3 that 1'/ 1 , , 11k are independent and
0

T,,k = C~!>

+ .. + C~~> -4 1'11 + .. + 11k

But T,,k.!!... T; therefore (Problem 1)


T 4. 1'11 + .. + 1'/k
This completes the proof of the theorem.

Remark. The conclusion of the theorem remains valid if we replace the


hypothesis that ~n. 1, , ~n." are identically distributed for each n 2:: 1 by the
hypothesis that they are uniformly asymptotically infinitesimal (4.2).
2. To test whether a given random variable T is infinitely divisible, it is
simplest to begin with its characteristic function qJ(t). If we can find characteristic functions qJnCt) such that qJ(t) = [({J"(t)]" for every n 2:: 1, then T is
infinitely divisible.
In the Gaussian case,

and if we put

we see at once that qJ(t) = [({Jn(t)]".

337

5. Infinitely Divisible and Stable Distributions

In the Poisson case,

<p(t) = exp{A.(ei' - 1)},


and if we put <pit)= exp{(A./n)(ei'- 1)} then <p(t) = [<pit)]n.
If a random variable T has a r -distribution with density

Xa-1e-xfP
{
f(x) =
r(a)pa '

X> O
- '

0,

X< 0,

it is easy to show that its characteristic function is

1
<p(t) = (1 - i{3t)a
Consequently <p(t)

= [<pn(t)]n where
<fJn(t)

1
(1 - i{3t)afn'

and therefore T is infinitely divisible.


We quote without proof the following result on the general form of the
characteristic functions of infinitely divisible distributions.

Theorem 2 (Levy-Khinchin Theorem). A random variable T is irifinitely


divisible if and only if <p(t) = exp ljl(t) and
1/J(t) =

where

f3 E R, cr 2

2 2
. it/3 - -t cr + fro (e"x

~ 0 and

1 +x 2 dA.(x),
1 - -itx-) 2
2
1+x
x

-oo

A. is a .finite measure on (R, PA(R)) with A.{O}

(2)
=

0.

3. Let ~ 1 , ~ 2 , .. be a sequence of independent identically distributed


random variables and Sn = ~ 1 + + ~n. Suppose that there are constants
bn and an > 0, and a random variable T, such that

Sn- bn

__::__....:.: -"+

(3)

We ask for a description of the distributions (random variables T) that can be


obtained as limit distributions in (3).
If the independent identically distributed random variables ~ 1 , ~ 2 , ..

satisfy 0 < cr 2 = V~ 1 < oo, then if we put bn = nE~ 1 and an= crJn, we
find by 4 that Tis the normal distribution %(0, 1).
If f(x) = ()jn(x 2 + () 2 ) is the Cauchy density (with parameter () > 0)
and ~ 1 , ~ 2 , ... are independent random variables with density f(x), the
characteristic functions <p~ 1 (t) are equal to e-Oitl and therefore <fJsn;n(t) =
(e-Oitlfn)n = e- 6 111, i.e. Sn/n also has a Cauchy distribution (with the same
parameter ()).

338

III. Convergence of Probability Measures. Central Limit Theorem

Consequently there are other limit distributions besides the normal: the
Cauchy distribution, for example.
If we put ~nk = (~Jan) - (bJnan), 1 ~ k ~ n, we find that

i ~n. k

sn - bn =
an

( = T,.).

k= 1

Therefore all conceivable distributions for T that can conceivably appear as


limits in (3) are necessarily (in agreement with Theorem 1) infinitely divisible.
However, the specific characteristics of the variable T,. = (Sn - bn)/an may
make it possible to obtain further information on the structure of the limit
distributions that arise.
For this reason we introduce the following definition.

Definition 2. A random variable T, its distribution function F(x), and its


characteristic function cp(t) are stable if, for every n ~ 1, there are constants
an > 0, bn, and independent random variables ~ 1 , , ~n distributed like T,
such that
(4)
anT + bn 4 ~ 1 + + ~n
or, equivalently, F[(x - bn)/an] = F * * F(x), or
'-..---'
ntimes

(5)

Theorem 3. A necessary and sufficient condition for the random variable T


to be a limit in distribution of random variables (Sn - bn)/an, an > 0, is that Tis
stable.
PRooF. If T is stable, then by (4)

T 4 Sn- bn ,
an
where Sn = ~ 1 + + ~n and consequently (Sn - bn)/an .4 T.
Conversely, let ~ 1, ~ 2 , be a sequence of independent identically distributed random variables, Sn = ~ 1 + + ~n and (Sn- bn)/an ~ T, an> 0.
Let us show that T is a stable random variable.
If T is degenerate, it is evidently stable. Let us suppose that T is nondegenerate.
Choose k ~ 1 and write
(1) _ )!
Sn
-'o1
T(l)

+ ... + 'on
)!

Sn<1) - bn
T(k) =
' '
n
an

s<k) _ )!
n-'o(k-1)n+1
s<k)

n an

+ + 'okn
)!

It is clear that all the variables T~ll, ... , T~kl have the same distribution and

n ~ oo, i = 1, ... , k.

339

5. Infinitely Divisible and Stable Distributions

Write
Then
u~k> ~ y(l>

+ ... +

y<k>,

where y<l> 4 . 4 y<k> 4:. T.


On the other hand,

u<k) = ~ 1
n

+ ... + ~kn

- kb.

a.

v. + p<k)
n '

ll(k)
n kn

(6)

where

and

It is clear from (6) that

V.
kn =

u<k> _ p<k>
n

(k)

lln

'

where J.-k. ~ T, U~k> ~ T(l> + + r<k>, n-+ oo.


It follows from the lemma established below that there are constants
ll<k> > 0 and p<k> such that ll~k) -+ ll<k> and f3~k> -+ p<k> as n -+ oo. Therefore

d y<l)
T

+ ... +

y<k) - p<k)

<k>
ll

'

which shows that T is a stable random variable.


This completes the proof of the theorem.
We now state and prove the lemma that we used above.

Lemma. Let ~. ~ ~ and let there be constants a. > 0 and b. such that
d

a.~.+ b.-+~'

where the random variables ~ and ~ are not degenerate. Then there are constants a > 0 and b such that lim a. = a, lim b. = b, and
~=a~+ b.

340

III. Convergence of Probability Measures. Central Limit Theorem

PRooF. Let cpn, cp and iP be the characteristic functions of ~n ~ and ~' respectively. Then cp""~"+b"(t), the characteristic function of an~n + bn, is equal
to eirb"cpn(an t) and, by Theorem 1 and Problem 3 of 3,

eitb"cpn(an t)

-+

ip(t),

(7)

cpn(t)

-+

cp(t)

(8)

uniformly on every finite interval of length t.


Let {n;} be a subsequence of {n} such that an1 -+ a. Let us first show that
a < oo. Suppose that a = oo. By (7),
sup llcpn(ant)l-

ltiSc

lip(t)ll-+ 0,

n-+ oo

for every c > 0. We replace t by t 0 /an; Then, since an;-+ oo, we have

and therefore

Icpn.(to) I -+ Iip(O) I = 1.
But Icpn1(t 0) I -+ Icp(t 0 ) 1. Therefore Icp(t 0 ) I = 1 for every t 0 E R, and consequently, by Theorem 5, 12, Chapter II, the random variable ~ must be
degenerate, which contradicts the hypotheses of the lemma.
Thus a < oo. Now suppose that there are two subsequences {n;} and {n;}
such that an, -+ a, ani -+ a', where a =f a'; suppose for definiteness that
0 :::; a' < a. Then by (7) and (8),

lcpn 1(anJ)I-+ lcp(at)l,

lcpn1(an,t)l-+ liP(t)l

and

Consequently

lcp(at)l

lcp(a't)l,

and therefore, for all t E R,

lcp(t)l =

lcp(~t)l = ... = lcp((~)"t)l-+ 1,

n-+ oo.

Therefore Icp(t) I = 1 and, by Theorem 5 of 12, Chapter II, it follows that ~


is a degenerate random variable. This contradiction shows that a = a'
and therefore that there is a finite limit lim an = a, with a ~ 0.
Let us now show that there is a limit lim b" = b, and that a > 0. Since (8)
is satisfied uniformly on each finite interval, we have

cpn(ant)-+ cp(at),

341

5. Infinitely Divisible and Stable Distributions

and therefore, by (7), the limit limn-+oo eitbn exists for all t such that qJ(at) =F 0.
Let [J > 0 be such that qJ(at) =F 0 for all It I < b. For such t, lim eitbn exists.
Hence we can deduce (Problem 9) that lim Ibn I < oo.
Let there be two sequences {n;} and {ni} such that lim bn, = b and
lim bn; = b'. Then
for It I < [J, and consequently b = b'. Thus there is a finite limit b = lim bn
and, by (7),

ijJ(t) = eitbqJ(at),
which means that ~ 4 a~ + b. Since ~ is not degenerate, we have a > 0.
This completes the proof of the lemma.
4. We quote without proof a theorem on the general form of the characteristic functions of stable distributions.

Theorem 4 (Levy-Khinchin Representation). A random variable Tis stable


if and only if its characteristic function qJ(t) has the form qJ(t) = exp 1/J(t),
t/l(t) = itf3 - d It 1"'(1
whereO <a< 2,f3ER,d
G(t (X) =
'

+ iO I~ I G(t, a)).

1,t/ltl = Ofort
tan !na
if a =F 1,
{
(2/n) log It I if a= 1.

0,101:::;;

(9)

O,and
(10)

Observe that it is easy to exhibit characteristic functions of symmetric


stable distributions:

qJ(t) = e-dltl"',
where 0 < a :::;; 2, d

5.

(11)

0.

PROBLEMS

1. Show that ~ ~ '1 if ~n .!!... ~ and ~n .!!... '1


2. Show that if qJ 1 and qJ 2 are infinitely divisible characteristic functions, so is qJ 1

qJ 2

3. Let (/Jn be infinitely divisible characteristic functions and let ({Jn(t)--+ ({J(t) for every
t e R, where ({J(t) is a characteristic function. Show that ({J(t) is infinitely divisible.
4. Show that the characteristic function of an infinitely divisible distribution cannot take
the value 0.
5. Give an example of a random variable that is infinitely divisible but not stable.
6. Show that a stable random variable ~always satisfies the inequality E I~ I' < oo for all
r E (0, IX).

342

III. Convergence of Probability Measures. Central Limit Theorem

7. Show that if~ is a stable random variable with parameter 0 <


differentiable at t = 0.

IX

:s; 1, then rp(t) is not

8. Prove that e-dlrla is a characteristic function provided that d ~ 0, 0 <


9. Let (bn)n;d be a sequence of numbers such that limn eirb. exists for all
Show that lim lbnl < oo.

IX

:s; 2.

It I < c5, c5 > 0.

6. Rapidity of Convergence in the


Central Limit Theorem
1. Let ~nl ... , ~"" be a sequence of independent random variables, S" =
~nl + + ~""' Fn(x) = P(Sn ~ x). If Sn--+ %(0, 1), then FnCx)--+ <l>(x) for
every x e R. Since <l>(x) is continuous, the convergence here is actually uniform (Problem 5 in 1):
n --+ oo.

supJFn(x)- <l>(x)l--+ 0,

(1)

In particular, it follows that


P(Sn ~ x) -<I> (

x-ES)

JVS,"

n--+ oo

--+ 0,

(under the assumption that ESn and VSn exist and are finite).

It is natural to ask how rapid the convergence in (1) is. We shall establish
a result for the case when

n;;::: 1,
where ~ 1 , ~ 2 , .. is a sequence of independent identically distributed random
variables with E~k = 0, V~k = a 2 and El~ 1 1 3 < oo.
Theorem (Berry and Esseen). We have the bound
sup IFn(x)- <D(x)l
x

~ CEIJt,
~

(2)

where Cis an absolute constant ((2n)- 112 ~ C < 0.8).


PRooF. For simplicity, let a 2 = 1 and

(Subsection 10, 12, Chapter II)


sup IFnCx)- <D(x)l
x

~~

(I

Jo

P3

= El~ 1 1 3 . By Esseen's inequality

fn(t)- qJ(t) dt
t

+ 2T4 ~
n v 2n

(3)

343

6. Rapidity of Convergence in the Central Limit Theorem

where qJ(t) = e- 1212 and

fit) = [f(tjjn)]",
with f(t) = Ee; 1 ~'.
In (3) we may take T arbitrarily. Let us choose

jn/(5/33).

T =

We are going to show that for this T,

I fit)- ({J(t)l::; ~

}n ltl

3 e- 1214 ,

ltl ::; T.

(4)

The required estimate (2), with C an absolute constant, will follow immediately from (3) by means of (4). (A more detailed analysis shows that
c < 0.8.)
We now turn to the proof of (4).
By Taylor's formula with integral remainder,
t
f(t) = 1 + 1 ! f'(O)

t3 Jl (1 -

t2

+ 2 ! f"(O) + 2

v) 2f"'(vt) dv.

(5)

Hence

where 181 ::; 1.


If It I ::; Jn/(5/3 3 ), we have, since
t2
2n

/3 3 z

a 3 = 1.

lt 3 1/33

6n 312

:$

1
25

Consequently

f(t/Jn)
for

It I ::;

z ~~

= Jn/(5/3 3 ), and hence we may write


[f(t/jn)]" =

(6)

enlnnf(t/,fti).

By (5) (with f(t) replaced by In f(tjjll)) we find that


t2

In f(t/Jn) = - 2n
here

8t 3

+ 6n 312

18 1 1::; 1 and
IOn f)/Ill = 1[f'" ! 2

3f"

r J + 2U'n f-

::; (/33 + 3PzPt + 2PD<~~)- 3


where

Pk =

E I~ 1 1\ k = 1, 2, 3.

(7)

(In f) 111(8 1 t/jn);

::;

7/3 3 ,

1
(8)

344

III. Convergence of Probability Measures. Central Limit Theorem

Using the inequality le- 11 :s; lzlell, we can now show, for I tl :s; T =
Jn/(5P 3 ), that

l[f(t/Jn)J" _ e-t2 /21 = le"In/(t/Jiil _

<

e-t2f2l

2P31tl3 exp{- t2 + 21tl3 ~}

-6

Jn

Jn

< 2P31 t 13e-t2f4.

-6

Jn

This completes the proof of the theorem.

Remark. We observe that unless we make some supplementary hypothesis


about the behavior of the random variables that are added, (2) cannot be
improved. In fact, let ~ 1 , ~ 2 , be independent identically distributed random
variables with

It is evident by symmetry that

and hence by Stirling's formula

= l.Cn . 2-2n""' _1_ =


2

2fo

2"

j(2n) (2n).

It follows, in particular, that the constant C in (2) cannot be less than (2n) - 1 12

2.

PROBLEMS

1. Prove (8).

2.

be independent identically distributed random variables with E~k = 0,


Ele 1 l3 < oo.
It is known [53] that the following nonuniform inequality holds: for all x E R,

Let~~> ~ 2 ,

vek

= u 2and

IFn(x)- ~x)l :::;;

CEI~ 1 1 3

u\fo . (1 + lxl)3.

Prove this, at least for Bernoulli random variables.

345

7. Rapidity of Convergence in Poisson's Theorem

7. Rapidity of Convergence in Poisson's Theorem


1. Let 1] 1, 1] 2 , . , 1Jn be independent Bernoulli random variables, taking the
values 1 and 0 with probabilities

P(1Jk = 1) = Pk>

P(1Jk

= 0) =

1~ k

1 - Pk>

n.

Write
sn

= 111 + ... + 1Jn,

Pk = P(S = k),

= 0, 1, ... ; A. > 0.

In 6 of Chapter I we observe that when p 1 = = P. = p and A. = np


we have Prohorov's inequality,

L IPk- nkl ~ C1(A.)p,


00

k=O

where
C 1 (A.) = 2 min(2, A.).
When the Pk are not necessarily equal, but
that

Lk=

Pk

= A., LeCam showed

L IPk- nkl ~ Cz(A.) max Pk>


00

k=O

1:5k:5n

where Cz(A.) = 2 min(9, A.).


The object of the present section is to prove the following theorem, which
provides an estimate of the approximation of the P k by the nk, specifically
not including the assumption that
1 Pk = A.. Although the proof does not
produce very good constants (see C(A.) below), it may be of interest because
of its simplicity.

Li:=

Theorem.
(1) We have the following inequality:
(1)

where min is taken over all permutations i = (i 1, i2 , , in) of(l, 2, ... , n),
P;0 = 0, and [ns] is the integral part of ns.
(2) IfLi:= 1 Pk = A., then

k~0 1Pk -nkl ~ C(A.)~in 0~~~~~k~/;k- A.sl


~

C(A.) max pk,

where

C(A.) = (2

+ 4A.)eu.

(2)

346

III. Convergence of Probability Measures. Central Limit Theorem

2. The key to the proof is the following lemma.

Lemma 1. Let S(t)

= Ll"2o 1'fk, where 1'fo = 0, 0 ~ t

Pk(t) = P(S(t) = k),

Then for every tin 0

(A.t)ke-.<t

nk(t) =

k!

~ 1,

k = 0, 1, ....

1,

k~o IPk(t)- nk(t)l ~ eur(2 + 4~~!k) ~~~srI ~~k- A.s

(3)

PRooF. Let us introduce the variables


Xk(t) = J(S(t) = k),

where I(A) is the indicator of A. For each sample point, the function S(t),
0 ~ t ~ 1, is a generalized distribution function, for which, consequently, the
Lebesgue-Stieltjes integral

J~xk(s-) dS(s)
is defined; it is in fact just the sum

(j _ 1)

[nt]

LXk - - 'lj

i= 1

A simple argument shows that Xk(t), k ;;::: 0, satisfy


X 0 (t)

1-

Xk(t) = -

{x

dS(s),

f~[Xk(s-)- Xk_ 1(s-)] dS(s),

where X 0 (0) = 1 and Xk(O)


Now EXk(t) = Pk(t) and

0 (s-)

1,

= 0 fork~ 1.
[nt]

Xk(s-) dS(s) = E Lxk


o
i= 1

=
=

[nt]

1)
n
(j _1)E17j =
~ 17;

('

LEXk - n

i= 1

{Pk(s-) dA(s),

where
[nt]

A(t) =

L Pk

k=O

( =ES(t)).

[nt]

(j _ 1)

Lpk --Pi

i= 1

(4)

347

7. Rapidity of Convergence in Poisson's Theorem

Hence if we average the left-hand and right-hand sides of(4) we find that
P 0 (t)

= 1 - J:P 0 (s-) dA(s),


(5)

Pk(t) = - s:[Pk(s-)- Pk_ 1(s-)] dA(s),

In turn, it is easily verified that nk(t), k


system
n 0 (t)

1.

0, 0 :::;:; t :::;:; 1, satisfy the similar

= 1 - J:n 0 (s-) d(A.s),

nk(t) = -

[nis-) - nk-l (s- )] d(A.s),

~ 1.

Therefore
n 0 (t)- P 0 (t)

= - {[n0 (s-)- P 0 (s-)] d(A.s)

+ {P0 (s-) d(A(s)-

(6)

A.s)

and
nk(t)- Pk(t)

= - {[nk(s-)-

Pk(s- )] d(A.s)

+ {[nk- 1(s-)- Pk_ 1(s-)] d(A.s)


+ {[Pk(s-)-

Pk_ 1 (s- )] d[A(s)- A.s].

By the formula for integration by parts (namely "dUV = U dV


see Theorem 11, 6, Chapter II) and by (5),

J~P 0 (s-) d(A(s) -

A.s) = (A(t)- A.t)P 0 (t)

+ J~(A(s)-

(7)

+ VdU";

A.s)P 0 (s-) dA(s),

(8)
{ [Pk(s-) - Pk-t (s-)] d(A(s) - A.s)

(A(t) - A.t)(Pk(t) - Pk- 1(t))

+ {[Pk(s-)-

2Pk_ 1(s-)

+ Pk_ 2 (s-)](A(s)-

where it is convenient to suppose that P _ 1 (s)

=0.

A.s)dA(s), (9)

348

III. Convergence of Probability Measures. Central Limit Theorem

From (6)-(9) we obtain

k~0 lnk(t)- Pk(t)l ~ 2 k~o lnk(s-)- Pk(s- )ld(A.s)


+ 21A(t)- A.tl + 4A(t)max IA(s)- A.sl
O:;;;s:;;;t

~ 2 k~o lnk(s-)- Pk(s- )ld(A.s)


+(2 + 4A(t)) max IA(s)- A.sl.
O:;;;s:;;;t
Hence, by Lemma 2, which will be proved in Subsection 3,
max IA(s)- ..lsi,
L IPk(t)- nk(t)l ~ e M(2 + 4A(t)) o:;;;s:;;;t
k=O
where we recall that A(t) = 21"2 Pk
00

(10)

This completes the proof of Lemma 1.

3. PROOF OF THE THEOREM. Inequality (1) follows immediately from Lemma


1 if we observe that Pk = Pk(l), nk = nk(1), and that the probability Pk =
P{1] 1 + + 1Jn = k} is the same as P{1]; 1 + + 1];" = k}, where
(i 1 , i2 , , in) is an arbitrary permutation of(1, 2, ... , n).
Moreover, the first inequality in (2) and the estimate (3) follow from ( 1).
We have only to show that

I L Pik O:;;;s:;;; I k=O

min sup
i

[ns]

(11)

A.s ~ max Pk>


1 :;;;k:;;;n

where we may evidently suppose that A. = 1.


We write F;(s) = Lk~o P;k, G(s) = s, 0 ~ s ~ 1. These are distribution
functions: Fi(s) is a discrete distribution function with masses Pi 1 , Pi 2 , , Pi"
at the points 1/n, 2/n, ... , 1; and G(s) is a uniform distribution on [0, 1].
Our object is to find a permutation i* = (i!, ... , i:) such that
sup IFi(s)- G(s)l ~ max Pk
Since
sup IF;"(s)- G(s)l = max

!:;;;k:;;;n

O:;;;s:;;;l

I (~
Fi

-)

-~I
n

it is sufficient to study the deviations F;(s-) - G(s) only at the points


s = kjn, k = 1, ... , n.
We observe that if all the Pk are equal (p 1 = = Pn = 1/n), then

1
n

sup IF;(s)- G(s)l = - = max Pk


O:;;;s:;;;l

!:;;;k:;;;n

349

7. Rapidity of Convergence in Poisson's Theorem

We shall accordingly suppose that at least one of p 1, ... , Pn is not equal to


1/n. With this assumption, we separate the set of numbers P1> . .. , Pn into the
union of two (nonempty) sets
A = {pi: Pi> 1/n}

B = {pi:

and

Pi~

To simplify the notation, we write F*(s) = Fi.(s), Pit


It is clear that F*(1) = 1, F*(1 - (1/n)) = 1 - p:,

F*(1-

~) =

1-

(p: + P:- 1), ,F*(~)

1-

1/n}.

= Pt.

(p~ + + P2*).

Hence we see that the distribution F*(s), 0 ~ s ~ 1, can be generated by


successively choosing p~, then p~_ 1, etc.
The following picture indicates the inductive construction of p:,
P!-1 , pf:

(0, 1 ) - -

-~--- ~--

I
I
i-i-- -:--- ](1,
I

- - -: - - - J - - - L - - ~- - I

I
1
1
- - _j - - - ...!- - - t - - :
I
I

--,--I

~----~~

I/

/I

---r7{_ -~-- _

p,

/1
/ I

2
n

/I

~:-~ I

}fL/~--~
I

/1

--~-~~~---~---~

P:

--~J

-- -~---~----~- --

I)

I
I

I
I

I
I

_j_

I
I
---1- __ ,

2
n

1--

1-n

On the right-hand side of the square [0, 1] x [0, 1] we start from (1, 1)
and descend an amount p~, where p~ is the largest number (or one of the
largest numbers) in A, and from the point (1, 1 - p~) we draw a (dotted) line
parallel to the diagonal of the square that extends from (0, 0) to {1, 1).

350

III. Convergence of Probability Measures. Central Limit Theorem

Now we draw a horizontal line segment of length 1/n from the point
(1, 1 - p:>. Its left-hand end will be at the point (1 - (1/n), 1 - p:), from
which we descend an amount
1 is the largest number (or one
1 , where
of the largest numbers) in B. Since
1 :::;; 1/n, it is easily seen that this line
segment does not intersect the dotted line and therefore G(1 - (1/n)) F*((1 - (1/n))-) :::;;
From the point (1 - (1/n), 1 1 ), we again draw a horizontal
either the left-hand end
possibilities:
two
are
There
line segment oflength 1/n.
the diagonal or on it;
below
falls
)
1
of this interval (1 - (2/n),
1
diagonal. In the first
the
above
is
)
1
or the point (1 - (2/n),
1
length
of
segment
2 , where
case, we descend from this point by a line
{p: _ 1 }.
B\
set
the
in
numbers)
largest
the
of
p:_ 2 is the largest number (or one
is clear
it
Again,
2)/n.)
2 > (n(This set is not empty, since Pi++
descend
we
case
second
the
In
:::;;
that G(1 - (2/n)) - F*((1 - (2/n)-)
2 is the largest number (or one of
by a line segment oflength
2 , where
the largest numbers) in the set A\{p:}. (This set is not empty since
Pi++ P:- 2 > (n- 2)/2).) Since P:- 2 :::;; p:, it is clear that in this case

p:_

p:.

p:_

p:_

p: - p:_
p: - p:_
p: - p:_

P:-

p:_
p:.

P:-

p:_

Continuing in this way, we construct the sequence P:- 3 , . , Pi


It is clear from the construction that

for 1 :::;; k:::;; n.


Since
min sup
i

05s51

II

k=O

p;k- s

I : :;

sup IF*(s)- G(s)l :::;; p:,


05s51

we have also established the second inequality in (2).

Remark. Let us observe that the lower bound


min sup
i

05s51

I L P;k - s I~ !P:
[ns]

k=O

is evident.
4. Let A = A(t), t ~ 0, be a nondecreasing function, continuous on the right,
with limits on the left, and with A(O) = 0. In accordance with Theorem 12
of 6, Chapter II, the equation

Zr = K +

J~z._ dA(s)

(12)

351

7. Rapidity of Convergence in Poisson's Theorem

has (in the class of locally bounded functions that are continuous on the right
and have limits on the left) a unique solution, given by the formula

Z, = KtS',(A),
where

tS',(A) = eA<tJ

(13)

n (1 + L\A(s))e-&<>.

(14)

oss:!>f

Let us now suppose that the function V(t), t ~ 0, which is locally bounded,
continuous on the right, and with limits on the left, satisfies

Jt;

~ K + LV.- dA(s),

where K is a constant, for every t

Lemma 2. For every t

(15)

0.

0,
(16)

PRooF. Let T = inf{t ~ 0: Jt; > KtS',(A)}, where inf{0} = oo. If T = oo,
then (16) holds. Suppose T < oo; we show that this leads to a contradiction.
By the definition of T,
From this and (12),

Vr

~ KtS'r(A) =

~K

+K

LTs._(A) dA.

V.-

s:

dA(s)

~ Vr.

(17)

If Vr > KtS' r(A), inequality (17) yields Vr > Vr, which is impossible, since

IVrl <

00.

Now let Vr = Kef r(A). Then, by (17),

Vr

=K +

s:

V.-

dA(s).

From the definition ofT, and the continuity on the right of Jt;, Kcf,(A) and
A(t), it follows that there is an h > 0 such that when T < t ~ T + h we have

Jt; > KtS',(A) and Ar+h - Ar ~

t.

Let us write 1/11 = Jt; - Kcf,(A). Then


0 < 1/1,

~ J:t/1.- dA.,

T < t

~ T + h,

and therefore
0:::;;; sup 1/11

:::;;;

sup

1/J,.

T:s;r:s;T+.!

352

III. Convergence of Probability Measures. Central Limit Theorem

Hence it follows that ljl, = 0 for T


assumption that T < oo.

+ h,

which contradicts the

Corollary (Gronwall-Bellman Lemma). In (15) let A(t) = J~ a(s) ds and


K;;;::: 0. Then
(18)

5.

PROBLEMS

1. Prove formula (4).


~ 0, be functions of locally bounded variation, continuous
on the right and with limits on the left. Let A(O) = B(O) = 0 and L\A(t) > -1,
t ~ 0. Show that the equation

2. Let A = A(t), B = B(t), t

z,

J~z._ dA(s) + B(t)

has a unique solution $.(A, B) of locally bounded variation, given by

3. Let ~ and 11 be random variables taking the values 0, 1, ....


Let
p(~,

1'/) = supj P(~

A)- P(l'/

A)j,

where sup is taken over all subsets of {0, 1, ... }.


Prove that
(1) p(~, 1'/) = t Ik'=ol P(~ = k)- P('l = k)l;
(2) p(~. 1'/) s p(~. ~) + p(~. 1'/);
(3) if Cis independent of(~, 1'/), then
p(~

(4) If the vectors (~ 1 , ,

~.)and

+ ,, '1 + 0 s

it 1'/i)

J/(~;. 1'/;).

Let~= ~(p) be a Bernoulli random variable with P(~ = 1) = p, P(~ = 0) = 1 - p,


0 < p < 1; and n = n(p) a Poisson random variable with En= p. Show that

p(~(p),

5.

1'/);

('7 1 , , 11.) are independent, then

p(t1 ~i
4.

p(~.

n(p)) = p(1 - e-P)

p2

Let~ = ~(p) be a Bernoulli random variable with P(~ = 1) = p, P(~ = 0) = 1 - p,


0 < p < 1, and let n = n(A.) be a Poisson random variable such that P(~ = 0) =
P(n = 0). Show that A.= -ln(1 - p) and
p(~(p),

n(A.)) = 1 - e--<- A_e-.t

s tA. 2

7. Rapidity of Convergence in Poisson's Theorem

353

6. Show, using Property (4) of Problem 3, and the conclusions of Problems 4 and 5,
that if ~ 1 = ~ 1(pd, ... , ~. = ~.(p.) are independent Bernoulli random variables,
0 < P; < 1, and A; = -ln(1 - p;), 1 :::;; i :::;; n, then

and

CHAPTER IV

Sequences and Sums of Independent


Random Variables

1. Zero-or-One Laws
1. The series 2::': 1 (1/n) diverges and the series 2::': 1 ( -1)"(1/n) converges.
We ask the following question. What can we say about the convergence or
divergence of a series L.."'= 1 (~n/n), where ~ 1 , ~ 2 , isasequenceofindependent
identically distributed Bernoulli random variables with P(~ 1 = + 1) =
P(~ 1 = --'-1) =
In other words, what can be said about the convergence
of a series whose general term is 1/n, where the signs are chosen in a random
manner, according to the sequence ~ 1 , ~ 2 , . ?
Let

t?

A1 =

{w: I

n= 1

~"converges}
n

2::':

be the set of sample points for which


1 (~n/n) converges (to a finite
number) and consider the probability P(A 1 ) of this set. It is far from clear,
to begin with, what values this probability might have. However, it is a
remarkable fact that we are able to say that the probability can have only two
values, 0 or 1. This is a corollary of Kolmogorov's "zero-one law," whose
statement and proof form the main content of the present section.
2. Let (Q, , P) be a probability space, and let ~I> ~ 2 , be a sequence of
random variables. Let !F': = a(~"' ~n+ 1 , ) be the a-algebra generated by
~"' ~n+ 1 , , and write
!!'

n ~F:.
00

n=1

355

I. Zero-or-One Laws

Since an intersection of a-algebras is again a a-algebra, X is a a-algebra. It is


called a tail algebra (or terminal or asymptotic algebra), because every
event A EX is independent of the values of ~ 1 , .. , ~n for every finite number n,
and is determined, so to speak, only by the behavior of the infinitely remote
values of~ 1 , ~ 2 ,
Since, for every k 2::: 1,
A1
we have A1

={I ~nn converges} = {I ~nn converges}


n=k

n=1

nk ff'k =X. In the same way, if


A2 = {

ff'k,

~1 ~2 ... is any sequence,

~ ~n converges} EX.

The following events are also tail events:

A3 =
where In

{~n E

In for infinitely many n},

fJ6(R), n 2::: 1;
A4

{Ji~ ~n < 00 };

A5 =

+ " ' + ~n
{I~
1m ~ 1

A6 =

{I

A7

{~n converges};

As =

1m ~ 1

+ ' ' ' + ~n


n

{rrm:
n

Sn

J2n log n

}
<oo;

}
<c;

1}.

On the other hand,

= 0 for all n

B1 =

{~n

B2 =

{li~(~ 1 + + ~n) exists and is less than c}

2::: 1},

are examples of events that do not belong to X.


Let us now suppose that our random variables are independent. Then by
the Borel-Cantelli lemma it follows that
P(A3)

P(A3) =

L P(~n E Jn) <


1 ~ L P(~n E Jn) =

0~

00,
00.

Therefore the probability of A 3 can take only the values 0 or 1 according to


the convergence or divergence of L P(~n E Jn). This is Borel's zero-one law.

356

IV. Sequences and Sums of Independent Random Variables

Theorem 1 (Kolmogorov's Zero-One Law). Let ~ 1 , ~ 2 , . be a sequence of


independent random variables and let A E !!f. The P(A) can only have one of the
values zero or one.
PROOF. The idea of the proof is to show that every tail event A is independent
of itself and therefore P(A n A) = P(A) P(A), i.e. P(A) = P 2 (A), so that
P(A) = 0 or 1.
If A EX then A E ffr" = crg 1 , ~ 2 , ... } = cr(Un ffD, where ff~ =
a{~ 1, ... , ~n}, and we can find (Problem 8, 3, Chapter II) sets An E ff~,
m :2: 1, such that P(A !::,. An) --+ 0, n --+ oo. Hence

P(An)--+ P(A),

P(An n A)--+ P(A).

(1)

But if A E X, the events An and A are independent for every n :2: 1. Hence it
follows from (1) that P(A) = P 2 (A) and therefore P(A) = 0 or 1.
This completes the proof of the theorem.

Corollary. Let 1J be a random variable that is measurable with respect to the tail
a-algebra X, i.e. {IJ E B} E !!f, BE ?4(R). Then 1J is degenerate, i.e. there is a
constant c such that P(IJ = c) = 1.
3. Theorem 2 below provides an example of a nontrivial application of
Kolmogorov's zero-one law.
Let ~ 1 , ~ 2 , . be a sequence of independent Bernoulli random variables
with P(~n = 1) = p, P(~n = -1) = q, p + q = 1, n :2: 1, and let Sn =
~ 1 + + ~n It seems intuitively clear that in the symmetric case (p = !)
a "typical" path of the random walk Sn, n :2: 1, will cross zero infinitely often,
whereas when p # 1it will go off to infinity. Let us give a precise formulation.

Theorem 2. (a)

(b) lfp #

~fp =

1 then P(Sn =

1, then P(Sn =

0 i.o.) = 1.

0 i.o.) = 0.

PRooF. We first observe that the event B = (Sn = 0 i.o.) is not a tail event, i.e.
B!!f=nff;:o, ff:O=a{~n,~n+t .. } Consequently it is, in principle,
not clear that B should have only the values 0 or 1.
Statement (b) is easily proved by applying (the first part of) the BorelCantelli lemma. In fact, if B 2 n = {S 2 n = 0}, then by Stirling's formula
P(B ) =
2n

en
n n
2nP q

"'(4pqt
!::
-vnn

and therefore L P(B 2 n) < oo. Consequently P(Sn = 0 i.o.) = 0.


To prove (a), it is enough to prove that the event
A =

~ Sn
{hm
Jn =

has probability 1 (since A

5;:;

B).

Jn

. Sn
}
oo, hm
= - oo

357

1. Zero-or-One Laws

Let

Then Ac! A, c--+ oo, and all the events A, A 0 A~, A~ are tail events. Let us
show that P(A~) = P(A~) = 1 for each c > 0. Since A~ E ?and A~ E ?, it is
sufficient to show only that P(A~) > 0, P(A~) > 0. But by Problem 5

P(lim }n <-c)= P(rrm }n >c)~ umP(}n >c)> O,


where the last inequality follows from the De Moivre-Laplace theorem.
Thus P(Ac) = 1 for all c > 0 and therefore P(A) = limc .... oo P(Ac) = 1.
This completes the proof of the theorem.
4. Let us observe again that B = { Sn = 0 i.o.} is not a tail event. Nevertheless,
it follows from Theorem 2 that, for a Bernoulli scheme, the probability of this
event, just as for tail events, takes only the values 0 and 1. This phenomenon
is not accidental: it is a corollary of the Hewitt-Savage zero-one law, which
for independent identically distributed random variables extends the result
of Theorem 1 to the class of" symmetric" events (which includes the class of
tail events).
Let us give the essential definitions. A one-to-one mapping n =
(n 1, n 2 , ) of the set (1, 2, ... ) on itself is said to be a finite permutation if
rt 11 = n for every n with a finite number of exceptions.
If~= ~ 1 , ~ 2 , is a sequence of random variables, n(~) denotes the
sequence (~,11 , ~ 112 , ). If A is the event {~ E B}, BE PJ(R 00 ), then n(A)
denotes the event {n(~) E B}, BE .94(R
We call an event A = {~ E B}, BE PJ(R 00 ), symmetric if n(A) coincides with
A for every finite permutation n.
An example of a symmetric event is A = {Sn = 0 i.o.}, where Sn =
~ 1 + + ~n Moreover, we may suppose (Problem 4) that every event in
ff;:'(S) = a{w: Sn, Sn+ 1 , .. } generated by
the tail a-algebra ?I(S) =
s1 = ~1 s2 = ~1 + ~2 ... is symmetric.
00

).

Theorem 3 (Hewitt-Savage Zero-One Law). Let ~ 1 , ~ 2 , be a sequence of


independent identically distributed random variables, and
A=

a symmetric event. Then P(A)

{w:(~ 1 ,~ 2 , ... )EB}

= 0

or 1.

PRooF. Let A = { ~ E B} be a symmetric event. Choose sets Bn E PJ(R") such


that, for An = {w: (~ 1 ... , ~n) E Bn},
n--+ oo.

(2)

358

IV. Sequences and Sums of Independent Random Variables

Since the random variables ~ 1 , ~ 2 , . are independent and identically


distributed, the probability distributions P~(B) = P(~ E B) and P,..w(B) =
P(n.(O E B) coincide. Therefore
P(A lJ. A.)

= P~(B lJ. B.) = P,..w(B lJ. B.).

Since A is symmetric, we have


A

Therefore
P,.<~>(B

=g

B} = n.(A)

{nnC~) E

(3)

B}.

lJ. B.)= P{n.(~) E B) lJ. (n.(~) E B.)}


= P{(~ E B) lJ. (n.(~) E B.)} = P{A i:J. n.(A.)}.

(4)

Hence, by (3) and (4),


P(A lJ. A.) = P(A lJ. n.(A.)).

(5)

It then follows from (2) that


P(A lJ. (A. n n.(A.)))

--+

n--+ oo.

0,

(6)

Hence, by (2), (5) and (6), we obtain


P(A.)

--+

P(A),

P(n.(A))

P(A. n n.(A.))

--+

--+

P(A),

P(A).

(7)

Moreover, since ~ 1 and ~ 2 are independent,


P(A.

f1

n.(A.)) = P{(~ 1 , ... , ~.) E B., (~n+ 1 ... , ~2n) E B;,}


= P{(~I ... , ~.) E B.} P{(~n+I ... , ~2n) E B.}
= P(A.)P(n:.(A.)),

whence by (7)
P(A)

= P 2 (A)

and therefore P(A) = 0 or 1.


This completes the proof of the theorem.

5.

PROBLEMS

1. Prove the corollary to Theorem 1.

2. Show that if (e.) is a sequence of independent random variables, the random variables
Urn and lim are degenerate.

e.

e.

e.,

3. Let (e.) be a sequence of independent random variables, s. = 1 + +


and let
the constants b. satisfy 0 < b. j oo. Show that the random variables Iiiii(S./b.) and
lim(S.Jb.) are degenerate.

e.,

4. Lets.= el + ... +
n ~ 1, and ~(S) =
Show that every event in ~(S) is symmetric.

n$';:'(S), $';:'(S) = a{ro: s., s.+ I ...}.

S. Let (e.) be a sequence of random variables. Show that {Iiiii


for each c > 0.

e.> c} 2ilm{e. > c}

359

2. Convergence of Series

2. Convergence of Series
1. Let us suppose that ~ 1 , ~ 2 , . is a sequence of independent random
variables, Sn = ~ 1 + + ~"' and let A be the set of sample points w for
which
~nCw) converges to a finite limit. It follows from Kolmogorov's
zero-one law that P(A) = 0 or 1, i.e. the series ~n converges or diverges with
probability 1. The object of the present section is to give criteria that will
determine whether a sum of independent random variables converges or
diverges.

L:

L:

Theorem 1 (Kolmogorov and Khinchin).


(a) Let E~n = 0, n ~ 1. Then if
(1)

the series ~n converges with probability 1.


(b) !{the random variables~"' n ~ 1, are unfformly bounded (i.e., P(l~nl s c)
= 1, c < oo ), the converse is true: the convergence of
~n with probability
1 implies (1).

The proof depends on

Kolmogorov's Inequality
(a) Let~ 1, ~ 2 , .. , ~n be independent random variables with E~i = 0, E~t < oo,
i S n. Then for every c; > 0
(2)

(b) If also

P(\~d

S c)= 1, iS n, then
(3)

PRooF. (a) Put

A= {max\Ski ~ e},
Ak = {\Sd <
Then A= LAk and

t:,

i = 1, ... ,k- 1, \Ski~ e},

1 S k S n.

360

IV. Sequences and Sums of Independent Random Variables

But

ES;IAk = E(Sk + (ek+t + + en)) 2 1Ak


= ESf/Ak + 2ESiek+l + ... + en)/Ak
~ ESf/Ak'

+ E(ek+l + ... + en) 2 1Ak

since

+ .. + en)/Ak = ESk/Ak E(ek+t + .. +en)= 0


because of independence and the conditions Eei = 0, i :::;; n. Hence
ESk(ek+t

Es; ~

L ESf/ Ak ~ 8 2 L P(Ak) =

8 2 P(A),

which proves the first inequality.


(b) To prove (3), we observe that

ES;IA

= Es;

= ES;

- ES;Ix ~ Es; - 82 P(A)


On the other hand, on the set Ak

1sk-tl:::;;
and therefore

Es;IA

1sk1:::;; 1sk-tl

8,

:::;; (8

82

+ 82 P(A).

(4)

+ 1ek1:::;; 8 + c

L ESf/Ak + L E(/Ak(Sn- Sk)


k

2)

k=l

j=k+l

+ c) 2 L P(Ak) + L P(Ak) L Ee]


k

:::;; P(A{(8

+ c) 2 +

J
1

+ c) 2 + ES;),

Ee] = P(A)[(8

(5)

From (4) and (5) we obtain


P(A) >
-

Es; -

(8

= 1-

82

+ c) 2 + Es;

82

(8

(8

> 1-

c)2

+ c) 2 + Es;

(8

c)2

Es;

82 -

This completes the proof of (3).


PROOF OF THEOREM 1. (a) By Theorem 4 of 10, Chapter II, the sequence
(Sn), n ~ 1, converges with probability 1, if and only if it is fundamental with
probability 1. By Theorem 1 of 10, Chapter II, the sequence (Sn), n ~ 1, is
fundamental (P-a.s.) if and only if

n-... oo.

(6)

By (2),
P{sup ISn+k- Snl
k~l

~ 8} =

limP{ max ISn+k- Snl

N-+oo

1Sk:SN

Therefore (6) is satisfied if:Lk"'=. 1 Ee~ < oo, and consequently


with probability 1.

~ 8}
L ek converges

361

2. Convergence of Series

(b) Let

L ~k converge. Then, by (6), for sufficiently large n,


P{sup ISn+k- Snl ~ e} < 1-.

(7)

k~l

By (3),
P{ supiSn+k- Snl ~ e} ~ 1k~

Therefore if we suppose that

Lf=

(c + e)
E1'2"
"oo
2

L...k=n

'ok

E~f = oo, we obtain

P{sup ISn+k- S.l


k~1

~ e} = 1,

which contradicts (7).


This completes the proof of the theorem.
If ~ 1 , ~ 2 , ... is a sequence of independent Bernoulli random
variables with P(~n = + 1) = P(~n = -1) = !, then the series ~nan, with
Ian I ~ c, converges with probability 1, if and only if La; < oo.

ExAMPLE.

2. Theorem 2 (Two-Series Theorem). A sufficient condition for the convergence


of the series L ~" of independent random variables, with probability 1, is that
both series L E~n and L V~"converge. If P( I~n I ~ c) = 1, the condition is also
necessary.
PROOF. Ifl: V~" < oo, then by Theorem 1 the series L (~n- E~n) converges
(P-a.s.). But by hypothesis the series E~n converges; hence ~"converges
(P-a.s.)
To prove the necessity we use the following symmetrization method. In
addition to the sequence ~ 1 , ~ 2 , ... we consider a different sequence ~ 1 ,

~2 ,

of independent random variables such that ~" has the same distribu-

tion as ~"' n ~ 1. (When the original sample space is sufficiently rich, the
existence of such a sequence follows from Theorem 1 of 9, Chapter II. We
can also show that this assumption involves no loss of generality.)
Then if ~"converges (P-a.s.), the series ~"also converges, and hence
so does L (~. - ~.). But E(~. - ~.) = 0 and P( I~" - ~"I ~ 2c) = 1. Therefore LV(~. ~.) < oo by Theorem 1. In addition,

L
-

L: v~. = 1- L: V(~. - ~.) < oo.


Consequently, by Theorem 1, L (~. - E~.) converges with probability 1,
and therefore L E~" converges.
Thus if L ~"converges (P-a.s.)(and P( I~" I ~ c) = 1, n ~ 1) it follows that
both L E~. and LV~. converge.
This completes the proof of the theorem.

3. The following theorem provides a necessary and sufficient condition for


the convergence of L ~" without any boundedness condition on the random
variables.

362

IV. Sequences and Sums of Independent Random Variables

Let

c be a constant and

~c = {~' ~~~ ~ C,
0,

>c.

1~1

Theorem 3 (Kolmogorov's Three-Series Theorem). Let ~ 1 , ~ 2 , be a


sequence of independent random variables. A necessary condition for the convergence of"[. ~ .. with probability 1 is that the series

"[. E~~.
v~~.
"[. P(l~.. l;?: c)
converge for every c > 0; a sufficient condition is that these series converge
for some c > 0.

PRooF. Sufficiency. By the two-series theorem,"[. ~~converges with probability


1. But if "[. P( I~.. I ;?: c) < oo, then by the Borel-Cantelli lemma we have
I(l ~ .. I ;?: c) < oo with probability 1. Therefore ~ .. = ~~ for all n with at

L.

most finitely many exceptions. Therefore


~ also converges (P-a.s.).
Necessity. If ~ converges (P-a.s.) then ~ .. -+ 0 (P-a.s.), and therefore,
for every c > 0, at most a finite number of the events {I~ .. I ;?: c} can occur
(P-a.s.). Therefore I( I~ .. I ;?: c) < oo (P-a.s.), and, by the second part of the
Borel-Cantelli lemma,
P( I~ .. I > c) < oo. Moreover, the convergence of
~ implies the convergence of
~~. Therefore, by the two-series theorem,
both of the series E~~ and LV~~ converge.
This completes the proof of the theorem.

L.
L

L .

Corollary. Let ~ 1 , ~ 2 , be independent variables withE~, = 0. Then if


~2

L E 1 + ~~~~~~~ <

00,

the series "[. ~ .. converges with probability 1.


For the proof we observe that
~2

L E 1 + ~~~~~~~ < 00 <=> L E[~;I(I~nl ~ 1) + l~nii(I~nl > 1)] < 00.


Therefore if~! = ~ .. I( I~ .. I ~ 1), we have

L E(~!)2

<

00.

Since E~, = 0, we have

LIE~! I=

L IE~,I(I~.. I ~

1)1 =

L IE~,I(I~.. I > 1)1

~ l:EI~ .. II(I~ .. I > 1) < oo.


Therefore both
inequality,
P{l~ .. l

L E~!

> 1} =

and LV~! converge. Moreover, by Chebyshev's


P{I~.. II(I~ .. I > 1)

Therefore
P( I~ .. I > 1)
the three-series theorem.

> 1}

~ E(I~ .. II(I~ .. I > 1).

< oo. Hence the convergence of L ~.. follows from

363

3. Strong Law of Large Numbers

4.

PROBLEMS

1. Let ~ 1 , ~ 2 , .. be a sequence of independent random variables, S" = ~ 1 ,

. , ~n

Show, using the three-series theorem, that


(a) if I ~;; < oo (P-a.s.) then I ~" converges with probability 1, if and only if

IE

U(l~d ~!)converges;

(b) if I~" converges (P-a.s.) then I~;; < oo (P-a.s.) if and only if

(E l~nll(lenl ~ 1))2 < oo.

2. Let eI ~2' ... be a sequence of independent random variables. Show that I


(P-a.s.) if and only if
~;;

IE - - 2 <
1 + ~"

e;;

<

00

00.

3. Let ~ 1 , ~ 2 , be a sequence of independent random variables. Show that I~"


converges (P-a.s.) if and only if it converges in probability.

3. Strong Law of Large Numbers


1. Let ~ 1 , ~ 2 , . be a sequence of independent random variables with finite
second moments; Sn = ~ 1 + + ~n By Problem 2, 3, Chapter III, if the
numbers V~i are uniformly bounded, we have the law of large numbers:
Sn- ESn

---"'-+

n-+ oo.

'

(1)

A strong law of large numbers is a proposition in which convergence in


probability is replaced by convergence with probability 1.
One of the earliest results in this direction is the following theorem.
Theorem 1 (Cantelli). Let ~ 1 ,
finite fourth moments and let

~ 2 , .

be independent random variables with

El~n-E~ni 4 ~C,

n~l,

for some constant C. Then as n-+ oo

Sn- ES"

- - - -+

(P-a.s.).

(2)

PRooF. Without loss of generality, we may assume that E~" = 0 for n ~ 1.


By the corollary to Theorem 1, 10, Chapter II, we will have Sn/n-+ 0 (P-a.s.)
provided that

for every e > 0. In turn, by Chebyshev's inequality, this will follow from

364

IV. Sequences and Sums oflndependent Random Variables

Let us show that this condition is actually satisfied under our hypotheses.
We have
4

s.

= (~1 + ... + ~.) = i~1~i-

4! 2 2
2!2! ~i~j

i<j

"

4!

+ loFJ
.L.. 3'1'
~i~j

Remembering that

E~k

ES! =

= 0, k ::::;; n, we then obtain


n

L E~i + 6 L:
i=1

i,j=1

nC

E~?E~J s nC + 6

6n(n - 1)
C = (3n 2
2

i,j=1
i<j

JE~i E~J

2n)C < 3n 2 C.

Consequently

L E(s: )

::::;; 3C

L n1

< oo.

This completes the proof of the theorem.


2. The hypotheses of Theorem 1 can be considerably weakened by the use of
more precise methods. In this way we obtain a stronger law of large numbers.

Theorem 2 (Kolmogorov). Let~ 1 , ~ 2 , .. be a sequence of independent random


variables with finite second moments, and let there be positive numbers b. such
that b. i oo and

"v~. < oo.

(3)

L.f?
n

Then

s. -b ES. __,. O
n

In particular,

P-a.s..

(4)

if
(5)

then

S" - ES. --.. 0 (P- a.s. )


---'--n
For the proof ofthis, and of Theorem 2 below, we need two lemmas.

(6)

365

3. Strong Law of Large Numbers

Lemma 1 (Toeplitz). Let {a.} be a sequence of nonnegative numbers, b. =


> 0 for n ~ 1, and b. i oo, n--+ oo. Let {x.} be a sequence of

L~= 1 a;, b.

numbers converging to x. Then

(7)
In particular,

if a.

1 then

x1

+ + x.

(8)

------+X.

PROOF. Let e > 0 and let n0


Choose n1 > n0 so that

n0 (e) be such that Ix. -

no

.L

" ' j=

Ixi

xI ::;

~:/2

for all n

n0 .

xI < e/2.

Then, for n > n 1,

This completes the proof of the lemma.

Lemma 2 (Kronecker). Let {b.} be a sequence of positive increasing numbers,


b. j oo, n--+ oo, and let {x.} be a sequence of numbers such that L x. converges.
Then
1

b.

j= 1

- L bixi--+ 0,
In particular, if b. = n, x. = Y./n and
Y1

Let b 0

n --+ oo.

(10)

= 0, S 0 = 0, s. = L}= 1 xi. Then (by summation by parts)


n

L bjxj = L b/Sj- sj-1) =

j=1

(9)

L (y./n) converg_es, then

+ + Yn --+,
0
n

PROOF.

n --+ oo.

j=1

bnSn - boSo -

L sj-1(bj- b;-1)

j=l

366

IV. Sequences and Sums of Independent Random Variables

and therefore

since, if sn

~X,

then by Toeplitz' lemma,

1
-b

L Si _ 1ai ~ x.

n j= 1

This establishes the lemma.


PROOF OF THEOREM 1. Since

Sn - ES"
bn

2_

I bk(~k -bkE~k),

bn k=1

a sufficient condition for (4) is, by Kronecker's lemma, that the series
[(~k - E~k)/bk] converges (P-a.s.). But this series does converge by (3) of
Theorem 1, 2.
This completes the proof of the theorem.

EXAMPLE 1. Let ~ 1 , ~ 2 , . be a sequence of independent Bernoulli random


variables with P(~n= l)=P(~n= -1)=1. Then, since L [l/(n log 2 n)] < oo,
we have

r::. sn

v n log n

)
~
~ 0 (P-a.s ..

(11)

3. In the case when the variables ~ 1 , ~ 2 , . are not only independent but
also identically distributed, we can obtain a strong Jaw of large numbers
without requiring (as in Theorem 2) the existence of the second moment,
provided that the first absolute moment exists.

Theorem 3 (Kolmogorov). Let ~ 1 , ~ 2 , . be a sequence of independent


identically distributed random variables withE I~ 1 1< oo. Then

sn ~ m

n
where m

(P-a.s.)

(12)

= E~ 1 .

We need the following lemma.

Lemma 3. Let~ be a nonnegative random variable. Then


00

00

n=l

n=l

L P( ~ :2: n) ~ E~ ~ 1 + L P( ~ ;;:: n).

(13)

367

3. Strong Law of Large Numbers

The proof consists of the following chain of inequalities:

L P(e ~ n) = L L P(k ~ e<


00

00

n= 1

k + 1)

n= 1 k;;,n
00

= I

kP(k ~ e < k

k=1

+ 1) =

00

k=O

E[ei<k ~ e < k

r E[kl(k ~ e < k + 1)J


00

k=O

+ 1)J

r E[(k + 1)I(k ~ e< k + 1n


00

= Ee ~
00

k=O

(k

k=O

+ 1)P(k ~ e< k + 1)

00

00

L P(e ~ n) + L P(k ~ e<

n=1

k=O

00

k + 1) =

L P(e ~ n) +

n=1

1.

PRooF OF THEOREM 3. By Lemma 3 and the Borel-Cantelli lemma,


E le11 <

00

>
>

L P{le11 ~ n} < 00
L P{lenl ~ n} < oo

>

P{lenl ~ n i.o.} = 0.

Hence Ien I < n, except for a finite number of n, with probability 1.


Let us put

and suppose that Een


and only if ( 1 + +

Een

= 0, n

1. Then (e 1 + + en)/n--. 0 (P-a.s.), if

en )/n --. 0 (P-a.s. ). Note that in general Een # 0 but

= Een !(len I<

n)

= Eeti(Ietl

< n) ___. Eet

= 0.

Hence by Toeplitz' lemma

n--. oo,
and consequently (e 1 + + en)!n-+ 0 (P-a.s.), if and only if

<e~ - E~~) ++<en- Ee") --.,


0
n
Write ~n = ~n

n --.

oo

(P-a.s.),

n --.

oo. (14)

E~n By Kronecker's lemma, (14) will be established if

I (~jn) converges (P-a.s.). In turn, by Theorem 1 of 2, this will follow if we


show that, when E Ie1 I < 00, the series L (V ~n/n 2 ) converges.

368

IV. Sequences and Sums of Independent Random Variables

We have

"V~n < ~ E~;


L.,

2-L.

n=l

00

L E[~iJ(k-

1 ~ l~tl < k)].

k=1

00

00

n=k

~ 2 k~1 kE[~if(k- 1 ~ 1~ 1 1 < k)]

L E[1~11J(k -1 ~ l~tl < k)] =


00

~ 2

2E1~11

<

00.

k= 1

This completes the proof of the theorem.

Remark 1. The theorem admits a converse in the following sense. Let


be a sequence of independent identically distributed random
variables such that
~ 1 , ~ 2 , .

~1

+ "' + ~n

-=-----"-+

C,

with probability 1, where Cis a (finite) constant. Then E I~ 1 I < oo and C


In fact, if Sn/n -+ C (P-a.s.) then

~n =
n

Sn _ (n n
n

Sn1) n-1

1 -+

E~ 1

O (P-a.s.)

and therefore P( I~n I > n i.o.) = 0. By the Borel-Cantelli lemma,

L P(l~1l > n) < oo,


and by Lemma 3 we have E I~ 1 I < oo. Then it follows from the theorem that
C = E~ 1 .

Consequently for independent identically distributed random variables


the conditionE I ~ 1 1 < oo is necessary and sufficient for the convergence (with
probability 1) of the ratio Sn/n to a finite limit.

Remark 2. If the expectation m = E~ 1 exists but is not necessarily finite, the


conclusion (10) of the theorem remains valid.
In fact, let, for example, E~1 < oo and E~i = oo. With C > 0, put
n

s; = L ~J(~; ~ C).
i= 1

369

3. Strong Law of Large Numbers

Then (P-a.s.).

But as C -+ oo,
Ee 1 I(e 1 ~C)-+ Ee 1 = oo;
therefore Sn/n -+

+ oo (P-a.s.)

4. Let us give some applications of the strong law oflarge numbers.


1 (Application to number theory). Let Q = [0, 1), let 11 be the
algebra of Borel subsets of Q and let P be Lebesgue measure on [0, 1). Consider the binary expansions OJ=O. OJ 1 0J 2 . of numbers OJ e Q (with infinitely
many O's) and define random variables 1(OJ), 2(OJ), ... by putting en(OJ) = OJn.
Since, for all no~ 1 and all x 1 , .. , xn taking the values 0 or 1,
EXAMPLE

e e

the P-measure of this set is 1/2n. It follows that e 1 , en, ... is a sequence of
independent identically distributed random variables with
P<el =

o> =

P<el = 1) = !.

Hence, by the strong law of large numbers, we have the following result of
Borel: almost every number in [0, 1) is normal, in the sense that with probability
1 the proportion of zeros and ones in its binary expansion tends to ! , i.e.

- L I(ek =
n k=t

1)-+! (P-a.s.).

EXAMPLE 2 (The Monte Carlo method). Let f(x) be a continuous function


defined on [0, 1], with values on [0, 1]. The following idea is the foundation
of the statistical method of calculating JAf(x) dx (the "Monte Carlo
method").
Let e 1 , '7 1 , e 2 , '7 2 , . be a sequence of independent random variables,
uniformly distributed on [0, 1]. Put

P =

It is clear that

{1

if f(O > '7;.


if
< 'li

f(ei>

370

IV. Sequences and Sums of Independent Random Variables

By the strong law of large numbers (Theorem 3)

1 n

(1

n;~/; -+ Jo f(x) dx

(P-a.s.).

Consequently we can approximate an integral f~ f(x) dx by taking a


simulation consisting of a pair of random variables(~;, '7;), i ;;:: 1, and then
1 P;
calculating P; and (1/n)

L7=

5. PROBLEMS
1. Show that

Ee

<

00

eI > n) <

if and only if L:'=t nP( I

00.

ee

2. Supposing that 1, 2 , are independent and identically distributed, show that if


E Ietl" < 00 for some IX, 0 < IX < 1, then S./n 11"--> 0 (P-a.s.), and if EI~tiP < 00 for
some {J, 1 ~ fJ < 2, then (S. - nE~ 1 )/n 11 P-+ 0 (P-a.s.).

ee

3. Let 1, 2 , be a sequence of independent identically distributed random variables


and let E I ~ 1 1 = oo. Show that

~I~

-I
a.

oo

(P-a.s.)

for every sequence of constants {a.}.


4. Show that a rational number on [0, 1) is never normal (in the sense of Example 1,
Subsection 4).

4. Law of the Iterated Logarithm


1. Let ~ 1 , ~ 2 , be a sequence of independent Bernoulli random variables
with P(~n = 1) = P(~n = -1) = t; let Sn = ~ 1 + + ~n It follows from
the proof of Theorem 2, 1, that
(1)

with probability 1. On the other hand, by (3.11),

JnnSnlog n -+ 0

(P-a.s.).

(2)

Let us compare these results.


It follows from (1) that with probability 1 the paths of (Sn)n~ 1 intersect
the "curves"
infinitely often for any given but at the same time (2)

eJn

e;

~4.

371

Law of the Iterated Logarithm

shows that they only finitely often leave the region bounded by the curves
eJn log n. These two results yield useful information on the amplitude
of the oscillations of the symmetric random walk (Sn)n ~ 1 The law of the
iterated logarithm, which we present below, improves this picture of the
amplitude of the oscillations of (Sn)n> 1 .
Let us introduce the following definition. We call a function cp* = cp*(n),
n ;;::: 1, upper (for (Sn)n~ 1) if, with probability 1, Sn ~ cp*(n) for all n from
n = n 0 (w) on.
Wecallafunctioncp* = cp*(n),n;;::: 1,lower(for(Sn)n~ 1 )if,withprobability
1,
> cp*(n) for infinitely many n.
Using these definitions, and appealing to (1) and (2), we can say that every
function cp* = eJn log n, e > 0, is upper, whereas cp* = eJn is lower, e > 0.
Let cp = cp(n) be a function and cp: = (1 + e)cp, cp*" = (1 - e)cp, where
e > 0. Then it is easily seen that

sn

{rrm (/)~~> ~ 1} = {1i~ [~~~ (/)~:>] ~ t}


:> { sup

m~n,(<)

:.

S(m)
cp m

{Sm ~ (1

~ 1 + e for every e > 0, from some n (e) on}


1

+ e)cp(m) for every e >

0, from some n 1(e) on}.


(3)

In the same way,

{11m(/)~~) ; : : 1} = {li~ [~~~ (/)~:)] ;;::: 1}


:>{sup S(m)
m2:n2(<) cp m

~ 1 + e.foreverye > O,.fromsomen (e)on}


1

:> {Sm ;;::: (1 - e)cp(m) for every e > 0 and for infinitely
many m larger than some n 3 (e) ;;::: n 2 (e)}.

(1

(4)

It follows from (3) and (4) that in order to verify that each function cp: =
+ e)cp, e > 0, is upper, we have to show that
(5)

But to show that cp*. = (1 - e)cp, e > 0, is lower, we have to show that
(6)

372

IV. Sequences and Sums of Independent Random Variables

2. Theorem 1 (Law of the Iterated Logarithm). Let~ 1 , ~ 2 , . be a sequence of


independent identically distributed random variables with E~; = 0 and E~t =
CJ 2 > 0. Then
(7)

where
tf;(n) = J2CJ 2 n log log n.

(8)

For uniformly bounded random variables, the law of the iterated logarithm
was established by Khinchin (1924). In 1929 Kolmogorov generalized this
result to a wide class of independent variables. Under the conditions of
Theorem 1, the law of the iterated logarithm was established by Hartman
and Wintner (1941).
Since the proof of Theorem 1 is rather complicated, we shall confine
ourselves to the special case when the random variables ~n are normal,
~n " ' %(0, 1), n ~ 1.
We begin by proving two auxiliary results.

Lemma 1. Let~ 1, . , ~n be independent random variables that are symmetrically


distributed (P( ~k E B) = P(- ~k E B) for every B E 84 (R), k ::;; n). Then for
every real number a

P( max

Sk >

1'5k:5,n

a) : ; 2P(S" > a).

(9)

PROOF. Let A= {max1:5k:5n sk >a}, Ak = {S;::;; a, i::;; k- 1; sk >a} and


B = {Sn > a}. Since S" > a on Ak (because Sk ::;; S"), we have

P(B n Ak) ~ P(Ak n {S" ~ Sd) = P(Ak)P(S" ~ Sk)


= P(Ak)P(~k+ 1 + + ~n ~ 0).
By the symmetry of the distributions of the random variables ~ 1, ... , ~n,
we have
P(~k+ 1 +

Hence P(~k+ 1 +

+ ~n >

~n > 0) = P(~k+ 1

+ + ~n <

0).

0) ~ -!,and therefore

P(B) ~ k~1 P(Ak n B) ~

2k~1 P(Ak) =

2 P(A),

which establishes (9).

Lemma 2. Let Sn ,...., JV (0, CJ 2 (n)), CJ 2 (n) j oo, and let a(n), n ~ 1, satisfy
a(n)/CJ(n)--.. oo, n--.. oo. Then
P(Sn > a(n)) ,....,

CJ(n)

11::..

-y

2na(n)

exp{ -ta 2 (n)/CJ 2 (n)}.

(10)

373

4. Law of the Iterated Logarithm

The proof follows from the asymptotic formula

-1-

faa e - y1/2 dy - -1- e -x2/2

fox

fox

X-+ 00,

since S,Ju(n) ,.... .Af(O, 1).


PROOF OF THEOREM 1 (for~; "' .Af(O, 1)).
Let us first establish (5). Let e > 0, A.= l + e, nk =A.\ where k ~ k0 , and
k0 is chosen so that In In k0 is defined. We also define

Ak = {Sn > A.t/l(n) for some n E (nk, nk+ tJ},

(11)

and put

A = {Ak i.o.}

= {Sn > A.t/J(n) for infinitely many n}.

In accordance with (3), we can establish (5) by showing that P(A) = 0.


Let us show that
P(Ak) < oo. Then P(A) = 0 by the Borel-Cantelli
lemma.
From (11), (9) and (10) we find that

P(Ak)

P{Sn > A.t/J(nk) for some n E (nk, nk+ 1)}

~ P{S"

> A.t/J(nk) for some n

nk+ tl

~ 2P{Snk+1 > A.t/J(nk)} "'fo~nk) exp{ -tA.2[t/J(nk)/AJl}


~

C 1 exp( -A. In In A.k)

Ce-J.Ink = C2 k-;.,

where C 1 and C2 are constants. But L::;. 1 k-;. < oo, and therefore

L P(Ak) <

oo.

Consequently (5) is established.


We turn now to the proof of(6). In accordance with (4) we must show that,
with A. = 1 - e, e > 0, we have with probability 1 that sn ~ A.t/J(n) for infinitely many n.
Let us apply (5), which we just proved, to the sequence (- Sn)n<::.t Then we
find that for all n, with finitely many exceptions, -Sn ~ 21/J(n) (P-a.s.).
Consequently if nk = N\ N > 1, then for sufficiently large k, either
snk-1 ~ -21/J(nk-l)
or
(12)

where Yk = snk - snk-1.


Hence if we show that for infinitely many k

lk >

A.t/J(nk)

+ 21/J(nk-l),

(13)

374

IV. Sequences and Sums of Independent Random Variables

this and (12) show that (P-a.s.) s.k > A.t/J(nk) for infinitely many k. Take some
).' E (A., 1). Then there is anN > 1 such that for all k

A.'[2(Nk- Nk- 1) In In Nk] 112 > A.(2Nk In In Nk) 112


+ 2(2Nk- 1 In In Nk- 1) 112

=A.t/J(Nk) + 21/J(Nk- 1).

It is now enough to show that


~

> A.'[2(Nk - Nk- 1) In In Nk] 112

(14)

for infinitely many k. Evidently ~ ""' %(0, Nk - Nk- 1). Therefore, by


Lemma2,
P{Y. > A.'[2(Nk- Nk-1) In In Nkr/2}"'

>

1
foX(2 In In Nk) 112

c1

k-(A')2

- (In k) 112

e-0.'>2JntnNk

c2>-

k In k

Since (1/k Ink) = oo, it follows from the second part of the Borel-Cantelli
lemma that, with probability 1, inequality (14) is satisfied for infinitely many
k, so that (6) is established.
This completes the proof of the theorem.

Remark 1. Applying (7) to the random variables (- s.).;;,; 1, we find that

s.

(15)

tm cp(n) = -1.

It follows from (7) and (15) that the law of the iterated logarithm can be put
in the form
(16)

Remark 2. The law of the iterated logarithm says that for every 8 > 0 each
function t/Ji = (1 + 8)1/1 is upper, and t/1*, = (1 - 8)1/1 is lower.
The conclusion (7) is also equivalent to the statement that, for each 8 > 0,

3.

P{IS.I

(1 - 8)1/J(n) i.o.}

1,

P{IS.I

(1

+ 8)1/J(n) i.o.}

o.

PROBLEMS

1. Let ~ 1 , ~ 2 , ... be a sequence of independent random variables


Show that
= 1} = 1.
P{IIm v~
2ln n

with~.~

JV(O, 1).

375

4. Law of the Iterated Logarithm

2. Let ~ 1 , ~ 2 , .. be a sequence of independent random variables, distributed according


to Poisson's law with parameter A. > 0. Show that (independently of A.)

{
Pnm

=~.In Inn =1 } =1.


Inn

3. Let ~ 1 , ~ 2 , .. be a sequence of independent identically distributed random variables


with

0 <IX< 2.
Show that

{ I

}
S lll(lnlnn)
= ell = 1.
p Ilrri -"
nil

4. Establish the following generalization of (9). Let ~ 1 , . , ~.be independent random


variables. Levy's inequality

PL~::. [Sk + Jl(S. -

Sk)] >

a}

:$

2P(S. a),
>

S0

0,

holds for every real a, where /1(~) is the median of~, i.e. a constant such that

CHAPTER V

Stationary (Strict Sense) Random


Sequences and Ergodic Theory

1. Stationary (Strict Sense) Random Sequences.


Measure-Preserving Transformations
1. Let (Q, :F, P) be a probability space and~= (~t. ~ 2 , .. ) a sequence of
random variables or, as we say, a random sequence. Let (}k ~denote the sequence
(~k+1 ~k+2 ... ).

Definition l. A random sequence ~ is stationary (in the strict sense) if the


probability distributions of (}k~ and~ are the same for every k 2: 1:

BE PA(R 00 ).
The simplest example is a sequence ~ = (~ 1 , ~ 2 , ..) of independent
identically distributed random variables. Starting from such a sequence, we
can construct a broad class of stationary sequences 11 = (17 t. 17 2 , ) by choosing any Borel function g(x 1, ... , x.) and setting '1k = g(~k ~k+ 1, ... , ~k+ 1).
If ~ = (~ 1 , ~ 2 , .. ) is a sequence of independent identically distributed
random variables with E I ~ 1 1 < oo and E~ 1 = m, the law of large numbers
tells us that, with probability 1,
~1 + ... + ~. __.m,
_____

n ~ oo.

In 1931 Birkhoff obtained a remarkable generalization of this fact for the case
of stationary sequences. The present chapter consists mainly of a proof of
Birkhoff's theorem.
The following presentation is based on the idea of measure-preserving
transformations, something that brings us in contact with an interesting

I. Stationary (Strict Sense) Random Sequences. Measure-Preserving Transformations

377

branch of analysis (ergodic theory), and at the same time shows the connection between this theory and stationary random proceesses.
Let (Q, fF, P) be a probability space.

Definition 2. A transformation T ofQ into n is measurable if, for every A E fF,


T- 1 A = {w: TwEA}EfF.

Definition 3. A measurable transformation T is a measure-preserving transformation (or morphism) if, for every A E F,
P(T- 1A) = P(A).
LetT be a measure-preserving transformation, T" its nth iterate, and ~ 1 =
n ~ 2, and consider the
sequence~= (~ 1 , ~ 2 , ... ). We claim that this sequence is stationary.
In fact, let A= {w: ~EB} and A 1 = {w: 0 1 ~EB}, where BE~(R 00 ).
Since A = {w: (~ 1 (w), ~ 1 (Tw), .. .) E B}, and A 1 = {w: (~ 1 (Tw), ~ 1 (T 2 w), ...)
E B}, we have wE A 1 if and only if either Tw E A or A 1 = r- 1A. But P(T- 1A)
= P(A), and therefore P(Ad = P(A). Similarly P(Ak) = P(A) for every
Ak = {w: Ok~ EB}, k ~ 2.
Thus we can use measure-preserving transformations to construct
stationary (strict sense) random variables.
In a certain sense, there is a converse result: for every stationary sequence
~considered on (Q, fF, P) we can construct a new probability space (0, #, P),
a random variable ~ 1 (w) and a measure-preserving transformation f,
such that the distribution of~= {~ 1 (&), ~ 1 (fw), ... } coincides with the
distribution of~.
In fact, take 0 to be the coordinate space Roo and put :fft = BI(R 00 ),
P = P~, where P~(B) = P{w: ~ E B}, BE BH(R 00 ). The action of f on Q is
given by
~ 1 (w) a random variable. Put ~k(w) = ~ 1 (Tn- 1 w),

If w = (xt. x 2 ,

),

put

~1(&)

Now let A = {w: (x 1,

x1,

... ,

t-

Uw) = ~ 1 (fn- 1 w),

n ~ 2.

xk) E B}, BE ~(Rk), and

A = {w:(x 2 ,

...

,xk+ 1 )EB}.

Then the property of being stationary means that


P(A) = P{w:(~ 1 ,

...

,~dEB} = P{w:(~ 2 ,

...

,~k+ 1 )EB} =

P(f- 1A),

i.e. Tis a measure-preserving transformation. Since P{w: (~ 1 , , ~k) E B} =


P {w: (~ 1 , ... , ~k)EB} for every k, it follows that~ and~ have the same
distribution.
Here are some examples of measure-preserving transformations.

378

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

EXAMPLE 1.

Let Q = {w 1 , ... , ron} consist ofn points (a finite number), n 2:: 2,


letffbethecollectionofitssubsets,andlet Tw; = W;+ 1, 1::;; i::;; n- 1,and
Tw" = w 1 If P (w;) = 1/n, the transformation T is measure-preserving.

2.1f n = [0, 1), ' = ~([0, 1)), Pis Lebesgue measure, A. E [0, 1),
then Tx = (x + A.) mod 1 and T = 2x mod 1 are both measure-preserving
transformations.
ExAMPLE

2. Let us pause to consider the physical hypotheses that led to the consideration of measure-preserving transformations.
Let us suppose that Q is the phase space of a system that evolves (in discrete
time) according to a given law of motion. If w is the state at instant n = 1, then
Tnw, where Tis the translation operator induced by the given law of motion,
is the state attained by the system after n steps. Moreover, if A is some set of
states w then T- 1A = { w: Tw E A} is, by definition, the set of states w that
lead to A in one step. Therefore if we interpret Q as an incompressible fluid, the
condition P(T- 1 A) = P(A) can be thought of as the rather natural condition
of conservation of volume. (For the classical conservative Hamiltonian systems, Liouville's theorem asserts that the corresponding transformation T
preserves Lebesgue measure.)

3. One of the earliest results on measure-preserving transformations was


Poincare's recurrence theorem (1912).
Theorem 1. Let (Q, ff, P) be a probability space, let T be a measure-preserving
transformation, and let A E F. Then, for almost every point wE A, we have
Tnw E A for infinitely many n 2:: 1.

PROOF. Let C = {wEA: Tnw~A, for all n 2:: 1}. Since C n T-nc = 0 for
all n 2:: 1, we have T-mc n T-<m+nlc = T-m(C n T-nC) = 0. Therefore
the sequence {T-nC} consists of disjoint sets of equal measure. Therefore
L:':o P(C) = L:':o P(T-nC)::;; P(O) = 1 and consequently P(C) = 0.
Therefore, for almost every point wE A, for at least one n 2:: 1, we have
Tnw EA. It follows that Tnw E A for infinitely many n.
Let us apply the preceding result to T\ k 2:: 1. Then for every wE A\N,
where N is a set of probability zero, the union of the corresponding sets corresponding to the various values of k, there is an nk such that (Tkrw EA.
It is then clear that rnw E A for infinitely many n. This completes the proof of
the theorem.

Corollary. Let ~(w) 2:: 0. Then

L ~(Tkw) =
00

k=O

on the set {w: ~(w)

> 0}.

oo

(P-a.s.)

379

2. Ergodicity and Mixing

In fact, let An= {w: ~(w) ~ 1/n}. Then, according to the theorem,
= oo (P-a.s.) on An, and the required result follows by letting
n--+ oo.

Lk'=o ~(Tkw)

Remark. The theorem remains valid if we replace the probability measure P


by any finite measure Jl with Jl(Q) < oo.
4. PROBLEMS
1. Let Tbeameasure-preservingtransformationand~ = ~(w)arandom variable whose
expectation E~(w) exists. Show that E~(w) = E~(Tw).
2. Show that the transformations in Examples 1 and 2 are measure-preserving.
3. Let n = [0, 1), F = af([O, 1)) and let P be a measure whose distribution function is
continuous. Show that the transformations Tx = ..l.x, 0 < A. < 1, and Tx = x 2 are not
measure-preserving.

2. Ergodicity and Mixing


1. In the present section T denotes a measure-preserving transformation on
the probability space (Q, fl', P).

Definition 1. A set A E fJ' is invariant if T- 1A = A. A set A E fJ' is almost


invariant if A and r- 1A differ only by a set of measure zero, i.e. P(A 6. r- 1 A)
=Q
.
It is easily verified that the classes J and J* of invariant or almost invariant sets, respectively, are u-algebras.

Definition 2. A measure-preserving transformation Tis ergodic (or metrically


transitive) if every invariant set A has measure either zero or one.
Definition 3. A random variable~ = ~(w) is invariant (or almost invariant) if
~(w) = ~(Tw) for all wEn (or for almost all wE Q).
The following lemma establishes a connection between invariant and
almost invariant sets.

Lemma 1. If A is almost invariant, there is an invariant set B such that


P(A!::. B) = 0.
PRooF. Let B = Ilm r-nA. Then T- 1B = Ilm r-<n+ 1 >A = B, i.e. B EJ.It is
easily seen that A!::. B s;;; Uk'=o (T-kA!::. r-<k+ 1>A). But
P(T-kA!::. r-<k+ 1 >A)

Hence P(A !'::!,. B) = 0.

= P(A 6. T- 1A) = 0.

380

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

Lemma 2. A transformation T is ergodic if and only if every almost invariant set


has measure zero or one.
PROOF. Let A E *; then according to Lemma 1 there is an invariant set B
such that P(A 6. B) = 0. But T is ergodic and therefore P(B) = 0 or 1.
Therefore P(A) = 0 or 1. The converse is evident, since J J*. This completes the proof of the lemma.
Theorem 1. Let T be a measure-preserving transformation. Then the following
conditions are equivalent:
(1) Tis ergodic;
(2) every almost invariant random variable is (P-a.s.) constant;
(3) every invariant random variable is (P-a.s.) constant.

PROOF. (1) <o> (2). Let T be ergodic and ~ almost invariant, i.e. (P-a.s.) ~( w) =
~(Tw). Then for every c E R we have Ac = {w: ~(w) ~ c} E J*, and then
P(AJ = 0 or 1 by Lemma 2. Let C = sup{c: P(Ac) = 0}. Since Ac j Q as
c j oo and Ac! 0 as c!- w, we have ICI < oo. Then
P{w:

~(w) < C} =

P{.91

{~(w) ~ C- ~}} =

and similarly P{w: ~(w) > C} = 0. Consequently P{w:


(2) => (3). Evident.
(3)

=>

(1 ). Let A

~(w)

= C} = 1.

J; then I A is an invariant random variable and therefore,

(P-a.s.), IA = 0 or IA = 1, whence P(A) = 0 or 1,

Remark. The conclusion of the theorem remains valid in the case when
"random variable" is replaced by "bounded-random variable".
We illustrate the theorem with the following example.
ExAMPLE. Let Q = [0, 1), ' = .?4([0, 1)), let P be Lebesgue measure and let
Tw = (w + A.) mod 1. Let us show that T is ergodic if and only if A. is irrational.
Let~ = ~(w) be a random variable with Ee(w) < oo. Then we know that
the Fourier series L~ _"' c.e 2 "inw of ~(w) converges in the mean square sense,
L Ic. 12 < oo, and, because T is a measure-preserving transformation
(Example 2, 1), we have (Problem 1, 1) that for the random variable~
c.E~(w)e2xin~<w>

=
=

E~(Tw)e2xinTw

ez";";.E~(w)e2xinw

e2xin.i.E~(Tw)ez"inw

= c.ez";";..

So c.(1 - e 2 ";";.) = 0. By hypothesis, A. is irrational and therefore e2 "in.l. # 1


for all n # 0. Therefore c. = 0, n # 0, ~(w) = c0 (P-a.s.), and Tis ergodic by
Theorem 1.

381

3. Ergodic Theorems

On the other hand, let A. be rational, i.e. A.


integers. Consider the set

A =

U
k=O

2m-2 {

= k/m,

where k and m are

1}

w: -2 ~ w < -2- .
m
m

ltisclearthatthissetisinvariant;butP(A) =

t. Consequently Tis not ergodic.

2. Definition 4. A measure-preserving transformation is mixing (or has the


mixing property) if, for all A and BE F,
lim P(A n y-nB)

= P(A)P(B).

(1)

n-+ oo

The following theorem establishes a connection between ergodicity and


mixing.
Theorem 2. Every mixing transformation T is ergodic.
PRooF.

Let A E ~. BE ..F. Then B = y-nB, n ;::: 1, and therefore


P(A n y-nB) = P(A n B)

for all n ;::: 1. Because of (1), P(A n B) = P(A)P(B). Hence we find, when
A = B, that P(B) = P 2 (B), and consequently P(B) = 0 or 1. This completes
the proof.

3.

PROBLEMS

1. Show that a random variable is invariant if and only if it is J-measurable.

2. Show that a set A is almost invariant if and only if either


P(T- 1A\A) = 0

or

3. Show that the transformation considered in the example of Subsection 1 of the


present section is not mixing.

4. Show that a transformation is mixing if and only if, for all random variables and 11
with Ee 2 < oo and E11 2 < oo,

3. Ergodic Theorems
1. Theorem 1 (Birkhoff and Khinchin). Let T be a measure-preserving trans~(w) a random variable withE I~ I < oo. Then (P-a.s.)

formation and~ =

(1)

382

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

lf also Tis ergodic then (P-a.s.)


1 n-1
limn

L ~(T"w) =

n k=O

(2)

E~.

The proof given below is based on the following proposition, whose simple
proof was given by A. Garsia (1965).

Lemma (Maximal Ergodic Theorem). Let T be a measure-preserving transformation, let~ be a random variable with E I~ I < oo, and let
Sk(w)

~(w)

Mk(w)

+ ~(Tw) + + ~(T"- 1 w),

= max{O, S 1(ro), ... , S"(ro)}.

Then
for every n

PRooF. Ifn ~

~(w)

1.

k, we have Mn(Tw) ~ Sk(Tw) and therefore ~(ro) + Mn(Tw) ~


Sk+ 1(w). Since it is evident that ~(ro) ~ S 1 (ro)- Mn(Tw),

+ Sk(Tw) =

we have
Therefore
E[c;(ro)/1M">olro)] ~ E(max(S 1 (ro), ... , Sn(w)) - Mn(Tw)),
But max(Sl> ... , Sn) = Mn on the set {Mn > 0}. Consequently,
E[~(ro)/{Mn>O}(ro)] ~ E{(Mn(ro)- Mn(Tro))/{Mn(w)>OJ}
~

E{Mn(ro)- Mn(Tw)}

= 0,

since if T is a measure-preserving transformation we have EMn(ro)


EM,.(Tw) (Problem 1, 1).
This completes the proof of the lemma.
PRooF OF THE

THEOREM. Let us suppose that E(~l.f) = O(otherwise replace

~by~- E(~l.f)).

Let tf
(P-a.s.)

= Ilm(SJn)

and!!.

= lim(SJn). It
0

$;

!!.

$;

will be enough to establish that

'1 $; 0.

Consider the random variable i7 = q(w). Since q(ro) = q(Tw), the variable i7 is
invariant and consequently, for every e > 0, the set A. = {q(ro) > e} is also
invariant. Let us introduce the new random variable
and put
St(w) = ~*(ro)

+ .. + ~cr"- 1 ro),

Mt(w)

max(O, Sf, ... , St).

383

3. Ergodic Theorems

Then, by the lemma,

E[~*J 1 M~>oa ~

for every n

1. But as n--+ oo,

{M~ > 0} = { maxSt > o} j {sup St > o} ={sup Skt > o}


= {sup
k;;>: 1

k<-: 1

k<': 1

1 ,;;k,;;n

~k > e} nA, = A.,

where the last equation follows because supk,., 1(S:/k) ~ ij, and A,=
{w: ij > e}.
Moreover, El~*l ~ El~l +e. Hence, by the dominated convergence
theorem,

Thus

~ E[~*IAJ

= E[(~ - e)IAJ = E[UAJ - eP(A,)


= E[E<ei.Y)IAJ - eP(A,) = -eP(A,),

so that P(A,) = 0 and therefore P(ij ~ 0) = 1.


Similarly, if we consider -~(w) instead of ~(w), we find that

=(

11111

Sn) = - 1l. mSn- = -1]

--

and P( -17 ~ 0) = 1, i.e. P(17 ~ 0) = 1. Therefore 0 ~ 11 ~ ij ~ 0 (P-a.s.) and


the first part of the theorem is established.
To prove the second part, we observe that ~ince E( ~ I.Y) is an invariant
random variable, we have E(~IJ) = E~ (P-a.s.) in the ergodic case.
This completes the proof of the theorem.
Corollary. A measure-preserving transformation T is ergodic
all A and BE f7,

1 n-1
lim- L P(A n r-kB) = P(A)P(B).
n

n k=O

if and only if, for


(3)

To prove the ergodicity of Twe use A = BE J in (3). Then A n T- kB = B


and therefore P(B) = P 2 (B), i.e. P(B) = 0 or 1. Conversely, let T be ergodic.
Then if we apply (2) to the random variable~ = Ja(w), where BE f7, we find
that (P-a.s.)
1 n-1
lim- L IT-kiw) = P(B).
n

n k=O

384

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

If we now integrate both sides over A


theorem, we obtain (3) as required.

fF and use the dominated convergence

2. We now show that, under the hypotheses of Theorem 1, there is not only
almost sure convergence in (1) and (2), but also convergence in mean. (This
result will be used below in the proof of Theorem 3.)

Theoreml. LetT be a measure-preserving transformation and let


random variable with EI I < 00. Then

~~ :t>(Tkw)- E(el~ I--. 0,

e=

e(w) be a

n--. oo.

(4)

If also T is ergodic, then


n--. oo.

(5)

PRooF. For every e > 0 there is a bounded random variable


such that E1e- '71:::;; e. Then

'7( I'7(w) I :::;; M)

(6)

Since I'71 :::;; M, then by the dominated convergence theorem and by using (1)
we find that the second term on the right of (6) tends to zero as n--. oo. The
first and third terms are each at most e. Hence for sufficiently large n the lefthand side of(6) is less than 2e, so that (4) is proved. Finally, if Tis ergodic, then
(5) follows from (4) and the remark that E(eii) = Ee (P-a.s.).
This completes the proof of the theorem.
3. We now turn to the question of the validity of the ergodic theorem for
)defined on a probstationary (strict sense) random sequences = <el>
ability space (Q, !F, P). In general, (Q, !F, P) need not carry any measurepreserving transformations, so that it is not possible to apply Theorem 1
directly. However, as we observed in 1, we can construct a coordinate probability space (fi, ;, P), random variables ~ = (~ 1 , ~ 2 , . ), and a measurepreserving transformation T such that ~n(w) = ~ 1 (T"- 1 w) and the distributions of and ~ are the same. Since such properties as almost sure
convergence and convergence in the mean are defined only for proba1w) (P-a.s.
bility distributions, from the convergence of (1/n)
1 ~1
and in mean) to a random variable ij it follows that (1/n)
also
1
converges (P-a.s. and in mean) to a random variable '1 such that '1 :4: ij. It

ez, .. .

Li=

(Tk-

:D= ek<w)

385

3. Ergodic Theorems

follows from Theorem 1 that if EI~ 1 I < oo then = E( ~ 1 If), where .J is a


collection of invariant sets (E is the average with respect to the measure P). We
now describe the structure of '1

Definition 1. A set A E i' is invariant with respect to the sequence~ if there is a


set BE 86(R oo) such that for n ~ 1
A = {w:

(~., ~n+ 1, ...) E

B}.

The collection of all such invariant sets is a a-algebra, denoted by

J~.

Definition 2. A stationary sequence ~ is ergodic if the measure of every invariant set is either 0 or 1.
Let us now show that the random variable 1J can be taken equal toE(~ 1 IJ ~).
In fact, let A E f ~. Then since

1 n-1

E1~k
n k~ 1
we have

f ~k ~ f

!n I

k~ 1

{w;

dP

(7)

'1 dP.

= {w: (~k ~k+ 1, ... ) E B}

Let B Eei(R 00 ) be such that A


since ~ is stationary,

f ~k dP = Jl

'1 ~ 0,

(~k. ~k+ 1 ....)eB}

~k dP =

Hence it follows from (7) that for all A

that '1 = E( 1 1...';). Here E( 1 1...';) =

f{ro;(~t. ~2
J

~,

... )eB}

for all k ~ 1. Then

~1 dP =

f ~1 dP.
A

which implies (see 7, Chapter II)

E~ 1 if~

is ergodic.

Therefore we have proved the following theorem.

Theorem 3 (Ergodic Theorem). Let ~ = (~ 1 , ~ 2 , .. ) be a stationary (strict


sense) random sequence withE I~ 1 1< oo. Then (P-a.s., and in the mean)

If~

is also an ergqdic sequence, then (P-a.s., and in the mean)

1
lim-

L ~k(w) = E~ 1 .

n k~ 1

4.

PROBLEMS

1. Let ~ = (~ 1 , ~ 2 ,

function R(n)
ergodic.

. )

be a Gaussian stationary sequence withE~" = 0 and covariance


Show that R(n)--+ 0 is a sufficient condition for ~ to be

= E~k+n~k

386

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

2. Show that every sequence ~ =


random variables is ergodic.

(~ 1 , ~ 2 , . )

3. Show that a stationary sequence

for every BE [J(R), k = 1,2, ....

of independent identically distributed

is ergodic if and only if

CHAPTER VI

Stationary (Wide Sense) Random


Sequences. L 2 - Theory

1. Spectral Representation of the


Covariance Function
1. According to the definition given in the preceding chapter, a random
sequence ~ = (~ 1 , ~ 2 , ..) is stationary in the strict sense if, for every set
BE&l(R 00 )andeveryn ~ 1,
P{(~l' ~2 ... )EB} = P{(~n+l ~n+2 ... )EB}.

(1)

It follows, in particular, that if E~i < oo then E~n is independent of n:


E~n

and the covariance cov(~n+m~n) =

onm:

E~1'

E(~n+m- E~n+m)(~n - E~n)

(2)

depends only
(3)

In the present chapter we study sequences that are stationary in the wide
sense (and have finite second moments), namely those for which (1) is
replaced by the (weaker) conditions (2) and (3).
The random variables ~n are understood to be defined for n E 7L. =
{0, 1, ... } and to be complex-valued. The latter assumption not only does
not complicate the theory, but makes it more elegant. It is also clear that
results for real random variables can easily be obtained as special cases of the
corresponding results for complex random variables.
Let H 2 = H 2 (Q, Jl', P) be the space of (complex) random variables
2 < oo, where I~ 1
2 = r~. 2 + {3 2 If ~ and
~ = r:1. + i{3, r:J., {3 E R, with E I~ 1
17 E H 2 , we put
(4)

388
where

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

r; =

(J( -

if3 is the complex conjugate of 11 =

(J(

+ if3 and

WI = (~, 0 112

(5)

As for real random variables, the space H 2 (more precisely, the space of
equivalence classes of random variables; compare 10 and 11 of Chapter II)
is complete under the scalar product ( ~, 11) and norm II~ 11. In accordance with
the terminology of functional analysis, H 2 is called the complex (or unitary)
Hilbert space (of random variables considered on the probability space
(Q, !F, P)).
If~' 11 E H 2 their covariance is

11) =

cov(~,

E(~ - E~)(11

- E11).

(6)

It follows from (4) and (6) that if E~ = E11 = 0 then


cov(~,

11) = (~, 11).

(7)

Definition. A sequence of complex random variables ~ = (~")" E 2 with


E I~" 12 < oo, n E 7l., is stationary (in the wide sense) if, for all n E 7l.,
E~" = E~ 0 ,
cov(~k+n ~k)

k E 7i..

cov(~"' ~ 0 ),

(8)

As a matter of convenience, we shall always suppose that E~ 0 = 0. This


involves no loss of generality, but does make it possible (by (7)) to identify the
covariance with the scalar product and hence to apply the methods and results
of the theory of Hilbert spaces.
Let us write

R(n) =

cov(~"' ~ 0 ),

n E 7l.,

(9)

and (assuming R(O) = E I ~ 0 12 # 0)

R(n)
p(n) = R(O)'

n E 7l..

(10)

We call R(n) the covariance function, and p(n), the correlationfunction, of the
sequence~ (assumed stationary in the wide sense).
It follows immediately from (9) that R(n) is nonnegative-definite, i.e. for all
complex numbers a 1, ... , am and t 1 , .. , tm E 7l., m ;::::: 1, we have
m

L aJijR(t; -

t) ;: : : 0.

(11)

,i,j= 1

It is then easy to deduce (either from (11) or directly from (9)) the following
properties of the covariance function (see Problem 1):

R(O);::::: 0,

R( -n) = R(n),

IR(n) I :s; R(O),

IR(n)- R(mW :s; 2R(O)[R(O)- Re R(n- m)].

(12)

389

1. Spectral Representation of the Covariance Function

2. Let us give some examples of stationary sequences ~ = (~n)nEZ (From


now on, the words "in the wide sense" and the statement n E Z will both be
omitted.)
ExAMPLE 1. Let ~n = ~ 0 g(n), where E~ 0 = 0, E~6 = 1 and g = g(n) is a
function. The sequence~ = (~n) will be stationary if and only if g(k + n)g(k)
depends only on n. Hence it is easy to see that there is a Asuch that

g(n) = g(O)ei;.n.
Consequently the sequence of random variables
~n = ~0

g(O)eiln

is stationary with
In particular, the random
EXAMPLE

"constant"~

= 0 is a stationary sequence.
~

2. An almost periodic sequence. Let


(13)

where z 1, . , zN are orthogonal (Ez;zi = 0, i =F j) random variables with zero


means and E lzkl 2 =a~ > 0; -n ~ Ak < n, k = 1, ... , N; A; =F Ai, i =F j.
The sequence~ = (~n) is stationary with
N

L a~eiJ.n.

R(n) =

(14)

k=l

As a generalization of (13) we now suppose that


~n

L
00

k=-

zkei).n,

(15)

00

where zk, k E Z, have the same properties as in (13). If we suppose that


oo a~ < oo, the series on the right of (15) converges in mean-square and

Lr'= _

R(n)

L
00

k=-

a~e;;.n.

(16)

00

Let us introduce the function

F(A) =

af.

(17)

{k: Ak:5 ).)

Then the covariance function (16) can be written as a Lebesgue-Stieltjes


integral,

R(n)

r"

e;;.n dF(A.).

(18)

390

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

The stationary sequence (15) is represented as a sum of "harmonics" ei).kn


with "frequencies" Ak and random "amplitudes" zk of "intensities" (Jl =
E jzkj 2. Consequently the values of F(A) provide complete information on the
"spectrum" of the sequence ~' i.e. on the intensity with which each frequency
appears in (15). By (18), the values of F(A) also completely determine the
structure of the covariance function R(n).
Up to a constant multiple, a (nondegenerate) F(A) is evidently a distribution function, which in the examples considered so far has been piecewise
constant. It is quite remarkable that the covariance function of every stationary
(wide sense) random sequence can be represented (see the theorem in Subsection 3) in the form (18), where F(A) is a distribution function (up to normalization), whose support is concentrated on [- n, n), i.e. F(A) = 0 for
A < -nand F(A) = F(n) for A > n.
The result on the integral representation of the covariance function, if
compared with (15) and (16), suggests that every stationary sequence also
admits an "integral" representation. This is in fact the case, as will be shown in
3 by using what we shall learn to call stochastic integrals with respect to
orthogonal stochastic measures (2).
EXAMPLE 3 (White noise). Let e = (en) be an orthonormal sequence of random
variables, Een = 0, Eeiei = bii' where bii is the Kronecker delta. Such a
sequence is evidently stationary, and
R(n) =

{10,

0,

n=
n # 0.

Observe that R(n) can be represented in the form


R(n) =

f/""

(19)

dF(A),

where

F(A)

f/(v) dv;

j(A) = 2n'

-n::;; A< n.

(20)

Comparison of the spectral functions (17) and (20) shows that whereas the
spectrum in Example 2 is discrete, in the present example it is absolutely
continuous with constant "spectral density" j(A) tn. In this sense we can
say that the sequence e = (en) "consists of harmonics of equal intensities." It
is just this property that has led to calling such a sequence e = (en) "white
noise" by analogy with white light, which consists of different frequencies with
the same intensities.

EXAMPLE 4 (Moving averages) Starting from the white noise e = (en) introduced in Example 3, let us form the new sequence
C()

~" =

2: aken-k,
k=-oo

(21)

391

I. Spectral Representation of the Covariance Function

where ak are complex numbers such that


equation,

Lk'= _

oo I ak 12

<

rx:;.

By Parseval's

00

cov(~n+m ~m) = cov(~"' ~ 0 ) =

k=-

an+kak,

00

so that~ = (~k) is a stationary sequence, which we call the sequence obtained


from e = (ek) by a (two-sided) moving average.
In the special case when the ak of negative index are zero, i.e.
00

~"
the sequence~ =
k > p, i.e. if

(~n)

L aken-k
k=O

is a one-sided moving average. If, in addition, ak = 0 for


(22)

then ~ = (~") is a moving average of order p.


We can show (Problem 5) that (22) has a covariance function of the form
R(n) =
e;;."f(A.) dA., where the spectral density is

J':.,

(23)

with

P(z) = a0

+ a1 z + + aPzP.

EXAMPLE 5 (Autoregression). Again let e = (en) be white noise. We say that a


random sequence~ = ( ~n) is described by an autoregressive model of order q if

~n

+ bl~n-1 + ''' + bq~n-q =

Sn.

(24)

Under what conditions on b 1 , , bn can we say that (24) has a stationary


solution? To find an answer, let us begin with the case q = 1:
(25)
where a.= -b 1 . If la.l < 1, it is easy to verify that the stationary sequence
~ = (~n) with

L a.ien00

~n =

(26)

j=O

is a solution of (25). (The series on the right of (26) converges in mean-square.)


Let us now show that, in the class of stationary sequences ~ = (~") (with
finite second moments) this is the only solution. In fact, we find from (25),
by successive iteration, that

392

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Hence it follows that

k-+

00.

Therefore when Ia I < 1 a stationary solution of (25) exists and is representable as the one-sided moving average (26).
There is a similar result for every q > 1 : if all the zeros of the polynomial
(27)
lie outside the unit disk, then the autoregression equation (24) has a unique
stationary solution, which is representable as a one-sided moving average
(Problem 2). Here the covariance function R(n) can be represented (Problem
5) in the form
R(n) =

f/An dF().),

F().) =

f/(v)

dv,

(28)

where

(29)
In the special case q = 1, we find easily from (25) that Eeo = 0,
2

Eeo

= t -

lo::l2'

and
R(n)

a"

= 1-lo::l2'

n~O

(when n < 0 we have R(n) = R(- n)), Here

1
1
J().) = 2n '11 - ae-;;.1 2 '
EXAMPLE 6.

This example illustrates how autoregression arises in the construction of probabilistic models in hydrology. Consider a body of water;
we try to construct a probabilistic model of the deviations of the level ofthe
water from its average value because of variations in the inflow and evaporation from the surface.
If we take a year as the unit of time and let H n denote the water level in
yearn, we obtain the following balance equation:

Hn+ 1 = Hn - KS(H")

+ ~n+ 1

(30)

where ~n+ 1 is the inflow in year (n + 1), S(H) is the area of the surface of the
water at level H, and K is the coefficient of evaporation.

393

I. Spectral Representation of the Covariance Function

Let ~n = Hn - H be the deviation from the mean level (which is obtained


from observations over many years) and suppose that S(H) = S(H) +
c(H - H). Then it follows from the balance equation that ~n satisfies
(31)

with a = 1 - cK, en = :En - KS(H). It is natural to assume that the random


variables ~>n have zero means and are identically distributed. Then, as we
showed in Example 5, equation (31) has (for Ia I < 1) a unique stationary
solution, which we think of as the steady-state solution (with respect to time
in years) of the oscillations of the level in the body of water.
As an example of practical conclusions that can be drawn from a
(theoretical) model (31 ), we call attention to the possibility of predicting the
level for the following year from the results of the observations of the present
and preceding years. It turns out (see also Example 2 in 6) that (in the meansquare sense) the optimal linear estimator of ~n+ 1 in terms of the values of
. , ~n-l ~n is simply O!~n
7 (Autoregression and moving average (mixed model)). If we
suppose that the right-hand side of (24) contains a0 Bn + a 1Bn- 1 + +
apsn- P instead of en, we obtain a mixed model with autoregression and
moving average of order (p, q):

ExAMPLE

~n

+ b1~n-1 + + bq~n-q = aoBn + a1Bn-1 + + apBn-p

(32)

Under the same hypotheses as in Example 5 on the zeros it will be shown later
(Corollary 2 to Theorem 3 of3) that (32) has the stationary solution~ = (~")
e;;.n dF(A.) with F(A.) =
for which the covariance function is R(n) =
J~,f(v) dv, where

J':,

3. Theorem (Herglotz). Let R(n) be the covariance function of a stationary

(wide sense) random sequence with zero mean. Then there is, on

([ -n, n), g6([ -n, n))),


a finite measure F = F(B),

BE g6([ -n,

R(n) =
PRooF. For N

n)), such that for every n E 7L

f/;." F(dA.).

(33)

1 and A. E [ -n, n], put

fN(A.) = 2 N
1t

I I

k=1 1=1

R(k - T)e- ik;.ew'.

(34)

394

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Since R(n) is nonnegative definite, fN(A.) is nonnegative. Since there are


N - Im I pairs (k, I) for which k - l = m, we have

fN(A.) = __!_
2n lmi<N

(1 - ~)R(m)e-im'-.
N

(35)

Let
BE~([ -n,

n)).

Then

lnl <
lnl

N,} (36)

~N.

The measures F N N ~ 1, are supported on the interval [ -n, n] and


F N([ -n, n]) = R(O) < oo for all N ~ 1. Consequently the family of measures
{FN}, N ~ 1, is tight, and by Prohorov's theorem {Theorem 1 of 2 of
Chapter III) there are a sequence {Nk}

{N} and a measure F such that

F Nk ~ F. {The concepts of tightness, relative compactness, and weak

convergence, together with Prohorov's theorem, can be extended in an


obvious way from probability measures to any finite measures.)
It then follows from (36) that
f/'"nF(dA.)

N~~oo f~/'-npNk(dA.) =

R(n).

The measure F so constructed is supported on [ -n, n]. Without changing


the integral J~oo ei'-nF(dA.), we can redefine F by transferring the "mass"
F( { n} ), which is concentrated at n, to -n. The resulting new measure (which
we again denote by F) will be supported on [ -n, n).
This completes the proof of the theorem.

Remark 1. The measure F = F(B) involved in (33) is known as the spectral


measure, and F(A.) = F([ -n, A.]) as the spectral function, of the stationary
sequence with covariance function R(n).
In Example 2 above the spectral measure was discrete (concentrated at
A.k, k = 0, 1, ...). In Examples 3-6 the spectral measures were absolutely
continuous.

Remark 2. The spectral measure F is uniquely defined by the covariance


function. In fact, let F 1 and F 2 be two spectral measures and let
f/'-nFl(dA.)

= f/'"nFz(dA.),

n E Z.

2. Orthogonal Stochastic Measures and Stochastic Integrals

395

Since every bounded continuous function g(A.) can be uniformly approximated on [ -n, n) by trigonometric polynomials, we have
{,.g(A.)F 1(dA.) = {,.g(A.)FidA.).
It follows (compare the proof in Theorem 2, 12, Chapter II) that F 1(B)
=FiB) for all Be&l([ -n, n)).

Remark 3. If e=(en) is a stationary sequence of real random variables en,


then
R(n) = {,.cos A.n F(dA.).

4.

PROBLEMS

1. Derive (12) from (11).


2. Show that the autoregression equation (24) has a stationary solution if all the zeros
ofthe polynomial Q(z) defined by (27) lie outside the unit disk.

3. Prove that the covariance function (28) admits the representation (29) with spectral
density given by (30).
4. Show that the sequence

e= (e.) of random variables, where


e. = I <cxk sin A.kn + pk cos A.kn>
<Xl

k=l

and cxk and f3k are real random variables, can be represented in the form

e. = I

<Xl

zkeu

k=-oo

with zk = !(f3k- icxk) fork~ 0 and zk = z_k, A.k = -A.-k fork < 0.
5. Show that the spectral functions of the sequences (22) and (24) have densities given
respectively by (23) and (29).
6. Show that if I IR(n) I < oo, the spectral function F(A.) has density f(A.) given by
1

f(A.} = - I e-i.<nR(n).
2nn=-oo
<Xl

2. Orthogonal Stochastic Measures and Stochastic


Integrals
1. As we observed in 1, the integral representation of the covariance
function and the example of a stationary sequence

en=

00

k=-oo

zke;;.

(1)

396

VI. Stationary (Wide Sense) Random Sequences. 2 -Theory

with pairwise orthogonal random variables zk, k e 7L, suggest the possibility
of representing an arbitrary stationary sequence as a corresponding integral
generalization of (1).
If we put
(2)
Z(A.) =
zk,

{k:.l.kS.l.}

we can rewrite (1) in the form


co

e. = L

k= -co

eilk".1Z(A.k),

(3)

where .1Z(A.k) Z(A.k) - Z(A.k -) = zk.


The right-hand side of (3) reminds us of an approximating sum for an
integral j'':., ei-.n dZ(A.) of Riemann-Stieltjes type. However, in the present
case Z(A.) is a random function (it also depends on w). Hence it is clear that
for an integral representation of a general stationary sequence we need to use
functions Z(A.) that do not have bounded variation for each ro. Consequently
the simple interpretation of
eiAn dZ(A.) as a Riemann-Stieltjes integral
for each w is inapplicable.

J':.,

2. By analogy with the general ideas of the Lebesgue, Lebesgue-Stieltjes


and Riemann-Stieltjes integrals (6, Chapter II), we begin by defining
stochastic measure.
Let (Q, JF, P) be a probability space, and let E be a subset, with an algebra
S 0 of subsets and the u-algebra S generated by S 0

Definition 1. A complex-valued function Z(.1) = Z(w; .1), defined for wen


and .1 e S 0 , is a finitely additive stochastic measure if
(1) E I Z(.1) 12 < oo for every .1 E S 0 ;
(2) for every pair A 1 and .12 of disjoint sets in tff 0 ,

Z(.11

+ .12) =

Z(.1 1)

+ Z(.12)

(P-a.s.)

(4)

Definition 2. A finitely additive stochastic measure Z(.1) is an elementary


stochastic measure if, for all disjoint sets .1 1, .1 2, ... of tff 0 such that .1 =
Lk'= 1 .1k E tff 0
n--+ oo.

(5)

Remark 1. In this definition of an elementary stochastic measure on subsets


of S 0 , it is assumed that its values are in the Hilbert space H 2 = H 2 (il, JF, P),
and that countable additivity is understood in the mean-square sense (5).
There are other definitions of stochastic measures, without the requirement
of the existence of second moments, where countable additivity is defined
(for example) in terms of convergence in probability or with probability
one.

397

2. Orthogonal Stochastic Measures and Stochastic Integrals

Remark 2. In analogy with nonstochastic measures, one can show that for
finitely additive stochastic measures the condition (5) of countable additivity
(in the mean-square sense) is equivalent to continuity (in the mean-square
sense) at "zero":
(6)

A particularly important class of elementary stochastic measures consists


of those that are orthogonal according to the following definition.

Definition 3. An elementary stochastic measure


(or.a measure with orthogonal values) if

Z(~), ~ E

fff 0 , is orthogonal

(7)

for every pair of disjoint sets

~1

and

EZ(~ 1 )Z(~ 2 )

for all

~1

and

~2

~2

in fff 0 ; or, equivalently, if

= E IZ(~1

n ~2W

(8)

in fff 0 .

We write
m(~)

= E IZ(~W,

(9)

For elementary orthogonal stochastic measures, the set function m = m(~),


~ E fff 0 , is, as is easily verified, a finite measure, and consequently by
Caratheodory's theorem (3, Chapter II) it can be extended to (E, f%).
The resulting measure will again be denoted by m = m(~) and called the
structure function (of the elementary orthogonal stochastic measure Z =
Z(~). ~ E

fff 0 ).

The following question now arises naturally: since the set function
m = m(~) defined on (E, fff 0 ) admits an extension to (E, Iff), where Iff = a(fff 0 ),
cannot an elementary orthogonalstochastic measure Z = Z(~), ~ E fff 0 , be
extended to sets ~in E in such a way that EI Z(~) 12 = m(~), ~ E fff?
The answer is affirmative, as follows from the construction given below.
This construction, at the same time, leads to the stochastic integral which we
need for the integral representation of stationary sequences.
3. Let Z = Z(~) be an elementary orthogonal stochastic measure,
with structure function m. = m(~), ~ E f%. For every function

~ E fff 0 ,

(10)

with only a finite number of different (complex) values, we define the random
variable

398

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Let L 2 = L 2 (E, 8, m) be the Hilbert space of complex-valued functions


with the scalar product

(f, g)

f(A.)g(A.)m(dA.)

and the norm 11!11 = (f,J) 112 , and let H 2 = H 2 (0., $', P) be the Hilbert
space of complex-valued random variables with the scalar product

(~, t~)

E~ii

and the norm 11~11 = (~, ~) 1 ' 2 .


Then it is clear that, for every pair of functions

(J(f), J(g))

and g of the form (10),

= (f, g)

and
IIJ(f)ll 2 = 11!11 2 = 11f<A.Wm(dA.).
NowletfE L 2 and let {J,.} befunctionsoftype(10)such that II!- J..ll ..... 0,
n ..... oo (the existence of such functions follows from Problem 2). Consequently
IIJ(J,.)- J(fm)ll = llfn- /mil_. 0,

n,m ..... oo.

Therefore the sequence {J(f,.)} is fundamental in the mean-square sense


and by Theorem 7, 10, Chapter II, there is a random variable (denoted by
J(f)) such that J(f) E H 2 and IIJ(J,.) - f(f)ll ..... 0, n ..... oo.
The random variable f(f) constructed in this way is uniquely defined
(up to stochastic equivalence) and is independent of the choice of the approximating sequence {J,.}. We call it the stochastic integral off E L 2 with
respect to the elementary orthogonal stochastic measure Z and denote it by
f( f) =

fCA.)Z(dA.).

We note the following basic properties of the stochastic integral J(f);


these are direct consequences of its construction (Problem 1). Let g, f, and
fnEL 2 . Then
(J(f), J(g)) = (f, g);
IIJ(f)ll

J(af + bg)

= 11!11;

= aJ(f) + bJ(g) (P-a.s.)

(11)
(12)
(13)

where a and bare constants;


IIJ(J,.)- J(f)ll ..... o,
if II fn - f II ..... o, n ..... oo.

(14)

399

2. Orthogonal Stochastic Measures and Stochastic Integrals

4. Let us use the preceding definition of the stochastic integral to extend


the elementary stochastic measure Z(A), A E C 0 , to sets inC = u(C 0 ).
Since m is assumed to be finite, we have I d = I d(A.) E L 2 for all A E C.
Write Z(A) = J(I d). It is clear that Z(A) = Z(A) for A E C 0 . It follows from
(13) that if At n A2 = 0 for At and A2 E C, then
Z(At

+ A2 )

= Z(At)

+ Z(A 2 )

(P-a.s.)

and it follows from (12) that


EIZ(A)I 2 = m(A),

AEC.

Let us show that the random set function Z(A), A E C, is countably additive
in the mean-square sense. In fact, let Ak E C and A = Lk"= t Ak. Then
n

Z(A) -

L Z(Ak) = J(gn),

k= t

where
n

9n(A) = Id(A.) -

L Idk(A) =

Il:n(A.),

k= t

But
n-+ oo,

i.e.
n

E I Z(A) -

L Z(Ak) 2 -+ 0,
1

n-+ oo.

k=1

It also follows from (11) that

when At n A2 = 0, At, A2 EC.


Thus our function Z(A), defined on A E C, is countably additive in the
mean-square sense and coincides with Z(A) on the sets A E C 0 We shall call
Z(A), A E C, an orthogonal stochastic measure (since it is an extension of
the elementary orthogonal stochastic measure Z(A)) with respect to the
structure function m(A), A E C; and we call the integral J(f) = .fE f(A.)Z(dA.),
defined above, a stochastic integral with respect to this measure.
5. We now consider the case (E, C) = (R, ~(R)), which is the most important for our purposes. As we know (3, Chapter II), there is a one-to-one
correspondence between finite measures m = m(A) on (R, ~(R)) and certain
(generalized) distribution functions G = G(x), with m(a, b] = G(b)- G(a).
It turns out that there is something similar for orthogonal stochastic
measures. We introduce the following definition.

400

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Definition 4. A set of (complex-valued) random variables {Z;.}, A. E R,


defined on (Q, $', P), is a random process with orthogonal increments if
(1) EIZ;.I 2 < oo,A.eR;
(2) for every A. E R

EIZ;.- Z;.J2~0,
(3) whenever ..1. 1 < ..1. 2 < ..1. 3 < A.4,
E(Z;. 4

Z;.,)(Z;. 2

Z;.,)

= 0.

Condition (3) is the condition of orthogonal increments. Condition (1)


means that Z;. E H 2 . Finally, condition (2) is included for technical reasons;
it is a requirement of continuity on the right (in the mean-square sense) at
each A.e R.
Let Z = Z(~) be an orthogonal stochastic measure with respect to the
structure function m = m(~), of finite measure, with the (generalized)
distribution function G(A.). Let us put

Z;. = Z(- oo, A.].


Then
E IZ;.I 2 = m(- oo, A.] = G(A.) < oo,
and (evidently) 3) is satisfied also. Then this process {Z;.} is called a process
with orthogonal increments.
On the other hand, if {Z;.} is such a process withE IZ;.I 2 = G(A.), G(- oo)
= 0, G( + oo) < oo, we put
Z(~) =

when

~ =

Zb- Za

(a, b]. Let cC 0 be the algebra of sets

(ak, bk]

and

Z(~)

k=l

Z(ak, bk].

k=l

It is clear that

E IZ(~)I 2

= m(~),

where m(d) = I~=t [G(bk)- G(ak)] and


EZ(~ 1 )Z(~ 2 )

=0

for disjoint intervals ~ 1 = (a 1 , b 1 ] and ~ 2 = (a 2 , b2 ].


Therefore Z = Z(~), ~ E cC 0 , is an elementary stochastic measure with
orthogonal values. The set function m = m(~), ~ E cC 0 , has a unique extension
to a measure on cC = 86(R), and it follows from the preceding constructions
that Z = Z(~), ~ E cC 0 , can also be extended to the set~ E C, where cC = 86(R),
and E IZ(~) 12 = m(~), ~ E B(&l).

401

3. Spectral Representation of Stationary (Wide Sense) Sequences

Therefore there is a one-to-one correspondence between processes


{Z_.}, A. E R, with orthogonal increments and EIZ .. I2 = G(A.), G(- oo) = 0,
G( + oo) < oo, and orthogonal stochastic measures Z = Z(.::\), .::\ E &I(R),
with structure functions m = m(L\). The correspondence is given by

Z;.

= Z(- oo,

A.],

G(A.) = m(- oo, A.]

and
m(a, b] = G(b)- G(a).

By analogy with the usual notation of the theory of Riemann-Stieltjes


integration, the stochastic integral JR f(A.) dZ_., where {Z;.} is a process with
orthogonal increments, means the stochastic integral JR f(A.)Z(dA.) with
respect to the corresponding process with an orthogonal stochastic measure.

6.

PROBLEMS

1. Prove the equivalence of (5) and (6).


2. Letf e L 2 Using the results of Chapter II (Theorem 1 of4, the Corollary to Theorem
3 of 6, and Problem 9 of 3), prove that there is a sequence of functions J,. of the
form (10) such that II!- J..ll -+ 0, n-+ oo.
3. Establish the following properties of an orthogonal stochastic measure Z(L\} with
structure function m(L\):
E IZ(L\1) - Z(L\2W

= m(L\16 L\2),

Z(L\1 \L\2) = Z(L\1) - Z(L\1 r~ L\2)


Z(L\ 16 L\ 2 ) = Z(L\1)

+ Z(L\ 2 )

(P-a.s.),

2Z(L\ 1 rl L\ 2 )

(P-a.s.).

3. Spectral Representation of Stationary


(Wide Sense) Sequences
1. If ~ = (~n) is a stationary sequence with E~n = 0, n E Z, then by the
theorem of 1, there is a finite measure F = F(L\) on ([ -n, n), &~([ -n, n)))
such that its covariance function R(n) = cov(~k+n ~k) admits the spectral
representation
(1)

The following result provides the corresponding spectral representation


of the sequence ~ = (~"), n E Z, itself.

402

VI. Stationary (Wide Sense) Random Sequences. L 2 - Theory

Theorem 1. There is an orthogonal stochastic measure Z


Ll E .11([ -n, n)), such that for every n E Z (P-a.s.)

Z(Ll),

~n = ff""Z(dA).

(2)

Moreover, E !Z(LlW = F(Ll).

The simplest proof is based on properties of Hilbert spaces.


Let L 2 (F) = L 2 (E, lff, F) be a Hilbert space of complex functions,
E = [ -n, n), lff = .11([ -n, n)), with the scalar product
(f, g)

= f/(}c)g(}c)F(d}c),

(3)

and let L6(F) be the linear manifold (L6(F) ~ L 2 (F)) spanned by en = en( A),
n E Z, where en(A) = ei"".
Observe that since E = [ -n, n) and F is finite, the closure of L6(F)
coincides (Problem 1) with L 2 (F):
L6(F)

L 2 (F).

Also let L6(~) be the linear manifold spanned by the random variables~"'
n E Z, and let e(O be its closure in the mean-square sense (with respect
toP).
We establish a one-to-one correspondence between the elements of
L6(F) and L6( 0, denoted by "~ ", by setting
nEZ,

(4)

and defining it for elements in general (more precisely, for equivalence


classes of elements) by linearity:
(5)

(here we suppose that only finitely many of the complex numbers r:xn are
different from zero).
Observe that (5) is a consistent definition, in the sense that !: rx.nen = 0
almost everywhere with respect to F if and only if L rx.n~n = 0 (P-a.s.).
The correspondence "~" is an isometry, i.e. it preserves scalar products.
In fact, by (3),
(en, em)

f"en(A)em(A)F(dA)

ffl(n-m>F(d}c)

R(n - m)

= E~n~m = (~"' ~m)


and similarly
(6)

3. Spectral Representation of Stationary (Wide Sense) Sequences

403

Now let '7 E L 2 (~). Since L 2 (~) = I6(~), there is a sequence {1'/n} such that
1'/n E L6( ~) and II IJn - '711 --+ 0, n --+ oo. Consequently {1'/n} is a fundamental
sequence and therefore so is the sequence Un}, where fn E L6(F) and f, +-+ IJn.
The space L 2 (F) is complete and consequently there is an f E L 2 (F) such
that llfn- fll--+ 0.
There is an evident converse: iff E e(F) and II!- !nil --+ 0, fnEL6{F),
tpere is an element '7 of L 2(0 such that 11'7 - '7nll --+ 0, IJn E L6{~) and '7n +-+ f,.
Up to now the isometry"+-+" has been defined only as between elements
of L6( ~)and L6(F). We extend it by continuity, taking f +-+ '1 when f and 11 are
the elements considered above. It is easily verified that the correspondence
obtained in this way is one-to-one (between classes of equivalent random
variables and of functions), is linear, and preserves scalar products.
Consider the function f(A.) = I 4 (A.), where Ll E .16([ -n, n)), and let
Z(Ll) be the element of L 2 (~) such that I 4 (A.) +-+ Z(A.). It is clear that III 4 (A.)II 2
= F(Ll) and therefore E IZ(LlW = F(Ll). Moreover, if .1 1 n .1 2 = 0, we
have EZ(Ll 1 )Z(Ll 2 ) = 0 and E I Z(Ll) - I~= 1 Z(LlkW --+ 0, n--+ oo, where
.1 =
1 .1k.
Hence the family of elements Z(Ll), Ll E .16([ -n, n)), form an orthogonal
stochastic measure, with respect to which (according to 2) we can define
the stochastic integral

If=

Let f E L 2 (F) and '1 +-+f. Denote the element '1 by <l>(f) (more precisely,
select single representatives from the corresponding equivalence classes of
random variables or functions). Let us show that (P-a.s.)
J(f) = <l>(f).

(7)

In fact, if
(8)

is a finite linear combination of functions I 4 .(A.), Llk = (ak, bk], then, by the
very definition of the stochastic integral, J(f) = I IXkZ(Llk), which is
evidently equal to <l>(f). Therefore (7) is valid for functions of the form (8).
But if J E L 2 (F) and II fn - f II --+ 0, where fn are functions of the form (8),
then ll<l>(fn) - <l>(f)ll --+ 0 and IIJ(fn) - J(f)IJ -> 0 (by (2.14)). Therefore
<l>(f) = J(f) (P-a.s.).
Consider the function f(A.) = ei'-n. Then <l>(ei'-n) = ~n by (4), but on the
other hand J(ei'-n) = .f':._, ei'-"Z(dA.). Therefore

n E 7l. (P-a.s.)
by (7). This completes the proof of the theorem.

404

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

en,

Corollary 1. Let = (en) be a stationary sequence qfreal random variables


n E 71.. Then the stochastic measure Z = Z(t1) involved in the spectral representation (2) has the property that
Z(t1) = Z(- .:1)

(9)

for every .:1 = aJ([ -n, n)), where -.:1 = {A: -A. E .:1}.

In fact, let f(A.) =


therefore

L cxkeo.k and 11 = L cxkek (finite sums). Then f - 11 and


(10)

Since J .1(A.)- Z(t1), it follows from (10) that either J .1(- A.)- Z(L1) or
J -<1(..1.)- Z(t1). On the other hand, J -<1(..1.)- Z(- .:1). Therefore Z(t1)
= Z( -.:1)(P-a.s.).

Corollary 2. Again let


ables

en and Z(L1) =

=(en) be a stationary sequence of real random variZ 1 (.:1) + iZ 2 (L1). Then


EZ 1(i1 1)Zit12) = 0

for every .:1 1 and .:1 2 ; and

if .:1 1 11

.:1 2 =

(11)

then
(12)

In fact, since Z(t1) = Z( -.:1), we have


Z 1( -.:1) = Z 1(.:1),

Z 2( -.:1) = -Zit1).

(13)

Moreover, since EZ(t1 1)Z(t12) = EIZ(t1 1 11 .:12)1 2, we have Im EZ(t1 1)Z(t12)


= 0, i.e.
(14)

If we take the interval -.:1 1 instead of .:1 1 we therefore obtain


EZ1( -L11)Z2(l12)

+ EZ2( -L\1)Z1(l12) = 0,

which, by (13), can be transformed into


EZ1(L\1)Zil\2)- EZ2(L\1)Z1(f12) = 0.

(15)

Then (11) follows from (14) and (15).


On the other hand, if .:1 1 11 L\ 2 = 0 then EZ{t1 1)Z(t1 2) = 0, whence
Re EZ(t1 1)Z(L\ 2) = 0 and Re EZ(- t1 1)Z(L\ 2) = 0, which, with (13), provides
an evident proof of (12).

Corollary 3. Let =(en) be a Gaussian sequence. Then, for every family


.:1 1, ... , L\k, the vector (Z 1(.:1 1), ... , Z 1(.:1k), Z 2(.:1 1), ... , Z 2(L1k)) is normally
distributed.

3. Spectral Representation of Stationary (Wide Sense) Sequences

405

In fact, the linear manifold L5{~) consists of (complex-valued) Gaussian


random variables 17, i.e. the vector (Re 17, lm 17) has a Gaussian distribution.
Then, according to Subsection 5, 13, Chapter II, the closure of L5(~) also
consists of Gaussian variables. It follows from Corollary 2 that, when
~ = ( ~n) is a Gaussian sequence, the real and imaginary parts of Z 1 and Z 2
are independent in the sense that the families of random variables (Z 1 (L\ 1 ),
... , Z 1(L\k)) and (Z 2 (L\ 1), ... , Zil\k)) are independent. It also follows
from (12) that when the sets L\ 1 , ... , L\k are disjoint, the random variables
Z;(L\ 1), .. , Z;(L\k) are collectively independent, i = 1, 2.

Corollary 4. If
then (P-a.s.)

~ = C~n)

is a stationary sequence of real random variables,

Remark. If {Z;.}, A. E [ -n, n), is a process with orthogonal increments,


corresponding to an orthogonal stochastic measure Z = Z(L\), then in
accordance with 2 the spectral representation (2) can also be written in the
following form:
J!
':.n

f"

eiAn

-n

dZ ).

nel..

(17)

2. Let~ = C~n) be a stationary sequence with the spectral representation (2)


and let 17 E L 2 ( ~). The following theorem describes the structure of such
random variables.

r,

Theorem 2. If 11 E L 2 (~). there is a function q> E L 2 (F) such that (P-a.s.)


11

q>(A.)Z(dA.).

(18)

PRooF. If
(19)

then by (2)
(20)

i.e. (18) is satisfied with

q>"(A.) =

L:

lklsn

a" em.

(21)

In the general case, when 17 E L 2 (~), there are variables 17n of type (19) such
that //11- 17nl/ --t 0, n --t 00. But then [fq>n- q>m/1 = ll17n- 17m I --t 0, n, m --t 00.

406

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Consequently {qJ,.} is fundamental in L 2 (F) and therefore there is a function


qJ e L 2 (F) such that 11(/J - qJ,.II ~ 0, n ~ oo.
By property (2.14) we have ll..l(qJ,.)- ..l(qJ)II ~ 0, and since '7n =..l(qJ,.)
we also have '1 = ..l(qJ) (P-a.s.).
This completes the proof of the theorem.

Remark. Let

H 0 (~) and H 0 (F) be the respective closed linear manifolds


spanned by the variables ~ .. and by the functions e,. when n :s;; 0. Then if
'1 e H 0 (~) there is a function qJ e H 0 (F) such that (P-a.s.) '1 = J':.,. qJ(A.)Z(dA.).

3. Formula (18) describes the structure of the random variables that are
obtained from ~ .. , n e 71., by linear transformations, i.e. in the form of finite
sums (19) and their mean-square limits.
A special but important class of such linear transformations are defined
by means of what are known as (linear).filters. Let us suppose that, at instant
m, a system (filter) receives as input a signal xm, and that the output of the
system is, at instant n, the signal h(n - m)xm, where h = h(s), s e 71., is a
complex valued function called the impulse response (of the filter).
Therefore the total signal obtained from the input can be represented in
the form
00

Yn =

m=-oo

h(n - m)xm.

(22)

For physically realizable systems, the values of the input at instant n


are determined only by the "past" values of the signal, i.e. the values xm for
m :s;; n. It is therefore natural to call a filter with the impulse response h(s)
physically realizable if h(s) = 0 for all s < o, in other words if
n

Yn =

m=-oo

oo

h(n- m)Xm =

L h(m)Xn-m

m=O

(23)

An important spectral characteristic of a filter with the impulse response h


is its Fourier transform

qJ(A.) =

L
00

e-iJ.mh(m),

(24)

m=-oo

known as the frequency characteristic or transfer function of the filter.


Let us now take up conditions, about which nothing has been said so far,
for the convergence ofthe series in (22) and (24). Let us suppose that the input
is a stationary random sequence ~ = (~ ..), n e 71., with covariance function
R(n) and spectral decomposition (2). Then if
00

k,l=-oo

h(k)R(k - l)li(l) < 00,

(25)

407

3. Spectral Representation of Stationary (Wide Sense) Sequences

the series L~= 00 h(n - m)~m converges in mean-square and therefore there
is a stationary sequence 1J = (IJn) with
00

'ln =

00

m=-oo

h(n - m)~m =

m=-oo

h(m)~n-m

(26)

In terms of the spectral measure, (25) is evidently equivalent to saying that

cp(A.) E L 2 (F), i.e.

f"icp(A.WF(dA.) < oo.

(27)

Under (25) or (27), we obtain the spectral representation

IJn = f/)."cp(A.)Z(dA.).

(28)

of 1J from (26) and (2). Consequently the covariance function R,(n) of 1J


is given by the formula

R/n) = f/)."lcp(A.)I 2 F(dA.).

(29)

In particular, if the input to a filter with frequency characteristic <p = cp(A.)


is taken to be white noise E = (~>n), the output will be a stationary sequence
(moving average)
00

'ln =

m=-oo

h(m)~>n-m

(30)

with spectral density

JP,.)

= 2n I<p(A.) I

The following theorem shows that there is a sense in which every stationary
sequence with a spectral density is obtainable by means of a moving average.

Theorem 3. Let 1] = (IJn) be a stationary sequence with spectral density f,(A.).


Then (possibly at the expense of enlarging the original probability space) we
can find a sequence E = ( ~>n) representing white noise, and a filter, such that the
representation (30) holds.
PRooF. For a given (nonnegative) function UA.) we can find a function cp(A.)
such that f.,( A.) = (1/2n) I cp(A.) 12 Since
f.,( A.) dA. < oo, we have cp(A.) E L 2 (~),
where ~ is Lebesgue measure on [ -n, n). Hence <p can be represented as a
Fourier series (24) with h(m) = (1/2n) .f':." eimJ.cp(A.) dA., where convergence is
understood in the sense that

J':."

J" I<p(A.) -,

lmlsn

e-iA.mh(m) 12 dA.--+ 0,

n--+ oo.

408

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Let

n E Z.

Yin = f/).nz(dA.),

Besides the measure Z = Z(d) we introduce another independent orthogonal stochastic measure Z = Z(d) with E I Z(a, b] 12 = (b - a)/2n. (The
possibility of constructing such a measure depends, in general, on having a
sufficiently "rich" original probability space.) Let us put

Z(d)

cpffi(A.)Z(dA.)

[1 - cpffi(A.)cp(A.)]Z(dA.),

where

a E!1

{a -1, ~f a ~ 0,
0,
If a- 0.

The stochastic measure Z = Z(d) is a measure with orthogonal values, and


for every d = (a, b] we have
EIZ(d) 12 =

where

2~L Icpffi(A.W Icp(A.W dA. + 2~

11 - cpffi(A.)cp(A.W dA.

12~ 1 '

ldl = b- a. Therefore the stationary sequence e = (en), n E Z, with


Bn = ff;.nz(dA.),

is a white noise.
We now observe that
(31)
and, on the other hand, by property (2.14) (P-a.s.)

f"

ein).cp(A.)Z(dA.) =

f" ei).nc=~

oo

e- i).mh(m) )z(dA.)

which, together with (31), establishes the representation (30).


This completes the proof of the theorem.

Remark. If f,(A.) > 0 (almost everywhere with respect to Lebesgue measure),


the introduction of the auxiliary measure Z = Z(d) becomes unnecessary
(since then 1 - cpffi(A.)cp(A.) = 0 almost everywhere with respect to Lebesgue
measure), and the reservation concerning the necessity of extending the
original probability space can be omitted.

3. Spectral Representation of Stationary (Wide Sense) Sequences

Corollary 1. Let the spectral density

J~(A.)

409

> 0 (almost everywhere with

respect to Lebesgue measure) and

J,(A.) = 2n Iq>(A.) I ,
where

q>(A.)

L e-mh(k),
<X)

<X)

k=O

k=O

lh(k)l 2 < oo.

Then the sequence 11 admits a representation as a one-sided moving average,


<X)

11n =

L h(m)t:n-m

m=O

In particular, let P(z) = a0 + a 1z + + aPzP be a polynomial that has


no zeros on {z: lzl = 1}. Then the sequence 11 = (17,) with spectral density

J,(A.) =
~

_!_
2n

IP(e-iAW

can be represented in the form

Corollary 2. Let

= (~..) be a sequence with rational spectral density


(32)

Let us show that if P(z) and Q(z) have no zeros on {z: lzl = 1}, there is a
white noise e = e(n) such that (P-a.s.)
~ ..

+ bl~n-1 + ... + bq~n-q =

aot:n

+ alt:n-1 + ... + apen-p

(33)

Conversely, every stationary sequence~ = (~,)that satisfies this equation


with some white noise e = (e,) and some polynomial Q(z) with no zeros on
{z: lzl = 1} has a spectral density (32).
In fact, let 17, = ~ .. + b 1 ~,_ 1 + + bqen-q Then J,(A.) = (1/2n)IP(e-iAW
and the required representation follows from Corollary 1.
On the other hand, if (33) holds and F~(A.) and F~(A.) are the spectral
functions of eand 11 then

F~(A.) =

f.

IQ(e-i")l 2

dF~(v) =

1
2n

f.

IP(e-iW dv.

Since IQ(e-iW > 0, it follows that F~(A.) has a density defined by (32).

410

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

4. The following mean-square ergodic theorem can be thought of as an


analog of the law of large numbers for stationary (wide sense) random
sequences.
Theorem 4. Let ~ = (~n), n E Z, be a stationary sequence with
covariance function (1 ), and spectral resolution (2). Then
1 n-1

- L ~k Z({O})

E~n = 0,

(34)

n k=o

and
1 n-1

- L R(k)-+ F({O}).

(35)

n k=O

PROOF.

By (2),

where

(36)
Since Isin A. I 2:: (2/n:) IA. I for I A. I : : :; n/2, we have

nA.
nA.
sin2
1t
2
1t
<- - - <IIPn(A.)I =
A. -2 nA. - 2'
n sin :2
2
sin-

Moreover, <Pn(A.)

f,

J}(F)

---+

J 10l(A.) and therefore by (2.14)

cpnCA.)Z(dA.)

f/

101 (A.)Z(dA.)

= Z({O}),

which establishes (34).


Relation (35) can be proved in a similar way.
This completes the proof of the theorem.

Corollary. If the spectral function is continuous at zero, i.e. F( {0}) = 0, then


Z({O}) = 0 (P-a.s.) and by (34) and (35),

411

3. Spectral Representation of Stationary (Wide Sense) Sequences

Since

the converse implication also holds:

Therefore the condition (1/n) I;;:6 R(k)---+ 0 is necessary and sufficient


for the convergence (in the mean-square sense) of the arithmetic means
(1/n) I;;:6 ~k to zero. It follows that if the original sequences~= C~n) has
expectation m (that is, E~ 0 = m), then
1n-1

-I

nk=o

1n-1

R(k)---+ O<o>-

L ~k

nk=O

L2
---+

m,

(37)

where R(n) = E(~n- E~n)(~o- E~o).


Let us also observe that if Z({O}) =1= 0 (P-a.s.) and m = 0, then ~n "contains
a random constant IX":
~n

= IX + IJn,

where IX= Z({O}); and in the spectral representation IJn = J':." ei'-nz,,(d.-1.)
the measure Z~ = Z~(Ll) is such that Z~({O}) = 0 (P-a.s.). Conclusion (34)
means that the arithmetic mean converges in mean-square to precisely this
random constant IX.

5.

PROBLEMS

1. Show that L~(F) = L 2 (F) (for the notation see the proof of Theorem 1).

2. Let ~ = (~.)be a stationary sequence with the property that ~n+N = ~.for some N
and all n. Show that the spectral representation of such a sequence reduces to (1.13).
3.

Let~ = (~.)be

21
N

a stationary sequence such that

I I
N

k=O l=O

R(k - /) = -1
N

lklsN-1

E~. =

0 and

I I]

R(k) [ 1 - - k
N

::; CN

for some C > 0, IX > 0. Use the Borel-Cantelli lemma to show that then

k=O

- I
4. Let the spectral density f~(J..) of the

~k -> 0

(P-a.s.)

sequence~ = (~.)be

rational,

}.. -~IP.-I(e-il)l
IQ.(e i'-)1 '

M)- 2n

whereP._ 1 (z) = a0 + a 1 z + + a._ 1z"- 1 andQ.(z) = 1 + b1z


all the zeros of these polynomials lie outside the unit disk.

(38)

+ + b.z",and

412

VI. Stationary (Wide Sense) Random Sequences. L2 -Theory

Show that there is a white noise e = (em), mE 71., such that the sequence (~m) is a
component of an n-dimensional sequence (~~. ~;, ... , ~~), ~~ = ~m that satisfies
the system of equations
i = 1, ... , n- 1,
n-1

):n

- - "i..Jbn-j":>m
):j+l+Pnem+h

(39)

~m+l-

j=O

4. Statistical Estimation of the Covariance Function


and the Spectral Density
1. Problems of the statistical estimation of various characteristics of the
probability distributions of random sequences arise in the most diverse
branches of science (geophysics, medicine, economics, etc.) The material
presented in this section will give the reader an idea of the concepts and
methods of estimation, and of the difficulties that are encountered.
To begin with, let c; = (c;n), n E Z, be a sequence, stationary in the wide
sense (for simplicity, real) with expectation Ec;n = m and covariance R(n) =
ei).nF(dA.).
Let x 0, x 1, .. , xN- 1 be the results of observing the random variables
c; 0 , ~ 1 , .. , c;N_ 1 . How are we then to construct a "good" estimator of the
(unknown) mean value m?
Let us put

J':."

mN(x) = N

N-1

k=O

(1)

xk.

Then it follows from the elementary properties of the expectation that this is
a "good" estimator of m in the sense that it is unbiased "in the mean over all
kinds of data x 0, ... , xN-1 ",i.e.

1
EmN(c;) = E( N

L c;k

N-1

k=O

= m.

In addition, it follows from Theorem 4 of 3 that when (1/N)


N ~ oo, our estimator is consistent (in mean-square), i.e.
N~

oo.

(2)

Lf=o R(k) ~ 0,
(3)

Next we take up the problem of estimating the covariance function R(n),


the spectral function F(A.) = F([ -n, A.]), and the spectral density f(A.), all
under the assumption that m = 0.

4. Statistical Estimation of the Covariance Function and the Spectral Density

413

Since R(n) = E~n+k~k it is natural to estimate this function on the basis


of N observations x 0 , x 1, , xN-l (when 0 ~ n < N) by

It is clear that this estimator is unbiased in the sense that

0 ~ n < N.
Let us now consider the question of its consistency. If we replace ~k in
(3.37) by ~n+k~k and suppose that the sequence~ = C~n) under consideration
has a fourth moment (E ~ti < oo ), we find that the condition
N-+ oo,

(4)

is necessary and sufficient for


N-+ oo.
Let us suppose that the original sequence~ =
mean and covariance R(n)). Then by (11.12.51)
E[~n+k~k - R(n)] [~n~O - R(n)]

=
=
=

C~n)

E~n+k~k~n~o

(5)

is Gaussian (with zero

E~n+k~k E~n~o

R 2 (n)

+ E~n+k~n E~k~o

+ E~n+k~o E~k~n - R 2(n)


R 2 (k) + R(n + k)R(n - k).

Therefore in the Gaussian case condition (4) is equivalent to


1

N-1

k=O

- L [R
Since IR(n

2 (k)

+ k)R(n-

+ R(n + k)R(n

k)l ~ IR(n

N-+ oo.

(6)

+ kW + IR(n- k)l 2 , the condition

_!_Nil R2(k)-+ 0,

- k)] -+ 0,

N-+ oo,

(7)

k=O

implies (6). Conversely, if (6) holds for n = 0, then (7) is satisfied.


We have now established the following theorem.

Theorem. Let

~ = (~n) be a Gaussian stationary sequence with E~n = 0 and


covariance function R(n). Then (7) is a necessary and sufficient condition that,
for every n ~ 0, the estimator RN(n; x) is mean-square consistent, (i.e. that
(5) is satisfied).

414

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Remark. If we use the spectral representation of the covariance function,


we obtain

-1 N-1
L R 2 (k) =
N

k=O

J" J"
-..

-1

- ..

L ei(A.-v)kF(dA.)F(dv)

N-1

k=O

f . f/N(A., v)F(dA.)F(dv),

where (compare (3.35))

A.= v,

But as N-+ oo

1, A.= v,
fN(A., v)-+ f(A., v) = { 0 ,
'
Jl, =I= v.
Therefore

:t:

R 2 (k)-+

f . f/().,

v)F(dA.)F(dv)

ff({A.})F(dA.)

= ~ F2({A.}),

where the sum over A. contains at most a countable number of terms since
the measure F is finite.
Hence (7) is equivalent to
(8)

which means that the spectral function F(A.)

= F([ -n, A.]) is continuous.

2. We now turn to the problem of finding estimators for the spectral function
F(A.) and the spectral density f(A.) (under the assumption that they exist).
A method that naturally suggests itself for estimating the spectral density
follows from the proof of Herglotz' theorem that we gave earlier. Recall that
the function

fN(A.) =

21 L (1 - INni)R(n)e-iA.n
11: lni<N

introduced in 1 has the property that the function

FN(A.) =

f/N(v) dv

(9)

415

4. Statistical Estimation of the Covariance Function and the Spectral Density

converges on the whole (Chapter III, 1) to the spectral function F(A.).


Therefore if F(A.) has a density f(A.), we have

f/N(v) dv-+ f/(v) dv

(10)

for each A. E [ - n, n).


Starting from these facts and recalling that an estimator for R(n) (on
the basis of the observations x 0 , x 1, .. , xN_ 1 ) is RN(n; x), we take as an
estimator for f(A.) the function

~ ( 1 - In
~
fN(A.; x) = -21 t...
N 1) RN(n;
x)e _';.",

(11)

7t lnl <N

putting RN(n; x) = RN(inl; x) for In I < N.


The function jN(A.; x) is known as a periodogram. It is easily verified that
it can also be represented in the following more convenient form:

jN(A.; x) =
Since ERN(n; ~) = R(n),

In I <

2~N

[t:

(12)

Xne-i.i.nr

N, we have

EjN(A.; ~) = fN(A.).
If the spectral function F(A.) has density f(A.), then, since fN(A.) can also be
written in the form (1.34), we find that

N-1
L L f" ei(k-l>ei.i.(l-k>J(v) dv

1
fN(A.) = 2nN k=O
=

N-l

1 IN-1
L
f" -2nN
-n

-n

1=0

ei<v-.i.)k

12f(v) dv.

k=O

The function

<llN(A.) = 2nN

IN-

A.

1 .

k~o e'.i.k

12
=

1 sin 2- N
2nN sin A./2

is the Fejer kernel. It is known, from the properties of this function, that for
almost every A. (with respect to Lebesgue measure)

f . <~>N(A.-

v)f(v) dv-+ f(A.).

(13)

Therefore for almost every A. E [ -n, n)

EjN(A.; ~)

-+

f(A.);

in other words, the estimator jN(A.; x)ofj(A.)on the basis ofx 0 , x 1 ,


is asymptotically unbiased.

(14)
... ,

xN-t

416

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

In this sense the estimator ]N(A.; x) can be considered to be "good." However, at the individual observed values x 0 , ... , xN-l the values of the
periodogram ]N(A.; x) usually turn out to be far from the actual values f(A.).
In fact, let~ = ( ~n) be a stationary sequence of independent Gaussian random
variables, ~n ~ JV(O, 1). Then f(A.)
1f2n and

]N(A.;

~) = _21 I ~Nil ~ke-wl2


1!:

v' N

k=O

Then at the point A. = 0 we have ]N(O, 0 coinciding in distribution with the


square of the Gaussian random variable 11 ~ JV(O, 1). Hence, for every N,
1
E lfN(O; ~)- f(OW = 4n 2 E 1'1 2
A

11 2 >

0.

Moreover, an easy calculation shows that if f(A.) is the spectral density of a


stationary sequence~ = (~")that is constructed as a moving average:
00

~" =
with
Ee~

I aken-k
k=O

(15)

Lk'=o lakl < oo, Lk'=o lakl 2 < oo, where e =(en) is white noise with

< oo, then

limE IJN(A.;

~)- f(A.W

N-+oo

= {

2! 22 (0),
f (A.),

A.= O,
A.# 0,

n,
n.

(16)

Hence it is clear that the periodogram cannot be a satisfactory estimator of


the spectral density. To improve the situation, one often uses an estimator for
f(A.) of the form

f Z'<A.; x) =

r1t WN(A.- v)]N(v;

x) dv,

(17)

which is obtained from the periodogram fN(A.; x) and a smoothing function


WN(A.), and which we call a spectral window. Natural requirements on
WN(A.) are:
(a) WN(A.) has a sharp maximum at A. = 0;
(b) J':" WN(A.) d)..= 1;
(c) PI]~(A.; ~)- f(A.W-+ 0, N-+ oo, A. E [ -n, n).
By (14) and (b) the estimators ]';(A.;~) are asymptotically unbiased. Condition (c) is the condition of asymptotic consistency in mean-square, which,
as we showed above, is violated for the periodogram. Finally, condition (a)
ensures that the required frequency A. is "picked out" from the periodogram.
Let us give some examples of estimators of the form (17).
Bartlett's estimator is based on the spectral window

WN(A.)

= aNB(aNA.),

417

4. Statistical Estimation of the Covariance Function and the Spectral Density

where aN j oo, aN/N -+ 0, N -+ oo, and


2
( ,) = __!__ I sin(A./2) 1

B II.

2n

A./2

Parzen's estimator takes the spectral window to be

WN(A.) = aNP(aNA.),
where aN are the same as before and
p

4
( ,) = ]_ I sin(A./4) 1
II.

8n

A./4

Zhurbenko's estimator is constructed from a spectral window of the form

with

Z(A.) = {

-(X+ 1 1..1.1" +(X+ 1 ' lA. I~ 1,


2cx

2cx

1..1.1 > 1,

0,

where 0 < ex ~ 2 and the aN are selected in a particular way.


We shall not spend any more time on problems of estimating spectral
densities; we merely note that there is an extensive statistical literature
dealing with the construction of spectral windows and the comparison of
the corresponding estimators ]';(A.; x).

3. We now consider the problem of estimating the spectral function F(A.) =


F([ -n, A.]). We begin by defining

FN(A.) = f!N(v) dv,

FN(A.; x) =

fJN(v; x) dv,

where ]N(v; x) is the periodogram constructed with (x 0 , x 1,


It follows from the proof of Herglotz' theorem (1) that

.. ,

xN_ 1 ).

f/).n dFN(A.)-+ f/).n dF(A.)


for every n E 7L. Hence it follows (compare the corollary to Theorem 1, 3,
Chapter III) that F N => F, i.e. FN(A.) converges to F(A.) at each point of continuity of F(A.).
Observe that

418

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

for all lnl < N. Therefore if we suppose that RN(n; ~) converges to R(n)
with probability one as N -+ oo, we have

f/An dFN().; 0-+ f/An dF(A)

(P-a.s.)

and therefore FN().; ~) ~ F().) (P-a.s.).


It is then easy to deduce (if necessary, passing from a sequence to a
subsequence) that if RN(n; ~)-+ R(n) in probability, then FN().; ~) ~ F().) in
probability.
4.

PROBLEMS

1. In (15) let e.

.%(0, 1). Show that

(N- n)VRN(n,

~)-+ 2n

fy +

e 2i"l)f 2(A.) dA.

for every n, as N -+ oo.


2. Establish (16) and the following generalization:
lim COV(]N(A_; ~), ]N(v;
N~oo

2/ 2 (0),

={

/ 2 (A),
~

A.= v = 0, n,
A.= v # 0, n,
A.# v.

5. Wold's Expansion
1. In contrast to the representation (3.2) which gives an expansion of a
stationary sequence in the frequency domain, Wold's expansion operates
in the time domain. The main point of this expansion is that a stationary

sequence ~ = (~n), n E 71., can be represented as the sum of two stationary


sequences, one of which is completely predictable (in the sense that its
values are completely determined by its "past"), whereas the second does
not have this property.
We begin with some definitions. Let HnC~) = UW) and H(~) = U(~)
be closed linear manifolds, spanned respectively by ~n = (... , ~n-l ~n) and
~ = (- ~n-1> ~n ). Let

For every 17 E H(~), denote by


ftnC11) =

E('li Hn I~))

the projection of 17 on the subspace


write

Hn(~)

(see 11, Chapter II). We also

419

5. Wold's Expansion

Every element 1J E H(~) can be represented as

where 1J sum

ft_

00

(1J)

_l ft_

00

(1]). Therefore H(~) is represented as the orthogonal


H(~)

S(~) Ef) R(~),

where S(~) consists of the elements fL 00 (1]) with IJ E H(~), and R(~) consists
of the elements of the form IJ- ft_ 00 (1J).
We shallnowassumethatE~n = Oand V~n > 0. ThenH()isautomat ically
nontrivial (contains elements different from zero).

Definition 1. A stationary sequence

= (~n) is regular if

H(~) = R(~)

and singular if
H(~)

S(~).

Remark. Singular sequences are also called deterministic and regular


sequences are called purely or completely nondeterministic. If S(~) is a proper
subspace of H(O we just say that~ is nondeterministic.
Theorem 1. Every stationary (wide sense) random sequence
decomposition

has a unique
(1)

where

regular and ~s =
~~for all nand m).

~r = (~~)is

onal(~~ _l

PROOF.

(~~)is

singular. Here

~r

amd

~s

are orthog-

We define

for every n, we have S(~') _l S(~). On the other hand, S(~')


s; S(~) and therefore S(n is trivial (contains only random sequences that
coincide almost surely with zero). Consequently ~r is regular.
Moreover, Hn(~) S:; Hi~ 5 ) Ef> HnW) and Hi~s) S:; Hn(~), HnW) S:; Hi~).
Therefore Hn(~) = Hn(~ 5) Ef) HnW) and hence

Since

~~ _l S(~),

(2)

for every n. Since

~~ _l

S() it follows from (2) that


S( ~) s; H n<C),

420

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

and therefore S(~) s; S(~") s; H(~"). But~~ s; S(~); hence H(~") S(~) and
consequently
S( ~) = S( ~") = Hce'),
which means that ~ is singular.
The orthogonality of~ and ~follows in an obvious way from ~~ E S(~)
and ~~ _i S( ~).
Let us now show that (1) is unique. Let ~n = 11~ + 17~, where 17' and '7 8 are
regular and singular orthogonal sequences. Then since Hn('1") = H(17), we
have
Hn(~) =

Hn('1') E9 HnC17) = Hn('1') E9 H(17),

and therefore S(~) = S(17')


S(~)

= H(17

Ee H(17

8 ).

But S(17') is trivial, and therefore

8 ).

Since '7~ E H(17 8 ) = S( ~) and '7~ l_ H(17") = S( ~), we have E( ~n IS(~)) =


E(17~ + 11~ IS(~)) = 17~,i.e. 11~ coincides with~~; this establishes the uniqueness
of (1).
This completes the proof of the theorem.

2. Definition 2. Let ~ = (~n) be a nondegenerate stationary sequence. A


random sequence e = (en) is an innovation sequence (for~) if
(a) e = (en) consists of pairwise orthogonal random variables with Een = 0,
E lenl 2 = 1;
(b) Hn(~) = Hn(e) for all n E Z.

Remark. The reason for the term "innovation" is that en+ 1 provides, so to
speak, new" information" not contained in H nC ~)(in other words," innovates"
in Hn(~) the information that is needed for forming Hn+ 1 (~)).
The following fundamental theorem establishes a connection between
one-sided moving averages (Example 4, 1) and regular sequences.

Theorem 2. A necessary and sufficient condition for a nondegenerate sequence


~to be regular is that there are an innovation sequence e = (en) and a sequence
(an) of complex numbers; n ~ 0, with L:'=o lanl 2 < oo, such that
00

~n =
PROOF.

L aken-k
k=O

Necessity. We represent

Hn(~)

(P-a.s.)

(3)

in the form

Since HnC~) is spanned by elements of Hn_ 1 (~) and elements of the form

P ~n where Pis a complex number, the dimension (dim) of Bn is either zero

or one. But the space Hn(~) cannot coincide with Hn_ 1 (~) for any value ofn.

421

5. Wold's Expansion

In fact, if Bn is trivial for some n, then by stationarity Bk is trivial for all k,


and therefore H( ~) = S( 0, contradicting the assumption that ~ is regular.
Thus Bn has the dimension dim Bn = 1. Let IJn be a nonzero element of Bn.
Put

where lll1nll 2 = E IIJnl 2 > 0.


For given nand k ~ 0, consider the decomposition
HnC~) = Hn-k(~) EB Bn-k+1 EB EB Bn.

Then En-b ... , En is an orthogonal basis in Bn-k+ 1 EB EB Bn and


k-1

~n =

L ajEn- j + nn-l~n),

(4)

j=O

where ai = E~nEn- j
By Bessel's inequality (II.ll.16)
00

L lajl

j=O

~ ll~nll 2 <

00.

It follows that Lf=o a;En-i converges in mean square, and then, by (4),
equation (3) will be established as soon as we show that nn-i~n) .!.{ 0,

~ 00.

It is enough to consider the case n

= 0. Since

and the terms that appear in this sum are orthogonal, we have for every
k~O

Therefore the limit limk-oo ft_k exists (in mean square). Now ft_k E H -k(~)
for each k, and therefore the limit in question must belong to nk>O Hk(~) =
S(O. But, by assumption, S(O is trivial, and therefore n_k g 0, k- ~ oo.
Sufficiency. Let the nondegenerate sequence ~ have a representation (3),
where E = (En) is an orthonormal system (not necessarily satisfying the condition Hn(~) = Hn(E), n E 1:). Then HnCO ~ Hn(E) and therefore S(~) =
nk Hk(~) ~ Hn(E) for every n. But En+ 1 j_ Hn(E), and therefore En+ 1 j_ S(O
and at the same time E = (En) is a basis in H(~). It follows that S(~) is trivial,
and consequently ~ is regular.
This completes the proof of the theorem.

422

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Remark. It follows from the proof that a nondegenerate sequence is regular


if and only if it admits a representation as a one-sided moving average

e, = L: a~ce,_k>
00

(5)

k=O

where e = e, is an orthonormal system which (it is important to emphasize


this!) does not necessarily satisfy the condition H,(e) = H,(e), n E 7!... In this
sense the conclusion of Theorem 2 says more, and specifically that for a
regular sequence there exist a = (a,) and an orthonortnal system e = (e,)
such that not only (5), but also (3), is satisfied, with H,(e) = H,(e), n E 7!...

The following theorem is an immediate corollary of Theorems 1 and 2.

Theorem 3 (Wold's Expansion). If e = (e,) is a nondegenerate stationary


sequence, then

e, = e~ + L: ake,-k,
00

Where b00=o la~cl 2 <

(6)

k=O

00

and C: = (e,) is an innovation sequence (fore').

3. The significance of the concepts introduced here (regular and singular


sequences) becomes particularly clear if we consider the following (linear)
extrapolation problem, for whose solution the Wold expansion (6) is
especially useful.
Let H 0 ( e) = P(eo) be the closed linear manifold spanned by the variables
eo = (... , 1, eo). Consider the problem of constructing an optimal (leastsquares) linear estimator ~ .. of e, in terms of the "past" eo = (. 00' e -1 eo>
It follows from 11, Chapter II, that

e_

(7)

e'

(In the notation of Subsection 1, ~.. = ft 0 (e,).) Since and


and H 0 (e) = H 0 (e') EB H 0 {e"), we obtain, by using (6),

e are orthogonal

e, = ECe: + e~IHoCe)) = ECe:IHoCe)) + E(e~IHoCe))


= ECe:IHoCe') EB Ho(e))
= ECe:IHoCe))
=

E(e~IHo(e') EB HoCe"))

+ E(e~IHoCe'))

e: + EC~oaken-kiHo<e')).

In (6), the sequence e = (e,) is an innovation sequence fore'=


fore H 0 (e') = H 0 (e). Therefore

Ce~) and

~.. = e: + EC~oaken-kiHo(e)) = e: + k~naken-k

there-

(8)

423

5. Wold's Expansion

and the mean-square error of predicting

~n

by ~ 0 = ( ... , ~ _ 1 , ~ 0 ) is
(9)

We can draw two important conclusions.


(a) If ~ is singular, then for every n ~ 1 the error (in the extrapolation) a;
is zero; in other words, we can predict ~n without error from its "past"
~ 0 = (. , ~-1 ~o).
(b) If~ is regular, then a; ~ a;+ 1 and

a;= L lakl
00

lim

n->oo

(10)

k=O

Since
00

L lakl

k=O

El~nl 2 ,

it follows from (10) and (9) that


n~

oo;

i.e. as n increases, the prediction of ~n in terms of ~ 0 = (... , ~ _ 1 , ~ 0 ) becomes


trivial (reducing simply to E~n = 0).
4. Let us suppose that ~ is a nondegenerate regular stationary sequence.
According to Theorem 2, every such sequence admits a representation as a
one-sided moving average
00

~n

k=O

aken-k

(11)

where f= 0 I ak 12 < oo and the orthonormal sequence e = (en) has the


important property that

n E 7l..

(12)

The representation (11) means (see Subsection 3, 3) that ~n can be


interpreted as the output signal of a physically realizable filter with impulse
response a = (ak), k ~ 0, when the input is e = (en).
Like any sequence of two-sided moving averages, a regular sequence has
a spectral density f(A.). But since a regular sequence admits a representation
as a one-sided moving average it is possible to obtain additional information
about properties of the spectral density.
In the first place, it is clear that
f(A.) = 21n lcp(A.W,

VI. Stationary (Wide Sense) Random Sequences. L2 -Theory

424

where
<p(A.) =

00

(13)

e-i'-kab

k=O

Put

00

<D(z) =

(14)

akzk.

k=O

This function is analytic in the open domain I z I < 1 and since Ik"= 0 Iak 12
< oo it belongs to the Hardy class Hz, the class of functions g = g(z), analytic
in Iz I < 1, satisfying
sup -21

O:<>r<l

J" lg(rei W
8

d()

< oo.

(15)

-rr

In fact,

and
O:<;;r< 1

It is shown in the theory of functions of a complex variable that the


boundary function <D(ei'-), -n :::;; A. < n, of <DE Hz, not identically zero, has
the property that
(16)

In our case

where <DE H 2 Therefore


lnf(A.) = -In 2n

+ 2ln I<D(e-i.t)l,

and consequently the spectral density f(A.) of a regular process satisfies

f"

In f(A.) dA. > - oo.

(17)

On the other hand, let the spectral density f(A.) satisfy (17). It again follows
from the theory of functions of a complex variable that there is then a
function <D(z) = Lk"=o akzk in the Hardy class H 2 such that (almost everywhere with respect to Lebesgue measure)

425

6. Extrapolation, Interpolation and Filtering

Therefore if we put q>(A.) = <l>(e-o.) we obtain

f(A.) = 21n Iq>(A.) 12'


where q>(A.) is given by (13). Then it follows from the corollary to Theorem 3,
3, that~ admits a representation as a one-sided moving average (11), where
a = (a.) is an orthonormal sequence. From this and from the Remark on
Theorem 2, it follows that ~ is regular.
Thus we have the following theorem.

Theorem 4 (Kolmogorov). Let ~ be a nondegenerate regular stationary


sequence. Then there is a spectral density f(A.) such that

r"

(18)

In f(A.) dA. > - oo.

In particular,f(A.) > 0 (almost everywhere with respect to Lebesgue measure).


Conversely, if~ is a stationary sequence with a spectral density satisfying
(18), the sequence is regular.
5.

PROBLEMS

1. Show that a stationary sequence with discrete spectrum (piecewise-constant spectral


function F(A.)) is singular.
2. Let a~= El~.- ~.1 2 , ~. = E(~.IH 0 (~)). Show that if a~= 0 for some n ~ 1, the
sequence is singular; if -> R(O) as -> oo, the sequence is regular.

a;

3. Show that the stationary sequence~ = (~.), ~. = ei""', where t:p is a uniform random
variable on [0, 2n], is regular. Find the estimator ~.and the number a;, and show
that the nonlinear estimator

~. = (lc2__)"
Ct

provides a correct estimate of ~. by the "past" 0


E I~.

~. 12 = 0,

= ( ... ,

_ 1 , ~ 0 ), i.e.

n ~ 1.

6. Extrapolation, Interpolation and Filtering


1. Extrapolation. According to the preceding section, a singular sequence
admits an error-free prediction (extrapolation) of~., n ~ 1, in terms of the
"past," ~ 0 = ( ... , ~ _ 1 , ~ 0 ). Consequently it is reasonable, when considering
the problem of extrapolation for arbitrary stationary sequences, to begin
with the case of regular sequences.

426

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

According to Theorem 2 of 5, every regular


representation as a one-sided moving average,

sequence~

(~n)

admits a

00

~n

=I
aken-k
k=O

(1)

with Lk'=o lakl 2 < oo and some innovation sequence e = (en). It follows
from 5 that the representation (1) solves the problem of finding the optimal
(linear) estimator~= E(~niH 0 (~)) since, by (5.8),
00

~n =

I aken-k
k=n

(2)

and
(3)

However, this can be considered only as a theoretical solution, for the


following reasons.
The sequences that we consider are ordinarily not given to us by means
of their representations (1), but by their covariance functions R(n) or the
spectral densities f(A.) (which exist for regular sequences). Hence a solution
(2) can only be regarded as satisfactory if the coefficients ak are given in
terms of R(n) or of f(A.), and ek are given by their values ~k-l ~k.
Without discussing the problem in general, we consider only the special
case (of interest in applications) when the spectral density has the form

f(A.) =
where Cl>(z) =
in lzl::::;; 1.
Let

2~ ICI>(e-i.t)l

2,

Lk'=o bkzk has radius of convergence r >

~n =

f/.tnz(dA.)

(4)

1 and has no zeros

(5)

be the spectral representation of ~ = (~n), n E Z.

Theorem 1. If the spectral density of~ has the density (4), then the optimal
(linear) estimator ~n of ~n in terms of ~ 0 = ( ... , ~ -1> ~ 0 ) is given by
(6)
where
(7)
and

L bkzk.
00

Cl>n(z) =

k=n

427

6. Extrapolation, Interpolation and Filtering

PRooF. According to the remark on Theorem 2 of 3, every variable


t E H 0( ~) admits a representation in the form
(8)
where H 0 (F) is the closed linear manifold spanned by the functions en = ei'-n
for n ~ 0 (F(.A.) = J~, f(v) dv).
Since

l~n- ~nl 2 = EIf,<eiln-

cpn().))Z(d).)'

= f,iei'-n- cpi-"Wf(.A.)d.A.,

the proof that (6) is optimal reduces to proving that


inf
$nEHo(F)

J"

leiln- cpn(.A.)I 2/(.A.) d).=

f"

leiln- cpn(.A.Wf(.A.) d.A..

(9)

-n

-n

It follows from Hilbert-space theory (11, Chapter II) that the optimal
function cpn().) (in the sense of (9)) is determined by the two conditions

(1)
(2)

(10)

cpn(.A.)EHo(F),
eiln - cpn(A) _[_ H o(F).

Since

ei'-"<l>n(e-il) = ei'-"[bne-i'-n

+ bn+le-i'-<n+l) + ]EH0 (F)

and in a similar way 1/<l>(e-;;.) E H 0 (F), the function cpn().) defined in (7)
belongs to H 0 (F). Therefore in proving that cpn(A) is optimal it is sufficient
to verify that, for every m 2': 0,

ei'-n - cpn(.A.)

_L

ei'-m,

I.e.

rn 2:: 0.
The following chain of equations shows that this is actually the case:

n,m

2_

d).
f" eil(n-m)[1- <l>ie-.;;.)JI<l>(e-i'-)12
<l>(e ''-)

2n -n

428

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

where the last equation follows because, for m ~ 0 and r > 1,

f".

e- iAmeiM

dA. = 0.

This completes the proof of the theorem.

Remark 1. Expanding ~n(A.) in a Fourier series, we find that the predicted


value ~n of~"' n ~ 1, in terms of the past, ~ 0 = ( ... , ~ _ 1 , ~ 0 ), is given by the
formula

Remark 2. A typical example of a spectral density represented in the form


(4) is the rational function
1 IP(e-iA)I2

f(A.) = 2n Q(e iA) '


where the polynomials P(z) = a0 + a 1 z +
+ + bqzq have no zeros in {z: lzl :::; 1}.

+ aPzP

and Q(z) = 1 + b1 z

In fact, in this case it is enough to put Cl>(z) = P(z)/Q(z). Then Cl>(z)

b"'=o Ckzk and the radius of convergence ofthis series is greater than one.
Let us illustrate Theorem 1 with two examples.

EXAMPLE 1. Let the spectral density be


1

f(A.) = 2n (5

+ 4 cos A.).

The corresponding covariance function R(n) has the shape of a triangle with
R(O)

= 5,

R(1)

2,

R(n) = 0

lnl

for

2.

(11)

Since this spectral density can be represented in the form


f(A.) = 21n 12

+ e-iAI 2 ,

we may apply Theorem 1. We find easily that

~1().)

-j).

eiA

2+e

iA'

for

2.

(12)

Therefore ~n = 0 for all n ~ 2, i.e. the (linear) prediction of ~n in terms of


~ 0 = (... , ~ _ 1 , ~ 0 ) is trivial, which is not at all surprising if we observe that,
by (11), the correlation between ~nand any of ~ 0 , ~- 1 , . is zero for n ~ 2.

429

6. Extrapolation, Interpolation and Filtering

For n = 1 we find from (6) and (12) that

J
lt

~1 =

e' 2
-n

1
=-2

;.

J"

-n

e -i).

+e
1

;;. Z(dA.)

( -

00

;;.)Z(dA.)=

1 +_e_

k=O

1)k

2k+1

.
f" e-k.<z(dA.)
_,

EXAMPLE 2. Let the covariance function be

lal < 1.
Then (see Example 5 in 1)

f(A.) = _.!._ 1 - lal 2


2n 11 - ae ;;. 12 '
i.e.

where

ll>(z) =

(1

- a

1z)112

oo

= (1 - lal 2) 112 I (az)\


k=O

1 - az

from which <Pn(A.) = a" and therefore

~n =
~0

f,anZ(dJc)

an~o

In other words, in order to predict the value of ~n from the observations


= (... , ~- 1 , ~ 0 ) it is sufficient to know only the last observation ~ 0 .

Remark 3. It follows from the Wold expansion of the regular sequence


= Gn) with

00

~" =

I ak~n-k
k=O

(13)

that the spectral density f(A.) admits the representation

f(A.) =

2~ lll>(e-;;.)1

2,

(14)

where
00

ll>(z)

= L akzk.
k=O

(15)

430

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

It is evident that the converse also holds, that is, if f(2) admits the representation (14) with a function <l>(z) of the form (15), then the Wold expansion of ~n
has the form (13). Therefore the problem of representing the spectral density
in the form (14) and the problem of determining the coefficients ak in the
Wold expansion are equivalent.
r

The assumptions that <l>(z) in Theorem 1 has no zeros for Iz I ~ 1 and that
> 1 are in fact not essential. In other words, if the spectral density of a

regular sequence is represented in the form (14), then the optimal estimator
(in the mean square sense) for ~n in terms of ~ 0 = (... , ~ _ 1 , ~ 0 ) is determined by formulas (6) and (7).

Remark 4. Theorem 1 (with the preceding remark) solves the prediction


problem for regular sequences. Let us show that in fact the same answer
remains valid for arbitrary stationary sequences. More precisely, let

and let f'(2) = (1/2n)I<I>(e-u)l 2 be the spectral density of the regular


sequence ~r = (~~).Then ~n is determined by (6) and (7).
In fact, let (see Subsection 3, 5)

~~ =

r,

<P~(2)Z'(d2),

where Z'(Ll) is the orthogonal stochastic measure in the representation ofthe


regular sequence ~'. Then

El~n- ~nl 2

f,lei.ln- <Pn(2WF(d2)

; : : f,lei.<n- <PnC2)1 2f'(2) d).;;:::: f,lei).n- <P~(2)1 2f'(2) d).


=

E ~~~- ~~1 2 .

(16)

But ~n - ~n = ~~ - ~'. Hence E I~n - ~n 12 = E I~~


from (16) that we may take <Pn(A.) to be <P~(2).

- ~~ 12 , and it follows

2. Interpolation. Suppose that ~ = ( ~n) is a regular sequence with spectral


density f(2). The simplest interpolation problem is the problem of constructing the optimal (mean-square) linear estimator from the results of the
measurements {~n' n = 1, 2, ... } omitting ~ 0
Let H 0 (~) be the closed linear manifold spanned by ~n' n =f. 0. Then
according to the results of Theorem 2, 3, every random variable '1 E H 0 (~)
can be represented in the form
'1

r,<p(2)Z(d2),

431

6. Extrapolation, Interpolation and Filtering

where q> belongs to H 0 (F), the closed linear manifold spanned by the functions ei).n, n =f. 0. The estimator

~0 =

f"

will be optimal if and only if


inf E I~

~EHO(~)

0 - 111 2 =

inf
q>EHO(F)

(17)

cp(A.)Z(dA.)

J" 11 -

q>(A.W F(dA.)

-x

= f"11- cp(A.WF(dA.) = El~o-

~0 1 2 .

It follows from the perpendicularity properties of the Hilbert space


H 0 (F) that cp(A.) is completely determined (compare (1 0)) by the two conditions

cp(A.) E H 0 (F),
1 - cp(A.) j_ H 0 (F).

(1)
(2)

(18)

Theorem 2 (Kolmogorov).

be a regular sequence such that

Let~ = (~n)

dA.
-x J(A.) <
lt

(19)

OO.

Then

f~A.)'

cp(A.) = 1 -

(20)

where

J"

()( =

2n
dA.,

(21)

-xf(A.)

and the interpolation error <5 2 = EI ~ 0

~ 0 12 is given by <5 2 = 2n ()(.

PROOF. We shall give the proof only under very stringent hypotheses on the
spectral density, specifically that

0<c

f(A.)

c < 00.

(22)

It follows from (2) and (18) that

f"

[1 - cp(A.)]ein).f(A.) dA.

=0

(23)

for every n =f. 0. By (22), the function [1 - cp(A.)]f(A.) belongs to the Hilbert
space L 2 ([ -n, n], .?4[ -n, n], J-1.) with Lebesgue measure J-1.. In this space the
functions {ein).;J2n, n = 0, 1, ... } form an orthonormal basis (Problem 7,
11, Chapter II). Hence it follows from (23) that [1 - cp(A.)]f(A.) is a constant,

432

VI. Stationary (Wide Sense) Random Sequences. 2 -Theory

which we denote by
clusion that

Thus the second condition in (18) leads to the con-

IX.

IX

(24)

cp(Jc) = 1 - f(Jc).

Starting from the first condition (18), we now determine IX.


By (22), cp E L 2 and the condition cp E H 0 (F) is equivalent to the condition that cp belongs to the closed (in the L 2 norm) linear manifold spanned
by the functions ei'-n, n =1= 0. Hence it is clear that the zeroth coefficient in the
expansion of cp(Jc) must be zero. Therefore

J"

= _, cp(Jc) dJc = 2n -

IX

J"

_,

dJc
f(Jc)

and hence IX is determined by (21).


Finally,

b2 = Eleo- ol 2 = r , l l - cp(JcWf(Jc)dJc

=I lXI

4n2
f(Jc)
J"-,F(Jc)
dJc = J" dA .
_, f(Jc)

This completes the proof (under condition (22)).


Corollary. If

cp(Jc) =

ckei'-k,

O<lki:5N

then

3. Let f(Jc) be the spectral density in Example 2 above. Then an


easy calculation shows that

EXAMPLE

and the interpolation error is

3. Filtering. Let (0, e) = ((On), (en)), n E 7L, be a partially observed sequence,

where = (On) and e = (en) are respectively the unobserved and the observed
components.

433

6. Extrapolation, Interpolation and Filtering

Each of the sequences () and ~ will be supposed stationary (wide sense)


with zero mean; let the spectral densities be

We write
and
F8 ~(fl)

EZ 8 (fl)Z~(fl).

In addition, we suppose that () and ~ are connected in a stationary way, i.e.


that their covariance functions cov(()n, ~m) = E()n~m depend only on the
differences n- m. Let R 8 ~(n) = E()n~o; then

R 8 ~(n) =

f/).nF ldA.).
8

The filtering problem that we shall consider is the construction of the


optimal (mean-square)linear estimator en of ()n in terms of some observation
of the sequence ~The problem is easily solved under the assumption that ()n is to be constructed from all the values ~m' mE z. In fact, since en = E(()n IH(~)) there is
a function <Pn(A.) such that
(25)

As in Subsections 1 and 2, the conditions to impose on the optimal <Pn(A.)


are that
(1)
(2)

<]Jn(A.) E H(F~),
()n -

en l_

H(~).

From the latter condition we find

for every mE Z. Therefore if we suppose that F 8 ~(A.) and FlA.) have densities
and /~(A.), we find from (26) that

f8 ~(A.)

f/).(n-m)[fo~(A.)- e-i).n<PnCA.)f~(A.)] dA. =

0.

If f~(A.) > 0 (almost everywhere with respect to Lebesgue measure) we


find immediately that
(27)

434

VI. Stationary (Wide Sense) Random Sequences. 2 -Theory

where

cp(A.) = fo~(A.). IT (A.)


and f~ffi(A.) is the "pseudotransform" of NA.), i.e.

!TCA.) =

{Jztc;.),
o,

NA.):
NA.)-

o,
o.

Then the filtering error is


2
=
EIOn- Onl

Jn_"[fo(A.)- !o~(A.)f~ (A.)] dA..


2

ffi

(28)

As is easily verified, cp E H(F ~), and consequently the estimator (25), with
the function (27), is optimal.
4. Detection ()[a signal in the presence of noise. Let ~n = en + l'fn,
where the signal e = (On) and the noise '1 = ('1n) are uncorrelated sequences
with spectral densities.f8 (A.) and _[,,(A.). Then

EXAMPLE

{)n =

f/Ancp(A.)Z~(dA.),

where
and the filtering error is

The solution (25) obtained above can now be used to construct an optimal
estimator en+m of en+m as a result of observing ~k' k ~ n, where m is a given
element of 7l.. Let us suppose that~ = (~n) is regular, with spectral density

2~ l<l>(e-iA)I2,

.f(A.) =
where <l>(z) =

Lk'=o akzk. By the Wold expansion,


00

~n =
where s

= (sk) is white noise with the spectral resolution

~>n =
Since

I aksn-k>
k=O

f/Anze(dA.).

435

6. Extrapolation, Interpolation and Filtering

and
{') n+m =

f" ei.<(n+m>,ll(A.)<I>(e-i.<)z (dA.) =


-x

..,.,

"
L.

an+m-k ek'

k~n+m

where
(29)

then

But

Hn(~)

= Hn(e) and therefore


en+m =

I
f" [I
f" eu"[~oa1 +me-w]<l>a>(e-;;.)Z~(dA.),

ks.n

an+m-kek =

-n

an+m-kej).k]z.<dA.)

k~n

where <J>EB is the pseudotransform of <1>.


We have therefore established the following theorem.
Theorem 3. If the sequence ~ = (~") under observation is regular, then the
optimal (mean-square) linear estimator iJn+m of (}n+m in terms of ~k k ~ n,
is given by

(30)

where
Hm(e-i).)

L al+me-i).l<I>EB(e-i).)
00

l=O

(31)

and the coefficients ak are defined by (29).


4.

PROBLEMS

1. Let be a nondegenerate regular sequence with spectral density (4). Show that Cl>(z)
has no zeros for Iz I :::;; 1.
2. Show that the conclusion of Theorem 1 remains valid even without the hypotheses
that <l>(z) has radius of convergence r > 1 and that the zeros ofci>(z) all lie in lzl > 1.

436

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

3. Show that, for a regular process. the function <l>(z) introduced in (4) can be represented in the form
lzl < 1,

where
ck

= ~
1

2n

J"

eikl

-n

In f(A.) dA..

Deduce from this formula and (5.9) that the one-step prediction error ui = E 1~ 1 - ~ 1 12
is given by the Szego-Kolmogorov formula

ui =

2n

J"
exp{~
2n

-n

In f(A.) dA.}.

4. Prove Theorem 2 without using (22).

5. Let a signal 0 and a noise IJ, not correlated with each other, have spectral densities

). =~

fo()

1
2n ll+b,e-i'l2

1
1
and UA.)=2n.ll+b2e ill2'

Using Theorem 3, find an estimator l!n+m for On+m in terms of ~k k :;:;; n, where
~k = fJk + 'lk Consider the same problem for the spectral densities
j 6(A.)

2n

12 + e- I

and

f~(A.) =

2n.

7. The Kalman-Bucy Filter and Its Generalizations


1. From a computational point of view, the solution presented above for
the problem of filtering out an unobservable component (} by means of
observations of is not practical, since, because it is expressed in terms ofthe
spectrum, it has to be carried out by spectral methods. In the method proposed
by Kalman and Bucy, the synthesis of the optimal filter is carried out recursively; this makes it possible to do it with a digital computer. There are
also other reasons for the wide use of the Kalman-Bucy filter, one being that
it still "works" even without the assumption that the sequence ((}, ) is
stationary.
We shall present not only the usual Kalman-Bucy method, but also
a generalization in which the recurrent equations determined by ((}, ~)
have coefficients that depend on all the data observed in the past.
Thus, let us suppose that ((}, ) = (((}.), (.)) is a partially observed
sequence, and let

e.= ((}l(n), ... ' (}k(n))

and

. = (1(n), ... ' z(n))

be governed by the recurrent equations

+ a1(n, )(}. + b 1(n, )e 1(n + 1) + b 2(n, )ein + 1),


= A 0(n, ) + A 1(n, )(}. + B 1(n, )e 1(n + 1) + B 2(n, )ein + 1).

(}n+l = a0(n, )
n+l

(1)

437

7. The Kalman-Bucy Filter and Its Generalizations

Here
e1(n) = (e 11 (n), ... , elk(n))

and

e2(n) = (e 21 (n), ... , e21(n))

are independent Gaussian vectors with independent components, each of


which is normally distributed with parameters 0 and 1; a0(n, ~) = (a 01 (n, ~),
... , rxOk(n, ~))and A 0(n, ~) = (A 01 (n, ~), ... , A 0ln, ~))are vector functions,
where the dependence on ~ = {~ 0 , ... , ~n) is determined without looking
ahead, i.e. for a given n the functions a 0 (n, ~), ... , A 01 (n, ~) depend only on
~ 0 , , ~n; the matrix functions
b1(n, ~)

= llb!Jl(n, ~)II,

B1(n, ~) = IIBIJ>(n, ~)II,


a1(n, ~)

= lla!J>(n, ~)II,

bz(n, ~) = llbljl(n, ~)II,


Bz(n, ~)

= IIBljl(n, ~)II,

A1(n, ~) = IIA\jl(n, ~)II

have orders k x k, k x l, l x k, l x I, k x k, l x k, respectively, and also


depend on ~ without looking ahead. We also suppose that the initial vector
(0 0, ~ 0 ) is independent of the sequences e1 = (e 1(n)) and e2 = (ez(n)).
To simplify the presentation, we shall frequently not indicate the dependence of the coefficients on~So that the system (1) will have a solution with finite second moments,
we assume that E(11Boll 2 + ll~oll 2 ) < oo

and if g(n, ~) is any of the functions a0 i, A0 i, blJ>, b\J>, BIJ> or BIJ> then
E lg(n, ~W < oo, n = 0, 1,... . With these assumptions, (0, ~) has
E(110nll 2 + ll~nll 2 ) < oo, n ~ 0.
Now let ff~ = tr{ro: ~ 0 , , ~"} be the smallest u-algebra generated by
~0 ,

.. ,

~nand

According to Theorem 1, 8, Chapter II, mn = (m 1(n), ... , mk(n))is an optimal


estimator (in the mean square sense) for the vector (}n = (0 1(n), ... , (}k(n)),
and Eyn = E[(On- mn)((}n- mn)*] is the matrix of errors of observation.
To determine these matrices for arbitrary sequences ((}, ~) governed by
equations (1) is a very difficult problem. However, there is a further supplementary condition on (0 0 , ~ 0 ) that leads to a system of recurrent equations
for mn and Ym that still contains the Kalman-Bucy filter. This is the condition
that the conditional distribution P(0 0 ::::; a I~ 0 I) is Gaussian,
P(00

::::;

1 fa exp{- (x - m0 ) 2 } dx,
a! ~ 0 ) = ;;:c::2 2
y 2ny - oo
Yo
0

with parameters m0 = mo(~o), Yo = YoC~o).


To begin with, let us establish an important auxiliary result.

(2)

438

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Lemma 1. Under the assumptions made above about the coefficients of (1),
together with (2), the sequence (fJ, 0 is conditionally Gaussian, i.e. the conditional distribution function
P{fJ 0

::::;;

ao, ... , 1'/n::::;; anl~n

is (P-a.s.) the distribution.fimction of ann-dimensional Gaussian vector whose


mean and covariance matrix depend on (~ 0 , .. , ~n).
PROOF. We prove only the Gaussian character of P(fJn::::;; al~~); this is
enough to let us obtain equations for mn and Yn
First we observe that (1) implies that the conditional distribution

P(fJn+l ::::;; a1, ~n+l ::::;; xl~~,f}n =b)


is Gaussian with mean-value vector

A b = ( ao
t
Ao

+ a 1b )
+ Atb

and covariance matrix


IEB =

bob
boB)
(b a B)* B a B '

where bob= b 1 bf + b2 bi, b a B = b 1 Bf + b2 Bi, B a B = B 1 Bf


Let Cn = ((}n, ~n) and t = (t 1, ... , tk+l). Then
E[exp(it*Cn+ 1 )1~~' fJnJ

= exp{it*(A 0 (n,

~)

+ B2 Bi.

+ A 1(n, ~)f}n)- !t*IEB(n, ~)t}.


(3)

Suppose now that the conclusion of the lemma holds for some n

0. Then

E[exp(it*A 1(n, ~)f}n)I~~J = exp(it*A 1(n, Omn- !t*(A 1(n, ~)ynAfCn, ~))t.
(4)
Let us show that (4) is also valid when n is replaced by n
From (3) and (4), we have
E[exp(it*Cn+ 1 )1~~]

1.

+ A 1(n, ~)mn)
-t*IEB(n, ~)t- t*(Al(n, OynAf(n, ~))t}.

= exp{it*(A 0 (n, 0

Hence the conditional distribution


P(fJn+l::::;; a, ~n+l::::;; xl~~)

(5)

is Gaussian.
As in the proof of the theorem on normal correlation (Theorem 2, 13,
Chapter II) we can verify that there is a matrix C such that the vector
1J = [fJn+l- E(fJn+li~~)J- C[~n+l- E(~n+l~~~)J

has the property that (P-a.s.)


E[17(~n+ 1

E(~n+ tl ~~))*I~~] = 0.

439

7. The Kalman-Bucy Filter and Its Generalizations

It follows that the conditionally-Gaussian vectors 11 and


under the condition g;~, are independent, i.e.

~n+ 1,

considered

P(11 E A, ~n+1 E Big;~)= P(11 E A lg;~)' P(~n+ 1 E Big;~)


for all A E BB(Rk), BE &B(R 1).
Therefore if s = (s 1, ... , sn) then
E[exp(is*en+1)1g;~, ~n+1J

= E{exp(is*[E(en+11g;~) + 11 + C[~n+1- E(~n+11g;~)]J)Ig;~, ~n+tl


= exp{is*[E(en+tlg;~) + C[~n+t- E(~n+tlg;~)]}
x E[exp(is*11)lg;~, ~n+1]
= exp{is*[E(en+11g;m + C[~n+1- E(~n+11g;~)]}
(6)
x E(exp(is*11)1g;n

By (5), the conditional distribution P(11 :::::; ylg;~) is Gaussian. With (6),
this shows that the conditional distribution P(en+t:::::; alg;~+ 1 ) is also
Gaussian.
This completes the proof of the lemma.

Theorem 1. Let (e, ~)be a partial observation of a sequence that satisfies the
system (1) and condition (2). Then (mn, Yn) obey the following recursion
relations:

mn+ 1 = [a 0
X

+ a 1mnJ +[boB+ a 1ynATJ[B oB + A 1ynAiJE!l

[~n+1-

Ao- A1mn],

(7)

Yn+ 1 = [a 1ynai +bob] - [boB+ a1ynAi] [BoB+ A 1ynAiJE!l


(8)
X [b B + a1ynAiJ*.
0

PROOF.

From (1),
(9)

and

en+ 1

+ b 1et(n + 1) + b2ein + 1),


A1[en- mnJ + B1e1(n + 1) + B2e2(n + 1).

E(en+ 1 lg;~) = at[en- mnJ

~n+ 1 - E(~n+11g;~) =

(10)
Let us write

d11 = cov(en+ l en+ 11 g;~)


= E{[en+1- E(en+tlg;~)][en+t- E(en+11g;m*;g;H,

d12 = cov(en+ 1 ~n+ 1l g;~)


= E{[en+1- E[en+11g;m[~n+1- E(~n+11g;m*;g;n,

d22 =COV(~n+l ~n+tlg;~)


= E{[~n+1- E(~n+11g;~)J[~n+1- E(~n+11g;~)J*/g;~}.

440

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Then, by (10),

d22

AlynAT

+ B a B.
(11)

By the theorem on normal correlation (see Theorem 2 and Problem 4,


13, Chapter II),
mn+l = E(I'Jn+li~~, en+l) = E(On+li~~)

+ d12d?2(c;n+1-

E(c;n+li~~))

and

Yn+l = cov(On-1 On+ll~~. en+l)= du - d12d?2dT2


If we then use the expressions from (9) for E(On+ll~~) and E(c;n+ll~~)
and those for d 11 , d 12 , d22 from (11), we obtain the required recursion
formulas (7) and (8).
This completes the proof of the theorem.

Corollary 1. If the coefficients a0 (n, c;), ... , B2(n, c;) in (1) are independent of
c; the corresponding method is known as the Kalman-Bucy method, and equations (7) and (8)for mn and Yn describe the Kalman-Bucy filter. It is important
to observe that in this case the conditional and unconditional error matrices
Yn agree, i.e.
Corollary 2. Suppose that a partially observed sequence (On, c;n) has the
property that (Jn satisfies the first equation (1), and that t;n satisfies the equation
i;n = A 0 (n- 1, i;) + A 1(n- 1, i;)On
+ B 1(n - 1, i;)e 1(n) + B2(n- 1, i;)ein).

(12)

Then evidently

+ A1(n, i;)[ao(n, i;) + a1(n, i;)(Jn


+ b1(n, c;)e 1(n + 1) +bin, c;)e2(n+ 1)] + B 1(n, i;)e 1(n + 1)
+ Bin, i;)ein + 1),

en+l = Ao(n, i;)

and with the notation


Ao

Bl

+ Alao,
Albl + Bl,
Ao

Al

Alal,

B2 = A 1b2 + B 2,

we find that the case under consideration also depends on the model (1), and
that mn and Yn satisfy (7) and (8).
2. We now consider a linear model (compare (1))

+ alen + a2en + blel(n + 1) + b2e2(n + 1),


Ao + A 10n + A 2i;n + B 1e1(n + 1) + B2e2(n + 1),

en+l = ao
i;n+l =

(13)

441

7. The Kalman-Bucy Filter and Its Generalizations

where the coefficients a0 , .. , Bn may depend on n (but not on~), and eiin)
are independent Gaussian random variables with Eeii(n) = 0 and Eet(n) = 1.
Let (13) be solved for the initial values (8 0 , ~ 0 ) so that the conditional
distribution P(8 0 ~ al~o) is Gaussian with parameters m0 = E(8 0 , ~ 0 ) and
y = cov(0 0 , 8 0 1~ 0 ) = Ey 0 . Then, by the theorem on normal correlation and
(7) and (8), the optimal estimator mn = E((Jn I$'~) is a linear function of

~0' ~1' " ' ~n


This remark makes it possible to prove the following important statement
about the structure of the optimal linear filter without the assumption that
it is Gaussian.

Theorem 2. Let (8, ~) = (On, ~n)n;<:O be a partially observed sequence that


satisfies (13), where eiin) are uncorrelated random variables with EeJn) = 0,
Ee?in) = 1, and the components of the initial vector (0 0 , ~ 0 ) have finite second
moments. Then the optimal linear estimator mn = E(On I~0' ... ' ~n) satisfies
(7) with a0 (n, ~) = a0 (n) + a 2 (n)~n' A 0 (n, ~) = A 0 (n) + A 2 (n)~n' and the
error matrix Yn = E[(On - em)( en - mn)*] satisfies (8) with initial values

m0 = cov(0 0 , ~ 0 )cove!(~ 0 , ~ 0 ) ~ 0 ,
Yo = cov(8 0 , 00 ) - cov(0 0 , ~ 0 )cov6!(~ 0 , ~ 0 )cov*(8 0 , ~o)

(14)

For the proof of this lemma, we need the following lemma, which reveals
the role of the Gaussian case in determining optimal linear estimators.

Lemma 2. Let(rx, /3) be a two-dimensional random vector with E(rx 2 + /3 2 ) < oo,
a( a, P) a two-dimensional Gaussian vector with the same first and second
moments as (o:, {3), i.e.
i

= 1, 2;

EcxP = Eo:/3.

Let A.(b) be a linear function of b such that


A.(b) = E(fi I p = b).

Then A.(/3) is the optimal (in the mean square sense) linear estimator of rx in
terms of /3, i.e.

E(rx I/3) = A.(/3).


Here E.A(/3) = Erx.
We first observe that the existence of a linear function .A(b) coinciding
with E(fi IP= b) follows from the theorem on normal correlation. Moreover,
let 'X(b) be any other linear estimator. Then

PROOF.

E[ti- J:(P)] 2

;;:::

E[ti - .A(P)J 2

442

VI. Stationary (Wide Sense) Random Sequences. L 2 - Theory

and since 'X(b) and A.(b) are linear and the hypotheses of the lemma are
satisfied, we have

E[cx - l(p)] 2 = E[a - l(P)J 2 ~ E[a - A.(P)J 2 = E[cx - A.(p)] 2 ,


which shows that A.(p) is optimal in the class of linear estimators. Finally,

EA.(p) = EA.(P) = E[E(iXI]I)] =Eli= Ecx.


This completes the proof of the lemma.
PRooF OF

THEOREM 2. We consider, besides (13), the system

+ ali'Jn + az~n + b1il 11 (n + l) + b2il12(n + 1),


Ao + Ali'Jn + Az~n + B1il21(n + 1) + Bzilzz(n + 1),

i'Jn+l = ao

~n+l =

(15)

where f.iin) are independent Gaussian random variables with EBiin) = 0


and Ee'fin) = 1. Let (0 0 , ~ 0 ) also be a Gaussian vector which has the same
first moment and covariance as (0 0 , ~ 0 ) and is independent of il;in). Then
since (15) is linear, the vector (0 0 , ... , i'J", ~ 0 , . , ~")is Gaussian and therefore the conclusion of the theorem follows from Lemma 2 (more precisely,
from its multidimensional analog) and the theorem on normal covariance.
This completes the proof of the theorem.
3. Let us consider some illustrations of Theorems 1 and 2.
EXAMPLE 1. Let () = ((Jn) and '7 = (rtn) be two stationary (wide sense) uncorrelated random sequences with EO" = Ertn = 0 and spectral densities

fe(A.) = 2nl1 + b 1 e ul 2 and f,(A.) = 2n '11 + b2 e ;;.1 2 '


where !btl< 1, lbzl < 1.
We are going to interpret(} as a useful signal and '1 as noise, and suppose
that observation produces a sequence~ = (~")with

According to Corollary 2 to Theorem 3 of 3 there are (mutually uncorrelated) white noises e1 = (e 1 (n)) and e2 = (e 2(n)) such that

Then
~n+l

= (}n+l + '1n+l = -bl(}n- bzrtn + el(n + 1) + ez(n + 1)

On(bl - bz) + el(n + 1) + e2(n


= -bz~n- (bl- b2)(}n + el(n + 1) + ez(n + 1).
= -b2((}"

+ '1n)-

+ 1)

443

7. The Kalman-Bucy Filter and Its Generalizations

Hence 8 and

satisfy the recursion relations

en+l = -hl(Jn
~n+l

+ el(n + 1),

= -(hl- hz)8n-

hz~n

+ e1(n + 1) + ez(n +

(16)
1),

and, according to Theorem 2, mn = E(8nl~ 0 , , ~n) and Yn = E(On- mn) 2


satisfy the following system of recursion equations for optimal linear filtering:

Let us find the initial conditions under which we should solve this system.
Write d 11 = Ee;, d 12 = E8n~n d22 = E~;. Then we find from (16) that

+ h1h2 d12 + 1,
hz)2 d 11 + h~d 22 + 2hz(h 1

d12 = htCht - hz)d 11


dzz

(ht -

h2 )d 12

+ 2,

from which

2- hi- h~
dzz = (1 - hi)(1 - h~)'
which, by (14), leads to the following initial values:

1 - h~
d12
mo = dzz ~o = 2- hi - h~ ~o.

Yo = d 11

1
1 - h~
1
df 2
d22 = 1 - hi - (1 - hi)(2 - hi - hD = 2 - hi -

br

(18)

Thus the optimal (in the least squares sense) linear estimators mn for the
in terms of ~o ... ' ~n and the mean-square error are determined
signal
by the system of recurrent equations (17), solved under the initial conditions
(18). Observe that the equation for Yn does not contain any random components, and consequently the number Yn which is needed for finding mn,
can be calculated in advance, before the filtering problem has been solved.

en

2. This example is instructive because it shows that the result of


Theorem 2 can be applied to find the optimal linear filter in a case where the
sequence (8, 0 is described by a (nonlinear) system which is different from
(13).
ExAMPLE

444

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Let e1 = (e 1(n)) and e2 = (e 2 (n)) be two independent Gaussian sequences


of independent random variables with Ee;(n) = 0 and Eef(n) = 1, n ~ 1.
Consider a pair of sequences (0, ~) = (0., ~.), n ~ 0, with
(}n+l

ae.

+ (1 +

~n+l =A(}.+ e2(n

e.)el(n

+ 1),

(19)

+ 1).

We shall suppose that 00 is independent of (e 1 , e2 ) and that 00 "'JV(m 0 , y0 ).


The system (19) is nonlinear, and Theorem 2 is not immediately applicable.
However, if we put

+ 1) = J

el(n

1 +e.
el(n
E(l + 0.) 2

we can observe that E8 1 (n) = 0, E8 1(n)e 1(m)


we have reduced (19) to a linear system

+ 1),

= 0, n =F m, Eei(n) = 1. Hence

+ b 1 a1 (n + 1),
A1 e. + e2 (n + 1),

(}n+l = a1 0.
~+ 1 =

(20)

where b 1 = jE(1 + 0.) 2 , and {8 1(n)} is a sequence of uncorrelated random


variables.
Now (20) is a linear system of the same type as (13), and consequently
= E(O.I~ 0 , .. , ~.)and its error Yn can be
the optimal linear estimator
determined from (7) and (8) via Theorem 2, applied in the following form in

m.

the present case:

where b 1 = jE(1

+ 0.) 2 must be found from the first equation in (19).

EXAMPLE 3. Estimators for parameters. Let (} = (0 1 , . , (}k) be a Gaussian


vector with EO = m and cov(O, 0) = y. Suppose that (with known m and v)
we want the optimal estimator of(} in terms of observations on an [-dimensional sequence~ = (~.), n ~ 0, with

~0 =

0,

(21)

where e 1 is as in (1).
Then from (7) and (8), with m. = E((J I$'~) andy., we find that
mn+l =

m.

+ y.A!(n,

X [~n+l-

~)[(B 1 B!)(n, ~)

+ A 1(n,

~)y.A!(n, ~)]~

A 0 (n, ~)- A 1(n, ~)m.],

Yn+l = Yn- y.A!(n, ~)[(B 1 B!)(n, ~)

+ A 1(n, ~)y.A!(n, ~)]~ A 1(n, ~)y.


(22)

445

7. The Kalman-Bucy Filter and Its Generalizations

If the matrices B 1Bt are nonsingular, the solution of (22) is given by

+ Ymto Af(m, ~)(B 1 Bt)- 1 (m, ~)Af(m, ~)r 1

mn+1 = [ E
X [

Yn+l =

[E

+ ymtOAt(m, ~)(B1Bt)- 1 (m, ~)(~m+l-

Ao(m,

+ Ymt0 Af(m, ~)(B1Bt)- 1 (m, ~)A1(m, ~)r 1 y,

mJ
(23)

where E is a unit matrix.

4.

PROBLEMS

1. Show that the vectors m. and

e.- m. in (1) are uncorrelated:


E[m:(o - m.)] = 0.

2. In (1), let y and the coefficients other than a0 (n, e) and A 0 (n, e) be independent of
"chance" (i.e. of e). Show that then the conditional covariance 'l'n is independent of
"chance": 'l'n = Ey.
3. Show that the solution of (22) is given by (23).
4. Let (0,

= (0.,

e.) be a Gaussian sequence satisfying the following special case of(1):

Show that if A =I 0, b =I 0, B =I 0, the limiting error of filtering, y =


exists and is determined as the positive root of the equation

2+ [B2(1A2- a2) - b2] y -

b2 B2

Al =

0.

lim.~oo

y.,

CHAPTER VII

Sequences of Random Variables


That Form Martingales

1. Definitions of Martingales and Related Concepts


1. The study of the dependence of random variables arises in various ways
in probability theory. In the theory of stationary (wide sense) random
sequences, the basic indicator of dependence is the covariance function,
and the inferences made in this theory are determined by the properties of
that function. In the theory of Markov chains (12 of Chapter I; Chapter
VIII) the basic dependence is supplied by the transition function, which
completely determines the development of the random variables involved
in Markov dependence.
In the present chapter (see also 11, Chapter I), we single out a rather wide
class of sequences of random variables (martingales and their generalizations) for which dependence can be studied by methods based on a discussion
of the properties of conditional expectations.

2. Let (0, ', P) be a given probability space, and let (,;) be a family of
a-algebras ,;, n ~ 0, such that $'o s; ~ s; s; '.
Let X 0 , X 1 , . be a sequence of random variables defined on (0, ', P).
If, for each n ~ 0, the variable X,. is ,;-measurable, we say that the set X =
(X,.,,;), n ~ 0, or simply X = (X,.,,;), is a stochastic sequence.
If a stochastic sequence X = (X,.,,;) has the property that, for each
n ~ 1, the variable X,. is .;_cmeasurable, we write X= (X,., ,;_ 1 ), taking
F _ 1 = F 0 , and call X a predictable sequence. We call such a sequence
increasing if X 0 = 0 and X,. ::s;; X,.+ 1 (P-a.s.).

Definition 1. A stochastic sequence X= (X,.,',.) is a martingale, or a submartingale, if, for all n ~ 0,


(1)

447

1. Definitions of Martingales and Related Concepts

and, respectively,
E(X.+ 1 1~)

or

E(X.+ll~) ~

x.

x.

(P-a.s.)(martingale)
(2)

(P-a.s.)(submartingale).

A stochastic sequence X = (X.~) is a supermartingale if the sequence


-X= (-X.,~) is a submartingale.
where
= u{w: X 0 , ... , X.}, and
In the special case when~=
the stochastic sequence X= (X.,~) is a martingale (or submartingale),
we say that the sequence (X.).~ 0 itself is a martingale (or submartingale).
It is easy to deduce from the properties of conditional expectations that
(2) is equivalent to the property that, for every n ~ 0 and A E ~.

ff:,

L
L

ff:

L
~L

Xn+l dP =

or

xn+l dP

x.dP
(3)

x.dP.

1. If (~.).~ 0 is a sequence of independent random variables with


= 0 and x. = ~ 0 + + ~., ~ = u{w: ~ 0 , . , ~.}, the stochastic
sequence X= {X.,~) is a martingale.

ExAMPLE

E~.

2. If (~.).~ 0 is a sequence of independent random variables with


E~. = 1, the stochastic sequence (X.,~) with X.= n~=o ~k ~ =
u{w: ~ 0 , . , ~.}is also a martingale.
ExAMPLE

ExAMPLE

3. Let

be a random variable with EI~ I < oo and


~s;;;~s;;;

Then the sequence X

=(X.,~)

with

... s;;;ff.

x. =

E(~ I~)

is a martingale.

ExAMPLE 4. If (~.).~ 0 is a sequence of nonnegative integrable random variables, the sequence (X.) with x. = ~ 0 + + ~.is a submartingale.
ExAMPLE 5. If X= (X.,~) is a martingale and g(x) is convex downward
with E lg(X.)I < oo, n ~ 0, then the stochastic sequence (g(X.), ~) is a
submartingale (as follows from Jensen's inequality).
If X= (X.,~) is a submartingale and g(x) is convex downward and
nondecreasing, with Elg(X.)I < oo for all n ~ 0, then (g(X.), ~) is also a
su bmartingale.

Assumption (1) in Definition 1 ensures the existence of the conditional


expectations E(X+ 1 I~), n ~ 0. However, these expectations can also exist
without the assumption that E IX.+ 1 1 < oo. Recall that by 7 of Chapter

448

VII. Sequences of Random Variables That Form Martingales

II, E(X ;+ 1I~) and E(X;+ 1I~) are always defined. Let us write A = B
(P-a.s.) when P(A!::,. B)= 0. Then if
{w:E(X;+ 1 1~)

< oo} u

{w:E(X,;-1~)

< oo}

= Q

(P-a.s.)

we say that E(Xn + 1 I~) is also defined and is given by


E(Xn+11~) = E(X:+1I~)- E(X,;-+11~).

After this, the following definition is natural.

Definition 2. A stochastic sequence X = (Xn,

~)is a generalized martingale


E(Xn+ll~) are defined
expectations
conditional
the
if
(or submartingale)
satisfied.
is
(2)
and
0
~
n
for every

Notice that it follows from this definition that E(X;+ 1 1~) < oo for a
generalized submartingale, and the E(IXn+tll~) < oo (P-a.s.) for a generalized martingale.
3. In the following definition we introduce the concept of a Markov time,
which plays a very important role in the subsequent theory.

Definition 3. A random variable r = r(w) with values in the set {0, 1, ... , + oo}
is a Markov time (with respect to (~))(or a random variable independent of
the future) if, for each n ~ 0,
(4)

{r = n}E~.
When P( r < oo) = 1, a Markov time r is called a stopping time.

Let X= (Xn, ~)be a stochastic sequence and let r be a Markov time


(with respect to (~)). We write
00

X,=

n=O

Xnl{t=nj(w)

(hence X,= 0 on the set {w: r = oo }).


Then for every BE PA(R),
00

{w: X, E B} =

n=O

{Xn E B, r = n}

and consequently X, is a random variable.


ExAMPLE 6. Let X = (Xn, ~) be a stochastic sequence and let
Then the time of first hitting the set B, that is,

rB = inf{n

0: xn E B}

BE

PA(R).

449

I. Definitions of Martingales and Related Concepts

(with r 8 =

+ ro if { }
{rB

= 0) is a Markov time, since

= n} = {Xo B, ... , Xn-1 B, X. E B} E ~

for every n 2 0.
ExAMPLE 7. Let X= (X.,~) be a martingale (or submartingale) and r a
Markov time (with respect to (~)). Then the "stopped" process X'=
(X" A" ~) is also a martingale (or submartingale ).
In fact, the equation
n-1
XnAt = L Xm/{t=m} + XnJ{t~n}
m=O

implies that the variables X n A r are ~-measurable, are integrable, and satisfy

whence
E[X(n+1)At- XnAri~J = J{t>n}E[Xn+1 - Xnlg;;,] = 0

Every system
collection of sets

(~)

(or 2 0).

and Markov time r corresponding to it generate a

~= {AE.?:An{r=n}E~foralln20}.

It is clear that Q E ~ and g; is closed under countable unions. Moreover, if


AEg;, then An{r=n}={r=n}\(An{r=n})E.fl;. and therefore AE~.
Hence it follows that ~ is a a-algebra.
If we think of~ as a collection of events observed up to time n (inclusive),
then ~can be thought of as a collection of events observed at the "random"
timer.
It is easy to show (Problem 3) that the random variables r and Xr are
~-measurable.

4. Definition 4. A stochastic sequence X = (X.,~) is a local martingale


(or submartingale) if there is a (localizing) sequence (rk)h 1 of Markov times
such that rk :S rk+ 1 (P-a.s.), rk i ro (P-a.s.) as k --+ oo, and every "stopped"
sequence X'k = (X,kAn J 1rk>Ol' ~)is a martingale (or submartingale).

In Theorem 1 below, we show that in fact the class of local martingales


coincides with the class of generalized martingales. Moreover, every local
martingale can be obtained by a "martingale transformation" from a martingale and a predictable sequence.

450

VII. Sequences of Random Variables That Form Martingales

Definition 5. Let Y = (Y,, ~)be a stochastic sequence and let V = (V,., ~- 1 )


be a predictable sequence (/F_ 1 = ~). The stochastic sequence V Y =
((V Y)n, ~)with
n

(V Y)n = Vo Yo+

V;AY;,

(5)

i= 1

where A Yi = Yi - Yi _~> is called the transform of Y by V. If, in addition, Y is


a martingale, we say that V Yis a martingale transform.

Theorem 1. Let X= (X", ~)n;,o be a stochastic sequence and let X 0 = 0


(P-a.s.). The following conditions are equivalent:
(a) X is a local martingale;
(b) X is a generalized martingale;
(c) X is a martingale transform, i.e. there are a predictable sequence V =
(V,., ~- 1 ) with V0 = 0 and a martingale Y = (Y,, ~) with Y0 = 0 such
that X= V Y
PRooF. (a)=> (b). Let X be a local martingale and let (rk) be a local sequence
of Markov times for X. Then for every m ~ 0
(6)

and therefore
(7)

The random variable J 1,k>nJ is

~-measurable.

Hence it follows from (7) that

E[ IXn+ 1ll1,k>nJ I~] = I 1,k>nJE[I Xn+ 1ll ~] < oo

(P-a.s.).

Here J 1,k>nJ--> 1 (P-a.s.), k--> oo, and therefore


E[IXn+lii~J

< oo (P-a.s.).

(8)

Under this condition, E[Xn+ 1 I~J is defined, and it remains only to show
that E[Xn+ 1 1~] = Xn (P-a.s.).
To do this, we need to show that

Xn+1 dP =

XndP

for A E ~-By Problem 7, 7, Chapter II, we have E[IXn+ 1 1~] < oo (P-a.s.)
if and only if the measure JA IXn+ 11 dP, A E ~' is a-finite. Therefore if we
show that the measure JAIXnldP, AE~, is also a-finite, then in order to
establish the equation in which we are interested it will be sufficient to establish it only for those sets BE~ for which JB IX n+ 11 dP < oo.

451

I. Definitions of Martingales and Related Concepts

Since xtk is a martingale, Ixrk I = (I X tk 1\ n I . I {tk > 0}, ~) is a submartingale


and hence, if we recall that {rk > n} E ~'we obtain

JB,..,{tk>n}

IXnl dP =

JB{tk>n)

JB,..,{tk>n}

IXn/\tJ{tk>O} I dP

IX(n+l)Atk/{tk>O}IdP

JB'{tk>n)

IXn+lldP.

Letting k-+ oo, we find that


11XnldP

~ 11Xn+tldP.

It follows that if B E ~ is such that B IX n + 11dP < oo, then (by Lebesgue's
dominated convergence theorem) we can take limits ask -+ oo in the martingale equation

Thus

fB XndP =

1 Xn+ 1 dP

for B E ~ and such that B IX n + 11dP < oo. Hence it follows that this equation
is valid for all BE~, and implies E(Xn+ 1 1~) = Xn (P-a.s.).
(b)=> (c). Let

and

V0

= 0, V, =

E[IAXniiFn-tJ, n 2::: 1.

Put W, = v,.EB, Y0 = 0, and


n

~AX;,

E[IAY,II~- 1 ] ~

1 and

Y, =

i= 1

n 2::: 1.

It is clear that
E[AY,I~- 1 ]

= 0.

Consequently Y = (Y,, ~) is a martingale. Moreover, X 0


and A(Y. Y)n = AXn. Therefore

= V0 Y0 = 0

X= V Y.
(c)=> (a). Let X = V Ywhere Vis a predictable sequence, Yis a martingale and Vo = Y0 = 0. Put
rk

== inf{n 2::: 0: I V,+ 1 1 > k},

452

VII. Sequences of Random Variables That Form Martingales

and suppose that rk = oo if the set { } = 0. Since V,+ 1 is ~-measurable,


the variables rk are Markov times for every k 2=: 1.
Consider a "stopped" sequence X'" = (( V Y). A,J r'" >01 , ~). On the
set {rk > 0}, the inequality I V,Arkl.:::;; k is in effect. Hence it follows that
E I(V Y).A ,J {rk> 01 I < oo for every n 2': 1. In addition, for n 2=: 1,
E{[(V Y)(n+l)Atk- (V Y)nAtJJ{tk>O}I~J
=

J{tk>O)" V(n+l)ATkE{l(n+l)ATk- Y,Atkl~}

since (see Example 7) E{ l(n+ 0 AT" - Y, A'k I~} = 0.


Thus for every k 2=: 1 the stochastic sequences X'k are martingales,
rk l oo (P-a.s.), and consequently X is a local martingale.
This completes the proof of the theorem.
5. EXAMPLE 8. Let (IJn)n:?.l be a sequence of independent identically distributed Bernoulli random variables and let P(IJn = 1) = p, P(IJn = -1) = q,
p + q = 1. We interpret the event {IJ. = 1} as success (gain) and {IJn = -1}
as failure (loss) of a player at the nth turn. Let us suppose that the player's
stake at the nth turn is V,. Then the player's total gain through the nth turn is
n

x. = I 1 Vi'li =

+ v,IJ.,

xn-1

i~

X 0 = 0.

It is quite natural to suppose that the amount V, at the nth turn may depend
on the results of the preceding turns, i.e. on V1 , .. , v,_ 1 and on '11> ... , IJn-l
In other words, if we put F 0 = {0, 0} and Fn = u{w: 1] 1 , . . . , IJn}, then
V, is an ~- 1 -measurable random variable, i.e. the sequence V = (V,, ~- 1 )
that determines the player's "strategy" is predictable. Putting Y, =
'11 + + IJn,we find that
n

x. = I

;~

v;~Y;,

i.e. the sequence X= (Xn, ~)with X 0 = 0 is the transform of Yby V.


From the player's point of view, the game in question is fair (or favorable,
or unfavorable) if, at every stage, the conditional expectation
E(Xn+l - X.l~)

0 (or 2': 0 or.:::;; 0).

Moreover, it is clear that the game is


fair if p

q=

!,

favorable if p > q,
unfavorable, if p < q.
Since X=

(X.,~)

is a

martingale if p = q = t,
submartingale if p > q,
supermartingale if p < q,

453

l. Definitions of Martingales and Related Concepts

we can say that the assumption that the game is fair (or favorable, or unfavorable) corresponds to the assumption that the sequence X is a martingale
(or submartingale, or supermartingale ).
Let us now consider the special class of strategies V = (V,, ~- 1 )"~ 1
with V1 = 1 and (for n > 1)

V. = {2n-1
n
0

ifJ71 = -1, ... ,J7n-1 = -1,


otherwise.

(9)

In such a strategy, a player, having started with a stake V1 = 1, doubles the


stake after a loss and drops out of the game immediately after a win.
If 17 1 = -1, ... , lln = -1, the total loss to the player after n turns will be

L 2i-1 = 2"-t.
n

i= 1

Therefore if also lln+ 1 = 1, we have


Xn+1 = Xn

V,+1 = -(2"- 1)

+ 2" =

1.

Let r = inf{n ~ 1: Xn = 1}. If p = q =!,i.e. the game in question is


fair, then P(r = n) = (!)", P(r < oo) = 1, P(Xt = 1) = 1, and EXt= 1.
Therefore even for a fair game, by applying the strategy (9), a player can in a
finite time (with probability unity) complete the game "successfully," increasing his capital by one unit (EXt= 1 > X 0 = 0).
In gambling practice, this system (doubling the stakes after a loss and
dropping out of the game after a win) is called a martingale. This is the origin
of the mathematical term "martingale."

Remark. When p = q =
martingale and therefore

!,

the sequence X= (Xn, ~) with X 0 = 0 is a


for every

1.

We may therefore expect that this equation is preserved if the instant n is


replaced by a random instant r. It will appear later (Theorem 1, 2) that
EXt= EX 0 in "typical" situations. Violations of this equation (as in the
game discussed above) arise in what we may describe as physically unrealizable situations, when either r or IX nI takes values that are much too large.
(Note that the game discussed above would be physically unrealizable, since
it supposes an unbounded time for playing and an unbounded initial capital
for the player.)

6. Definition 6. A stochastic sequence


if E I~I < oo for all n ~ 0 and
E(~n+tl~)

= 0

= (~"'~)is a martingale-difference
(P-a.s.).

(10)

454

VII. Sequences of Random Variables That Form Martingales

The connection between martingales and martingale-differences is clear


from Definitions 1 and 6. Thus if X = (Xn, ~) is a martingale, then ~ =
(~n ~)with ~ 0 = X 0 and ~n = ~Xn, n;:::: 1, is a martingale-difference. In
turn, if ~ = (~n ~) is a martingale-difference, then X = (Xn, ~) with
+ ~n is a martingale.
Xn = ~ 0
In agreement with this terminology, every sequence ~ = (~n)n~o of independent integrable random variables with E~n = 0 is a martingaledifference (with~= a{w: ~ 0 , ~b ... , ~n}).

7. The following theorem elucidates the structure of submartingales (or


su permartingales).
Theorem 2 (Doob). Let X= (Xn,

~)be a submartingale. Then there are a


martingale m = (mn, ~)and a predictable increasing sequence A =(Am ~- 1 )
such that,for every n ;:::: 0, Doob's decomposition

Xn = mn +An

(P-a.s.)

(11)

holds. A decomposition of this kind is unique.

PRooF. Let us put m0 = X 0 , A 0 = 0 and


mn = mo

n-1

[Xj+1- E(Xj+11~)],

n-1
An=

(12)

j=O

[E(Xj+l/~)- Xj].

(13)

j=O

It is evident that m and A, defined in this way, have the required properties.
In addition, let Xn = m~ + A~, where m' = (m~, ~)is a martingale and A' =
(A~, Fn)

is a predictable increasing sequence. Then

A~+1- A~= (An+1 -An)+ (mn+1- mn)- (m~+1- m~),

and if we take conditional expectations on both sides, we find that (P-a.s.)


A~+ 1 - A~ = An+ 1 - An. But A 0 = A~ = 0, and therefore An = A~ and
mn = m~ (P-a.s.) for all n ;:::: 0.
This completes the proof of the theorem.
It follows from (11) that the sequence A = (An, Fn_ 1 ) compensates X =
(Xn, Fn) so that it becomes a martingale. This observation is justified by the

following definition.
Definition 7. A predictable increasing sequence A = (An, ~- 1 ) appearing
in the Doob decomposition (11) is called a compensator (of the submartingale X).
The Doob decomposition plays a key role in the study of square integrable
martingales M = (Mn, Fn) i.e. martingales for which EM; < oo, n ;:::: 0; this

455

1. Definitions of Martingales and Related Concepts

depends on the observation that the stochastic sequence M 2 = (M 2 , ~) is


a submartingale. According to Theorem 2 there are a martingale m = (m", ~)
and a predictable increasing sequence (M) = ((M)"' ~- 1 ) such that
(14)

The sequence (M) is called the quadratic variation of M and, in many


respects, determines its structure and properties.
It follows from (12) that
(M)n =

L" E[(AMj)2 I~-1J

(15)

j= 1

and, for all l :::;; k,


E[(Mk- M,) 2 l~tJ = E[Mf - Mri~J = E[(M)k- (M),I~]. (16)
In particular, if M 0 = 0 (P-a.s.) then
EMf= E(M)k

(17)

It is useful to observe that if M 0 = 0 and M" = ~ 1 + + ~"' where


(~")isasequenceofindependentrandom variables withE~;= OandE~r < oo,

the quadratic variation

(18)

is not random, and indeed coincides with the variance.


If X= (X",~) and Y = (Yn, ~)are square integrable martingales, we
put
(19)

It is easily verified that (X" Y, - (X, Y)",


for l:::;; k,

E[(Xk- X 1)(Y,.-

l!)I~J

~)is

= E[(X,

a martingale and therefore,

Y)k- (X,

Y),l~].

(20)

In the case when X"= ~ 1 + + ~"' Y, = 17 1 + + "'"' where (~")


and (1'/n) are sequences of independent random variables withE~; = E'7; = 0
and E~r < oo, E17t < oo, the variable (X, Y)" is given by
(X, Y)" =

L" cov(~;, t];).

i=1

The sequence (X, Y) = ((X, Y)"' ~- 1 ) is often called the mutual


variation of the (square integrable) martingales X and Y.
8.

PROBLEMS

1. Show that (2) and (3) are equivalent.


2. Let u and 1: be Markov times. Show that 1:
times; and if P(u:;;; 1:) = 1, then s;,. s; s;,.

+ u, 1:

1\

u, and

1:

v u are also Markov

456

VII. Sequences of Random Variables That Form Martingales

3. Show that rand X, are ff,-measurable.


4. Let Y = (Y,, ff.) be a martingale (or submartingale), let V = (V,., ,;_ 1 ) be a predictable sequence, and let (V. Y). be integrable random variables, n ;::: 0. Show that
V. Yis a martingale (or submartingale).
5. Let 3"1 ~ 3"2 ~ be a nondecreasing family of a-algebras and ~ an integrable
random variable. Show that (X.).;, 1 with X. = E(~ I,;) is a martingale.
6. Let!' 1 2 !' 2 2 be a nonincreasing family of a-algebras and let ~be an integrable
random variable. Show that (X.).;, 1 with x. = E(~ I".) is a bounded martingale, i.e.
E(X.IX.+ 1 , x.+ 2 ,

= x.+l

(P-a.s.)

for every n ;::: 1.


7. Let ~ 1 , ~ 2 , ~ 3 , . be independent random variables, P(~; = 0) = P(~; = 2) =!and
x. = TI7= 1 ~;Show that there do not exist an integrable random variable~ and a
nondecreasing family(,;) of a-algebras such that X" = E( ~I,;). This example shows
that not every martingale (X.).;, 1 can be represented in the form (E(~I.;)).;, 1 ;
compare Example 3, 11, Chapter 1.)
8. Let X= (X.,,;), n;::: 0, be a square integrable martingale with EX.= 0. Show that
it has orthogonal increments, i.e.
m# n,
where L1Xk = Xk- Xk-! fork :2: 1 and 1X 0 = X 0 .
(Consequently the square integrable martingales occupy a position in the class
of stochastic sequences with zero mean and finite second moments, intermediate
between sequences with independent increments and sequences with orthogonal
increments.)

2. Preservation of the Martingale Property Under


Time Change at a Random Time
1. If X = (Xn, ~)n<:O is a martingale, we have

EXn = EXo

(1)

for every n ;:::: 1. Is this property preserved if the time n is replaced by a


Markov timer? Example 8 of the preceding section shows that, in general,
the answer is "no": there exist a martingale X and a Markov time r (finite
with probability 1) such that
(2)

The following basic theorem describes the "typical" situation, in which,


in particular, EX,= EX 0 .

2. Preservation of the Martingale Property Under Time Change at a Random Time

457

Theorem 1 (Doob). Let X= (Xn, ~) be a martingale (or submartingale),


and r 1 and r 2 , stopping timesfor which
i = 1, 2,
lim
n- oo

Then

J > n}

IXnldP

= 0,

(3)

i = 1, 2.

(4)

{ti

(5)

(6)
(Here and in the formulas below, read the upper symbol for martingales
and the lower symbol for submartingales.)
PROOF.

It is sufficient to show that, for every A E ~.,


(7)

For this, in turn, it is sufficient to show that, for every n 2:: 0,

or, what amounts to the same thing,


(8)

where B =An {r1


We have
(

JBn{<2~n)

=
(,;;)

n}

E ~.

X n dP = (

(
JBn{n,;;<2Sm}

whence

X n dP

X" 2 dP

+ (
JBn{<2>n}

JBn{<2=n}

+ (

JBn{<2>m}

Xm dP,

X n dP =
(,;)

(
JBn{<2=n}

X ndP

458

VII. Sequences of Random Variables That Form Martingales

and since Xm = 2X!.- IXml, we have, by (4),


(
JBI"I{t2 ;;:n}

X,2 dP

= lim[ (

(;;,)

m-->oo

XndP- (

JBI"\{nSt2}

XmdP]

JB1"1{m<<2l

which establishes (8), and hence (5). Finally, (6) follows from (5).
This completes the proof of the theorem.

Corollary 1. If there is a constant N such that P( 1 :::;; N) = 1and P( 2 :s; N) =


1, then (3) and (4) are satisfied. Hence if, in addition, P(< 1 :::;; <2 ) = 1 and X is
a martingale, then
(9)

Corollary 2. If the random variables {Xn} are uniformly integrable (in particular, ifiXnl :::;; C < oo, n ~ 0), then (3) and (4) are satisfied.
In fact, P(<; > n)- 0, n- oo, and hence (4) follows from Lemma 2, 6,
Chapter II. In addition, since the family {Xn} is uniformly integrable, we
have (see Il.6.(16))
(10)
If is a stopping time and X is a submartingale, then by Corollary 1, applied

to the bounded time

N =

1\

N,

EX 0

:::;;

EX,N.

Therefore

EIX,NI = 2EX~- EX,N :s; 2EX~- EX 0

(11)

The sequence x+ = (X:,~) is a submartingale (Example 5, 1) and therefore


EX~ =

Ni
L

j=O

{<N=J1

Xf dP +

f. x: dP :::;; LNi
{t>N}

Xt dP =EXt

j=O

{tN=J1

:s; EIXNI:::;; supEIXNI


N

~>m

From this and (11) we have


E IX,NI:::;; 3 supE IXNI,
N

and hence by Fatou's lemma


E IX,I :::;; 3 sup E IXNI
N

x: dP

2. Preservation of the Martingale Property Under Time Change at a Random Time

Therefore if we take
i = 1, 2.

Ti,

459

= 1, 2, and use (10), we obtain E I Xr, I < oo,

Remark. In Example 8 of the preceding section,

IXnl dP

(2n- 1)P{T > n}

(2n- 1). rn--+ 1,

n --+ oo,

{t>n}

and consequently (4) is violated (for

T 2 = T).

2. The following proposition, which we shall deduce from Theorem 1, is


often useful in applications.

Theorem 2. Let X= (Xn) be a martingale (or submartingale) and T a stopping


= O"{w: X 0 , ... , Xn). Suppose that
time (with respect to (ff;), where

g;;

ET < 00,
and that for some n

0 and some constant C

E{IXn+1- Xnll ff;}:::; C ({T

~ n}; P-a.s.).

Then
EIXrl<oo

and
(12)

We first verify that hypotheses (3) and (4) of Theorem 1 are satisfied with
Tz

T.

Let
j ~ 1.

Then IXrl:::;

LJ=o lj and

L Ln
00

n=O j=O

The set {T ~ j}

{t<>:j}

{t=n}

lj dP =

0\ {T < j}

ljdP =

{t<>:j}

L L
00

00

j=O n=j

{t=n}

lj dP =

L
00

j=O

{t<'=J1

ffJ_ 1, j ~ 1. Therefore

E[l]IX0 ,

... ,

Xi_ 1] dP:::; CP{T

~ j}

lj dP.

460
for j

VII. Sequences of Random Variables That Form Martingales

1; and

EIX,I

~ E(t }j) ~ EIXol + Ci~t P{r ~j} = EIXol + CEr <oo.


(13)

Moreover, if r > n, then


n

j=O

j=O

I lJ ~ I lJ,

and therefore

J{t>n}

Hence since (by (13)) E I.i= 0


convergence theorem yields

lim (

n-+oo J{t>n}

IXnldP

}j

ljdP.

J{t>n} j=O

< oo and {r > n}

I X" I dP

lim (

! 0, n --+ oo, the dominated

I }j) dP = 0.

n-+oo J{t>n} j=O

Hence the hypotheses of Theorem 1 are satisfied, and (12) follows as


required.
This completes the proof of the theorem.
3. Here we present some applications of the preceding theorems.

Theorem 3 (Wald's Identities). Let ~b ~ 2 , . be independent identically


distributed random variables with EI~d < oo and r a stopping time (with
respect to"~), where"~= a{ro: ~b .. , ~n}, r ~ 1), and Er < oo. Then
If also

(14)

Ea < oo then
E{(~ 1

+ + ~.)-

rE~d 2 = V~ 1 Er.

PROOF. It is clear that X= (Xn, F~)n>t with Xn = (~ 1


is a martingale with

+ + ~n)-

(15)
nE~1

E[IXn+t- XniiXt, ... , X"]= E[l~n+t- E~tll~t ... , ~nJ


= El~n+t- E~tl ~ 2EI~tl <oo.

Therefore EX,= EX 0 = 0, by Theorem 2, and (14) is established.


Similar considerations applied to the martingale Y = (Y,, "~) with
Y, =
nV~ 1 lead to a proof of (15).

x; -

CoroUary. Let~ b ~ 2 . be independent identically distributed random variables


with

2. Preservation of the Martingale Property Under Time Change at a Random Time

461

and r = inf{n;;::: 1:Sn = 1}. Then P{r <oo} = 1 (see,Jor example, (1.9.20))
and therefore P(St = 1) = 1, ESt = 1. Hence it follows from (14) that Er = oo.
Theorem 4 (Wald's Fundamental Identity). Let ~ 1 , ~ 2 , , be a sequence of
independent identically distributed random variables, Sn = ~ 1 + + ~"'
and n ;;::: 1. Let qJ(t) = Ee'~ 1 , t E R, and for some t 0 =I= 0 let (fJ(t 0 ) exist and
(fJ(to) 2: 1.
If r is a stopping time (with respect to($'~),$'~ = u{w: ~ 1 , . , ~n}, r 2: 1),
such that ISnl:::::; C ({r 2: n}; P-a.s.) and Er <oo, then
(16)
PROOF. Take

Y,

e'"s"( cp(to))- ".

Then Y = ( Y,, $'~)" > 1 is a martingale with EY, = 1 and, on the set {r ;;::: n},

E{l Y,+ 1

Y,ll Y1, ... , Y,}

Y,E{Ie;;;:;-

111~1 .. ' ~n}

= Y,E{je 10 ~ 1 <p- 1 (t 0 )-

11}

:$;

B < oo,

where B is a constant. Therefore Theorem 2 is applicable, and (16) follows


since EY1 = 1.
This completes the proof.

1. This example will let us illustrate the use of the preceding


examples to find the probabilities of ruin and of mean duration in games
(see 9, Chapter 1).
Let ~ 1 , ~ 2 , . be a sequence of independent Bernoulli random variables
with P(~; = 1) = p, P(~; = -1) = q, p + q = 1, S = ~ 1 + .. + ~"'and
EXAMPLE

r = inf{n;;::: 1: Sn =BorA},

(17)

where (-A) and B are positive integers.


It follows from (1.9.20) that P( r < oo) = 1 and Er < oo. Then if a =
P(St = A), f3 = P(St = B), we have a + f3 = 1. If p = q = !, we obtain
0 = ESt = rtA

+ {JB,

from (14),

whence
B
rt=B+IAI'

Applying (15), we obtain

Et = Es; =

aA 2

+ f3B 2 =

lAB I.

462

VII. Sequences of Random Variables That Form Martingales

However, if p i= q we find, by considering the martingale ((qjp) 8 ")n?.l that

(q)s1 =1
(q)s,
=E'
p
p

Eand therefore

Together with the equation ex

+ [3

= 1 this yields

(18)

Finally, since ESt = (p - q)E r, we find

+ [3B
p-q '

Er = ~ = ocA

p-q

where ex and [3 are defined by (18).


ExAMPLE 2. In the example considered above, let p = q = ! . Let us show that
for every A in 0 < A < n/(B + IA I) and every time r defined in (17),

B+A

COSA--

E(cos A)-r =
cos

1
1\..

B+ 1 AI

(19)

For this purpose we consider the martingale X = (Xn, ff~)n?.O with

x. =(cos A)-" cos A(s.- B; A)

(20)

and S 0 = 0. It is clear that

B+A

(21)

EX.= EX 0 =cos A-2- .

Let us show that the family {XnAt} is uniformly integrable. For this purpose
we observe that, by Corollary 1 to Theorem 1 for 0 <A< n/(B + IAI),
EXo = EXnAr = E (cosA)-(nA<lcosA(s.Ar~

B-A

E (cos A)-(nAt) cos A - 2- .

B; A)

2. Preservation of the Martingale Property Under Time Change at a Random Time

463

Therefore, by (21),

B+A

cosA.-2-

E(cos A.)- <n " r> < ---::----.,.--.,--,,B+ IAI'


2
cos II.
and consequently by Fatou's lemma,

B+A

cosA.-2B+IAI.
E(cosA.)-r~
2
cos A.

(22)

Consequently, by (20),
IXn"rl

~(cos

A.)-r.

With (22), this establishes the uniform integrability of the family {Xn/\r}.
Then, by Corollary 2 to Theorem 1,

B+A

cos A.-2-

B-A

= EX 0 = EXr = E(cos A.)-r cos A.-2- ,

from which the required inequality (19) follows.

4.

PROBLEMS

1. Show that Theorem 1 remains valid for submartingales if (4) is replaced by

lim
n..-..+ oo

2. Let X =(X.,

~).;,:o

{t'i

> n}

X: dP

i = 1, 2.

0,

be a square-integrable martingale,' a stopping time and


lim
n-+oo

lim
n-+oo

x;dP

o,

IX.I dP

o.

{r>n}

{t">n}

Show that then

where

~X 0 =

X0 ,

~Xi=

Xi- Xi_ 1 ,j

1.

3. Show that

E IX,I:::; limE IX.I


for every martingale or nonnegative submartingale X = (X., ~).;,:o and every
stopping time'

VII. Sequences of Random Variables That Form Martingales

464

4. Let X =(X., ~) ., 0 be a submartingale such that x.;;::.: EW~) (P-a.s.), n;;::.: 0,


where E I~ I < oo. Show that if't' 1 and r 2 are stopping times with P( r 1 ~ r ~) = 1, then
X,,;;::.:

E(X, 2 1~,)

(P-a.s.).

~ 1 , ~ 2 , . be a sequence of independent random variables with P(~; = 1) =


a and b positive numbers, b >a,
P(~; = -1) =

5. Let

t,

I"

X.=a

l(~k= +1)-b

k=1

I"

l(~k= -1)

and
r = inf{n;;::.: 1:X.

-r},

r > 0.

Show that Ee;,' < oo for A ~ a 0 and Ee;,' = oo for A > a 0 , where
IX 0

b
a+b

2a
a
2b
+ --In--.
a+b a+b a+b

=--In--

6. Let ~ 1 , ~ 2 , .. be a sequence of independent random variables withE~; = 0, V~i = uf,


s. = ~ 1 + + ~., ff~ = u{ w: ~ 1, . , ~.}. Prove the following generalizations of
1 E~f < oo,
Wald's identities (14) and (15): IfE = 1 E I~il < oo then ES, = 0; ifE
then

LJ=

Li

Es; = E

I'

~f = E

j= 1

I' uf.

(23)

j= 1

3. Fundamental Inequalities
1. Let X= (Xn,

~)n;;,o

x: =

be a stochastic sequence,
p > 0,

max IXjl, IIXnllp = (EIXnjP) 11P,


:s;j:s;n

Theorem 1 (Doob). Let X= (X",~) be a nonnegative submartingale. Then,


for every 8 > 0 and all n ;;::: 0,

1f

P{x:;;::: 8} ~8

{X~;;,e)

IIXnllp

IIX:IIp

IIXnllp
~ _!!_
p- 1

IIXnllp

IIX:IIp

~e~

PRooF. Put

1 {1

x:dP

EX
~ -";

(1)

ifp>1;

+ IIXnln+ Xnllp} ifp =

1.

(2)

(3)

465

3. Fundamental Inequalities

taking rn

= n if max0 si sn Xi < e. Then, by (2.6),

EXn ~EXt"
=

{X~<!:e}

X< dP
n

+f

{X~<e}

Therefore
eP{X:

X< dP > e f
"

~ e} ~ EXn-

XndP

{X~<e}

dP

{X~<!:e}

{X~<e}

XndP
{X~<!:e}

XndP.

~ EXn,

which establishes (1).


The first inequalities in (2) and (3) are evident.
In proving the second inequality in (2), we first suppose that
(4)

IIX:IIv < oo,


and use the equation
Ee' = r

f"

t'- 1 P(e

~ t) dt,

(5)

which is satisfied for every nonnegative random variable and for r > 0.
Then we find from (1) and Fubini's theorem that when p > 1,
E(X:)P = p foo tP- 1 P{x:
0

= p Loo tp- 2

~ t} dt ~ p foo tP- 2 (f
0

Xnl{X:

XndP) dt

L [Lx~tp- 2
{X~<!:t}

~ t} Jdt = p

Xn

dt] dP

= ____l!___1 E[Xn(x:y- 1 ].

(6)

p-

Hence by Holder's inequality


E(X:)v

::5;

qiiXnllvIICX:)P- 1 IIq = qf1Xnllv[E(X:)v] 11q,

(7)

where q = pj(p - 1).


If (4) is satisfied, the second inequality in (2) follows at once.
If (4) is not satisfied, we may proceed as follows. In (6), we consider
(X: " L) instead of
where L is a constant. We obtain

x:,

E(X:

L)P ~ qE[XiX:

L)p- 1] ~ qiiXnllv[E(X:

and it follows from the inequality E(X: " L)P ~ P < oo that
E(X:

L)P

~ qPEX~ = qviiXnll~

and therefore

E(x:y

= lim E(x: "


L->oo

L)P

~ qPffXnlf~.

L)P] 11q,

466

VII. Sequences of Random Variables That Form Martingales

We can now prove the second inequality in (3).


Applying (1) again, we find that
Ex: - 1 :::;; E(x: - 1)+ =

fo'Xl P{x: - 1 ~ t} dt

1
1- [ f
: :; foo -1+t
XndPJ dt =EX" fx~- ~
J{X~~l+t}
1+t
0

= EXnlnx:.
Since

a In b :::;; a In+ a + be- 1


for all a

(8)

0 and b > 0, we have


Ex:- 1:::;; EXnlnx::::;; EXnln+ Xn + e- 1 EX:.

If EX: < oo we then obtain (3) immediately. If Ex: = oo, we again introduce
Lin place of
and proceed as above.

x: "

x:,

This completes the proof of the theorem.

Corollary 1. Let X = (Xn, ,;) be a square-integrable martingale. Then


X 2 =(X;,,;) is a submartingale, and it follows from (1) that
(9)

In particular, if Xi = eo + + ei, where (ei) is a sequence of independent


random variables with Eei = 0 and Ee] < oo, inequality (9) becomes Kolmogorov's inequality (2, Chapter IV).
Corollary 2. If X = (Xn, ,;) is a square-integrable martingale, we find from
(2) that

E[~ax x;J:::;; 4Ex;.

(10)

J:5.n

2. Let X= (X",,;) be a nonnegative submartingale and let

Xn = Mn +An
be its Doob decomposition. Then since EM"= 0 it follows from (1) that
P{Xn* ~

} ::5;-.
EAn

Theorem 2 (below) will show that this inequality is valid not only for
submartingales but also for the wider class of sequences that satisfy a dominance relation, in the following sense.

467

3. Fundamental Inequalities

Definition. Let (Xn, ff.) be a nonnegative stochastic sequence and let A =


(An, ff._ 1) be an increasing predictable sequence. We say that X is dominated
by A if
(11)

for every stopping time r.


Theorem 2. If X= (X., ff'.) is a nonnegative stochastic sequence dominated
by the increasing predictable sequence A= (An, ff._ 1 ), then we have,Jor every
e > 0, a > 0, and stopping time r,

EA,
P{ X,* 2e } .::;;-,

(12)

P{X~

1
2 e} .::;; - E(A,
e

and

(2- p)!/p

IIX, llv .::;; 1 _ p

a)

+ P(A, 2

a),

(13)

(14)

O<p<l.

IIArllv,

Put

PROOF.

taking

CJ

= r A n if { } =

0. Then

EA, 2 EA"" 2 EX"" 2

{X~,_n>E}

X"" dP 2 eP{Xi "n > e},

whence

Inequality (12) now follows from Fatou's lemma.


To prove (13) we introduce the time

y = inf{j: Ai+ 1 2 a},


taking y = oo if { } =
P{X~ 2

0. Then

+ P{Xi 2 e, A, 2
+ P{A, 2 a}

e} = P{Xi 2 e, A,< a}
.::;; P{J{A,<aJ Xi 2 e}
.::;; P{Xi "Y 2 e}

+ P{A, 2

1
a}.::;; -EA," 1

a}

+ P{Ar 2

a}

468

VII. Sequences of Random Variables That Form Martingales

where we have used (12) and the inequality

J{A,<aJ

Xi::::; XiAY" Finally, by

(13),

IIXill~ = E(Xi)P = Loo P{(Xi)P ~

::;; Loo t- 11PE[A, 1\


=E

AP
<

dt

t 11P] dt

+ E foo (A
A~

t} dt =

Loo P{Xi ~ t 11P} dt

Loo P{Af ~ t} dt

t- 11P) dt

2
EAP
+ EAP = ----=--!!..
1
t

t.

This completes the proof of the theorem.


Remark. Suppose that under the hypotheses of the theorem the increasing
sequence A = (An,.?,;) is not predictable, but for some c > 0

where LlAk
quality

= Ak

- Ak_ 1 for k ~ 1. Then (compare (13)) we have the ine-

P(Xi

1
e) ::;; - E(A,

1\

(a

+ c)) +

P(A,

a).

(The proof is the same as for (13) with y = inf{j: Ai+ 1 ~a} replaced by
y = inf{j: Ai ~a}, taking account of the inequality Ay::;; a+ c.)
Corollary. Let the sequences X" and A" satisfy, for each n ~ 1, the hypotheses
of Theorem 2 or of the preceding Remark (with P(sup ILlA~ I ::;; c) = 1), and
for some sequence of stopping times {rn} let
n --+ oo.

Then
0'
( X")*~
tn

n --+ oo.

3. In this subsection we present (without proofs, but with applications) a


number of significant inequalities for martingales. These generalize the
inequalities of Khinchin and of Marcinkiewicz and Zygmund for sums of
independent random variables.
Khinchin's Inequalities. Let ~ 1 , ~ 2 , . . . be independent identically distributed
Bernoulli random variables with P(~; = 1) = P(~; = -1) = 1 and let (cn)n2! 1
be a sequence of numbers.

469

3. Fundamental Inequalities

Then for every p, 0 < p < oo, there are universal constants AP and BP
(independent of(cn)) such that
(15)

for every n

1.

The following result generalizes these inequalities (for p

1):

Marcinkiewicz and Zygmund's Inequalities. If ~ 1 , ~ 2 , is a sequence of


independent integrable random variables with E~; = 0, then for p ~ 1 there
are universal constants AP and BP (independent of(~n)) such that

for every n

1.

In (15) and (16) the sequences X= (Xn) with Xn = 'f,j= 1 ci~i and Xn =

Li= 1 ~i are martingales. It is natural to ask whether the inequalities can be

extended to arbitrary martingales.


The first result in this direction was obtained by Burkholder.

Burkholder's Inequalities. If X= (Xn, ~) is a martingale, then for every


p > 1 there are universal constants AP and BP (independent of X) such that
(17)
for every n ~ 1, where [X]" is the quadratic variation of X",

[XJn =

L (\X) 2 ,

X 0 = 0.

(18)

j= 1

The constants AP and BP can be taken to have the values

It follows from (17), by using (2), that

APIIJEX111p

~ IIX:IIp ~ B;IIJ[Xr.;IIP'

(19)

where

Burkholder's inequalities (17) hold for p > 1, whereas the MarcinkiewiczZygmund inequalities (16) also hold when p = 1. What can we say about the
validity of (17) for p = 1? It turns out that a direct generalization to p = 1
is impossible, as the following example shows.

470
EXAMPLE.

VII. Sequences of Random Variables That Form Martingales

Let ~ 1 , ~ 2 , ... be independent Bernoulli random variables with

PG; = 1) = P(~; = -1) =!and let

nAt

where

The sequence X= (Xn, ~~)is a martingale with

n--. oo.
But
IIJEX1111 = EJEX],; = E

Ct~ 1)

112

E~--. 00.

Consequently the first inequality in (17) fails.


It turns out that when p = 1 we must generalize not (17), but (19) (which
is equivalent when p > 1).

Davis's Inequality. If X = (Xn, ~n) is a martingale, there are universal


constants A and B, 0 <A < B < oo, such that
(20)
i.e.
n

(~xy.

j= 1

CoroUary l. Let ~ 1 , ~ 2 , . . . be independent identically distributed random


variables; Sn = ~ 1 + + ~n JfEI~ 1 1 <oo and E~ 1 = 0, then according to
Wald's inequality (2.14) we have
ES,

=0

(21)

for every stopping timer (with respect to (~~))for which Er < oo.

It turns out that (21) is still valid under a weaker hypothesis than Er
if we impose stronger conditions on the random variables. In fact, if

< oo

El~1l' <oo,

where 1 < r :$ 2, the conditionE r 11' < oo is a sufficient condition for ES, = 0.
For the proof, we put Tn = T 1\ n, y = SUPn Is," I, and let m = [t'] (integral
part of t') for t > 0. By Corollary 1 to Theorem 1, 2, we have ES," = 0.
Therefore a sufficient condition for ES, = 0 is (by the dominated convergence
theorem)thatEsupniS,"I <oo.

471

3. Fundamental Inequalities

Using (1) and