You are on page 1of 100

Probability Theory I

Heinrich von Weizscker


Based on notes taken by Martin Altmayer and Florian Schwahn

December 20, 2013

Introductory remark

The purpose of this text is to provide on 100 pages a first introduction to the most important
mathematical tools of general probability theory, i.e. the probability essentials. Thus we
cover similar grounds as the nice book by J. Jacod and P. Protter [?] which has that phrase
as title. But even though nothing is really new, of course there are endless possible variations
in the details of the presentation. This text contains a few of my favourite arguments. I still
feel the influence of my first teachers Klaus Krickeberg and Hans Richter.
It is assumed that the reader already is familiar with basic measure theory and has sharpened
the intuition by going through some elementary probability and statistics; more than enough
material can be found in the texts [?] and [?] in the same series. Some of the measure
theoretic proofs are provided since I believe they help to understand the essence of the
result.
I am very grateful for the opportunity to teach this beautiful subject for many years. In particular my thanks go to Martin Altmayer and Florian Schwahn who prepared a preliminary
version of these notes during the winter 2008/9.

Kaiserslautern, December 20, 2013


Heinrich v. Weizscker

Contents
1 Probability spaces
1.1 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5
5
7
9

2 Construction of probability measures


12
2.1 Carathodorys extension theorem . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Monotone classes and uniqueness of probability measures . . . . . . . . . . . 16
3 Random variables
18
3.1 Definitions and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Verifying measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Expectation
23
4.1 Convergence theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Application: the moment generating function . . . . . . . . . . . . . . . . . . 27
5 Conditional probability and independence
5.1 Basic concepts . . . . . . . . . . . . . . . . .
5.2 Joint distributions and Fubinis Theorem . .
5.3 More than two independent events or random
5.4 Infinite product experiments . . . . . . . . . .

. . . . . .
. . . . . .
variables
. . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

29
29
30
33
35

6 Handling concrete distributions


38
6.1 Order statistics and the Beta distribution . . . . . . . . . . . . . . . . . . . . 38
6.2 Transformation of densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3 Convolution; the Gamma-distribution . . . . . . . . . . . . . . . . . . . . . . 42
7 Higher Moments and
7.1 The p-norm . . . .
7.2 Completeness . . .
7.3 Second moments .

the Lp -spaces
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45
45
46
46

8 Convergence of random variables and the laws of large numbers


49
8.1 The weak law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.2 Convergence in probability versus almost sure convergence . . . . . . . . . . . 50
8.3 The strong law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . 51
9 Convergence of distributions
54
9.1 One-dimensional distribution functions . . . . . . . . . . . . . . . . . . . . . . 54
9.2 Weak convergence on general metric spaces . . . . . . . . . . . . . . . . . . . 56
9.3 Convergence in total variation . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3

4
10 Characteristic Functions
10.1 Complex valued random variables .
10.2 The definition and first properties
10.3 The uniqueness theorem . . . . . .
10.4 Lvys continuity theorem . . . . .

CONTENTS

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

60
60
60
63
65

11 Central Limit Theorems


67
11.1 The CLT for multidimensional iid sequences . . . . . . . . . . . . . . . . . . . 67
11.2 Fluctuation of the empirical distribution . . . . . . . . . . . . . . . . . . . . . 68
11.3 Outlook: triangular arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
12 Conditional Expectation
12.1 Prologue: Orthogonal projections in Hilbert space
12.2 Heuristic definition of conditional expectation . . .
12.3 Formal definition of conditional expectation . . . .
12.4 Properties of conditional expectation . . . . . . . .
12.5 Uniform integrability . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

71
71
71
72
73
76

13 Martingales in discrete time


79
13.1 Definition and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
13.2 Martingales, games and optional sampling . . . . . . . . . . . . . . . . . . . . 81
13.3 Doobs Lp -inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
14 Martingale convergence
14.1 Doobs upcrossing inequality for submartingales
14.2 Increasing martingale convergence . . . . . . .
14.3 Uniformly integrable martingales . . . . . . . .
14.4 Decreasing martingale convergence . . . . . . .
14.5 Laws of large numbers via martingale theory .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

85
85
86
87
88
89

15 Brownian motion
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.2 Dyadic time approximation of Brownian motion: Lvys triangles . . .
15.3 Linear interpolations of random walks and their modulus of continuity
15.4 Brownian motion exists . . . . . . . . . . . . . . . . . . . . . . . . . .
15.5 Donskers functional central limit theorem . . . . . . . . . . . . . . . .
15.6 An application: The arcsin law . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

91
91
91
93
95
96
98

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

Chapter 1

Probability spaces
In this chapter we motivate shortly the axioms of mathematical probability theory, as it is
developped in this course. It should be stressed that this does not really answer the question
what probability is about. On the one hand the word probability can be used meaningfully
in contexts beyond the range of the mathematical theory in the sense of this course. On
the other hand several different interpretations of the functional meaning of probability are
compatible with this mathematical framework.
Probability theory is based on the experience that uncertainty about some circumstances
or events is not the same as complete lack of knowledge about them. On the contrary, in
many situations it is possible to assign to these events a quantitative degree of certainty
or likelihood. Some events appear to the observer to be more likely than others and some
types of events happen more frequently than others. Consequently probability, understood
as this quantitative degree of certainty, has a subjective side to it, being dependent on the
judgement and state of knowledge of the observer, and an objective side, since often by
experimental experience different observers are necessarily led to a similar opinion.

1.1

Events

Here is a small list of statements to which you may attach a degree of certainty.
A certain die shows 3.
Two dice give the sum 9.
The price of rice at some market place increases during the year 2010.
This year, on December 1 there will be snow in Kaiserslautern.
Last year, on December 1 there was snow in Kaiserslautern.
The extinction of dinosaurs is due to the impact of a meteor.
Paul loves Paula.
Not in all these of these situations it is reasonable to try a mathematical theory.
Basic Rules for Events A probability model in our sense specifies a collection F of
events, for each of which a probability is assigned in a certain situation or in the mind of
a certain observer. An important assumption is that this collection F is closed under the
natural logical operations.
5

CHAPTER 1. PROBABILITY SPACES


If A is an event, then not A is also an event.
If A, B are events, then A and B and A or B are also events.1
A and not A is the impossible event. A or not A is the sure event (either A happens
or A does not happen; there is no third possibility: tertium non datur).
A and (B or C) is equivalent to (A and B) or (A and C) (logical distributive law).

Under these logical operations the events form a Boolean algebra. The most simple example
of a Boolean algebra is given as follows. The operation and is replaced by intersection, or
is replaced by union and not by the formation of the complement.2
Definition 1.1 Let be a set. A nonempty system F of subsets of is called a (set-)
algebra over , if it has the following properties
1. If A F then \A =: Ac F.
2. If A, B F then A B, A B F.
Remark 1.2
1. Every algebra contains the empty set and the full set since they have the representations = A Ac and = A Ac . 2. The points are called elementary events.
For the development of probability the following rule is very convenient. It is less fundamental in a philosophical sense, but it is extremely useful because it allows to simplify many
statements. Moreover it allows to use tools from measure theory.
Technical Rule If A1 , A2 , . . . is a sequence of events, then at least one of them happens
is also an event.
Example 1.3 A coin is thrown infinitely often. Let An be the event the coin shows head
at the n-th trial. Then we want to consider the event B =at least once we get head.
Under our translation between logical operations and set operations the following definition
now is very natural.
Definition 1.4 A set algebra F over the set is called a -algebra if in addition
S
If A1 , A2 , . . . is a sequence of elements of F then n=1 An F.
Remark 1.5
Using complements we see that for a -algebra F and for every sequence of elements of F
also the set

\
[
An = (
Acn )c
n=1

n=1

belongs to F. (We shall often need similar identities between sets. They may not be immediately clear for the unexperienced reader: The trick is of course to consider an arbitrary
element of one side of the equation and to convince oneself that it must be also an element
of the other side.)
1 However in quantum mechanics this is too much to ask for: Suppose both A: the electron has location
in dx" and B: the electron has momentum in dp" are statements which can be checked by a measurement
and have a certain probability. Then typically the combined statement A and B" cannot be checked and
therefore it is not meaningfully connected to a probability. For a short look on quantum probability we
refer to the last chapter in Williams [?].
2 Actually according to the Stone representation theorem every Boolean algebra is isomorphic to a set
algebra in the sense of Definition 1.1.

1.2. PROBABILITY

We quote some simple facts from measure theory.


Theorem
T 1.6 If several, possibly infinitely many, -algebras Fi are given on the same set
, then i Fi is again a -algebra.
In particular the following definition makes sense: If E is a system of subsets of , then the
-algebra generated by E is defined by
\
(E) :=
{F : F -algebra, E F} .
Theorem 1.6 shows that this is indeed a -algebra, and it is the smallest -algebra which
contains E.
Example 1.7 Let be a metric space. The Borel--algebra B() = (O) is the intersection
of all -algebras which contain the open sets. Clearly, two metrics with the same system of
open sets also define the same Borel -algebra. The elements of this -algebra are called
the Borel sets of . In the special case = Rd with the usual topology we sometimes write
simply B d instead of B(Rd ).
Theorem 1.8 The Borel--algebra B d can also be written as


(K) where K = compact subsets of Rd ,


(C) where C = closed subsets of Rd ,
(R) where R = {(a1 , b1 ] (ad , bd ] : ai , bi R}.
The idea of the proof is to use the following
Lemma 1.9 If E1 (E2 ) then (E1 ) (E2 ).
Definition 1.10 (, F) is called measurable space if F is a -algebra of subsets of .

1.2

Probability

For the events A F we want to quantify the degree of possibility that A happens. As
mentioned above, the question what probability really means has many different answers
which we cannot discuss here in great detail. Many of these views agree that probability
should sastisfy the following
Basic Requirements
Monotonicity: If A implies B then P (A) P (B) i.e. A B P (A) P (B).
Normalization: P () = 0, P () = 1
Additivity: P (A B) = P (A) + P (B) for disjoint A and B.
Here are three operational approaches to probability which make these rules appear reasonable:
1. Frequency: We make the (non-trivial) assumption that the underlying experiment can
be repeated many times under the same conditions. Then
P (A)

number of trials in which A happens


.
total number of trials

In natural sciences, empirically one often observes that this quotient stabilizes around
a certain number which we call probability of A. Then the additivity, for instance, is

CHAPTER 1. PROBABILITY SPACES


clear because if A and B are disjoint, i.e. if they cannot happen simultaneously in the
same trial, then
number of trials where A B happens =
(number of trials with A) + (number of trials with B).
While this approach is limited to the cases in which the above assumption is justified,
it has the advantage that the number P (A) is objective in the sense that it does not
depend on individual judgements of the observer.
2. Betting strategies : Suppose some experiment is performed for the first time and only
once. Then the frequence approach to the probability of an event A does not make
sense. One way to arrive at a number P (A) which is based on the subjective judgement
of the observer is to let the observer make a bet. Before the experiment, Paul bets
(puts on stake) P (A) e on the event A, in the hope to get 1 e if A happens. This
means that he loses P (A) e if Ac happens, but he wins (1 P (A)) e if A happens.
The payoff function in e units is
fA () = 1A () P (A)
where

1A denotes the indicator function of A


(
1 if A
1A () :=
0 else

In order to encourage Paul to give his true opinion about his chances he must also be
ready to switch roles with his partner (or bank). So he should also accept the function
fA . This is what we would call fair.
Now the basic assumption in the betting approach to subjective probability is the
following additivity of the payoff function: If A and B are disjoint, i.e. mutually
exclusive, then
fAB = fA + fB .
We observe that this implies the additivity P (AB) = P (A)+P (B) of the probability:
the additivity of the payoff is equivalent to

1AB P (A B) = 1A P (A) + 1B P (B).


Since 1AB =
probability.

1A + 1B because of A B = this is precisely the additivity of the

3. Insufficient reason: We consider the following situation: There are finitely many possible outcomes E1 , . . . , En of the experiment which are mutually exclusive: Exactly
one of them occurs. Some of the outcomes Ei imply A, they are called favourable for
A, the others imply Ac . Under the basic assumption:
There is no sufficient reason to consider one of the Ei to be more plausible than any
of the others
we get a natural definition of
P (A) =

number of favourable outcomes


.
total number of outcomes

Of course this general principle of indifference or principle of insufficient reason is not


always applicable but it can be combined both with an objective and an subjective
view of probability.

1.3. EXAMPLES

From now on we accept the three basic requirements. Moreover for mathematical convenience we replace additivity by -additivity. This allows to use the tools of measure theory.
Moreover in concrete situations the -additivity usually can be verified.
Definition 1.11 If F is a -algebra then a measure is a mapping : F [0, ] with
() = 0 and which is -additive: If A1 , A2 , . . . are disjoint, then
!

[
X

An =
(An ).
n=1

n=1

In this case (, F, ) is called a measure space.


We now give the basic definition. It is usually attributed to A.N. Kolmogorovs basic book
[?], even though the idea of building probability theory on measure theory did not appear
there for the first time. But Kolmogorov introduced a couple of fundamental techniques
which heavily rely on this approach, in particular the concept of conditional expectation
which we study later on.
Definition 1.12 A probability measure is a measure P with P () = 1. A measure space
(, F, P ) with a probability measure P is called a probability space. A probability measure
is also called a probability distribution.
We close this section with some elementary observations.
Remark 1.13 A probability measure P has the following properties:
monotonicity: A B P (B) = P (A) + P (B \ A)
Sn
Pn
finite subadditivity: P ( i=1 Ai ) i=1 P (Ai )
S
P
-subadditivity: P ( i=1 Ai ) i=1 P (Ai )
modularity: P (A B) + P (A B) = P (A) + P (B).
The symmetric difference A M B := (A B)\(A B) of two sets indicates the points where
the two sets differ. The number P (A M B) measures how much they differ in probability.
This will be an important tool in the next chapter.

1.3

Examples

We start with two very elementary examples.


1. The Bernoulli distribution Bin(1, p): Here := {0, 1}, F := 2 = {, {0}, {1}, },
and P ({1}) := p for a p (0, 1). Then P ({0}) = 1 p. This experiment is also
called a Bernoulli trial. Independent repetitions of Bernoulli trials give rise to many
interesting problems and results.
2. Laplace experiments: Here is any finite set, again F := 2 and
P (A) :=

|A|
.
||

Many examples are of P


the following type: is any countable set, F := 2 is the power
set of , and P (A) := kA
Ppk where (pk )k is a probability vector, i.e. a family of real
nonnegative numbers with k pk = 1. These probability spaces are called d iscrete. Of
course every probability measure P on the power set of a countable set can be written in
this form, simply take
pk = P ({k}).
The next three examples define again discrete probability spaces.

10

CHAPTER 1. PROBABILITY SPACES


3. Geometric distribution (This distribution occurs as waiting time distribution for the
first success in a sequence of independent trials): = N0 , pk = Geom(p)(k) :=
(1 p)k p.
4. Binomial distribution (This gives the probability distribution for the number of successes in n independent Bernoulli trials each of which has success probability p):
= {0, . . . , k},
 
n
pk = Bin(n, p)(k) :=
pk (1 p)nk .
k
5. Poisson distribution: = N0 , pk = Poiss()(k) := e k! for > 0. This distribution
occurs as limit of the Binomial distributions when n is large and p is small but pn .
k

In the following examples the set is uncountable, and individual points get probability
0. In such situations the power set of is too large to serve as domain of definition of the
probability measure. One needs the concept of -algebras in order to single out a sufficiently
flexible subsystem of the powerset on which the probability can be defined.
6. Uniform distribution: := [0, 1], F := B([0, 1]) and P (A) := (A) with the Lebesguemeasure . Similarly the uniform distribution can be defined on other intervals, and
on more general sets even in higher dimensions: Let be a Borel-set in Rd with
d () < . The uniform distribution on B is defined by
P (A) =

d (A)
d ()

for all A B().

R
7. Let = Rd and f : R be Borel-measurable such that Rd f (x) dx = 1 and f 0.
Then
Z
P (A) =
f (x) dx
for all A B()
A

defines a probability measure P . The function f is called a probability density. In


some books, distributions P induced by a density
are called
R
R continuous distributions.
We prove the -additivity of P : Observe A f (x) dx = 1A f (x) dx. If we have
disjoint events A1 , A2 , . . . , we get

1S
(x) =
i=1 Ai

1Ai (x)

for all x .

i=1

Integrating yields with the theorem of monotone convergence of integration theory


Z
f (x) dx =
S

i=1

Ai

Z X

1Ai (x)f (x) dx =

i=1

and thus
P

[
i=1

i=1

!
Ai

Z
X

f (x) dx

Ai

P (Ai ).

i=1

Remark 1.14
R
If g : Rd [0, ) is Borel measurable and 0 < Rd g(x) dx < , then there exists a unique
constant c (0, ) such that f := c g is a probability density.
8. f (x) = 1(0,) ex is the density of the Exp() distribution.
2

9. f (x) =

x
1 e 2
2

is the density of the N (0, 1) distribution.

1.3. EXAMPLES
10. f (x) =

1 1
1+x2

11
is the density of the Cauchy distribution with parameter 1.

Finally, if F is a -algebra and x is a point in the underlying set, then we consider the point
mass or Dirac distribution in the point x:
(
1 xA
x (A) :=
= 1A (x)
0 x
/A
Observe that for disjoint sets An
x

[
n=1

!
An

x (An ).

n=1

Therefore x is a probability measure on every -algebra over a set which contains the
element x. A discrete distribution is just a (finite or infinite) convex combination of point
masses. Consider for example the geometric distribution:
Geom(p)(A) :=

X
kA

p(1 p)k =

p(1 p)k k (A).

k=0

This makes sense even if A contains not only integers. For example we can consider the
geometric distribution not only as a distribution over the non-negative integers but also as
a distribution on the Borel -algebra of the real line.

Chapter 2

Construction of probability
measures
The fact that we can define the probabilities on a full -algebra has many advantages.
However it is a nontrivial fact. Even in the case of geometric size on the real line this is so:
It is clear that the system
( n
)
[
A=
Ji : each Ji is of the type (a, b] or (, b] or (a, ]
i=1

is a set-algebra and letting |I| be the length of an interval it is easy to define a finitely
additive set function on A e.g. by
P(

n
[

Ji ) =

|Ji [0, 1]|

i=1

if the Ji are disjoint. But A is not a -algebra and it is by no means clear that this set function
has an extension to a measure on a -algebra which contains all intervals. Lebesgues theory
of integration implies this. But in probability theory we need similar extensions in much
more general situations and since this is so important we reproduce in the first section a
proof. In the second section we prove a general uniqueness result for such extensions as a
special case of the technique of monotone classes which allows to carry many statements
from small systems of sets to much larger ones.

2.1

Carathodorys extension theorem

Definition 2.1 Let A be an algebra over a set . A function m : A [0, ] is a content


if m() = 0 and m is additive, i.e. m(A B) = m(A) + m(B) for disjoint A, B. A content
m is -additive if
!

[
X
[
m
An =
m(An ) whenever An A for all n N such that
An A.
n=1

n=1

n=1

Theorem 2.2 (Carathodory) Let m : A [0, 1] be a -additive content of the algebra


A such that m() = 1. Then there is a unique extension of m to a probability measure
on the -algebra F = (A). Furthermore for every F (A) there exists a sequence (An )
in A such that (An M F ) 0 for n .
In the proof we need a couple of lemmas.
12

2.1. CARATHODORYS EXTENSION THEOREM

13

Lemma 2.3 If A is an algebra which is closed under taking unions of countably many
disjoint sets, then A is a -algebra.
Proof. We verify that A is closed under countable unions even without the disjointness
assumption: Let (An ) be a sequence of (not necessarily disjoint) elements of A. Consider
B 1 = A1 ,

B n = An

n1
[

!c
Ai

i=1

The Bn are disjoint, they are in A and thus


A is a -algebra.

n=1

An =

n=1

Bn A by assumption. Hence


In the remaining parts of the proof we will need the following definition (outer measure)
(
)

X
[
(E) := inf
m(An ) : E
An , An A , E .
n=1

n=1

Lemma 2.4 is an outer measure on 2 , i.e.:


1. () = 0,
2. is monotone: E E 0 implies (E) (E 0 ),
S
P
3. is -subadditive: ( i=1 Ei ) i=1 (Ei ).
Proof.
P
1. Set An = for all n. Then 0 () n=1 m(An ) = 0.
2. clear.
3. P
Let > 0. Choose for each i a sequence (Ain ) in A such that Ei

i
i
n=1 m(An ) < (Ei ) + 2 .
Then

[
n=1

Ei

[
[

n=1

Ain and

Ain

i=1 n=1

and so

[
n=1

!
Ei

X
i=1 n=1

X
 X
m Ain
(Ei ) + 2i =
(Ei ) + .
i=1

Since was arbitrary this yields the claim.

i=1


Lemma 2.5 For arbitrary subsets E, E 0 of let d(E, E 0 ) := (E M E 0 ). Then d satisfies


the triangle inequality and the operations , and c are continuous for this pseudometric,
i.e. d(An , A) 0, d(Bn , B) 0 implies d(An Bn , A B) 0 and so on.
Proof. The triangular inequality follows from A M C (A M B) (B M C) and concerning
the continuity (for example for ) we argue as follows: Assume (An M A) 0 and
(Bn M B) 0. Then (An Bn ) M (A B) (An M A) (Bn M B) and hence by
monotonicity and subadditivity ((An Bn ) M (A B)) (An M A) + (Bn M B) 0.
Lemma 2.6 |A = m, i.e. is an extension of m.

14

CHAPTER 2. CONSTRUCTION OF PROBABILITY MEASURES

S
Proof. Let
A1 := A and An := for n 2. Then A =S An . and
PA A and define

m(A) = n=1 m(An ) (A). For the converse inequality assume that A n=1 An for
some sets An A and define B1 := A A1 and
!c
n1
[
Bn := A An
Ai
i=1

Then Bn An , A =

Bn and thus monotonicity and -additivity yield:


!

X
X
[
m(An )
m(Bn ) = m
Bn = m(A).
n=1

n=1

n=1

n=1

Passing to the infimum in the definition of (A) this yields m(A) (A).

Proof (of Carathodorys theorem (2.2)). We restate the definition of :


(
)

X
[

(E) := inf
m(An ) : E
An , A n A .
n=1

n=1

Define d(A, B) = (A M B). Let A be the closure of A:


A := {A : sequence (An ) in A with d(An , A) 0}
We claim that A is a -algebra and |A is a probability measure. This implies that
(A) A and := |(A) is our extension.
1. is continuous with respect to d: If d(An , A) 0 then
| (An ) (A)| = |d(, An ) d(, A)| d(An , A) 0
where we have used the inverse triangular inequality.
then there exists sequences (An ) and (Bn ) in A with
2. A is an algebra: If A, B A,
d(An , A) 0 and d(Bn , B) 0. By the continuity of with respect to d we have
The proof that A is closed under complements
d(An Bn , AB) 0. Thus AB A.
works similarly.
Let A, B A be disjoint and choose sequences (An ) and (Bn ) as
3. is additive on A:
above. Then
(A) + (B) = lim (An ) + (Bn )
n

= lim m(An ) + m(Bn )


n

= lim m(An Bn ) + m(An Bn )


n

= lim (An Bn ) + (An Bn ).


n

By continuity of the intersection d(An Bn , ) = d(An Bn , A B) 0 and therefore


(An Bn ) () = 0. This yields
(A) + (B) = lim (An Bn ) = (A B).
n

4. A is aS-algebra and is -additive: Let Ai for i N be disjoint elements of A and

A := i=1 Ai . Then
!

n
n
X
X
[

(Ai ) = lim
(Ai ) = lim
Ai 1.
i=1

i=1

i=1

2.1. CARATHODORYS EXTENSION THEOREM

Since the sum is convergent the remainder

15
(Ai ) converges to 0 and thus

i=n+1

n
[

AM

!
Ai

i=1

[
i=n+1

!
Ai

(Ai ) 0,

i=n+1

Furthermore
where we have used the -subadditivity of . This means A A = A.
!
n

[
X

(A) = lim
Ai =
(Ai ).
n

i=1

i=1

The uniqueness will be shown in Corollary 2.14.

In order to apply the theorem it is useful to reformulate the condition of -additivity.


Proposition 2.7 If A is an algebra and m : A R (thus m() is finite) is additive. Then
the following statements are equivalent:
1. m is -additive
2. m is continuous from below: If An , A A, An A then m(An ) m(A).
3. m is continuous from above at : If Bn A, Bn then m(Bn ) 0.
Proof. 1.) 2.)

Define C1 := A1 and Ci := Ai \ Ai1 for i 2. Then


!
n

X
X
[
m(An ) =
m(Ci )
m(Ci ) = m
Ci = m(A)
i=1

i=1

i=1

Bnc .

Then An and hence m(An ) m()


2.) 3.)
Assume Bn and set An :=
which is finite by assumption. Finally m(Bn ) = m() m(An ) 0.
3.) 1.) If C1 , C2 , . . . are disjoint, then
!
!

[
X
[
m
Ci
m(Ci ) = m
Ci 0
i=1

since

i=n+1

i=1

i=n+1

Ci .

In particular the third condition is useful because often it can be checked easily with the
help of a compactness argument.
Example 2.8 Let A be the set-algebra at the beginning of this chapter. Let F : R [0, 1]
be right-continuous and increasing. Define for disjoint intervals
m(

m
[

(ai , bi ] =

i=1

m
X

F (bi ) F (ai ).

i=1

Sm n n n
Then clearly m is a content. It is even -additive. In fact let An = i=1
(ai , bi ] decrease to
the empty set. Let > 0. By right-continuity we can find in > ani such that
mn
X
X

F (in ) F (ani ) < .

n=1 i=1

Smn

The compact sets A0n = i=1 [in , bni ] have empty intersection and therefore there is a number
TN
N such that n=1 A0n = . We claim that m(Ar ) < for all r N . Let x Ar , then

16

CHAPTER 2. CONSTRUCTION OF PROBABILITY MEASURES

x
/ A0n for at least one index n N . On the other hand x An for this index since the An
decrease. Thus
N
N m
[
[
[n
0
Ar
An \ An
(ani , in ]
n=1

n=1 i=1

and by monotonicity and subadditivity of m


m(Ar )

mn
N X
X

F (in ) F (ani ) < .

n=1 i=1

2.2

Monotone classes and uniqueness of probability measures

Definition 2.9 Let be a set and let M be a system of subsets of . M is called a


monotone class, if
1. M
2. If A, B M and A B then B \ A M.
3. If An M for all n N and An A then A M.
Every -algebra is a monotone class but in general monotone classes are not closed under
finite unions.
Sometimes a similar and essentially equivalent concept is used under the names Dynkin class
or -system.
Example 2.10 Let P, Q be two probability measures on a -algebra F. Then M = {A
F : P (A) = Q(A)} is a monotone class.
Proof.

1. P () = 1 = Q(). Thus M.

2. If A, B A and A B then P (B \ A) = P (B) P (A) = Q(B) Q(A) = Q(B \ A).


Hence B \ A M.
3. If An M such that An A then P (A) = lim P (An ) = lim Q(An ) = Q(A) and thus
A M.

Lemma 2.11 If M is a monotone class and -stable, then it is a -algebra.
Proof. M is closed under

and M. If A1 , A2 , . . . M then

A1 An = (Ac1 Acn )c M
S
S
Sn
Now i=1 Ai i=1 Ai implies i=1 Ai M.

for all n


Theorem 2.12 (Monotone class theorem) Let E be a system of subsets of which is


closed under . Let M be a monotone class such that E M. Then (E) M.
Proof. Define the following systems:


\
A N whenever N is a monotone

M :=
C = A2 :
.
class which contains E
C monotone
class, EC

Define
D1 := {A : A E M for all E E}
Then E D1 : If A E and E E then A E E M . D1 is a monotone class:

2.2. MONOTONE CLASSES AND UNIQUENESS OF PROBABILITY MEASURES 17


D1
If A, B D1 and A B then (B \ A) E = (B E) \ (A E) M since B E and
A E M . Hence B \ A D1 .
Similarly for increasing sequences: If An D1 and An A then An E A E M .
Since D1 is a monotone class, we get M D1 . This implies: A E M for all E E
and A M . Define
D2 := {A : A E M for all E M }
Then again E D2 and D2 is a monotone class by a argument similar to the case D1 . Hence
M D2 . Therefore M is closed under intersection and thus M is a -algebra which
contains E: (E) M M.

Theorem 2.13 (Uniqueness theorem for probability measures) If two probability measures agree on a -stable system E then they agree on (E).
Proof. Consider M = {A : P (A) = Q(A)} and apply the theorem.

Corollary 2.14 The measure in Carathodorys Theorem is unique.


Proof. Assume , are two extensions of m from A to (A). Choose E = A, P = ,
Q = . Then P |E = m = Q|E . Hence P |(A) = Q|(A) .


Chapter 3

Random variables
3.1

Definitions and examples

Definition 3.1 Let (, F) and (S, B) be two measurable spaces. A map f : S is called
F-B-measurable if f 1 (B) F for all B B. In the particular case where S = Rd and
B = B(Rd ) one talks about F-measurable functions.
Definition 3.2 Let (, F, P ) be a probability space. Then a random variable X with state
space (S, B) is an F-B-measurable map X : S.
Example 3.3 In descriptive statistics examples of the following type are important. Let
be a finite set with the Laplace probability (cf. section 1.3). Since the -algebra is 2 all
functions from are measurable irrespective of the -algebra in the image space. Therefore
this -algebra often is not specified. The image space can be a set of numbers, a set of
qualitative properties or any mixture. For example let be the set of all students at the
TU Kaiserslautern. Let
X1 () = major (subject) of ,
X2 () = gender of ,
X() = (X1 (), X2 ()),
X3 () = size of in cm.
Example 3.4 Here is a simple example of a geometric quantity considered as a random
variable. Let R2 be the unit disk with F = Borel(). Then X() = 1 kk is
the distance to the boundary of the random point . The measurability will be verified in
Corollary 3.17.
Definition 3.5 Let X be a random variable with state space (S, B). Then PX : B [0, 1]
defined by
PX (B) := P (X 1 (B)) =: P ({X B})
is called the distribution of X or the law of X. Sometimes it is denoted by L(X) and
P ({X B}) is denoted by P (X B) or P {X B}.
Observation 3.6 PX is a probability measure on the -algebra B.
Proof. Clearly PX (S) = P (X 1 (S)) = P ()S= 1. For the proof of -additivity let

B1 , B2 , . . . be disjoint elements
and let B := i=1 Bi . Then also X 1 (B1 ), X 1 (B2 ), . . .
Sof B 1
1
are disjoint and X (B) = i=1 X (Bi ). Therefore,
PX (B) = P (X 1 (B)) =

P (X 1 (Bi )) =

i=1

18

X
i=1

PX (Bi ).


3.1. DEFINITIONS AND EXAMPLES

19

We adopt the following terminology. If X is a random variable such that PX is some


known distribution, e.g. PX = Geom(p) then X is called a geometric random variable with
parameter p; similarly if PX = N (0, 1) then X is called a standard normal or standard
Gaussian random variable.
Notation 3.7 If PX = P then we also write X P , e.g. X Geom(p).
Real valued random variables play a special role.
Definition 3.8 If X is a random variable with state space (R, B(R)) then X is called a
real random variable. The function FX : R [0, 1] defined by FX (z) = PX ((, z]) =
P (X z) is called the distribution function of X.
Remark 3.9 The law PX of a real random variable is uniquely determined by FX .
Proof. Let E := {(, z] : z R}. Then E is closed under intersections. On the other hand
(E) contains all intervals (a, b] = (, b] \ (, a] and hence it contains ({(a, b] : a, b
R}) = B(R) (Theorem 1.8). Therefore every probability measure Q with Q((, z]) =
FX (z) = PX ((, z]) must agree with PX on the full -algebra B(R) by the uniqueness
theorem (Theorem 2.13).

Definition 3.10 If PX has a density f (on
we write f = fX .

R or Rd ) then f is also called density of X and

Rz
Remark 3.11 If X has a density fX in R, then FX (z) = fX (x)dx for all z. If fX is
0
(z) = fX (z) for all z. If fX is
continuous then by the fundamental theorem of calculus FX
only Borel measurable, then by the (more subtle) differentiation theorem of Lebesgue this
relation still holds Lebesgue-almost everywhere.
Example 3.12 If X N (0, 1) then
Z

FX (z) = (z) :=

x2
1
e 2 dx.
2

The special function is called the Gaussian error function.


Remark 3.13 If F is a distribution function, i.e. F = FX for some random variable, then
F is nondecreasing and continuous from the right. Moreover
lim F (z) = 0 and lim F (z) = 1.

Proof. That F is nondecreasing follows from the monotonicity of probability measures.


For the right-continuity
let zn z and define An = {X zn }. ThenAn An+1 . . . and
T
{X z} = nN An . Thus FX (zn ) = P (An ) P (X z) = FX (z). For the asymptotic
behaviour at define An = {X n}, then F (n) = P (An ) P () = 0. Similarly
F (n) = P {X n} P () = 1. This suffices since FX is nondecreasing.

If F is any function with the properties of the preceding remark the we can define the
generalized inverse G of F by
G(s) := inf{z : s F (z)}.
In fact F (G(s)) = s for all numbers s in the range of F . P (U (a, b]) = ba if (a, b] [0, 1].
Theorem 3.14 Let F be a function as in Remark 3.13 and let G be its generalized inverse.
Let U Unif(0, 1), i.e. U is uniformly distributed on the unit interval. Then the random
:= G(U ) has the distribution function F = F .
variable X
X

20

CHAPTER 3. RANDOM VARIABLES

Proof. For every z R we have by the right-continuity of F




z) = P G(U ) z = P inf{y : U F (y)} z
P (X

= P y z : U F (y)

= P U F (z)
= F (z),
as asserted.

This theorem is useful not only theoretically, but also numerically for the simulation of a
random variable with the given distribution P : We determine F from P , and G from F as
= G(U ) is what we want. Of
above. Computers give good simulations of U and then X
course this procedure is only efficient if the values of the function G are easy to compute.

3.2

Verifying measurability

The verification of the measurability in the definition of a random variable looks difficult
because it involves all elements of the -algebra in the state space. Here we collect some
useful tools which simplify this task considerably.
Lemma 3.15 Let (, F) and (S, B) be two measurable spaces. Let E be a system of subsets
of S such that (E) = B. If X : S is a function such that {X E} F for all E E,
then X is a random variable.
Proof. Consider the system G of all subsets G of S such that {X G} F. By assumption
every E E belongs to G i.e. E G. In order to prove B G, i.e. {X B} F for all
B B, we verify that G is a -algebra:
1. G because X 1 () = F
2. If G G then X 1 (Gc ) = (X 1 (G))c F Gc G
3. If Gn G for all n:
X

!
Gn

n=1

i.e.

n=1

X 1 (Gn ) F

n=1

Gn G.

Therefore G is a -algebra which contains E and hence B = (E) G by assumption.

Remark 3.16 An elegant, but somewhat cryptic, version of the last lemma is the identity
f 1 ((E)) = (f 1 (E))
which holds for all maps f : S and systems E of subsets of the image space S.
Proof. is proved like in the last lemma.
is easy since f 1 ((E)) is a -algebra which contains f 1 (E).

Corollary 3.17 If S and T are metric spaces and f : S T is continuous, then f is


B(S)-B(T )-measurable, i.e. continuous functions are Borel-measurable.
Proof. Let OT and OS be the system of all open sets in T and S, respectively. Since
f is continuous, f 1 (O) is open in S for every O OT , so f 1 (B(T )) = f 1 ((OT )) =
(f 1 (OT )) (OS ) = B(S). Hence f 1 (B) B(S) for all B B(T ).


3.2. VERIFYING MEASURABILITY

21

Remark 3.18 The composition g f of two measurable maps is measurable.


Proof. Let (S, B1 ), (T, B2 ) and (W, B3 ) be measurable spaces and f : S T , g : T W .
Then
(g f )1 (B3 ) = f 1 (g 1 (B3 )) f 1 (B2 ) B1

If X : Rd is a random variable and f : Rd Rq is Borel then Y := f (X) is measurable
and hence a random variable.
Sometimes it is convenient to extend the concept of real valued random variables by allowing
also the values + and .
Definition 3.19 We use the notations R or [, ] for
equipped with the -algebra B(R) where

R {, }. The set R is

B(R) = {B R : B R B(R)}.
A F-B(R) measurable map X : R is called an extended real valued random variable.
Note that this is consistent: A map X : R is a real valued random variable in the old
sense if and only if it is an extended real valued random variable when X is considered as
a map into R. We simply call X a random variable.
Proposition 3.20 A map X : R is a random variable if and only if the sets {X a}
are in F for all a R.
The same holds for the sets {X < a}, {X a} or {X > a}.
Proof. is trivial and follows from Lemma 3.15 if we consider
E = {[, a] : a R}
since (E) = B(R).

Example 3.21 If G : R R is monotone, then G is Borel-measurable.


Proof. For every a R the set {G a} is a half line, hence in F = B(R). With Proposition
3.20 follows that G is F-B(R)-measurable and hence G Borel.

Definition 3.22 A map X : Rd is called a random vector if it is F B(Rd )-measurable.
Proposition 3.23 Let X = (X1 , . . . , Xd ) :
only if all Xi are random variables.

Rd . Then X is a random vector if and

Proof. . Suppose X is a random variable. Consider the i-projection i : Rd R. i


is continuous hence Borel-measurable. Therefore Xi = i X is a random variable.
. Observe B(Rd ) = (E) where E := {I1 Id : Ii are intervals}. Assume that the
Td
Xi are random variables. Then {X I1 Id } = i=1 {Xi Ii } F. So X 1 (E) F
for all E E and with Lemma 3.15 follows that X is a random variable.

Corollary 3.24 If X, Y are real random variables then X+Y , XY , min{X, Y }, max{X, Y }
are also random variables1 .
Proof. Consider the random pair (X, Y ). Compose it with one of the operations in the
assertion, e.g. addition + : (x, y) 7 x + y. These operations are continuous on their
domain of definition. Hence they are Borel and X + Y = + (X, Y ) etc. are random
variables.

1 they

are defined point-wise on

22

CHAPTER 3. RANDOM VARIABLES

Proposition 3.25 If X1 , X2 , . . . are R-valued random variables then the point-wise defined
maps supn Xn , inf n Xn , lim supn Xn , lim inf n Xn are random variables. Moreover the
set
{ : lim Xn () exists} = { : lim sup Xn () = lim inf Xn ()}
n

is in F.
Proof.
supn Xn is a random variable, since {supn Xn a} = { : Xn () a n
T
{X
n a} F for all a R. The other three cases follow immediately from
nN
inf Xn = sup Xn

lim sup = inf sup Xm


n

N} =

lim inf Xn = sup inf Xm .

N mn

N mn

For the last statement we need in addition the following fact: If X, Y are random variables
then {X 6= Y } and hence {X = Y } are in F since
[
{X 6= Y } =
({X < r} {Y r}) ({X r} {Y < r}) F.
r

We conclude this chapter with the approximation of non-negative random variables by linear
combinations of indicator functions of events. Note that 1A is a random variable for all
A F because {1A a} is either (if a 1) or Ac (if 0 a < 1) or (if a < 0).
Theorem 3.26 Let X be a nonnegative real valued random variable. Then X = sup Xn
where
n
n2
X
Xn =
k 2n 1{k2n X < (k+1)2n } , n N.
()
k=1

Proof. If X() is given then for n X() we have Xn () X() < Xn () + 2n because
Xn () is the left end point of the dyadic interval of length 2n in which X() lies:
X1

X2

1
4

X3 X4
X

Definition 3.27 X is called elementary random variable if X =


numbers ai and events Ai F.

1
2

Pn

i=1

ai 1Ai for some real

Corollary 3.28 If X is a [0, ]-valued random variable, then there is a sequence (Xn ) of
elementary random variables with Xn X.
Proof. Replace X by min(X, n) in (). If X() then min(X(), n) = X() for large
n and by theorem 3.26 supnN Xn () = X(). Otherwise, if X() = then Xn () = n
for all n N.


Chapter 4

Expectation
We assume that the reader knows the definition of the integral in measure theory. We review
some basic properties which are important for the probabilistic interpretation.
Example 4.1 Let us suppose that in some game we get the amount a if A happens and
b if Ac happens. Then the gain is X() = a 1A () + b 1Ac (). The expected gain is:
E(X) = a P (A) + b P (Ac ). If we pay m in advance for playing this game then a m is
the net gain and m b is the net loss. Then m = E(X) solves the balance equation
P (A)(a m) = P (Ac )(m b).
R
R
Remark
4.2 If (S, G, ) is a measure space, then the integral S f d (or S f (x)d(x) or
R
f (x)(dx)) is characterized by three rules
S
Pn
1. If f = i=1 ai 1Ai is an elementary function, then
Z X
n
n
X
ai 1Ai d =
ai (Ai )
i=1

i=1

2. If f : S [0, ] is G-measurable then S f d is defined andRBeppo Levis


theorem of
R
monotone convergence holds: If 0 fn f point-wise, then fn d f d.
R
R
3. If f takes both signs, then f d Ris definedR only if |fR| is integrable: |f |d < . In
this case the integral is defined as f d = f + d f d with f + = max(0, f ) and
f = max(0, f ).
DefinitionR 4.3 For a extended real valued random variable X the expectation E(X) is
defined as XdP if this integral exists.
There are two basic cases in which the expectation exists: Either X 0. Then E(X)
[0, ]. If X also takes negative values then E(X) = E(X + ) E(X ) is only defined if both
terms are finite. For elementary random variables clearly
!
n
n
X
X
E
ai 1Ai =
ai P (Ai ).
i=1

i=1

Remark 4.4 The expectation depends only on PX . If for example X takes only the pairwise
distinct values x1 , . . . , xm , then
X=

xk 1{X=xk }

k=1

P
and therefore E(X) = k=1 xk pk , where pk := P {X = xk } = PX ({xk }). The general case
is treated in the following theorem.
23

24

CHAPTER 4. EXPECTATION

Theorem 4.5 (Expectation rule) Let X : S be a random variable andR g : S R be


measurable. Then the random variable g(X) has an expectation if and only if S g(x)dPX (x)
exists. In this case
Z
E(g(X)) =
g(x)dPX (x).
S

Proof. We prove the theorem by the standard procedure: In the first step we restrict
ourselves to elementary functions, in the second step we allow non-negative measurable
functions and in the third step we generalize the result to measurable functions.
Pn
1. Let g be elementary: g = j=1 bj 1Bj where Bj measurable in the -algebra of S.
Then
X
X
g(X)() =
bj 1Bj (X()) =
bj 1{XBj } ()
and therefore
Z
E(g(X)) =

g(X)dP =

bj P {X Bj } =

Z
bj PX (Bj ) =

gdPX .
S

2. Let g 0 be measurable. Then there exists a sequence (gn ) of elementary measurable


functions such that gn (x) g(x) for all x S. This implies gn (X)() g(X)() for
all . By applying monotone convergence twice we get
Z
Z
gn dPX =
gdPX .
E(g(X)) = lim E(gn (X)) = lim
n

3. Now let g be an arbitrary measurable Rfunction. By the second step the


R assertion holds
for |g|, g + and g . Thus E(g(X)) = g(X)dP < if and only if S gdPX < . In
this case
Z
E(g(X)) = E(g(X)+ ) E(g(X) ) = E(g + (X)) E(g (X)) =
gdPX .
S

Example 4.6 Let X B(n, p). Then


E(X 2 ) =

n
X

k2

k=0

 
Z
n
X
n k
p (1 p)nk =
k 2 PX {k} =
x2 dPX .
k
R
k=0

Later we shall see several ways how to evaluate this expression explicitly.
Remark 4.7 Let x be
Pnthe Dirac measure at x: x (A) = 1 if x A and x (A) = 0 if x 6 A.
Note that for (A) = k=1 k xk the integral is the sum
Z
f d =

In particular

n
X

k f (xk )

k=1

f dx = f (x).

The following gives the two most popular forms for the expectation rule.
Proposition 4.8
Let X be P
a random variable with a discrete
P distribution, i.e. P (X =
xk ) = pk for x1 , x2 , . . . and
pk = 1. Then E(g(X)) = k=1 g(Xk )pk .
R
If X is Rd -valued and has a density fX , then E(g(X)) = Rd g(x)fX (x) dx.

25
Example 4.9

Let X Exp(). Then fX (x) = 1(0,) (x)ex and


Z
1
E(X) =
xex dx = .

Let X N (0, 1). Then


Z
E(X) =

x2
1
x e 2 dx = 0.
2

Let X Cauchy. In this case E(X) does not exist, because


Z
1
|x|
E(|X|) =
dx = .
(1 + x2 )

Let us come back to discrete distributions on N0 . For these the expectation can sometimes
be computed conveniently with the help of probability generating functions. An extension
of this idea will come up in the section 4.2 on moment generating functions and in chapter
10 on characteristic functions.
Definition 4.10 Let (pk )kN0 beP
a probability vector and let X be a random variable with

P {X = k} = pk . Then GX (z) := k=0 pk z k is called the generating function of X.


Proposition 4.11 Define G0X (1) := limz1 G0X (z) and G00X (1) := limz1 G00X (z).
E(X) = G0X (1) and E(X 2 ) = G00X (1) + G0X (1).
P
Proof. Observe that for z < 1 and = kN0 k
G0X (z) =

k pk z k1 =

k=0

k pk z k1 d
0

By monotone convergence follows that


Z
Z
k1
0
k pk zn d =
lim GX (zn ) = lim
n

Then

k pk d =
0

k pk = E(X)

k=0

for arbitrary zn 1. This implies the first assertion. The second assertion follows similarly:
lim G00X (zn ) = lim

k(k1)pk znk2 =

k=0

k(k1)pk = E(X (X 1)) = E(X 2 X)

k=0

for arbitrary zn 1 and additivity yields E(X 2 ) = G0X (1) + G00X (1).

Example 4.12
Let PX = B(n, p). Then
GX (z) =

n  
X
n
k=0

(1 p)nk pk z k = (pz + 1 p)n

Now G0X (z) = n (pz + 1 p)n1 p and thus G0X (1) = n p = E(X). Furthermore
G00X (z) = n (n 1) (pz + 1 p)n2 p2
and hence E(X 2 ) = G00 (1) + G0 (1) = p2 n (n 1) + p n.
Definition 4.13 Let (, F, P ) be a probability space. An event F F is called a nullset
if P (F ) = 0. An event happens almost surely (a.s.) if P (A) = 1, i.e. Ac is a nullset.

26

CHAPTER 4. EXPECTATION

Observation 4.14 ATcountable union of nullsets is a nullset. If An happens almost surely

for each n N, then n=1 An also happens almost surely:


P

\

An

c 

=P

[

 X
P (Acn ) = 0
Acn

Proposition 4.15
1. If X and Y are random variables such that X = Y almost surely and X has an
expectation, then Y has the same expectation.
2. If X Y almost surely, then E(X) E(Y ), provided both expectations exist.
3. E(X + Y ) = E(X) + E(Y ) if either X, Y 0 and , 0 or E(X), E(Y ) finite.
Theorem 4.16 (Markovs inequality) Let X be a nonnegative random variable and let
a > 0. Then P (X a) a1 E(X).
Proof. Define Y = a 1{Xa} . Then X Y and thus E(X) E(Y ) = a P (X a).

Corollary 4.17 If E(|X|) = 0, then X = 0 almost surely.


Proof. Let An = {|X|
is a nullset.

1
n }.

Then P (An )

1
1
n

E(|X|) = 0 and hence

n=1

An = {|X| > 0}


Corollary 4.18 If E(|X|) < then |X| < almost surely.


Proof. 1 P (|X| = ) = 1 lim P (|X| n) = 1 since P (|X| n)
n

1
n

E(|X|) 0.

Corollary
P 4.19 (First Lemma of Borel-Cantelli) If (An ) is a sequence of events such
that n=1 P (An ) < , then almost surely only finitely many of the An happen:
!

[
\
Am = 0
P (infinitely many An occur) = P
n=1 m=n

Proof. Let X =

n=1

1An . Then
E(X) =

E(1An ) =

n=1

P (An ) <

n=1

and X < almost surely by the previous corollary.

4.1

Convergence theorems

From the definition and Beppo Levis Theorem one gets


Theorem 4.20 (Monotone convergence) Let (Xn ) be a sequence of nonnegative random variables. Assume Xn X. Then E(Xn ) E(X).
This implies the following useful rules:
Theorem 4.21 (Fatous lemma) Let (Xn ) be a sequence of nonnegative random variables. Then


E lim inf Xn lim inf E(Xn ).
n

The same holds true if Xn Y for all n and a random variable Y with finite expectation.

4.2. APPLICATION: THE MOMENT GENERATING FUNCTION

27

Proof. If Xn 0 use lim inf Xn = sup ( inf Xm ) and monotone convergence. In the second
n

N mn

case switch to the non-negative sequence (Xn Y )nN .

Combining this with the corresponding mirror estimate for the lim sup we get
Theorem 4.22 (Lebesgues theorem of dominated convergence)
Let Xn be a sequence of random variables such that there exists a random variable Y with
E(Y ) < and |Xn | Y for all n. If lim Xn = X almost surely, then E(X) = lim E(Xn )
n

and limn E(|X Xn |) = 0.


Theorem 4.23 (Parameter dependence) Let (Xt )a<t<b be a 1-parameter family of random variables (on one common probability space (, F, P )).
1. Suppose t 7 Xt () is continuous for all and there exists a random variable Y such
that |Xt | Y for all t and E(Y ) < . Then t 7 E(Xt ) is continuous.
2. Suppose t 7 Xt () is continuous differentiable for all and there exists a random
d
Xt | Y for all t and E(Y ) < . Then t 7 E(Xt ) is
variable Y such that | dt
continuous differentiable and


d
d
E(Xt ) = E
Xt .
dt
dt
Proof. Let (tn ) and t0 be in (a, b) such that tn t0 . Define Xn := Xtn and X := Xt0 .
1. This part follows from the theorem of dominated convergence: If tn t then limn E(Xtn ) =
E(Xt ), hence E(Xt ) is continuous in t.
2. Define n :=

Xn X
tn t0 .

Then

E(Xtn ) E(Xt0 )
=E
tn t0

Xtn Xt0
tn t0


= E(n ).

The mean value theorem gives


n () =


d

Xt ()
dt
t=tn ()

for some tn () (t0 , tn ). Thus by assumption |n ()| Y () and the theorem of


dominated convergence implies





d
d

lim E(n ) = E lim
Xt ()
=E
Xt |t=t0
n
n dt
dt
t=t
n
d
Xt ) is continuous
Therefore t 7 E(Xt ) is differentiable and the derivative t 7 E( dt
d
by the first part applied to the random variables dt Xt .


4.2

Application: the moment generating function

Definition 4.24 Let X be a real random variable. Assume there exists > 0 such that
E(eX ) < for all (, ). Then 7 MX () := E(eX ) is called moment generating
function.

28

CHAPTER 4. EXPECTATION

Theorem 4.25 If the moment generating function MX of a random variable X exists on


an open interval (, ) then X has moments of all orders and for all (, ) one has


2 2
2
MX () = E 1 + X +
X + . . . = 1 + 1 + 2
+ ...
2!
2
where r = E(X r ) = r-th moment of X. In particular
E(X r ) =


dr

MX ()
d
=0

d X
e
= X r eX and
Proof. Let r N. We start with the second assertion: Obviously d
X
0
00
thus the map 7 e
is continuously differentiable. Choose < < . Then there is a
00
00
constant cr such that |x|r e|x| cr e |x| for all x R and 0 . By setting Y := cr e |X|
we get |X r |eX Y and E(Y ) < . Theorem 4.23 now gives
 r


dr
d X

=E
MX ()
e
= E(X r ).
d
d
=0
=0
r

For the first assertion one needs to extend this argument to the case of complex . Since
|e(+i)X | = eX one sees that the moment generating function is even holomorphic inside
the circle {z C : |z| < }. Then by the second assertion the formula for MX is just the
Taylor expansion of this function.

Example 4.26 Let Z N (0, 1) and X := + Z. Then FX (x) = P (X x) = P (Z
x
x
0
= ( ) and writing f for we get the density
0
fX (x) = FX
(x) =

(x)2
1
e 22
2

Now we can compute the moment generating functions, at first for Z:


Z
Z
2

(x)2
2
2
1
x2
Z
x 1
2

e 2 dx = e 2
e
dx = e
MZ () = E e
=
e
2
R 2
R
because the last integrand is a probability density. Finally



()2
2 2
MX () = E e(+Z) = e E eZ = e e 2 = e+ 2 .

Chapter 5

Conditional probability and


independence
5.1

Basic concepts

Let (, F, P ) be a probability space.


Definition 5.1 Let A and B be two events such that P (B) > 0. Then
P (A|B) :=

P (A B)
P (B)

is called conditional probability of A given B.


Definition 5.2 A and B are called independent, if P (A B) = P (A) P (B).
Remark 5.3
1. If P (B) > 0 then A, B independent is equivalent to P (A|B) = P (A).
2. If P (B) > 0 then P B : F [0, 1] defined as P B (A) := P (A|B) is a probability
measure.
3. If P (B) {0, 1} then B is independent of every A F.
4. P (A B C) = P (A) P (B|A) P (C|A B) if P (A B) > 0 and similarly for more
than three events.
Proposition 5.4 If B1 , B2 , . . . are disjoint events and =

i=1

Bi , then the following hold:

1. (formula of complete/total probability)1 :


P (A) =

P (Bi ) P (A|Bi )

i=1

2. (Bayes formula): For each j with Bj > 0


P (A|Bj ) P (Bj )
.
P (Bj |A) = P

P (A|Bi ) P (Bi )
i=1
1

in this formula P (A|Bi ) is only defined if P (Bi ) > 0, but otherwise the product is 0 anyway.

29

30

CHAPTER 5. CONDITIONAL PROBABILITY AND INDEPENDENCE

Definition 5.5
1. Let X : (, F) (S, B) and Y : (, F) (T, C) be two random variables. Then X
and Y are called independent, if P (X B, Y C) = P (X B) P (Y C) for all
B B and C C.
2. Let G and H be two subsystems of F. Then G and H are called independent, if
P (G H) = P (G) P (H) for all G G and H H.
Definition 5.6 If X : (S, B) is a random variable, then the system of all sets {X B}
for B B is a -algebra, the -algebra generated by X. It is denoted by (X).
Remark 5.7 Obviously X and Y are independent if and only if (X) and (Y ) are independent.
Proposition 5.8 If X and Y are independent and : (S, B) (S 0 , B 0 ) and : (T, C)
(T 0 , C 0 ) are measurable, then (X) and (Y ) are also independent.
Proof. Let B 0 B 0 and C 0 C 0 . Then {(X) B 0 } = {X 1 (B 0 )} and similarly
{(Y ) C 0 } = {Y 1 (C 0 )}. Thus
P ((Y ) B 0 , (Y ) C 0 ) = P (X 1 (B 0 ), Y 1 (C 0 ))
= P (X 1 (B 0 )) P (Y 1 (C 0 ))

by independence

= P ((X) B 0 ) P ((Y ) C 0 ).


Theorem 5.9 If E1 and E2 are -stable systems of sets which are independent, then (E1 )
and (E2 ) are also independent.
Proof. Let G1 := (E1 ) and G2 := (E2 ).
1. Fix E1 E1 . Claim: Every G2 G2 is independent of E1 .
If P (E1 ) = 0 the claim is trivial. If P (E1 ) > 0 then P E1 (E2 ) = P (E2 ) for all E2 E2
by our assumption. Therefore the two probability measures P E1 and P agree on E2 .
The uniqueness theorem of probability measures (Theorem 2.13) implies that P E1 = P
on (E2 ) = G2 . Therefore P (G2 E1 ) = P E1 (G2 ) P (E1 ) = P (G2 ) P (E1 ) for all
G2 G2 .
2. Fix now G2 G2 . Prove in the same way, using P G2 and P , that every G1 G1 is
independent of G2 .

Corollary 5.10 Let A and B be two independent events. Claim: The indicators
1B are independent.

1A and

Proof. Define E1 := {A}, then (E1 ) = {, A, Ac , } = (1A ) = {{1A M } : M B(R)}


and similarly E2 := {B} and thus (E2 ) = (1B ). By assumption E1 and E2 are independent
and with the theorem follows that (E1 ) and (E2 ) are independent. Thus (1A ) and (1B )
and with remark 5.7 also 1A and 1B are independent.


5.2

Joint distributions and Fubinis Theorem

In this section let (S, B) and (T, C) be two measurable spaces.


Definition 5.11 The product -algebra B C is defined as {B C : B B, C C}.

5.2. JOINT DISTRIBUTIONS AND FUBINIS THEOREM

31

Remark 5.12 If X and Y are random variables with values in (S, B) and (T, C) then (X, Y )
is a random variable with values in (S T, B C).
Proof. By assumption {(X, Y ) B C} = {X B} {Y C} F, hence (X, Y )1 (E)
F for all E E := {B C : B B, C C}. This yields (X, Y )1 ((E)) = ((X, Y )1 (E))
F. Therefore (X, Y ) is a random variable.

Therefore the following definition is meaningful.
Definition 5.13 The probability measure P(X,Y ) : B C 3 E 7 P {(X, Y ) E} is called
the joint distribution of (X, Y ).
Proposition 5.14 Let and be two probability measures on B and C respectively. Then
there is a unique probability measure on B C such that
(B C) = (B) (C)

()

is called product measure.


Uniqueness of follows from our uniqueness theorem of probability measures (Theorem 2.13),
since E := {B C : B B, C C} is stable under intersection and () determines the
probability measure on E uniquely and finally (E) = B C. The existence of the product
measure follows from the following Lemma:
Lemma 5.15 For each D B C and s S the sets Ds := {t T : (s, t) D} are in C
and the map s 7 (Ds ) is measurable. The map
Z
D 7
(Ds )d(s)
S

defines a probability measure on B C.


Proof. If D = B C then the claim holds, since then
(
(
if s 6 B
0
if s 6 B
Ds =
and
(Ds ) =
C if s B
(C) if s B
The class M of all D (E) for which the claim holds is a monotone class:
S T M since Ds = T C for each s.
If D E and D, E M then (E \ D)s = Es \ Ds C for all s and s 7 ((E \ D)s ) =
(Es ) (Ds ) is a measurable function.
S
If Dn D and Dn M for all n, then D M because Ds = nN (Dn )s and
(Ds ) = limn (Dn )s is measurable.
Since E M, the Monotone class theorem (2.12) yields B C = (E) M and thus
(E) = M.
The map defined above is a measure: If D1 , D2 , . . . are disjoint elements of B C then
!
!
Z
Z

[
[
( Di )s d =

(Di )s d
S

i=1

=
=

i=1

Z X

S i=1

XZ
i=1

((Di )s )d
((Di )s )d.

32

CHAPTER 5. CONDITIONAL PROBABILITY AND INDEPENDENCE

Theorem 5.16 Let X and Y be two random variables with values in (S, B) and (T, C)
respectively. Then X and Y are independent if and only if P(X,Y ) = PX PY .
Proof. X and Y are independent means that for all B B and C C we have P (X
B, Y C) = P (X B) P (Y C) , or equivalently P(X,Y ) (B C) = PX PY (B C).
The uniqueness part of the previous proposition implies the assertion.

Definition 5.17 Probability spaces are also called experiments. If (1 , F1 , P1 ), (2 , F2 , P2 )
are two experiments, then (1 2 , F1 F2 , P1 P2 ) is called the independent coupling of
the two experiments.
Remark 5.18 The word coupling is motivated by the fact that everything which can be
described by one of the two experiments (i , Fi , Pi ) can also be described by the coupling:
1 , 2 ) = X(1 ). Then X and X

A random variable, say, X : 1 R is substituted by X(


have the same law:
a) = P1 P2 ({1 : X(1 ) a} 2 ) = P1 (X a) P2 (2 ) = P1 (X a).
P1 P2 (X
We shall need the following theorem mainly for probability measures. However it is sometimes useful to have it also for infinite measures like Lebesgue measure; therefore we state
it in the general form. We recall that a measure space (S, B, ) is called -finite if the setS
has a partition into countably many sets Bn B with (Bn ) < . The definition of the
product measure easily extends to the case in which both and are -finite.
Theorem 5.19 (Fubini) Let (S, B, ) and (T, C, ) be two -finite measure spaces. Let
f : S T [0, ] be B C-measurable. Then


Z
Z Z
Z Z
f (s, t)d =
f (s, t)d(t) d(s) =
f (s, t)d(s) d(t).
ST

In particular the inner integrals make sense since s 7

R
S

f (s, t)d(t) is measurable.

Proof. (Sketch) We assume that and are probability measures. The extension to finite
and then to -finite measures is straightforward. Let D B C and Dt := {s : (s, t) D}.
Z
Z
Z
1D d = (D) = (Ds )d(s) = (Dt )d(t),
S

where the last equality follows by uniqueness, since both sides define which agree on B C.
Furthermore
Z
Z Z
(Ds )d(s) =
1D (s, t)d(t)d(s)
S

and

Z Z
(Dt )d(t) =

1D (s, t)d(s)d(t).

The extension from indicators to non-negative mesurable functions follows now by the standard procedure.

Here is a useful and intuitive consequence.
Theorem 5.20 Let X and Y be two independent random variables with values in S and
T respectively. Let f be as in the Fubinis theorem. Then E(f (X, Y )) = E(F (X)) where
F (s) := E(f (s, Y )).
Proof. Since X and Y are independent, P(X,Y ) = PX PY . Therefore
Z
Z
Z Z
E(f (X, Y )) = f dP(X,Y ) = f (s, t)dPX PY =
f (s, t)dPY dPX
S T
Z
=
E(f (s, Y ))dPX = E(F (X)).
S


5.3. MORE THAN TWO INDEPENDENT EVENTS OR RANDOM VARIABLES

33

Example 5.21 (Product rule) Let X and Y be independent extended real and either X,
Y 0 or E(|X|) < , E(|Y |) < . Then E(X Y ) = E(X) E(Y ).
Proof.

1. Let X, Y 0 and define f (s, t) := s t and F (s) := E(s Y ) = s E(Y ). Then


E(X Y ) = E(F (X)) = E(X E(Y )) = E(X) E(Y )

2. Let now E(|X|) < , E(|Y |) < . Since X and Y are independent X + and X are
independent from Y + and Y and therefore
E((X + X ) (Y + Y )) = E(X + Y + X + Y X Y + + X Y )
= E(X + ) E(Y + ) E(X + ) E(Y )
E(X ) E(Y + ) + E(X ) E(Y )
= E(X) E(Y ).


Another useful application of Fubinis Theorem is an alternative representation of the expectation of a non-negative random variable.
Corollary 5.22 For every random variable X 0 on a probability space (, F, P ) we have
the representation of E(X) as a Lebesgue integral:
Z
E(X) =
P (X > t) dt.
0

Proof. The set {(t, ) : t < X()} clearly is B F-measurable. Thus


Z Z X()
Z Z
X() dP =
1 dt dP =
1{X>t} dt dP

0
0
Z Z
Z
=
1X>t dP dt =
P (X > t) dt.
Z

E(X) =

5.3

More than two independent events or random variables

Remark 5.23 If A, B and C are pairwise independent, this does not mean
P (A B C) = P (A) P (B) P (C).
Example: Let A and B be independent with probability P (A) = P (B) = 21 . Let C := A M B.
Then P (C) = 12 and A, B and C are pairwise independent. But A B C = and
P (A B C) = 0 6= P (A) P (B) P (C).

Qn
Definition 5.24 A1 , A2 , . . . An are independent if P (A1 . . . An ) = i=1 P (Ai ) and the
same is true for all subfamilies.
An infinite family of events is independent, if each finite subfamily is independent.
P
Theorem 5.25 [Second Lemma of Borel-Cantelli] If A1 , A2 , . . . are independent and n=1 P (An ) =
then P (infinitely many An occur) = 1.

34

CHAPTER 5. CONDITIONAL PROBABILITY AND INDEPENDENCE

Proof. We have to show P (A) = 1 for


[

A :=

Am .

n=1 m=n

Observe that the complements are also independent and Ac =


equality ex x + 1 implies
!
!

N
\
\
c
c
Am
P
Am = lim P
N

m=n

= lim

lim

S T
n=1

m=n

Acm . The in-

n=1
N
Y

(1 P (Am ))

m=n
N
Y

eP (Am )

m=n

= lim exp
N

N
X

P (Am )

m=n

=0
S T
for each n N. This yields P ( n=1 m=n Acm ) = 0.

Definition 5.26 A family (Ai )iI of systems of events is independent if each finite subfamily is independent in the sense that
!
\
Y
P
Ai =
P (Ai )
for all Ai A and finite J I
iJ

iJ

A family of random variables (Xi )iI is independent if the -algebras (Xi ) are independent.
Theorem 5.27 If (Ei )iI are independent and -stable, then the -algebras (Ei ) are also
independent.
Proof. We only need to consider finite index sets, say I = {1, . . . , n}, and use induction
over n. For n = 2 the claim is theorem 5.9. The set
(
)
\
0
E =
Ei : Ei Ei , J {1, . . . , n 1}
iJ

is -stable and Ei E 0 for i = 1, . . . , n1 (choose J = {i}). This yields (E 0 ) =

S
n1
i=1


Ei .

Now by theorem 5.9 (E 0 ) and (En ) are independent and thus for J {1, . . . , n 1} and
arbitrary Ai (Ei ) we get by induction hyothesis
!
!
!
\
\
Y
P
Ai An = P
Ai P (An ) =
P (Ai ) P (An ).
iJ

iJ

iJ

Theorem 5.28 (Kolmogorovs 0-1 law) Let X1 , X2 , . . . be independent. Consider the


-algebra

\
A :=
{Xn , Xn+1 , . . .}
n=1

where {Xn , Xn+1 , . . .} is the smallest -algebra for which all Xk , k n are measurable.
Then P (A) {0, 1} for all A A .

5.4. INFINITE PRODUCT EXPERIMENTS

35

S
Proof. A := n=1 {X1 , . . . , Xn } is a set-algebra and (A) = {X1 , X2 , . . .}. By the measure extension theorem 2.2 A is dense in (A): For each A (A) and each > 0 there is
some A0 A such that P (A M A0 ) < .
Now let A A and A0 A be arbitrary. Then there exists n such that A0 {X1 , . . . , Xn }
and A {Xn+1 , Xn+2 , . . .}. Therefore A and A0 are independent of each other. In particular, with (Ak )k A a sequence such that P (A M Ak ) 0, we get
P (A) = P (A A) = lim P (A Ak ) = lim P (A) P (Ak ) = P (A)2
k

and hence P (A) {0, 1}.


(Alternatively use that A is -stable and therefore A and (A) are independent of each
other. But A (A), so every A A is independent of itself and P (A) = P (A)2 .) 
Remark 5.29 A is called the asymptotic -algebra of the sequence X1 , X2 , . . .. A
{Xn , Xn+1 , . . .} means roughly that A can be described by the random variables Xn , Xn+1 , . . .,
whereas A A means that A is an asymptotic property of the sequence Xn . One example
is:
) (
)
(
m
m
1 X
1 X

Xi = 0 =
lim

Xi = 0
for all n N.
A :=
lim
m m
m m
i=1
i=n
C

Note that if a random variable depends on the asymptotic behaviour of the sequence
X1 , X2 , . . . then it is almost surely constant. More generally,
Remark 5.30 Let A be a trivial -algebra, i.e. A contains only events with probability 0
or 1. Then each A-measurable real random variable Y is a.s. equal to its own expectation.
Proof. There is a c < such that P {|Y | > c} < 1. Then by assumption P {|Y | > c} = 0,
i.e. Y is a.s. bounded and in particular E(Y ) exists. If one of the two events {Y > E(Y )}
and {Y < E(Y )} would have positive probability the same would hold for the other and
they could not have probability 1 in contradiction to the assumption. Thus Y = E(Y )
almost surely.


5.4

Infinite product experiments

For a finite collection of probability spaces (1 , F1 , P1 ), . . . , (n , Fn , Pn ) we get their independent coupling in natural extrapolation from n = 2 on the product space := 1 n
with the product -algebra F := F1 Fn = (A1 An : Ai Fi ). The product
measure P := P1 Pn is characterized by
n
O

Pi (A1 An ) =

i=1

n
Y

Pi (Ai )

for Ai Fi .

i=1

This measure could also be defined recursively by


n
O
i=1

Z
Pi (C) =

n
O

Pi (Cs ) dP1 (s)

1 i=2

where Cs := {(2 , . . . , n ) : (s, 2 , . . . , n ) C}, s 1 .


Let now I be an arbitrary infinite index set and let (i , Fi , Pi ) be probability spaces. Define
the product space
Y
:=
i = { = (i )iI : i i i I}.
iI

36

CHAPTER 5. CONDITIONAL PROBABILITY AND INDEPENDENCE

On this space one has the projections i : i defined by i0 ((i )iI ) = i0 and for
subsets
J I similarly J : J with J ((i )iI = (j )jJ . The product -algebra
N
F
can be described in several ways
i
iI
O

Fi := {i : i I} =

J1 (A) : A

iI

Fj , J I finite

jJ

Definition 5.31 A set of the form C = J1 (A) = { : (j )jJ A} is called a


cylinder set.
Example 5.32 Let I = {1, 2, 3}, J = {1, 2}, i := R and A := {(1 , 2 ) R2 : 12 + 22
1
1}. {1,2}
(A) Then C = J1 (A) is the three-dimensional cylinder of infinite height whose
base is the unit disc.
N
N
Theorem 5.33 There is a unique probability measure on iI Fi denoted by iI Pi such
that
n
O
Y
Pi { : (j1 , . . . , jn ) Aj1 Ajn } =
Pjk (Ak )
iI

k=1

for
sets J I and Ajk Fjk for all k = 1, . . . , n. The probability space
Q all finite
N index N
( iI i , iI Fi , iI Pi ) is called the independent coupling of the probability spaces
(i , Fi , Pi ).
Proof. 1. We first note that in the representation of a cylinder set the choice of the finite
index set J is not unique: We can add an arbitrary element i to J by adding a trivial
condition on the i-th coordinate:
C := { : (j1 , . . . , jn ) A} = { : (j1 , . . . , jn , i ) A }.
2. The cylinder sets form an algebra:
C = J1 (A) C c = J1 (Ac )
0
0
0
If C = J1 (A) and C 0 = J1
0 (A ) then we may replace J and J by J J by adding
0
trivial constraints as above. Therefore we may assume J = J . This means J1 (A)
J1 (A0 ) = J1 (A A0 ) is also a cylinder set.
N
3. We define P = iI Pi on the algebra of cylinder sets by

P (J1 (A)) :=

Pj (A).

jJ

This definition is meaningful since changing of J in the sense of 1. does not change the
value, e.g. P1 P2 P3 (A ) = P1 P2 (A).
4. P is finitely additive : Let C = J1 (A) and C 0 = J1 (A0 ) such that C C 0 = . Then
A A0 = and therefore
O
O
O
P (C C 0 ) = P (J1 (A A0 )) =
Pi (A A0 ) =
Pj (A) +
Pj (A0 ) = P (C) + P (C 0 ).
jJ

jJ

jJ

5. Now we have a finitely additive set function P on the algebra of cylinder sets which
satisfies the equation in the assertion. It remains to prove that P can be extended uniquely
to a measure on the -algebra F which is generated by the cylinder sets. By the measure
extension theorem 2.2 it suffices to verify the -additivity of P on the cylinder algebra. By
proposition 2.7 we only need to show that C n implies P (C n ) 0.

5.4. INFINITE PRODUCT EXPERIMENTS

37

Let now C 1 C 2 . . . be a sequence of cylinder sets such that C n and C n = J1


(An ).
n
The Jn are finite and we may assume (confer step 1.) that J1 J2 . . . and by renaming
we may further assume Jn = {1, . . . , mn } with mn .
N
Assume P (C n ) for all n and a fixed > 0. Write P (1) = iI\{1} Pi .
P1 Pmn (An ) = P (C n ) =

P2 Pmn (Ans ) dP1 (s)

P (1) (Csn ) dP1 (s)

=
1

Z
=:

fn (s) dP1 (s)


1

Observe Cs1 Cs2 . . . and therefore


1 f1 (s) f2 (s) . . . fn (s) f (s) := inf fn (s)
n

This yields
Z

Z
fn (s) dP1 (s)

f (s) dP1 (s) = lim

Hence there exists s1 1 such that f (s1 ) , i.e. P (1) (Csn1 ) for all n N.
Using the same arguments for the finitely N
additive set function P (1) we get a s2 2
(2)
n
(2)
such that P (C(s1 ,s2 ) ) , where P
:= iI\{1,2} Pi . Recursively we find a sequence
s1 , s2 , . . . such that for all k
O

Pi Csn1 ,...,sk
iI
i6=1,...,k

and in particular Csn1 ,...,sk 6= .


Let such that i = si for all i N and fix some n. Then there is some
1
0 Csn1 ,...,smn = {1,...,m
(An )
n}
n
Observe i0 = si for i = 1, . . . , mn and hence
si this means
T (s1n, . . . , smn ) A . Because i =
T
1
n
n
{1,...,mn } (A ) = C and thus n=1 C which is a contradiction to n=1 C n = .

Corollary 5.34 Let Pn be a sequence of probability distributions on some state spaces


(Sn , Bn ). Then there are a probability space (, F, P ) and random variables Xn : Sn
such that
the Xn are independent
PXn = Pn for all n N
Q
N
N
Proof. Let := nN Sn , F := nN Bn , P := nN Pn and define Xn () = n () =
n . Then
P (X1 B1 , . . . , Xn Bn ) = P { : i Bi for i = 1, . . . , n}
= P { : (1 , . . . , n ) B1 Bn }
n
n
Y
Y
=
Pi (Bi ) =
P (Xi Bi )
i=1

i=1

and thus the Xn are independent. Finally PXi (Bi ) = P (Xi Bi ) = Pi (Bi ).

For example there exists (, F, P ) and a sequence of independent random variables with
PXn = N (0, 1).

Chapter 6

Handling concrete distributions


The general methods of the previous chapter are now used to study a couple of concrete
probability distributions. Many of them are closely related to certain special functions of
classical analysis.
Proposition 6.1 Let X1 , X2 , . . . be independent random variables Bin(1, p) (so-called
Bernoulli trials) on some probability space 1 . For r N let Tr be the time of the r-th
success:
k
X
Tr () = min{k N0 :
Xi = r}.
i=1

Let k r. Then Tr has a negative binomial distribution




k1 r
P (Tr = k) =
p (1 p)kr .
r1

Proof. The number of successes up to time k 1 has a Bin(k 1, p)-distribution. Therefore


P (Tr = k) = P

k1
X

!
Xi = r 1



k1
pr1 (1 p)k1(r1) p
r1


k1
=
pr (1 p)kr .
r1

P (Xk = 1) =

i=1

6.1

Order statistics and the Beta distribution

Our next goal is to show the use of Fubinis theorem in the computation of an explicit
distribution. First we note the following simple but important fact. The joint distribution
of two independent r.v. with densities is given by the product density.
Proposition 6.2 Let X and Y be random vectors with densities fX and fY on Rn and
Rm , respectively. Then X and Y are independent if and only if (X, Y ) has the density
f(X,Y ) (x, y) = fX (x) fY (y) on Rn+m .
1
N recall that we now can rely on Corollary 5.34.
iN Bin(1, p)

38

The canonical space is = {0, 1}N with P =

6.1. ORDER STATISTICS AND THE BETA DISTRIBUTION

39

Proof. Let B B n and C B m be arbitrary Borel sets. Then


Z
Z
Z
P (X B) P (Y C) =
fX dx
fY dy =
fX (x) fY (y) dxdy
B

BC

So X and Y are independent if and only if


Z
fX (x) fY (y) dxdy = P (X B, Y C)
BC

and this is equivalent to fX (x) fY (y) being the density of (X, Y ).

Example 6.3 Let T1 , T2 , . . . , Tn be uniformly distributed over [0, 1]. Then the Ti are independent if and only if T = (T1 , . . . , Tn ) is uniformly distributed over [0, 1]n , which means
that T has the density 1[0,1]n .
Definition 6.4 If x1 , . . . , xn are real numbers, then x1:n , . . . , xn:n are the order statistics
of the vector (x1 , . . . , xn ), i.e. x1:n xn:n and (x1:n , . . . , xn:n ) is a permutation of
x1 , . . . , x n .
R1
Definition 6.5 Let a, b > 0. Then B(a, b) := 0 sa1 (1 s)b1 ds is the Beta function
and the probability density over [0, 1] of the form
a,b (s) =

sa1 (1 s)b1
B(a, b)

is called the a,b -density and the associated distribution is called the Beta distribution with
parameters a, b.
Proposition 6.6 Let T1 , . . . , Tn be independent and uniformly distributed over [0, 1] and
for 1 r n let Tr:n be their r-th order statistic. Then Tr:n has a r,nr+1 -density.
S
Proof. Since P ( i6=j {Ti = Tj }) = 0 we can assume without loss of generality that the Ti
are pairwise different. Furthermore observe that if some random variables X1 , . . . , Xn are
independent identically distributed, then they are exchangeable, i.e.
PX1 ,...,Xn = PX(1) ,...,X(n)
for each permutation Sn . Therefore
P (Tr:n c) = P ( permutation such that T(1) < < T(n) and T(r) c}
!
[
=P
{T(1) < < T(n) , T(r) c}
Sn

P (T1 < < Tn , Tr c)

since the union is disjoint

Sn

= n! P (T1 < < Tn , Tr c)


= n! E f (Tr , T1 , . . . , Tr1 , Tr+1 , . . . , Tn )

for the function f defined by


f (s, t1 , . . . , tr1 , tr+1 , tn ) = 1[0,c] (s)1{t1 <<tr1 <s<tr+1 <<tn } .
We want to apply Theorem 5.20 to this function. Thus we introduce

F (s) =E f (s, T1 , . . . , Tr1 , Tr+1 , . . . , Tn )

1[0,c] (s) P (T1 < < Tr1 < s < Tr+1 < < Tn )
= 1[0,c] (s) P (T1 < < Tr1 < s) P (s < Tr+1 < < Tn )
=

1[0,c] (s)

(1 s)nr
sr1

(r 1)!
(n r)!

40

CHAPTER 6. HANDLING CONCRETE DISTRIBUTIONS

and get
P (Tr:n c) = n! E(F (Tr ))
Z 1
F (s) ds
= n!
0

Z c
n!
=

sr1 (1 s)nr ds
(r 1)! (n r)! 0
Z c r1
n!
s
(1 s)nr
=
B(r, n r + 1)
ds
(r 1)! (n r)!
B(r, n r + 1)
Z0 c
n!
r,nr+1 (s) ds.
=
B(r, n r + 1)
(r 1)! (n r)!
0
If we set c = 1 we get
n!
B(r, n r + 1) = 1.
(r 1)! (n r)!
This proves both our assertion and the following fact about the Beta function.

Corollary 6.7
B(r, n r + 1) =

6.2

(r 1)! (n r)!
.
n!

Transformation of densities

Next we study how densities change under a non-linear transformation of ccordinates. For
the motivation we start with the polar coordinates of a standard Gaussian vector.
Example 6.8 If X1 and X2 are independent random variables and X1 , X2 N (0, 1), then
according to Proposition 6.2 (X1 , X2 ) has the density
x2
x2
1
1
1 x21 +x22
1
2
2
f(X1 ,X2 ) (x1 , x2 ) = e 2 e 2 =
e
.
2
2
2

Let (R, U ) be the polar coordinates of (X1 , X2 ): (X1 , X2 ) = R (cos U, sin U )) =: g(R, U ).
Since the Jacobian of g is


cos u r ( sin u)
Dg(r, u) =
sin u
r cos u
with determinant r we get by means of the transformation rule of multidimensional integrals
for every Borel set A
P ((R, U ) g 1 (A)) = P ((X1 , X2 ) A)
Z
=
f(X1 ,X2 ) (x1 , x2 ) dx1 dx2
ZA
=
f(X1 ,X2 ) (g(r, u)) r drdu.
g 1 (A)

Thus the density of (R, U ) is f(R,U ) (r, u) = r f(X1 ,X2 ) (g(r, u)) =
of R and U are fR (r) = r e

r2

and fU (u) =

1
2

1[0,2) (u).

r
2

r2
2

. The densities

The general result reads as follows


Theorem 6.9 (Transformation rule for densities) Let X be a random vector with values in an open set U Rn and fX be the density of X. Let g : U Rn be continuously
differentiable and Y := g(X). If

6.2. TRANSFORMATION OF DENSITIES

41

U
Ui

Figure 6.1: In the proof of Remark 6.10 we decompose an open set U into a countable union
of boxes Ui .
U =N

N Ui where the Ui are pairwise disjoint open sets and N

is a nullset,

gi := g|Ui is injective for each i,


J(x) := | det Dg(x)| =
6 0 for all x

N Ui

then Y has the density


X

fY (y) =

i
yg(Ui )

Proof. If we define f i :=

fX (gi1 (y))
J(gi1 (y))

fX (gi1 (y))
.
J(gi1 (y))

we get for an arbitrary measurable set B Rn :

P (Y B) = P (X g 1 (B))
X
=
P (X Ui g 1 (B))

P (X gi1 (B))

X Z

gi1 (B)

X Z

gi1 (B)

fX (x) dx
J(x) f i (gi (x)) dx

And by the transformation rule follows:


=

X Z

f i (y) dy

Bg(Ui )

Z
=

fY (y) dy.
B


Remark 6.10 Conversely, if Y = g(X) has a density and g is continuously differentiable,


then a decomposition of U as in the theorem exists.
Proof. Consider the set of critical points C := {x : | det Dg(x)| = 0}. Then the set of
critical
values V := g(C) is a (Lebesgue-)nullset2 . If Y has a density, then P (Y V ) =
R
f
(y)dy
= 0 and thus P (X C) = 0. The function g is locally invertible on U \ C, i.e.
V Y
around every x U \ C there exists a little box which is parallel
S to the axes and on which
g is injective. We can make these boxes disjoint. Observe that Ui is a nullset, and thus
2 This

is a special case of Sards theorem and at least in one dimension plausible.

42

CHAPTER 6. HANDLING CONCRETE DISTRIBUTIONS


!
U=

Ui

Ui

is our decomposition.

6.3

Convolution; the Gamma-distribution

Definition 6.11 The Gamma distribution Gamma(, ) is the distribution with the density
, (x) =

x1 ex 1R+ (x)
()

where denotes the Gamma-function


Z

() =

ex x1 dx

Here, is called shape parameter and is called scale parameter. Observe that satisfies
(1) = 1 and (x + 1) = x (x) and thus (n + 1) = n!.
Example 6.12 Let X N (0, 1) be a random variable and g(x) := x2 . Then |g 0 (x)| =

2 |x| =
6 0 for x 6= 0. If we define U1 = (, 0) and U2 = (0, ), then g 1 (y) = y on U1

and g 1 (y) = y on U2 and thus Y = X 2 has the density

( y )2
y2
1
1
e 2
e 2
+
2 y 2
2 y 2
y
1
= e 2
y 2
1
1 2

fY (y) =

2
=

y 2 1 e 2 y .

This is the density of the Gamma( 12 , 12 )-distribution except that the normalizing factor has
a different form. Since this constant is uniquely determined, we get the following Corollary

Corollary 6.13 ( 12 ) = .
As an application of the transformation formula it is easy to prove that if X has a Gamma(, )distribution, then 1 X has a Gamma(, )-distribution. The family of Gamma-distributions
also behaves nicely with respect to convolution which is another key operation in probability
theory.
Definition 6.14 If and are two measures on B n , the convolution is defined as the
image measure of under the map + : Rn Rn Rn :
(B) = {(x, y) Rn Rn : x + y B}
Observe that

Z
(B) =

(B x)d(x) =

Remark 6.15
1. If X and Y are independent and

(B y)d(y).

Rn -valued, then PX+Y = PX PY :

P (X + Y B) = P(X,Y ) {(x, y) : x + y B}
= PX PY {(x, y) : x + y B}
= PX PY (B).

6.3. CONVOLUTION; THE GAMMA-DISTRIBUTION

43

B
Figure 6.2: {(x, y) : x + y B} is the set between the diagonal lines.
2. If X and Y are independent and have the densities fX and fY , respectively, then
X + Y has the density fX fY where
Z
Z
g h(z) =
g(y) h(z y)dy =
h(x) g(z x)dx.

P
P
3. If X and Y are independent and Z-valued and PX = kZ pk k and PY = kZ qk
k , then
X
X
m m where m :=
pk ql .
PX PY =
m

k+l=m

Example 6.16 Suppose that we have two Bernoulli variables: PX = PY = Bin(1, 21 ). Then

4 k = 0
PX+Y = 12 k = 1

1
k=2
4
and this is exactly Bin(2, 12 ). More generally, if X1 , . . . , Xn Bin(1, p), then PX1 ++Xn =
PX1 PXn = Bin(n, p).
Proposition 6.17 Assume X (r, ) and X (t, ) are independent. Then X + Y
(r + t, ).
Proof. Using the remark after Corollary 6.13 we can assume without loss of generality that
= 1. Then
Z z
1
fX fY (z) =

(z y)r1 e(zy) y t1 ey dy
(r) (t) 0
where the integration limits are due to the indicator functions. Substituting y by zs yields
Z 1
ez
=

z r1 (1 s)r1 st1 z t1 zds


(r) (t) 0
B(r, t)
= ez z r+t1
(r) (t)
B(r, t)
1
=
(r + t)
z r+t1 ez
(r) (t)
(r + t)
B(r, t)
=
(r + t) density of (r + t).
(r) (t)
Since this is a density and (r + t) is also a density and the integral of a density is 1, we get
B(r, t)
(r + t) = 1.
(r) (t)
This shows not only the assertion but also the following corollary.

44

CHAPTER 6. HANDLING CONCRETE DISTRIBUTIONS

Corollary 6.18
(r) (t)
.
(r + t)

B(r, t) =

Definition 6.19 The 2n -distribution with n degrees of freedom is the law of Z12 + + Zn2 ,
where the Zi are independent identically distributed N (0, 1)-variables.
Proposition 6.20 The 2n -distribution is equal to the ( n2 , 12 )-distribution. It has the
density
x
n
x 2 1 e 2

n2 , 12 (x) = n
.
2 2 n2
Proof. According to Example 6.12, Zi2 has a ( 12 , 12 )-distribution for each i. This implies
that






1 1
n 1
1 1
,

,
=
,
.
P2n =
2 2
2 2
2 2

We conclude this chapter by noting a by-product of our study of the Gamma-distribution:
We get the classical formula for the volume of the n-dimensional Euclidean unit ball for free.
For this we have a look at the lower tail of the 2n -distribution as x 0:
P (2n r2 ) =

r2

Now e

x
2

x 2 1 e 2
 dx
n
2 2 n2

1 as x 0 and thus
P (2n r2 )

r2

x 2 1
rn
1

dx
=
n n
n
2 2
22
2
n
2

n
2

.

On the other hand 2n = kZk2 where Z := (Z1 , . . . , Zn ) and hence


Z
kxk2
1
rn
P (2n r2 ) = P (kZk r) =
n e 2 dx n vol Bn
2
2
{xRn : kxkr}
with the n-dimensional unit ball Bn . Comparing the two values for P (2n r2 ) yields
Corollary 6.21 The n-dimensional unit ball has the volume
n

vol Bn =

2 2
.
n n2

In particular vol Bn 0 as n .

vol Bn

4
3

1 2
2

8 2
15

1 3
6

3,14

4,19

4,93

5,26

5,17

Figure 6.3: Volume of the unit ball in various dimensions.

Chapter 7

Higher Moments and the


Lp-spaces
7.1

The p-norm

Definition 7.1 If X is a (extended) real random variable, E(|X|r ) is the r-th absolute
moment. For 0 < r < we write kXkr for E(|X|r )1/r . Further we define kXk := inf{ :
P (|X|) > ) = 0} and if this infimum does not exist, we set kXk := . For 0 < p the
symbol Lp or Lp (P ) or Lp (, F, P ) denotes the set of all random variables with kXkp < .
Theorem 7.2 (Hlders inequality ) Suppose p, q [1, ] are conjugate numbers, i.e.
1
1
p + q = 1. Then for all random variables X and Y one has
E(|X Y |) kXkp kY kq .

Proof. See measure theory script.

Observe that for p = q = 2 the Hlder inequality becomes the Cauchy-Schwarz inequality:
E(|X Y |) kXk2 kY k2 .
Theorem 7.3 (Minkowski inequality ) If X + Y is defined and p 1 then
kX + Y kp kXkp + kY kp
Remark 7.4 The inequalities of Hlder and Minkowski hold also in general measure spaces.
Corollary 7.5 On a probability space one has kXkr kXks whenever 0 < r s < .
Proof. Set p =

s
r

and let q be the conjugate of p. Then by Hlders inequality:


s

kXksr = E(|X|r ) r = E(|X|r 1)p (k|X|r kp k1kq ) = E(|X|rp ) p = kXkss


Example 7.6 If X Cauchy, then X Lr for all r < 1 since

|x|r

R 1+x

dx < , but X 6 L1 .

If X Lr , then X is almost surely equal to a finite valued random variable. We write


X Y if X = Y almost surely. This defines an equivalence relation. Let Lr be the set of
equivalence classes of elements of Lr .
45

CHAPTER 7. HIGHER MOMENTS AND THE LP -SPACES

46

7.2

Completeness

Theorem 7.7 (Lp , k.kp ) is a Banach space for 1 p .


Proof. We prove only the case p < , since for p = the assertion can be shown similarly
to the fact that C[a, b] is a Banach space. If p 1 then k.kp is a norm on Lp ; we easily
verify the three properties
kXkp = 0 if and only if X = 0 almost surely, that is X is in the 0-element of Lp ,
kaXkp = |a|kXkp for all a R,
kX + Y kp kXkp + kY kp by Minkowskis inequality.
For the completeness we need to show that for a Cauchy sequence (Xn ) in Lp there is some
X Lp with kXn Xkp 0. Given this Cauchy sequence choose an increasing sequence
(nk ) such that kXnk
Xm k 2k for all m nk . Then for Y1 := Xn1Pand Yk+1 :=
P

Xnk+1 Xnk we have k=1 kYk kp < . Now define a random variable Z := k=1 |Yk | and
we get with monotone convergence and Minkowskis inequality
N
X

E(Z ) = lim E
N

!p !
|Yk |

k=1
p

N
X


Yk
= lim
N

k=1

lim

N
X

!p
kYk kp

k=1

< ,
and thus Z < almost surely. Therefore the series
lim

N
X

k=1

Yk converges and

Yk = lim XnN =: X

k=1

exists almost everywhere. Furthermore


lim kX

XnN kpp


p
X



= lim
Yk = 0
N

k=N +1

and therefore kXkp kX XnN kp + kXnN kp is finite and kX XN kp kX XnN kp +


kXnN XN kp where both terms converge to 0 as N . Finally pass to equivalence
classes and the proof is complete.


7.3

Second moments

Remark 7.8 In the case p = 2 the space cL2 is a Hilbert space with scalar product <
X, Y >= E(XY ). If X L2 then E(X 2 ) < and thus by Cor. 7.5 E(|X|) < . We get
E(X 2 ) (E(X))2 = E((X E(X))2 ) = var(X) = 2 (X) < .
p
Moreover |E(X)| E(X 2 ).

7.3. SECOND MOMENTS

47

Definition 7.9 If X and Y are random variables we define the covariance of X and Y as
cov(X, Y ) := E((X E(X)) (Y E(Y )))
if this expectation is defined.
Remark 7.10 If X and Y are in L2 , then we get as an analogue of the Cauchy-Schwarz
inequality for the covariance:
| cov(X, Y )| (X) (Y ).
p
Definition 7.11 Let X and Y be random variables. Then (X) =
2 (X) is called
standard deviation of X. The correlation or correlation coefficient between X and Y is
defined by
cov(X, Y )
%(X, Y ) =
(X) (Y )
Observe that 1 %(X, Y ) 1. X and Y are said to be uncorrelated if %(X, Y ) = 0.
Remark 7.12

If X and Y are independent, then they are uncorrelated.

Every X L2 can be standardized: X =

XE(X)
(X)

has expectation 0 and variance 1.

%(X, Y ) = 1 means that X equals Y almost surely, whereas %(X, Y ) = 1 implies


X = Y almost surely.
cov(X, Y ) = E(X Y ) E(X) E(Y )
Proposition 7.13

X
X
X
cov
ai Xi ,
bj Yj =
ai bj cov(Xi , Yj ).
i

i,j

Proof. Follows directly because E() is a linear operation.

Remark 7.14 [Chebychevs inequality] For every X L2 and every > 0 one has
P (|X E(X)| )

2 (X)
.
2

Proof. We use again Markovs inequality


P (|X E(X)| )

1
2 (X)
1
E(|X E(X)|2 ) = 2 kX E(X)k22 =
.
2

Corollary 7.15 (Formula of Bienaym) Let X1 , X2 , . . . be uncorrelated random variables. Then


!
n
n
X
X
2

Xi =
2 (Xi )
i=1

i=1

Proof. From cov(X, X) = 2 (X) and the last proposition we get


!
!
n
n
n
n
n
n
X
X
X
X
X
X
2
Xi = cov
Xi ,
Xi =
cov(Xi , Xj ) =
cov(Xi , Xi ) =
2 (Xi )
i=1

i=1

i=1

i,j=1

i=1

since the covariance is zero for two uncorrelated random variables.

i=1


CHAPTER 7. HIGHER MOMENTS AND THE LP -SPACES

48

Definition 7.16 Let X1 , . . . , Xd be elements of L2 . Then the matrix C = (cij )1i,jd with
entries
cij = cov(Xi , Xj )
is called the covariance matrix of the vector X = (X1 , . . . , Xd ). It is denoted by cov(X).
Theorem 7.17 If X is a d-dimensional random vector and A :
then
cov(AX) = A cov(X)AT .

Rd Rk is a linear map

Every covariance matrix is positive semi-definite.


Proof. For the first part simply note that the l-th and the m-th components of the vector
Y = AX have the covariance
cov(

d
X
i=1

ali Xi ,

d
X
j=1

amj Xj ) =

d
X

ali cov(Xi , Xj )amj = (A cov(X)AT )lm .

ij=1

For the second part we consider the elements of


the first part to the linear form tT . Then

Rd as column vectors and apply for t Rd

< cov(X)t, t >= tT cov(X)t = cov(tT X) = var(tT X) 0.


This proves that cov(X) is positive semi-definite.

Chapter 8

Convergence of random variables


and the laws of large numbers
8.1

The weak law of large numbers

Definition 8.1 Let Xn for n = 1, 2, . . . and X be random variables on (, F, P ) with


values in (S, d) where d is a metric on S. We say (Xn ) converges to X in probability, if
P
P (d(Xn , X) > ) 0 as n for each > 0. Sometimes this is denoted by Xn
X.
Proposition 8.2 Convergence of the r-th mean implies convergence in probability: If r > 0
and kXn Xkr 0, then Xn X in probability.
Proof. From Markovs inequality we get
1
P (|Xn X| > ) = P (|Xn X| > ) r E(|Xn X|r ) =

kXn Xkr

r

0.


Theorem 8.3 (Weak Law of Large Numbers) Let X1 , X2 , . . . be a sequence of uncorrelated real random variables with 2 (Xi ) M < for all i. Then we have for all > 0
n

!

1 X


P
(Xi E(Xi )) > 0,

n
i=1
that is

1
n

Pn

i=1 (Xi

E(Xi ))
0.

Proof.
n

!
!
!
n
n
1 X

X
1
1
1X


P
(Xi E(Xi )) > 2 var
Xi = 2 2 var
Xi
n

n i=1
n
i=1
i=1
P
P
Since the Xi are uncorrelated we have var( Xi ) = var(Xi ) and thus
!
n
n
X
X
1
1
M

var
X
=

var(Xi ) 2 0
i
2 2
n2 2
n
n
i=1
i=1

Pn
P
. If the Xi are independent
In particular, if E(Xi ) = for all i then n1 i=1 Xi
1
B(1, p)-trials and p = n # successes up to n, then
P (|
p p| > )
49

p (1 p)
n2

50 CHAPTER 8. CONVERGENCE OF R.V. AND THE LAWS OF LARGE NUMBERS


Thus probabilities can be measured by performing a large number of trials and we even
get estimates for the error probability. But observe that if we want to double the precision,
i.e. if we replace by /2 then we have to perform 4 times as many trials in order to keep
the same bound for the error probability.

8.2

Convergence in probability versus almost sure convergence

First let us give some alternative descriptions of convergence in probability for real valued
random variables.
Proposition 8.4 If (Xn ) is a sequence of random variables, which is uniformly bounded
(i.e. |Xn | C a.s., all n) then Xn X in probability if and only if kXn Xkr 0 for all
r > 0.
P

Proof. By Proposition 8.2 we only need to show that Xn


X implies kXn Xkr 0.
Assume |Xn | C for all n almost surely, then |Xn X| 2 C almost surely.
kXn Xkrr = E(|Xn X|r )

E(r 1{|Xn X|<} + (2C)r 1{|Xn X|} )

r + (2C)r P (|Xn X| )
2r
for sufficiently large n, since Xn converges in probability.

Notation 8.5 We frequently use the notations a b := min(a, b) and a b := max(a, b).
Corollary 8.6 Xn converges to X in probability if and only if E(1 |Xn X|) 0.
P

Proof. Since P (|Xn X| ) = P (1 |Xn X| ) for all 1 we get that Xn


X
P
is equivalent to 1 |Xn X|
0. By the previous proposition this is equivalent to
k1 |Xn X|k1 0, that is E(1 |Xn X|) 0.

Definition 8.7 We denote by L0 the space of all real valued random variables, and by L0
the corresponding space of equivalence classes of elements of L0 .

Observe that d(X,


Y ) := E(1 |X Y |) is a pseudometric on the space L0 . On L0 we
even get a metric. For indicator functions of events this (pseudo-)metric coincides with
1A , 1B ) =
the one we used in the proof of Carathodorys extension theorem: We have d(
E(|1A 1B |) = P (A M B).
Now we compare convergence in probability to almost sure convergence. Let us first consider
the special case that Xn = 1An . Then obviously Xn 0 in probability if and only if
P (An ) 0, whereas Xn 0 almost surely if and only if
P ( infinitely many An occur) = 0.
If the An are independent, then Xn 0 almost surely is equivalent to
the two lemmas of Borel-Cantelli (Corollary 4.19 and Theorem 5.25).

P (An ) < by

In particular convergence in probability does not imply almost sure convergence. However
for monotone sequences the two concepts coincide:
Lemma 8.8 Let (Yn ) be a monotone sequence of random variables and suppose that the
P
r.v. Y is a.s. finite. Then Yn converges to Y a.s. if and if Yn
Y.

8.3. THE STRONG LAW OF LARGE NUMBERS

51

Proof. We may assume that Yn is decreasing. Moreover replacing Yn by Yn Y we may


assume Y = 0. Clearly Yn 0 a.s. if and only if 1 Yn 0 almost surely. By dominated
convergence this is equivalent to E(1 Yn ) 0. Now the assertion follows from Corollary
8.6.

In general convergence in probability can be expressed in terms of almost sure convergence.
Proposition 8.9
1. Xn X almost surely implies that Xn X in probability.
2. Xn X in probability if and only if for each subsequence (Xnk ) there is a subsubsequence (Xnkl ) which converges to X almost surely.
Proof.
1. Xn X almost surely implies supmn |Xm X| 0 almost surely. The
sequence Yn := supmn |Xm X| is decreasing and non-negative. Hence Yn 0 in
probability and also Xn X in probability.
2. Suppose for each subsequence (Xnk ) there is a subsubsequence (Xnkl ) converging to
X almost surely. We claim convergence in probability. Assume to the contrary that
P (|Xn X| > ) 6 0 for some > 0. Then there exists an > 0 and a subsequence
(Xnk ) such that P (|Xnk X| > ) for alle k N. Thus no subsubsequence
converges in probability and by the first part this means that no subsubsequence
converges almost surely, which is a contradiction to our assumption.
Conversely assume that Xn X in probability. Then E (1 |Xn X|) 0. Thus
there is subsequence (Xnk ) satisfying

E (1 |Xnk X|) < .

k=1

P
Then the series
k |Xnk X| converges a.s. and for each subsubsequence we get
|Xnkl X| 0 a.s..

Remark 8.10 We saw at the end of the previous section that convergence in probability is
induced by a pseudometric. Thus we can speak of the topology of convergence in probability
on the space of all real random variables on a given probability space.
However it is worthwhile to note that there is no topology of almost sure convergence
in the sense that a sequence would converge for that topology if and only if it converges
almost surely. Such a topology would be at least as fine as the topology of convergence
in probability since there are less converging sequences. On the other hand a -closed set
F would be closed under almost sure convergence and hence by the previous proposition
the set F would be closed under convergence in probability. Hence this topology would
coincide with the topology of convergence in probability, in contradiction to our assumption
that only the a.s. convergent sequences converge for .

8.3

The strong law of large numbers

Lemma 8.11 If (Yn ) is a sequence of real random variables and

E(|Yn |p ) <

n=1

for some p > 0, then Yn 0 almost surely.

52 CHAPTER 8. CONVERGENCE OF R.V. AND THE LAWS OF LARGE NUMBERS


Proof. As before the assumption implies

|Yn |p < almost surely

n=1

and in particular |Yn |p 0 almost surely.

Theorem 8.12 (Strong Law of Large Numbers) Let X1 , X2 , . . . be real random variables which are uncorrelated and let 2 (Xi ) M < for all i N. Then
n
1 X

(Xi E(Xi )) 0 almost surely.


n i=1

Proof. We assume E(Xi ) = 0 for all i because otherwise we can replace


Xi by Xi E(Xi )
Pn
without changing uncorrelatedness or the variance. If we set Sn = i=1 Xi , we need to show
that n1 Sn 0 almost surely. We decompose the indices into segments between successive
squares:




1

1
1
1

max
Sm2 + (Sl Sm2 ) +
2 Sm2

2
2
2
m
l
l
m
m l<(m+1)


1
1
1
1
sup 2 |Sm2 | + sup
max
|Sl Sm2 | + sup

|Sm2 |
m2
l
mn m
mn m2 <l<(m+1)2 l
mn



1


sup Sl = sup
2
l
mn
ln

l>m2

1
|Sm2 | + sup
2
mn m
mn

2 sup

max
m2 <l<(m+1)2



l


1 X

Xi
2
m

i=m2 +1

We want to apply the Lemma with p = 2: For the first part we get

X
m=1


E

2 ! X


1
1
2

|S
|
=
E |Sm2 |2
m
2
4
m
m
m=1
=

X
1
var(Sm2 )
4
m
m=1

X
1

m2 M
4
m
m=1

2
M
6
< .
=

For the second supremum in (), letting

Dm =

max

m2 <l<(m+1)2



l

X


1

X
i


m2

2
i=m +1

()

8.3. THE STRONG LAW OF LARGE NUMBERS

53

we get similarly
2


X
l


1

2

Xi
E 2 max 2
E(Dm ) =
4
m
m <l<(m+1)

m=1
m=1
i=m2 +1

(m+1)2 1

l
X
X
X
1

E
|Xi |2
4
m
2
2
m=1

l=m +1

4
m
m=1

i=m +1

(m+1)2 1 (m+1)2 1

l=m2 +1

i=m2 +1

E |Xi |2

Observe that the second and third sum adds 2m terms each and continue with

X
1
2m 2m M
m4
m=1

< .
By the lemma both parts of () converge to 0 almost surely. This completes the proof.

Remark 8.13 1. In particular the Strong Law of Large Numbers holds for independent
random variables with bounded variance. But observe that in general uncorrelatedness does
not imply independence: Let be the unit disk with the uniform distribution and X and
Y be the first and second coordinate, respectively. Then X and Y are uncorrelated by
symmetry (E(X) = E(Y ) = E(X Y ) = 0), but not independent.
2. The fact that independent random variables with bounded variance satisfy the Strong Law
of Large Numbers has been generalized in many other directions, e.g. Kolmogorovs Laws
of Large Numbers, Doobs martingale convergence, Birkhoffs individual ergodic theorem.
In the chapter on martingales we shall establish Kolmogorovs and Doobs results.
Remark 8.14 Even if the random variables X1 , X2 , . . . are independent and identically
distributed with values in R one needs some minimal condition on the moments in order to
have a.s. convergence of the averages: If E(|Xi |) = , then n1 Sn diverges a.s..
Proof. Recall that for a non-negative random variable Y we have
Z

P (Y > t)dt

E(Y ) =
0

P (Y > n).

n=0

Set An := {|Xn | > n}. Since all Xn have the same law one has

P (An ) =

n=0

P (|X1 | > n) E(|X1 |) = .

n=0

Since the An are independent, we can apply the second Borel-Cantelli-Lemma and get that
almost surely infinitely many An occur, that is almost surely n1 |Xn | 1 for infinitely many
n. Suppose n1 Sn a R. Then


1
1
n1 1
n1
lim Xn = lim
Sn
Sn1 = a(1 lim
) = 0,
n n
n
n
n
n n1
n
which yields a contradiction.

Chapter 9

Convergence of distributions
9.1

One-dimensional distribution functions

We briefly review some facts about the convergence of one-dimensional distribution functions.
Proposition 9.1 Let F and Fn with n
following statements are equivalent:

N be distribution functions on R. Then the

1. Fn (z) F (z) for all z at which F is continuous.


2. Fn (z) F (z) for all z in a dense set of points.
If F is continuous, then 1. and 2. are equivalent to
3. supz |Fn (z) F (z)| 0 as n .
Proof. We prove only that 1. and 2. imply 3. If F is continuous: Let > 0. We can choose
z1 < < zk such that |Fn (zi+1 ) F (zi )| < , F (z1 ) , F (zk ) 1 and Fn (zi ) F (zi )
for i = 1, . . . , k. Let n0 be such that Fn (zi ) F (zi ) for n n0 and i = 1, . . . , k. We
claim |Fn (z) F (z)| < 2 for n n0 and all z R.
Either z [zi , zi+1 ) for some i or z < z1 or z zk . In the first case
F (z) 2 F (zi+1 ) 2 F (zi ) Fn (zi ) Fn (z)
Fn (zi+1 ) F (zi+1 ) + F (zi ) + 2 F (z) + 2
hence |Fn (z) F (z)| < 2 for n n0 . In the case z < z1 we have:
F (z) 2 F (z1 ) 2 Fn (z1 ) 0 Fn (z) Fn (z1 ) F (z1 ) + 2 F (z) + 2
and the third case (z zk ) is similar.

Definition 9.2 A sequence of probability measures on R converges weakly to the probability measure , if the corresponding distribution functions converge in the sense of the
previous proposition: Fn (z) F (z) for all z at which F is continuous. In this case one also
speaks of weak convergence of the distribution functions.
If for some random variables Xn , n N and X the laws n = L(Xn ) converge weakly to
= L(X) then one says: Xn converges to X in law.
Definition 9.3 Let (0, 1). Then z0 is a -quantile of F (or ) if F (z0 )
limz%z0 F (z). In the special case = 21 , the -quantile is called median.
C
54

9.1. ONE-DIMENSIONAL DISTRIBUTION FUNCTIONS

55

We shall see in the Observe that the -quantiles are not necessarily unique: If there are two
numbers z1 < z2 such that F (z) on (z1 , z2 ), then both z1 and z2 are -quantiles. But
this happens only for countably many . F 1 () := inf{z : F (z) } is a -quantile, in
fact it is the smallest -quantile in case of non-uniqueness.
Proposition 9.4 Let Fn F weakly and zn be a -quantile of Fn .
1. If zn z, then z is a -quantile of F .
2. Conversely, if the -quantile z is unique, then zn z.
Proof.
1. Let z and z be points where F is continuous and z < z < z. Then eventually

all zn are in (z, z). Hence F (


z ) = lim Fn (
z ) lim Fn (zn ) = . Similarly F (z) .

Now let z & z and z % z. Then F (z) F (z+) = F (z).

2. lim sup zn and lim inf zn are -quantiles. By uniqueness they agree with the -quantile
of F and thus zn z.

Recall that on ([0, 1], B[0, 1], 1 ) the random variable X := F 1 has the distribution FX = F .
We shall see later in Proposition 9.10 that convergence in probability - and hence also a.s.
convergence - implies convergence in law. The following is a partial converse.
Corollary 9.5 Suppose Fn F weakly. Then there is a probability space (, A, P ) with
real random variables Xn and X such that FXn = Fn and FX = F and Xn X almost
surely.
Proof. Choose (, A, P ) := ([0, 1], B[0, 1], 1 ) and set Xn := Fn1 and X = F 1 . Then
FXn = Fn and FX = F . Furthermore for all = [0, 1] with countably many exceptions
(namely those for which the -quantile of F is not unique) we get Fn1 () F 1 ().
Hence Xn () X() almost surely with respect to 1 .

Definition 9.6 Let X1 , . . . , Xn be random variables on a probability space (, A, P ) with
state space (S, B). Then the empirical distribution of X1 , . . . , Xn is the random probability
measure defined by

() =

n
1 X

X ()
n i=1 i

i.e.

Observe that
Z
f (x) d
n () =
S

()(B) =

n
1 X

1B (Xi ()).
n i=1

n
1 X
f (Xi ())

n i=1

If S = R, then the empirical distribution function is defined by


n

1 X
Fn ()(z) :=
n ()((, z]) =
1{Xi z} ()
n i=1
The following is sometimes called the Fundamental Theorem of Statistics. It says that every
one-dimensional distribution a.s. can be recovered from iid. observations.
Theorem 9.7 (Glivenko-Cantelli)
If X1 , X2 , . . . are independent identically distributed real random variables with distribution
function F and law . Then almost surely
n () weakly and Fn () F weakly.
P
Proof. Consider z R and define Yi := 1{Xi z} . Then Fn ()(z) = n1
Yi . The Y1 , Y2 , . . .
are independent and B(1, p)-distributed with p = P (Xi z) = F (z) and E(Yi ) = F (z). By
the Strong Law of Large Numbers Fn ()(z) F (z) almost surely. Now let
\
0 = { : Fn ()(z) F (z) z Q} =
{ : Fn ()(z) F (z)}
z

Since P ({ : Fn ()(z) F (z)}) = 1 for all z


alomost surely the weak convergence of Fn to F .

Q we have P (0 ) = 1 and thus we have




56

CHAPTER 9. CONVERGENCE OF DISTRIBUTIONS

9.2

Weak convergence on general metric spaces

The drawback of the definition of convergence in law in the one-dimensional case is that it
relies heavily on the linear structure of the real numbers. The following approach works in
general metric spaces. In Corollary 9.15 we shall see that in R it is equivalent to the concept
of the previous section.
Definition 9.8 Let (S, d) be a metric space.
1. Let and 1 , 2 , . . . be probability measures on B = B(S). We say that (n ) converges
weakly to if
Z
Z
dn
S

d
S

for all in the space Cb (s) of bounded continuous functions on S.


2. Let X1 , X2 , . . . and X be random variables with state space (S, B). We say that (Xn )
L
D
converges to X in distribution or in law and write Xn
X or Xn X, if PXn PX
weakly or L(Xn ) L(X).
Remark 9.9
The random variables X and Xn are not necessarily defined on a common probability
space.
L

Xn
X is equivalent to
Z
Z
dPXn = E((Xn )) E((X)) =
dPX
S

for all Cb (S).

It is important to distinguish between convergence in probability and convergence in law.


Proposition 9.10 Let (Xn ) and X be random variables defined on a common probability
space (, A, P ) and assume that Xn X in probability. Then Xn converges to X also in
law.
Proof. Let us first assume that Xn X almost surely. Then for Cb (S) we have
(Xn ) (X) almost surely. Since is bounded |(Xn )| C for some C < . So we
can apply dominated convergence to get E((Xn )) E((X)). Hence Xn converges to X
in law.
Now assume Xn X in probability. If Xn does not converge to X in law, there is a function
Cb (S) for which E((Xn )) 6 E((X)). Hence there exists a sequence (nk ) and > 0
such that
|E((Xnk )) E((X))|
for all k N. But by Proposition 8.9 there is a subsubsequence Xnkl which converges
to X almost surely. Then Xnkl X in law by the first part of the proof. But this is a
contradiction.

The converse usually is meaningless in view of remark 9.9.
Example 9.11 Let x S, xn S and n := xn and := x . Then lim xn = x is
equivalent to n weakly; in fact both conditions are equivalent to
Z
Z
dxn = (xn ) (x) =
dx
for all Cb (S).
S

Proof. is clear, so assume n but xn 6 x. Then there exists an > 0 and a


sequence xnk such that d(xnk , x) > . The mapping (y) := dist(y, B(x, )c ) is in Cb (S).
But (xnk ) = 0 for all k while (x) = and therefore xn 6 x , which is a contradiction.

9.2. WEAK CONVERGENCE ON GENERAL METRIC SPACES

57

Pn
Example 9.12 Let S := [0, 1] and define n := n1 i=1 i/n . A standard fact about the
Riemann integral is that for all bounded continuous functions
 
Z 1
1 X
i

(x) dx
n
n
0
i.e. n 1|B([0,1]) weakly.
Lemma 9.13 Let n , n N and
measures. Let k , k N be bounded
R be probability
R
mesurable functions such that k dn k d as n for all k N.
R
R
1. If k (x) % (x) for all x, then d lim inf dn .
n

2. If k uniformly, then
Proof.

dn

d.

1. For each k we have


Z
lim

Z
k dn lim inf
n

dn

and monotone convergence implies


Z
Z
Z
d = sup k d lim inf dn
n

2. Define Ik () =

k d, I() = d and k := sup |k (x) (x)|. Then


Z
Z
Z


|Ik () I()| = k d d k d = k

for all probability measures . Thus Ik I uniformly on the space of probability


measures. By our assumption we have Ik (n ) Ik () as n and uniform
convergence implies that the same holds for I.

Theorem 9.14 (Portmanteau theorem) Let (S, d) be a metric space. Let n and be
probability measures on B(S). Then the following statements are equivalent
R
R
a) n weakly, i.e. lim dn = d () holds for all Cb (S).
n

b) () holds for all bounded Lipschitz functions (i.e. |(x) (y)| Ld(x, y)).
c) () holds for all C where C is a subset of Cb (S) such that for each open set U S
there is a sequence (k ) in C with k (x) 1U (x) for all x S. If S = Rd then for

example the set of all C -functions with compact support CComp


(Rd ) is such a set.
d) () holds for all bounded Borel functions which are -a.s. continuous.
e) For each U S open lim inf n (U ) (U ).
n

f) For each F S closed lim sup n (F ) (F ).


n

g) For each G B with (G) = 0

n (G) (G).

Proof.
a) b): trivial
b) c): Let U S be open and define k (x) := 1 k dist(x, U c ). Then k 1U and the
k are Lipschitz since |k (x) k (y)| k d(x, y) by the triangle inequality.

58

CHAPTER 9. CONVERGENCE OF DISTRIBUTIONS

c) e): Let
R an openRset U be given, choose k C such that k 1U . By assumption
k dn k d and part a) of the previous lemma implies
Z
Z
lim inf n (U ) = lim inf 1U dn 1U d = (U ).
n

e) f): Simply pass to complements: F = U c , U = F c , n (F ) = 1 n (U ) and (F ) =


1 (U ).
G G
and (G
\ G)
= 0 we get (G)
= (G) = (G).
Thus
f) g): From G
(G)
= (G)
lim inf n (G)
lim inf n (G).
lim sup n (G) lim sup n (G)
g) d): Let be Borel with C < < C. Define N = {x : is not continuous at x}.
By assumption
Pm (N ) = 0. We will approximate uniformly with functions k of
the form i=1 ai 1Gi where (Gi ) = 0. Then
Z
Z
X
g) X
k dn =
ai n (Gi )
ai (Gi ) = k d
R
R
and part b) of the previous lemma implies dn d and we are done.
Let = k1 . Choose ai < < am such that a1 C, C < am , ai+1 ai < .
Note ({ = a}) > 0 only for a countable number of as1 . Therefore we can also
require ({ = ai }) = 0 for all i. Now define Gi = {x : ai < (x) ai+1 }. Then
|(x) k (x)| = |(x) ai | on Gi , thus the k converge uniformly to .
i . Then x N or is
It remains to show that (Gi ) = 0: Let x Gi \ G
i we
continuous at x. In the latter case ai (x) ai+1 but because x 6 G
cannot have ai < (x) < ai+1 . Therefore x N { = ai } { = ai+1 } which
is a nullset.
d) a): trivial

Corollary 9.15 In the case S = R the two definitions 9.2 and 9.8 of weak convergence of
probability measures and convergence in law of random variables are equivalent.
Proof. Assume that n in the sense of Definition 9.8. Let Fn and F be the corresponding distribution functions and let z be a point of continuity of F . Then the set
G = (, z] satisfies (G) = 0, and hence by condition g) in the Portmonteau Theorem we
get Fn (z) = n (G) (G) = F (z), i.e. the condition of Definition 9.2 holds.
Conversely suppose that Fn F in the sense of Definition 9.2. According to Corollary
9.5 there are random variables Xn , X with these distribution functions such that Xn X
a.s., and hence in probability. Proposition 9.10 implies that Xn X in law in the sense of
Definition 9.8.

The following simple observation is extremely useful.
Theorem 9.16 (Mapping theorem) Let Xn , X be S-valued random variables such that
L
Xn
X. Let (S 0 , d0 ) be another metric space and f : S S 0 be Borel and PX -a.s.
L
continuous. Then f (Xn )
f (X).
Proof. Let Cb (S 0 ). Then f is bounded, Borel and PX -a.s. continuous. Let
n = PXn , = PX . By condition d) in the Portmenteau theorem we get
Z
Z
E ( (f (Xn ))) = f dPXn f dPX = E ( (f (X)))


Example 9.17 Let Xn , Yn be real random variables such that P(Xn ,Yn ) P(X,Y ) and
suppose P (Y = 0) = 0. Choose f such that f (x, y) = xy on R (R \ {0}) and the mapping
theorem implies
1 Otherwise

Xn L X
Y.
Yn

1 = (S)

({ = a}) = .

9.3. CONVERGENCE IN TOTAL VARIATION

9.3

59

Convergence in total variation

We add a short section on a much stronger concept of convergence for probability distributions.
Definition 9.18 If P, Q are two probability measures on the same -algebra A then
kP Qk = sup |P (A) Q(A)|
AA

is called the total variation distance between P and Q.


The condition kPn P k 0, i.e. convergence in total variation, implies of course set-wise
convergence: Pn (A) P (A) for all A A. For Borel probability measures on a metric
space this is stronger than weak convergence in view of part g) of Theorem 9.14.
Note that kP Qk = 1 if P and Q are concentrated on disjoint sets, this applies in particular
if P has a density and Q is a discrete distribution. For example in the Example 9.12 the
distributions n converge to the uniform distribution weakly, but not in total variation.
Remark 9.19 Let P and Q have the densities f and g. Then the sup in Definition 9.18 is
attained both at the set A+ = {f > g} and at the set A = {f < g}. Moreover
Z
|f g| dx = 2kP Qk.

Proof. For every set A A we have


Z
|P (A) Q(A)| = |

Z
f g dx

g f dx|
AA

AA+

Z

f g dx,
g f dx
+

ZAA
Z AA

max
f g dx,
g f dx
max

A+


= max P (A ) Q(A ), Q(A ) P (A ) .
R
R
R
R
R
Since A+ f g dx A g f dx = 11 = 0 and A+ f g dx+ A g f dx = A+ |f g| dx
the result follows.

+

It is worthwile to note that pointwise convergence of densities to a density implies convergence in total variation of the corresponding distributions.
Theorem 9.20 (Lemma of Scheff) If Pn and P have densities fn and f and fn (x)
f (x) for each x, then Pn P in total variation, i.e.
Z
|fn (x) f (x)| dx 0.

Proof. Letting A+
n = {f > fn } we see by the above remark and dominated convergence
that
Z
+
+
kP Pn k = P (An ) Pn (An ) =
(f (x) fn (x))+ dx 0

and the assertion follows.

Chapter 10

Characteristic Functions
10.1

Complex valued random variables

In this chapter we will use complex valued random variables. For such variables the expectation is defined as E(X + iY ) := E(X) + i E(Y ) where X and Y are real valued random
variables. Most properties of the expectation carry over to this situation. In particular the
formula |E(Z)| E(|Z|) and the dominated convergence theorem hold also for complex
random variables.
Recall the following theorem about parameter-dependent expectations (see Theorem 4.23):
If the function f (t, x) is continuously differentiable in t and if for all t (a, b) one has

| t
f (t, X)| Y for some random variable Y with finite E(Y ), then E(f (t, X)) is continuously differentiable in t (a, b) with



E
f (t, X) = E(f (t, X)).
t
t
Theorem 10.1 The same is still true in the complex valued case and also if the real variable
t (a, b) is replaced by a complex variable z U C with U open and the differentiation
is understood in the complex sense.
Proof. The first part is clear. For the complex variable situation it suffices to apply the
real variable case to the partial derivatives and to note that the Cauchy-Riemann differential
equations for f (, x) carry over to the function E(f (, X)).

10.2

The definition and first properties

Definition 10.2 If X is a random vector with values in

Rd , then the function



X : Rd C with t = (t1 , . . . , td ) 7 E eiht,Xi
is called the characteristic function of X. Similarly, if is a Borel probability distribution
on Rd , then
Z

(t) :=
eiht,xi d(x)

defines the characteristic function of .

Observe that the expectation formula gives


Z
X (t) = eiht,xi dPX (x) = PX (t).
60

10.2. THE DEFINITION AND FIRST PROPERTIES

61

Remark 10.3 If f L1 (Rd , d ), then the Fourier-transform Ff is defined by


Z
Ff (t) = (2)d/2 f (x) eiht,xi dx.
If X has the density fX , then the characteristic function of X and the Fourier transform of
fX are related to each other via
Z
X (t) = eiht,xi fX (x) dx = (2)d/2 FfX (t).
Here is a list of basic properties of the characteristic function.
Theorem 10.4 Let X be a random variable with values in

Rd .

1. The characteristic function X is continuous, |X (t)| 1 and X (0) = 1.


2. If E(kXkk ) is finite for some k N, then X is k-times continuously differentiable. If
(k)
d = 1 we have E(X k ) = (i)k X (0). In general we can compute the derivatives by


k X
j1
jd
k
iht,Xi
(t)
=
i

E
X

e
1
d
j1 t1 jd td
for j1 + + jd = k.
Proof.
1. We have |X (t)| = |E(eiht,Xi )| E(|eiht,Xi |) = E(1) = 1 and X (0) =
E(eih0,Xi ) = E(1) = 1. Now let (tn ) be a sequence in Rd which converges to t. Then
the sequence eihtn ,Xi converges pointwise to eiht,Xi and is dominated by |eihtn ,Xi | = 1.
Thus we can apply dominated convergence and get




lim X (tn ) = lim E eihtn ,Xi = E lim eihtn ,Xi = X (t).
n

itX
2. First assume d = k = 1. Then we have | t
e | = |iXeitX | = |X| and thus Theorem
10.1 implies




itX
X (t) = E
e
= i E XeitX .
t
t

Setting t = 0 yields 0X (0) = i E(X).


In the case k > 1 the assertion is proven by induction using the fact that E(|X|k ) <
implies E(|X|l ) < for all l < k 1 . The case d > 1 is done similarly.

If the moment generating function MX () = E(eX ) of a random variable exists (cf. section
4.2) then the characteristic function can be considered as part of the analytic extension of
MX .
Proposition 10.5 Let d = 1 and > 0. Assume that MX () = E(eX ) is finite for all
(, ). Then the function MX (z) := E(ezX ) is complex differentiable on the strip
S := {z = + it : (, ), t R} and X (t) = MX (it).
Proof. Let z = + it S. Then








E ezX = E |X| eX eitX = E |X| eX E eX + eX <
t
for || < || < (as in the proof of theorem 4.25). This implies



MX (z) = E ezX = E XezX


z
t
in the strip S. MX (it) = E(eitX ) = X (t) is trivial.
1 For

example k = 2:

00
X (t)

i
t

E(XeitX )

i2 E(X 2 eitX ).

62

CHAPTER 10. CHARACTERISTIC FUNCTIONS

Example 10.6 Let X B(n, p). Then


X (t) =

n
X
k=0

ikt

 
n
n k
p (1 p)nk = peit + (1 p) .
k

Example 10.7 Let X Poisson(). Then


X (t) =

eikt

k=0

it
it
k
e = ee e = e(e 1) .
k!

Example 10.8 Let X N (, 2 ). Then MX () = e+ 2


X (t) = eit e

t2 2
2

and thus

In particular, if X N (0, 1) the characteristic function is et

/2

Recall that if X N (0, 1) and Y = X + then Y N (, 2 ); the last example shows


Y (t) = eit X (t). This relationship can be extended:
Theorem 10.9 If X is a d-dimensional random vector and Y = AX + b where A is a linear
transformation Rd Rk and b Rk , then
Y (s) = eihs,bi X (AT s).

Proof. For each s Rk we have










T
Y (s) = E eihs,Y i = E eihs,AX+bi = E eihs,bi eihs,AXi = eihs,bi E eihA s,Xi .


Theorem 10.10 Let X and Y be independent

Rd -valued random variables. Then

X+Y (t) = X (t)Y (t).

Proof. The product rule (5.21) for independent expectations holds also for complex random
variables and thus








X+Y (t) = E eiht,X+Y i = E eiht,Xi eiht,Y i = E eiht,Xi E eiht,Y i .


Example 10.11 Let X = (X1 , . . . , Xd ) be a standard Gaussian vector, i.e. the Xi are
independent identically N (0, 1)-distributed. Then Theorem 10.10 gives
X (t) = E(ei<t,X> ) = E(ei

Pd

k=1 tk Xk

)=

d
Y
k=1

Xk (tk ) =

t2
k
2

= e

ktk2
2

10.3. THE UNIQUENESS THEOREM

10.3

63

The uniqueness theorem

The name characteristic function is motivated by the fact that a probability distribution is
uniquely determined by its characteristic function. In a first step we prove this for distributions which are convolutions with a Gaussian law. In this case there is even an explicit
inversion formula for the density in terms of the characteristic function.
Proposition 10.12 Let P be a probability distribution on Rd . Let P = P N (0, 2 I) =
L(X + Z) where X, Z are independent with L(X) = P and L(Z) = N (0, 2 I), > 0.
ktk2 2
Then P (t) = P (t)e 2 and P has the density
Z
ktk2 2
1
f (x) =
P (t)e 2 ei<x,t> dt.
d
(2) Rd
Proof. Let h be bounded measurable. Then
Z
Z Z
h(x)dP (x) =
h(y + t)P (dy)N (0, 2 I)(dt)
d
d
Rd
R
R
ZZ
ktk2
1
22
dt
=
h(y + t)P (dy)
e
d
2
2
Z
Z
kxyk2
1
e 22 P (dy) dx,
= h(x)
d
2 2
where we substitute t := x y. Thus the inner integral equals f (x):
Z
1

\
f (x) =
N (0,
2 I)(x y) P (dy)

d
d
2
R
2 Z
Z
ktk2 2
1
1
=
ei<xy,t>
e 2 dt P (dy)

d
d
Rd Z22 Rd
2 2
Z
2
2
ktk
1
=
ei<x,t> 2
ei<y,t> P (dy) dt
(2)d
Z
ktk2 2
1
=
P (t)ei<x,t> 2 dt.
d
(2)

Theorem 10.13 Let P be a probability distribution on Rd . Then it is uniquely determined


by P . Moreover
1. for each bounded and P -a.s. continuous function h we get
Z
Z
h(x)dP (x) = lim f (x)h(x)dx

where f is given by P as in the proposition above.


2. If P has an almost surely continuous density f and if P L1 (d ) then
Z
1
f (x) =
P (t)ei<x,t> dt
(2)d Rd
Proof.
R
1. Clearly f (x)h(x)dx = E(h(X + Z)) where X and Z are independent with L(X) =
P
R and L(Z) = N (0, I). By dominated convergence this converges to E(h(X)) =
h(x)dP R(x) as 0. Uniqueness follows, since P determines all f and hence all
integrals hdP .

64

CHAPTER 10. CHARACTERISTIC FUNCTIONS


2. Now let f be a continuous density of P and let g be the density of N (0, 2 I). Then
Z

f (x) =
g (z)f (x z) dz = E(f (x Z)) f (x) as 0.

On the other hand


1
lim f (x) = lim
0
0 (2)d

P (t)ei<x,t> e

ktk2 2
2

1
dt =
(2)d

P (t)ei<x,t> dt

by dominated convergence since P is Lebesgue-integrable.

Example 10.14 Let d = 1 and P be the two-sided exponential distribution , i.e. P has the
density f (x) = 12 e|x| . We compute the characteristic function:
P (t) =

1 |x| i<x,t>
e
e
dx =
2

ex cos tx dx = 1 t2

ex

eixt + eixt
2


dx

ex cos tx dx

using integration by parts. The last equation yields


Z
P (t) =
ex cos txdx =
0

1
.
1 + t2

This is integrable in t and we can apply the inversion formula of the Theorem to get
Z
Z
1 |x|
1
1
1
1
1 \
i<x,t>
e
=
e
dt =
ei<x,t> dt = Cauchy(1)(x).
2
2
2
2
1+t
2
(1 + t )
2
\
Thus we also obtain the characteristic function of the Cauchy distribution: Cauchy(1)(x)
=
|x|
e
.
2
Remark 10.15 We see that e|t| and et = N\
(0, 2)(t) are both characteristic functions.

One can show that e|x| is a characteristic function of a suitable distribution for each
(0, 2]. But for most 6= 1, 2 the density of the corresponding distribution (the -stable
symmetric distribution ) is not known explicitly.

Remark 10.16 The uniqueness theorem implies that a multidimensional distribution P =


L(X) on Rd is uniquely determined by its one-dimensional projections:
It suffices to know
P
the
laws
of
all
random
variables
of
the
form
L(<
t,
X
>)
=
L(
t
X
),
because X (t) =
k k
P
R
R is P
i
tk X k
(s).
e
dP
=
e
dP
tk Xk
Rd
R
P
Definition 10.17 A random vector is called Gaussian, if all linear combinations
tk Xk
have a one-dimensional Gaussian distribution. In this case we call X1 , X2 , . . . , Xd jointly
Gaussian.
Proposition 10.18
1. If X is Gaussian with the expectation vector = E(X) = (E(X1 ), . . . , E(Xd )), and
hCt,ti

the covariance matrix C = (cov(Xk , Xl )) then X (t) = eih,ti e 2 . In particular


the law of a Gaussian vector is uniquely determined by , C. It is denoted by N (, C).
2. If X1 , . . . , Xd are independent and Gaussian, then they are jointly Gaussian.
3. If X1 , . . . , Xd are jointly Gaussian and uncorrelated, then they are independent.

10.4. LVYS CONTINUITY THEOREM

65

Proof. 1. By asumption, for each t the real random variable ht, Xi has a Gaussian law. Its
expectation is ht, i by linearity and by Theorem 7.17 its variance is equal to tT Ct = hCt, ti.
Thus, using example 10.8
X (t) = E(eiht,Xi ) = ht,Xi (1) = eih,ti e

hCt,ti
2

2. Let t Rd be given. Then for all s R one has


P tk Xk (s) =

tk Xk () = ei

tk k s2

2
t2k k
/2

P
P
P
which is the characteristic function of N ( tk k , t2k k2 ). Thus by uniqueness
tk Xk has
a Gaussian law.
3. If X = (X1 , . . . , Xd ) N (, C) and if the Xi are uncorrelated then C is a diagonal matrix,
i.e. the joint law of the Xi by step 1. and 2. is the same as the one of independent Gaussian
variables. So the Xi are independent.


10.4

Lvys continuity theorem

The following Theorem strengthens the uniqueness considerably.


Theorem 10.19 (Lvys continuity theorem) Let (Pn ) and P be probability measures
on Rd . Then Pn P weakly if and only if Pn (t) P (t) for all t Rd .
Proof. One direction is easy: Assume Pn P weakly and choose an arbitrary t
The function
x 7 eiht,xi = cos xt + i sin xt

Rd .

is bounded and continuous. Therefore


Z
Z
Pn (t) = eihx,ti dPn eihx,ti dP = P (t).
Conversely assume that Pn (t) P (t) for all t Rd . We recall from our previous section
that for > 0 the law Pn := Pn N (0, 2 I) has density
Z
2
fn (x) :=
Pn (t)eihx,ti ektk /2 dt.

The modulus of the integrands is bounded by ektk /2 which is Lebesgue-integrable. Thus


by dominated convergence fn (x) converges for each x Rd to
Z
2 2
P (t)eihx,ti ektk /2 dt = f (x).
This implies Pn P weakly. This follows from Scheffs Lemma. Here is a simpler direct
argument which does not use the section 9.3: Let U Rd be open. With the help of Fatous
lemma we get
Z

lim inf Pn (U ) = lim inf 1U (x) fn (x) dx


n
n
Z
Z
lim inf 1U (x) fn (x) dx = 1U (x) f (x) dx = P (U ).
n

Now the assertion follows with part e) of the Portmanteau theorem (9.14).

66

CHAPTER 10. CHARACTERISTIC FUNCTIONS

In order to conclude that Pn P weakly, let h be a bounded Lipschitz-continuous function


with Lipschitz constant L. Then

Z
Z


hdP h(x)f (x) dx = |E(h(X) h(X + Z))| E(|h(X) h(X + Z)|)


p

E(LkZk) L E(kZk2 ) = L d
 P
P
R

R
Pd
d
d
2
= i=1 E(Zi2 ) = i=1 1 = d. Similarly h dPn h dPn
since E(kZk2 ) = E
Z
i=1 i

dL for each n. Therefore




Z
Z
Z
Z





hdPn hdP 2L d + hdPn hdP .




Since Pn converges weakly to P and can be chosen arbitrarily small, this implies Pn P
weakly by the Portmanteau theorem (9.14), part b).


Chapter 11

Central Limit Theorems


The first illustration of the power of characteristic functions is the ease with which they
yield the central limit theorem (CLT). There are many versions of the CLT. Here we only
prove one of them with an application and give some hints of extensions.

11.1

The CLT for multidimensional iid sequences

Theorem 11.1 Let X1 , X2 , . . . be independent identically distributed random vectors with


second moments. Let := E(X1 ) Rd and C := cov(X1 ) := (cij ) with cij :=Pcov(X1i , X1j )
n
be their common expectation vector and covariance matrix. Finally let Sn := i=1 Xi Rd .
Then
Sn n L

N (0, C).
n
Proof. Let Xi0 := Xi . Then E(Xi0 ) = 0 and Sn =
without loss of generality that = 0.

Sn /n (t) = Pni=1 Xi


=

n
Y
i=1


Xi

Xi0 + n. So we can assume


= X1

n

We calculate the derivatives of X1 :


X1 (0)
= iE(X1k ) = 0
tk

2 X1 (0)
= i2 E(X1k X1l ) = cov(X1k , X1l ) = ckl .
tk tl

and

The Taylor expansion


(t) = (0) +

tk

X tk tl
2

(0) +

(0) + o(ktk2 )
tk
2 tk tl
k,l

gives
t
Sn ( ) = X1
n
with rn (t) 0. Since (1 +

x
n

n

n
X
1
rn (t)
= 1 + 0
ckl tk tl +
2n
n

k,l

+ o( n1 ))n ex for all x R this implies


Sn /n (t) e1/2hCt,ti = Y (t)

where Y N (0, C). Lvys continuity theorem finishes the proof.


67

68

11.2

CHAPTER 11. CENTRAL LIMIT THEOREMS

Fluctuation of the empirical distribution

Let X1 , X2 , . . . be independent identically distributed with values in an observation space


d be a partition of E into d disjoint classes. Set
(E, B) and law P . Let E = E1 . . . E
n
k = |{i n : Xi Ek }|. Then n
k /n is the empirical frequency of the occurence of the
class Ek . In statitics the term
2
d 
X
n
k npk
Dn2 :=
with pk = P (Sk )

npk
k=1

is used as a measure how much the observation differs from the expected value.
Theorem 11.2 The law of random variable Dn2 converges weakly to a 2d1 distribution.
Proof. For i N let Yi be the k-th unit vector where k {1, . . . , d} is the unique index
with P
Xi Ek , i.e. Yik = 1{Xi Ek } . The sequence (Yi ) is an iid sequence of random vectors
n
and
n1 , . . . , n
d ). Moreover Yik Bin(1, pk ) and p := E(Yi ) = (p1 , . . . , pd ).
i=0 Yi = (
Then the law of
Pn
n
k npk 
i=0 Yi np

=
k=1,...,d
n
n
converges wakly to the Gaussian N(0, C)-distribution where C is the covariance matrix of
Y1 . Moreover
Pn
2




1
1
i=0 Yi np

W
where
W
:=
diag
Dn2 =
,
.
.
.
,
.



p1
pd
n
By the mapping principle of weak convergence L(Dn2 ) L(kW Zk2 ) where Z N (0, C). To
calculate C note that var(Y1k ) = pk (1 pk ) and cov(Y1k , Y1l ) = E(Y1k Y1l ) E(Y1k )E(Y1l ) =
pk pl for k 6= l. This yields

p1 (1 p1 )
pk pl

T
..
C=
= diag(p1 , . . . , pd ) p p.
.
pk pl

pd (1 pd )

The random vector W Z is Gaussian with expectation 0 and covariance matrix


W CW T = I (W p)T W p.

The vector W p has the form ( p1 , . . . , pd ) and hence satisfies kW pk = 1. Thus this
covariance matrix is the matrix of the orthogonal projection to the hyperplane (W p) .
Thus there exists an orthogonal matrix Q such that QW CW T QT = diag(1, . . . , 1, 0). Since
an orthogonal matrix does not change the norm we have kW Zk2 = kQW Zk2 . Now
cov(QW Z) = QW CW T QT = diag(1, . . . , 1, 0) implies that QW Z is a N (0, diag(1, . . . , 1, 0))vector. Thus the components of QW Z are independent and the definition of the 2 distribution yields
2
L(Dn2 ) L(kW Zk2 ) = L(kQW Zk2 ) = L(12 + + d1
) = d1

with independent N (0, 1)-distributed random variables k .

11.3

Outlook: triangular arrays

Let X1 , X2 , . . . be independent random variables, but either not necessarily with the same
distribution or iid but without
pP variance. Is there a Central Limit Theorem in this situation?
var(Xi ) = (X1 + + Xn ) and define
In the first case set sn :=
Xnj :=

Xj E(Xj )
sn

and

Sn :=

n
X
j=1

Xnj .

(11.1)

11.3. OUTLOOK: TRIANGULAR ARRAYS

69

Then E(Sn ) = 0 and var(Sn ) = 1, i.e. Sn is standardized. The following example shows
that in general Sn does not converge to N (0, 1) in distribution.
Example 11.3 Let pi > 0 and define independent random variables
(
1pi
with probability 12 pi each
Xi :=
0
otherwise

Thus E(Xi ) = 0 and var(Xi ) = E(Xi2 ) = 12 pi p1i + 12 pi p1i = 1. Then sn := n. Choose


P
P
the pi such that i=1 pi < . Then i=1 Xi has almost surely only finitely many terms
which are not 0 by Borel-Cantelli. Therefore
Sn

n
1 X

Xi 0
n i=1

almost surely for n . So Sn does not converge to N (0, 1).


Example
11.4 Let Xi be iid with a Cauchy(1)-distribution. Then for each n the average
P
1
X
also has a Cauchy(1)-law. This is easily checked by computing the characteristic
i
i=1
n
function
t
t
t n
n1 Pi=1 Xi (t) = Pi=1 Xi ( ) = (X1 )( ) = (e| n | )n = e|t| .
n
n
P

Therefore L( 1n i=1 Xi ) = L( nX1 ), so these laws asymptotically are spreading their


mass far outside, and do not converge. So in this
Pn particular example (note that there is
no finite variance) the proper normalization of i=1 Xi in order to get a non-trivial limit
distribution is no longer 1n but n1 . In the case of infinite variance it is much harder to find
the potential limit laws of properly normalized sums.
In this section we want to state without detailed proof a classical result which clarifies what is
needed in order to avoid this phenomenon. P
The main idea in the Central Limit Theorem can
n
be summarized as follows: The sum Sn = i=1 Xni is asymptotically Gaussian, assuming
For each n, the terms Xni , 1 i n are independent
none of them has exceptionally large influence, and
The random variables have finite variance.
The second condition can be formalized e.g. as follows
Definition 11.5 In a triangular array
X11
X21
..
.

...
...
..
.

X1k1
...
..
.

X2k2
..
.

..

of random variables the Xnj are asymptotically negligible, if


max P (|Xnj | > ) 0

1jkn

as n .
In the case of finite variance we have the following basic result: The proof uses a careful analysis of the ingredients of the proof in the first section in this chapter based on characteristic
functions.

70

CHAPTER 11. CENTRAL LIMIT THEOREMS

Theorem 11.6 Let X1 , X2 , . . . be independent with finite variance. Using the notation in
(11.1) the following conditions are equivalent:
1. L(Sn ) N (0, 1) and the Xnj are asymptotically negligible.
2. Lindeberg condition: For each > 0
Ln () :=

n
X


2
E Xnj
1{|Xnj |} 0

j=1

as n .
Proof. 2. 1. was shown by Lindeberg, the other direction by W. Feller. Confer [?].

Another condition, which is stronger than Lindebergs condition, is Lyapunovs condition:


There exists some > 0 such that
n
X


E |Xnj |2+ 0

j=1

as n . The CLT under Lyapunovs condition with = 1, i.e. under some assumption
on the third moments, is known to follow from more elementary observations, for simplicity
we refer to [?].
Another interesting question is the speed of the convergence. Here again the existence of
the third moment is useful.
Theorem 11.7 (Berry-Essen) If the Xi are independent identically distributed and :=
E(|X1 |3 ) < , then for all n N


sup FSn (x) (x)

0.8

Proof. Again clever use of characteristic functions. The best known constant in the theorem
has improved over time. The value 0.8 is quoted from [?], see also [?], 15.


Chapter 12

Conditional Expectation
12.1

Prologue: Orthogonal projections in Hilbert space

Let (H, h, i) be a real Hilbert space. Of course, we shall think of H = L2 (, A, P ) with


hX, Y i = E(XY ). Conditional expectations are closely related to the idea of orthogonal
projections in a Hilbert space. This is a result from functional analysis but for the sake of
completeness we sketch the proof.
Theorem 12.1 Let L be a closed linear subspace of H. Then there is a map PL : P L,
which is called orthogonal projection, such that
1. PL x is the closest point to x in L.
2. PL is linear.
3. x PL x y for all y L (or equivalently hy, xi = hy, PL xi for all y L and x H).
Proof. Let x H be given and choose zn L such that kzn xk inf zL kz xk =: .
Apply the parallelogram identity
k + k2 + k k2 = 2(kk2 + kk2 )
m
to = zn2x and = zm2x . Use (zn + zm )/2 L and hence k zn +z
xk to show that
2
zn zm 2
k 2 k 0. Now the completeness of H and the closedness of L implies zn z for some
z L. Clearly, z has the smallest distance to x among all elements of L, and it is unique
with this property. We denote it by PL x.
In order to show property 3. we calculate for a given y L


kx (PL x ty)k2 = 0
t t=0

This implies hx PL x, yi = 0.
The linearity of PL is an easy consequence of property 3.

12.2

Heuristic definition of conditional expectation

Suppose X and Z are random variables on the same probability space. E(X|Z), the conditional expectation of X given Z, should be the expected value of X under the condition
that we know the value of Z. The following list contains some properties which we expect
from a formal definition of conditional expectation:
If X and Z are independent, then E(X|Z) = E(X).
(Knowledge about Z does not change the knowledge about X)
71

72

CHAPTER 12. CONDITIONAL EXPECTATION


If X = Z, then E(X|Z) = Z. More generally: If X = f (Z), then E(X|Z) = f (Z).
If X = f (Z)X 0 , then E(X|Z) = f (Z)E(X 0 )
(This and the preceding property reflect that f (Z) behaves like a constant if Z is
known.)
If g is bijective, then E(X|Z) = E(X|g(Z)).
(So what really matters is not the actual value of Z but the sets which can be described
by Z, i.e. E(X|Z) depends only on the -algebra generated by Z.)

Example 12.2 Here is a simple example in which we can guess the value of a conditional
expectation: Let X1 and X2 be the outcomes of two dice rolls: := {1, . . . , 6}2 , X1 (i, j) := i,
X2 (i, j) := j. Define Z := X1 + X2 . By symmetry we expect that E(X1 |Z) = E(X2 |Z).
By linearity E(X1 |Z) + E(X2 |Z) = E(X1 + X2 |Z) = E(Z|Z) = Z. Thus E(X1 |Z) =
E(X2 |Z) = 21 Z.

12.3

Formal definition of conditional expectation

Definition 12.3 Let (, A, P ) be a probability space. Let G A be a sub--algebra.


Let X be either a non-negative random variable or in L1 . A random variable Y is called
conditional expectation of X given G, if
Y is G-measurable and
E(1G X) = E(1G Y ) for all G G.
In this case one writes Y = E(X|G) or Y = E G (X) (almost surely).
Proposition 12.4 Conditional expectations are unique up to almost sure equality.
Proof. Suppose Y and Y 0 both satisfy the two conditions in the definition. Set G := {Y <
Y 0 }. Then G G and therefore
E (1G (Y 0 Y )) = E(1G Y 0 ) E(1G Y ) = E(1G X) E(1G X) = 0
Since Y 0 Y > 0 on G we have 1G (Y 0 Y ) = 0 almost surely. This implies P (G) = 0 and
thus Y 0 Y almost surely. Y Y 0 is shown analogously.

S
Example 12.5 Let := i=1 Bi with disjoint Bi and P (Bi ) > 0. Let
(
)
[
G := {Bi } =
Bi : M {1, 2, . . . }
iM

Thus G is G-measurable if and only if G is the union of some of the Bi . Now define a random
variable
1
E (1Bi X)
if X Bi
Y () :=
P (Bi )
S
Then Y is constant on each of the Bi . Therefore Y is G-measurable. If G = iM Bi , then
E(1G Y ) =

X
iM

E (1Bi Y ) =

X
iM

P (Bi )

X
1
E(1Bi X) =
E(1Bi X) = E(1G X)
P (Bi )
iM

Therefore E(1G X) = E(1G Y ) for all G G. Thus Y = E(X|G).


Proposition 12.6 If G is the -algebra generated by a partition (Bi )i of , then
E(X|G)() =

1
E(1Bi X)
P (Bi )

for Bi .

12.4. PROPERTIES OF CONDITIONAL EXPECTATION

73

Example 12.7 We take our old example: Let := {1, . . . , 6}2 and define random variables
X1 (i, j) := i, X2 (i, j) := j and Z(i, j) = i + j. We partition according to the values of Z:
=

12
[

{Z = k}

k=2

and define
(
G := (Z) =

)
[

{Z = k} : M {2, . . . , 12}



= Z 1 (M ) : M {2, . . . , 12} .

kM

Our old guess that E(X1 |G) = Z/2, i.e. E(X1 |G) = k/2 on {Z = k} can now be checked for
each k, e.g. k = 10:



1
4
5
6
1
E 1{Z=10} X1 =

+
+
= 5.
E(X|G)() =
P (Z = 10)
1/12
36 36 36
Definition 12.8 If G = (Z) one also writes E(X|Z) instead of E(X|G).
Proposition 12.9 Let X, Z take countably many
P values xk , zl and define pk|l = P (X =
xk |Z = zl ) if P (Z = zl ) > 0. Then E(X|Z)() = k xk pk|l if Z() = zl .
Proof. Set Bl = {Z = zl } and use the previous proposition: We know that for Bl
X
1
1
E(1Bl X) =
E(1Bl {X=xk } xk )
P (Bl )
P (Z = zl )
X
X
=
xk P (X = xk |Z = zl ) =
xk pk|l .

E(X|Z)() = E(X|G)() =

k


In particular E(X|Z) is a function of Z because it is constant on the sets {Z = zl }. This is


a special case of the more general
Proposition 12.10 (Factorisation Lemma) If G = (Z) and Y is G-measurable extended real valued, then Y = g(Z) for some measurable g.
P
Proof. If Y is elementary, there exist ai R and Gi G such that Y =
ai 1Gi . Since
G = (Z), we have Gi = Z 1 (Bi ) for some Borel sets Bi . Therefore
Y =

n
X
i=1

ai 1Gi =

n
X

ai 1Bi (Z) = g(Z) with g =

ai 1Bi .

i=1

If Y is non-negative, there exist elementary random variables Yn with Yn Y and measurable


functions gn such that Yn = gn (Z). Define g(Z) := sup gn (Z). Then Y () = sup Yn () =
sup gn (Z) = g(Z).
Finally, use the decomposition Y = Y + Y for arbitrary random variables.


12.4

Properties of conditional expectation

We now can verify that the formal definition of the previous section leads to a concept which
has the properties of the first section, and some other nice features.
Theorem 12.11
1. If X is independent of G then E(X|G) = E(X) almost surely.
In particular E(X|{, }) = E(X).

74

CHAPTER 12. CONDITIONAL EXPECTATION


2. If X is G-measurable then E(X|G) = X almost surely.

Proof. Clearly the constant random variable E(X) is G-measurable. If G G then X and

1G are independent and

E(1G X) = E(1G )E(X) = E(1G E(X)).


The second assertion follows trivially from the definition of conditional expectation and the
a.s. uniqueness.
Next we see that G-measurable functions behave like constants for conditional expectations
with respect to G:
Theorem 12.12 Let R be G-measurable and assume either X 0, R 0 or X L1 , XR
L1 then
1. E(RX) = E(RE(X|G)). In particular, taking R = 1 we get E(X) = E(E(X|G)).
2. RE(X|G) = E(RX|G).
Proof.
1. If R is an indicator function 1G with G G, the assertion follows from the
definition of E(X|G). By linearity the assertion holds also if R is elementary and
G-measurable and by monotone convergence it holds for non-negative R (choose elementary Rn R). For general R = R+ R use the linearity of conditional expectation
(see the next theorem).
2. Clearly RE(X|G) is G-measurable and for G G we have by the first step
E(1G RE(X|G)) = E(1G RX) = E(1G E(RX|G)).
But this is the defining property of E(RX|G).

The conditional expectation shares the following three basic properties of ordinary expectation.
Theorem 12.13
1. X > X 0 a.s. implies E(X|G) > E(X 0 |G) a. s., and similarly with .
2. E(aX + bX 0 |G) = aE(X|G) + bE(X 0 |G) almost surely.
3. If 0 Xn X, then E(Xn |G) E(X|G) almost surely (monotone convergence for
conditional expectation).
Proof.
1. The first half of the proof of a.s. uniqueness (Proposition 12.4) proves the
assertion for . Now assume X > X 0 a.s.. We already know E(X|G) E(X 0 |G) a.s..
Let G = {E(X|G) = E(X 0 |G)}. Then


E(1G X) = E 1G E(X|G) = E 1G E(X 0 |G) = E(1G X 0 ).
Since X > X 0 a.s. we have necessarily P (G) = 0 which proves the assertion.
2. The right-hand side has the defining property of the left-hand side.
3. Let Y := sup E(Xn |G), then Y is G-measurable and has the defining property of
E(X|G):
E(1G Y ) = sup E(1G E(Xn |G)) = sup E(1G Xn ) = E(1G X).
1.

12.4. PROPERTIES OF CONDITIONAL EXPECTATION

75

Remark 12.14 Similarly one has a Lemma of Fatou and a theorem of dominated convergence for conditional expectation. They follow from monotone convergence precisely as in
section 4.1.
For random variables in L2 conditional expectation has an interpretation in terms of Hilbert
space projections.
Theorem 12.15 Let G A be a sub--algebra. Then
1. The space L2G = {[Y ] L2 : Y is G-measurable} forms a closed linear subspace of the
Hilbert space L2 .
2. Let X L2 and let Y be a random variable such that [Y ] = PL2G ([X]) in the sense of
Theorem 12.1. Then Y = E(X|G) almost surely.
Proof.
1. Let [Yn ] L2G and assume [Yn ] [Y ] L2 . Then (Yn ) is Cauchy in
2
L (, G, P|G ) and by the completeness of this space there exists a G-measurable Y 0
with kYn Y 0 k2 0 as n . Thus [Y ] = [Y 0 ] L2G .
2. By the definition of PL2G we get [X] [Y ] [Z] for all [Z] L2G if [Y ] = PL2G [X]. This
means E((X Y )Z) = 0 for all Z L2 (, G, P ) or E(XZ) = E(Y Z). We choose
Z = 1G for G G and get Y = E(X|G).

So in short, in L2 the conditional expectation is the orthogonal projection to L2G . In particular it exists.
Corollary 12.16 Let X 0 or X L1 then there is a conditional expectation E(X|G).
Proof. If X is non-negative, we choose Xn := min(X, n) L2 . Then Xn X and therefore
E(Xn |G), which exists by the previous theorem, converges to E(X|G) (and in particular this
expectation exists). In the general case decompose again X = X + X .

For the next property of conditional expectation we need a small tool form real analysis.
Lemma 12.17 Let I be an open real interval. For a convex function : I R there is a
sequence of points tn I and cn R such that (x) = supn (tn ) + cn (x tn ) for all x I.
Proof. Convex functions are continuous in the interior of their domain. Moreover, the
convexity implies that for each t I there exists c(t) such that (x) (t) + c(t) (x t)
for all x I, in particular c(t) is a sub-derivative at t.1 Choose (tn ) dense in I and set
(x) := supn (tn ) + c(tn )(x tn ). Clearly (x) (x). For the converse choose a sequence
nk such that tnk x. Then (c(tnk )) is bounded and thus by continuity we get
(x) = lim (tnk ) = lim (tnk ) + c(tnk )(x tnk ) (x).
k

Theorem 12.18 (Jensens inequality) Let I be an open real interval. Let : I R be


convex and let X be an I-valued random variable such that X L1 and (X) L1 . Then
E(X|G) I a.s. and


E(X|G) E (X)|G .

1 If these general properties of convex functions are not obvious to the reader he/she may add them to the
assumptions on . In concrete applications is even piece-wise smooth and these properties are immediate.

76

CHAPTER 12. CONDITIONAL EXPECTATION

Proof. Part 1 of Theorem 12.13 implies that E(X|G) I a.s.. The previous lemma yields


E (X)|G sup E (tn ) + cn (X tn )|G
n

= sup (tn ) + cn E(X|G) tn
n

= E(X|G) .

Corollary 12.19
For every X L1 one has |E(X|G)| E(|X| | G) and, more generally, if p 1 and X Lp ,
then kE(X|G)kp kXkp .
Proof. The map x 7 |x|p is convex for p 1, so by Jensens inequality we get |E(X|G)|p
E(|X|p |G) and the first assertion follows, while the second one follows from kE(X|G)kpp =
E(|E(X|G)|p ) E(E(|X|p |G)) = E(|X|p ) = kXkpp .

One can take short cuts when taking conditional expectations with respect to different
-algebras.
Theorem 12.20 [Tower property] If X L1 and G H are sub--algebras, then
E(E(X|H)|G) = E(X|G).

Proof. Set Y := E(E(X|H)|G). Then Y is G-measurable and for G G


E(1G Y ) = E(1G E(X|H)) = E(1G X)
Therefore Y = E(X|G).

12.5

Uniform integrability

In the next chapter we even consider conditional expectations of a random variable for
infinitely many -algebras simultaneously. In Theorem 12.25 below we show that the resulting family of random variables is uniformly integrable. This concept is useful also in other
contexts.
Definition 12.21 A family (Xi )iI of real random variables is called uniformly integrable,
if
1. supiI E(|Xi |) <
2. supiI E(|Xi | 1{|Xi |k} ) 0 as k .
This concept allows an extension of the Theorem of dominated convergence.
Theorem 12.22 Let (Xn )nN be a uniformly integrable sequence. If Xn converges to X
in distribution, then E(Xn ) E(X).
L

Proof. Define k (x) := (k x) k and define akn := E(k (Xn )). For each k, Xn
X
implies that the k-th row of the matrix (akn ) converges as n :
akn := E(k (Xn )) E(k (X)).
Since k (|x|) |x|, we get with the first part of lemma 9.13 that
E(|X|) lim inf E(|Xn |) < .
n

12.5. UNIFORM INTEGRABILITY

77

Xn

|Xn | k

Figure 12.1: Condition 2 for uniform integrability: E(|Xn | 1{|Xn |k} ) converges to 0 for
k .
Therefore E(X) = limk E(k (X)), i.e. the row limits converge to E(X). Similarly the
columns converge even uniformly in n since

sup |akn E(Xn )| sup E(|k (Xn ) Xn |) sup E |Xn | 1{|Xn |k} 0

as k . Therefore we can exchange the limits:


E(X) = lim E(k (X)) = lim lim akn = lim lim akn = lim E(Xn ).
k n

n k

Uniform integrability often is easily verified. Here are two sufficient criteria.
Proposition 12.23 A family of random variables (Xi )iI is uniformly integrable if either
|Xi | Y for all i with a random variable Y such that E(|Y |) < , or
supiI kXi kp < for some p > 1.
Proof. Let us assume that |Xi | Y for all i. Then supiI E(|Xi |) E(|Y |) < and



sup E |Xi | 1{|Xi |k} sup E |Y | 1{|Xi |k} E |Y | 1{|Y |k} 0
iI

iI

as k by dominated convergence.
Now assume supiI kXi kp < . This implies
sup E(|Xi |) = sup kXi k1 sup kXi kp < .
iI

iI

iI

From Hlders inequality follows furthermore





sup E |Xi | 1{|Xi |k} sup kXi kp 1{|Xi |k} q
iI

iI

The right-hand side converges to zero since



1{|X


= P (|Xi | k)1/q
i |k} q

1
sup E(|Xi |)
k iI

1/q
0

by Markovs inequality.


L

Corollary 12.24 If Xn
X and sup kXn kp < with a p > 0, then E(|Xn |p ) E(|X|p )
0
for all p < p.

78

CHAPTER 12. CONDITIONAL EXPECTATION


0

Proof. Set Yn := |Xn |p and Y := |X|p . Then




p0 p/p0

sup kYn kp/p0 = sup E |Xn |


n

nN

p0 /p

 p0

sup kXn kp
n

< .

Thus (Yn ) is uniformly integrable and E(Yn ) converges to E(Y ) by Theorem 12.22.

Here is the announced result on conditional expectations.


Theorem 12.25 Let X L1 (P ). Let (Fi )iI be a family of -algebras and Xi = E(X|Fi ).
Then {Xi }iI is uniformly integrable.
Proof. First observe: If P (An ) 0 as n , then E(1An X) 0 as n (either
combine dominated convergence and Proposition 8.9 or Theorem 12.22 and Proposition
12.23). Therefore2 for every > 0 there exists a > 0 such that P (A) < implies
E(1A |X|) < . Moreover by Jensens inequality
sup E(|Xi |) sup E(E(|X||Fi )) = E(|X|) < .
i

. Then for all i we get by Markovs


For > 0 choose as above and k so large that k > E(|X|)

E(|Xi |)
< and
inequality P (|Xi | > k)
k




E |Xi |1{|Xi |>k} E E |X| Fi 1{|Xi |>k} = E |X|1{|Xi |>k} < .
Therefore supi E(|Xi |1{|Xi |>k} ) 0 as k .

2 There

is a converse of this statement: The theorem of Radon-Nikodym

Chapter 13

Martingales in discrete time


13.1

Definition and examples

Definition 13.1 Let (I, ) be an ordered set, let (, A, P ) be a probability space and let
(Xi )iI be a familiy of R-valued integrable random variables on (, A, P ).
1. A filtration (Fi )iI of (, A, P ) is an increasing family of sub--algebras of A, i.e.
Fi Fj A for all i j.
2. (Xi ) is an (Fi )-martingale if Xi = E(Xj |Fi ) for all i j.
3. Suppose Xi is Fi -measurable for all i I. Then (Xi ) is
a (Fi )-submartingale, if Xi E(Xj |Fi ) for all i j,
a (Fi )-supermartingale, if Xi E(Xj |Fi ) for all i j.
In the sequel most of the time we consider index sets I Z in particular I = {1, . . . , n} or
I = N. Very often we choose Fn = (Z1 , . . . , Zn ) for n N and some family (Zn ) or even
Fn = (X1 , . . . , Xn ). In the latter case (Fn ) is called the natural filtration of (Xn ). If the
filtration is clear, it is omitted from the definition of (sub-/super-)martingales.
The simplest way to produce martingales is the following.
Theorem 13.2 For X L1 and an arbitrary filtration (Fi ) we define Xi := E(X|Fi ) for
i I. Then (Xi )iI is an (Fi )iI -martingale.
Proof. Use the tower property.

A frequent way in which submartingales arise is provided by Jensens inequality.


Theorem 13.3 Let : I R be continuous and convex and Xi almost surely I-valued
and Xi , (Xi ) L1 for i I. Assume either
1. (Xi ) is a martingale
2. (Xi ) is a submartingale and is increasing on the range of the Xi .
Then ((Xi )) is a submartingale
Proof. Whenever i j we have E((Xj )|Fi ) (E(Xj |Fi )) and this is = (Xi ) in the
first case and (Xi ) in the second case.

We now give a couple of examples.
79

80

CHAPTER 13. MARTINGALES IN DISCRETE TIME

Example 13.4
1. If (Xi ) is a martingale, then (|Xi |) is a submartingale and (Xi2 ) is a submartingale if
E(Xi2 ) < for i I.
2. Set X1 := 2, X2 := 1, X3 := 0. Then (Xi ) is a submartingale (for any filtration)
but X 2 is not. This illustrates the condition in Theorem 13.3 that is increasing.
3. Let X1 , X2 , . . . be independent with E(Xi ) = 0Pfor i N. Let (Fn ) be the natural
n
filtration: Fn := (X1 , . . . , Xn ). Define Sn = i=1 Xi for n N. Then (Sn ) is an
(Fn )-martingale. In fact by Theorem 12.11
E(Sn+1 |Fn ) = Sn + E(Xn+1 |Fn ) = Sn + E(Xn+1 ) = Sn + 0 = Sn .
The rest follows from the tower property (Theorem 12.20).
4. Let X1 , X2 , . . . be Q
independent with E(Xi ) = 1 for i N. Let Fn := (X1 , . . . , Xn )
n
and define Mn := i=1 Xi for n N. Then (Mn ) is an (Fn )-martingale. Like in the
previous example we get
E(Mn+1 |Fn ) = Mn E(Xn+1 |Fn ) = Mn E(Xn+1 ) = Mn 1 = Mn .
5. Let (, A, P ) := ([0, 1], B[0, 1], ) be the standard probability space on the unit interval. Let Ik,n := [k 2n , (k + 1) 2n ) for k, n N0 ; k < 2n and I2n ,n := {1}.
Let
(
)
[
n
n
Fn := {Ik,n : k = 0, . . . , 2 } =
Ik,n : K {0, . . . , 2 } .
kK

Then it easy to see that (Fn ) is a filtration and for every X L1 and all s Ik,n we
have according to Example 12.5
Z (k+1)2n
1
n
E(X|Fn )(s) =
E(1Ik,n X) = 2
X(t) dt.
(Ik,n )
k2n
So this sequence of random variables is a martingale.
6. On the same probability space define
Gn := {G B[0, 1] : G = (G + 2n ) mod 1}.
The -algebra Gn consists of all Borel sets which are periodic with period 2n . Then
for each n N and all s [0, 1]
n

2
1 X
X((s + k 2n ) mod 1).
E(X|Gn )(s) = n
2
k=1

Note however that here Gn+1 Gn and hence (Gn ) is not a filtration with the index
set N. But letting Fn be Gn we get a filtration (Fn )nN with the index set N:
Indeed, if n
 < m and n, m N then m < n and hence Fn Fm . Therefore
E(X|Fn ) nN is a martingale.
Remark 13.5 Let us compare examples 5 and 6: In example 5 E(X|Fn ) is a step function
which is a close approximation of X. If X is a continuous function then it is easily verified
that for all s [0, 1]
E(X|Fn )(s) X(s)
as n . In example 6 E(X|Gm ) is a average of the values of X at equidistant points
and hence an approximation of the Riemann integral, i.e. if X is continuous then
Z 1
E(X|Gm )(s) E(X) =
X(t) dt
0

13.2. MARTINGALES, GAMES AND OPTIONAL SAMPLING

81

for all s [0, 1] as m , or equivalently, m . Both convergence statements stay


valid for all X L1 , at least for Lebesgue-almost all s. This will follow from the increasing
martingale convergence theorem in the case of example 5 and the decreasing martingale
convergence theorem in the case of example 6 (see Theorem 14.5, in particular Corollary
14.10) and Theorem 14.12).

13.2

Martingales, games and optional sampling

Let (Fn )nN0 be a filtration and (Xn )nN0 be a family of random variables such that Xn is
Fn -measurable. Think of Xn Xn1 as your net gain in the n-th game in a series of games
played at times n = 1, 2, . . . .
(Xn ) is a martingale if and only if E(Xn Xn1 |Fn1 ) = 0 (fair game).
(Xn ) is a submartingale if and only if E(Xn Xn1 |Fn1 ) 0 (favorable game).
Definition 13.6 A sequence (Cn )nN is called (Fn )-previsible, if Cn is Fn1 -measurable
for n N.
Think of Cn as your stake in game n. You have to choose Cn using the information available
at n 1, i.e. just before the n-th game is played. Your gain on game n is Cn (Xn Xn1 )
and your total gain up to time n is
(C X)n :=

n
X

Ck (Xk Xk1 )

k=1

For n = 0 we set (C X)n := 0.


Definition 13.7 If (Cn )nN are previsible, then C X = ((C X)n )nN0 is called the
martingale transform of (Xn ) by (Cn ).
Theorem 13.8 (You cant beat the system!) Let (Cn ) be bounded, non-negative and
previsible and let (Xn ) be a (sub-)martingale. Then C X is a (sub-)martingale.
Proof. Set Y := C X. Then Yn Yn1 = Cn (Xn Xn1 ) and
E(Yn Yn1 |Fn1 ) = E(Cn (Xn Xn1 )|Fn1 )
= Cn E(Xn Xn1 |Fn1 )
(
= 0 if (Xn ) is a martingale
0 if (Xn ) is a submartingale.
This implies the assertion.

Definition 13.9 A map T : N0 {+} is an (Fn )n=0,1,... stopping time if {T n}


Fn for each n N0 .
The decision to stop at (immediately after) time T can be taken on the basis of observations
up to time T .
Example 13.10 Entrance time of the process Xn into B B: Let Fn := {X1 , . . . , Xn }.
Then
(
min{k : Xk B}
T =
+ if there is no such k
defines a stopping time. In contrast, the time of the last visit e.g. T 0 := max{n 17 : Xn
B} is typically not a stopping time, since {T 0 5} depends on the behaviour of X6 , . . . .

82

CHAPTER 13. MARTINGALES IN DISCRETE TIME

Definition 13.11 Let T be a stopping time. X T is the process defined by XnT = XT n or


XnT () = XT ()n ().
Corollary 13.12 If X is a supermartingale and T is a stopping time, then the stopped
process X T is also a supermartingale. The same holds for martingales and submartingales.

Proof. 1[0,T ] (n, ) = 1{nT } () = 1{T n1}c (). Therefore 1{nT } nN is previsible.
Observe
T
n
X

XT n = X0 +
(Xk Xk1 ) = X0 + 1[0,T ] X n
k=1

so we can apply the previous theorem with C = 1[0,T ] . In particular we get E(XT n )
E(X0 ) for supermartingales, E(XT n ) = E(X0 ) for martingales and E(XT n ) E(X0 ) for
submartingales.

Example 13.13P(Doubling Strategy) Let Y1 , Y2 , . . . be iid with P (Yn = 1) = P (Yn =
n
1) = 21 , Xn = i=1 Yi , Xn Xn1 = Yn . Let Yn = 1 indicate win at the n-th game, and
Yn = 1 loss. Xn is a martingale. Put the stake 1 on the first game. If you win stop, else
double the stake. This leads to
(
2k1 if Y1 = Y2 = = Yk1 = 1
Ck =
0
otherwise.
Consider the stopping time T = min{n : Yn = 1}. Then
(
20 21 2T 2 + 2T 1 = +1 if T n
(C X)n =
20 21 2n1 = 2n + 1
if T > n.
The theorem shows that E((C X)n ) = E(XT n ) = E(X0 ) = 0. In this particular case this
result can also be understood directly: We have P (T > n) = 2n and P (T n) = 1 2n .
Therefore
E(total gain) = 1(1 2n ) (2n 1)2n = 0.
Note that in this game P (T < ) = 1: If we are allowed to wait arbitrarily long until
the win really occurs, we are sure to win: we have XT = 1 a.s., in particular surprisingly
E(XT ) = 1 6= 0 = E(X0 ) even though X is a martingale.
The next theorem gives a condition under which the surprise of the last example can be
avoided.
Theorem 13.14 (Doobs optional sampling theorem) Let X be a submartingale and
let T be a stopping time such that T < almost surely and (XT n )n=1,2,... is uniformly integrable. Then E(XT ) E(X0 ). The analogue holds for supermartingales and martingales.
a.s.

Proof. T n = T almost surely for eventually all n and thus XT n XT and Theorem
12.22 implies E(XT ) = lim E(XT n ) E(X0 ).

Corollary 13.15 The conclusion holds under each of the following conditions on T and X
1. T is bounded;
2. X is bounded and T < almost surely;
3. E(T ) < and there is some k < such that: |Xn Xn1 | k for all n.
Proof. In each case we verify uniform integrability of (XT n ):

13.3. DOOBS LP -INEQUALITY


1. |XT n |

PM

k=1

83

|Xk | = Y where T M , M N.

2. |XT n | C for all n if |Xk | C for all k.


3. |XT n | = |X0 +

PT n

k=1 (Xk

Xk1 )| |X0 | + (T n)k |X0 | + kT L1 .

Thus in all cases (XT n ) is pointwise bounded by some Y L1 and we can apply proposition
12.23.

This can be easily extended to inequalities between E(XS ) and E(XT ) for two stopping
times, e.g.
Remark 13.16 If S, T are bounded stopping times with S < T almost surely and (Mi ) a
submartingale, then E(MS ) E(MT ).
Proof. 1(S,T ] = 1[0,T ] 1[0,S] is previsible. (1(S,T ] M )n = MT n MSn is a submartingale.
By S T n follows


E(MT ) E(MS ) = E (1(S,T ] M )n E (1(S,T ] M )0 = 0.


13.3

Doobs Lp -inequality

Theorem 13.17 (Doobs maximal inequality for submartingales) Let M1 , . . . , Mn be


a submartingale. Then
Z
1
1
Mn dP ( E(Mn ) if Mn 0).
P ( max Mi )
1in
{max Mi >}

Proof. Let
(
T =

min{i : Mi }
n+1
if no i exists.

This is a stopping time and


P (max Mi ) = P (MT ) =

n
X

P (Mi , T = i)

i=1
n

1X
1
1
E(1{T =i} Mn ) = E(Mn 1Sni=1 {T =i} ) =
i=1

1X
E(1{T =i} Mi )
i=1

Z
Mn dP.
max Mi

Theorem 13.18 (Doobs Lp -inequality) If (Mi )0in is a nonnegative submartingale


or a martingale in Lp , p > 1, then

E

max |Mi |p

0in

p
p1

p

E(|Mn |p ).

Proof. Set U = max |Mi |, V = |Mn |. Then


kU kpp = E(U p ) =

Z
0

P (U p > c) dc =

Z
0

P (U > c1/p ) dc

84

CHAPTER 13. MARTINGALES IN DISCRETE TIME

and with the maximum inequality


Z

c1/p

Z Z

1{U p c} V c1/p dcdP

Up

V dP dc
{U c1/p }

Z
=

c1/p dcdP

p
E(V U p1 );
p1

Continue with Hlders inequality

Therefore kU kp

p
p
kV kp kU p1 kp/(p1) =
kV kp kU kpp1 .
p1
p1

p
p1 kV

kp and we are done.

Chapter 14

Martingale convergence
Martingales (and sub- resp. super-martingales) are useful both for computing expectations
and because of their good asymptotic behaviour. This chapter deals with the second issue.

14.1

Doobs upcrossing inequality for submartingales

Definition 14.1 Let x1 , x2 , . . . , xn be a finite sequence of real numbers and let a < b. The
number of upcrossings of the interval (a, b) by (xi ) is defined as the maximal number k for
which there are indices i1 < < i2k such that xi2l1 a and xi2l b for all l = 1, . . . , k.
Theorem 14.2 Let (Mi )1in be a submartingale and let a < b. Then the number N (a, b)
of upcrossings of (a, b) by (Mi )1in satisfies
E(N (a, b))

1
E((Mn a)+ ).
ba

Proof. Let Mi0 = max(Mi , a). Then Mi0 is a submartingale by theorem 13.3, since the
function x 7 max(x, a) is convex and increasing. If the assertion holds for (Mi0 ) then also
for M . Therefore we may assume Mi a. Now define T0 = 0 and recursively
(
min{i T2k2 : Mi = a}
T2k1 =
n
if no such i exists
(
min{i T2k1 : Mi b}
T2k =
n
if no such i exists.
Then 0 < T1 . . . Tn . This sequence of random times is strictly increasing until it
becomes stationary. Since the values are integers between 1 and n we have necessarily
Tn+1 = Tn = n. Therefore
n

M n = M T1 +

[2]
X

MT2k MT2k1 +

k=1

[2]
X


MT2k+1 MT2k .

k=1

Each upcrossing corresponds to a unique term in the first sum. Moreover every term in
the first sum is non-negative. Thus the first sum is (b a)N (a, b). The second sum
has nonnegative expectation by remark 13.16, since the Ti are stopping times. Taking
expectations we get

n
[2]
X


(b a)E N (a, b) E
MT2k MT2k1 E(Mn MT1 )
k=1

= E((Mn a)1{T1 <n} ) E((Mn a)+ ),


85

86

CHAPTER 14. MARTINGALE CONVERGENCE

since Mn a. This concludes the proof.

Remark 14.3 There are also maximal inequalities and upcrossing inequalities for supermartingales. The proofs are similar but slightly more difficult.

14.2

Increasing martingale convergence

The following result is the key to the a.s. behaviour of martingales. It is formulated in such
a way that it can also be used in continuous time (which we do not cover in this course.)
Theorem 14.4 Let U R be a countable set. Let (Ft )tU be a filtration and let (Mt )tU
be a submartingale with suptU E(Mt+ ) < . Then there is a set 0 A such that
P (0 ) = 1 and for 0 the following holds: For every monotone (decreasing or increasing)
sequence (tn )nN in U the limit limn Mtn () exists in [, ].
Proof. Since U is countable we
Scan find a sequence (Fk )k=1,2,... of finite subsets of U such
that F1 F2 . . . and U = k=1 Fk . Fix a < b. Let N (a, b) = supk NFk (a, b) where NFk
is the number of upcrossings of (a, b) by (Mi )iFk . Then
E(N (a, b)) = sup E(NFk (a, b))

1
sup E((Mt a)+ ) <
b a tU

since E((Mt a)+ ) E(Mt+ ) + |a|. In particular the random variables N (a, b) are a.s.
finite. Now let
0 = { : N (a, b) < for all a, b Q}.
Then P (0 ) = 1. Fix 0 and let (tn ) be a monotone sequence in U . Assume that
limn Mtn () does not exist in [, ]. Then the sequence (Mtn ())n has at least two
accumulation points. Therefore there exists two rational numbers a < b such that the
sequence (Mtn ()) has infinitely many upcrossings of (a, b). Thus N (a, b)() = +, so

/ 0 , contradiction. So (Mtn ())n converges to a extended real number.



Theorem 14.5 (Increasing (sub)martingale convergence) Let (Fn )n=1,2,... be a filtration and let (Mn )n=1,2,... be a submartingale with supn E(Mn+ ) < . Then the limit
limn Mn =: M exists almost surely and is integrable, in particular it is a.s. finite.
Proof. Let U = N, then by the last theorem the limit M exists almost surely in [, +].
In order to show that it is integrable, note first that
+
E(M
) = E(lim inf Mn+ ) lim inf E(Mn+ ) <
n

+
by the Lemma of Fatou. Thus M
finite almost surely. On the other hand the sequence
(E(Mn ))n=1,2,... is nondecreasing by the submartingale property and bounded from above
by supn E(Mn+ ). Therefore supn E(Mn ) = supE (Mn+ ) E(Mn ) < and hence again using

Fatous Lemma we get E(M


) < .


Corollary 14.6 Every nonnegative supermartingale (Mn )nN converges almost surely and
Mn E(M |Fn ) for each n.
Proof. (Mn ) is a nonpositive submartingale, so sup E((Mn )+ ) = 0 < . Hence Mn
converges almost surely to M . Moreover for each n and every A Fn we get by Fatous
Lemma
Z
Z
Z
Z
M dP =
lim Mm dP lim inf
Mm dP
Mn dP.
A

A m

Therefore Mn E(M |Fn ) almost surely.

A


14.3. UNIFORMLY INTEGRABLE MARTINGALES

14.3

87

Uniformly integrable martingales

Definition 14.7 A (Fi ) martingale (Mi )iI is called closed by X L1 if Mi = E(X|Fi )


for all i I.
Theorem 14.8 Let (Mn ) be a martingale for the filtration (Fn )n=1,2,... . The following are
equivalent:
1. (Mn ) is bounded in L1 and closed by M ,
2. (Mn ) is closed,
3. (Mn ) is uniformly integrable,
4. Mn M in L1 .
Proof.
1.2. obvious.
2.3. If Mn = E(X|Fn ) then the last theorem implies 3.
3.4. If (Mn ) is uniformly integrable then it is bounded in L1 and hence by Theorem 14.5
Mn M almost surely and M L1 . The sequence (|Mn M |) is also uniformly
integrable and hence E(|Mn M |) E(0) = 0 by Theorem 12.22.
4.1. Let A Fn . Then E(1A Mn ) = limm E(1A Mm ) = E(1A M ). Thus Mn =
E(M |Fn ), i.e. (Mn ) is closed by M .

Example 14.9 We give an example of a martingale which does not satisfy the conditions
of the theorem. As in Example 13.4 5. consider dyadic filtration Fn = {Ik,n : 0
k 2n } where Ik,n denotes the interval [k2n , (k + 1)2n ) on the standard probability
space ([0, 1], B([0, 1]), 1 ). Define the random variable Mn by Mn (x) = 2n 1I1,n (x). Then
E(Mn+1 |Fn ) = Mn since
(
Z
Z
1 k=1
Mn+1 (x) dx =
Mn (x) dx =
.
0 k>1
Ik,n
Ik,n
Therefore (Mn ) is a martingale and Mn M = 0 almost surely. But clearly Mn does
not converge to 0 in L1 since E(Mn ) = 1 for all n. (Intuitively in this example Mn is the
conditional expectation of the point mass in 0. But this mass does not correspond to a
random variable on our probability space since this mass is living on a null-set for Lebesgue
measure.)
S
Corollary 14.10 Let X L1 and let F1 F2 . . . be a filtration. Let F = ( n=1 Fn ).
Then E(X|Fn ) E(X|F ) almost surely and in L1 .
Proof. The martingale (Mn ) defined by Mn := E(X|Fn ) is closed. Therefore by Theorem
14.8 it converges to M := limn Mn almost surely and in L1 . In remains to prove that
E(X|F ) = M
S .
Choose A Fn , e.g. A Fn0 . Then E(1A X) = E(1A Mn ) for all n n0 . The L1 convergence implies E(1A X) = E(1A M ). The system of all events A with
E(1A X) = E(1A M )
S
is a monotone class, it contains the -stable system Fn , hence it contains F . Therefore
M = E(X|F ) since M is F -measurable as the limit of a F -measurable sequence. 
Example 14.11 As in Example 13.4 5 and in Example 14.9 let again (Fn ) be the dyadic
filtration on [0, 1]. Then F = B[0, 1] because every open set is union of countably many
dyadic intervals. Let X L1 ([0, 1], 1 ). Then X = E(X|F ), and hence the dyadic step
functions E(X|Fn ) (which we computed in Example 13.4 5) converge -a.s. and in L1 to
X, as was announced in Remark 13.5.

88

CHAPTER 14. MARTINGALE CONVERGENCE

14.4

Decreasing martingale convergence

Theorem 14.12 (Decreasing martingale


convergence) Let F1 F2 . . . be a
T
decreasing filtration. Let F = nN Fn and (Mn )n=1,2,... be a (Fn )-submartingale.
Then
1. limn Mn exists almost surely.
2. If (Mn ) is a martingale, then Mn converges in L1 to M = E(M1 |F ).
+
+
+
Proof.
1. supn E(Mn
) E(M1
) < because (Mn
) is a submartingale. Therefore
limn Mn () exists almost surely by Theorem 14.4.

2. The martingale (Mn ) is closed by M1 and hence it is uniformly integrable. As in the


proof of Theorem 14.5 the a.s. limit M is a.s. finite. Therefore the a.s. convergence
implies convergence in L1 . The limit M is F measurable since the same is true
for each Mn . Moreover E(1A M ) = limn E(1A Mn ) = E(M1 1A ) for all
A F . Thus M = E(M1 |F ).
Corollary 14.13 Let X1 , X2 , . . . be a sequence of independent random variables. Then
lim E(X|Xn+1 , Xn+2 , . . .) = E(X)

a.s. and in L1 for every X L1 .


Proof. Let Fn = {Xn+1 , Xn+2 , . . .}. Decreasing martingale convergence implies the
assertion with E(X|F ) instead of E(X). It remains to prove
E(X|F ) = E(X) a.s.
Kolmogorovs 0-1 law Theorem 5.28 says that the asymptotic -algebra F is trivial, and
Remark 5.30 implies the assertion.

We announced in Remark 13.5 that some Riemann type sums converge to the Lebesgue
integral even for very irregular functions, at least for almost all shifted dyadic grids.
Corollary 14.14 Let f be a Lebesgue-integrable Borel function on [0, 1]. Then for Lebesguealmost all s [0, 1]
n

lim 2
n

2
X
k=1

f (s + k2

Z
mod 1) =

f (t) dt.
0

Proof. We continue with the Example 13.4, 6, i.e. we consider the -algebra
Fn = {B [0, 1] : B B([0, 1]), B = B + 2n

mod 1}

of all Borel sets which are periodic with period 2n . As was noted there, for each n the
average on the left-hand is equal to E(f |Fn )(s). In view of Corollary 14.13 it suffices to
verify that
Fn = {Xn+1 , Xn+2 , . . . }
where Xn (s) is the n-th dyadic digit of s, because these Xn are independent random variables
on the Lebesgue probability space on [0, 1]. In this equality of -algebras it easy to see .
Conversely due to periodicity every event B Fn is uniquely determined by the intersection
B [0, 2n ) and this set is in the -algebra generated by the restriction of Xn+1 , Xn+2 , . . .
to [0, 2n ) for the same reason why all dyadic intervals generate the Borel sets on [0, 1], cf.
Example 14.11. This concludes the proof.


14.5. LAWS OF LARGE NUMBERS VIA MARTINGALE THEORY

14.5

89

Laws of large numbers via martingale theory

In this last section we deduce two different versions of the strong law of large numbers for
independent random variables from martingale theory. For the first version we use decreasing
martingale convergence and for the second version increasing martingale convergence. Both
versions were established originally by Kolmogorov using different proofs. These results gave
a strong motivation for the creation of martingale theory by Doob.
Theorem 14.15 Let X1 , X2 , . . . be independent random variables. Then they satisfy the
strong law of large numbers, i.e. almost surely
n

1X
Xi E(Xi ) 0,
n i=1
under each of the following conditions:
1. The Xi are in L1 and identically distributed, or
2. The Xi are in L2 with

X
var(Xi )
i=1

i2

< .

Proof. In both cases we may and shall assume E(Xi ) = 0. In the first case the idea is to
find F1 F2 . . . such that
n

1X
Xi = E(X1 |Fn )
n i=1
P
Xi converges almost surely and in L1 by Theorem 14.12. The limit
for each n. Then n1
random variable is measurable for the asymptotic -algebra of X1 , X2 , . . . and hence almost
surely equal to its expectation which is equal to 0 by L1 -convergence, cf. the proof of
Corollary 14.13.
A possible choice for Fn is
Fn = {X1 + + Xn , Xn+1 , Xn+2 , . . . }.
It is easy to see like in our introductory Example 12.2 that
n

1X
Xi = E(X1 |X1 + + Xn ).
n i=1
Then one can apply Lemma 14.16 below
Pn to X = X1 , Y = X1 + + Xn , and Z =
{Xn+1 , Xn+2 , . . . } to see that even n1 i=1 Xi = E(X1 |Fn ) . This completes the first
proof.
P
i)
For the second part let X1 , X2 , . . . be independent such that i=1 var(X
< . Then
i2
Pn Xi
Mn = i=1 i defines a (Fn )-martingale where Fn = {X1 , . . . , Xn }. This martingale is
bounded in L2 since
n
X
var(Xi )
sup E(Mn2 ) = sup
< .
i2
n
n
i=1
Pn
In particular (Mn ) is uniformly integrable (cf. Proposition 12.23) and limn i=1 Xii
exists almost surely. Therefore the assertion follows from Kroneckers Lemma below.

Lemma 14.16 Let X, Y, Z be three random variables (with possibly general state spaces)
on the same probability space. If Z is independent of (X, Y ) then E(X|Y, Z) = E(X|Y ).

90

CHAPTER 14. MARTINGALE CONVERGENCE

Proof. Let B (Y ) and C (Z) be given. Then X 1B and E(X|Y )1B are independent
of Z and therefore
E(X 1BC ) = E(X 1B )P (C) = E(E(X|Y )1B )P (C) = E(E(X|Y )1BC ).
Since the sets of the form B C generate (Y, Z) the assertion follows.

Lemma 14.17 (Kroneckers


Pn(n ) be two sequences of real numP Lemma) Let (xn ) and
bers such that n . If i=1 xii converges then 1n i=1 xi 0 as n .
Pn
Proof. Let sn = i=1 xii . Then sn s as n . Summation by parts gives
n+1
X
i=1

xi =

n+1
X

i (si si1 ) =

i=1

proves the lemma.

si (i i+1 ) + n+1 sn+1 .

i=1

Thus

i
for i n.
where ni = i+1
Pn n+1
the sum i=1 ni si , being an

n
X

n+1
X

n+1

i=1

xi =

n
X

ni si + sn+1

i=1

1
Now i=1 ni = n+1
1 and ni 0 for all i. Therefore
n+1
asymptotic average of the si , converges to s = limn sn , which

Pn

Chapter 15

Brownian motion
15.1

Introduction

Brownian motion is a stochastic process (Wt )t with independent stationary Gaussian increments and continuous paths. More explicitly
Definition 15.1 A family (Wt )0tT with T < or (Wt )0t< of real random variables
on the same probability space is called a standard Brownian motion (Brownsche Bewegung), if it has the following properties:
a) W0 = 0 a.s.: The process starts at 0.
b) For all indices t0 < t1 < . . . < tk the increments
Wt1 Wt0 , Wt2 Wt1 , . . . , Wtk Wtk1
are independent.
c) For s < t the increment Wt Ws has the law N (0, t s).
d) P -almost all paths t 7 Wt () are continuous.
Remark Some authors require that all and not only P -almost all Brownian paths are
continuous. This does not make a big difference.
Brownian motion plays a fundamental role in probability theory because of, both, its rich
structure and the fact that many other probabilistic objects and models can be studied via
their relation to particular features of Brownian motion. It is not completely obvious that
the properties a)-d) are compatible, in other words that a Brownian motion exists. In this
chapter we give a construction of Brownian motion and show that the law of Brownian
motion appears as a universal limit law of second order centered random walks (Donskers
invariance principle or functional central limit theorem).1

15.2

Dyadic time approximation of Brownian motion:


Lvys triangles

For the sake of simpler notation we assume T = 1. Let us first suppose a Brownian motion
(Wt )0t1 is already given. For n N we restrict the paths to the dyadic time points of
order n, i.e. the points of the form m2n , 0 m 2n . Then
Wm2n =

m
X

Xi2 a.s.

(15.1)

i=1
1 To the experienced reader: The point of these notes is to arrive at Brownian motion and Donskers theorem essentially simultaneously without the help of either Kolmogorovs consistency theorem or Prokhorovs
theorem.

91

92

CHAPTER 15. BROWNIAN MOTION


n

where the increments Wi2n W(i1)2n are denoted by Xi2 for i = 1, . . . , 2n .


n
In particular each Brownian motion defines via (15.1) a system of random variables Xi2 ,
n N, 1 i 2n such that for each n
n

the X12 , . . . , X22n are iid N (0, 2n )

(15.2)

and for all i = 1, . . . , 2n1 we have


n1

Xi2

2
2
= X2i1
+ X2i
.

(15.3)

since both sides are equal to the increment Wi2n1 W(i1)2n1 .


The inductive structure of this scheme can be illustrated by two triangles, one in R2 and one
in the space L2 . For (typographical) simplicity the three successive times (i1)2(n1) , (2i
1)2n , i2(n1) are denoted by l, m, r (left, middle, right) respectively. We look at the graph
of the linear interpolation of W in the dyadic intervals: On level n 1 we have the straight
line from (l, Wl ) to (r, Wr ) which is replaced on level n by the two lines from (l, Wl ) to
(m, Wm ) and from (m, Wm ) to (r, Wr ). These three lines form the first triangle.
The identity (15.3) can be rewritten as Wr Wl = (Wm Wl ) + (Wr Wm ). In L2 , due to
independence, these three vectors build an equilateral right-angled triangle with hypotenuse
Wr Wl . The (Euclidean) geometric idea of the following is that, given any vector x, one
can find an equilateral right-angled triangle with x as hypotenuse by choosing a vector z
perpendicular to x of the same length. Then the three points 0, x/2+z/2, x form the desired
triangle.
Lemma 15.2 Let X, Z have a joint Gaussian distribution. Then the two random variables
X + Z, X Z are independent if and only if var(X) = var(Z).
Proof. Obviously cov(X + Z, X Z) = var(X) var(Z). Both statements are equivalent
to saying that this quantity vanishes.

Recall that for every probability law there is a probability space (, A, P ) on which there
is an iid sequence of r.v. with law .
n

Proposition 15.3 Let Z1 , Z2 , . . . be iid N (0, 1)-variables. Define random variables Xi2 ,
n N0 , 1 i 2n recursively by
X11
n

2
X2i1
n

2
X2i

=
=
=

Z1
1 2n1 1 (n1)/2
X
+ 2
Z2n1 +i
2 i
2
1 2n1 1 (n1)/2
X
2
Z2n1 +i
2 i
2

(15.4)
(15.5)

Then they satisfy the relations (15.2) and (15.3).


Proof. For n = 0 the random variable X11 = Z1 has law N (0, 2n ) since 20 = 1. For
n1
n1
the induction step n 1 n assume that X12 , . . . , X22n1 are iid N (0, 2(n1) )-variables
n1
which are (Z1 , . . . , Z2n1 )-measurable. Fix an index i {1, . . . , 2n1 }. Then Xi2
and
(n1)/2
(n1)
2n
2
Z2n1 +i both have variance 2
. Each of the two random variables X2i1 and
2n
X2i
in (15.4) and (15.5) is a sum of two independent N (0, 2(n+1) )-variables and so they
both are N (0, 2n )-variables. By the preceding Lemma they are independent of each other.
n1
Moreover, being linear combinations of Xi2
and Z2n1 +i , they both are orthogonal to all
n1
Xj2
and Z2n1 +j with an index j {1, . . . , 2n1 }, j 6= i. Thus, using the definitions of
2n
2n
2n
2n
X2i1
and X2i
for such j, they are also orthogonal to X2j1
and X2j
. This proves that in
n
n
the centered Gaussian vector (X12 , . . . , X22n ) all components are mutually orthogonal and
hence independent. This proves (15.2). Finally adding (15.4) and (15.5) gives the relation
(15.3).


15.3. LINEAR INTERPOLATIONS OF RANDOM WALKS AND THEIR MODULUS OF CONTINUITY93


Corollary 15.4 Let D denote the set of all dyadic numbers in [0, 1]. There is a family of
random variables (Wt )tD (on a suitable probability space) which has the properties a)-c)
of a standard Brownian motion.
n

Proof. Choose the probability space and the random variables Xi2 as in the Proposition.
Let t D be given. Choose n N such that t is of order n, i.e. such that there is a number
m {0, . . . , 2n } with t = m2n . Then we can define Wt by (15.1). The important point is
that this definition does not depend on the choice of n because of the relations (15.3).
Let now t0 < t1 < . . . < tk be dyadic times. Then there are some n N and numbers
m0 < m1 < . . .P
< mk with tj = mj 2n for j = 0, . . . , k. For each j by (15.1) we have
n
mj
Wtj Wtj1 = i=m
Xi2 . Here the terms are independent centered Gaussian and
j1 +1
hence the same is true for the sum. The variance of the sum adds up to (mj mj1 )2n =
tj tj1 , i.e. the increments Wtj Wtj1 have laws N (0, tj tj1 ). Moreover they are
n
n
mutually independent since they are built from disjoint parts of the sequence X12 , . . . , X22n .
This proves b) and c) at dyadic times.

Remark The main point of the above construction is that the consistency of the different
levels can be achieved on a single probability space. Kolmogorovs consistency theorem
(mentioned in the footnote) gives a more general method to deal with this kind of problem.
It now remains to show that the family (Wt ) in the Corollary can be extended to all t [0, 1]
in such a way that the paths are continuous. So we need to show that the restricted paths
D 3 t 7 Wt () are uniformly continuous. This follows as a special case from a more general
estimate which then yields also Donskers theorem.

15.3

Linear interpolations of random walks and their modulus of continuity

From now on let X1 , X2 , . . . be iid random variables with E(Xi ) = 0 and 2 = var(Xi ) < .
Pn
(n)
(n)
For each n N let Sn = i=1 Xi . Define a process Y (n) = (Yt )t[0,1] by letting Y0 = 0,
(n)

Yt

1
Sk
n

if t =

k
,
n

(15.6)

k
and by linear interpolation of these values for t ( k1
n , n ) where k {1, . . . , n}.

The key to the results in this chapter is to keep control of the fluctuations of the process
Y (n) . The following concept is useful.
Definition 15.5 Let f be a real function on a set I R. For each > 0 define w(, f ) by
w(, f ) = sup{|f (s) f (t)| : |s t| < , s, t I}.
The function 7 w(, f ) is called modulus of continuity of f . If (Yt )tI is a stochastic
process on a probability space (, A, P ) then w(, Y ) denotes the map 7 w(, Y ()).
A function f obviously is uniformly continuous iff lim0 w(, f ) = 0. If the process Y has
continuous paths then w(, Y ) is a random variable since the supremum in the definition
needs to be taken only over a countable dense subset of I.
Lemma 15.6 Let Z have the law N (0, 1). Then for each c > 0
c2
1
P {Z > c} e 2 .
c 2

Proof. This follows from


Z
Z
2
x2
c2
1
1
x x2
1
1
x2

e
dx
e 2 dx = (e 2 )|
e 2 .
c =
2 c
2 c c
c 2
c 2

(15.7)

94

CHAPTER 15. BROWNIAN MOTION

Here is the main result of this section. In the language of Prokhorovs Theorem (which we
do not want to use) this guarantees the uniform tightness of the sequence of laws of the
Y (n) .
Proposition 15.7 Let Y (n) be defined as above (cf.(15.6)). For each > 0 one has


lim lim sup P w(, Y (n) ) > = 0.
(15.8)
0

Proof. Fix two numbers > 0 and > 0. Choose n large enough such that n1 < . Choose
m N such that < m
n < 2. Divide the interval [0, 1] starting from the left into equal
rm (r+1)m
intervals Ir = [ n , n ] of length m
n (> ) plus one possibly shorter interval near 1. For
the total number R of these intervals we get the estimate
R 1 + 1.

(15.9)
(n)

(n)

Assume now w(, Y (n) ) > . There are points s < t such that ts < with |Yt Ys | > .
The maximal fluctuation of the path occurs at the interpolation points. The two points s, t
are either in the same Ir or in two adjacent Ir -s. Thus there is an index r such that the
interval Ir Ir+1 contains s and t and therefore also two interpolation points at which the
values of Y (n) differ by more than . Then by the triangular inequality there is at least one
index k in the range [rm, (r + 2)m] such that
1
(n)
|Sk Srm | = |Y k(n) Y rm
| > /2.
n
n
n

(15.10)

Therefore the event {w(, Y (n) ) > } is contained in


R
[

{max0j2m |

r=1

rm+j
X

Xi | > n/2}.

i=rm+1

Using (15.9), Lemma 15.8 below and the fact that the Xi are iid its probability can be
estimated from above by

3( 1 + 1) max P (|Sj | > n/6).


1j2m

Let jn converge to such that nevertheless jnn 0. Then maxjjn var(Sj ) = o(n) and
hence

max P (|Sj | > n/6) 0


jjn

with the help of Chebyshevs inequality. For 2m j jn and n sufficiently large we get by
the central limit theorem and the estimate (15.7) above
r

1
n
( 1 + 1)P (|Sj | > n/6) 2 1 P ( |Sj | >
)
6j
j
r
n
)
2 1 P (|Z| >
6m
r
1
1
2 P (|Z| >
)
12

1
2
4 1 1 12 exp(
).
24
2
So finally, lim supn P ({w(, Y (n) ) > }) is at most equal to 3 times the last expression
for all > 0. Now observe that this expression converges to 0 as 0 to conclude the
result.


15.4. BROWNIAN MOTION EXISTS

95

The following Lemma (cute, isnt it?) is a special case of a result in M. Loves classical
textbook [?].
Lemma 15.8 Let X1 , X2 , . . . , Xn be independent2 random variables. Then for all c > 0
P ( max |Sk | > 3c) 3 max P (|Sk | > c).
1kn

1kn

Proof. Let p = max1kn P (|Sk | > c). Let T be the first time at which |Sk | > 3c, and
T = n + 1 if this never happens. Then
P ( max |Sk | > 3c)
1kn

=
=

n
X
k=1
n
X
k=1
n
X
k=1
n
X

P ({T = k})
P ({T = k} {|Sn | c} + P ({|Sn | > c})
P ({T = k} {|Sn Sk | > 2c}) + p
P ({T = k})P ({|Sn Sk | > 2c}) + p

k=1
n
X

P ({T = k}))(p + p) + p 3p.

k=1

This is the assertion.

15.4

Brownian motion exists

Theorem 15.9 (N. Wiener) There is a Brownian motion.


Proof. Let (Wt )tD be the process in Corollary 15.4 where the index set D consists of all
dyadic numbers. Let X1 , X2 , . . . be iid N (0, 1)-variables. For each n N let W (n) be the
linear interpolation of the restriction of W to the dyadic numbers of order n. Then for each
> 0 the random variables w(, W (n) ) increase as n to w(, W ). On the other hand
n
W (n) has the same distribution as the process Y (2 ) which is built from the Xi as in the
previous section. Therefore for each > 0 by Proposition 15.7


lim P w(, W ) > = lim lim sup P w(, W (n) ) >
0
0 n

n
= lim lim sup P ( w(, Y (2 ) ) > = 0.
0

The random variables w(, W ) are increasing in and since they converge in probability to
0 they even converge a.s. to 0 as 0. But this means that P -almost all paths of W are
uniformly continuous on the dyadic numbers and hence they have a unique extension to a
continuous function on [0, 1]. We also denote this extension by W . (On the measurable (!)
set of all where w(, W ()) does not converge to 0 we simply let Wt () = 0 for all t.)
Thus W = (Wt )t[0,1] is a stochastic process with continuous paths whose restriction to the
dyadic numbers has the properties b) and c) in the Definition of a Brownian motion. We
show that these properties stay valid for general times tj , 0 j k in [0, 1]. Choose a
sequence (trj )rN of dyadic numbers such that tj = limr trj . Then by the continuity of
the paths we get
(Wtr0 , Wtr1 Wtr0 , . . . , Wtrk Wtrk1 ) (Wt0 , Wt1 Wt0 , . . . , Wtk Wtk1 )
r

(15.11)

2 Actually if the X are also identically distributed a similar argument allows to replace the 3 by 2 at both
i
places!

96

CHAPTER 15. BROWNIAN MOTION

a.s. and hence in law. Checking characteristic functions we see that


(u) = exp(

k
k
X
X


1
1
t0 u20 +
(tj tj1 )u2j = lim exp( tr0 u20 +
(trj trj1 )u2j
r
2
2
j=1
j=1

is the characteristic function of the limit in (15.11) which proves that the increments at
non-dyadic times also have the properties b) and c).

Note that the piece-wise linear paths of W (n) in the above proof actually converge a.s. uniformly, so (Wt )t[0,1] alternatively can be defined as their uniform limit. Norbert Wiener in
his original proof constructed the Brownian paths as a uniform limit of random trigonometric
polynomials.

15.5

Donskers functional central limit theorem

In this section we consider processes with continuous paths as random variables with values
in the metric space C[0, 1] with the sup-norm. Donskers Theorem states that in this sense
the processes Y (n) in section 15.3 converge in law to a Brownian motion.
Theorem 15.10 Let X1 , X2 , . . . be iid random variables with E(Xi ) = 0 and 2 = var(Xi ) <
(n)
. For each n N define the process Y (n) = (Yt )t[0,1] as in (15.6) by linear interpolation
of the associated rescaled random walk. Then the laws L(Y (n) ) on the space C[0, 1] converge
weakly to the law of a Brownian motion.
An important idea in the technical side of the proof is to switch from a function u
C[0, 1] to its linear interpolations and back. For a finite subset F of [0, 1] and a vector
x = (xt )tF RF let IF (x) C[0, 1] be the function which one gets from x by linear
interpolation. Conversely for u C[0, 1] let u|F be the restriction of u to F .
Lemma 15.11 Let F be a finite subset of [0, 1] which contains the two endpoints 0 and 1.
Let F denotes the maximal distance between two adjacent elements of F . Then for every
u C[0, 1] one has
ku I F (u|F )k w(F , u).
(15.12)
In particular let (Fn ) be an increasing sequence of finite sets in [0, 1] containing {0, 1} whose
union is dense in [0, 1]. Then, for every u C[0, 1], u = limn I Fn (u|Fn ) uniformly i.e. in the
topology of C[0, 1].
Proof. In each interval between two adjacent points ti , ti+1 of F we have
|u(t) I F (u|F (t)|

max(|u(t) I F (u|F (ti )|, |u(t) I F (u|F (ti+1 )|)


=

max(|u(t) u(ti )|, |u(t) u(ti+1 )|)

w(F , u).
Now take the supremum over all t to get (15.12). If the (Fn ) are as indicated then Fn 0
and since every u C[0, 1] is uniformly continuous we have w(Fn , u) 0 as n . Thus
(15.12) gives the second assertion.

The space C[0, 1] with the sup-norm is a metric space. It is well known by the Weierstra
Approximation Theorem that it is separable, i.e. there is a countable dense subset.3 The
symbol B denotes the Borel -algebra of C[0, 1]. As a first application of Lemma 15.11 let us
clarify the obvious measurability question: Are the processes in the Theorem really random
variables in the measurable space (C[0, 1], B)?
3 If you do not know Weierstras Theorem give a direct proof of the separability by the above
interpolation-approximation scheme

15.5. DONSKERS FUNCTIONAL CENTRAL LIMIT THEOREM

97

First note that in every metric space (M, d) the Borel -algebra B(M ) is generated by
the continuous real functions on M , since every closed set L can be written in the form
{f = 0} where f (x) = dist(x; L). If (Xn ) is a sequence of M -valued random variables which
converges point-wise to the function X then f (X) = limn f (Xn ) is a real random variable
for each continuous f : M R and thus X is also a M -valued random variable.
Lemma 15.12 Let (Yt )t[0,1] be a real valued process with continuous paths on the probability space (, A, P ). Then the map Y : C[0, 1] which maps to the function (path)
[0, 1] 3 t 7 Yt () is A B(C[0, 1])-measurable.
Proof. Let Y be our process with continuous paths. Let (Fn ) be a sequence of finite subsets
as in Lemma 15.11. Then for each n the restriction Y|Fn is a random vector in RFn . The
maps I Fn : RFn C[0, 1] are continuous and hence the I Fn (Y|Fn ) are C[0, 1]-valued random
variables. Thus by Lemma 15.11 Y as their point-wise limit has the same property.

The statement of the next lemma is called convergence of the finite dimensional marginals.
(n)

(n)

Lemma 15.13 Let t0 < . . . < tk be in [0, 1]. Then (Yt0 , . . . , Ytk ) converges for n
in law to (Wt0 , . . . , Wtk ).
Proof. We consider the increments. In order to be able to consider the first component also
as increment we may assume t0 = 0. For j {0, . . . , k} and n N let mjn be the smallest
integer m such that m
n tj . Then the k sequences of random variables
(n)

(n)

Y mj Y mj1
n
n

n
n

1
(S j Smj1
)
n
n mn
s
mjn mj1
1
n
p
=
j
n
mn mnj1
=

mn
X

Xi

mj1
+1
n

are independent of each other and their laws converge by the central limit theorem to
the law of independent N (0, tj tj1 )-variables which are the increments of a Brownian
motion. Then the mapping principle of weak convergence which says that this convergence
(n)
(n)
is preserved under continuous maps implies that the vector (Y m1n , . . . , Y mk ) converges in
n

n
n

law to (Wt0 , . . . , Wtk ). The error which we introduce by switching from the indices
the tj is asymptotically small.
Now we come to the proof of the Theorem. It suffices to show that


lim E (Y (n) ) = E (W )
n

mjn
n

to


(15.13)

holds if is a bounded real Lipschitz functional on the space C[0, 1]. Let L be its Lipschitz
constant and choose K < such that |(u)| K for all u C[0, 1]. Let F = {t0 , . . . , tk }
[0, 1] be a finite set which starts with 0 and ends with 1. Moreover let F : C[0, 1] R be
defined by
F (u) = ( IF )(u|F ).
The function I F is bounded and continuous on RF and so by Lemma 15.13

(n)
(n) 
lim E F (Y (n) ) = lim E IF (Yt0 , . . . , Ytk )
n
n


= E IF (Wt0 , . . . , Wtk ) = E F (W ) .
On the other hand by Lemma 15.11 and the Lipschitz property
|F (u) (u)| = |(u) (I F (u|F ))| Lku I F (u|F )k Lw(F , u).

(15.14)

98

CHAPTER 15. BROWNIAN MOTION


Let > 0 be given.
Choose > 0 such that lim supn P w(, Y (n) ) > < . Similarly

P w(, W ) > < for sufficiently small . Choose F such that F < . Then for
sufficiently large n


|E F (Y (n) ) (Y (n) ) | LE w(, Y (n) )

L + 2CP w(, Y (n) ) > < (L + 2C).
By the same estimate

|E F (W ) (W ) | < (L + 2C).
Since > 0 is arbitrary we can deduce the claim (15.13) from (15.14). This finishes the
proof.

15.6

An application: The arcsin law

In the famous chapter III of the first volume of Fellers book the following asymptotic result
is derived by completely elementary combinatorial methods. It says that the probability that
in n fair games the fraction of time spent with a cumulative win is at most x (0 < x < 1)
approaches a number which is surprisingly large even for small x.
Theorem 15.14 (Arcsin law for the simple symmetric random walk) Let X1 , X2 , . . . be an
iid sequence of random variables with P (Xi = 1) = 21 . Let Nn be the number of indices
k n such that Sk > 0. Then for every x [0, 1]
Z


Nn
1 x
dt
2
p
lim P
<x =
= arcsin x.
(15.15)
n
n
0

t(1 t)
We want to show that Donskers Theorem implies that then the same must hold for every
centered second order random walk.
Corollary 15.15 The relation (15.15) holds for every iid sequence (Xi ) with E(Xi ) = 0
and var(Xi ) < . Moreover if (Wt )t[0,1] is a Brownian motion, then


2
P {t [0, 1] : Wt > 0} < x = arcsin x.

We consider the functional : C[0, 1] R defined by


(u) = {t : u(t) > 0}.

(15.16)

Lemma 15.16 Let (um ) be a sequence of measurable functions which converges uniformly
(or just point-wise) to some u with {t : u(t) = 0} = 0. Then
lim {t : um (t) > 0} = {t : u(t) > 0}.

(15.17)

Proof. We look at the um and u as real random variables on the Lebesgue probability space
on [0, 1]. Then um converges point-wise and hence in law to u. Let Fm and F denote the
distribution functions of um und u, respectively. By assumption F is continuous at 0, hence
{t : um (t) > 0} = 1 Fm (0) 1 F (0) = {t : u(t) > 0}.
m

Lemma 15.17 Let W be a Brownian


motion. Then almost all paths spend no time at

the origin: P {t : W (t) = 0} = 0 = 1.

15.6. AN APPLICATION: THE ARCSIN LAW

99

Proof. We consider the set G = {(t, u) [0, 1] C[0, 1] : u(t) = 0}. The function (t, u) 7
u(t) is continuous on the product space. Hence G is closed in the product topology. Thus it
is product measurable. Since for each t the r.v. Wt has a Gaussian law we get from Fubinis
Theorem
Z 1


1G (t, W ) dt
E {t : W (t) = 0} = E
0

1G (t, W ) dt


N (0, t)({0}) dt = 0.

=
0

Therefore the non-negative random variable {t : Wt = 0} must vanish a.s..

These two Lemmas show that the functional is continuous at all points u A C[0, 1]
where A is the set of functions which spend no time at 0 and that P (W A) = 1. This
implies by the extended mapping principle and Donskers theorem that also the random
variables (Y (n) ) converge in law to (W ). Since for the simple random walk the random
variables (Y (n) have the arcsin-distribution as limit law, the latter must be the law of
(W ). This proves the second statement of the Corollary. A second application of Donskers
Theorem shows that it also the limit law for other second order random walks. Note that
the size of does not enter.

Bibliography
[1] W. Feller. An introduction to probability theory and its applications. Vol. II. John Wiley
& Sons, New York, 2nd edition, 1971.
[2] J. Jacod and P. Protter. Probability Essentials. Springer, Berlin - Heidelberg etc., 2.
edition, 2002.
[3] A. Klenke. Wahrscheinlichkeitstheorie. Springer, Berlin - Heidelberg, 2006.
[4] A. N. Kolmogorov. Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer, Berlin,
1933.
[5] M. Love. Probability Theory I. Springer Verlag, New York Heidelberg Berlin, 4th
edition, 1977.
[6] P. Mrters and H. von Weizscker. Stochastische Methoden, 3. Auflage. TU Kaiserslautern, http://www.mathematik.uni-kl.de/w-theorie/lehre/skripte, 2010.
[7] A.N. Shiryaev. Probability. Springer, New York, 1996.
[8] H.v.
Weizscker.
Basic Measure Theory.
TU
http://www.mathematik.uni-kl.de/w-theorie/lehre/skripte, 2008.

Kaiserslautern,

[9] D. Williams. Weighing the Odds. A Course in Probability and Statistics. Cambridge
University Press, New York, 2001.

100

You might also like