Professional Documents
Culture Documents
Introductory remark
The purpose of this text is to provide on 100 pages a first introduction to the most important
mathematical tools of general probability theory, i.e. the probability essentials. Thus we
cover similar grounds as the nice book by J. Jacod and P. Protter [?] which has that phrase
as title. But even though nothing is really new, of course there are endless possible variations
in the details of the presentation. This text contains a few of my favourite arguments. I still
feel the influence of my first teachers Klaus Krickeberg and Hans Richter.
It is assumed that the reader already is familiar with basic measure theory and has sharpened
the intuition by going through some elementary probability and statistics; more than enough
material can be found in the texts [?] and [?] in the same series. Some of the measure
theoretic proofs are provided since I believe they help to understand the essence of the
result.
I am very grateful for the opportunity to teach this beautiful subject for many years. In particular my thanks go to Martin Altmayer and Florian Schwahn who prepared a preliminary
version of these notes during the winter 2008/9.
Contents
1 Probability spaces
1.1 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
5
7
9
. . . . . .
. . . . . .
variables
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
30
33
35
the Lp -spaces
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
45
46
46
4
10 Characteristic Functions
10.1 Complex valued random variables .
10.2 The definition and first properties
10.3 The uniqueness theorem . . . . . .
10.4 Lvys continuity theorem . . . . .
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
60
60
60
63
65
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
71
71
72
73
76
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
85
85
86
87
88
89
15 Brownian motion
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.2 Dyadic time approximation of Brownian motion: Lvys triangles . . .
15.3 Linear interpolations of random walks and their modulus of continuity
15.4 Brownian motion exists . . . . . . . . . . . . . . . . . . . . . . . . . .
15.5 Donskers functional central limit theorem . . . . . . . . . . . . . . . .
15.6 An application: The arcsin law . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
91
91
91
93
95
96
98
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Probability spaces
In this chapter we motivate shortly the axioms of mathematical probability theory, as it is
developped in this course. It should be stressed that this does not really answer the question
what probability is about. On the one hand the word probability can be used meaningfully
in contexts beyond the range of the mathematical theory in the sense of this course. On
the other hand several different interpretations of the functional meaning of probability are
compatible with this mathematical framework.
Probability theory is based on the experience that uncertainty about some circumstances
or events is not the same as complete lack of knowledge about them. On the contrary, in
many situations it is possible to assign to these events a quantitative degree of certainty
or likelihood. Some events appear to the observer to be more likely than others and some
types of events happen more frequently than others. Consequently probability, understood
as this quantitative degree of certainty, has a subjective side to it, being dependent on the
judgement and state of knowledge of the observer, and an objective side, since often by
experimental experience different observers are necessarily led to a similar opinion.
1.1
Events
Here is a small list of statements to which you may attach a degree of certainty.
A certain die shows 3.
Two dice give the sum 9.
The price of rice at some market place increases during the year 2010.
This year, on December 1 there will be snow in Kaiserslautern.
Last year, on December 1 there was snow in Kaiserslautern.
The extinction of dinosaurs is due to the impact of a meteor.
Paul loves Paula.
Not in all these of these situations it is reasonable to try a mathematical theory.
Basic Rules for Events A probability model in our sense specifies a collection F of
events, for each of which a probability is assigned in a certain situation or in the mind of
a certain observer. An important assumption is that this collection F is closed under the
natural logical operations.
5
Under these logical operations the events form a Boolean algebra. The most simple example
of a Boolean algebra is given as follows. The operation and is replaced by intersection, or
is replaced by union and not by the formation of the complement.2
Definition 1.1 Let be a set. A nonempty system F of subsets of is called a (set-)
algebra over , if it has the following properties
1. If A F then \A =: Ac F.
2. If A, B F then A B, A B F.
Remark 1.2
1. Every algebra contains the empty set and the full set since they have the representations = A Ac and = A Ac . 2. The points are called elementary events.
For the development of probability the following rule is very convenient. It is less fundamental in a philosophical sense, but it is extremely useful because it allows to simplify many
statements. Moreover it allows to use tools from measure theory.
Technical Rule If A1 , A2 , . . . is a sequence of events, then at least one of them happens
is also an event.
Example 1.3 A coin is thrown infinitely often. Let An be the event the coin shows head
at the n-th trial. Then we want to consider the event B =at least once we get head.
Under our translation between logical operations and set operations the following definition
now is very natural.
Definition 1.4 A set algebra F over the set is called a -algebra if in addition
S
If A1 , A2 , . . . is a sequence of elements of F then n=1 An F.
Remark 1.5
Using complements we see that for a -algebra F and for every sequence of elements of F
also the set
\
[
An = (
Acn )c
n=1
n=1
belongs to F. (We shall often need similar identities between sets. They may not be immediately clear for the unexperienced reader: The trick is of course to consider an arbitrary
element of one side of the equation and to convince oneself that it must be also an element
of the other side.)
1 However in quantum mechanics this is too much to ask for: Suppose both A: the electron has location
in dx" and B: the electron has momentum in dp" are statements which can be checked by a measurement
and have a certain probability. Then typically the combined statement A and B" cannot be checked and
therefore it is not meaningfully connected to a probability. For a short look on quantum probability we
refer to the last chapter in Williams [?].
2 Actually according to the Stone representation theorem every Boolean algebra is isomorphic to a set
algebra in the sense of Definition 1.1.
1.2. PROBABILITY
1.2
Probability
For the events A F we want to quantify the degree of possibility that A happens. As
mentioned above, the question what probability really means has many different answers
which we cannot discuss here in great detail. Many of these views agree that probability
should sastisfy the following
Basic Requirements
Monotonicity: If A implies B then P (A) P (B) i.e. A B P (A) P (B).
Normalization: P () = 0, P () = 1
Additivity: P (A B) = P (A) + P (B) for disjoint A and B.
Here are three operational approaches to probability which make these rules appear reasonable:
1. Frequency: We make the (non-trivial) assumption that the underlying experiment can
be repeated many times under the same conditions. Then
P (A)
In natural sciences, empirically one often observes that this quotient stabilizes around
a certain number which we call probability of A. Then the additivity, for instance, is
In order to encourage Paul to give his true opinion about his chances he must also be
ready to switch roles with his partner (or bank). So he should also accept the function
fA . This is what we would call fair.
Now the basic assumption in the betting approach to subjective probability is the
following additivity of the payoff function: If A and B are disjoint, i.e. mutually
exclusive, then
fAB = fA + fB .
We observe that this implies the additivity P (AB) = P (A)+P (B) of the probability:
the additivity of the payoff is equivalent to
3. Insufficient reason: We consider the following situation: There are finitely many possible outcomes E1 , . . . , En of the experiment which are mutually exclusive: Exactly
one of them occurs. Some of the outcomes Ei imply A, they are called favourable for
A, the others imply Ac . Under the basic assumption:
There is no sufficient reason to consider one of the Ei to be more plausible than any
of the others
we get a natural definition of
P (A) =
1.3. EXAMPLES
From now on we accept the three basic requirements. Moreover for mathematical convenience we replace additivity by -additivity. This allows to use the tools of measure theory.
Moreover in concrete situations the -additivity usually can be verified.
Definition 1.11 If F is a -algebra then a measure is a mapping : F [0, ] with
() = 0 and which is -additive: If A1 , A2 , . . . are disjoint, then
!
[
X
An =
(An ).
n=1
n=1
1.3
Examples
|A|
.
||
10
In the following examples the set is uncountable, and individual points get probability
0. In such situations the power set of is too large to serve as domain of definition of the
probability measure. One needs the concept of -algebras in order to single out a sufficiently
flexible subsystem of the powerset on which the probability can be defined.
6. Uniform distribution: := [0, 1], F := B([0, 1]) and P (A) := (A) with the Lebesguemeasure . Similarly the uniform distribution can be defined on other intervals, and
on more general sets even in higher dimensions: Let be a Borel-set in Rd with
d () < . The uniform distribution on B is defined by
P (A) =
d (A)
d ()
R
7. Let = Rd and f : R be Borel-measurable such that Rd f (x) dx = 1 and f 0.
Then
Z
P (A) =
f (x) dx
for all A B()
A
1S
(x) =
i=1 Ai
1Ai (x)
for all x .
i=1
i=1
Ai
Z X
i=1
and thus
P
[
i=1
i=1
!
Ai
Z
X
f (x) dx
Ai
P (Ai ).
i=1
Remark 1.14
R
If g : Rd [0, ) is Borel measurable and 0 < Rd g(x) dx < , then there exists a unique
constant c (0, ) such that f := c g is a probability density.
8. f (x) = 1(0,) ex is the density of the Exp() distribution.
2
9. f (x) =
x
1 e 2
2
1.3. EXAMPLES
10. f (x) =
1 1
1+x2
11
is the density of the Cauchy distribution with parameter 1.
Finally, if F is a -algebra and x is a point in the underlying set, then we consider the point
mass or Dirac distribution in the point x:
(
1 xA
x (A) :=
= 1A (x)
0 x
/A
Observe that for disjoint sets An
x
[
n=1
!
An
x (An ).
n=1
Therefore x is a probability measure on every -algebra over a set which contains the
element x. A discrete distribution is just a (finite or infinite) convex combination of point
masses. Consider for example the geometric distribution:
Geom(p)(A) :=
X
kA
p(1 p)k =
k=0
This makes sense even if A contains not only integers. For example we can consider the
geometric distribution not only as a distribution over the non-negative integers but also as
a distribution on the Borel -algebra of the real line.
Chapter 2
Construction of probability
measures
The fact that we can define the probabilities on a full -algebra has many advantages.
However it is a nontrivial fact. Even in the case of geometric size on the real line this is so:
It is clear that the system
( n
)
[
A=
Ji : each Ji is of the type (a, b] or (, b] or (a, ]
i=1
is a set-algebra and letting |I| be the length of an interval it is easy to define a finitely
additive set function on A e.g. by
P(
n
[
Ji ) =
i=1
if the Ji are disjoint. But A is not a -algebra and it is by no means clear that this set function
has an extension to a measure on a -algebra which contains all intervals. Lebesgues theory
of integration implies this. But in probability theory we need similar extensions in much
more general situations and since this is so important we reproduce in the first section a
proof. In the second section we prove a general uniqueness result for such extensions as a
special case of the technique of monotone classes which allows to carry many statements
from small systems of sets to much larger ones.
2.1
[
X
[
m
An =
m(An ) whenever An A for all n N such that
An A.
n=1
n=1
n=1
13
Lemma 2.3 If A is an algebra which is closed under taking unions of countably many
disjoint sets, then A is a -algebra.
Proof. We verify that A is closed under countable unions even without the disjointness
assumption: Let (An ) be a sequence of (not necessarily disjoint) elements of A. Consider
B 1 = A1 ,
B n = An
n1
[
!c
Ai
i=1
n=1
An =
n=1
Bn A by assumption. Hence
In the remaining parts of the proof we will need the following definition (outer measure)
(
)
X
[
(E) := inf
m(An ) : E
An , An A , E .
n=1
n=1
i
i
n=1 m(An ) < (Ei ) + 2 .
Then
[
n=1
Ei
[
[
n=1
Ain and
Ain
i=1 n=1
and so
[
n=1
!
Ei
X
i=1 n=1
X
X
m Ain
(Ei ) + 2i =
(Ei ) + .
i=1
i=1
14
S
Proof. Let
A1 := A and An := for n 2. Then A =S An . and
PA A and define
m(A) = n=1 m(An ) (A). For the converse inequality assume that A n=1 An for
some sets An A and define B1 := A A1 and
!c
n1
[
Bn := A An
Ai
i=1
Then Bn An , A =
X
X
[
m(An )
m(Bn ) = m
Bn = m(A).
n=1
n=1
n=1
n=1
Passing to the infimum in the definition of (A) this yields m(A) (A).
X
[
(E) := inf
m(An ) : E
An , A n A .
n=1
n=1
A := i=1 Ai . Then
!
n
n
X
X
[
(Ai ) = lim
(Ai ) = lim
Ai 1.
i=1
i=1
i=1
15
(Ai ) converges to 0 and thus
i=n+1
n
[
AM
!
Ai
i=1
[
i=n+1
!
Ai
(Ai ) 0,
i=n+1
Furthermore
where we have used the -subadditivity of . This means A A = A.
!
n
[
X
(A) = lim
Ai =
(Ai ).
n
i=1
i=1
X
X
[
m(An ) =
m(Ci )
m(Ci ) = m
Ci = m(A)
i=1
i=1
i=1
Bnc .
[
X
[
m
Ci
m(Ci ) = m
Ci 0
i=1
since
i=n+1
i=1
i=n+1
Ci .
In particular the third condition is useful because often it can be checked easily with the
help of a compactness argument.
Example 2.8 Let A be the set-algebra at the beginning of this chapter. Let F : R [0, 1]
be right-continuous and increasing. Define for disjoint intervals
m(
m
[
(ai , bi ] =
i=1
m
X
F (bi ) F (ai ).
i=1
Sm n n n
Then clearly m is a content. It is even -additive. In fact let An = i=1
(ai , bi ] decrease to
the empty set. Let > 0. By right-continuity we can find in > ani such that
mn
X
X
n=1 i=1
Smn
The compact sets A0n = i=1 [in , bni ] have empty intersection and therefore there is a number
TN
N such that n=1 A0n = . We claim that m(Ar ) < for all r N . Let x Ar , then
16
x
/ A0n for at least one index n N . On the other hand x An for this index since the An
decrease. Thus
N
N m
[
[
[n
0
Ar
An \ An
(ani , in ]
n=1
n=1 i=1
mn
N X
X
n=1 i=1
2.2
1. P () = 1 = Q(). Thus M.
and M. If A1 , A2 , . . . M then
A1 An = (Ac1 Acn )c M
S
S
Sn
Now i=1 Ai i=1 Ai implies i=1 Ai M.
for all n
M :=
C = A2 :
.
class which contains E
C monotone
class, EC
Define
D1 := {A : A E M for all E E}
Then E D1 : If A E and E E then A E E M . D1 is a monotone class:
Chapter 3
Random variables
3.1
Definition 3.1 Let (, F) and (S, B) be two measurable spaces. A map f : S is called
F-B-measurable if f 1 (B) F for all B B. In the particular case where S = Rd and
B = B(Rd ) one talks about F-measurable functions.
Definition 3.2 Let (, F, P ) be a probability space. Then a random variable X with state
space (S, B) is an F-B-measurable map X : S.
Example 3.3 In descriptive statistics examples of the following type are important. Let
be a finite set with the Laplace probability (cf. section 1.3). Since the -algebra is 2 all
functions from are measurable irrespective of the -algebra in the image space. Therefore
this -algebra often is not specified. The image space can be a set of numbers, a set of
qualitative properties or any mixture. For example let be the set of all students at the
TU Kaiserslautern. Let
X1 () = major (subject) of ,
X2 () = gender of ,
X() = (X1 (), X2 ()),
X3 () = size of in cm.
Example 3.4 Here is a simple example of a geometric quantity considered as a random
variable. Let R2 be the unit disk with F = Borel(). Then X() = 1 kk is
the distance to the boundary of the random point . The measurability will be verified in
Corollary 3.17.
Definition 3.5 Let X be a random variable with state space (S, B). Then PX : B [0, 1]
defined by
PX (B) := P (X 1 (B)) =: P ({X B})
is called the distribution of X or the law of X. Sometimes it is denoted by L(X) and
P ({X B}) is denoted by P (X B) or P {X B}.
Observation 3.6 PX is a probability measure on the -algebra B.
Proof. Clearly PX (S) = P (X 1 (S)) = P ()S= 1. For the proof of -additivity let
B1 , B2 , . . . be disjoint elements
and let B := i=1 Bi . Then also X 1 (B1 ), X 1 (B2 ), . . .
Sof B 1
1
are disjoint and X (B) = i=1 X (Bi ). Therefore,
PX (B) = P (X 1 (B)) =
P (X 1 (Bi )) =
i=1
18
X
i=1
PX (Bi ).
19
Rz
Remark 3.11 If X has a density fX in R, then FX (z) = fX (x)dx for all z. If fX is
0
(z) = fX (z) for all z. If fX is
continuous then by the fundamental theorem of calculus FX
only Borel measurable, then by the (more subtle) differentiation theorem of Lebesgue this
relation still holds Lebesgue-almost everywhere.
Example 3.12 If X N (0, 1) then
Z
FX (z) = (z) :=
x2
1
e 2 dx.
2
20
This theorem is useful not only theoretically, but also numerically for the simulation of a
random variable with the given distribution P : We determine F from P , and G from F as
= G(U ) is what we want. Of
above. Computers give good simulations of U and then X
course this procedure is only efficient if the values of the function G are easy to compute.
3.2
Verifying measurability
The verification of the measurability in the definition of a random variable looks difficult
because it involves all elements of the -algebra in the state space. Here we collect some
useful tools which simplify this task considerably.
Lemma 3.15 Let (, F) and (S, B) be two measurable spaces. Let E be a system of subsets
of S such that (E) = B. If X : S is a function such that {X E} F for all E E,
then X is a random variable.
Proof. Consider the system G of all subsets G of S such that {X G} F. By assumption
every E E belongs to G i.e. E G. In order to prove B G, i.e. {X B} F for all
B B, we verify that G is a -algebra:
1. G because X 1 () = F
2. If G G then X 1 (Gc ) = (X 1 (G))c F Gc G
3. If Gn G for all n:
X
!
Gn
n=1
i.e.
n=1
X 1 (Gn ) F
n=1
Gn G.
Remark 3.16 An elegant, but somewhat cryptic, version of the last lemma is the identity
f 1 ((E)) = (f 1 (E))
which holds for all maps f : S and systems E of subsets of the image space S.
Proof. is proved like in the last lemma.
is easy since f 1 ((E)) is a -algebra which contains f 1 (E).
21
R {, }. The set R is
B(R) = {B R : B R B(R)}.
A F-B(R) measurable map X : R is called an extended real valued random variable.
Note that this is consistent: A map X : R is a real valued random variable in the old
sense if and only if it is an extended real valued random variable when X is considered as
a map into R. We simply call X a random variable.
Proposition 3.20 A map X : R is a random variable if and only if the sets {X a}
are in F for all a R.
The same holds for the sets {X < a}, {X a} or {X > a}.
Proof. is trivial and follows from Lemma 3.15 if we consider
E = {[, a] : a R}
since (E) = B(R).
22
Proposition 3.25 If X1 , X2 , . . . are R-valued random variables then the point-wise defined
maps supn Xn , inf n Xn , lim supn Xn , lim inf n Xn are random variables. Moreover the
set
{ : lim Xn () exists} = { : lim sup Xn () = lim inf Xn ()}
n
is in F.
Proof.
supn Xn is a random variable, since {supn Xn a} = { : Xn () a n
T
{X
n a} F for all a R. The other three cases follow immediately from
nN
inf Xn = sup Xn
N} =
N mn
N mn
For the last statement we need in addition the following fact: If X, Y are random variables
then {X 6= Y } and hence {X = Y } are in F since
[
{X 6= Y } =
({X < r} {Y r}) ({X r} {Y < r}) F.
r
We conclude this chapter with the approximation of non-negative random variables by linear
combinations of indicator functions of events. Note that 1A is a random variable for all
A F because {1A a} is either (if a 1) or Ac (if 0 a < 1) or (if a < 0).
Theorem 3.26 Let X be a nonnegative real valued random variable. Then X = sup Xn
where
n
n2
X
Xn =
k 2n 1{k2n X < (k+1)2n } , n N.
()
k=1
Proof. If X() is given then for n X() we have Xn () X() < Xn () + 2n because
Xn () is the left end point of the dyadic interval of length 2n in which X() lies:
X1
X2
1
4
X3 X4
X
1
2
Pn
i=1
Corollary 3.28 If X is a [0, ]-valued random variable, then there is a sequence (Xn ) of
elementary random variables with Xn X.
Proof. Replace X by min(X, n) in (). If X() then min(X(), n) = X() for large
n and by theorem 3.26 supnN Xn () = X(). Otherwise, if X() = then Xn () = n
for all n N.
Chapter 4
Expectation
We assume that the reader knows the definition of the integral in measure theory. We review
some basic properties which are important for the probabilistic interpretation.
Example 4.1 Let us suppose that in some game we get the amount a if A happens and
b if Ac happens. Then the gain is X() = a 1A () + b 1Ac (). The expected gain is:
E(X) = a P (A) + b P (Ac ). If we pay m in advance for playing this game then a m is
the net gain and m b is the net loss. Then m = E(X) solves the balance equation
P (A)(a m) = P (Ac )(m b).
R
R
Remark
4.2 If (S, G, ) is a measure space, then the integral S f d (or S f (x)d(x) or
R
f (x)(dx)) is characterized by three rules
S
Pn
1. If f = i=1 ai 1Ai is an elementary function, then
Z X
n
n
X
ai 1Ai d =
ai (Ai )
i=1
i=1
i=1
Remark 4.4 The expectation depends only on PX . If for example X takes only the pairwise
distinct values x1 , . . . , xm , then
X=
xk 1{X=xk }
k=1
P
and therefore E(X) = k=1 xk pk , where pk := P {X = xk } = PX ({xk }). The general case
is treated in the following theorem.
23
24
CHAPTER 4. EXPECTATION
Proof. We prove the theorem by the standard procedure: In the first step we restrict
ourselves to elementary functions, in the second step we allow non-negative measurable
functions and in the third step we generalize the result to measurable functions.
Pn
1. Let g be elementary: g = j=1 bj 1Bj where Bj measurable in the -algebra of S.
Then
X
X
g(X)() =
bj 1Bj (X()) =
bj 1{XBj } ()
and therefore
Z
E(g(X)) =
g(X)dP =
bj P {X Bj } =
Z
bj PX (Bj ) =
gdPX .
S
n
X
k2
k=0
Z
n
X
n k
p (1 p)nk =
k 2 PX {k} =
x2 dPX .
k
R
k=0
Later we shall see several ways how to evaluate this expression explicitly.
Remark 4.7 Let x be
Pnthe Dirac measure at x: x (A) = 1 if x A and x (A) = 0 if x 6 A.
Note that for (A) = k=1 k xk the integral is the sum
Z
f d =
In particular
n
X
k f (xk )
k=1
f dx = f (x).
The following gives the two most popular forms for the expectation rule.
Proposition 4.8
Let X be P
a random variable with a discrete
P distribution, i.e. P (X =
xk ) = pk for x1 , x2 , . . . and
pk = 1. Then E(g(X)) = k=1 g(Xk )pk .
R
If X is Rd -valued and has a density fX , then E(g(X)) = Rd g(x)fX (x) dx.
25
Example 4.9
x2
1
x e 2 dx = 0.
2
Let us come back to discrete distributions on N0 . For these the expectation can sometimes
be computed conveniently with the help of probability generating functions. An extension
of this idea will come up in the section 4.2 on moment generating functions and in chapter
10 on characteristic functions.
Definition 4.10 Let (pk )kN0 beP
a probability vector and let X be a random variable with
k pk z k1 =
k=0
k pk z k1 d
0
Then
k pk d =
0
k pk = E(X)
k=0
for arbitrary zn 1. This implies the first assertion. The second assertion follows similarly:
lim G00X (zn ) = lim
k(k1)pk znk2 =
k=0
k=0
for arbitrary zn 1 and additivity yields E(X 2 ) = G0X (1) + G00X (1).
Example 4.12
Let PX = B(n, p). Then
GX (z) =
n
X
n
k=0
Now G0X (z) = n (pz + 1 p)n1 p and thus G0X (1) = n p = E(X). Furthermore
G00X (z) = n (n 1) (pz + 1 p)n2 p2
and hence E(X 2 ) = G00 (1) + G0 (1) = p2 n (n 1) + p n.
Definition 4.13 Let (, F, P ) be a probability space. An event F F is called a nullset
if P (F ) = 0. An event happens almost surely (a.s.) if P (A) = 1, i.e. Ac is a nullset.
26
CHAPTER 4. EXPECTATION
\
An
c
=P
[
X
P (Acn ) = 0
Acn
Proposition 4.15
1. If X and Y are random variables such that X = Y almost surely and X has an
expectation, then Y has the same expectation.
2. If X Y almost surely, then E(X) E(Y ), provided both expectations exist.
3. E(X + Y ) = E(X) + E(Y ) if either X, Y 0 and , 0 or E(X), E(Y ) finite.
Theorem 4.16 (Markovs inequality) Let X be a nonnegative random variable and let
a > 0. Then P (X a) a1 E(X).
Proof. Define Y = a 1{Xa} . Then X Y and thus E(X) E(Y ) = a P (X a).
1
n }.
Then P (An )
1
1
n
n=1
An = {|X| > 0}
1
n
E(|X|) 0.
Corollary
P 4.19 (First Lemma of Borel-Cantelli) If (An ) is a sequence of events such
that n=1 P (An ) < , then almost surely only finitely many of the An happen:
!
[
\
Am = 0
P (infinitely many An occur) = P
n=1 m=n
Proof. Let X =
n=1
1An . Then
E(X) =
E(1An ) =
n=1
P (An ) <
n=1
4.1
Convergence theorems
The same holds true if Xn Y for all n and a random variable Y with finite expectation.
27
Proof. If Xn 0 use lim inf Xn = sup ( inf Xm ) and monotone convergence. In the second
n
N mn
Combining this with the corresponding mirror estimate for the lim sup we get
Theorem 4.22 (Lebesgues theorem of dominated convergence)
Let Xn be a sequence of random variables such that there exists a random variable Y with
E(Y ) < and |Xn | Y for all n. If lim Xn = X almost surely, then E(X) = lim E(Xn )
n
Xn X
tn t0 .
Then
E(Xtn ) E(Xt0 )
=E
tn t0
Xtn Xt0
tn t0
= E(n ).
d
Xt ()
dt
t=tn ()
4.2
Definition 4.24 Let X be a real random variable. Assume there exists > 0 such that
E(eX ) < for all (, ). Then 7 MX () := E(eX ) is called moment generating
function.
28
CHAPTER 4. EXPECTATION
dr
MX ()
d
=0
d X
e
= X r eX and
Proof. Let r N. We start with the second assertion: Obviously d
X
0
00
thus the map 7 e
is continuously differentiable. Choose < < . Then there is a
00
00
constant cr such that |x|r e|x| cr e |x| for all x R and 0 . By setting Y := cr e |X|
we get |X r |eX Y and E(Y ) < . Theorem 4.23 now gives
r
dr
d X
=E
MX ()
e
= E(X r ).
d
d
=0
=0
r
For the first assertion one needs to extend this argument to the case of complex . Since
|e(+i)X | = eX one sees that the moment generating function is even holomorphic inside
the circle {z C : |z| < }. Then by the second assertion the formula for MX is just the
Taylor expansion of this function.
Example 4.26 Let Z N (0, 1) and X := + Z. Then FX (x) = P (X x) = P (Z
x
x
0
= ( ) and writing f for we get the density
0
fX (x) = FX
(x) =
(x)2
1
e 22
2
e 2 dx = e 2
e
dx = e
MZ () = E e
=
e
2
R 2
R
because the last integrand is a probability density. Finally
()2
2 2
MX () = E e(+Z) = e E eZ = e e 2 = e+ 2 .
Chapter 5
Basic concepts
P (A B)
P (B)
i=1
P (Bi ) P (A|Bi )
i=1
P (A|Bi ) P (Bi )
i=1
1
in this formula P (A|Bi ) is only defined if P (Bi ) > 0, but otherwise the product is 0 anyway.
29
30
Definition 5.5
1. Let X : (, F) (S, B) and Y : (, F) (T, C) be two random variables. Then X
and Y are called independent, if P (X B, Y C) = P (X B) P (Y C) for all
B B and C C.
2. Let G and H be two subsystems of F. Then G and H are called independent, if
P (G H) = P (G) P (H) for all G G and H H.
Definition 5.6 If X : (S, B) is a random variable, then the system of all sets {X B}
for B B is a -algebra, the -algebra generated by X. It is denoted by (X).
Remark 5.7 Obviously X and Y are independent if and only if (X) and (Y ) are independent.
Proposition 5.8 If X and Y are independent and : (S, B) (S 0 , B 0 ) and : (T, C)
(T 0 , C 0 ) are measurable, then (X) and (Y ) are also independent.
Proof. Let B 0 B 0 and C 0 C 0 . Then {(X) B 0 } = {X 1 (B 0 )} and similarly
{(Y ) C 0 } = {Y 1 (C 0 )}. Thus
P ((Y ) B 0 , (Y ) C 0 ) = P (X 1 (B 0 ), Y 1 (C 0 ))
= P (X 1 (B 0 )) P (Y 1 (C 0 ))
by independence
= P ((X) B 0 ) P ((Y ) C 0 ).
Theorem 5.9 If E1 and E2 are -stable systems of sets which are independent, then (E1 )
and (E2 ) are also independent.
Proof. Let G1 := (E1 ) and G2 := (E2 ).
1. Fix E1 E1 . Claim: Every G2 G2 is independent of E1 .
If P (E1 ) = 0 the claim is trivial. If P (E1 ) > 0 then P E1 (E2 ) = P (E2 ) for all E2 E2
by our assumption. Therefore the two probability measures P E1 and P agree on E2 .
The uniqueness theorem of probability measures (Theorem 2.13) implies that P E1 = P
on (E2 ) = G2 . Therefore P (G2 E1 ) = P E1 (G2 ) P (E1 ) = P (G2 ) P (E1 ) for all
G2 G2 .
2. Fix now G2 G2 . Prove in the same way, using P G2 and P , that every G1 G1 is
independent of G2 .
Corollary 5.10 Let A and B be two independent events. Claim: The indicators
1B are independent.
1A and
5.2
31
Remark 5.12 If X and Y are random variables with values in (S, B) and (T, C) then (X, Y )
is a random variable with values in (S T, B C).
Proof. By assumption {(X, Y ) B C} = {X B} {Y C} F, hence (X, Y )1 (E)
F for all E E := {B C : B B, C C}. This yields (X, Y )1 ((E)) = ((X, Y )1 (E))
F. Therefore (X, Y ) is a random variable.
Therefore the following definition is meaningful.
Definition 5.13 The probability measure P(X,Y ) : B C 3 E 7 P {(X, Y ) E} is called
the joint distribution of (X, Y ).
Proposition 5.14 Let and be two probability measures on B and C respectively. Then
there is a unique probability measure on B C such that
(B C) = (B) (C)
()
[
[
( Di )s d =
(Di )s d
S
i=1
=
=
i=1
Z X
S i=1
XZ
i=1
((Di )s )d
((Di )s )d.
32
Theorem 5.16 Let X and Y be two random variables with values in (S, B) and (T, C)
respectively. Then X and Y are independent if and only if P(X,Y ) = PX PY .
Proof. X and Y are independent means that for all B B and C C we have P (X
B, Y C) = P (X B) P (Y C) , or equivalently P(X,Y ) (B C) = PX PY (B C).
The uniqueness part of the previous proposition implies the assertion.
Definition 5.17 Probability spaces are also called experiments. If (1 , F1 , P1 ), (2 , F2 , P2 )
are two experiments, then (1 2 , F1 F2 , P1 P2 ) is called the independent coupling of
the two experiments.
Remark 5.18 The word coupling is motivated by the fact that everything which can be
described by one of the two experiments (i , Fi , Pi ) can also be described by the coupling:
1 , 2 ) = X(1 ). Then X and X
R
S
Proof. (Sketch) We assume that and are probability measures. The extension to finite
and then to -finite measures is straightforward. Let D B C and Dt := {s : (s, t) D}.
Z
Z
Z
1D d = (D) = (Ds )d(s) = (Dt )d(t),
S
where the last equality follows by uniqueness, since both sides define which agree on B C.
Furthermore
Z
Z Z
(Ds )d(s) =
1D (s, t)d(t)d(s)
S
and
Z Z
(Dt )d(t) =
1D (s, t)d(s)d(t).
The extension from indicators to non-negative mesurable functions follows now by the standard procedure.
Here is a useful and intuitive consequence.
Theorem 5.20 Let X and Y be two independent random variables with values in S and
T respectively. Let f be as in the Fubinis theorem. Then E(f (X, Y )) = E(F (X)) where
F (s) := E(f (s, Y )).
Proof. Since X and Y are independent, P(X,Y ) = PX PY . Therefore
Z
Z
Z Z
E(f (X, Y )) = f dP(X,Y ) = f (s, t)dPX PY =
f (s, t)dPY dPX
S T
Z
=
E(f (s, Y ))dPX = E(F (X)).
S
33
Example 5.21 (Product rule) Let X and Y be independent extended real and either X,
Y 0 or E(|X|) < , E(|Y |) < . Then E(X Y ) = E(X) E(Y ).
Proof.
2. Let now E(|X|) < , E(|Y |) < . Since X and Y are independent X + and X are
independent from Y + and Y and therefore
E((X + X ) (Y + Y )) = E(X + Y + X + Y X Y + + X Y )
= E(X + ) E(Y + ) E(X + ) E(Y )
E(X ) E(Y + ) + E(X ) E(Y )
= E(X) E(Y ).
Another useful application of Fubinis Theorem is an alternative representation of the expectation of a non-negative random variable.
Corollary 5.22 For every random variable X 0 on a probability space (, F, P ) we have
the representation of E(X) as a Lebesgue integral:
Z
E(X) =
P (X > t) dt.
0
0
0
Z Z
Z
=
1X>t dP dt =
P (X > t) dt.
Z
E(X) =
5.3
Remark 5.23 If A, B and C are pairwise independent, this does not mean
P (A B C) = P (A) P (B) P (C).
Example: Let A and B be independent with probability P (A) = P (B) = 21 . Let C := A M B.
Then P (C) = 12 and A, B and C are pairwise independent. But A B C = and
P (A B C) = 0 6= P (A) P (B) P (C).
Qn
Definition 5.24 A1 , A2 , . . . An are independent if P (A1 . . . An ) = i=1 P (Ai ) and the
same is true for all subfamilies.
An infinite family of events is independent, if each finite subfamily is independent.
P
Theorem 5.25 [Second Lemma of Borel-Cantelli] If A1 , A2 , . . . are independent and n=1 P (An ) =
then P (infinitely many An occur) = 1.
34
A :=
Am .
n=1 m=n
N
\
\
c
c
Am
P
Am = lim P
N
m=n
= lim
lim
S T
n=1
m=n
n=1
N
Y
(1 P (Am ))
m=n
N
Y
eP (Am )
m=n
= lim exp
N
N
X
P (Am )
m=n
=0
S T
for each n N. This yields P ( n=1 m=n Acm ) = 0.
Definition 5.26 A family (Ai )iI of systems of events is independent if each finite subfamily is independent in the sense that
!
\
Y
P
Ai =
P (Ai )
for all Ai A and finite J I
iJ
iJ
A family of random variables (Xi )iI is independent if the -algebras (Xi ) are independent.
Theorem 5.27 If (Ei )iI are independent and -stable, then the -algebras (Ei ) are also
independent.
Proof. We only need to consider finite index sets, say I = {1, . . . , n}, and use induction
over n. For n = 2 the claim is theorem 5.9. The set
(
)
\
0
E =
Ei : Ei Ei , J {1, . . . , n 1}
iJ
S
n1
i=1
Ei .
Now by theorem 5.9 (E 0 ) and (En ) are independent and thus for J {1, . . . , n 1} and
arbitrary Ai (Ei ) we get by induction hyothesis
!
!
!
\
\
Y
P
Ai An = P
Ai P (An ) =
P (Ai ) P (An ).
iJ
iJ
iJ
\
A :=
{Xn , Xn+1 , . . .}
n=1
where {Xn , Xn+1 , . . .} is the smallest -algebra for which all Xk , k n are measurable.
Then P (A) {0, 1} for all A A .
35
S
Proof. A := n=1 {X1 , . . . , Xn } is a set-algebra and (A) = {X1 , X2 , . . .}. By the measure extension theorem 2.2 A is dense in (A): For each A (A) and each > 0 there is
some A0 A such that P (A M A0 ) < .
Now let A A and A0 A be arbitrary. Then there exists n such that A0 {X1 , . . . , Xn }
and A {Xn+1 , Xn+2 , . . .}. Therefore A and A0 are independent of each other. In particular, with (Ak )k A a sequence such that P (A M Ak ) 0, we get
P (A) = P (A A) = lim P (A Ak ) = lim P (A) P (Ak ) = P (A)2
k
Xi = 0 =
lim
Xi = 0
for all n N.
A :=
lim
m m
m m
i=1
i=n
C
Note that if a random variable depends on the asymptotic behaviour of the sequence
X1 , X2 , . . . then it is almost surely constant. More generally,
Remark 5.30 Let A be a trivial -algebra, i.e. A contains only events with probability 0
or 1. Then each A-measurable real random variable Y is a.s. equal to its own expectation.
Proof. There is a c < such that P {|Y | > c} < 1. Then by assumption P {|Y | > c} = 0,
i.e. Y is a.s. bounded and in particular E(Y ) exists. If one of the two events {Y > E(Y )}
and {Y < E(Y )} would have positive probability the same would hold for the other and
they could not have probability 1 in contradiction to the assumption. Thus Y = E(Y )
almost surely.
5.4
For a finite collection of probability spaces (1 , F1 , P1 ), . . . , (n , Fn , Pn ) we get their independent coupling in natural extrapolation from n = 2 on the product space := 1 n
with the product -algebra F := F1 Fn = (A1 An : Ai Fi ). The product
measure P := P1 Pn is characterized by
n
O
Pi (A1 An ) =
i=1
n
Y
Pi (Ai )
for Ai Fi .
i=1
Z
Pi (C) =
n
O
1 i=2
36
On this space one has the projections i : i defined by i0 ((i )iI ) = i0 and for
subsets
J I similarly J : J with J ((i )iI = (j )jJ . The product -algebra
N
F
can be described in several ways
i
iI
O
Fi := {i : i I} =
J1 (A) : A
iI
Fj , J I finite
jJ
k=1
for
sets J I and Ajk Fjk for all k = 1, . . . , n. The probability space
Q all finite
N index N
( iI i , iI Fi , iI Pi ) is called the independent coupling of the probability spaces
(i , Fi , Pi ).
Proof. 1. We first note that in the representation of a cylinder set the choice of the finite
index set J is not unique: We can add an arbitrary element i to J by adding a trivial
condition on the i-th coordinate:
C := { : (j1 , . . . , jn ) A} = { : (j1 , . . . , jn , i ) A }.
2. The cylinder sets form an algebra:
C = J1 (A) C c = J1 (Ac )
0
0
0
If C = J1 (A) and C 0 = J1
0 (A ) then we may replace J and J by J J by adding
0
trivial constraints as above. Therefore we may assume J = J . This means J1 (A)
J1 (A0 ) = J1 (A A0 ) is also a cylinder set.
N
3. We define P = iI Pi on the algebra of cylinder sets by
P (J1 (A)) :=
Pj (A).
jJ
This definition is meaningful since changing of J in the sense of 1. does not change the
value, e.g. P1 P2 P3 (A ) = P1 P2 (A).
4. P is finitely additive : Let C = J1 (A) and C 0 = J1 (A0 ) such that C C 0 = . Then
A A0 = and therefore
O
O
O
P (C C 0 ) = P (J1 (A A0 )) =
Pi (A A0 ) =
Pj (A) +
Pj (A0 ) = P (C) + P (C 0 ).
jJ
jJ
jJ
5. Now we have a finitely additive set function P on the algebra of cylinder sets which
satisfies the equation in the assertion. It remains to prove that P can be extended uniquely
to a measure on the -algebra F which is generated by the cylinder sets. By the measure
extension theorem 2.2 it suffices to verify the -additivity of P on the cylinder algebra. By
proposition 2.7 we only need to show that C n implies P (C n ) 0.
37
=
1
Z
=:
This yields
Z
Z
fn (s) dP1 (s)
Hence there exists s1 1 such that f (s1 ) , i.e. P (1) (Csn1 ) for all n N.
Using the same arguments for the finitely N
additive set function P (1) we get a s2 2
(2)
n
(2)
such that P (C(s1 ,s2 ) ) , where P
:= iI\{1,2} Pi . Recursively we find a sequence
s1 , s2 , . . . such that for all k
O
Pi Csn1 ,...,sk
iI
i6=1,...,k
i=1
and thus the Xn are independent. Finally PXi (Bi ) = P (Xi Bi ) = Pi (Bi ).
For example there exists (, F, P ) and a sequence of independent random variables with
PXn = N (0, 1).
Chapter 6
k1
X
!
Xi = r 1
k1
pr1 (1 p)k1(r1) p
r1
k1
=
pr (1 p)kr .
r1
P (Xk = 1) =
i=1
6.1
Our next goal is to show the use of Fubinis theorem in the computation of an explicit
distribution. First we note the following simple but important fact. The joint distribution
of two independent r.v. with densities is given by the product density.
Proposition 6.2 Let X and Y be random vectors with densities fX and fY on Rn and
Rm , respectively. Then X and Y are independent if and only if (X, Y ) has the density
f(X,Y ) (x, y) = fX (x) fY (y) on Rn+m .
1
N recall that we now can rely on Corollary 5.34.
iN Bin(1, p)
38
39
BC
Example 6.3 Let T1 , T2 , . . . , Tn be uniformly distributed over [0, 1]. Then the Ti are independent if and only if T = (T1 , . . . , Tn ) is uniformly distributed over [0, 1]n , which means
that T has the density 1[0,1]n .
Definition 6.4 If x1 , . . . , xn are real numbers, then x1:n , . . . , xn:n are the order statistics
of the vector (x1 , . . . , xn ), i.e. x1:n xn:n and (x1:n , . . . , xn:n ) is a permutation of
x1 , . . . , x n .
R1
Definition 6.5 Let a, b > 0. Then B(a, b) := 0 sa1 (1 s)b1 ds is the Beta function
and the probability density over [0, 1] of the form
a,b (s) =
sa1 (1 s)b1
B(a, b)
is called the a,b -density and the associated distribution is called the Beta distribution with
parameters a, b.
Proposition 6.6 Let T1 , . . . , Tn be independent and uniformly distributed over [0, 1] and
for 1 r n let Tr:n be their r-th order statistic. Then Tr:n has a r,nr+1 -density.
S
Proof. Since P ( i6=j {Ti = Tj }) = 0 we can assume without loss of generality that the Ti
are pairwise different. Furthermore observe that if some random variables X1 , . . . , Xn are
independent identically distributed, then they are exchangeable, i.e.
PX1 ,...,Xn = PX(1) ,...,X(n)
for each permutation Sn . Therefore
P (Tr:n c) = P ( permutation such that T(1) < < T(n) and T(r) c}
!
[
=P
{T(1) < < T(n) , T(r) c}
Sn
Sn
1[0,c] (s) P (T1 < < Tr1 < s < Tr+1 < < Tn )
= 1[0,c] (s) P (T1 < < Tr1 < s) P (s < Tr+1 < < Tn )
=
1[0,c] (s)
(1 s)nr
sr1
(r 1)!
(n r)!
40
and get
P (Tr:n c) = n! E(F (Tr ))
Z 1
F (s) ds
= n!
0
Z c
n!
=
sr1 (1 s)nr ds
(r 1)! (n r)! 0
Z c r1
n!
s
(1 s)nr
=
B(r, n r + 1)
ds
(r 1)! (n r)!
B(r, n r + 1)
Z0 c
n!
r,nr+1 (s) ds.
=
B(r, n r + 1)
(r 1)! (n r)!
0
If we set c = 1 we get
n!
B(r, n r + 1) = 1.
(r 1)! (n r)!
This proves both our assertion and the following fact about the Beta function.
Corollary 6.7
B(r, n r + 1) =
6.2
(r 1)! (n r)!
.
n!
Transformation of densities
Next we study how densities change under a non-linear transformation of ccordinates. For
the motivation we start with the polar coordinates of a standard Gaussian vector.
Example 6.8 If X1 and X2 are independent random variables and X1 , X2 N (0, 1), then
according to Proposition 6.2 (X1 , X2 ) has the density
x2
x2
1
1
1 x21 +x22
1
2
2
f(X1 ,X2 ) (x1 , x2 ) = e 2 e 2 =
e
.
2
2
2
Let (R, U ) be the polar coordinates of (X1 , X2 ): (X1 , X2 ) = R (cos U, sin U )) =: g(R, U ).
Since the Jacobian of g is
cos u r ( sin u)
Dg(r, u) =
sin u
r cos u
with determinant r we get by means of the transformation rule of multidimensional integrals
for every Borel set A
P ((R, U ) g 1 (A)) = P ((X1 , X2 ) A)
Z
=
f(X1 ,X2 ) (x1 , x2 ) dx1 dx2
ZA
=
f(X1 ,X2 ) (g(r, u)) r drdu.
g 1 (A)
Thus the density of (R, U ) is f(R,U ) (r, u) = r f(X1 ,X2 ) (g(r, u)) =
of R and U are fR (r) = r e
r2
and fU (u) =
1
2
1[0,2) (u).
r
2
r2
2
. The densities
41
U
Ui
Figure 6.1: In the proof of Remark 6.10 we decompose an open set U into a countable union
of boxes Ui .
U =N
is a nullset,
N Ui
fY (y) =
i
yg(Ui )
Proof. If we define f i :=
fX (gi1 (y))
J(gi1 (y))
fX (gi1 (y))
.
J(gi1 (y))
P (Y B) = P (X g 1 (B))
X
=
P (X Ui g 1 (B))
P (X gi1 (B))
X Z
gi1 (B)
X Z
gi1 (B)
fX (x) dx
J(x) f i (gi (x)) dx
X Z
f i (y) dy
Bg(Ui )
Z
=
fY (y) dy.
B
42
Ui
Ui
is our decomposition.
6.3
Definition 6.11 The Gamma distribution Gamma(, ) is the distribution with the density
, (x) =
x1 ex 1R+ (x)
()
() =
ex x1 dx
Here, is called shape parameter and is called scale parameter. Observe that satisfies
(1) = 1 and (x + 1) = x (x) and thus (n + 1) = n!.
Example 6.12 Let X N (0, 1) be a random variable and g(x) := x2 . Then |g 0 (x)| =
2 |x| =
6 0 for x 6= 0. If we define U1 = (, 0) and U2 = (0, ), then g 1 (y) = y on U1
( y )2
y2
1
1
e 2
e 2
+
2 y 2
2 y 2
y
1
= e 2
y 2
1
1 2
fY (y) =
2
=
y 2 1 e 2 y .
This is the density of the Gamma( 12 , 12 )-distribution except that the normalizing factor has
a different form. Since this constant is uniquely determined, we get the following Corollary
Corollary 6.13 ( 12 ) = .
As an application of the transformation formula it is easy to prove that if X has a Gamma(, )distribution, then 1 X has a Gamma(, )-distribution. The family of Gamma-distributions
also behaves nicely with respect to convolution which is another key operation in probability
theory.
Definition 6.14 If and are two measures on B n , the convolution is defined as the
image measure of under the map + : Rn Rn Rn :
(B) = {(x, y) Rn Rn : x + y B}
Observe that
Z
(B) =
(B x)d(x) =
Remark 6.15
1. If X and Y are independent and
(B y)d(y).
P (X + Y B) = P(X,Y ) {(x, y) : x + y B}
= PX PY {(x, y) : x + y B}
= PX PY (B).
43
B
Figure 6.2: {(x, y) : x + y B} is the set between the diagonal lines.
2. If X and Y are independent and have the densities fX and fY , respectively, then
X + Y has the density fX fY where
Z
Z
g h(z) =
g(y) h(z y)dy =
h(x) g(z x)dx.
P
P
3. If X and Y are independent and Z-valued and PX = kZ pk k and PY = kZ qk
k , then
X
X
m m where m :=
pk ql .
PX PY =
m
k+l=m
Example 6.16 Suppose that we have two Bernoulli variables: PX = PY = Bin(1, 21 ). Then
4 k = 0
PX+Y = 12 k = 1
1
k=2
4
and this is exactly Bin(2, 12 ). More generally, if X1 , . . . , Xn Bin(1, p), then PX1 ++Xn =
PX1 PXn = Bin(n, p).
Proposition 6.17 Assume X (r, ) and X (t, ) are independent. Then X + Y
(r + t, ).
Proof. Using the remark after Corollary 6.13 we can assume without loss of generality that
= 1. Then
Z z
1
fX fY (z) =
(z y)r1 e(zy) y t1 ey dy
(r) (t) 0
where the integration limits are due to the indicator functions. Substituting y by zs yields
Z 1
ez
=
44
Corollary 6.18
(r) (t)
.
(r + t)
B(r, t) =
Definition 6.19 The 2n -distribution with n degrees of freedom is the law of Z12 + + Zn2 ,
where the Zi are independent identically distributed N (0, 1)-variables.
Proposition 6.20 The 2n -distribution is equal to the ( n2 , 12 )-distribution. It has the
density
x
n
x 2 1 e 2
n2 , 12 (x) = n
.
2 2 n2
Proof. According to Example 6.12, Zi2 has a ( 12 , 12 )-distribution for each i. This implies
that
1 1
n 1
1 1
,
,
=
,
.
P2n =
2 2
2 2
2 2
We conclude this chapter by noting a by-product of our study of the Gamma-distribution:
We get the classical formula for the volume of the n-dimensional Euclidean unit ball for free.
For this we have a look at the lower tail of the 2n -distribution as x 0:
P (2n r2 ) =
r2
Now e
x
2
x 2 1 e 2
dx
n
2 2 n2
1 as x 0 and thus
P (2n r2 )
r2
x 2 1
rn
1
dx
=
n n
n
2 2
22
2
n
2
n
2
.
vol Bn =
2 2
.
n n2
In particular vol Bn 0 as n .
vol Bn
4
3
1 2
2
8 2
15
1 3
6
3,14
4,19
4,93
5,26
5,17
Chapter 7
The p-norm
Definition 7.1 If X is a (extended) real random variable, E(|X|r ) is the r-th absolute
moment. For 0 < r < we write kXkr for E(|X|r )1/r . Further we define kXk := inf{ :
P (|X|) > ) = 0} and if this infimum does not exist, we set kXk := . For 0 < p the
symbol Lp or Lp (P ) or Lp (, F, P ) denotes the set of all random variables with kXkp < .
Theorem 7.2 (Hlders inequality ) Suppose p, q [1, ] are conjugate numbers, i.e.
1
1
p + q = 1. Then for all random variables X and Y one has
E(|X Y |) kXkp kY kq .
Observe that for p = q = 2 the Hlder inequality becomes the Cauchy-Schwarz inequality:
E(|X Y |) kXk2 kY k2 .
Theorem 7.3 (Minkowski inequality ) If X + Y is defined and p 1 then
kX + Y kp kXkp + kY kp
Remark 7.4 The inequalities of Hlder and Minkowski hold also in general measure spaces.
Corollary 7.5 On a probability space one has kXkr kXks whenever 0 < r s < .
Proof. Set p =
s
r
|x|r
R 1+x
dx < , but X 6 L1 .
46
7.2
Completeness
Xnk+1 Xnk we have k=1 kYk kp < . Now define a random variable Z := k=1 |Yk | and
we get with monotone convergence and Minkowskis inequality
N
X
E(Z ) = lim E
N
!p !
|Yk |
k=1
p
N
X
Yk
= lim
N
k=1
lim
N
X
!p
kYk kp
k=1
< ,
and thus Z < almost surely. Therefore the series
lim
N
X
k=1
Yk converges and
Yk = lim XnN =: X
k=1
XnN kpp
p
X
= lim
Yk
= 0
N
k=N +1
7.3
Second moments
Remark 7.8 In the case p = 2 the space cL2 is a Hilbert space with scalar product <
X, Y >= E(XY ). If X L2 then E(X 2 ) < and thus by Cor. 7.5 E(|X|) < . We get
E(X 2 ) (E(X))2 = E((X E(X))2 ) = var(X) = 2 (X) < .
p
Moreover |E(X)| E(X 2 ).
47
Definition 7.9 If X and Y are random variables we define the covariance of X and Y as
cov(X, Y ) := E((X E(X)) (Y E(Y )))
if this expectation is defined.
Remark 7.10 If X and Y are in L2 , then we get as an analogue of the Cauchy-Schwarz
inequality for the covariance:
| cov(X, Y )| (X) (Y ).
p
Definition 7.11 Let X and Y be random variables. Then (X) =
2 (X) is called
standard deviation of X. The correlation or correlation coefficient between X and Y is
defined by
cov(X, Y )
%(X, Y ) =
(X) (Y )
Observe that 1 %(X, Y ) 1. X and Y are said to be uncorrelated if %(X, Y ) = 0.
Remark 7.12
XE(X)
(X)
X
X
X
cov
ai Xi ,
bj Yj =
ai bj cov(Xi , Yj ).
i
i,j
Remark 7.14 [Chebychevs inequality] For every X L2 and every > 0 one has
P (|X E(X)| )
2 (X)
.
2
1
2 (X)
1
E(|X E(X)|2 ) = 2 kX E(X)k22 =
.
2
Xi =
2 (Xi )
i=1
i=1
i=1
i=1
i,j=1
i=1
i=1
48
Definition 7.16 Let X1 , . . . , Xd be elements of L2 . Then the matrix C = (cij )1i,jd with
entries
cij = cov(Xi , Xj )
is called the covariance matrix of the vector X = (X1 , . . . , Xd ). It is denoted by cov(X).
Theorem 7.17 If X is a d-dimensional random vector and A :
then
cov(AX) = A cov(X)AT .
Rd Rk is a linear map
d
X
i=1
ali Xi ,
d
X
j=1
amj Xj ) =
d
X
ij=1
Chapter 8
kXn Xkr
r
0.
Theorem 8.3 (Weak Law of Large Numbers) Let X1 , X2 , . . . be a sequence of uncorrelated real random variables with 2 (Xi ) M < for all i. Then we have for all > 0
n
!
1 X
P
(Xi E(Xi )) > 0,
n
i=1
that is
1
n
Pn
i=1 (Xi
E(Xi ))
0.
Proof.
n
!
!
!
n
n
1 X
X
1
1
1X
P
(Xi E(Xi )) > 2 var
Xi = 2 2 var
Xi
n
n i=1
n
i=1
i=1
P
P
Since the Xi are uncorrelated we have var( Xi ) = var(Xi ) and thus
!
n
n
X
X
1
1
M
var
X
=
var(Xi ) 2 0
i
2 2
n2 2
n
n
i=1
i=1
Pn
P
. If the Xi are independent
In particular, if E(Xi ) = for all i then n1 i=1 Xi
1
B(1, p)-trials and p = n # successes up to n, then
P (|
p p| > )
49
p (1 p)
n2
8.2
First let us give some alternative descriptions of convergence in probability for real valued
random variables.
Proposition 8.4 If (Xn ) is a sequence of random variables, which is uniformly bounded
(i.e. |Xn | C a.s., all n) then Xn X in probability if and only if kXn Xkr 0 for all
r > 0.
P
r + (2C)r P (|Xn X| )
2r
for sufficiently large n, since Xn converges in probability.
Notation 8.5 We frequently use the notations a b := min(a, b) and a b := max(a, b).
Corollary 8.6 Xn converges to X in probability if and only if E(1 |Xn X|) 0.
P
P (An ) < by
In particular convergence in probability does not imply almost sure convergence. However
for monotone sequences the two concepts coincide:
Lemma 8.8 Let (Yn ) be a monotone sequence of random variables and suppose that the
P
r.v. Y is a.s. finite. Then Yn converges to Y a.s. if and if Yn
Y.
51
k=1
P
Then the series
k |Xnk X| converges a.s. and for each subsubsequence we get
|Xnkl X| 0 a.s..
Remark 8.10 We saw at the end of the previous section that convergence in probability is
induced by a pseudometric. Thus we can speak of the topology of convergence in probability
on the space of all real random variables on a given probability space.
However it is worthwhile to note that there is no topology of almost sure convergence
in the sense that a sequence would converge for that topology if and only if it converges
almost surely. Such a topology would be at least as fine as the topology of convergence
in probability since there are less converging sequences. On the other hand a -closed set
F would be closed under almost sure convergence and hence by the previous proposition
the set F would be closed under convergence in probability. Hence this topology would
coincide with the topology of convergence in probability, in contradiction to our assumption
that only the a.s. convergent sequences converge for .
8.3
E(|Yn |p ) <
n=1
n=1
Theorem 8.12 (Strong Law of Large Numbers) Let X1 , X2 , . . . be real random variables which are uncorrelated and let 2 (Xi ) M < for all i N. Then
n
1 X
|Sm2 |
m2
l
mn m
mn m2 <l<(m+1)2 l
mn
1
sup Sl = sup
2
l
mn
ln
l>m2
1
|Sm2 | + sup
2
mn m
mn
2 sup
max
m2 <l<(m+1)2
l
1 X
Xi
2
m
i=m2 +1
We want to apply the Lemma with p = 2: For the first part we get
X
m=1
E
2 ! X
1
1
2
|S
|
=
E |Sm2 |2
m
2
4
m
m
m=1
=
X
1
var(Sm2 )
4
m
m=1
X
1
m2 M
4
m
m=1
2
M
6
< .
=
Dm =
max
m2 <l<(m+1)2
l
X
1
X
i
m2
2
i=m +1
()
53
we get similarly
2
X
l
1
2
Xi
E 2 max 2
E(Dm ) =
4
m
m <l<(m+1)
m=1
m=1
i=m2 +1
(m+1)2 1
l
X
X
X
1
E
|Xi |2
4
m
2
2
m=1
l=m +1
4
m
m=1
i=m +1
(m+1)2 1 (m+1)2 1
l=m2 +1
i=m2 +1
E |Xi |2
Observe that the second and third sum adds 2m terms each and continue with
X
1
2m 2m M
m4
m=1
< .
By the lemma both parts of () converge to 0 almost surely. This completes the proof.
Remark 8.13 1. In particular the Strong Law of Large Numbers holds for independent
random variables with bounded variance. But observe that in general uncorrelatedness does
not imply independence: Let be the unit disk with the uniform distribution and X and
Y be the first and second coordinate, respectively. Then X and Y are uncorrelated by
symmetry (E(X) = E(Y ) = E(X Y ) = 0), but not independent.
2. The fact that independent random variables with bounded variance satisfy the Strong Law
of Large Numbers has been generalized in many other directions, e.g. Kolmogorovs Laws
of Large Numbers, Doobs martingale convergence, Birkhoffs individual ergodic theorem.
In the chapter on martingales we shall establish Kolmogorovs and Doobs results.
Remark 8.14 Even if the random variables X1 , X2 , . . . are independent and identically
distributed with values in R one needs some minimal condition on the moments in order to
have a.s. convergence of the averages: If E(|Xi |) = , then n1 Sn diverges a.s..
Proof. Recall that for a non-negative random variable Y we have
Z
P (Y > t)dt
E(Y ) =
0
P (Y > n).
n=0
Set An := {|Xn | > n}. Since all Xn have the same law one has
P (An ) =
n=0
n=0
Since the An are independent, we can apply the second Borel-Cantelli-Lemma and get that
almost surely infinitely many An occur, that is almost surely n1 |Xn | 1 for infinitely many
n. Suppose n1 Sn a R. Then
1
1
n1 1
n1
lim Xn = lim
Sn
Sn1 = a(1 lim
) = 0,
n n
n
n
n
n n1
n
which yields a contradiction.
Chapter 9
Convergence of distributions
9.1
We briefly review some facts about the convergence of one-dimensional distribution functions.
Proposition 9.1 Let F and Fn with n
following statements are equivalent:
Definition 9.2 A sequence of probability measures on R converges weakly to the probability measure , if the corresponding distribution functions converge in the sense of the
previous proposition: Fn (z) F (z) for all z at which F is continuous. In this case one also
speaks of weak convergence of the distribution functions.
If for some random variables Xn , n N and X the laws n = L(Xn ) converge weakly to
= L(X) then one says: Xn converges to X in law.
Definition 9.3 Let (0, 1). Then z0 is a -quantile of F (or ) if F (z0 )
limz%z0 F (z). In the special case = 21 , the -quantile is called median.
C
54
55
We shall see in the Observe that the -quantiles are not necessarily unique: If there are two
numbers z1 < z2 such that F (z) on (z1 , z2 ), then both z1 and z2 are -quantiles. But
this happens only for countably many . F 1 () := inf{z : F (z) } is a -quantile, in
fact it is the smallest -quantile in case of non-uniqueness.
Proposition 9.4 Let Fn F weakly and zn be a -quantile of Fn .
1. If zn z, then z is a -quantile of F .
2. Conversely, if the -quantile z is unique, then zn z.
Proof.
1. Let z and z be points where F is continuous and z < z < z. Then eventually
2. lim sup zn and lim inf zn are -quantiles. By uniqueness they agree with the -quantile
of F and thus zn z.
Recall that on ([0, 1], B[0, 1], 1 ) the random variable X := F 1 has the distribution FX = F .
We shall see later in Proposition 9.10 that convergence in probability - and hence also a.s.
convergence - implies convergence in law. The following is a partial converse.
Corollary 9.5 Suppose Fn F weakly. Then there is a probability space (, A, P ) with
real random variables Xn and X such that FXn = Fn and FX = F and Xn X almost
surely.
Proof. Choose (, A, P ) := ([0, 1], B[0, 1], 1 ) and set Xn := Fn1 and X = F 1 . Then
FXn = Fn and FX = F . Furthermore for all = [0, 1] with countably many exceptions
(namely those for which the -quantile of F is not unique) we get Fn1 () F 1 ().
Hence Xn () X() almost surely with respect to 1 .
Definition 9.6 Let X1 , . . . , Xn be random variables on a probability space (, A, P ) with
state space (S, B). Then the empirical distribution of X1 , . . . , Xn is the random probability
measure defined by
() =
n
1 X
X ()
n i=1 i
i.e.
Observe that
Z
f (x) d
n () =
S
()(B) =
n
1 X
1B (Xi ()).
n i=1
n
1 X
f (Xi ())
n i=1
1 X
Fn ()(z) :=
n ()((, z]) =
1{Xi z} ()
n i=1
The following is sometimes called the Fundamental Theorem of Statistics. It says that every
one-dimensional distribution a.s. can be recovered from iid. observations.
Theorem 9.7 (Glivenko-Cantelli)
If X1 , X2 , . . . are independent identically distributed real random variables with distribution
function F and law . Then almost surely
n () weakly and Fn () F weakly.
P
Proof. Consider z R and define Yi := 1{Xi z} . Then Fn ()(z) = n1
Yi . The Y1 , Y2 , . . .
are independent and B(1, p)-distributed with p = P (Xi z) = F (z) and E(Yi ) = F (z). By
the Strong Law of Large Numbers Fn ()(z) F (z) almost surely. Now let
\
0 = { : Fn ()(z) F (z) z Q} =
{ : Fn ()(z) F (z)}
z
56
9.2
The drawback of the definition of convergence in law in the one-dimensional case is that it
relies heavily on the linear structure of the real numbers. The following approach works in
general metric spaces. In Corollary 9.15 we shall see that in R it is equivalent to the concept
of the previous section.
Definition 9.8 Let (S, d) be a metric space.
1. Let and 1 , 2 , . . . be probability measures on B = B(S). We say that (n ) converges
weakly to if
Z
Z
dn
S
d
S
Xn
X is equivalent to
Z
Z
dPXn = E((Xn )) E((X)) =
dPX
S
57
Pn
Example 9.12 Let S := [0, 1] and define n := n1 i=1 i/n . A standard fact about the
Riemann integral is that for all bounded continuous functions
Z 1
1 X
i
(x) dx
n
n
0
i.e. n 1|B([0,1]) weakly.
Lemma 9.13 Let n , n N and
measures. Let k , k N be bounded
R be probability
R
mesurable functions such that k dn k d as n for all k N.
R
R
1. If k (x) % (x) for all x, then d lim inf dn .
n
2. If k uniformly, then
Proof.
dn
d.
Z
k dn lim inf
n
dn
2. Define Ik () =
b) () holds for all bounded Lipschitz functions (i.e. |(x) (y)| Ld(x, y)).
c) () holds for all C where C is a subset of Cb (S) such that for each open set U S
there is a sequence (k ) in C with k (x) 1U (x) for all x S. If S = Rd then for
n (G) (G).
Proof.
a) b): trivial
b) c): Let U S be open and define k (x) := 1 k dist(x, U c ). Then k 1U and the
k are Lipschitz since |k (x) k (y)| k d(x, y) by the triangle inequality.
58
c) e): Let
R an openRset U be given, choose k C such that k 1U . By assumption
k dn k d and part a) of the previous lemma implies
Z
Z
lim inf n (U ) = lim inf 1U dn 1U d = (U ).
n
Corollary 9.15 In the case S = R the two definitions 9.2 and 9.8 of weak convergence of
probability measures and convergence in law of random variables are equivalent.
Proof. Assume that n in the sense of Definition 9.8. Let Fn and F be the corresponding distribution functions and let z be a point of continuity of F . Then the set
G = (, z] satisfies (G) = 0, and hence by condition g) in the Portmonteau Theorem we
get Fn (z) = n (G) (G) = F (z), i.e. the condition of Definition 9.2 holds.
Conversely suppose that Fn F in the sense of Definition 9.2. According to Corollary
9.5 there are random variables Xn , X with these distribution functions such that Xn X
a.s., and hence in probability. Proposition 9.10 implies that Xn X in law in the sense of
Definition 9.8.
The following simple observation is extremely useful.
Theorem 9.16 (Mapping theorem) Let Xn , X be S-valued random variables such that
L
Xn
X. Let (S 0 , d0 ) be another metric space and f : S S 0 be Borel and PX -a.s.
L
continuous. Then f (Xn )
f (X).
Proof. Let Cb (S 0 ). Then f is bounded, Borel and PX -a.s. continuous. Let
n = PXn , = PX . By condition d) in the Portmenteau theorem we get
Z
Z
E ( (f (Xn ))) = f dPXn f dPX = E ( (f (X)))
Example 9.17 Let Xn , Yn be real random variables such that P(Xn ,Yn ) P(X,Y ) and
suppose P (Y = 0) = 0. Choose f such that f (x, y) = xy on R (R \ {0}) and the mapping
theorem implies
1 Otherwise
Xn L X
Y.
Yn
1 = (S)
({ = a}) = .
9.3
59
We add a short section on a much stronger concept of convergence for probability distributions.
Definition 9.18 If P, Q are two probability measures on the same -algebra A then
kP Qk = sup |P (A) Q(A)|
AA
Z
f g dx
g f dx|
AA
AA+
Z
f g dx,
g f dx
+
ZAA
Z AA
max
f g dx,
g f dx
max
A+
= max P (A ) Q(A ), Q(A ) P (A ) .
R
R
R
R
R
Since A+ f g dx A g f dx = 11 = 0 and A+ f g dx+ A g f dx = A+ |f g| dx
the result follows.
+
It is worthwile to note that pointwise convergence of densities to a density implies convergence in total variation of the corresponding distributions.
Theorem 9.20 (Lemma of Scheff) If Pn and P have densities fn and f and fn (x)
f (x) for each x, then Pn P in total variation, i.e.
Z
|fn (x) f (x)| dx 0.
Proof. Letting A+
n = {f > fn } we see by the above remark and dominated convergence
that
Z
+
+
kP Pn k = P (An ) Pn (An ) =
(f (x) fn (x))+ dx 0
Chapter 10
Characteristic Functions
10.1
In this chapter we will use complex valued random variables. For such variables the expectation is defined as E(X + iY ) := E(X) + i E(Y ) where X and Y are real valued random
variables. Most properties of the expectation carry over to this situation. In particular the
formula |E(Z)| E(|Z|) and the dominated convergence theorem hold also for complex
random variables.
Recall the following theorem about parameter-dependent expectations (see Theorem 4.23):
If the function f (t, x) is continuously differentiable in t and if for all t (a, b) one has
| t
f (t, X)| Y for some random variable Y with finite E(Y ), then E(f (t, X)) is continuously differentiable in t (a, b) with
E
f (t, X) = E(f (t, X)).
t
t
Theorem 10.1 The same is still true in the complex valued case and also if the real variable
t (a, b) is replaced by a complex variable z U C with U open and the differentiation
is understood in the complex sense.
Proof. The first part is clear. For the complex variable situation it suffices to apply the
real variable case to the partial derivatives and to note that the Cauchy-Riemann differential
equations for f (, x) carry over to the function E(f (, X)).
10.2
X : Rd C with t = (t1 , . . . , td ) 7 E eiht,Xi
is called the characteristic function of X. Similarly, if is a Borel probability distribution
on Rd , then
Z
(t) :=
eiht,xi d(x)
61
Rd .
E
X
e
1
d
j1 t1 jd td
for j1 + + jd = k.
Proof.
1. We have |X (t)| = |E(eiht,Xi )| E(|eiht,Xi |) = E(1) = 1 and X (0) =
E(eih0,Xi ) = E(1) = 1. Now let (tn ) be a sequence in Rd which converges to t. Then
the sequence eihtn ,Xi converges pointwise to eiht,Xi and is dominated by |eihtn ,Xi | = 1.
Thus we can apply dominated convergence and get
lim X (tn ) = lim E eihtn ,Xi = E lim eihtn ,Xi = X (t).
n
itX
2. First assume d = k = 1. Then we have | t
e | = |iXeitX | = |X| and thus Theorem
10.1 implies
itX
X (t) = E
e
= i E XeitX .
t
t
example k = 2:
00
X (t)
i
t
E(XeitX )
i2 E(X 2 eitX ).
62
n
X
k=0
ikt
n
n k
p (1 p)nk = peit + (1 p) .
k
eikt
k=0
it
it
k
e = ee e = e(e 1) .
k!
t2 2
2
and thus
/2
Proof. The product rule (5.21) for independent expectations holds also for complex random
variables and thus
X+Y (t) = E eiht,X+Y i = E eiht,Xi eiht,Y i = E eiht,Xi E eiht,Y i .
Example 10.11 Let X = (X1 , . . . , Xd ) be a standard Gaussian vector, i.e. the Xi are
independent identically N (0, 1)-distributed. Then Theorem 10.10 gives
X (t) = E(ei<t,X> ) = E(ei
Pd
k=1 tk Xk
)=
d
Y
k=1
Xk (tk ) =
t2
k
2
= e
ktk2
2
10.3
63
The name characteristic function is motivated by the fact that a probability distribution is
uniquely determined by its characteristic function. In a first step we prove this for distributions which are convolutions with a Gaussian law. In this case there is even an explicit
inversion formula for the density in terms of the characteristic function.
Proposition 10.12 Let P be a probability distribution on Rd . Let P = P N (0, 2 I) =
L(X + Z) where X, Z are independent with L(X) = P and L(Z) = N (0, 2 I), > 0.
ktk2 2
Then P (t) = P (t)e 2 and P has the density
Z
ktk2 2
1
f (x) =
P (t)e 2 ei<x,t> dt.
d
(2) Rd
Proof. Let h be bounded measurable. Then
Z
Z Z
h(x)dP (x) =
h(y + t)P (dy)N (0, 2 I)(dt)
d
d
Rd
R
R
ZZ
ktk2
1
22
dt
=
h(y + t)P (dy)
e
d
2
2
Z
Z
kxyk2
1
e 22 P (dy) dx,
= h(x)
d
2 2
where we substitute t := x y. Thus the inner integral equals f (x):
Z
1
\
f (x) =
N (0,
2 I)(x y) P (dy)
d
d
2
R
2 Z
Z
ktk2 2
1
1
=
ei<xy,t>
e 2 dt P (dy)
d
d
Rd Z22 Rd
2 2
Z
2
2
ktk
1
=
ei<x,t> 2
ei<y,t> P (dy) dt
(2)d
Z
ktk2 2
1
=
P (t)ei<x,t> 2 dt.
d
(2)
64
f (x) =
g (z)f (x z) dz = E(f (x Z)) f (x) as 0.
P (t)ei<x,t> e
ktk2 2
2
1
dt =
(2)d
P (t)ei<x,t> dt
Example 10.14 Let d = 1 and P be the two-sided exponential distribution , i.e. P has the
density f (x) = 12 e|x| . We compute the characteristic function:
P (t) =
1 |x| i<x,t>
e
e
dx =
2
ex cos tx dx = 1 t2
ex
eixt + eixt
2
dx
ex cos tx dx
1
.
1 + t2
This is integrable in t and we can apply the inversion formula of the Theorem to get
Z
Z
1 |x|
1
1
1
1
1 \
i<x,t>
e
=
e
dt =
ei<x,t> dt = Cauchy(1)(x).
2
2
2
2
1+t
2
(1 + t )
2
\
Thus we also obtain the characteristic function of the Cauchy distribution: Cauchy(1)(x)
=
|x|
e
.
2
Remark 10.15 We see that e|t| and et = N\
(0, 2)(t) are both characteristic functions.
One can show that e|x| is a characteristic function of a suitable distribution for each
(0, 2]. But for most 6= 1, 2 the density of the corresponding distribution (the -stable
symmetric distribution ) is not known explicitly.
65
Proof. 1. By asumption, for each t the real random variable ht, Xi has a Gaussian law. Its
expectation is ht, i by linearity and by Theorem 7.17 its variance is equal to tT Ct = hCt, ti.
Thus, using example 10.8
X (t) = E(eiht,Xi ) = ht,Xi (1) = eih,ti e
hCt,ti
2
tk Xk () = ei
tk k s2
2
t2k k
/2
P
P
P
which is the characteristic function of N ( tk k , t2k k2 ). Thus by uniqueness
tk Xk has
a Gaussian law.
3. If X = (X1 , . . . , Xd ) N (, C) and if the Xi are uncorrelated then C is a diagonal matrix,
i.e. the joint law of the Xi by step 1. and 2. is the same as the one of independent Gaussian
variables. So the Xi are independent.
10.4
Rd .
Now the assertion follows with part e) of the Portmanteau theorem (9.14).
66
E(LkZk) L E(kZk2 ) = L d
P
P
R
R
Pd
d
d
2
= i=1 E(Zi2 ) = i=1 1 = d. Similarly h dPn h dPn
since E(kZk2 ) = E
Z
i=1 i
hdPn hdP 2L d + hdPn hdP .
Since Pn converges weakly to P and can be chosen arbitrarily small, this implies Pn P
weakly by the Portmanteau theorem (9.14), part b).
Chapter 11
11.1
N (0, C).
n
Proof. Let Xi0 := Xi . Then E(Xi0 ) = 0 and Sn =
without loss of generality that = 0.
Sn /n (t) = Pni=1 Xi
=
n
Y
i=1
Xi
= X1
n
2 X1 (0)
= i2 E(X1k X1l ) = cov(X1k , X1l ) = ckl .
tk tl
and
tk
X tk tl
2
(0) +
(0) + o(ktk2 )
tk
2 tk tl
k,l
gives
t
Sn ( ) = X1
n
with rn (t) 0. Since (1 +
x
n
n
n
X
1
rn (t)
= 1 + 0
ckl tk tl +
2n
n
k,l
68
11.2
npk
k=1
is used as a measure how much the observation differs from the expected value.
Theorem 11.2 The law of random variable Dn2 converges weakly to a 2d1 distribution.
Proof. For i N let Yi be the k-th unit vector where k {1, . . . , d} is the unique index
with P
Xi Ek , i.e. Yik = 1{Xi Ek } . The sequence (Yi ) is an iid sequence of random vectors
n
and
n1 , . . . , n
d ). Moreover Yik Bin(1, pk ) and p := E(Yi ) = (p1 , . . . , pd ).
i=0 Yi = (
Then the law of
Pn
n
k npk
i=0 Yi np
=
k=1,...,d
n
n
converges wakly to the Gaussian N(0, C)-distribution where C is the covariance matrix of
Y1 . Moreover
Pn
2
1
1
i=0 Yi np
W
where
W
:=
diag
Dn2 =
,
.
.
.
,
.
p1
pd
n
By the mapping principle of weak convergence L(Dn2 ) L(kW Zk2 ) where Z N (0, C). To
calculate C note that var(Y1k ) = pk (1 pk ) and cov(Y1k , Y1l ) = E(Y1k Y1l ) E(Y1k )E(Y1l ) =
pk pl for k 6= l. This yields
p1 (1 p1 )
pk pl
T
..
C=
= diag(p1 , . . . , pd ) p p.
.
pk pl
pd (1 pd )
The vector W p has the form ( p1 , . . . , pd ) and hence satisfies kW pk = 1. Thus this
covariance matrix is the matrix of the orthogonal projection to the hyperplane (W p) .
Thus there exists an orthogonal matrix Q such that QW CW T QT = diag(1, . . . , 1, 0). Since
an orthogonal matrix does not change the norm we have kW Zk2 = kQW Zk2 . Now
cov(QW Z) = QW CW T QT = diag(1, . . . , 1, 0) implies that QW Z is a N (0, diag(1, . . . , 1, 0))vector. Thus the components of QW Z are independent and the definition of the 2 distribution yields
2
L(Dn2 ) L(kW Zk2 ) = L(kQW Zk2 ) = L(12 + + d1
) = d1
11.3
Let X1 , X2 , . . . be independent random variables, but either not necessarily with the same
distribution or iid but without
pP variance. Is there a Central Limit Theorem in this situation?
var(Xi ) = (X1 + + Xn ) and define
In the first case set sn :=
Xnj :=
Xj E(Xj )
sn
and
Sn :=
n
X
j=1
Xnj .
(11.1)
69
Then E(Sn ) = 0 and var(Sn ) = 1, i.e. Sn is standardized. The following example shows
that in general Sn does not converge to N (0, 1) in distribution.
Example 11.3 Let pi > 0 and define independent random variables
(
1pi
with probability 12 pi each
Xi :=
0
otherwise
n
1 X
Xi 0
n i=1
...
...
..
.
X1k1
...
..
.
X2k2
..
.
..
1jkn
as n .
In the case of finite variance we have the following basic result: The proof uses a careful analysis of the ingredients of the proof in the first section in this chapter based on characteristic
functions.
70
Theorem 11.6 Let X1 , X2 , . . . be independent with finite variance. Using the notation in
(11.1) the following conditions are equivalent:
1. L(Sn ) N (0, 1) and the Xnj are asymptotically negligible.
2. Lindeberg condition: For each > 0
Ln () :=
n
X
2
E Xnj
1{|Xnj |} 0
j=1
as n .
Proof. 2. 1. was shown by Lindeberg, the other direction by W. Feller. Confer [?].
E |Xnj |2+ 0
j=1
as n . The CLT under Lyapunovs condition with = 1, i.e. under some assumption
on the third moments, is known to follow from more elementary observations, for simplicity
we refer to [?].
Another interesting question is the speed of the convergence. Here again the existence of
the third moment is useful.
Theorem 11.7 (Berry-Essen) If the Xi are independent identically distributed and :=
E(|X1 |3 ) < , then for all n N
sup FSn (x) (x)
0.8
Proof. Again clever use of characteristic functions. The best known constant in the theorem
has improved over time. The value 0.8 is quoted from [?], see also [?], 15.
Chapter 12
Conditional Expectation
12.1
This implies hx PL x, yi = 0.
The linearity of PL is an easy consequence of property 3.
12.2
Suppose X and Z are random variables on the same probability space. E(X|Z), the conditional expectation of X given Z, should be the expected value of X under the condition
that we know the value of Z. The following list contains some properties which we expect
from a formal definition of conditional expectation:
If X and Z are independent, then E(X|Z) = E(X).
(Knowledge about Z does not change the knowledge about X)
71
72
Example 12.2 Here is a simple example in which we can guess the value of a conditional
expectation: Let X1 and X2 be the outcomes of two dice rolls: := {1, . . . , 6}2 , X1 (i, j) := i,
X2 (i, j) := j. Define Z := X1 + X2 . By symmetry we expect that E(X1 |Z) = E(X2 |Z).
By linearity E(X1 |Z) + E(X2 |Z) = E(X1 + X2 |Z) = E(Z|Z) = Z. Thus E(X1 |Z) =
E(X2 |Z) = 21 Z.
12.3
Thus G is G-measurable if and only if G is the union of some of the Bi . Now define a random
variable
1
E (1Bi X)
if X Bi
Y () :=
P (Bi )
S
Then Y is constant on each of the Bi . Therefore Y is G-measurable. If G = iM Bi , then
E(1G Y ) =
X
iM
E (1Bi Y ) =
X
iM
P (Bi )
X
1
E(1Bi X) =
E(1Bi X) = E(1G X)
P (Bi )
iM
1
E(1Bi X)
P (Bi )
for Bi .
73
Example 12.7 We take our old example: Let := {1, . . . , 6}2 and define random variables
X1 (i, j) := i, X2 (i, j) := j and Z(i, j) = i + j. We partition according to the values of Z:
=
12
[
{Z = k}
k=2
and define
(
G := (Z) =
)
[
{Z = k} : M {2, . . . , 12}
= Z 1 (M ) : M {2, . . . , 12} .
kM
Our old guess that E(X1 |G) = Z/2, i.e. E(X1 |G) = k/2 on {Z = k} can now be checked for
each k, e.g. k = 10:
1
4
5
6
1
E 1{Z=10} X1 =
+
+
= 5.
E(X|G)() =
P (Z = 10)
1/12
36 36 36
Definition 12.8 If G = (Z) one also writes E(X|Z) instead of E(X|G).
Proposition 12.9 Let X, Z take countably many
P values xk , zl and define pk|l = P (X =
xk |Z = zl ) if P (Z = zl ) > 0. Then E(X|Z)() = k xk pk|l if Z() = zl .
Proof. Set Bl = {Z = zl } and use the previous proposition: We know that for Bl
X
1
1
E(1Bl X) =
E(1Bl {X=xk } xk )
P (Bl )
P (Z = zl )
X
X
=
xk P (X = xk |Z = zl ) =
xk pk|l .
E(X|Z)() = E(X|G)() =
k
n
X
i=1
ai 1Gi =
n
X
ai 1Bi .
i=1
12.4
We now can verify that the formal definition of the previous section leads to a concept which
has the properties of the first section, and some other nice features.
Theorem 12.11
1. If X is independent of G then E(X|G) = E(X) almost surely.
In particular E(X|{, }) = E(X).
74
Proof. Clearly the constant random variable E(X) is G-measurable. If G G then X and
The conditional expectation shares the following three basic properties of ordinary expectation.
Theorem 12.13
1. X > X 0 a.s. implies E(X|G) > E(X 0 |G) a. s., and similarly with .
2. E(aX + bX 0 |G) = aE(X|G) + bE(X 0 |G) almost surely.
3. If 0 Xn X, then E(Xn |G) E(X|G) almost surely (monotone convergence for
conditional expectation).
Proof.
1. The first half of the proof of a.s. uniqueness (Proposition 12.4) proves the
assertion for . Now assume X > X 0 a.s.. We already know E(X|G) E(X 0 |G) a.s..
Let G = {E(X|G) = E(X 0 |G)}. Then
E(1G X) = E 1G E(X|G) = E 1G E(X 0 |G) = E(1G X 0 ).
Since X > X 0 a.s. we have necessarily P (G) = 0 which proves the assertion.
2. The right-hand side has the defining property of the left-hand side.
3. Let Y := sup E(Xn |G), then Y is G-measurable and has the defining property of
E(X|G):
E(1G Y ) = sup E(1G E(Xn |G)) = sup E(1G Xn ) = E(1G X).
1.
75
Remark 12.14 Similarly one has a Lemma of Fatou and a theorem of dominated convergence for conditional expectation. They follow from monotone convergence precisely as in
section 4.1.
For random variables in L2 conditional expectation has an interpretation in terms of Hilbert
space projections.
Theorem 12.15 Let G A be a sub--algebra. Then
1. The space L2G = {[Y ] L2 : Y is G-measurable} forms a closed linear subspace of the
Hilbert space L2 .
2. Let X L2 and let Y be a random variable such that [Y ] = PL2G ([X]) in the sense of
Theorem 12.1. Then Y = E(X|G) almost surely.
Proof.
1. Let [Yn ] L2G and assume [Yn ] [Y ] L2 . Then (Yn ) is Cauchy in
2
L (, G, P|G ) and by the completeness of this space there exists a G-measurable Y 0
with kYn Y 0 k2 0 as n . Thus [Y ] = [Y 0 ] L2G .
2. By the definition of PL2G we get [X] [Y ] [Z] for all [Z] L2G if [Y ] = PL2G [X]. This
means E((X Y )Z) = 0 for all Z L2 (, G, P ) or E(XZ) = E(Y Z). We choose
Z = 1G for G G and get Y = E(X|G).
So in short, in L2 the conditional expectation is the orthogonal projection to L2G . In particular it exists.
Corollary 12.16 Let X 0 or X L1 then there is a conditional expectation E(X|G).
Proof. If X is non-negative, we choose Xn := min(X, n) L2 . Then Xn X and therefore
E(Xn |G), which exists by the previous theorem, converges to E(X|G) (and in particular this
expectation exists). In the general case decompose again X = X + X .
For the next property of conditional expectation we need a small tool form real analysis.
Lemma 12.17 Let I be an open real interval. For a convex function : I R there is a
sequence of points tn I and cn R such that (x) = supn (tn ) + cn (x tn ) for all x I.
Proof. Convex functions are continuous in the interior of their domain. Moreover, the
convexity implies that for each t I there exists c(t) such that (x) (t) + c(t) (x t)
for all x I, in particular c(t) is a sub-derivative at t.1 Choose (tn ) dense in I and set
(x) := supn (tn ) + c(tn )(x tn ). Clearly (x) (x). For the converse choose a sequence
nk such that tnk x. Then (c(tnk )) is bounded and thus by continuity we get
(x) = lim (tnk ) = lim (tnk ) + c(tnk )(x tnk ) (x).
k
1 If these general properties of convex functions are not obvious to the reader he/she may add them to the
assumptions on . In concrete applications is even piece-wise smooth and these properties are immediate.
76
Proof. Part 1 of Theorem 12.13 implies that E(X|G) I a.s.. The previous lemma yields
E (X)|G sup E (tn ) + cn (X tn )|G
n
= sup (tn ) + cn E(X|G) tn
n
= E(X|G) .
Corollary 12.19
For every X L1 one has |E(X|G)| E(|X| | G) and, more generally, if p 1 and X Lp ,
then kE(X|G)kp kXkp .
Proof. The map x 7 |x|p is convex for p 1, so by Jensens inequality we get |E(X|G)|p
E(|X|p |G) and the first assertion follows, while the second one follows from kE(X|G)kpp =
E(|E(X|G)|p ) E(E(|X|p |G)) = E(|X|p ) = kXkpp .
One can take short cuts when taking conditional expectations with respect to different
-algebras.
Theorem 12.20 [Tower property] If X L1 and G H are sub--algebras, then
E(E(X|H)|G) = E(X|G).
12.5
Uniform integrability
In the next chapter we even consider conditional expectations of a random variable for
infinitely many -algebras simultaneously. In Theorem 12.25 below we show that the resulting family of random variables is uniformly integrable. This concept is useful also in other
contexts.
Definition 12.21 A family (Xi )iI of real random variables is called uniformly integrable,
if
1. supiI E(|Xi |) <
2. supiI E(|Xi | 1{|Xi |k} ) 0 as k .
This concept allows an extension of the Theorem of dominated convergence.
Theorem 12.22 Let (Xn )nN be a uniformly integrable sequence. If Xn converges to X
in distribution, then E(Xn ) E(X).
L
Proof. Define k (x) := (k x) k and define akn := E(k (Xn )). For each k, Xn
X
implies that the k-th row of the matrix (akn ) converges as n :
akn := E(k (Xn )) E(k (X)).
Since k (|x|) |x|, we get with the first part of lemma 9.13 that
E(|X|) lim inf E(|Xn |) < .
n
77
Xn
|Xn | k
Figure 12.1: Condition 2 for uniform integrability: E(|Xn | 1{|Xn |k} ) converges to 0 for
k .
Therefore E(X) = limk E(k (X)), i.e. the row limits converge to E(X). Similarly the
columns converge even uniformly in n since
sup |akn E(Xn )| sup E(|k (Xn ) Xn |) sup E |Xn | 1{|Xn |k} 0
n k
Uniform integrability often is easily verified. Here are two sufficient criteria.
Proposition 12.23 A family of random variables (Xi )iI is uniformly integrable if either
|Xi | Y for all i with a random variable Y such that E(|Y |) < , or
supiI kXi kp < for some p > 1.
Proof. Let us assume that |Xi | Y for all i. Then supiI E(|Xi |) E(|Y |) < and
sup E |Xi | 1{|Xi |k} sup E |Y | 1{|Xi |k} E |Y | 1{|Y |k} 0
iI
iI
as k by dominated convergence.
Now assume supiI kXi kp < . This implies
sup E(|Xi |) = sup kXi k1 sup kXi kp < .
iI
iI
iI
iI
= P (|Xi | k)1/q
i |k} q
1
sup E(|Xi |)
k iI
1/q
0
by Markovs inequality.
L
Corollary 12.24 If Xn
X and sup kXn kp < with a p > 0, then E(|Xn |p ) E(|X|p )
0
for all p < p.
78
p0 p/p0
nN
p0 /p
p0
sup kXn kp
n
< .
Thus (Yn ) is uniformly integrable and E(Yn ) converges to E(Y ) by Theorem 12.22.
E(|Xi |)
< and
inequality P (|Xi | > k)
k
E |Xi |1{|Xi |>k} E E |X|Fi 1{|Xi |>k} = E |X|1{|Xi |>k} < .
Therefore supi E(|Xi |1{|Xi |>k} ) 0 as k .
2 There
Chapter 13
Definition 13.1 Let (I, ) be an ordered set, let (, A, P ) be a probability space and let
(Xi )iI be a familiy of R-valued integrable random variables on (, A, P ).
1. A filtration (Fi )iI of (, A, P ) is an increasing family of sub--algebras of A, i.e.
Fi Fj A for all i j.
2. (Xi ) is an (Fi )-martingale if Xi = E(Xj |Fi ) for all i j.
3. Suppose Xi is Fi -measurable for all i I. Then (Xi ) is
a (Fi )-submartingale, if Xi E(Xj |Fi ) for all i j,
a (Fi )-supermartingale, if Xi E(Xj |Fi ) for all i j.
In the sequel most of the time we consider index sets I Z in particular I = {1, . . . , n} or
I = N. Very often we choose Fn = (Z1 , . . . , Zn ) for n N and some family (Zn ) or even
Fn = (X1 , . . . , Xn ). In the latter case (Fn ) is called the natural filtration of (Xn ). If the
filtration is clear, it is omitted from the definition of (sub-/super-)martingales.
The simplest way to produce martingales is the following.
Theorem 13.2 For X L1 and an arbitrary filtration (Fi ) we define Xi := E(X|Fi ) for
i I. Then (Xi )iI is an (Fi )iI -martingale.
Proof. Use the tower property.
80
Example 13.4
1. If (Xi ) is a martingale, then (|Xi |) is a submartingale and (Xi2 ) is a submartingale if
E(Xi2 ) < for i I.
2. Set X1 := 2, X2 := 1, X3 := 0. Then (Xi ) is a submartingale (for any filtration)
but X 2 is not. This illustrates the condition in Theorem 13.3 that is increasing.
3. Let X1 , X2 , . . . be independent with E(Xi ) = 0Pfor i N. Let (Fn ) be the natural
n
filtration: Fn := (X1 , . . . , Xn ). Define Sn = i=1 Xi for n N. Then (Sn ) is an
(Fn )-martingale. In fact by Theorem 12.11
E(Sn+1 |Fn ) = Sn + E(Xn+1 |Fn ) = Sn + E(Xn+1 ) = Sn + 0 = Sn .
The rest follows from the tower property (Theorem 12.20).
4. Let X1 , X2 , . . . be Q
independent with E(Xi ) = 1 for i N. Let Fn := (X1 , . . . , Xn )
n
and define Mn := i=1 Xi for n N. Then (Mn ) is an (Fn )-martingale. Like in the
previous example we get
E(Mn+1 |Fn ) = Mn E(Xn+1 |Fn ) = Mn E(Xn+1 ) = Mn 1 = Mn .
5. Let (, A, P ) := ([0, 1], B[0, 1], ) be the standard probability space on the unit interval. Let Ik,n := [k 2n , (k + 1) 2n ) for k, n N0 ; k < 2n and I2n ,n := {1}.
Let
(
)
[
n
n
Fn := {Ik,n : k = 0, . . . , 2 } =
Ik,n : K {0, . . . , 2 } .
kK
Then it easy to see that (Fn ) is a filtration and for every X L1 and all s Ik,n we
have according to Example 12.5
Z (k+1)2n
1
n
E(X|Fn )(s) =
E(1Ik,n X) = 2
X(t) dt.
(Ik,n )
k2n
So this sequence of random variables is a martingale.
6. On the same probability space define
Gn := {G B[0, 1] : G = (G + 2n ) mod 1}.
The -algebra Gn consists of all Borel sets which are periodic with period 2n . Then
for each n N and all s [0, 1]
n
2
1 X
X((s + k 2n ) mod 1).
E(X|Gn )(s) = n
2
k=1
Note however that here Gn+1 Gn and hence (Gn ) is not a filtration with the index
set N. But letting Fn be Gn we get a filtration (Fn )nN with the index set N:
Indeed, if n
< m and n, m N then m < n and hence Fn Fm . Therefore
E(X|Fn ) nN is a martingale.
Remark 13.5 Let us compare examples 5 and 6: In example 5 E(X|Fn ) is a step function
which is a close approximation of X. If X is a continuous function then it is easily verified
that for all s [0, 1]
E(X|Fn )(s) X(s)
as n . In example 6 E(X|Gm ) is a average of the values of X at equidistant points
and hence an approximation of the Riemann integral, i.e. if X is continuous then
Z 1
E(X|Gm )(s) E(X) =
X(t) dt
0
81
13.2
Let (Fn )nN0 be a filtration and (Xn )nN0 be a family of random variables such that Xn is
Fn -measurable. Think of Xn Xn1 as your net gain in the n-th game in a series of games
played at times n = 1, 2, . . . .
(Xn ) is a martingale if and only if E(Xn Xn1 |Fn1 ) = 0 (fair game).
(Xn ) is a submartingale if and only if E(Xn Xn1 |Fn1 ) 0 (favorable game).
Definition 13.6 A sequence (Cn )nN is called (Fn )-previsible, if Cn is Fn1 -measurable
for n N.
Think of Cn as your stake in game n. You have to choose Cn using the information available
at n 1, i.e. just before the n-th game is played. Your gain on game n is Cn (Xn Xn1 )
and your total gain up to time n is
(C X)n :=
n
X
Ck (Xk Xk1 )
k=1
82
so we can apply the previous theorem with C = 1[0,T ] . In particular we get E(XT n )
E(X0 ) for supermartingales, E(XT n ) = E(X0 ) for martingales and E(XT n ) E(X0 ) for
submartingales.
Example 13.13P(Doubling Strategy) Let Y1 , Y2 , . . . be iid with P (Yn = 1) = P (Yn =
n
1) = 21 , Xn = i=1 Yi , Xn Xn1 = Yn . Let Yn = 1 indicate win at the n-th game, and
Yn = 1 loss. Xn is a martingale. Put the stake 1 on the first game. If you win stop, else
double the stake. This leads to
(
2k1 if Y1 = Y2 = = Yk1 = 1
Ck =
0
otherwise.
Consider the stopping time T = min{n : Yn = 1}. Then
(
20 21 2T 2 + 2T 1 = +1 if T n
(C X)n =
20 21 2n1 = 2n + 1
if T > n.
The theorem shows that E((C X)n ) = E(XT n ) = E(X0 ) = 0. In this particular case this
result can also be understood directly: We have P (T > n) = 2n and P (T n) = 1 2n .
Therefore
E(total gain) = 1(1 2n ) (2n 1)2n = 0.
Note that in this game P (T < ) = 1: If we are allowed to wait arbitrarily long until
the win really occurs, we are sure to win: we have XT = 1 a.s., in particular surprisingly
E(XT ) = 1 6= 0 = E(X0 ) even though X is a martingale.
The next theorem gives a condition under which the surprise of the last example can be
avoided.
Theorem 13.14 (Doobs optional sampling theorem) Let X be a submartingale and
let T be a stopping time such that T < almost surely and (XT n )n=1,2,... is uniformly integrable. Then E(XT ) E(X0 ). The analogue holds for supermartingales and martingales.
a.s.
Proof. T n = T almost surely for eventually all n and thus XT n XT and Theorem
12.22 implies E(XT ) = lim E(XT n ) E(X0 ).
Corollary 13.15 The conclusion holds under each of the following conditions on T and X
1. T is bounded;
2. X is bounded and T < almost surely;
3. E(T ) < and there is some k < such that: |Xn Xn1 | k for all n.
Proof. In each case we verify uniform integrability of (XT n ):
PM
k=1
83
|Xk | = Y where T M , M N.
PT n
k=1 (Xk
Thus in all cases (XT n ) is pointwise bounded by some Y L1 and we can apply proposition
12.23.
This can be easily extended to inequalities between E(XS ) and E(XT ) for two stopping
times, e.g.
Remark 13.16 If S, T are bounded stopping times with S < T almost surely and (Mi ) a
submartingale, then E(MS ) E(MT ).
Proof. 1(S,T ] = 1[0,T ] 1[0,S] is previsible. (1(S,T ] M )n = MT n MSn is a submartingale.
By S T n follows
E(MT ) E(MS ) = E (1(S,T ] M )n E (1(S,T ] M )0 = 0.
13.3
Doobs Lp -inequality
Proof. Let
(
T =
min{i : Mi }
n+1
if no i exists.
n
X
P (Mi , T = i)
i=1
n
1X
1
1
E(1{T =i} Mn ) = E(Mn 1Sni=1 {T =i} ) =
i=1
1X
E(1{T =i} Mi )
i=1
Z
Mn dP.
max Mi
max |Mi |p
0in
p
p1
p
E(|Mn |p ).
Z
0
P (U p > c) dc =
Z
0
P (U > c1/p ) dc
84
c1/p
Z Z
Up
V dP dc
{U c1/p }
Z
=
c1/p dcdP
p
E(V U p1 );
p1
Therefore kU kp
p
p
kV kp kU p1 kp/(p1) =
kV kp kU kpp1 .
p1
p1
p
p1 kV
Chapter 14
Martingale convergence
Martingales (and sub- resp. super-martingales) are useful both for computing expectations
and because of their good asymptotic behaviour. This chapter deals with the second issue.
14.1
Definition 14.1 Let x1 , x2 , . . . , xn be a finite sequence of real numbers and let a < b. The
number of upcrossings of the interval (a, b) by (xi ) is defined as the maximal number k for
which there are indices i1 < < i2k such that xi2l1 a and xi2l b for all l = 1, . . . , k.
Theorem 14.2 Let (Mi )1in be a submartingale and let a < b. Then the number N (a, b)
of upcrossings of (a, b) by (Mi )1in satisfies
E(N (a, b))
1
E((Mn a)+ ).
ba
Proof. Let Mi0 = max(Mi , a). Then Mi0 is a submartingale by theorem 13.3, since the
function x 7 max(x, a) is convex and increasing. If the assertion holds for (Mi0 ) then also
for M . Therefore we may assume Mi a. Now define T0 = 0 and recursively
(
min{i T2k2 : Mi = a}
T2k1 =
n
if no such i exists
(
min{i T2k1 : Mi b}
T2k =
n
if no such i exists.
Then 0 < T1 . . . Tn . This sequence of random times is strictly increasing until it
becomes stationary. Since the values are integers between 1 and n we have necessarily
Tn+1 = Tn = n. Therefore
n
M n = M T1 +
[2]
X
MT2k MT2k1 +
k=1
[2]
X
MT2k+1 MT2k .
k=1
Each upcrossing corresponds to a unique term in the first sum. Moreover every term in
the first sum is non-negative. Thus the first sum is (b a)N (a, b). The second sum
has nonnegative expectation by remark 13.16, since the Ti are stopping times. Taking
expectations we get
n
[2]
X
(b a)E N (a, b) E
MT2k MT2k1 E(Mn MT1 )
k=1
86
Remark 14.3 There are also maximal inequalities and upcrossing inequalities for supermartingales. The proofs are similar but slightly more difficult.
14.2
The following result is the key to the a.s. behaviour of martingales. It is formulated in such
a way that it can also be used in continuous time (which we do not cover in this course.)
Theorem 14.4 Let U R be a countable set. Let (Ft )tU be a filtration and let (Mt )tU
be a submartingale with suptU E(Mt+ ) < . Then there is a set 0 A such that
P (0 ) = 1 and for 0 the following holds: For every monotone (decreasing or increasing)
sequence (tn )nN in U the limit limn Mtn () exists in [, ].
Proof. Since U is countable we
Scan find a sequence (Fk )k=1,2,... of finite subsets of U such
that F1 F2 . . . and U = k=1 Fk . Fix a < b. Let N (a, b) = supk NFk (a, b) where NFk
is the number of upcrossings of (a, b) by (Mi )iFk . Then
E(N (a, b)) = sup E(NFk (a, b))
1
sup E((Mt a)+ ) <
b a tU
since E((Mt a)+ ) E(Mt+ ) + |a|. In particular the random variables N (a, b) are a.s.
finite. Now let
0 = { : N (a, b) < for all a, b Q}.
Then P (0 ) = 1. Fix 0 and let (tn ) be a monotone sequence in U . Assume that
limn Mtn () does not exist in [, ]. Then the sequence (Mtn ())n has at least two
accumulation points. Therefore there exists two rational numbers a < b such that the
sequence (Mtn ()) has infinitely many upcrossings of (a, b). Thus N (a, b)() = +, so
+
by the Lemma of Fatou. Thus M
finite almost surely. On the other hand the sequence
(E(Mn ))n=1,2,... is nondecreasing by the submartingale property and bounded from above
by supn E(Mn+ ). Therefore supn E(Mn ) = supE (Mn+ ) E(Mn ) < and hence again using
Corollary 14.6 Every nonnegative supermartingale (Mn )nN converges almost surely and
Mn E(M |Fn ) for each n.
Proof. (Mn ) is a nonpositive submartingale, so sup E((Mn )+ ) = 0 < . Hence Mn
converges almost surely to M . Moreover for each n and every A Fn we get by Fatous
Lemma
Z
Z
Z
Z
M dP =
lim Mm dP lim inf
Mm dP
Mn dP.
A
A m
A
14.3
87
88
14.4
lim 2
n
2
X
k=1
f (s + k2
Z
mod 1) =
f (t) dt.
0
Proof. We continue with the Example 13.4, 6, i.e. we consider the -algebra
Fn = {B [0, 1] : B B([0, 1]), B = B + 2n
mod 1}
of all Borel sets which are periodic with period 2n . As was noted there, for each n the
average on the left-hand is equal to E(f |Fn )(s). In view of Corollary 14.13 it suffices to
verify that
Fn = {Xn+1 , Xn+2 , . . . }
where Xn (s) is the n-th dyadic digit of s, because these Xn are independent random variables
on the Lebesgue probability space on [0, 1]. In this equality of -algebras it easy to see .
Conversely due to periodicity every event B Fn is uniquely determined by the intersection
B [0, 2n ) and this set is in the -algebra generated by the restriction of Xn+1 , Xn+2 , . . .
to [0, 2n ) for the same reason why all dyadic intervals generate the Borel sets on [0, 1], cf.
Example 14.11. This concludes the proof.
14.5
89
In this last section we deduce two different versions of the strong law of large numbers for
independent random variables from martingale theory. For the first version we use decreasing
martingale convergence and for the second version increasing martingale convergence. Both
versions were established originally by Kolmogorov using different proofs. These results gave
a strong motivation for the creation of martingale theory by Doob.
Theorem 14.15 Let X1 , X2 , . . . be independent random variables. Then they satisfy the
strong law of large numbers, i.e. almost surely
n
1X
Xi E(Xi ) 0,
n i=1
under each of the following conditions:
1. The Xi are in L1 and identically distributed, or
2. The Xi are in L2 with
X
var(Xi )
i=1
i2
< .
Proof. In both cases we may and shall assume E(Xi ) = 0. In the first case the idea is to
find F1 F2 . . . such that
n
1X
Xi = E(X1 |Fn )
n i=1
P
Xi converges almost surely and in L1 by Theorem 14.12. The limit
for each n. Then n1
random variable is measurable for the asymptotic -algebra of X1 , X2 , . . . and hence almost
surely equal to its expectation which is equal to 0 by L1 -convergence, cf. the proof of
Corollary 14.13.
A possible choice for Fn is
Fn = {X1 + + Xn , Xn+1 , Xn+2 , . . . }.
It is easy to see like in our introductory Example 12.2 that
n
1X
Xi = E(X1 |X1 + + Xn ).
n i=1
Then one can apply Lemma 14.16 below
Pn to X = X1 , Y = X1 + + Xn , and Z =
{Xn+1 , Xn+2 , . . . } to see that even n1 i=1 Xi = E(X1 |Fn ) . This completes the first
proof.
P
i)
For the second part let X1 , X2 , . . . be independent such that i=1 var(X
< . Then
i2
Pn Xi
Mn = i=1 i defines a (Fn )-martingale where Fn = {X1 , . . . , Xn }. This martingale is
bounded in L2 since
n
X
var(Xi )
sup E(Mn2 ) = sup
< .
i2
n
n
i=1
Pn
In particular (Mn ) is uniformly integrable (cf. Proposition 12.23) and limn i=1 Xii
exists almost surely. Therefore the assertion follows from Kroneckers Lemma below.
Lemma 14.16 Let X, Y, Z be three random variables (with possibly general state spaces)
on the same probability space. If Z is independent of (X, Y ) then E(X|Y, Z) = E(X|Y ).
90
Proof. Let B (Y ) and C (Z) be given. Then X 1B and E(X|Y )1B are independent
of Z and therefore
E(X 1BC ) = E(X 1B )P (C) = E(E(X|Y )1B )P (C) = E(E(X|Y )1BC ).
Since the sets of the form B C generate (Y, Z) the assertion follows.
xi =
n+1
X
i (si si1 ) =
i=1
i=1
Thus
i
for i n.
where ni = i+1
Pn n+1
the sum i=1 ni si , being an
n
X
n+1
X
n+1
i=1
xi =
n
X
ni si + sn+1
i=1
1
Now i=1 ni = n+1
1 and ni 0 for all i. Therefore
n+1
asymptotic average of the si , converges to s = limn sn , which
Pn
Chapter 15
Brownian motion
15.1
Introduction
Brownian motion is a stochastic process (Wt )t with independent stationary Gaussian increments and continuous paths. More explicitly
Definition 15.1 A family (Wt )0tT with T < or (Wt )0t< of real random variables
on the same probability space is called a standard Brownian motion (Brownsche Bewegung), if it has the following properties:
a) W0 = 0 a.s.: The process starts at 0.
b) For all indices t0 < t1 < . . . < tk the increments
Wt1 Wt0 , Wt2 Wt1 , . . . , Wtk Wtk1
are independent.
c) For s < t the increment Wt Ws has the law N (0, t s).
d) P -almost all paths t 7 Wt () are continuous.
Remark Some authors require that all and not only P -almost all Brownian paths are
continuous. This does not make a big difference.
Brownian motion plays a fundamental role in probability theory because of, both, its rich
structure and the fact that many other probabilistic objects and models can be studied via
their relation to particular features of Brownian motion. It is not completely obvious that
the properties a)-d) are compatible, in other words that a Brownian motion exists. In this
chapter we give a construction of Brownian motion and show that the law of Brownian
motion appears as a universal limit law of second order centered random walks (Donskers
invariance principle or functional central limit theorem).1
15.2
For the sake of simpler notation we assume T = 1. Let us first suppose a Brownian motion
(Wt )0t1 is already given. For n N we restrict the paths to the dyadic time points of
order n, i.e. the points of the form m2n , 0 m 2n . Then
Wm2n =
m
X
Xi2 a.s.
(15.1)
i=1
1 To the experienced reader: The point of these notes is to arrive at Brownian motion and Donskers theorem essentially simultaneously without the help of either Kolmogorovs consistency theorem or Prokhorovs
theorem.
91
92
(15.2)
Xi2
2
2
= X2i1
+ X2i
.
(15.3)
Proposition 15.3 Let Z1 , Z2 , . . . be iid N (0, 1)-variables. Define random variables Xi2 ,
n N0 , 1 i 2n recursively by
X11
n
2
X2i1
n
2
X2i
=
=
=
Z1
1 2n1 1 (n1)/2
X
+ 2
Z2n1 +i
2 i
2
1 2n1 1 (n1)/2
X
2
Z2n1 +i
2 i
2
(15.4)
(15.5)
Proof. Choose the probability space and the random variables Xi2 as in the Proposition.
Let t D be given. Choose n N such that t is of order n, i.e. such that there is a number
m {0, . . . , 2n } with t = m2n . Then we can define Wt by (15.1). The important point is
that this definition does not depend on the choice of n because of the relations (15.3).
Let now t0 < t1 < . . . < tk be dyadic times. Then there are some n N and numbers
m0 < m1 < . . .P
< mk with tj = mj 2n for j = 0, . . . , k. For each j by (15.1) we have
n
mj
Wtj Wtj1 = i=m
Xi2 . Here the terms are independent centered Gaussian and
j1 +1
hence the same is true for the sum. The variance of the sum adds up to (mj mj1 )2n =
tj tj1 , i.e. the increments Wtj Wtj1 have laws N (0, tj tj1 ). Moreover they are
n
n
mutually independent since they are built from disjoint parts of the sequence X12 , . . . , X22n .
This proves b) and c) at dyadic times.
Remark The main point of the above construction is that the consistency of the different
levels can be achieved on a single probability space. Kolmogorovs consistency theorem
(mentioned in the footnote) gives a more general method to deal with this kind of problem.
It now remains to show that the family (Wt ) in the Corollary can be extended to all t [0, 1]
in such a way that the paths are continuous. So we need to show that the restricted paths
D 3 t 7 Wt () are uniformly continuous. This follows as a special case from a more general
estimate which then yields also Donskers theorem.
15.3
From now on let X1 , X2 , . . . be iid random variables with E(Xi ) = 0 and 2 = var(Xi ) < .
Pn
(n)
(n)
For each n N let Sn = i=1 Xi . Define a process Y (n) = (Yt )t[0,1] by letting Y0 = 0,
(n)
Yt
1
Sk
n
if t =
k
,
n
(15.6)
k
and by linear interpolation of these values for t ( k1
n , n ) where k {1, . . . , n}.
The key to the results in this chapter is to keep control of the fluctuations of the process
Y (n) . The following concept is useful.
Definition 15.5 Let f be a real function on a set I R. For each > 0 define w(, f ) by
w(, f ) = sup{|f (s) f (t)| : |s t| < , s, t I}.
The function 7 w(, f ) is called modulus of continuity of f . If (Yt )tI is a stochastic
process on a probability space (, A, P ) then w(, Y ) denotes the map 7 w(, Y ()).
A function f obviously is uniformly continuous iff lim0 w(, f ) = 0. If the process Y has
continuous paths then w(, Y ) is a random variable since the supremum in the definition
needs to be taken only over a countable dense subset of I.
Lemma 15.6 Let Z have the law N (0, 1). Then for each c > 0
c2
1
P {Z > c} e 2 .
c 2
e
dx
e 2 dx = (e 2 )|
e 2 .
c =
2 c
2 c c
c 2
c 2
(15.7)
94
Here is the main result of this section. In the language of Prokhorovs Theorem (which we
do not want to use) this guarantees the uniform tightness of the sequence of laws of the
Y (n) .
Proposition 15.7 Let Y (n) be defined as above (cf.(15.6)). For each > 0 one has
lim lim sup P w(, Y (n) ) > = 0.
(15.8)
0
Proof. Fix two numbers > 0 and > 0. Choose n large enough such that n1 < . Choose
m N such that < m
n < 2. Divide the interval [0, 1] starting from the left into equal
rm (r+1)m
intervals Ir = [ n , n ] of length m
n (> ) plus one possibly shorter interval near 1. For
the total number R of these intervals we get the estimate
R 1 + 1.
(15.9)
(n)
(n)
Assume now w(, Y (n) ) > . There are points s < t such that ts < with |Yt Ys | > .
The maximal fluctuation of the path occurs at the interpolation points. The two points s, t
are either in the same Ir or in two adjacent Ir -s. Thus there is an index r such that the
interval Ir Ir+1 contains s and t and therefore also two interpolation points at which the
values of Y (n) differ by more than . Then by the triangular inequality there is at least one
index k in the range [rm, (r + 2)m] such that
1
(n)
|Sk Srm | = |Y k(n) Y rm
| > /2.
n
n
n
(15.10)
{max0j2m |
r=1
rm+j
X
Xi | > n/2}.
i=rm+1
Using (15.9), Lemma 15.8 below and the fact that the Xi are iid its probability can be
estimated from above by
Let jn converge to such that nevertheless jnn 0. Then maxjjn var(Sj ) = o(n) and
hence
with the help of Chebyshevs inequality. For 2m j jn and n sufficiently large we get by
the central limit theorem and the estimate (15.7) above
r
1
n
( 1 + 1)P (|Sj | > n/6) 2 1 P ( |Sj | >
)
6j
j
r
n
)
2 1 P (|Z| >
6m
r
1
1
2 P (|Z| >
)
12
1
2
4 1 1 12 exp(
).
24
2
So finally, lim supn P ({w(, Y (n) ) > }) is at most equal to 3 times the last expression
for all > 0. Now observe that this expression converges to 0 as 0 to conclude the
result.
95
The following Lemma (cute, isnt it?) is a special case of a result in M. Loves classical
textbook [?].
Lemma 15.8 Let X1 , X2 , . . . , Xn be independent2 random variables. Then for all c > 0
P ( max |Sk | > 3c) 3 max P (|Sk | > c).
1kn
1kn
Proof. Let p = max1kn P (|Sk | > c). Let T be the first time at which |Sk | > 3c, and
T = n + 1 if this never happens. Then
P ( max |Sk | > 3c)
1kn
=
=
n
X
k=1
n
X
k=1
n
X
k=1
n
X
P ({T = k})
P ({T = k} {|Sn | c} + P ({|Sn | > c})
P ({T = k} {|Sn Sk | > 2c}) + p
P ({T = k})P ({|Sn Sk | > 2c}) + p
k=1
n
X
k=1
15.4
The random variables w(, W ) are increasing in and since they converge in probability to
0 they even converge a.s. to 0 as 0. But this means that P -almost all paths of W are
uniformly continuous on the dyadic numbers and hence they have a unique extension to a
continuous function on [0, 1]. We also denote this extension by W . (On the measurable (!)
set of all where w(, W ()) does not converge to 0 we simply let Wt () = 0 for all t.)
Thus W = (Wt )t[0,1] is a stochastic process with continuous paths whose restriction to the
dyadic numbers has the properties b) and c) in the Definition of a Brownian motion. We
show that these properties stay valid for general times tj , 0 j k in [0, 1]. Choose a
sequence (trj )rN of dyadic numbers such that tj = limr trj . Then by the continuity of
the paths we get
(Wtr0 , Wtr1 Wtr0 , . . . , Wtrk Wtrk1 ) (Wt0 , Wt1 Wt0 , . . . , Wtk Wtk1 )
r
(15.11)
2 Actually if the X are also identically distributed a similar argument allows to replace the 3 by 2 at both
i
places!
96
k
k
X
X
1
1
t0 u20 +
(tj tj1 )u2j = lim exp( tr0 u20 +
(trj trj1 )u2j
r
2
2
j=1
j=1
is the characteristic function of the limit in (15.11) which proves that the increments at
non-dyadic times also have the properties b) and c).
Note that the piece-wise linear paths of W (n) in the above proof actually converge a.s. uniformly, so (Wt )t[0,1] alternatively can be defined as their uniform limit. Norbert Wiener in
his original proof constructed the Brownian paths as a uniform limit of random trigonometric
polynomials.
15.5
In this section we consider processes with continuous paths as random variables with values
in the metric space C[0, 1] with the sup-norm. Donskers Theorem states that in this sense
the processes Y (n) in section 15.3 converge in law to a Brownian motion.
Theorem 15.10 Let X1 , X2 , . . . be iid random variables with E(Xi ) = 0 and 2 = var(Xi ) <
(n)
. For each n N define the process Y (n) = (Yt )t[0,1] as in (15.6) by linear interpolation
of the associated rescaled random walk. Then the laws L(Y (n) ) on the space C[0, 1] converge
weakly to the law of a Brownian motion.
An important idea in the technical side of the proof is to switch from a function u
C[0, 1] to its linear interpolations and back. For a finite subset F of [0, 1] and a vector
x = (xt )tF RF let IF (x) C[0, 1] be the function which one gets from x by linear
interpolation. Conversely for u C[0, 1] let u|F be the restriction of u to F .
Lemma 15.11 Let F be a finite subset of [0, 1] which contains the two endpoints 0 and 1.
Let F denotes the maximal distance between two adjacent elements of F . Then for every
u C[0, 1] one has
ku I F (u|F )k w(F , u).
(15.12)
In particular let (Fn ) be an increasing sequence of finite sets in [0, 1] containing {0, 1} whose
union is dense in [0, 1]. Then, for every u C[0, 1], u = limn I Fn (u|Fn ) uniformly i.e. in the
topology of C[0, 1].
Proof. In each interval between two adjacent points ti , ti+1 of F we have
|u(t) I F (u|F (t)|
w(F , u).
Now take the supremum over all t to get (15.12). If the (Fn ) are as indicated then Fn 0
and since every u C[0, 1] is uniformly continuous we have w(Fn , u) 0 as n . Thus
(15.12) gives the second assertion.
The space C[0, 1] with the sup-norm is a metric space. It is well known by the Weierstra
Approximation Theorem that it is separable, i.e. there is a countable dense subset.3 The
symbol B denotes the Borel -algebra of C[0, 1]. As a first application of Lemma 15.11 let us
clarify the obvious measurability question: Are the processes in the Theorem really random
variables in the measurable space (C[0, 1], B)?
3 If you do not know Weierstras Theorem give a direct proof of the separability by the above
interpolation-approximation scheme
97
First note that in every metric space (M, d) the Borel -algebra B(M ) is generated by
the continuous real functions on M , since every closed set L can be written in the form
{f = 0} where f (x) = dist(x; L). If (Xn ) is a sequence of M -valued random variables which
converges point-wise to the function X then f (X) = limn f (Xn ) is a real random variable
for each continuous f : M R and thus X is also a M -valued random variable.
Lemma 15.12 Let (Yt )t[0,1] be a real valued process with continuous paths on the probability space (, A, P ). Then the map Y : C[0, 1] which maps to the function (path)
[0, 1] 3 t 7 Yt () is A B(C[0, 1])-measurable.
Proof. Let Y be our process with continuous paths. Let (Fn ) be a sequence of finite subsets
as in Lemma 15.11. Then for each n the restriction Y|Fn is a random vector in RFn . The
maps I Fn : RFn C[0, 1] are continuous and hence the I Fn (Y|Fn ) are C[0, 1]-valued random
variables. Thus by Lemma 15.11 Y as their point-wise limit has the same property.
The statement of the next lemma is called convergence of the finite dimensional marginals.
(n)
(n)
Lemma 15.13 Let t0 < . . . < tk be in [0, 1]. Then (Yt0 , . . . , Ytk ) converges for n
in law to (Wt0 , . . . , Wtk ).
Proof. We consider the increments. In order to be able to consider the first component also
as increment we may assume t0 = 0. For j {0, . . . , k} and n N let mjn be the smallest
integer m such that m
n tj . Then the k sequences of random variables
(n)
(n)
Y mj Y mj1
n
n
n
n
1
(S j Smj1
)
n
n mn
s
mjn mj1
1
n
p
=
j
n
mn mnj1
=
mn
X
Xi
mj1
+1
n
are independent of each other and their laws converge by the central limit theorem to
the law of independent N (0, tj tj1 )-variables which are the increments of a Brownian
motion. Then the mapping principle of weak convergence which says that this convergence
(n)
(n)
is preserved under continuous maps implies that the vector (Y m1n , . . . , Y mk ) converges in
n
n
n
law to (Wt0 , . . . , Wtk ). The error which we introduce by switching from the indices
the tj is asymptotically small.
Now we come to the proof of the Theorem. It suffices to show that
lim E (Y (n) ) = E (W )
n
mjn
n
to
(15.13)
holds if is a bounded real Lipschitz functional on the space C[0, 1]. Let L be its Lipschitz
constant and choose K < such that |(u)| K for all u C[0, 1]. Let F = {t0 , . . . , tk }
[0, 1] be a finite set which starts with 0 and ends with 1. Moreover let F : C[0, 1] R be
defined by
F (u) = ( IF )(u|F ).
The function I F is bounded and continuous on RF and so by Lemma 15.13
(n)
(n)
lim E F (Y (n) ) = lim E IF (Yt0 , . . . , Ytk )
n
n
= E IF (Wt0 , . . . , Wtk ) = E F (W ) .
On the other hand by Lemma 15.11 and the Lipschitz property
|F (u) (u)| = |(u) (I F (u|F ))| Lku I F (u|F )k Lw(F , u).
(15.14)
98
Let > 0 be given.
Choose > 0 such that lim supn P w(, Y (n) ) > < . Similarly
P w(, W ) > < for sufficiently small . Choose F such that F < . Then for
sufficiently large n
|E F (Y (n) ) (Y (n) ) | LE w(, Y (n) )
L + 2CP w(, Y (n) ) > < (L + 2C).
By the same estimate
|E F (W ) (W ) | < (L + 2C).
Since > 0 is arbitrary we can deduce the claim (15.13) from (15.14). This finishes the
proof.
15.6
In the famous chapter III of the first volume of Fellers book the following asymptotic result
is derived by completely elementary combinatorial methods. It says that the probability that
in n fair games the fraction of time spent with a cumulative win is at most x (0 < x < 1)
approaches a number which is surprisingly large even for small x.
Theorem 15.14 (Arcsin law for the simple symmetric random walk) Let X1 , X2 , . . . be an
iid sequence of random variables with P (Xi = 1) = 21 . Let Nn be the number of indices
k n such that Sk > 0. Then for every x [0, 1]
Z
Nn
1 x
dt
2
p
lim P
<x =
= arcsin x.
(15.15)
n
n
0
t(1 t)
We want to show that Donskers Theorem implies that then the same must hold for every
centered second order random walk.
Corollary 15.15 The relation (15.15) holds for every iid sequence (Xi ) with E(Xi ) = 0
and var(Xi ) < . Moreover if (Wt )t[0,1] is a Brownian motion, then
2
P {t [0, 1] : Wt > 0} < x = arcsin x.
(15.16)
Lemma 15.16 Let (um ) be a sequence of measurable functions which converges uniformly
(or just point-wise) to some u with {t : u(t) = 0} = 0. Then
lim {t : um (t) > 0} = {t : u(t) > 0}.
(15.17)
Proof. We look at the um and u as real random variables on the Lebesgue probability space
on [0, 1]. Then um converges point-wise and hence in law to u. Let Fm and F denote the
distribution functions of um und u, respectively. By assumption F is continuous at 0, hence
{t : um (t) > 0} = 1 Fm (0) 1 F (0) = {t : u(t) > 0}.
m
99
Proof. We consider the set G = {(t, u) [0, 1] C[0, 1] : u(t) = 0}. The function (t, u) 7
u(t) is continuous on the product space. Hence G is closed in the product topology. Thus it
is product measurable. Since for each t the r.v. Wt has a Gaussian law we get from Fubinis
Theorem
Z 1
1G (t, W ) dt
E {t : W (t) = 0} = E
0
1G (t, W ) dt
N (0, t)({0}) dt = 0.
=
0
These two Lemmas show that the functional is continuous at all points u A C[0, 1]
where A is the set of functions which spend no time at 0 and that P (W A) = 1. This
implies by the extended mapping principle and Donskers theorem that also the random
variables (Y (n) ) converge in law to (W ). Since for the simple random walk the random
variables (Y (n) have the arcsin-distribution as limit law, the latter must be the law of
(W ). This proves the second statement of the Corollary. A second application of Donskers
Theorem shows that it also the limit law for other second order random walks. Note that
the size of does not enter.
Bibliography
[1] W. Feller. An introduction to probability theory and its applications. Vol. II. John Wiley
& Sons, New York, 2nd edition, 1971.
[2] J. Jacod and P. Protter. Probability Essentials. Springer, Berlin - Heidelberg etc., 2.
edition, 2002.
[3] A. Klenke. Wahrscheinlichkeitstheorie. Springer, Berlin - Heidelberg, 2006.
[4] A. N. Kolmogorov. Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer, Berlin,
1933.
[5] M. Love. Probability Theory I. Springer Verlag, New York Heidelberg Berlin, 4th
edition, 1977.
[6] P. Mrters and H. von Weizscker. Stochastische Methoden, 3. Auflage. TU Kaiserslautern, http://www.mathematik.uni-kl.de/w-theorie/lehre/skripte, 2010.
[7] A.N. Shiryaev. Probability. Springer, New York, 1996.
[8] H.v.
Weizscker.
Basic Measure Theory.
TU
http://www.mathematik.uni-kl.de/w-theorie/lehre/skripte, 2008.
Kaiserslautern,
[9] D. Williams. Weighing the Odds. A Course in Probability and Statistics. Cambridge
University Press, New York, 2001.
100