, consisting of all
allowed events, that is, those events to which we shall assign probabilities. We next
dene the structural conditions imposed on T.
Definition 1.1.1. We say that T 2
i=1
A
i
T.
Remark. Using DeMorgans law you can easily check that if A
i
T for i = 1, 2 . . .
and T is a eld, then also
i
A
i
T. Similarly, you can show that a eld is
closed under countably many elementary set operations.
7
8 1. PROBABILITY, MEASURE AND INTEGRATION
Definition 1.1.2. A pair (, T) with T a eld of subsets of is called a
measurable space. Given a measurable space, a probability measure P is a function
P : T [0, 1], having the following properties:
(a) 0 P(A) 1 for all A T.
(b) P() = 1.
(c) (Countable additivity) P(A) =
n=1
P(A
n
) whenever A =
n=1
A
n
is a
countable union of disjoint sets A
n
T (that is, A
n
A
m
= , for all n ,= m).
A probability space is a triplet (, T, P), with P a probability measure on the
measurable space (, T).
The next exercise collects some of the fundamental properties shared by all prob
ability measures.
Exercise 1.1.3. Let (, T, P) be a probability space and A, B, A
i
events in T.
Prove the following properties of every probability measure.
(a) Monotonicity. If A B then P(A) P(B).
(b) Subadditivity. If A
i
A
i
then P(A)
i
P(A
i
).
(c) Continuity from below: If A
i
A, that is, A
1
A
2
. . . and
i
A
i
= A,
then P(A
i
) P(A).
(d) Continuity from above: If A
i
A, that is, A
1
A
2
. . . and
i
A
i
= A,
then P(A
i
) P(A).
(e) Inclusionexclusion rule:
P(
n
_
i=1
A
i
) =
i
P(A
i
)
i<j
P(A
i
A
j
) +
i<j<k
P(A
i
A
j
A
k
)
+ (1)
n+1
P(A
1
A
n
)
The eld T always contains at least the set and its complement, the empty
set . Necessarily, P() = 1 and P() = 0. So, if we take T
0
= , as our
eld, then we are left with no degrees of freedom in choice of P. For this reason
we call T
0
the trivial eld.
Fixing , we may expect that the larger the eld we consider, the more freedom
we have in choosing the probability measure. This indeed holds to some extent,
that is, as long as we have no problem satisfying the requirements (a)(c) in the
denition of a probability measure. For example, a natural question is when should
we expect the maximal possible eld T = 2
to be useful?
Example 1.1.4. When the sample space is nite we can and typically shall take
T = 2
> 0 to each
making sure that
A
p
,
we get the Poisson probability measure of parameter > 0 when starting from
p
k
=
k
k!
e
for k = 0, 1, 2, . . ..
1.1. PROBABILITY SPACES AND FIELDS 9
When is uncountable such a strategy as in Example 1.1.4 will no longer work.
The problem is that if we take p
) for each A .
Excluding such trivial cases, to genuinely use an uncountable sample space we
need to restrict our eld T to a strict subset of 2
.
1.1.2. Generated and Borel elds. Enumerating the sets in the eld
T it not a realistic option for uncountable . Instead, as we see next, the most
common construction of elds is then by implicit means. That is, we demand
that certain sets (called the generators) be in our eld, and take the smallest
possible collection for which this holds.
Definition 1.1.5. Given a collection of subsets A
, where a not
necessarily countable index set, we denote the smallest eld T such that A
T
for all by (A
) (or sometimes by (A
) the
eld generated by the collection A
. That is,
(A
) =
( : ( 2
is a eld, A
( .
Denition 1.1.5 works because the intersection of (possibly uncountably many)
elds is also a eld, which you will verify in the following exercise.
Exercise 1.1.6. Let /
and B
are such
that A
(B
(A
) =
(B
).
Proof. Recall that if a collection of sets / is a subset of a eld (, then by
Denition 1.1.5 also (/) (. Applying this for / = A
and ( = (B
) our
assumption that A
(B
) (B
). Similarly,
our assumption that B
(A
) (A
).
Taken together, we see that (A
) = (B
).
For instance, considering B
Q
= ((a, b) : a, b Q), we have by the preceding
lemma that B
Q
= B as soon as we show that any interval (a, b) is in B
Q
. To verify
this fact, note that for any real a < b there are rational numbers q
n
< r
n
such that
q
n
a and r
n
b, hence (a, b) =
n
(q
n
, r
n
) B
Q
. Following the same approach,
you are to establish next a few alternative denitions for the Borel eld B.
Exercise 1.1.9. Verify the alternative denitions of the Borel eld B:
((a, b) : a < b R) = ([a, b] : a < b R) = ((, b] : b R)
= ((, b] : b Q) = (O R open )
10 1. PROBABILITY, MEASURE AND INTEGRATION
Hint: Any O R open is a countable union of sets (a, b) for a, b Q (rational).
If A R is in B of Example 1.1.7, we say that A is a Borel set. In particular, all
open or closed subsets of R are Borel sets, as are many other sets. However,
Proposition 1.1.10. There exists a subset of R that is not in B. That is, not all
sets are Borel sets.
Despite the above proposition, all sets encountered in practice are Borel sets.
Often there is no explicit enumerative description of the eld generated by an
innite collection of subsets. A notable exception is ( = ([a, b] : a, b Z),
where one may check that the sets in ( are all possible unions of elements from the
countable collection b, (b, b + 1), b Z. In particular, B ,= ( since for example
(0, 1/2) / (.
Example 1.1.11. One example of a probability measure dened on (R, B) is the
Uniform probability measure on (0, 1), denoted U and dened as following. For
each interval (a, b) (0, 1), a < b, we set U((a, b)) = b a (the length of the
interval), and for any other open interval I we set U(I) = U(I (0, 1)).
Note that we did not specify U(A) for each Borel set A, but rather only for
the generators of the Borel eld B. This is a common strategy, as under mild
conditions on the collection A
=
N
1
for
1
= H, T (recall that setting H = 1 and T = 0, each innite sequence
(e.g. A
1,H
is the set of
all sequences starting with H and A
2,TT
are all sequences starting with a pair of T
symbols). This is also our rst example of a stochastic process, to which we return
in the next section.
Note that any countable union of sets of probability zero has probability zero,
but this is not the case for an uncountable union. For example, U(x) = 0 for
every x R, but U(R) = 1. When we later deal with continuous time stochastic
processes we should pay attention to such diculties!
1.2. Random variables and their expectation
Random variables are numerical functions X() of the outcome of our ran
dom experiment. However, in order to have a successful mathematical theory, we
limit our interest to the subset of measurable functions, as dened in Subsection
1.2. RANDOM VARIABLES AND THEIR EXPECTATION 11
1.2.1 and study some of their properties in Subsection 1.2.2. Taking advantage
of these we dene the mathematical expectation in Subsection 1.2.3 as the corre
sponding Lebesgue integral and relate it to the more elementary denitions that
apply for simple functions and for random variables having a probability density
function.
1.2.1. Indicators, simple functions and random variables. We start
with the denition of a random variable and two important examples of such ob
jects.
Definition 1.2.1. A Random Variable (R.V.) is a function X : R such
that R the set : X() is in T (such a function is also called a
Tmeasurable or, simply, measurable function).
Example 1.2.2. For any A T the function I
A
() =
_
1, A
0, / A
is a R.V.
since : I
A
() =
_
_
, 1
A
c
, 0 < 1
, < 0
all of whom are in T. We call such
R.V. also an indicator function.
Example 1.2.3. By same reasoning check that X() =
N
n=1
c
n
I
An
() is a R.V.
for any nite N, nonrandom c
n
R and sets A
n
T. We call any such X a
simple function, denoted by X SF.
Exercise 1.2.4. Verify the following properties of indicator R.V.s.
(a) I
() = 0 and I
() = 1
(b) I
A
c () = 1 I
A
()
(c) I
A
() I
B
() if and only if A B
(d) I
iAi
() =
i
I
Ai
()
(e) If A
i
are disjoint then I
iAi
() =
i
I
Ai
()
Though in our denition of a R.V. the eld T is implicit, the choice of T is very
important (and we sometimes denote by mT the collection of all R.V. for a given
eld T). For example, there are nontrivial elds ( and T on = R such
that X() = is measurable for (, T), but not measurable for (, (). Indeed,
one such example is when T is the Borel eld B and ( = ([a, b] : a, b Z)
(for example, the set : is not in ( whenever / Z). To practice your
understanding, solve the following exercise at this point.
Exercise 1.2.5. Let = 1, 2, 3. Find a eld T such that (, T) is a measur
able space, and a mapping X from to R, such that X is not a random variable
on (, T).
Our next proposition explains why simple functions are quite useful in probability
theory.
Proposition 1.2.6. For every R.V. X() there exists a sequence of simple func
tions X
n
() such that X
n
() X() as n , for each xed .
Proof. Let
f
n
(x) = n1
x>n
+
n2
n
1
k=0
k2
n
1
(k2
n
,(k+1)2
n
]
(x) ,
12 1. PROBABILITY, MEASURE AND INTEGRATION
0 0.5 1
0
1
2
3
4
X
(
)
n=1
0 0.5 1
0
1
2
3
4
X
(
)
n=2
0 0.5 1
0
1
2
3
4
X
(
)
n=3
0 0.5 1
0
1
2
3
4
X
(
)
n=4
Figure 1. Illustration of approximation of a random variable us
ing simple functions for dierent values of n.
noting that for R.V. X 0, we have that X
n
= f
n
(X) are simple functions. Since
X X
n+1
X
n
and X() X
n
() 2
n
whenever X() n, it follows that
X
n
() X() as n , for each .
We write a general R.V. as X() = X
+
()X
() where X
+
() = max(X(), 0)
and X
, = (
1
2
). Specify (X
0
), (X
1
), and (X
2
) for the R.V.s:
X
0
() = 4,
X
1
() = 2X
0
()I
{1=H}
() + 0.5X
0
()I
{1=T}
(),
X
2
() = 2X
1
()I
{2=H}
() + 0.5X
1
()I
{2=T}
().
The concept of eld is needed in order to produce a rigorous mathematical
theory. It further has the crucial role of quantifying the amount of information
we have. For example, (X) contains exactly those events A for which we can say
whether A or not, based on the value of X(). Interpreting Example 1.1.13 as
corresponding to sequentially tossing coins, the R.V. X
n
() =
n
gives the result
of the nth coin toss in our experiment
() =
lim
n
X
n
() exists and is nite. Prove that X
() = inf
m
sup
nm
X
n
(), or see [Bre92, Proposition A.18]
for a detailed proof.
We turn to deal with numerical operations involving R.V.s, for which we need rst
the following denition.
Definition 1.2.11. A function g : R R is called Borel (measurable) function if
g is a R.V. on (R, B).
To practice your understanding solve the next exercise, in which you show that
every continuous function g is Borel measurable. Further, every piecewise constant
function g is also Borel measurable (where g piecewise constant means that g has
at most countably many jump points between which it is constant).
Exercise 1.2.12. Recall that a function g : R R is continuous if g(x
n
) g(x)
for every x R and any convergent sequence x
n
x.
14 1. PROBABILITY, MEASURE AND INTEGRATION
(a) Show that if g is a continuous function then for each a R the set
x : g(x) a is closed. Alternatively, you may show instead that x :
g(x) < a is an open set for each a R.
(b) Use whatever you opted to prove in (a) to conclude that all continuous
functions are Borel measurable.
We can and shall extend the notions of Borel sets and functions to R
n
, n 1
by dening the Borel eld on R
n
as B
n
= ([a
1
, b
1
] [a
n
, b
n
] : a
i
, b
i
R, i = 1, . . . , n) and calling g : R
n
R Borel function if g is a R.V. on (R
n
, B
n
).
Convince yourself that these notions coincide for n = 1 with those of Example 1.1.7
and Denition 1.2.11.
Proposition 1.2.13. If g : R
n
R is a Borel function and X
1
, . . . , X
n
are R.V.
on (, T) then g(X
1
, . . . , X
n
) is also a R.V. on (, T).
If interested in the proof, c.f. [Bre92, Proposition 2.31].
This is the generic description of a R.V. on (Y
1
, . . . , Y
n
), namely:
Theorem 1.2.14. If Z is a R.V. on (, (Y
1
, . . . , Y
n
)), then Z = g(Y
1
, . . . , Y
n
)
for some Borel function g : R
n
R
For the proof of this result see [Bre92, Proposition A.21].
Here are some concrete special cases.
Exercise 1.2.15. Choosing appropriate g in Proposition 1.2.13 deduce that given
R.V. X
n
, the following are also R.V.s:
X
n
with R, X
1
+X
2
, X
1
X
2
.
We consider next the eect that a Borel measurable function has on the amount
of information quantied by the corresponding generated elds.
Proposition 1.2.16. For any n < , any Borel function g : R
n
R and R.V.
Y
1
, . . . , Y
n
on the same measurable space we have the inclusion (g(Y
1
, . . . , Y
n
))
(Y
1
, . . . , Y
n
).
In the following direct corollary of Proposition 1.2.16 we observe that the infor
mation content quantied by the respective generated elds is invariant under
invertible Borel transformations.
Corollary 1.2.17. Suppose R.V. Y
1
, . . . , Y
n
and Z
1
, . . . , Z
m
dened on the same
measurable space are such that Z
k
= g
k
(Y
1
, . . . , Y
n
), k = 1, . . . , m and Y
i
=
h
i
(Z
1
, . . . , Z
m
), i = 1, . . . , n for some Borel functions g
k
: R
n
R and h
i
: R
m
R. Then, (Y
1
, . . . , Y
n
) = (Z
1
, . . . , Z
m
).
Proof. Left to the reader. Do it to practice your understanding of the concept
of generated elds.
Exercise 1.2.18. Provide example of a measurable space, a R.V. X on it, and:
(a) A function g(x) , x such that (g(X)) = (X).
(b) A function f such that (f(X)) is strictly smaller than (X) and is not
the trivial eld , .
1.2. RANDOM VARIABLES AND THEIR EXPECTATION 15
1.2.3. The (mathematical) expectation. A key concept in probability the
ory is the expectation of a R.V. which we dene here, starting with the expectation
of nonnegative R.V.s.
Definition 1.2.19. (see [Bre92, Appendix A.3]). The (mathematical) expec
tation of a R.V. X() is denoted EX. With x
k,n
= k2
n
and the intervals
I
k,n
= (x
k,n
, x
k+1,n
] for k = 0, 1, . . ., the expectation of X() 0 is dened as:
(1.2.1) EX = lim
n
_
k=0
x
k,n
P( : X() I
k,n
)
_
.
For each value of n the nonnegative series in the righthandside of (1.2.1) is
well dened, though possibly innite. For any k, n we have that x
k,n
= x
2k,n+1
x
2k+1,n+1
. Thus, the interval I
k,n
is the union of the disjoint intervals I
2k,n+1
and
I
2k+1,n+1
, implying that
x
k,n
P(X I
k,n
) x
2k,n+1
P(X I
2k,n+1
) +x
2k+1,n+1
P(X I
2k+1,n+1
) .
It follows that the series in Denition 1.2.19 are monotone nondecreasing in n,
hence the limit as n exists as well.
We next discuss the two special cases for which the expectation can be computed
explicitly, that of R.V. with countable range, followed by that of R.V. having a
probability density function.
Example 1.2.20. Though not detailed in these notes, it is possible to show that for
countable and T = 2
X()p
i
x
i
P( : X() = x
i
) applies whenever the range of X is a bounded
below countable set x
1
, x
2
, . . . of real numbers (for example, whenever X SF).
Here are few examples showing that we may have EX = while X() < for
all .
Example 1.2.21. Take = 1, 2, . . ., T = 2
k=1
k
2
]
1
a positive, nite normalization con
stant. Then, the random variable X() = is nite but EX = c
k=1
k
1
= .
For a more interesting example consider the innite coin toss space (
, T
c
) of Ex
ample 1.1.13 with probability of a head on each toss equal to 1/2 independently on
all other tosses. For k = 1, 2, . . . let X() = 2
k
if
1
= =
k1
= T,
k
= H.
This denes X() for every sequence of coin tosses except the innite sequence of
all tails, whose probability is zero, for which we set X(TTT ) = . Note that X
is nite a.s. However EX =
k
2
k
2
k
= (the latter example is the basis for a
gambling question that give rise to what is known as the St. Petersburg paradox).
Remark. Using the elementary formula EY =
N
m=1
c
m
P(A
m
) for the simple
function Y () =
N
m=1
c
m
I
Am
(), as in Example 1.2.20, it can be shown that our
denition of the expectation of X 0 coincides with
(1.2.2) EX = sup EY : Y SF, 0 Y X .
This alternative denition is useful for proving properties of the expectation, but
less convenient for computing the value of EX for any specic X.
16 1. PROBABILITY, MEASURE AND INTEGRATION
Exercise 1.2.22. Write (, T, P) for a random experiment whose outcome is a
recording of the results of n independent rolls of a balanced sixsided dice (including
their order). Compute the expectation of the random variable D() which counts
the number of dierent faces of the dice recorded in these n rolls.
Definition 1.2.23. We say that a R.V. X() has a probability density function
f
X
if P(a X b) =
_
b
a
f
X
(x)dx for every a < b R. Such density f
X
must be
a nonnegative function with
_
R
f
X
(x)dx = 1.
Proposition 1.2.24. When a nonnegative R.V. X() has a probability density
function f
X
, our denition of the expectation coincides with the well known ele
mentary formula EX =
_
0
xf
X
(x)dx.
Proof. Fixing X() 0 with a density f
X
(x), a nite m and a positive , let
E
,m
(X) =
m
n=0
n
_
(n+1)
n
f
X
(x)dx.
For any x [n, (n + 1)] we have that x n x , implying that
_
(m+1)
0
xf
X
(x)dx =
m
n=0
_
(n+1)
n
xf
X
(x)dx E
,m
(X)
n=0
_
_
(n+1)
n
xf
X
(x)dx
_
(n+1)
n
f
X
(x)dx
_
=
_
(m+1)
0
xf
X
(x)dx
_
(m+1)
0
f
X
(x)dx.
Considering m we see that
_
0
xf
X
(x)dx lim
m
E
,m
(X)
_
0
xf
X
(x)dx .
Taking 0 the lower bounds converge to the upper bound, hence by (1.2.1),
EX = lim
0
lim
m
E
,m
(X) =
_
0
xf
X
(x)dx,
as stated.
Denition 1.2.19 is called also Lebesgue integral of X with respect to the probabil
ity measure P and consequently denoted EX =
_
X()dP() (or
_
X()P(d)).
It is based on splitting the range of X() to nitely many (small) intervals and
approximating X() by a constant on the corresponding set of values for which
X() falls into one such interval. This allows to deal with rather general domain
, in contrast to Riemanns integral where the domain of integration is split into
nitely many (small) intervals hence limited to R
d
. Even when = [0, 1] it allows
us to deal with measures P for which P([0, ]) is not smooth (and hence Rie
manns integral fails to exist). Though not done here, if the corresponding Riemann
integral exists, then it necessarily coincides with our Lebesgue integral and with
Denition 1.2.19 (as proved in Proposition 1.2.24 for R.V. X having a density).
We next extend the denition of the expectation from nonnegative R.V.s to
general R.V.s.
1.2. RANDOM VARIABLES AND THEIR EXPECTATION 17
Definition 1.2.25. For a general R.V. X consider the nonnegative R.V.s X
+
=
max(X, 0) and X
= min(X, 0) (so X = X
+
X
), and let E X = E X
+
E X
,
provided either E X
+
< or E X
< .
Definition 1.2.26. We say that a random variable X is integrable (or has nite
expectation) if E[X[ < , that is, both E X
+
< and E X
< .
Exercise 1.2.27. Show that R.V. X is integrable if and only if E[[X[I
X>M
] 0
when M .
Remark. Suppose X = Y Z for some nonnegative R.V.s Y and Z. Then,
necessarily Y = D+X
+
and Z = D+X
.
Consequently,
EY EZ = EX
+
EX
= EX
provided either EY is nite or EZ is nite. We conclude that in this case EX =
EY EZ. However, it is possible to have X integrable while ED = resulting
with EY = EZ = .
Example 1.2.28. Using Proposition 1.2.24 it is easy to check that a R.V. X
with a density f
X
is integrable if and only if
_
[x[f
X
(x)dx < , in which case
EX =
_
xf
X
(x)dx.
Building on Example 1.2.28 we also have the following well known change of
variables formula for the expectation.
Proposition 1.2.29. If a R.V. X has the probability density function f
X
and
h : R R is a Borel measurable function, then the R.V. Y = h(X) is integrable if
and only if
_
[h(x)[f
X
(x)dx < , in which case EY =
_
h(x)f
X
(x)dx.
Definition 1.2.30. A R.V. X with probability density function
f
X
(x) =
1
2
exp
_
(x )
2
2
2
_
, x R
where R and > 0, is called a nondegenerate Gaussian (or Normal) R.V.
with mean and variance
2
, denoted by X N(,
2
).
Exercise 1.2.31. Verify that the lognormal random variable Y , that is Y = e
X
with X N(,
2
), has expected value EY = exp( +
2
/2).
Exercise 1.2.32.
(a) Find the values of R for which f
X
(x) = c
/(1 +[x[
) is a probability
density function for some c
n=1
c
n
I
An
is a simple function, then E X =
N
n=1
c
n
P(A
n
).
(3). If X and Y are integrable R.V. then for any constants , the R.V. X +Y
is integrable and E (X +Y ) = (E X) +(E Y ).
(4). E X = c if X() = c with probability 1.
(5). Monotonicity: If X Y a.s., then EX EY . Further, if X Y a.s. and
EX = EY , then X = Y a.s.
In particular, property (3) of Proposition 1.2.33 tells us that the expectation is a
linear functional on the vector space of all integrable R.V. (denoted below by L
1
).
Exercise 1.2.34. Using the denition of the expectation, prove the ve properties
detailed in Proposition 1.2.33.
Indicator R.V.s are often useful when computing expectations, as the following
exercise illustrates.
Exercise 1.2.35. A coin is tossed n times with the same probability 0 < p < 1 for
H showing on each toss, independently of all other tosses. A run is a sequence of
tosses which result in the same outcome. For example, the sequence HHHTHTTH
contains ve runs. Show that the expected number of runs is 1 + 2(n 1)p(1 p).
Hint: This is [GS01, Exercise 3.4.1].
Since the explicit computation of the expectation is often not possible, we detail
next a useful way to bound one expectation by another.
Proposition 1.2.36 (Jensens inequality). Suppose g() is a convex function, that
is,
g(x + (1 )y) g(x) + (1 )g(y) x, y R, 0 1.
If X is an integrable R.V. and g(X) is also integrable, then E(g(X)) g(EX).
Example 1.2.37. Let X be a R.V. such that X() = xI
A
+yI
A
c and P(A) = .
Then, EX = xP(A) +yP(A
c
) = x +y(1 ). Further, then g(X()) = g(x)I
A
+
g(y)I
A
c, so for g convex, E g(X) = g(x)+g(y)(1) g(x+y(1)) = g(E X).
The expectation is often used also as a way to bound tail probabilities, based on
the following classical inequality.
Theorem 1.2.38 (Markovs inequality). Suppose f is a nondecreasing, Borel
measurable function with f(x) > 0 for any x > 0. Then, for any random variable
X and all > 0,
P([X()[ > )
1
f()
E(f([X[)).
Proof. Let A = : [X()[ > . Since f is nonnegative,
E[f([X[)] =
_
f([X()[)dP()
_
I
A
()f([X()[)dP().
Since f is nondecreasing, f([X(w)[) f() on the set A, implying that
_
I
A
()f([X()[)dP()
_
I
A
()f()dP() = f()P(A).
Dividing by f() > 0 we get the stated inequality.
1.3. CONVERGENCE OF RANDOM VARIABLES 19
We next specify the three most common instances of Markovs inequality.
Example 1.2.39. (a). Assuming X 0 a.s. and taking f(x) = x, Markovs
inequality is then,
P(X() )
E[X]
.
(b). Taking f(x) = x
2
and X = Y EY , Markovs inequality is then,
P([Y EY [ ) = P([X()[ )
E[X
2
]
2
=
Var(Y )
2
.
(c). Taking f(x) = e
x
for some > 0, Markovs Inequality is then,
P(X ) e
E[e
X
] .
In this bound we have an exponential decay in at the cost of requiring X to have
nite exponential moments.
Here is an application of Markovs inequality for f(x) = x
2
.
Exercise 1.2.40. Show that if E[X
2
] = 0 then X = 0 almost surely.
We conclude this section with Schwarz inequality (see also Proposition 2.2.4).
Proposition 1.2.41. Suppose Y and Z are random variables on the same proba
bility space with both E[Y
2
] and E[Z
2
] nite. Then, E[Y Z[
EY
2
EZ
2
.
1.3. Convergence of random variables
Asymptotic behavior is a key issue in probability theory and in the study of
stochastic processes. We thus explore here the various notions of convergence of
random variables and the relations among them. We start in Subsection 1.3.1
with the convergence for almost all possible outcomes versus the convergence of
probabilities to zero. Then, in Subsection 1.3.2 we introduce and study the spaces
of qintegrable random variables and the associated notions of convergence in q
means (that play an important role in the theory of stochastic processes).
Unless explicitly stated otherwise, throughout this section we assume that all R.V.
are dened on the same probability space (, T, P).
1.3.1. Convergence almost surely and in probability. It is possible to
nd sequences of R.V. that have pointwise limits, that is X
n
() X() for all .
One such example is X
n
() = a
n
X(), with nonrandom a
n
a for some a R.
However, this notion of convergence is in general not very useful, since it is sensitive
to illbehavior of the random variables on negligible sets of points . We provide
next the more appropriate (and slightly weaker) alternative notion of convergence,
that of almost sure convergence.
Definition 1.3.1. We say that random variables X
n
converge to X almost surely,
denoted X
n
a.s.
X, if there exists A T with P(A) = 1 such that X
n
() X()
as n for each xed A.
Just like the pointwise convergence, the convergence almost surely is invariant
under application of a continuous function.
Exercise 1.3.2. Show that if X
n
a.s.
X and f is a continuous function then
f(X
n
)
a.s.
f(X) as well.
20 1. PROBABILITY, MEASURE AND INTEGRATION
We next illustrate the dierence between convergence pointwise and almost surely
via a special instance of the Law of Large Numbers.
Example 1.3.3. Let S
n
=
1
n
n
i=1
I
{i=H}
be the fraction of head counts in the
rst n independent fair coin tosses. That is, using the measurable space (
, T
c
) of
Example 1.1.13 for which S
n
are R.V. and endowing it with the probability measure
P such that for each n, the restriction of P to the event space 2
n
of rst n tosses
gives equal probability 2
n
to each of the 2
n
possible outcomes. In this case, the Law
of Large Numbers (L.L.N.) is the statement that S
n
a.s.
1
2
. That is, as n , the
observed fraction of head counts in the rst n independent fair coin tosses approach
the probability of the coin landing Head, apart from a negligible collection of innite
sequences of coin tosses. However, note that there are
nk
C
n
]
c
.
To practice your understanding at this point, explain why X
n
a.s.
X implies that
P(A
k
) 1, why this in turn implies that P(C
n
) 0 and why this veries part (a)
of Theorem 1.3.6.
We will prove part (b) of Theorem 1.3.6 after developing the BorelCantelli lem
mas, and studying a particular example, showing that convergence in probability
does not imply in general convergence a.s.
Proposition 1.3.8. In general, X
n
p
X does not imply that X
n
a.s.
X.
Proof. Consider the probability space = (0, 1), with Borel eld and the
Uniform probability measure U of Example 1.1.11. Suces to construct an example
1.3. CONVERGENCE OF RANDOM VARIABLES 21
of X
n
p
0 such that xing each (0, 1), we have that X
n
() = 1 for innitely
many values of n. For example, this is the case when X
n
() = 1
[tn,tn+sn]
() with
s
n
0 as n slowly enough and t
n
[0, 1s
n
] are such that any [0, 1] is in
innitely many intervals [t
n
, t
n
+s
n
]. The latter property applies if t
n
= (i 1)/k
and s
n
= 1/k when n = k(k 1)/2 + i, i = 1, 2, . . . , k and k = 1, 2, . . . (plot the
intervals [t
n
, t
n
+s
n
] to convince yourself).
Definition 1.3.9. Let A
n
be a sequence of events, and B
n
=
k=n
A
k
. Dene
A
n=1
B
n
, so A
if and only if A
k
for innitely many values of k.
Our next result, called the rst BorelCantelli lemma (see for example [GS01,
Theorem 7.3.10, part (a)]), states that almost surely, A
k
occurs for only nitely
many values of k if the sequence P(A
k
) converges to zero fast enough.
Lemma 1.3.10 (BorelCantelli I). Suppose A
k
T and
k=1
P(A
k
) < . Then,
necessarily P(A
) = 0.
Proof. Let b
n
=
k=n
P(A
k
), noting that by monotonicity and countable sub
additivity of probability measures we have for all n that
P(A
) P(B
n
)
k=n
P(A
k
) := b
n
(see Exercise 1.1.3). Further, our assumption that the series b
1
is nite implies that
b
n
0 as n , so considering the preceding bound P(A
) b
n
for n , we
conclude that P(A
) = 0.
The following (strong) converse applies when in addition we know that the events
A
k
are mutually independent (for a rigorous denition, see Denition 1.4.35).
Lemma 1.3.11 (BorelCantelli II). If A
k
are independent and
k=1
P(A
k
) = ,
then P(A
l
P(A
l
)
l
2
l
< . Therefore, by BorelCantelli I, P(A
) = 0. Now
observe that / A
amounts to / A
l
for all but nitely many values of l,
hence if / A
, then necessarily X
n
l
() X() as l . To summarize,
the measurable set B := : X
n
l
() X() contains the set (A
)
c
whose
probability is one. Clearly then, P(B) = 1, which is exactly what we set up to
prove.
Here is another application of BorelCantelli Lemma I which produces the a.s.
convergence to zero of k
1
X
k
in case sup
n
E[X
2
n
] is nite.
Proposition 1.3.12. Suppose E[X
2
n
] 1 for all n. Then n
1
X
n
() 0 a.s. for
n .
22 1. PROBABILITY, MEASURE AND INTEGRATION
Proof. Fixing > 0 let A
k
= : [k
1
X
k
()[ > for k = 1, 2, . . .. Then,
by part (b) of Example 1.2.39 and our assumption we have that
P(A
k
) = P( : [X
k
()[ > k)
E(X
2
k
)
(k)
2
1
k
2
1
2
.
Since
k
k
2
< , it then follows by Borel Cantelli I that P(A
) = 0, where
A
= : [k
1
X
k
()[ > for innitely many values of k. Hence, for any
xed > 0, with probability one k
1
[X
k
()[ for all large enough k, that is,
limsup
n
n
1
[X
n
()[ a.s. Considering a sequence
m
0 we conclude that
n
1
X
n
0 for n and a.e. .
A common use of BorelCantelli I is to prove convergence almost surely by applying
the conclusion of the next exercise to the case at hand.
Exercise 1.3.13. Suppose X and X
n
are random variables on the same proba
bility space.
(a) Fixing > 0 show that the set A
= : limsup
n
[X
n
() X()[ > .
(b) Explain why if P(C
n=1
P([X
n
X[ > ) < for each > 0.
We conclude this subsection with a pair of exercises in which BorelCantelli lem
mas help in nding certain asymptotic behavior for sequences of independent ran
dom variables.
Exercise 1.3.14. Suppose T
n
are independent Exponential(1) random variables
(that is, P(T
n
> t) = e
t
1
t0
).
(a) Using both BorelCantelli lemmas, show that
P(T
k
() > log k for innitely many values of k) = 1
1
.
(b) Deduce that limsup
n
(T
n
/ log n) = 1 almost surely.
Hint: See [Wil91, Example 4.4] for more details.
Exercise 1.3.15. Consider a twosided innite sequence of independent random
variables X
k
, k Z such that P(X
k
= 1) = P(X
k
= 0) = 1/2 for all k Z (for
example, think of independent fair coin tosses). Let
m
= maxi 1 : X
mi+1
=
= X
m
= 1 denote the length of the run of 1s going backwards from time m
(with
m
= 0 in case X
m
= 0). We are interested in the asymptotics of the longest
such run during times 1, 2, . . . , n for large n. That is,
L
n
= max
m
: m = 1, . . . , n
= maxmk : X
k+1
= = X
m
= 1 for some m = 1, . . . , n .
(a) Explain why P(
m
= k) = 2
(k+1)
for k = 0, 1, 2, . . . and any m.
(b) Applying the BorelCantelli I lemma for A
n
=
n
> (1+) log
2
n, show
that for each > 0, with probability one,
n
(1+) log
2
n for all n large
enough. Considering a countable sequence
k
0 deduce that
limsup
n
L
n
log
2
n
a.s.
1 .
1.3. CONVERGENCE OF RANDOM VARIABLES 23
(c) Fixing > 0 let A
n
= L
n
< k
n
for k
n
= [(1 ) log
2
n]. Explain why
A
n
mn
i=1
B
c
i
,
for m
n
= [n/k
n
] and the independent events B
i
= X
(i1)kn+1
= =
X
ikn
= 1. Deduce from it that P(A
n
) P(B
c
1
)
mn
exp(n
/(2 log
2
n)),
for all n large enough.
(d) Applying the BorelCantelli I lemma for the events A
n
of part (c), fol
lowed by 0, conclude that
liminf
n
L
n
log
2
n
a.s.
1
and consequently, that L
n
/(log
2
n)
a.s.
1.
1.3.2. L
q
spaces and convergence in qmean. Fixing 1 q < we
denote by L
q
(, T, P) the collection of random variables X on (, T) for which
E([X[
q
) < (and when the probability space (, T, P) is clear from the context,
we often use the short notation L
q
). For example, L
1
denotes the space of all
integrable randomvariables, and the random variables in L
2
are also called square
integrable .
We start by proving that these spaces are nested in terms of the parameter q.
Proposition 1.3.16. The sequence [[X[[
q
= [E([X[
q
)]
1/q
is nondecreasing in q.
Proof. Fix q
1
> q
2
and apply Jensens inequality for the convex function
g(y) = [y[
q1/q2
and the nonnegative random variable Y = [X[
q2
, to get that
E
_
[Y [
q1/q2
_
(EY )
q1/q2
. Taking the 1/q
1
power yields the stated result.
Associated with the space L
q
is the notion of convergence in qmean, which we
now dene.
Definition 1.3.17. We say that X
n
converges in qmean, or in L
q
to X, denoted
X
n
q.m.
X, if X
n
, X L
q
and [[X
n
X[[
q
0 as n (i.e., E([X
n
X[
q
) 0
as n .
Example 1.3.18. For q = 2 we have the explicit formula
[[X
n
X[[
2
2
= E(X
2
n
) 2E(X
n
X) +E(X
2
).
Thus, it is often easiest to check convergence in 2mean.
Check that the following claim is an immediate corollary of Proposition 1.3.16.
Corollary 1.3.19. If X
n
q.m.
X and q r, then X
n
r.m.
X.
Our next proposition details the most important general structural properties of
the spaces L
q
.
Proposition 1.3.20. L
q
(, T, P) is a complete, normed (topological) vector space
with the norm [[ [[
q
; That is, X +Y L
q
whenever X, Y L
q
, , R, with
X [[X[[
q
a norm on L
q
and if X
n
L
q
are such that [[X
n
X
m
[[
q
0 as
n, m then X
n
q.m.
X for some X L
q
.
For example, check the following claim.
24 1. PROBABILITY, MEASURE AND INTEGRATION
Exercise 1.3.21. Fixing q 1, use the triangle inequality for the norm  
q
on
L
q
to show that if X
n
q.m.
X, then E[X
n
[
q
E[X[
q
. Using Jensens inequality for
g(x) = [x[, deduce that also EX
n
EX. Finally, provide an example to show that
EX
n
EX does not necessarily imply X
n
X in L
1
.
As we prove below by an application of Markovs inequality, convergence in qmean
implies convergence in probability (for any value of q).
Proposition 1.3.22. If X
n
q.m.
X, then X
n
p
X.
Proof. For f(x) = [x[
q
Markovs inequality (i.e., Theorem 1.2.38) says that
P([Y [ > )
q
E[[Y [
q
].
Taking Y = X
n
X gives P([X
n
X[ > )
q
E[[X
n
X[
q
]. Thus, if X
n
q.m.
X
we necessarily also have that X
n
p
X as claimed.
Example 1.3.23. The converse of Proposition 1.3.22 does not hold in general.
For example, take the probability space of Example 1.1.11 and the R.V. Y
n
() =
n1
[0,n
1
]
. Since Y
n
() = 0 for all n n
0
and some nite n
0
= n
0
(), it follows that
Y
n
() converges a.s. to Y () = 0 and hence converges to zero in probability as well
(see part (a) of Theorem 1.3.6). However, Y
n
does not converge to zero in qmean,
even for the easiest case of q = 1 since E[Y
n
] = nU([0, n
1
]) = 1 for all n. The
convergence in qmean is in general not comparable to a.s. convergence. Indeed, the
above example of Y
n
() with convergence a.s. but not in 1mean is complemented
by the example considered when proving Proposition 1.3.8 which converges to zero
in qmean but not almost surely.
Your next exercise summarizes all that can go wrong here.
Exercise 1.3.24. Give a counterexample to each of the following claims
(a) If X
n
X a.s. then X
n
X in L
q
, q 1;
(b) If X
n
X in L
q
then X
n
X a.s.;
(c) If X
n
X in probability then X
n
X a.s.
While neither convergence in 1mean nor convergence a.s. are a consequence of
each other, a sequence of R.V.s cannot have a 1mean limit and a dierent a.s.
limit.
Proposition 1.3.25. If X
n
q.m.
X and X
n
a.s.
Y then X = Y a.s.
Proof. It follows from Proposition 1.3.22 that X
n
p
X and from part (a)
of Theorem 1.3.6 that X
n
p
Y . Note that for any > 0, and , if [Y ()
X()[ > 2 then either [X
n
() X()[ > or [X
n
() Y ()[ > . Hence,
P( : [Y () X()[ > 2) P( : [X
n
() X()[ > )
+ P( : [X
n
() Y ()[ > ) .
By Denition 1.3.5, both terms on the right hand side converge to zero as n ,
hence P( : [Y () X()[ > 2) = 0 for each > 0, that is X = Y a.s.
Remark. Note that Y
n
() of Example 1.3.23 are such that EY
n
= 1 for every n,
but certainly E[[Y
n
1[] , 0 as n . That is, Y
n
() does not converge to one
in 1mean. Indeed, Y
n
a.s.
0 as n , so by Proposition 1.3.25 these Y
n
simply
do not have any 1mean limit as n .
1.4. INDEPENDENCE, WEAK CONVERGENCE AND UNIFORM INTEGRABILITY 25
1.4. Independence, weak convergence and uniform integrability
This section is devoted to independence (Subsection 1.4.3) and distribution (Sub
section 1.4.1), the two fundamental aspects that dierentiate probability from (gen
eral) measure theory. In the process, we consider in Subsection 1.4.1 the useful
notion of convergence in law (or more generally, that of weak convergence), which
is weaker than all notions of convergence of Section 1.3, and devote Subsection 1.4.2
to uniform integrability, a technical tool which is highly useful when attempting to
exchange the order of a limit and expectation operations.
1.4.1. Distribution, law and weak convergence. As dened next, every
R.V. X induces a probability measure on its range which is called the law of X.
Definition 1.4.1. The law of a R.V. X, denoted T
X
, is the probability measure
on (R, B) such that T
X
(B) = P( : X() B) for any Borel set B.
Exercise 1.4.2. For a R.V. dened on (, T, P) verify that T
X
is a probability
measure on (R, B).
Hint: First show that for B
i
B, : X()
i
B
i
=
i
: X() B
i
and
that if B
i
are disjoint then so are the sets : X() B
i
.
Note that the law T
X
of a R.V. X : R, determines the values of the
probability measure P on (X) = ( : X() , R). Further, recalling
the remark following Denition 1.2.8, convince yourself that T
X
carries exactly the
same information as the restriction of P to (X). We shall pick this point of view
again when discussing stochastic processes in Section 3.1.
The law of a R.V. motivates the following change of variables formula which
is useful in computing expectations (the special case of R.V. having a probability
density function is given already in Proposition 1.2.29).
Proposition 1.4.3. Let X be a R.V. on (, T, P) and let g be a Borel function
on R. Suppose either g is nonnegative or E[g(X)[ < . Then
(1.4.1) E[g(X)] =
_
R
g(x)dT
X
(x) ,
where the integral on the right hand side merely denotes the expectation of the
random variable g(x) on the (new) probability space (R, B, T
X
).
A good way to practice your understanding of Denition 1.4.1 is by verifying that
if X = Y almost surely, then also T
X
= T
Y
(that is, any two random variables we
consider to be the same would indeed have the same law).
Exercise 1.4.4. Convince yourself that Proposition 1.4.3 implies that if T
X
= T
Y
then also Eh(X) = Eh(Y ) for any bounded Borel function h : R R and prove the
converse statement: If Eh(X) = Eh(Y ) for any bounded Borel function h : R R
then necessarily T
X
= T
Y
.
The next concept we dene, the distribution function, is closely associated with
the law T
X
of the R.V.
Definition 1.4.5. The distribution function F
X
of a realvalued R.V. X is
F
X
() = P( : X() ) = T
X
((, ]) R
As we have that P( : X() ) = F
X
() for the generators : X()
of (X), we are not at all surprised by the following proposition.
26 1. PROBABILITY, MEASURE AND INTEGRATION
Proposition 1.4.6. The distribution function F
X
uniquely determines the law
T
X
of X.
Our next example highlights the possible shape of the distribution function.
Example 1.4.7. Consider Example 1.1.4 of n coin tosses, with eld T
n
= 2
n
,
sample space
n
= H, T
n
, and the probability measure P
n
(A) =
A
p
, where
p
= 2
n
for each
n
(that is, =
1
,
2
, ,
n
for
i
H, T),
corresponding to independent, fair, coin tosses. Let Y () = I
{1=H}
measure the
outcome of the rst toss. The law of this random variable is,
T
Y
(B) =
1
2
1
{0B}
+
1
2
1
{1B}
and its distribution function is
F
Y
() = T
Y
((, ]) = P
n
(Y () ) =
_
_
1, 1
1
2
, 0 < 1
0, < 0
. (1.4.2)
Note that in general (X) is a strict subset of the eld T (in Example 1.4.7 we
have that (Y ) determines the probability measure for the rst coin toss, but tells
us nothing about the probability measure assigned to the remaining n 1 tosses).
Consequently, though the law T
X
determines the probability measure P on (X)
it usually does not completely determine P.
Example 1.4.7 is somewhat generic, in the sense that if the R.V. X is a simple
function, then its distribution function F
X
is piecewise constant with jumps at the
possible values that X takes and jump sizes that are the corresponding probabilities.
In contrast, the distribution function of a R.V. with a density (per Denition
1.2.23) is almost everywhere dierentiable, that is,
Proposition 1.4.8. A R.V. X has a (probability) density (function) f
X
if and
only if its distribution function F
X
can be expressed as
F
X
() =
_
f
X
(x)dx,
for all R (where f
X
0 and
_
f
X
(x)dx = 1). Such F
X
is continuous and
almost everywhere dierentiable with
dFX
dx
(x) = f
X
(x) for almost every x.
Example 1.4.9. The distribution function for the R.V. U() = corresponding
to Example 1.1.11 is
F
U
() = P(U ) = P(U [0, ]) =
_
_
1, > 1
, 0 1
0, < 0
(1.4.3)
and its density is f
U
(u) =
_
1, 0 u 1
0, otherwise
.
Every realvalued R.V. X has a distribution function but not necessarily a density.
For example X = 0 w.p.1 has distribution function F
X
() = 1
0
. Since F
X
is
discontinuous at 0 the R.V. X does not have a density.
The distribution function of any R.V. is necessarily nondecreasing. Somewhat
surprisingly, there are continuous, nondecreasing functions that do not equal to
1.4. INDEPENDENCE, WEAK CONVERGENCE AND UNIFORM INTEGRABILITY 27
the integral of their almost everywhere derivative. Such are the distribution func
tions of the nondiscrete random variables that do not have a density. For the
characterization of all possible distribution functions c.f. [GS01, Section 2.3].
Associated with the law of R.V.s is the important concept of weak convergence
which we dene next.
Definition 1.4.10. We say that R.V.s X
n
converge in law (or weakly) to a R.V.
X, denoted by X
n
L
X if F
Xn
() F
X
() as n for each xed which is a
continuity point of F
X
(where is a continuity point of F
X
() if F
X
(
k
) F
X
()
whenever
k
). In some books this is also called convergence in distribution,
and denoted X
n
D
X.
The next proposition, whose proof we do not provide, gives an alternative def
inition of convergence in law. Though not as easy to check as Denition 1.4.10,
this alternative denition applies to more general R.V., whose range is not R, for
example to random vectors X
n
with values in R
d
.
Proposition 1.4.11. X
n
L
X if and only if for each bounded h that is contin
uous on the range of X we have Eh(X
n
) Eh(X) as n .
Remark. Note that F
X
() = T
X
((, ]) = E[I
(,]
(X)] involves the discon
tinuous function h(x) = 1
(,]
(x). Restricting the convergence in law to continu
ity points of F
X
is what makes Proposition 1.4.11 possible.
Exercise 1.4.12. Show that if X
n
L
X and f is a continuous function then
f(X
n
)
L
f(X).
We next illustrate the concept of convergence in law by a special instance of the
Central Limit Theorem (one that is also called the Normal approximation for the
Binomial distribution).
Example 1.4.13. Let
S
n
=
1
n
i=1
(I
{i=H}
I
{i=T}
) be the normalized dier
ence between head and tail counts in n independent fair coin tosses, that is, using
the probability space (
n
, T
n
, P
n
) of Example 1.4.7 (convince yourself that
S
n
is a
R.V. with respect to (
n
, T
n
)). In this case, the Central Limit Theorem (C.L.T.)
is the statement that
S
n
L
G, where G is a Gaussian R.V. of zero mean and
variance one, that is, having the distribution function
F
G
(g) =
_
g
x
2
2
2
dx
(it is not hard to check that indeed E(G) = 0 and E(G
2
) = 1; such R.V. is some
times also called standard Normal).
By Proposition 1.4.11, this C.L.T. tells us that E[h(
S
n
)]
n
E[h(G)] for each
continuous and bounded function h : R R, where
Eh(
S
n
) =
n
2
n
h(
S
n
()) (1.4.4)
Eh(G) =
_
h(x)
e
x
2
2
2
dx , (1.4.5)
28 1. PROBABILITY, MEASURE AND INTEGRATION
and the expression (1.4.5) is an instance of Proposition 1.2.29 (as G has a density
f
G
(g) = (2)
1/2
e
g
2
/2
). So, the C.L.T. allows us to approximate, for all large
enough n, the noncomputable sums (1.4.4) by the computable integrals (1.4.5).
Here is another example of convergence in law, this time in the context of extreme
value theory.
Exercise 1.4.14. Let M
n
= max
1in
T
i
, where T
i
, i = 1, 2, . . . are independent
Exponential() random variables (i.e. F
Ti
(t) = 1 e
t
for some > 0, all t 0
and any i). Find nonrandom numbers a
n
and a nonzero random variable M
such that (M
n
a
n
) converges in law to M
.
Hint: Explain why F
Mnan
(t) = (1 e
t
e
an
)
n
and nd a
n
for which
(1 e
t
e
an
)
n
converges per xed t and its limit is strictly between 0 and 1.
If the limit X has a density, or more generally whenever F
X
is a continuous func
tion, the convergence in law of X
n
to X is equivalent to the pointwise convergence
of the corresponding distribution functions. Such is the case in Example 1.4.13
since G has a density.
As we demonstrate next, in general, the convergence in law of X
n
to X is strictly
weaker than the pointwise convergence of the corresponding distribution functions
and the rate of convergence of Eh(X
n
) to Eh(X) depends on the specic function
h we consider.
Example 1.4.15. The random variables X
n
= 1/n a.s. converge in law to X = 0,
while F
Xn
() does not converge to F
X
() at the discontinuity point = 0 of F
X
.
Indeed, F
Xn
() = 1
[1/n,)
() converges to F
X
() = 1
[0,)
() for each ,= 0,
since F
X
() = 0 = F
Xn
() for all n and < 0, whereas for > 0 and all n large
enough > 1/n in which case F
Xn
() = 1 = F
X
() as well. However, F
X
(0) = 1
while F
Xn
(0) = 0 for all n.
Further, while Eh(X
n
) = h(
1
n
) h(0) = Eh(X) for each xed continuous func
tion h, the rate of convergence clearly varies with the choice of h.
In contrast to the preceding example, we provide next a very explicit necessary
and sucient condition for convergence in law of integer valued random variables.
Exercise 1.4.16. Let X
n
, 1 n be nonnegative integer valued R.V.s. Show
that X
n
X
= m) for all m.
The next exercise provides a few additional examples as well as the dual of Exercise
1.4.16 when each of the random variables X
n
, 1 n has a density.
Exercise 1.4.17.
(a) Give an example of random variables X and Y on the same probability
space, such that T
X
= T
Y
while P( : X() ,= Y ()) = 1.
(b) Give an example of random variables X
n
L
X
where each X
n
has a
probability density function, but X
(s)[ds =
_
R
g
n
(s)f
(s)ds,
why it follows from Corollary 1.4.28 that
_
R
g
n
(s)f
(s)ds 0 as n
and how you deduce from this that X
n
L
X
.
Our next proposition states that convergence in probability implies the conver
gence in law. This is perhaps one of the reasons we call the latter also weak con
vergence.
Proposition 1.4.18. If X
n
p
X, then X
n
L
X.
The convergence in law does not require X
n
to be dened in the same probability
space (indeed, in Example 1.4.13 we have
S
n
in a dierent probability space for
each n. Though we can embed all of these R.V.s in the same larger measurable
space (
, T
c
) of Example 1.1.13, we still do not see their limit in law G in
).
Consequently, the converse of Proposition 1.4.18 cannot hold. Nevertheless, as we
state next, when the limiting R.V. is a nonrandom constant, the convergence in
law is equivalent to convergence in probability.
Proposition 1.4.19. If X
n
L
X and X is a nonrandom constant (almost
surely), then X
n
p
X as well.
We dene next a natural extension of the convergence in law, which is very con
venient for dealing with the convergence of stochastic processes.
Definition 1.4.20. We say that a sequence of probability measures Q
n
on a topo
logical space S (i.e. a set S with a notion of open sets, or topology) and its Borel
eld B
S
(= the eld generated by the open subsets of S), converges weakly to a
probability measure Q if for each xed h continuous and bounded on S,
_
S
h()dQ
n
()
_
S
h()dQ() ,
as n (here
_
S
hdQ denotes the expectation of the R.V. h() in the probability
space (S, B
S
, Q), while
_
S
hdQ
n
corresponds to the space (S, B
S
, Q
n
)). We shall use
Q
n
Q to denote weak convergence.
Example 1.4.21. In particular, X
n
L
X if and only if T
Xn
T
X
. That is,
when _
R
h()dT
Xn
()
_
R
h()dT
X
(),
for each xed h : R R continuous and bounded.
1.4.2. Uniform integrability, limits and expectation. Recall that either
convergence a.s. or convergence in qmean imply convergence in probability, which
in turn gives convergence in law. While we have seen that a.s. convergence and
convergence in qmeans are in general noncomparable, we next give an integrability
condition that together with convergence in probability implies convergence in q
mean.
30 1. PROBABILITY, MEASURE AND INTEGRATION
Definition 1.4.22. A collection of R.V.s X
E[[X
[I
X>M
] = 0 .
Theorem 1.4.23. (for proof see [GS01, Theorem 7.10.3]): If X
n
p
X and [X
n
[
q
are U.I., then X
n
q.m.
X.
We next detail a few special cases of U.I. collections of R.V.s, showing among
other things that any nite or uniformly bounded collection of integrable R.V.s is
U.I.
Example 1.4.24. By Exercise 1.2.27 any X L
1
is U.I. Applying the same
reasoning, if [X
[I
X>M
Y I
Y >M
, hence E[[X
[I
X>M
]
E[Y I
Y >M
], which does not depend on and as you have shown in Exercise 1.2.27,
converges to zero when M ). In particular, a collection of R.V.s X
is U.I.
when for some nonrandom C < and all we have [X
[ C (take Y = C in
the preceding, or just try M > C in Denition 1.4.22). Other such examples are
the U.I. collection of R.V. c
Y , where Y L
1
and c
[X
[X
, T
c
, P) of fair coin tosses, as in Example 1.3.3
and let X
n
() = infi > n :
i
= H, that is, the index of the rst toss after the
nth one, for which the coin lands Head. Indeed, if n M then X
n
> n M,
hence E[X
n
I
Xn>M
] = E[X
n
] > n, implying that X
n
is not U.I.
In the next exercise you provide a criterion for U.I. that is very handy and often
used by us (in future chapters).
Exercise 1.4.25. A collection of R.V. X
: 1 is uniformly integrable if
Ef([X
E[X
[ <
alone is not enough for the collection to be U.I. The following lemma shows what
is required in addition; for a proof see [GS01, Lemma 7.10.6].
Lemma 1.4.26. A collection of R.V. X
E[X
[ < ,
(b) for any > 0, there exists a > 0 such that E([X
[I
A
) < for all 1
and events A for which P(A) < .
We now address the technical but important question of when can one exchange
the order of limits and expectation operations. To see that this is not always
possible, consider the R.V. Y
n
() = n1
[0,n
1
]
of Example 1.3.23 (dened on the
probability space (R, B, U)). Since E[Y
n
I
Yn>M
] = 1
n>M
, it follows that
sup
n
E[Y
n
I
Yn>M
] = 1 for all M, hence Y
n
are not a U.I. collection. Indeed, Y
n
0
a.s. while lim
n
E(Y
n
) = 1 > 0.
1.4. INDEPENDENCE, WEAK CONVERGENCE AND UNIFORM INTEGRABILITY 31
In view of Example 1.4.24, our next result is merely a corollary of Theorem 1.4.23.
Theorem 1.4.27 (Dominated Convergence). If there exists a random variable Y
such that EY < , [X
n
[ Y for all n and X
n
p
X, then EX
n
EX.
Considering nonrandom Y we get the following corollary,
Corollary 1.4.28 (Bounded Convergence). Suppose [X
n
[ C for some nite
constant C and all n. If X
n
p
X then also EX
n
EX.
In case of monotone (upward) convergence of nonnegative R.V. X
n
to X we can
exchange the order of limit and expectation even when X is not integrable, that is,
Theorem 1.4.29 (Monotone Convergence). If X
n
0 and X
n
() X() for a.e.
, then EX
n
EX. This applies even if X() = for some .
Suppose X
n
0 and X
n
() X() for a.e. . If we assume in addition that
EX
1
< , then we get the convergence EX
n
EX already from the dominated
convergence theorem. However, considering random variable Z 0 such that
EZ = and X
n
() = n
1
Z() 0 as n , we see that unlike monotonicity
upwards, the monotonicity downwards does not really help much in convergence of
expectations.
To practice your understanding, solve the following exercises.
Exercise 1.4.30. Use Monotone Convergence to show that
E(
n=1
Y
n
) =
n=1
EY
n
,
for any sequence of nonnegative R.V. Y
n
. Deduce that if X 0 and A
n
are disjoint
sets with P(
n
A
n
) = 1, then
E(X) =
n=1
E(XI
An
) .
Further, show that this applies also for any X L
1
.
Exercise 1.4.31. Prove Proposition 1.4.3, using the following four steps.
(a) Verify that the identity (1.4.1) holds for indicator functions g(x) = I
B
(x),
B B.
(b) Using linearity of the expectation, check that this identity holds whenever
g(x) is a (nonnegative) simple function on (R, B).
(c) Combine the denition of the expectation via the identity (1.2.2) with
Monotone Convergence to deduce that (1.4.1) is valid for any nonnegative
Borel function g(x).
(d) Recall that g(x) = g(x)
+
g(x)
for g(x)
+
= max(g(x), 0) and g(x)
=
min(g(x), 0) nonnegative Borel functions. Thus, using Denition 1.2.25
conclude that (1.4.1) holds whenever E[g(X)[ < .
Remark. The four step procedure you just employed is a routine way for verifying
identities about expectations. Not surprisingly it is called the standard machine.
As we next demonstrate, multiplication by an integrable positive random variable
results with a change of measure that produces an equivalent probability measure.
32 1. PROBABILITY, MEASURE AND INTEGRATION
Exercise 1.4.32 (Change of measure). Suppose Z 0 is a random variable on
(, T, P) such that EZ = 1.
(a) Show that
P : T R given by
P(A) = E[ZI
A
] is a probability measure
on (, T).
(b) Denoting by
E[X] the expectation of a nonnegative random variable X
on (, T) under the probability measure
P, show that
EX = E[XZ].
Hint: Following the procedure outlined in Exercise 1.4.31, verify this iden
tity rst for X = I
A
, then for nonnegative X SF, combining (1.2.2)
and Monotone Convergence to extend it to all X 0.
(c) Similarly, show that if Z > 0 then also EY =
E[Y/Z] for any random
variable Y 0 on (, T). Deduce that in this case P and
P are equiv
alent probability measures on (, T), that is P(A) = 0 if and only if
P(A) = 0.
Here is an application of the preceding exercise, where by a suitable explicit change
of measure you change the law of a random variable W without changing the
function W : R.
Exercise 1.4.33. Suppose a R.V. W on a probability space (, T, P) has the
N(, 1) law of Denition 1.2.30.
(a) Check that Z = exp(W +
2
/2) is a positive random variable with
EZ = 1.
(b) Show that under the corresponding equivalent probability measure
P of
Exercise 1.4.32 the R.V. W has the N(0, 1) law.
1.4.3. Independence. We say that two events A, B T are Pmutually
independent if P(A B) = P(A)P(B). For example, suppose two fair dice are
thrown. The events E
1
= Sum of two is 6 and E
2
= rst die is 4 are then not
independent since
P(E
1
) = P((1, 5) (2, 4) (3, 3) (4, 2) (5, 1)) =
5
36
, P(E
2
) = P( :
1
= 4) =
1
6
and
P(E
1
E
2
) = P((4, 2)) =
1
36
,= P(E
1
)P(E
2
).
However, one can check that E
2
and E
3
= sum of dice is 7 are independent.
In the context of independence we often do not explicitly mention the probability
measure P. However, events A, B T may well be independent with respect to
one probability measure on (, T) and not independent with respect to another.
Exercise 1.4.34. Provide a measurable space (, T), two probability measures
P and Q on it and events A and B that are Pmutually independent but not Q
mutually independent.
More generally we dene the independence of events as follows.
Definition 1.4.35. Events A
i
T are Pmutually independent if for any L <
and distinct indices i
1
, i
2
, . . . , i
L
,
P(A
i1
A
i2
A
iL
) =
L
k=1
P(A
i
k
).
1.4. INDEPENDENCE, WEAK CONVERGENCE AND UNIFORM INTEGRABILITY 33
In analogy with the mutual independence of events we dene the independence of
two R.V.s and more generally, that of two elds.
Definition 1.4.36. Two elds H, ( T are Pindependent if
P(G H) = P(G)P(H), G (, H H
(see [Bre92, Denition 3.1] for the independence of more than two elds). The
random vectors (X
1
, . . . , X
n
) and (Y
1
, . . . , Y
m
) are independent, if the corresponding
elds (X
1
, . . . , X
n
) and (Y
1
, . . . , Y
m
) are independent (see [Bre92, Denition
3.2] for the independence of k > 2 random vectors).
To practice your understanding of this denition, solve the following exercise.
Exercise 1.4.37.
(a) Verify that if elds ( and H are Pindependent,
( ( and
H H,
then also
( and
H are Pindependent.
(b) Deduce that if random vectors (X
1
, . . . , X
n
) and (Y
1
, . . . , Y
m
) are P
independent, then so are the R.V. h(X
1
, . . . , X
n
) and g(Y
1
, . . . , Y
m
) for
any Borel functions h : R
n
R and g : R
m
R.
Beware that pairwise independence (of each pair A
k
, A
j
for k ,= j), does not imply
mutual independence of all the events in question and the same applies to three or
more random variables. Here is an illustrating example.
Exercise 1.4.38. Consider the sample space = 0, 1, 2
2
with probability mea
sure on (, 2
i=1
i
() =
i=1
i
()I
N()i
Has the expectation
E(X) =
i=1
E(
i
)P(N i) = E(
1
)E(N) .
CHAPTER 2
Conditional expectation and Hilbert spaces
The most important concept in probability theory is the conditional expectation to
which this chapter is devoted. In contrast with the elementary denition often used
for a nite or countable sample space, the conditional expectation, as dened in
Section 2.1, is itself a random variable. A brief introduction to the theory of Hilbert
spaces is provided in Section 2.2. This theory is used here to show the existence
of the conditional expectation. It is also applied in Section 5.1 to construct the
Brownian motion, one of the main stochastic processes of interest to us. Section 2.3
details the important properties of the conditional expectation. Finally, in Section
2.4 we represent the conditional expectation as the expectation with respect to the
random regular conditional probability distribution.
2.1. Conditional expectation: existence and uniqueness
After reviewing in Subsection 2.1.1 the elementary denition of the conditional
expectation for discrete random variables we provide a denition that applies to
any squareintegrable R.V. We then show in Subsection 2.1.2 how to extend it to
the general case of integrable R.V. and to the conditioning on any eld.
2.1.1. Conditional expectation in the discrete and L
2
cases. Suppose
the random variables X and Y take on nitely many values. Then, we know that
E(X[Y = y) =
x
xP(X = x[Y = y), provided P(Y = y) > 0. Moreover, by the
formula P(A[B) =
P(AB)
P(B)
, this amounts to
E(X[Y = y) =
x
x
P(X = x, Y = y)
P(Y = y)
With f(y) denoting the function E[X[Y = y], we may now consider the random
variable E(X[Y ) = f(Y ), which we call the Conditional Expectation (in short C.E.)
of X given Y . For a general R.V. X we similarly dene
f(y) := E(X[Y = y) =
E(XI
{Y =y}
)
P(Y = y)
for any y such that P(Y = y) > 0 and let E(X[Y ) = f(Y ) (see [GS01, Denition
3.7.3] for a similar treatment).
Example 2.1.1. For example, if R.V. X =
1
and Y =
2
for the probability
space T = 2
, = 1, 2
2
and
P(1, 1) = .5, P(1, 2) = .1, P(2, 1) = .1, P(2, 2) = .3,
then,
P(X = 1[Y = 1) =
P(X = 1, Y = 1)
P(Y = 1)
=
5
6
35
36 2. CONDITIONAL EXPECTATION AND HILBERT SPACES
implying that P(X = 2[Y = 1) =
1
6
, and
E(X[Y = 1) = 1
5
6
+ 2
1
6
=
7
6
Likewise, check that E(X[Y = 2) =
7
4
, hence E(X[Y ) =
7
6
I
Y =1
+
7
4
I
Y =2
.
This approach works also for Y that takes countably many values. However, it
is limited to this case, as it requires us to have E(I
{Y =y}
) = P(Y = y) > 0.
To bypass this diculty, observe that Z = E(X[Y ) = f(Y ) as dened before,
necessarily satises the identity
0 = E[(X f(y))I
{Y =y}
] = E[(X Z)I
{Y =y}
]
for any y R. Further, here also E[(X Z)I
{Y B}
] = 0 for any Borel set B, that
is E[(X Z)I
A
] = 0 for any A (Y ). As we see in the sequel, demanding that
the last identity holds for Z = f(Y ) provides us with a denition of Z = E(X[Y )
that applies for any two R.V. X and Y on the same probability space (, T, P),
subject only to the mild condition that E([X[) < (For more details and proofs,
see [GS01, Section 7.9]).
To gain better insight into this, more abstract denition, consider rst X
L
2
(, T, P) and the optimization problem
(2.1.1) d
2
= infE[(X W)
2
] : W H
Y
for H
Y
= L
2
(, (Y ), P). From the next proposition we see that the conditional
expectation is the solution of this optimization problem.
Proposition 2.1.2. There exists a unique (a.s.) optimal Z H
Y
such that
d
2
= E[(X Z)
2
]. Further, the optimality of Z is equivalent to the orthogonality
property
(2.1.2) E[(X Z)V ] = 0, V H
Y
.
Proof. We state without proof the existence of Z H
Y
having the orthogo
nality property (2.1.2), as we shall later see that this is a consequence of the more
general theory of Hilbert spaces.
Suppose that both Z
1
, Z
2
H
Y
are such that E[(X Z
1
)V ] = 0 and E[(X
Z
2
)V ] = 0 for all V H
Y
. Taking the dierence of the two equations, we see that
E[(Z
1
Z
2
)V ] = 0 for any such V . Take V = Z
1
Z
2
H
Y
to get E[(Z
1
Z
2
)
2
] = 0,
so Z
1
a.s.
= Z
2
.
Suppose now that W, Z H
Y
and also that E[(X Z)V ] = 0 for any V H
Y
.
It is easy to check the identity
(2.1.3)
1
2
[(X W)
2
] [(X Z)
2
] = (X Z)(Z W) +
1
2
(Z W)
2
,
which holds for any X, W, Z. Considering V = Z W H
Y
, and taking expecta
tion, we get by linearity of the expectation that
E[(X W)
2
] E[(X Z)
2
] = E(V
2
) 0,
with equality if only if V = 0, that is, W = Z a.s. Thus, Z H
Y
that satises
(2.1.2) is the a.s. unique minimizer of E[(X W)
2
] among all W H
Y
.
We nish the proof by demonstrating the converse. Namely, suppose Z H
Y
is
such that E[(X W)
2
] E[(X Z)
2
] for all W H
Y
. Taking the expectation in
2.1. CONDITIONAL EXPECTATION: EXISTENCE AND UNIQUENESS 37
(2.1.3), we thus have that
E[(X Z)(Z W)] +
1
2
E[(Z W)
2
] 0 .
Fixing a nonrandom y R and a R.V. V H
Y
, we apply this inequality for
W = ZyV H
Y
to get g(y) = yE[(XZ)V ] +
1
2
y
2
E[V
2
] 0. Since E[V
2
] < ,
the requirement g(y) 0 for all y R implies that g
n
i=1
y
i
I
Ai
for some n < , distinct y
i
and disjoint sets
A
i
= : Y () = y
i
such that P(A
i
) > 0 for all i.
Recall Theorem 1.2.14 that any W H
Y
is of the form W = g(Y ) for some
Borel function g such that E(g(Y )
2
) < . As Y () takes only nitely many
possible values y
1
, . . . , y
n
, every function g() on this set is bounded (hence
squareintegrable) and measurable. Further, the sets A
i
are disjoint, resulting with
g(Y ) =
n
i=1
g(y
i
)I
Ai
. Hence, in this case, H
Y
=
n
i=1
v
i
I
Ai
: v
i
R and the
optimization problem (2.1.1) is equivalent to
d
2
= inf
vi
E[(X
n
i=1
v
i
I
Ai
)
2
] = E(X
2
) + inf
vi
_
n
i=1
P(A
i
)v
2
i
2
n
i=1
v
i
E[XI
Ai
]
_
,
the solution of which is obtained for v
i
= E[XI
Ai
]/P(A
i
). We conclude that
(2.1.4) E(X[Y ) =
i
E[XI
Y =yi
]
P(Y = y
i
)
I
Y =yi
,
in agreement with our rst denition for the conditional expectation in the case of
discrete random variables.
Exercise 2.1.4. Let = a, b, c, d, with event space T = 2
and probability
measure such that P(a) = 1/2, P(b) = 1/4, P(c) = 1/6 and P(d) = 1/12.
(a) Find (I
A
), (I
B
) and (I
A
, I
B
) for the subsets A = a, d and B =
b, c, d of .
(b) Let H = L
2
(, (I
B
), P). Find the conditional expectation E(I
A
[I
B
) and
the value of d
2
= infE[(I
A
W)
2
] : W H.
2.1.2. Conditional expectation: the general case. You should verify that
the conditional expectation of Denition 2.1.3 is a R.V. on the measurable space
(, T
Y
) whose dependence on Y is only via T
Y
. This is consistent with our inter
pretation of T
Y
as conveying the information content of the R.V. Y . In particular,
E[X[Y ] = E[X[Y
] whenever T
Y
= T
Y
(for example, if Y
= h(Y ) for some invert
ible Borel function h). By this reasoning we may and shall often use the notation
E (X [ T
Y
) for E(X[Y ), and next dene E(X[() for X L
1
(, T, P) and arbitrary
eld ( T. To this end, recall that Denition 2.1.3 is based on the orthogonality
property (2.1.2). As we see next, in the general case where X is only in L
1
, we
have to consider the smaller class of almost surely bounded R.V. V on (, (), in
order to guarantee that the expectation of XV is well dened.
38 2. CONDITIONAL EXPECTATION AND HILBERT SPACES
Definition 2.1.5. The conditional expectation of X L
1
(, T, P) given a eld
( T, is the R.V. Z on (, () such that
(2.1.5) E [(X Z) I
A
] = 0, A (
and E(X[Y ) corresponds to the special case of ( = T
Y
in (2.1.5).
As we state next, in the special case of X L
2
(, T, P) this denition coincides
with the C.E. we get by Denition 2.1.3.
Theorem 2.1.6. The C.E. of integrable R.V. X given any eld ( exists and is
a.s. unique. That is, there exists Z measurable on ( that satises (2.1.5), and if
Z
1
and Z
2
are both measurable on ( satisfying (2.1.5), then Z
1
a.s
= Z
2
.
Further, if in addition E(X
2
) < , then such Z also satises
E [(X Z) V ] = 0, V L
2
(, (, P),
hence for ( = T
Y
the R.V. Z coincides with that of Denition 2.1.3.
Proof. (omit at rst reading) We only outline the key ideas of the proof (for
a detailed treatment see [GS01, Theorem 7.9.26]).
Step 1. Given X 0, EX < , we dene X
n
() = minX(), n. Then, X
n
X
and X
n
is bounded for each n. In particular, X
n
L
2
so the C.E. E[X
n
[(] exists
by orthogonal projection in Hilbert spaces (see Theorem 2.2.12, in analogy with
Proposition 2.1.2). Further, the R.V. E[X
n
[(] are nondecreasing in n, so we dene
the C.E. of X given ( as
(2.1.6) E[X[(] := lim
n
E[X
n
[(].
One then checks that
(1). The limit in (2.1.6) exists and is almost surely nite;
(2). This limit does not depend on the specic sequence X
n
we have chosen in the
sense that if Y
n
X and E(Y
n
2
) < then lim
n
E[Y
n
[(] = E[X[(];
(3). The R.V. obtained by (2.1.6) agrees with Denition 2.1.5, namely, it satises
the orthogonality property (2.1.5).
Step 2. If E[X[ < then E[X[(] = E[X
+
[(] E[X
[(].
We often use the notation P(A[() for E(I
A
[() and any A T. The reason for this
is made clear in Section 2.4 when we relate the C.E. with the notion of (regular)
conditional probability.
We detail in Section 2.3 few of the many properties of the C.E. that are easy to
verify directly from Denition 2.1.5. However, some properties are much easier to
see when considering Denition 2.1.3 and the proof of Theorem 2.1.6. For example,
Proposition 2.1.7. If X is a nonnegative R.V., then a.s. E(X[() 0.
Proof. Suppose rst that X L
2
is nonnegative and Z L
2
(, (, P). Note
that Z
+
= max(Z, 0) L
2
(, (, P) as well. Moreover, in this case (X Z
+
)
2
(XZ)
2
a.s. hence also E[(XZ
+
)
2
] E[(XZ)
2
]. Consequently, if Z = E(X[(),
then by Denition 2.1.3 Z = Z
+
a.s., that is, Z 0 a.s. Following the proof of
Theorem 2.1.6 we see that if X L
1
is nonnegative then so are X
n
L
2
. We have
already seen that E[X
n
[(] are nonnegative, and so the same applies for their limit
E[X[(].
2.2. HILBERT SPACES 39
As you show next, identities such as (2.1.4) apply for conditioning on any eld
generated by a countable partition.
Exercise 2.1.8. Let ( = (B
i
) for a countable, Tmeasurable partition B
1
, B
2
, . . .
of into sets of positive probability. That is, B
i
T,
i
B
i
= , B
i
B
j
= , i ,= j
and P(B
i
) > 0.
(a) Show that G ( if and only if G is a union of sets from B
i
.
(b) Suppose Y is a random variable on (, (). Show that Y () =
i
c
i
I
Bi
()
for some nonrandom c
i
R.
(c) Let X be an integrable random variable on (, T, P). Show that E(X[() =
i
c
i
I
Bi
with c
i
= E(XI
Bi
)/P(B
i
).
(d) Deduce that for an integrable random variable X and B T such that
0 < P(B) < 1,
E(X[(B)) =
E(I
B
X)
P(B)
I
B
+
E(I
B
c X)
P(B
c
)
I
B
c ,
and in particular E(I
A
[I
B
) = P(A[B)I
B
+ P(A[B
c
)I
B
c for any A T
(where P(A[B) = P(A B)/P(B)).
2.2. Hilbert spaces
The existence of the C.E. when X L
2
is a consequence of the more general theory
of Hilbert spaces. We embark in this section on an exposition of this general theory,
with most denitions given in Subsection 2.2.1 and their consequences outlined in
Subsection 2.2.2. The choice of material is such that we state everything we need
either for the existence of the C.E. or for constructing the Brownian motion in
Section 5.1.
2.2.1. Denition of Hilbert spaces and subspaces. We start with the
denition of a linear vector space.
Definition 2.2.1. A linear vector space H = h is a collection of vectors h, for
which addition and scalar multiplication are welldened:
h
1
+h
2
H  the addition of h
1
H, h
2
H;
h H  scalar multiplication R, h H;
and having the properties:
(i) (h
1
+h
2
) = h
1
+h
2
;
(ii) ( +)h = h +h;
(iii) (h) = ()h;
(iv) 1h = h.
The following examples are used throughout for illustration.
Example 2.2.2. The space R
3
is a linear vector space with addition and scalar
multiplication
_
_
x
1
y
1
z
1
_
_
+
_
_
x
2
y
2
z
2
_
_
=
_
_
x
1
+x
2
y
1
+y
2
z
1
+z
2
_
_
, c
_
_
x
1
y
1
z
1
_
_
=
_
_
cx
1
cy
1
cz
1
_
_
Other examples are the spaces L
q
(, T, P) with the usual addition of R.V. and
scalar multiplication of a R.V. X() by a nonrandom constant.
40 2. CONDITIONAL EXPECTATION AND HILBERT SPACES
The key concept in the denition of a Hilbert space is the inner product which is
dened next.
Definition 2.2.3. An inner product in a linear vector space H is a realvalued
function on pairs h
1
, h
2
H, denoted (h
1
, h
2
), having the following properties:
1) (h
1
, h
2
) = (h
2
, h
1
);
2) (h
1
, h
2
+h
3
) = (h
1
, h
2
) +(h
1
, h
3
);
3) (h, h) = 0 if and only if h = 0, otherwise (h, h) > 0.
The canonical inner product on R
3
is the dot product (v
1
, v
2
) = x
1
x
2
+y
1
y
2
+z
1
z
2
and the canonical inner product on L
2
(, T, P) is (X, Y ) = E(XY ).
The following inequality holds for any inner product (for the special case of most
interest to us, see Proposition 1.2.41).
Proposition 2.2.4 (Schwarz inequality). Let h = (h, h)
1
2
. Then, for any
h
1
, h
2
H,
(h
1
, h
2
)
2
h
1

2
h
2

2
.
Exercise 2.2.5. (a) Note that (h
1
+ h
2
, h
1
+ h
2
) 0 for all R. By
the bilinearity of the inner product this is a quadratic function of . To
prove Schwarzs inequality, nd its coecients and consider value of
which minimizes the quadratic expression.
(b) Use Schwarz inequality to derive the triangle inequality h
1
+ h
2

h
1
 +h
2
.
(c) Using the bilinearity of the inner product, check that the parallelogram
law u +v
2
+u v
2
= 2u
2
+ 2v
2
holds.
Thus, each inner product induces a norm on the underlying linear vector space H.
Definition 2.2.6. The norm of h is h = (h, h)
1
2
. In particular, the trian
gle inequality h
1
+ h
2
 h
1
 + h
2
 is an immediate consequence of Schwarz
inequality.
In the space R
3
we get the Euclidean norm v =
_
x
2
+y
2
+z
2
, whereas for
H = L
2
(, T, P) we have that X = (EX
2
)
1
2
= X
2
, and X = 0 if and only if
X
a.s.
= 0.
Equipping the space H with the norm   we have the corresponding notions of
Cauchy and convergent sequences.
Definition 2.2.7. We say that h
n
is a Cauchysequence in H if for any > 0,
there exists N() < such that sup
n,mN
h
n
h
m
 . We say that h
n
converges to h H if h
n
h 0 as n .
For example, in case of H = L
2
(, T, P), the concept of Cauchysequences is the
same as what we had already seen in Section 1.3.2.
We are now ready to dene the key concept of Hilbert space.
Definition 2.2.8. A Hilbert space is a linear vector space H with inner product
(h, h), which is complete with respect to the corresponding norm h (that is, every
Cauchy sequence in H converges to a point h in H).
By Proposition 1.3.20 we know that L
2
is a Hilbert space, since every Cauchy
sequence in L
2
converges in quadratic mean to a point in L
2
. To check your
understanding you may wish to solve the next exercise at this point.
2.2. HILBERT SPACES 41
Exercise 2.2.9. Check that the space R
3
of three dimensional realvalued vectors,
together with the inner product (v, u) = v
1
u
1
+v
2
u
2
+v
3
u
3
is a Hilbert space.
To dene the conditional expectation we need the following additional structure.
Definition 2.2.10. A Hilbert subspace K of H is a subset K H, which when
equipped with the same inner product is a Hilbert space. Specically, K is a linear
subspace (closed under addition, scalar multiplication), such that every Cauchy
sequence h
n
K has a limit h K.
Example 2.2.11. Note that in our setting H
Y
= L
2
(, T
Y
, P) L
2
(, T, P) is
a Hilbert subspace. Indeed, if Z
n
H
Y
is Cauchy, then Z
n
= f
n
(Y ) Z H
Y
as well. More generally, the Hilbert subspace of L
2
(, T, P) spanned by R.V.s
X
1
, X
2
, , X
n
is the set of h = f(X
1
, X
2
, , X
n
) with E (h
2
) < .
2.2.2. Orthogonal projection, separability and Fourier series. The ex
istence of the C.E. is a special instance of the following important geometric theorem
about Hilbert spaces.
Theorem 2.2.12 (Orthogonal Projection). Let G H be a given Hilbert subspace
and h H given. Let d = infh g : g G, then,
(a). There exists
h G such that d = h
h.
(b). Any such
h satises (h
h, g) = 0 for all g G.
(c). The vector
h that satises (b) (or (a)) is unique.
We call
h as above the orthogonal projection of h on G.
Our next exercise outlines the hardest part of the proof of the orthogonal projec
tion theorem.
Exercise 2.2.13. Prove part (a) of Theorem 2.2.12 as follows: Consider g
n
G
for which hg
n
 d. By denition of d and the parallelogram law for u = hg
m
and v = g
n
h, deduce that g
n
g
m
 0 when both n, m , hence (why?) g
n
has a limit
h G, such that h
h = d (why?).
In particular, Denition 2.1.3 amounts to saying that the conditional expecta
tion of X given Y is the orthogonal projection of X in H = L
2
(, T, P) on
G = L
2
(, T
Y
, P).
Another example of orthogonal projection is the projection (x
1
, . . . , x
k
, 0, . . . , 0) of
x R
d
on the hyperspace of lower dimension G = (y
1
, . . . , y
k
, 0, . . . , 0) : y
i
R,
some k < d.
An important ingredient of Hilbert space theory is the concept of linear functional,
dened next.
Definition 2.2.14. A functional f : H R is linear if f(h
1
+ h
2
) = f(h
1
) +
f(h
2
). A linear functional f() is called bounded if [f(h)[ Ch for some
C < and all h H.
For example, each real function f : R R is also a functional in the Hilbert
space R. The functional g(x) = 2x is then linear and bounded though g(x) is not
a bounded function. More generally,
Example 2.2.15. Fixing h
0
H a Hilbert space, let f(h) = (h
0
, h). This is a
bounded linear functional on H since h
0
 < and [f(h)[ h
0
h by Schwarz
inequality.
42 2. CONDITIONAL EXPECTATION AND HILBERT SPACES
The next theorem (which we do not prove), states that any bounded linear func
tional is of the form of Example 2.2.15. It is often very handy when studying linear
functionals that are given in a rather implicit form.
Theorem 2.2.16 (Riesz representation). If f is a bounded linear functional on a
Hilbert space H, then there exists a unique h
0
H such that f(h) = (h
0
, h) for all
h H.
We next dene the subclass of separable Hilbert spaces, which have particularly
appealing properties. We use these later in Section 5.1.
Definition 2.2.17. Hilbert space H is separable if it has a complete, countable,
orthonormal basis h
m
, m = 1, , such that
(a). (h
m
, h
l
) =
_
1, m = l
0, otherwise
(orthonormal).
(b). If (h, h
m
) = 0 for all m, then h = 0 (complete).
The classical concept of Fourier series has an analog in the theory of separable
Hilbert spaces.
Definition 2.2.18. We say that h H has a Fourier series
i
(h
i
, h)h
i
if
h
N
i=1
(h
i
, h)h
i
 0, as N .
Proposition 2.2.19. The completeness of a basis h
m
is equivalent to the exis
tence of Fourier series (in h
m
) for any h H.
A key ingredient in the theory of separable Hilbert spaces is the following approx
imation theorem.
Theorem 2.2.20 (Parseval). For any complete orthonormal basis h
m
of a sep
arable Hilbert space H and any f, g H,
(f, g) = lim
N
N
i=1
(f, h
i
)(g, h
i
) .
In particular, h
2
=
i=1
(h, h
i
)
2
, for all h H.
Proof. Since both f and g have a Fourier series, it follows by Schwarz in
equality that a
N
= (f
N
i=1
(f, h
i
)h
i
, g
N
i=1
(g, h
i
)h
i
) 0 as N .
Since the inner product is bilinear and (h
i
, h
j
) = 1
i=j
it follows that a
N
=
(f, g)
N
i=1
(f, h
i
)(g, h
i
), yielding the rst statement of the theorem. The sec
ond statement corresponds to the special case of f = g.
We conclude with a concrete example that relates the abstract theory of separable
Hilbert spaces to the familiar classical Fourier series.
Example 2.2.21. Let L
2
((0, 1), B, U) = f : (0, 1) R such that
_
1
0
f
2
(t) dt <
equipped with the inner product (h, g) =
_
1
0
h(t)g(t)dt. This is a separable
Hilbert space. Indeed, the classical Fourier series of any such f is
f(t) =
n=0
c
n
cos(2nt) +
n=1
s
n
sin(2nt),
where c
n
= 2
_
1
0
f(t) cos(2nt)dt and s
n
= 2
_
1
0
f(t) sin(2nt)dt.
2.3. PROPERTIES OF THE CONDITIONAL EXPECTATION 43
2.3. Properties of the conditional expectation
We start with two extreme examples for which we can easily compute the C.E.
Example 2.3.1. Suppose ( and (X) are independent (i.e. for all A ( and
B (X) we have that P(A B) = P(A) P(B)). Then, E(XI
A
) = E(X) E(I
A
)
for all A (. So, the constant Z = E(X) satises (2.1.5), that is, in this case
E(X[() = E(X). In particular, if Y and V are independent random variables and
f a Borel function such that X = f(V ) is integrable, then E(f(V )[Y ) = E(f(V )).
Example 2.3.2. Suppose X L
1
(, (, P). Obviously, Z = X satises (2.1.5)
and by our assumption X is measurable on (. Consequently, here E(X[() = X
a.s. In particular, if X = c a.s. (a constant R.V.), then we have that E(X[() = c
a.s. (and conversely, if E(XI
G
) = cP(G) for all G ( then X = c a.s.).
Remark. Note that X may have the same law as Y while E(X[() does not have
the same law as E(Y [(). For example, take ( = (X) with X and Y square
integrable, independent, of the same distribution and positive variance. Then,
E(X[() = X per Example 2.3.2 and the nonrandom constant E(Y [() = E(Y )
(from Example 2.3.1) have dierent laws.
You can solve the next exercise either directly, or by showing that T
0
is indepen
dent of (X) for any random variable X and then citing Example 2.3.1.
Exercise 2.3.3. Let T
0
= , . Show that if Z L
1
(, T
0
, P) then Z is
necessarily a nonrandom constant and deduce that E(X[T
0
) = EX for any X
L
1
(, T, P).
A yet another example consists of X = Y + V with Y and V two independent
R.Vs in L
2
(, T, P). Let = E(V ). Taking a generic W H
Y
we know that
W = g(Y ) and E[g
2
(Y )] < . By Denition 1.4.36 and Proposition 1.2.16 we
know that the squareintegrable random variables Y + g(Y ) and V are
independent, hence uncorrelated (see Proposition 1.4.42). Consequently,
E[(V )(Y + g(Y ))] = E[Y + g(Y )]E[V ] = 0
(recall that = EV ). Thus,
E(X W)
2
= E[(V +Y + g(Y ))
2
] = E[(Y + g(Y ))
2
] +E[(V )
2
] ,
is minimal for W = g(Y ) = Y +. Comparing this result with Examples 2.3.1 and
2.3.2, we see that
E[X[Y ] = E[Y +V [Y ] = E[Y [Y ] +E[V [Y ] = Y +E(V ) .
With additional work (that we shall not detail) one can show that if f(Y, V ) is
integrable then E(f(Y, V )[Y ) = g(Y ) where g(y) = E(f(y, V )) (so the preceding
corresponds to f(y, v) = y +v and g(y) = y +).
As we next show, for any probability space (, T,P) the conditional expectation
of X given a eld ( T is linear in the random variable X.
Proposition 2.3.4. Let X, Y L
1
(, T,P). Then,
E(X +Y [() = E(X[() +E(Y [()
for any , R.
44 2. CONDITIONAL EXPECTATION AND HILBERT SPACES
Proof. Let Z = E(X[() and V = E(Y [(). Fixing A (, it follows by
linearity of the expectation operator and the denition of Z and V that
E(I
A
(Z+V )) = E(I
A
Z)+E(I
A
V ) = E(I
A
X)+E(I
A
Y ) = E(I
A
(X+Y )).
Since this applies for all A (, it follows by denition that the R.V. Z +V on
(, () is the C.E. of X +Y given (, as stated.
We next deal with the relation between the C.E. of X given two dierent elds,
one of which contains the other.
Proposition 2.3.5 (Tower property). Suppose H ( T and X L
1
(, T,P).
Then, E(X[H) = E(E(X[()[H).
Proof. Let Y = E(X[() and Z = E(Y [H). Fix A H. Note that A ( and
consequently, by Denition 2.1.5 of Y = E(X[() we have that
E(I
A
X) = E(I
A
Y ) = E(I
A
Z) ,
where the second identity holds by the Denition 2.1.5 of Z = E(Y [H) and the fact
that A H. Since E(I
A
X) = E(I
A
Z) for any A H, the proposition follows by
the uniqueness of E(X[H) and its characterization in Denition 2.1.5.
Remark. Note that any eld ( contains the trivial eld T
0
= , . Con
sequently, when H = T
0
the tower property applies for any eld (. Further,
E(Y [T
0
) = EY for any integrable random variable Y (c.f. Exercise 2.3.3). Conse
quently, we deduce from the tower property that
(2.3.1) E(X) = E(X[T
0
) = E[E(X[()[T
0
] = E(E(X[()),
for any X L
1
(, T, P) and any eld (.
Your next exercise shows that we cannot dispense of the relationship between the
two elds in the tower property.
Exercise 2.3.6. Give an example of a R.V. X and two elds T
1
and T
2
on
= a, b, c in which E(E(X[T
1
)[T
2
) ,= E(E(X[T
2
)[T
1
).
Here are two important applications of the highly useful tower property (also called
the law of iterated expectations).
Exercise 2.3.7. Let Var(Y [() = E(Y
2
[() E(Y [()
2
, where Y is a square
integrable random variable and ( is a eld.
(a) Check that Var(Y [() = 0 whenever Y L
2
(, (, P).
(b) More generally, show that Var(Y ) = E(Var(Y [()) + Var(E(Y [()).
(c) Show that if E(Y [() = X and EX
2
= EY
2
< then X = Y almost
surely.
Exercise 2.3.8. Suppose that (
1
and (
2
are elds and (
1
(
2
. Show that for
any squareintegrable R.V. X,
E[(X E(X[(
2
))
2
] E[(X E(X[(
1
))
2
].
Keeping the eld ( xed, we may think of the C.E. as an expectation in a
dierent (conditional) probability space. Consequently, every property of the ex
pectation has a corresponding extension to the C.E. For example, the extension
of Proposition 1.2.36 is given next without proof (see [GS01, Exercise 7.9.4, page
349] for more details).
2.3. PROPERTIES OF THE CONDITIONAL EXPECTATION 45
Proposition 2.3.9 (Jensens inequality). Let f : R R be a convex function (that
is, f(x + (1 )y) f(x) + (1 )f(y) for any x, y and [0, 1]). Suppose
X L
1
(, T, P) is such that E[f(X)[ < . Then, E(f(X)[() f(E(X[()).
Example 2.3.10. Per q 1, applying Jensens inequality for the convex function
f(x) = [x[
q
we have for all X L
q
(, T, P), that E([X[
q
[() [E(X[()[
q
.
Combining this example with the tower property and monotonicity of the expec
tation operator, we nd that
Corollary 2.3.11. For each q 1, the norm of the conditional expectation of
X L
q
(, T, P) given a eld ( never exceeds the (L
q
)norm of X.
Proof. Recall Example 2.3.10 that E([X[
q
[() [E(X[()[
q
almost surely.
Hence, by the tower property and monotonicity of the expectation operator,
X
q
q
= E([X[
q
) = E[E([X[
q
[()] E[[E(X[()[
q
] = E(X[()
q
q
.
That is, X
q
E(X[()
q
for any q 1.
We also have the corresponding extensions of the Monotone and Dominated Con
vergence theorems of Section 1.4.2.
Theorem 2.3.12 (Monotone Convergence for C.E.). If 0 X
m
X and E(X) <
then E(X
m
[() E(X[() a.s. (see [Bre92, Proposition 4.24] for proof ).
Theorem 2.3.13 (Dominated Convergence for C.E.). If [X
m
[ Y L
1
(, T,P)
and X
m
a.s
X, then E(X
m
[()
a.s.
E(X[().
Remark. In contrast with Theorem 1.4.27, the convergence in probability to X
by X
m
that are dominated by an integrable R.V. Y does not imply the convergence
almost surely of E(X
m
[() to E(X[(). Indeed, taking ( = T we see that such a
statement would have contradicted Proposition 1.3.8.
Our next result shows that the C.E. operation is continuous on each of the L
q
spaces.
Theorem 2.3.14. Suppose X
n
q.m.
X, that is, X
n
, X L
q
(, T,P) with E([X
n
X[
q
) 0. Then, E(X
n
[()
q.m.
E(X[().
Proof. By the linearity of conditional expectations and Jensens Inequality,
E([E(X
n
[() E(X[()[
q
) = E([E(X
n
X[()[
q
)
E(E([X
n
X[
q
[()) = E([X
n
X[
q
) 0,
by hypothesis.
We next show that one can take out what is known when computing the C.E.
Proposition 2.3.15. Suppose Y is bounded and measurable on (, and that X
L
1
(, T,P). Then, E(XY [() = Y E(X[().
Proof. Note that XY L
1
and Z = Y E(X[() is measurable on (. Now, for
any A (,
E[(XY Z)I
A
] = E[(X E(X[())(Y I
A
)] = 0 ,
by the orthogonality property (2.1.5) of E(X[() (since V = Y I
A
is bounded and
measurable on (). Since A ( is arbitrary, the proposition follows (see Denition
2.1.5 and Theorem 2.1.6).
46 2. CONDITIONAL EXPECTATION AND HILBERT SPACES
You can check your understanding by proving that Proposition 2.3.15 holds also
whenever Y L
2
(, (, P) and X L
2
(, T,P) (actually, this proposition applies
as soon as Y is measurable on ( and both X and XY are integrable).
Exercise 2.3.16. Let Z = (X, Y ) be a uniformly chosen point in (0, 1)
2
. That
is, X and Y are independent random variables, each having the U(0, 1) measure of
Example 1.1.11. Set T = I
A
(Z) + 5I
B
(Z) where A = (x, y) : 0 < x < 1/4, 3/4 <
y < 1 and B = (x, y) : 3/4 < x < 1, 0 < y < 1/2.
(a) Find an explicit formula for the conditional expectation W = E(T[X)
and use it to determine the conditional expectation U = E(TX[X).
(b) Find the value of E[(T W) sin(e
X
)].
(c) Without any computation decide whether E(W
2
) E(T
2
) is negative,
zero, or positive. Explain your answer.
We conclude this section with a proposition extending Example 2.3.1 and a pair of
exercises dealing with the connection (or lack thereof) between independence and
C.E.
Proposition 2.3.17. If X is integrable and elds ( and ((X), H) are inde
pendent, then
E[X[((, H)] = E[X[ H] .
Exercise 2.3.18. Building on Exercise 1.4.38 provide a eld ( T and a
bounded random variable X which is independent of integrable Y , such that:
(a) E(X[() is not independent of E(Y [().
(b) E(XY [() ,= E(X[()E(Y [().
Exercise 2.3.19. Suppose that X and Y are square integrable random variables.
(a) Show that if E(X[Y ) = E(X) then X and Y are uncorrelated.
(b) Provide an example of uncorrelated X and Y for which E(X[Y ) ,= E(X).
(c) Provide an example where E(X[Y ) = E(X) but X and Y are not in
dependent (this is also an example of uncorrelated but not independent
R.V.).
2.4. Regular conditional probability
Fixing two elds (, H T, let
P(A[() = E(I
A
[() for any A T. Obviously,
P(A[() [0, 1] exists for any such A and almost all . Ideally, we would expect
that the collection
P(
_
n
A
n
[() ,=
P(A
n
[().
We may need to deal with an uncountable number of such collections to check the
R.C.P. property throughout H, causing the corresponding exceptional sets of to
pile up to a nonnegligible set.
2.4. REGULAR CONDITIONAL PROBABILITY 47
The way we avoid this problem is by focusing on the R.C.P. for the special choice
of H = (X) with X a specic realvalued R.V. of interest. That is, we restrict our
attention to A = : X() B for a Borel set B. The existence of the R.C.P.
is then equivalent to the question whether
P(X B[() =
P(B[() is a probability
measure on (R, B), as precisely dened next.
Definition 2.4.1. Let X be a R.V. on (, T, P) and ( T a eld. The col
lection
P(B[() is called the regular conditional probability distribution (R.C.P.D.)
of X given ( if:
(a)
P(B[() = E[I
XB
[(] for any xed Borel set B R. That is,
P(B[() is
measurable on (, () such that P( : X() B A) = E[I
A
P
XG
(dx, ) ,
where the right side is to be interpreted in the sense of Denition 1.2.19 of Lebesgues
integral for the probability space (R, B,
P
XG
(, )).
For a proof that this denition coincides with our earlier denition of the condi
tional expectation see for example [Bre92, Proposition 4.28].
Example 2.4.5. The R.C.P.D. is rather explicit when ( = (Y ) and the random
vector (X, Y ) has a probability density function, denoted f
X,Y
. That is, when for
all x, y R,
(2.4.1) P(X x, Y y) =
_
x
_
y
f
X,Y
(u, v) du dv ,
48 2. CONDITIONAL EXPECTATION AND HILBERT SPACES
with f
X,Y
a nonnegative Borel function on R
2
such that
_
R
2
f
X,Y
(u, v)dudv = 1. In
this case, the R.C.P.D. of X given ( = (Y ) has the density function f
XY
(x[Y ())
where
f
XY
(x[y) =
f
X,Y
(x, y)
f
Y
(y)
,
and f
Y
(y) =
_
R
f
X,Y
(v, y)dv. Consequently, by Denition 2.4.4 we see that here
E(X[Y ) =
_
xf
XY
(x[Y )dx,
in agreement with the classical formulas provided in elementary (nonmeasure the
oretic) probability courses.
To practice your understanding of the preceding example, solve the next exercise.
Exercise 2.4.6.
(a) Suppose that the joint law of (X, Y, Z) has a density. Express the R.C.P.D.
of Y given X, Z, in terms of this density.
(b) Using this expression show that if X is independent of the pair (Y, Z),
then
E(Y [X, Z) = E(Y [Z).
(c) Give an example of random variables X, Y, Z, such that X is independent
of Y and
E(Y [X, Z) ,= E(Y [Z).
CHAPTER 3
Stochastic Processes: general theory
There are two ways to approach the denition of a stochastic process (S.P.)
The easier approach, which we follow in Section 3.1, is to view a S.P. as a collec
tion of R.V.s X
t
(), t 1, focusing on the nite dimensional distributions and
ignoring sample path properties (such as continuity). Its advantages are:
The index set 1 can be arbitrary;
The measurability of t X
t
() as a function from 1 to R (per xed ), is not
part of the denition of the process;
There is no need to master functional analysis techniques and results.
The main disadvantages of this approach are:
It only works when X
t
() is welldened for each xed t;
Using it, we cannot directly determine important properties such as continuity or
monotonicity of the sample path, or the law of sup
t
X
t
.
A more sophisticated approach to S.P., mostly when 1 is an interval on the real
line, views the S.P. as a functionvalued R.V: [Space of functions] (examples
of such spaces might be C[0, 1] or L
2
[0, 1]). The main advantages of this approach
are the complements of the disadvantages of our approach and vice versa.
Dening and using characteristic functions we study in Section 3.2 the important
class of Gaussian stochastic processes. We conclude this chapter by detailing in
Section 3.3 sucient conditions for the continuity of the sample path t X
t
(),
for almost all outcomes .
3.1. Denition, distribution and versions
Our starting point is thus the following denition of what a stochastic process is.
Definition 3.1.1. Given (, T, P), a stochastic process (S.P.) X
t
is a collec
tion X
t
: t 1 of R.V.s where the index t belongs to the index set 1. Typically,
1 is an interval in R (in which case we say that X
t
is a continuous time stochas
tic process), or a subset of 1, 2, . . . , n, . . . (in which case we say that X
t
is a
discrete time stochastic process. We also call t X
t
() the sample function (or
sample path) of the S.P.
Recall our notation (X
t
) for the eld generated by X
t
. The discrete time sto
chastic processes are merely countable collections of R.V.s X
1
, X
2
, X
3
, . . . dened
on the same probability space. All relevant information about such a process dur
ing a nite time interval 1, 2, . . . , n is conveyed by the eld (X
1
, X
2
, . . . , X
n
),
namely, the eld generated by the rectangle sets
n
i=1
: X
i
()
i
for
i
R (compare with Denition 1.2.8 of (X)). To deal with the full innite time
horizon we just take the eld (X
1
, X
2
, . . .) generated by the union of these sets
49
50 3. STOCHASTIC PROCESSES: GENERAL THEORY
0 0.5 1 1.5 2 2.5 3
2
1.5
1
0.5
0
0.5
1
1.5
t
X
t
(
)
t X
t
(
1
)
t X
t
(
2
)
Figure 1. Two sample paths of a stochastic process, correspond
ing to two outcomes
1
and
2
.
over n = 1, 2, . . . (this is exactly what we did to dene T
c
of the coin tosses Example
1.1.13). Though we do not do so here, it is not hard to verify that in this setting
the eld (X
1
, X
2
, . . .) coincides with the smallest eld containing (X
t
) for
all t 1, which we denote hereafter by T
X
.
Perhaps the simplest example of a discrete time stochastic process is that of in
dependent, identically distributed R.V.s X
n
(see Example 1.4.7 for one such con
struction). Though such a S.P. has little interesting properties worth study, it is
the corner stone of the following, fundamental example, called the random walk.
Definition 3.1.2. A random walk is the sequence S
n
=
n
i=1
i
, where
i
are
independent and identically distributed realvalued R.V.s dened on the same prob
ability space (, T, P). When
i
are integervalued we say that this is a random
walk on the integers, and call the special case of
i
1, 1 a simple random walk.
When considering a continuoustime S.P. we deal with uncountable collections of
R.V.. As we soon see, this causes many diculties. For example, when we talk
about the distribution (also called the law) of such a S.P. we usually think of the
restriction of P to a certain eld. But which eld to choose? That is, which one
carries all the information we have in mind? Clearly, we require at least T
X
so we
can determine the law T
Xt
of X
t
() per xed t. But is this enough? For example,
we might be interested in sets such as H
f
= : X
t
() f(t), 0 t 1
for some functions f : [0, 1] R. Indeed, the value of sup(X
t
: 0 t 1) is
determined by such sets for f(t) = independent of t. Unfortunately, such sets
typically involve an uncountable number of set operations and thus are usually not
in T
X
, and hence sup(X
t
: 0 t 1) might not even be measurable on T
X
. So,
maybe we should take instead (H
f
, f : [0, 1] R) which is quite dierent from
T
X
? We choose in these notes the smaller, i.e. simpler, eld T
X
, but what is then
the minimal information we need for specifying uniquely a probability measure on
this space? Before tackling this issue, we provide some motivation to our interest
in continuoustime S.P.
3.1. DEFINITION, DISTRIBUTION AND VERSIONS 51
0 0.5 1
2
1
0
1
2
First 20 steps
0 0.5 1
2
1
0
1
2
First 1000 steps
0 0.5 1
2
1
0
1
2
First 100 steps
0 0.5 1
2
1
0
1
2
First 5000 steps
Figure 2. Illustration of scaled simple random walks for dierent
values of n.
Assuming that E(
1
) = 0 and E(
2
1
) = 1 we have that E(S
n
) = 0 and
(3.1.1) E(S
2
n
) = E
_
(
n
i=1
i
)
2
_
=
n
i,j=1
E(
i
j
) = n,
that is, E
_
(n
1/2
S
n
)
2
_
= 1. Further, in this case, by central limit theorem we have
that
(3.1.2) n
1/2
S
n
L
G = N(0, 1) ,
with N(, v) denoting a Gaussian R.V. of mean and variance v (c.f. Example
1.4.13). Replacing n by [nt] (the integer part of nt), we get from (3.1.2) by rescaling
that also n
1/2
S
[nt]
L
N(0, t) for any xed 0 t 1. This leads us to state the
functional C.L.T. where all values of 0 t 1 are considered at once.
Theorem 3.1.3. (see [Bre92, Section 12.2]) Consider the random walk S
n
when
E(
1
) = 0 and E(
2
1
) = 1. Take the linear interpolation of the sequence S
n
, scale
space by n
1
2
and time by n
1
(see Figure 2). Taking n we arrive at a limiting
object which we call the Brownian motion on 0 t 1. The convergence here is
weak convergence in the sense of Denition 1.4.20 with S the set of continuous
functions on [0, 1], equipped with the topology induced by the supremum norm.
Even though it is harder to dene the Brownian motion (being a continuous time
S.P.), computations for it typically involve relatively simple Partial Dierential
Equations and are often more explicit than those for the random walk. In addition,
52 3. STOCHASTIC PROCESSES: GENERAL THEORY
unlike the random walk, the Brownian motion does not depend on the specic law
of
i
(beyond having zero mean and unit variance). For a direct construction of the
Brownian motion, to which we return in Section 5.1, see [Bre92, Section 12.7].
Remark. The condition E(
2
1
) < is almost necessary for the n
1/2
scaling of
space and the Brownian limiting process. Indeed, if E(
1
) = for some 0 < < 2,
then both scaling of space and the limit S.P. are changed with respect to Theorem
3.1.3.
In the next example, which is mathematically equivalent to Theorem 3.1.3, we
replace the sum of independent and identically distributed R.V.s by the product
of such nonnegative R.V.s.
Example 3.1.4. Let M
n
=
n
i=1
Y
i
where Y
i
are positive, independent identically
distributed random variables (for instance, the random daily return rates of a cer
tain investment). Let S
n
= log M
n
and
i
= log Y
i
, noting that S
n
=
n
i=1
i
is a random walk. Assuming E(log Y
1
) = 0 and E[(log Y
1
)
2
] = 1 we know that
n
1/2
S
[nt]
= n
1/2
log M
[nt]
converges to a Brownian motion W
t
as n . Hence,
the limit behavior of M
n
is related to that of the Geometric Brownian motion e
Wt
.
For continuous time stochastic processes we provide next few examples of events
that are in the eld T
X
and few examples that in general may not belong to this
eld (see [Bre92, Section 12.4] for more examples).
(1). : X
t1
T
X
.
(2). : X
1/k
, k = 1, 2, . . . , N T
X
.
(3). : X
1/k
, k = 1, 2, 3, . . . = : sup
k
X
1/k
() T
X
.
(4). : sup(X
t
: 0 t 1) is not necessarily in T
X
since the supremum
here is over an uncountable collection of R.V.s. However, check that this event is
in T
X
whenever all sample functions of the S.P. X
t
are right continuous.
(5). : X
t
() : 1 R is a measurable function may also be outside T
X
(say for
1 = [0, 1]).
We dene next the stochastic process analog of the distribution function.
Definition 3.1.5. Given N < and a collection t
1
, t
2
, . . . , t
N
in 1, we denote
the (joint) distribution of (X
t1
, . . . , X
tN
) by F
t1,t2, ,tN
(), that is,
F
t1,t2, ,tN
(
1
,
2
, ,
N
) = P(X
t1
1
, , X
tN
N
),
for all
1
,
2
, ,
N
R. We call the collection of functions F
t1,t2, ,tN
(), the
nite dimensional distributions (f.d.d.) of the S.P.
Having independent increments is one example of a property that is determined
by the f.d.d.
Definition 3.1.6. With (
t
the smallest eld containing (X
s
) for any 0 s
t, we say that a S.P. X
t
has independent increments if X
t+h
X
t
is independent
of (
t
for any h > 0 and all t 0. This property is determined by the f.d.d. That is,
if the random variables X
t1
, X
t2
X
t1
, . . . , X
tn
X
tn1
are mutually independent,
for all n < and 0 t
1
< t
2
< < t
n
< then the S.P. X
t
has independent
increments.
3.1. DEFINITION, DISTRIBUTION AND VERSIONS 53
Remark. For example, both the random walk and the Brownian motion are
processes with independent increments.
The next example shows that the f.d.d. do not determine some other important
properties of the stochastic process, such as continuity of its sample path. That is,
knowing all f.d.d. is not enough for computing P( : t X
t
() is continuous on
[0, 1]).
Example 3.1.7. Consider the probability space = [0, 1] with its Borel eld and
the Uniform law on [0, 1] (that is, the probability of each interval equals its length,
also known as Lebesgue measure restricted to [0, 1]). Given , we dene two
Stochastic Processes:
Y
t
() = 0, t, X
t
() =
_
1, t =
0, otherwise
Let A
t
= : X
t
,= Y
t
= t. Since P(A
t
) = 0, we have that P(X
t
= Y
t
) = 1
for each xed t. Moreover, let A
N
=
N
i=1
A
ti
, then P(A
N
) = 0 (a nite union of
negligible sets is negligible). Since this applies for any choice of N and t
1
, . . . , t
N
, we
see that the f.d.d. of X
t
are the same as those of Y
t
. Moreover, considering the
set A
i=1
A
ti
, involving a countable number of times, we see that P(A
) = 0,
that is, almost surely, X
t
() agrees with Y
t
() at any xed, countable, collection of
times. But note that some global samplepath properties do not agree. For example,
P( : (supX
t
() : 0 t 1) ,= 0) = 1,
P( : (supY
t
() : 0 t 1) ,= 0) = 0.
Also,
P( : t X
t
() is continuous) = 0,
P( : t Y
t
() is continuous) = 1.
While the maximal value and continuity of sample path are dierent for the two
S.P. of Example 3.1.7, we should typically consider such a pair to be the same S.P.,
motivating our next two denitions.
Definition 3.1.8. Two S.P. X
t
and Y
t
are called versions of one another if
they have the same nitedimensional distributions.
Definition 3.1.9. A S.P. Y
t
is called a modication of another S.P. X
t
if
P(Y
t
= X
t
) = 1 for all t 1.
We consider next the relation between the concepts of modication and version,
starting with:
Exercise 3.1.10. Show that if Y
t
is a modication of X
t
, then Y
t
is also a
version of X
t
.
Note that a modication has to be dened on the same probability space as the
original S.P. while this is not required of versions.
The processes in Example 3.1.7 are modications of one another. In contrast,
our next example is of two versions on the same probability space which are not
modications of each other.
54 3. STOCHASTIC PROCESSES: GENERAL THEORY
Example 3.1.11. Consider
2
= HH, TT, HT, TH with the uniform probability
measure P, corresponding to two independent fair coin tosses, whose outcome is
= (
1
,
2
). Dene on (
2
, 2
2
, P) the S.P
X
t
() = 1
[0,1)
(t)I
H
(
1
) +1
[1,2)
(t)I
H
(
2
) , 0 t < 2 ,
and Y
t
() = 1 X
t
() for 0 t < 2.
Exercise 3.1.12. To practice your understanding you should at this point check
that the processes X
t
and Y
t
of Example 3.1.11 are versions of each other but are
not modications of each other.
Definition 3.1.13. We say that a collection of nite dimensional distributions is
consistent if
lim
F
t1, ,tN
(
1
, ,
N
) = F
t1, ,t
k1
,t
k+1
,tN
(
1
, ,
k1
,
k+1
, ,
N
) ,
for any 1 k N, t
1
< t
2
< < t
N
1 and
i
R, i = 1, . . . , N.
Convince yourself that the f.d.d. of any S.P. must be consistent. Conversely,
Proposition 3.1.14. For any consistent collection of nite dimensional distribu
tions, there exists a probability space (, T, P) and a stochastic process X
t
()
on it, whose f.d.d. are in agreement with the given collection (c.f. [Bre92, Theo
rem 12.14], or [GS01, Theorem 8.6.3]). Further, the restriction of the probability
measure P to the eld T
X
is uniquely determined by the given collection of f.d.d.
We note in passing that the construction of Proposition 3.1.14 builds on the easier
case of discrete time stochastic processes (which is treated for example in [Bre92,
Section 2.4]).
What follows may be omitted at rst reading.
We can construct a eld that is the image of T
X
on the range of X
t
() :
t 1 R
I
.
Definition 3.1.15. For an interval 1 R, let R
I
denote the set of all functions
x : 1 R. A nite dimensional rectangle in R
I
is any set of the form x : x(t
i
)
J
i
, i = 1, . . . , n for a nonnegative integer n, intervals J
i
R and times t
i
1,
i = 1, . . . , n. The cylindrical eld B
I
is the eld generated by the collection of
nite dimensional rectangles.
While f.d.d. do not determine important properties of the sample path t X
t
()
of the S.P. (see Example 3.1.7), they uniquely determine the probabilities of events
in T
X
(hence each property of the sample path that can be expressed as an element
of the cylindrical eld B
I
).
Proposition 3.1.16. For an interval 1 R and any S.P. X
t
on t 1, the
eld T
X
consists of the events : X
() for B
I
. Further, if a S.P.
Y
t
is a version of X
t
, then P(X
) = P(Y
X
() = E[e
i
P
n
k=1
k
X
k
] ,
where = (
1
,
2
, ,
n
) R
n
and i =
1.
Remark. The characteristic function
X
: R
n
C exists for any X since
(3.2.1) e
i
P
n
k=1
k
X
k
= cos(
n
k=1
k
X
k
) +i sin(
n
k=1
k
X
k
) ,
with both real and imaginary parts being bounded (hence integrable) random vari
ables. Actually, we see from (3.2.1) that
X
(0) = 1 and [
X
()[ 1 for all R
n
(see [Bre92, Proposition 8.27] or [GS01, Section 5.7] for other properties of the
characteristic function).
Our next proposition justies naming
X
the characteristic function of (the law
of) X.
Proposition 3.2.2. The characteristic function determines the law of a random
vector. That is, if
X
() =
Y
() for all then X has the same law (= probability
measure on R
n
) as Y (for proof see [Bre92, Theorems 11.4 and 8.24] or [GS01,
Corollary 5.9.3]).
Remark. The law of a nonnegative random variable X is also determined by its
moment generating function M
X
(s) = E[e
sX
] at s < 0 (see [Bre92, Proposition
8.51] for a proof). While the realvalued function M
X
(s) is a simpler object, it
is unfortunately useless for the many random variables X which are neither non
negative nor nonpositive and for which M
X
(s) = for all s ,= 0.
The characteristic function is very useful in connection with convergence in law.
Indeed,
Exercise 3.2.3. Show that if X
n
L
X then
Xn
()
X
() for any R.
Remark. Though much harder to prove, the converse of Exercise 3.2.3 is also
true, namely if
Xn
()
X
() for each R then X
n
L
X.
We continue with a few explicit computations of the characteristic function.
56 3. STOCHASTIC PROCESSES: GENERAL THEORY
Example 3.2.4. Consider X a Bernoulli(p) random variable, that is, P(X = 1) =
p and P(X = 0) = 1 p. Its characteristic function is by denition
X
() = E[e
iX
] = pe
i
+ (1 p)e
i0
= pe
i
+ 1 p .
The same type of explicit formula applies to any R.V. X SF. Moreover, such
formulas apply for any discrete valued R.V. For example, if X Poisson() then
(3.2.2)
X
() = E[e
iX
] =
k=0
(e
i
)
k
k!
e
= e
(e
i
1)
.
The characteristic function has an explicit form also when the R.V. X has a
probability density function f
X
as in Denition 1.2.23. Indeed, then by Proposition
1.2.29 we have that
(3.2.3)
X
() =
_
e
ix
f
X
(x)dx,
which is merely the Fourier transform of the density f
X
. For example, applying
this formula we see that the Uniform random variable U of Example 1.1.11 has
characteristic function
U
() = (e
i
1)/(i). Assuming that the density f
X
is
bounded and continuous, we also have the explicit inversion formula
(3.2.4) f
X
(x) =
1
2
_
e
ix
X
()d ,
as a way to explain Proposition 3.2.2 ([Bre92, Theorem 8.39] shows that this
inversion formula is valid whenever
_
[
X
()[d < , see also [GS01, Theorem
5.9.1]).
We next recall the extension of the notion of density as in Denition 1.2.23 to a
random vector (as done already in (2.4.1)).
Definition 3.2.5. We say that a random vector X = (X
1
, . . . , X
n
) has a proba
bility density function f
X
if
P( : a
i
X
i
() b
i
, i = 1, . . . , n) =
_
b1
a1
_
bn
an
f
X
(x
1
, . . . , x
n
)dx
n
dx
1
,
for every a
i
< b
i
, i = 1, . . . , n. Such density f
X
must be a nonnegative Borel
measurable function with
_
R
n
f
X
(x)dx = 1 (f
X
is sometimes called the joint density
of X
1
, . . . , X
n
as in [GS01, Denition 4.5.2]).
Adopting the notation (, x) =
n
k=1
k
x
k
we have the following extension of the
Fourier transform formula (3.2.3) to random vectors X with density,
X
() =
_
R
n
e
i(,x)
f
X
(x)dx
(this is merely a special case of the extension of Proposition 1.2.29 to h : R
n
R).
Though we shall not do so, we can similarly extend the explicit inversion formula
of (3.2.4) to X having bounded continuous density, or alternatively, having an
absolutely integrable characteristic function.
The computation of the characteristic function is much simplied in the presence
of independence, as shown by the following alternative of Proposition 1.4.40.
3.2. CHARACTERISTIC FUNCTIONS, GAUSSIAN VARIABLES AND PROCESSES 57
Proposition 3.2.6. If X = (X
1
, X
2
, . . . , X
n
) with X
i
mutually independent R.V.,
then clearly,
(3.2.5)
X
() = E[
n
k=1
e
i
k
X
k
] =
n
k=1
X
k
(
k
) R
n
Conversely, if (3.2.5) holds then the random variables X
i
, i = 1, . . . , n are mutually
independent of each other.
3.2.2. Gaussian variables, vectors and processes. We start by recalling
some linear algebra concepts we soon need.
Definition 3.2.7. An n n matrix A with entries A
jk
is called nonnegative
denite (or positive semidenite) if A
jk
= A
kj
for all j, k, and for any R
n
(, A) =
n
j=1
n
k=1
j
A
jk
k
0.
We next dene the Gaussian random vectors via their characteristic functions.
Definition 3.2.8. We say that a random vector X = (X
1
, X
2
, , X
n
) has a
Gaussian (or multivariate Normal) distribution if
X
() = e
1
2
(,)
e
i(,)
,
for some nonnegative denite n n matrix , some = (
1
,
2
, ,
n
) R
n
and all = (
1
,
2
, ,
n
) R
n
.
Remark. In the special case of n = 1, we say that a random variable X is
Gaussian if for some R, some
2
0 and all R,
E[e
iX
] = e
1
2
2
+i
.
As we see next, the classical denition of Gaussian distribution via its density
amounts to a strict subset of the distributions we consider in Denition 3.2.8.
Definition 3.2.9. We say that X has a nondegenerate Gaussian distribution if
the matrix is invertible, or alternatively, when is (strictly) positive denite
matrix, that is (, ) > 0 whenever is a nonzero vector (for an equivalent
denition see [GS01, Section 4.9]).
Proposition 3.2.10. A random vector X with a nondegenerate Gaussian distri
bution has the density
f
X
(x) =
1
(2)
n/2
(det)
1/2
e
1
2
(x,
1
(x))
(see also [GS01, Denition 4.9.4]). In particular, if
2
> 0, then a Gaussian
random variable X has the density
f
X
(x) =
1
2
e
1
2
(x)
2
/
2
(for example, see [GS01, Example 4.4.4]).
Our next proposition links the vector and the matrix to the rst two moments
of the Gaussian distribution.
58 3. STOCHASTIC PROCESSES: GENERAL THEORY
Proposition 3.2.11. The parameters of the Gaussian distribution are
j
=
E(X
j
) and
jk
= E[(X
j
j
)(X
k
k
)], j, k = 1, . . . , n (c.f. [GS01, Theorem
4.9.5]). Thus is the mean vector and is the covariance matrix of X.
As we now demonstrate, there is more to a Gaussian random vector than just
having coordinates that are Gaussian random variables.
Exercise 3.2.12. Let X be a Gaussian R.V. independent of S, with E(X) = 0
and P(S = 1) = P(S = 1) = 1/2.
(a) Check that SX is Gaussian.
(b) Give an example of uncorrelated, zeromean, Gaussian R.V. X
1
and X
2
such that the vector X = (X
1
, X
2
) is not Gaussian and where X
1
and X
2
are not independent.
Exercise 3.2.13. Suppose (X, Y ) has a bivariate Normal distribution (per Def
inition 3.2.8), with mean vector = (
X
,
Y
) and the covariance matrix =
_
2
X
X
Y
2
Y
_
, with
X
,
Y
> 0 and [[ 1.
(a) Show that (X, Y ) has the same law as (
X
+
X
U +
X
_
1
2
V,
Y
+
Y
U), where U and V are independent Normal R.V.s of mean zero and
variance one. Explain why this implies that Z = X (
X
/
Y
)Y is
independent of Y .
(b) Explain why such X and Y are independent whenever they are uncorre
lated (hence also whenever E(X[Y ) = EX).
(c) Verify that E(X[Y ) =
X
+
X
Y
(Y
Y
).
Part (b) of Exercise 3.2.13 extends to any Gaussian random vector. That is,
Proposition 3.2.14. If a Gaussian random vector X = (X
1
, . . . , X
n
) has uncor
related coordinates, then its coordinates are also mutually independent.
Proof. Since the coordinates X
k
are uncorrelated, the corresponding matrix
has zero entries except at the maindiagonal j = k (see Proposition 3.2.11).
Hence, by Denition 3.2.8, the characteristic function
X
() is of the form of
n
k=1
X
k
(
k
). This in turn implies that the coordinates X
k
of the random vector
X are mutually independent (see Proposition 3.2.6).
Denition 3.2.8 allows for that is noninvertible, so for example the random
variable X = a.s. is considered a Gaussian variable though it obviously does
not have a density (hence does not t Denition 3.2.9). The reason we make this
choice is to have the collection of Gaussian distributions closed with respect to
convergence in 2mean, as we prove below to be the case.
Proposition 3.2.15. Suppose a sequence of ndimensional Gaussian random vec
tors X
(k)
, k = 1, 2, . . . converges in 2mean to an ndimensional random vector X,
that is, E[(X
i
X
(k)
i
)
2
] 0 as k , for i = 1, 2, . . . , n. Then, X is a Gauss
ian random vector, whose parameters and are the limits of the corresponding
parameters
(k)
and
(k)
of X
(k)
.
Proof. We start by verifying the convergence of the parameters of X
(k)
to
those of X. To this end, xing 1 i, j n and applying the inequality
[(a +x)(b +y) ab[ [ay[ +[bx[ +[xy[
3.2. CHARACTERISTIC FUNCTIONS, GAUSSIAN VARIABLES AND PROCESSES 59
for a = X
i
, b = X
j
, x = X
(k)
i
X
i
and y = X
(k)
j
X
j
, we get by monotonicity of
the expectation that
E[X
(k)
i
X
(k)
j
X
i
X
j
[ E[X
i
(X
(k)
j
X
j
)[ +E[X
j
(X
(k)
i
X
i
)[
+ E[(X
(k)
i
X
i
)(X
(k)
j
X
j
)[ .
Thus, by the Schwarz inequality, c.f. Proposition 1.2.41, we see that
E[X
(k)
i
X
(k)
j
X
i
X
j
[ X
i

2
X
(k)
j
X
j

2
+X
j

2
X
(k)
i
X
i

2
+ X
(k)
i
X
i

2
X
(k)
j
X
j

2
.
So, the assumed convergence in 2mean of X
(k)
to X implies the convergence in
1mean of X
(k)
i
X
(k)
j
to X
i
X
j
as k . This in turn implies that EX
(k)
i
X
(k)
j
EX
i
X
j
(c.f. Exercise 1.3.21). Further, the assumed convergence in 2mean of X
(k)
l
to X
l
(as k ) implies the convergence of
(k)
l
= EX
(k)
l
to
l
= EX
l
, for
l = 1, 2, . . . , n, and hence also that
(k)
ij
= EX
(k)
i
X
(k)
j
(k)
i
(k)
j
EX
i
X
j
i
j
=
ij
.
In conclusion, we established the convergence of the mean vectors
(k)
and the
covariance matrices
(k)
to the mean vector and the covariance matrix of X,
respectively.
Fixing R
n
the assumed convergence in 2mean of X
(k)
to X also implies the
convergence in 2mean, and hence in probability, of (, X
(k)
) to (, X). Hence,
by bounded convergence,
X
(k) ()
X
() for each xed R
n
(see Corollary
1.4.28). Since
X
(k) () = e
1
2
(,
(k)
)
e
i(,
(k)
)
for each k, the convergence of the
parameters (
(k)
,
(k)
) implies that the function
X
() must also be of such form.
That is, necessarily X has a Gaussian distribution, whose parameters are the limits
of the corresponding parameters of X
(k)
, as claimed.
The next proposition provides an alternative to Denition 3.2.8.
Proposition 3.2.16. A random vector X has the Gaussian distribution if and
only if (
n
i=1
a
ji
X
i
, j = 1, . . . , m) is a Gaussian random vector for any nonrandom
coecients a
11
, a
12
, . . . , a
mn
R (c.f. [GS01, Denition 4.9.7]).
It is usually much easier to check Denition 3.2.8 than to check the conclusion of
Proposition 3.2.16. However, it is often very convenient to use the latter enroute
to the derivation of some other property of X.
We are nally ready to dene the class of Gaussian stochastic processes.
Definition 3.2.17. A stochastic process (S.P.) X
t
, t 1 is Gaussian if for
all n < and all t
1
, t
2
, , t
n
1, the random vector (X
t1
, X
t2
, , X
tn
) has a
Gaussian distribution, that is, all nite dimensional distributions of the process are
Gaussian.
To see that you understood well the denitions of Gaussian vectors and processes,
convince yourself that the following corollary holds.
Corollary 3.2.18. All distributional properties of Gaussian processes are deter
mined by the mean (t) = E(X
t
) of the process and its autocovariance function
(t, s) = E[(X
t
(t)) (X
s
(s))].
60 3. STOCHASTIC PROCESSES: GENERAL THEORY
Applying Proposition 3.2.14 for the Gaussian vector X = (Y
t2
Y
t1
, . . . , Y
tn
Y
tn1
) of increments of a Gaussian stochastic process Y
t
(with arbitrary nite n
and 0 t
1
< t
2
< < t
n
), we conclude from Denition 3.1.6 that
Proposition 3.2.19. If Cov(Y
t+h
Y
t
, Y
s
) = 0 for a Gaussian stochastic process
Y
t
, all t s and h > 0, then the S.P. Y
t
has uncorrelated hence independent
increments (which is thus also equivalent to E(Y
t+h
Y
t
[(Y
s
, s t)) = E(Y
t+h
Y
t
)
for any t 0 and h > 0).
The special class of Gaussian processes plays a key role in our construction of
the Brownian motion. When doing so, we shall use the following extension of
Proposition 3.2.15.
Proposition 3.2.20. If the S.P. X
t
, t 1 and the Gaussian S.P. X
(k)
t
, t 1
are such that E[(X
t
X
(k)
t
)
2
] 0 as k , for each xed t 1, then X
t
is a
Gaussian S.P. with mean and autocovariance functions that are the pointwise limits
of those for X
(k)
t
.
Proof. Fixing n < and t
1
, t
2
, , t
n
1, apply Proposition 3.2.15 for
the sequence of Gaussian random vectors (X
(k)
t1
, X
(k)
t2
, , X
(k)
tn
) to see that the
distribution of (X
t1
, X
t2
, , X
tn
) is Gaussian. Since this applies for all nite
dimensional distributions of the S.P. X
t
, t 1 we are done (see Denition
3.2.17).
Here is the derivation of the C.L.T. statement of Example 1.4.13 and its extension
towards a plausible construction of the Brownian motion.
Exercise 3.2.21. Consider the random variables
S
k
of Example 1.4.13.
(a) Applying Proposition 3.2.6 verify that the corresponding characteristic
functions are
b
S
k
() = [cos(/
k)]
k
.
(b) Recalling that
2
log(cos ) 0.5 as 0, nd the limit of
b
S
k
()
as k while R is xed.
(c) Suppose random vectors X
(k)
and X in R
n
are such that
X
(k) ()
X
() as k , for any xed . It can be shown that then the laws of
X
(k)
, as probability measures on R
n
, must converge weakly in the sense
of Denition 1.4.20 to the law of X. Explain how this fact allows you to
verify the C.L.T. statement
S
n
L
G of Example 1.4.13.
Exercise 3.2.22. Consider the random vectors X
(k)
= (
1
k
S
k/2
,
1
k
S
k
) in R
2
,
where k = 2, 4, 6, . . . is even, and S
k
is the simple random walk of Denition 3.1.2,
with P(
1
= 1) = P(
1
= 1) = 0.5.
(a) Verify that
X
(k) () = [cos((
1
+
2
)/
k)]
k/2
[cos(
2
/
k)]
k/2
,
where = (
1
,
2
).
(b) Find the mean vector and the covariance matrix of a Gaussian ran
dom vector X for which
X
(k) () converges to
X
() as k .
3.2. CHARACTERISTIC FUNCTIONS, GAUSSIAN VARIABLES AND PROCESSES 61
(c) Upon appropriately generalizing what you did in part (b), I claim that the
Brownian motion of Theorem 3.1.3 must be a Gaussian stochastic pro
cess. Explain why, and guess what is the mean (t) and autocovariance
function (t, s) of this process (if needed take a look at Chapter 5).
We conclude with a concrete example of a Gaussian stochastic process.
Exercise 3.2.23. Let Y
n
=
n
k=1
k
V
k
for i.i.d. random variables
k
such that
p = P(
k
= 1) = 1 P(
k
= 1) and i.i.d. Gaussian random variables V
k
of
zero mean and variance one that are independent of the collection
k
.
(a) Compute the mean (n) and autocovariance function (, n) for the dis
crete time stochastic process Y
n
.
(b) Find the law of
1
V
1
, explain why Y
n
is a Gaussian process and provide
the joint density f
Yn,Y2n
(x, y) of Y
n
and Y
2n
.
3.2.3. Stationary processes. We conclude this section with a brief discus
sion of the important concept of stationarity, that is, invariance of the law of the
process to translation of time.
Definition 3.2.24. A stochastic process X
t
indexed by t R is called (strong
sense) stationary if its f.d.d. satisfy
F
t1,t2,...,tN
(
1
,
2
, . . . ,
N
) = P(X
t1
1
, . . . , X
tN
N
)
= P(X
t1+
1
, . . . , X
tN+
N
)
= F
t1+,t2+,...,tN+
(
1
,
2
, . . . ,
N
) ,
for all R, N < ,
i
R, i = 1, . . . , N and any monotone t
1
< t
2
< t
N
R.
A similar denition applies to discrete time S.P. indexed by t on the integers, just
then t
i
and take only integer values.
It is particularly easy to verify the stationarity of Gaussian S.P. since
Proposition 3.2.25. A Gaussian S.P. is stationary if and only if (t) = (a
constant) and (t, s) = r([t s[), where r : R R is a function of the time
dierence [t s[. (A stochastic process whose mean and autocovariance function
satisfy these two properties is called weak sense (or covariance) stationary. In
general, a weak sense stationary process is not a strong sense stationary process
(for example, see [GS01, Example 8.2.5]). However, as the current Proposition
shows, the two notions of stationarity are equivalent in the Gaussian case.)
Convince yourself that Proposition 3.2.25 is an immediate consequence of Corol
lary 3.2.18 (alternatively, use directly Proposition 3.2.11). For more on stationary
Gaussian S.P. solve the following exercise and see [GS01, Section 9.6] (or [Bre92,
Section 11.5] for the case of discrete time).
Exercise 3.2.26. Suppose X
t
is a zeromean, (weak sense) stationary process
with autocovariance function r(t).
(a) Show that [r(h)[ r(0) for all h > 0.
(b) Show that if r(h) = r(0) for some h > 0 then X
t+h
a.s.
= X
t
for each t.
(c) Explain why part (c) of Exercise 3.2.13 implies that if X
t
is a zero
mean, stationary, Gaussian process with autocovariance function r(t)
such that r(0) > 0, then E(X
t+h
[X
t
) =
r(h)
r(0)
X
t
for any t and h 0.
62 3. STOCHASTIC PROCESSES: GENERAL THEORY
(d) Conclude that there is no zeromean, stationary, Gaussian process of
independent increments other than the trivial process X
t
X
0
.
Definition 3.2.27. We say that a process X
t
, t 0 has stationary increments if
X
t+h
X
t
and X
s+h
X
s
have the same law for all s, t, h, 0. The same denition
applies to discrete time S.P. indexed by t on the integers, just with t, s and h taking
only integer values (and if the S.P. is indexed by a nonnegative integer time, then
so are the values of s, t and h).
Example 3.2.28. Clearly, a sequence of independent and identically distributed
random variables . . . , X
2
, X
1
, X
0
, X
1
, . . . is a discrete time stationary process.
However, many processes are not stationary. For example, the random walk S
n
=
n
i=1
X
i
of Denition 3.1.2 is a nonstationary S.P. when EX
1
= 0 and EX
2
1
= 1.
Indeed, if S
n
was a stationary process then the law of S
n
, and in particular its
second moment, would not depend on n in contradiction with (3.1.1). Convince
yourself that every stationary process has stationary increments, but note that the
random walk S
n
has stationary increments, thus demonstrating that stationary in
crements are not enough for stationarity.
For more on stationary discrete time S.P. see [Bre92, Section 6.1], or see [GS01,
Chapter 9] for the general case.
3.3. Sample path continuity
As we have seen in Section 3.1, the distribution of the S.P. does not specify uniquely
the probability of events outside the rather restricted eld T
X
and in particular
provides insucient information about the behavior of its supremum, as well as
about the continuity of its sample path.
Our goal in this section is thus to nd relatively easy to check sucient conditions
for the existence of a modication of the S.P. that has a somewhat nice sample
paths. The following denition of sample path continuity is the rst step in this
direction.
Definition 3.3.1. We say that X
t
has continuous sample path w.p.1 if P( :
t X
t
() is continuous) = 1. Similarly, we use the term continuous modication
to denote a modication
X
t
of a given S.P. X
t
such that
X
t
has continuous
sample path w.p.1.
The next denition of H older continuity provides a quantitative renement of this
notion of continuity, by specifying the maximal possible smoothness of the sample
path of X
t
.
Definition 3.3.2. A S.P. Y
t
is locally H older continuous with exponent if for
some c < and a R.V. h() > 0,
P( : sup
0s,tT, tsh()
[Y
t
() Y
s
()[
[t s[
c) = 1.
Remark. The word locally in the above denition refers to the R.V. h().
When it holds for unrestricted t, s [0, T] we say that Y
t
is globally (or uniformly)
H older continuous with exponent . A particular important special case is that of
= 1, corresponding to Lipschitz continuous functions.
3.3. SAMPLE PATH CONTINUITY 63
Equipped with Denition 3.3.2 our next theorem gives a very useful criterion for
the existence of a continuous modication (and even yields a further degree of
smoothness in terms of the H older continuity of the sample path).
Theorem 3.3.3 (Kolmogorovs continuity theorem). Given a S.P. X
t
, t [0, T],
suppose there exist , , c, h
0
> 0 such that
(3.3.1) E([X
t+h
X
t
[
) ch
1+
, for all 0 t, t +h T, 0 < h < h
0
.
Then, there exists a continuous modication Y
t
of X
t
such that Y
t
is also locally
H older continuous with exponent for any 0 < < /.
Remark. In case you have wondered why exponent near (1 + )/ does not
work in Theorem 3.3.3, read its proof, and in particular, the derivation of [KS97,
inequality (2.9), page 54]. Or, see [Oks03, Theorem 2.6, page 10], for a somewhat
weaker result.
It is important to note that condition (3.3.1) of Theorem 3.3.3 involves only the
joint distribution of (X
t
, X
t+h
) and as such is veriable based on the f.d.d. of the
process. In particular, either all versions of the given S.P. satisfy (3.3.1) or none of
them does.
The following example demonstrates that we must have > 0 in (3.3.1) to deduce
the existence of a continuous modication.
Example 3.3.4. Consider the stochastic process X
t
() = I
{Ut}
(), t [0, 1],
where U is a Uniform[0, 1] random variable (that is, U() = on the probability
space (U, R, B) of Example 1.1.11). Note that [X
t+h
X
t
[ = 1 if 0 t < U t +h,
and [X
t+h
X
t
[ = 0 otherwise. So, E([X
t+h
X
t
[
k
i=1
i
for k = 1, 2, 3, . . ., so
R
t
1, 1 keeps the same value in each of the intervals (s
k
, s
k+1
). Since almost
surely s
1
< , this S.P. does not have a continuous modication. However,
E(R
t
R
t+
) = P(R
t
= R
t+
) P(R
t
,= R
t+
) = 1 2P(R
t
,= R
t+
) ,
and since for any t 0 and > 0,
1
P(R
t
,= R
t+
)
1
P(
i
) =
1
(1 e
) 1 ,
we see that R
t
indeed satises (3.3.2) with p = 1 and = 2.
The stochastic process R
t
of Example 3.3.6 is a special instance of the continuous
time Markov jump processes, which we study in Section 6.3. Though the sample
path of this process is almost never continuous, it has the rightcontinuity property
of the following denition, as is the case for all continuoustime Markov jump
processes of Section 6.3.
Definition 3.3.7. We say that a S.P. X
t
has rightcontinuous with left limits
(in short, RCLL) sample path, if for a.e. , the sample path X
t
() is right
continuous and of leftlimits at any t 0 (that is, for h 0 both X
t+h
() X
t
()
and the limit of X
th
() exists). Similarly, a modication having RCLL sample
path with probability one is called RCLL modication of the S.P.
Remark. To practice your understanding, check that any S.P. having continuous
sample path also has RCLL sample path (in particular, the Brownian motion of
Section 5.1 is such). The latter property plays a major role in continuoustime
martingale theory, as we shall see in Sections 4.2 and 4.3.2. For more on RCLL
sample path see [Bre92, Section 14.2].
Perhaps you expect any two S.P.s that are modications of each other to have
(a.s.) indistinguishable sample path, i.e. P(X
t
= Y
t
for all t 1) = 1. This is
indeed what happens for discrete time, but in case of continuous time, in general
such property may fail, though it holds when both S.P.s have rightcontinuous
sample paths (a.s.).
Exercise 3.3.8.
(a) Let X
n
, Y
n
be discrete time S.P.s that are modications of each
other. Show that P(X
n
= Y
n
for all n 0) = 1.
(b) Let X
t
, Y
t
be continuous time S.P.s that are modications of each
other. Suppose that both processes have rightcontinuous sample paths
a.s. Show that P(X
t
= Y
t
for all t 0) = 1.
(c) Provide an example of two S.P.s which are modications of one another
but which are not indistinguishable.
We conclude with a hierarchy of the sample path properties which have been
considered here.
Proposition 3.3.9. The following implications apply for the sample path of any
stochastic process:
H older continuity Continuous w.p.1 RCLL.
3.3. SAMPLE PATH CONTINUITY 65
Proof. The stated relations between local H older continuity, continuity and
rightcontinuity with left limits hold for any function f : [0, ) R. Considering
f(t) = X
t
() for a xed leads to the corresponding relation for the sample
path of the S.P.
All the S.P. of interest to us shall have at least a RCLL modication, hence with
all properties implied by it, and unless explicitly stated otherwise, hereafter we
always assume that we are studying the RCLL modication of the S.P. in question.
The next theorem shows that objects like Y
t
=
_
t
0
X
s
ds are well dened under
mild conditions and is crucial for the successful rigorous development of stochastic
calculus.
Theorem 3.3.10 (Fubinis theorem). If X
t
has RCLL sample path and for some
interval I and eld H, almost surely
_
I
E[[X
t
[ [H]dt is nite then almost surely
_
I
X
t
dt is nite and
_
I
E[X
t
[H] dt = E [
_
I
X
t
dt[H] .
Remark. Taking H = , we get as a special case of Fubinis theorem that
_
I
EX
t
dt = E (
_
I
X
t
dt) whenever
_
I
E[X
t
[dt is nite.
Here is a converse of Fubinis theorem, where the dierentiability of the sample
path t X
t
implies the dierentiability of the mean t EX
t
of a S.P.
Exercise 3.3.11. Let X
t
, t 0 be a stochastic process such that for each
the sample path t X
t
() is dierentiable at any t 0.
(a) Verify that
t
X
t
is a random variable for each xed t 0.
(b) Suppose further that there is an integrable random variable Y such that
[X
t
X
s
[ [t s[Y for almost every and all t, s 0. Using the
dominated convergence theorem, show that t EX
t
is then dierentiable
with a nite derivative such that for all t 0,
d
dt
E(X
t
) = E
_
t
X
t
_
.
CHAPTER 4
Martingales and stopping times
In this chapter we study a collection of stochastic processes called martingales.
Among them are some of the S.P. we already met, namely the random walk (in
discrete time) and the Brownian motion (in continuous time). Many other S.P.
found in applications are also martingales. We start in Section 4.1 with the simpler
setting of discrete time martingales and ltrations (also called discrete parameter
martingales and ltrations). The analogous theory of continuous time (or param
eter) ltrations and martingales is introduced in Section 4.2, taking care also of
sample path (right)continuity. As we shall see in Section 4.3, martingales play a
key role in computations involving stopping times. Martingales share many other
nice properties, chiey among which are tail bounds and convergence theorems.
Section 4.4 deals with martingale representations and tail inequalities (for both
discrete and continuous time). These lead to the various convergence theorems we
present in Section 4.5. To demonstrate the power of martingale theory we also
analyze in Section 4.6 the extinction probabilities of branching processes.
4.1. Discrete time martingales and ltrations
Subsection 4.1.1 introduces the concepts of ltration and martingale with some
illustrating examples. As shown in Subsection 4.1.2, squareintegrable martingales
are analogous to zeromean, orthogonal increments. The related supermartingales
and submartingales are the subject of Subsection 4.1.3. For additional examples see
[GS01, Sections 12.1, 12.2 and 12.8] or [KT75, Section 6.1] (for discrete time MGs),
[KT75, Section 6.7] (for ltrations), and [Ros95, Section 7.3] (for applications to
random walks).
4.1.1. Martingales and ltrations, denition and examples. A ltra
tion represents any procedure of collecting more and more information as time
goes on. Our starting point is the following rigorous mathematical denition of a
(discrete time) ltration.
Definition 4.1.1. A ltration is a nondecreasing family of of subelds T
n
is such that (X
n
, T
n
) is a martingale, then you can show using the tower property
that (X
n
, H
n
) is also a martingale. To check your understanding of the preceding,
and more generally, of what a martingale is, solve the next exercise.
Exercise 4.1.5. Suppose (X
n
, T
n
) is a martingale. Show that then X
n
is also
a martingale with respect to its canonical ltration and that a.s. E[X
[T
n
] = X
n
for all > n.
In view of Exercise 4.1.5, unless explicitly stated otherwise, when we say that
X
n
is a MG we mean using the canonical ltration (X
k
, k n).
Exercise 4.1.6. Provide an example of a probability space (, T, P), a ltration
T
n
and a stochastic process X
n
adapted to T
n
such that:
(a) X
n
is a martingale with respect to its canonical ltration but (X
n
, T
n
)
is not a martingale.
(b) Provide a probability measure Q on (, T) under which X
n
is not a
martingale even with respect to its canonical ltration.
Hint: Go for a simple construction. For example, = a, b, T
0
= T = 2
,
X
0
= 0 and X
n
= X
1
for all n 1.
We next provide a convenient alternative characterization of the martingale prop
erty in terms of the martingale dierences .
Proposition 4.1.7. If X
n
=
n
i=0
D
i
then the canonical ltration for X
n
is
the same as the canonical ltration for D
n
. Further, (X
n
, T
n
) is a martingale if
and only if D
n
is an integrable S.P., adapted to T
n
, such that E(D
n+1
[T
n
) = 0
a.s. for all n.
Proof. Since the transformation from (X
0
, . . . , X
n
) to (D
0
, . . . , D
n
) is contin
uous and invertible, it follows from Corollary 1.2.17 that (X
k
, k n) = (D
k
, k
n) for each n. By Denition 4.1.3 we see that X
n
is adapted to a ltration
T
n
if and only if D
n
is adapted to this ltration. It is very easy to show by
induction on n that E[X
k
[ < for k = 0, . . . , n if and only if E[D
k
[ < for
4.1. DISCRETE TIME MARTINGALES AND FILTRATIONS 69
k = 0, . . . , n. Hence, X
n
is an integrable S.P. if and only if D
n
is. Finally, with
X
n
measurable on T
n
it follows from the linearity of the C.E. that
E[X
n+1
[T
n
] X
n
= E[X
n+1
X
n
[T
n
] = E[D
n+1
[T
n
] ,
and the alternative expression for the martingale property follows from (4.1.1).
In view of Proposition 4.1.7 we call D
n
= X
n
X
n1
for n 1 and D
0
= X
0
the martingale dierences associated with a martingale X
n
. We now detail a few
simple examples of martingales, starting with the random walk of Denition 3.1.2.
Example 4.1.8. The random walk S
n
=
n
k=1
k
, with
k
independent, identically
distributed, such that E[
1
[ < and E
1
= 0, is a MG (for its canonical ltration).
More generally, S
n
is a MG even when the independent and integrable R.V.
k
of zero mean have nonidentical distributions. Further, the canonical ltration may
be replaced by the ltration (
1
, ,
n
). Indeed, this is just an application of
Proposition 4.1.7 for the case where the dierences D
k
= S
k
S
k1
=
k
, k 1 (and
D
0
= 0), are independent, integrable and E[D
n+1
[D
0
, D
1
, . . . , D
n
] = E(D
n+1
) = 0
for all n 0 by our assumption that E
k
= 0 for all k. Alternatively, for a
direct proof recall that E[S
n
[
n
k=1
E[
k
[ < for all n, that is, the S.P. S
n
is
integrable. Moreover, since
n+1
is independent of (S
1
, . . . , S
n
) and E(
n+1
) = 0,
we have that
E[S
n+1
[S
1
, , S
n
] = E[S
n
[S
1
, , S
n
]+E[
n+1
[S
1
, , S
n
] = S
n
+E(
n+1
) = S
n
,
implying that S
n
is a MG for its canonical ltration.
Exercise 4.1.9. Let T
n
be a ltration and X an integrable R.V. Dene Y
n
=
E(X[T
n
) and show that (Y
n
, T
n
) is a martingale. How do you interpret Y
n
?
Our next example takes an arbitrary S.P. V
n
and creates a MG by considering
n
k=1
V
k
k
for an appropriate auxiliary sequence
k
of R.V.
Example 4.1.10. Let Y
n
=
n
k=1
V
k
k
, where V
n
is an arbitrary bounded S.P.
and
n
is a sequence of integrable R.V. such that for n = 0, 1, . . . both E(
n+1
) = 0
and
n+1
is independent of T
n
= (
1
, . . . ,
n
, V
1
, . . . , V
n+1
). Then, Y
n
is a MG
for its canonical ltration and even for the possibly larger ltration T
n
. This is yet
another application of Proposition 4.1.7, now with the dierences D
k
= Y
k
Y
k1
=
V
k
k
, k 1 (and D
0
= 0). Indeed, we assumed
k
are integrable and [V
k
[ C
k
for some nonrandom nite constants C
k
, resulting with E[D
k
[ C
k
E[
k
[ < ,
whereas
E[D
n+1
[T
n
] = E[V
n+1
n+1
[T
n
] = V
n+1
E[
n+1
[T
n
] (take out what is known)
= V
n+1
E[
n+1
] = 0 (zero mean
n+1
is independent of T
n
),
giving us the martingale property.
A special case of Example 4.1.10 is when the auxiliary sequence
k
is indepen
dent of the given S.P. V
n
and consists of zeromean, independent, identically dis
tributed R.V. For example, random i.i.d. signs
k
1, 1 (with P(
k
= 1) =
1
2
)
are commonly used in discrete mathematics applications (for other martingale ap
plications c.f. [AS00, Chapter 7]).
Example 4.1.10 is a special case of the powerful martingale transform method. To
explore this further, we rst extract what was relevant in this example about the
relation between the sequence V
n
and the ltration T
n
.
70 4. MARTINGALES AND STOPPING TIMES
Definition 4.1.11. We call a sequence V
n
previsible (or predictable) for the
ltration T
n
if V
n
is measurable on T
n1
for all n 1.
The other relevant property of Example 4.1.10 is the fact that X
n
=
n
k=1
k
is a
MG for the ltration T
n
(which is merely a rerun on Example 4.1.8, now with the
excess but irrelevant information carried by the S.P. V
n
). Having understood the
two key properties of Example 4.1.10, we are ready for the more general martingale
transform.
Theorem 4.1.12. Let (X
n
, T
n
) be a MG and V
n
be a previsible sequence for the
same ltration. The sequence of R.V.
Y
n
=
n
k=1
V
k
(X
k
X
k1
) ,
called the martingale transform of V
with respect to X
n
k=1
X
k1
(X
k
X
k1
) is a MG whenever
X
n
L
2
(, T, P) is a MG (just note that the sequence V
n
= X
n1
is previsible for
any ltration T
n
with respect to which X
n
is adapted and take p = q = 2 in
Theorem 4.1.12).
A classical martingale is derived from the random products of Example 3.1.4.
Example 4.1.14. Consider the integrable S.P. M
n
=
n
i=1
Y
i
for strictly pos
itive R.V. Y
i
. By Corollary 1.2.17 its canonical ltration coincides with (
n
=
(Y
1
, , Y
n
) and since E(M
n+1
[(
n
) = E(Y
n+1
M
n
[(
n
) = M
n
E(Y
n+1
[(
n
), the MG
condition for M
n
is just E(Y
n+1
[Y
1
, , Y
n
) = 1 a.s. for all n. In the context of
i.i.d. Y
i
as in Example 3.1.4 we see that M
n
is a MG if and only if E(Y
1
) = 1,
corresponding to neutral return rate. Note that this is not the same as the con
dition E(log Y
1
) = 0 under which the associated random walk S
n
= log M
n
is a
MG.
4.1.2. Orthogonal increments and squareintegrable martingales. In
case E(X
2
n
) < for all n, we have an alternative denition of being a MG, some
what reminiscent of our denition of the conditional expectation.
Proposition 4.1.15. A S.P. X
n
L
2
(, T, P) adapted to the ltration T
n
is
a MG if and only if E[(X
n+1
X
n
)Z] = 0 for any Z L
2
(, T
n
, P).
Proof. This follows from the denition of MG, such that E[X
n+1
[T
n
] = X
n
,
together with the denition of the C.E. via orthogonal projection on L
2
(, T
n
, P)
as in Denition 2.1.3.
4.1. DISCRETE TIME MARTINGALES AND FILTRATIONS 71
A martingale X
n
such that E(X
2
n
) < for all n is called L
2
MG, or square
integrable MG. We gain more insight to the nature of such MGs by reformulating
Proposition 4.1.15 in terms of the martingale dierences D
n
= X
n
X
n1
. To this
end, we have the following denition.
Definition 4.1.16. We say that D
n
L
2
(, T, P) is an orthogonal sequence
of R.V. if E[D
n
h(D
0
, D
1
, . . . , D
n1
)] = E[D
n
]E[h(D
0
, . . . , D
n1
)] for any n
1 and every Borel function h : R
n
R such that E[h(D
0
, . . . , D
n1
)
2
] < .
Alternatively, in view of Denition 2.1.3 and Proposition 1.2.14 this is merely the
statement that E[D
n
[D
0
, D
1
, . . . , D
n1
] = E[D
n
] for all n.
It is possible to extend Proposition 1.4.40 to the statement that the random vari
ables D
k
L
2
(, T, P) are mutually independent if and only if
E[g(D
n
)h(D
0
, D
1
, . . . , D
n1
)] = E[g(D
n
)]E[h(D
0
, . . . , D
n1
)]
for every n and any Borel functions h : R
n
R and g : R R such that
E[h(D
0
, . . . , D
n1
)
2
] < and E[g(D
n
)
2
] < . In particular, independence im
plies orthogonality (where g() has to be linear), which in turn implies that the
R.V. D
n
are uncorrelated, that is, E(D
n
D
i
) = E(D
n
)E(D
i
) for all 0 i < n <
(i.e. both g() and h() have to be linear functions).
As we show next, for squareintegrable S.P. the martingale property amounts to
having zeromean, orthogonal dierences. It is thus just the right compromise
between the perhaps too restrictive requirement of having zeromean, independent
dierences, and the ineective property of just having zeromean, uncorrelated
dierences.
Proposition 4.1.17. A S.P. X
n
L
2
(, T, P) is a MG for its canonical ltration
if and only if it has an orthogonal, zeromean dierences sequence D
n
= X
n
X
n1
,
n 1.
Proof. In view of Denition 4.1.16, this is just a simple reformulation of
Proposition 4.1.15.
The MG property is especially simple to understand for a Gaussian S.P. As we
have seen a necessary condition for the MG property is to have ED
n
= 0 and
E(D
n
D
i
) = 0 for all 0 i < n. With the Gaussian vector D = (D
0
, , D
n
)
having uncorrelated coordinates, we know that the corresponding matrix has
zero entries except at the maindiagonal j = k (see Proposition 3.2.11). Hence, by
Denition 3.2.8, the characteristic function
D
() is of the form of
n
k=0
D
k
(
k
).
This in turn implies that the coordinates D
k
of this random vector are mutually
independent (see Proposition 3.2.6). In conclusion, for a Gaussian S.P. having
independent, orthogonal or uncorrelated dierences are equivalent properties, which
together with each of these dierences having a zero mean is also equivalent to the
MG property.
4.1.3. Submartingales and supermartingales. Often when operating on
a MG, we naturally end up with a sub or super martingale, as dened below.
Moreover, these processes share many of the properties of martingales, so it is
useful to develop a unied theory for them.
Definition 4.1.18. A submartingale (denoted subMG) is an integrable S.P. X
n
,
adapted to the ltration T
n
, such that
E[X
n+1
[T
n
] X
n
n, a.s.
72 4. MARTINGALES AND STOPPING TIMES
A supermartingale (denoted supMG) is an integrable S.P. X
n
, adapted to the
ltration T
n
such that
E[X
n+1
[T
n
] X
n
n, a.s.
(A typical S.P. X
n
is neither a subMG nor a supMG, as the sign of the R.V.
E[X
n+1
[T
n
] X
n
may well be random, or possibly dependent upon the time index
n).
Remark 4.1.19. Note that X
n
is a subMG if and only if X
n
is a supMG.
By this identity, all results about subMGs have dual statements for supMGs and
vice versa. We often state only one out of each such pair of statements. Further,
X
n
is a MG if and only if X
n
is both a subMG and a supMG. As a result,
every statement holding for either subMGs or supMGs, also hold for MGs.
Example 4.1.20. In the context of Example 4.1.14, if E(Y
n+1
[Y
1
, , Y
n
) 1
a.s. for all n then M
n
is a subMG, and if E(Y
n+1
[Y
1
, , Y
n
) 1 a.s. for all
n then M
n
is a supMG. Such martingales appear for example in mathematical
nance, where Y
i
denotes the random proportional change in the value of a risky
asset at the ith trading round. So, positive conditional mean return rate yields a
subMG while negative conditional mean return rate gives a supMG.
Remark 4.1.21. If X
n
a subMG, then necessarily n EX
n
is nondecreasing,
since by the tower property of (conditional) expectation
E[X
n
] = E[E[X
n
[T
n1
]] E[X
n1
] ,
for all n 1. Convince yourself that by Remark 4.1.19 this implies that if X
n
is a supMG then n EX
n
is nonincreasing, hence for a MG X
n
we have that
E(X
n
) = E(X
0
) for all n.
We next detail a few examples in which subMGs or supMGs naturally appear,
starting with an immediate consequence of Jensens inequality
Exercise 4.1.22. Suppose (X
n
, T
n
) is a martingale and : R R is a convex
function such that E[[(X
n
)[] < . Show that ((X
n
), T
n
) is a submartingale.
Hint: Use Proposition 2.3.9.
Remark. Some examples of convex functions for which the above exercise is
commonly applied are (x) = [x[
p
, p 1, (x) = e
x
and (x) = xlog x (the latter
only for x > 0). Taking instead a concave function () leads to a supMG, as for
example when (x) = x
p
, p (0, 1) or (x) = log x, both restricted to x > 0.
Here is a concrete application of Exercise 4.1.22.
Exercise 4.1.23. Let
1
,
2
, . . . be independent with E
i
= 0 and E
2
i
=
2
i
.
(a) Let S
n
=
n
i=1
i
and s
2
n
=
n
i=1
2
i
. Show that S
2
n
is a submartingale
and S
2
n
s
2
n
is a martingale.
(b) Suppose also that m
n
=
n
i=1
E(e
i
) < . Show that e
Sn
is a sub
martingale and M
n
= e
Sn
/m
n
is a martingale.
A special case of Exercise 4.1.23 is the random walk S
n
=
n
k=1
k
. Assuming in
addition to E(
1
) = 0 that also E
2
1
< we see that (S
2
n
, (
1
, ,
n
)) is a subMG.
Further, s
2
n
= E(S
2
n
) = nE(
2
1
) (see (3.1.1)), hence S
2
n
nE(
2
1
) is a MG. Likewise,
e
Sn
is a subMG for the same ltration whenever E(
1
) = 0 and E(e
1
) < .
Though e
Sn
is not a MG (unless
i
= 0 a.s.), the normalized M
n
= e
Sn
/[E(e
)]
n
is
merely the MG of Example 4.1.14 for the product of i.i.d. Y
i
= e
i
/E(e
).
4.2. CONTINUOUS TIME MARTINGALES AND RIGHT CONTINUOUS FILTRATIONS 73
4.2. Continuous time martingales and right continuous ltrations
In duality with Denitions 4.1.1 and 4.1.4 we dene the continuous time ltrations
and martingales as follows.
Definition 4.2.1. The pair (X
t
, T
t
), t 0 realvalued, is called a continuous
time martingale (in short MG), if:
(a). The elds T
t
T, t 0 form a continuous time ltration, that is, T
t
T
t+h
, for all t 0 and h > 0.
(b). The continuous time S.P. X
t
is integrable and adapted to this ltration.
That is, E[X
t
[ < and (X
t
) T
t
for all t 0.
(c). For any xed t 0 and h > 0, the identity E(X
t+h
[T
t
) = X
t
holds a.s.
Similar denitions apply to S.P. X
t
indexed by t I for an interval I R, just
requiring also that in (a)(c) both t and t +h be in I.
Replacing the equality in (c) of Denition 4.2.1 with or with , denes the
continuous time subMG, or supMG, respectively. These three groups of S.P. are
related in the same manner as in the discrete time setting (c.f. Remark 4.1.19).
Similar to Remark 4.1.21 we have that if X
t
is a subMG then EX
t
EX
s
for
all t s, and if X
t
is a supMG then EX
t
EX
s
for all t s, so EX
t
= EX
0
for all t when X
t
is a MG.
Let (X
s
, 0 s t) denote the smallest eld containing (X
s
) for each s t
(compare with T
X
of Section 3.1). In analogy with Denition 4.1.3 and Exercise
4.1.5, the canonical ltration for a continuous time S.P. X
t
is (X
s
, 0 s t).
Further, as you check below, if (X
t
, T
t
) is a MG, then (X
t
, (X
s
, 0 s t)) is
also a MG. Hence, as in the discrete time case, our default is to use the canonical
ltration when studying MGs (or sub/supMGs).
Exercise 4.2.2. Let (
t
= (X
s
, s t). Show that
(a) If (X
t
, T
t
), t 0 is a continuous time martingale for some ltration
T
t
, then (X
t
, (
t
), t 0 is also a martingale.
(b) If (X
t
, (
t
) is a continuous time submartingale and E(X
t
) = E(X
0
) for
all t 0, then (X
t
, (
t
) is also a martingale.
To practice your understanding, verify that the following identity holds for any
squareintegrable MG (X
t
, T
t
)
(4.2.1) E[X
2
t
[T
s
] X
2
s
= E[(X
t
X
s
)
2
[T
s
] for any t s 0
and upon taking expectations deduce that for such MG the function EX
2
t
is non
decreasing in t.
We have the following analog of Example 4.1.8, showing that in particular, the
Brownian motion is a (continuous time) MG.
Proposition 4.2.3. Any integrable S.P. M
t
of independent increments (see
Denition 3.1.6), and constant mean (i.e. EM
t
= EM
0
), is a MG.
Proof. Recall that a process M
t
has independent increments if M
t+h
M
t
is
independent of (
t
= (M
s
, 0 s t), for all h > 0 and t 0. We assume that
in addition to having independent increments, also E[M
t
[ < and EM
t
= EM
0
74 4. MARTINGALES AND STOPPING TIMES
for all t. Then, E(M
t+h
M
t
[(
t
) = E(M
t+h
M
t
) = 0. Since M
t
is measurable
on (
t
, we deduce that E(M
t+h
[(
t
) = M
t
, that is, M
t
is a MG (for its canonical
ltration (
t
).
Conversely, as you check below, any Gaussian martingale M
t
is squareintegrable
and of independent increments, in which case M
2
t
EM
2
t
is also a martingale.
Exercise 4.2.4.
(a) Deduce from the identity (4.2.1) that if the MG M
t
, t 0 of Proposition
4.2.3 is squareintegrable, then (M
2
t
A
t
, (
t
) is a MG for (
t
= (M
s
, s
t) and the nonrandom, nondecreasing function A
t
= EM
2
t
EM
2
0
.
(b) Show that if a Gaussian S.P. M
t
is a MG, then it is squareintegrable
and of independent increments.
(c) Conclude that (M
2
t
A
t
, (
t
) is a MG for A
t
= EM
2
t
EM
2
0
and any
Gaussian MG M
t
.
Some of the many MGs associated with the Brownian motion are provided next
(see [KT75, Section 7.5] for more).
Exercise 4.2.5. Let (
t
denote the canonical ltration of a Brownian motion W
t
.
(a) Show that for any R, the S.P. M
t
() = exp(W
t
2
t/2), is a
continuous time martingale with respect to (
t
.
(b) Explain why
d
k
d
k
M
t
() are also martingales with respect to (
t
.
(c) Compute the rst three derivatives in of M
t
() at = 0 and deduce
that the S.P. W
2
t
t and W
3
t
3tW
t
are also MGs.
Our next example is of continuous time subMGs and supMGs that are obtained
in the context of Example 3.1.4 and play a major role in modeling the prices of
risky assets.
Example 4.2.6. For each n 1 let S
(n)
k
denote the random walk corresponding
to i.i.d. increments
(n)
i
each having the N(r/n, 1/n) law (i.e.
(n)
i
is a Gaussian
random variable of mean r/n and variance 1/n). The risky asset value at time
t = k/n is exp(S
(n)
k
), modeling discrete trading rounds held each 1/n units of time
with r denoting the mean return rate per unit time. So, consider the linear in
terpolation between the points (k/n, exp(S
(n)
k
)) in the plane, for k = 0, 1, . . . , n,
which in the limit n converge weakly to the continuous time stochastic pro
cess Y
t
= exp(W
t
+ rt), called the Geometric Brownian motion. The Geometric
Brownian motion Y
t
is a martingale for r = 1/2, a subMG for r 1/2 and
a supMG for r 1/2 (compare with Example 3.1.4).
As we explain next, each result derived for continuous time MGs implies the
corresponding result for discrete time MGs.
Example 4.2.7. Any discrete time MG (X
n
, T
n
) is made into a continuous time
MG (X
t
, T
t
) by the interpolation T
t
= T
n
and X
t
= X
n
for n = 0, 1, 2, . . . and all
t [n, n + 1). Indeed, this interpolation keeps the integrable X
t
adapted to T
t
and the latter a ltration, so it remains only to check the MG condition (c), that
is, xing t 0 and h > 0,
E[X
t+h
[T
t
] = E[X
t+h
[T
] = E[X
m
[T
] = X
,
where and m are the integer parts of t and t+h, respectively, and the rightmost
equality follows since (X
n
, T
n
) is a (discrete time) MG.
4.2. CONTINUOUS TIME MARTINGALES AND RIGHT CONTINUOUS FILTRATIONS 75
While we do not pursue this subject further here, observe that the continuous time
analog of the martingale transform of Theorem 4.1.12 is the stochastic integral
Y
t
=
_
t
0
V
s
dX
s
,
which for X
t
a Brownian motion, is the main object of study in Stochastic calculus
or Stochastic dierential equations (to which the texts [Oks03, KS97, Mik98] are
devoted). For example, the analog of Example 4.1.13 for the Brownian motion is
Y
t
=
_
t
0
W
s
dW
s
, which for the appropriate denition of the stochastic integral (due
to Ito), is merely the martingale Y
t
=
1
2
(W
2
t
t) for the ltration T
t
= (W
s
, s t).
Note that this Ito stochastic integral coincides with martingale theory at the price of
deviating from the standard integration by parts formula. Indeed, the latter would
apply if the Brownian sample path t W
t
() was dierentiable with probability
one. As we see in Section 5.2, this is denitely not true, so the breakup of standard
integration by parts should not come as a surprise.
Similarly to what happens for discretetime (see Exercise 4.1.22), the subMG
property of X
t
is inherited by (X
t
) for convex, nondecreasing functions .
Exercise 4.2.8. Suppose (X
t
, T
t
) is a MG, is convex and E[(X
t
)[ < . Using
Jensens inequality for the C.E. check that ((X
t
), T
t
) is a subMG. Moreover, same
applies even when (X
t
, T
t
) is only a subMG, provided is also nondecreasing.
Here are additional closure properties of subMGs, supMGs and MGs.
Exercise 4.2.9. Suppose (X
t
, T
t
) and (Y
t
, T
t
) are subMGs and t f(t) is a
nondecreasing, nonrandom function.
(a) Verify that (X
t
+Y
t
, T
t
) is a subMG and hence so is (X
t
+f(t), T
t
).
(b) Rewrite this, rst for supMGs X
t
and Y
t
, then in case of MGs.
Building on Exercise 1.4.32 we next note that each positive martingale (Z
t
, T
t
)
induces a collection of probability measures
P
t
that are equivalent to the restrictions
of P to T
t
and satisfy a certain martingale Bayes rule.
Exercise 4.2.10. Given a positive MG (Z
t
, T
t
) with EZ
0
= 1 consider for each
t 0 the probability measure
P
t
: T
t
R given by
P
t
(A) = E[Z
t
I
A
].
(a) Show that
P
t
(A) =
P
s
(A) for any A T
s
and 0 s t.
(b) With
E
t
denoting the expectation under
P
t
x 0 u s t and
Y L
1
(, T
s
,
P
t
). Building on Exercise 1.4.32 deduce that
E
t
(Y [T
u
) =
E(Y Z
s
[T
u
)/Z
u
almost surely under P (hence also under
P
t
).
In the theory of continuous time S.P. it makes sense to require that each new piece
of information has a denite rst time of arrival. From a mathematical point of
view, this is captured by the following concept of rightcontinuous ltration.
Definition 4.2.11. A ltration is called rightcontinuous if for any t 0,
h>0
T
t+h
= T
t
.
(To avoid unnecessary technicalities we further assume the usual conditions of a
complete probability space per Denition 1.3.4 with N T
0
whenever P(N) = 0).
76 4. MARTINGALES AND STOPPING TIMES
For example, check that the ltration T
t
= T
n
for t [n, n + 1), as in the
interpolation of discretetime MGs (Example 4.2.7), is rightcontinuous. Like
many other rightcontinuous ltrations, this ltration has jumps, that is some
times (T
s
, 0 s < t) is a strict subset of T
t
(i.e, for integer t), accounting for a
new piece of information arriving at time t.
As we show next, not all ltrations are rightcontinuous and the continuity of
the sample path of a S.P. does not guarantee that its canonical ltration is right
continuous.
Example 4.2.12. Consider the uniform probability measure on = 1, 1 and
T = 2
. The process X
t
() = t clearly has continuous sample path. It is easy to
see that its canonical ltration has (
0
= , while (
h
= T for all h > 0 and is
evidently not rightcontinuous at t = 0.
As we next see, right continuity of the ltration translates into a RCLL sample
path for any MG with respect to that ltration. We shall further see in Subsection
4.3.2 how right continuity of the ltration helps in the treatment of stopping times.
Theorem 4.2.13. If (X
t
, T
t
) is a MG with a rightcontinuous ltration T
t
, then
X
t
has a RCLL modication as in Denition 3.3.7 (see [Bre92, Theorem 14.7]
for a proof of a similar result).
Remark. You should check directly that the interpolated MGs of Example 4.2.7
have RCLL sample path with probability one. In view of Proposition 3.3.9, the
same applies for the Brownian motion and any other MG with continuous sample
path. Further, by Theorem 4.2.13, any MG with respect to the Brownian canonical
ltration, or to the interpolated ltration of Example 4.2.7, has a RCLL modica
tion. As we see in Sections 4.34.5 right continuity of the sample path allows us
to deduce many important martingale properties, such as tail bounds, convergence
results and optional stopping theorems.
4.3. Stopping times and the optional stopping theorem
This section is about stopping times and the relevance of martingales to their
study, mostly via Doobs optional stopping theorem. The simpler concept of stop
ping time for a discrete ltration is considered rst in Subsection 4.3.1, where we
also provide a few illustrating examples of how Doobs theorem is used. For ad
ditional material and examples, see [GS01, Section 12.5], [Ros95, Section 6.2] or
[KT75, Section 6.4]. The more delicate issue of stopping time for a continuous
parameter ltration is dealt with in Subsection 4.3.2, with applications to hitting
times for the Brownian motion given in Section 5.2.
4.3.1. Stopping times for discrete parameter ltrations. One of the
advantages of MGs is in providing information about the law of stopping times,
which we now dene in the context of a discrete parameter ltration.
Definition 4.3.1. A random variable taking values in 0, 1, . . . , n, . . . , is a
stopping time for the ltration T
n
if the event n is in T
n
for each nite
n 0 (c.f. [Bre92, Section 5.6]).
Intuitively speaking a stopping time is such that the decision whether to stop or
not by a given time is based on the information available at that time, and not on
future, yet unavailable information. Some examples to practice your understanding
are provided in the next exercises.
4.3. STOPPING TIMES AND THE OPTIONAL STOPPING THEOREM 77
Exercise 4.3.2. Show that is a stopping time for the discrete time ltration
T
n
if and only if < t T
t
for all t 0, where T
t
= T
[t]
is the interpolated
ltration of Example 4.2.7.
Exercise 4.3.3. Suppose , ,
1
,
2
, . . . are stopping times for the same (discrete
time) ltration T
n
. Show that so are min(, ), max(, ), + , sup
n
n
and
inf
n
n
.
Exercise 4.3.4. Show that the rst hitting time () = mink 0 : X
k
() B
of a Borel set B R by a sequence X
k
, is a stopping time for the canonical
ltration T
n
= (X
k
, k n). Provide an example where the last hitting time
= supk 0 : X
k
B of a set B by the sequence, is not a stopping time (not
surprising, since we need to know the whole sequence X
k
in order to verify that
there are no visits to B after a given time n).
As we see in Theorem 4.3.6, the MG, subMG and subMG properties are inherited
by the stopped S.P. dened next.
Definition 4.3.5. Using the notation n = min(n, ()), the stopped at
stochastic process X
n
is given by
X
n
() =
_
X
n
(), n ()
X
()
(), n > ()
Theorem 4.3.6. If (X
n
, T
n
) is a subMG (or supMG or a MG) and is a stopping
time for T
n
, then (X
n
, T
n
) is also a subMG, or supMG or MG, respectively
(for proof see [Bre92, Proposition 5.26]).
Corollary 4.3.7. If (X
n
, T
n
) is a subMG and is a stopping time for T
n
,
then E(X
n
) E(X
0
) for all n. If in addition (X
n
, T
n
) is a MG, then E(X
n
) =
E(X
0
).
The main result of this subsection is the following theorem, where the uniform
integrability of Subsection 1.4.2 comes handy (see also [GS01, Theorem 12.5.1] for
a similar result).
Theorem 4.3.8 (Doobs optional stopping). If (X
n
, T
n
) is a subMG and <
a.s. is a stopping time for the ltration T
n
such that the sequence X
n
is
uniformly integrable, then E(X
) E(X
0
). If in addition (X
n
, T
n
) is a MG, then
E(X
) = E(X
0
).
Proof. Note that X
n()
() X
()
() as n , whenever () < .
Since < a.s. we deduce that X
n
a.s
X
, hence also X
n
p
X
(see part
(b) of Theorem 1.3.6). With X
n
uniformly integrable, combining Theorem
1.4.23 with the preceding corollary, we obtain that in case of a subMG (X
n
, T
n
),
E[X
] = E[ lim
n
X
n
] = lim
n
E[X
n
] E(X
0
) ,
as stated in the theorem. The same argument shows that E(X
) E(X
0
) in case
of a supMG (X
n
, T
n
), hence E(X
) = E(X
0
) for a MG (X
n
, T
n
).
There are many examples in which MGs are applied to provide information about
specic stopping times. We detail below one such example, pertaining to the sym
metric simple random walk.
78 4. MARTINGALES AND STOPPING TIMES
Example 4.3.9. Consider the simple random walk (SRW) S
n
=
n
k=1
k
, with
k
1, 1 independent and identically distributed R.V. such that P(
k
= 1) =
1/2, so E
k
= 0. Fixing positive integers a and b, let
a,b
= infn 0 : S
n
b, or
S
n
a, that is,
a,b
is the rst time the SRW exits the interval (a, b). Recall
that S
n
is a MG and the rst hitting time
a,b
is a stopping time. Moreover,
[S
n
a,b
[ max(a, b) is a bounded sequence, hence uniformly integrable. Further,
P(
a,b
> k) P(a < S
k
< b) 0 as k , hence
a,b
< almost surely.
We also have that S
0
= 0, so by Doobs optional stopping theorem we know that
E(S
a,b
) = 0. Note that using only increments
k
1, 1, necessarily S
a,b
a, b. Consequently,
E(S
a,b
) = aP(S
a,b
= a) +bP(S
a,b
= b) = 0 .
Since P(S
a,b
= a) = 1 P(S
a,b
= b) we have two equations with two unknowns,
which we easily solve to get that P(S
a,b
= a) =
b
b+a
is the probability that the
SRW hits a before hitting +b. This probability is sometimes called the gamblers
ruin, for a gambler with initial capital of +a, betting on the outcome of independent
fair games, a unit amount per game, gaining (or losing)
k
and stopping when
either all his capital is lost (ruin), or his accumulated gains reach the amount +b
(c.f. [GS01, Example 1.7.4]).
Remark. You should always take the time to verify the conditions of Doobs
optional stopping theorem before using it. To see what might happen otherwise,
consider Example 4.3.9 without specifying a lower limit. That is, let
b
= infn :
S
n
b. Though we shall not show it, P(
b
> n) = P(max
kn
S
k
< b) 0 as
n . Consequently, with probability one,
b
< in which case clearly S
b
= b.
Hence, E(S
b
) = b ,= E(S
0
) = 0. This shows that you cannot apply Doobs optional
stopping theorem here and that the sequence of random variables S
n
b
is not
uniformly integrable.
There is no general recipe of how to go about nding the appropriate martingale
for each example, making this some sort of an art work. We illustrate this in the
context of the asymmetric simple random walk.
Example 4.3.10. Consider the stopping times
a,b
for the SRW S
n
of Example
4.3.9, now with p := P(
k
= 1) ,=
1
2
(excluding also the trivial cases of p = 0
and p = 1). We wish to compute the ruin probability r = P(S
a,b
= a) in this
case. Note that S
n
is no longer a MG. However, M
n
=
n
i=1
Y
i
is a MG for
Y
i
= e
i
provided the constant ,= 0 is such that E[e
a,b
) = E(e
S
a,b
) = re
a
+ (1 r)e
b
,
giving an explicit formula for r in terms of a, b and . To complete the solution
check that E[e
] = pe
+(1 p)e
a,b
) = E[S
a,b
(2p 1)
a,b
] .
Hence, by linearity of the expectation, we see that (2p 1)E(
a,b
) = E(S
a,b
) =
ra + (1 r)b. Having only one equation in two unknowns, r and E(
a,b
), we
cannot nd r directly this way.
However, having found r already using the martingale M
n
, we can deduce now an
explicit formula for E(
a,b
).
To practice your understanding, nd another martingale that allows you to ex
plicitly compute E(
a,b
) for the symmetric (i.e. p = 1/2), SRW of Example 4.3.9.
4.3.2. Stopping times for continuous parameter ltrations. We start
with the denition of stopping times for a continuous parameter ltration T
t
in
a probability space (, T, P), analogous to Denition 4.3.1.
Definition 4.3.11. A nonnegative random variable () is called stopping time
with respect to the continuous time ltration T
t
if : () t T
t
for all
t 0.
Exercise 4.3.12.
(a) Show that is a stopping time for the rightcontinuous ltration T
t
if
and only if < t T
t
for all t 0.
(b) Redo Exercise 4.3.3 in case , ,
1
,
2
, . . . are stopping times for a right
continuous ltration T
t
.
Deterministic times are clearly also stopping times (as then : () t is
either or , both of which are in any eld). First hitting times of the type of
the next proposition are another common example of stopping times.
Proposition 4.3.13. If rightcontinuous S.P. X
t
is adapted to the ltration
T
t
then
B
() = inft 0 : X
t
() B is a stopping time for T
t
when either:
(a) B is an open set and T
t
is a right continuous ltration;
(b) B is a closed set and the sample path t X
t
() is continuous for all
.
Proof. (omit at rst reading) If T
t
is a right continuous ltration then by
Exercise 4.3.12 it suces to show that A
t
:= :
B
< t T
t
. By denition of
B
we have that A
t
is the union of the sets
s
(B) := : X
s
() B over all
s < t. Further,
s
(B) T
t
for s t (since X
t
is adapted to T
t
). This is however
an uncountable union and indeed in general A
t
may not be in T
t
. Nevertheless,
when u X
u
is right continuous and B is an open set, it follows by elementary real
analysis that the latter union equals the countable union of
s
(B) over all rational
values of s < t. Consequently, in this case A
t
is in T
t
, completing the proof of
part (a).
Without right continuity of T
t
it is no longer enough to consider A
t
. Instead,
we should show that A
t
:= :
B
t T
t
. Unfortunately, the identity A
t
=
st
s
(B) is not true in general (and having B open is not of much help). However,
by denition of
B
we have that
B
() t if and only if there exist nonincreasing
t
n
() t + 1/n such that X
tn()
() B for all n. Clearly, t
n
s for some
s = s() t. With u X
u
() right continuous, it follows that X
s
() is the limit
80 4. MARTINGALES AND STOPPING TIMES
as n of X
tn()
() B so our assumption that B is closed leads to X
s
() B.
Thus, under these two assumptions we have the identity A
t
=
st
s
(B). With
B no longer an open set, the countable union of
s
(B) over Q
t
= s : s = t or
s Q and s < t is in general a strict subset of the union of
s
(B) over all s t.
Fortunately, for a closed set B and u X
u
continuous we have by a real analysis
argument (which we do not provide here), that
_
st
s
(B) =
k=1
_
sQt
s
(B
k
)
(for the open sets B
k
=
yB
(y k
1
, y + k
1
)), thus concluding that A
t
T
t
and completing the proof of part (b).
Beware that unlike discrete time, the rst hitting time
B
might not be a stopping
time for a Borel set B which is not closed, even when considering the canonical
ltration (
t
of a process X
t
with continuous sample path.
Example 4.3.14. Indeed, consider the open set B = (0, ) and the S.P. X
t
() =
t of Example 4.2.12 of continuous sample path and canonical ltration that is not
right continuous. It is easy to check that in this case
B
(1) = 0 while
B
(1) = .
As the event :
B
() 0 = 1 is not in (
0
= , , we see that in this case
B
is not a stopping time for (
t
.
Stochastic processes are typically dened only up to a set of zero probability, and
the continuity with probability one of the sample path t X
t
() is not enough to
assure that
B
of Proposition 4.3.13 is a stopping time for (X
s
, s t) whenever
B is closed. This problem is easily xed by modifying the value of such S.P. on a
set of zero probability, so as to make its sample path continuous for all .
Exercise 4.3.15.
(a) Deduce from part (a) of Proposition 4.3.13 that for any xed b > 0 and
> 0 and any S.P. X
t
with rightcontinuous sample path, the random
variable
()
b
= inft 0 : X
t
> b is a stopping time for the canonical
ltration (
t
of X
t
.
(b) With Y
t
=
_
t
0
X
2
s
ds use part (b) of Proposition 4.3.13 to show that
1
=
inft 0 : Y
t
= b is another stopping time for (
t
. Then explain why
2
= inft 0 : Y
2t
= b, is in general not a stopping time for this
ltration.
We next extend to the continuous parameter setting the notion of stopped sub
martingale (as in Theorem 4.3.6), focusing on processes with rightcontinuous sam
ple path.
Theorem 4.3.16. If is a stopping time for the ltration T
t
and the S.P.
X
t
of rightcontinuous sample path is a subMG (or supMG or a MG) for T
t
,
then X
t
= X
t()
() is also a subMG (or supMG or MG, respectively), for this
ltration.
Equipped with Theorem 4.3.16 we extend also Doobs optional stopping theorem
to the continuous parameter setting (compare with Theorem 4.3.8).
Theorem 4.3.17 (Doobs optional stopping). If (X
t
, T
t
) is a subMG with right
continuous sample path and < a.s. is a stopping time for the ltration T
t
) E(X
0
). If in addition (X
t
, T
t
) is a MG,
then E(X
) = E(X
0
).
Here are three representative applications of Doobs optional stopping theorem in
the context of rst hitting times (for the Brownian motion, the Geometric Brownian
motion and a Brownian motion with drift). When solving these and any other
exercise with rst hitting times, assume that the Brownian motion W
t
has been
modied on a set of zero probability so that t W
t
() is continuous for all .
Exercise 4.3.18. Let W
t
be a Brownian motion. Fixing a > 0 and b > 0 let
a,b
= inft 0 : W
t
/ (a, b). We will see in Section 5.2 that
a,b
is nite with
probability one.
(a) Check that
a,b
is a stopping time and that W
t
a,b
is uniformly integrable.
(b) Applying Doobs optional stopping theorem for this stopped martingale,
compute the probability that W
t
reaches level b before it reaches level a.
(c) Consider the martingales exp(W
t
2
t/2), > 0 of Exercise 4.2.5,
stopped at
b,b
. Justify using the optional stopping theorem for it, and
deduce the value of E(e
b,b
) for > 0.
Hint: In part (c) you may use the fact that the S.P. W
t
has the law as W
t
.
Exercise 4.3.19. Consider the Geometric Brownian motion Y
t
= e
Wt
. Fixing
b > 1, compute the Laplace transform E(e
b
) of the stopping time
b
= inft >
0 : Y
t
= b at arbitrary 0 (see [KT75, page 364] for a related option pricing
application).
Exercise 4.3.20. Consider M
t
= exp(Z
t
) for nonrandom constants and r,
where Z
t
= W
t
+rt, t 0, and W
t
is a Brownian motion.
(a) Compute the conditional expectation E(M
t+h
[(
t
) for (
t
= (Z
u
, u t)
and t, h 0.
(b) Find the value of ,= 0 for which (M
t
, (
t
) is a martingale.
(c) Fixing a, b > 0, apply Doobs optional stopping theorem to nd the law of
Z
a,b
for
a,b
= inft 0 : Z
t
/ (a, b).
An important concept associated with each stopping time is the stopped eld
dened next (see also [Bre92, Denition 12.41] or [KS97, Denition 1.2.12] or
[GS01, Problem 12.4.7]).
Definition 4.3.21. The stopped eld T
measurable.
(b) Show that if A T
then A : () () T
.
Example 4.3.23. For the measurable space (
2
, T
2
) corresponding to two coin
tosses, consider the stopping time such that = 1 if the rst coin shows H
and = 2 otherwise. Convince yourself that the corresponding stopped eld is
T
b
. Fixing nonrandom c R consider the
event
A
c
= : sup
s[0,T]
X
s
() > c .
Since the sample path s X
s
() is continuous, A
c
if and only if X
s
() > c
for some s in the countable set Q
T
of all rational numbers in [0, T]. Consequently,
the event A
c
=
sQT
: X
s
() > c is in (
T
. However, in general A
c
, (
t
for
t < T since if X
s
() c for all s t, then we are not sure whether A
c
or not,
till we observe X
s
() also for s (t, T]. Nevertheless, when c < b, if
b
() t for
some nonrandom t < T then clearly A
c
. Hence, for c < b and any t < T,
A
c
:
b
() t = :
b
() t (
t
(recall that
b
is a stopping time for (
t
). Since this applies for all t, we deduce in
view of Denition 4.3.21 that A
c
(
b
for each c < b. In contrast, when c b,
for stochastic processes X
s
of suciently generic sample path the event
b
t
does not tell us enough about A
c
to conclude that A
c
b
t (
t
for all t.
Consequently, for such stochastic processes A
c
/ (
b
when c b.
4.4. Martingale representations and inequalities
In Subsection 4.4.1 we show that martingales are at the core of all adapted pro
cesses. We further explore there the structure of certain submartingales, giving rise
to fundamental objects such as the innovation process and the increasing process
associated with squareintegrable martingales. This is augmented in Subsection
4.4.2 by the study of maximal inequalities for submartingales (and martingales).
Such inequalities are a key technical tool in many applications of probability theory.
4.4.1. Martingale decompositions. To demonstrate the relevance of mar
tingales to the study of many S.P., we start with a representation of any adapted,
integrable, discretetime S.P. as the sum of a martingale and a previsible process.
Theorem 4.4.1 (Doobs decomposition). Given an integrable S.P. X
n
, adapted
to a discrete parameter ltration T
n
, n 0, there exists a decomposition X
n
=
Y
n
+A
n
such that (Y
n
, T
n
) is a MG and A
n
is a previsible S.P. This decomposition
is unique up to the value of Y
0
, a R.V. measurable on T
0
.
Proof. Let A
0
= 0 and for all n 1,
A
n
= A
n1
+E(X
n
X
n1
[T
n1
).
By denition of the conditional expectation (C.E.) we see that A
k
A
k1
is
measurable on T
k1
for any k 1. Since T
k1
T
n1
for all k n and
A
n
= A
0
+
n
k=1
(A
k
A
k1
), it follows that A
n
is previsible for the ltra
tion T
n
. We next check that Y
n
= X
n
A
n
is a MG. To this end, recall that
X
n
integrable implies that so is X
n
X
n1
whereas the C.E. only reduces the
L
1
norm (see Corollary 2.3.11). Therefore, E[A
n
A
n1
[ E[X
n
X
n1
[ < .
So, A
n
is integrable, as is X
n
, implying (by the triangle inequality for the L
1
norm)
that Y
n
is integrable as well. With X
n
adapted and A
n
previsible (hence
4.4. MARTINGALE REPRESENTATIONS AND INEQUALITIES 83
adapted), we see that Y
n
is also adapted. It remains to check that almost surely,
E(Y
n
[T
n1
) Y
n1
= 0 for all n 1. Indeed, for any n 1,
E(Y
n
[T
n1
) Y
n1
= E(X
n
A
n
[T
n1
) (X
n1
A
n1
)
= E[X
n
X
n1
(A
n
A
n1
)[T
n1
] (since Y
n1
is measurable on T
n1
)
= E[X
n
X
n1
[T
n1
] (A
n
A
n1
) (since A
n
is previsible)
= 0 (by the denition of A
n
).
We nish the proof by checking the stated uniqueness of the decomposition. To
this end, suppose that we have two such decompositions, X
n
= Y
n
+A
n
=
Y
n
+
A
n
.
Then,
Y
n
Y
n
= A
n
A
n
. Since A
n
and
A
n
are previsible for the ltration
T
n
,
A
n
A
n
= E(A
n
A
n
[T
n1
) = E(
Y
n
Y
n
[T
n1
)
=
Y
n1
Y
n1
(since
Y
n
and Y
n
are MGs)
= A
n1
A
n1
.
We thus conclude that A
n
A
n
is independent of n. If Y
0
=
Y
0
we deduce further
that A
n
A
n
= A
0
A
0
=
Y
0
Y
0
= 0 for all n. In conclusion, as soon as
we determine Y
0
, a R.V. measurable on T
0
, both sequences A
n
and Y
n
are
uniquely determined.
Definition 4.4.2. When using Doobs decomposition for the canonical ltration
(X
k
, k n), the MG Y
n
is called the innovation process associated with X
n
.
The reason for this name is that X
n+1
= (A
n+1
+Y
n
)+(Y
n+1
Y
n
), where A
n+1
+Y
n
is measurable on (X
k
, k n) while Y
n+1
Y
n
describes the new part of X
n+1
.
Indeed, assuming X
n
L
2
we have that Y
n+1
Y
n
is orthogonal to (X
k
, k n)
in the sense of (2.1.2) (see Proposition 4.1.17). The innovation process is widely
used in prediction, estimation and control of time series, where the fact that it is a
MG is very handy.
As we see next, Doobs decomposition is very attractive when (X
n
, T
n
) is a
subMG.
Exercise 4.4.3. Check that the previsible part of Doobs decomposition of a sub
martingale (X
n
, T
n
) is a nondecreasing sequence, that is, A
n
A
n+1
for all n.
What can you say about the previsible part of Doobs decomposition of a super
martingale?
We next illustrate Doobs decomposition for two classical subMGs.
Example 4.4.4. Consider the subMG S
2
n
for the random walk S
n
=
n
k=1
k
,
where
k
are independent and identically distributed R.V. such that E
1
= 0 and
E
2
1
= 1. We already saw that Y
n
= S
2
n
n is a MG. Since Doobs decomposition,
S
2
n
= Y
n
+ A
n
is unique, in this special case the nondecreasing previsible part of
the decomposition is A
n
= n.
In Example 4.4.4 we have a nonrandom A
n
. However, for most subMGs the
corresponding nondecreasing A
n
is a random sequence, as is the case in our next
two examples.
84 4. MARTINGALES AND STOPPING TIMES
Example 4.4.5. Consider the subMG (M
n
, (
n
) where M
n
=
n
i=1
Z
i
for i.i.d.
Z
i
> 0 such that EZ
1
> 1 and (
n
= (Z
i
: i n) (see Example 4.1.14). The
nondecreasing previsible part of its Doobs decomposition is such that for n 1
A
n+1
A
n
= E[M
n+1
M
n
[(
n
] = E[Z
n+1
M
n
M
n
[(
n
]
= M
n
E[Z
n+1
1[(
n
] = M
n
(EZ
1
1)
(since Z
n+1
is independent of (
n
). In this case A
n
= (EZ
1
1)
n1
k=1
M
k
+ A
1
,
where we are free to choose for A
1
any nonrandom constant. We see that A
n
is a
random sequence (assuming the R.V. Z
i
are not a.s. constant).
Exercise 4.4.6. Consider the stochastic process Z
t
= W
t
+ rt, t 0, with W
t
a
Brownian motion and r a nonrandom constant. Compute the previsible part A
n
in
Doobs decomposition of X
n
= Z
2
n
, n = 0, 1, 2, . . . with respect to the discrete time
ltration T
n
= (Z
k
, k = 0, . . . , n), starting at A
0
= 0.
Doobs decomposition is particularly useful in connection with squareintegrable
martingales X
n
, where one can relate the limit of X
n
as n with that of the
nondecreasing sequence A
n
in the decomposition of X
2
n
.
The continuousparameter analog of Doobs decomposition is a fundamental in
gredient of stochastic integration. Here we provide this decomposition for a special
but very important class of subMG associated with squareintegrable martingales
(see also [KS97, page 30]).
Theorem 4.4.7 (DoobMeyer decomposition). Suppose T
t
is a rightcontinuous
ltration and the martingale (M
t
, T
t
) of continuous sample path is such that EM
2
t
<
for each t 0. Then, there exists a unique (integrable) S.P. A
t
such that
(a). A
0
= 0.
(b). A
t
has continuous sample path.
(c). A
t
is adapted to T
t
.
(d). t A
t
is nondecreasing.
(e). (M
2
t
A
t
, T
t
) is a MG.
Remark. This is merely the decomposition of the subMG X
t
= M
2
t
, where prop
erty (a) resolves the issue of uniqueness of the R.V. A
0
measurable on T
0
, property
(b) species the smoothness of the samplepath of the continuoustime S.P. A
t
and property (d) is in analogy with the monotonicity you saw already in Exercise
4.4.3.
Definition 4.4.8. The S.P. A
t
in the DoobMeyer decomposition (of M
2
t
) is
called the increasing part or the increasing process associated with the MG (M
t
, T
t
).
Example 4.4.9. Starting with the Brownian motion W
t
(which a martingale),
the DoobMeyer decomposition of W
2
t
gives the increasing part A
t
= t and the
martingale W
2
t
t (compare to Example 4.4.4 of the random walk). Indeed, recall
Exercise 4.2.4 that the increasing part is nonrandom for any Gaussian martingale.
Also, in Section 5.3 we show that the increasing process associated with W
t
coin
cides with its quadratic variation, as is the case for all squareintegrable MGs of
continuous sample path.
Exercise 4.4.10. Find a nonrandom f(t) such that X
t
= e
Wtf(t)
is a mar
tingale, and for this value of f(t) nd the increasing process associated with the
martingale X
t
via the DoobMeyer decomposition.
4.4. MARTINGALE REPRESENTATIONS AND INEQUALITIES 85
Hint: Try an increasing process A
t
=
_
t
0
e
2Wsh(s)
ds and use Fubinis theorem to
nd the nonrandom h(s) for which M
t
= X
2
t
A
t
is a martingale with respect to
the ltration (
t
= (W
s
, s t).
Exercise 4.4.11. Suppose a stopping time for a rightcontinuous ltration T
t
= 0.
(a) Applying Theorem 4.3.16 for X
t
= M
t
and X
t
= M
2
t
A
t
deduce that
both (M
t
, T
t
) and (M
2
t
, T
t
) are martingales.
(b) Applying (4.2.1) verify that E[(M
t
M
0
)
2
[T
0
] = 0, and deduce from it
that P(M
t
= M
0
) = 1 for any t 0.
(c) Explain why the assumed continuity of t M
t
implies that P(M
t
= M
0
for all t 0) = 1.
Remark. Taking nonrandom we conclude from the preceding exercise
that if a a squareintegrable martingale of continuous sample path has a zero in
creasing part, then it is almost surely constant (in time).
Exercise 4.4.12. Suppose T
t
is a rightcontinuous ltration and (X
t
, T
t
) and
(Y
t
, T
t
) are squareintegrable martingales of continuous sample path.
(a) Verify that (X
t
+ Y
t
, T
t
) and (X
t
Y
t
, T
t
) are then squareintegrable
martingales of continuous sample path.
(b) Let Z
t
= (A
X+Y
t
A
XY
t
)/4, where A
XY
t
denote the increasing part
of the MG (X
t
Y
t
, T
t
). Show that (X
t
Y
t
Z
t
, T
t
) is a martingale of
continuous sample path and verify that for all 0 s < t,
E[(X
t
X
s
)(Y
t
Y
s
)[T
s
] = E[(X
t
Y
t
X
s
Y
s
)[T
s
] = E[Z
t
Z
s
[T
s
].
Remark. The S.P. Z
t
of the preceding exercise is called the cross variation
for the martingales (X
t
, T
t
) and (Y
t
, T
t
), with two such martingales considered
orthogonal if and only if their cross variation is zero (c.f. [KS97, Denition 1.5.5]).
Note that the cross variation for a martingale (M
t
, T
t
) and itself is merely its
quadratic variation (i.e. the associated increasing process A
t
).
4.4.2. Maximal inequalities for martingales. Submartingales (and super
martingales) are rather tame stochastic processes. In particular, as we see next,
the tail of max X
n
over 1 n N, is bounded by moments of X
N
. This is a major
improvement over Markovs inequality, relating the typically much smaller tail of
the R.V. X
N
to its moments (see Example 1.2.39).
Theorem 4.4.13 (Doobs inequality).
(a). Suppose X
n
is a subMG. Then, for all x > 0 and N < ,
(4.4.1) P( max
0nN
X
n
> x) x
1
E[X
N
[.
(b). Suppose X
n
, n is a subMG (that is, E(X
m
[X
k
, k ) X
for all
0 < m ). Then, for all x > 0,
(4.4.2) P( sup
0n<
X
n
> x) x
1
E[X
[.
(c). Suppose X
t
, t [0, T] is a continuousparameter, right continuous subMG
(that is, each sample path t X
t
() is right continuous). Then, for all x > 0,
(4.4.3) P( sup
0tT
X
t
> x) x
1
E[X
T
[.
86 4. MARTINGALES AND STOPPING TIMES
Proof. (omit at rst reading) Following [GS01, Theorem 12.6.1], we prove
(4.4.1) by considering the stopping time
x
= minn 0 : X
n
> x. Indeed,
xP( max
0nN
X
n
> x) = xP(
x
N) E(X
x
I
{xN}
) .
Since S = min(N,
x
) N is a bounded stopping time, by another version of
Doobs optional stopping (not detailed in these notes), we have that
E(X
x
I
{xN}
) +E(X
N
I
{x>N}
) = E(X
min(N,x)
) E(X
N
) .
Noticing that,
E(X
N
) E(X
N
I
{x>N}
) = E(X
N
I
{xN}
) E[X
N
[ ,
we thus get the inequality (4.4.1).
The assumptions of part (b) result with the sequence (X
0
, X
1
, . . . , X
N1
, X
)
being also a subMG, hence by part (a) we have that for any N < ,
P( max
0nN1
X
n
> x) x
1
E[X
[.
The events A
N
= : max
nN
X
n
() > x monotonically increase in N to
A
= : max
n<
X
n
() > x. Therefore, we get the inequality (4.4.2) by
the continuity of each probability measure under such an operation (see the remark
following Denition 1.1.2).
Fixing an integer N, let I
N
denote the (nite) set of times in [0, T] that are given
as a ratio of two integers from 0, 1, . . . , N. For a given continuous time subMG
X
t
, t [0, T], consider the discretetime subMG X
tn
, where t
n
is the nth smallest
point in I
N
for n = 1, 2, . . ., and t
i=1
i
where the i.i.d.
i
are in
tegrable and of zero mean. Applying Doobs inequality (4.4.1) to the submartingale
[S
n
[ we get that P(U
n
> x) x
1
E[[S
n
[] for U
n
= max[S
k
[ : 0 k n and any
x > 0. If in addition E
2
i
= 1, then applying Doobs inequality to the subMG S
2
n
4.4. MARTINGALE REPRESENTATIONS AND INEQUALITIES 87
yields the bound P(U
n
> x) x
2
E[S
2
n
] = nx
2
so U
n
c
= max
k
S
k
the global maxi
mum of its sample path (possibly innite). Replacing the assumption E
i
= 0 with
E(e
i
) = 1 results with the exponential martingale M
n
= e
Sn
(see Example 4.1.14).
Applying Doobs inequality (4.4.1) to the latter, we have for each x > 0 the bound
P(Z
n
> x) = P( max
0kn
S
k
> x) = P( max
0kn
M
k
> e
x
) e
x
E(M
n
) = e
x
.
This is an example of largedeviations, or exponential tail bounds. Since the events
Z
n
> x increase monotonically in n to Z
is nite
a.s. and has exponential upper tail, that is, P(Z
> x) e
x
for all x 0.
As we have just seen, for submartingales with pnite moments, p > 1, we may
improve the rate of decay in terms of x > 0, fromx
1
provided by Doobs inequality,
to x
p
. Moreover, combining it with the formula
(4.4.4) E(Z
p
) =
_
0
px
p1
P(Z > x)dx
which holds for any p > 1 and any nonnegative R.V. Z, allows us next to eciently
bound the moments of the maximum of a subMG.
Theorem 4.4.16. Suppose X
t
is a right continuous subMG for t [0, T] such
that E[(X
t
)
p
+
] < for some p > 1 and all t 0. Then, for q = p/(p 1), any
x > 0 and t T,
(4.4.5) P( sup
0ut
X
u
> x) x
p
E
_
(X
t
)
p
+
_
,
(4.4.6) E
_
( sup
0ut
X
u
)
p
+
q
p
E[(X
t
)
p
+
] ,
where (y)
p
+
denotes the function (max(y, 0))
p
.
Example 4.4.17. For the Brownian motion W
t
we get from (4.4.5) that for any
x > 0 and p > 1,
P( sup
0tT
W
t
> x) x
p
E[(W
T
)
p
+
]
and the righthand side can be explicitly computed since W
T
is a Gaussian R.V. of
zeromean and variance T. We do not pursue this further since in Section 5.2 we
explicitly compute the probability density function of sup
0tT
W
t
.
In the next exercise you are to restate the preceding theorem for the discrete
parameter subMG X
n
= [Y
n
[.
Exercise 4.4.18. Show that if Y
n
is a MG and E[[Y
n
[
p
] < for some p > 1
and all n N, then for q = p/(p 1), any y > 0 and all n N,
P(max
kn
[Y
k
[ > y) y
p
E
_
[Y
n
[
p
_
,
E
__
max
kn
[Y
k
[
_
p
q
p
E
_
[Y
n
[
p
.
88 4. MARTINGALES AND STOPPING TIMES
4.5. Martingale convergence theorems
The fact that the maximum of a subMG does not grow too rapidly is closely
related to convergence properties of subMG (also of supMGs and of martingales).
The next theorem is stated in [KS97, Theorem 1.3.15 and Problem 1.3.19]. See
[GS01, Theorem 12.3.1] for the corresponding discretetime results and their proof.
Theorem 4.5.1 (Doobs convergence theorem). Suppose (X
t
, T
t
) is a right con
tinuous subMG.
(a). If sup
t0
E[(X
t
)
+
] < , then X
= lim
t
X
t
exists w.p.1. Further, in
this case E[X
[ lim
t
E[X
t
[ < .
(b). If X
t
is uniformly integrable then X
t
X
also in L
1
. Further, the L
1
convergence X
t
X
implies that X
t
E(X
[T
t
) for any xed t 0.
Remark. Doobs convergence theorem is stated here for a continuous parameter
subMG (of right continuous sample path). The corresponding theorem is also easy
to state for (X
t
, T
t
) which is a (right continuous) supMG since then (X
t
, T
t
) is a
subMG. Similarly, to any discrete parameter subMG (or supMG) corresponds the
right continuous interpolated subMG (or supMG, respectively), as done in Exam
ple 4.2.7 (for the MG case). Consequently, Doobs convergence theorem applies
also when replacing (X
t
, T
t
) by (X
n
, T
n
) throughout its statement (with right
continuity then irrelevant).
Further, Doobs convergence theorem provides the following characterization of
right continuous, Uniformly Integrable (U.I.) martingales.
Corollary 4.5.2. If (X
t
, T
t
) is a right continuous MG and sup
t
E[X
t
[ < ,
then X
= lim
t
X
t
exists w.p.1. and is integrable. If X
t
is also U.I. then X
t
=
E(X
[T
t
) for all t (such a martingale, namely X
t
= E(X[T
t
) for an integrable
R.V. X and a ltration T
t
, is called Doobs martingale of X with respect to
T
t
).
Remark. To understand the dierence between parts (a) and (b) of Doobs con
vergence theorem, recall that if X
t
is uniformly integrable then E[(X
t
)
+
] C
for some C < and all t (see Denition 1.4.22). Further, by Theorem 1.4.23, the
uniform integrability together with convergence almost surely imply convergence in
L
1
. So, the content of part (b) is that the L
1
convergence X
t
X
implies also
that X
t
E(X
[T
t
) for any t 0.
Keep in mind that many important martingales do not converge. For example, as
we see in Section 5.2, the path of the Brownian motion W
t
exceeds any level > 0
within nite time
such that Y
t
Y
almost
surely and in L
2
. Moreover, EY
2
= Y
2
we deduce that
Y
t
a.s.
Y
for a squareintegrable Y
such that
C lim
t
EY
2
t
= lim
t
EX
t
EX
= EY
2
.
To complete the proof it suces to show that E([Y
t
Y
[
2
) 0 as t . To this
end, considering (4.4.6) for the nonnegative rightcontinuous subMG X
u
= [Y
u
[
and p = 2 we have that E(Z
t
) 4C for Z
t
= sup
0ut
Y
2
u
and any t < .
Since 0 Z
t
Z = sup
0u
Y
2
u
, we have by monotone convergence that E(Z)
4C < . With Y
Z as well. Hence,
V
t
= [Y
t
Y
[
2
2Y
2
t
+ 2Y
2
[
2
) 0 as t , completing the proof of the proposition.
Remark. Beware that Proposition 4.5.3 does not have an L
1
analog. Namely,
there exists a nonnegative MG Y
n
such that EY
n
= 1 for all n and Y
n
Y
= 0
almost surely, so obviously, Y
n
does not converge to Y
in L
1
. One such example
is given in Proposition 4.6.5.
Here are a few applications of Doobs convergence theorem.
Exercise 4.5.4. Consider an urn that at stage 0 contains one red ball and one blue
ball. At each stage a ball is drawn at random from the urn, with all possible choices
being equally likely, and it and one more ball of the same color are then returned to
the urn. Let R
n
denote the number of red balls at stage n and M
n
= R
n
/(n + 2)
the corresponding fraction of red balls.
(a) Find the law of R
n+1
conditional on R
n
= k and use it to compute
E(R
n+1
[R
n
).
(b) Check that M
n
is a martingale with respect to its canonical ltration.
(c) Applying Proposition 4.5.3 conclude that M
n
M
in L
2
and that
E(M
) = E(M
0
) = 1/2.
(d) Using Doobs (maximal) inequality show that P(sup
k1
M
k
> 3/4) 2/3.
Example 4.5.5. Consider the martingale S
n
=
n
k=1
k
for independent, square
integrable, zeromean random variables
k
. Since ES
2
n
=
n
k=1
E
2
k
, it follows from
Proposition 4.5.3 that the random series S
n
() S
k
E
2
k
< .
Exercise 4.5.6. Deduce from part (a) of Doobs convergence theorem that if X
t
and EX
EX
0
<
. Further, X
n
a.s.
X
and EX
EX
0
< for any nonnegative, discretetime
martingale X
n
.
90 4. MARTINGALES AND STOPPING TIMES
4.6. Branching processes: extinction probabilities
We use martingales to study the extinction probabilities of branching processes
(dened next). See [KT75, Chapter 8] or [GS01, Sections 5.45.5] for more on
these processes.
Definition 4.6.1 (Branching Process). The Branching process is a discrete time
S.P. Z
n
taking nonnegative integer values, such that Z
0
= 1 and for any n 1,
Z
n
=
Zn1
j=1
N
(n)
j
,
where N and N
(n)
j
for j = 1, 2, . . . are independent, identically distributed, non
negative integer valued R.V. with nite mean m = E(N) < , and where we use
the convention that if Z
n1
= 0 then also Z
n
= 0.
The S.P. Z
n
is interpreted as counting the size of an evolving population, with
N
(n)
j
being the number of ospring of j
th
individual of generation (n 1) and Z
n
being the size of the nth generation. Associated with the branching process is the
family tree with the root denoting the 0th generation and having N
(n)
j
edges from
vertex j at distance (n 1) from the root to vertices of distance n from the root.
Random trees generated in such a fashion are called GaltonWatson trees and are
the subject of much research. We focus on the simpler S.P. Z
n
and shall use
throughout the ltration T
n
= (N
(k)
j
, k n, j = 1, 2, . . .). We note in passing
that in general (
n
= (Z
k
, k n) is a strict subset of T
n
(since in general one
can not recover the number of ospring of each individual knowing only the total
population sizes at the dierent generations).
Proposition 4.6.2. The S.P. X
n
= m
n
Z
n
is a martingale for the ltration
T
n
.
Proof. Note that the value of Z
n
is a nonrandom function of the values of
N
(k)
j
, k n, j = 1, 2, . . .. Hence, it follows that Z
n
is adapted to the ltration
T
n
. It suces to show that
(4.6.1) E[Z
n+1
[T
n
] = mZ
n
.
Indeed, then by the tower property and induction E[Z
n+1
] = mE[Z
n
] = m
n+1
,
providing the integrability of Z
n
. Moreover, the identity (4.6.1) reads also as
E[X
n+1
[T
n
] = X
n
for X
n
= m
n
Z
n
, which is precisely the stated martingale
property of (X
n
, T
n
). To prove (4.6.1), note that the random variables N
(n+1)
j
are independent of T
n
on which Z
n
is measurable. Hence, by the linearity of the
expectation
E[Z
n+1
[T
n
] =
Zn
j=1
E[N
(n+1)
j
[T
n
] =
Zn
j=1
E(N
(n+1)
j
) = mZ
n
,
as claimed.
While proving Proposition 4.6.2 we showed that EZ
n
= m
n
for all n 0. Thus,
the mean total population size of a subcritical branching process (i.e. m < 1) is
E(
n=0
Z
n
) =
n=0
m
n
=
1
1 m
< .
4.6. BRANCHING PROCESSES: EXTINCTION PROBABILITIES 91
In particular,
n
Z
n
() is then nite w.p.1. Since Z
n
0 are integer valued, this
in turn implies that the extinction probability is one, namely,
p
ex
:= P( : Z
n
() = 0 for all n large enough ) = 1 .
We provide next another proof of this result, using the martingale X
n
= m
n
Z
n
of Proposition 4.6.2.
Proposition 4.6.3 (subcritical process dies o). If m < 1 then p
ex
= 1, that is,
w.p.1. the population eventually dies o.
Proof. Recall that X
n
= m
n
Z
n
is a nonnegative MG. Thus, by Doobs
martingale convergence theorem, X
n
a.s.
X
is almost surely
nite (see Exercise 4.5.6). With Z
n
= m
n
X
n
and m
n
0 (for m < 1), this implies
that Z
n
a.s.
0. Because Z
n
are integer valued, Z
n
0 only if Z
n
= 0 eventually,
completing the proof of the proposition.
As we see next, more can be done in the special case where each individual has at
most one ospring.
Exercise 4.6.4. Consider the subcritical branching process Z
n
where indepen
dently each individual has either one ospring (with probability 0 < p < 1) or no
ospring as all (with probability q = 1 p). Starting with Z
0
= N 1 individuals,
compute the law P(T = n) of the extinction time T = mink 1 : Z
k
= 0.
We now move to critical branching processes, namely, those Z
n
for which m =
EN = 1. Excluding the trivial case in which each individual has exactly one
ospring, we show that again the population eventually dies o w.p.1.
Proposition 4.6.5 (critical process dies o). If m = 1 and P(N = 1) < 1 then
p
ex
= 1.
Proof. As we have already seen, when m = 1 the branching process Z
n
, with Z
nite
almost surely. Since Z
n
are integer valued this can happen only if for almost every
there exist nonnegative integers k and (possibly depending on ), such that
Z
n
= k for all n . Our assumption that P(N = 1) < 1 and E(N) = 1 for
the nonnegative integer valued R.V. N implies that P(N = 0) > 0. Note that
given Z
n
= k > 0, if N
(n+1)
j
= 0 for j = 1, . . . , k, then Z
n+1
= 0 ,= k. By the
independence of N
(n+1)
j
, j = 1, . . . and T
n
deduce that for each n and k > 0,
P(Z
n+1
= k[Z
n
= k, T
n
) 1P(Z
n+1
= 0[Z
n
= k, T
n
) = 1P(N = 0)
k
:=
k
< 1.
For nonrandom integers m > 0 and k 0 let A
,m,k
denote the event Z
n
= k,
for n = , . . . , m. Note that I
A
,m,k
= I
Zm=k
I
A
,m1,k
, with A
,m1,k
T
m1
implying that Z
m1
= k. So, rst taking out what is known (per Proposition
2.3.15), then applying the preceding inequality for n = m 1, we deduce that for
any m > and k > 0,
E(I
A
,m,k
[T
m1
)
k
I
A
,m1,k
.
Hence, by the tower property we have that P(A
,m,k
)
k
P(A
,m1,k
). With
P(A
,,k
) 1, it follows that P(A
,m,k
)
m
k
. We deduce that if k > 0, then
P(A
,k
) = 0 for A
,k
= A
,,k
= Z
n
= k, for all n and any 0. We have
seen that P(
,k0
A
,k
) = 1 while P(A
,k
) = 0 whenever k > 0. So, necessarily
P(
A
,0
) = 1, which amounts to p
ex
= 1.
92 4. MARTINGALES AND STOPPING TIMES
Remark. Proposition 4.6.5 shows that in case m = 1, the martingale Z
n
con
verges to 0 w.p.1. If this sequence was U.I. then by part (b) of Doobs convergence
theorem, necessarily Z
n
0 also in L
1
, i.e. EZ
n
0. However, EZ
n
= 1 for all n,
so we conclude that the sequence Z
n
is not uniformly integrable. Further, note
that either Z
n
= 0 or Z
n
1, so 1 = E(Z
n
I
Zn1
) = E(Z
n
[Z
n
1)P(Z
n
1).
With p
ex
= 1, the probability P(Z
n
1) of survival for n generations decays to
zero as n and consequently, conditional upon survival, the mean population
size E(Z
n
[Z
n
1) = 1/P(Z
n
1) grows to innity as n .
The martingale of Proposition 4.6.2 does not provide information on the value of
p
ex
for a supercritical branching process, that is, when m > 1. However, as we see
next, when N is also squareintegrable, it implies that m
n
Z
n
converges in law to
a nonzero random variable.
Exercise 4.6.6. Let Z
n
be the population size for the nth generation of a (super
critical) branching process, with the number of ospring having mean m = E(N) >
1 and nite variance
2
= Var(N) < .
(a) Check that E(Z
2
n+1
) = m
2
E(Z
2
n
) +
2
E(Z
n
).
(b) Compute E(X
2
n
) for X
n
= m
n
Z
n
.
(c) Show that X
n
converges in law to some R.V. X
with P(X
> 0) > 0
Hint: Use Proposition 4.5.3.
For a supercritical branching process, if P(N = 0) = 0 extinction is of course
impossible. Otherwise, p
ex
P(Z
1
= 0) = P(N = 0) > 0. Nevertheless, even in
this case p
ex
< 1 for any supercritical branching process. That is, with positive
probability the population survives forever. Turning to compute p
ex
, consider the
function
(4.6.2) (p) = P(N = 0) +
k=1
P(N = k)p
k
.
Note that the convex continuous function (p) is such that (0) = P(N = 0) > 0,
(1) = 1 and
(1) = EN > 1. It is not hard to verify that for any such function
there exists a unique solution (0, 1) of the equation p = (p). Upon verifying
that
Zn
is then a martingale, you are to show next that p
ex
= (for a dierent
derivation without martingales, see [GS01, Section 5.4] or [KT75, Section 8.3]).
Exercise 4.6.7. Consider a supercritical branching process Z
n
with Z
0
= 1,
P(N = 0) > 0 and let denote the unique solution of p = (p) in (0, 1).
(a) Fixing p (0, 1) check that E(p
Zn+1
[T
n
) = (p)
Zn
and verify that thus
M
n
=
Zn
is a uniformly bounded martingale with respect to the ltration
T
n
.
(b) Use Doobs convergence theorem to show that M
n
M
almost surely
and in L
1
with E(M
) = .
(c) Check that since Z
n
are nonnegative integers, the random variable M
()
0, 1 with probability one.
(e) Noting the M
X
() being the product of the characteristic function of W
t+h
W
t
and that of
(W
s1
, . . . , W
sn
). Consequently, W
t+h
W
t
is independent of (W
s1
, . . . , W
sn
) (see
Proposition 3.2.6). Since this applies for any 0 s
1
< s
2
< . . . < s
n
t, it can be
shown that W
t+h
W
t
is also independent of (W
s
, s t).
In conclusion, the Brownian motion is an example of a zero mean S.P. with in
dependent increments. That is, (W
t+h
W
t
) is independent of W
s
, s [0, t], as
stated.
We proceed to construct the Brownian motion as in [Bre92, Section 12.7]. To
this end, consider
L
2
([0, T]) = f(u) :
_
T
0
f
2
(u)du < ,
equipped with the inner product, (f, g) =
_
T
0
f(u)g(u)du, where we identify f, g
such that f(t) = g(t) for almost every t [0, T], as being the same function. As
we have seen in Example 2.2.21, this is a separable Hilbert space, and there exists
a nonrandom sequence of functions,
i
(t)
i=1
in L
2
([0, T]), such that for any
5.1. BROWNIAN MOTION: DEFINITION AND CONSTRUCTION 97
f, g L
2
([0, T]),
(5.1.1) lim
n
n
i=1
(f,
i
)(g,
i
) = (f, g)
(c.f. Denition 2.2.17 and Theorem 2.2.20). Let X
i
be i.i.d., Gaussian R.V.s with
EX
i
= 0 and EX
2
i
= 1, all of which are dened on the same probability space
(, T, P). For each positive integer N dene the stochastic process
V
N
t
=
N
i=1
X
i
_
t
0
i
(u)du .
Since
i
(t) are nonrandom and any linear combination of the coordinates of a
Gaussian random vector gives a Gaussian random vector (see Proposition 3.2.16),
we see that V
N
t
is a Gaussian process.
We shall show that the random variables V
N
t
converge in L
2
(, T, P) to some
random variable V
t
, for any xed, nonrandom, t [0, T]. Moreover, we show
that the S.P. V
t
has properties (a) and (b) of Denition 5.1.1. Then, applying
Kolmogorovs continuity theorem, we deduce that the continuous modication of
the S.P. V
t
is the Brownian motion.
Our next result provides the rst part of this program.
Proposition 5.1.3. Fixing t [0, T], the sequence N V
N
t
is a Cauchy sequence
in the Hilbert space L
2
(, T, P). Consequently, there exists a S.P. V
t
() such that
E[(V
t
V
N
t
)
2
] 0 as N , for any t [0, T]. The S.P. V
t
is Gaussian with
E(V
t
) = 0 and E(V
t
V
s
) = min(t, s).
Proof. Fix t [0, T], noting that for any i,
(5.1.2)
_
t
0
i
(u)du =
_
T
0
1
[0,t]
(u)
i
(u)du = (1
[0,t]
,
i
) .
Set V
0
t
= 0 and let
n
(t) =
i=n+1
(1
[0,t]
,
i
)
2
.
Since E(X
i
X
j
) = 1
i=j
we have for any N > M 0,
E
_
(V
N
t
V
M
t
)
2
=
N
i=M+1
N
j=M+1
E[X
i
X
j
](
_
t
0
i
(u)du)(
_
t
0
j
(u)du)
=
N
i=M+1
(
_
t
0
i
(u)du)
2
=
M
(t)
N
(t) (5.1.3)
(using (5.1.2) for the rightmost equality). Applying (5.1.1) for f = g = 1
[0,t]
() we
have that for all M,
M
(t)
0
(t) =
i=1
(1
[0,t]
,
i
)
2
= (1
[0,t]
, 1
[0,t]
) = t < .
In particular, taking M = 0 in (5.1.3) we see that E[(V
N
t
)
2
] are nite for all
N. It further follows from the niteness of the innite series
0
(t) that
n
(t)
0 as n . In view of (5.1.3) we deduce that V
N
t
is a Cauchy sequence in
98 5. THE BROWNIAN MOTION
L
2
(, T, P), converging to some random variable V
t
by the completeness of this
space (see Proposition 1.3.20).
Being the pointwise (in t) limit in 2mean of Gaussian processes, the S.P. V
t
is also Gaussian, with the mean and autocovariance functions for V
t
being the
(pointwise in t) limits of those for V
N
t
(c.f. Proposition 3.2.20). Recall that
E(V
N
t
) =
N
i=1
EX
i
t
_
0
i
(u)du = 0, for all N, hence E(V
t
) = 0 as well.
Repeating the argument used when deriving (5.1.3) we see that for any s, t [0, T],
E(V
N
t
V
N
s
) =
N
i=1
N
j=1
E[X
i
X
j
](
_
t
0
i
(u)du)(
_
s
0
j
(u)du) =
N
i=1
(1
[0,t]
,
i
)(1
[0,s]
,
i
) .
Applying (5.1.1) for f = 1
[0,t]
() and g = 1
[0,s]
(), both in L
2
([0, T]), we now have
that
E(V
t
V
s
) = lim
N
E(V
N
t
V
N
s
) = lim
N
N
i=1
(1
[0,t]
,
i
)(1
[0,s]
,
i
)
= (1
[0,t]
, 1
[0,s]
) = min(t, s) ,
as needed to conclude the proof of the proposition.
Having constructed a Gaussian stochastic process V
t
with the same distribution
as a Brownian motion, we next apply Kolmogorovs continuity theorem, so as to
obtain its continuous modication. This modication is then a Brownian motion.
To this end, recall that a Gaussian R.V. Y with EY = 0, EY
2
=
2
has moments
E(Y
2n
) =
(2n)!
2
n
n!
2n
. In particular, E(Y
4
) = 3(E(Y
2
))
2
. Since V
t
is Gaussian with
E[(V
t+h
V
t
)
2
] = E[(V
t+h
V
t
)V
t+h
] E[(V
t+h
V
t
)V
t
] = h,
for all t and h > 0, we get that
E[(V
t+h
V
t
)
4
] = 3[E(V
t+h
V
t
)
2
]
2
= 3h
2
,
as needed to apply Kolmogorovs theorem (with = 4, = 1 and c = 3 there).
Remark. There is an alternative direct construction of the Brownian motion
as the limit of timespace rescaled random walks (see Theorem 3.1.3 for details).
Further, though we constructed the Brownian motion W
t
as a stochastic process
on [0, T] for some nite T < , it easily extends to a process on [0, ), which we
thus take hereafter as the index set of the Brownian motion.
The Brownian motion has many interesting scaling properties, some of which are
summarized in your next two exercises.
Exercise 5.1.4. Suppose W
t
is a Brownian motion and , s, T > 0 are non
random constants. Show the following.
(a) (Symmetry) W
t
, t 0 is a Brownian motion.
(b) (Time homogeneity) W
s+t
W
s
, t 0 is a Brownian motion.
(c) (Time reversal) W
T
W
Tt
, 0 t T is a Brownian motion.
(d) (Scaling, or selfsimilarity)
W
t/
, t 0 is a Brownian motion.
(e) (Time inversion) If
W
0
= 0 and
W
t
= tW
1/t
, then
W
t
, t 0 is a
Brownian motion.
5.1. BROWNIAN MOTION: DEFINITION AND CONSTRUCTION 99
(f) With W
i
t
denoting independent Brownian motions nd the constants c
n
such that c
n
n
i=1
W
i
t
are also Brownian motions.
Exercise 5.1.5. Fix [1, 1]. Let
W
t
= W
1
t
+
_
1
2
W
2
t
where W
1
t
and
W
2
t
are two independent Brownian motions. Show that
W
t
is a Brownian motion
and nd the value of E(W
1
t
W
t
).
Exercise 5.1.6. Fixing s > 0 show that the S.P. W
s
W
st
, 0 t s and
W
s+t
W
s
, t 0 are two independent Brownian motions and for 0 < t s
evaluate q
t
= P(W
s
> W
st
> W
s+t
).
Applying Doobs inequality you are to prove next the law of large numbers for
Brownian motion, namely, that almost surely t
1
W
t
0 for t (compare with
the more familiar law of large numbers, n
1
[S
n
ES
n
] 0 for a random walk S
n
).
Exercise 5.1.7. Let W
t
be a Brownian motion.
(a) Use the inequality (4.4.6) to show that for any 0 < u < v,
E
_
( sup
utv
[W
t
[/t)
2
_
4v
u
2
.
(b) Taking u = 2
n
and v = 2
n+1
, n 1 in part (a), apply Markovs inequality
to deduce that for any > 0,
P
_
sup
2
n
t2
n+1
[W
t
[/t >
_
8
2
2
n
.
(c) Applying BorelCantelli lemma I conclude that almost surely t
1
W
t
0
for t .
Many important S.P. are derived from the Brownian motion W
t
. Our next two
exercises introduce a few of these processes, the Brownian bridge B
t
= W
t
min(t, 1)W
1
, the Geometric Brownian motion Y
t
= e
Wt
, and the OrnsteinUhlenbeck
process U
t
= e
t/2
W
e
t . We also dene X
t
= x+t +W
t
, a Brownian motion with
drift R and diusion coecient > 0 starting from x R. (See Figure 2 for
illustrations of sample paths associated with these processes.)
Exercise 5.1.8. Compute the mean and the autocovariance functions of the pro
cesses B
t
, Y
t
, U
t
, and X
t
. Justify your answers to:
(a) Which of the processes W
t
, B
t
, Y
t
, U
t
, X
t
is Gaussian?
(b) Which of these processes is stationary?
(c) Which of these processes has continuous sample path?
(d) Which of these processes is adapted to the ltration (W
s
, s t) and
which is also a submartingale for this ltration?
Exercise 5.1.9. Show that for 0 t 1 each of the following S.P. has the
same distribution as the Brownian bridge and explain why both have continuous
modications.
(a)
B
t
= (1 t)W
t/(1t)
for t < 1 with
B
1
= 0.
(b) Z
t
= tW
1/t1
for t > 0 with Z
0
= 0.
Exercise 5.1.10. Let X
t
=
_
t
0
W
s
ds for a Brownian motion W
t
.
(a) Verify that X
t
is a well dened stochastic process. That is, check that
X
t
() is a random variable for each xed t 0.
100 5. THE BROWNIAN MOTION
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
Brownian Bridge B
t
t
B
t
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
2
4
6
8
10
12
14
16
18
t
Y
t
Geometric Brownian Motion Y
t
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
t
U
t
OrnsteinUhlenbeck process U
t
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
1
2
3
4
5
6
7
8
9
t
X
t
Brownian Motion with drift: = 1, = 2, x= 1
Figure 2. Illustration of sample paths for processes in Exercise 5.1.8.
(b) Using Fubinis theorem 3.3.10 nd E(X
t
) and E(X
2
t
).
(c) Is X
t
a Gaussian process? Does it have continuous sample paths a.s.?
Does it have stationary increments? Independent increments?
Exercise 5.1.11. Suppose W
t
is a Brownian motion.
(a) Compute the probability density function of the random vector (W
s
, W
t
).
Then compute E(W
s
[W
t
) and Var(W
s
[W
t
), rst for s > t, then for s < t.
Hint: Consider Example 2.4.5.
(b) Explain why the Brownian Bridge B
t
, 0 t 1 has the same distribu
tion as W
t
, 0 t 1, conditioned upon W
1
= 0 (which is the reason
for naming B
t
a Brownian bridge).
Hint: Both Exercise 2.4.6 and parts of Exercise 5.1.8 may help here.
We conclude with the fractional Brownian motion, another Gaussian S.P. of con
siderable interest in nancial mathematics and analysis of computer network trac.
Exercise 5.1.12. Fix H (0, 1). A Gaussian stochastic process X
t
, t 0, is
called a fractional Brownian motion (or in short, fBM), of Hurst parameter H if
E(X
t
) = 0 and
E(X
t
X
s
) =
1
2
[[t[
2H
+[s[
2H
[t s[
2H
] , s, t 0.
(a) Show that an fBM of Hurst parameter H has a continuous modication
that is also locally H older continuous with exponent for any 0 < < H.
(b) Verify that in case H = 1/2 such modication yields the (standard) Brow
nian motion.
(c) Show the selfsimilarity property, whereby for any nonrandom > 0 the
process
H
X
t/
is an fBM of the same Hurst parameter H.
5.2. THE REFLECTION PRINCIPLE AND BROWNIAN HITTING TIMES 101
(d) For which values of H is the fBM a process of stationary increments and
for which values of H is it a process of independent increments?
5.2. The reection principle and Brownian hitting times
We start with Paul Levys martingale characterization of the Brownian motion,
stated next.
Theorem 5.2.1 (Levys martingale characterization). Suppose squareintegrable
MG (X
t
, T
t
) of rightcontinuous ltration and continuous sample path is such that
(X
2
t
t, T
t
) is also a MG. Then, X
t
is a Brownian motion.
Remark. The continuity of X
t
is essential for Levys martingale characteriza
tion. For example, the squareintegrable martingale X
t
= N
t
t, with N
t
the
Poisson process of rate one (per Denition 6.2.1), is such that X
2
t
t is also a MG
(see Exercise 6.2.2). Of course, almost all sample path of the Poisson process are
discontinuous.
A consequence of this characterization is that a squareintegrable MG with contin
uous sample path and unbounded increasing part is merely a time changed Brownian
motion (c.f. [KS97, Theorem 3.4.6]).
Proposition 5.2.2. Suppose (X
t
, T
t
) is a squareintegrable martingale with X
0
=
0, rightcontinuous ltration and continuous sample path. If the increasing part A
t
in the corresponding DoobMeyer decomposition of Theorem 4.4.7 is almost surely
unbounded then W
s
= X
s
is a Brownian motion, where
s
= inft 0 : A
t
> s
are T
t
stopping times such that s
s
is nondecreasing and rightcontinuous
mapping of [0, ) to [0, ), with A
s
= s and X
t
= W
At
.
Our next proposition may be viewed as yet another application of Levys mar
tingale characterization. In essence it states that each stopping time acts as a
regeneration point for the Brownian motion. In particular, it implies that the
Brownian motion is a strong Markov process (in the sense of Denition 6.1.21). As
we soon see, this regeneration property is very handy for nding the distribution
of certain Brownian hitting times and running maxima.
Proposition 5.2.3. If is a stopping time for the canonical ltration (
t
of the
Brownian motion W
t
then the S.P. X
t
= W
t+
W
.
Remark. This result is stated as [Bre92, Theorem 12.42], with a proof that
starts with a stopping time taking a countable set of values and moves to the
general case by approximation, using sample path continuity. Alternatively, with
the help of some amount of stochastic calculus one may verify the conditions of
Levys theorem for X
t
and the ltration T
t
= (W
s+
W
, 0 s t). We will
detail neither approach here.
We next apply Proposition 5.2.3 for computing the probability density function
of the rst hitting time
= inft > 0 : W
t
= for any xed > 0. Since the
Brownian motion has continuous sample path, we know that
= mint > 0 :
W
t
= and that the maximal value of W
t
for t [0, T] is always achieved at some
t T. Further, since W
0
= 0 < , if W
s
for some s > 0, then W
u
= for
some u [0, s], that is,
s with W
= . Consequently,
: W
T
() : max
0sT
W
s
() = :
() T .
102 5. THE BROWNIAN MOTION
0 3
1
0.5
0
0.5
1
1.5
2
2.5
3
3.5
t
W
t
(which is measurable on (
T, X
T
0)
= P(
T, X
T
0) = P( max
0sT
W
s
, W
T
) .
Also, P(W
T
= ) = 0 and we have that
P( max
0sT
W
s
) = P( max
0sT
W
s
, W
T
) +P( max
0sT
W
s
, W
T
)
= 2P( max
0sT
W
s
, W
T
) = 2P(W
T
) (5.2.1)
= 2
_
T
1/2
e
x
2
2
dx
2
Among other things, this shows that P(
> T) 0 as T , hence
<
with probability one. Further, we have that the probability density function of
at T is given by
(5.2.2) p
(T) =
[P(
T)]
T
= 2
T
_
T
1/2
e
x
2
2
dx
2
=
2T
3/2
e
2
2T
.
This computation demonstrates the power of the reection principle and more
generally, that many computations for stochastic processes are the most explicit
when they are done for the Brownian motion.
Our next exercise provides yet another example of a similar nature.
Exercise 5.2.4. Let W
t
be a Brownian motion.
(a) Show that min
0tT
W
t
and max
0tT
W
t
have the same distribution
which is also the distribution of [W
T
[.
(b) Show that the probability that the Brownian motion W
u
attains the
value zero at some u (s, s + t) is given by =
_
p
t
([x[)
s
(x)dx,
where p
t
(x) = P([W
t
[ x) for x, t > 0 and
s
(x) denotes the probability
density of the R.V. W
s
.
5.3. SMOOTHNESS AND VARIATION OF THE BROWNIAN SAMPLE PATH 103
Remark: The explicit formula = (2/) arccos(
_
s/(s +t)) is obtained
in [KT75, page 348] by computing this integral.
Remark. Using a reection principle type argument one gets the discrete time
analog of (5.2.1), whereby the simple random walk S
n
of Denition 3.1.2 satises
for each integer r > 0 the identity
P( max
0kn
S
k
r) = 2P(S
n
> r) +P(S
n
= r) .
Fixing > 0 and > 0 consider the stopping time
,
= inft : W
t
or
W
t
(for the canonical ltration of the Brownian motion W
t
). By continuity
of the Brownian sample path we know that W
,
, . Applying Doobs
optional stopping theorem for the uniformly integrable stopped martingale W
t
,
of continuous sample path we get that P(W
,
= ) = /(+) (for more details
see Exercise 4.3.18).
Exercise 5.2.5. Show that E(
,
) = by applying Doobs optional stopping
theorem for the uniformly integrable stopped martingale W
2
t
,
t
,
.
We see that the expected time it takes the Brownian motion to exit the interval
(, ) is nite for any nite and . As , these exit times
,
converge
monotonically to the time of reaching level , namely
= inft > 0 : W
t
= .
Exercise 5.2.5 implies that
i
[f(t
()
i+1
) f(t
()
i
)[
q
denote the qth variation of f() on . The qth variation of f() on [a, b] is then
(5.3.1) V
(q)
(f) = lim
0
V
(q)
()
(f) ,
provided such limit exists.
We next extend this denition to continuous time stochastic processes.
Definition 5.3.2. The qth variation of a S.P. X
t
on the interval [a, b] is the ran
dom variable V
(q)
(X) obtained when replacing f(t) by X
t
() in the above denition,
provided the limit (5.3.1) exists (in some sense).
The quadratic variation is aected by the smoothness of the sample path. For
example, suppose that a S.P. X(t) has Lipschitz sample path with probability one.
Namely, there exists a random variable L() which is nite almost surely, such that
[X(t) X(s)[ L[t s[ for all t, s [a, b]. Then,
V
(2)
()
(X) L
2
i
(t
()
i+1
t
()
i
)
2
L
2

i
(t
()
i+1
t
()
i
) = L
2
(b a) , (5.3.2)
converges to zero almost surely as  0. So, such a S.P. has zero quadratic
variation on [a, b].
By considering dierent time intervals we view the quadratic variation as yet
another stochastic process.
Definition 5.3.3. The quadratic variation of a stochastic process X, denoted
V
(2)
t
(X) is the nondecreasing, nonnegative S.P. corresponding to the quadratic
variation of X on the intervals [0, t].
Focusing hereafter on the Brownian motion, we have that,
Proposition 5.3.4. For a Brownian motion W(t), as  0 we have that
V
(2)
()
(W) (b a) in 2mean.
Proof. Fixing a nite partition , note that
E[V
(2)
()
(W)] =
i
E[(W(t
i+1
) W(t
i
))
2
]
=
i
Var(W(t
i+1
) W(t
i
)) =
i
(t
i+1
t
i
) = b a .
Similarly, by the independence of increments,
E[V
(2)
()
(W)
2
] =
i,j
E[(W(t
i+1
) W(t
i
))
2
(W(t
j+1
) W(t
j
))
2
]
=
i
E[(W(t
i+1
) W(t
i
))
4
]
+
i=j
E[(W(t
i+1
) W(t
i
))
2
] E[(W(t
j+1
) W(t
j
))
2
]
5.3. SMOOTHNESS AND VARIATION OF THE BROWNIAN SAMPLE PATH 105
Since W(t
j+1
) W(t
j
) is Gaussian of mean zero and variance (t
j+1
t
j
), it follows
that
E[V
(2)
()
(W)
2
] = 3
i
(t
i+1
t
i
)
2
+
i=j
(t
i+1
t
i
)(t
j+1
t
j
) = 2
i
(t
i+1
t
i
)
2
+(ba)
2
.
So, Var(V
(2)
()
(W)) = E(V
(2)
()
(W)
2
) (b a)
2
2(b a) 0 as  0. With
the mean of V
(2)
()
(W) being (b a) and its variance converging to zero, we have the
stated convergence in 2mean.
Here are two consequences of Proposition 5.3.4.
Corollary 5.3.5. The quadratic variation of the Brownian motion is the S.P.
V
(2)
t
(W) = t, which is the same as the increasing process in the DoobMeyer de
composition of W
2
t
. More generally, the quadratic variation equals the increasing
process for any squareintegrable martingale of continuous sample path and right
continuous ltration (as shown for example in [KS97, Theorem 1.5.8, page 32]).
Remark. Since V
(2)
()
are observable on the sample path, considering ner and
ner partitions
n
, one may numerically estimate the quadratic variation for a
given sample path of a S.P. The quadratic variation of the Brownian motion is non
random, so if this numerical estimate signicantly deviates from t, we conclude that
Brownian motion is not a good model for the given S.P.
Corollary 5.3.6. With probability one, the sample path of the Brownian motion
W(t) is not Lipschitz continuous in any interval [a, b], a < b.
Proof. Fix a nite interval [a, b], a < b and let
L
denote the set of outcomes
for which [W(t) W(s)[ L[t s[ for all t, s [a, b]. From (5.3.2) we see that
if  1/(2L
2
) then
Var(V
(2)
()
(W)) E[(V
(2)
()
(W) (b a))
2
I
L
]
(b a)
2
4
P(
L
) .
By Proposition 5.3.4 we know that Var(V
(2)
()
(W)) 0 as  0, hence necessarily
P(
L
) = 0. As the set of outcomes for which the sample path of W(t) is Lipschitz
continuous is just the (countable) union of
L
over positive integer values of L, it
follows that P() = 0, as stated.
We can even improve upon this negative result as following.
Exercise 5.3.7. Fixing >
1
2
check that by the same type of argument as above,
with probability one, the sample path of the Brownian motion is not globally H older
continuous of exponent in any interval [a, b], a < b.
In contrast, applying Theorem 3.3.3 verify that with probability one the sample
path of the Brownian motion is locally H older continuous for any exponent < 1/2
(see part (c) of Exercise 3.3.5 for a similar derivation).
The next exercise shows that one can strengthen the convergence of the quadratic
variation for W(t) by imposing some restrictions on the allowed partitions.
Exercise 5.3.8. Let V
(2)
()
(W) denote the approximation of the quadratic variation
of the Brownian motion for a nite partition of [a, a + t]. Combining Markovs
106 5. THE BROWNIAN MOTION
inequality (for f(x) = x
2
) and BorelCantelli I show that for the Brownian motion
V
(2)
(n)
(W)
a.s.
t if the nite partitions
n
are such that
n

n
 < .
In the next exercise, you are to follow a similar procedure, enroute to nding the
quadratic variation for a Brownian motion with drift.
Exercise 5.3.9. Let Z(t) = W(t) + rt, t 0, where W(t) is a Brownian motion
and r a nonrandom constant.
(a) What is the law of Y = Z(t +h) Z(t)?
(b) For which values of t
< t and h, h
+h
) Z(t
) independent?
(c) Find the quadratic variation V
(2)
t
(Z) of the stochastic process Z(t).
Hint: See Exercise 5.3.15.
Typically, the stochastic integral I
t
=
_
t
0
X
s
dW
s
is rst constructed in case X
t
is a simple process (that is having sample path that are piecewise constant on
nonrandom intervals), exactly as you do next.
Exercise 5.3.10. Suppose (W
t
, T
t
) satises Levys characterization of the Brow
nian motion. Namely, it is a squareintegrable martingale of rightcontinuous l
tration and continuous sample path such that (W
2
t
t, T
t
) is also a martingale.
Suppose X
t
is a bounded T
t
adapted simple process. That is,
X
t
=
0
1
{0}
(t) +
i=0
i
1
(ti,ti+1]
(t) ,
where the nonrandom sequence t
k
> t
0
= 0 is strictly increasing and unbounded
(in k), while the (discrete time) S.P.
n
is uniformly (in n and ) bounded and
adapted to T
tn
. Provide an explicit formula for A
t
=
_
t
0
X
2
u
du, then show that both
I
t
=
k1
j=0
j
(W
tj+1
W
tj
) +
k
(W
t
W
t
k
), when t [t
k
, t
k+1
) ,
and I
2
t
A
t
are martingales with respect to T
t
and explain why this implies that
EI
2
t
= EA
t
and V
(2)
t
(I) = A
t
.
We move from the quadratic variation V
(2)
to the total variation V
(1)
. Note that,
when q = 1, the limit in (5.3.1) always exists and equals the supremum over all
nite partitions .
Example 5.3.11. The total variation is particularly simple for monotone func
tions. Indeed, it is easy to check that if f(t) is monotone then its total variation is
V
(1)
(f) = max
t[a,b]
f(t) min
t[a,b]
f(t). In particular, the total variation of
monotone functions is nite on nite intervals even though the functions may well
be discontinuous.
In contrast we have that
Proposition 5.3.12. The total variation of the Brownian motion W(t) is innite
with probability one.
5.3. SMOOTHNESS AND VARIATION OF THE BROWNIAN SAMPLE PATH 107
Proof. Let (h) = sup
atbh
[W(t +h) W(t)[. With probability one, the
sample path W(t) is continuous hence uniformly continuous on the closed, bounded
interval [a, b]. Therefore, (h)
a.s.
0 as h 0. Let
n
divide [a, b] to 2
n
equal
parts, so 
n
 = 2
n
(b a). Then,
V
(2)
(n)
(W) =
2
n
1
i=0
[W(a + (i + 1)
n
) W(a +i
n
)]
2
(
n
)
2
n
1
i=0
[W(a + (i + 1)
n
) W(a +i
n
)[ . (5.3.3)
Recall Exercise 5.3.8, that almost surely V
(2)
(n)
(W) (b a) < . This, together
with (5.3.3) and the fact that (
n
)
a.s.
0, imply that
V
(1)
(n)
(W) =
2
n
1
i=0
[W(a + (i + 1)
n
) W(a +i
n
)[
a.s
,
implying that V
(1)
(W) = with probability one, as stated.
Remark. Comparing Example 5.3.11 and Proposition 5.3.12 we have that the
sample path of the Brownian motion is almost surely nonmonotone on each non
empty open interval. Here is an alternative, direct proof of this result (c.f. [KS97,
Theorem 2.9.9]).
Exercise 5.3.13. Let W
t
be a Brownian motion on a probability space (, T, P).
(a) Let A
n
=
n
i=1
: W
i/n
() W
(i1)/n
() 0 and A =
: t W
t
() is nondecreasing on [0, 1]. Explain why A =
n
A
n
why
P(A
n
) = 2
n
and why it implies that A T and P(A) = 0.
(b) Use the symmetry of the Brownian motions sample path (per Exercise
5.1.4) to deduce that the probability that it is monotone on [0, 1] is 0.
Verify that the same applies for any interval [s, t] with 0 s < t non
random.
(c) Show that, for almost every , the sample path t W
t
() is non
monotone on any nonempty open interval.
Hint: Let F denote the set of such that t W
t
() is monotone on
some nonempty open interval, observing that
F =
_
s,tQ,0s<t
: t W
t
() is monotone on [s, t].
To practice your understanding, solve the following exercises.
Exercise 5.3.14. Consider the stochastic process Y (t) = W(t)
2
, for 0 t 1,
with W(t) a Brownian motion.
(a) Show that for any < 1/2 the sample path of Y (t) is locally H older
continuous of exponent with probability one.
(b) Compute E[V
(2)
()
(Y )] for a nite partition of [0, t] to k intervals, and
nd its limit as  0.
(c) Show that the total variation of Y (t) on the interval [0, 1] is innite.
Exercise 5.3.15.
108 5. THE BROWNIAN MOTION
(a) Show that if functions f(t) and g(t) on [a, b] have zero and nite quadratic
variations, respectively (i.e. V
(2)
(f) = 0 and V
(2)
(g) < exists), then
V
(2)
(g +f) = V
(2)
(g).
(b) Show that if a (uniformly) continuous function f(t) has nite total vari
ation then V
(q)
(f) = 0 for any q > 1.
(c) Suppose both X
t
and
A
t
have continuous sample path, such that t
A
t
has nite total variation on any bounded interval and X
t
is a square
integrable martingale. Deduce that then V
(2)
t
(X +
A) = V
(2)
t
(X).
What follows should be omitted at rst reading.
We saw that the sample path of the Brownian motion is rather irregular, for it is
neither monotone nor Lipschitz continuous at any open interval. [Bre92, Theorem
12.25] somewhat renes the latter conclusion, showing that with probability one
the sample path is nowhere dierentiable.
We saw that almost surely the sample path of the Brownian motion is H older
continuous of any exponent <
1
2
(see Exercise 5.3.7), and of no exponent >
1
2
.
The exact modulus of continuity of the Brownian path is provided by P. Levys
(1937) theorem (see [KS97, Theorem 2.9.25, page 114]):
P(limsup
0
1
g()
sup
0s,t1
ts
[W(t) W(s)[ = 1) = 1,
where g() =
_
2 log(
1
tW
1
,
suggesting that the Brownian path grows like
i=1
i
, where
i
are i.i.d. random variables.
Indeed, S
n+1
= S
n
+
n+1
with
n+1
independent of (
n
= (S
0
, . . . , S
n
), hence
P(S
n+1
A[(
n
) = P(S
n
+
n+1
A[S
n
). With
n+1
having the same law as
1
,
we thus get that P(S
n
+
n+1
A[S
n
) = p(A[S
n
) for the stationary transition
probabilities p(A[x) = P(
1
y x : y A). By similar reasoning we see that
another example of a homogeneous Markov chain is the branching process Z
n
of
Denition 4.6.1, with stationary transition probabilities p(A[x) = P(
x
j=1
N
j
A)
for integer x 1 and p(A[0) = 1
0A
.
Remark. Many Markov chains are not martingales. For example, a sequence
of independent variables X
n
is a Markov chain, but unless P(X
n
= c) = 1 for
some c nonrandom and all n, it is not a martingale (for E[X
n+1
[X
0
, . . . , X
n
] =
EX
n+1
,= X
n
()). Likewise, many martingales do not have the Markov property.
For example, the sequence X
n
= X
0
(1 + S
n
) with X
0
uniformly chosen in 1, 3
independently of the simple random walk S
n
of zeromean, is a martingale, since
E[X
n+1
[X
0
, . . . , X
n
] = X
n
+X
0
E[
n+1
[X
0
, . . . , X
n
] = X
n
,
but not a Markov chain, since X
2
0
is not a measurable on (X
n
), hence
E[X
2
n+1
[X
0
, . . . , X
n
] = X
2
n
+X
2
0
,= X
2
n
+E[X
2
0
[X
n
] = E[X
2
n+1
[X
n
] .
Let P
x
denote the law of the homogeneous Markov chain starting at X
0
= x.
For example, in the context of the random walk we normally start at S
0
= 0, i.e.
consider the law P
0
, whereas for the branching process example we normally take
Z
0
= 1, hence in this context consider the law P
1
.
Whereas it suces to know the stationary transition probability p to determine
P
x
for any given x S, we are often interested in settings in which X
0
is a R.V. To
this end, we next dene the initial distribution associated with the Markov chain.
Definition 6.1.4. The initial distribution of a Markov chain is the probability
measure (A) = P(X
0
A) on (S, B).
Indeed, by the tower property (with respect to (X
0
)), it follows from our deni
tions that for any integer k 0 and a (measurable) set B B
k+1
,
P((X
0
, X
1
, . . . , X
k
) B) =
_
P
x
((x, X
1
, . . . , X
k
) B)(dx) ,
where (dx), or more generally p(dx[y), denote throughout the Lebesgue integral
with respect to x (with y, if present, being an argument of the resulting function).
For example, the preceding formula means the expectation (as in Denition 1.2.19),
of the measurable function h(x) = P
x
((x, X
1
, . . . , X
k
) B) when x is an Svalued
R.V. of law ().
While we do not pursue this direction further, computations of probabilities of
events of interest for a Markov chain are easier when S is a countable set. In this
case, B = 2
S
and for any A S
p(A[x) =
yA
p(y[x) ,
so the stationary transition probabilities are fully determined by p(y[x) 0 such
that
yS
p(y[x) = 1 for each x S, and all our Lebesgue integrals are merely
6.1. MARKOV CHAINS AND PROCESSES 113
sums. For example, this is what we have for the simple random walk (with S the
set of integers, and p(x + 1[x) = P(
1
= 1) = 1 p(x 1[x) for any integer x), or
the branching process (with S the set on nonnegative integers). Things are even
simpler when S is a nite set, in which case identifying S with the set 1, . . . , m
for some m < , we view p(y[x) as the (x, y)th entry of an m m dimensional
transition probability matrix, and use matrix theory for evaluating all probabilities
of interest.
Exercise 6.1.5. Consider the probability space corresponding to a sequence of
independent rolls of a fair sixsided die. Determine which of the following S.P. is
then a homogeneous Markov chain and for each of these specify its state space and
stationary transition probabilities.
(a) X
n
: the largest number rolled in n rolls.
(b) N
n
: the number of 6s in n rolls.
(c) C
n
: at time n, the number of rolls since the most recent 6.
(d) B
n
: at time n, the number of rolls until the next 6.
Here is a glimpse of the rich theory of Markov chains of countable state space
(which is beyond the scope of this course).
Exercise 6.1.6. Suppose X
n
is a homogeneous Markov chain of countable state
space S and for each x S let
x
= P
x
(T
x
< ) for the (rst) return time
T
x
= infn 1 : X
n
= x. We call x S a recurrent state if
x
= 1. That is,
starting at x the Markov chain returns to state x in a nite time w.p.1. (and we
call x S a transient state if
x
< 1). Let N
x
=
n=1
I
{Xn=x}
count the number
of visits to state x by the Markov chain (excluding time zero).
(a) Show that P
x
(N
x
= k) =
k
x
(1
x
) and hence E
x
(N
x
) =
x
/(1
x
).
(b) Deduce that x S is a recurrent state if and only if
n
P
x
(X
n
= x) = .
Exercise 6.1.7. Let X
n
be a symmetric simple random walk in Z
d
. That is, X
n
=
(X
(1)
n
, . . . , X
(d)
n
) where X
(i)
n
are for i = 1, . . . , d independent symmetric simple one
dimensional random walks. That is, from the current state x Z
d
the walk moves
to one of the 2
d
possible neighboring states (x
1
1, . . . , x
d
1) with probability
2
d
, independent of all previous moves. Show that for d = 1 or d = 2 the origin
(0, . . . , 0) is a recurrent state of this Markov chain, while for any d 3 it is a
transient state.
Hint: Observe that X
n
can only return to the origin when n is even. Then combine
Exercise 6.1.6 with the approximation
_
2n
n
_
2
2n
= (4n)
1/2
(1 +o(1)).
A second setting in which computations are more explicit is when S = R (or a
closed interval thereof), and for each x the stationary transition probability p([x)
has a density p(y[x) 0 (such that
_
p(y[x)dy = 1). Such density p(y[x) is often
called the (stationary) transition probability kernel, as in this case for any bounded
Borel function h()
E
x
(h(X
1
)) =
_
h(y)dp(y[x) =
_
h(y)p(y[x)dy ,
so probabilities are computed by iterated onedimensional Riemann integrals in
volving the integration kernel p([).
114 6. MARKOV, POISSON AND JUMP PROCESSES
Turning hereafter to the general setting, with (
n
= (X
0
, . . . , X
n
) denoting its
canonical ltration, it is not hard to check that any homogeneous Markov chain
X
n
has the Markov property,
(6.1.1) P((X
, X
+1
, . . . , X
+k
) B[(
) = P
X
((X
0
, . . . , X
k
) B) ,
holding almost surely for any integer k 0, nonrandom 0 and B B
k+1
.
We state next the strong Markov property, one of the most important features of
Markov chains.
Proposition 6.1.8 (Strong Markov Property). Let X
n
be a homogeneous Markov
chain. Then, (6.1.1) holds for any almost surely nite stopping time with respect
to its canonical ltration (
n
, with (
) = p(A[X
t+ s
(A[x) =
_
p
t
(A[y)p
s
(dy[x)
t, s 0.
Further, any such process satises the regular Markov property.
Proposition 6.1.15. Suppose X(t) is a homogeneous Markov process, with (
t
denoting its canonical ltration (X(s), s t) and P
x
the law of the process starting
at X(0) = x. Then, any such process has the regular Markov property. That is,
(6.1.5) P
x
(X( +) [(
) = P
X()
(X() ) , a.s.
for any x R, nonrandom 0 and in the cylindrical eld B
[0,)
of Deni
tion 3.1.15.
As you show next, being a Markov process is a property that is invariant un
der invertible nonrandom mappings of the state space as well as under invertible
monotone time mappings.
Exercise 6.1.16. Let X
t
, t 0 be a Markov process of state space S. Suppose
h
t
: S S
are measurable and invertible for any xed t 0 and g : [0, ) [0, )
is invertible and strictly increasing.
(a) Verify that Y
t
= h
t
(X
g(t)
) is a Markov process.
(b) Show that if X
t
is a homogeneous Markov process then so is h(X
t
).
A host of particularly simple Markov processes is provided by our next proposition.
Proposition 6.1.17. Every continuous time stochastic process of independent in
crements is a Markov process. Further, every continuous time S.P. of stationary
independent increments is a homogeneous Markov process.
6.1. MARKOV CHAINS AND PROCESSES 117
For example, Proposition 6.1.17 implies that the Brownian motion is a homoge
neous Markov process. It corresponds to initial distribution X(0) = 0 and having
the stationary regular transition probabilities
(6.1.6) p
t
(A[x) =
_
A
e
(yx)
2
/2t
2t
dy
(note that we have here a stationary transition kernel e
(yx)
2
/2t
/
2t).
Remark. To recap, we have seen three main ways of showing that a S.P. X
t
, t
0 is a Markov process:
(a) Computing P(X
t+h
A[(
t
) directly and checking that it only depends
on X
t
(and not on X
s
for s < t), per Denition 6.1.10.
(b) Showing that the process has independent increments and applying Propo
sition 6.1.17.
(c) Showing that it is an invertible function of another Markov process, and
appealing to Exercise 6.1.16.
Our choice of name for p
t
([) is motivated in part by the following fact.
Proposition 6.1.18. If a Markov process or a Markov chain is also a stationary
process, then it is a homogeneous Markov process, or Markov chain, respectively.
Note however that many homogeneous Markov processes and Markov chains are
not stationary processes. Convince yourself that among such examples are the
Brownian motion (in continuous time) and the random walk (in discrete time).
Solve the next two exercises to practice your understanding of the denition of
Markov processes.
Exercise 6.1.19. Let B
t
= W
t
min(t, 1)W
1
, Y
t
= e
Wt
and U
t
= e
t/2
W
e
t , where
W
t
is a Brownian motion.
(a) Determine which of the S.P. W
t
, B
t
, U
t
and Y
t
is a Markov process for
t 1 and among those, which are also time homogeneous.
Hint: Consider part (a) of Exercise 5.1.9.
(b) Provide an example (among these S.P.) of a homogeneous Markov process
whose increments are neither independent nor stationary.
(c) Provide an example (among these S.P.) of a Markov process of stationary
increments, which is not a homogeneous Markov process.
Exercise 6.1.20. Explain why Z
t
= W
t
+ rt, t 0, with W
t
a Brownian mo
tion and r a nonrandom constant, is a homogeneous Markov process. Provide
the state space S, the initial distribution () and the stationary regular transition
probabilities p
t
(A[x) for this process.
The homogeneous Markov chains are fully characterized by the initial distributions
and the (onestep) transition probabilities p([). In contrast, we need to specify the
transition probabilities p
t
([) for all t > 0 in order to determine all distributional
properties of the associated homogeneous Markov process. While we shall not do so,
in view of the ChapmanKolmogorov relationship, using functional analysis one may
often express p
t
([) in terms of a single operator, which is called the generator of
the Markov process. For example, the generator of the Brownian motion is closely
118 6. MARKOV, POISSON AND JUMP PROCESSES
related to the heat equation, hence the reason that many computations can then
be explicitly done via the theory of PDE.
Similar to the case of Markov chains, we seek to have the strong Markov property
for a homogeneous Markov process. Namely,
Definition 6.1.21. A homogeneous Markov process is called a strong Markov
process if (6.1.5) holds also for any almost surely nite stopping time with respect
to its canonical ltration (
t
.
Recall that we have already seen in Proposition 5.2.3 that
Corollary 6.1.22. The Brownian motion is a strong Markov process.
Unfortunately, the theory of Markov processes is more involved than that of
Markov chains and in particular, not all homogeneous Markov processes have the
strong Markov property. Indeed, as we show next, even having also continuous
sample path does not imply the strong Markov property.
Example 6.1.23. With X
0
independent of the Brownian motion W
t
, consider the
S.P. X
t
= X
0
+W
t
I
{X0=0}
of continuous sample path. Noting that I
X0=0
= I
Xt=0
almost surely (as the dierence occurs on the event : W
t
() = X
0
() ,= 0
which is of zero probability), by the independence of increments of W
t
, hence of X
t
in case X
0
,= 0, we have that almost surely,
P(X
t+u
A[(X
s
, s t)) = I
0A
I
X0=0
+P(W
t+u
W
t
+X
t
A[X
t
)I
X0=0
= I
0A
I
Xt=0
+p
t
(A[X
t
)I
Xt=0
,
for p
t
([x) of (6.1.6). Consequently, X
t
is a homogeneous Markov process (regard
less of the distribution of X
0
), whose stationary regular transition probabilities are
given by p
t
([x) of (6.1.6) for x ,= 0 while p
t
(A[0) = I
0A
. By Proposition 6.1.15
X
t
satises the regular Markov. However, this process does not satisfy the strong
Markov property. For example, (6.1.5) does not hold for the almost surely nite
stopping time = inft 0 : X
t
= 0 and = x() : x(1) > 0 (in which case its
left side is
1
2
1
x=0
whereas its right side is zero).
Our next proposition further helps in clarifying where the extra diculty comes
from.
Proposition 6.1.24. The Markov property (6.1.5) holds for any stopping time
(with respect to the canonical ltration of the homogeneous Markov process X(t)),
provided assumes at most a countable number of nonrandom values (for a proof,
see [Bre92, Proposition 15.19]).
Indeed, any stopping time for a Markov chain assumes at most a countable number
of values 0, 1, 2, . . ., hence the reason that every homogeneous Markov chain has
the strong Markov property.
In the following exercise you use the strong Markov property to compute the
probability that a Brownian motion that starts at x (c, d) reaches level d before
it reaches level c (i.e., the event W
a,b
= b for
a,b
of Exercise 4.3.18 with b = d x
and a = x c).
Exercise 6.1.25. Consider the stopping time = inft 0 : X
t
d or X
t
c
and the law P
x
of the Markov process X
t
= W
t
+x, where W
t
is a Brownian motion
and x (c, d) nonrandom.
6.2. POISSON PROCESS, EXPONENTIAL INTERARRIVALS AND ORDER STATISTICS119
(a) Using the strong Markov property of X
t
show that u(x) = P
x
(X
= d)
is an harmonic function, namely u(x) = (u(x +r) +u(x r))/2 for any
c x r < x < x + r d, with boundary conditions u(c) = 0 and
u(d) = 1.
(b) Check that v(x) = (x c)/(d c) is an harmonic function satisfying the
same boundary conditions as u(x). Since boundary conditions at x = c
and x = d uniquely determine the value of harmonic function in (c, d)
(a fact you do not need to prove), you thus showed that P
x
(X
= d) =
(x c)/(d c).
6.2. Poisson process, Exponential interarrivals and order statistics
Following [Bre92, Chapter 14.6], we are going to consider throughout continuous
time stochastic processes N
t
, t 0 satisfying the following condition.
Condition. C
0
Each sample path N
t
() is piecewise constant, nondecreasing,
right continuous, with N
0
() = 0, all jump discontinuities are of size one, and there
are innitely many of them.
Associated with each sample path N
t
() satisfying C
0
are the jump times 0 =
T
0
< T
1
< < T
n
< such that T
k
= inft 0 : N
t
k for each k, or
equivalently
N
t
= supk 0 : T
k
t.
In applications we nd such N
t
as counting the number of discrete events occurring
in the interval [0, t] for each t 0, with T
k
denoting the arrival or occurrence time
of the kth such event. For this reason processes like N
t
are also called counting
processes.
Recall Example 1.1.4 that a random variable N has the Poisson() law if
P(N = k) =
k
k!
e
, k 0, integer,
and that a stochastic process N
t
has independent increments if the random variable
N
t+h
N
t
is independent of (N
s
: 0 s t) for any h > 0 and t 0.
We dene next the Poisson process. To this end, we set the following condition:
Condition. C
1
For any k and any 0 < t
1
< < t
k
, the increments N
t1
,
N
t2
N
t1
, . . . , N
t
k
N
t
k1
, are independent random variables and for some > 0
and all t > s 0, the increment N
t
N
s
has the Poisson((t s)) law.
Equipped with Conditions C
0
and C
1
we have
Definition 6.2.1. Among the processes satisfying C
0
the Poisson Process is the
unique S.P. having also the property C
1
.
Thus, the Poisson process has independent increments, each having a Poisson law,
where the parameter of the count N
t
N
s
is proportional to the length of the
corresponding interval [s, t]. The constant of proportionality is called the rate or
intensity of the Poisson process.
In particular, it follows from our denition that M
t
= N
t
t is a MG (see
Proposition 4.2.3). As you show next, this provides an example of a DoobMeyer
120 6. MARKOV, POISSON AND JUMP PROCESSES
0
0
1
2
3
4
T
1
T
2
T
3
T
4
N
(
t
)
Figure 1. A sample path of a Poisson process.
decomposition with a continuous increasing process for a martingale of discontin
uous sample path. It also demonstrates that continuity of sample path is essential
for the validity of both Levys martingale characterization and Proposition 5.2.2
about the time changed Brownian motion.
Exercise 6.2.2. Show that the martingale M
t
= N
t
t is squareintegrable, and
such that M
2
t
t is a martingale for the (rightcontinuous) ltration (N
s
, s t).
In a similar manner, the independence of increments of the Poisson process pro
vides us with the corresponding exponential martingales (as it did for the Brownian
motion). Here we get the martingales L
t
= e
Ntt
, for some = (, ) and all
(with () computed by you in part (a) of Exercise 6.2.9).
Remark 6.2.3. It is possible to dene in a similar manner the counting process for
discrete events on R
d
, any integer d 2. This is done by assigning random integer
counts N
A
to Borel subsets A of R
d
in an additive manner (that is N
AB
= N
A
+N
B
whenever A and B are disjoint). Such processes are called point processes, so the
Poisson process is perhaps the simplest example of a point process.
Just as the Brownian motion has many nice properties, so does the Poisson process,
starting with the following characterization of this process (for its proof, see [Bre92,
Theorem 14.23]).
Proposition 6.2.4. The Poisson processes are the only stochastic processes with
stationary independent increments (see Denition 3.2.27), that satisfy C
0
.
Next, note that the Poisson process is a homogeneous Markov process, whose state
space S is 0, 1, 2, . . . (see Proposition 6.1.17), with the initial distribution N
0
= 0
and the stationary regular transition probabilities
p
t
(x +k[x) =
(t)
k
k!
e
t
, k, x 0, integers.
Furthermore, we have, similarly to Corollary 6.1.22, that
Proposition 6.2.5. The Poisson process is a strong Markov process.
Remark. Note that the Poisson process is yet another example of a process with
stationary independent increments that is clearly not a stationary process.
6.2. POISSON PROCESS, EXPONENTIAL INTERARRIVALS AND ORDER STATISTICS121
Another way to characterize the Poisson process is by the joint distribution of the
jump (arrival) times T
k
. To this end, recall that
Proposition 6.2.6 (Memoryless property of the Exponential law). We say that
a random variable T has Exponential() law if P(T > t) = e
t
, for all t 0 and
some > 0. Except for the trivial case of T = 0 w.p.1. these are the only laws for
which P(T > x +y[T > y) = P(T > x), for all x, y 0.
We check only the easy part of the proposition, that is, taking T of Exponential()
law we have for any x, y 0,
P(T > x +y[T > y) =
P(T > x +y)
P(T > y)
=
e
(x+y)
e
y
= e
x
= P(T > x) .
Exercise 6.2.7. For T of Exponential() law use integration by parts to show that
the S.P. X
t
= I
[T,)
(t) min(t, T) is of zero mean. Then use the memoryless
property of T to deduce that X
t
is a martingale.
In view of Proposition 6.2.6, N
t
having independent increments is related to the
following condition of having Exponentially distributed interarrival times.
Condition. C
2
The gaps between jump times T
k
T
k1
for k = 1, 2, . . . are
i.i.d. random variables, each of Exponential() law.
Indeed, an equivalent denition of the Poisson process is:
Proposition 6.2.8. A stochastic process N
t
that satises C
0
is a Poisson process
of rate if and only if it satises C
2
(c.f. [Bre92, page 309]).
Proposition 6.2.8 implies that for each k, the kth arrival time T
k
of the Pois
son process of rate has the Gamma(k, ) law corresponding to the sum of k
i.i.d. Exponential() random variables. In the next exercise we arrive at the same
conclusion by an application of Doobs optional stopping theorem.
Exercise 6.2.9. Let N
t
, t 0 be a Poisson process of rate .
(a) Show that L
t
= exp(N
t
t) is a martingale for the canonical ltration
(
t
= (N
s
, 0 s t), whenever = (1 e
).
(b) Check that T
k
t = N
t
k (
t
and deduce that T
r
= min(t 0 :
N
t
= r) is a stopping time with respect to (
t
, for each positive integer r.
(c) Using Doobs optional stopping theorem for the martingale (L
t
, (
t
) (see
Theorem 4.3.17), compute the value of E(e
Tr
) for > 0.
(d) Check that the preceding evaluation of E(e
Tr
) equals what you get by
assuming that T
r
has the Gamma distribution of parameters r and .
That is, by taking T
r
with the probability density function
f
Tr
(t) = e
t
(t)
r1
(r 1)!
t 0 .
Remark. Obviously the sample path of the Poisson process are never continuous.
However, P(N
t+h
N
t
1) = 1 e
h
0 as h 0, implying that P(T
k
= t) = 0
for all t 0 and k = 1, 2, . . ., so this S.P. has no xed discontinuities (i.e. occurring
at nonrandom times). The Poisson process is a special case of the family of Markov
jump processes of Section 6.3, all of whom have discontinuous sample path but no
xed discontinuities.
122 6. MARKOV, POISSON AND JUMP PROCESSES
Use the next exercise to check your understanding of the various characterizations
and properties of the Poisson process.
Exercise 6.2.10. Let N
t
be a Poisson process with rate > 0 and a nite
stopping time for its canonical ltration. State which of the four stochastic processes
N
(1)
t
= 2N
t
, N
(2)
t
= N
2t
, N
(3)
t
= N
t
2 and N
(4)
t
= N
+t
N
is a Poisson process
and if so, identify its rate.
The Poisson process is related not only to the Exponential distribution but also
to the Uniform measure, as we state next (and for a proof c.f. [KT75, Theorem
4.2.3]).
Proposition 6.2.11. Fixing positive t and a positive integer n let U
i
be i.i.d.
random variables, each uniformly distributed in [0, t] and consider their order sta
tistics U
i
for i = 1, . . . , n. That is, permute the order of U
i
for i = 1, . . . , n
such that U
1
U
2
U
n
while U
1
, . . . , U
n
= U
1
, . . . , U
n
(for example
U
1
= min(U
i
: i = 1, . . . , n) and U
n
= max(U
i
: i = 1, . . . , n)). The joint distribu
tion of (U
1
, . . . , U
n
) is then precisely that of the rst n arrival times (T
1
, . . . , T
n
)
of a Poisson process, conditional on the event N
t
= n. Alternatively, xing n and
0 t
1
t
2
t
n
t, we have that
P(T
k
t
k
, k = 1, . . . , n[N
t
= n) =
n!
t
n
_
t1
0
_
t2
x1
_
tn
xn1
dx
1
dx
2
dx
n
.
Here are a few applications of Proposition 6.2.11.
Exercise 6.2.12. Let T
k
=
k
i=1
i
for independent Exponential() random vari
ables
i
, i = 1, 2, . . . and N
t
= supk 0 : T
k
t the corresponding Poisson
process.
(a) Express v = E(
Nt
k=1
(t T
k
)) in terms of g(n) = E(
n
k=1
T
k
[N
t
= n)
and the law of N
t
.
(b) Compute the values of g(n) = E(
n
k=1
T
k
[N
t
= n).
(c) Compute the value of v.
(d) Suppose that T
k
is the arrival time to the train station of the kth pas
senger on a train that departs the station at time t. What is the meaning
of N
t
and of v in this case?
Exercise 6.2.13. Let N
t
be a Poisson process of rate > 0.
(a) Fixing 0 < s t show that conditional on N
t
= n, the R.V. N
s
has the
Binomial(n, p) law for p = s/t. That is,
P(N
s
= k[N
t
= n) =
_
n
k
_
_
s
t
_
k
_
1
s
t
_
nk
, k = 0, 1, . . . , n.
(b) What is the probability that the rst jump of this process occurred before
time s [0, t] given that precisely n jumps have occurred by time t?
Exercise 6.2.14. Let N
t
be a Poisson process of rate > 0. Compute v(n) =
E[N
s
[N
t
= n] and E[N
s
[N
t
], rst for s > t, then for 0 s t.
Exercise 6.2.15. Commuters arrive at a bus stop according to a Poisson process
with rate > 0. Suppose that the bus is large enough so that anyone who is waiting
for the bus when it arrives is able to board the bus.
6.2. POISSON PROCESS, EXPONENTIAL INTERARRIVALS AND ORDER STATISTICS123
(a) Suppose buses arrive every T units of time (for nonrandom T > 0).
Immediately after a bus arrives and all the waiting commuters board it,
you uniformly and independently select a commuter that just boarded the
bus. Find the expected amount of time that this commuter waited for the
bus to arrive.
(b) Now supposes buses arrive according to a Poisson process with rate 1/T.
Assume that arrivals of commuters and buses are independent. What is
the expected waiting time of a randomly selected commuter on the bus
in this case? Is it dierent than the expected waiting time of the rst
commuter to arrive at the bus stop?
We have the following interesting approximation property of the Poisson distribu
tion (see also [GS01, Section 4.12]).
Theorem 6.2.16 (Poisson approximation). Suppose that for each n, the random
variables Z
(n)
l
are independent, nonnegative integers where P(Z
(n)
l
= 1) = p
(n)
l
and
P(Z
(n)
l
2) =
(n)
l
are such that as n ,
(i).
n
l=1
p
(n)
l
(0, ),
(ii).
n
l=1
(n)
l
0,
(iii). max
l=1, ,n
p
(n)
l
0.
Then, S
n
=
n
l=1
Z
(n)
l
converges in distribution to Poisson() when n .
For example, consider Z
(n)
l
= 0, 1 with P(Z
(n)
l
= 1) =
n
, resulting with S
n
having the Binomial(n,
n
) law. In this case,
(n)
l
= 0 and p
(n)
l
=
n
, with
n
l=1
p
(n)
l
=
. Hence, applying Theorem 6.2.16 we have that the Binomial(n,
n
) probability
measures converge weakly as n to the Poisson() probability measure. This
is the classical Poisson approximation of the Binomial, often derived in elementary
probability courses.
The Poisson approximation theorem relates the Poisson process to the following
condition.
Condition. C
3
The S.P. N
t
has no xed discontinuities, that is P(T
k
= t) = 0
for all k and t 0. Also, for any xed k, 0 < t
1
< t
2
< < t
k
and nonnegative
integers n
1
, n
2
, , n
k
,
P(N
t
k
+h
N
t
k
= 1[N
tj
= n
j
, j k) = h +o(h),
P(N
t
k
+h
N
t
k
2[N
tj
= n
j
, j k) = o(h),
where o(h) denotes a function f(h) such that h
1
f(h) 0 as h 0.
Indeed, as we show next, Theorem 6.2.16 plays for the Poisson process the same
role that the Central Limit Theorem plays for the Brownian motion, that is, provid
ing a characterization of the Poisson process that is very attractive for the purpose
of modeling realworld phenomena.
124 6. MARKOV, POISSON AND JUMP PROCESSES
Proposition 6.2.17. A stochastic process N
t
that satises C
0
is a Poisson pro
cess of rate if and only if it satises condition C
3
.
Proof. (omit at rst reading) Fixing k, the t
j
and the n
j
, denote by A the
event N
tj
= n
j
, j k. For a Poisson process of rate the random variable
N
t
k
+h
N
t
k
is independent of A with P(N
t
k
+h
N
t
k
= 1) = e
h
h and P(N
t
k
+h
N
t
k
2) = 1 e
h
(1 + h). Since e
h
= 1 h + o(h) we see that the
Poisson process satises C
3
. To prove the converse, we start with a S.P. N
t
that satises both C
0
and C
3
. Fixing A as above, let D
t
= N
t
k
+t
N
t
k
and
p
n
(t) = P(D
t
= n[A). To show that N
t
satises C
1
, hence is a Poisson process,
it suces to show that p
n
(t) = e
t
(t)
n
/n! for any t 0 and nonnegative integer
n (the independence of increments then follows by induction on k). It is trivial to
check the case of t = 0, that is, p
n
(0) = 1
n=0
. Fixing u > 0 and s 0 we have
that [p
n
(u) p
n
(s)[ P(D
u
,= D
s
[A) P(N
t
k
+u
,= N
t
k
+s
)/P(A). By C
3
we
know that N
t
is continuous in probability so taking u s yields that s p
n
(s)
is continuous on [0, ). If D
t+h
= n then necessarily D
t
= m for some m n.
Fixing n and t > 0, by the rules of conditional probability
p
n
(t +h) =
n
m=0
p
m
(t)P(D
t+h
D
t
= n m[A
m
) ,
where A
m
= A D
t
= m. Applying C
3
for each value of m separately (with
t
k+1
= t
k
+ t and n
k+1
= n
k
+ m there), we thus get that p
n
(t + h) = p
n
(t)(1
h) +p
n1
(t)h +o(h). Taking h 0 gives the system of dierential equations
p
n
(t) = p
n
(t) +p
n1
(t) ,
with boundary conditions p
1
(t) = 0 for all t and p
n
(0) = 1
n=0
(apriori, t > 0
and p
n
(t) stands for the righthand derivative, but since t p
n
(t) is continuous,
the dierential equations apply also at t = 0 and p
n
(t) can be taken as a twosided
derivative when t > 0). It is easy to check that p
n
(t) = e
t
(t)
n
/n! satises
these equations. It is well known that the solution of such a dierential system of
equations is unique, so we are done.
The collection of all Poisson processes is closed with respect to the merging and
thinning of their streams.
Proposition 6.2.18. If N
(1)
t
and N
(2)
t
are two independent Poisson processes of
rates
1
and
2
respectively, then N
(1)
t
+N
(2)
t
is a Poisson process of rate
1
+
2
.
Conversely, the subsequence of jump times obtained by independently keeping with
probability p each of the jump times of a Poisson process of rate corresponds to a
Poisson process of rate p.
We conclude with the law of large numbers for the Poisson process.
Exercise 6.2.19. Redo Exercise 5.1.7 with W
t
replaced by the MG
1
(N
t
t)
for a Poisson process N
t
of rate > 0, to arrive at the law of large numbers for
the Poisson process. That is, t
1
N
t
a.s.
for t .
Remark. A inhomogeneous Poisson process X
t
with rate function (t) 0 for
t 0 is a counting process of independent increments, such that for all t > s 0,
6.3. MARKOV JUMP PROCESSES, COMPOUND POISSON PROCESSES 125
the increment X
t
X
s
has the Poisson((t)(s)) law with (t) =
_
t
0
(s)ds. This
is merely a nonrandom time change X
t
= N
(t)
of a Poisson process N
t
of rate one
(see also Proposition 5.2.2 for the time change of a Brownian motion). Condition
C
3
is then inherited by X
t
(upon replacing by (t
k
)), but condition C
2
does not hold, as the gaps between jump times S
k
of X
t
are neither i.i.d. nor of
Exponential law. Nevertheless, we have much information about these jump times
as T
r
= (S
r
) for the jump times T
r
of the homogeneous Poisson process of rate
one (for example, P(S
1
s[X
t
= 1) = (s)/(t) for all 0 s t).
6.3. Markov jump processes, compound Poisson processes
We begin with the denition of the family of Markov jump processes that are the
natural extension of the Poisson process. Whereas the Brownian motion and the
related diusion processes are commonly used to model processes with continuous
sample path, Markov jump processes are the most common object in modeling
situations with inherent jumps.
Definition 6.3.1. Let T
i
denote the jump times of a Poisson process Y (t) of
rate > 0. We say that a stochastic process X(t) is a Markov jump process
if its sample path are constant apart from jumps of size X
i
at times T
i
, and the
sequence Z
n
=
n
j=1
X
j
is a Markov chain which is independent of T
i
. That
is, X(t) = Z
Y (t)
. The Markov jump processes with i.i.d. jump sizes X
i
are called
compound Poisson processes.
In particular, taking constant X
i
= 1 in Denition 6.3.1 leads to the Poisson
process, while Z
n+1
= Z
n
1, 1 gives the random telegraph signal of Example
3.3.6. The latter is also an example of a Markov jump process which is not a
compound Poisson process.
Clearly, the sample paths of the Markov jump process X(t) inherit the RCLL
property of the Poisson process Y (t). We further show next that compound Poisson
processes inherit many other properties of the underlying Poisson process.
Proposition 6.3.2. Any compound Poisson process X(t) (that is, a Markov jump
process of i.i.d. jump sizes X
i
), has stationary, independent increments and the
characteristic function E(e
iuX(t)
) = expt
_
(e
iux
1)dF
X
(x) (where F
X
(x) de
notes the distribution function of the jump size X
1
).
Proof outline. To prove the proposition one has to check that for any h > 0
and t > 0, the random variable X(t+h)X(t) is independent of (X(s), 0 s t),
and its law does not depend on t, neither of which we do here (c.f. [Bre92, Section
14.7] where both steps are detailed).
Thus, we only prove here the stated formula for the characteristic function of X(t).
To this end, we rst condition on the event Y (t) = n. Then, X (t) =
n
i=1
X
i
,
implying by the independence of X
j
that
E[e
iuX(t)
[Y (t) = n] = E[e
iu
P
n
j=1
Xj
] = h(u)
n
,
where h(u) =
_
R
e
iux
dF
X
(x) is the characteristic function of the jump size X. Since
Y (t) is a Poisson(t) random variable, we thus have by the tower property of the
126 6. MARKOV, POISSON AND JUMP PROCESSES
expectation that
E[e
iuX(t)
] = E[E[e
iuX(t)
[Y (t)]] = E[h(u)
Y (t)
]
=
n=0
h(u)
n
P(Y (t) = n) =
n=0
(h(u))
n
(t)
n
n!
e
t
= e
t(h(u)1)
,
as claimed.
Remark. Similarly to Proposition 6.3.2 it is not hard to check that if X(t) is a
compound Poisson process for i.i.d. X
i
that are square integrable, then E(X(t)) =
tE(X
1
) and Var(X(t)) = tE(X
2
1
).
Consequences of Proposition 6.3.2:
One of the implications of Proposition 6.3.2 is that any compound Poisson process
is a homogeneous Markov process (see Proposition 6.1.17).
Another implication is that if in addition EX
1
= 0, then X(t) is a martingale (see
Proposition 4.2.3) . Like in the case of a Poisson process, many other martingales
can also be derived out of X(t).
Fixing any disjoint nite partition of R0 to Borel sets B
k
, k = 1, . . . , m we
have the decomposition of any Markov jump process X(t) =
m
k=1
X(B
k
, t), in terms
of the contributions
X(B
k
, t) =
t
(X() X())1
{X()X()B
k
}
to X(t) by jumps whose size belong to B
k
. What is so interesting about this
decomposition is the fact that for any compound Poisson process, these components
are in turn independent compound Poisson processes, as we state next.
Proposition 6.3.3. If the Markov jump process X(t) is a compound Poisson
process, then the S.P. X(B
k
, t) for k = 1, . . . , m are independent compound Poisson
processes, with
_
B
k
(e
iux
1)dF
X
(x) replacing
_
R
(e
iux
1)dF
X
(x) in the formula
for the characteristic function for X(B
k
, t) (c.f. [Bre92, Proposition 14.25]).
Here is a concrete example that give rise to a compound Poisson process.
Exercise 6.3.4. A basketball team scores baskets according to a Poisson process
with rate 2 baskets per minute. Each basket is worth either 1, 2, or 3 points; the
team attempts shots according to the following percentages: 20% for 1 point, 50%
for 2 points, and 30% for 3 points.
(a) What is the expected amount of time until the team scores its rst basket?
(b) Given that at the ve minute mark of the game the team has scored exactly
one basket, what is the probability that the team scored the basket in the
rst minute?
(c) What is the probability that the team scores exactly three baskets in the
rst ve minutes of the game?
(d) What is the teams expected score at the ve minute mark of the game?
(e) Let Z(t) count the number of 2point baskets by the team up to time t.
What type of S.P. is Z(t).
Bibliography
[AS00] Noga Alon and Joel H. Spencer, The probabilistic method, 2nd ed., WileyInterscience,
2000.
[Bil95] Patrick Billingsley, Probability and measure, 3rd ed., WileyInterscience, 1995.
[Bre92] Leo Breiman, Probability, Classics in Applied Mathematics, Society for Industrial and
Applied Mathematics, 1992.
[GS01] Georey Grimmett and David Stirzaker, Probability and random processes, 3rd ed., Ox
ford University Press, 2001.
[KS97] Ioannis Karatzas and Steven E. Shreve, Brownian motion and stochastic calculus, 2nd
ed., Graduate Texts in Mathematics, vol. 113, Springer Verlag, 1997.
[KT75] Samuel Karlin and Howard M. Taylor, A rst course in stochastic processes, 2nd ed.,
Academic Press, 1975.
[Mik98] Thomas Mikosch, Elementary stochastic calculus, World Scientic, 1998.
[Oks03] Bernt Oksendal, Stochastic dierential equations: An introduction with applications, 6th
ed., Universitext, Springer Verlag, 2003.
[Ros95] Sheldon M. Ross, Stochastic processes, 2nd ed., John Wiley and Sons, 1995.
[Wil91] David Williams, Probability with martingales, Cambridge University Press, 1991.
[Zak] Moshe Zakai, Stochastic processes in communication and control theory.
127
Index
L
q
spaces, 23
eld, 7, 67
eld, Borel, 9, 111
eld, cylindrical, 54, 116
eld, generated, 9, 12
eld, stopped, 81, 101, 114
eld, trivial, 8, 44
adapted, 67, 73, 99
almost everywhere, 12
almost surely, 12
autocovariance function, 59, 98, 99
Borel function, 13
Borel set, 10
BorelCantelli I, 21, 106
BorelCantelli II, 21
bounded convergence, 31
bounded linear functional, 41
branching process, 90, 112
Brownian bridge, 99
Brownian motion, 51, 84, 95, 101, 117
Brownian motion, fractional, 100
Brownian motion, geometric, 52, 74, 81, 99
Brownian motion, local maxima, 108
Brownian motion, modulus of continuity,
108
Brownian motion, planar, 103
Brownian motion, quadratic variation, 105
Brownian motion, time change, 101, 120
Brownian motion, total variation, 107
Brownian motion, zero set, 108
Cauchy sequence, 40, 97
central limit theorem, 27
change of measure, 31
ChapmanKolmogorov equations, 115
characteristic function, 55, 125
conditional expectation, 35, 37, 38, 47
continuous, H older, 62
continuous, Lipschitz, 62, 105
convergence almost surely, 19
convergence in qmean, 23
convergence in law, 27
convergence in probability, 20
countable additivity, 8
distribution, 49, 100
distribution function, 25
distribution, Bernoulli, 56
distribution, exponential, 64, 121
distribution, gamma, 121
distribution, Poisson, 56, 119
dominated convergence, 31, 45
Doobs convergence theorem, 88
Doobs decomposition, 82
Doobs inequality, 85
Doobs martingale, 88
Doobs optional stopping, 77, 80, 103, 121
DoobMeyer decomposition, 84
event space, 7
expectation, 15
exponential distribution, 22, 28
extinction probability, 91
ltration, 67, 99
ltration, canonical, 68, 73
ltration, continuous time, 73, 79
ltration, rightcontinuous, 75
nite dimensional distributions, 52, 54, 115
Fourier series, 42
Fubinis theorem, 65
GaltonWatson trees, 90
Gaussian distribution, 57, 98
Gaussian distribution, nondegenerate, 17,
57
Gaussian distribution, parameters, 58
harmonic function, 119
Hilbert space, 40, 96
Hilbert space, separable, 42
Hilbert subspace, 41
hitting time, rst, 7779, 101
hitting time, last, 77
increasing part, 84
increasing process, 84, 105
129
130 INDEX
independence, 32, 46, 57
independent increments, 52, 60, 73, 95, 119
independent increments, stationary, 116,
120, 125
indicator function, 11
inner product, 40
innovation process, 83
Jensens inequality, 18, 45, 72
Kolmogorovs continuity theorem, 63, 98
Levys martingale characterization, 101,
120
law, 25, 55
law of large numbers, 20, 99, 109, 124
law of the iterated logarithm, 109
Lebesgue integral, 16, 47, 112
linear functional, 41
linear vector space, 39
lognormal distribution, 17
Markov chain, 111
Markov chain, homogeneous, 111
Markov jump process, 64, 125
Markov process, 114
Markov process, homogeneous, 116, 120,
126
Markov property, 114, 116
Markov property, strong, 114, 118
Markovs inequality, 18, 106
Markov, initial distribution, 112, 115
martingale, 68, 126
martingale dierence, 68
martingale transform, 70
martingale, continuous time, 73
martingale, exponential, 87
martingale, Gaussian, 71
martingale, interpolated, 74, 76
martingale, squareintegrable, 70, 84, 89,
105
martingale, sub, 71, 99
martingale, sub, last element, 85
martingale, sub, right continuous, 85
martingale, super, 72, 83
martingale, uniformly integrable, 88
maximal inequalities, 85
measurable function, 11
measurable space, 8
memoryless property, 121
modication, 53, 111
modication, continuous, 62
modication, RCLL, 64, 76
monotone convergence, 31, 45
monotone function, 107
nonnegative denite matrix, 57
order statistics, 122
OrnsteinUhlenbeck process, 99
orthogonal projection, 41
orthogonal sequence, 71
orthogonal, dierence, 71
orthonormal basis, complete, 42, 96
parallelogram law, 40
Parseval, 42
Poisson approximation, 123
Poisson process, 119
Poisson process, compound, 125
Poisson process, inhomogeneous, 124
Poisson process, time change, 125
predictable, 70
previsible, 70, 82
probability density function, 16, 26, 47, 56,
100, 102
probability density function, Gaussian, 57
probability measure, 8
probability measures, equivalent, 32
probability space, 8
probability space, complete, 20
random telegraph signal, 64
random variable, 11
random variable, integrable, 17
random variable, squareintegrable, 23, 33
random vector, 55, 57
random walk, 50, 62, 69, 83, 103, 112
random walk, simple, asymmetric, 78
random walk, simple, symmetric, 78
recurrent state, 113
reection principle, 102
regeneration point, 101
regular conditional probability distribution,
47
regular conditional probability, 46
Riesz representation, 42
sample path, 49, 79
sample path, continuous, 53, 62, 79, 95, 99
sample path, RCLL, 64, 125
sample space, 7
Schwarzs inequality, 19, 40
simple function, 11
state space, 111, 114
stationary, increments, 62
stochastic integral, 75, 106
stochastic process, 49
stochastic process, autoregressive, 114
stochastic process, continuous time, 49
stochastic process, counting process, 119
stochastic process, discrete time, 49
stochastic process, Gaussian, 59, 95
stochastic process, point process, 120
stochastic process, quadratic variation of,
104
stochastic process, right continuous, 85
stochastic process, stationary, 61
stochastic process, stopped, 77, 80, 103
INDEX 131
stopping time, 76, 79, 101, 114
take out what is known, 45
tower property, 44
transient state, 113
transition probabilities, 115
transition probabilities, regular, 116
transition probabilities, stationary, 111, 116
triangle inequality, 24, 40
uncorrelated, 33, 46
uniform measure, 10, 122
uniformly integrable, 30, 77, 81, 88
variation, 104
variation, cross, 85
variation, quadratic, 104
variation, total, 104, 107
version, 53
weak convergence, 29
with probability one, 12