# AN INTRODUCTION TO STOCHASTIC

DIFFERENTIAL EQUATIONS
VERSION 1.2
Lawrence C. Evans
Department of Mathematics
UC Berkeley
Chapter 1: Introduction
Chapter 2: A crash course in basic probability theory
Chapter 3: Brownian motion and “white noise”
Chapter 4: Stochastic integrals, Itˆ o’s formula
Chapter 5: Stochastic diﬀerential equations
Chapter 6: Applications
Exercises
Appendices
References
1
PREFACE
These are an evolving set of notes for Mathematics 195 at UC Berkeley. This course
is for advanced undergraduate math majors and surveys without too many precise details
random diﬀerential equations and some applications.
Stochastic diﬀerential equations is usually, and justly, regarded as a graduate level
subject. A really careful treatment assumes the students’ familiarity with probability
theory, measure theory, ordinary diﬀerential equations, and perhaps partial diﬀerential
equations as well. This is all too much to expect of undergrads.
But white noise, Brownian motion and the random calculus are wonderful topics, too
good for undergraduates to miss out on.
Therefore as an experiment I tried to design these lectures so that strong students
could follow most of the theory, at the cost of some omission of detail and precision. I for
instance downplayed most measure theoretic issues, but did emphasize the intuitive idea of
σ–algebras as “containing information”. Similarly, I “prove” many formulas by conﬁrming
them in easy cases (for simple random variables or for step functions), and then just stating
that by approximation these rules hold in general. I also did not reproduce in class some
of the more complicated proofs provided in these notes, although I did try to explain the
guiding ideas.
My thanks especially to Lisa Goldberg, who several years ago presented the class with
several lectures on ﬁnancial applications, and to Fraydoun Rezakhanlou, who has taught
from these notes and added several improvements. I am also grateful to Jonathan Weare
for several computer simulations illustrating the text.
2
CHAPTER 1: INTRODUCTION
A. MOTIVATION
Fix a point x
0
∈ R
n
and consider then the ordinary diﬀerential equation:
(ODE)
_
˙ x(t) = b(x(t)) (t > 0)
x(0) = x
0
,
where b : R
n
→ R
n
is a given, smooth vector ﬁeld and the solution is the trajectory
x() : [0, ∞) →R
n
.
x(t)
x
0
Trajectory of the differential equation
Notation. x(t) is the state of the system at time t ≥ 0, ˙ x(t) :=
d
dt
x(t).
In many applications, however, the experimentally measured trajectories of systems
modeled by (ODE) do not in fact behave as predicted:
X(t)
x
0
Sample path of the stochastic differential equation
Hence it seems reasonable to modify (ODE), somehow to include the possibility of random
eﬀects disturbing the system. A formal way to do so is to write:
(1)
_
˙
X(t) = b(X(t)) +B(X(t))ξ(t) (t > 0)
X(0) = x
0
,
where B : R
n
→M
n×m
(= space of n m matrices) and
ξ() := m-dimensional “white noise”.
This approach presents us with these mathematical problems:
• Deﬁne the “white noise” ξ() in a rigorous way.
3
• Deﬁne what it means for X() to solve (1).
• Show (1) has a solution, discuss uniqueness, asymptotic behavior, dependence upon
x
0
, b, B, etc.
B. SOME HEURISTICS
Let us ﬁrst study (1) in the case m = n, x
0
= 0, b ≡ 0, and B ≡ I. The solution of
(1) in this setting turns out to be the n-dimensional Wiener process, or Brownian motion,
denoted W(). Thus we may symbolically write
˙
W() = ξ(),
thereby asserting that “white noise” is the time derivative of the Wiener process.
d
dt
dX(t)
dt
= b(X(t)) +B(X(t))
dW(t)
dt
,
and ﬁnally multiply by “dt”:
(SDE)
_
dX(t) = b(X(t))dt +B(X(t))dW(t)
X(0) = x
0
.
This expression, properly interpreted, is a stochastic diﬀerential equation. We say that
X() solves (SDE) provided
(2) X(t) = x
0
+
_
t
0
b(X(s)) ds +
_
t
0
B(X(s)) dW for all times t > 0 .
Now we must:
• Construct W(): See Chapter 3.
• Deﬁne the stochastic integral
_
t
0
dW : See Chapter 4.
• Show (2) has a solution, etc.: See Chapter 5.
And once all this is accomplished, there will still remain these modeling problems:
• Does (SDE) truly model the physical situation?
• Is the term ξ() in (1) “really” white noise, or is it rather some ensemble of smooth,
but highly oscillatory functions? See Chapter 6.
As we will see later these questions are subtle, and diﬀerent answers can yield completely
diﬀerent solutions of (SDE). Part of the trouble is the strange form of the chain rule in
the stochastic calculus:
C. IT
ˆ
O’S FORMULA
Assume n = 1 and X() solves the SDE
(3) dX = b(X)dt +dW.
4
Suppose next that u : R → R is a given smooth function. We ask: what stochastic
diﬀerential equation does
Y (t) := u(X(t)) (t ≥ 0)
solve? Oﬀhand, we would guess from (3) that
dY = u

dX = u

bdt +u

dW,
according to the usual chain rule, where

=
d
dx
. This is wrong, however ! In fact, as we
will see,
(4) dW ≈ (dt)
1/2
in some sense. Consequently if we compute dY and keep all terms of order dt or (dt)
1
2
, we
obtain
dY = u

dX +
1
2
u

(dX)
2
+. . .
= u

(bdt +dW
. ¸¸ .
from (3)
) +
1
2
u

(bdt +dW)
2
+. . .
=
_
u

b +
1
2
u

_
dt +u

dW +¦terms of order (dt)
3/2
and higher¦.
Here we used the “fact” that (dW)
2
= dt, which follows from (4). Hence
dY =
_
u

b +
1
2
u

_
dt +u

dW,
with the extra term “
1
2
u

dt” not present in ordinary calculus.
A major goal of these notes is to provide a rigorous interpretation for calculations like
these, involving stochastic diﬀerentials.
Example 1. According to Itˆ o’s formula, the solution of the stochastic diﬀerential equation
_
dY = Y dW,
Y (0) = 1
is
Y (t) := e
W(t)−
t
2
,
and not what might seem the obvious guess, namely
ˆ
Y (t) := e
W(t)
.
5
Example 2. Let P(t) denote the (random) price of a stock at time t ≥ 0. A standard
model assumes that
dP
P
, the relative change of price, evolves according to the SDE
dP
P
= µdt +σdW
for certain constants µ > 0 and σ, called respectively the drift and the volatility of the
stock. In other words,
_
dP = µPdt +σPdW
P(0) = p
0
,
where p
0
is the starting price. Using once again Itˆ o’s formula we can check that the solution
is
P(t) = p
0
e
σW(t)+
_
µ−
σ
2
2
_
t
.

A sample path for stock prices
6
CHAPTER 2: A CRASH COURSE IN BASIC PROBABILITY THEORY.
A. Basic deﬁnitions
B. Expected value, variance
C. Distribution functions
D. Independence
E. Borel–Cantelli Lemma
F. Characteristic functions
G. Strong Law of Large Numbers, Central Limit Theorem
H. Conditional expectation
I. Martingales
This chapter is a very rapid introduction to the measure theoretic foundations of prob-
ability theory. More details can be found in any good introductory text, for instance
Bremaud [Br], Chung [C] or Lamperti [L1].
A. BASIC DEFINITIONS.
Let us begin with a puzzle:
Bertrand’s paradox. Take a circle of radius 2 inches in the plane and choose a chord
of this circle at random. What is the probability this chord intersects the concentric circle
Solution #1 Any such chord (provided it does not hit the center) is uniquely deter-
mined by the location of its midpoint.
Thus
probability of hitting inner circle =
area of inner circle
area of larger circle
=
1
4
.
Solution #2 By symmetry under rotation we may assume the chord is vertical. The
diameter of the large circle is 4 inches and the chord will hit the small circle if it falls
within its 2-inch diameter.
7
Hence
probability of hitting inner circle =
2 inches
4 inches
=
1
2
.
Solution #3 By symmetry we may assume one end of the chord is at the far left point
of the larger circle. The angle θ the chord makes with the horizontal lies between ±
π
2
and
the chord hits the inner circle if θ lies between ±
π
6
.
θ
Therefore
probability of hitting inner circle =

6

2
=
1
3
.

PROBABILITY SPACES. This example shows that we must carefully deﬁne what
we mean by the term “random”. The correct way to do so is by introducing as follows the
precise mathematical structure of a probability space.
We start with a set, denoted Ω, certain subsets of which we will in a moment interpret
as being “events”.
DEFINTION. A σ-algebra is a collection | of subsets of Ω with these properties:
(i) ∅, Ω ∈ |.
(ii) If A ∈ |, then A
c
∈ |.
(iii) If A
1
, A
2
, ∈ |, then

_
k=1
A
k
,

k=1
A
k
∈ |.
Here A
c
:= Ω −A is the complement of A.
8
DEFINTION. Let | be a σ-algebra of subsets of Ω. We call P : | → [0, 1] a probability
measure provided:
(i) P(∅) = 0, P(Ω) = 1.
(ii) If A
1
, A
2
, ∈ |, then
P(

_
k=1
A
k
) ≤

k=1
P(A
k
).
(iii) If A
1
, A
2
, . . . are disjoint sets in |, then
P(

_
k=1
A
k
) =

k=1
P(A
k
).
It follows that if A, B ∈ |, then
A ⊆ B implies P(A) ≤ P(B).
DEFINITION. A triple (Ω, |, P) is called a probability space provided Ω is any set, |
is a σ-algebra of subsets of Ω, and P is a probability measure on |.
Terminology. (i) A set A ∈ | is called an event; points ω ∈ Ω are sample points.
(ii) P(A) is the probability of the event A.
(iii) A property which is true except for an event of probability zero is said to hold
almost surely (usually abbreviated “a.s.”).
Example 1. Let Ω = ¦ω
1
, ω
2
, . . . , ω
N
¦ be a ﬁnite set, and suppose we are given numbers
0 ≤ p
j
≤ 1 for j = 1, . . . , N, satisfying

p
j
= 1. We take | to comprise all subsets of
Ω. For each set A = ¦ω
j
1
, ω
j
2
, . . . , ω
j
m
¦ ∈ |, with 1 ≤ j
1
< j
2
< . . . j
m
≤ N, we deﬁne
P(A) := p
j
1
+p
j
2
+ +p
j
m
.
Example 2. The smallest σ-algebra containing all the open subsets of R
n
is called the
Borel σ-algebra, denoted B. Assume that f is a nonnegative, integrable function, such
that
_
R
n
f dx = 1. We deﬁne
P(B) :=
_
B
f(x) dx
for each B ∈ B. Then (R
n
, B, P) is a probability space. We call f the density of the
probability measure P.
Example 3. Suppose instead we ﬁx a point z ∈ R
n
, and now deﬁne
P(B) :=
_
1 if z ∈ B
0 if z / ∈ B
9
for sets B ∈ B. Then (R
n
, B, P) is a probability space. We call P the Dirac mass concen-
trated at the point z, and write P = δ
z
.
A probability space is the proper setting for mathematical probability theory. This
means that we must ﬁrst of all carefully identify an appropriate (Ω, |, P) when we try to
solve problems. The reader should convince himself or herself that the three “solutions” to
Bertrand’s paradox discussed above represent three distinct interpretations of the phrase
“at random”, that is, to three distinct models of (Ω, |, P).
Here is another example.
Example 4 (Buﬀon’s needle problem). The plane is ruled by parallel lines 2 inches
apart and a 1-inch long needle is dropped at random on the plane. What is the probability
that it hits one of the parallel lines?
The ﬁrst issue is to ﬁnd some appropriate probability space (Ω, |, P). For this, let
_
h = distance from the center of needle to nearest line,
θ = angle (≤
π
2
) that the needle makes with the horizontal.
θ
h
needle
These fully determine the position of the needle, up to translations and reﬂection. Let
us next take
_
¸
¸
_
¸
¸
_
Ω = [0,
π
2
)
. ¸¸ .
values of θ
[0, 1],
. ¸¸ .
values of h
| = Borel subsets of Ω,
P(B) =
2·area of B
π
for each B ∈ |.
We denote by A the event that the needle hits a horizontal line. We can now check
that this happens provided
h
sin θ

1
2
. Consequently A = ¦(θ, h) ∈ Ω[ h ≤
sin θ
2
¦, and so
P(A) =
2(area of A)
π
=
2
π
_ π
2
0
1
2
sinθ dθ =
1
π
.
RANDOM VARIABLES. We can think of the probability space as being an essential
mathematical construct, which is nevertheless not “directly observable”. We are therefore
interested in introducing mappings X from Ω to R
n
, the values of which we can observe.
10
Remember from Example 2 above that
B denotes the collection of Borel subsets of R
n
, which is the
smallest σ-algebra of subsets of R
n
containing all open sets.
We may henceforth informally just think of B as containing all the “nice, well-behaved”
subsets of R
n
.
DEFINTION. Let (Ω, |, P) be a probability space. A mapping
X : Ω →R
n
is called an n-dimensional random variable if for each B ∈ B, we have
X
−1
(B) ∈ |.
We equivalently say that X is |-measurable.
Notation, comments. We usually write “X” and not “X(ω)”. This follows the custom
within probability theory of mostly not displaying the dependence of random variables on
the sample point ω ∈ Ω. We also denote P(X
−1
(B)) as “P(X ∈ B)”, the probability that
X is in B.
In these notes we will usually use capital letters to denote random variables. Boldface
usually means a vector-valued mapping.
We will also use without further comment various standard facts from measure theory,
for instance that sums and products of random variables are random variables.
Example 1. Let A ∈ |. Then the indicator function of A,
χ
A
(ω) :=
_
1 if ω ∈ A
0 if ω / ∈ A,
is a random variable.
Example 2. More generally, if A
1
, A
2
, . . . , A
m
∈ |, with Ω = ∪
m
i=1
A
i
, and a
1
, a
2
, . . . , a
m
are real numbers, then
X =
m

i=1
a
i
χ
A
i
is a random variable, called a simple function.
11
LEMMA. Let X : Ω →R
n
be a random variable. Then
|(X) := ¦X
−1
(B) [ B ∈ B¦
is a σ-algebra, called the σ-algebra generated by X. This is the smallest sub-σ-algebra of
| with respect to which X is measurable.
Proof. Check that ¦X
−1
(B) [ B ∈ B¦ is a σ-algebra; clearly it is the smallest σ-algebra
with respect to which X is measurable.
IMPORTANT REMARK. It is essential to understand that, in probabilistic terms,
the σ-algebra |(X) can be interpreted as “containing all relevant information” about the
random variable X.
In particular, if a random variable Y is a function of X, that is, if
Y = Φ(X)
for some reasonable function Φ, then Y is |(X)-measurable.
Conversely, suppose Y : Ω → R is |(X)-measurable. Then there exists a function Φ
such that
Y = Φ(X).
Hence if Y is |(X)-measurable, Y is in fact a function of X. Consequently if we know
the value X(ω), we in principle know also Y (ω) = Φ(X(ω)), although we may have no
practical way to construct Φ.
STOCHASTIC PROCESSES. We introduce next random variables depending upon
time.
DEFINITIONS. (i) A collection ¦X(t) [ t ≥ 0¦ of random variables is called a stochastic
process.
(ii) For each point ω ∈ Ω, the mapping t → X(t, ω) is the corresponding sample path.
The idea is that if we run an experiment and observe the random values of X() as time
evolves, we are in fact looking at a sample path ¦X(t, ω) [ t ≥ 0¦ for some ﬁxed ω ∈ Ω. If
we rerun the experiment, we will in general observe a diﬀerent sample path.
12
X(t,ω
1
)
X(t,ω
2
)
time
Two sample paths of a stochastic process
B. EXPECTED VALUE, VARIANCE.
Integration with respect to a measure. If (Ω, |, P) is a probability space and X =

k
i=1
a
i
χ
A
i
is a real-valued simple random variable, we deﬁne the integral of X by
_

X dP :=
k

i=1
a
i
P(A
i
).
If next X is a nonnegative random variable, we deﬁne
_

X dP := sup
Y ≤X,Y simple
_

Y dP.
Finally if X : Ω →R is a random variable, we write
_

X dP :=
_

X
+
dP −
_

X

dP,
provided at least one of the integrals on the right is ﬁnite. Here X
+
= max(X, 0) and
X

= max(−X, 0); so that X = X
+
−X

.
Next, suppose X : Ω →R
n
is a vector-valued random variable, X = (X
1
, X
2
, . . . , X
n
).
Then we write
_

XdP =
__

X
1
dP,
_

X
2
dP, ,
_

X
n
dP
_
.
We will assume without further comment the usual rules for these integrals.
DEFINITION. We call
E(X) :=
_

XdP
the expected value (or mean value) of X.
13
DEFINITION. We call
V (X) :=
_

[X−E(X)[
2
dP
the variance of X.
Observe that
V (X) = E([X−E(X)[
2
) = E([X[
2
) −[E(X)[
2
.
LEMMA (Chebyshev’s inequality). If X is a random variable and 1 ≤ p < ∞, then
P([X[ ≥ λ) ≤
1
λ
p
E([X[
p
) for all λ > 0.
Proof. We have
E([X[
p
) =
_

[X[
p
dP ≥
_
{|X|≥λ}
[X[
p
dP ≥ λ
p
P([X[ ≥ λ).

C. DISTRIBUTION FUNCTIONS.
Let (Ω, |, P) be a probability space and suppose X : Ω →R
n
is a random variable.
Notation. Let x = (x
1
, . . . , x
n
) ∈ R
n
, y = (y
1
, . . . , y
n
) ∈ R
n
. Then
x ≤ y
means x
i
≤ y
i
for i = 1, . . . , n.
DEFINITIONS. (i) The distribution function of X is the function F
X
: R
n
→ [0, 1]
deﬁned by
F
X
(x) := P(X ≤ x) for all x ∈ R
n
(ii) If X
1
, . . . , X
m
: Ω → R
n
are random variables, their joint distribution function is
F
X
1
,...,X
m
: (R
n
)
m
→ [0, 1],
F
X
1
,...,X
m
(x
1
, . . . , x
m
) := P(X
1
≤ x
1
, . . . , X
m
≤ x
m
) for all x
i
∈ R
n
, i = 1, . . . , m.
DEFINITION. Suppose X : Ω →R
n
is a random variable and F = F
X
its distribution
function. If there exists a nonnegative, integrable function f : R
n
→R such that
F(x) = F(x
1
, . . . , x
n
) =
_
x
1
−∞

_
x
n
−∞
f(y
1
, . . . , y
n
) dy
n
. . . dy
1
,
then f is called the density function for X.
It follows then that
(1) P(X ∈ B) =
_
B
f(x) dx for all B ∈ B
This formula is important as the expression on the right hand side is an ordinary integral,
and can often be explicitly calculated.
14
x

R
n
X
Example 1. If X : Ω →R has density
f(x) =
1

2πσ
2
e

|x−m|
2

2
(x ∈ R),
we say X has a Gaussian (or normal) distribution, with mean m and variance σ
2
. In this
case let us write
X is an N(m, σ
2
) random variable.
Example 2. If X : Ω →R
n
has density
f(x) =
1
((2π)
n
det C)
1/2
e

1
2
(x−m)·C
−1
(x−m)
(x ∈ R
n
)
for some m ∈ R
n
and some positive deﬁnite, symmetric matrix C, we say X has a Gaussian
(or normal) distribution, with mean m and covariance matrix C. We then write
X is an N(m, C) random variable.

LEMMA. Let X : Ω → R
n
be a random variable, and assume that its distribution func-
tion F = F
X
has the density f. Suppose g : R
n
→R, and
Y = g(X)
is integrable. Then
E(Y ) =
_
R
n
g(x)f(x) dx.
15
In particular,
E(X) =
_
R
n
xf(x) dx and V (X) =
_
R
n
[x −E(X)[
2
f(x) dx.
Remark. Hence we can compute E(X), V (X), etc. in terms of integrals over R
n
. This
is an important observation, since as mentioned before the probability space (Ω, |, P) is
“unobservable”: All that we “see” are the values X takes on in R
n
. Indeed, all quantities
of interest in probability theory can be computed in R
n
in terms of the density f.
Proof. Suppose ﬁrst g is a simple function on R
n
:
g =
m

i=1
b
i
χ
B
i
(B
i
∈ B).
Then
E(g(X)) =
m

i=1
b
i
_

χ
B
i
(X) dP =
m

i=1
b
i
P(X ∈ B
i
).
But also
_
R
n
g(x)f(x) dx =
m

i=1
b
i
_
B
i
f(x) dx
=
m

i=1
b
i
P(X ∈ B
i
) by (1).
Consequently the formula holds for all simple functions g and, by approximation, it holds
therefore for general functions g.
Example. If X is N(m, σ
2
), then
E(X) =
1

2πσ
2
_

−∞
xe

(x−m)
2

2
dx = m
and
V (X) =
1

2πσ
2
_

−∞
(x −m)
2
e

(x−m)
2

2
dx = σ
2
.
Therefore m is indeed the mean, and σ
2
the variance.
16
B

A
ω
D. INDEPENDENCE.
MOTIVATION. Let (Ω, |, P) be a probability space, and let A, B ∈ | be two events,
with P(B) > 0. We want to ﬁnd a reasonable deﬁnition of
P(A[ B), the probability of A, given B.
Think this way. Suppose some point ω ∈ Ω is selected “at random” and we are told ω ∈ B.
What then is the probability that ω ∈ A also?
Since we know ω ∈ B, we can regard B as being a new probability space. Therefore we
can deﬁne
˜
Ω := B,
˜
| := ¦C ∩ B[ C ∈ |¦ and
˜
P :=
P
P(B)
; so that
˜
P(
˜
Ω) = 1. Then the
probability that ω lies in A is
˜
P(A∩ B) =
P(A∩B)
P(B)
.
This observation motivates the following
DEFINITION. We write
P(A[ B) :=
P(A∩ B)
P(B)
if P(B) > 0.
Now what should it mean to say “A and B are independent”? This should mean
P(A[ B) = P(A), since presumably any information that the event B has occurred is
irrelevant in determining the probability that A has occurred. Thus
P(A) = P(A[ B) =
P(A∩ B)
P(B)
and so
P(A∩ B) = P(A)P(B)
if P(B) > 0. We take this for the deﬁnition, even if P(B) = 0:
DEFINITION. Two events A and B are called independent if
P(A∩ B) = P(A)P(B).
This concept and its ramiﬁcations are the hallmarks of probability theory.
To gain some insight, the reader may wish to check that if A and B are independent
events, then so are A
c
and B. Likewise, A
c
and B
c
are independent.
17
DEFINITION. Let A
1
, . . . , A
n
, . . . be events. These events are independent if for all
choices 1 ≤ k
1
< k
2
< < k
m
, we have
P(A
k
1
∩ A
k
2
∩ ∩ A
k
m
) = P(A
k
1
)P(A
k
1
) P(A
k
m
).
It is important to extend this deﬁnition to σ-algebras:
DEFINITION. Let |
i
⊆ | be σ-algebras, for i = 1, . . . . We say that ¦|
i
¦

i=1
are
independent if for all choices of 1 ≤ k
1
< k
2
< < k
m
and of events A
k
i
∈ |
k
i
, we have
P(A
k
1
∩ A
k
2
∩ ∩ A
k
m
) = P(A
k
1
)P(A
k
2
) . . . P(A
k
m
).
Lastly, we transfer our deﬁnitions to random variables:
DEFINITION. Let X
i
: Ω → R
n
be random variables (i = 1, . . . ). We say the random
variables X
1
, . . . are independent if for all integers k ≥ 2 and all choices of Borel sets
B
1
, . . . B
k
⊆ R
n
:
P(X
1
∈ B
1
, X
2
∈ B
2
, . . . , X
k
∈ B
k
) = P(X
1
∈ B
1
)P(X
2
∈ B
2
) P(X
k
∈ B
k
).
This is equivalent to saying that the σ-algebras ¦|(X
i

i=1
are independent.
Example. Take Ω = [0, 1), | the Borel subsets of [0, 1), and P Lebesgue measure.
Deﬁne for n = 1, 2, . . .
X
n
(ω) :=
_
1 if
k
2
n
≤ ω <
k+1
2
n
, k even
−1 if
k
2
n
≤ ω <
k+1
2
n
, k odd
(0 ≤ ω < 1).
These are the Rademacher functions, which we assert are in fact independent random
variables. To prove this, it suﬃces to verify
P(X
1
= e
1
, X
2
= e
2
, . . . , X
k
= e
k
) = P(X
1
= e
1
)P(X
2
= e
2
) P(X
k
= e
k
),
for all choices of e
1
, . . . , e
k
∈ ¦−1, 1¦. This can be checked by showing that both sides are
equal to 2
−k
.
LEMMA. Let X
1
, . . . , X
m+n
be independent R
k
-valued random variables. Suppose f :
(R
k
)
n
→R and g : (R
k
)
m
→R. Then
Y := f(X
1
, . . . , X
n
) and Z := g(X
n+1
, . . . , X
n+m
)
are independent.
We omit the proof, which may be found in Breiman [B].
18
THEOREM. The random variables X
1
, , X
m
: Ω → R
n
are independent if and only
if
(2) F
X
1
,··· ,X
m
(x
1
, . . . , x
m
) = F
X
1
(x
1
) F
X
m
(x
m
) for all x
i
∈ R
n
, i = 1, . . . , m.
If the random variables have densities, (2) is equivalent to
(3) f
X
1
,··· ,X
m
(x
1
, . . . , x
m
) = f
X
1
(x
1
) f
X
m
(x
m
) for all x
i
∈ R
n
, i = 1, . . . , m,
where the functions f are the appropriate densities.
Proof. 1. Assume ﬁrst that ¦X
k
¦
m
k=1
are independent. Then
F
X
1
···X
m
(x
1
, . . . , x
m
) = P(X
1
≤ x
1
, . . . , X
m
≤ x
m
)
= P(X
1
≤ x
1
) P(X
m
≤ x
m
)
= F
X
1
(x
1
) F
X
m
(x
m
).
2. We prove the converse statement for the case that all the random variables have
densities. Select A
i
∈ |(X
i
), i = 1, . . . , m. Then A
i
= X
−1
i
(B
i
) for some B
i
∈ B. Hence
P(A
1
∩ ∩ A
m
) = P(X
1
∈ B
1
, . . . , X
m
∈ B
m
)
=
_
B
1
×...×B
m
f
X
1
···X
m
(x
1
, . . . , x
m
) dx
1
dx
m
=
__
B
1
f
X
1
(x
1
) dx
1
_
. . .
__
B
m
f
X
m
(x
m
) dx
m
_
by (3)
= P(X
1
∈ B
1
) P(X
m
∈ B
m
)
= P(A
1
) P(A
m
).
Therefore |(X
1
), , |(X
m
) are independent σ-algebras.
One of the most important properties of independent random variables is this:
THEOREM. If X
1
, . . . , X
m
are independent, real-valued random variables, with
E([X
i
[) < ∞ (i = 1, . . . , m),
then E([X
1
X
m
[) < ∞ and
E(X
1
X
m
) = E(X
1
) E(X
m
).
Proof. Suppose that each X
i
is bounded and has a density. Then
E(X
1
X
m
) =
_
R
m
x
1
x
m
f
X
1
···X
m
(x
1
, . . . , x
m
) dx
1
. . . x
m
=
__
R
x
1
f
X
1
(x
1
) dx
1
_

__
R
x
m
f
X
m
(x
m
) dx
m
_
by (3)
= E(X
1
) E(X
m
).

19
THEOREM. If X
1
, . . . , X
m
are independent, real-valued random variables, with
V (X
i
) < ∞ (i = 1, . . . , m),
then
V (X
1
+ +X
m
) = V (X
1
) + +V (X
m
).
Proof. Use induction, the case m = 2 holding as follows. Let m
1
:= EX
1
, m
2
:= E(X
2
).
Then E(X
1
+X
2
) = m
1
+m
2
and
V (X
1
+X
2
) =
_

(X
1
+X
2
−(m
1
+m
2
))
2
dP
=
_

(X
1
−m
1
)
2
dP +
_

(X
2
−m
2
)
2
dP
+ 2
_

(X
1
−m
1
)(X
2
−m
2
) dP
= V (X
1
) +V (X
2
) + 2E(X
1
−m
1
. ¸¸ .
=0
)E(X
2
−m
2
. ¸¸ .
=0
),
where we used independence in the next last step.
E. BOREL–CANTELLI LEMMA.
We introduce next a simple and very useful way to check if some sequence A
1
, . . . , A
n
, . . .
of events “occurs inﬁnitely often”.
DEFINITION. Let A
1
, . . . , A
n
, . . . be events in a probability space. Then the event

n=1

_
m=n
A
m
= ¦ω ∈ Ω[ ω belongs to inﬁnitely many of the A
n
¦,
is called “A
n
inﬁnitely often”, abbreviated “A
n
i.o.”.
BOREL–CANTELLI LEMMA. If

n=1
P(A
n
) < ∞, then P(A
n
i.o.) = 0.
Proof. By deﬁnition A
n
i.o. =

n=1

m=n
A
m
, and so for each n
P(A
n
i.o.) ≤ P
_

_
m=n
A
m
_

m=n
P(A
m
).
The limit of the left-hand side is zero as n → ∞ because

P(A
m
) < ∞.
APPLICATION. We illustrate a typical use of the Borel–Cantelli Lemma.
A sequence of random variables ¦X
k
¦

k=1
deﬁned on some probability space converges
in probability to a random variable X, provided
lim
k→∞
P([X
k
−X[ > ) = 0
for each > 0.
20
THEOREM. If X
k
→ X in probability, then there exists a subsequence ¦X
k
j
¦

j=1

¦X
k
¦

k=1
such that
X
k
j
(ω) → X(ω) for almost every ω.
Proof. For each positive integer j we select k
j
so large that
P([X
k
j
−X[ >
1
j
) ≤
1
j
2
,
and also . . . k
j−1
< k
j
< . . . , k
j
→ ∞. Let A
j
:= ¦[X
k
j
−X[ >
1
j
¦. Since

1
j
2
< ∞, the
Borel–Cantelli Lemma implies P(A
j
i.o.) = 0. Therefore for almost all sample points ω,
[X
k
j
(ω) −X(ω)[ ≤
1
j
provided j ≥ J, for some index J depending on ω.
F. CHARACTERISTIC FUNCTIONS.
It is convenient to introduce next a clever integral transform, which will later provide
us with a useful means to identify normal random variables.
DEFINITION. Let X be an R
n
-valued random variable. Then
φ
X
(λ) := E(e
iλ·X
) (λ ∈ R
n
)
is the characteristic function of X.
Example. If the real-valued random variable X is N(m, σ
2
), then
φ
X
(λ) = e
imλ−
λ
2
σ
2
2
(λ ∈ R).
To see this, let us suppose that m = 0, σ = 1 and calculate
φ
X
(λ) =
_

−∞
e
iλx
1

e

x
2
2
dx =
e
−λ
2
2

_

−∞
e

(x−iλ)
2
2
dx.
We move the path of integration in the complex plane from the line ¦Im(z) = −λ¦ to the
real axis, and recall that
_

−∞
e

x
2
2
dx =

2π. (Here Im(z) means the imaginary part of
the complex number z.) Hence φ
X
(λ) = e

λ
2
2
.
21
LEMMA. (i) If X
1
, . . . , X
m
are independent random variables, then for each λ ∈ R
n
φ
X
1
+···+X
m
(λ) = φ
X
1
(λ) . . . φ
X
m
(λ).
(ii) If X is a real-valued random variable,
φ
(k)
(0) = i
k
E(X
k
) (k = 0, 1, . . . ).
(iii) If X and Y are random variables and
φ
X
(λ) = φ
Y
(λ) for all λ,
then
F
X
(x) = F
Y
(x) for all x.
Assertion (iii) says the characteristic function of X determines the distribution of X.
Proof. 1. Let us calculate
φ
X
1
+···+X
m
(λ) = E(e
iλ·(X
1
+···+X
m
)
)
= E(e
iλ·X
1
e
iλ·X
2
e
iλ·X
m
)
= E(e
iλ·X
1
) E(e
iλ·X
m
) by independence
= φ
X
1
(λ) . . . φ
X
m
(λ).
2. We have φ

(λ) = iE(Xe
iλX
), and so φ

(0) = iE(X). The formulas in (ii) for k = 2, . . .
3. See Breiman [B] for the proof of (iii).
Example. If X and Y are independent, real-valued random variables, and if X is N(m
1
, σ
2
1
),
Y is N(m
2
, σ
2
2
), then
X +Y is N(m
1
+m
2
, σ
2
1

2
2
).
To see this, just calculate
φ
X+Y
(λ) = φ
X
(λ)φ
Y
(λ) = e
−im
1
λ−
λ
2
σ
2
1
2
e
−im
2
λ−
λ
2
σ
2
2
2
= e
−i(m
1
+m
2
)λ−
λ
2
2

2
1

2
2
)
.

22
G. STRONG LAW OF LARGE NUMBERS, CENTRAL LIMIT THEOREM.
This section discusses a mathematical model for “repeated, independent experiments”.
The idea is this. Suppose we are given a probability space and on it a real–valued
random variable X, which records the outcome of some sort of random experiment. We
can model repetitions of this experiment by introducing a sequence of random variables
X
1
, . . . , X
n
, . . . , each of which “has the same probabilistic information as X”:
DEFINITION. A sequence X
1
, . . . , X
n
, . . . of random variables is called identically dis-
tributed if
F
X
1
(x) = F
X
2
(x) = = F
X
n
(x) = . . . for all x.
If we additionally assume that the random variables X
1
, . . . , X
n
, . . . are independent, we
can regard this sequence as a model for repeated and independent runs of the experiment,
the outcomes of which we can measure. More precisely, imagine that a “random” sample
point ω ∈ Ω is given and we can observe the sequence of values X
1
(ω), X
2
(ω), . . . , X
n
(ω), . . . .
What can we infer from these observations?
STRONG LAW OF LARGE NUMBERS. First we show that with probability
one, we can deduce the common expected values of the random variables.
THEOREM (Strong Law of Large Numbers). Let X
1
, . . . , X
n
, . . . be a sequence
of independent, identically distributed, integrable random variables deﬁned on the same
probability space.
Write m := E(X
i
) for i = 1, . . . . Then
P
_
lim
n→∞
X
1
+ +X
n
n
= m
_
= 1.
Proof. 1. Supposing that the random variables are real–valued entails no loss of generality.
We will as well suppose for simplicity that
E(X
4
i
) < ∞ (i = 1, . . . ).
We may also assume m = 0, as we could otherwise consider X
i
−m in place of X
i
.
2. Then
E
_
_
_
n

i=1
X
i
_
4
_
_
=
n

i,j,k,l=1
E(X
i
X
j
X
k
X
l
).
If i ,= j, k, or l, independence implies
E(X
i
X
j
X
k
X
l
) = E(X
i
)
. ¸¸ .
=0
E(X
j
X
k
X
l
).
23
Consequently, since the X
i
are identically distributed, we have
E
_
_
_
n

i=1
X
i
_
4
_
_
=
n

i=1
E(X
4
i
) + 3
n

i,j=1
i=j
E(X
2
i
X
2
j
)
= nE(X
4
1
) + 3(n
2
−n)(E(X
2
1
))
2
≤ n
2
C
for some constant C.
Now ﬁx ε > 0. Then
P

¸
¸
¸
¸
1
n
n

i=1
X
i
¸
¸
¸
¸
¸
≥ ε
_
= P

¸
¸
¸
¸
n

i=1
X
i
¸
¸
¸
¸
¸
≥ εn
_

1
(εn)
4
E
_
_
_
n

i=1
X
i
_
4
_
_

C
ε
4
1
n
2
.
We used here the Chebyshev inequality. By the Borel–Cantelli Lemma, therefore,
P

¸
¸
¸
¸
1
n
n

i=1
X
i
¸
¸
¸
¸
¸
≥ ε i.o.
_
= 0.
3. Take ε =
1
k
. The foregoing says that
limsup
n→∞
¸
¸
¸
¸
¸
1
n
n

i=1
X
i
(ω)
¸
¸
¸
¸
¸

1
k
,
except possibly for ω lying in an event B
k
, with P(B
k
) = 0. Write B := ∪

k=1
B
k
. Then
P(B) = 0 and
lim
n→∞
1
n
n

i=1
X
i
(ω) = 0
for each sample point ω / ∈ B.
FLUCTUATIONS, LAPLACE–DEMOIVRE THEOREM. The Strong Law of
Large Numbers says that for almost every sample point ω ∈ Ω,
X
1
(ω) + +X
n
(ω)
n
→ m as n → ∞.
We turn next to the Laplace–DeMoivre Theorem, and its generalization the Central Limit
Theorem, which estimate the “ﬂuctuations” we can expect in this limit.
24
LEMMA. Suppose the real–valued random variables X
1
, . . . , X
n
, . . . are independent and
identically distributed, with
_
P(X
i
= 1) = p
P(X
i
= 0) = q
for p, q ≥ 0, p +q = 1. Then
E(X
1
+ +X
n
) = np
V (X
1
+ +X
n
) = npq.
Proof. E(X
1
) =
_

X
1
dP = p and therefore E(X
1
+ +X
n
) = np. Also,
V (X
1
) =
_

(X
1
−p)
2
dP = (1 −p)
2
P(X
1
= 1) +p
2
P(X
1
= 0)
= q
2
p +p
2
q = qp.
By independence, V (X
1
+ +X
n
) = V (X
1
) + +V (X
n
) = npq.
We can imagine these random variables as modeling for example repeated tosses of a
biased coin, which has probability p of coming up heads, and probability q = 1 − p of
coming up tails.
THEOREM (Laplace–DeMoivre). Let X
1
, . . . , X
n
be the independent, identically dis-
tributed, real–valued random variables in the preceding Lemma. Deﬁne the sums
S
n
:= X
1
+ +X
n
.
Then for all −∞ < a < b < +∞,
lim
n→∞
P
_
a ≤
S
n
−np

npq
≤ b
_
=
1

_
b
a
e

x
2
2
dx.
A proof is in Appendix A.
Interpretation of the Laplace–DeMoivre Theorem. In view of the Lemma,
S
n
−np

npq
=
S
n
−E(S
n
)
V (S
n
)
1/2
.
Hence the Laplace–DeMoivre Theorem says that the sums S
n
, properly renormalized, have
a distribution which tends to the Gaussian N(0, 1) as n → ∞.
Consider in particular the situation p = q =
1
2
. Suppose a > 0; then
lim
n→∞
P
_

a

n
2
≤ S
n

n
2

a

n
2
_
=
1

_
a
−a
e

x
2
2
dx.
25
If we ﬁx b > 0 and write a =
2b

n
, then for large n
P
_
−b ≤ S
n

n
2
≤ b
_

1

_ 2b

n

2b

n
e

x
2
2
dx
. ¸¸ .
→0 as n→∞.
Thus for almost every ω,
1
n
S
n
(ω) →
1
2
, in accord with the Strong Law of Large Numbers;
but
¸
¸
S
n
(ω) −
n
2
¸
¸
“ﬂuctuates” with probability 1 to exceed any ﬁnite bound b.
CENTRAL LIMIT THEOREM. We now generalize the Laplace–DeMoivre Theo-
rem:
THEOREM (Central Limit Theorem). Let X
1
, . . . , X
n
, . . . be independent, identi-
cally distributed, real-valued random variables with
E(X
i
) = m, V (X
i
) = σ
2
> 0.
for i = 1, . . . . Set
S
n
:= X
1
+ +X
n
.
Then for all −∞ < a < b < +∞
(1) lim
n→∞
P
_
a ≤
S
n
−nm

≤ b
_
=
1

_
b
a
e

x
2
2
dx.
Thus the conclusion of the Laplace–DeMoivre Theorem holds not only for the 0– or 1–
valued random variable considered before, but for any sequence of independent, identically
distributed random variables with ﬁnite variance. We will later invoke this assertion to
motivate our requirement that Brownian motion be normally distributed for each time
t ≥ 0.
Outline of Proof. For simplicity assume m = 0, σ = 1, since we can always rescale to this
case. Then
φS
n

n
(λ) = φX
1

n
(λ) . . . φX
n

n
(λ) = φ
X
1
_
λ

n
_
n
for λ ∈ R, because the random variables are independent and identically distributed.
Now φ = φ
X
1
satisﬁes
φ(µ) = φ(0) +φ

(0)µ +
1
2
φ

(0)µ
2
+o(µ
2
) as µ → 0,
with φ(0) = 1, φ

(0) = iE(X
1
) = 0, φ

(0) = −E(X
2
1
) = −1. Consequently our setting
µ =
λ

n
gives
φ
X
1
_
λ

n
_
= 1 −
λ
2
2n
+o
_
λ
2
n
_
,
26
and so
φS
n

n
(λ) =
_
1 −
λ
2
2n
+o
_
λ
2
n
__
n
→ e

λ
2
2
for all λ, as n → ∞. But e

λ
2
2
is the characteristic function of an N(0, 1) random variable.
It turns out that this convergence of the characteristic functions implies the limit (1): see
Breiman [B] for more.
H. CONDITIONAL EXPECTATION.
MOTIVATION. We earlier decided to deﬁne P(A[ B), the probability of A, given B,
to be
P(A∩B)
P(B)
, provided P(B) > 0. How then should we deﬁne
E(X[ B),
the expected value of the random variable X, given the event B? Remember that we can
think of B as the new probability space, with
˜
P =
P
P(B)
. Thus if P(B) > 0, we should set
E(X[ B) = mean value of X over B
=
1
P(B)
_
B
X dP.
Next we pose a more interesting question. What is a reasonable deﬁnition of
E(X[ Y ),
the expected value of the random variable X, given another random variable Y ? In other
words if “chance” selects a sample point ω ∈ Ω and all we know about ω is the value Y (ω),
what is our best guess as to the value X(ω)?
This turns out to be a subtle, but extremely important issue, for which we provide two
introductory discussions.
example.
Example. Assume we are given a probability space (Ω, |, P), on which is deﬁned a simple
random variable Y . That is, Y =

m
i=1
a
i
χ
A
i
, and so
Y =
_
¸
¸
¸
¸
_
¸
¸
¸
¸
_
a
1
on A
1
a
2
on A
2
.
.
.
a
m
on A
m
,
27
for distinct real numbers a
1
, a
2
, . . . , a
m
and disjoint events A
1
, A
2
, . . . , A
m
, each of positive
probability, whose union is Ω.
Next, let X be any other real–valued random variable on Ω. What is our best guess of
X, given Y ? Think about the problem this way: if we know the value of Y (ω), we can tell
which event A
1
, A
2
, . . . , A
m
contains ω. This, and only this, known, our best estimate for
X should then be the average value of X over each appropriate event. That is, we should
take
E(X[ Y ) :=
_
¸
¸
¸
¸
¸
_
¸
¸
¸
¸
¸
_
1
P(A
1
)
_
A
1
X dP on A
1
1
P(A
2
)
_
A
2
X dP on A
2
.
.
.
1
P(A
m
)
_
A
m
X dP on A
m
.

We note for this example that
• E(X[ Y ) is a random variable, and not a constant.
• E(X[ Y ) is |(Y )-measurable.

_
A
XdP =
_
A
E(X[ Y ) dP for all A ∈ |(Y ).
Let us take these properties as the deﬁnition in the general case:
DEFINITION. Let Y be a random variable. Then E(X[ Y ) is any |(Y )-measurable
random variable such that
_
A
X dP =
_
A
E(X[ Y ) dP for all A ∈ |(Y ).
Finally, notice that it is not really the values of Y that are important, but rather just
the σ-algebra it generates. This motivates the next
DEFINITION. Let (Ω, |, P) be a probability space and suppose 1 is a σ-algebra, 1 ⊆ |.
If X : Ω →R
n
is an integrable random variable, we deﬁne
E(X[ 1)
to be any random variable on Ω such that
(i) E(X[ 1) is 1-measurable, and
(ii)
_
A
X dP =
_
A
E(X[ 1) dP for all A ∈ 1.
Interpretation. We can understand E(X[ 1) as follows. We are given the “information”
available in a σ-algebra 1, from which we intend to build an estimate of the random
variable X. Condition (i) in the deﬁnition requires that E(X[ 1) be constructed from the
28
information in 1, and (ii) requires that our estimate be consistent with X, at least as
regards integration over events in 1. We will later see that the conditional expectation
E(X[ 1), so deﬁned, has various additional nice properties.
Remark. We can check without diﬃculty that
(i) E(X[ Y ) = E(X[ |(Y )).
(ii) E(E(X[ 1)) = E(X).
(iii) E(X) = E(X[ J), where J = ¦∅, Ω¦ is the trivial σ-algebra.
THEOREM. Let X be an integrable random variable. Then for each σ-algebra 1 ⊂
|, the conditional expectation E(X[ 1) exists and is unique up to 1-measurable sets of
probability zero.
We omit the proof, which uses a few advanced concepts from measure theory.
SECOND APPROACH TO CONDITIONAL EXPECTATION. An elegant al-
ternative approach to conditional expectations is based upon projections onto closed sub-
spaces, and is motivated by this example:
Least squares method. Consider for the moment R
n
and suppose that V is a proper
subspace.
Suppose we are given a vector x ∈ R
n
. The least squares problem asks us to ﬁnd a
vector z ∈ V so that
[z −x[ = min
y∈V
[y −x[.
It is not particularly diﬃcult to show that, given x, there exists a unique vector z ∈ V
solving this minimization problem. We call v the projection of x onto V ,
(7) z = proj
V
(x).
V
0
z=proj
V
(x)
x
29
Now we want to ﬁnd formula characterizing z. For this take any other vector w ∈ V .
Deﬁne then
i(τ) := [z +τw −x[
2
.
Since z + τw ∈ V for all τ, we see that the function i() has a minimum at τ = 0. Hence
0 = i

(0) = 2(z −x) w; that is,
(8) x w = z w for all w ∈ V.
The geometric interpretation is that the “error” x − z is perpendicular to the subspace
V .
Projection of random variables. Motivated by the example above, we return now
to conditional expectation. Let us take the linear space L
2
(Ω) = L
2
(Ω, |), which consists
of all real-valued, |–measurable random variables Y , such that
[[Y [[ :=
__

Y
2
dP
_1
2
< ∞.
We call [[Y [[ the norm of Y ; and if X, Y ∈ L
2
(Ω), we deﬁne their inner product to be
(X, Y ) :=
_

XY dP = E(XY ).
Next, take as before 1 to be a σ-algebra contained in |. Consider then
V := L
2
(Ω, 1),
the space of square–integrable random variables that are 1–measurable. This is a closed
subspace of L
2
(Ω). Consequently if X ∈ L
2
(Ω), we can deﬁne its projection
(9) Z = proj
V
(X),
by analogy with (7) in the ﬁnite dimensional case. Almost exactly as we established (8)
above, we can likewise show
(X, W) = (Z, W) for all W ∈ V.
Take in particular W = χ
A
for any set A ∈ 1. In view of the deﬁnition of the inner
product, it follows that
_
A
X dP =
_
A
Z dP for all A ∈ 1.
30
Since Z ∈ V is 1-measurable, we see that Z is in fact E(X[ 1), as deﬁned in the earlier
discussion. That is,
E(X[ 1) = proj
V
(X).
We could therefore alternatively take the last identity as a deﬁnition of conditional
expectation. This point of view also makes it clear that Z = E(X[ 1) solves the least
squares problem:
[[Z −X[[ = min
Y ∈V
[[Y −X[[;
and so E(X[ 1) can be interpreted as that 1-measurable random variable which is the best
least squares approximation of the random variable X.
The two introductory discussions now completed, we turn next to examining conditional
expectation more closely.
THEOREM (Properties of conditional expectation).
(i) If X is 1-measurable, then E(X[ 1) = X a.s.
(ii) If a, b are constants, E(aX +bY [ 1) = aE(X[ 1) +bE(Y [ 1) a.s.
(iii) If X is 1-measurable and XY is integrable, then E(XY [ 1) = XE(Y [ 1) a.s.
(iv) If X is independent of 1, then E(X[ 1) = E(X) a.s.
(v) If J ⊆ 1, we have
E(X[ J) = E(E(X[ 1) [ J) = E(E(X[ J) [ 1) a.s.
(vi) The inequality X ≤ Y a.s. implies E(X[ 1) ≤ E(Y [ 1) a.s.
Proof.
1. Statement (i) is obvious, and (ii) is easy to check
2. By uniqueness a.s. of E(XY [ 1), it is enough in proving (iii) to show
(10)
_
A
XE(Y [ 1) dP =
_
A
XY dP for all A ∈ 1.
First suppose X =

m
i=1
b
i
χ
B
i
, where B
i
∈ 1 for i = 1, . . . , m. Then
_
A
XE(Y [ 1) dP =
m

i=1
b
i
_
A∩B
i
. ¸¸ .
∈V
E(Y [ 1) dP
=
m

i=1
b
i
_
A∩B
i
Y dP =
_
A
XY dP.
31
This proves (10) if X is a simple function. The general case follows by approximation.
3. To show (iv), it suﬃces to prove
_
A
E(X) dP =
_
A
X dP for all A ∈ 1. Let us
compute:
_
A
X dP =
_

χ
A
X dP = E(χ
A
X) = E(X)P(A) =
_
A
E(X) dP,
the third equality owing to independence.
4. Assume J ⊆ 1 and let A ∈ J. Then
_
A
E(E(X[ 1) [ J) dP =
_
A
E(X[ 1) dP =
_
A
X dP,
since A ∈ J ⊆ 1. Thus E(X[ J) = E(E(X[ 1) [ J) a.s.
Furthermore, assertion (i) implies that E(E(X[ J) [ 1) = E(X[ J), since E(X[ J) is
J-measurable and so also 1-measurable. This establishes assertion (v).
5. Finally, suppose X ≤ Y , and note that
_
A
E(Y [ 1) −E(X[ 1) dP =
_
A
E(Y −X[ 1) dP
=
_
A
Y −X dP ≥ 0
for all A ∈ 1. Take A := ¦E(Y [ 1) −E(X[ 1) ≤ 0¦. This event lies in 1, and we deduce
from the previous inequality that P(A) = 0.
LEMMA (Conditional Jensen’s Inequality). Suppose Φ : R → R is convex, with
E([Φ(X)[) < ∞. Then
Φ(E(X[ 1)) ≤ E(Φ(X) [ 1).
We leave the proof as an exercise.
I. MARTINGALES.
MOTIVATION. Suppose Y
1
, Y
2
, . . . are independent real-valued random variables, with
E(Y
i
) = 0 (i = 1, 2, . . . ).
Deﬁne the sum S
n
:= Y
1
+ +Y
n
.
What is our best guess of S
n+k
, given the values of S
1
, . . . , S
n
(11)
E(S
n+k
[ S
1
, . . . , S
n
) = E(Y
1
+ +Y
n
[ S
1
, . . . , S
n
)
+E(Y
n+1
+ +Y
n+k
[ S
1
, . . . , S
n
)
= Y
1
+ +Y
n
+E(Y
n+1
+ +Y
n+k
)
. ¸¸ .
=0
= S
n
.
32
Thus the best estimate of the “future value” of S
n+k
, given the history up to time n, is
just S
n
.
If we interpret Y
i
as the payoﬀ of a “fair” gambling game at time i, and therefore S
n
as the total winnings at time n, the calculation above says that at any time one’s future
expected winnings, given the winnings to date, is just the current amount of money. So the
formula (11) characterizes a “fair” game.
We incorporate these ideas into a formal deﬁnition:
DEFINITION. Let X
1
, . . . , X
n
, . . . be a sequence of real-valued random variables, with
E([X
i
[) < ∞ (i = 1, 2, . . . ). If
X
k
= E(X
j
[ X
1
, . . . , X
k
) a.s. for all j ≥ k,
we call ¦X
i
¦

i=1
a (discrete) martingale.
DEFINITION. Let X() be a real–valued stochastic process. Then
|(t) := |(X(s) [ 0 ≤ s ≤ t),
the σ-algebra generated by the random variables X(s) for 0 ≤ s ≤ t, is called the history
of the process until (and including) time t ≥ 0.
DEFINITIONS. Let X() be a stochastic process, such that E([X(t)[) < ∞for all t ≥ 0.
(i) If
X(s) = E(X(t) [ |(s)) a.s. for all t ≥ s ≥ 0,
then X() is called a martingale.
(ii) If
X(s) ≤ E(X(t) [ |(s)) a.s. for all t ≥ s ≥ 0,
X() is a submartingale.
Example. Let W() be a 1-dimensional Wiener process, as deﬁned later in Chapter 3.
Then
W() is a martingale.
To see this, write J(t) := |(W(s)[ 0 ≤ s ≤ t), and let t ≥ s. Then
E(W(t) [ J(s)) = E(W(t) −W(s) [ J(s)) +E(W(s) [ J(s))
= E(W(t) −W(s)) +W(s) = W(s) a.s.
(The reader should refer back to this calculation after reading Chapter 3.)
33
LEMMA. Suppose X() is a real-valued martingale and Φ : R → R is convex. Then if
E([Φ(X(t))[) < ∞ for all t ≥ 0,
Φ(X()) is a submartingale.
We omit the proof, which uses Jensen’s inequality.
Martingales are important in probability theory mainly because they admit the following
powerful estimates:
THEOREM (Discrete martingale inequalities).
(i) If ¦X
n
¦

n=1
is a submartingale, then
P
_
max
1≤k≤n
X
k
≥ λ
_

1
λ
E(X
+
n
)
for all n = 1, . . . and λ > 0.
(ii) If ¦X
n
¦

n=1
is a martingale and 1 < p < ∞, then
E
_
max
1≤k≤n
[X
k
[
p
_

_
p
p −1
_
p
E([X
n
[
p
)
for all n = 1, . . . .
A proof is provided in Appendix B. Notice that (i) is a generalization of the Chebyshev
inequality. We can also extend these estimates to continuous–time martingales.
THEOREM (Martingale inequalities). Let X() be a stochastic process with contin-
uous sample paths a.s.
(i) If X() is a submartingale, then
P
_
max
0≤s≤t
X(s) ≥ λ
_

1
λ
E(X(t)
+
) for all λ > 0, t ≥ 0.
(ii) If X() is a martingale and 1 < p < ∞, then
E
_
max
0≤s≤t
[X(s)[
p
_

_
p
p −1
_
p
E([X(t)[
p
).
Outline of Proof. Choose λ > 0, t > 0 and select 0 = t
0
< t
1
< < t
n
= t. We check
that ¦X(t
i

n
i=1
is a martingale and apply the discrete martingale inequality. Next choose
a ﬁner and ﬁner partition of [0, t] and pass to limits.
The proof of assertion (ii) is similar.

34
CHAPTER 3: BROWNIAN MOTION AND “WHITE NOISE”.
A. Motivation and deﬁnitions
B. Construction of Brownian motion
C. Sample paths
D. Markov property
A. MOTIVATION AND DEFINITIONS.
SOME HISTORY. R. Brown in 1826–27 observed the irregular motion of pollen particles
suspended in water. He and others noted that
• the path of a given particle is very irregular, having a tangent at no point, and
• the motions of two distinct particles appear to be independent.
In 1900 L. Bachelier attempted to describe ﬂuctuations in stock prices mathematically
and essentially discovered ﬁrst certain results later rederived and extended by A. Einstein
in 1905. Einstein studied the Brownian phenomena this way. Let us consider a long, thin
tube ﬁlled with clear water, into which we inject at time t = 0 a unit amount of ink, at
the location x = 0. Now let f(x, t) denote the density of ink particles at position x ∈ R
and time t ≥ 0. Initially we have
f(x, 0) = δ
0
, the unit mass at 0.
Next, suppose that the probability density of the event that an ink particle moves from x
to x +y in (small) time τ is ρ(τ, y). Then
(1)
f(x, t +τ) =
_

−∞
f(x −y, t)ρ(τ, y) dy
=
_

−∞
_
f −f
x
y +
1
2
f
xx
y
2
+. . .
_
ρ(τ, y) dy.
But since ρ is a probability density,
_

−∞
ρ dy = 1; whereas ρ(τ, −y) = ρ(τ, y) by symmetry.
Consequently
_

−∞
yρ dy = 0. We further assume that
_

−∞
y
2
ρ dy, the variance of ρ, is
linear in τ:
_

−∞
y
2
ρ dy = Dτ, D > 0.
We insert these identities into (1), thereby to obtain
f(x, t +τ) −f(x, t)
τ
=
Df
xx
(x, t)
2
¦+ lower order terms¦.
35
Sending now τ → 0, we discover
f
t
=
D
2
f
xx
This is the diﬀusion equation, also known as the heat equation. This partial diﬀerential
equation, with the initial condition f(x, 0) = δ
0
, has the solution
f(x, t) =
1
(2πDt)
1/2
e

x
2
2Dt
.
This says the probability density at time t is N(0, Dt), for some constant D.
In fact, Einstein computed:
D =
RT
N
A
f
, where
_
¸
¸
¸
_
¸
¸
¸
_
R = gas constant
T = absolute temperature
f = friction coeﬃcient
N
A
= Avagodro’s number.
This equation and the observed properties of Brownian motion allowed J. Perrin to com-
pute N
A
(≈ 610
23
= the number of molecules in a mole) and help to conﬁrm the atomic
theory of matter.
N. Wiener in the 1920’s (and later) put the theory on a ﬁrm mathematical basis. His
ideas are at the heart of the mathematics in ¸B–D below.
RANDOM WALKS. A variant of Einstein’s argument follows. We introduce a 2-
dimensional rectangular lattice, comprising the sites ¦(m∆x, n∆t) [ m = 0, ±1, ±2, . . . ; n =
0, 1, 2, . . . ¦. Consider a particle starting at x = 0 and time t = 0, and at each time n∆t
moves to the left an amount ∆x with probability 1/2, to the right an amount ∆x with
probability 1/2. Let p(m, n) denote the probability that the particle is at position m∆x
at time n∆t. Then
p(m, 0) =
_
0 m ,= 0
1 m = 0.
Also
p(m, n + 1) =
1
2
p(m−1, n) +
1
2
p(m+ 1, n),
and hence
p(m, n + 1) −p(m, n) =
1
2
(p(m+ 1, n) −2p(m, n) +p(m−1, n)).
Now assume
(∆x)
2
∆t
= D for some positive constant D.
This implies
p(m, n + 1) −p(m, n)
∆t
=
D
2
_
p(m+ 1, n) −2p(m, n) +p(m−1, n)
(∆x)
2
_
.
36
Let ∆t → 0, ∆x → 0, m∆x → x, n∆t → t, with
(∆x)
2
∆t
≡ D. Then presumably
p(m, n) → f(x, t), which we now interpret as the probability density that particle is at x
at time t. The above diﬀerence equation becomes formally in the limit
f
t
=
D
2
f
xx
,
and so we arrive at the diﬀusion equation again.
MATHEMATICAL JUSTIFICATION. A more careful study of this technique of
passing to limits with random walks on a lattice depends upon the Laplace–De Moivre
Theorem.
As above we assume the particle moves to the left or right a distance ∆x with probability
1/2. Let X(t) denote the position of particle at time t = n∆t (n = 0, . . . ). Deﬁne
S
n
:=
n

i=1
X
i
,
where the X
i
are independent random variables such that
_
P(X
i
= 0) = 1/2
P(X
i
= 1) = 1/2
for i = 1, . . . . Then V (X
i
) =
1
4
.
Now S
n
is the number of moves to the right by time t = n∆t. Consequently
X(t) = S
n
∆x + (n −S
n
)(−∆x) = (2S
n
−n)∆x.
Note also
V (X(t)) = (∆x)
2
V (2S
n
−n)
= (∆x)
2
4V (S
n
) = (∆x)
2
4nV (X
1
)
= (∆x)
2
n =
(∆x)
2
∆t
t.
Again assume
(∆x)
2
∆t
= D. Then
X(t) = (2S
n
−n)∆x =
_
S
n

n
2
_
n
4
_

n∆x =
_
S
n

n
2
_
n
4
_

tD.
Then Laplace–De Moivre Theorem thus implies
lim
n→∞
t=n∆t,
(∆x)
2
∆t
=D
P(a ≤ X(t) ≤ b) = lim
n→∞
_
a

tD

S
n

n
2
_
n
4

b

tD
_
=
1

_ b

tD
a

tD
e

x
2
2
dx
=
1

2πDt
_
b
a
e

x
2
2Dt
dx.
37
Once again, and rigorously this time, we obtain the N(0, Dt) distribution.
Inspired by all these considerations, we now introduce Brownian motion, for which we
take D = 1:
DEFINITION. A real-valued stochastic process W() is called a Brownian motion or
Wiener process if
(i) W(0) = 0 a.s.,
(ii) W(t) −W(s) is N(0, t −s) for all t ≥ s ≥ 0,
(iii) for all times 0 < t
1
< t
2
< < t
n
, the random variables W(t
1
), W(t
2
) −
W(t
1
), . . . , W(t
n
) −W(t
n−1
) are independent (“independent increments”).
Notice in particular that
E(W(t)) = 0, E(W
2
(t)) = t for each time t ≥ 0.
The Central Limit Theorem Further provides some further motivation for our deﬁnition
of Brownian motion, since we can expect that any suitably scaled sum of independent,
random disturbances aﬀecting the postion of a moving particle will result in a Gaussian
distribution.
B. CONSTRUCTION OF BROWNIAN MOTION.
COMPUTATION OF JOINT PROBABILITIES. From the deﬁnition we know
that if W() is a Brownian motion, then for all t > 0 and a ≤ b,
P(a ≤ W(t) ≤ b) =
1

2πt
_
b
a
e

x
2
2t
dx,
since W(t) is N(0, t).
Suppose we now choose times 0 < t
1
< < t
n
and real numbers a
i
≤ b
i
, for i =
1, . . . , n. What is the joint probability
P(a
1
≤ W(t
1
) ≤ b
1
, , a
n
≤ W(t
n
) ≤ b
n
)?
In other words, what is the probability that a sample path of Brownian motion takes values
between a
i
and b
i
at time t
i
for each i = 1, . . . n?
38
a
2
b
2
a
1
b
1
a
3
b
3
a
4
b
4
a
5
b
5
t
1
t
2
t
3
t
5
t
4
We can guess the answer as follows. We know
P(a
1
≤ W(t
1
) ≤ b
1
) =
_
b
1
a
1
e

x
2
1
2t
1

2πt
1
dx
1
;
and given that W(t
1
) = x
1
, a
1
≤ x
1
≤ b
1
, then presumably the process is N(x
1
, t
2
− t
1
)
on the interval [t
1
, t
2
]. Thus the probability that a
2
≤ W(t
2
) ≤ b
1
, given that W(t
1
) = x
1
,
should equal
_
b
2
a
2
1
_
2π(t
2
−t
1
)
e

|x
2
−x
1
|
2
2(t
2
−t
1
)
dx
2
.
Hence it should be that
P(a
1
≤ W(t
1
) ≤ b
1
, a
2
≤ W(t
2
) ≤ b
2
) =
_
b
1
a
1
_
b
2
a
2
g(x
1
, t
1
[ 0)g(x
2
, t
2
−t
1
[ x
1
) dx
2
dx
1
for
g(x, t [ y) :=
1

2πt
e

(x−y)
2
2t
.
In general, we would therefore guess that
(2)
P(a
1
≤ W(t
1
) ≤ b
1
, . . . , a
n
≤ W(t
n
) ≤ b
n
) =
_
b
1
a
1

_
b
n
a
n
g(x
1
, t
1
[ 0)g(x
2
, t
2
−t
1
[ x
1
) . . . g(x
n
, t
n
−t
n−1
[ x
n−1
) dx
n
. . . dx
1
.
The next assertion conﬁrms and extends this formula.
THEOREM. Let W() be a one-dimensional Wiener process. Then for all positive in-
tegers n, all choices of times 0 = t
0
< t
1
< < t
n
and each function f : R
n
→ R, we
have
Ef(W(t
1
), . . . , W(t
n
)) =
_

−∞

_

−∞
f(x
1
, . . . , x
n
)g(x
1
, t
1
[ 0)g(x
2
, t
2
−t
1
[ x
1
)
. . . g(x
n
, t
n
−t
n−1
[ x
n−1
) dx
n
. . . dx
1
.
39
Our taking
f(x
1
, . . . , x
n
) = χ
[a
1
,b
1
]
(x
1
) χ
[a
n
,b
n
]
(x
n
)
gives (2).
Proof. Let us write X
i
:= W(t
i
), Y
i
:= X
i
−X
i−1
for i = 1, . . . , n. We also deﬁne
h(y
1
, y
2
, . . . , y
n
) := f(y
1
, y
1
+y
2
, . . . , y
1
+ +y
n
).
Then
Ef(W(t
1
), . . . , W(t
n
)) = Eh(Y
1
, . . . , Y
n
)
=
_

−∞

_

−∞
h(y
1
, . . . , y
n
)g(y
1
, t
1
[ 0)g(y
2
, t
2
−t
1
[ 0)
. . . g(y
n
, t
n
−t
n−1
[ 0)dy
n
. . . dy
1
=
_

−∞

_

−∞
f(x
1
, . . . , x
n
)g(x
1
, t
1
[ 0)g(x
2
, t
2
−t
1
[ x
1
)
. . . g(x
n
, t
n
−t
n−1
[ x
n−1
) dx
n
. . . dx
1
.
For the second equality we recalled that the random variables Y
i
= W(t
i
) − W(t
i−1
) are
independent for i = 1, . . . , n, and that each Y
i
is N(0, t
i
−t
i−1
). We also changed variables
using the identities y
i
= x
i
− x
i−1
for i = 1, . . . , n and x
0
= 0. The Jacobian for this
change of variables equals 1.
BUILDING A ONE-DIMENSIONAL WIENER PROCESS. The main issue
now is to demonstrate that a Brownian motion actually exists.
Our method will be to develop a formal expansion of white noise ξ() in terms of a clev-
erly selected orthonormal basis of L
2
(0, 1), the space of all real-valued, square–integrable
funtions deﬁned on (0, 1) . We will then integrate the resulting expression in time, show
that this series converges, and prove then that we have built a Wiener process. This
procedure is a form of “wavelet analysis”: see Pinsky [P].
LEMMA. Suppose W() is a one-dimensional Brownian motion. Then
E(W(t)W(s)) = t ∧ s = min¦s, t¦ for t ≥ 0, s ≥ 0.
Proof. Assume t ≥ s ≥ 0. Then
E(W(t)W(s)) = E((W(s) +W(t) −W(s))W(s))
= E(W
2
(s)) +E((W(t) −W(s))W(s))
= s +E(W(t) −W(s))
. ¸¸ .
=0
E(W(s))
. ¸¸ .
=0
= s = t ∧ s,
40
since W(s) is N(0, s) and W(t) −W(s) is independent of W(s).
HEURISTICS. Remember from Chapter 1 that the formal time-derivative
˙
W(t) =
dW(t)
dt
= ξ(t)
is “1-dimensional white noise”. As we will see later however, for a.e. ω the sample path
t → W(t, ω) is in fact diﬀerentiable for no time t ≥ 0. Thus
˙
W(t) = ξ(t) does not really
exist.
However, we do have the heuristic formula
(3) “E(ξ(t)ξ(s)) = δ
0
(s −t)”,
where δ
0
is the unit mass at 0. A formal “proof” is this. Suppose h > 0, ﬁx t > 0, and set
φ
h
(s) := E
__
W(t +h) −W(t)
h
__
W(s +h) −W(s)
h
__
=
1
h
2
[E(W(t +h)W(s +h)) −E(W(t +h)W(s)) −E(W(t)W(s +h)) +E(W(t)W(s))]
=
1
h
2
[((t +h) ∧ (s +h)) −((t +h) ∧ s) −(t ∧ (s +h)) + (t ∧ s)].
t-h
t+h
t
graph of φ
h
height = 1/h
Then φ
h
(s) → 0 as h → 0, t ,= s. But
_
φ
h
(s) ds = 1, and so presumably φ
h
(s) →
δ
0
(s −t) in some sense, as h → 0. In addition, we expect that φ
h
(s) → E(ξ(t)ξ(s)). This
gives the formula (3) above.
Remark: Why
˙
W() = ξ() is called white noise. If X() is any real-valued stochastic
process with E(X
2
(t)) < ∞ for all t ≥ 0, we deﬁne
r(t, s) := E(X(t)X(s)) (t, s ≥ 0),
the autocorrelation function of X(). If r(t, s) = c(t −s) for some function c : R → R and
if E(X(t)) = E(X(s)) for all t, s ≥ 0, X() is called stationary in the wide sense. A white
noise process ξ() is by deﬁnition Gaussian, wide sense stationary, with c() = δ
0
.
41
In general we deﬁne
f(λ) :=
1

_

−∞
e
−iλt
c(t) dt (λ ∈ R)
to be the spectral density of a process X(). For white noise, we have
f(λ) =
1

_

−∞
e
−iλt
δ
0
dt =
1

for all λ.
Thus the spectral density of ξ() is ﬂat; that is, all “frequencies” contribute equally in
the correlation function, just as—by analogy—all colors contribute equally to make white
light.
RANDOM FOURIER SERIES. Suppose now ¦ψ
n
¦

n=0
is a complete, orthonormal
basis of L
2
(0, 1), where ψ
n
= ψ
n
(t) are functions of 0 ≤ t ≤ 1 only and so are not random
variables. The orthonormality means that
_
1
0
ψ
n
(s)ψ
m
(s) ds = δ
mn
for all m, n.
We write formally
(4) ξ(t) =

n=0
A
n
ψ
n
(t) (0 ≤ t ≤ 1).
It is easy to see that then
A
n
=
_
1
0
ξ(t)ψ
n
(t) dt.
We expect that the A
n
are independent and Gaussian, with E(A
n
) = 0. Therefore to be
consistent we must have for m ,= n
0 = E(A
n
)E(A
m
) = E(A
n
A
m
) =
_
1
0
_
1
0
E(ξ(t)ξ(s))ψ
n
(t)ψ
m
(s) dtds
=
_
1
0
_
1
0
δ
0
(s −t)ψ
n
(t)ψ
m
(s) dtds by (3)
=
_
1
0
ψ
n
(s)ψ
m
(s) ds.
But this is already automatically true as the ψ
n
are orthogonal. Similarly,
E(A
2
n
) =
_
1
0
ψ
2
n
(s) ds = 1.
42
Consequently if the A
n
are independent and N(0, 1), it is reasonable to believe that formula
(4) makes sense. But then the Brownian motion W() should be given by
(5) W(t) :=
_
t
0
ξ(s) ds =

n=0
A
n
_
t
0
ψ
n
(s) ds.
This seems to be true for any orthonormal basis, and we will next make this rigorous by
choosing a particularly nice basis.
L
´
EVY–CIESIELSKI CONSTRUCTION OF BROWNIAN MOTION
DEFINITION. The family ¦h
k
()¦

k=0
of Haar functions are deﬁned for 0 ≤ t ≤ 1 as
follows:
h
0
(t) := 1 for 0 ≤ t ≤ 1.
h
1
(t) :=
_
1 for 0 ≤ t ≤
1
2
−1 for
1
2
< t ≤ 1.
If 2
n
≤ k < 2
n+1
, n = 1, 2, . . . , we set
h
k
(t) :=
_
¸
_
¸
_
2
n/2
for
k−2
n
2
n
≤ t ≤
k−2
n
+1/2
2
n
−2
n/2
for
k−2
n
+1/2
2
n
< t ≤
k−2
n
+1
2
n
0 otherwise.
graph of h
k
height = 2
n/2
width = 2
-(n+1)
Graph of a Haar function
43
LEMMA 1. The functions ¦h
k
()¦

k=0
form a complete, orthonormal basis of L
2
(0, 1).
Proof. 1. We have
_
1
0
h
2
k
dt = 2
n
_
1
2
n+1
+
1
2
n+1
_
= 1.
Note also that for all l > k, either h
k
h
l
= 0 for all t or else h
k
is constant on the support
of h
l
. In this second case
_
1
0
h
l
h
k
dt = ±2
n/2
_
1
0
h
l
dt = 0.
2. Suppose f ∈ L
2
(0, 1),
_
1
0
fh
k
dt = 0 for all k = 0, 1, . . . . We will prove f = 0 almost
everywhere.
If n = 0, we have
_
1
0
f dt = 0. Let n = 1. Then
_
1/2
0
f dt =
_
1
1/2
f dt; and both
are equal to zero, since 0 =
_
1/2
0
f dt +
_
1
1/2
f dt =
_
1
0
f dt. Continuing in this way, we
deduce
_
k+1
2
n+1
k
2
n+1
f dt = 0 for all 0 ≤ k < 2
n+1
. Thus
_
r
s
f dt = 0 for all dyadic rationals
0 ≤ s ≤ r ≤ 1, and so for all 0 ≤ s ≤ r ≤ 1. But
f(r) =
d
dr
_
r
0
f(t) dt = 0 a.e. r.

DEFINITION. For k = 1, 2, . . . ,
s
k
(t) :=
_
t
0
h
k
(s) ds (0 ≤ t ≤ 1)
is the k
th
–Schauder function.
graph of s
k
height = 2
-(n+2)/2
width = 2
-n
Graph of a Schauder function
The graph of s
k
is a “tent” of height 2
−n/2−1
, lying above the interval [
k−2
n
2
n
,
k−2
n
+1
2
n
].
Consequently if 2
n
≤ k < 2
n+1
, then
max
0≤t≤1
[s
k
(t)[ = 2
−n/2−1
.
44
Our goal is to deﬁne
W(t) :=

k=0
A
k
s
k
(t)
for times 0 ≤ t ≤ 1, where the coeﬃcients ¦A
k
¦

k=0
are independent, N(0, 1) random
variables deﬁned on some probability space.
We must ﬁrst of all check whether this series converges.
LEMMA 2. Let ¦a
k
¦

k=0
be a sequence of real numbers such that
[a
k
[ = O(k
δ
) as k → ∞
for some 0 ≤ δ < 1/2. Then the series

k=0
a
k
s
k
(t)
converges uniformly for 0 ≤ t ≤ 1.
Proof. Fix ε > 0. Notice that for 2
n
≤ k < 2
n+1
, the functions s
k
() have disjoint supports.
Set
b
n
:= max
2
n
≤k<2
n+1
[a
k
[ ≤ C(2
n+1
)
δ
.
Then for 0 ≤ t ≤ 1,

k=2
m
[a
k
[[s
k
(t)[ ≤

n=m
b
n
max
2
n
≤k<2
n+1
0≤t≤1
[s
k
(t)[
≤ C

n=m
(2
n+1
)
δ
2
−n/2−1
< ε
for m large enough, since 0 ≤ δ < 1/2.
LEMMA 3. Suppose ¦A
k
¦

k=1
are independent, N(0, 1) random variables. Then for al-
most every ω,
[A
k
(ω)[ = O(
_
log k) as k → ∞.
In particular, the numbers ¦A
k
(ω)¦

k=1
almost surely satisfy the hypothesis of Lemma
2 above.
Proof. For all x > 0, k = 2, . . . , we have
P([A
k
[ > x) =
2

_

x
e

s
2
2
ds

2

e

x
2
4
_

x
e

s
2
4
ds
≤ Ce

x
2
4
,
45
for some constant C. Set x := 4

log k; then
P([A
k
[ ≥ 4
_
log k) ≤ Ce
−4 log k
= C
1
k
4
.
Since

1
k
4
< ∞, the Borel–Cantelli Lemma implies
P([A
k
[ ≥ 4
_
log k i.o.) = 0.
Therefore for almost every sample point ω, we have
[A
k
(ω)[ ≤ 4
_
log k provided k ≥ K,
where K depends on ω.
LEMMA 4.

k=0
s
k
(s)s
k
(t) = t ∧ s for each 0 ≤ s, t ≤ 1.
Proof. Deﬁne for 0 ≤ s ≤ 1,
φ
s
(τ) :=
_
1 0 ≤ τ ≤ s
0 s < τ ≤ 1.
Then if s ≤ t, Lemma 1 implies
s =
_
1
0
φ
t
φ
s
dτ =

k=0
a
k
b
k
,
where
a
k
=
_
1
0
φ
t
h
k
dτ =
_
t
0
h
k
dτ = s
k
(t), b
k
=
_
1
0
φ
s
h
k
dτ = s
k
(s).

THEOREM. Let ¦A
k
¦

k=0
be a sequence of independent, N(0, 1) random variables de-
ﬁned on the same probability space. Then the sum
W(t, ω) :=

k=0
A
k
(ω)s
k
(t) ( 0 ≤ t ≤ 1)
converges uniformly in t, for a.e. ω. Furthermore
(i) W() is a Brownian motion for 0 ≤ t ≤ 1, and
(ii) for a.e. ω, the sample path t → W(t, ω) is continuous.
46
Proof. 1. The uniform convergence is a consequence of Lemmas 2 and 3; this implies (ii).
2. To prove W() is a Brownian motion, we ﬁrst note that clearly W(0) = 0 a.s.
We assert as well that W(t) −W(s) is N(0, t −s) for all 0 ≤ s ≤ t ≤ 1. To prove this,
let us compute
E(e
iλ(W(t)−W(s))
) = E(e

k=0
A
k
(s
k
(t)−s
k
(s))
)
=

k=0
E(e
iλA
k
(s
k
(t)−s
k
(s))
) by independence
=

k=0
e

λ
2
2
(s
k
(t)−s
k
(s))
2
since A
k
is N(0, 1)
= e

λ
2
2

k=0
(s
k
(t)−s
k
(s))
2
= e

λ
2
2

k=0
s
2
k
(t)−2s
k
(t)s
k
(s)+s
2
k
(s)
= e

λ
2
2
(t−2s+s)
by Lemma 4
= e

λ
2
2
(t−s)
.
By uniqueness of characteristic functions, the increment W(t) − W(s) is N(0, t − s), as
asserted.
3. Next we claim for all m = 1, 2, . . . and for all 0 = t
0
< t
1
< < t
m
≤ 1, that
(6) E(e
i

m
j=1
λ
j
(W(t
j
)−W(t
j−1
))
) =
m

j=1
e

λ
2
j
2
(t
j
−t
j−1
)
.
Once this is proved, we will know from uniqueness of characteristic functions that
F
W(t
1
),...,W(t
m
)−W(t
m−1
)
(x
1
, . . . , x
m
) = F
W(t
1
)
(x
1
) F
W(t
m
)−W(t
m−1
)
(x
m
)
for all x
1
, . . . x
m
∈ R. This proves that
W(t
1
), . . . , W(t
m
) −W(t
m−1
) are independent.
Thus (6) will establish the Theorem.
47
Now in the case m = 2, we have
E(e
i[λ
1
W(t
1
)+λ
2
(W(t
2
)−W(t
1
))]
) = E(e
i[(λ
1
−λ
2
)W(t
1
)+λ
2
W(t
2
)]
)
= E(e
i(λ
1
−λ
2
)

k=0
A
k
s
k
(t
1
)+iλ
2

k=0
A
k
s
k
(t
2
)
)
=

k=0
E(e
iA
k
[(λ
1
−λ
2
)s
k
(t
1
)+λ
2
s
k
(t
2
)]
)
=

k=0
e

1
2
((λ
1
−λ
2
)s
k
(t
1
)+λ
2
s
k
(t
2
))
2
= e

1
2

k=0

1
−λ
2
)
2
s
2
k
(t
1
)+2(λ
1
−λ
2

2
s
k
(t
1
)s
k
(t
2
)+λ
2
2
s
2
k
(t
2
)
= e

1
2
[(λ
1
−λ
2
)
2
t
1
+2(λ
1
−λ
2

2
t
1

2
2
t
2
]
by Lemma 4
= e

1
2

2
1
t
1

2
2
(t
2
−t
1
)]
.
This is (6) for m = 2, and the general case follows similarly.
THEOREM (Existence of one-dimensional Brownian motion). Let (Ω, |, P) be a
probability space on which countably many N(0, 1), independent random variables ¦A
n
¦

n=1
are deﬁned. Then there exists a 1-dimensional Brownian motion W() deﬁned for ω ∈ Ω,
t ≥ 0.
Outline of proof. The theorem above demonstrated how to build a Brownian motion on
0 ≤ t ≤ 1. As we can reindex the N(0, 1) random variables to obtain countably many
families of countably many random variables, we can therefore build countably many
independent Brownian motions W
n
(t) for 0 ≤ t ≤ 1.
We assemble these inductively by setting
W(t) := W(n −1) +W
n
(t −(n −1)) for n −1 ≤ t ≤ n.
Then W() is a one-dimensional Brownian motion, deﬁned for all times t ≥ 0.
This theorem shows we can construct a Brownian motion deﬁned on any probability
space on which there exist countably many independent N(0, 1) random variables.
We mostly followed Lamperti [L1] for the foregoing theory.
3. BROWNIAN MOTION IN R
n
.
It is straightforward to extend our deﬁnitions to Brownian motions taking values in R
n
.
48
DEFINITION. An R
n
-valued stochastic process W() = (W
1
(), . . . , W
n
()) is an n-
dimensional Wiener process (or Brownian motion) provided
(i) for each k = 1, . . . , n, W
k
() is a 1-dimensional Wiener process,
and
(ii) the σ-algebras J
k
:= |(W
k
(t) [ t ≥ 0) are independent, k = 1, . . . , n.
By the arguments above we can build a probability space and on it n independent 1-
dimensional Wiener processes W
k
() (k = 1, . . . , n). Then W() := (W
1
(), . . . , W
n
()) is
an n-dimensional Brownian motion.
LEMMA. If W() is an n-dimensional Wiener process, then
(i) E(W
k
(t)W
l
(s)) = (t ∧ s)δ
kl
(k, l = 1, . . . , n),
(ii) E((W
k
(t) −W
k
(s))(W
l
(t) −W
l
(s))) = (t −s)δ
kl
(k, l = 1, . . . , n; t ≥ s ≥ 0.)
Proof. If k ,= l, E(W
k
(t)W
l
(s)) = E(W
k
(t))E(W
l
(s)) = 0, by independence. The proof
of (ii) is similar.
THEOREM. (i) If W() is an n-dimensional Brownian motion, then W(t) is N(0, tI)
for each time t > 0. Therefore
P(W(t) ∈ A) =
1
(2πt)
n/2
_
A
e

|x|
2
2t
dx
for each Borel subset A ⊆ R
n
.
(ii) More generally, for each m = 1, 2, . . . and each function f : R
n
R
n
R
n
→R,
we have
(7)
Ef(W(t
1
), . . . , W(t
m
)) =
_
R
n

_
R
n
f(x
1
, . . . , x
m
)g(x
1
, t
1
[ 0)g(x
2
, t
2
−t
1
[ x
1
)
. . . g(x
m
, t
m
−t
m−1
[ x
m−1
) dx
m
. . . dx
1
.
where
g(x, t [ y) :=
1
(2πt)
n/2
e

|x−y|
2
2t
.
Proof. For each time t > 0, the random variables W
1
(t), . . . , W
n
(t) are independent.
Consequently for each point x = (x
1
, . . . , x
n
) ∈ R
n
, we have
f
W(t)
(x
1
, . . . , x
n
) = f
W
1
(t)
(x
1
) f
W
n
(t)
(x
n
)
=
1
(2πt)
1/2
e

x
2
1
2t

1
(2πt)
1/2
e

x
2
n
2t
=
1
(2πt)
n/2
e

|x|
2
2t
= g(x, t [ 0).
We prove formula (7) as in the one-dimensional case.
49
C. SAMPLE PATH PROPERTIES.
In this section we will demonstrate that for almost every ω, the sample path t → W(t, ω)
is uniformly H¨ older continuous for each exponent γ <
1
2
, but is nowhere H¨ older continuous
with any exponent γ >
1
2
. In particular t → W(t, ω) almost surely is nowhere diﬀerentiable
and is of inﬁnite variation for each time interval.
DEFINITIONS. (i) Let 0 < γ ≤ 1. A function f : [0, T] →R is called uniformly H¨ older
continuous with exponent γ > 0 if there exists a constant K such that
[f(t) −f(s)[ ≤ K[t −s[
γ
for all s, t ∈ [0, T].
(ii) We say f is H¨older continuous with exponent γ > 0 at the point s if there exists a
constant K such that
[f(t) −f(s)[ ≤ K[t −s[
γ
for all t ∈ [0, T].
1. CONTINUITY OF SAMPLE PATHS.
A good general theorem to prove H¨ older continuity is this important theorem of Kol-
mogorov:
THEOREM. Let X() be a stochastic process with continuous sample paths a.s., such
that
E([X(t) −X(s)[
β
) ≤ C[t −s[
1+α
for constants β, α > 0, C ≥ 0 and for all 0 ≤ t, s.
Then for each 0 < γ <
α
β
, T > 0, and almost every ω, there exists a constant K =
K(ω, γ, T) such that
[X(t, ω) −X(s, ω)[ ≤ K[t −s[
γ
for all 0 ≤ s, t ≤ T.
Hence the sample path t → X(t, ω) is uniformly H¨ older continuous with exponent γ on
[0, T].
APPLICATION TO BROWNIAN MOTION. Consider W(), an n-dimensional
Brownian motion. We have for all integers m = 1, 2, . . .
E([W(t) −W(s)[
2m
) =
1
(2πr)
n/2
_
R
n
[x[
2m
e

|x|
2
2r
dx for r = t −s > 0
=
1
(2π)
n/2
r
m
_
R
n
[y[
2m
e

|y|
2
2
dy
_
y =
x

r
_
= Cr
m
= C[t −s[
m
.
50
Thus the hypotheses of Kolmogorov’s theorem hold for β = 2m, α = m−1. The process
W() is thus H¨ older continuous a.s. for exponents
0 < γ <
α
β
=
1
2

1
2m
for all m.
Thus for almost all ω and any T > 0, the sample path t → W(t, ω) is uniformly H¨ older
continuous on [0, T] for each exponent 0 < γ < 1/2.
Proof. 1. For simplicity, take T = 1. Pick any
(8) 0 < γ <
α
β
.
Now deﬁne for n = 1, . . . ,
A
n
:=

¸
¸
¸
X(
i + 1
2
n
) −X(
i
2
n
)
¸
¸
¸
¸
>
1
2

for some integer 0 ≤ i < 2
n
_
.
Then
P(A
n
) ≤
2
n
−1

i=0
P

¸
¸
¸
X(
i + 1
2
n
) −X(
i
2
n
)
¸
¸
¸
¸
>
1
2

_

2
n
−1

i=0
E
_
¸
¸
¸
¸
X(
i + 1
2
n
) −X(
i
2
n
)
¸
¸
¸
¸
β
_
_
1
2

_
−β
by Chebyshev’s inequality
≤ C
2
n
−1

i=0
_
1
2
n
_
1+α
_
1
2

_
−β
= C2
n(−α+γβ)
.
Since (8) forces −α + γβ < 0, we deduce

n=1
P(A
n
) < ∞; whence the Borel–Cantelli
Lemma implies
P(A
n
i.o.) = 0.
So for a.e. ω there exists m = m(ω) such that
¸
¸
¸
¸
X(
i + 1
2
n
, ω) −X(
i
2
n
, ω)
¸
¸
¸
¸

1
2

for 0 ≤ i ≤ 2
n
−1
provided n ≥ m. But then we have
(9)
_
¸
¸
X(
i+1
2
n
, ω) −X(
i
2
n
, ω)
¸
¸
≤ K
1
2

for 0 ≤ i ≤ 2
n
−1
for all n ≥ 0,
if we select K = K(ω) large enough.
51
2.* We now claim (9) implies the stated H¨older continuity. To see this, ﬁx ω ∈ Ω for
which (9) holds. Let t
1
, t
2
∈ [0, 1] be dyadic rationals, 0 < t
2
−t
1
< 1. Select n ≥ 1 so that
(10) 2
−n
≤ t < 2
−(n−1)
for t := t
2
−t
1
.
We can write
_
t
1
=
i
2
n

1
2
p
1
− −
1
2
p
k
(n < p
1
< < p
k
)
t
2
=
j
2
n
+
1
2
q
1
+ +
1
2
q
l
(n < q
1
< < q
l
)
for
t
1

i
2
n

j
2
n
≤ t
2
.
Then
j −i
2
n
≤ t <
1
2
n−1
and so j = i or i + 1. In view of (9),
[X(i/2
n
, ω) −X(j/2
n
, ω)[ ≤ K
¸
¸
¸
¸
i −j
2
n
¸
¸
¸
¸
γ
≤ Kt
γ
.
Furthermore
[X(i/2
n
−1/2
p
1
− −1/2
p
r
, ω) −X(i/2
n
−1/2
p
1
− −1/2
p
r−1
, ω)[ ≤ K
¸
¸
¸
¸
1
2
p
r
¸
¸
¸
¸
γ
for r = 1, . . . , k; and consequently
[X(t
1
, ω) −X(i/2
n
, ω)[ ≤ K
k

r=1
¸
¸
¸
¸
1
2
p
r
¸
¸
¸
¸
γ

K
2

r=1
1
2

since p
r
> n
=
C
2

≤ Ct
γ
by (10).
In the same way we deduce
[X(t
2
, ω) −X(j/2
n
, ω)[ ≤ Ct
γ
.
Add up the estimates above, to discover
[X(t
1
, ω) −X(t
2
, ω)[ ≤ C[t
1
−t
2
[
γ
for all dynamic rationals t
1
, t
2
∈ [0, 1] and some constant C = C(ω). Since t → X(t, ω) is
continuous for a.e. ω, the estimate above holds for all t
1
, t
2
∈ [0, 1].
*Omit the second step in this proof on ﬁrst reading.
52
Remark. The proof above can in fact be modiﬁed to show that if X() is a stochastic
process such that
E([X(t) −X(s)[
β
) ≤ C[t −s[
1+α
(α, β > 0, C ≥ 0),
then X() has a version
˜
X() such that a.e. sample path is H¨ older continuous for each
exponent 0 < γ < α/β. (We call
˜
X() a version of X() if P(X(t) =
˜
X(t)) = 1 for all
t ≥ 0.)
So any Wiener process has a version with continuous sample paths a.s.
2. NOWHERE DIFFERENTIABILITY
Next we prove that sample paths of Brownian motion are with probability one nowhere
H¨older continuous with exponent greater than
1
2
, and thus are nowhere diﬀerentiable.
THEOREM. (i) For each
1
2
< γ ≤ 1 and almost every ω, t → W(t, ω) is nowhere H¨older
continuous with exponent γ.
(ii) In particular, for almost every ω, the sample path t → W(t, ω) is nowhere diﬀeren-
tiable and is of inﬁnite variation on each subinterval.
Proof. (Dvoretzky, Erd¨ os, Kakutani) 1. It suﬃces to consider a one-dimensional Brownian
motion, and we may for simplicity consider only times 0 ≤ t ≤ 1.
Fix an integer N so large that
N
_
γ −
1
2
_
> 1.
Now if the function t → W(t, ω) is H¨older continuous with exponent γ at some point
0 ≤ s < 1, then
[W(t, ω) −W(s, ω)[ ≤ K[t −s[
γ
for all t ∈ [0, 1] and some constant K.
For n ¸ 1, set i = [ns] + 1 and note that for j = i, i + 1, . . . , i +N −1
¸
¸
¸
¸
W(
j
n
, ω) −W(
j + 1
n
, ω)
¸
¸
¸
¸

¸
¸
¸
¸
W(s, ω) −W(
j
n
, ω)
¸
¸
¸
¸
+
¸
¸
¸
¸
W(s, ω) −W(
j + 1
n
, ω)
¸
¸
¸
¸
≤ K

¸
¸
¸
s −
j
n
¸
¸
¸
¸
γ
+
¸
¸
¸
¸
s −
j + 1
n
¸
¸
¸
¸
γ
_

M
n
γ
53
for some constant M. Thus
ω ∈ A
i
M,n
:=

¸
¸
¸
W(
j
n
) −W(
j + 1
n
)
¸
¸
¸
¸

M
n
γ
for j = i, . . . , i +N −1
_
for some 1 ≤ i ≤ n, some M ≥ 1, and all large n.
Therefore the set of ω ∈ Ω such that W(ω, ) is H¨older continuous with exponent γ at
some time 0 ≤ s < 1 is contained in

_
M=1

_
k=1

n=k
n
_
i=1
A
i
M,n
.
We will show this event has probability 0.
2. For all k and M,
P
_

n=k
n
_
i=1
A
i
M,n
_
≤ liminf
n→∞
P
_
n
_
i=1
A
i
M,n
_
≤ liminf
n→∞
n

i=1
P(A
i
M,n
)
≤ liminf
n→∞
n
_
P
_
[W(
1
n
)[ ≤
M
n
γ
__
N
,
since the random variables W(
j+1
n
) −W(
j
n
) are N
_
0,
1
n
_
and independent. Now
P
_
[W(
1
n
)[ ≤
M
n
γ
_
=

n

_
Mn
−γ
−Mn
−γ
e

nx
2
2
dx
=
1

_
Mn
1/2−γ
−Mn
1/2−γ
e

y
2
2
dy
≤ Cn
1/2−γ
.
We use this calculation to deduce:
P
_

n=k
n
_
i=1
A
i
M,n
_
≤ liminf
n→∞
nC[n
1/2−γ
]
N
= 0,
since N(γ −1/2) > 1. This holds for all k, M. Thus
P
_

_
M=1

_
k=1

n=k
n
_
i=1
A
i
M,n
_
= 0,
and assertion (i) of the Theorem follows.
54
3. If W(t, ω) is diﬀerentiable at s, then W(t, ω) would be H¨ older continuous (with
exponent 1) at s. But this is almost surely not so. If W(t, ω) were of ﬁnite variation on
some subinterval, it would then be diﬀerentiable almost everywhere there.
Interpretation. The idea underlying the proof is that if
[W(t, ω) −W(s, ω)[ ≤ K[t −s[
γ
for all t,
then
[W(
j
n
, ω) −W(
j + 1
n
, ω)[ ≤
M
n
γ
for all n ¸ 1 and at least N values of j. But these are independent events of small
probability. The probability that the above inequality holds for all these j’s is a small
number to the large power N, and is therefore extremely small.
A sample path of Brownian motion
D. MARKOV PROPERTY.
DEFINITION. If 1 is a σ-algebra, 1 ⊆ |, then
P(A[ 1) := E(χ
A
[ 1) for A ∈ |.
Therefore P(A[ 1) is a random variable, the conditional probability of A, given 1.
55
DEFINITION. If X() is a stochastic process, the σ-algebra
|(s) := |(X(r) [ 0 ≤ r ≤ s)
is called the history of the process up to and including time s.
We can informally interpret |(s) as recording the information available from our ob-
serving X(r) for all times 0 ≤ r ≤ s.
DEFINITION. An R
n
-valued stochastic process X() is called a Markov process if
P(X(t) ∈ B[ |(s)) = P(X(t) ∈ B[ X(s)) a.s.
for all 0 ≤ s ≤ t and all Borel subset B of R
n
.
The idea of this deﬁnition is that, given the current value X(s), you can predict the
probabilities of future values of X(t) just as well as if you knew the entire history of the
process before time s. Loosely speaking, the process only “knows” its value at time s and
does not “remember” how it got there.
THEOREM. Let W() be an n-dimensional Wiener process. Then W() is a Markov
process, and
(13) P(W(t) ∈ B[ W(s)) =
1
(2π(t −s))
n/2
_
B
e

|x−W(s)|
2
2(t−s)
dx a.s.
for all 0 ≤ s < t, and Borel sets B .
Note carefully that each side of this identity is a random variable.
Proof. We will only prove (13). Set
Φ(y) :=
1
(2π(t −s))
n/2
_
A
e

|x−y|
2
2(t−s)
dx.
As Φ(W(s)) is |(W(s)) measurable, we must show
(14)
_
C
χ
{W(t)∈A}
dP =
_
C
Φ(W(s)) dP for all C ∈ |(W(s)).
Now if C ∈ |(W(s)), then C = ¦W(s) ∈ B¦ for some Borel set B ⊆ R
n
. Hence
_
C
χ
{W(t)∈A}
dP = P(W(s) ∈ B, W(t) ∈ A)
=
_
B
_
A
g(y, s [ 0)g(x, t −s [ y) dxdy
=
_
B
g(y, s [ 0)Φ(y) dy.
56
On the other hand,
_
C
Φ(W(s))dP =
_

χ
B
(W(s))Φ(W(s)) dP
=
_
R
n
χ
B
(y)Φ(y)
e

|y|
2
2s
(2πs)
n/2
dy
=
_
B
g(y, s [ 0)Φ(y) dy,
and this last expression agrees with that above. This veriﬁes (14), and so establishes (13).

Interpretation. The Markov property partially explains the nondiﬀerentiability of sam-
ple paths for Brownian motion, as discussed before in ¸C.
If W(s, ω) = b, say, then the future behavior of W(t, ω) depends only upon this fact and
not on how W(t, ω) approached the point b as t → s

. Thus the path “cannot remember”
how to leave b in such a way that W(, ω) will have a tangent there.
57
CHAPTER 4: STOCHASTIC INTEGRALS, IT
ˆ
O’S FORMULA.
A. Motivation
B. Deﬁnition and properties of Itˆ o integral
C. Indeﬁnite Itˆ o integrals
D. Itˆ o’s formula
E. Itˆ o integral in n-dimensions
A. MOTIVATION.
Remember from Chapter 1 that we want to develop a theory of stochastic diﬀerential
equations of the form
(SDE)
_
dX = b(X, t)dt +B(X, t)dW
X(0) = X
0
,
which we will in Chapter 5 interpret to mean
(1) X(t) = X
0
+
_
t
0
b(X, s) ds +
_
t
0
B(X, s) dW
for all times t ≥ 0. But before we can study and solve such an integral equation, we must
ﬁrst deﬁne
_
T
0
GdW
for some wide class of stochastic processes G, so that the right-hand side of (1) is at least
makes sense. Observe also that this is not at all obvious. For instance, since t → W(t, ω)
is of inﬁnite variation for almost every ω, then
_
T
0
GdW simply cannot be understood as
an ordinary integral.
A FIRST DEFINITION. Suppose now n = m = 1. One possible deﬁnition is due to
Paley, Wiener and Zygmund [P-W-Z]. Suppose g : [0, 1] →R is continuously diﬀerentiable,
with g(0) = g(1) = 0. Note carefully: g is an ordinary, deterministic function and not a
stochastic process. Then let us deﬁne
_
1
0
g dW := −
_
1
0
g

W dt.
Note that
_
1
0
g dW is therefore a random variable. Let us check out the properties following
from this deﬁnition:
58
LEMMA (Properties of the Paley–Wiener–Zygmund integral).
(i) E
_
_
1
0
g dW
_
= 0.
(ii) E
_
_
_
1
0
g dW
_
2
_
=
_
1
0
g
2
dt.
Proof. 1. E
_
_
1
0
g dW
_
= −
_
1
0
g

E(W(t))
. ¸¸ .
=0
dt.
2. To conﬁrm (ii), we calculate
E
_
__
1
0
g dW
_
2
_
= E
__
1
0
g

(t)W(t) dt
_
1
0
g

(s)W(s) ds
_
=
_
1
0
_
1
0
g

(t)g

(s)E(W(t)W(s))
. ¸¸ .
=t∧s
dsdt
=
_
1
0
g

(t)
__
t
0
sg

(s) ds +
_
1
t
tg

(s) ds
_
dt
=
_
1
0
g

(t)
_
tg(t) −
_
t
0
g ds −tg(t)
_
dt
=
_
1
0
g

(t)
_

_
t
0
g ds
_
dt =
_
1
0
g
2
dt.

Discussion. Suppose now g ∈ L
2
(0, 1). We can take a sequence of C
1
functions g
n
, as
above, such that
_
1
0
(g
n
−g)
2
dt → 0. In view of property (ii),
E
_
__
1
0
g
m
dW −
_
1
0
g
n
dW
_
2
_
=
_
1
0
(g
m
−g
n
)
2
dt,
and therefore ¦
_
1
0
g
n
dW¦

n=1
is a Cauchy sequence in L
2
(Ω). Consequently we can deﬁne
_
1
0
g dW := lim
n→∞
_
1
0
g
n
dW.
The extended deﬁnition still satisﬁes properties (i) and (ii).
This is a reasonable deﬁnition of
_
1
0
g dW, except that this only makes sense for functions
g ∈ L
2
(0, 1), and not for stochastic processes. If we wish to deﬁne the integral in (1),
_
t
0
B(X, s) dW,
then the integrand B(X, t) is a stochastic process and the deﬁnition above will not suﬃce.
59
We must devise a deﬁnition for a wider class of integrands (although the deﬁnition we
ﬁnally decide on will agree with that of Paley, Wiener, Zygmund if g happens to be a
deterministic C
1
function, with g(0) = g(1) = 0).
RIEMANN SUMS. To continue our study of stochastic integrals with random inte-
grands, let us think about what might be an appropriate deﬁnition for
_
T
0
W dW = ?,
where W() is a 1-dimensional Brownian motion. A reasonable procedure is to construct
a Riemann sum approximation, and then–if possible–to pass to limits.
DEFINITIONS. (i) If [0, T] is an interval, a partition P of [0, T] is a ﬁnite collection of
points in [0, T]:
P := ¦0 = t
0
< t
1
< < t
m
= T¦.
(ii) Let the mesh size of P be [P[ := max
0≤k≤m−1
[t
k+1
−t
k
[.
(iii) For ﬁxed 0 ≤ λ ≤ 1 and P a given partition of [0, T], set
τ
k
:= (1 −λ)t
k
+λt
k+1
(k = 0, . . . , m−1).
For such a partition P and 0 ≤ λ ≤ 1, we deﬁne
R = R(P, λ) :=
m−1

k=0
W(τ
k
)(W(t
k+1
) −W(t
k
)).
This is the corresponding Riemann sum approximation of
_
T
0
W dW. The key question is
this: what happens if [P[ → 0, with λ ﬁxed?
LEMMA (Quadratic variation). Let [a, b] be an interval in [0, ∞), and suppose
P
n
:= ¦a = t
n
0
< t
n
1
< < t
n
m
n
= b¦
are partitions of [a, b], with [P
n
[ → 0 as n → ∞. Then
m
n
−1

k=0
(W(t
n
k+1
) −W(t
n
k
))
2
→ b −a
in L
2
(Ω) as n → ∞.
This assertion partly justiﬁes the heuristic idea, introduced in Chapter 1, that
dW ≈ (dt)
1/2
.
60
Proof. Set Q
n
:=

m
n
−1
k=0
(W(t
n
k+1
) −W(t
n
k
))
2
. Then
Q
n
−(b −a) =
m
n
−1

k=0
((W(t
n
k+1
) −W(t
n
k
))
2
−(t
n
k+1
−t
n
k
)).
Hence
E((Q
n
−(b −a))
2
) =
m
n
−1

k=0
m
n
−1

j=0
E([(W(t
n
k+1
) −W(t
n
k
))
2
−(t
n
k+1
−t
n
k
)]
[(W(t
n
j+1
) −W(t
n
j
))
2
−(t
n
j+1
−t
n
j
)]).
For k ,= j, the term in the double sum is
E((W(t
n
k+1
) −W(t
n
k
))
2
−(t
n
k+1
−t
n
k
))E( ),
according to the independent increments, and thus equals 0, as W(t) −W(s) is N(0, t −s)
for all t ≥ s ≥ 0. Hence
E((Q
n
−(b −a))
2
) =
m
n
−1

k=0
E((Y
2
k
−1)
2
(t
n
k+1
−t
n
k
)
2
),
where
Y
k
= Y
n
k
:=
W(t
n
k+1
) −W(t
n
k
)
_
t
n
k+1
−t
n
k
is N(0, 1).
Therefore for some constant C
E((Q
n
−(a −b))
2
) ≤ C
m
n
−1

k=0
(t
n
k+1
−t
n
k
)
2
≤ C [ P
n
[ (b −a) → 0 as n → ∞.

Remark. Passing if necessary to a subsequence,
m
n
−1

k=0
(W(t
n
k+1
) −W(t
n
k
))
2
→ b −a a.s.
Pick an ω for which this holds and also for which the sample path is H¨ older continuous
with some exponent 0 < γ <
1
2
. Then
b −a ≤ K limsup
n→∞
[P
n
[
γ
m
n
−1

k=0
[W(t
n
k+1
) −W(t
n
k
)[
61
for a constant K. Since [P
n
[ → 0, we see again that sample paths have inﬁnite variation
with probability one:
sup
P
_
m−1

k=0
[W(t
k+1
) −W(t
k
)[
_
= ∞.

Let us now return to the question posed above, as to the limit of the Riemann sum
approximations.
LEMMA. If P
n
denotes a partition of [0, T] and 0 ≤ λ ≤ 1 is ﬁxed, deﬁne
R
n
:=
m
n
−1

k=0
W(τ
n
k
)(W(t
n
k+1
) −W(t
n
k
)).
Then
lim
n→∞
R
n
=
W(T)
2
2
+
_
λ −
1
2
_
T,
the limit taken in L
2
(Ω). That is,
E
_
_
R
n

W(T)
2
2

_
λ −
1
2
_
T
_
2
_
→ 0.
In particular the limit of the Riemann sum approximations depends upon the choice of
intermediate points t
n
k
≤ τ
n
k
≤ t
n
k+1
, where τ
n
k
= (1 −λ)t
n
k
+λt
n
k+1
.
Proof. We have
R
n
:=
m
n
−1

k=0
W(τ
n
k
)(W(t
n
k+1
) −W(t
n
k
))
=
W
2
(T)
2

1
2
m
n
−1

k=0
(W(t
n
k+1
) −W(t
n
k
))
2
. ¸¸ .
=:A
+
m
n
−1

k=0
(W(τ
n
k
) −W(t
n
k
))
2
. ¸¸ .
=:B
+
m
n
−1

k=0
(W(t
n
k+1
) −W(τ
n
k
))(W(τ
n
k
) −W(t
n
k
))
. ¸¸ .
=:C
.
According to the foregoing Lemma, A →
T
2
in L
2
(Ω) as n → ∞, and similarly B → λT as
62
n → ∞. Next we study the term C:
E([
m
n
−1

k=0
(W(t
n
k+1
) −W(τ
n
k
))(W(τ
n
k
) −W(t
n
k
))]
2
)
=
m
n
−1

k=0
E([W(t
n
k+1
) −W(τ
n
k
)]
2
)E([W(τ
n
k
) −W(t
n
k
)]
2
)
(independent increments)
=
m
n
−1

k=0
(1 −λ)(t
n
k+1
−t
n
k
)λ(t
n
k+1
−t
n
k
)
≤ λ(1 −λ)T[P
n
[ → 0.
Hence C → 0 in L
2
(Ω) as n → ∞.
We combine the limiting expressions for the terms A, B, C, and thereby establish the
Lemma.
It turns out that Itˆ o’s deﬁnition (later, in ¸B) of
_
T
0
W dW corresponds to the choice
λ = 0. That is,
_
T
0
W dW =
W
2
(T)
2

T
2
and, more generally,
_
r
s
W dW =
W
2
(r) −W
2
(s)
2

(r −s)
2
for all r ≥ s ≥ 0.
This is not what one would guess oﬀhand. An alternative deﬁnition, due to Stratonovich,
takes λ =
1
2
; so that
_
T
0
W ◦ dW =
W
2
(T)
2
(Stratonovich integral).
See Chapter 6 for more.
More discussion. What are the advantages of taking λ = 0 and getting
_
T
0
W dW =
W
2
(T)
2

T
2
?
First and most importantly, building the Riemann sum approximation by evaluating the
integrand at the left-hand endpoint τ
n
k
= t
n
k
on each subinterval [t
n
k
, t
n
k=1
] will ultimately
permit the deﬁnition of
_
T
0
GdW
63
for a wide class of so-called “nonanticipating” stochastic processes G(). Exact deﬁnitions
are later, but the idea is that t represents time, and since we do not know what W() will
do on [t
n
k
, t
n
k+1
], it is best to use the known value of G(t
n
k
) in the approximation. Indeed,
G() will in general depend on Brownian motion W(), and we do not know at time t
n
k
its
future value at the future time τ
n
k
= (1 −λ)t
n
k
+λt
n
k+1
, if λ > 0.
B. DEFINITION AND PROPERTIES OF IT
ˆ
O’S INTEGRAL.
Let W() be a 1-dimensional Brownian motion deﬁned on some probability space (Ω, |, P).
DEFINITIONS. (i) The σ-algebra J(t) := |(W(s) [ 0 ≤ s ≤ t) is called the history of
the Brownian motion up to (and including) time t.
(ii) The σ-algebra J
+
(t) := |(W(s)−W(t) [ s ≥ t) is the future of the Brownian motion
beyond time t.
DEFINITION. A family T() of σ-algebras ⊆ | is called nonanticipating (with respect
to W()) if
(a) T(t) ⊇ T(s) for all t ≥ s ≥ 0
(b) T(t) ⊇ J(t) for all t ≥ 0
(c) T(t) is independent of J
+
(t) for all t ≥ 0.
We also refer to T() as a ﬁltration.
We should informally think of T(t) as “containing all information available to us at time
t”. Our primary example will be T(t) := |(W(s) (0 ≤ s ≤ t), X
0
), where X
0
is a random
variable independent of J
+
(0). This will be employed in Chapter 5, where X
0
will be the
(possibly random) initial condition for a stochastic diﬀerential equation.
DEFINITION. A real-valued stochastic process G() is called nonanticipating (with re-
spect to T()) if for each time t ≥ 0, G(t) is T(t)–measurable.
The idea is that for each time t ≥ 0, the random variable G(t) “depends upon only the
information available in the σ-algebra T(t)”.
Discussion. We will actually need a slightly stronger notion, namely that G() be
progessively measurable. This is however a bit subtle to deﬁne, and we will not do so
here. The idea is that G() is nonanticipating and, in addition, is appropriately jointly
measurable in the variables t and ω together.
These measure theoretic issues can be confusing to students, and so we pause here to
emphasize the basic point, to be developed below. For progessively measurable integrands
G(), we will be able to deﬁne, and understand, the stochastic integral
_
T
0
GdW in terms
of some simple, useful and elegant formulas. In other words, we will see that since at each
64
moment of time “G depends only upon the past history of the Brownian motion”, some
nice identities hold, which would be false if G “depends upon the future behavior of the
Brownian motion”.
DEFINITIONS. (i) We denote by L
2
(0, T) the space of all real–valued, progressively
measurable stochastic processes G() such that
E
_
_
T
0
G
2
dt
_
< ∞.
(ii) Likewise, L
1
(0, T) is the space of all real–valued, progressively measurable processes
F() such that
E
_
_
T
0
[F[ dt
_
< ∞.
DEFINITION. A process G ∈ L
2
(0, T) is called a step process if there exists a partition
P = ¦0 = t
0
< t
1
< < t
m
= T¦ such that
G(t) ≡ G
k
for t
k
≤ t < t
k+1
(k = 0, . . . , m−1).
Then each G
k
is an T(t
k
)-measurable random variable, since G is nonanticipating.
DEFINITION. Let G ∈ L
2
(0, T) be a step process, as above. Then
_
T
0
GdW :=
m−1

k=0
G
k
(W(t
k+1
) −W(t
k
))
is the Itˆ o stochastic integral of G on the interval (0, T).
Note carefully that this is a random variable.
LEMMA (Properties of stochastic integral for step processes). We have for all
constants a, b ∈ R and for all step processes G, H ∈ L
2
(0, T):
(i)
_
T
0
aG+bH dW = a
_
T
0
GdW +b
_
T
0
H dW,
(ii) E
_
_
T
0
GdW
_
= 0,
65
(iii) E
_
_
_
_
T
0
GdW
_
2
_
_
= E
_
_
T
0
G
2
dt
_
.
Proof. 1. The ﬁrst assertion is easy to check.
Suppose next G(t) ≡ G
k
for t
k
≤ t < t
k+1
. Then
E
_
_
T
0
GdW
_
=
m−1

k=0
E(G
k
(W(t
k+1
) −W(t
k
))).
Now G
k
is T(t
k
)-measurable and T(t
k
) is independent of J
+
(t
k
). On the other hand,
W(t
k+1
) − W(t
k
) is J
+
(t
k
)-measurable, and so G
k
is independent of W(t
k+1
) − W(t
k
).
Hence
E(G
k
(W(t
k+1
) −W(t
k
))) = E(G
k
)E(W(t
k+1
) −W(t
k
))
. ¸¸ .
=0
.
2. Furthermore,
E
_
_
_
_
T
0
GdW
_
2
_
_
=
m−1

k,j=1
E (G
k
G
j
(W(t
k+1
) −W(t
k
))(W(t
j+1
) −W(t
j
)) .
Now if j < k, then W(t
k+1
) −W(t
k
) is independent of G
k
G
j
(W(t
j+1
) −W(t
j
)). Thus
E(G
k
G
j
(W(t
k+1
) −W(t
k
))(W(t
j+1
) −W(t
j
)))
= E(G
k
G
j
(W(t
j+1
) −W(t
j
)))
. ¸¸ .
<∞
E(W(t
k+1
) −W(t
k
))
. ¸¸ .
=0
.
Consequently
E
_
_
_
_
T
0
GdW
_
2
_
_
=
m−1

k=0
E(G
2
k
(W(t
k+1
) −W(t
k
))
2
)
=
m−1

k=0
E(G
2
k
)E((W(t
k+1
) −W(t
k
))
2
)
. ¸¸ .
=t
k+1
−t
k
= E
_
_
T
0
G
2
dt
_
.

APPROXIMATION BY STEP FUNCTIONS. The plan now is to approximate
an arbitrary process G ∈ L
2
(0, T) by step processes in L
2
(0, T), and then pass to limits to
deﬁne the Itˆ o integral of G.
66
LEMMA (Approximation by step processes). If G ∈ L
2
(0, T), there exists a se-
quence of bounded step processes G
n
∈ L
2
(0, T) such that
E
_
_
T
0
[G−G
n
[
2
dt
_
→ 0.
Outline of proof. We omit the proof, but the idea is this: if t → G(t, ω) is continuous for
almost every ω, we can set
G
n
(t) := G(
k
n
) for
k
n
≤ t <
k + 1
n
, k = 0, . . . , [nT].
For a general G ∈ L
2
(0, T), deﬁne
G
m
(t) :=
_
t
0
me
m(s−t)
G(s) ds.
Then G
m
∈ L
2
(0, T), t → G
m
(t, ω) is continuous for a.e. ω, and
_
T
0
[G
m
−G[
2
dt → 0 a.s.
Now approximate G
m
by step processes, as above.
DEFINITION. If G ∈ L
2
(0, T), take step processes G
n
as above. Then
E
_
_
_
_
T
0
G
n
−G
m
dW
_
2
_
_
= E
_
_
T
0
(G
n
−G
m
)
2
dt
_
→ 0 as n, m → ∞
and so the limit
_
T
0
GdW := lim
n→∞
_
T
0
G
n
dW
exists in L
2
(Ω).
It is not hard to check this deﬁnition does not depend upon the particular sequence of
step process approximations in L
2
(0, T).
THEOREM (Properties of Itˆ o Integral). For all constants a, b ∈ R and for all
G, H ∈ L
2
(0, T), we have
(i)
_
T
0
aG+bH dW = a
_
T
0
GdW +b
_
T
0
H dW,
67
(ii) E
_
_
T
0
GdW
_
= 0,
(iii) E
_
_
_
_
T
0
GdW
_
2
_
_
= E
_
_
T
0
G
2
dt
_
,
(iv) E
_
_
T
0
GdW
_
T
0
H dW
_
= E
_
_
T
0
GH dt
_
.
Proof. 1. Assertion (i) follows at once from the corresponding linearity property for step
processes.
Statements (i) and (iii) are also easy consequences of the similar rules for step processes.
2. Finally, assertion (iv) results from (iii) and the identity 2ab = (a +b)
2
−a
2
−b
2
, and
is left as an exercise.
EXTENDING THE DEFINITION. For many applications, it is important to con-
sider a wider class of integrands, instead of just L
2
(0, T). To this end we deﬁne M
2
(0, T)
to be the space of all real–valued, progressively measurable processes G() such that
_
T
0
G
2
dt < ∞ a.s.
It is possible to extend the deﬁnition of the Itˆ o integral to cover G ∈ M
2
(0, T), although we
will not do so in these notes. The idea is to ﬁnd a sequence of step processes G
n
∈ M
2
(0, T)
such that
_
T
0
(G−G
n
)
2
dt → 0 a.s. as n → ∞.
It turns out that we can then deﬁne
_
T
0
GdW := lim
n→∞
_
T
0
G
n
dW,
the expressions on the right converging in probability. See for instance Friedman [F] or
Gihman–Skorohod [G-S] for details.
More on Riemann sums. In particular, if G ∈ M
2
(0, T) and t → G(t, ω) is continuous
for a.e. ω, then
m
n
−1

k=0
G(t
n
k
)(W(t
n
k+1
) −W(t
n
k
)) →
_
T
0
GdW
in probability, where P
n
= ¦0 = t
n
< < t
n
m
n
= T¦ is any sequence of partitions,
with [P
n
[ → 0. This conﬁrms the consistency of Itˆ o’s integral with the earlier calculations
involving Riemann sums, evaluated at τ
n
k
= t
n
k
.
68
C. INDEFINITE IT
ˆ
O INTEGRALS.
DEFINITION. For G ∈ L
2
(0, T), set
I(t) :=
_
t
0
GdW (0 ≤ t ≤ T),
the indeﬁnite integral of G(). Note I(0) = 0.
In this section we note some properties of the process I(), namely that it is a martingale
and has continuous sample paths a.s. These facts will be quite useful for proving Itˆ o’s
formula later in ¸D and in solving the stochastic diﬀerential equations in Chapter 5.
THEOREM. (i) If G ∈ L
2
(0, T), then the indeﬁnite integral I() is a martingale.
(iI) Furthermore, I() has a version with continuous sample paths a.s.
Henceforth when we refer to I(), we will always mean this version. We will not prove
assertion (i); a proof of (ii) is in Appendix C.
D. IT
ˆ
O’S FORMULA.
DEFINITION. Suppose that X() is a real–valued stochastic process satisfying
X(r) = X(s) +
_
r
s
F dt +
_
r
s
GdW
for some F ∈ L
1
(0, T), G ∈ L
2
(0, T) and all times 0 ≤ s ≤ r ≤ T. We say that X() has
the stochastic diﬀerential
dX = Fdt +GdW
for 0 ≤ t ≤ T.
Note carefully that the diﬀerential symbols are simply an abbreviation for the integral
expressions above: strictly speaking “dX”, “dt”, and “dW” have no meaning alone.
THEOREM (Itˆ o’s Formula). Suppose that X() has a stochastic diﬀerential
dX = Fdt +GdW,
for F ∈ L
1
(0, T), G ∈ L
2
(0, T). Assume u : R[0, T] →R is continuous and that
∂u
∂t
,
∂u
∂x
,

2
u
∂x
2
exist and are continuous.
Set
Y (t) := u(X(t), t).
69
Then Y has the stochastic diﬀerential
(2)
dY =
∂u
∂t
dt +
∂u
∂x
dX +
1
2

2
u
∂x
2
G
2
dt
=
_
∂u
∂t
+
∂u
∂x
F +
1
2

2
u
∂x
2
G
2
_
dt +
∂u
∂x
GdW.
We call (2) Itˆo’s formula or Itˆ o’s chain rule.
Remarks. (i) The argument of u,
∂u
∂t
, etc. above is (X(t), t).
(ii) In view of our deﬁnitions, the expression (2) means for all 0 ≤ s ≤ r ≤ T,
(3)
Y (r) −Y (s) = u(X(r), r) −u(X(s), s)
=
_
r
s
∂u
∂t
(X, t) +
∂u
∂x
(X, t)F +
1
2

2
u
∂x
2
(X, t)G
2
dt
+
_
r
s
∂u
∂x
(X, t)GdW almost surely.
(iii) Since X(t) = X(0) +
_
t
0
F ds +
_
t
0
GdW, X() has continuous sample paths almost
surely. Thus for amost every ω, the functions t →
∂u
∂t
(X(t), t),
∂u
∂x
(X(t), t),

2
u
∂x
2
(X(t), t)
are continuous and so the integrals in (3) are deﬁned.
ILLUSTRATIONS OF IT
ˆ
O’S FORMULA. We will prove Itˆ o’s formula below, but
ﬁrst here are some applications:
Example 1. Let X() = W(), u(x) = x
m
. Then dX = dW and thus F ≡ 0, G ≡ 1.
Hence Itˆo’s formula gives
d(W
m
) = mW
m−1
dW +
1
2
m(m−1)W
m−2
dt.
In particular the case m = 2 reads
d(W
2
) = 2WdW +dt.
This integrated is the identity
_
r
s
W dW =
W
2
(r) −W
2
(s)
2

(r −s)
2
,
a formula we have established from ﬁrst principles before.
70
Example 2. Again take X() = W(), u(x, t) = e
λx−
λ
2
t
2
, F ≡ 0, G ≡ 1. Then
d
_
e
λW(t)−
λ
2
t
2
_
=
_

λ
2
2
e
λW(t)−
λ
2
t
2
+
λ
2
2
e
λW(t)−
λ
2
t
2
_
dt +λe
λW(t)−
λ
2
t
2
dW
by Itˆ o’s formula. Thus
_
dY = λY dW
Y (0) = 1.
This is a stochastic diﬀerential equation, about which more in Chapters 5 and 6.
In the Itˆ o stochastic calculus the expression e
λW(t)−
λ
2
t
2
plays the role that e
λt
plays in
ordinary calculus. We build upon this observation:
Example 3. For n = 0, 1, . . . , deﬁne
h
n
(x, t) :=
(−t)
n
n!
e
x
2
/2t
d
n
dx
n
_
e
−x
2
/2t
_
,
the n-th Hermite polynomial. Then
h
0
(x, t) = 1, h
1
(x, t) = x
h
2
(x, t) =
x
2
2

t
2
, h
3
(x, t) =
x
3
6

tx
2
h
4
(x, t) =
x
4
24

tx
2
4
+
t
2
8
, etc.
THEOREM (Stochastic calculus with Hermite polynomials). We have
_
t
0
h
n
(W, s) dW = h
n+1
(W(t), t) for n = 0, . . . , t ≥ 0;
that is,
dh
n+1
(W, t) = h
n
(W, t) dW.
Consequently in the Itˆ o stochastic calculus the expression h
n
(W(t), t) plays the role
that
t
n
n!
plays in ordinary calculus.
Proof. (from McKean [McK]) Since
d
n

n
(e

(x−λt)
2
2t
)[
λ=0
= (−t)
n
d
n
dx
n
(e
−x
2
/2t
),
we have
d
n

n
(e
λx−
λ
2
t
2
)[
λ=0
= (−t)
n
e
x
2
/2t
d
n
dx
n
(e
−x
2
/2t
)
= n!h
n
(x, t).
71
Hence
e
λx−
λ
2
t
2
=

n=0
λ
n
h
n
(x, t),
and so
Y (t) = e
λW(t)−
λ
2
t
2
=

n=0
λ
n
h
n
(W(t), t).
But Y () solves
_
dY = λY dW
Y (0) = 1;
that is,
Y (t) = 1 +λ
_
t
0
Y dW for all t ≥ 0.
Plug in the expansion above for Y (t):

n=0
λ
n
h
n
(W(t), t) = 1 +λ
_
t
0

n=0
λ
n
h
n
(W(s), s) dW
= 1 +

n=1
λ
n
_
t
0
h
n−1
(W(s), s) dW.
This identity holds for all λ and so the coeﬃcients of λ
n
on both sides are equal.
PROOF OF IT
ˆ
O’S FORMULA. We now begin the proof of Itˆ o’s formula, by veri-
fying directly two important special cases:
LEMMA (Two simple stochastic diﬀerentials). We have
(i) d(W
2
) = 2WdW +dt,
and
(ii ) d(tW) = Wdt +tdW.
Proof. We have already established formula (i). To verify (ii), note that
_
r
0
t dW = lim
n→∞
m
n
−1

k=0
t
n
k
(W(t
n
k+1
) −W(t
n
k
)),
where P
n
= ¦0 = t
n
0
< t
n
1
< < t
n
m
n
= r¦ is a sequence of partitions of [0, r], with
[P
n
[ → 0. The limit above is taken in L
2
(Ω).
Similarly, since t → W(t) is continuous a.s.,
_
r
0
W dt = lim
n→∞
m
n
−1

k=0
W(t
n
k+1
)(t
n
k+1
−t
n
k
),
72
since for amost every ω the sum is an ordinary Riemann sum approximation and for this
we can take the right-hand endpoint t
n
k+1
at which to evaluate the continuous integrand.
We add these formulas to obtain
_
r
0
t dW +
_
r
0
W dt = rW(r).
These integral identities for all r ≥ 0 are abbreviated d(tW) = tdW +Wdt.
These special cases in hand, we now prove:
THEOREM (Itˆ o product rule). Suppose
_
dX
1
= F
1
dt +G
1
dW
dX
2
= F
2
dt +G
2
dW
(0 ≤ t ≤ T),
for F
i
∈ L
1
(0, T), G
i
∈ L
2
(0, T) (i = 1, 2). Then
(4) d(X
1
X
2
) = X
2
dX
1
+X
1
dX
2
+G
1
G
2
dt.
Remarks. (i) The expression G
1
G
2
dt here is the Itˆ o correction term. The integrated
version of the product rule is the Itˆ o integration-by-parts formula:
(5)
_
r
s
X
2
dX
1
= X
1
(r)X
2
(r) −X
1
(s)X
2
(s) −
_
r
s
X
1
dX
2

_
r
s
G
1
G
2
dt.
(ii) If either G
1
or G
2
is identically equal to 0, we get the ordinary calculus integration-
by-parts formula. This conﬁrms that the Paley–Wiener–Zygmund deﬁnition
_
1
0
g dW = −
_
1
0
g

W dt,
for deterministic C
1
functions g, with g(0) = g(1) = 0, agrees with the Itˆ o deﬁnition.
Proof. 1. Choose 0 ≤ r ≤ T.
First of all, assume for simplicity that X
1
(0) = X
2
(0) = 0, F
i
(t) ≡ F
i
, G
i
(t) ≡ G
i
,
where F
i
, G
i
are time–independent, T(0)-measurable random variables (i = 1, 2). Then
X
i
(t) = F
i
t +G
i
W(t) (t ≥ 0, i = 1, 2).
73
Thus
_
r
0
X
2
dX
1
+X
1
dX
2
+G
1
G
2
dt
=
_
r
0
X
1
F
2
+X
2
F
1
dt +
_
r
0
X
1
G
2
+X
2
G
1
dW
+
_
r
0
G
1
G
2
dt
=
_
r
0
(F
1
t +G
1
W)F
2
+ (F
2
t +G
2
W)F
1
dt
+
_
r
0
(F
1
t +G
1
W)G
2
+ (F
2
t +G
2
W)G
1
dW +G
1
G
2
r
= F
1
F
2
r
2
+ (G
1
F
2
+G
2
F
1
)
__
r
0
W dt +
_
r
0
t dW
_
+ 2G
1
G
2
_
r
0
W dW +G
1
G
2
r.
We now use the Lemma above to compute 2
_
r
0
W dW = W
2
(r)−r and
_
r
0
Wdt+
_
r
0
t dW =
rW(r). Employing these identities, we deduce:
_
r
0
X
2
dX
1
+X
1
dX
2
+G
1
G
2
dt
= F
1
F
2
r
2
+ (G
1
F
2
+G
2
F
1
)rW(r) +G
1
G
2
W
2
(r)
= X
1
(r)X
2
(r).
This is formula (5) for the special circumstance that s = 0, X
i
(0) = 0, and F
i
, G
i
time–
independent random variables.
The case that s ≥ 0, X
1
(s), X
2
(s) are arbitrary, and F
i
, G
i
are constant T(s)-measurable
random variables has a similar proof.
2. If F
i
, G
i
are step processes, we apply Step 1 on each subinterval [t
k
, t
k+1
) on which
F
i
and G
i
are constant random variables, and add the resulting integral expressions.
3. In the general situation, we select step processes F
n
i
∈ L
1
(0, T), G
n
i
∈ L
2
(0, T), with
E(
_
T
0
[F
n
i
−F
i
[ dt) → 0
E(
_
T
0
(G
n
i
−G
i
)
2
dt) → 0
as n → ∞, i = 1, 2.
Deﬁne
X
n
i
(t) := X
i
(0) +
_
t
0
F
n
i
ds +
_
t
0
G
n
i
dW (i = 1, 2).
We apply Step 2 to X
n
i
() on (s, r) and pass to limits, to obtain the formula
X
1
(r)X
2
(r) = X
1
(s)X
2
(s) +
_
r
s
X
1
dX
2
+X
2
dX
1
+G
1
G
2
dt.
74

CONCLUSION OF THE PROOF OF IT
ˆ
O’S FORMULA. Suppose dX = Fdt +
GdW, with F ∈ L
1
(0, T), G ∈ L
2
(0, T).
m
, m = 0, 1, . . . , and ﬁrst of all claim that
(6) d(X
m
) = mX
m−1
dX +
1
2
m(m−1)X
m−2
G
2
dt.
This is clear for m = 0, 1, and the case m = 2 follows from the Itˆ o product formula. Now
assume the stated formula for m−1:
d(X
m−1
) = (m−1)X
m−2
dX +
1
2
(m−1)(m−2)X
m−3
G
2
dt
= (m−1)X
m−2
(Fdt +GdW) +
1
2
(m−1)(m−2)X
m−3
G
2
dt,
and we prove it for m:
d(X
m
) = d(XX
m−1
)
= Xd(X
m−1
) +X
m−1
dX + (m−1)X
m−2
G
2
dt
(by the product rule)
= X
_
(m−1)X
m−2
dX +
1
2
(m−1)(m−2)X
m−3
G
2
dt
_
+ (m−1)X
m−2
G
2
dt +X
m−1
dX
= mX
m−2
dX +
1
2
m(m−1)X
m−2
G
2
dt,
because m−1 +
1
2
(m−1)(m−2) =
1
2
m(m−1). This proves (6).
Since Itˆ o’s formula thus holds for the functions u(x) = x
m
, m = 0, 1, . . . and since the
operator “d” is linear, Itˆ o’s formula is valid for all polynomials u in the variable x.
2. Suppose now u(x, t) = f(x)g(t), where f and g are polynomials. Then
d(u(X, t)) = d(f(X)g)
= f(X)dg +gdf(X)
= f(X)g

dt +g[f

(X)dX +
1
2
f

(X)G
2
dt]
=
∂u
∂t
dt +
∂u
∂x
dX +
1
2

2
u
∂x
2
G
2
dt.
This calculation conﬁrms Itˆ o’s formula for u(x, t) = f(x)g(t), where f and g are polyno-
mials. Thus it is true as well for any function u having the form
u(x, t) =
m

i=1
f
i
(x)g
i
(t),
75
where f
i
and g
i
polynomials. That is, Itˆ o’s formula is valid for all polynomial functions u
of the variables x, t.
3. Given u as in Itˆ o’s formula, there exists a sequence of polynomials u
n
such that
u
n
→ u,
∂u
n
∂t

∂u
∂t
∂u
n
∂x

∂u
∂x
,

2
u
n
∂x
2

2
u
∂x
2
,
uniformly on compact subsets of R[0, T]. Invoking Step 2, we know that for all 0 ≤ r ≤ T,
u
n
(X(r), r) −u
n
(X(0), 0) =
_
r
0
∂u
n
∂t
+
∂u
n
∂x
F +
1
2

2
u
n
∂x
2
G
2
dt
+
_
r
0
∂u
n
∂x
GdW almost surely;
the argument of the partial derivatives of u
n
is (X(t), t).
We may pass to limits as n → ∞ in this expression, thereby proving Itˆ o’s formula in
general.
A similar proof gives this:
GENERALIZED IT
ˆ
O FORMULA. Suppose dX
i
= F
i
dt + G
i
dW, with for F
i

L
1
(0, T), G
i
∈ L
2
(0, T), for i = 1, . . . , n.
If u : R
n
[0, T] →R is continuous, with continuous partial derivatives
∂u
∂t
,
∂u
∂x
i
,

2
u
∂x
i
∂x
j
,
(i, j = 1, . . . , n), then
d(u(X
1
, . . . , X
n
, t)) =
∂u
∂t
dt +
n

i=1
∂u
∂x
i
dX
i
+
1
2
n

i,j=1

2
u
∂x
i
∂x
j
G
i
G
j
dt.

E. IT
ˆ
O’S INTEGRAL IN N-DIMENSIONS.
Notation. (i) Let W() = (W
1
(), . . . , W
m
()) be an m-dimensional Brownian motion.
(ii) We assume T() is a family of nonanticipating σ-algebras, meaning that
(a) T(t) ⊇ T(s) for all t ≥ s ≥ 0
(b) T(t) ⊇ J(t) = |(W(s) [ 0 ≤ s ≤ t)
(c) T(t) is independent of J
+
(t) := |(W(s) −W(t) [ t ≤ s < ∞).
76
DEFINITIONS. (i) An M
n×m
-valued stochastic process G = ((G
ij
)) belongs to L
2
n×m
(0, T)
if
G
ij
∈ L
2
(0, T) (i = 1, . . . n; j = 1, . . . m).
(ii) An R
n
-valued stochastic process F = (F
1
, F
2
, . . . , F
n
) belongs to L
1
n
(0, T) if
F
i
∈ L
1
(0, T) (i = 1, . . . n).
DEFINITION. If G ∈ L
2
n×m
(0, T), then
_
T
0
GdW
is an R
n
–valued random variable, whose i-th component is
m

j=1
_
T
0
G
ij
dW
j
(i = 1, . . . , n).

Approximating by step processes as before, we can establish this
LEMMA. If G ∈ L
2
n×m
(0, T), then
E
_
_
T
0
GdW
_
= 0,
and
E
_
_
¸
¸
¸
¸
¸
_
T
0
GdW
¸
¸
¸
¸
¸
2
_
_
= E
_
_
T
0
[G[
2
dt
_
,
where [G[
2
:=

1≤i≤n
1≤j≤m
[G
ij
[
2
.
DEFINITION. If X() = (X
1
(), . . . , X
n
()) is an R
n
-valued stochastic process such that
X(r) = X(s) +
_
r
s
Fdt +
_
r
s
GdW
for some F ∈ L
1
n
(0, T), G ∈ L
2
n×m
(0, T) and all 0 ≤ s ≤ r ≤ T, we say X() has the
stochastic diﬀerential
dX = Fdt +GdW.
This means that
dX
i
= F
i
dt +
m

j=1
G
ij
dW
j
for i = 1, . . . , n.
77
THEOREM (Itˆ o’s formula in n-dimensions). Suppose that dX = Fdt + GdW, as
above. Let u : R
n
[0, T] be continuous, with continuous partial derivations
∂u
∂t
,
∂u
∂x
i
,

2
u
∂x
i
∂x
j
, (i, j = 1, . . . , n). Then
(5)
d(u(X(t), t))=
∂u
∂t
dt +

n
i=1
∂u
∂x
i
dX
i
+
1
2

n
i,j=1

2
u
∂x
i
∂x
j

m
l=1
G
il
G
jl
dt,
where the argument of the partial derivatives of u is (X(t), t).
An outline of the proof follows some preliminary results:
LEMMA (Another simple stochastic diﬀerential). Let W() and
¯
W() be indepen-
dent 1-dimensional Brownian motions. Then
d(W
¯
W) = Wd
¯
W +
¯
WdW.
Compare this to the case W =
¯
W. There is no correction term “dt” here, since W,
¯
W
are independent.
Proof. 1. To begin, set X(t) :=
W(t)+
¯
W(t)

2
.
We claim that X() is a 1-dimensional Brownian motion. To see this, note ﬁrstly that
X(0) = 0 a.s. and X() has independent increments. Next observe that since X is the
sum of two independent, N(0,
t
2
) random variables, X(t) is N(0, t). A similar observation
shows that X(t) −X(s) in N(0, t −s). This establishes the claim.
2. From the 1-dimensional Itˆ o calculus, we know
_
¸
_
¸
_
d(X
2
) = 2XdX +dt,
d(W
2
) = 2WdW +dt,
d(
¯
W
2
) = 2
¯
Wd
¯
W +dt.
Thus
d(W
¯
W) = d
_
X
2

1
2
W
2

1
2
¯
W
2
_
= 2XdX +dt −
1
2
(2WdW +dt)

1
2
(2
¯
Wd
¯
W +dt)
= (W +
¯
W)(dW +d
¯
W) −WdW −
¯
Wd
¯
W
= Wd
¯
W +
¯
WdW.

We will also need the following modiﬁcation of the product rule:
78
LEMMA (Itˆ o product rule with several Brownian motions). Suppose
dX
1
= F
1
dt +
m

k=1
G
k
1
dW
k
and
dX
2
= F
2
dt +
m

l=1
G
l
2
dW
l
,
where F
i
∈ L
1
(0, T) and G
k
i
∈ L
2
(0, T) for i = 1, 2; k = 1, . . . , m. Then
d(X
1
X
2
) = X
1
dX
2
+X
2
dX
1
+
m

k=1
G
k
1
G
k
2
dt.
The proof is a modiﬁcation of that for the one–dimensional Itˆ o product rule, as before,
with the new feature that
d(W
i
W
j
) = W
i
dW
j
+W
j
dW
i

ij
dt,
according to the Lemma above.
The Itˆo formula in n-dimensions can now be proved by a suitable modiﬁcation of the
one-dimensional proof. We ﬁrst establish the formula for a multinomials u = u(x) =
x
k
1
1
. . . x
k
m
m
, proving this by an induction on k
1
, . . . , k
m
, using the Lemma above. This done,
the formula follows easily for polynomials u = u(x, t) in the variables x = (x
1
, . . . , x
n
) and
t, and then, after an approximation, for all functions u as stated.
CLOSING REMARKS.
1. ALTERNATIVE NOTATION. When
dX = Fdt +GdW,
we sometimes write
H
ij
:=
m

k=1
G
ik
G
jk
.
du(X, t) =
_
∂u
∂t
+F Du +
1
2
H : D
2
u
_
dt +Du GdW,
79
where Du =
_
∂u
∂x
1
, . . . ,
∂u
∂x
n
_
is the gradient of u in the x-variables, D
2
u =
__

2
u
∂x
i
∂x
j
__
is
the Hessian matrix, and
F Du =
n

i=1
F
i
∂u
∂x
i
,
H : D
2
u =
n

i,j=1
H
ij

2
u
∂x
i
∂x
j
,
Du GdW =
n

i=1
m

k=1
∂u
∂x
i
G
ik
dW
k
.
2. HOW TO REMEMBER IT
ˆ
O’S FORMULA.
We may symbolically compute
d(u(X, t)) =
∂u
∂t
dt +
n

i=1
∂u
∂x
i
dX
i
+
1
2
n

i,j=1

2
u
∂x
i
∂x
j
dX
i
dX
j
,
and then simplify the term “dX
i
dX
j
” by expanding it out and using the formal multipli-
cation rules
(dt)
2
= 0, dtdW
k
= 0, dW
k
dW
l
= δ
kl
dt (k, l = 1, . . . , m).
The foregoing theory provides a rigorous meaning for all this.
80
CHAPTER 5: STOCHASTIC DIFFERENTIAL EQUATIONS.
A. Deﬁnitions and examples
B. Existence and uniqueness of solutions
C. Properties of solutions
D. Linear stochastic diﬀerential equations
A. DEFINITIONS AND EXAMPLES.
We are ﬁnally ready to study stochastic diﬀerential equations:
Notation. (i) Let W() be an m-dimensional Brownian motion and X
0
an n-dimensional
random variable which is independent of W(). We will henceforth take
T(t) := |(X
0
, W(s) (0 ≤ s ≤ t)) (t ≥ 0),
the σ-algebra generated by X
0
and the history of the Wiener process up to (and including)
time t.
(ii) Assume T > 0 is given, and
b : R
n
[0, T] →R
n
,
B : R
n
[0, T] →M
n×m
are given functions. (Note carefully: these are not random variables.) We display the
components of these functions by writing
b = (b
1
, b
2
, . . . , b
n
), B =
_
_
b
1 1
. . . b
1 m
.
.
.
.
.
.
.
.
.
b
n1
. . . b
nm
_
_
.
DEFINITION. We say that an R
n
-valued stochastic process X() is a solution of the
Itˆ o stochastic diﬀerential equation
(SDE)
_
dX = b(X, t)dt +B(X, t)dW
X(0) = X
0
for 0 ≤ t ≤ T, provided
(i) X() is progressively measurable with respect to T(),
(ii) F := b(X, t) ∈ L
1
n
(0, T),
81
(iii) G := B(X, t) ∈ L
2
n×m
(0, T),
and
(iv) X(t) = X
0
+
_
t
0
b(X(s), s) ds +
_
t
0
B(X(s), s) dW a.s. for all 0 ≤ t ≤ T.
Remarks. (i) A higher order SDE of the form
Y
(n)
= f(t, Y, . . . , Y
(n−1)
) +g(t, Y, . . . , Y
(n−1)
)ξ,
where as usual ξ denotes “white noise”, can be rewritten into the form above by the device
of setting
X(t) =
_
_
_
_
_
Y (t)
˙
Y (t)
.
.
.
Y
(n−1)
(t)
_
_
_
_
_
=
_
_
_
_
X
1
(t)
X
2
(t)
.
.
.
X
n
(t)
_
_
_
_
.
Then
dX =
_
_
X
2
.
.
.
f( )
_
_
dt +
_
_
0
.
.
.
g( )
_
_
dW.
(ii) In view of (iii), we can always assume X() has continuous sample paths almost
surely.
EXAMPLES OF LINEAR STOCHASTIC DIFFERENTIAL EQUATIONS.
Example 1. Let m = n = 1 and suppose g is a continuous function (not a random
variable). Then the unique solution of
(1)
_
dX = gXdW
X(0) = 1
is
X(t) = e

1
2
_
t
0
g
2
ds+
_
t
0
g dW
for 0 ≤ t ≤ T. To verify this, note that
Y (t) := −
1
2
_
t
0
g
2
ds +
_
t
0
g dW
satisﬁes
dY = −
1
2
g
2
dt +gdW.
Thus Itˆ o’s lemma for u(x) = e
x
gives
dX =
∂u
∂x
dY +
1
2

2
u
∂x
2
g
2
dt
= e
Y
_

1
2
g
2
dt +gdW +
1
2
g
2
dt
_
= gXdW, as claimed.
We will prove uniqueness later, in ¸B.
82
Example 2. Similarly, the unique solution of
(2)
_
dX = fXdt +gXdW
X(0) = 1
is
X(t) = e
_
t
0
f−
1
2
g
2
ds+
_
t
0
g dW
for 0 ≤ t ≤ T.
Example 3 (Stock prices). Let P(t) denote the price of a stock at time t. We can
model the evolution of P(t) in time by supposing that
dP
P
, the relative change of price,
evolves according to the SDE
dP
P
= µdt +σdW
for certain constants µ > 0 and σ, called the drift and the volatility of the stock. Hence
(3) dP = µPdt +σPdW;
and so
d(log(P)) =
dP
P

1
2
σ
2
P
2
dt
P
2
by Itˆ o’s formula
=
_
µ −
σ
2
2
_
dt +σdW.
Consequently
P(t) = p
0
e
σW(t)+
_
µ−
σ
2
2
_
t
,
similarly to Example 2. Observe that the price is always positive, assuming the initial
price p
0
is positive.
Since (3) implies
P(t) = p
0
+
_
t
0
µP ds +
_
t
0
σP dW
and E
_
_
t
0
σP dW
_
= 0, we see that
E(P(t)) = p
0
+
_
t
0
µE(P(s)) ds.
Hence
E(P(t)) = p
0
e
µt
for t ≥ 0.
The expected value of the stock price consequently agrees with the deterministic solution
of (3) corresponding to σ = 0.
83
Example 4 (Brownian bridge). The solution of the SDE
(4)
_
dB = −
B
1−t
dt +dW (0 ≤ t < 1)
B(0) = 0
is
B(t) = (1 −t)
_
t
0
1
1 −s
dW (0 ≤ t < 1),
as we conﬁrm by a direct calculation. It turns out also that lim
t→1
− B(t) = 0 almost
surely. We call B() a Brownian bridge, between the origin at time 0 and at time 1.
A sample path of the Brownian bridge
Example 5 (Langevin’s equation). A possible improvement of our mathematical model
of the motion of a Brownian particle models frictional forces as follows for the one-
dimensional case:
˙
X = −bX +σξ,
where ξ() is “white noise”, b > 0 is a coeﬃcient of friction, and σ is a diﬀusion coeﬃcient.
In this interpretation X() is the velocity of the Brownian particle: see Example 6 for the
position process Y (). We interpret this to mean
(5)
_
dX = −bXdt +σdW
X(0) = X
0
,
84
for some initial distribution X
0
, independent of the Brownian motion. This is the Langevin
equation.
The solution is
X(t) = e
−bt
X
0

_
t
0
e
−b(t−s)
dW (t ≥ 0),
as is straightforward to verify. Observe that
E(X(t)) = e
−bt
E(X
0
)
and
E(X
2
(t)) = E
_
e
−2bt
X
2
0
+ 2σe
−bt
X
0
_
t
0
e
−b(t−s)
dW

2
__
t
0
e
−b(t−s)
dW
_
2
_
= e
−2bt
E(X
2
0
) + 2σe
−bt
E(X
0
)E
__
t
0
e
−b(t−s)
dW
_

2
_
t
0
e
−2b(t−s)
ds
= e
−2bt
E(X
2
0
) +
σ
2
2b
(1 −e
−2bt
).
Thus the variance
V (X(t)) = E(X
2
(t)) −E(X(t))
2
is given by
V (X(t)) = e
−2bt
V (X
0
) +
σ
2
2b
(1 −e
−2bt
),
assuming, of course, V (X
0
) < ∞. For any such initial condition X
0
we therefore have
_
E(X(t)) → 0
V (X(t)) →
σ
2
2b
as t → ∞.
From the explicit form of the solution we see that the distribution of X(t) approaches
N
_
0,
σ
2
2b
_
as t → ∞. We interpret this to mean that irrespective of the initial distribution,
the solution of the SDE for large time “settles down” into a Gaussian distribution whose
variance
σ
2
2b
represents a balance between the random disturbing force σξ() and the fric-
tional damping force −bX().
85
A simulation of Langevin’s equation
Example 6 (Ornstein–Uhlenbeck process). A better model of Brownian movement
is provided by the Ornstein–Uhlenbeck equation
_
¨
Y = −b
˙
Y +σξ
Y (0) = Y
0
,
˙
Y (0) = Y
1
,
where Y (t) is the position of Brownian particle at time t, Y
0
and Y
1
are given Gaussian
random variables. As before b > 0 is the friction coeﬃcient, σ is the diﬀusion coeﬃcient,
and ξ() as usual is “white noise”.
Then X :=
˙
Y , the velocity process, satisﬁes the Langevin equation
(6)
_
dX = −bXdt +σdW
X(0) = Y
1
,
studied in Example 5. We assume Y
1
to be normal, whence explicit formula for the solution,
X(t) = e
−bt
Y
1

_
t
0
e
−b(t−s)
dW,
shows X(t) to be Gaussian for all times t ≥ 0. Now the position process is
Y (t) = Y
0
+
_
t
0
X ds.
86
Therefore
E(Y (t)) = E(Y
0
) +
_
t
0
E(X(s)) ds
= E(Y
0
) +
_
t
0
e
−bs
E(Y
1
) ds
= E(Y
0
) +
_
1 −e
−bt
b
_
E(Y
1
);
and a somewhat lengthly calculation shows
V (Y (t)) = V (Y
0
) +
σ
2
b
2
t +
σ
2
2b
3
(−3 + 4e
−bt
−e
−2bt
).
Nelson [N, p. 57] discusses this model as compared with Einstein’s .
Example 7 (Random harmonic oscillator). This is the SDE
_
¨
X = −λ
2
X −b
˙
X +σξ
X(0) = X
0
,
˙
X(0) = X
1
,
where −λ
2
X represents a linear, restoring force and −b
˙
X is a frictional damping term.
An explicit solution can be worked out using the general formulas presented below in
¸D. For the special case X
1
= 0, b = 0, σ = 1, we have
X(t) = X
0
cos(λt) +
1
λ
_
t
0
sin(λ(t −s)) dW.

B. EXISTENCE AND UNIQUENESS OF SOLUTIONS.
In this section we address the problem of building solutions to stochastic diﬀerential
1. AN EXAMPLE IN ONE DIMENSION. Let us ﬁrst suppose b : R → R is
C
1
, with [b

[ ≤ L for some constant L, and try to solve the one–dimensional stochastic
diﬀerential equation
(7)
_
dX = b(X)dt +dW
X(0) = x
where x ∈ R.
87
Now the SDE means
X(t) = x +
_
t
0
b(X) ds +W(t),
for all times t ≥ 0, and this formulation suggests that we try a successive approximation
method to construct a solution. So deﬁne X
0
(t) ≡ x, and then
X
n+1
(t) := x +
_
t
0
b(X
n
) ds +W(t) (t ≥ 0)
for n = 0, 1, . . . . Next write
D
n
(t) := max
0≤s≤t
[ X
n+1
(s) −X
n
(s)[ (n = 0, . . . ),
and notice that for a given continuous sample path of the Brownian motion, we have
D
0
(t) = max
0≤s≤t
¸
¸
¸
¸
_
s
0
b(x) dr +W(s)
¸
¸
¸
¸
≤ C
for all times 0 ≤ t ≤ T, where C depends on ω.
We now claim that
D
n
(t) ≤ C
L
n
n!
t
n
for n = 0, 1, . . . , 0 ≤ t ≤ T. To see this note that
D
n
(t) = max
0≤s≤t
¸
¸
¸
¸
_
s
0
b(X
n
(r)) −b(X
n−1
(r)) dr
¸
¸
¸
¸
≤ L
_
t
0
D
n−1
(s) ds
≤ L
_
t
0
C
L
n−1
s
n−1
(n −1)!
ds by the induction assumption
= C
L
n
t
n
n!
.
In view of the claim, for m ≥ n we have
max
0≤t≤T
[X
m
(t) −X
n
(t)[ ≤ C

k=n
L
k
T
k
k!
→ 0 as n → ∞.
Thus for almost every ω, X
n
() converges uniformly for 0 ≤ t ≤ T to a limit process X()
which, as is easy to check, solves (7).
2. SOLVING SDE BY CHANGING VARIABLES. Next is a procedure for solving
SDE by means of a clever change of variables (McKean [McK, p. 60]).
88
Given a general one–dimensional SDE of the form
(8)
_
dX = b(X)dt +σ(X)dW
X(0) = x,
let us ﬁrst solve
(9)
_
dY = f(Y )dt +dW
Y (0) = y,
where f will be selected later, and try to ﬁnd a function u such that
X := u(Y )
solves our SDE (8). Note that we can in principle at least solve (9), according to the
previous example. Assuming for the moment u and f are known, we compute using Itˆ o’s
formula that
dX = u

(Y )dY +
1
2
u

(Y )dt
=
_
u

f +
1
2
u

_
dt +u

dW.
Thus X() solves (8) provided
_
u

(Y ) = σ(X) = σ(u(Y )),
u

(Y )f(Y ) +
1
2
u

(Y ) = b(X) = b(u(Y )),
and
u(y) = x.
So let us ﬁrst solve the ODE
_
u

(z) = σ(u(z))
u(y) = x
(z ∈ R),
where

=
d
dz
, and then, once u is known, solve for
f(z) =
1
σ(u(z))
_
b(u(z)) −
1
2
u

(z)
_
.
We will not discuss here conditions under which all of this is possible: see Lamperti [L2].

Notice that both of the methods described above avoid all use of martingale estimates.
3. A GENERAL EXISTENCE AND UNIQUENESS THEOREM
89
GRONWALL’S LEMMA. Let φ and f be nonnegative, continuous functions deﬁned
for 0 ≤ t ≤ T, and let C
0
≥ 0 denote a constant. If
φ(t) ≤ C
0
+
_
t
0
fφds for all 0 ≤ t ≤ T,
then
φ(t) ≤ C
0
e
_
t
0
f ds
for all 0 ≤ t ≤ T.
Proof. Set Φ(t) := C
0
+
_
t
0
fφds. Then Φ

= fφ ≤ fΦ, and so
_
e

_
t
0
f ds
Φ
_

= (Φ

−fΦ)e

_
t
0
f ds
≤ (fφ −fφ)e

_
t
0
f ds
= 0.
Therefore
Φ(t)e

_
t
0
f ds
≤ Φ(0)e

_
0
0
f ds
= C
0
,
and thus
φ(t) ≤ Φ(t) ≤ C
0
e
_
t
0
f ds
.

EXISTENCE AND UNIQUENESS THEOREM. Suppose that b : R
n
[0, T] →R
n
and B : R
n
[0, T] →M
m×n
are continuous and satisfy the following conditions:
(a)
[b(x, t) −b(ˆ x, t)[ ≤ L[x − ˆ x[
[B(x, t) −B(ˆ x, t)[ ≤ L[x − ˆ x[
for all 0 ≤ t ≤ T, x, ˆ x ∈ R
n
(b)
[b(x, t)[ ≤ L(1 +[x[)
[B(x, t)[ ≤ L(1 +[x[)
for all 0 ≤ t ≤ T, x ∈ R
n
,
for some constant L.
Let X
0
be any R
n
-valued random variable such that
(c) E([X
0
[
2
) < ∞
and
(d) X
0
is independent of J
+
(0),
where W() is a given m-dimensional Brownian motion.
Then there exists a unique solution X ∈ L
2
n
(0, T) of the stochastic diﬀerential equation:
(SDE)
_
dX = b(X, t)dt +B(X, t)dW (0 ≤ t ≤ T)
X(0) = X
0
.
90
Remarks. (i) “Unique” means that if X,
ˆ
X ∈ L
2
n
(0, T), with continuous sample paths
almost surely, and both solve (SDE), then
P(X(t) =
ˆ
X(t) for all 0 ≤ t ≤ T) = 1.
(ii) Hypotheses (a) says that b and B are uniformly Lipschitz continuous in the variable
x. Notice also that hypothesis (b) actually follows from (a).

Proof. 1. Uniqueness. Suppose X and
ˆ
X are solutions, as above. Then for all 0 ≤ t ≤ T,
X(t) −
ˆ
X(t) =
_
t
0
b(X, s) −b(
ˆ
X, s) ds +
_
t
0
B(X, s) −B(
ˆ
X, s) dW.
Since (a +b)
2
≤ 2a
2
+ 2b
2
, we can estimate
E([X(t) −
ˆ
X(t)[
2
) ≤ 2E
_
¸
¸
¸
¸
_
t
0
b(X, s) −b(
ˆ
X, s) ds
¸
¸
¸
¸
2
_
+ 2E
_
¸
¸
¸
¸
_
t
0
B(X, s) −B(
ˆ
X, s) dW
¸
¸
¸
¸
2
_
.
The Cauchy–Schwarz inequality implies that
¸
¸
¸
¸
_
t
0
f ds
¸
¸
¸
¸
2
≤ t
_
t
0
[f [
2
ds
for any t > 0 and f : [0, t] →R
n
. We use this to estimate
E
_
¸
¸
¸
¸
_
t
0
b(X, s) −b(
ˆ
X, s) ds
¸
¸
¸
¸
2
_
≤ TE
__
t
0
¸
¸
¸b(X, s) −b(
ˆ
X, s)
¸
¸
¸
2
ds
_
≤ L
2
T
_
t
0
E([X−
ˆ
X[
2
) ds.
Furthermore
E
_
¸
¸
¸
¸
_
t
0
B(X, s) −B(
ˆ
X, s) dW
¸
¸
¸
¸
2
_
= E
__
t
0
¸
¸
¸B(X, s) −B(
ˆ
X, s)
¸
¸
¸
2
ds
_
≤ L
2
_
t
0
E([X−
ˆ
X[
2
) ds.
Therefore for some appropriate constant C we have
E([X(t) −
ˆ
X(t)[
2
) ≤ C
_
t
0
E([X−
ˆ
X[
2
) ds,
91
provided 0 ≤ t ≤ T. If we now set φ(t) := E([X(t) −
ˆ
X(t)[
2
φ(t) ≤ C
_
t
0
φ(s) ds for all 0 ≤ t ≤ T.
Therefore Gronwall’s Lemma, with C
0
= 0, implies φ ≡ 0. Thus X(t) =
ˆ
X(t) a.s. for
all 0 ≤ t ≤ T, and so X(r) =
ˆ
X(r) for all rational 0 ≤ r ≤ T, except for some set of
probability zero. As X and
ˆ
X have continuous sample paths almost surely,
P
_
max
0≤t≤T
[X(t) −
ˆ
X(t)[ > 0
_
= 0.
2. Existence. We will utilize the iterative scheme introduced earlier. Deﬁne
_
X
0
(t) := X
0
X
n+1
(t) := X
0
+
_
t
0
b(X
n
(s), s) ds +
_
t
0
B(X
n
(s), s) dW,
for n = 0, 1, . . . and 0 ≤ t ≤ T. Deﬁne also
d
n
(t) := E([X
n+1
(t) −X
n
(t)[
2
).
We claim that
d
n
(t) ≤
(Mt)
n+1
(n + 1)!
for all n = 0, . . . , 0 ≤ t ≤ T
for some constant M, depending on L, T and X
0
. Indeed for n = 0, we have
d
0
(t) = E([X
1
(t) −X
0
(t)[
2
)
= E
_
¸
¸
¸
¸
_
t
0
b(X
0
, s) ds +
_
t
0
B(X
0
, s) dW
¸
¸
¸
¸
2
_
≤ 2E
_
¸
¸
¸
¸
_
t
0
L(1 +[X
0
[) ds
¸
¸
¸
¸
2
_
+ 2E
__
t
0
L
2
(1 +[X
0
[
2
) ds
_
≤ tM
for some large enough constant M. This conﬁrms the claim for n = 0.
92
Next assume the claim is valid for some n −1. Then
d
n
(t) = E([X
n+1
(t) −X
n
(t)[
2
)
= E

¸
¸
¸
_
t
0
b(X
n
, s) −b(X
n−1
, s) ds
+
_
t
0
B(X
n
, s) −B(X
n−1
, s) dW
¸
¸
¸
¸
2
_
≤ 2TL
2
E
__
t
0
[X
n
−X
n−1
[
2
ds
_
+ 2L
2
E
__
t
0
[X
n
−X
n−1
[
2
ds
_
≤ 2L
2
(1 +T)
_
t
0
M
n
s
n
n!
ds by the induction hypothesis

M
n+1
t
n+1
(n + 1)!
,
provided we choose M ≥ 2L
2
(1 +T). This proves the claim.
3. Now note
max
0≤t≤T
[X
n+1
(t) −X
n
(t)[
2
≤ 2TL
2
_
T
0
[X
n
−X
n−1
[
2
ds
+ 2 max
0≤t≤T
¸
¸
¸
¸
_
t
0
B(X
n
, s) −B(X
n−1
, s) dW
¸
¸
¸
¸
2
.
Consequently the martingale inequality from Chapter 2 implies
E
_
max
0≤t≤T
[X
n+1
(t) −X
n
(t)[
2
_
≤ 2TL
2
_
T
0
E([X
n
−X
n−1
[
2
) ds
+ 8L
2
_
T
0
E([X
n
−X
n−1
[
2
) ds
≤ C
(MT)
n
n!
by the claim above.
4. The Borel–Cantelli Lemma thus applies, since
P
_
max
0≤t≤T
[X
n+1
(t) −X
n
(t)[ >
1
2
n
_
≤ 2
2n
E
_
max
0≤t≤T
[X
n+1
(t) −X
n
(t)[
2
_
≤ 2
2n
C(MT)
n
n!
and

n=1
2
2n
(MT)
n
n!
< ∞.
93
Thus
P
_
max
0≤t≤T
[X
n+1
(t) −X
n
(t)[ >
1
2
n
i.o.
_
= 0.
In light of this, for almost every ω
X
n
= X
0
+
n−1

j=0
(X
j+1
−X
j
)
converges uniformly on [0, T] to a process X(). We pass to limits in the deﬁnition of
X
n+1
(), to prove
X(t) = X
0
+
_
t
0
b(X, s) ds +
_
t
0
B(X, s) dW for 0 ≤ t ≤ T.
That is,
(SDE)
_
dX = b(X, t)dt +B(X, t)dW
X(0) = X
0
,
for times 0 ≤ t ≤ T.
5. We must still show X() ∈ L
2
n
(0, T). We have
E([X
n+1
(t)[
2
) ≤ CE([X
0
[
2
) +CE
_
¸
¸
¸
¸
_
t
0
b(X
n
, s) ds
¸
¸
¸
¸
2
_
+CE
_
¸
¸
¸
¸
_
t
0
B(X
n
, s) dW
¸
¸
¸
¸
2
_
≤ C(1 +E([X
0
[
2
)) +C
_
t
0
E([X
n
[
2
) ds,
where, as usual, “C” denotes various constants. By induction, therefore,
E([X
n+1
(t)[
2
) ≤
_
C +C
2
+ +C
n+2
t
n+1
(n + 1)!
_
(1 +E([X
0
[
2
)).
Consequently
E([X
n+1
(t)[
2
) ≤ C(1 +E([X
0
[
2
))e
Ct
.
Let n → ∞:
E([X(t)[
2
) ≤ C(1 +E([X
0
[
2
))e
Ct
for all 0 ≤ t ≤ T;
and so X ∈ L
2
n
(0, T).
94
C. PROPERTIES OF SOLUTIONS.
In this section we mention, without proofs, a few properties of the solution to various
SDE.
THEOREM (Estimate on higher moments of solutions). Suppose that b, B and
X
0
satisfy the hypotheses of the Existence and Uniqueness Theorem. If, in addition,
E([X
0
[
2p
) < ∞ for some integer p > 1,
then the solution X() of
(SDE)
_
dX = b(X, t)dt +B(X, t)dW
X(0) = X
0
satisﬁes the estimates
(i) E([X(t)[
2p
) ≤ C
2
(1 +E([X
0
[
2p
))e
C
1
t
and
(ii) E([X(t) −X
0
[
2p
) ≤ C
2
(1 +E([X
0
[
2p
))t
p
e
C
2
t
for certain constants C
1
and C
2
, depending only on T, L, m, n.
The estimates above on the moments of X() are fairly crude, but are nevertheless
sometimes useful:
APPLICATION: SAMPLE PATH PROPERTIES. The possibility that B ≡ 0 is
not excluded, and consequently it could happen that the solution of our SDE is really a
solution of the ODE
˙
X = b(X, t),
with possibly random initial data. In this case the mapping t → X(t) will be smooth if b
is. On the other hand, if for some 1 ≤ i ≤ n

1≤l≤m
[b
il
(x, t)[
2
> 0 for all x ∈ R
n
, 0 ≤ t ≤ T,
then almost every sample path t → X
i
(t) is nowhere diﬀerentiable for a.e. ω. We can
however use estimates (i) and (ii) above to check the hypotheses of Kolmogorov’s Theorem
from ¸C in Chapter 3. It follows that for almost all sample paths,
the mapping t → X(t) is H¨ older continuous with each exponent less than
1
2
,
provided E([X
0
[
2p
) < ∞ for each 1 ≤ p < ∞.
95
THEOREM (Dependence on parameters). Suppose for k = 1, 2, . . . that b
k
, B
k
and X
k
0
satisfy the hypotheses of the Existence and Uniqueness Theorem, with the same
constant L. Assume further that
(a) lim
k→∞
E([X
k
0
−X
0
[
2
) = 0,
and for each M > 0,
(b) lim
k→∞
max
0≤t≤T
|x|≤M
([b
k
(x, t) −b(x, t)[ +[B
k
(x, t) −B(x, t)[) = 0.
Finally suppose that X
k
() solves
_
dX
k
= b
k
(X
k
, t)dt +B
k
(X
k
, t)dW
X
k
(0) = X
k
0
.
Then
lim
k→∞
E
_
max
0≤t≤T
[X
k
(t) −X(t)[
2
_
= 0,
where X is the unique solution of
_
dX = b(X, t)dt +B(X, t)dW
X(0) = X
0
.
Example (Small noise limits). In particular, for almost every ω the random trajec-
tories of the SDE
_
dX
ε
= b(X
ε
)dt +εdW
X
ε
(0) = x
0
converge uniformly on [0, T] as ε → 0 to the deterministic trajectory of
_
˙ x = b(x),
x(0) = x
0
.

96
D. LINEAR STOCHASTIC DIFFERENTIAL EQUATIONS.
This section presents some fairly explicit formulas for solutions of linear SDE.
DEFINITION. The stochastic diﬀerential equation
dX = b(X, t)dt +B(X, t)dW
is linear provided the coeﬃcients b and B have this form:
b(x, t) := c(t) +D(t)x,
for c : [0, T] →R
n
, D : [0, T] →M
n×n
, and
B(x, t) := E(t) +F(t)x
for E : [0, T] → M
n×m
, F : [0, T] → L(R
n
, M
n×m
), the space of bounded linear mappings
from R
n
to M
n×m
.
DEFINITION. A linear SDE is called homogeneous if c ≡ E ≡ 0 for 0 ≤ t ≤ T. It is
called linear in the narrow sense if F ≡ 0.
Remark. If
sup
0≤t≤T
[[c(t)[ +[D(t)[ +[E(t)[ +[F(t)[] < ∞,
then b and B satisfy the hypotheses of the Existence and Uniqueness Theorem. Thus the
linear SDE
_
dX = (c(t) +D(t)X)dt + (E(t) +F(t)X)dW
X(0) = X
0
has a unique solution, provided E([X
0
[
2
) < ∞, and X
0
is independent of J
+
(0).
FORMULAS FOR SOLUTIONS: linear equations in narrow sense
Suppose ﬁrst D ≡ D is constant. Then the solution of
(10)
_
dX = (c(t) +DX)dt +E(t)dW
X(0) = X
0
is
(11) X(t) = e
Dt
X
0
+
_
t
0
e
D(t−s)
(c(s)ds +E(s) dW),
where
e
Dt
:=

k=0
D
k
t
k
k!
.
97
More generally, the solution of
(12)
_
dX = (c(t) +D(t)X)dt +E(t)dW
X(0) = X
0
is
(13) X(t) = Φ(t)
_
X
0
+
_
t
0
Φ(s)
−1
(c(s)ds +E(s) dW)
_
,
where Φ() is the fundamental matrix of the nonautonomous system of ODE

dt
= D(t)Φ, Φ(0) = I.

These assertions follow formally from standard formulas in ODE theory if we write
EdW = Eξdt, ξ as usual denoting white noise, and regard Eξ as an inhomogeneous term
driving the ODE
˙
X = c(t) +D(t)X+E(t)ξ.
This will not be so if F() ,≡ 0, owing to the extra term in Itˆ o’s formula.
Observe also that formula (13) shows X(t) to be Gaussian if X
0
is.
FORMULAS FOR SOLUTIONS: general scalar linear equations
Suppose now n = 1, but m ≥ 1 is arbitrary. Then the solution of
(14)
_
dX = (c(t) +d(t)X)dt +

m
l=1
(e
l
(t) +f
l
(t)X)dW
l
X(0) = X
0
is
(15)
X(t) = Φ(t)
_
X
0
+
_
t
0
Φ(s)
−1
_
c(s) −
m

l=1
e
l
(s)f
l
(s)
_
ds
_
+
_
t
0
m

l=1
Φ(s)
−1
e
l
(s) dW
l
,
where
Φ(t) := exp
_
_
t
0
d −
m

l=1
(f
l
)
2
2
ds +
_
t
0
m

l=1
f
l
dW
l
_
.
See Arnold [A, Chapter 8] for more formulas for solutions of general linear equations.

3. SOME METHODS FOR SOLVING LINEAR SDE
98
For practice with Itˆ o’s formula, let us derive some of the formulas stated above.
Example 1. Consider ﬁrst the linear stochastic diﬀerential equation
(16)
_
dX = d(t)Xdt +f(t)XdW
X(0) = X
0
for m = n = 1. We will try to ﬁnd a solution having the product form
X(t) = X
1
(t)X
2
(t),
where
(17)
_
dX
1
= f(t)X
1
dW
X
1
(0) = X
0
and
(18)
_
dX
2
= A(t)dt +B(t)dW
X
2
(0) = 1,
where the functions A and B are to be selected. Then
dX = d(X
1
X
2
)
= X
1
dX
2
+X
2
dX
1
+f(t)X
1
B(t)dt
= f(t)XdW + (X
1
dX
2
+f(t)X
1
B(t)dt),
according to (17). Now we try to choose A, B so that
dX
2
+f(t)B(t)dt = d(t)X
2
dt.
For this, B ≡ 0 and A(t) = d(t)X
2
(t) will work. Thus (18) reads
_
dX
2
= d(t)X
2
dt
X
2
(0) = 1.
This is non-random: X
2
(t) = e
_
t
0
d(s) ds
. Since the solution of (17) is
X
1
(t) = X
0
e
_
t
0
f(s) dW−
1
2
_
t
0
f
2
(s) ds
,
we conclude that
X(t) = X
1
(t)X
2
(t) = X
0
e
_
t
0
f(s) dW+
_
t
0
d(s)−
1
2
f
2
(s) ds
,
a formula noted earlier.
99
Example 2. Consider next the general equation
(19)
_
dX = (c(t) +d(t)X)dt + (e(t) +f(t)X)dW
X(0) = X
0
,
again for m = n = 1. As above, we try for a solution of the form
X(t) = X
1
(t)X
2
(t),
where now
(20)
_
dX
1
= d(t)X
1
dt +f(t)X
1
dW
X
1
(0) = 1
and
(21)
_
dX
2
= A(t)dt +B(t)dW
X
2
(0) = X
0
,
the functions A, B to be chosen. Then
dX = X
2
dX
1
+X
1
dX
2
+f(t)X
1
B(t)dt
= d(t)Xdt +f(t)XdW
+X
1
(A(t)dt +B(t)dW) +f(t)X
1
B(t)dt.
We now require
X
1
(A(t)dt +B(t)dW) +f(t)X
1
B(t)dt = c(t)dt +e(t)dW;
and this identity will hold if we take
_
A(t) := [c(t) −f(t)e(t)](X
1
(t))
−1
B(t) := e(t)(X
1
(t))
−1
.
Observe that since X
1
(t) = e
_
t
0
f dW+
_
t
0
d−
1
2
f
2
ds
, we have X
1
(t) > 0 almost surely. Conse-
quently
X
2
(t) = X
0
+
_
t
0
[c(s) −f(s)e(s)](X
1
(s))
−1
ds
+
_
t
0
e(s)(X
1
(s))
−1
dW.
100
Employing this and the expression above for X
1
, we arrive at the formula, a special case
of (15):
X(t) = X
1
(t)X
2
(t)
= exp
__
t
0
d(s) −
1
2
f
2
(s) ds +
_
t
0
f(s) dW
_

_
X
0
+
_
t
0
exp
_

_
r
0
d(r) −
1
2
f
2
(r) dr −
_
s
0
f(r) dW
_
(c(s) −e(s)f(s)) ds
+
_
t
0
exp
_

_
s
0
d(r) −
1
2
f
2
(r) dr −
_
s
0
f(r) dW
_
e(s) dW.
_

Remark. There is great theoretical and practical interest in numerical methods for
simulation of solutions to random diﬀerential equations. The paper of Higham [H] is a
good introduction.
101
CHAPTER 6: APPLICATIONS.
A. Stopping times
B. Applications to PDE, Feynman-Kac formula
C. Optimal stopping
D. Options pricing
E. The Stratonovich integral
This chapter is devoted to some applications and extensions of the theory developed
earlier.
A. STOPPING TIMES.
DEFINITIONS, BASIC PROPERTIES. Let (Ω, |, P) be a probability space and
T() a ﬁltration of σ–algebras, as in Chapters 4 and 5. We introduce now some random
times that are well–behaved with respect to T():
DEFINITION. A random variable τ : Ω → [0, ∞] is called a stopping time with respect
to T() provided
¦τ ≤ t¦ ∈ T(t) for all t ≥ 0.
This says that the set of all ω ∈ Ω such that τ(ω) ≤ t is an T(t)-measurable set. Note
that τ is allowed to take on the value +∞, and also that any constant τ ≡ t
0
is a stopping
time.
THEOREM (Properties of stopping times). Let τ
1
and τ
2
be stopping times with
respect to T(). Then
(i) ¦τ < t¦ ∈ T(t), and so ¦τ = t¦ ∈ T(t), for all times t ≥ 0.
(ii) τ
1
∧ τ
2
:= min(τ
1
, τ
2
), τ
1
∨ τ
2
:= max(τ
1
, τ
2
) are stopping times.
Proof. Observe that
¦τ < t¦ =

_
k=1
¦τ ≤ t −1/k¦
. ¸¸ .
∈F(t−1/k)⊆F(t)
.
Also, we have ¦τ
1
∧τ
2
≤ t¦ = ¦τ
1
≤ t¦ ∪¦τ
2
≤ t¦ ∈ T(t), and furthermore ¦τ
1
∨τ
2
≤ t¦ =
¦τ
1
≤ t¦ ∩ ¦τ
2
≤ t¦ ∈ T(t).
The notion of stopping times comes up naturally in the study of stochastic diﬀerential
equations, as it allows us to investigate phenomena occuring over “random time intervals”.
An example will make this clearer:
102
Example (Hitting a set). Consider the solution X() of the SDE
_
dX(t) = b(t, X)dt +B(t, X)dW
X(0) = X
0
,
where b, B and X
0
satisfy the hypotheses of the Existence and Uniqueness Theorem.
THEOREM. Let E be either a nonempty closed subset or a nonempty open subset of
R
n
. Then
τ := inf¦t ≥ 0 [ X(t) ∈ E¦
is a stopping time. (We put τ = +∞ for those sample paths of X() that never hit E.)
E
X(τ)
Proof. Fix t ≥ 0; we must show ¦τ ≤ t¦ ∈ T(t). Take ¦t
i
¦

i=1
to be a countable dense
subset of [0, ∞). First we assume that E = U is an open set. Then the event
¦τ ≤ t¦ =
_
t
i
≤t
¦X(t
i
) ∈ U¦
. ¸¸ .
∈F(t
i
)⊆F(t)
belongs to T(t).
Next we assume that E = C is a closed set. Set d(x, C) := dist(x, C) and deﬁne the
open sets
U
n
= ¦x : d(x, C) <
1
n
¦.
The event
¦τ ≤ t¦ =

n=1
_
t
i
≤t
¦X(t
i
) ∈ U
n

. ¸¸ .
∈F(t
i
)⊆F(t)
also belongs to T(t).
Discussion. The random variable
σ := sup¦t ≥ 0 [ X(t) ∈ E¦,
103
the last time that X(t) hits E, is in general not a stopping time. The heuristic reason is
that the event ¦σ ≤ t¦ would depend upon the entire future history of process and thus
would not in general be T(t)-measurable. (In applications T(t) “contains the history of
X() up to and including time t, but does not contain information about the future”.)
The name “stopping time” comes from the example, where we sometimes think of
halting the sample path X() at the ﬁrst time τ that it hits a set E. But there are many
examples where we do not really stop the process at time τ. Thus “stopping time” is not
a particularly good name and “Markov time” would be better.
STOCHASTIC INTEGRALS AND STOPPING TIMES. Our next task is to
consider stochastic integrals with random limits of integration and to work out an Itˆ o
formula for these.
DEFINITION. If G ∈ L
2
(0, T) and τ is a stopping time with 0 ≤ τ ≤ T, we deﬁne
_
τ
0
GdW :=
_
T
0
χ
{t≤τ}
GdW.
LEMMA (Itˆ o integrals with stopping times). If G ∈ L
2
(0, T)and 0 ≤ τ ≤ T is a
stopping time, then
(i) E
__
τ
0
GdW
_
= 0
(ii) E
_
(
_
τ
0
GdW)
2
_
= E
__
τ
0
G
2
dt
_
.
Proof. We have
E
__
τ
0
GdW
_
= E
_
_
_
_
T
0
χ
{t≤τ}
G
. ¸¸ .
∈L
2
(0,T)
dW
_
_
_
= 0,
and
E((
_
τ
0
GdW)
2
) = E((
_
T
0
χ
{t≤τ}
GdW)
2
)
= E(
_
T
0

{t≤τ}
G)
2
dt)
= E(
_
τ
0
G
2
dt).

104
Similar formulas hold for vector–valued processes.
IT
ˆ
O’S FORMULA WITH STOPPING TIMES. As usual, let W() denote m-
dimensional Brownian motion. Recall next from Chapter 4 that if dX = b(X, t)dt +
B(X, t)dW, then for each C
2
function u,
(1) du(X, t) =
∂u
∂t
dt +
n

i=1
∂u
∂x
i
dX
i
+
1
2
n

i,j=1

2
u
∂x
i
∂x
j
m

k=1
b
ik
b
jk
dt.
Written in integral form, this means:
(2) u(X(t), t) −u(X(0), 0) =
_
t
0
_
∂u
∂t
+Lu
_
ds +
_
t
0
Du BdW,
for the diﬀerential operator
Lu :=
1
2
n

i,j=1
a
ij
u
x
i
x
j
+
n

i=1
b
i
u
x
i
, a
ij
=
m

k=1
b
ik
b
jk
,
and
Du BdW =
m

k=1
n

i=1
u
x
i
b
ik
dW
k
.
The argument of u in these integrals is (X(s), s). We call L the generator.
For a ﬁxed ω ∈ Ω, formula (2) holds for all 0 ≤ t ≤ T. Thus we may set t = τ, where τ
is a stopping time, 0 ≤ τ ≤ T:
u(X(τ), τ) −u(X(0), 0) =
_
τ
0
_
∂u
∂t
+Lu
_
ds +
_
τ
0
Du BdW.
Take expected value:
(3) E(u(X(τ), τ)) −E(u(X(0), 0)) = E
__
τ
0
_
∂u
∂t
+Lu
_
ds
_
.
We will see in the next section that this formula provides a very important link between
stochastic diﬀerential equations and (nonrandom) partial diﬀerential equations.
BROWNIAN MOTION AND THE LAPLACIAN. The most important case is
X() = W(), n-dimensional Brownian motion, the generator of which is
Lu =
1
2
n

i=1
u
x
i
x
i
=:
1
2
∆u.
The expression ∆u is called the Laplacian of u and occurs throughout mathematics and
physics. We will demonstrate in the next section some important links with Brownian
motion.
105
B. APPLICATIONS TO PDE, FEYNMAN–KAC FORMULA.
PROBABILISTIC FORMULAS FOR SOLUTIONS OF PDE.
Example 1 (Expected hitting time to a boundary). Let U ⊂ R
n
be a bounded
open set, with smooth boundary ∂U. According to standard PDE theory, there exists a
smooth solution u of the equation
(4)
_

1
2
∆u = 1 in U
u = 0 on ∂U.
Our goal is to ﬁnd a probabilistic representation formula for u. For this, ﬁx any point
x ∈ U and consider then an n-dimensional Brownian motion W(). Then X() := W()+x
represents a “Brownian motion starting at x”. Deﬁne
τ
x
:= ﬁrst time X() hits ∂U.
THEOREM. We have
(5) u(x) = E(τ
x
) for all x ∈ U.
In particular, u > 0 in U.
Proof. We employ formula (3), with Lu =
1
2
∆u. We have for each n = 1, 2, . . .
E(u(X(τ
x
∧ n))) −E(u(X(0))) = E
__
τ
x
∧n
0
1
2
∆u(X) ds
_
.
Since
1
2
∆u = −1 and u is bounded,
lim
n→∞
E(τ
x
∧ n) < ∞.
Thus τ
x
is integrable. Thus if we let n → ∞ above, we get
u(x) −E(u(X(τ
x
))) = E
__
τ
x
0
1 ds
_
= E(τ
x
).
But u = 0 on ∂U, and so u(X(τ
x
)) ≡ 0. Formula (5) follows.
Again recall that u is bounded on U. Hence
E(τ
x
) < ∞, and so τ
x
< ∞ a.s., for all x ∈ U.
This says that Brownian sample paths starting at any point x ∈ U will with probability 1
eventually hit ∂U.
Example 2 (Probabilistic representation of harmonic functions). Let U ⊂ R
n
be
a smooth, bounded domain and g : ∂U → R a given continuous function. It is known
from classical PDE theory that there exists a function u ∈ C
2
(U) ∩ C(
¯
U) satisfying the
boundary value problem:
(6)
_
∆u = 0 in U
u = g on ∂U.
We call u a harmonic function.
106
THEOREM. We have for each point x ∈ U
(7) u(x) = E(g(X(τ
x
))),
for X() := W() +x, Brownian motion starting at x.
Proof. As shown above,
E(u(X(τ
x
))) = E(u(X(0))) +E
__
τ
x
0
1
2
∆u(X) ds
_
= E(u(X(0))) = u(x),
the second equality valid since ∆u = 0 in U. Since u = g on ∂U, formula (7) follows.
APPLICATION: In particular, if ∆u = 0 in some open set containing the ball B(x, r),
then
u(x) = E(u(X(τ
x
))),
where τ
x
now denotes the hitting time of Brownian motion starting at x to ∂B(x, r). Since
Brownian motion is isotropic in space, we may reasonably guess that the term on the right
hand side is just the average of u over the sphere ∂B(x, r), with respect to surface measure.
That is, we have the identity
(8) u(x) =
1
area of ∂B(x, r)
_
∂B(x,r)
udS.
This is the mean value formula for harmonic functions.
Example 3 (Hitting one part of a boundary ﬁrst). Assume next that we can write
∂U as the union of two disjoint parts Γ
1
, Γ
2
. Let u solve the PDE
_
¸
_
¸
_
∆u = 0 in U
u = 1 on Γ
1
u = 0 on Γ
2
.
THEOREM. For each point x ∈ U, u(x) is the probability that a Brownian motion
starting at x hits Γ
1
before hitting Γ
2
.
Γ
1
Γ
2 x
107
Proof. Apply (7) for
g =
_
1 on Γ
1
0 on Γ
2
.
Then
u(x) = E(g(X(τ
x
))) = probability of hitting Γ
1
before Γ
2
.

FEYNMAN–KAC FORMULA. Now we extend Example #2 above to obtain a prob-
abilistic representation for the unique solution of the PDE
_

1
2
∆u +cu = f in U
u = 0 on ∂U.
We assume c, f are smooth functions, with c ≥ 0 in U.
THEOREM (Feynman–Kac formula). For each x ∈ U,
u(x) = E
__
τ
x
0
f(X(t))e

_
t
0
c(X) ds
dt
_
where, as before, X() := W() +x is a Brownian motion starting at x, and τ
x
denotes the
ﬁrst hitting time of ∂U.
Proof. We know E(τ
x
) < ∞. Since c ≥ 0, the integrals above all converge.
First look at the process
Y (t) := e
Z(t)
,
for Z(t) := −
_
t
0
c(X) ds. Then
dZ = −c(X)dt,
and so Itˆ o’s formula yields
dY = −c(X)Y dt.
Hence the Itˆo product rule implies
d
_
u(X)e

_
t
0
c(X) ds
_
= (du(X))e

_
t
0
c(X) ds
+u(X)d
_
e

_
t
0
c(X)ds
_
=
_
1
2
∆u(X)dt +
n

i=1
∂u(X)
∂x
i
dW
i
_
e

_
t
0
c(X)ds
+u(X)(−c(X)dt)e

_
t
0
c(X) ds
.
108
We use formula (3) for τ = τ
x
, and take the expected value, obtaining
E
_
u(X(τ
x
))e

_
τ
x
0
c(X) ds
_
−E(u(X(0)))
= E
__
τ
x
0
_
1
2
∆u(X) −c(X)u(X)
_
e

_
t
0
c(X) ds
dt
_
.
Since u solves (8), this simpliﬁes to give
u(x) = E(u(X(0))) = E
__
τ
x
0
f(X)e

_
t
0
c(X) ds
dt
_
,
as claimed.
An interpretation. We can explain this formula as describing a Brownian motion with
“killing”, as follows.
Suppose that the Brownian particles may disappear at a random killing time σ, for
example by being absorbed into the medium within which it is moving. Assume further
that the probability of its being killed in a short time interval [t, t +h] is
c(X(t))h +o(h).
Then the probability of the particle surviving until time t is approximately equal to
(1 −c(X(t
1
))h)(1 −c(X(t
2
))h) . . . (1 −c(X(t
n
))h),
where 0 = t
0
< t
1
< < t
n
= t, h = t
k+1
−t
k
. As h → 0, this converges to e

_
t
0
c(X) ds
.
Hence it should be that
u(x) = average of f(X()) over all sample paths which survive to hit ∂U
= E
__
τ
x
0
f(X)e

_
t
0
c(X) ds
dt
_
.

Remark. If we consider in these examples the solution of the SDE
_
dX = b(X)dt +B(X)dW
X(0) = x,
we can obtain similar formulas, where now
τ
x
= hitting time of ∂U for X()
and
1
2
∆u is replaced by the generator
Lu :=
1
2
n

i,j=1
a
ij
u
x
i
x
j
+
n

i=1
b
i
u
x
i
.
Note, however, we need to know that the various PDE have smooth solutions. This need
not always be the case for degenerate elliptic operators L.
109
C. OPTIMAL STOPPING.
The general mathematical setting for many control theory problems is this. We are given
some “system” whose state evolves in time according to a diﬀerential equation (determin-
istic or stochastic). Given also are certain controls which aﬀect somehow the behavior of
the system: these controls typically either modify some parameters in the dynamics or else
stop the process, or both. Finally we are given a cost criterion, depending upon our choice
of control and the corresponding state of the system.
The goal is to discover an optimal choice of controls, to minimize the cost criterion.
The easiest stochastic control problem of the general type outlined above occurs when
we cannot directly aﬀect the SDE controlling the evolution of X() and can only decide at
each instance whether or not to stop. A typical such problem follows.
STOPPING A STOCHASTIC DIFFERENTIAL EQUATION. Let U ⊂ R
m
be a bounded, smooth domain. Suppose b : R
n
→R
n
, B : R
n
→ M
n×m
satisfy the usual
assumptions.
Then for each x ∈ U the stochastic diﬀerential equation
_
dX = b(X)dt +B(X)dW
X
0
= x
has a unique solution. Let τ = τ
x
denote the hitting time of ∂U. Let θ be any stopping
time with respect to T(), and for each such θ deﬁne the expected cost of stopping X() at
time θ ∧ τ to be
(9) J
x
(θ) := E(
_
θ∧τ
0
f(X(s)) ds +g(X(θ ∧ τ))).
The idea is that if we stop at the possibly random time θ < τ, then the cost is a given
function g of the current state of X(θ). If instead we do not stop the process before it hits
∂U, that is, if θ ≥ τ, the cost is g(X(τ)). In addition there is a running cost per unit time
f of keeping the system in operation until time θ ∧ τ.
OPTIMAL STOPPING. The main question is this: does there exist an optimal
stopping time θ

= θ

x
, for which
J
x

) = min
θ stopping
time
J
x
(θ)?
And if so, how can we ﬁnd θ

? It turns out to be very diﬃcult to try to design θ

directly.
A much better idea is to turn attention to the value function
(10) u(x) := inf
θ
J
x
(θ),
110
and to try to ﬁgure out what u is as a function of x ∈ U. Note that u(x) is the minimum
expected cost, given we start the process at x. It turns out that once we know u, we will
be then be able to construct an optimal θ

. This approach is called dynamic programming.
OPTIMALITY CONDITIONS. So assume u is deﬁned above and suppose u is
smooth enough to justify the following calculations. We wish to determine the properties
of this function.
First of all, notice that we could just take θ ≡ 0 in the deﬁnition (10). That is, we could
just stop immediately and incur the cost g(X(0)) = g(x). Hence
(11) u(x) ≤ g(x) for each point x ∈ U.
Furthermore, τ ≡ 0 if x ∈ ∂U, and so
(12) u(x) = g(x) for each point x ∈ ∂U.
Next take any point x ∈ U and ﬁx some small number δ > 0. Now if we do not stop the
system for time δ, then according to (SDE) the new state of the system at time δ will be
X(δ). Then, given that we are at the point X(δ), the best we can achieve in minimizing
the cost thereafter must be
u(X(δ)).
So if we choose not to stop the system for time δ, and assuming we do not hit ∂U, our
cost is at least
E(
_
δ
0
f(X) ds +u(X(δ))).
Since u(x) is the inﬁmum of costs over all stopping times, we therefore have
u(x) ≤ E(
_
δ
0
f(X) ds +u(X(δ))).
Now by Itˆ o’s formula
E(u(X(δ))) = u(x) +E(
_
δ
0
Lu(X) ds),
for
Lu =
1
2
n

i,j=1
a
ij

2
u
∂x
i
∂x
j
+
n

i=1
b
i
∂u
∂x
i
, a
ij
=
m

k=1
b
ik
b
jk
.
Hence
0 ≤ E(
_
δ
0
f(X) +Lu(X) ds).
111
Divide by δ > 0, and then let δ → 0:
0 ≤ f(x) +Lu(x).
Equivalently, we have
(13) Mu ≤ f in U,
where Mu := −Lu.
Finally we observe that if in (11) a strict inequality held, that is, if
u(x) < g(x) at some point x ∈ U,
then it is optimal not to stop the process at once. Thus it is plausible to think that we
should leave the system going, for at least some very small time δ. In this circumstance
we then would have an equality in the formula above; and so
(14) Mu = f at those points where u < g.
In summary, we combine (11)–(14) to ﬁnd that if the formal reasoning above is valid,
then the value function u satisﬁes:
(15)
_
max(Mu −f, u −g) = 0 in U
u = g on ∂U
These are the optimality conditions.
SOLVING FOR THE VALUE FUNCTION. Our rigorous study of the stopping
time problem now begins by showing ﬁrst that there exists a unique solution u of (15)
and second that this u is in fact min
θ
J
x
(θ). Then we will use u to design θ

, an optimal
stopping time.
THEOREM. Suppose f, g are given smooth functions. There exists a unique funtion u,
with bounded second derivatives, such that:
(i) u ≤ g in U,
(ii) Mu ≤ f almost everywhere in U,
(iii) max(Mu −f, u −g) = 0 almost everywhere in U,
(iv) u = g on ∂U.
In general u / ∈ C
2
(U).
The idea of the proof is to approximate (15) by a penalized problem of this form:
_
Mu
ε

ε
(u
ε
−g) = f in U
u
ε
= g on ∂U,
112
where β
ε
: R → R is a smooth, convex function, β

ε
≥ 0, and β
ε
≡ 0 for x ≤ 0,
lim
→0
β
ε
(x) = ∞ for x > 0. Then u
ε
→ u. It will in practice be diﬃcult to ﬁnd a
precise formula for u, but computers can provide accurate numerical approximations.
DESIGNING AN OPTIMAL STOPPING POLICY. Now we show that our
solution of (15) is in fact the value function, and along the way we will learn how to design
an optimal stopping strategy θ

.
First note that the stopping set
S := ¦x ∈ U [ u(x) = g(x)¦
is closed. Deﬁne for each x ∈
¯
U,
θ

= ﬁrst hitting time of S.
THEOREM. Let u be the solution of (15). Then
u(x) = J
x

) = inf
θ
J
x
(θ)
for all x ∈
¯
U.
This says that we should ﬁrst compute the solution to (15) to ﬁnd S, deﬁne θ

as above,
and then we should run X() until it hits S (or else exits from U).
Proof. 1. Deﬁne the continuation set
C := U −S = ¦x ∈ U [ u(x) < g(x)¦.
On this set Lu = f, and furthermore u = g on ∂C. Since τ ∧ θ

is the exit time from C,
we have for x ∈ C
u(x) = E(
_
τ∧θ

0
f(X(s)) ds +g(X(θ

∧ τ))) = J
x

).
On the other hand, if x ∈ S, τ ∧ θ

= 0; and so
u(x) = g(x) = J
x

).
Thus for all x ∈
¯
U, we have u(x) = J
x

).
2. Now let θ be any other stopping time. We need to show
u(x) = J
x

) ≤ J
x
(θ).
113
Now by Itˆ o’s formula
u(x) = E(
_
τ∧θ
0
Mu(X) ds +u(X(τ ∧ θ)))
But Mu ≤ f and u ≤ g in
¯
U. Hence
u(x) ≤ E(
_
τ∧θ
0
f(X) ds +g(X(τ ∧ θ))) = J
x
(θ).
But since u(x) = J
x

), we consequently have
u(x) = J
x

) = min
θ
J
x
(θ),
as asserted.
D. OPTIONS PRICING.
In this section we outline an application to mathematical ﬁnance, mostly following
Baxter–Rennie [B-R] and the class lectures of L. Goldberg. Another basic reference is Hull
[Hu].
THE BASIC PROBLEM. Let us consider a given security, say a stock, whose price
at time t is S(t). We suppose that S evolves according to the SDE introduced in Chapter
5:
(16)
_
dS = µSdt +σSdW
S(0) = s
0
,
where µ > 0 is the drift and σ ,= 0 the volatility. The initial price s
0
is known.
A derivative is a ﬁnancial instrument whose payoﬀ depends upon (i.e., is derived from)
the behavior of S(). We will investigate a European call option, which is the right to buy
one share of the stock S, at the price p at time T. The number p is called the strike price
and T > 0 the strike (or expiration) time. The basic question is this:
What is the “proper price” at time t = 0 of this option?
In other words, if you run a ﬁnancial ﬁrm and wish to sell your customers this call option,
how much should you charge? (We are looking for the “break–even” price, for which the
ﬁrm neither makes nor loses money.)
114
ARBITRAGE AND HEDGING. To simplify, we assume hereafter that the pre-
vailing, no-risk interest rate is the constant r > 0. This means that \$1 put in a bank at
time t = 0 becomes \$e
rT
at time t = T. Equivalently, \$1 at time t = T is worth only
\$e
−rT
at time t = 0.
As for the problem of pricing our call option, a ﬁrst guess might be that the proper
price should be
(17) e
−rT
E((S(T) −p)
+
),
for x
+
:= max(x, 0). The reasoning behind this guess is that if S(T) < p, then the option
is worthless. If S(T) > p, we can buy a share for the price p, immediately sell at price
S(T), and thereby make a proﬁt of (S(T) − p)
+
. We average this over all sample paths
and multiply by the discount factor e
−rT
, to arrive at (17).
As reasonable as this may all seem, (17) is in fact not the proper price. Other forces
are at work in ﬁnancial markets. Indeed the fundamental factor in options pricings is
arbitrage, meaning the possibility of risk-free proﬁts.
We must price our option so as to create no arbitrage opportunities for others.
To convert this principle into mathematics, we introduce also the notion of hedging.
This means somehow eliminating our risk as the seller of the call option. The exact details
appear below, but the basic idea is that we can in eﬀect “duplicate” our option by a
portfolio consisting of (continually changing) holdings of a risk–free bond and of the stock
on which the call is written.
A PARTIAL DIFFERENTIAL EQUATION. We demonstrate next how use these
principles to convert our pricing problem into a PDE. We introduce for s ≥ 0 and 0 ≤ t ≤ T,
the unknown price function
(18) u(s, t), denoting the proper price of the option at time t, given that S(t) = s.
Then u(s
0
, 0) is the price we are seeking.
Boundary conditions. We need to calculate u. For this, notice ﬁrst that at the expiration
time T, we have
(19) u(s, T) = (s −p)
+
(s ≥ 0).
Furthermore, if s = 0, then S(t) = 0 for all 0 ≤ t ≤ T and so
(20) u(0, t) = 0 (0 ≤ t ≤ T).
115
We seek how a PDE u solves for s > 0, 0 ≤ t ≤ T.
Duplicating an option, self-ﬁnancing. To go further, deﬁne the process
(21) C(t) := u(S(t), t) (0 ≤ t ≤ T).
Thus C(t) is the current price of the option at time t, and is random since the stock price
S(t) is random. According to Itˆ o’s formula and (16)
(22)
dC = u
t
dt +u
s
dS +
1
2
u
ss
(dS)
2
= (u
t
+µSu
s
+
σ
2
2
S
2
u
ss
)dt +σSu
s
dW.
Now comes the key idea: we propose to “duplicate” C by a portfolio consisting of shares of
S and of a bond B. More precisely, assume that B is a risk-free investment, which therefore
grows at the prevailing interest rate r:
(23)
_
dB = rBdt
B(0) = 1.
This just means B(t) = e
rt
, of course. We will try to ﬁnd processes φ and ψ so that
(24) C = φS +ψB (0 ≤ t ≤ T).
Discussion. The point is that if we can construct φ, ψ so that (24) holds, we can eliminate
all risk. To see this more clearly, imagine that your ﬁnancial ﬁrm sells a call option,
as above. The ﬁrm thereby incurs the risk that at time T, the stock price S(T) will
exceed p, and so the buyer will exercise the option. But if in the meantime the ﬁrm has
constructed the portfolio (24), the proﬁts from it will exactly equal the funds needed to
pay the customer. Conversely, if the option is worthless at time T, the portfolio will have
no proﬁt.
But to make this work, the ﬁnancial ﬁrm should not have to inject any new money into
the hedging scheme, beyond the initial investment to set it up. We ensure this by requiring
that the portfolio represented on the right-hand side of (24) be self-ﬁnancing. This means
that the changes in the value of the portfolio should depend only upon the changes in S, B.
We therefore require that
(25) dC = φdS +ψdB (0 ≤ t ≤ T).
Remark (discrete version of self-ﬁnancing). Roughly speaking, a portfolio is self-
ﬁnancing if it is ﬁnancially self contained. To understand this better, let us consider a
116
diﬀerent model in which time is discrete, and the values of the stock and bond at a time
t
i
are given by S
i
and B
i
respectively. Here ¦t
i
¦
N
i=0
is an increasing sequence of times and
we suppose that each time step t
i+1
− t
i
is small. A portfolio can now be thought of as
a sequence ¦(φ
i
, ψ
i

N
i=0
, corresponding to our changing holdings of the stock S and the
bond B over each time interval.
Now for a given time interval (t
i
, t
i+1
), C
i
= φ
i
S
i
+ ψ
i
B
i
is the opening value of the
portfolio and C
i+1
= φ
i
S
i+1
+ ψ
i
B
i+1
represents the closing value. The self-ﬁnancing
condition means that the ﬁnancing gap C
i+1
− C
i
of cash (that would otherwise have to
be injected to pay for our construction strategy) must be zero.This is equivalent to saying
that
C
i+1
−C
i
= φ
i
(S
i+1
−S
i
) +ψ
i
(B
i+1
−B
i
),
the continuous version of which is condition (25).
Combining formulas (22), (23) and (25) provides the identity
(26)
(u
t
+µSu
s
+
σ
2
2
S
2
u
ss
)dt +σSu
s
dW
= φ(µSdt +σSdW) +ψrBdt.
So if (24) holds, (26) must be valid, and we are trying to select φ, ψ to make all this so.
We observe in particular that the terms multiplying dW on each side of (26) will match
provided we take
(27) φ(t) := u
s
(S(t), t) (0 ≤ t ≤ T).
(u
t
+
σ
2
2
S
2
u
ss
)dt = rψBdt.
But ψB = C −φS = u −u
s
S, according to (24), (27). Consequently,
(28) (u
t
+rSu
s
+
σ
2
2
S
2
u
ss
−ru)dt = 0.
The argument of u and its partial derivatives is (S(t), t).
Consequently, to make sure that (21) is valid, we ask that the function u = u(s, t) solve
the Black–Scholes–Merton PDE
(29) u
t
+rsu
s
+
σ
2
2
s
2
u
ss
−ru = 0 (0 ≤ t ≤ T).
The main outcome of all our ﬁnancial reasoning is the derivation of this partial diﬀerential
equation. Observe that the parameter µ does not appear.
117
More on self-ﬁnancing. Before going on, we return to the self-ﬁnancing condition (25).
The Itˆo product rule and (24) imply
dC = φdS +ψdB +Sdφ +Bdψ +dφdS.
To ensure (25), we consequently must make sure that
(30) Sdφ +Bdψ +dφdS = 0,
where we recall φ = u
s
(S(t), t). Now dφdS = σ
2
S
2
u
ss
dt. Thus (30) is valid provided
(31) dψ = −B
−1
(Sdφ +σ
2
S
2
u
ss
dt).
We can conﬁrm this by noting that (24), (27) imply
ψ = B
−1
(C −φS) = e
−rt
(u(S, t) −u
s
(S, t)S).
A direct calculation using (28) veriﬁes (31).
SUMMARY. To price our call option, we solve the boundary-value problem
(32)
_
_
_
u
t
+rsu
s
+
σ
2
2
s
2
u
ss
−ru = 0 (s > 0, 0 ≤ t ≤ T)
u = (s −p)
+
(s > 0, t = T)
u = 0 (s = 0, 0 ≤ t ≤ T).
Remember that u(s
0
, 0) is the price we are trying to ﬁnd. It turns out that this problem
can be solved explicitly, although we omit the details here: see for instance Baxter–Rennie
[B-R].
E. THE STRATONOVICH INTEGRAL.
We next discuss the Stratonovich stochastic calculus, which is an alternative to Itˆ o’s
approach. Most of the following material is from Arnold [A, Chapter 10].
1. Motivation. Let us consider ﬁrst of all the formal random diﬀerential equation
(33)
_
˙
X = d(t)X +f(t)Xξ
X(0) = X
0
,
where m = n = 1 and ξ() is 1-dimensional “white noise”. If we interpret this rigorously
as the stochastic diﬀerential equation:
(34)
_
dX = d(t)Xdt +f(t)XdW
X(0) = X
0
,
118
we then recall from Chapter 5 that the unique solution is
(35) X(t) = X
0
e
_
t
0
d(s)−
1
2
f
2
(s) ds+
_
t
0
f(s) dW
.
On the other hand perhaps (33) is a proposed mathematical model of some physical
process and we are not really sure whether ξ() is “really” white noise. It could perhaps
be instead some process with smooth (but highly complicated) sample paths. How would
this possibility change the solution?
APPROXIMATING WHITE NOISE. More precisely, suppose that ¦ξ
k
()¦

k=1
is
a sequence of stochastic processes satisfying:
(a) E(ξ
k
(t)) = 0,
(b) E(ξ
k
(t)ξ
k
(s)) := d
k
(t −s),
(c) ξ
k
(t) is Gaussian for all t ≥ 0,
(d) t → ξ
k
(t) is smooth for all ω,
where we suppose that the functions d
k
() converge as k → ∞ to δ
0
, the Dirac measure at
0.
In light of the formal deﬁnition of the white noise ξ() as a Gaussian process with
Eξ(t) = 0, E(ξ(t)ξ(s)) = δ
0
(t −s), the ξ
k
() are thus presumably smooth approximations
of ξ().
LIMITS OF SOLUTIONS. Now consider the problem
(36)
_
˙
X
k
= d(t)X
k
+f(t)X
k
ξ
k
X
k
(0) = X
0
.
For each ω this is just a regular ODE, whose solution is
X
k
(t) := X
0
e
_
t
0
d(s) ds+
_
t
0
f(s)ξ
k
(s) ds
.
Next look at
Z
k
(t) :=
_
t
0
f(s)ξ
k
(s) ds.
For each time t ≥ 0, this is a Gaussian random variable, with
E(Z
k
(t)) = 0.
Furthermore,
E(Z
k
(t)Z
k
(s)) =
_
t
0
_
s
0
f(τ)f(σ)d
k
(τ −σ) dσdτ

_
t
0
_
s
0
f(τ)f(σ)δ
0
(τ −σ) dσdτ
=
_
t∧s
0
f
2
(τ) dτ.
119
Hence as k → ∞, Z
k
(t) converges in L
2
to a process whose distributions agree with those
_
t
0
f(s) dW. And therefore X
k
(t) converges to a process whose distributions agree with
(37)
ˆ
X(t) := X
0
e
_
t
0
d(s) ds+
_
t
0
f(s) dW
.
This does not agree with the solution (35)!
Discussion. Thus if we regard (33) as an Itˆ o SDE with ξ() a “true” white noise, (35) is
our solution. But if we approximate ξ() by smooth processes ξ
k
(), solve the approximate
problems (36) and pass to limits with the approximate solutions X
k
(), we get a diﬀerent
solution. This means that (33) is unstable with respect to changes in the random term
ξ(). This conclusion has important consequences in questions of modeling, since it may
be unclear experimentally whether we really have ξ() or instead ξ
k
() in (33) and similar
problems.
In view of all this, it is appropriate to ask if there is some way to redeﬁne the stochastic
integral so these diﬃculties do not come up. One answer is the Stratonovich integral.
2. Deﬁnition of Stratonovich integral.
A one-dimensional example. Recall that in Chapter 4 we deﬁned for 1-dimensional
Brownian motion
_
T
0
W dW := lim
|P
n
|→0
m
n
−1

k=0
W(t
n
k
)(W(t
n
k+1
) −W(t
n
k
)) =
W
2
(T) −T
2
,
where P
n
:= ¦0 = t
n
0
< t
n
1
< < t
n
m
n
= T¦ is a partition of [0, T]. This corresponds
to a sequence of Riemann sum approximations, where the integrand is evaluated at the
left-hand endpoint of each subinterval [t
n
k
, t
n
k+1
].
The corresponding Stratonovich integral is instead deﬁned this way:
_
T
0
W ◦ dW := lim
|P
n
|→0
m
n
−1

k=0
_
W(t
n
k+1
) +W(t
n
k
)
2
_
(W(t
n
k+1
) −W(t
n
k
)) =
W
2
(T)
2
.
(Observe the notational change: we hereafter write a small circle before the dW to signify
the Stratonovich integral.) According to calculations in Chapter 4, we also have
_
T
0
W ◦ dW = lim
|P
n
|→0
m
n
−1

k=0
W
_
t
n
k+1
+t
n
k
2
_
(W(t
n
k+1
) −W(t
n
k
)).
Therefore for this case the Stratonovich integral corresponds to a Riemann sum approxi-
mation where we evaluate the integrand at the midpoint of each subinterval [t
n
k
, t
n
k+1
].

We generalize this example and so introduce the
120
DEFINITION. Let W() be an n-dimensional Brownian motion and let B : R
n
[0, T] →
M
n×n
be a C
1
function such that
E
_
_
T
0
[B(W, t)[
2
dt
_
< ∞.
Then we deﬁne
_
T
0
B(W, t) ◦ dW := lim
|P
n
|→0
m
n
−1

k=0
B
_
W(t
n
k+1
) +W(t
n
k
)
2
, t
n
k
_
(W(t
n
k+1
) −W(t
n
k
)).
It can be shown that this limit exists in L
2
(Ω).
A CONVERSION FORMULA. Remember that Itˆo’s integral can be computed this
way:
_
T
0
B(W, t) dW = lim
|P
n
|→0
m
n
−1

k=0
B(W(t
n
k
), t
n
k
)(W(t
n
k+1
) −W(t
n
k
)).
This is in general not equal to the Stratonovich integral, but there is a conversion formula
(38)
_
_
T
0
B(W, t) ◦ dW
_
i
=
_
_
T
0
B(W, t) dW
_
i
+
1
2
_
T
0
n

j=1
∂b
ij
∂x
j
(W, t) dt,
for i = 1, . . . , n. Here v
i
means the i
th
–component of the vector function v. This formula
is proved by noting
_
T
0
B(W, t) ◦ dW−
_
T
0
B(W, t) dW
= lim
|P
n
|→0
m
n
−1

k=0
_
B
_
W(t
n
k+1
) +W(t
n
k
)
2
, t
n
k
_
−B(W(t
n
k
), t
n
k
)
_
(W(t
n
k+1
) −W(t
n
k
))
and using the Mean Value Theorem plus some usual methods for evaluating the limit. We
omit details.
Special case. If n = 1, then
_
T
0
b(W, t) ◦ dW =
_
T
0
b(W, t) dW +
1
2
_
T
0
∂b
∂x
(W, t) dt.

Assume now B : R
n
[0, T] →M
n×m
and W() is an m-dimensional Brownian motion.
We make this informal
121
DEFINITION. If X() is a stochastic process with values in R
n
, we deﬁne
_
T
0
B(X, t) ◦ dW := lim
|P
n
|→0
m
n
−1

k=0
B
_
X(t
n
k+1
) +X(t
n
k
)
2
, t
n
k
_
(W(t
n
k+1
) −W(t
n
k
)),
provided this limit exists in L
2
(Ω) for all sequences of partitions P
n
, with [P
n
[ → 0.
3. Stratonovich chain rule.
DEFINITION. Suppose that the process X() solves the Stratonovich integral equation
X(t) = X(0) +
_
t
0
b(X, s) ds +
_
t
0
B(X, s) ◦ dW (0 ≤ t ≤ T)
for b : R
n
[0, T] →R
n
and B : R
n
[0, T] →M
n×m
. We then write
dX = b(X, t)dt +B(X, t) ◦ dW,
the second term on the right being the Stratonovich stochastic diﬀerential.
THEOREM (Stratonovich chain rule). Assume
dX = b(X, t)dt +B(X, t) ◦ dW
and suppose u : R
n
[0, T] →R is smooth. Deﬁne
Y (t) := u(X(t), t).
Then
dY =
∂u
∂t
dt +
n

i=1
∂u
∂x
i
◦ dX
i
=
_
∂u
∂t
+
n

i=1
∂u
∂x
i
b
i
_
dt +
n

i=1
m

k=1
∂u
∂x
i
b
ik
◦ dW
k
.
Thus the ordinary chain rule holds for Stratonovich stochastic diﬀerentials, and there is

2
u
∂x
i
∂x
j
as there is for Itˆ o’s formula. We omit the proof, which
is similar to that for the Itˆ o rule. The main diﬀerence is that we make use of the formula
_
T
0
W ◦ dW =
1
2
W
2
(T) in the approximations.
122
More discussion. Next let us return to the motivational example we began with. We
have seen that if the diﬀerential equation (33) is interpreted to mean
_
dX = d(t)Xdt +f(t)XdW (Itˆ o’s sense),
X(0) = X
0
,
then
X(t) = X
0
e
_
t
0
d(s)−
1
2
f
2
(s) ds+
_
t
0
f(s) dW
.
However, if we interpret (33) to mean
_
dX = d(t)Xdt +f(t)X ◦ dW (Stratonovich’s sense)
X(0) = X
0
,
the solution is
˜
X(t) = X
0
e
_
t
0
d(s) ds+
_
t
0
f(s) dW
,
as is easily checked using the Stratonovich calculus described above.
This solution
˜
X() is also the solution obtained by approximating the “white noise” ξ()
by smooth processes ξ
k
() and passing to limits. This suggests that interpreting (16) and
similar formal random diﬀerential equations in the Stratonovich sense will provide solutions
which are stable with respect to perturbations in the random terms. This is indeed the
case: See the articles [S1-2] by Sussmann.
Note also that these considerations clarify a bit the problems of interpreting mathemat-
ically the formal random diﬀerential equation (33), but do not say which interpretation is
physically correct. This is a question of modeling and is not, strictly speaking, a mathe-
matical issue.
CONVERSION RULES FOR SDE.
Let W() be an m-dimensional Wiener process and suppose b : R
n
[0, T] → R
n
,
B : R
n
[0, T] → M
n×m
satisfy the hypotheses of the basic existence and uniqueness
theorem. Then X() solves the Itˆ o stochastic diﬀerential equation
_
dX = b(X, t)dt +B(X, t)dW
X(0) = X
0
if and only if X() solves the Stratonovich stochastic diﬀerential equation
_
dX =
_
b(X, t) −
1
2
c(X, t)
¸
dt +B(X, t) ◦ dW
X(0) = X
0
,
for
c
i
(x, t) =
m

k=1
n

j=1
∂b
ik
∂x
j
(x, t)b
jk
(x, t) (1 ≤ i ≤ n).
123
A special case. For m = n = 1, this says
dX = b(X)dt +σ(X)dW
if and only if
dX = (b(X) −
1
2
σ

(X)σ(X))dt +σ(X) ◦ dW.
4. Summary We conclude these lectures by summarizing the advantages of each
deﬁnition of the stochastic integral:
1. Simple formulas: E
_
_
t
0
GdW
_
= 0, E
_
_
_
t
0
GdW
_
2
_
= E
_
_
t
0
G
2
dt
_
.
2. I(t) =
_
t
0
GdW is a martingale.
1. Ordinary chain rule holds.
2. Solutions of stochastic diﬀerential equations interpreted in Stratonovich sense are
stable with respect to changes in random terms.
124
APPENDICES
Appendix A: Proof of Laplace–DeMoivre Theorem (from ¸G in Chapter 2)
Proof. 1. Set S

n
:=
S
n
−np

npq
, this being a random variable taking on the value x
k
=
k−np

npq
(k = 0, . . . , n) with probability p
n
(k) =
_
n
k
_
p
k
q
n−k
.
Look at the interval
_
−np

npq
,
nq

npq
_
. The points x
k
divide this interval into n subintervals
of length
h :=
1

npq
.
Now if n goes to ∞, and at the same time k changes so that [x
k
[ is bounded, then
k = np +x
k

npq → ∞
and
n −k = nq −x
k

npq → ∞.
2. We recall next Stirling’s formula, which says says
n! = e
−n
n
n

2πn(1 +o(1)) as n → ∞,
where “o(1)” denotes a term which goes to 0 as n → ∞. (See Mermin [M] for a nice
discussion.) Hence as n → ∞
(1)
p
n
(k) =
_
n
k
_
p
k
q
n−k
=
n!
k!(n −k)!
p
k
q
n−k
=
e
−n
n
n

2πnp
k
q
n−k
e
−k
k
k

2πke
−(n−k)
(n −k)
(n−k)
_
2π(n −k)
(1 +o(1))
=
1

_
n
k(n −k)
_
np
k
_
k
_
nq
n −k
_
n−k
(1 +o(1)).
3. Observe next that if x = x
k
=
k−np

npq
, then
1 +
_
q
np
x = 1 +
_
q
np
_
k −np

npq
_
=
k
np
and
1 −
_
p
nq
x =
n −k
nq
.
125
Note also log(1 ±y) = ±y −
y
2
2
+O(y
3
) as y → 0. Hence
log
_
np
k
_
k
= −k log
_
k
np
_
= −k log
_
1 +
_
q
np
x
_
= −(np +x

npq)
__
q
np
x −
q
2np
x
2
_
+O
_
n

1
2
_
.
Similarly,
log
_
nq
n −k
_
n−k
= −(nq −x

npq)
_

_
p
nq
x −
p
2nq
x
2
_
+O
_
n

1
2
_
.
Add these expressions and simplify, to discover
lim
n→∞
k−np

npq
→x
log
_
_
np
k
_
k
_
nq
n −k
_
n−k
_
= −
x
2
2
.
Consequently
(2) lim
n→∞
k−np

npq
→x
_
np
k
_
k
_
nq
n −k
_
n−k
= e

x
2
2
.
4. Finally, observe
(3)
_
n
k(n −k)
=
1

npq
(1 +o(1)) = h(1 +o(1)),
since k = np +x

npq, n −k = nq −x

npq.
Now
P(a ≤ S

n
≤ b) =

a≤x
k
≤b
x
k
=
k−np

npq
p
n
(k)
for a < b. In view of (1) − (3), the latter expression is a Riemann sum approximation as
n → ∞ of the integral
1

_
b
a
e

x
2
2
dx.

Appendix B: Proof of discrete martingale inequalities (from ¸I in Chapter 2)
126
Proof. 1. Deﬁne
A
k
:=
k−1

j=1
¦X
j
≤ λ¦ ∩ ¦X
k
> λ¦ (k = 1, . . . , n).
Then
A :=
_
max
1≤k≤n
X
k
> λ
_
=
n
_
k=1
A
k
. ¸¸ .
disjoint union
.
Since λP(A
k
) ≤
_
A
k
X
k
dP, we have
(4) λP(A) = λ
n

k=1
P(A
k
) ≤
n

k=1
E(χ
A
k
X
k
).
Therefore
E(X
+
n
) ≥
n

k=1
E(X
+
n
χ
A
k
)
=
n

k=1
E(E(X
+
n
χ
A
k
[ X
1
, . . . , X
k
))
=
n

k=1
E(χ
A
k
E(X
+
n
[ X
1
, . . . , X
k
))

n

k=1
E(χ
A
k
E(X
n
[ X
1
, . . . , X
k
))

n

k=1
E(χ
A
k
X
k
) by the submartingale property
≥ λP(A) by (4).
2. Notice next that the proof above in fact demonstrates
λP
_
max
1≤k≤n
X
k
> λ
_

_
¦max
1≤k≤n
X
k
>λ¦
X
+
n
dP.
Apply this to the submartingale [X
k
[:
(5) λP(X > λ) ≤
_
{X>λ}
Y dP,
127
for X := max
1≤k≤n
[X
k
[, Y := [X
n
[. Now take some 1 < p < ∞. Then
E([X[
p
) = −
_

0
λ
p
dP(λ) for P(λ) := P(X > λ)
= p
_

0
λ
p−1
P(λ) dλ
≤ p
_

0
λ
p−1
_
1
λ
_
{X>λ}
Y dP
_
dλ by (5)
= p
_

Y
_
_
X
0
λ
p−2

_
dP
=
p
p −1
_

Y X
p−1
dP

p
p −1
__

Y
p
dP
_
1/p
__

X
p
dP
_
1−1/p
.

Appendix C: Proof of continuity of indeﬁnite Itˆo integral (from ¸C in Chapter
4)
Proof. We will assume assertion (i) of the Theorem in ¸C of Chapter 4, which states that
the indeﬁnite integral I() is a martingale.
There exist step processes G
n
∈ L
2
(0, T), such that
E
_
_
T
0
(G
n
−G)
2
dt
_
→ 0.
Write I
n
(t) :=
_
t
0
G
n
dW, for 0 ≤ t ≤ T. If G
n
(s) ≡ G
n
k
for t
n
k
≤ s < t
n
k+1
, then
I
n
(t) =
k−1

i=0
G
n
i
(W(t
n
i+1
) −W(t
n
i
)) +G
n
k
(W(t) −W(t
n
k
))
for t
n
k
≤ t < t
n
k+1
. Therefore I
n
() has continuous sample paths a.s., since Brownian
motion does. Since I
n
() is a martingale, it follows that [I
n
− I
m
[
2
is a submartingale.
The martingale inequality now implies
P
_
sup
0≤t≤T
[I
n
(t) −I
m
(t)[ > ε
_
= P
_
sup
0≤t≤T
[I
n
(t) −I
m
(t)[
2
> ε
2
_

1
ε
2
E([I
n
(T) −I
m
(T)[
2
)
=
1
ε
2
E
_
_
T
0
[G
n
−G
m
[
2
dt
_
.
128
Choose ε =
1
2
k
. Then there exists n
k
such that
P
_
sup
0≤t≤T
[I
n
(t) −I
m
(t)[ >
1
2
k
_
≤ 2
2k
E
_
_
T
0
[G
n
(t) −G
m
(t)[
2
dt
_

1
k
2
for m, n ≥ n
k
.
We may assume n
k+1
≥ n
k
≥ n
k−1
≥ . . . , and n
k
→ ∞. Let
A
k
:=
_
sup
0≤t≤T
[I
n
k
(t) −I
n
k+1
(t)[ >
1
2
k
_
.
Then
P(A
k
) ≤
1
k
2
.
Thus by the Borel–Cantelli Lemma, P(A
k
i.o.) = 0; which is to say, for almost all ω
sup
0≤t≤T
[I
n
k
(t, ω) −I
n
k+1
(t, ω)[ ≤
1
2
k
provided k ≥ k
0
(ω).
Hence I
n
k
(, ω) converges uniformly on [0, T] for almost every ω, and therefore J(t, ω) :=
lim
k→∞
I
n
k
(t, ω) is continuous for amost every ω. As I
n
(t) → I(t) in L
2
(Ω) for all 0 ≤
t ≤ T, we deduce as well that J(t) = I(t) amost every for all 0 ≤ t ≤ T. In other words,
J() is a version of I(). Since for almost every ω, J(, ω) is the uniform limit of continuous
functions, J() has continuous sample paths a.s.
129
EXERCISES
(1) Show, using the formal manipulations for Itˆ o’s formula discussed in Chapter 1, that
Y (t) := e
W(t)−
t
2
solves the stochastic diﬀerential equation
_
dY = Y dW,
Y (0) = 1.
(Hint: If X(t) := W(t) −
t
2
, then dX = −
dt
2
+dW.)
(2) Show that
P(t) = p
0
e
σW(t)+
_
µ−
σ
2
2
_
t
,
solves
_
dP = µPdt +σPdW,
P(0) = p
0
.
(3) Let Ω be any set and / any collection of subsets of Ω. Show that there exists a
unique smallest σ-algebra | of subsets of Ω containing /. We call | the σ-algebra
generated by /.
(Hint: Take the intersection of all the σ-algebras containing /.)
(4) Let X =

k
i=1
a
i
χ
A
i
be a simple random variable, where the real numbers a
i
are
distinct, the events A
i
are pairwise disjoint, and Ω = ∪
k
i=1
A
i
. Let |(X) be the
σ-algebra generated by X.
(i) Describe precisely which sets are in |(X).
(ii) Suppose the random variable Y is |(X)-measurable. Show that Y is constant
on each set A
i
.
(iii) Show that therefore Y can be written as a function of X.
(5) Verify:
_

−∞
e
−x
2
dx =

π,
1

2πσ
2
_

−∞
xe

(x−m)
2

2
dx = m,
1

2πσ
2
_

−∞
(x −m)
2
e

(x−m)
2

2
dx = σ
2
.
(6) (i) Suppose A and B are independent events in some probability space. Show that
A
c
and B are independent. Likewise, show that A
c
and B
c
are independent.
130
(ii) Suppose that A
1
, A
2
, . . . , A
m
are disjoint events, each of positive probability,
such that Ω = ∪
m
j=1
A
j
. Prove Bayes’ formula:
P(A
k
[ B) =
P(B[ A
k
)P(A
k
)

m
j=1
P(B[ A
j
)P(A
j
)
(k = 1, . . . , m),
provided P(B) > 0.
(7) During the Fall, 1999 semester 105 women applied to UC Sunnydale, of whom 76
were accepted, and 400 men applied, of whom 230 were accepted.
During the Spring, 2000 semester, 300 women applied, of whom 100 were ac-
cepted, and 112 men applied, of whom 21 were accepted.
Calculate numerically
a. the probability of a female applicant being accepted during the fall,
b. the probability of a male applicant being accepted during the fall,
c. the probability of a female applicant being accepted during the spring,
d. the probability of a male applicant being accepted during the spring.
Consider now the total applicant pool for both semesters together, and calculate
e. the probability of a female applicant being accepted,
f. the probability of a male applicant being accepted.
Are the University’s admission policies biased towards females? or males?
(8) Let X be a real–valued, N(0, 1) random variable, and set Y := X
2
. Calculate the
density g of the distribution function for Y .
(Hint: You must ﬁnd g so that P(−∞ < Y ≤ a) =
_
a
−∞
g dy for all a.)
(9) Take Ω = [0, 1] [0, 1], with | the Borel sets and P Lebesgue measure. Let
g : [0, 1] →R be a continuous function.
Deﬁne the random variables
X
1
(ω) := g(x
1
), X
2
(ω) := g(x
2
) for ω = (x
1
, x
2
) ∈ Ω.
Show that X
1
and X
2
are independent and identically distributed.
(10) (i) Let (Ω, |, P) be a probability space and A
1
⊆ A
2
⊆ ⊆ A
n
⊆ . . . be events.
Show that
P
_

_
n=1
A
n
_
= lim
m→∞
P(A
m
).
(Hint: Look at the disjoint events B
n
:= A
n+1
−A
n
.)
(ii) Likewise, show that if A
1
⊇ A
2
⊇ ⊇ A
n
⊇ . . . , then
P
_

n=1
A
n
_
= lim
m→∞
P(A
m
).
131
(11) Let f : [0, 1] →R be continuous and deﬁne the Bernstein polynomial
b
n
(x) :=
n

k=0
f
_
k
n
__
n
k
_
x
k
(1 −x)
n−k
.
Prove that b
n
→ f uniformly on [0, 1] as n → ∞, by providing the details for the
following steps.
(i) Since f is uniformly continuous, for each > 0 there exists δ() > 0 such
that [f(x) −f(y)[ ≤ if [x −y[ ≤ δ().
(ii) Given x ∈ [0, 1], take a sequence of independent random variables X
k
such
that P(X
k
= 1) = x, P(X
k
= 0) = 1 − x. Write S
n
= X
1
+ + X
n
. Then
b
n
(x) = E(f(
S
n
n
)).
(iii) Therefore
[b
n
(x) −f(x)[ ≤ E([f(
S
n
n
) −f(x)[)
=
_
A
[f(
S
n
n
) −f(x)[ dP +
_
A
c
[f(
S
n
n
) −f(x)[ dP,
for A = ¦ω ∈ Ω [ [
S
n
n
−x[ ≤ δ()¦.
(iv) Then show
[b
n
(x) −f(x)[ ≤ +
2M
δ()
2
V (
S
n
n
) = +
2M
nδ()
2
V (X
1
),
for M = max [f[. Conclude that b
n
→ f uniformly.
(12) Let X and Y be independent random variables, and suppose that f
X
and f
Y
are
the density functions for X, Y . Show that the density function for X +Y is
f
X+Y
(z) =
_

−∞
f
X
(z −y)f
Y
(y) dy.
(Hint: If g : R →R, we have
E(g(X +Y )) =
_

−∞
_

−∞
f
X,Y
(x, y)g(x +y) dxdy,
where f
X,Y
is the joint density function of X, Y .)
(13) Let X and Y be two independent positive random variables, each with density
f(x) =
_
e
−x
if x ≥ 0
0 if x < 0.
132
Find the density of X +Y .
(14) Show that
lim
n→∞
_
1
0
_
1
0

_
1
0
f(
x
1
+. . . x
n
n
) dx
1
dx
2
. . . dx
n
= f(
1
2
)
for each continuous function f.
(Hint: P([
x
1
+...x
n
n

1
2
[ > ) ≤
1

2
V (
x
1
+...x
n
n
) =
1
12
2
n
.)
(15) Prove that
(i) E(E(X[ 1)) = E(X).
(ii) E(X) = E(X[ J), where J = ¦∅, Ω¦ is the trivial σ-algebra.
(16) Let X, Y be two real–valued random variables and suppose their joint distribution
function has the density f(x, y) . Show that
E(X[Y ) = Φ(Y ) a.s.
for
Φ(y) =
_

−∞
xf(x, y) dx
_

−∞
f(x, y) dx
.
(Hints: Φ(Y ) is a function of Y and so is |(Y )–measurable. Therefore we must
show that
(∗)
_
A
X dP =
_
A
Φ(Y ) dP for all A ∈ |(Y ).
Now A = Y
−1
(B) for some Borel subset of R. So the left hand side of (∗) is
(∗∗)
_
A
X dP =
_

χ
B
(Y )X dP =
_

−∞
_
B
xf(x, y) dydx.
The right hand side of (∗) is
_
A
Φ(Y ) dP =
_

−∞
_
B
Φ(y)f(x, y) dydx,
which equals the right hand side of (∗∗). Fill in the details.)
(17) A smooth function Φ : R →R is called convex if Φ

(x) ≥ 0 for all x ∈ R.
(i) Show that if Φ is convex, then
Φ(y) ≥ Φ(x) + Φ

(x)(y −x) for all x, y ∈ R.
133
(ii) Show that
Φ(
x +y
2
) ≤
1
2
Φ(x) +
1
2
Φ(y) for all x, y ∈ R.
(iii) A smooth function Φ : R
n
→ R is called convex if the matrix ((Φ
x
i
x
j
)) is
nonnegative deﬁnite for all x ∈ R
n
. (This means that

n
i,j=1
Φ
x
i
x
j
ξ
i
ξ
j
≥ 0 for all
ξ ∈ R
n
.) Prove
Φ(y) ≥ Φ(x) +DΦ(x) (y −x) and Φ(
x +y
2
) ≤
1
2
Φ(x) +
1
2
Φ(y)
for all x, y ∈ R
n
. (Here “D” denotes the gradient.)
(18) (i) Prove Jensen’s inequality:
Φ(E(X)) ≤ E(Φ(X))
for a random variable X : Ω → R, where Φ is convex. (Hint: Use assertion (iii)
from the previous problem.)
(ii) Prove the conditional Jensen’s inequality:
Φ(E(X[1)) ≤ E(Φ(X)[1).
(19) Let W() be a one-dimensional Brownian motion. Show
E(W
2k
(t)) =
(2k)!t
k
2
k
k!
.
(20) Show that if W() is an n-dimensional Brownian motion, then so are
(i) W(t +s) −W(s) for all s ≥ 0,
(ii) cW(t/c
2
) for all c > 0 (“Brownian scaling”).
(21) Let W() be a one-dimensional Brownian motion, and deﬁne
¯
W(t) :=
_
tW(
1
t
) for t > 0
0 for t = 0.
Show that
¯
W(t)−
¯
W(s) is N(0, t−s) for times 0 ≤ s ≤ t. (
¯
W() also has independent
increments and so is a one-dimensional Brownian motion. You do not need to show
this.)
(22) Deﬁne X(t) :=
_
t
0
W(s) ds, where W() is a one-dimensional Brownian motion.
Show that
E(X
2
(t)) =
t
3
3
for each t > 0.
134
(23) Deﬁne X(t) as in the previous problem. Show that
E(e
λX(t)
) = e
λ
2
t
3
6
for each t > 0.
(Hint: X(t) is a Gaussian random variable, the variance of which we know from
the previous homework problem.)
(24) Deﬁne U(t) := e
−t
W(e
2t
), where W() is a one-dimensional Brownian motion.
Show that
E(U(t)U(s)) = e
−|t−s|
for all −∞ < s, t < ∞.
(25) Let W() be a one-dimensional Brownian motion. Show that
lim
m→∞
W(m)
m
= 0 almost surely.
(Hint: Fix > 0 and deﬁne the event A
m
:= ¦[
W(m)
m
[ ≥ ¦. Then A
m
= ¦[X[ ≥

m¦ for the N(0, 1) random variable X =
W(m)

m
. Apply the Borel–Cantelli
Lemma.)
(26) (i) Let 0 < γ ≤ 1. Show that if f : [0, T] →R
n
is uniformly H¨ older continuous with
exponent γ, it is also is uniformly H¨ older continuous with each exponent 0 < δ < γ.
(ii) Show that f(t) = t
γ
is uniformly H¨ older continuous with exponent γ on the
interval [0, 1].
(27) Let 0 < γ <
1
2
. These notes show that if W() is a one–dimensional Brownian
motion, then for almost every ω there exists a constant K, depending on ω, such
that
(∗) [W(t, ω) −W(s, ω)[ ≤ K[t −s[
γ
for all 0 ≤ s, t ≤ 1.
Show that there does not exist a constant K such that (∗) holds for almost all ω.
(28) Prove that if G, H ∈ L
2
(0, T), then
E
_
_
T
0
GdW
_
T
0
H dW
_
= E
_
_
T
0
GH dt
_
.
(Hint: 2ab = (a +b)
2
−a
2
−b
2
.)
(29) Let (Ω, |, P) be a probability space, and take T() to be a ﬁltration of σ–algebras.
Assume X be an integrable random variable, and deﬁne X(t) := E(X[T(t)) for
times t ≥ 0.
135
Show that X() is a martingale.
(30) Show directly that I(t) := W
2
(t) −t is a martingale.
(Hint: W
2
(t) = (W(t) − W(s))
2
− W
2
(s) + 2W(t)W(s). Take the conditional
expectation with respect to J(s), the history of W(), and then condition with
respect to the history of I().)
(31) Suppose X() is a real-valued martingale and Φ : R → R is convex. Assume also
E([Φ(X(t))[) < ∞ for all t ≥ 0. Show that
Φ(X()) is a submartingale.
(Hint: Use the conditional Jensen’s inequality.)
(32) Use the Itˆo chain rule to show that Y (t) := e
t
2
cos(W(t)) is a martingale.
(33) Let W() = (W
1
, . . . , W
n
) be an n-dimensional Brownian motion, and write
Y (t) := [W(t)[
2
−nt for times t ≥ 0. Show that Y () is a martingale.
(Hint: Compute dY .)
(34) Show that
_
T
0
W
2
dW =
1
3
W
3
(T) −
_
T
0
W dt
and
_
T
0
W
3
dW =
1
4
W
4
(T) −
3
2
_
T
0
W
2
dt.
(35) Recall from the notes that
Y := e
_
t
0
g dW−
1
2
_
t
0
g
2
ds
satisﬁes
dY = gY dW.
Use this to prove
E(e
_
T
0
g dW
) = e
1
2
_
T
0
g
2
ds
.
(36) Let u = u(x, t) be a smooth solution of the backwards diﬀusion equation
∂u
∂t
+
1
2

2
u
∂x
2
= 0,
and suppose W() is a one-dimensional Brownian motion.
Show that for each time t > 0:
E(u(W(t), t)) = u(0, 0).
136
(37) Calculate E(B
2
(t)) for the Brownian bridge B(), and show in particular that
E(B
2
(t)) → 0 as t → 1

.
(38) Let X solve the Langevin equation, and suppose that X
0
is an N(0,
σ
2
2b
) random
variable. Show that
E(X(s)X(t)) =
σ
2
2b
e
−b|t−s|
.
(39) (i) Consider the ODE
_
˙ x = x
2
(t > 0)
x(0) = x
0
.
Show that if x
0
> 0, the solution “blows up to inﬁnity” in ﬁnite time.
(ii) Next, look at the ODE
_
˙ x = x
1
2
(t > 0)
x(0) = 0.
Show that this problem has inﬁnitely many solutions.
(Hint: x ≡ 0 is a solution. Find also a solution which is positive for times t > 0,
and then combine these solutions to ﬁnd ones which are zero for some time and
then become positive.)
(40) (i) Use the substituion X = u(W) to solve the SDE
_
dX = −
1
2
e
−2X
dt +e
−X
dW
X(0) = x
0
.
(ii) Show that the solution blows up at a ﬁnite, random time.
(41) Solve the SDE dX = −Xdt +e
−t
dW.
(42) Let W = (W
1
, W
2
, . . . , W
n
) be an n-dimensional Brownian motion and write
R := [W[ =
_
n

i=1
(W
i
)
2
_1
2
.
Show that R solves the stochastic Bessel equation
dR =
n

i=1
W
i
R
dW
i
+
n −1
2R
dt.
(43) (i) Show that X = (cos(W), sin(W)) solves the SDE system
_
dX
1
= −
1
2
X
1
dt −X
2
dW
dX
2
= −
1
2
X
2
dt +X
1
dW
137
(ii) Show also that if X = (X
1
, X
2
) is any other solution, then [X[ is constant in
time.
(44) Solve the system
_
dX
1
= dt +dW
1
dX
2
= X
1
dW
2
,
where W = (W
1
, W
2
) is a Brownian motion.
(45) Solve
_
dX
1
= X
2
dt +dW
1
dX
2
= X
1
dt +dW
2
.
(46) Solve
_
dX =
1
2
σ

(X)σ(X)dt +σ(X)dW
X(0) = 0
where W is a one–dimensional Brownian motion and σ is a smooth, positive func-
tion.
(Hint: Let f(x) :=
_
x
0
dy
σ(y)
and set g := f
−1
, the inverse function of f. Show
X := g(W).)
(47) Let f be a positive, smooth function. Use the Feynman-Kac formula to show that
M(t) := f(W(t))e

1
2
_
t
0
∆f(W(s)) ds
is a martingale.
(48) Let τ be the ﬁrst time a one–dimensional Brownian motion hits the half-open
interval (a, b]. Show τ is a stopping time.
(49) Let W denote an n–dimensional Brownian motion, for n ≥ 3. Write X = W+x
0
,
where the point x
0
lies in the region U = ¦0 < R
1
< [x[ < R
2
¦ Calculate explicitly
the probability that X will hit the outer sphere ¦[x[ = R
2
¦ before hitting the inner
sphere ¦[x[ = R
1
¦.
(Hint: Check that
Φ(x) =
1
[x[
n−2
satisﬁes ∆Φ = 0 for x ,= 0. Modify Φ to build a function u which equals 0 on the
inner sphere and 1 on the outer sphere.)
138
References
[A] L. Arnold, Stochastic Diﬀerential Equations: Theory and Applications, Wiley, 1974.
[B-R] M. Baxter and A. Rennie, Financial Calculus: An Introduction to Derivative Pricing, Cambridge
U. Press, 1996.
[B] L. Breiman, Probability, Addison–Wesley, 1968.
[Br] P. Bremaud, An Introduction to Probabilistic Modeling, Springer, 1988.
[C] K. L. Chung, Elementary Probability Theory with Stochastic Processes, Springer, 1975.
[D] M. H. A. Davis, Linear Estimation and Stochastic Control, Chapman and Hall.
[F] A. Friedman, Stochastic Diﬀerential Equations and Applications, Vol. 1 and 2, Academic Press.
[Fr] M. Freidlin, Functional Integration and Partial Diﬀerential Equations, Princeton U. Press, 1985.
[G] C. W. Gardiner, Handbook of Stochastic Methods for Physics, Chemistry, and the Natural Sci-
ences, Springer, 1983.
[G-S] I. I. Gihman and A. V. Skorohod, Stochastic Diﬀerential Equations, Springer, 1972.
[G] D. Gillespie, The mathematics of Brownian motion and Johnson noise, American J. Physics 64
(1996), 225–240.
[H] D. J. Higham, An algorithmic introduction to numerical simulation of stochastic diﬀerential
equations, SIAM Review 43 (2001), 525–546.
[Hu] J. C. Hull, Options, Futures and Other Derivatives (4th ed), Prentice Hall, 1999.
[K] N. V. Krylov, Introduction to the Theory of Diﬀusion Processes, American Math Society, 1995.
[L1] J. Lamperti, Probability, Benjamin.
[L2] J. Lamperti, A simple construction of certain diﬀusion processes, J. Math. Kyˆoto (1964), 161–
170.
[Ml] A. G. Malliaris, Itˆ o’s calculus in ﬁnancial decision making, SIAM Review 25 (1983), 481–496.
[M] D. Mermin, Stirling’s formula!, American J. Physics 52 (1984), 362–365.
[McK] H. McKean, Stochastic Integrals, Academic Press, 1969.
[N] E. Nelson, Dynamical Theories of Brownian Motion, Princeton University Press, 1967.
[O] B. K. Oksendal, Stochastic Diﬀerential Equations: An Introduction with Applications, 4th ed.,
Springer, 1995.
[P-W-Z] R. Paley, N. Wiener, and A. Zygmund, Notes on random functions, Math. Z. 37 (1959), 647–668.
[P] M. Pinsky, Introduction to Fourier Analysis and Wavelets, Brooks/Cole, 2002.
[S] D. Stroock, Probability Theory: An Analytic View, Cambridge U. Press, 1993.
[S1] H. Sussmann, An interpretation of stochastic diﬀerential equations as ordinary diﬀerential equa-
tions which depend on the sample point., Bulletin AMS 83 (1977), 296–298.
[S2] H. Sussmann, On the gap between deterministic and stochastic ordinary diﬀerential equations,
Ann. Probability 6 (1978), 19–41.
139

PREFACE These are an evolving set of notes for Mathematics 195 at UC Berkeley. This course is for advanced undergraduate math majors and surveys without too many precise details random diﬀerential equations and some applications. Stochastic diﬀerential equations is usually, and justly, regarded as a graduate level subject. A really careful treatment assumes the students’ familiarity with probability theory, measure theory, ordinary diﬀerential equations, and perhaps partial diﬀerential equations as well. This is all too much to expect of undergrads. But white noise, Brownian motion and the random calculus are wonderful topics, too good for undergraduates to miss out on. Therefore as an experiment I tried to design these lectures so that strong students could follow most of the theory, at the cost of some omission of detail and precision. I for instance downplayed most measure theoretic issues, but did emphasize the intuitive idea of σ–algebras as “containing information”. Similarly, I “prove” many formulas by conﬁrming them in easy cases (for simple random variables or for step functions), and then just stating that by approximation these rules hold in general. I also did not reproduce in class some of the more complicated proofs provided in these notes, although I did try to explain the guiding ideas. My thanks especially to Lisa Goldberg, who several years ago presented the class with several lectures on ﬁnancial applications, and to Fraydoun Rezakhanlou, who has taught from these notes and added several improvements. I am also grateful to Jonathan Weare for several computer simulations illustrating the text.

2

CHAPTER 1: INTRODUCTION A. MOTIVATION Fix a point x0 ∈ Rn and consider then the ordinary diﬀerential equation: (ODE) ˙ x(t) = b(x(t)) x(0) = x0 , (t > 0)

where b : Rn → Rn is a given, smooth vector ﬁeld and the solution is the trajectory x(·) : [0, ∞) → Rn .

x(t) x0

Trajectory of the differential equation ˙ Notation. x(t) is the state of the system at time t ≥ 0, x(t) :=
d dt x(t).

In many applications, however, the experimentally measured trajectories of systems modeled by (ODE) do not in fact behave as predicted:

X(t) x0

Sample path of the stochastic differential equation Hence it seems reasonable to modify (ODE), somehow to include the possibility of random eﬀects disturbing the system. A formal way to do so is to write: (1) ˙ X(t) = b(X(t)) + B(X(t))ξ(t) X(0) = x0 , (t > 0)

where B : Rn → Mn×m (= space of n × m matrices) and ξ(·) := m-dimensional “white noise”. This approach presents us with these mathematical problems: • Deﬁne the “white noise” ξ(·) in a rigorous way. 3

4 . b. Part of the trouble is the strange form of the chain rule in the stochastic calculus: ˆ C. or is it rather some ensemble of smooth. And once all this is accomplished. We say that X(·) solves (SDE) provided t t (2) X(t) = x0 + 0 b(X(s)) ds + 0 B(X(s)) dW for all times t > 0 . write dt instead of the dot: dW(t) dX(t) = b(X(t)) + B(X(t)) . dt dt and ﬁnally multiply by “dt”: (SDE) dX(t) = b(X(t))dt + B(X(t))dW(t) X(0) = x0 . there will still remain these modeling problems: • Does (SDE) truly model the physical situation? • Is the term ξ(·) in (1) “really” white noise. x0 = 0. or Brownian motion. but highly oscillatory functions? See Chapter 6. discuss uniqueness. B. denoted W(·). asymptotic behavior. • Show (1) has a solution. B. Now we must: • Construct W(·): See Chapter 3. and B ≡ I. • Show (2) has a solution. etc. thereby asserting that “white noise” is the time derivative of the Wiener process. b ≡ 0. etc.: See Chapter 5. t • Deﬁne the stochastic integral 0 · · · dW : See Chapter 4. properly interpreted. This expression. ITO’S FORMULA Assume n = 1 and X(·) solves the SDE (3) dX = b(X)dt + dW. Thus we may symbolically write ˙ W(·) = ξ(·). SOME HEURISTICS Let us ﬁrst study (1) in the case m = n. The solution of (1) in this setting turns out to be the n-dimensional Wiener process. dependence upon x0 . and diﬀerent answers can yield completely diﬀerent solutions of (SDE).• Deﬁne what it means for X(·) to solve (1). is a stochastic diﬀerential equation. As we will see later these questions are subtle. d Now return to the general case of the equation (1).

(4) = d dx . 2 1 = u (bdt + dW ) + u (bdt + dW )2 + . where will see. We ask: what stochastic diﬀerential equation does Y (t) := u(X(t)) (t ≥ 0) solve? Oﬀhand. 2 A major goal of these notes is to provide a rigorous interpretation for calculations like these. 5 t . which follows from (4). namely Y (t) := eW (t) . as we dW ≈ (dt)1/2 1 in some sense. . This is wrong. . however ! In fact. according to the usual chain rule. Consequently if we compute dY and keep all terms of order dt or (dt) 2 . . . 2 from (3) = 1 ub+ u 2 dt + u dW + {terms of order (dt)3/2 and higher}. Y (0) = 1 is Y (t) := eW (t)− 2 . involving stochastic diﬀerentials. Here we used the “fact” that (dW )2 = dt. According to Itˆ’s formula. Example 1. we obtain 1 dY = u dX + u (dX)2 + .Suppose next that u : R → R is a given smooth function. the solution of the stochastic diﬀerential equation o dY = Y dW. we would guess from (3) that dY = u dX = u bdt + u dW. with the extra term “ 1 u dt” not present in ordinary calculus. Hence dY = 1 ub+ u 2 dt + u dW. ˆ and not what might seem the obvious guess.

dP = µP dt + σP dW P (0) = p0 . P (t) = p0 e A sample path for stock prices 6 . In other words. the relative change of price.Example 2. where p0 is the starting price. Using once again Itˆ’s formula we can check that the solution o is 2 σW (t)+ µ− σ t 2 . evolves according to the SDE P dP = µdt + σdW P for certain constants µ > 0 and σ. A standard model assumes that dP . Let P (t) denote the (random) price of a stock at time t ≥ 0. called respectively the drift and the volatility of the stock.

BASIC DEFINITIONS. H. 7 . B. Basic deﬁnitions Expected value. F. I.CHAPTER 2: A CRASH COURSE IN BASIC PROBABILITY THEORY. G. D. A. More details can be found in any good introductory text. E. C. for instance Bremaud [Br]. Thus probability of hitting inner circle = area of inner circle 1 = . Central Limit Theorem Conditional expectation Martingales This chapter is a very rapid introduction to the measure theoretic foundations of probability theory. Chung [C] or Lamperti [L1]. Take a circle of radius 2 inches in the plane and choose a chord of this circle at random. What is the probability this chord intersects the concentric circle of radius 1 inch? Solution #1 Any such chord (provided it does not hit the center) is uniquely determined by the location of its midpoint. area of larger circle 4 Solution #2 By symmetry under rotation we may assume the chord is vertical. The diameter of the large circle is 4 inches and the chord will hit the small circle if it falls within its 2-inch diameter. Let us begin with a puzzle: Bertrand’s paradox. variance Distribution functions Independence Borel–Cantelli Lemma Characteristic functions Strong Law of Large Numbers. A.

2 inches 1 = . 4 inches 2 Solution #3 By symmetry we may assume one end of the chord is at the far left point of the larger circle. The angle θ the chord makes with the horizontal lies between ± π and 2 the chord hits the inner circle if θ lies between ± π . 6 probability of hitting inner circle =

Hence

θ

Therefore probability of hitting inner circle =

2π 6 2π 2

=

1 . 3

PROBABILITY SPACES. This example shows that we must carefully deﬁne what we mean by the term “random”. The correct way to do so is by introducing as follows the precise mathematical structure of a probability space. We start with a set, denoted Ω, certain subsets of which we will in a moment interpret as being “events”. DEFINTION. A σ-algebra is a collection U of subsets of Ω with these properties: (i) ∅, Ω ∈ U. (ii) If A ∈ U, then Ac ∈ U. (iii) If A1 , A2 , · · · ∈ U, then
∞ ∞

Ak ,
k=1 k=1

Ak ∈ U.

Here A := Ω − A is the complement of A.
c

8

DEFINTION. Let U be a σ-algebra of subsets of Ω. We call P : U → [0, 1] a probability measure provided: (i) P (∅) = 0, P (Ω) = 1. (ii) If A1 , A2 , · · · ∈ U, then
∞ ∞

P(
k=1

Ak ) ≤
k=1

P (Ak ).

(iii) If A1 , A2 , . . . are disjoint sets in U, then
∞ ∞

P(
k=1

Ak ) =
k=1

P (Ak ).

It follows that if A, B ∈ U, then A⊆B implies P (A) ≤ P (B).

DEFINITION. A triple (Ω, U, P ) is called a probability space provided Ω is any set, U is a σ-algebra of subsets of Ω, and P is a probability measure on U. Terminology. (i) A set A ∈ U is called an event; points ω ∈ Ω are sample points. (ii) P (A) is the probability of the event A. (iii) A property which is true except for an event of probability zero is said to hold almost surely (usually abbreviated “a.s.”). Example 1. Let Ω = {ω1 , ω2 , . . . , ωN } be a ﬁnite set, and suppose we are given numbers pj = 1. We take U to comprise all subsets of 0 ≤ pj ≤ 1 for j = 1, . . . , N , satisfying Ω. For each set A = {ωj1 , ωj2 , . . . , ωjm } ∈ U, with 1 ≤ j1 < j2 < . . . jm ≤ N , we deﬁne P (A) := pj1 + pj2 + · · · + pjm . Example 2. The smallest σ-algebra containing all the open subsets of Rn is called the Borel σ-algebra, denoted B. Assume that f is a nonnegative, integrable function, such that Rn f dx = 1. We deﬁne P (B) :=
B

f (x) dx

for each B ∈ B. Then (Rn , B, P ) is a probability space. We call f the density of the probability measure P . Example 3. Suppose instead we ﬁx a point z ∈ Rn , and now deﬁne P (B) := 1 0 9 if z ∈ B if z ∈ B /

for sets B ∈ B. Then (Rn , B, P ) is a probability space. We call P the Dirac mass concentrated at the point z, and write P = δz . A probability space is the proper setting for mathematical probability theory. This means that we must ﬁrst of all carefully identify an appropriate (Ω, U, P ) when we try to solve problems. The reader should convince himself or herself that the three “solutions” to Bertrand’s paradox discussed above represent three distinct interpretations of the phrase “at random”, that is, to three distinct models of (Ω, U, P ). Here is another example. Example 4 (Buﬀon’s needle problem). The plane is ruled by parallel lines 2 inches apart and a 1-inch long needle is dropped at random on the plane. What is the probability that it hits one of the parallel lines? The ﬁrst issue is to ﬁnd some appropriate probability space (Ω, U, P ). For this, let h = distance from the center of needle to nearest line, θ = angle (≤
π 2)

that the needle makes with the horizontal.

θ h
needle

These fully determine the position of the needle, up to translations and reﬂection. Let us next take  π  Ω = [0, ) × [0, 1], U = Borel subsets of Ω,   2   
values of h values of θ P (B) = 2·area of B for each π

B ∈ U.

We denote by A the event that the needle hits a horizontal line. We can now check h that this happens provided sin θ ≤ 1 . Consequently A = {(θ, h) ∈ Ω | h ≤ sin θ }, and so 2 2 π 2(area of A) 2 1 2 1 = π 0 2 sin θ dθ = π . P (A) = π RANDOM VARIABLES. We can think of the probability space as being an essential mathematical construct, which is nevertheless not “directly observable”. We are therefore interested in introducing mappings X from Ω to Rn , the values of which we can observe. 10

then m X= i=1 ai χAi is a random variable. a2 . Am ∈ U. More generally. A mapping X : Ω → Rn is called an n-dimensional random variable if for each B ∈ B. . the probability that X is in B. comments. In these notes we will usually use capital letters to denote random variables. We will also use without further comment various standard facts from measure theory. We equivalently say that X is U-measurable. 11 . called a simple function. This follows the custom within probability theory of mostly not displaying the dependence of random variables on the sample point ω ∈ Ω. Example 2. if A1 . with Ω = ∪m Ai . which is the smallest σ-algebra of subsets of Rn containing all open sets. Example 1. Notation. A2 . U. Then the indicator function of A. P ) be a probability space. We usually write “X” and not “X(ω)”. for instance that sums and products of random variables are random variables. Boldface usually means a vector-valued mapping. and a1 . DEFINTION. . well-behaved” subsets of Rn .Remember from Example 2 above that B denotes the collection of Borel subsets of Rn . am i=1 are real numbers. . we have X−1 (B) ∈ U. We may henceforth informally just think of B as containing all the “nice. χA (ω) := 1 0 if ω ∈ A if ω ∈ A. . . We also denote P (X−1 (B)) as “P (X ∈ B)”. Let A ∈ U. Let (Ω. . . . / is a random variable.

then Y is U(X)-measurable. DEFINITIONS. Y is in fact a function of X. if Y = Φ(X) for some reasonable function Φ.LEMMA. called the σ-algebra generated by X. Then there exists a function Φ such that Y = Φ(X). Proof. (i) A collection {X(t) | t ≥ 0} of random variables is called a stochastic process. suppose Y : Ω → R is U(X)-measurable. that is. Check that {X−1 (B) | B ∈ B} is a σ-algebra. although we may have no practical way to construct Φ. We introduce next random variables depending upon time. Let X : Ω → Rn be a random variable. Hence if Y is U(X)-measurable. the σ-algebra U(X) can be interpreted as “containing all relevant information” about the random variable X. It is essential to understand that. Conversely. clearly it is the smallest σ-algebra with respect to which X is measurable. the mapping t → X(t. The idea is that if we run an experiment and observe the random values of X(·) as time evolves. 12 . if a random variable Y is a function of X. Then U(X) := {X−1 (B) | B ∈ B} is a σ-algebra. we in principle know also Y (ω) = Φ(X(ω)). we will in general observe a diﬀerent sample path. (ii) For each point ω ∈ Ω. IMPORTANT REMARK. Consequently if we know the value X(ω). ω) is the corresponding sample path. In particular. ω) | t ≥ 0} for some ﬁxed ω ∈ Ω. This is the smallest sub-σ-algebra of U with respect to which X is measurable. If we rerun the experiment. STOCHASTIC PROCESSES. in probabilistic terms. we are in fact looking at a sample path {X(t.

X(t. X = (X 1 . so that X = X + − X − . · · · . . Then we write X dP = Ω Ω X 1 dP. Ω X 2 dP. DEFINITION.Y simple sup Y dP. provided at least one of the integrals on the right is ﬁnite. .ω1) time X(t. P ) is a probability space and X = k i=1 ai χAi is a real-valued simple random variable. If (Ω. We call E(X) := Ω X dP the expected value (or mean value) of X. Here X + = max(X. we deﬁne X dP := Ω Y ≤X. 0). . VARIANCE. EXPECTED VALUE. Ω X n dP . We will assume without further comment the usual rules for these integrals. we deﬁne the integral of X by k X dP := Ω i=1 ai P (Ai ). X 2 . 0) and X − = max(−X. .ω2) Two sample paths of a stochastic process B. Integration with respect to a measure. U. 13 . If next X is a nonnegative random variable. Ω Finally if X : Ω → R is a random variable. Next. X n ). we write X dP := Ω Ω X + dP − Ω X − dP. suppose X : Ω → Rn is a vector-valued random variable.

. LEMMA (Chebyshev’s inequality).DEFINITION. λ Proof. . yn ) ∈ Rn . then 1 P (|X| ≥ λ) ≤ p E(|X|p ) for all λ > 0. . integrable function f : Rn → R such that x1 F (x) = F (x1 . U. . Xm ≤ xm ) for all xi ∈ Rn . . . n. We call V (X) := Ω |X − E(X)|2 dP the variance of X. P ) be a probability space and suppose X : Ω → Rn is a random variable. If X is a random variable and 1 ≤ p < ∞. xm ) := P (X1 ≤ x1 . . . . FX1 . xn ) = −∞ ··· xn −∞ f (y1 . .Xm (x1 .. Suppose X : Ω → Rn is a random variable and F = FX its distribution function. dy1 . .Xm : (Rn )m → [0. .. DISTRIBUTION FUNCTIONS. It follows then that (1) P (X ∈ B) = B f (x) dx for all B ∈ B This formula is important as the expression on the right hand side is an ordinary integral. yn ) dyn . . . i = 1. We have E(|X|p ) = Ω |X|p dP ≥ {|X|≥λ} |X|p dP ≥ λp P (|X| ≥ λ). . DEFINITIONS. 1]. DEFINITION. If there exists a nonnegative. y = (y1 . .. . . Observe that V (X) = E(|X − E(X)|2 ) = E(|X|2 ) − |E(X)|2 . . . . . xn ) ∈ Rn . . 14 . (i) The distribution function of X is the function FX : Rn → [0. .. . Let (Ω. . . Then x≤y means xi ≤ yi for i = 1.. ... . . . Xm : Ω → Rn are random variables. . . . . m. C. 1] deﬁned by FX (x) := P (X ≤ x) for all x ∈ Rn (ii) If X1 . their joint distribution function is FX1 .. Let x = (x1 . and can often be explicitly calculated. . . then f is called the density function for X. Notation. .

Example 2. 15 . C) random variable. σ 2 ) random variable. LEMMA. If X : Ω → Rn has density f (x) = −1 1 1 e− 2 (x−m)·C (x−m) ((2π)n det C)1/2 (x ∈ Rn ) for some m ∈ Rn and some positive deﬁnite. we say X has a Gaussian (or normal) distribution. We then write X is an N (m. If X : Ω → R has density f (x) = √ 1 2πσ 2 e− |x−m|2 2σ 2 (x ∈ R). Let X : Ω → Rn be a random variable. we say X has a Gaussian (or normal) distribution. Suppose g : Rn → R. Then E(Y ) = Rn g(x)f (x) dx.x X Rn Ω Example 1. with mean m and variance σ 2 . and Y = g(X) is integrable. symmetric matrix C. and assume that its distribution function F = FX has the density f . In this case let us write X is an N (m. with mean m and covariance matrix C.

E(X) = Rn xf (x) dx and V (X) = Rn |x − E(X)|2 f (x) dx. Suppose ﬁrst g is a simple function on Rn : m g= i=1 bi χBi (Bi ∈ B). Then m m E(g(X)) = i=1 bi Ω χBi (X) dP = i=1 bi P (X ∈ Bi ). then E(X) = √ and V (X) = √ 1 2πσ 2 1 2πσ 2 ∞ −∞ xe− (x−m)2 2σ 2 dx = m ∞ −∞ (x − m)2 e− (x−m)2 2σ 2 dx = σ 2 . in terms of integrals over Rn . V (X). it holds therefore for general functions g. and σ 2 the variance.In particular. σ 2 ). since as mentioned before the probability space (Ω. If X is N (m. all quantities of interest in probability theory can be computed in Rn in terms of the density f . Consequently the formula holds for all simple functions g and. 16 . etc. Indeed. Proof. Therefore m is indeed the mean. by approximation. Hence we can compute E(X). Example. P ) is “unobservable”: All that we “see” are the values X takes on in Rn . But also m g(x)f (x) dx = Rn i=1 m bi Bi f (x) dx = i=1 bi P (X ∈ Bi ) by (1). U. This is an important observation. Remark.

even if P (B) = 0: DEFINITION. We write P (A | B) := P (A ∩ B) P (B) if P (B) > 0. This concept and its ramiﬁcations are the hallmarks of probability theory. the probability of A. U := {C ∩ B | C ∈ U} and P := P (B) . P (B) This observation motivates the following DEFINITION. We want to ﬁnd a reasonable deﬁnition of P (A | B). with P (B) > 0. To gain some insight. and let A. What then is the probability that ω ∈ A also? Since we know ω ∈ B. INDEPENDENCE. Let (Ω. Likewise. Ac and B c are independent. since presumably any information that the event B has occurred is irrelevant in determining the probability that A has occurred.A ω B Ω D. 17 P (A ∩ B) P (B) . given B. We take this for the deﬁnition. Therefore we P ˜ ˜ ˜ ˜ ˜ can deﬁne Ω := B. Two events A and B are called independent if P (A ∩ B) = P (A)P (B). then so are Ac and B. Then the ˜ probability that ω lies in A is P (A ∩ B) = P (A∩B) . B ∈ U be two events. so that P (Ω) = 1. MOTIVATION. we can regard B as being a new probability space. the reader may wish to check that if A and B are independent events. Now what should it mean to say “A and B are independent”? This should mean P (A | B) = P (A). U. P ) be a probability space. Think this way. Thus P (A) = P (A | B) = and so P (A ∩ B) = P (A)P (B) if P (B) > 0. Suppose some point ω ∈ Ω is selected “at random” and we are told ω ∈ B.

which we assert are in fact independent random variables. . and P Lebesgue measure. To prove this. . P (Akm ). . Then Y := f (X1 . Xn ) and Z := g(Xn+1 . k+1 2n . . 1). This can be checked by showing that both sides are equal to 2−k .DEFINITION. . . Bk ⊆ Rn : P (X1 ∈ B1 . . . . be events. . . . . . . Xn+m ) are independent. We say the random variables X1 . . . . Let X1 . . 2. it suﬃces to verify P (X1 = e1 . An . . . Xk = ek ) = P (X1 = e1 )P (X2 = e2 ) · · · P (Xk = ek ). Xk ∈ Bk ) = P (X1 ∈ B1 )P (X2 ∈ B2 ) · · · P (Xk ∈ Bk ). . . . Let Xi : Ω → Rn be random variables (i = 1. It is important to extend this deﬁnition to σ-algebras: DEFINITION. Let Ui ⊆ U be σ-algebras. . we have P (Ak1 ∩ Ak2 ∩ · · · ∩ Akm ) = P (Ak1 )P (Ak2 ) . . we transfer our deﬁnitions to random variables: DEFINITION. Lastly. . . . . 18 . . . . we have P (Ak1 ∩ Ak2 ∩ · · · ∩ Akm ) = P (Ak1 )P (Ak1 ) · · · P (Akm ). k even k odd (0 ≤ ω < 1). We omit the proof. . are independent if for all integers k ≥ 2 and all choices of Borel sets B1 . . Xm+n be independent Rk -valued random variables. . Take Ω = [0. These events are independent if for all choices 1 ≤ k1 < k2 < · · · < km . ). . i=1 Example. X2 ∈ B2 . for all choices of e1 . . . We say that {Ui }∞ are i=1 independent if for all choices of 1 ≤ k1 < k2 < · · · < km and of events Aki ∈ Uki . . Suppose f : (Rk )n → R and g : (Rk )m → R. . . . for i = 1. . U the Borel subsets of [0. which may be found in Breiman [B]. 1). These are the Rademacher functions. . . LEMMA. X2 = e2 . . Xn (ω) := 1 −1 if if k 2n k 2n ≤ω< ≤ω< k+1 2n . . Deﬁne for n = 1. This is equivalent to saying that the σ-algebras {U(Xi )}∞ are independent. . 1}. Let A1 . ek ∈ {−1.

. . m. 1. . . m). . . . . . . . Xm ≤ xm ) = P (X1 ≤ x1 ) · · · P (Xm ≤ xm ) = FX1 (x1 ) · · · FXm (xm ). . . . . If the random variables have densities. One of the most important properties of independent random variables is this: THEOREM. . x1 · · · xm fX1 ···Xm (x1 .Xm (x1 . . . Proof. Proof. .. m. xm ) = P (X1 ≤ x1 . Then k=1 FX1 ···Xm (x1 . Therefore U(X1 ). . · · · . 19 R . with E(|Xi |) < ∞ then E(|X1 · · · Xm |) < ∞ and E(X1 · · · Xm ) = E(X1 ) · · · E(Xm ). . xm ) dx1 · · · dxm fXm (xm ) dxm Bm = B1 fX1 (x1 ) dx1 . Select Ai ∈ U(Xi ). . for all xi ∈ Rn .. . Xm : Ω → Rn are independent if and only if (2) FX1 . . . .THEOREM. by (3) = P (X1 ∈ B1 ) · · · P (Xm ∈ Bm ) = P (A1 ) · · · P (Am ). . xm ) = fX1 (x1 ) · · · fXm (xm ) for all xi ∈ Rn . i = 1. (2) is equivalent to (3) where the functions f are the appropriate densities. . . Then Ai = X−1 (Bi ) for some Bi ∈ B. Hence i P (A1 ∩ · · · ∩ Am ) = P (X1 ∈ B1 . . . Assume ﬁrst that {Xk }m are independent. . . . Xm are independent. We prove the converse statement for the case that all the random variables have densities. Then E(X1 · · · Xm ) = = R Rm (i = 1.··· . .··· . . Suppose that each Xi is bounded and has a density. i = 1. . . .Xm (x1 . . . Xm ∈ Bm ) = B1 ×.×Bm fX1 ···Xm (x1 . xm ) dx1 . . . . . . real-valued random variables. . . U(Xm ) are independent σ-algebras. xm x1 fX1 (x1 ) dx1 · · · xm fXm (xm ) dxm by (3) = E(X1 ) · · · E(Xm ). 2. . i = 1. . If X1 . · · · . The random variables X1 . . m. . xm ) = FX1 (x1 ) · · · FXm (xm ) fX1 .

. of events “occurs inﬁnitely often”.”. We introduce next a simple and very useful way to check if some sequence A1 . . . Use induction. . . n=1 m=n is called “An inﬁnitely often”. . . We illustrate a typical use of the Borel–Cantelli Lemma. = ∞ n=1 ∞ n=1 P (An ) < ∞. then P (An i. . . Proof. abbreviated “An i. The limit of the left-hand side is zero as n → ∞ because APPLICATION. .o.o. . A sequence of random variables {Xk }∞ deﬁned on some probability space converges k=1 in probability to a random variable X. Xm are independent. . BOREL–CANTELLI LEMMA. .o.o. . Then the event ∞ ∞ Am = {ω ∈ Ω | ω belongs to inﬁnitely many of the An }. .) = 0. By deﬁnition An i. . E. DEFINITION. . . Then E(X1 + X2 ) = m1 + m2 and V (X1 + X2 ) = Ω (i = 1. An . If X1 . m2 := E(X2 ). . Let A1 . provided k→∞ lim P (|Xk − X| > ) = 0 20 for each > 0. (X1 + X2 − (m1 + m2 ))2 dP (X1 − m1 )2 dP + Ω Ω = (X2 − m2 )2 dP +2 Ω (X1 − m1 )(X2 − m2 ) dP = V (X1 ) + V (X2 ) + 2E(X1 − m1 )E(X2 − m2 ). be events in a probability space. the case m = 2 holding as follows. . m). Let m1 := EX1 .) ≤ P m=n P (Am ). An . If Proof. P (Am ) < ∞. real-valued random variables. ∞ ∞ m=n Am . =0 =0 where we used independence in the next last step. . and so for each n Am ≤ m=n ∞ P (An i. with V (Xi ) < ∞ then V (X1 + · · · + Xm ) = V (X1 ) + · · · + V (Xm ). . . BOREL–CANTELLI LEMMA.THEOREM.

j F.THEOREM. 21 .o. It is convenient to introduce next a clever integral transform. CHARACTERISTIC FUNCTIONS. then φX (λ) = eimλ− λ2 σ 2 2 (λ ∈ Rn ) (λ ∈ R). the Borel–Cantelli Lemma implies P (Aj i. . which will later provide us with a useful means to identify normal random variables. let us suppose that m = 0. Let X be an Rn -valued random variable. If Xk → X in probability. . Therefore for almost all sample points ω. . kj−1 < kj < . for some index J depending on ω. σ 2 ). For each positive integer j we select kj so large that P (|Xkj − X| > 1 1 ) ≤ 2.) Hence φX (λ) = e− λ2 2 . Then φX (λ) := E(eiλ·X ) is the characteristic function of X. Example. .) = 0. and recall that −∞ e− 2 dx = 2π. . Let Aj := {|Xkj − X| > 1 }. |Xkj (ω) − X(ω)| ≤ 1 provided j ≥ J. We move the path of integration in the complex plane from the line {Im(z) = −λ} to the √ x2 ∞ real axis. If the real-valued random variable X is N (m. (Here Im(z) means the imaginary part of the complex number z. kj → ∞. To see this. then there exists a subsequence {Xkj }∞ ⊂ j=1 ∞ {Xk }k=1 such that Xkj (ω) → X(ω) for almost every ω. DEFINITION. σ = 1 and calculate ∞ x2 e 2 1 √ e− 2 dx = √ 2π 2π −λ2 φX (λ) = e −∞ iλx ∞ −∞ e− (x−iλ)2 2 dx. j j 1 and also . Proof. Since j j 2 < ∞.

. . 22 . and so φ (0) = iE(X). (ii) If X is a real-valued random variable. then 2 2 X + Y is N (m1 + m2 . and if X is N (m1 . φXm (λ). just calculate φX+Y (λ) = φX (λ)φY (λ) = e−im1 λ− λ2 σ 2 1 2 e−im2 λ− λ2 2 λ2 σ 2 2 2 = e−i(m1 +m2 )λ− 2 2 (σ1 +σ2 ) . 1. 2 Y is N (m2 . . σ1 ). φXm (λ). The formulas in (ii) for k = 2. for all λ. Let us calculate φX1 +···+Xm (λ) = E(eiλ·(X1 +···+Xm ) ) = E(eiλ·X1 eiλ·X2 · · · eiλ·Xm ) = E(eiλ·X1 ) · · · E(eiλ·Xm ) = φX1 (λ) . (k = 0. σ1 + σ2 ). . (i) If X1 . real-valued random variables. See Breiman [B] for the proof of (iii). follow similarly. . . ). . . . If X and Y are independent. by independence To see this. 1. . σ2 ). . then for each λ ∈ Rn φX1 +···+Xm (λ) = φX1 (λ) . . Xm are independent random variables. 2. φ(k) (0) = ik E(X k ) (iii) If X and Y are random variables and φX (λ) = φY (λ) then FX (x) = FY (x) for all x. 3. .LEMMA. Proof. 2 Example. Assertion (iii) says the characteristic function of X determines the distribution of X. . We have φ (λ) = iE(XeiλX ).

. . . . Let X1 . . 1. We will as well suppose for simplicity that 4 E(Xi ) < ∞ (i = 1. . . More precisely. What can we infer from these observations? STRONG LAW OF LARGE NUMBERS. . . 2. . . we can regard this sequence as a model for repeated and independent runs of the experiment.l=1 If i = j. .j.G. . or l. identically distributed. Xn (ω). . Supposing that the random variables are real–valued entails no loss of generality. . be a sequence of independent. . Then   E n 4 Xi i=1 = n E(Xi Xj Xk Xl ). . Proof. We can model repetitions of this experiment by introducing a sequence of random variables X1 . the outcomes of which we can measure. . . . CENTRAL LIMIT THEOREM. are independent. Xn . This section discusses a mathematical model for “repeated. . . for all x. . . . . imagine that a “random” sample point ω ∈ Ω is given and we can observe the sequence of values X1 (ω). . . each of which “has the same probabilistic information as X”: DEFINITION. THEOREM (Strong Law of Large Numbers). which records the outcome of some sort of random experiment.k. independence implies E(Xi Xj Xk Xl ) = E(Xi )E(Xj Xk Xl ). ). . . we can deduce the common expected values of the random variables. . First we show that with probability one. integrable random variables deﬁned on the same probability space. . Xn . Xn . . Suppose we are given a probability space and on it a real–valued random variable X. . Then P X1 + · · · + Xn =m n→∞ n lim = 1. Write m := E(Xi ) for i = 1. of random variables is called identically distributed if FX1 (x) = FX2 (x) = · · · = FXn (x) = . . . independent experiments”. The idea is this. X2 (ω). We may also assume m = 0. Xn . as we could otherwise consider Xi − m in place of Xi . . . A sequence X1 . . . . k. . =0 23 . If we additionally assume that the random variables X1 . i. . . STRONG LAW OF LARGE NUMBERS. . .

P 1 n n Xi ≥ ε i. Now ﬁx ε > 0. Then k=1 P (B) = 0 and n 1 lim Xi (ω) = 0 n→∞ n i=1 for each sample point ω ∈ B. Then P 1 n n n Xi ≥ ε i=1 =P i=1 Xi ≥ εn  n 4   ≤ ≤ 1 E (εn)4 C 1 . We turn next to the Laplace–DeMoivre Theorem.Consequently. Take ε = k . The Strong Law of Large Numbers says that for almost every sample point ω ∈ Ω.j=1 i=j 2 2 E(Xi Xj ) 4 2 = nE(X1 ) + 3(n2 − n)(E(X1 ))2 ≤ n2 C for some constant C. therefore. k except possibly for ω lying in an event Bk . X1 (ω) + · · · + Xn (ω) →m n as n → ∞. 24 . Write B := ∪∞ Bk . 1 3. i=1 = 0. By the Borel–Cantelli Lemma. we have   E n 4 Xi i=1 = n n 4 E(Xi ) i=1 +3 i. and its generalization the Central Limit Theorem. which estimate the “ﬂuctuations” we can expect in this limit. / FLUCTUATIONS. ε 4 n2 Xi i=1 We used here the Chebyshev inequality. LAPLACE–DEMOIVRE THEOREM. Let us start with a simple calculation.o. with P (Bk ) = 0. The foregoing says that 1 lim sup n n→∞ n Xi (ω) ≤ i=1 1 . since the Xi are identically distributed.

q ≥ 0. have a distribution which tends to the Gaussian N (0. . . We can imagine these random variables as modeling for example repeated tosses of a biased coin. Interpretation of the Laplace–DeMoivre Theorem. p + q = 1. Proof. identically distributed. . Sn − E(Sn ) Sn − np . THEOREM (Laplace–DeMoivre). . A proof is in Appendix A. . By independence. then 2 lim P √ √ a n n a n − ≤ Sn − ≤ 2 2 2 25 1 =√ 2π a −a n→∞ e− x2 2 dx. V (X1 + · · · + Xn ) = V (X1 ) + · · · + V (Xn ) = npq. . are independent and identically distributed. E(X1 ) = Ω X1 dP = p and therefore E(X1 + · · · + Xn ) = np. 1) as n → ∞. Deﬁne the sums Sn := X1 + · · · + Xn . Suppose the real–valued random variables X1 . . = √ npq V (Sn )1/2 Hence the Laplace–DeMoivre Theorem says that the sums Sn . Also. . properly renormalized. Consider in particular the situation p = q = 1 . lim P Sn − np ≤b a≤ √ npq 1 =√ 2π b a n→∞ e− x2 2 dx. with P (Xi = 1) = p P (Xi = 0) = q for p. which has probability p of coming up heads. . . Then for all −∞ < a < b < +∞. In view of the Lemma. and probability q = 1 − p of coming up tails. Xn . real–valued random variables in the preceding Lemma. Let X1 . Suppose a > 0. Xn be the independent.LEMMA. . . Then E(X1 + · · · + Xn ) = np V (X1 + · · · + Xn ) = npq. (X1 − p)2 dP = (1 − p)2 P (X1 = 1) + p2 P (X1 = 0) Ω V (X1 ) = = q 2 p + p2 q = qp.

be independent. . Xn . =1− 2n n n 26 . . 2 CENTRAL LIMIT THEOREM. φ (0) = iE(X1 ) = 0. Then for all −∞ < a < b < +∞ (1) n→∞ lim P Sn − nm a≤ √ ≤b nσ 1 =√ 2π b a e− x2 2 dx.If we ﬁx b > 0 and write a = 2b √ . . n then for large n 2b √ n 2b − √n n 1 P −b ≤ Sn − ≤ b ≈ √ 2 2π e− x2 2 dx →0 as n→∞. 2 but Sn (ω) − n “ﬂuctuates” with probability 1 to exceed any ﬁnite bound b. . . . Let X1 . 1 Thus for almost every ω. Now φ = φX1 satisﬁes 1 φ(µ) = φ(0) + φ (0)µ + φ (0)µ2 + o(µ2 ) as µ → 0. Set Sn := X1 + · · · + Xn . 2 2 with φ(0) = 1. . We now generalize the Laplace–DeMoivre Theorem: THEOREM (Central Limit Theorem). Thus the conclusion of the Laplace–DeMoivre Theorem holds not only for the 0– or 1– valued random variable considered before. We will later invoke this assertion to motivate our requirement that Brownian motion be normally distributed for each time t ≥ 0. . Outline of Proof. identically distributed random variables with ﬁnite variance. σ = 1. Consequently our setting λ µ = √n gives λ2 λ λ2 φX1 √ +o . . . real-valued random variables with E(Xi ) = m. For simplicity assume m = 0. . V (Xi ) = σ 2 > 0. identically distributed. because the random variables are independent and identically distributed. in accord with the Strong Law of Large Numbers. φ (0) = −E(X1 ) = −1. φ Xn (λ) = φX1 √ √ n n n n for λ ∈ R. n Sn (ω) → 1 . . Then n λ S X φ √n (λ) = φ √1 (λ) . . for i = 1. since we can always rescale to this case. but for any sequence of independent.

provided P (B) > 0. Thus if P (B) > 0. But e− 2 is the characteristic function of an N (0. FIRST APPROACH TO CONDITIONAL EXPECTATION. given the event B? Remember that we can P ˜ think of B as the new probability space. That is. Example. and so  on A1  a1   a  2 on A2 Y = . What is a reasonable deﬁnition of E(X | Y ). U. for which we provide two introductory discussions. Assume we are given a probability space (Ω. the probability of A. on which is deﬁned a simple m random variable Y . CONDITIONAL EXPECTATION. we should set E(X | B) = mean value of X over B 1 X dP. P ). 27 . = P (B) B Next we pose a more interesting question.    am on Am . MOTIVATION. the expected value of the random variable X. as n → ∞. H. Y = i=1 ai χAi . the expected value of the random variable X. with P = P (B) . . given another random variable Y ? In other words if “chance” selects a sample point ω ∈ Ω and all we know about ω is the value Y (ω). what is our best guess as to the value X(ω)? This turns out to be a subtle.and so S φ √n (λ) = n λ2 1− λ2 +o 2n λ2 n n → e− λ2 2 for all λ. but extremely important issue. 1) random variable. We earlier decided to deﬁne P (A | B). How then should we deﬁne E(X | B). given B. (A∩B) to be P P (B) .   . We start with an example. It turns out that this convergence of the characteristic functions implies the limit (1): see Breiman [B] for more.

let X be any other real–valued random variable on Ω. What is our best guess of X. • E(X | Y ) is U(Y )-measurable. . our best estimate for X should then be the average value of X over each appropriate event. Finally.for distinct real numbers a1 . each of positive probability. That is. • A XdP = A E(X | Y ) dP for all A ∈ U(Y ). Next. Let us take these properties as the deﬁnition in the general case: DEFINITION. P ) be a probability space and suppose V is a σ-algebra. known. A2 . This. .    1 on Am . .  .   . We are given the “information” available in a σ-algebra V. . Then E(X | Y ) is any U(Y )-measurable random variable such that X dP = A A E(X | Y ) dP for all A ∈ U(Y ). Let (Ω. . we deﬁne E(X | V) to be any random variable on Ω such that (i) E(X | V) is V-measurable. Condition (i) in the deﬁnition requires that E(X | V) be constructed from the 28 . . we should take  1 on A1  P (A1 ) A1 X dP   1   on A2  P (A2 ) A2 X dP E(X | Y ) := . V ⊆ U. we can tell which event A1 . Interpretation. but rather just the σ-algebra it generates. . We can understand E(X | V) as follows. P (Am ) Am X dP We note for this example that • E(X | Y ) is a random variable. given Y ? Think about the problem this way: if we know the value of Y (ω). notice that it is not really the values of Y that are important. Am contains ω. from which we intend to build an estimate of the random variable X. am and disjoint events A1 . Am . If X : Ω → Rn is an integrable random variable. U. . This motivates the next DEFINITION. . and (ii) A X dP = A E(X | V) dP for all A ∈ V. Let Y be a random variable. A2 . . a2 . . whose union is Ω. . and not a constant. and only this.

Let X be an integrable random variable. Remark. which uses a few advanced concepts from measure theory. (ii) E(E(X | V)) = E(X). Consider for the moment Rn and suppose that V is a proper subspace. so deﬁned. (iii) E(X) = E(X | W). x z=projV(x) 0 V 29 . and is motivated by this example: Least squares method. the conditional expectation E(X | V) exists and is unique up to V-measurable sets of probability zero. Ω} is the trivial σ-algebra. THEOREM. Then for each σ-algebra V ⊂ U. and (ii) requires that our estimate be consistent with X. We will later see that the conditional expectation E(X | V). has various additional nice properties. Suppose we are given a vector x ∈ Rn . We omit the proof. given x. y∈V It is not particularly diﬃcult to show that. We can check without diﬃculty that (i) E(X | Y ) = E(X | U(Y )). We call v the projection of x onto V . at least as regards integration over events in V. An elegant alternative approach to conditional expectations is based upon projections onto closed subspaces. there exists a unique vector z ∈ V solving this minimization problem. SECOND APPROACH TO CONDITIONAL EXPECTATION. (7) z = projV (x). where W = {∅. The least squares problem asks us to ﬁnd a vector z ∈ V so that |z − x| = min |y − x|.information in V.

such that ||Y || := Ω Y dP 2 1 2 < ∞. Take in particular W = χA for any set A ∈ V. that is. we return now to conditional expectation. U–measurable random variables Y . U). For this take any other vector w ∈ V . it follows that X dP = A A Z dP 30 for all A ∈ V. Y ) := Ω XY dP = E(XY ). by analogy with (7) in the ﬁnite dimensional case. Motivated by the example above. we can likewise show (X. Let us take the linear space L2 (Ω) = L2 (Ω. Projection of random variables. Hence 0 = i (0) = 2(z − x) · w. We call ||Y || the norm of Y . Consider then V := L2 (Ω. Since z + τ w ∈ V for all τ . .Now we want to ﬁnd formula characterizing z. Y ∈ L2 (Ω). (8) x·w =z·w for all w ∈ V. we see that the function i(·) has a minimum at τ = 0. V). and if X. which consists of all real-valued. Next. Deﬁne then i(τ ) := |z + τ w − x|2 . we deﬁne their inner product to be (X. Almost exactly as we established (8) above. W ) for all W ∈ V. This is a closed subspace of L2 (Ω). we can deﬁne its projection (9) Z = projV (X). take as before V to be a σ-algebra contained in U. W ) = (Z. the space of square–integrable random variables that are V–measurable. In view of the deﬁnition of the inner product. Consequently if X ∈ L2 (Ω). The geometric interpretation is that the “error” x − z is perpendicular to the subspace V.

The two introductory discussions now completed. we see that Z is in fact E(X | V).Since Z ∈ V is V-measurable.s. then E(X | V) = X a. b are constants. THEOREM (Properties of conditional expectation). we turn next to examining conditional expectation more closely.s. and (ii) is easy to check 2. XY dP for all A ∈ V. as deﬁned in the earlier discussion. W ⊆ V. . We could therefore alternatively take the last identity as a deﬁnition of conditional expectation. X is independent of V. X is V-measurable and XY is integrable.s. a.s. 31 . we have E(X | W) = E(E(X | V) | W) = E(E(X | W) | V) a. . 1. . . E(X | V) = projV (X). of E(XY | V). This point of view also makes it clear that Z = E(X | V) solves the least squares problem: ||Z − X|| = min ||Y − X||. m. it is enough in proving (iii) to show (10) A XE(Y | V) dP = A m i=1 bi χBi . Statement (i) is obvious. Then m XE(Y | V) dP = A i=1 m bi A∩Bi ∈V E(Y | V) dP = i=1 bi A∩Bi Y dP = A XY dP. Y ∈V and so E(X | V) can be interpreted as that V-measurable random variable which is the best least squares approximation of the random variable X.s.s. That is. (vi) The inequality X ≤ Y a. By uniqueness a. E(aX + bY | V) = aE(X | V) + bE(Y | V) a. Proof.s.s. (i) (ii) (iii) (iv) (v) If If If If If X is V-measurable. then E(X | V) = E(X) a. First suppose X = where Bi ∈ V for i = 1. implies E(X | V) ≤ E(Y | V) a. then E(XY | V) = XE(Y | V) a.

. . Sn ) (11) + E(Yn+1 + · · · + Yn+k | S1 . the third equality owing to independence. Suppose Φ : R → R is convex. given the values of S1 . . We leave the proof as an exercise. . . Sn ) = E(Y1 + · · · + Yn | S1 . with E(Yi ) = 0 (i = 1. . . and we deduce from the previous inequality that P (A) = 0. . . Sn ?? The answer is E(Sn+k | S1 . assertion (i) implies that E(E(X | W) | V) = E(X | W). Suppose Y1 . 5. . with E(|Φ(X)|) < ∞. . it suﬃces to prove A E(X) dP = A X dP for all A ∈ V. Thus E(X | W) = E(E(X | V) | W) a. .This proves (10) if X is a simple function. suppose X ≤ Y . Sn ) = Y1 + · · · + Yn + E(Yn+1 + · · · + Yn+k ) = Sn . I. Y2 . . are independent real-valued random variables. Let us compute: X dP = A Ω χA X dP = E(χA X) = E(X)P (A) = A E(X) dP. . MARTINGALES. 4. 2. . 3. since A ∈ W ⊆ V. ). . Then E(E(X | V) | W) dP = A A E(X | V) dP = A X dP. Then Φ(E(X | V)) ≤ E(Φ(X) | V). since E(X | W) is W-measurable and so also V-measurable. =0 32 . To show (iv). Furthermore.s. . LEMMA (Conditional Jensen’s Inequality). Finally. MOTIVATION. . Deﬁne the sum Sn := Y1 + · · · + Yn . Assume W ⊆ V and let A ∈ W. This event lies in V. . and note that E(Y | V) − E(X | V) dP = A A E(Y − X | V) dP Y − X dP ≥ 0 A = for all A ∈ V. What is our best guess of Sn+k . . The general case follows by approximation. This establishes assertion (v). . . Take A := {E(Y | V) − E(X | V) ≤ 0}.

the calculation above says that at any time one’s future expected winnings.Thus the best estimate of the “future value” of Sn+k . for all t ≥ s ≥ 0. . So the formula (11) characterizes a “fair” game. X(·) is a submartingale. for all t ≥ s ≥ 0. Xn . 2. . write W(t) := U(W (s)| 0 ≤ s ≤ t).s. Then U(t) := U(X(s) | 0 ≤ s ≤ t). (The reader should refer back to this calculation after reading Chapter 3. . . i=1 DEFINITION. be a sequence of real-valued random variables. . . We incorporate these ideas into a formal deﬁnition: DEFINITION. (ii) If X(s) ≤ E(X(t) | U(s)) a.) 33 . . Let X(·) be a real–valued stochastic process. If Xk = E(Xj | X1 .s. and let t ≥ s. is called the history of the process until (and including) time t ≥ 0. Let X(·) be a stochastic process.s. is just Sn . ). To see this. as deﬁned later in Chapter 3. Let W (·) be a 1-dimensional Wiener process. . and therefore Sn as the total winnings at time n. . . DEFINITIONS. . Example. then X(·) is called a martingale. such that E(|X(t)|) < ∞ for all t ≥ 0. Then E(W (t) | W(s)) = E(W (t) − W (s) | W(s)) + E(W (s) | W(s)) = E(W (t) − W (s)) + W (s) = W (s) a. (i) If X(s) = E(X(t) | U(s)) a. given the history up to time n. the σ-algebra generated by the random variables X(s) for 0 ≤ s ≤ t. Xk ) we call {Xi }∞ a (discrete) martingale. . Then W (·) is a martingale. If we interpret Yi as the payoﬀ of a “fair” gambling game at time i. . with E(|Xi |) < ∞ (i = 1. a. for all j ≥ k. . Let X1 . is just the current amount of money.s. given the winnings to date.

t ≥ 0. Suppose X(·) is a real-valued martingale and Φ : R → R is convex. t] and pass to limits. Next choose i=1 a ﬁner and ﬁner partition of [0. Notice that (i) is a generalization of the Chebyshev inequality. The proof of assertion (ii) is similar. . . max |Xk | p 1≤k≤n ≤ p p−1 p E(|Xn |p ) 0≤s≤t (ii) If X(·) is a martingale and 1 < p < ∞. (i) If X(·) is a submartingale. and λ > 0. . then P max X(s) ≥ λ ≤ 1 E(X(t)+ ) λ for all λ > 0. THEOREM (Martingale inequalities). Martingales are important in probability theory mainly because they admit the following powerful estimates: THEOREM (Discrete martingale inequalities). which uses Jensen’s inequality. . then E max |X(s)|p ≤ p p−1 p 0≤s≤t E(|X(t)|p ). t > 0 and select 0 = t0 < t1 < · · · < tn = t.s. then n=1 P max Xk ≥ λ ≤ 1 + E(Xn ) λ 1≤k≤n for all n = 1. . We can also extend these estimates to continuous–time martingales. We check that {X(ti )}n is a martingale and apply the discrete martingale inequality. . Φ(X(·)) is a submartingale. 34 . Choose λ > 0. We omit the proof. then n=1 E for all n = 1. Outline of Proof. Then if E(|Φ(X(t))|) < ∞ for all t ≥ 0. . (ii) If {Xn }∞ is a martingale and 1 < p < ∞.LEMMA. (i) If {Xn }∞ is a submartingale. Let X(·) be a stochastic process with continuous sample paths a. A proof is provided in Appendix B.

0) = δ0 . He and others noted that • the path of a given particle is very irregular. Let us consider a long. Next. suppose that the probability density of the event that an ink particle moves from x to x + y in (small) time τ is ρ(τ. D. Initially we have f (x. R. whereas ρ(τ. −∞ We insert these identities into (1). y). thereby to obtain f (x. is linear in τ : ∞ y 2 ρ dy = Dτ. Brown in 1826–27 observed the irregular motion of pollen particles suspended in water. Bachelier attempted to describe ﬂuctuations in stock prices mathematically and essentially discovered ﬁrst certain results later rederived and extended by A. In 1900 L. t)ρ(τ. 2 ∞ ρ(τ. . the variance of ρ. MOTIVATION AND DEFINITIONS. y) dy 1 f − fx y + fxx y 2 + . Then ∞ f (x. and • the motions of two distinct particles appear to be independent. But since ρ is a probability density. SOME HISTORY. Einstein in 1905. Now let f (x. . into which we inject at time t = 0 a unit amount of ink. thin tube ﬁlled with clear water. We further assume that −∞ y 2 ρ dy. at the location x = 0. Motivation and deﬁnitions Construction of Brownian motion Sample paths Markov property A. Einstein studied the Brownian phenomena this way. t + τ ) = (1) = −∞ −∞ ∞ f (x − y. the unit mass at 0. t) = {+ lower order terms}. D > 0. t) denote the density of ink particles at position x ∈ R and time t ≥ 0. y) dy. A. y) by symmetry. B.CHAPTER 3: BROWNIAN MOTION AND “WHITE NOISE”. C. −∞ ρ dy = 1. ∞ ∞ Consequently −∞ yρ dy = 0. τ 2 35 . −y) = ρ(τ. having a tangent at no point. t) Dfxx (x. t + τ ) − f (x.

. ±2. Let p(m. In fact. n) − 2p(m. RANDOM WALKS. . for some constant D. n + 1) = and hence p(m. }. n) = Now assume (∆x)2 =D ∆t 1 (p(m + 1. n) + p(m + 1. . n + 1) − p(m. n) − 2p(m. 1 1 p(m − 1. Einstein computed:   R = gas constant    T = absolute temperature RT D= . n) (∆x)2 36 . n∆t) | m = 0. 0) = δ0 . has the solution ft = f (x. N. . This equation and the observed properties of Brownian motion allowed J. . Then 0 m=0 p(m. Dt). We introduce a 2dimensional rectangular lattice. n)). 1/2 (2πDt) This says the probability density at time t is N (0. n) D = ∆t 2 p(m + 1. Perrin to compute NA (≈ 6 × 1023 = the number of molecules in a mole) and help to conﬁrm the atomic theory of matter. n) denote the probability that the particle is at position m∆x at time n∆t. to the right an amount ∆x with probability 1/2. n = 0. . and at each time n∆t moves to the left an amount ∆x with probability 1/2. n) + p(m − 1. n). A variant of Einstein’s argument follows. Also p(m. 0) = 1 m = 0. n + 1) − p(m. also known as the heat equation. with the initial condition f (x. comprising the sites {(m∆x. . t) = x2 1 e− 2Dt . Consider a particle starting at x = 0 and time t = 0. His ideas are at the heart of the mathematics in §B–D below. Wiener in the 1920’s (and later) put the theory on a ﬁrm mathematical basis. 1. 2. This partial diﬀerential equation.Sending now τ → 0. 2 2 This implies p(m. we discover D fxx 2 This is the diﬀusion equation. 2 for some positive constant D. where  f = friction coeﬃcient NA f    NA = Avagodro’s number. . ±1. n) + p(m − 1.

which we now interpret as the probability density that particle is at x at time t. Then Laplace–De Moivre Theorem thus implies lim n→∞ t=n∆t. . . . n∆t → t. Then Sn − n 4 n 2 X(t) = (2Sn − n)∆x = n∆x = n 2 √ tD. ). m∆x → x.Let ∆t → 0. . 4 Now Sn is the number of moves to the right by time t = n∆t. x2 . The above diﬀerence equation becomes formally in the limit D ft = fxx . Then V (Xi ) = 1 . MATHEMATICAL JUSTIFICATION. t). . where the Xi are independent random variables such that P (Xi = 0) = 1/2 P (Xi = 1) = 1/2 for i = 1. . Deﬁne n 2 Sn := i=1 Xi . Let X(t) denote the position of particle at time t = n∆t (n = 0. ∆t √ Sn − n 4 = D. ∆x → 0. . Then presumably ∆t p(m. Note also V (X(t)) = (∆x)2 V (2Sn − n) = (∆x)2 4V (Sn ) = (∆x)2 4nV (X1 ) = (∆x)2 n = Again assume (∆x)2 ∆t (∆x)2 t. n) → f (x. As above we assume the particle moves to the left or right a distance ∆x with probability 1/2. A more careful study of this technique of passing to limits with random walks on a lattice depends upon the Laplace–De Moivre Theorem. P (a ≤ X(t) ≤ b) = lim =D (∆x)2 ∆t n→∞ Sn − n b a 2 √ ≤ ≤√ n tD tD 4 √b tD √a tD 1 =√ 2π =√ 37 e− x2 2 dx 1 2πDt b a e− 2Dt dx. Consequently X(t) = Sn ∆x + (n − Sn )(−∆x) = (2Sn − n)∆x. 2 and so we arrive at the diﬀusion equation again. with (∆x) ≡ D.

and rigorously this time. . (ii) W (t) − W (s) is N (0. . B. what is the probability that a sample path of Brownian motion takes values between ai and bi at time ti for each i = 1. Suppose we now choose times 0 < t1 < · · · < tn and real numbers ai ≤ bi . for which we take D = 1: DEFINITION. . . since we can expect that any suitably scaled sum of independent. the random variables W (t1 ). COMPUTATION OF JOINT PROBABILITIES.. an ≤ W (tn ) ≤ bn )? In other words. t − s) for all t ≥ s ≥ 0. n? 38 . · · · . we obtain the N (0. P (a ≤ W (t) ≤ b) = √ 1 2πt b a e− 2t dx. for i = 1.Once again. (iii) for all times 0 < t1 < t2 < · · · < tn . . n. . E(W 2 (t)) = t for each time t ≥ 0. . Dt) distribution. From the deﬁnition we know that if W (·) is a Brownian motion. . W (tn ) − W (tn−1 ) are independent (“independent increments”). . A real-valued stochastic process W (·) is called a Brownian motion or Wiener process if (i) W (0) = 0 a. What is the joint probability P (a1 ≤ W (t1 ) ≤ b1 . Inspired by all these considerations. CONSTRUCTION OF BROWNIAN MOTION. W (t2 ) − W (t1 ). random disturbances aﬀecting the postion of a moving particle will result in a Gaussian distribution. we now introduce Brownian motion. The Central Limit Theorem Further provides some further motivation for our deﬁnition of Brownian motion. . x2 since W (t) is N (0. . Notice in particular that E(W (t)) = 0. then for all t > 0 and a ≤ b.s. t).

all choices of times 0 = t0 < t1 < · · · < tn and each function f : Rn → R. and given that W (t1 ) = x1 . The next assertion conﬁrms and extends this formula. . Let W (·) be a one-dimensional Wiener process. THEOREM. . . . then presumably the process is N (x1 . . t1 | 0)g(x2 . we have ∞ Ef (W (t1 ). . tn − tn−1 | xn−1 ) dxn . We know P (a1 ≤ W (t1 ) ≤ b1 ) = b1 a1 e √ − 2t1 x2 1 2πt1 dx1 . . . Then for all positive integers n. . should equal b2 |x −x |2 1 − 2 1 e 2(t2 −t1 ) dx2 . . . a2 ≤ W (t2 ) ≤ b2 ) = for g(x. t2 − t1 ) on the interval [t1 . . . t2 − t1 | x1 ) . xn )g(x1 . . an ≤ W (tn ) ≤ bn ) = (2) b1 a1 b1 a1 b2 a2 g(x1 . t1 | 0)g(x2 . 2πt ··· bn an g(x1 . given that W (t1 ) = x1 . g(xn . . . t2 − t1 | x1 ) dx2 dx1 1 − (x−y)2 2t e . . we would therefore guess that P (a1 ≤ W (t1 ) ≤ b1 . t | y) := √ In general. g(xn . . dx1 . 2π(t2 − t1 ) a2 Hence it should be that P (a1 ≤ W (t1 ) ≤ b1 . . t2 − t1 | x1 ) . 39 . . t1 | 0)g(x2 . a1 ≤ x1 ≤ b1 . t2 ]. dx1 . tn − tn−1 | xn−1 ) dxn . W (tn )) = −∞ ··· ∞ −∞ f (x1 .a3 a5 a1 a2 b3 a4 b5 t5 b1 t1 b2 t2 t3 b4 t4 We can guess the answer as follows. Thus the probability that a2 ≤ W (t2 ) ≤ b1 .

square–integrable funtions deﬁned on (0. . Then E(W (t)W (s)) = t ∧ s = min{s. We will then integrate the resulting expression in time. . dy1 = −∞ ··· f (x1 . y1 + · · · + yn ). We start with an easy lemma. 1). s ≥ 0. tn − tn−1 | xn−1 ) dxn .Our taking f (x1 . . tn − tn−1 | 0)dyn . . t2 − t1 | x1 ) . n. . The Jacobian for this change of variables equals 1. y1 + y2 . . The main issue now is to demonstrate that a Brownian motion actually exists. . . .bn ] (xn ) gives (2). xn ) = χ[a1 . . Then E(W (t)W (s)) = E((W (s) + W (t) − W (s))W (s)) = E(W 2 (s)) + E((W (t) − W (s))W (s)) = s + E(W (t) − W (s))E(W (s)) =0 =0 for t ≥ 0. . . g(xn . t2 − t1 | 0) . . the space of all real-valued. . g(yn . . W (tn )) = Eh(Y1 . Yn ) ∞ = −∞ ∞ ··· ∞ −∞ ∞ −∞ h(y1 . t} Proof. and prove then that we have built a Wiener process. .b1 ] (x1 ) · · · χ[an . 1) . . = s = t ∧ s. yn )g(y1 . . . . Assume t ≥ s ≥ 0. n and x0 = 0. . We also deﬁne h(y1 . t1 | 0)g(y2 . . This procedure is a form of “wavelet analysis”: see Pinsky [P]. . Proof. Then Ef (W (t1 ). Yi := Xi − Xi−1 for i = 1. yn ) := f (y1 . . . . . We also changed variables using the identities yi = xi − xi−1 for i = 1. . . xn )g(x1 . . . . dx1 . For the second equality we recalled that the random variables Yi = W (ti ) − W (ti−1 ) are independent for i = 1. . . y2 . . . . LEMMA. show that this series converges. . and that each Yi is N (0. . t1 | 0)g(x2 . . Let us write Xi := W (ti ). Suppose W (·) is a one-dimensional Brownian motion. . ti − ti−1 ). Our method will be to develop a formal expansion of white noise ξ(·) in terms of a cleverly selected orthonormal basis of L2 (0. . . 40 . BUILDING A ONE-DIMENSIONAL WIENER PROCESS. n. . . . . .

where δ0 is the unit mass at 0. If X(·) is any real-valued stochastic process with E(X 2 (t)) < ∞ for all t ≥ 0. This gives the formula (3) above. t = s. ˙ Remark: Why W(·) = ξ(·) is called white noise. ω) is in fact diﬀerentiable for no time t ≥ 0. the autocorrelation function of X(·). and set φh (s) := E = W (t + h) − W (t) h W (s + h) − W (s) h 1 [E(W (t + h)W (s + h)) − E(W (t + h)W (s)) − E(W (t)W (s + h)) + E(W (t)W (s))] h2 1 = 2 [((t + h) ∧ (s + h)) − ((t + h) ∧ s) − (t ∧ (s + h)) + (t ∧ s)]. As we will see later however. A formal “proof” is this. h height = 1/h graph of φh t-h t t+h Then φh (s) → 0 as h → 0. s) = c(t − s) for some function c : R → R and if E(X(t)) = E(X(s)) for all t. 41 . In addition. s ≥ 0). s) and W (t) − W (s) is independent of W (s). ﬁx t > 0. HEURISTICS. we deﬁne r(t. X(·) is called stationary in the wide sense. But φh (s) ds = 1. ω the sample path ˙ t → W (t. s) := E(X(t)X(s)) (t.e. wide sense stationary. s ≥ 0. Suppose h > 0. Thus W (t) = ξ(t) does not really exist. and so presumably φh (s) → δ0 (s − t) in some sense. as h → 0. we expect that φh (s) → E(ξ(t)ξ(s)). Remember from Chapter 1 that the formal time-derivative dW (t) ˙ W (t) = = ξ(t) dt is “1-dimensional white noise”. If r(t. for a. A white noise process ξ(·) is by deﬁnition Gaussian.since W (s) is N (0. with c(·) = δ0 . we do have the heuristic formula (3) “E(ξ(t)ξ(s)) = δ0 (s − t)”. However.

with E(An ) = 0. For white noise. RANDOM FOURIER SERIES. We expect that the An are independent and Gaussian. Thus the spectral density of ξ(·) is ﬂat. n. all “frequencies” contribute equally in the correlation function. The orthonormality means that 1 ψn (s)ψm (s) ds = δmn 0 for all m. we have f (λ) = 1 2π ∞ −∞ e−iλt δ0 dt = 1 2π for all λ. where ψn = ψn (t) are functions of 0 ≤ t ≤ 1 only and so are not random variables. just as—by analogy—all colors contribute equally to make white light. Similarly. We write formally ∞ (4) It is easy to see that then ξ(t) = n=0 An ψn (t) (0 ≤ t ≤ 1). Therefore to be consistent we must have for m = n 1 1 0 = E(An )E(Am ) = E(An Am ) = 0 1 0 1 E(ξ(t)ξ(s))ψn (t)ψm (s) dtds δ0 (s − t)ψn (t)ψm (s) dtds 0 1 0 = = 0 by (3) ψn (s)ψm (s) ds. 1 An = 0 ξ(t)ψn (t) dt. Suppose now {ψn }∞ is a complete. 42 . orthonormal n=0 basis of L2 (0.In general we deﬁne f (λ) := 1 2π ∞ −∞ e−iλt c(t) dt (λ ∈ R) to be the spectral density of a process X(·). 1 E(A2 ) = n 0 2 ψn (s) ds = 1. 1). that is. But this is already automatically true as the ψn are orthogonal.

1 −1 for 0 ≤ t ≤ 1. But then the Brownian motion W (·) should be given by t ∞ t (5) W (t) := 0 ξ(s) ds = n=0 An 0 ψn (s) ds. ´ LEVY–CIESIELSKI CONSTRUCTION OF BROWNIAN MOTION DEFINITION. . we set  n/2 n n +1/2 for k−2 ≤ t ≤ k−22n 2 2n  n n +1/2 hk (t) := −2n/2 for k−2 n < t ≤ k−2n +1 2 2   0 otherwise. it is reasonable to believe that formula (4) makes sense.Consequently if the An are independent and N (0. height = 2 n/2 graph of hk width = 2 -(n+1) Graph of a Haar function 43 . 1). . . 2. This seems to be true for any orthonormal basis. . and we will next make this rigorous by choosing a particularly nice basis. for 0 ≤ t ≤ for 1 2 1 2 < t ≤ 1. n = 1. The family {hk (·)}∞ of Haar functions are deﬁned for 0 ≤ t ≤ 1 as k=0 follows: h0 (t) := 1 h1 (t) := If 2n ≤ k < 2n+1 .

. since 0 = deduce k+1 2n+1 k 2n+1 1/2 0 f dt + 1 1/2 f dt = 1 0 f dt. Then 0 f dt = 1/2 f dt. We have 0 h2 dt = 2n 2n+1 + 2n+1 = 1. Thus r r s 0 ≤ s ≤ r ≤ 1. The functions {hk (·)}∞ form a complete. k−2n +1 ]. we f dt = 0 for all dyadic rationals f dt = 0 for all 0 ≤ k < 2n+1 . 1). . 0 f hk dt = 0 for all k = 0. . 2. r. In this second case 1 1 1 hl hk dt = ±2n/2 0 1 0 hl dt = 0.LEMMA 1. . 1). . k Note also that for all l > k. . We will prove f = 0 almost everywhere. 2. and so for all 0 ≤ s ≤ r ≤ 1. Let n = 1. and both are equal to zero. k=0 1 1 Proof. DEFINITION. height = 2 -(n+2)/2 graph of sk width = 2 -n Graph of a Schauder function The graph of sk is a “tent” of height 2−n/2−1 .e. t sk (t) := 0 hk (s) ds (0 ≤ t ≤ 1) is the k th –Schauder function. then 0≤t≤1 n n max |sk (t)| = 2−n/2−1 . For k = 1. 1. 1 1/2 1 If n = 0. orthonormal basis of L2 (0. 44 . we have 0 f dt = 0. 1. 2n 2 n n+1 Consequently if 2 ≤ k < 2 . Continuing in this way. . Suppose f ∈ L2 (0. . But f (r) = d dr f (t) dt = 0 0 a. either hk hl = 0 for all t or else hk is constant on the support of hl . lying above the interval [ k−2 .

Fix ε > 0. Suppose {Ak }∞ are independent. 1) random variables. k = 2. In particular. since 0 ≤ δ < 1/2. N (0. . . For all x > 0. 2 ≤k<2 Then for 0 ≤ t ≤ 1. Notice that for 2n ≤ k < 2n+1 . 1) random k=0 variables deﬁned on some probability space. Let {ak }∞ be a sequence of real numbers such that k=0 |ak | = O(k δ ) for some 0 ≤ δ < 1/2. 45 x2 . Proof. we have ∞ s2 2 e− 2 ds P (|Ak | > x) = √ 2π x ∞ x2 s2 2 e− 4 ds ≤ √ e− 4 2π x ≤ Ce− 4 . . .Our goal is to deﬁne W (t) := ∞ Ak sk (t) k=0 for times 0 ≤ t ≤ 1. |Ak (ω)| = O( log k) as k → ∞. the functions sk (·) have disjoint supports. (2n+1 )δ 2−n/2−1 < ε n=m LEMMA 3. Then the series ∞ as k → ∞ ak sk (t) k=0 converges uniformly for 0 ≤ t ≤ 1. ∞ ∞ |ak ||sk (t)| ≤ k=2m n=m bn ∞ 2n ≤k<2n+1 0≤t≤1 max |sk (t)| ≤C for m large enough. LEMMA 2. where the coeﬃcients {Ak }∞ are independent. Then for alk=1 most every ω. Proof. We must ﬁrst of all check whether this series converges. N (0. the numbers {Ak (ω)}∞ almost surely satisfy the hypothesis of Lemma k=1 2 above. Set bn := n maxn+1 |ak | ≤ C(2n+1 )δ .

o.√ for some constant C. ω) := k=0 Ak (ω)sk (t) ( 0 ≤ t ≤ 1) converges uniformly in t. t ≤ 1. bk = 0 φs hk dτ = sk (s). φs (τ ) := Then if s ≤ t. N (0. ω.e. Let {Ak }∞ be a sequence of independent. 46 . and (ii) for a. Deﬁne for 0 ≤ s ≤ 1. k4 < ∞. where ak = 0 1 t 1 φt hk dτ = 0 hk dτ = sk (t). ω) is continuous. then P (|Ak | ≥ 4 log k) ≤ Ce−4 log k = C Since 1 k4 1 . Proof. Furthermore (i) W (·) is a Brownian motion for 0 ≤ t ≤ 1. 1) random variables dek=0 ﬁned on the same probability space.) = 0. Therefore for almost every sample point ω. s= 0 φt φs dτ = k=0 ak bk . LEMMA 4. THEOREM. ω. Set x := 4 log k. we have |Ak (ω)| ≤ 4 where K depends on ω. Lemma 1 implies 1 ∞ 1 0 0≤τ ≤s s < τ ≤ 1. for a. ∞ k=0 sk (s)sk (t) log k provided k ≥ K. = t ∧ s for each 0 ≤ s. the sample path t → W (t. the Borel–Cantelli Lemma implies P (|Ak | ≥ 4 log k i.e. Then the sum ∞ W (t.

. xm ∈ R. the increment W (t) − W (s) is N (0. 2.. . . .W (tm )−W (tm−1 ) (x1 . To prove this. . . Once this is proved. that m λj (W (tj )−W (tj−1 )) λ2 j 2 (6) E(e i m j=1 )= j=1 e − (tj −tj−1 ) . . 1. . The uniform convergence is a consequence of Lemmas 2 and 3.. let us compute E(eiλ(W (t)−W (s)) ) = E(eiλ ∞ ∞ k=0 Ak (sk (t)−sk (s)) ) by independence = k=0 ∞ E(eiλAk (sk (t)−sk (s)) ) e− λ2 2 = k=0 (sk (t)−sk (s))2 since Ak is N (0. this implies (ii). 3. as asserted. To prove W (·) is a Brownian motion. t − s) for all 0 ≤ s ≤ t ≤ 1. xm ) = FW (t1 ) (x1 ) · · · FW (tm )−W (tm−1 ) (xm ) for all x1 . and for all 0 = t0 < t1 < · · · < tm ≤ 1. Thus (6) will establish the Theorem. .s. we will know from uniqueness of characteristic functions that FW (t1 ). . 47 . . 1) = e− = e− = e− =e λ2 2 λ2 2 λ2 2 ∞ 2 k=0 (sk (t)−sk (s)) ∞ k=0 s2 (t)−2sk (t)sk (s)+s2 (s) k k (t−2s+s) (t−s) by Lemma 4 2 −λ 2 . W (tm ) − W (tm−1 ) are independent. By uniqueness of characteristic functions.Proof. . This proves that W (t1 ).. Next we claim for all m = 1. . We assert as well that W (t) − W (s) is N (0. 2. . . we ﬁrst note that clearly W (0) = 0 a. t − s).

Let (Ω. This theorem shows we can construct a Brownian motion deﬁned on any probability space on which there exist countably many independent N (0. Then W (·) is a one-dimensional Brownian motion. We mostly followed Lamperti [L1] for the foregoing theory. We assemble these inductively by setting W (t) := W (n − 1) + W n (t − (n − 1)) for n − 1 ≤ t ≤ n. 1) random variables. As we can reindex the N (0. 1). It is straightforward to extend our deﬁnitions to Brownian motions taking values in Rn . BROWNIAN MOTION IN Rn . and the general case follows similarly.Now in the case m = 2. 1) random variables to obtain countably many families of countably many random variables. 3. This is (6) for m = 2. 48 . t ≥ 0. deﬁned for all times t ≥ 0. The theorem above demonstrated how to build a Brownian motion on 0 ≤ t ≤ 1. Outline of proof. P ) be a probability space on which countably many N (0. U. THEOREM (Existence of one-dimensional Brownian motion). we have E(ei[λ1 W (t1 )+λ2 (W (t2 )−W (t1 ))] ) = E(ei[(λ1 −λ2 )W (t1 )+λ2 W (t2 )] ) = E(ei(λ1 −λ2 ) ∞ ∞ k=0 Ak sk (t1 )+iλ2 ∞ k=0 Ak sk (t2 ) ) = k=0 ∞ E(eiAk [(λ1 −λ2 )sk (t1 )+λ2 sk (t2 )] ) e− 2 ((λ1 −λ2 )sk (t1 )+λ2 sk (t2 )) 1 2 = k=0 = e− 2 1 1 ∞ 2 2 2 2 k=0 (λ1 −λ2 ) sk (t1 )+2(λ1 −λ2 )λ2 sk (t1 )sk (t2 )+λ2 sk (t2 ) 2 = e− 2 [(λ1 −λ2 ) =e t1 +2(λ1 −λ2 )λ2 t1 +λ2 t2 ] 2 by Lemma 4 − 1 [λ2 t1 +λ2 (t2 −t1 )] 1 2 2 . independent random variables {An }∞ n=1 are deﬁned. Then there exists a 1-dimensional Brownian motion W (·) deﬁned for ω ∈ Ω. we can therefore build countably many independent Brownian motions W n (t) for 0 ≤ t ≤ 1.

) E((W k (t) − W k (s))(W l (t) − W l (s))) = (t − s)δkl Proof. n). . n. l = 1. W n (·)) is an ndimensional Wiener process (or Brownian motion) provided (i) for each k = 1. . . . . . xm )g(x1 . . . . . . t ≥ s ≥ 0. tI) for each time t > 0. . the random variables W 1 (t). . . and (ii) the σ-algebras W k := U(W k (t) | t ≥ 0) are independent. . By the arguments above we can build a probability space and on it n independent 1dimensional Wiener processes W k (·) (k = 1. . t | 0). . we have fW(t) (x1 . (k. xn ) = fW 1 (t) (x1 ) · · · fW n (t) (xn ) x2 x2 1 1 n 1 − 2t − 2t = e ··· e 1/2 1/2 (2πt) (2πt) |x|2 1 = e− 2t = g(x. If k = l. . . W n (·)) is an n-dimensional Brownian motion. for each m = 1. An Rn -valued stochastic process W(·) = (W 1 (·). . . n. . g(xm . . . .DEFINITION. . t | y) := Ef (W(t1 ). . . . and each function f : Rn × Rn × · · · Rn → R. (2πt)n/2 Proof. For each time t > 0. xn ) ∈ Rn . LEMMA. then (i) (ii) E(W k (t)W l (s)) = (t ∧ s)δkl (k. . E(W k (t)W l (s)) = E(W k (t))E(W l (s)) = 0. |x−y|2 1 e− 2t . . . . Consequently for each point x = (x1 . Therefore P (W(t) ∈ A) = 1 (2πt)n/2 e− A |x|2 2t dx for each Borel subset A ⊆ Rn . t1 | 0)g(x2 . then W(t) is N (0. . W(tm )) = Rn ··· Rn f (x1 . n/2 (2πt) We prove formula (7) as in the one-dimensional case. . . If W(·) is an n-dimensional Wiener process. . . . l = 1. dx1 . t2 − t1 | x1 ) . . by independence. 49 . . . . . The proof of (ii) is similar. Then W(·) := (W 1 (·). . . . n). n. k = 1. . tm − tm−1 | xm−1 ) dxm . W n (t) are independent. 2. THEOREM. . . . W k (·) is a 1-dimensional Wiener process. . we have (7) where g(x. (ii) More generally. (i) If W(·) is an n-dimensional Brownian motion. . . .

50 . ω) − X(s. but is nowhere H¨lder continuous o 2 1 with any exponent γ > 2 . APPLICATION TO BROWNIAN MOTION. t ∈ [0. A good general theorem to prove H¨lder continuity is this important theorem of Kolo mogorov: THEOREM. ω) is uniformly H¨lder continuous with exponent γ on o [0. A function f : [0. Hence the sample path t → X(t. ω) almost surely is nowhere diﬀerentiable and is of inﬁnite variation for each time interval. . SAMPLE PATH PROPERTIES. T ]. T ]. (ii) We say f is H¨lder continuous with exponent γ > 0 at the point s if there exists a o constant K such that |f (t) − f (s)| ≤ K|t − s|γ for all t ∈ [0. . 2. . CONTINUITY OF SAMPLE PATHS. T ) such that |X(t. DEFINITIONS.s. γ. In this section we will demonstrate that for almost every ω. T ] → R is called uniformly H¨lder o continuous with exponent γ > 0 if there exists a constant K such that |f (t) − f (s)| ≤ K|t − s|γ for all s. such that E(|X(t) − X(s)|β ) ≤ C|t − s|1+α for constants β. We have for all integers m = 1. Let X(·) be a stochastic process with continuous sample paths a. ω)| ≤ K|t − s|γ for all 0 ≤ s. T ]. α > 0. 1.C. In particular t → W(t. (i) Let 0 < γ ≤ 1. T > 0. Consider W(·). t ≤ T. C ≥ 0 and for all 0 ≤ t. ω) o is uniformly H¨lder continuous for each exponent γ < 1 . there exists a constant K = β K(ω. and almost every ω. E(|W(t) − W(s)|2m ) = |x|2 1 |x|2m e− 2r dx for r = t − s > 0 (2πr)n/2 Rn |y|2 1 x = rm |y|2m e− 2 dy y=√ n/2 r (2π) Rn m m = Cr = C|t − s| . the sample path t → W(t. s.. an n-dimensional Brownian motion. Then for each 0 < γ < α .

ω) ≤ nγ 2n 2 2 for 0 ≤ i ≤ 2n − 1 provided n ≥ m. ω) is uniformly H¨lder o continuous on [0. ω) − X( 2in . n 2 2 2 P (An ) ≤ i=0 2n −1 P X( i+1 i 1 ) − X( n ) > nγ n 2 2 2 i+1 i ) − X( n ) n 2 2 1+α β ≤ i=0 E 2n −1 X( 1 2n 1 2nγ −β by Chebyshev’s inequality ≤C = C2 1 2nγ −β i=0 n(−α+γβ) .) = 0. Proof. ω) ≤ K 2nγ 2n for 0 ≤ i ≤ 2n − 1 for all n ≥ 0.s. ω there exists m = m(ω) such that X( i 1 i+1 . take T = 1. for exponents o 0<γ< α 1 1 = − β 2 2m for all m. So for a. whence the Borel–Cantelli Lemma implies P (An i. . . 1. ω) − X( n . T ] for each exponent 0 < γ < 1/2. β X( i+1 i 1 ) − X( n ) > nγ for some integer 0 ≤ i < 2n . . The process W(·) is thus H¨lder continuous a. the sample path t → W(t. An := Then 2n −1 0<γ< α . For simplicity. Pick any (8) Now deﬁne for n = 1.o. ∞ Since (8) forces −α + γβ < 0. .e. α = m − 1. Thus for almost all ω and any T > 0. we deduce n=1 P (An ) < ∞. . But then we have (9) 1 X( i+1 .Thus the hypotheses of Kolmogorov’s theorem hold for β = 2m. 51 if we select K = K(ω) large enough.

k. 2 In the same way we deduce |X(t2 . ω) is continuous for a. ω) − X(i/2n . 1] be dyadic rationals. .2. and consequently k |X(t1 . ω) − X(i/2 − 1/2 pr n p1 − · · · − 1/2 pr−1 1 . ω)| ≤ K ≤ K 2nγ r=1 ∞ 1 2pr γ r=1 1 since pr > n 2rγ C = nγ ≤ Ctγ by (10). t2 ∈ [0. Select n ≥ 1 so that (10) We can write t1 = t2 = for t1 ≤ Then 2−n ≤ t < 2−(n−1) for t := t2 − t1 . 52 .* We now claim (9) implies the stated H¨lder continuity. Let t1 . .e. Since t → X(t. . ω. ﬁx ω ∈ Ω for o which (9) holds. ω) − X(t2 . the estimate above holds for all t1 . In view of (9). Furthermore |X(i/2 − 1/2 n p1 − · · · − 1/2 . ω) − X(j/2n . . t2 ∈ [0. ≤ Ktγ . 1] and some constant C = C(ω). ω)| ≤ K pr 2 γ for r = 1. i 2n j 2n − + 1 2p1 1 2q1 − ··· − + ··· + 1 2pk 1 2ql (n < p1 < · · · < pk ) (n < q1 < · · · < ql ) i j ≤ n ≤ t2 . to discover |X(t1 . ω)| ≤ Ctγ . Add up the estimates above. ω)| ≤ K 2n n n γ and so j = i or i + 1. ω) − X(j/2 . To see this. ω)| ≤ C|t1 − t2 |γ for all dynamic rationals t1 . n 2 2 j−i 1 ≤ t < n−1 n 2 2 i−j |X(i/2 . 1]. *Omit the second step in this proof on ﬁrst reading. 0 < t2 − t1 < 1. t2 ∈ [0.

e. It suﬃces to consider a one-dimensional Brownian o motion. β > 0. Now if the function t → W (t. ω) W ( . ω) − W ( n γ γ j+1 j + s− ≤K s− n n M ≤ γ n 53 . ω) − W ( n n n j+1 . C ≥ 0). 1] and some constant K. (Dvoretzky. ω) ≤ W (s.Remark. Fix an integer N so large that N γ− 1 2 > 1. ω) − W ( . Erd¨s. 1. o 2 o THEOREM. Proof. ω) is nowhere H¨lder 2 continuous with exponent γ. . t → W(t. ˜ then X(·) has a version X(·) such that a. ω) is H¨lder continuous with exponent γ at some point o 0 ≤ s < 1. and we may for simplicity consider only times 0 ≤ t ≤ 1.) So any Wiener process has a version with continuous sample paths a. i + 1. . ω) − W (s. (ii) In particular. and thus are nowhere diﬀerentiable. i + N − 1 j+1 j j . . NOWHERE DIFFERENTIABILITY Next we prove that sample paths of Brownian motion are with probability one nowhere H¨lder continuous with exponent greater than 1 . set i = [ns] + 1 and note that for j = i. (We call X(·) a version of X(·) if P (X(t) = X(t)) = 1 for all t ≥ 0. ω)| ≤ K|t − s|γ For n for all t ∈ [0. 2. The proof above can in fact be modiﬁed to show that if X(·) is a stochastic process such that E(|X(t) − X(s)|β ) ≤ C|t − s|1+α (α. for almost every ω. then |W (t. Kakutani) 1. ω) is nowhere diﬀerentiable and is of inﬁnite variation on each subinterval. the sample path t → W(t. . (i) For each 1 < γ ≤ 1 and almost every ω. sample path is H¨lder continuous for each o ˜ ˜ exponent 0 < γ < α/β.s. ω) + W (s.

. This holds for all k. For all k and M . Thus ω ∈ Ai M.n P (Ai ) M.n = 0. 54 . Thus ∞ ∞ ∞ n P M =1 k=1 n=k i=1 Ai M. ·) is H¨lder continuous with exponent γ at o some time 0 ≤ s < 1 is contained in ∞ ∞ ∞ n Ai . n→∞ since N (γ − 1/2) > 1. and all large n. . ∞ n n P n=k i=1 Ai M.n ≤ lim inf n→∞ i=1 ≤ lim inf n P n→∞ 1 M |W ( )| ≤ γ n n N . n and independent.for some constant M .n ≤ lim inf P n→∞ i=1 n Ai M. M.n M =1 k=1 n=k i=1 We will show this event has probability 0. 2.n := j j+1 M W( ) − W( ) ≤ γ for j = i. Therefore the set of ω ∈ Ω such that W (ω. j 1 since the random variables W ( j+1 ) − W ( n ) are N 0. Now n P 1 M |W ( )| ≤ γ n n √ n =√ 2π 1 =√ 2π ≤ Cn M n−γ −M n−γ e− nx2 2 dx dy M n1/2−γ e− y2 2 −M n1/2−γ 1/2−γ . . and assertion (i) of the Theorem follows.n ≤ lim inf nC[n1/2−γ ]N = 0. We use this calculation to deduce: ∞ n P n=k i=1 Ai M. . i + N − 1 n n n for some 1 ≤ i ≤ n. M . some M ≥ 1.

ω) were of ﬁnite variation on some subinterval. If W (t. ω) − W (s. MARKOV PROPERTY. then W (t. ω)| ≤ K|t − s|γ for all t. ω)| ≤ γ n n n for all n 1 and at least N values of j. the conditional probability of A. If V is a σ-algebra. Interpretation. ω) − W ( . Therefore P (A | V) is a random variable. 55 . DEFINITION. and is therefore extremely small. V ⊆ U. ω) would be H¨lder continuous (with o exponent 1) at s. it would then be diﬀerentiable almost everywhere there. ω) is diﬀerentiable at s. then P (A | V) := E(χA | V) for A ∈ U. A sample path of Brownian motion D.3. But these are independent events of small probability. then j j+1 M |W ( . If W (t. But this is almost surely not so. given V. The probability that the above inequality holds for all these j’s is a small number to the large power N . The idea underlying the proof is that if |W (t.

Set Φ(y) := 1 (2π(t − s))n/2 e− 2(t−s) dx. Now if C ∈ U(W(s)). Proof. and Borel sets B .s. for all 0 ≤ s < t. We will only prove (13). W (t) ∈ A) = B A C g(y. for all 0 ≤ s ≤ t and all Borel subset B of Rn . B = 56 . A |x−y|2 As Φ(W(s)) is U(W(s)) measurable.DEFINITION. If X(·) is a stochastic process. Let W(·) be an n-dimensional Wiener process. An Rn -valued stochastic process X(·) is called a Markov process if P (X(t) ∈ B | U(s)) = P (X(t) ∈ B | X(s)) a. Hence χ{W(t)∈A} dP = P (W(s) ∈ B. the process only “knows” its value at time s and does not “remember” how it got there. and (13) P (W(t) ∈ B | W(s)) = 1 (2π(t − s))n/2 e− B |x−W(s)|2 2(t−s) dx a. The idea of this deﬁnition is that.s. the σ-algebra U(s) := U(X(r) | 0 ≤ r ≤ s) is called the history of the process up to and including time s. you can predict the probabilities of future values of X(t) just as well as if you knew the entire history of the process before time s. Note carefully that each side of this identity is a random variable. s | 0)g(x. then C = {W(s) ∈ B} for some Borel set B ⊆ Rn . we must show (14) C χ{W(t)∈A} dP = Φ(W(s)) dP C for all C ∈ U(W(s)). Then W(·) is a Markov process. THEOREM. t − s | y) dxdy g(y. DEFINITION. given the current value X(s). s | 0)Φ(y) dy. We can informally interpret U(s) as recording the information available from our observing X(r) for all times 0 ≤ r ≤ s. Loosely speaking.

57 . as discussed before in §C. Thus the path “cannot remember” how to leave b in such a way that W(·. and so establishes (13). If W(s. ω) approached the point b as t → s− . ω) will have a tangent there. Interpretation. then the future behavior of W(t.On the other hand. and this last expression agrees with that above. The Markov property partially explains the nondiﬀerentiability of sample paths for Brownian motion. say. ω) depends only upon this fact and not on how W(t. Φ(W(s))dP = C Ω χB (W(s))Φ(W(s)) dP |y|2 e− 2s = χB (y)Φ(y) dy (2πs)n/2 Rn = B g(y. s | 0)Φ(y) dy. ω) = b. This veriﬁes (14).

One possible deﬁnition is due to Paley. t)dW X(0) = X0 . E. For instance. Motivation Deﬁnition and properties of Itˆ integral o Indeﬁnite Itˆ integrals o Itˆ’s formula o Itˆ integral in n-dimensions o A. But before we can study and solve such an integral equation. C. D. then 0 G dW simply cannot be understood as an ordinary integral. A FIRST DEFINITION. Wiener and Zygmund [P-W-Z]. s) ds + 0 B(X. so that the right-hand side of (1) is at least makes sense. 1] → R is continuously diﬀerentiable. which we will in Chapter 5 interpret to mean t t (1) X(t) = X0 + 0 b(X. s) dW for all times t ≥ 0. MOTIVATION. Note carefully: g is an ordinary. with g(0) = g(1) = 0. ω) T is of inﬁnite variation for almost every ω. B. since t → W(t. Then let us deﬁne 1 1 g dW := − 0 1 0 g W dt. we must ﬁrst deﬁne T G dW 0 for some wide class of stochastic processes G. A. Remember from Chapter 1 that we want to develop a theory of stochastic diﬀerential equations of the form (SDE) dX = b(X. ITO’S FORMULA. Let us check out the properties following from this deﬁnition: 58 .ˆ CHAPTER 4: STOCHASTIC INTEGRALS. Suppose now n = m = 1. Note that 0 g dW is therefore a random variable. t)dt + B(X. Suppose g : [0. deterministic function and not a stochastic process. Observe also that this is not at all obvious.

(ii) E Proof.LEMMA (Properties of the Paley–Wiener–Zygmund integral). To conﬁrm (ii). 59 . and therefore { 1 g 0 n dW }∞ is a Cauchy sequence in L2 (Ω). 1). We can take a sequence of C 1 functions gn . If we wish to deﬁne the integral in (1). and not for stochastic processes. t B(X. 0 then the integrand B(X. 1 (i) E 0 g dW = 0. except that this only makes sense for functions g ∈ L2 (0. 1). Consequently we can deﬁne n=1 1 1 g dW := lim 0 n→∞ gn dW. In view of property (ii). 0 The extended deﬁnition still satisﬁes properties (i) and (ii). g E(W (t)) dt. Discussion. as 1 above. s) dW. such that 0 (gn − g)2 dt → 0. t) is a stochastic process and the deﬁnition above will not suﬃce. =0 g dW 1 0 2. 1 1 2 1 E 0 gm dW − 0 gn dW = 0 (gm − gn )2 dt. we calculate 1 2 1 1 E 0 g dW =E 0 1 1 g (t)W (t) dt 0 g (s)W (s) ds = 0 1 0 g (t)g (s)E(W (t)W (s))dsdt =t∧s t 1 = 0 1 g (t) 0 sg (s) ds + t t tg (s) ds dt = 0 1 g (t) tg(t) − 0 t g ds − tg(t) dt 1 = 0 g (t) − 0 g ds dt = 0 g 2 dt. 1 This is a reasonable deﬁnition of 0 g dW . E 1 0 2 g dW 1 0 = =− 1 0 g 2 dt. Suppose now g ∈ L2 (0. 1.

T ] is a ﬁnite collection of points in [0. Let [a. we deﬁne m−1 R = R(P. T ]: P := {0 = t0 < t1 < · · · < tm = T }. and then–if possible–to pass to limits. λ) := k=0 W (τk )(W (tk+1 ) − W (tk )). m − 1). . T ]. RIEMANN SUMS. The key question is LEMMA (Quadratic variation). DEFINITIONS. (iii) For ﬁxed 0 ≤ λ ≤ 1 and P a given partition of [0. b] be an interval in [0. and suppose P n := {a = tn < tn < · · · < tn n = b} 0 1 m are partitions of [a. 60 . b].We must devise a deﬁnition for a wider class of integrands (although the deﬁnition we ﬁnally decide on will agree with that of Paley. Then mn −1 (W (tn ) − W (tn ))2 → b − a k+1 k k=0 in L2 (Ω) as n → ∞. Zygmund if g happens to be a deterministic C 1 function. with g(0) = g(1) = 0). with |P n | → 0 as n → ∞. T 0 This is the corresponding Riemann sum approximation of this: what happens if |P | → 0. . ∞). with λ ﬁxed? W dW . a partition P of [0. A reasonable procedure is to construct a Riemann sum approximation. that dW ≈ (dt)1/2 . For such a partition P and 0 ≤ λ ≤ 1. set τk := (1 − λ)tk + λtk+1 (k = 0. Wiener. let us think about what might be an appropriate deﬁnition for T W dW = ?. T ] is an interval. To continue our study of stochastic integrals with random integrands. (i) If [0. introduced in Chapter 1. This assertion partly justiﬁes the heuristic idea. 0 where W (·) is a 1-dimensional Brownian motion. . . (ii) Let the mesh size of P be |P | := max0≤k≤m−1 |tk+1 − tk |.

k+1 k k+1 k according to the independent increments. 1).Proof. Then k ((W (tn ) − W (tn ))2 − (tn − tn )). Set Qn := mn −1 n k=0 (W (tk+1 ) − W (tn ))2 .s. ≤ C | P | (b − a) → 0 Remark. the term in the double sum is E((W (tn ) − W (tn ))2 − (tn − tn ))E(· · · ). Therefore for some constant C E((Qn − (a − b)) ) ≤ C 2 k=0 n (tn − tn )2 k+1 k as n → ∞. t − s) for all t ≥ s ≥ 0. and thus equals 0. Hence mn −1 E((Qn − (b − a))2 ) = k=0 2 E((Yk − 1)2 (tn − tn )2 ). k+1 k k+1 k mn −1 Qn − (b − a) = k=0 Hence mn −1 mn −1 E((Qn − (b − a)) ) = 2 k=0 j=0 E([(W (tn ) − W (tn ))2 − (tn − tn )] k+1 k k+1 k [(W (tn ) − W (tn ))2 − (tn − tn )]). k+1 k where Yk = n Yk W (tn ) − W (tn ) k+1 k := n n tk+1 − tk mn −1 is N (0. k+1 k k=0 Pick an ω for which this holds and also for which the sample path is H¨lder continuous o 1 with some exponent 0 < γ < 2 . Passing if necessary to a subsequence. j+1 j j+1 j For k = j. as W (t) − W (s) is N (0. mn −1 (W (tn ) − W (tn ))2 → b − a a. Then mn −1 b − a ≤ K lim sup |Pn | n→∞ γ k=0 |W (tn ) − W (tn )| k+1 k 61 .

Let us now return to the question posed above. In particular the limit of the Riemann sum approximations depends upon the choice of n n intermediate points tn ≤ τk ≤ tn . as to the limit of the Riemann sum approximations. T ] and 0 ≤ λ ≤ 1 is ﬁxed. the limit taken in L2 (Ω). Since |Pn | → 0. and similarly B → λT as 62 .for a constant K. W (T )2 1 Rn − − λ− 2 2 2 E T → 0. k k+1 k k+1 Proof. If P n denotes a partition of [0. we see again that sample paths have inﬁnite variation with probability one: m−1 sup P k=0 |W (tk+1 ) − W (tk )| = ∞. k+1 k k=0 Rn := Then n→∞ lim Rn = W (T )2 1 + λ− 2 2 T. LEMMA. where τk = (1 − λ)tn + λtn . k+1 k =:B =:C T 2 According to the foregoing Lemma. deﬁne mn −1 n W (τk )(W (tn ) − W (tn )). We have mn −1 n W (τk )(W (tn ) − W (tn )) k+1 k k=0 mn −1 Rn := W 2 (T ) 1 = − 2 2 mn −1 (W (tn ) − W (tn ))2 k+1 k k=0 =:A mn −1 + k=0 n (W (τk ) − W (tn ))2 k + k=0 n n (W (tn ) − W (τk ))(W (τk ) − W (tn )). That is. A → in L2 (Ω) as n → ∞.

building the Riemann sum approximation by evaluating the n integrand at the left-hand endpoint τk = tn on each subinterval [tn . B. That is. T W 2 (T ) T W dW = − 2 2 0 and. This is not what one would guess oﬀhand. C. Next we study the term C: mn −1 E([ k=0 n n (W (tn ) − W (τk ))(W (τk ) − W (tn ))]2 ) k+1 k mn −1 = k=0 n n E([W (tn ) − W (τk )]2 )E([W (τk ) − W (tn )]2 ) k+1 k (independent increments) mn −1 = k=0 (1 − λ)(tn − tn )λ(tn − tn ) k+1 k k+1 k ≤ λ(1 − λ)T |P n | → 0. more generally. More discussion. tn ] will ultimately k k k=1 permit the deﬁnition of T G dW 0 63 . in §B) of 0 W dW corresponds to the choice o λ = 0. We combine the limiting expressions for the terms A. Hence C → 0 in L2 (Ω) as n → ∞. An alternative deﬁnition. r T W dW = s W 2 (r) − W 2 (s) (r − s) − 2 2 for all r ≥ s ≥ 0. See Chapter 6 for more. It turns out that Itˆ’s deﬁnition (later.n → ∞. and thereby establish the Lemma. so that 2 T W ◦ dW = 0 W 2 (T ) 2 (Stratonovich integral). What are the advantages of taking λ = 0 and getting T W dW = 0 W 2 (T ) T − ? 2 2 First and most importantly. due to Stratonovich. takes λ = 1 .

where X0 is a random variable independent of W + (0). we will see that since at each 64 . and understand. DEFINITION. The idea is that G(·) is nonanticipating and. it is best to use the known value of G(tn ) in the approximation. We should informally think of F(t) as “containing all information available to us at time t”. and we do not know at time tn its k n future value at the future time τk = (1 − λ)tn + λtn . and so we pause here to emphasize the basic point. namely that G(·) be progessively measurable. in addition.for a wide class of so-called “nonanticipating” stochastic processes G(·). We will actually need a slightly stronger notion. if λ > 0. is appropriately jointly measurable in the variables t and ω together. Indeed. DEFINITION AND PROPERTIES OF ITO’S INTEGRAL. tn ]. useful and elegant formulas. A family F(·) of σ-algebras ⊆ U is called nonanticipating (with respect to W (·)) if (a) F(t) ⊇ F(s) for all t ≥ s ≥ 0 (b) F(t) ⊇ W(t) for all t ≥ 0 (c) F(t) is independent of W + (t) for all t ≥ 0. This will be employed in Chapter 5. Discussion. This is however a bit subtle to deﬁne. DEFINITION. to be developed below. U. These measure theoretic issues can be confusing to students. (i) The σ-algebra W(t) := U(W (s) | 0 ≤ s ≤ t) is called the history of the Brownian motion up to (and including) time t. X0 ). the stochastic integral 0 G dW in terms of some simple. In other words. The idea is that for each time t ≥ 0. we will be able to deﬁne. but the idea is that t represents time. Let W (·) be a 1-dimensional Brownian motion deﬁned on some probability space (Ω. the random variable G(t) “depends upon only the information available in the σ-algebra F(t)”. Our primary example will be F(t) := U(W (s) (0 ≤ s ≤ t). k k+1 ˆ B. P ). (ii) The σ-algebra W + (t) := U(W (s)−W (t) | s ≥ t) is the future of the Brownian motion beyond time t. We also refer to F(·) as a ﬁltration. where X0 will be the (possibly random) initial condition for a stochastic diﬀerential equation. G(t) is F(t)–measurable. and we will not do so here. For progessively measurable integrands T G(·). DEFINITIONS. k k+1 k G(·) will in general depend on Brownian motion W (·). Exact deﬁnitions are later. A real-valued stochastic process G(·) is called nonanticipating (with respect to F(·)) if for each time t ≥ 0. and since we do not know what W (·) will do on [tn .

Then T m−1 G dW := 0 k=0 Gk (W (tk+1 ) − W (tk )) is the Itˆ stochastic integral of G on the interval (0. o Note carefully that this is a random variable. m − 1). . T ). T ) is called a step process if there exists a partition P = {0 = t0 < t1 < · · · < tm = T } such that G(t) ≡ Gk f or tk ≤ t < tk+1 (k = 0. . DEFINITION. L1 (0. . H ∈ L2 (0. as above. progressively measurable stochastic processes G(·) such that T E 0 G2 dt < ∞. T ) is the space of all real–valued. T ): T T T (i) 0 aG + bH dW = a 0 G dW + b 0 H dW.moment of time “G depends only upon the past history of the Brownian motion”. Let G ∈ L2 (0. T (ii) E 0 G dW 65 = 0. which would be false if G “depends upon the future behavior of the Brownian motion”. since G is nonanticipating. some nice identities hold. DEFINITIONS. LEMMA (Properties of stochastic integral for step processes). . DEFINITION. Then each Gk is an F(tk )-measurable random variable. We have for all constants a. progressively measurable processes F (·) such that T E 0 |F | dt < ∞. . T ) the space of all real–valued. A process G ∈ L2 (0. (ii) Likewise. T ) be a step process. (i) We denote by L2 (0. b ∈ R and for all step processes G.

The plan now is to approximate an arbitrary process G ∈ L2 (0. Hence E(Gk (W (tk+1 ) − W (tk ))) = E(Gk )E(W (tk+1 ) − W (tk )). Proof. T ). APPROXIMATION BY STEP FUNCTIONS. T ) by step processes in L2 (0. k. o 66 . On the other hand. =0 2. <∞ =0 Consequently  E 0 T 2  = m−1 G dW E(G2 (W (tk+1 ) − W (tk ))2 ) k k=0 m−1 = k=0 E(G2 )E((W (tk+1 ) − W (tk ))2 ) k =tk+1 −tk T =E 0 G2 dt . Suppose next G(t) ≡ Gk for tk ≤ t < tk+1 . Then T m−1 E 0 G dW = k=0 E(Gk (W (tk+1 ) − W (tk ))). and then pass to limits to deﬁne the Itˆ integral of G.j=1 Now if j < k.  E 0 T 2  = m−1 G dW E (Gk Gj (W (tk+1 ) − W (tk ))(W (tj+1 ) − W (tj )) . W (tk+1 ) − W (tk ) is W + (tk )-measurable. (iii) E 0 T 2  =E 0 T G dW G2 dt . Now Gk is F(tk )-measurable and F(tk ) is independent of W + (tk ). Furthermore. and so Gk is independent of W (tk+1 ) − W (tk ). then W (tk+1 ) − W (tk ) is independent of Gk Gj (W (tj+1 ) − W (tj )). The ﬁrst assertion is easy to check. Thus E(Gk Gj (W (tk+1 ) − W (tk ))(W (tj+1 ) − W (tj ))) = E(Gk Gj (W (tj+1 ) − W (tj )))E(W (tk+1 ) − W (tk )). 1.

[nT ]. m → ∞ and so the limit 0 T T G dW := lim exists in L (Ω). If G ∈ L2 (0. 2 n→∞ Gn dW 0 It is not hard to check this deﬁnition does not depend upon the particular sequence of step process approximations in L2 (0. THEOREM (Properties of Itˆ Integral). we have T T For all constants a. DEFINITION. . . Then  E 0 T 2  =E 0 T Gn − Gm dW (Gn − Gm )2 dt →0 as n. T ). there exists a sequence of bounded step processes Gn ∈ L2 (0. t → Gm (t.s. If G ∈ L2 (0. Then Gm ∈ L2 (0. H ∈ L (0. T ) such that T E 0 |G − Gn |2 dt → 0. . . . as above. and T |Gm − G|2 dt → 0 a. ω. Outline of proof.LEMMA (Approximation by step processes). we can set k k k+1 . o 2 G. T ). T ). take step processes Gn as above. deﬁne t G (t) := 0 m mem(s−t) G(s) ds. b ∈ R and for all T (i) 0 aG + bH dW = a 67 0 G dW + b 0 H dW. k = 0. T ). T ).e. Gn (t) := G( ) for ≤ t < n n n For a general G ∈ L2 (0. but the idea is this: if t → G(t. ω) is continuous for a. ω) is continuous for almost every ω. 0 Now approximate Gm by step processes. T ). We omit the proof.

T ) and t → G(t. progressively measurable processes G(·) such that T G2 dt < ∞ 0 a. then mn −1 k=0 n T G(tn )(W (tn ) k k+1 − W (tn )) k → 0 G dW in probability. T T (iv) E 0 G dW 0 H dW =E 0 GH dt .s. 0 the expressions on the right converging in probability. See for instance Friedman [F] or Gihman–Skorohod [G-S] for details. and is left as an exercise. 2. it is important to consider a wider class of integrands. Finally.s. In particular. m with |P n | → 0. Statements (i) and (iii) are also easy consequences of the similar rules for step processes. assertion (iv) results from (iii) and the identity 2ab = (a + b)2 − a2 − b2 . if G ∈ M2 (0. It is possible to extend the deﬁnition of the Itˆ integral to cover G ∈ M2 (0. ω. 1. T ). The idea is to ﬁnd a sequence of step processes Gn ∈ M2 (0. T ) to be the space of all real–valued. ω) is continuous for a. Assertion (i) follows at once from the corresponding linearity property for step processes. where P = {0 = tn < · · · < tn n = T } is any sequence of partitions.e. 0 It turns out that we can then deﬁne T T G dW := lim 0 n→∞ Gn dW. k 68 . instead of just L2 (0. For many applications. evaluated at τk = tn .T (ii)  (iii) E 0 T T E 0 G dW 2 = 0. More on Riemann sums. Proof. This conﬁrms the consistency of Itˆ’s integral with the earlier calculations o n involving Riemann sums. T ). as n → ∞. EXTENDING THE DEFINITION. T ) such that T (G − Gn )2 dt → 0 a. To this end we deﬁne M2 (0.  =E 0 T G dW G2 dt . although we o will not do so in these notes.

(iI) Furthermore. In this section we note some properties of the process I(·). ITO’S FORMULA. I(·) has a version with continuous sample paths a. We say that X(·) has the stochastic diﬀerential dX = F dt + GdW for 0 ≤ t ≤ T . set t I(t) := 0 G dW (0 ≤ t ≤ T ). we will always mean this version.ˆ C. and “dW ” have no meaning alone. a proof of (ii) is in Appendix C. “dt”. Suppose that X(·) is a real–valued stochastic process satisfying r r X(r) = X(s) + s F dt + s G dW for some F ∈ L1 (0. T ). INDEFINITE ITO INTEGRALS. ˆ D. then the indeﬁnite integral I(·) is a martingale. T ) and all times 0 ≤ s ≤ r ≤ T . DEFINITION. t). the indeﬁnite integral of G(·). T ). T ). We will not prove assertion (i). T ). G ∈ L2 (0. THEOREM.s. These facts will be quite useful for proving Itˆ’s o formula later in §D and in solving the stochastic diﬀerential equations in Chapter 5. Note carefully that the diﬀerential symbols are simply an abbreviation for the integral expressions above: strictly speaking “dX”. DEFINITION. THEOREM (Itˆ’s Formula). G ∈ L2 (0. .s. T ). Suppose that X(·) has a stochastic diﬀerential o dX = F dt + GdW. ∂x . Assume u : R × [0. Note I(0) = 0. T ] → R is continuous and that ∂2u ∂x2 exist and are continuous. Set Y (t) := u(X(t). (i) If G ∈ L2 (0. namely that it is a martingale and has continuous sample paths a. for F ∈ L1 (0. Henceforth when we refer to I(·). 69 ∂u ∂u ∂t . For G ∈ L2 (0.

G ≡ 1. t). Thus for amost every ω. ˆ ILLUSTRATIONS OF ITO’S FORMULA. s ∂x t t r (iii) Since X(t) = X(0) + 0 F ds + 0 G dW . the expression (2) means for all 0 ≤ s ≤ r ≤ T . 70 . t)G dW almost surely. ∂t (ii) In view of our deﬁnitions. Then dX = dW and thus F ≡ 0. r) − u(X(s). t) + (X. X(·) has continuous sample paths almost 2 surely. This integrated is the identity r W dW = s W 2 (r) − W 2 (s) (r − s) − . 2 In particular the case m = 2 reads d(W 2 ) = 2W dW + dt. t)G2 dt ∂t ∂x 2 ∂x2 s r ∂u + (X. We will prove Itˆ’s formula below. the functions t → ∂u (X(t). t)F + (X.Then Y has the stochastic diﬀerential ∂u ∂t dt ∂u ∂t ∂u ∂x dX ∂u ∂x F 1 ∂2u 2 2 ∂x2 G dt dY = (2) = + + + + 1 ∂2u 2 2 ∂x2 G dt + ∂u ∂x GdW. ∂ u (X(t). Hence Itˆ’s formula gives o 1 d(W m ) = mW m−1 dW + m(m − 1)W m−2 dt. u(x) = xm . t). 2 2 a formula we have established from ﬁrst principles before. above is (X(t). Y (r) − Y (s) = u(X(r). Let X(·) = W (·). ∂u (X(t). but o ﬁrst here are some applications: Example 1. ∂u . t). t) ∂t ∂x ∂x2 are continuous and so the integrals in (3) are deﬁned. o o Remarks. (i) The argument of u. s) (3) = ∂u 1 ∂2u ∂u (X. etc. We call (2) Itˆ’s formula or Itˆ’s chain rule.

1. dhn+1 (W. u(x. t) = − 2 2 6 2 tx2 t2 x4 h4 (x. G ≡ 1. This is a stochastic diﬀerential equation.Example 2. t) dW. 71 . We have t 2 (−t)n x2 /2t dn e−x /2t . Then h0 (x. 24 4 8 h2 (x. t) plays the role o tn that n! plays in ordinary calculus. . n dλ dx we have n 2 dn λx− λ2 t n x2 /2t d 2 )| (e (e−x /2t ) λ=0 = (−t) e n n dλ dx = n!hn (x. Thus o dY = λY dW Y (0) = 1. . In the Itˆ stochastic calculus the expression eλW (t)− o ordinary calculus. . t) = THEOREM (Stochastic calculus with Hermite polynomials). t) 0 for n = 0. t) = eλx− d eλW (t)− λ2 t 2 λ2 t 2 . about which more in Chapters 5 and 6. F ≡ 0. t ≥ 0. h1 (x. e n n! dx λ2 t 2 plays the role that eλt plays in hn (W. t) = x x2 x3 t tx − . etc. . deﬁne hn (x. t) = − + . t) = 1. t) = hn (W. Consequently in the Itˆ stochastic calculus the expression hn (W (t). t) := the n-th Hermite polynomial. Proof. . . We build upon this observation: Example 3. . s) dW = hn+1 (W (t). Again take X(·) = W (·). For n = 0. (from McKean [McK]) Since 2 dn − (x−λt)2 dn 2t (e )|λ=0 = (−t)n n (e−x /2t ). t). Then dt + λeλW (t)− λ2 t 2 = − λ2 t λ2 λW (t)− λ2 t λ2 2 e + eλW (t)− 2 2 2 dW by Itˆ’s formula. that is. h3 (x. .

Proof. ∞ = n=0 λn hn (W (t). ˆ PROOF OF ITO’S FORMULA. since t → W (t) is continuous a. that is. Y (t) = 1 + λ 0 t Y dW for all t ≥ 0. Similarly. To verify (ii). This identity holds for all λ and so the coeﬃcients of λn on both sides are equal. Plug in the expansion above for Y (t): ∞ t ∞ λ hn (W (t). note that r mn −1 t dW = lim 0 n→∞ tn (W (tn ) − W (tn )).Hence e and so Y (t) = e But Y (·) solves λW (t)− λ2 t 2 λx− λ2 t 2 ∞ = n=0 λn hn (x.s. r].. k k+1 k k=0 where P n = {0 = tn < tn < · · · < tn n = r} is a sequence of partitions of [0. k+1 k+1 k k=0 72 . dY = λY dW Y (0) = 1. s) dW t =1+ n=1 λn 0 hn−1 (W (s). r mn −1 W dt = lim 0 n→∞ W (tn )(tn − tn ). t). t). We now begin the proof of Itˆ’s formula. with m 0 1 n |P | → 0. We have already established formula (i). The limit above is taken in L2 (Ω). and (ii ) d(tW ) = W dt + tdW . s) dW. by verio fying directly two important special cases: LEMMA (Two simple stochastic diﬀerentials). We have (i) d(W 2 ) = 2W dW + dt. t) = 1 + λ n=0 ∞ 0 n=0 n λn hn (W (s).

(i) The expression G1 G2 dt here is the Itˆ correction term. assume for simplicity that X1 (0) = X2 (0) = 0. agrees with the Itˆ deﬁnition. The integrated version of the product rule is the Itˆ integration-by-parts formula: o r r r (5) s X2 dX1 = X1 (r)X2 (r) − X1 (s)X2 (s) − s X1 dX2 − s G1 G2 dt. F(0)-measurable random variables (i = 1. T ). . for deterministic C 1 functions g. 2). Then (4) d(X1 X2 ) = X2 dX1 + X1 dX2 + G1 G2 dt. These integral identities for all r ≥ 0 are abbreviated d(tW ) = tdW + W dt. i = 1. 2). Gi (t) ≡ Gi . These special cases in hand. o Proof. Suppose o dX1 = F1 dt + G1 dW dX2 = F2 dt + G2 dW for Fi ∈ L1 (0. First of all. with g(0) = g(1) = 0. Choose 0 ≤ r ≤ T . (ii) If either G1 or G2 is identically equal to 0.since for amost every ω the sum is an ordinary Riemann sum approximation and for this we can take the right-hand endpoint tn at which to evaluate the continuous integrand. This conﬁrms that the Paley–Wiener–Zygmund deﬁnition 1 1 g dW = − 0 0 g W dt. o Remarks. Then Xi (t) = Fi t + Gi W (t) 73 (t ≥ 0. Gi are time–independent. k+1 We add these formulas to obtain r r t dW + 0 0 W dt = rW (r). we now prove: THEOREM (Itˆ product rule). where Fi . Gi ∈ L2 (0. we get the ordinary calculus integrationby-parts formula. 1. 2). Fi (t) ≡ Fi . T ) (i = 1. (0 ≤ t ≤ T ).

r) and pass to limits. i = 1. . with i E( E( Deﬁne n Xi (t) := Xi (0) + 0 T 0 |Fin − Fi | dt) → 0 − Gi ) dt) → 0 2 t T (Gn i 0 as n → ∞. Xi (0) = 0. This is formula (5) for the special circumstance that s = 0. t Fin ds + 0 Gn dW (i = 1. 3. Employing these identities. X1 (s). and Fi . we apply Step 1 on each subinterval [tk . and Fi . If Fi . Gi are constant F(s)-measurable random variables has a similar proof. i We apply Step 2 to n Xi (·) on (s. r r 0 We now use the Lemma above to compute 2 0 W dW = W 2 (r)−r and rW (r). 2. T ). to obtain the formula r X1 (r)X2 (r) = X1 (s)X2 (s) + 74 s X1 dX2 + X2 dX1 + G1 G2 dt. 2.Thus r X2 dX1 + X1 dX2 + G1 G2 dt 0 r r = 0 X1 F2 + X2 F1 dt + 0 r X1 G2 + X2 G1 dW + 0 r G1 G2 dt = 0 (F1 t + G1 W )F2 + (F2 t + G2 W )F1 dt r + 0 (F1 t + G1 W )G2 + (F2 t + G2 W )G1 dW + G1 G2 r r r = F1 F2 r2 + (G1 F2 + G2 F1 ) 0 r W dt + 0 t dW + 2G1 G2 0 W dW + G1 G2 r. we deduce: r W dt+ r 0 t dW = X2 dX1 + X1 dX2 + G1 G2 dt 0 = F1 F2 r2 + (G1 F2 + G2 F1 )rW (r) + G1 G2 W 2 (r) = X1 (r)X2 (r). Gn ∈ L2 (0. The case that s ≥ 0. X2 (s) are arbitrary. tk+1 ) on which Fi and Gi are constant random variables. we select step processes Fin ∈ L1 (0. Gi are step processes. and add the resulting integral expressions. T ). In the general situation. 2). Gi time– independent random variables.

. T ). t) = i=1 f i (x)g i (t). . G ∈ L2 (0. 1. m = 0. This proves (6). o 2. Itˆ’s formula is valid for all polynomials u in the variable x. and the case m = 2 follows from the Itˆ product formula. T ). . . and since the o operator “d” is linear. Then d(u(X. 2 This is clear for m = 0. 1. 1. t) = f (x)g(t). . dt + dX + ∂t ∂x 2 ∂x2 This calculation conﬁrms Itˆ’s formula for u(x. where f and g are polynoo mials. . t) = f (x)g(t). where f and g are polynomials. Thus it is true as well for any function u having the form m u(x. t)) = d(f (X)g) = f (X)dg + gdf (X) 1 = f (X)g dt + g[f (X)dX + f (X)G2 dt] 2 ∂u ∂u 1 ∂2u 2 = G dt. 1. and ﬁrst of all claim that 1 d(X m ) = mX m−1 dX + m(m − 1)X m−2 G2 dt. Now o assume the stated formula for m − 1: 1 d(X m−1 ) = (m − 1)X m−2 dX + (m − 1)(m − 2)X m−3 G2 dt 2 1 = (m − 1)X m−2 (F dt + GdW ) + (m − 1)(m − 2)X m−3 G2 dt. We start with the case u(x) = xm . 2 because m − 1 + 1 (m − 1)(m − 2) = 1 m(m − 1). Suppose dX = F dt + GdW . Suppose now u(x. 75 . m = 0. with F ∈ L1 (0. 2 and we prove it for m: (6) d(X m ) = d(XX m−1 ) = Xd(X m−1 ) + X m−1 dX + (m − 1)X m−2 G2 dt (by the product rule) 1 = X (m − 1)X m−2 dX + (m − 1)(m − 2)X m−3 G2 dt 2 + (m − 1)X m−2 G2 dt + X m−1 dX 1 = mX m−2 dX + m(m − 1)X m−2 G2 dt. 2 2 Since Itˆ’s formula thus holds for the functions u(x) = xm .Now we are ready for ˆ CONCLUSION OF THE PROOF OF ITO’S FORMULA. .

(i) Let W(·) = (W 1 (·). Itˆ’s formula is valid for all polynomial functions u o of the variables x. ITO’S INTEGRAL IN N-DIMENSIONS. T ). . 3.j=1 ∂xi ∂xj i=1 ˆ E. . uniformly on compact subsets of R×[0. . ∂t ∂xi 2 i. Suppose dX i = F i dt + Gi dW . We may pass to limits as n → ∞ in this expression. for i = 1. n). t. ∂xi . A similar proof gives this: ˆ GENERALIZED ITO FORMULA. 76 . n. . . If u : Rn × [0. . there exists a sequence of polynomials un such that o un → u. W m (·)) be an m-dimensional Brownian motion. T ]. we know that for all 0 ≤ r ≤ T . ∂u ∂un ∂x → ∂x . j = 1. . Gi ∈ L2 (0. then n n 1 n ∂u ∂u ∂2u ∂t . T ] → R is continuous.where f i and g i polynomials. . meaning that (a) F(t) ⊇ F(s) for all t ≥ s ≥ 0 (b) F(t) ⊇ W(t) = U(W(s) | 0 ≤ s ≤ t) (c) F(t) is independent of W + (t) := U(W(s) − W(t) | t ≤ s < ∞). . ∂un ∂u ∂t → ∂t ∂ 2 un ∂2u ∂x2 → ∂x2 . t). . . T ). ∂u ∂2u ∂u 1 d(u(X . . Invoking Step 2. (ii) We assume F(·) is a family of nonanticipating σ-algebras. ∂xi ∂xj . with continuous partial derivatives (i. . . . X . . That is. thereby proving Itˆ’s formula in o general. r) − un (X(0). with for F i ∈ L1 (0. t)) = dt + dX i + Gi Gj dt. Given u as in Itˆ’s formula. 0 ∂x the argument of the partial derivatives of un is (X(t). 0) = 0 ∂un ∂un 1 ∂ 2 un 2 G dt + F+ ∂t ∂x 2 ∂x2 r ∂un + G dW almost surely. Notation. r un (X(r).

m). . T ) and all 0 ≤ s ≤ r ≤ T . . . T ) n×m if (i = 1. . Gij ∈ L2 (0. DEFINITION. . n. j=1 Approximating by step processes as before. If X(·) = (X 1 (·). . T ). . and  E 0 T 2 G dW  = E 0 T |G|2 dt . then n×m T E 0 G dW  = 0. T ) DEFINITION. .DEFINITIONS. . n). j = 1. . If G ∈ L2 (0. then n×m T (i = 1. . . . This means that m dX = F dt + j=1 i i Gij dW j 77 for i = 1. . . n. G dW 0 is an Rn –valued random variable. . T ) (ii) An Rn -valued stochastic process F = (F 1 . T ). where |G|2 := 1≤i≤n 1≤j≤m |Gij |2 . . we can establish this LEMMA. T ). If G ∈ L2 (0. . . F 2 . . n). . X n (·)) is an Rn -valued stochastic process such that r r X(r) = X(s) + s F dt + s G dW for some F ∈ L1 (0. . (i) An Mn×m -valued stochastic process G = ((Gij )) belongs to L2 (0. . G ∈ L2 (0. . F n ) belongs to L1 (0. T ) if n F i ∈ L1 (0. whose i-th component is m 0 T Gij dW j (i = 1. . we say X(·) has the n n×m stochastic diﬀerential dX = Fdt + GdW. .

1. n).THEOREM (Itˆ’s formula in n-dimensions). t − s). This establishes the claim. 2. t). with continuous partial derivations ∂u . we know o   d(X 2 ) = 2XdX + dt. 2 ) random variables. Let u : Rn × [0. . . set X(t) := W (t)+2 (t) .s. j = 1. as o ∂u above. W are independent. and X(·) has independent increments. From the 1-dimensional Itˆ calculus. t). N (0. ∂t ∂2u ∂xi ∂xj . Next observe that since X is the t sum of two independent. Then d(u(X(t). An outline of the proof follows some preliminary results: ¯ LEMMA (Another simple stochastic diﬀerential). since W. where the argument of the partial derivatives of u is (X(t).j=1 ∂xi ∂xj Gil Gjl dt. T ] be continuous. Let W (·) and W (·) be independent 1-dimensional Brownian motions. note ﬁrstly that X(0) = 0 a. Suppose that dX = Fdt + GdW. Then ¯ ¯ ¯ d(W W ) = W dW + W dW. t))= ∂u ∂t dt + (5) n ∂u i i=1 ∂xi dX m l=1 +1 2 n ∂2u i. W √ Proof. We claim that X(·) is a 1-dimensional Brownian motion. ∂xi . A similar observation shows that X(t) − X(s) in N (0. . (i. We will also need the following modiﬁcation of the product rule: 78 . .  d(W 2 ) = 2W dW + dt. To begin. There is no correction term “dt” here.   ¯ ¯ ¯ d(W 2 ) = 2W dW + dt. X(t) is N (0. ¯ Thus 1 1 ¯ ¯ d(W W ) = d X 2 − W 2 − W 2 2 2 1 = 2XdX + dt − (2W dW + dt) 2 1 ¯ ¯ − (2W dW + dt) 2 ¯ ¯ ¯ ¯ = (W + W )(dW + dW ) − W dW − W dW ¯ ¯ = W dW + W dW. To see this. ¯ ¯ Compare this to the case W = W .

. m 1 the formula follows easily for polynomials u = u(x. . 1 2 The proof is a modiﬁcation of that for the one–dimensional Itˆ product rule. and then. . 2. 1. ALTERNATIVE NOTATION. . k = 1. proving this by an induction on k1 . according to the Lemma above. CLOSING REMARKS. . . we sometimes write H Then Itˆ’s formula reads o du(X. T ) and Gk ∈ L2 (0. m. t) = 1 ∂u + F · Du + H : D2 u dt + Du · GdW. ∂t 2 79 ij m := k=1 Gik Gjk . When dX = Fdt + GdW. using the Lemma above. xkm .LEMMA (Itˆ product rule with several Brownian motions). We ﬁrst establish the formula for a multinomials u = u(x) = xk1 . . Suppose o m dX1 = F1 dt + k=1 Gk dW k 1 and dX2 = F2 dt + m Gl dW l . km . o with the new feature that d(W i W j ) = W i dW j + W j dW i + δij dt. The Itˆ formula in n-dimensions can now be proved by a suitable modiﬁcation of the o one-dimensional proof. T ) for i = 1. . Then i m d(X1 X2 ) = X1 dX2 + X2 dX1 + k=1 Gk Gk dt. 2 l=1 where Fi ∈ L1 (0. . . . . as before. . . for all functions u as stated. after an approximation. t) in the variables x = (x1 . This done. xn ) and t. .

dt + ∂t ∂xi 2 i. . We may symbolically compute ∂u ∂u ∂2u 1 d(u(X. dW k dW l = δkl dt (k.j=1 ∂xi ∂xj i=1 and then simplify the term “dX i dX j ” by expanding it out and using the formal multiplication rules (dt)2 = 0. dtdW k = 0. ∂xi ∂2u H . HOW TO REMEMBER ITO’S FORMULA. . . ∂xn the Hessian matrix. m). n n The foregoing theory provides a rigorous meaning for all this. . and is the gradient of u in the x-variables. . . l = 1. H:D u= ∂xi ∂xj i. . . 80 .j=1 n m Du · GdW = i=1 k=1 ∂u ik G dW k . t)) = dX i + dX i dX j .∂u ∂u where Du = ∂x1 . ∂xi ˆ 2. D2 u = ∂2u ∂xi ∂xj is n F · Du = i=1 n 2 Fi ij ∂u .

t) ∈ L1 (0. the σ-algebra generated by X0 and the history of the Wiener process up to (and including) time t.  b1 m . n1 b  . We are ﬁnally ready to study stochastic diﬀerential equations: Notation. B. T ).. We say that an Rn -valued stochastic process X(·) is a solution of the Itˆ stochastic diﬀerential equation o (SDE) dX = b(X.. t)dt + B(X. provided (i) X(·) is progressively measurable with respect to F(·). b2 . D. . and b : Rn × [0. . T ] → Mn×m are given functions.. b = (b1 . . B : Rn × [0. . T ] → Rn . . . C. (ii) Assume T > 0 is given.CHAPTER 5: STOCHASTIC DIFFERENTIAL EQUATIONS. DEFINITIONS AND EXAMPLES. . n 81 . Deﬁnitions and examples Existence and uniqueness of solutions Properties of solutions Linear stochastic diﬀerential equations A. A. (i) Let W(·) be an m-dimensional Brownian motion and X0 an n-dimensional random variable which is independent of W(·). bn m DEFINITION.. W(s) (0 ≤ s ≤ t)) (t ≥ 0).) We display the components of these functions by writing b1 1 . . t)dW X(0) = X0 for 0 ≤ t ≤ T .. . (ii) F := b(X. bn ). . We will henceforth take F(t) := U(X0 .  . (Note carefully: these are not random variables. B =  .

Y.  dt +  . as claimed. n×m and t (iv) X(t) = X0 + 0 b(X(s).(iii) G := B(X. we can always assume X(·) has continuous sample paths almost surely. . . note that Y (t) := − satisﬁes 1 2 t t g 2 ds + 0 0 g dW 1 dY = − g 2 dt + gdW. t) ∈ L2 (0. . .   dW. . Remarks. . for all 0 ≤ t ≤ T . (i) A higher order SDE of the form Y (n) = f (t. Example 1. in §B. 82 . Y. . EXAMPLES OF LINEAR STOCHASTIC DIFFERENTIAL EQUATIONS. where as usual ξ denotes “white noise”.  X n (t) 0 . . . dX =  . . s) dW a. can be rewritten into the form above by the device of setting   1   Y (t) X (t)  Y (t)   X 2 (t)  ˙    X(t) =   =  . .s. Then the unique solution of dX = gXdW (1) X(0) = 1 is t 2 t 1 X(t) = e− 2 0 g ds+ 0 g dW for 0 ≤ t ≤ T .   . . f (· · · ) g(· · · ) (ii) In view of (iii). . T ). Y (n−1) (t) Then   X2 . We will prove uniqueness later. Y (n−1) ) + g(t.  . Let m = n = 1 and suppose g is a continuous function (not a random variable). . To verify this. 2 Thus Itˆ’s lemma for u(x) = ex gives o 1 ∂2u 2 ∂u g dt dY + dX = ∂x 2 ∂x2 1 1 = eY − g 2 dt + gdW + g 2 dt 2 2 = gXdW. s) ds + t 0 B(X(s). Y (n−1) )ξ.

Let P (t) denote the price of a stock at time t. assuming the initial price p0 is positive. Consequently P (t) = p0 e t . similarly to Example 2. the relative change of price. called the drift and the volatility of the stock. Similarly. P evolves according to the SDE dP = µdt + σdW P for certain constants µ > 0 and σ. The expected value of the stock price consequently agrees with the deterministic solution of (3) corresponding to σ = 0. Hence (3) and so d(log(P )) = dP 1 σ 2 P 2 dt by Itˆ’s formula o − P 2 P2 σ2 = µ− dt + σdW. Example 3 (Stock prices). 2 σW (t)+ µ− σ 2 2 t 0 dX = f Xdt + gXdW X(0) = 1 f − 1 g 2 ds+ 2 t 0 g dW dP = µP dt + σP dW . 83 . we see that t E(P (t)) = p0 + 0 µE(P (s)) ds. We can model the evolution of P (t) in time by supposing that dP . Observe that the price is always positive. the unique solution of (2) is X(t) = e for 0 ≤ t ≤ T .Example 2. Since (3) implies t t P (t) = p0 + 0 µP ds + 0 σP dW and E t 0 σP dW = 0. Hence E(P (t)) = p0 eµt for t ≥ 0.

Example 4 (Brownian bridge). The solution of the SDE (4) is B(t) = (1 − t)
0 B dB = − 1−t dt + dW

(0 ≤ t < 1)

B(0) = 0
t

1 dW 1−s

(0 ≤ t < 1),

as we conﬁrm by a direct calculation. It turns out also that limt→1− B(t) = 0 almost surely. We call B(·) a Brownian bridge, between the origin at time 0 and at time 1.

A sample path of the Brownian bridge Example 5 (Langevin’s equation). A possible improvement of our mathematical model of the motion of a Brownian particle models frictional forces as follows for the onedimensional case: ˙ X = −bX + σξ, where ξ(·) is “white noise”, b > 0 is a coeﬃcient of friction, and σ is a diﬀusion coeﬃcient. In this interpretation X(·) is the velocity of the Brownian particle: see Example 6 for the position process Y (·). We interpret this to mean (5) dX = −bXdt + σdW X(0) = X0 , 84

for some initial distribution X0 , independent of the Brownian motion. This is the Langevin equation. The solution is X(t) = e−bt X0 + σ
0 t

e−b(t−s) dW

(t ≥ 0),

as is straightforward to verify. Observe that E(X(t)) = e−bt E(X0 ) and E(X (t)) = E e
2 −2bt 2 X0 t

+ 2σe

−bt

t

X0
0 2

e−b(t−s) dW

+ σ2
0

e−b(t−s) dW + 2σe
−bt t

=e

−2bt

2 E(X0 ) t

E(X0 )E
0

e−b(t−s) dW

+ σ2
0

e−2b(t−s) ds σ2 + (1 − e−2bt ). 2b

=e Thus the variance

−2bt

2 E(X0 )

V (X(t)) = E(X 2 (t)) − E(X(t))2 is given by V (X(t)) = e−2bt V (X0 ) + σ2 (1 − e−2bt ), 2b

assuming, of course, V (X0 ) < ∞. For any such initial condition X0 we therefore have E(X(t)) → 0 V (X(t)) →
σ2 2b

as t → ∞.

From the explicit form of the solution we see that the distribution of X(t) approaches N 0, σ as t → ∞. We interpret this to mean that irrespective of the initial distribution, 2b the solution of the SDE for large time “settles down” into a Gaussian distribution whose 2 variance σ represents a balance between the random disturbing force σξ(·) and the fric2b tional damping force −bX(·). 85
2

A simulation of Langevin’s equation Example 6 (Ornstein–Uhlenbeck process). A better model of Brownian movement is provided by the Ornstein–Uhlenbeck equation ¨ ˙ Y = −bY + σξ ˙ Y (0) = Y0 , Y (0) = Y1 , where Y (t) is the position of Brownian particle at time t, Y0 and Y1 are given Gaussian random variables. As before b > 0 is the friction coeﬃcient, σ is the diﬀusion coeﬃcient, and ξ(·) as usual is “white noise”. ˙ Then X := Y , the velocity process, satisﬁes the Langevin equation (6) dX = −bXdt + σdW X(0) = Y1 ,

studied in Example 5. We assume Y1 to be normal, whence explicit formula for the solution, X(t) = e−bt Y1 + σ
0 t

e−b(t−s) dW,

shows X(t) to be Gaussian for all times t ≥ 0. Now the position process is
t

Y (t) = Y0 + 86
0

X ds.

For the special case X1 = 0. with |b | ≤ L for some constant L. and try to solve the one–dimensional stochastic diﬀerential equation 1 (7) where x ∈ R. This is the SDE ¨ ˙ X = −λ2 X − bX + σξ ˙ X(0) = X0 . b 2b Nelson [N. X(0) = X1 . AN EXAMPLE IN ONE DIMENSION. p. We start with a simple case: 1. An explicit solution can be worked out using the general formulas presented below in §D.Therefore t E(Y (t)) = E(Y0 ) + 0 t E(X(s)) ds e−bs E(Y1 ) ds E(Y1 ). ˙ where −λ2 X represents a linear. b = 0. Example 7 (Random harmonic oscillator). In this section we address the problem of building solutions to stochastic diﬀerential equations. we have 1 X(t) = X0 cos(λt) + λ t sin(λ(t − s)) dW. Let us ﬁrst suppose b : R → R is C . EXISTENCE AND UNIQUENESS OF SOLUTIONS. restoring force and −bX is a frictional damping term. dX = b(X)dt + dW X(0) = x 87 . = E(Y0 ) + 0 = E(Y0 ) + and a somewhat lengthly calculation shows 1 − e−bt b σ2 σ2 V (Y (t)) = V (Y0 ) + 2 t + 3 (−3 + 4e−bt − e−2bt ). σ = 1. 57] discusses this model as compared with Einstein’s . 0 B.

for m ≥ n we have ∞ 0≤t≤T max |X (t) − X (t)| ≤ C m n k=n Lk T k →0 k! as n → ∞. . 60]). . Next is a procedure for solving SDE by means of a clever change of variables (McKean [McK. . . and notice that for a given continuous sample path of the Brownian motion. where C depends on ω. we have s D0 (t) = max 0≤s≤t b(x) dr + W (s) ≤ C 0 for all times 0 ≤ t ≤ T .Now the SDE means X(t) = x + 0 t b(X) ds + W (t). . 2. To see this note that s Dn (t) = max ≤L 0 0≤s≤t t b(X n (r)) − b(X n−1 (r)) dr 0 Dn−1 (s) ds t ≤L =C C 0 n n Ln−1 sn−1 ds (n − 1)! by the induction assumption L t . 0 ≤ t ≤ T . and then t X n+1 (t) := x + 0 b(X n ) ds + W (t) (t ≥ 0) for n = 0. . n! In view of the claim. Next write Dn (t) := max | X n+1 (s) − X n (s)| 0≤s≤t (n = 0. Thus for almost every ω. as is easy to check. ). . p. So deﬁne X 0 (t) ≡ x. solves (7). . X n (·) converges uniformly for 0 ≤ t ≤ T to a limit process X(·) which. . SOLVING SDE BY CHANGING VARIABLES. . 1. 1. for all times t ≥ 0. We now claim that Ln n Dn (t) ≤ C t n! for n = 0. . 88 . and this formulation suggests that we try a successive approximation method to construct a solution.

2 Thus X(·) solves (8) provided u (Y ) = σ(X) = σ(u(Y )). Note that we can in principle at least solve (9). (z ∈ R). Assuming for the moment u and f are known. u (Y )f (Y ) + 1 u (Y ) = b(X) = b(u(Y )). So let us ﬁrst solve the ODE u (z) = σ(u(z)) u(y) = x where = d dz . and try to ﬁnd a function u such that X := u(Y ) solves our SDE (8). 3.Given a general one–dimensional SDE of the form (8) let us ﬁrst solve (9) dY = f (Y )dt + dW Y (0) = y. where f will be selected later. σ(u(z)) 2 We will not discuss here conditions under which all of this is possible: see Lamperti [L2]. according to the previous example. once u is known. dX = b(X)dt + σ(X)dW X(0) = x. and then. Notice that both of the methods described above avoid all use of martingale estimates. A GENERAL EXISTENCE AND UNIQUENESS THEOREM We start with a useful calculus lemma: 89 . solve for f (z) = 1 1 b(u(z)) − u (z) . we compute using Itˆ’s o formula that 1 dX = u (Y )dY + u (Y )dt 2 1 = u f + u dt + u dW. 2 and u(y) = x.

t)| ≤ L|x − x| x ˆ |B(x. t)| ≤ L(1 + |x|) for all 0 ≤ t ≤ T. and let C0 ≥ 0 denote a constant. Then there exists a unique solution X ∈ L2 (0. . Set Φ(t) := C0 + e− Therefore Φ(t)e− and thus φ(t) ≤ Φ(t) ≤ C0 e t 0 t 0 t 0 t 0 f ds for all 0 ≤ t ≤ T. Then Φ = f φ ≤ f Φ. t)dt + B(X. and so = (Φ − f Φ)e− t 0 f ds Φ f ds ≤ (f φ − f φ)e− 0 0 t 0 f ds = 0. t)dW X(0) = X0 . continuous functions deﬁned for 0 ≤ t ≤ T . t)| ≤ L(1 + |x|) |B(x. T ] → Rn and B : Rn × [0. T ) of the stochastic diﬀerential equation: n dX = b(X. f ds EXISTENCE AND UNIQUENESS THEOREM. Let φ and f be nonnegative.GRONWALL’S LEMMA. x. f ds ≤ Φ(0)e− f ds = C0 . then φ(t) ≤ C0 e Proof. t 0 f φ ds. Let X0 be any Rn -valued random variable such that (c) and (d) X0 is independent of W + (0). x ∈ Rn ˆ (b) for all 0 ≤ t ≤ T. E(|X0 |2 ) < ∞ where W(·) is a given m-dimensional Brownian motion. for some constant L. t) − B(ˆ. t)| ≤ L|x − x| x ˆ |b(x. If t φ(t) ≤ C0 + 0 f φ ds for all 0 ≤ t ≤ T. x ∈ Rn . t) − b(ˆ. 90 (0 ≤ t ≤ T ) (SDE) . T ] → Mm×n are continuous and satisfy the following conditions: (a) |b(x. Suppose that b : Rn ×[0.

Notice also that hypothesis (b) actually follows from (a). X ∈ L2 (0. as above. s) − B(X. 1. T ). s) − b(X. with continuous sample paths n almost surely. ˆ Proof. s) ds ˆ B(X. then ˆ P (X(t) = X(t) for all 0 ≤ t ≤ T ) = 1. Therefore for some appropriate constant C we have ˆ E(|X(t) − X(t)|2 ) ≤ C 91 0 t ˆ E(|X − X|2 ) ds. We use this to estimate t 2 E 0 ˆ b(X. s) 2 ds ≤ L2 0 ˆ E(|X − X|2 ) ds. s) 2 ds ≤ L2 T 0 ˆ E(|X − X|2 ) ds. s) − b(X. (ii) Hypotheses (a) says that b and B are uniformly Lipschitz continuous in the variable x. . Since (a + b)2 ≤ 2a2 + 2b2 . s) dW. s) ds + 0 t ˆ B(X. s) dW . ˆ X(t) − X(t) = 0 t ˆ b(X. (i) “Unique” means that if X. we can estimate ˆ E(|X(t) − X(t)|2 ) ≤ 2E 0 t 2 t 2 ˆ b(X. s) dW t =E 0 t ˆ B(X. s) − B(X. s) − B(X. t] → Rn . s) − b(X. + 2E 0 The Cauchy–Schwarz inequality implies that t 2 t f ds 0 ≤t 0 |f |2 ds for any t > 0 and f : [0. s) − B(X.ˆ Remarks. s) ds t ≤ TE 0 t ˆ b(X. Suppose X and X are solutions. s) − b(X. Uniqueness. and both solve (SDE). Furthermore t 2 E 0 ˆ B(X. Then for all 0 ≤ t ≤ T .

Thus X(t) = X(t) a. . We claim that dn (t) ≤ (M t)n+1 (n + 1)! for all n = 0. and so X(r) = X(r) for all rational 0 ≤ r ≤ T . Deﬁne X0 (t) := X0 Xn+1 (t) := X0 + t 0 b(Xn (s). As X and X have continuous sample paths almost surely. . Indeed for n = 0. s) ds + 0 t 2 B(X0 . . . except for some set of ˆ probability zero. Existence. 92 . 0 ≤ t ≤ T for some constant M . ˆ max |X(t) − X(t)| > 0 P 0≤t≤T = 0. s) ds + t 0 B(Xn (s). ˆ Therefore Gronwall’s Lemma. and 0 ≤ t ≤ T . for n = 0. with C0 = 0. 1. Deﬁne also dn (t) := E(|Xn+1 (t) − Xn (t)|2 ). for ˆ all 0 ≤ t ≤ T . s) dW t ≤ 2E 0 L(1 + |X0 |) ds + 2E 0 L2 (1 + |X0 |2 ) ds ≤ tM for some large enough constant M . we have d0 (t) = E(|X1 (t) − X0 (t)|2 ) t t 2 =E 0 b(X0 . then the foregoing reads t φ(t) ≤ C 0 φ(s) ds for all 0 ≤ t ≤ T.s. T and X0 .ˆ provided 0 ≤ t ≤ T . We will utilize the iterative scheme introduced earlier. . . implies φ ≡ 0. s) dW. If we now set φ(t) := E(|X(t) − X(t)|2 ). This conﬁrms the claim for n = 0. 2. . depending on L.

3. s) − B(X n t n−1 . Now note T 0≤t≤T max |Xn+1 (t) − Xn (t)|2 ≤ 2T L2 0 |Xn − Xn−1 |2 ds t 2 + 2 max 0≤t≤T B(X . (M T )n ≤C n! 4. since P max |Xn+1 (t) − Xn (t)| > 1 2n ≤ 22n E ≤2 and ∞ 0≤t≤T 0≤t≤T max |Xn+1 (t) − Xn (t)|2 n 2n C(M T ) n! 22n n=1 (M T )n < ∞. The Borel–Cantelli Lemma thus applies. s) − b(Xn−1 . s) dW . Then dn (t) = E(|Xn+1 (t) − Xn (t)|2 ) t =E 0 t b(Xn . s) − B(X n 0 n−1 .Next assume the claim is valid for some n − 1. This proves the claim. n! 93 . s) ds 2 + 0 B(X . s) dW ≤ 2T L2 E 0 |Xn − Xn−1 |2 ds t + 2L E 0 t 2 |Xn − Xn−1 |2 ds M n sn ds n! by the induction hypothesis ≤ 2L2 (1 + T ) 0 ≤ M n+1 tn+1 . (n + 1)! provided we choose M ≥ 2L2 (1 + T ). Consequently the martingale inequality from Chapter 2 implies T E 0≤t≤T max |Xn+1 (t) − Xn (t)|2 ≤ 2T L2 0 E(|Xn − Xn−1 |2 ) ds T + 8L2 0 E(|Xn − Xn−1 |2 ) ds by the claim above.

By induction. T ). We pass to limits in the deﬁnition of Xn+1 (·). s) dW for 0 ≤ t ≤ T. E(|Xn+1 (t)|2 ) ≤ C + C 2 + · · · + C n+2 Consequently E(|Xn+1 (t)|2 ) ≤ C(1 + E(|X0 |2 ))eCt . s) ds 2 n + CE 0 B(X . T ). t)dt + B(X. 2n = 0. s) dW t n ≤ C(1 + E(|X0 |2 )) + C 0 E(|Xn |2 ) ds. That is.Thus P 0≤t≤T max |Xn+1 (t) − Xn (t)| > 1 i. (SDE) dX = b(X.o. t)dW X(0) = X0 . In light of this. n 94 for all 0 ≤ t ≤ T . therefore. as usual. T ] to a process X(·). Let n → ∞: E(|X(t)|2 ) ≤ C(1 + E(|X0 |2 ))eCt and so X ∈ L2 (0. “C” denotes various constants. s) ds + 0 B(X. for almost every ω n−1 X =X + j=0 n 0 (Xj+1 − Xj ) converges uniformly on [0. We have n t 2 E(|X n+1 (t)| ) ≤ CE(|X0 | ) + CE 2 2 0 t b(X . where. 5. (n + 1)! . tn+1 (1 + E(|X0 |2 )). We must still show X(·) ∈ L2 (0. to prove t t X(t) = X0 + 0 b(X. for times 0 ≤ t ≤ T .

THEOREM (Estimate on higher moments of solutions). if for some 1 ≤ i ≤ n |bil (x.C. It follows that for almost all sample paths. t). Suppose that b. We can however use estimates (i) and (ii) above to check the hypotheses of Kolmogorov’s Theorem from §C in Chapter 3. L. for certain constants C1 and C2 . depending only on T. a few properties of the solution to various SDE. 95 . B and X0 satisfy the hypotheses of the Existence and Uniqueness Theorem. PROPERTIES OF SOLUTIONS. On the other hand.e. without proofs. t)|2 > 0 1≤l≤m for all x ∈ Rn . m. then almost every sample path t → X i (t) is nowhere diﬀerentiable for a. o 2 provided E(|X0 |2p ) < ∞ for each 1 ≤ p < ∞. ω. In this section we mention. n. In this case the mapping t → X(t) will be smooth if b is. The estimates above on the moments of X(·) are fairly crude. but are nevertheless sometimes useful: APPLICATION: SAMPLE PATH PROPERTIES. E(|X0 |2p ) < ∞ then the solution X(·) of (SDE) satisﬁes the estimates (i) and (ii) E(|X(t) − X0 |2p ) ≤ C2 (1 + E(|X0 |2p ))tp eC2 t E(|X(t)|2p ) ≤ C2 (1 + E(|X0 |2p ))eC1 t dX = b(X. and consequently it could happen that the solution of our SDE is really a solution of the ODE ˙ X = b(X. the mapping t → X(t) is H¨lder continuous with each exponent less than 1 . t)dW X(0) = X0 for some integer p > 1. t)dt + B(X. 0 ≤ t ≤ T. If. with possibly random initial data. The possibility that B ≡ 0 is not excluded. in addition.

. t)dW X(0) = X0 . 2. t) − B(x. with the same 0 constant L. that bk . In particular. x(0) = x0 .THEOREM (Dependence on parameters). t)| + |Bk (x. t)dt + B(X. . 0 k→∞ k→∞ 0≤t≤T |x|≤M Finally suppose that Xk (·) solves dXk = bk (Xk . Assume further that (a) and for each M > 0. T ] as ε → 0 to the deterministic trajectory of ˙ x = b(x). 96 . t)dt + Bk (Xk . lim E(|Xk − X0 |2 ) = 0. where X is the unique solution of dX = b(X. 0 Then k→∞ lim E 0≤t≤T max |Xk (t) − X(t)|2 = 0. t)dW Xk (0) = Xk . t)|) = 0. Suppose for k = 1. . t) − b(x. Example (Small noise limits). (b) lim max (|bk (x. for almost every ω the random trajectories of the SDE dXε = b(Xε )dt + εdW Xε (0) = x0 converge uniformly on [0. Bk and Xk satisfy the hypotheses of the Existence and Uniqueness Theorem.

Mn×m ). Remark. A linear SDE is called homogeneous if c ≡ E ≡ 0 for 0 ≤ t ≤ T . DEFINITION. k! 97 . T ] → Mn×m .D. F : [0. The stochastic diﬀerential equation dX = b(X. and B(x. and X0 is independent of W + (0). then b and B satisfy the hypotheses of the Existence and Uniqueness Theorem. t)dt + B(X. t) := c(t) + D(t)x. T ] → Mn×n . If 0≤t≤T sup [|c(t)| + |D(t)| + |E(t)| + |F(t)|] < ∞. D : [0. the space of bounded linear mappings from Rn to Mn×m . FORMULAS FOR SOLUTIONS: linear equations in narrow sense Suppose ﬁrst D ≡ D is constant. t)dW is linear provided the coeﬃcients b and B have this form: b(x. Then the solution of (10) is t dX = (c(t) + DX)dt + E(t)dW X(0) = X0 (11) where X(t) = eDt X0 + 0 eD(t−s) (c(s)ds + E(s) dW). T ] → Rn . DEFINITION. t) := E(t) + F(t)x for E : [0. LINEAR STOCHASTIC DIFFERENTIAL EQUATIONS. It is called linear in the narrow sense if F ≡ 0. This section presents some fairly explicit formulas for solutions of linear SDE. provided E(|X0 |2 ) < ∞. T ] → L(Rn . Thus the linear SDE dX = (c(t) + D(t)X)dt + (E(t) + F(t)X)dW X(0) = X0 has a unique solution. ∞ e Dt := k=0 D k tk . for c : [0.

Then the solution of (14) is t m dX = (c(t) + d(t)X)dt + X(0) = X0 m l l=1 (e (t) + f l (t)X)dW l X(t) = Φ(t) X0 + (15) + 0 l=1 0 t m Φ(s)−1 c(s) − l=1 el (s)f l (s) ds Φ(s)−1 el (s) dW l . FORMULAS FOR SOLUTIONS: general scalar linear equations Suppose now n = 1. 3. This will not be so if F(·) ≡ 0. where Φ(·) is the fundamental matrix of the nonautonomous system of ODE dΦ = D(t)Φ. where t m Φ(t) := exp 0 d− l=1 (f l )2 ds + 2 t m f l dW l 0 l=1 . Chapter 8] for more formulas for solutions of general linear equations. dt These assertions follow formally from standard formulas in ODE theory if we write EdW = Eξdt. ξ as usual denoting white noise. o Observe also that formula (13) shows X(t) to be Gaussian if X0 is. Φ(0) = I. and regard Eξ as an inhomogeneous term driving the ODE ˙ X = c(t) + D(t)X + E(t)ξ. the solution of (12) is t dX = (c(t) + D(t)X)dt + E(t)dW X(0) = X0 (13) X(t) = Φ(t) X0 + 0 Φ(s)−1 (c(s)ds + E(s) dW) .More generally. See Arnold [A. owing to the extra term in Itˆ’s formula. SOME METHODS FOR SOLVING LINEAR SDE 98 . but m ≥ 1 is arbitrary.

according to (17). 99 . Since the solution of (17) is t 0 X1 (t) = X0 e we conclude that f (s) dW − 1 2 t 0 f 2 (s) ds . dX1 = f (t)X1 dW X1 (0) = X0 where the functions A and B are to be selected. X(t) = X1 (t)X2 (t) = X0 e a formula noted earlier. where (17) and (18) dX2 = A(t)dt + B(t)dW X2 (0) = 1. B ≡ 0 and A(t) = d(t)X2 (t) will work. let us derive some of the formulas stated above. For this. We will try to ﬁnd a solution having the product form X(t) = X1 (t)X2 (t). Thus (18) reads dX2 = d(t)X2 dt X2 (0) = 1. Now we try to choose A. t 0 f (s) dW + t 0 d(s)− 1 f 2 (s) ds 2 . Consider ﬁrst the linear stochastic diﬀerential equation (16) dX = d(t)Xdt + f (t)XdW X(0) = X0 for m = n = 1. This is non-random: X2 (t) = e t 0 d(s) ds . B so that dX2 + f (t)B(t)dt = d(t)X2 dt.For practice with Itˆ’s formula. Then dX = d(X1 X2 ) = X1 dX2 + X2 dX1 + f (t)X1 B(t)dt = f (t)XdW + (X1 dX2 + f (t)X1 B(t)dt). o Example 1.

Observe that since X1 (t) = e quently t 0 f dW + t 0 d− 1 f 2 ds 2 . and this identity will hold if we take A(t) := [c(t) − f (t)e(t)](X1 (t))−1 B(t) := e(t)(X1 (t))−1 . 100 . We now require X1 (A(t)dt + B(t)dW ) + f (t)X1 B(t)dt = c(t)dt + e(t)dW . Consider next the general equation (19) dX = (c(t) + d(t)X)dt + (e(t) + f (t)X)dW X(0) = X0 . where now (20) and (21) dX2 = A(t)dt + B(t)dW X2 (0) = X0 . As above.Example 2. B to be chosen. Conse- t X2 (t) = X0 + 0 t [c(s) − f (s)e(s)](X1 (s))−1 ds + 0 e(s)(X1 (s))−1 dW. Then dX = X2 dX1 + X1 dX2 + f (t)X1 B(t)dt = d(t)Xdt + f (t)XdW + X1 (A(t)dt + B(t)dW ) + f (t)X1 B(t)dt. we try for a solution of the form X(t) = X1 (t)X2 (t). we have X1 (t) > 0 almost surely. again for m = n = 1. dX1 = d(t)X1 dt + f (t)X1 dW X1 (0) = 1 the functions A.

a special case of (15): X(t) = X1 (t)X2 (t) t = exp 0 1 d(s) − f 2 (s) ds + 2 t r t f (s) dW 0 s × X0 + 0 t exp − 0 s 1 d(r) − f 2 (r) dr − 2 s 0 f (r) dW 0 (c(s) − e(s)f (s)) ds + 0 exp − 0 1 d(r) − f 2 (r) dr − 2 f (r) dW e(s) dW.Employing this and the expression above for X1 . Remark. The paper of Higham [H] is a good introduction. we arrive at the formula. 101 . There is great theoretical and practical interest in numerical methods for simulation of solutions to random diﬀerential equations.

Then (i) {τ < t} ∈ F(t). and furthermore {τ1 ∨ τ2 ≤ t} = {τ1 ≤ t} ∩ {τ2 ≤ t} ∈ F(t). Note that τ is allowed to take on the value +∞. ∈F (t−1/k)⊆F (t) Also. THEOREM (Properties of stopping times). τ2 ) are stopping times. (ii) τ1 ∧ τ2 := min(τ1 . as it allows us to investigate phenomena occuring over “random time intervals”. E. D. Observe that {τ < t} = k=1 ∞ {τ ≤ t − 1/k} .CHAPTER 6: APPLICATIONS. and also that any constant τ ≡ t0 is a stopping time. as in Chapters 4 and 5. τ2 ). The notion of stopping times comes up naturally in the study of stochastic diﬀerential equations. A. STOPPING TIMES. C. An example will make this clearer: 102 . Proof. B. A. ∞] is called a stopping time with respect to F(·) provided {τ ≤ t} ∈ F(t) for all t ≥ 0. for all times t ≥ 0. We introduce now some random times that are well–behaved with respect to F(·): DEFINITION. Let τ1 and τ2 be stopping times with respect to F(·). DEFINITIONS. τ1 ∨ τ2 := max(τ1 . U. P ) be a probability space and F(·) a ﬁltration of σ–algebras. and so {τ = t} ∈ F(t). we have {τ1 ∧ τ2 ≤ t} = {τ1 ≤ t} ∪ {τ2 ≤ t} ∈ F(t). BASIC PROPERTIES. Feynman-Kac formula Optimal stopping Options pricing The Stratonovich integral This chapter is devoted to some applications and extensions of the theory developed earlier. Let (Ω. Stopping times Applications to PDE. This says that the set of all ω ∈ Ω such that τ (ω) ≤ t is an F(t)-measurable set. A random variable τ : Ω → [0.

Example (Hitting a set). X)dW X(0) = X0 . Discussion.) X(τ) E Proof. C) := dist(x. Let E be either a nonempty closed subset or a nonempty open subset of Rn . First we assume that E = U is an open set. (We put τ = +∞ for those sample paths of X(·) that never hit E. where b. Set d(x. THEOREM. 103 . Next we assume that E = C is a closed set. Take {ti }∞ to be a countable dense i=1 subset of [0. C) < }. The random variable σ := sup{t ≥ 0 | X(t) ∈ E}. Consider the solution X(·) of the SDE dX(t) = b(t. B and X0 satisfy the hypotheses of the Existence and Uniqueness Theorem. X)dt + B(t. Fix t ≥ 0. n The event ∞ {τ ≤ t} = n=1 ti ≤t {X(ti ) ∈ Un )} ∈F (ti )⊆F (t) also belongs to F(t). C) and deﬁne the open sets 1 Un = {x : d(x. Then the event {τ ≤ t} = ti ≤t {X(ti ) ∈ U } ∈F (ti )⊆F (t) belongs to F(t). ∞). we must show {τ ≤ t} ∈ F(t). Then τ := inf{t ≥ 0 | X(t) ∈ E} is a stopping time.

But there are many examples where we do not really stop the process at time τ . Thus “stopping time” is not a particularly good name and “Markov time” would be better.) The name “stopping time” comes from the example. LEMMA (Itˆ integrals with stopping times). Proof. If G ∈ L2 (0. but does not contain information about the future”. STOCHASTIC INTEGRALS AND STOPPING TIMES.T ) and E(( 0 τ T G dW )2 ) = E(( 0 T χ{t≤τ } G dW )2 ) (χ{t≤τ } G)2 dt) = E( 0 τ = E( 0 G2 dt). We have τ  E 0  T 0 G dW  =E  χ{t≤τ } G dW  = 0. T )and 0 ≤ τ ≤ T is a o stopping time. The heuristic reason is that the event {σ ≤ t} would depend upon the entire future history of process and thus would not in general be F(t)-measurable.the last time that X(t) hits E. 104 . ∈L2 (0. we deﬁne τ T G dW := 0 0 χ{t≤τ } G dW. T ) and τ is a stopping time with 0 ≤ τ ≤ T . DEFINITION. (In applications F(t) “contains the history of X(·) up to and including time t. If G ∈ L2 (0. is in general not a stopping time. then τ (i) E 0 G dW =0 τ τ (ii) E ( 0 G dW ) 2 =E 0 G2 dt . where we sometimes think of halting the sample path X(·) at the ﬁrst time τ that it hits a set E. Our next task is to consider stochastic integrals with random limits of integration and to work out an Itˆ o formula for these.

τ ) − u(X(0). The most important case is X(·) = W(·). k=1 uxi bik dW k . 105 . t) = dt + dXi + ∂t ∂xi 2 i. t)dW.j=1 i=1 and Du · B dW = k=1 i=1 m n n n m bik bjk . For a ﬁxed ω ∈ Ω. where τ is a stopping time. t) − u(X(0). 0) = 0 ∂u + Lu ∂t τ ds + 0 Du · B dW. 0 ≤ τ ≤ T : τ u(X(τ ). τ )) − E(u(X(0). ˆ ITO’S FORMULA WITH STOPPING TIMES. Recall next from Chapter 4 that if dX = b(X. Thus we may set t = τ . let W(·) denote mdimensional Brownian motion. 0) = 0 ∂u + Lu ∂t t ds + 0 Du · B dW. As usual. The argument of u in these integrals is (X(s).j=1 ∂xi ∂xj i=1 n n m bik bjk dt. 2 The expression ∆u is called the Laplacian of u and occurs throughout mathematics and physics. this means: t (2) u(X(t). s). then for each C 2 function u. formula (2) holds for all 0 ≤ t ≤ T .Similar formulas hold for vector–valued processes. (1) ∂u ∂u ∂2u 1 du(X. 0)) = E 0 ∂u + Lu ∂t ds . t)dt + B(X. n-dimensional Brownian motion. We will demonstrate in the next section some important links with Brownian motion. for the diﬀerential operator 1 Lu := aij uxi xj + bi uxi . the generator of which is 1 Lu = 2 n uxi xi =: i=1 1 ∆u. We will see in the next section that this formula provides a very important link between stochastic diﬀerential equations and (nonrandom) partial diﬀerential equations. k=1 Written in integral form. Take expected value: τ (3) E(u(X(τ ). aij = 2 i. BROWNIAN MOTION AND THE LAPLACIAN. We call L the generator.

we get u(x) − E(u(X(τx ))) = E 0 1 ds = E(τx ). We call u a harmonic function. n→∞ lim E(τx ∧ n) < ∞. with smooth boundary ∂U . Hence E(τx ) < ∞. 2 E(u(X(τx ∧ n))) − E(u(X(0))) = E 0 τx ∧n u(x) = E(τx ) for all x ∈ U.s. Example 2 (Probabilistic representation of harmonic functions). 2 Since 1 2 ∆u = −1 and u is bounded. Proof. Thus if we let n → ∞ above.. FEYNMAN–KAC FORMULA. bounded domain and g : ∂U → R a given continuous function. Then X(·) := W(·) + x represents a “Brownian motion starting at x”. for all x ∈ U. . PROBABILISTIC FORMULAS FOR SOLUTIONS OF PDE. Our goal is to ﬁnd a probabilistic representation formula for u. 106 . Formula (5) follows. Again recall that u is bounded on U . According to standard PDE theory.B. τx Thus τx is integrable. We have for each n = 1. with Lu = 1 ∆u. . ﬁx any point x ∈ U and consider then an n-dimensional Brownian motion W(·). For this. 1 ∆u(X) ds . Example 1 (Expected hitting time to a boundary). This says that Brownian sample paths starting at any point x ∈ U will with probability 1 eventually hit ∂U . u > 0 in U . there exists a smooth solution u of the equation − 1 ∆u = 1 in U 2 (4) u = 0 on ∂U . 2. But u = 0 on ∂U . We have (5) In particular. . THEOREM. Let U ⊂ Rn be a bounded open set. It is known ¯ from classical PDE theory that there exists a function u ∈ C 2 (U ) ∩ C(U ) satisfying the boundary value problem: ∆u = 0 in U (6) u = g on ∂U . Let U ⊂ Rn be a smooth. APPLICATIONS TO PDE. and so τx < ∞ a. and so u(X(τx )) ≡ 0. We employ formula (3). Deﬁne τx := ﬁrst time X(·) hits ∂U.

THEOREM. APPLICATION: In particular. As shown above. where τx now denotes the hitting time of Brownian motion starting at x to ∂B(x. Assume next that we can write ∂U as the union of two disjoint parts Γ1 . the second equality valid since ∆u = 0 in U . we have the identity (8) u(x) = 1 area of ∂B(x. Since u = g on ∂U . r). THEOREM. Proof. r). r) u dS. Γ2 . Example 3 (Hitting one part of a boundary ﬁrst). For each point x ∈ U . r). u(x) is the probability that a Brownian motion starting at x hits Γ1 before hitting Γ2 . with respect to surface measure. That is. We have for each point x ∈ U (7) u(x) = E(g(X(τx ))). we may reasonably guess that the term on the right hand side is just the average of u over the sphere ∂B(x. formula (7) follows. ∂B(x.r) This is the mean value formula for harmonic functions. Let u solve the PDE   ∆u = 0 in U  u = 1 on Γ1   u = 0 on Γ2 . Γ1 Γ2 x 107 . Since Brownian motion is isotropic in space. if ∆u = 0 in some open set containing the ball B(x. for X(·) := W(·) + x. then u(x) = E(u(X(τx ))). Brownian motion starting at x. τx E(u(X(τx ))) = E(u(X(0))) + E 0 1 ∆u(X) ds 2 = E(u(X(0))) = u(x).

and so Itˆ’s formula yields o dY = −c(X)Y dt. Since c ≥ 0. with c ≥ 0 in U . t 0 c(X)ds c(X) ds . Apply (7) for g= Then u(x) = E(g(X(τx ))) = probability of hitting Γ1 before Γ2 . the integrals above all converge.Proof. THEOREM (Feynman–Kac formula). as before. τx u(x) = E 0 f (X(t))e− t 0 c(X) ds dt where. f are smooth functions. Now we extend Example #2 above to obtain a probabilistic representation for the unique solution of the PDE − 1 ∆u + cu = f 2 u=0 in U on ∂U. We assume c. We know E(τx ) < ∞. and τx denotes the ﬁrst hitting time of ∂U . Proof. FEYNMAN–KAC FORMULA. For each x ∈ U . Hence the Itˆ product rule implies o d u(X)e− t 0 c(X) ds = (du(X))e− t 0 c(X) ds t 0 + u(X)d e− = c(X)ds ∂u(X) 1 dW i ∆u(X)dt + 2 ∂xi i=1 + u(X)(−c(X)dt)e− 108 t 0 n e− . for Z(t) := − t 0 c(X) ds. Then dZ = −c(X)dt. 1 0 on Γ1 on Γ2 . First look at the process Y (t) := eZ(t) . X(·) := W(·) + x is a Brownian motion starting at x.

. 109 n n . Since u solves (8). An interpretation. as follows. however. obtaining E u(X(τx ))e− τx τx 0 c(X) ds − E(u(X(0))) t 0 =E 0 1 ∆u(X) − c(X)u(X) e− 2 τx c(X) ds dt . . We can explain this formula as describing a Brownian motion with “killing”. for example by being absorbed into the medium within which it is moving. u(x) = average of f (X(·)) over all sample paths which survive to hit ∂U =E 0 f (X)e− t 0 c(X) ds dt . 2 i. we can obtain similar formulas. this converges to e− Hence it should be that τx t 0 c(X) ds . (1 − c(X(tn ))h). Then the probability of the particle surviving until time t is approximately equal to (1 − c(X(t1 ))h)(1 − c(X(t2 ))h) . If we consider in these examples the solution of the SDE dX = b(X)dt + B(X)dW X(0) = x. Suppose that the Brownian particles may disappear at a random killing time σ. where 0 = t0 < t1 < · · · < tn = t. Assume further that the probability of its being killed in a short time interval [t. As h → 0. where now τx = hitting time of ∂U for X(·) and 1 ∆u is replaced by the generator 2 1 Lu := aij uxi xj + bi uxi .j=1 i=1 Note. t + h] is c(X(t))h + o(h). This need not always be the case for degenerate elliptic operators L. we need to know that the various PDE have smooth solutions. h = tk+1 − tk . this simpliﬁes to give u(x) = E(u(X(0))) = E 0 f (X)e− t 0 c(X) ds dt . Remark. as claimed. and take the expected value.We use formula (3) for τ = τx .

OPTIMAL STOPPING. A much better idea is to turn attention to the value function (10) u(x) := inf Jx (θ). Let τ = τx denote the hitting time of ∂U . Let U ⊂ Rm be a bounded. the cost is g(X(τ )). if θ ≥ τ . A typical such problem follows. that is. OPTIMAL STOPPING. The goal is to discover an optimal choice of controls. or both. B : Rn → M n×m satisfy the usual assumptions. If instead we do not stop the process before it hits ∂U . to minimize the cost criterion. and for each such θ deﬁne the expected cost of stopping X(·) at time θ ∧ τ to be θ∧τ (9) Jx (θ) := E( 0 f (X(s)) ds + g(X(θ ∧ τ ))). The easiest stochastic control problem of the general type outlined above occurs when we cannot directly aﬀect the SDE controlling the evolution of X(·) and can only decide at each instance whether or not to stop. Then for each x ∈ U the stochastic diﬀerential equation dX = b(X)dt + B(X)dW X0 = x has a unique solution. The general mathematical setting for many control theory problems is this. The main question is this: does there exist an optimal ∗ stopping time θ∗ = θx . STOPPING A STOCHASTIC DIFFERENTIAL EQUATION. then the cost is a given function g of the current state of X(θ). Finally we are given a cost criterion. θ 110 . The idea is that if we stop at the possibly random time θ < τ . for which Jx (θ∗ ) = θ stopping time min Jx (θ)? And if so. Given also are certain controls which aﬀect somehow the behavior of the system: these controls typically either modify some parameters in the dynamics or else stop the process. how can we ﬁnd θ∗ ? It turns out to be very diﬃcult to try to design θ∗ directly. In addition there is a running cost per unit time f of keeping the system in operation until time θ ∧ τ . We are given some “system” whose state evolves in time according to a diﬀerential equation (deterministic or stochastic). Suppose b : Rn → Rn .C. Let θ be any stopping time with respect to F(·). depending upon our choice of control and the corresponding state of the system. smooth domain.

So if we choose not to stop the system for time δ. 0 ≤ E( 0 f (X) + Lu(X) ds). our cost is at least δ E( 0 f (X) ds + u(X(δ))). then according to (SDE) the new state of the system at time δ will be X(δ). and so (12) u(x) = g(x) for each point x ∈ ∂U. Next take any point x ∈ U and ﬁx some small number δ > 0. We wish to determine the properties of this function. It turns out that once we know u. OPTIMALITY CONDITIONS. That is. τ ≡ 0 if x ∈ ∂U . 2 i. Furthermore. Hence (11) u(x) ≤ g(x) for each point x ∈ U.j=1 ∂xi ∂xj ∂xi i=1 δ n n m aij = k=1 bik bjk . This approach is called dynamic programming. we will be then be able to construct an optimal θ∗ . notice that we could just take θ ≡ 0 in the deﬁnition (10). Note that u(x) is the minimum expected cost.and to try to ﬁgure out what u is as a function of x ∈ U . we therefore have δ u(x) ≤ E( 0 f (X) ds + u(X(δ))). for Lu = Hence ∂2u ∂u 1 aij + bi . First of all. 111 . Then. Now if we do not stop the system for time δ. Now by Itˆ’s formula o δ E(u(X(δ))) = u(x) + E( 0 Lu(X) ds). the best we can achieve in minimizing the cost thereafter must be u(X(δ)). Since u(x) is the inﬁmum of costs over all stopping times. we could just stop immediately and incur the cost g(X(0)) = g(x). given we start the process at x. and assuming we do not hit ∂U . given that we are at the point X(δ). So assume u is deﬁned above and suppose u is smooth enough to justify the following calculations.

In summary. SOLVING FOR THE VALUE FUNCTION. M u ≤ f almost everywhere in U . if u(x) < g(x) at some point x ∈ U. we combine (11)–(14) to ﬁnd that if the formal reasoning above is valid. / The idea of the proof is to approximate (15) by a penalized problem of this form: M uε + βε (uε − g) = f uε = g 112 in U on ∂U . There exists a unique funtion u. In this circumstance we then would have an equality in the formula above. THEOREM. Our rigorous study of the stopping time problem now begins by showing ﬁrst that there exists a unique solution u of (15) and second that this u is in fact minθ Jx (θ). with bounded second derivatives. then it is optimal not to stop the process at once. . u − g) = 0 almost everywhere in U . and so (14) Mu = f at those points where u < g. g are given smooth functions. for at least some very small time δ. Thus it is plausible to think that we should leave the system going. Then we will use u to design θ∗ . u = g on ∂U . In general u ∈ C 2 (U ). and then let δ → 0: 0 ≤ f (x) + Lu(x).Divide by δ > 0. Equivalently. u − g) = 0 in U u = g on ∂U These are the optimality conditions. we have (13) Mu ≤ f in U. an optimal stopping time. that is. then the value function u satisﬁes: (15) max(M u − f. max(M u − f. Suppose f. where M u := −Lu. Finally we observe that if in (11) a strict inequality held. such that: (i) (ii) (iii) (iv) u ≤ g in U .

Since τ ∧ θ∗ is the exit time from C. THEOREM. but computers can provide accurate numerical approximations. βε ≥ 0. and then we should run X(·) until it hits S (or else exits from U ).where βε : R → R is a smooth. DESIGNING AN OPTIMAL STOPPING POLICY. and furthermore u = g on ∂C. lim →0 βε (x) = ∞ for x > 0. Let u be the solution of (15). 113 . we have for x ∈ C τ ∧θ ∗ u(x) = E( 0 f (X(s)) ds + g(X(θ∗ ∧ τ ))) = Jx (θ∗ ). τ ∧ θ∗ = 0. and βε ≡ 0 for x ≤ 0. Deﬁne the continuation set C := U − S = {x ∈ U | u(x) < g(x)}. and so u(x) = g(x) = Jx (θ∗ ). if x ∈ S. Now we show that our solution of (15) is in fact the value function. 2. 1. This says that we should ﬁrst compute the solution to (15) to ﬁnd S. Deﬁne for each x ∈ U . We need to show u(x) = Jx (θ∗ ) ≤ Jx (θ). and along the way we will learn how to design an optimal stopping strategy θ∗ . Then u(x) = Jx (θ∗ ) = inf Jx (θ) θ ¯ for all x ∈ U . On the other hand. we have u(x) = Jx (θ∗ ). It will in practice be diﬃcult to ﬁnd a precise formula for u. Then uε → u. convex function. First note that the stopping set S := {x ∈ U | u(x) = g(x)} ¯ is closed. θ∗ = ﬁrst hitting time of S. ¯ Thus for all x ∈ U . Proof. deﬁne θ∗ as above. Now let θ be any other stopping time. On this set Lu = f .

which is the right to buy one share of the stock S. if you run a ﬁnancial ﬁrm and wish to sell your customers this call option. where µ > 0 is the drift and σ = 0 the volatility. is derived from) the behavior of S(·). we consequently have u(x) = Jx (θ∗ ) = min Jx (θ). mostly following Baxter–Rennie [B-R] and the class lectures of L.e. But since u(x) = Jx (θ∗ ). Another basic reference is Hull [Hu]. for which the ﬁrm neither makes nor loses money. We will investigate a European call option. θ as asserted.Now by Itˆ’s formula o τ ∧θ u(x) = E( 0 M u(X) ds + u(X(τ ∧ θ))) ¯ But M u ≤ f and u ≤ g in U . say a stock.) 114 . Let us consider a given security. In this section we outline an application to mathematical ﬁnance. The basic question is this: What is the “proper price” at time t = 0 of this option? In other words. THE BASIC PROBLEM. OPTIONS PRICING. how much should you charge? (We are looking for the “break–even” price. Goldberg. whose price at time t is S(t). The number p is called the strike price and T > 0 the strike (or expiration) time. A derivative is a ﬁnancial instrument whose payoﬀ depends upon (i. Hence u(x) ≤ E( 0 τ ∧θ f (X) ds + g(X(τ ∧ θ))) = Jx (θ). at the price p at time T .. The initial price s0 is known. We suppose that S evolves according to the SDE introduced in Chapter 5: (16) dS = µSdt + σSdW S(0) = s0 . D.

This means that \$1 put in a bank at time t = 0 becomes \$erT at time t = T . to arrive at (17). t). Other forces are at work in ﬁnancial markets. We average this over all sample paths and multiply by the discount factor e−rT . A PARTIAL DIFFERENTIAL EQUATION. a ﬁrst guess might be that the proper price should be (17) e−rT E((S(T ) − p)+ ). As for the problem of pricing our call option. 0). if s = 0. (17) is in fact not the proper price. no-risk interest rate is the constant r > 0. Boundary conditions. given that S(t) = s. This means somehow eliminating our risk as the seller of the call option. we assume hereafter that the prevailing. As reasonable as this may all seem. We need to calculate u. t) = 0 (0 ≤ t ≤ T ). 0) is the price we are seeking. meaning the possibility of risk-free proﬁts. 115 . \$1 at time t = T is worth only \$e−rT at time t = 0. T ) = (s − p)+ (s ≥ 0). and thereby make a proﬁt of (S(T ) − p)+ . Equivalently. we can buy a share for the price p. To simplify. If S(T ) > p. The exact details appear below. then S(t) = 0 for all 0 ≤ t ≤ T and so (20) u(0. for x+ := max(x. we have (19) u(s. We introduce for s ≥ 0 and 0 ≤ t ≤ T . we introduce also the notion of hedging. We must price our option so as to create no arbitrage opportunities for others. then the option is worthless. For this. Indeed the fundamental factor in options pricings is arbitrage. To convert this principle into mathematics. Then u(s0 . immediately sell at price S(T ). notice ﬁrst that at the expiration time T .ARBITRAGE AND HEDGING. The reasoning behind this guess is that if S(T ) < p. denoting the proper price of the option at time t. We demonstrate next how use these principles to convert our pricing problem into a PDE. Furthermore. but the basic idea is that we can in eﬀect “duplicate” our option by a portfolio consisting of (continually changing) holdings of a risk–free bond and of the stock on which the call is written. the unknown price function (18) u(s.

Roughly speaking. Remark (discrete version of self-ﬁnancing). We ensure this by requiring that the portfolio represented on the right-hand side of (24) be self-ﬁnancing. self-ﬁnancing. assume that B is a risk-free investment. of course. the proﬁts from it will exactly equal the funds needed to pay the customer. deﬁne the process (21) C(t) := u(S(t). This just means B(t) = ert . The ﬁrm thereby incurs the risk that at time T . as above. To see this more clearly. Thus C(t) is the current price of the option at time t. Conversely. let us consider a 116 . B. This means that the changes in the value of the portfolio should depend only upon the changes in S. and so the buyer will exercise the option. ψ so that (24) holds. and is random since the stock price S(t) is random. which therefore grows at the prevailing interest rate r: (23) dB = rBdt B(0) = 1. But to make this work. But if in the meantime the ﬁrm has constructed the portfolio (24). imagine that your ﬁnancial ﬁrm sells a call option. According to Itˆ’s formula and (16) o 1 dC = ut dt + us dS + uss (dS)2 2 σ2 2 S uss )dt + σSus dW. We therefore require that (25) dC = φdS + ψdB (0 ≤ t ≤ T ). the ﬁnancial ﬁrm should not have to inject any new money into the hedging scheme. More precisely. Duplicating an option. To understand this better. the portfolio will have no proﬁt. Discussion. if the option is worthless at time T . beyond the initial investment to set it up.We seek how a PDE u solves for s > 0. t) (0 ≤ t ≤ T ). the stock price S(T ) will exceed p. we can eliminate all risk. The point is that if we can construct φ. a portfolio is selfﬁnancing if it is ﬁnancially self contained. 0 ≤ t ≤ T . We will try to ﬁnd processes φ and ψ so that (24) C = φS + ψB (0 ≤ t ≤ T ). To go further. = (ut + µSus + 2 (22) Now comes the key idea: we propose to “duplicate” C by a portfolio consisting of shares of S and of a bond B.

Observe that the parameter µ does not appear. (26) must be valid. and the values of the stock and bond at a time ti are given by Si and Bi respectively. and we are trying to select φ. Then (26) simpliﬁes. We observe in particular that the terms multiplying dW on each side of (26) will match provided we take (27) φ(t) := us (S(t). The main outcome of all our ﬁnancial reasoning is the derivation of this partial diﬀerential equation. ti+1 ). corresponding to our changing holdings of the stock S and the i=0 bond B over each time interval. Combining formulas (22). t) solve the Black–Scholes–Merton PDE (29) ut + rsus + σ2 2 s uss − ru = 0 2 (0 ≤ t ≤ T ). (28) (ut + rSus + σ2 2 S uss − ru)dt = 0. 2 But ψB = C − φS = u − us S. Now for a given time interval (ti . ψ to make all this so. Consequently. 117 .diﬀerent model in which time is discrete. (27). Consequently. So if (24) holds. A portfolio can now be thought of as a sequence {(φi . to make sure that (21) is valid. according to (24). to read (ut + σ2 2 S uss )dt = rψBdt. the continuous version of which is condition (25). 2 The argument of u and its partial derivatives is (S(t). t). (23) and (25) provides the identity (26) (ut + µSus + σ2 2 S uss )dt + σSus dW 2 = φ(µSdt + σSdW ) + ψrBdt. Ci = φi Si + ψi Bi is the opening value of the portfolio and Ci+1 = φi Si+1 + ψi Bi+1 represents the closing value. Here {ti }N is an increasing sequence of times and i=0 we suppose that each time step ti+1 − ti is small. ψi )}N .This is equivalent to saying that Ci+1 − Ci = φi (Si+1 − Si ) + ψi (Bi+1 − Bi ). The self-ﬁnancing condition means that the ﬁnancing gap Ci+1 − Ci of cash (that would otherwise have to be injected to pay for our construction strategy) must be zero. we ask that the function u = u(s. t) (0 ≤ t ≤ T ).

which is an alternative to Itˆ’s o approach. The Itˆ product rule and (24) imply o dC = φdS + ψdB + Sdφ + Bdψ + dφ dS. We can conﬁrm this by noting that (24). (27) imply ψ = B −1 (C − φS) = e−rt (u(S. t). To ensure (25). Most of the following material is from Arnold [A. 118 . t)S). t) − us (S. If we interpret this rigorously as the stochastic diﬀerential equation: (34) dX = d(t)Xdt + f (t)XdW X(0) = X0 . 0 ≤ t ≤ T ). Remember that u(s0 . 0 ≤ t ≤ T ) 2 (32) (s > 0. we consequently must make sure that (30) Sdφ + Bdψ + dφ dS = 0. Motivation. Chapter 10]. E. To price our call option. Let us consider ﬁrst of all the formal random diﬀerential equation (33) ˙ X = d(t)X + f (t)Xξ X(0) = X0 . where we recall φ = us (S(t).More on self-ﬁnancing. Thus (30) is valid provided (31) dψ = −B −1 (Sdφ + σ 2 S 2 uss dt). SUMMARY. We next discuss the Stratonovich stochastic calculus. It turns out that this problem can be solved explicitly. we return to the self-ﬁnancing condition (25). although we omit the details here: see for instance Baxter–Rennie [B-R]. t = T ) u = (s − p)+  u=0 (s = 0. where m = n = 1 and ξ(·) is 1-dimensional “white noise”. A direct calculation using (28) veriﬁes (31). 0) is the price we are trying to ﬁnd. 1. Before going on. THE STRATONOVICH INTEGRAL. we solve the boundary-value problem  2  ut + rsus + σ s2 uss − ru = 0 (s > 0. Now dφ dS = σ 2 S 2 uss dt.

t f (s)ξ k (s) ds. the ξ k (·) are thus presumably smooth approximations of ξ(·). where we suppose that the functions dk (·) converge as k → ∞ to δ0 . How would this possibility change the solution? APPROXIMATING WHITE NOISE. with E(Z k (t)) = 0. t → ξ k (t) is smooth for all ω. the Dirac measure at 0. ξ k (t) is Gaussian for all t ≥ 0. It could perhaps be instead some process with smooth (but highly complicated) sample paths. whose solution is X k (t) := X0 e Next look at Z (t) := 0 k d(s) ds+ f (s)ξ k (s) ds . LIMITS OF SOLUTIONS.we then recall from Chapter 5 that the unique solution is (35) X(t) = X0 e t 0 d(s)− 1 f 2 (s) ds+ 2 t 0 f (s) dW . suppose that {ξ k (·)}∞ is k=1 a sequence of stochastic processes satisfying: (a) (b) (c) (d) E(ξ k (t)) = 0. Now consider the problem (36) ˙ X k = d(t)X k + f (t)X k ξ k X k (0) = X0 . 0 = 119 . On the other hand perhaps (33) is a proposed mathematical model of some physical process and we are not really sure whether ξ(·) is “really” white noise. Furthermore. t s E(Z k (t)Z k (s)) = 0 t 0 f (τ )f (σ)dk (τ − σ) dσdτ s → 0 0 t∧s f (τ )f (σ)δ0 (τ − σ) dσdτ f 2 (τ ) dτ. this is a Gaussian random variable. In light of the formal deﬁnition of the white noise ξ(·) as a Gaussian process with Eξ(t) = 0. E(ξ(t)ξ(s)) = δ0 (t − s). E(ξ k (t)ξ k (s)) := dk (t − s). t 0 t 0 For each ω this is just a regular ODE. More precisely. For each time t ≥ 0.

Hence as k → ∞, Z k (t) converges in L2 to a process whose distributions agree with those t f (s) dW . And therefore X k (t) converges to a process whose distributions agree with 0 (37) ˆ X(t) := X0 e
t 0

d(s) ds+

t 0

f (s) dW

.

This does not agree with the solution (35)! Discussion. Thus if we regard (33) as an Itˆ SDE with ξ(·) a “true” white noise, (35) is o our solution. But if we approximate ξ(·) by smooth processes ξ k (·), solve the approximate problems (36) and pass to limits with the approximate solutions X k (·), we get a diﬀerent solution. This means that (33) is unstable with respect to changes in the random term ξ(·). This conclusion has important consequences in questions of modeling, since it may be unclear experimentally whether we really have ξ(·) or instead ξ k (·) in (33) and similar problems. In view of all this, it is appropriate to ask if there is some way to redeﬁne the stochastic integral so these diﬃculties do not come up. One answer is the Stratonovich integral. 2. Deﬁnition of Stratonovich integral. A one-dimensional example. Recall that in Chapter 4 we deﬁned for 1-dimensional Brownian motion
T 0 mn −1

W dW := lim n

|P |→0

W (tn )(W (tn ) − W (tn )) = k k+1 k
k=0

W 2 (T ) − T , 2

where P n := {0 = tn < tn < · · · < tn n = T } is a partition of [0, T ]. This corresponds m 0 1 to a sequence of Riemann sum approximations, where the integrand is evaluated at the left-hand endpoint of each subinterval [tn , tn ]. k k+1 The corresponding Stratonovich integral is instead deﬁned this way:
T 0 mn −1

W ◦ dW := lim n

|P |→0

k=0

W (tn ) + W (tn ) k+1 k 2

(W (tn ) − W (tn )) = k+1 k

W 2 (T ) . 2

(Observe the notational change: we hereafter write a small circle before the dW to signify the Stratonovich integral.) According to calculations in Chapter 4, we also have
T 0 mn −1

W ◦ dW = lim n

|P |→0

W
k=0

tn + t n k+1 k 2

(W (tn ) − W (tn )). k+1 k

Therefore for this case the Stratonovich integral corresponds to a Riemann sum approximation where we evaluate the integrand at the midpoint of each subinterval [tn , tn ]. k k+1 We generalize this example and so introduce the 120

DEFINITION. Let W(·) be an n-dimensional Brownian motion and let B : Rn ×[0, T ] → Mn×n be a C 1 function such that
T

E
0

|B(W, t)|2 dt

< ∞.

Then we deﬁne
T 0 mn −1

B(W, t) ◦ dW := lim n

|P |→0

B
k=0

W(tn ) + W(tn ) n k+1 k , tk (W(tn ) − W(tn )). k+1 k 2

It can be shown that this limit exists in L2 (Ω). A CONVERSION FORMULA. Remember that Itˆ’s integral can be computed this o way:
T 0 mn −1 k=0

B(W, t) dW = lim n

|P |→0

B(W(tn ), tn )(W(tn ) − W(tn )). k k k+1 k

This is in general not equal to the Stratonovich integral, but there is a conversion formula
T i T i

(38)
0

B(W, t) ◦ dW

=
0

B(W, t) dW

+

1 2

T 0

n

j=1

∂bij (W, t) dt, ∂xj

for i = 1, . . . , n. Here v i means the ith –component of the vector function v. This formula is proved by noting
T T

B(W, t) ◦ dW −
0 0

B(W, t) dW
mn −1

= lim n

|P |→0

B
k=0

W(tn ) + W(tn ) n k+1 k , tk 2

− B(W(tn ), tn ) k k

· (W(tn ) − W(tn )) k+1 k and using the Mean Value Theorem plus some usual methods for evaluating the limit. We omit details. Special case. If n = 1, then
T T

b(W, t) ◦ dW =
0 0

b(W, t) dW +

1 2

T 0

∂b (W, t) dt. ∂x

Assume now B : Rn × [0, T ] → Mn×m and W(·) is an m-dimensional Brownian motion. We make this informal 121

DEFINITION. If X(·) is a stochastic process with values in Rn , we deﬁne
T 0 mn −1

B(X, t) ◦ dW := lim n

|P |→0

B
k=0

X(tn ) + X(tn ) n k+1 k , tk (W(tn ) − W(tn )), k+1 k 2

provided this limit exists in L2 (Ω) for all sequences of partitions P n , with |P n | → 0. 3. Stratonovich chain rule. DEFINITION. Suppose that the process X(·) solves the Stratonovich integral equation
t t

X(t) = X(0) +
0

b(X, s) ds +
0

B(X, s) ◦ dW

(0 ≤ t ≤ T )

for b : Rn × [0, T ] → Rn and B : Rn × [0, T ] → Mn×m . We then write dX = b(X, t)dt + B(X, t) ◦ dW, the second term on the right being the Stratonovich stochastic diﬀerential. THEOREM (Stratonovich chain rule). Assume dX = b(X, t)dt + B(X, t) ◦ dW and suppose u : Rn × [0, T ] → R is smooth. Deﬁne Y (t) := u(X(t), t). Then dY = = ∂u ∂u dt + ◦ dXi ∂t ∂xi i=1 ∂u i ∂u b + ∂t ∂xi i=1
n n m n

dt +
i=1 k=1

∂u ik b ◦ dWk . ∂xi

Thus the ordinary chain rule holds for Stratonovich stochastic diﬀerentials, and there is ∂2u o no additional term involving ∂xi ∂xj as there is for Itˆ’s formula. We omit the proof, which is similar to that for the Itˆ rule. The main diﬀerence is that we make use of the formula o T 1 2 W ◦ dW = 2 W (T ) in the approximations. 0 122

as is easily checked using the Stratonovich calculus described above. Next let us return to the motivational example we began with. t) − 1 c(X. t)dt + B(X. a mathematical issue. We have seen that if the diﬀerential equation (33) is interpreted to mean dX = d(t)Xdt + f (t)XdW X(0) = X0 . t)bjk (x. Note also that these considerations clarify a bit the problems of interpreting mathematically the formal random diﬀerential equation (33). for ci (x. ˜ This solution X(·) is also the solution obtained by approximating the “white noise” ξ(·) by smooth processes ξ k (·) and passing to limits. strictly speaking. CONVERSION RULES FOR SDE. B : Rn × [0.More discussion. (Stratonovich’s sense) d(s) ds+ t 0 f (s) dW . then X(t) = X0 e However. This suggests that interpreting (16) and similar formal random diﬀerential equations in the Stratonovich sense will provide solutions which are stable with respect to perturbations in the random terms. t)dW X(0) = X0 if and only if X(·) solves the Stratonovich stochastic diﬀerential equation dX = b(X. but do not say which interpretation is physically correct. T ] → Mn×m satisfy the hypotheses of the basic existence and uniqueness theorem. t) dt + B(X. T ] → Rn . t) = k=1 j=1 m n ∂bik (x. the solution is ˜ X(t) = X0 e t 0 t 0 (Itˆ’s sense). t) ◦ dW 2 X(0) = X0 . . Then X(·) solves the Itˆ stochastic diﬀerential equation o dX = b(X. Let W(·) be an m-dimensional Wiener process and suppose b : Rn × [0. o d(s)− 1 f 2 (s) ds+ 2 t 0 f (s) dW . This is a question of modeling and is not. t) ∂xj 123 (1 ≤ i ≤ n). if we interpret (33) to mean dX = d(t)Xdt + f (t)X ◦ dW X(0) = X0 . This is indeed the case: See the articles [S1-2] by Sussmann.

2. E t 0 G dW =E t 0 G2 dt . Summary We conclude these lectures by summarizing the advantages of each deﬁnition of the stochastic integral: Advantages of Itˆ integral o 1. 2 4. I(t) = t 0 t 0 2 G dW = 0. this says dX = b(X)dt + σ(X)dW if and only if 1 dX = (b(X) − σ (X)σ(X))dt + σ(X) ◦ dW. Simple formulas: E 2. Solutions of stochastic diﬀerential equations interpreted in Stratonovich sense are stable with respect to changes in random terms.A special case. Ordinary chain rule holds. For m = n = 1. G dW is a martingale. 124 . Advantages of Stratonovich integral 1.

then √ k = np + xk npq → ∞ and √ n − k = nq − xk npq → ∞. The points xk divide this interval into n subintervals h := √ 1 . 3. . . then q np k − np √ npq k np 1+ and q x=1+ np = 1− p n−k x= .APPENDICES Appendix A: Proof of Laplace–DeMoivre Theorem (from §G in Chapter 2) ∗ Proof. Observe next that if x = xk = k−np √ npq . which says says √ n! = e−n nn 2πn (1 + o(1)) as n → ∞. and at the same time k changes so that |xk | is bounded. 2. n) with probability pn (k) = Look at the interval of length −np nq √ √ npq . We recall next Stirling’s formula. npq pk q n−k .) Hence as n → ∞ pn (k) = (1) = n k pk q n−k = n! pk q n−k k!(n − k)! √ e−n nn 2πnpk q n−k 2π(n − k) n−k √ e−k k k 2πke−(n−k) (n − k)(n−k) n k(n − k) np k k (1 + o(1)) 1 =√ 2π nq n−k (1 + o(1)). . 1. . where “o(1)” denotes a term which goes to 0 as n → ∞. this being a random variable taking on the value xk = n k k−np √ npq (k = 0. (See Mermin [M] for a nice discussion. nq nq 125 . npq Now if n goes to ∞. Set Sn := Sn −np √ npq . .

Note also log(1 ± y) = ±y − log np k k y2 2 + O(y 3 ) as y → 0. Hence k np q x np q q 2 x− x np 2np + O n− 2 . In view of (1) − (3). the latter expression is a Riemann sum approximation as n → ∞ of the integral b x2 1 √ e− 2 dx. n − k = nq − x npq. k(n − k) npq √ √ since k = np + x npq. 1 = −k log = −k log 1 + √ = −(np + x npq) Similarly. 2π a Appendix B: Proof of discrete martingale inequalities (from §I in Chapter 2) 126 . 2 Consequently (2) lim n→∞ np k k k−np √ npq →x nq n−k n−k = e− x2 2 . 1 Add these expressions and simplify. to discover lim log n→∞ np k k k−np √ npq →x nq n−k n−k =− x2 . observe (3) 1 n =√ (1 + o(1)) = h(1 + o(1)). log nq n−k n−k √ = −(nq − x npq) − p 2 p x− x nq 2nq + O n− 2 . Now ∗ P (a ≤ Sn ≤ b) = a≤xk ≤b √ xk = k−np npq pn (k) for a < b. Finally. 4.

Notice next that the proof above in fact demonstrates λP max Xk > λ ≤ + Xn dP. . . Xk )) = k=1 n + E(χAk E(Xn | X1 . . . . . 2. . . 1. we have n n (4) λP (A) = λ k=1 P (Ak ) ≤ k=1 E(χAk Xk ). . Xk )) E(χAk Xk ) k=1 ≥ by the submartingale property ≥ λP (A) by (4). 1≤k≤n {max1≤k≤n Xk >λ} Apply this to the submartingale |Xk |: (5) λP (X > λ) ≤ 127 Y dP. Since λP (Ak ) ≤ Ak Xk dP . . . n).Proof. Then A := 1≤k≤n n max Xk > λ = k=1 Ak disjoint union . {X>λ} . Therefore n + E(Xn ) ≥ k=1 n + E(Xn χAk ) = k=1 n + E(E(Xn χAk | X1 . . . . Deﬁne k−1 Ak := j=1 {Xj ≤ λ} ∩ {Xk > λ} (k = 1. . Xk )) ≥ k=1 n E(χAk E(Xn | X1 . .

Then E(|X| ) = − p ∞ 0 ∞ λp dP (λ) for P (λ) := P (X > λ) =p 0 λp−1 P (λ) dλ λp−1 X ≤p 0 ∞ 1 λ Y dP {X>λ} dλ by (5) =p Ω Y 0 λp−2 dλ Y X p−1 dP Ω 1/p dP = ≤ p p−1 p p−1 1−1/p Y p dP Ω Ω X p dP . Y := |Xn |. The martingale inequality now implies P sup |I n (t) − I m (t)| > ε =P sup |I n (t) − I m (t)|2 > ε2 0≤t≤T 0≤t≤T 1 ≤ 2 E(|I n (T ) − I m (T )|2 ) ε T 1 = 2E |Gn − Gm |2 dt .for X := max1≤k≤n |Xk |. Appendix C: Proof of continuity of indeﬁnite Itˆ integral (from §C in Chapter o 4) Proof. If Gn (s) ≡ Gn for tn ≤ s < tn . ε 0 128 . then k k k+1 k−1 n I (t) = i=0 Gn (W (tn ) − W (tn )) + Gn (W (t) − W (tn )) i i+1 i k k for tn ≤ t < tn . for 0 ≤ t ≤ T . it follows that |I n − I m |2 is a submartingale. which states that the indeﬁnite integral I(·) is a martingale. Since I n (·) is a martingale. Write I n (t) := t 0 Gn dW .s. Therefore I n (·) has continuous sample paths a. There exist step processes Gn ∈ L2 (0. such that T E 0 (Gn − G)2 dt → 0. since Brownian k k+1 motion does.. T ). Now take some 1 < p < ∞. We will assume assertion (i) of the Theorem in §C of Chapter 4.

ω) is the uniform limit of continuous functions. n ≥ nk . . . J(·) is a version of I(·). 2k Then there exists nk such that T n m P 1 sup |I (t) − I (t)| > k 2 0≤t≤T ≤2 E 2k 0 |Gn (t) − Gm (t)|2 dt for m. ω) converges uniformly on [0. 129 . ω) is continuous for amost every ω. which is to say.s. for almost all ω sup |I nk (t.o. ω) − I nk+1 (t. Let Ak := Then P (Ak ) ≤ sup |I nk (t) − I nk+1 (t)| > 1 2k . ω)| ≤ 1 2k provided k ≥ k0 (ω). J(·) has continuous sample paths a.Choose ε = 1 . 0≤t≤T 1 . and nk → ∞. P (Ak i. 0≤t≤T Hence I nk (·. As I n (t) → I(t) in L2 (Ω) for all 0 ≤ t ≤ T . ω) := limk→∞ I nk (t. and therefore J(t.) = 0. . we deduce as well that J(t) = I(t) amost every for all 0 ≤ t ≤ T . In other words. Since for almost every ω. k2 Thus by the Borel–Cantelli Lemma. ≤ 1 k2 We may assume nk+1 ≥ nk ≥ nk−1 ≥ . J(·. T ] for almost every ω.

and Ω = ∪k Ai . ∞ √ 1 2πσ 2 ∞ −∞ xe− (x−m)2 2σ 2 dx = m. using the formal manipulations for Itˆ’s formula discussed in Chapter 1. √ 1 2πσ 2 −∞ (x − m)2 e− (x−m)2 2σ 2 dx = σ 2 .) (4) Let X = i=1 ai χAi be a simple random variable. 130 . Likewise. t (Hint: If X(t) := W (t) − 2 . (i) Describe precisely which sets are in U(X). P (0) = p0 . where the real numbers ai are distinct. We call U the σ-algebra generated by A. the events Ai are pairwise disjoint.) 2 t (2) Show that P (t) = p0 e solves σW (t)+ µ− σ 2 2 t . Show that Ac and B are independent. Let U(X) be the i=1 σ-algebra generated by X. (Hint: Take the intersection of all the σ-algebras containing A. Show that there exists a unique smallest σ-algebra U of subsets of Ω containing A. then dX = − dt + dW . (iii) Show that therefore Y can be written as a function of X. that o Y (t) := eW (t)− 2 solves the stochastic diﬀerential equation dY = Y dW. (6) (i) Suppose A and B are independent events in some probability space. dP = µP dt + σP dW. (3) Let Ω be any set and A any collection of subsets of Ω.EXERCISES (1) Show. show that Ac and B c are independent. Y (0) = 1. (ii) Suppose the random variable Y is U(X)-measurable. (5) Verify: ∞ −∞ k e−x dx = 2 √ π. Show that Y is constant on each set Ai .

and calculate e. m→∞ (Hint: Look at the disjoint events Bn := An+1 − An . U. f. of whom 21 were accepted. of whom 230 were accepted. 1]. Calculate numerically a. c. 2000 semester. then ∞ P n=1 An = lim P (Am ). Am are disjoint events. m→∞ 131 . . each of positive probability. of whom 100 were accepted.(ii) Suppose that A1 . X2 (ω) := g(x2 ) for ω = (x1 . . the probability of a male applicant being accepted during the fall. b. d. a (Hint: You must ﬁnd g so that P (−∞ < Y ≤ a) = −∞ g dy for all a.) (ii) Likewise. P ) be a probability space and A1 ⊆ A2 ⊆ · · · ⊆ An ⊆ . and set Y := X 2 . m). x2 ) ∈ Ω. and 112 men applied. . such that Ω = ∪m Aj . 1999 semester 105 women applied to UC Sunnydale. the probability of a female applicant being accepted during the fall. . the probability of a female applicant being accepted.) (9) Take Ω = [0. Let g : [0. (10) (i) Let (Ω. . A2 . N (0. . Calculate the density g of the distribution function for Y . . (7) During the Fall. During the Spring. with U the Borel sets and P Lebesgue measure. the probability of a female applicant being accepted during the spring. Consider now the total applicant pool for both semesters together. . Show that X1 and X2 are independent and identically distributed. the probability of a male applicant being accepted during the spring. Prove Bayes’ formula: j=1 P (Ak | B) = provided P (B) > 0. and 400 men applied. of whom 76 were accepted. be events. the probability of a male applicant being accepted. 1) random variable. 300 women applied. 1] × [0. . show that if A1 ⊇ A2 ⊇ · · · ⊇ An ⊇ . Deﬁne the random variables X1 (ω) := g(x1 ). Show that ∞ P n=1 An = lim P (Am ). . . Are the University’s admission policies biased towards females? or males? (8) Let X be a real–valued. . . 1] → R be a continuous function. m j=1 P (B | Ak )P (Ak ) P (B | Aj )P (Aj ) (k = 1.

n (iii) Therefore |bn (x) − f (x)| ≤ E(|f ( Sn ) − f (x)|) n Sn |f ( ) − f (x)| dP + = n A |f ( Ac Sn ) − f (x)| dP.Y (x. 2 δ( ) n nδ( )2 for M = max |f |. (ii) Given x ∈ [0. by providing the details for the following steps. n for A = {ω ∈ Ω | | Sn − x| ≤ δ( )}.Y is the joint density function of X. 1] as n → ∞. Conclude that bn → f uniformly. for each > 0 there exists δ( ) > 0 such that |f (x) − f (y)| ≤ if |x − y| ≤ δ( ). k Prove that bn → f uniformly on [0. 132 . where fX. (i) Since f is uniformly continuous. we have ∞ −∞ fX (z − y)fY (y) dy. Y . each with density f (x) = e−x if x ≥ 0 0 if x < 0. Write Sn = X1 + · · · + Xn . 1]. Then bn (x) = E(f ( Sn )). Show that the density function for X + Y is ∞ fX+Y (z) = (Hint: If g : R → R. ∞ −∞ E(g(X + Y )) = −∞ fX. Y .(11) Let f : [0. and suppose that fX and fY are the density functions for X. n (iv) Then show |bn (x) − f (x)| ≤ + 2M Sn 2M V( )= + V (X1 ).) (13) Let X and Y be two independent positive random variables. y)g(x + y) dxdy. P (Xk = 0) = 1 − x. (12) Let X and Y be independent random variables. 1] → R be continuous and deﬁne the Bernstein polynomial n bn (x) := k=0 f k n n k x (1 − x)n−k . take a sequence of independent random variables Xk such that P (Xk = 1) = x.

..s. So the left hand side of (∗) is ∞ (∗∗) A X dP = Ω χB (Y )X dP = xf (x.xn − 1 | > ) ≤ 1 V ( x1 +. . y) dydx. (Hint: P (| x1 +. . Y be two real–valued random variables and suppose their joint distribution function has the density f (x. (i) Show that if Φ is convex. Ω} is the trivial σ-algebra. y) dx −∞ (Hints: Φ(Y ) is a function of Y and so is U(Y )–measurable.. dxn = f ( ) n 2 for each continuous function f . Now A = Y −1 (B) for some Borel subset of R. 133 . xn 1 ) dx1 dx2 .) (17) A smooth function Φ : R → R is called convex if Φ (x) ≥ 0 for all x ∈ R. where W = {∅. Therefore we must show that (∗) A X dP = A Φ(Y ) dP for all A ∈ U(Y ).Find the density of X + Y . y ∈ R.. . then Φ(y) ≥ Φ(x) + Φ (x)(y − x) for all x. y) . (14) Show that 1 n→∞ 1 1 lim ··· 0 0 0 f( x1 + .) (15) Prove that (i) E(E(X | V)) = E(X).. for Φ(y) = ∞ xf (x. −∞ B The right hand side of (∗) is ∞ Φ(Y ) dP = A −∞ B Φ(y)f (x. ∞ f (x. Show that E(X|Y ) = Φ(Y ) a. (16) Let X. y) dx −∞ . which equals the right hand side of (∗∗). (ii) E(X) = E(X | W). Fill in the details.xn ) = 2 n 2 n 1 12 2n . y) dydx.

where Φ is convex. (This means that i. y ∈ Rn .) (18) (i) Prove Jensen’s inequality: Φ(E(X)) ≤ E(Φ(X)) for a random variable X : Ω → R. (Here “D” denotes the gradient.) (ii) Prove the conditional Jensen’s inequality: Φ(E(X|V)) ≤ E(Φ(X)|V). (Hint: Use assertion (iii) from the previous problem.j=1 Φxi xj ξi ξj ≥ 0 for all ξ ∈ Rn . ¯ ¯ ¯ Show that W (t)−W (s) is N (0. 2k k! (20) Show that if W(·) is an n-dimensional Brownian motion. and deﬁne ¯ W (t) := tW ( 1 ) for t > 0 t 0 for t = 0. (ii) cW(t/c2 ) for all c > 0 (“Brownian scaling”). You do not need to show this. where W (·) is a one-dimensional Brownian motion. y ∈ R. (iii) A smooth function Φ : Rn → R is called convex if the matrix ((Φxi xj )) is n nonnegative deﬁnite for all x ∈ Rn . t−s) for times 0 ≤ s ≤ t. then so are (i) W(t + s) − W(s) for all s ≥ 0. E(X 2 (t)) = t3 for each t > 0. Show E(W 2k (t)) = (2k)!tk .(ii) Show that Φ( 1 1 x+y ) ≤ Φ(x) + Φ(y) 2 2 2 for all x. 3 134 . (21) Let W (·) be a one-dimensional Brownian motion.) Prove Φ(y) ≥ Φ(x) + DΦ(x) · (y − x) and Φ( 1 1 x+y ) ≤ Φ(x) + Φ(y) 2 2 2 for all x. (19) Let W (·) be a one-dimensional Brownian motion. (W (·) also has independent increments and so is a one-dimensional Brownian motion.) (22) Deﬁne X(t) := Show that t 0 W (s) ds.

P ) be a probability space. t ≤ 1. Show that W (m) =0 m→∞ m lim almost surely. U.) (29) Let (Ω. T ). o γ o (ii) Show that f (t) = t is uniformly H¨lder continuous with exponent γ on the interval [0. Apply the Borel–Cantelli m Lemma. H ∈ L2 (0.) (24) Deﬁne U (t) := e−t W (e2t ). then for almost every ω there exists a constant K. where W (·) is a one-dimensional Brownian motion. Assume X be an integrable random variable. (25) Let W (·) be a one-dimensional Brownian motion. (Hint: X(t) is a Gaussian random variable. depending on ω. 1) random variable X = W (m) . (27) Let 0 < γ < 1 . such that (∗) |W (t. and take F(·) to be a ﬁltration of σ–algebras. 135 . and deﬁne X(t) := E(X|F(t)) for times t ≥ 0. t < ∞. 1]. (28) Prove that if G. Show that if f : [0. These notes show that if W (·) is a one–dimensional Brownian 2 motion. (m) (Hint: Fix > 0 and deﬁne the event Am := {| Wm | ≥ }. Show that E(eλX(t) ) = e λ2 t3 6 for each t > 0. ω)| ≤ K|t − s|γ for all 0 ≤ s. the variance of which we know from the previous homework problem.(23) Deﬁne X(t) as in the previous problem. T ] → Rn is uniformly H¨lder continuous with exponent γ. Show that E(U (t)U (s)) = e−|t−s| for all − ∞ < s. Then Am = {|X| ≥ √ √ m } for the N (0. then T T T E 0 G dW 0 H dW =E 0 GH dt . it is also is uniformly H¨lder continuous with each exponent 0 < δ < γ.) o (26) (i) Let 0 < γ ≤ 1. ω) − W (s. (Hint: 2ab = (a + b)2 − a2 − b2 . Show that there does not exist a constant K such that (∗) holds for almost all ω.

0). and then condition with respect to the history of I(·). (Hint: Use the conditional Jensen’s inequality.Show that X(·) is a martingale. Assume also E(|Φ(X(t))|) < ∞ for all t ≥ 0. 136 . t) be a smooth solution of the backwards diﬀusion equation ∂u 1 ∂ 2 u = 0. Take the conditional expectation with respect to W(s). t)) = u(0. . Use this to prove E(e T 0 g dW ) = e2 1 T 0 g 2 ds .) (32) Use the Itˆ chain rule to show that Y (t) := e 2 cos(W (t)) is a martingale. . + ∂t 2 ∂x2 and suppose W (·) is a one-dimensional Brownian motion. Show that for each time t > 0: E(u(W (t). . 0 t 0 g dW − 1 2 t 0 g 2 ds dY = gY dW. (Hint: W 2 (t) = (W (t) − W (s))2 − W 2 (s) + 2W (t)W (s). (30) Show directly that I(t) := W 2 (t) − t is a martingale. (36) Let u = u(x. .) (31) Suppose X(·) is a real-valued martingale and Φ : R → R is convex. and write Y (t) := |W(t)|2 − nt for times t ≥ 0. the history of W (·).) (34) Show that T t W 2 dW = 0 1 3 W (T ) − 3 T W dt 0 T and 0 T W 3 dW = (35) Recall from the notes that Y := e satisﬁes 1 4 3 W (T ) − 4 2 W 2 dt. W n ) be an n-dimensional Brownian motion. Show that Φ(X(·)) is a submartingale. Show that Y (·) is a martingale. o (33) Let W(·) = (W 1 . (Hint: Compute dY .

(42) Let W = (W 1 . W n ) be an n-dimensional Brownian motion and write n 1 2 1 2 σ 2 −b|t−s| . and show in particular that E(B 2 (t)) → 0 as t → 1− . . (38) Let X solve the Langevin equation. (41) Solve the SDE dX = −Xdt + e−t dW . σ ) random 2b variable. Show that if x0 > 0. (ii) Next. and then combine these solutions to ﬁnd ones which are zero for some time and then become positive. . the solution “blows up to inﬁnity” in ﬁnite time.) (40) (i) Use the substituion X = u(W ) to solve the SDE dX = − 1 e−2X dt + e−X dW 2 X(0) = x0 . . Show that R solves the stochastic Bessel equation n dR = i=1 Wi n−1 dW i + dt. e 2b (t > 0) R := |W| = i=1 (W i )2 .(37) Calculate E(B 2 (t)) for the Brownian bridge B(·). R 2R (43) (i) Show that X = (cos(W ). (Hint: x ≡ 0 is a solution. Show that E(X(s)X(t)) = (39) (i) Consider the ODE x = x2 ˙ (t > 0) x(0) = x0 . (ii) Show that the solution blows up at a ﬁnite. W 2 . Find also a solution which is positive for times t > 0. Show that this problem has inﬁnitely many solutions. random time. look at the ODE x = x2 ˙ x(0) = 0. sin(W )) solves the SDE system dX 1 = − 1 X 1 dt − X 2 dW 2 dX 2 = − 1 X 2 dt + X 1 dW 2 137 . . and suppose that X0 is an N (0.

) 138 . (49) Let W denote an n–dimensional Brownian motion. Use the Feynman-Kac formula to show that M (t) := f (W(t))e− 2 1 t 0 ∆f (W(s)) ds is a martingale. the inverse function of f . (48) Let τ be the ﬁrst time a one–dimensional Brownian motion hits the half-open interval (a. Modify Φ to build a function u which equals 0 on the inner sphere and 1 on the outer sphere. x dy (Hint: Let f (x) := 0 σ(y) and set g := f −1 . (46) Solve dX = 1 σ (X)σ(X)dt + σ(X)dW 2 X(0) = 0 where W is a one–dimensional Brownian motion and σ is a smooth.(ii) Show also that if X = (X 1 . where W = (W 1 . Show τ is a stopping time. Write X = W + x0 . then |X| is constant in time. (44) Solve the system dX 1 = dt + dW 1 dX 2 = X 1 dW 2 .) (47) Let f be a positive. (Hint: Check that 1 Φ(x) = |x|n−2 satisﬁes ∆Φ = 0 for x = 0. positive function. where the point x0 lies in the region U = {0 < R1 < |x| < R2 } Calculate explicitly the probability that X will hit the outer sphere {|x| = R2 } before hitting the inner sphere {|x| = R1 }. b]. smooth function. W 2 ) is a Brownian motion. for n ≥ 3. (45) Solve dX 1 = X 2 dt + dW 1 dX 2 = X 1 dt + dW 2 . Show X := g(W ). X 2 ) is any other solution.

1985. 296–298. Introduction to Fourier Analysis and Wavelets. [F] A. Lamperti. Introduction to the Theory of Diﬀusion Processes. Gillespie. 4th ed. L. Physics 64 (1996). V. [H] D. Probability 6 (1978). Princeton University Press. Rennie. Wiley. Sussmann. 1968. Handbook of Stochastic Methods for Physics. Addison–Wesley. An interpretation of stochastic diﬀerential equations as ordinary diﬀerential equations which depend on the sample point. 1995. 1996. Chung. V. [C] K. [D] M. Futures and Other Derivatives (4th ed). Springer. N. Kyˆto (1964). Z. [G] C. Baxter and A. 1974. [L2] J. [B] L. [P-W-Z] R. Brooks/Cole. 225–240. Stochastic Integrals. Lamperti. Bremaud. A simple construction of certain diﬀusion processes. Press. Probability. Gardiner. 1995. Financial Calculus: An Introduction to Derivative Pricing. Breiman. Wiener. 19–41. Cambridge U. J. McKean. [McK] H. The mathematics of Brownian motion and Johnson noise. [S] D. Princeton U. Math. I. [K] N. 525–546. Probability. Stochastic Diﬀerential Equations. 1972. SIAM Review 43 (2001).. Malliaris. Academic Press. Springer. [N] E. Chapman and Hall. Stochastic Diﬀerential Equations: Theory and Applications. Springer. Elementary Probability Theory with Stochastic Processes. Pinsky. Prentice Hall. Stirling’s formula!. Nelson. J. C. SIAM Review 25 (1983). 1993. [G-S] I. G. [S1] H. Gihman and A. Functional Integration and Partial Diﬀerential Equations. Stochastic Diﬀerential Equations and Applications. 1975.References L. Krylov. American J. Freidlin. 1969. Notes on random functions. Vol. 1999. An Introduction to Probabilistic Modeling. [A] [B-R] 139 . 2002. Hull. Physics 52 (1984). 1 and 2. W. [Hu] J. American Math Society. Zygmund. Probability Theory: An Analytic View. Bulletin AMS 83 (1977). Itˆ’s calculus in ﬁnancial decision making. Springer. Paley. [P] M. H. Stochastic Diﬀerential Equations: An Introduction with Applications. A. Math. 1988. Linear Estimation and Stochastic Control. On the gap between deterministic and stochastic ordinary diﬀerential equations. American J. Press. Oksendal. Sussmann. Press. Cambridge U. 1983. Skorohod. K. Options. [S2] H. o [M] D. Stroock. Springer. [Ml] A. [G] D. An algorithmic introduction to numerical simulation of stochastic diﬀerential equations. 362–365. 37 (1959). Higham. and the Natural Sciences. M. [Br] P. Benjamin. 1967. [L1] J. Academic Press. Friedman. 481–496. 647–668. Arnold. [Fr] M. 161– o 170. Ann. Mermin. Davis. Dynamical Theories of Brownian Motion.. [O] B. and A. Chemistry.