You are on page 1of 131

Week 1

Discrete time Gaussian Markov processes

Jonathan Goodman
September 10, 2012

1 Introduction to Stochastic Calculus

These are lecture notes for the class Stochastic Calculus oered at the Courant
Institute in the Fall Semester of 2012. It is a graduate level class. Students
should have a solid background in probability and linear algebra. The topic
selection is guided in part by the needs of our MS program in Mathematics in
Finance. But it is not focused entirely on the Black Scholes theory of derivative
pricing. I hope that the main ideas are easier to understand in general with
a variety of applications and examples. I also hope that the class is useful to
engineers, scientists, economists, and applied mathematicians outside the world
of finance.
The term stochastic calculus refers to a family of mathematical methods for
studying dynamics with randomness. Stochastic by itself means random, and
it implies dynamics, as in stochastic process. The term calculus by itself has
two related meanings. One is a system of methods for calculating things, as in
the calculus of pseudo-dierential operators or the umbral calculus.1 The tools
of stochastic calculus include the backward equations and forward equations,
which allow us to calculate the time evolution of expected values and probability
distributions for stochastic processes. In simple cases these are matrix equations.
In more sophisticated cases they are partial dierential equations of diusion
The other sense of calculus is the study of what happens when t ! 0. In
this limit, finite dierences go to derivatives and sums go to integrals. Calculus
in this sense is short for dierential calculus and integral calculus,2 which refers
to the simple rules for calculating derivatives and integrals the product rule,
the fundamental theorem of calculus, and so on. The operations of calculus,
integration and dierentiation, are harder to justify than the operations of alge-
bra. But the formulas often are simpler and more useful: integrals can be easier
than sums.
1 Just
name-dropping. These are not part of our Stochastic Calculus class.
2 RichardCourant, the founder of the Courant Institute, wrote a two volume textbook
Dierential and Integral Calculus. (originally Vorlesungen uber Dierential und Integralrech-

Dierential and integral calculus is good for modeling as well as for calcu-
lation. We understand the dynamics of a system by asking how the system
changes in a small interval of time. The mathematical model can be a sys-
tem of dierential equations. We predict the behavior of the system by solving
the dierential equations, either analytically or computationally. Examples of
this include rate equations of physical chemistry and the laws of Newtonian
The stochastic calculus in this sense is the Ito calculus. The extra Ito term
makes the Ito calculus more complicated than ordinary calculus. There is no
Ito term in ordinary calculus because the quadratic variation is zero. The Ito
calculus also is a framework for modeling. A stochastic process may be described
by giving an Ito stochastic dierential equation, an SDE. There are relatively
simple rules for deriving this SDE from basic information about the short time
behavior of the process. This is analogous to writing an ordinary dierential
equation (ODE) to describe the evolution of a system that is not random. If
you can describe the behavior of the system over very short time intervals, you
can write the ODE. If you can write the ODE, there is an array of analytical
and computational methods that help you figure out how the system behaves.
This course starts with two simple kinds of stochastic processes that may be
described by basic methods of probability. This week we cover linear Gaussian
recurrence relations. These are used throughout science and economics as the
simplest class of models of stochastic dynamics. Almost everything about linear
Gaussian processes is determined by matrices and linear algebra. Next week we
discuss another class of random processes described by matrices, finite state
space Markov chains. Week 3 begins the transition to continuous time with
continuous time versions of the Gaussian processes discussed this week. The
simplest of these processes is Brownian motion, which is the central construct
that drives most of the Ito calculus.
After than comes the technical core of the course, the Ito integral, Itos
lemma, and general diusion processes. We will see how to associate partial dif-
ferential equations to diusion processes and how to find approximate numerical
It is impractical to do all this in a mathematically rigorous way in just one
semester. This class will indicate some of the main ideas of the mathematically
rigorous theory, but we will not discuss them thoroughly. Experience shows
that careful people can Ito calculus more or less correctly without being able to
recite the formal proofs. Indeed, the ordinary calculus of Newton and Leibnitz
is used daily by scientists and engineers around the world, most of whom would
be unable to give a mathematically correct definition of the derivative.
Computing is an integral part of this class as it is an integral part of applied
mathematics. The theorems and formulas of stochastic calculus are easier to
understand when you see them in action in computations of specific examples.
More importantly, in the practice of stochastic calculus, it is very rare that a
problem gets solved without some computation. A training class like this one
should include all aspects of the subject, not just those in use before computers
were invented.

2 Introduction to the material for the week
The topic this week is linear recurrence relations with Gaussian noise. A linear
recurrence relation is an equation of the form

Xn+1 = AXn + Vn . (1) lrr

The state vector at time n is a column vector with d components:

0 1
Xn = @ ... A 2 Rd .


The innovation, or noise vector, or random forcing, is the d component Vn .

The forcing vectors are i.i.d., which stands for independent and identically dis-
tributed. The model is defined by the matrix A and the probability distribution
of the forcing. The model does not change with time because A and the distri-
bution of the Vn are the same for each n. The recurrence relation is Gaussian
if the noise vectors Vn are Gaussian.
This is a simple model of the evolution of a system that is somewhat pre-
dictable, but not entirely. The d components Xn,1 , . . . , Xn,d represent the state
of the system at time n. The deterministic part of the dynamics is Xn+1 = AXn .
This says that the components at the next time period are linear functions of
the components in the current period. The term Vn represents the random in-
fluences at time n. In this model, everything about time n relevant to predicting
time n + 1 is contained in Xn . Therefore, the noise at time n, which is Vn , is
completely independent is anything we have seen before.
In statistics, a model of the form Y = AX + V is a linear regression model
(though usually with E[V ] = 0 and V called ). In it, a family of variables
Yi are predicted using linear functions of a dierent family of variables Xj .
The noise components Vk model the extent to which the Y variables cannot be
predicted by the X variables. The model (1) is called autoregressive, or AR,
because values of variables Xj are predicted by values of the same variables at
the previous time.
It is possible to understand the behavior of linear a Gaussian AR model in
great detail (technical detail, you must assume it starts with X0 that is Gaus-
sian). All the subsequent Xn are multivariate normal. There are simple matrix
recurrence relations that determine their means and covariance matrices. These
recurrence relations important in themselves, and they are our first example of
an ongoing theme for the course, backward and forward equations.
This material makes a good start for the class for several reasons. For one
thing, it gives us an excuse to review multivariate Gaussian random variables.
Also, we have a simple context in which to talk about path space. In this case,
a path is a sequence of consecutive states, which we write as

X[n1:n2 ] = (Xn1 , Xn1 +1 , . . . , Xn2 ) . (2) p

The notation [n1 : n2 ] comes from two sources. Several programming languages
use similar notation to denote a sequence of consecutive integers: [n1 : n2 ] =
{n1 , n1 + 1, . . . , n2 }. In mathematics, [n1 , n2 ] refers to the closed interval con-
taining all real numbers between n1 and n2 , including n1 and n2 . We write
[n1: n2 ] to denote just the integers in that interval.
The path is an object in a big vector space. Each of the Xn has d components.
The number of integers n in the set [n1 : n2 ] is n2 n1 + 1. Altogether, the path
X[n1:n2 ] has d(n2 n1 + 1) components. Therefore X[n1 :n2 ] can be viewed as a
point in the path space Rd(n2 n1 +1) . As such it is Gaussian. Its distribution is
completely determined by its mean and covariance matrix. The mean of X[n1:n2 ]
is determined by the means of the individual Xn . The covariance matrix of
X[n1:n2 ] has dimension d(n2 n1 + 1) d(n2 n1 + 1). Some of its entries
give the variances and covariances of the components of Xn . Others are the
covariances of compenents Xn,j with Xn0 ,k at unequal times n 6= n0 .
In future weeks we will consider spaces of paths that depend on continu-
ous time t rather than discrete time n. The corresponding path spaces, and
probability distributions on them, are one of the main subjects of this course.

3 Multivariate normals
Most of the material in this section should be review for most of you. The
multivariate Gaussian, or normal, probability distribution is important for so
many reasons that it would be dull to list them all here. That activity might
help you later as you review for the final exam. The important takeaway is
linear algebra as a way to deal with multivariate normals.

3.1 Linear transformations of random variables

Let X 2 Rd be a multivariate random variable. We write X u(x) to indicate
that u is the probability density of X. Let A be a square d d matrix that
describes an invertible linear transformation of random variables X = AY ,
Y = A 1 X. Let v(y) be the probability density of Y . The relation between u
and v is
v(y) = |det(A)| u(Ay) . (3) uvd
This is equivalent to
u(x) = x A 1
x . (4) vud
uvd vud
We will prove it in the form (3) and use it in the form (4). uvd
determinants may be the most complicated things in the formulas (3)
and (4). They may be the least important. It is common that probability
densities are known only up to a constant. That is, we know u(x) = cf (x) with
a formula for f , but we do not know c. Even if there is a formula for c, the
formula may be more helpful without it.

(For example, the Student t-density is
x2 2

u(x) = c 1+ ,

c = p 2
n n2
in terms of the Euler Gamma function (a) = 0 ta 1 e t dt. The important
features of the t-distribution are easier to see without the formula for c: the fact
that it is approximately normal for large n, that it is symmetric and smooth,
and that u(x) x p for large x with exponent p = n + 1 (power lawuvd tails).)
Here is an informal way to understand the transformation rule (3) and get
the determinant prefactor in the right place. Consider a very small region, By ,
in Rd about the point y. This region could be a small ball or box, say. Call
the volume of By , informally, |dy|. Under the transformation y ! x = Ay, say
that By is transformed to a small region Bx about x. Let |dx| be the volume
of Bx . Since By ! Bx (this means that the transformation A takes By to Bx ),
the ratio of the volumes is given by the determinant:

|dx| = |det(A)| |dy| .

This formula is exactly true even if |dy| is not small.

But when By small, we have the approximate formula
Pr(By ) = y(y 0 )dy 0 v(y) |dy| . (5) udx

The exact formula Pr(Bx ) = Pr(By ) then gives the approximate formula

v(y) |dy| u(x) |dx| = u(Ay) |det(A)| |dy| .

In the limit |dy| ! 0, the approximations become exact. Cancel theuvd common
factor |dy| from both sides and you get the transformation formula (3).

3.2 Linear algebra, matrix multiplication

Very simple facts about matrix multiplication make the mathematicians work
much simpler than it would be otherwise. This applies to the associativity
property of matrix multiplication and the distributive property of matrix mul-
tiplication and addition. This is part of what makes linear algebra so useful in
practical probability.
Suppose A, B, and C are three matrices that are compatible for multiplica-
tion. Associativity is the formula (AB) C = A (BC). We can write the product
simply as ABC because the order of multiplication does not matter. Associativ-
ity holds for products of more factors. For example (A (BC)) D = (AB) (CD)

gives two of the many ways to calculate the matrix product ABCD: you can
compute BC, then multiply from the left by A and lastly multiply from the
right by D, or you can first calculate AB and CD and then multiply those.
Distributivity is the fact that matrix product is a linear function of each
factor. Suppose AB is compatible for matrix multiplication, that B1 and B2
have the same shape (number of rows and columns) as B, and that u1 and u2
are numbers. Then A(u1 B1 + u2 B2 ) = u1 (AB1 ) + u2 (AB2 ). This works with
more than two B matrices, and with matrices on the right and left, such as
! n
A uk Bk C = uk ABk C .
k=1 k=1

It even works with integrals. If B(x) is a matrix function of x 2 Rd and u(x) is

a probability density function, then
(AB(x)C) u(x) dx = A B(x) u(x) dx C .

This may be said in a more abstract way. If B is a random matrix and A and
C are fixed, not random, then

E[ABC] = A E[B] C . (6) eabc

Matrix multiplication is associative and linear even when some of the matrices
are row vectors or column vectors. These can be treated as 1 d and d 1
matrices respectively.
Of course, matrix multiplication is not commutative: AB 6= BA in gen-
eral. The matrix transpose reverses the order of matrix multiplication: (AB) =
) 1(A
(B t
). Matrix
inverse does the same if A and B are square matrices: (AB) =
B A 1 . If A and B are not square, it is possible that AB is invertible even
though A and B are not.
We illustrate matrix algebra in probability by finding transformation rules
for the mean and covariance of a multivariate random variable. Suppose Y 2 Rd
is a d component random variable, and X = AY . It issub:lt not necessary here for
A to be invertible or square, as it was in subsection 3.1. The mean of Y is
the d component vector given either in matrix/vector form as Y = E[Y ], or in
component form as Y,j = E[Yj ]. The expected value of Y is

X = E[X] = E[AY ] = A E[Y ] = A Y .

We may take A out of the expectation because of the linearity of matrix multi-
plication, and the fact that Y may be treated as a d 1 matrix.
Slightly less trivial is the transformation formula for the covariance matrix.
The covariance matrix CY is the d d symmetric matrix whose entries are

CY,jk = E[(Yj Y,j ) (Yk Y,k )] .

The diagonal entries of CY are the variances of the components of Y :
h i
CY,jj = E (Yj Y,j ) = Y2 j .

Now consider the d d matrix B(Y ) = (Y Y ) (Y Y ) . The (j, k) entry
of B is just (Yj Y,j ) (Yk Y,k ). Therefore the covariance matrix may be
expressed as h i
CY = E (Y Y ) (Y Y ) . (7) cx
The linearity formula (6), and associativity, give the transformation law for
h i
CY = E (Y Y ) (Y Y )
h i
= E (AY AY ) (AY AY )
h i
= E {A (Y Y )} {A (Y Y )}
hn on oi
= E A (Y Y ) (Y Y ) At
h i
= E A (Y Y ) (Y Y ) At
h i
= A E (Y Y ) (Y Y ) At
CX = ACY At . (8)

The second to the third line uses distributivity. The third to the fourth uses
the property of matrix transpose. The fourth to the fifth is distributivity again.
The fifth to the sixth is linearity.

3.3 Gaussian probability density

This subsection and the next one use the multivariate normal probability den-
sity. The aim is not to use the formula but to find ways to avoid using it. We
use the general probability density formula to prove the important facts about
Gaussians. Working with these general properties is simpler than working with
probability density formulas. These properties are

Linear functions of Gaussians are Gaussian. If X is Gaussian and Y =

AX, then Y is Gaussian.

Uncorrelated Gaussians are independent. If X1 and X2 are two compo-

nents of a multivariate normal and if cov(X1 , X2 ) = 0 then X1 and X2
are independent.

Conditioned Gaussians are Gaussian. If X1 and X2 are two compenents of

a multivariate normal, then the distribution of X1 , conditioned on knowing
the value X2 = x2 , is Gaussian.

In this subsection and the next, we use the formula for the Gaussian probability
density to prove these three properties.
Let H be a d d matrix that is SPD (symmetric and positive definite). Let
= (1 , . . . d )t 2 Rd be a vector. If X has the probability density
det(H) (x )t H(x )/2
u(x) = e . (9) NH
then we say that X is a multivariate Gaussian, or normal, with parameters
and H. The probability density on the right is denoted by N (, H 1 ). It will
be clear soon why it is convenient to use H 1 instead of H. We say that X is
centered if = 0. In that case the density is symmetric, u(x) = u(x).
It is usually more convenient to write the density formula as
(x )t H(x )/2
u(x) = c e .
The value of the prefactor,
c = ,
often is not important. A probability density is Gaussian if it is the exponential
of a quadratic function of x. NH
We give some examples before explaining (9) in general. A univariate normal
has d = 1. In that case we drop the vector and matrix notation because and
H are just numbers. The simplest univariate normal is the univariate standard
normal, with = 0, and H = 1. We often use Z to denote a standard normal,
1 2
Z p e z /2 = N (0, 1) . (10) sn1
The cumulative distribution function, or CDF, of the univariate standard normal
is Z x
1 2
N (x) = Pr( Z < x ) = p e z /2 dz .
2 1

There is no explicit formula for N (x) but it is easy to calculate numerically. Most
numerical software packages include procedures that compute N (x) accurately.
The general univariate density may be written without matrix/vector nota-
tion: p
h 2
u(x) = p e (x ) h/2 . (11) wn
Simple calculations (explained below) show that the mean is E[X] = , and the
variance is h i
2 1
X = var(X) = E (X ) = .
In view of this, the probability density (11) is also written
1 (x )2 /(2 2
u(x) = p e )
. (12) uvn
2 2

If X has this density, we write X N (, 2 ). It would have been more accurate
to write X2
and X , but we make a habit of dropping subscripts that are easy
to guess from context.
It is useful to express a general univariate normal in terms of a standard
univariate normal. If we want X N (, 2 ), we can take Z (0, 1) and take

X = + Z . (13) XmsZ

It is clear that E[X] = . Also,

h i
var(X) = E (X )

= 2 E Z 2 = 2 .
A calculation with probability densities
(see below) shows that
if (10) is the
density of Z and X is given by (13), then X has the density (12). This is handy
in calculations, such as

Pr[X < a] = Pr[ + Z < a] = Pr[Z < (a ) /] = N .

This says that the probability of X < a depends on how many standard devia-
tions a is away from the mean, which is the argument of N on the right.
We move to a multivariate normal and return to matrix and vector notation.
The standard multivariate
normal has = 0 and H = Idd . In this case,
the exponent in (9) involves just z t Hz = NH
z t z = z12 + zd2 , and det(H) = 1.
Therefore the N (0, I) probability density (9) is
1 z t z/2
Z d/2
e (14)
1 2 2
= p e z1 zd /2

1 2 1 z22 /2 1 zd2 /2
= p e z1 /2 p e p e . (15)
2 2 2
The last line writes the probability density of Z as a product of one dimen-
sional N (0, 1) densities for the components Z1 , . . . , Zd . This implies that the
components Zj are independent univariate standard normals. The elements of
the covariance matrix are
CZ,jj = var(Zj ) = E Zj 2 = 1 . (16) ZI
CZ,jk = cov(Zj , Zk ) = E Zj Zk = 0 if j 6= k

In matrix terms, this is just CZ = I. The covariance matrix of the multivari-

ate standard normal is the identity matrix. In this case at least, uncorrelated
Gaussian components Zj are also independent. vud
The themes of this section so far are
the general transformation law (4), the
covariance transformation formula (8), and the Gaussian density formula (9).

We are ready to combine them to see how multivariate normals transform under
linear transformations. Suppose Y has probability density (assuming for now
that = 0)
v(y) = c e y Hy/2 ,
and X = AY , with an invertible A so Y = A 1
X. We use (4), and calculate
the exponent
t h i
1 t e ,
y t Hy = A 1 x H A 1 x = xt A HA 1
x = xt Hx

with He = A 1 t HA 1 . (Note that A 1 t = (At ) 1 . We denote these by
A t , and write H e = A t HA 1 .) The formula for H e is not important here.
xt Hx/2
What is important is that u(x) = c e , which is Gaussian. This proves
that a linear transformation of a multivariate normal is a multivariate normal,
at least if the linear transformation is invertible. NH
We come to the relationship between H, the SPD matrix in (9), and C, the
covariance matrix of X. In one dimension the relation is 2 = C = h 1 . We
now show that for d 1 the relationship is

C = H 1
. (17) CH

The multivariate normal, therefore, is N (, H 1 ) = N (, CH

C). This is consistent
with the one variable notation
N (, 2
). The relation (17) allows us to rewrite
the probability density (9) in its more familiar form
(x )t C 1 (x )/2
u(x) = c e . (18) N

The prefactor is (presuming C = H 1

1 det (H)
c = d/2
p = d/2
. (19) pf
(2) det (C) (2)
The proof of (17) usesXmsZ
an idea that is important for computation. A natural
multivariate version of (13) is

X = + AZ , (20) XmAZ

where Z N (0, I). We choose A so that X N (, C). Then we use CH

transformation formula to find the density formula for X. The desired (17) will
fall out. The whole thing is an exercise in linear algebra.
The mean propertycycx
is clear, so we continue to take = 0. The covariance
transformation formula (8), with CZ = I in place of CY , implies that CX = AAt .
We can create a multivariate normal with covariance C if we can find an A with

AAt = C . (21) cho

You can think of A as the square root of C, just as is the square root of 2
in the one dimensional version (13).

There are dierent ways to find a suitable A. One is the Cholesky factor-
ization C = LLt , where L is a lower triangular matrix. This is described in
any good linear algebra book (Strang, Lax, not Halmos). This is convenient
for computation because numerical software packages usually include a routine
that computes the Cholesky factorization.
This is the algorithm for creating X.vudWe now find the probability density
of X using the transformation formula (4). We write c for the prefactor in any
probability density formula. The value of c can be dierent in dierent formulas.
We saw that v(z) = ce z z/2 . With z = A 1 x, we get

x) = c e (A x)
1 t
A1 x/2
u(x) = c v(A 1
The exponent is, as we just saw, x Hx/2 with

H = A tA 1
But if we
take the inverse of both sides of (21) we find C 1 = A t A 1 . This
proves (17), as the expressions for C 1 and H are the same.cho
The prefactor works out too. The covariance equation (21) implies that
det(C) = det(A) det(At ) = [ det(A) ] .
zd uvd
Using (14) and (3) together gives
1 1
c = ,
d/2 det(A)
which is the prefactor formula (19).
Two important properties of the multivariate normal, for your review list, is
that they exist
for any C and are easy to generate. The covariance square root
equation (21) has a solution
for any SPD matrix C. If C is a desired covariance
matrix, the mapping (20) produces multivariate normal X with covariance C.
Standard software packages include random number generators that produce
independent univariate standard normals Zk . If you have C and you want a
million vectors independent random vectors X, you first compute the Cholesky
factor, L. Then a million times you use the standard normal random number
generator to produce a d component standard normal Z and do the matrix
calculation X = LZ.
This makes the multivariate normal family dierent from other multivariate
families. Sampling a general multivariate random variable can be challenging for
d larger than about 5. Practitioners resort to heavy handed and slow methods
such as Markov chain Monte Carlo. Moreover, there is a modeling question
that may be hard to answer for general random variables. Suppose you want
univariate random variables X1 , . . ., Xd each to have density f (x) and you want
them to be correlated. If f (x) is a univariate normal, you can make the Xk
components of a multivariate normal with desired variances and correlations.
If the Xk are not normal, the copula transformation maps the situation to a
multivariate normal. Warning: the copula transformation has been blamed for
the 2007 financial meltdown, seriously.

3.4 Conditional and marginal distributions
When you talk about conditional and marginal distributions, you have to say
which variables are fixed or known, and which are variable or unknown. We
write the multivariate random variable as (X, Y ) with X 2 RdX and y 2 RdY .
The number of random variables in total is still d = dX + dY . We study the
conditional distribution of X, conditioned on knowing Y = y. We also study
the marginal distribution of Y .
The math here is linear algebra with block vectors and matrices. The total
random variable is partitioned into its X and Y parts as
0 1
B .. C
B . C
= B C
B Y1 C 2 R
B . C
@ .. A

The Gaussian joint distribution of X and Y has in its exponent

t t HXX | HXY x
x ,y = xt HXX x + 2xt HXY y + y t HY Y y .
H Y X | HY Y y

(This relies on xt HXY y = y t HY X x, which is true because HY X = HXY

, which
is true because H is symmetric.) The joint distribution of X and Y is

u(x, y) = c e (x HXX x + 2x HXY y + y HY Y y)/2 .

t t t

If u(x, y) is any joint distribution, then the conditional density of X condi-

tioned on Y = y, is u(x | Y = y) = u(x | y) = c(y)u(x, y). Here we think of x
as the variable and y as a parameter. The normalization constant here depends
on the parameter. For the Gaussian, we have

u(x | Y = y) = c(y) e (x HXX x + 2x HXY y)/2 .

t t
(22) cX
The factor e y HYcX
Y y/2
has been absorbed into c(y).
We see from (22) that the conditional distribution of X is the exponential
of a quadratic function of x, which is to say, Gaussian. The algebraic trick
of completing
the square identifies the conditional
mean. We seek to write the
exponent (22) in the form of the exponent of (9)
xt HXX x + 2xt HXY y = (x X (y)) HXX (x X (y)) + m(y) .

The m(y) will eventually be absorbed into the y dependent prefactor. Some
algebra shows that this works provided

2xt HXY y = 2xt HXX X (y) .

This will hold for all x if HXY y = HXX X (y), which gives the formula for
the conditional mean:
X (y) = HXX
HXY y . (23) cm

The conditional mean X (y) is in some sense the best prediction of the unknown
X given the known Y = y.

3.5 Generating a multivariate normal, interpreting covari-

If we have M with M M t = C, we can think of M as a kind of square root of
C. It is possible to find a real d d matrix M as long as C is symmetric and
positive definite. We will see two distinct ways to do this that give two dierent
M matrices.
The Cholesky factorization is one of these ways. The Cholesky factorization
of C is a lower triangular matrix L with LLt = C Lower triangular means that
all non-zero entries of L are on or below the digonal:
0 1
l11 0 0
B .. C
B l21 l22 0 . C
L = B . . . . . .
C .
B . . . C
@ 0 A
ld1 ldd

Any good linear algebra book explains the basic facts of Cholesky factorization.
These are such an L exists as long as C is SPD. There is a unique lower triangular
L with positive diagonal entries: ljj > 0. There is a straightforward algorithm
that calculates L from C using approximately d3 /6 multiplications (and the
same number of additions).
If you want to generate X N (, C), you compute the Cholesky factoriza-
tion of C. Any good package of linear algebra software can do this, including
downloadable software LAPACK for C or C++ or FORTRAN programming,
and the build in linear algebra facilities in Python, R, and Matlab. To make an
X, you need d independent standard normals Z1 , . . . , Zd . Most packages that
generate pseudo-random numbers have a procedure to generate such standard
normals. This includes Python, R, and Matlab. To do it in C, C++, FOR-
TRAN, you can use a uniform pseudo-random number generator and then use
the Box Muller formula to get Gaussians. You assemble the Zj into a vector
Z = (Z1 , . . . , Zd and take X = LZ + .
Consider as an example the two dimensional case with = 0. Here, we
want X1 and X2 that are jointly normal. It is common to specify var(X1 ) = 12 ,
var(X2 ) = 12 , and the correlation coecient

cov(X1 , X2 ) E(X1 X2 )
12 = corr(X1 , X2 ) = = .
1 2 1 2

In this case, the Cholesky factor is

L =
1 p 0 . (24) L2
12 2 1 212 2
The general formula X = LZ becomes
X1 = 1 Z1 (25)
X2 = 12 2 Z1 + 1 212 2 Z2 . (26)
It is easy to calculate E X1 = 12 , which is the desired value. Similarly, because
Z1 and Z2 are independent, we have

var(X2 ) = E X22 = 212 22 + 1 212 22 = 22 ,
which is the desired answer, too. The correlation coecient is also correct:

E X22 E [1 Z1 12 2 Z1 ]
corr(X1 , X2 ) = = = 12 E Z12 = 12 .
1 2 1 2
You can, and should, verify by matrix multiplication that

12 1 2 12
LLt = ,
1 2 12 22
which is the desired covariance matrix ofx1(X1 , X2 )x2
We could have turned the formulas (25) and (26) around as
X1 = 1 212 1 Z1 + 12 1 Z2 +
X2 = 2 Z2 .
In this version,
it x2
looks like X2 is primary and X1 gets some of its value from
X2 . In (25) and (26), it looks like X1 is primary and X2 gets some of its value
from X1 . These two models are equally valid in the sense that they product
the same observed (X1 , X2 ) distribution. It is a good idea to keep this in mind
when interpreting regression studies involving X1 and X2 .

4 Linear Gaussian recurrences

Linear Gaussian linear recurrence relations (1) illustrate the ideas in the previous
section. We now know that if Vn is a multivariate normal with mean zero, there
is a matrix B so that Vn = BZn ,lrr
where Zn N (0, I), is a standard multivariate
normal. Therefore, we rewrite (1) as
Xn+1 = AXn + BZn . (27) lrz
Since the Xn are Gaussian, we need only describe their means and covariances.
This section shows
that the means and covariances satisfy recurrence relations
derived from (1). The next section explores the distributions of paths. This
determines, for example, the joint distribution of Xn and Xm for n 6= m. These
and more general path spaces and path distributions are important throughout
the course.

4.1 Probability distribution dynamics
As long as Zn is independent of Xn , we can calculate recurrence relations for
n = E[Xn ] and Cnsub:la
= cov[Xn ]. For the mean, we have (you may want to glance
back to subsection 3.2)

n+1 = E[AXn + BZn ]

= A E[Xn ] + B E[Zn ]
n+1 = An . (28)

This says lrz

that the recurrence relation for the means is the same as the recurrence
relation (27) for the random states if you turn o the
(set Zn to zero).
For the covariance, it is convenient to combine (27) and (28) into

Xn+1 n+1 = A (Xn n ) + BZn .

The covariance calculation starts with

h i
Cn+1 = E (Xn+1 n+1 ) (Xn+1 n+1 )
h i
= E (A (Xn n ) + BZn ) (A (Xn n ) + BZn )

We expand the last into a sum of four terms. Two of these are zero, one being
h i
E A (Xn n ) (BZn ) = 0,

because Zn has mean zero and is independent of Xn . We keep the non-zero

h i h i
t t
Cn+1 = E (A (Xn n )) (A (Xn n )) + E (BZn ) (BZn )
h n o i
= E A (Xn n ) (Xn n ) At + E B Zn Znt B t
h i
= A E (Xn n ) (Xn n ) At + B E Zn Znt B t B t
Cn+1 = ACn At + BB t . (29)
mur cr
The recurrence relations (28) and (29) determine the distribution of Xn+1 in
terms of the distribution of Xn . As such, they are the first example in this class
of a forward equation. sub:ho
We will see in subsection 4.2 that there are natural examples where the
dimension of the noise vector Zn is less than d, and the noise matrix B is not
square. When that happens, we let m denote the number of components of Zn ,
which is the number of sources of noise. The noise matrix B is d m; it has d
rows and m columns.
The case m > d is not important for applications. The
matrices in (29) all are d d, including BB t . If you wonder whether it might
be B t B instead, note that B t B is m m, which might be the wrong size.

4.2 Higher order recurrence relations, the Markov prop-
It is common to consider recurrence relations with more than one lag. For
example, a k lag relation might take the form

Xn+1 = A0 Xn + A1 Xn 1 + + Ak 1 Xn k+1 + BZn . (30) lk

From the point of view of Xn+1 , the k lagged states are Xn (one lag), up to
Xn k+1 (k lags). It is natural to consider models with multiple lags if Xn
represent observable aspects of a large and largely unobservable system. For
example, the components of Xn could be public financial data at time n. There
is much unavailable private financial data. The lagged values Xn j might give
more insight into the complete state at time n than just Xn .
We do not need a new theorylk
of lag k systems. State space expansion puts a
system of the form ( 30) into the form of a two term recurrence relation
(27). This formulation uses expanded vectors
0 1
B Xn 1 C
Xen = B B ..
C .
@ . A
Xn k+1

en have
If the original states Xn have d components, then the expanded states X
kd components. The noise vector Zn does not need expanding because noise
vectors have no memory. All the memory in the system is contained in Xen . The
recurrence relations in the expanded state formulation are
en+1 = A
X eXen + BZ
e n.

In more detail, this is

0 1
0 1 A0 A1 Ak+1 0 1 0 1
Xn+1 B I Xn B
B Xn C B 0 0 C C B Xn 1 C B0C
B C B 0 C B C B C
B .. C = B 0 I CB .. C + B .. C Zn . (31) cmrr
@ . A B .. .. .. C @ . A @ . A
@ . . . A
Xn k+2 Xn k+1 0
0 I 0
The matrix A e is the companion matrix of the recurrence relation (30).
sub:ss lrz
We will see in subsection 4.3 that the stability of a recurrence relation (27)
is determined by the eigenvalues of A. For lk
the case d = 1, you might know that
the stability of the recurrence relation (30) is determined by the roots of the
characteristic polynomial p(z) = z k A0 z k 1 Ak 1 . These statements are
consistent because the roots of the characteristic polynomial are the eigenvalues
of the companion matrix. lk
If Xn satisfies a k lag recurrence (30), then the covariance matrix, C en =
e e e e e e e
cov(Xn ), satisfies Cn+1 = ACn A + B B . The simplest way to find the d d
t t

covariance matrix Cn , is to find the kd kd covariance matrix C en and look at
the top left d d block.
The Markov property will be important lrz
throughout the course. If the Xn
satisfy the one lag recurrence relation (27), then they have the Markov property.
In this case the Xn form a Markov chain. If they satisfy the k lag recurrence
relation with k > 1 (in a non-trivial way) then the stochastic process Xn does not
have the Markov property. The informal definition is as follows. The process
has the Markov property if Xn is all the information about the past that is
relevant for predicting the future. Said more formally, the distribution of Xn+1
conditional on Xn , . . . , X0 is the same as the distribution of Xn+1 conditional
on Xn alone.
If a random process does not have the Markov property, you can blame that
on the state space being too small, so that Xn does not have as much information
about the state of the system as it should. In many such cases, a version of state
space expansion can create a more complete collection of information at time n.
Genuine state space expansion, with k > 1, always gives a noise matrix B e
with fewer sources of noise than state variables. The number of state variables
is kd and the number of noise variables is m d.

4.3 Large time behavior and stability

time behavior is the behavior of Xn as n ! 1. The stochastic process
(27) is stable if it settles into a stochastic steady state for large n. The states Xn
can not have a limit, because of the constant influence of random noise. But the
probability distributions, un (x), with Xn un (x), can have limits. The limit
u(x) = limn!1 un (x) is a statistical steady state. The finite time distributions
n are Gaussian:
umur cr
un = N (n , Cn ), with n and Cn satisfying the recurrences
(28) and (29). The limiting distribution depends on the following limits:

= lim n (32)
C = lim Cn (33)

If these limits exist, then3 u = N (, C).

In the following discussion we first ignore several subtleties in linear algebra
for the sake of simplicity. Conclusions are correct as initially stated if m = d, B
is non-singular, and there are no Jordan blocks in the eigenvalue decomposition
of A. We will then re-examine the reasoning to figure out what can happen in
exceptional degenerate
The limit (32) depends on the eigenvalues of A. Denote the eigenvalues
by j and the corresponding right eigenvectors by rj , so that Arj = j rj for
j = 1, . . . , d. The eigenvalues and eigenvectors do not have to be real even
when A is real. The eigenvectors form a basis of Cd , so the means n have
3 Some readers will worry that this statement is not proven with mathematical rigor. It

can be, but we are avoiding that kind of technical discussion.

Pd mur
unique representations n = j=1 mn,j rj . The dynamics (28) implies that
mn+1,j = j mn,j . This implies that

mn,j = nj m0,j . (34) mp

The matrix A is strongly stable if |j | < 1 for j = 1, . . . , d. In this case

mn,j ! 0 as n ! 1 for each j. In fact, the convergence is exponential. We
see that if A is strongly stable, then n ! 0 as n ! 1 independent of the
initial mean 0 . The opposite case is that |j | > 1 for some j. Such an A is
strongly unstable. It usually happens that |n | ! 1 as n ! 1 for a strongly
unstable A. The limiting distribution u does not exist for strongly unstable A.
The borderline case is |j | 1, for all j and there is at least one j with |j | 1.
This may be called either weakly stable or cl weakly unstable.
If A is strongly stable, then the limit (33) exists. We do not expect Cn ! 0
because the uncertainty in Xn is continually replenished by noise. We start with
a direct but possibly unsatisfying proof. A second and more complicated proof
follows. The first proof just uses the fact that if A is strongly stable, then

kAn k c an , (35) ab

for some constant c and positive a < 1. The value of c depends on the matrix
norm and is not important forclthe proof.
We prove that the limit (33) exists by writing C as a convergent infinite
sum. To simplify notation, writecr
R for BB t . Suppose C0 is given, then (29)
gives C1 = AC0 At + R. Using (29) again gives

C2 = AC1 At + R

= A AC0 At + R At + R
= A2 C0 At + ARAt + R
= A2 C0 A2 + ARAt + R

We can continue in this way to see (by induction) that

1 t
Cn = An C0 (An ) + An 1
R An + + R .

This is written more succinctly as

t t
Cn = An C0 (An ) + Ak R A k . (36) gsf

The limit of the Cn exists because the first term on the right goes to zero as
n ! 1 and the second term converges to the infinite sum
X t
C = Ak R A k . (37) gsi

For the first term, note that (35) and properties of matrix norms imply that4

n t
A C0 (An ) (can ) kC0 k (can ) = ca2n kC0 k .

We write c instead of c2 at the end because c is a generic constant whose value

does not matter. The right side goes to zero as n ! 1 because a < 1. For the
second term, recall that an infinite sum is the limit of its partial sums if the
infinite sum converges absolutely. Absolute convergence is the convergence of
the sum of the absolute values, or the norms in case of vectors and matrices.
Here the sum of norms is:
k k t
A R A .

Properties of norms bound this by a geometric series:

k k t
A R A c a2k kRk .
You can find C without summing the infinite series cr
(37). Since the limit
(33) exists, you can take the limit on both sides of (29), which gives

C ACAt = BB t . (38) le
Subsection 4.4 explains that this is a system of linear equations for the entries
of C. The system is solvable
and the solution is positive definite if A is strongly
stable. As a warning, (38) is solvable in most cases even when A is strongly
unstable. But in those cases the C you get is not positive definite and cr
is not the covariance matrix
of anything. The dynamical equation (29) and the
steady state equation (38) are examples of Liapounov equations.
Here are the conclusions: if A is strongly stable then un , the distribution of
Xn has u ! u as n ! le
gsi n
1, with a Gaussian limit u = N (0, C), and C is given
by (37), or by solving (38). If A is not strongly stable, then it is unlikely that
the un have a limit as n ! 1. It is not altogether impossible in degenerate
situations described below. If A is strongly unstable, then it is most likely that
kn k ! 1 as n ! 1. If gsi A is weakly unstable, then probably kCn k ! 1 as
n ! 1 because the sum (37) diverges.

4.4 Linear algebra and the limiting covariance

This subsection is a little esoteric. It is (to the author) interesting mathematics
that is not strictly necessary to understand the material for this cr week. Here we
find eigenvalues and eigen-matrices for the recurrence relation (29). These are
related to the eigenvalues and eigenvectors of A.
4 Part of this expression is similar to the design on Courant Institute tee shirts.

The covariance recurrence relation (29)has the same stability/instability di-
chotomy. We explain this by reformulating it as more standard linear algebra.
Consider first the part that does not involve B, which is

Cn+1 = ACn At . (39) Le

Here, the entries of Cn+1 are linear functions of the entries of Cn . We describe
this more explicitly by collecting all the distinct entries of Cn into a vector ~cn .
There are D = (d + 1)d/2 entries in ~cn because the elements of Cn below the
diagonal are equal to the entries above. For example, for d = 3 there are D = 6
distinct entries in Cn , which are Cn,11 , Cn,12 , Cn,13 , Cn,22 , Cn,23 , and Cn,33 ,
which makes ~cn = (Cn,11 , Cn,12 , Cn,13 , Cn,22 , Cn,23 , Cn,33 )t 2 RD (= R
). There

is a D D matrix, L so that ~cn+1 = L~cn . In the case d = 2 and A = ,

the Cn recurrence relation, or dynamical Liapounov equation without BB , (29) t


Cn+1,11 Cn+1,12 Cn+1,11 Cn+1,12
= .
Cn+1,12 Cn+1,22 Cn+1,12 Cn+1,22

This is equivalent to D = 3 and

0 1 0 2 10 1
Cn+1,11 2 2 Cn,11
@ Cn+1,12 A = @ + A @ Cn,12 A .
Cn+1,22 2 2 2 Cn,22

And that identifies L as

0 1
2 2 2
L = @ + A .
2 2 2

This formulationLe
is not so useful for practical calculations. Its only purpose is
to show that (39) is related to a D D matrix L.
The limiting behavior of Cn depends on the eigenvalues of L. It turns out
that these are determined by the eigenvalues of A in a simple way. For each pair
(j, k) there is an eigenvalue of L, which we call jk , that is equal to j k . To
understand this, note that an eigenvector, ~s, of L, with L~s = ~s, corresponds
to a symmetric d d eigen-matrix, S, with

ASAt = S .

It happens that Sjk = rj rkt +rk rjt is the eigen-matrix corresponding to eigenvalue
jk = i j . (To be clear, Sjk is a d d matrix, not the (j, ik) entry of a matrix

S.) For one thing, it is symmetric (Sjk
= Sjk ). For another thing:

ASjk At = A rj rkt + rk rjt At

= A rj rkt At + A rk rjt At
t t
= (Arj ) (Ark ) + (Ark ) (Arj )
t t
= (j rj ) (k rk ) + (k rk ) (j rj )

= j j rj rkt + rk rjt
= jk Sjk .

A counting argument shows that all the eigenvalues and eigen-matrices of L

take the form of Sjk for some j k. The number of such pairs is the same D,
which is the number of independent entries in a general symmetric matrix. We
do not count Sjk with j < k because Sjk = Skj with k > j.
Now suppose A is strongly stable. Then the Liapounov dynamical equation
(29) is equivalent to
~cn+1 = L~cn + ~r .
Since all the eigenvalues of L are less than one in magnitude, a little reasoning
with linear algebra shows that ~cn ! ~c as n ! 1, and that ~cL~c = (I L) ~c = ~r.
The matrix I L is invertible because L has no eigenvalues equal le
to 1. This
is a dierent proof that the steady state Liapounov equation (38) has a unique
solution. It is likely that
L has no eigenvalue equal to 1 even if A is not strongly
stable. In this case (38) has a solution, which is a symmetric matrix C. But
there is no guarantee that this C is positive definite, so it does not represent a
covariance matrix.

4.5 Degenerate cases

sub:d sub:ss sub:ev
The simple conclusions of subsections 4.3 and 4.4 do not hold in every case.
The reasoning there assumed things about the matrices A and B that you might
think are true in almost every interesting case. But it is important to understand
how things might more complicated in borderline and degenerate cases. For one
thing, many important special cases are such borderline cases. Many more
systems have behavior that is strongly influenced by near degeneracy. A
process that is weakly but not strongly unstable is simple Gaussian random
walk, which is a model of Brownian motion. A covariance that is nearly singular
is the covariance matrix of asset returns, of the S&P 500 stocks. This is a matrix
of rank 500 that is pretty well approximated for many purposes by a matrix of
rank 10.

4.5.1 Rank of B
The matrix B need not be square or have rank d.

5 Paths and path space
sec:p lrr
There are questions about the process (1) that depend Xn for one n. For ex-
ample, what is Pr (kXn k 1 for 1 n 10)? The probability distribution on
path space answers such questions. For linear Gaussian processes, the distri-
bution in path space is Gaussian. This is not surprising. This subsection goes
through the elementary mechanics of Gaussian path space. We also describe
more general path space terminology that carries over to other other kinds of
Markov processes.
Two relevant probability spaces are the state space and the path space. We
let S denote the state space. This is the set of all possible values of the state
at time n. This week, the state is a d component vector and S = Rd . The path
space is called . This week, is sequences of states with a given starting and
ending time. That is, X[n1 :n2 ] 2 is a sequence (Xn1 , Xn1 +1 , . . . , Xn2 ). There
are n2 n1 + 1 states in the sequence, so = R(n2 n1 +1)d . Even if the state
space is not Rd , still a path is a sequence of states. We express this by writing
= S n2 n1 +1 . The path space depends on n1 and n2 (only the dierence,
really), but we leave that out of the notation because it is usually clear from
the discussion.

6 Exercises
ex:cg 1. This exercise works through conditional distributions of multivariate nor-
mals in a sequence of steps. The themes (for the list of facts about Gaus-
sians) are the role of linear algebra and the relation to linear regression.
Suppose X and Y have dX and dY components respectively. Let u(x, y)
be the joint density. Then the conditional distribution of Y conditional
on X = x is u( y | X = x) = c(x)u(x, y). This says that the conditional
distribution is the same, up to a normalization constant) as the joint dis-
tribution once you fix the variable whose value is known (x in this case).
The normalization constant is determined by the requirement that the
conditional distribution has total probability equal to 1:
c(x) = R .
u(x, y) dy

For Gaussian random variables, finding c(x) usually both easy and unnec-

(a) This part works out the simplest case. Take d = 2, and X =
(X1 , X2 )t . Suppose X N (0, H 1 ). Fix the value of X1 = x1 and
calculate the distribution of the one dimensional random variable X2 .
If H is
h11 h12
H = ,
h12 h22

then the joint density is

u(x1 , x2 ) = c exp h11 x21 + 2h12 x1 x2 + h22 x22 /2 .

The conditional density looks almost the same:

u( x2 | x1 ) = c(x1 ) exp 2h12 x1 x2 + h22 x22 /2 .

Why is it allowed to leave the term h11 x21 out of the exponent? Com-
plete the square to write this in the form
h i
u( x2 | x1 ) = c(x1 ) exp (x2 (x1 )) /(222 ) .

Find formulas for the conditional mean, (x1 ), and the conditional
variance, 22 .

Week 10
Change of measure, Girsanov formula
Jonathan Goodman
November 26, 2012

1 Introduction to the material for the week

In Week 9 we made a distinction between simulation and Monte Carlo. The
dierence is that in Monte Carlo you are computing a number, A, that is not
random. It is likely that there is more than one formula for A. There may be
more than one way to express A as the expected value of a random variable.
A = E[ F (X)] ,
where X has probability density u(x). Suppose v(x) is another probability
density so that
L(x) = (1) eq:L
is well defined. Then
A= F (x)u(x)dx = F (x) v(x)dx .

This may be written as

A = Eu [ F (X)] = Ev [ F (X)L(X)] . (2) eq:is

This means that there are two distinct ways to evaluate A: (i) take samples
X u and evaluate F , or (ii) take samples X v and evaluate F L. eq:is
Importance sampling means using the change of measure formula (2) for
Monte Carlo. The expected value Eu [ F (X)] means integrate with respect to
the probability measure u(x)dx. Using the measure v(x)dx instead represents
a change of measure. The answer A does not change as long eq:is
as you put the
likelihood ratio into the second integral, as in the identity (2).
There are many uses of importance sampling in Monte Carlo and applied
probability. One use is variance reduction. The variance of the uestimator is

varu (F (X)) = Eu F (X)2 A2 .

The variance of the vestimator is
h i
varv (F (X)L(X)) = Ev (F (X)L(X)) A2 .

It may be possible to find a change of measure and corresponding likelihood

ratio so that the variance of the vestimator is smaller. That would mean
that the variation of F (x)L(x) is smaller than the variation of F (x), at least
in regions that count. A good probability density v is one that puts more of
the probability in regions that are important for the integral, hence the term
importance sampling.
Rare event simulation oers especially dramatic variance reductions. This is
when A = Pu (X 2 B) (which is the same as Pu (B)) The event B is rare when
the probability is small. Applications call for evaluating probabilities ranging
from 1% to 106 or smaller. A good change of measure is one that puts its
weight on the most likely parts of B. Consider the one dimensional example
where u = N (0, 1) and B = {x > b}. If b is large, P0,1 (X > b) is very small.
But most of the samples with X > b are only a little larger than b. The measure
v = N (b, 1) is a simple way to put most of the weight near b. The likelihood
ratio is 2
u(x) ex /2 2
L(x) = = (xb)2 /2 = eb /2 ebx .
v(x) e
When x = b, the likelihood ratio is L(b) = eb /2 , which is very small when b is
large. This is the largest value of L(x) when x b.
The probability measures u(x)dx and v(x)dx give two ways to estimate
P0,1 (X > b). The umethod is to draw L independent samples Xk N (0, 1)
and count the number of those with Xk > b. Most of the samples are wasted in
the sense that they are not counted. The vmethod is to draw L independent
samples Xk N (b, 1). The estimator is
Z 1
1 X
P0,1 (X > b) = L(x)v(x)dx L(Xk ) .
b L
Xk >b

Now about half of the samples are counted. But they are counted with a small
weight L(Xk ) < eb /2 . A hit is a sample Xk > b. A lot of small weight hits
give a lower variance estimator than a few large weight hits.
The Girsanov theorem describes change of measure for diusion processes.
Probability distributions, or probability measures, on path space do not have
probability densities. In some cases the likelihood ratio L(x) can be well defined
even when the probability densities u(x) and v(x) are not. If x is a path, the
likelihood ratio L(x) is a path function that makes (2) true for well behaved
functions F . Roughly speaking, a change of measure can change the drift of a
diusion process but not the noise. The Girsanov formula is the formula for the
L that does the change.
Girsanovs theorem has two parts. One part says when two diusion pro-
cesses may be related by a change of measure. If they can be, the two probability

measures are absolutely continuous with respect to each other, or equivalent. If
two probability measures are not equivalent in this sense, then at least one of
them has a component that is singular with respect to the other. The other
part of Girsanovs theorem is a formula for L(x) in cases in which it exists.
This makes the theorem useful in practice. We may compute hitting probabili-
ties or expected payouts using any diusion that is equivalent to the one we are

2 Probability measures
A probability measure is function that gives the probability of any event in an
appropriate class of events. If B is such an event, then P(B) is this probability.
By class of appropriate events, we mean a algebra. A probability function
must be countably additive, which means that if Bn is a sequence of events with
Bn Bn+1 (an expanding family of events), then
lim P(Bn ) = P Bn . (3) eq:ca

This formula says that the probability of a set is in some sense a continuous
function of the set. The infinite union on the right really is the limit of the sets
Bn . Another way to write this is to suppose Ck is any sequence appropriate
events, and define
Bn = Ck .

Then the Bn are an expanding family. The countable additivity formula is

! 1
[ [
lim P Ck = P Ck .
k=1 k=1

Every proof of every theorem in probability theory makes use of countable ad-
ditivity of probability measures. We do not mention this property very often in
this course, which is a signal that we are not giving full proofs.

2.1 Integration with respect to a probability measure

A probability density defines a probability measure. If the probability space is
= Rn and u(x) is a probability density for an n component random variable
(x1 , . . . , xn ), then Z
Pu (B) = u(x)dx .
is the corresponding probability measure. If B is a small neighborhood of a
specific outcome x, then we write its probability as Pu (B) = dP = u(x)dx.

More generally, if F (x) is a function of the random variable x, then
Eu [ F (X)] = F (x)dP (x) . (4) eq:pi
This is the same as Feq:pi
(x)u(x)dx when there is a probability density.
But the expression (4) makes sense even when P is a more general probability
measure. A simple definition involves a F = 2m rather than a x. Define
(F )
the events Bk as
(F )
Bk = {x | kF F (x) < (k + 1)F } . (5) eq:Bk

To picture these sets, suppose x is a one dimensional random variable and con-
sider the graph of a function F (x). Divide the vertical axis into equal intervals
(F )
of size F and a horizontal line for each breakpoint kF . The set Bk is the
part of the xaxis where the graph of F lies in the horizontal stripe between
kF and (k + 1)F . This set could consist of several intervals (for example,
two intervals if F is quadratic) or something more complicated if F is a com-
(F )
plicated function. If is an abstract probability space, then the sets Bk are
abstract events in that space. By definition, the function F is meeasurable with
(F )
respect to the algebra F ifeq:pieach of the sets Bk 2 F for each k and F .
The probability integral (4) is defined as a limit of approximations, just as
the Riemann integral and Ito integral are. The approximation in this case is
(F )
motivated that if x 2 Bk , then |F (x) kF | F . Therefore, if the dP
integral were to make sense, we would have

( F ) F (x)dP (x) kF dP (x)
B Bk
( F)

(F )
= F (x)dP (x) kF P Bk
B( F )

(F )
F P Bk .

The approximation to the integral is defined by using the above approximation

on each horizontal slice.
(F )
F (x)dP (x) = F (x)dP (x) kF P Bk
( F)
k Bk k

This was motivation. The formal definition of the approximations is

(F )
Im = kF P Bk , with F = 2m . (6) eq:Li

The probability integral is defined as

F (x)dP (x) = lim Im . (7) eq:pil

The numbers on the right are a Cauchy sequence because if n > m then
X (F )
|Im In | F P Bk
X (F )
=2 m
P Bk
= 2m .
The expected value is the same thing as the probability integral:
EP [ F (X)] = F (x)dP (x) .

A dierent view of our definition of the probability integral will be useful

in the next subsection. The indicator function of an event B is 1B (x) = 1 if
x 2 B and 1B (x) = 0 if x 2 / B. A simple function is a finite linear combination
of indicator functions. We say F (x) is a simple function if there are events
B1 , . . . , Bn and weights F1 , . . . , Fn so that
F (x) = Fk 1Bk (x) .

We could define the probability integral of a simple function as

Z Xn
F (x)dP (x) = Fk P (Bk ) . (8) eq:sfi

This has to be the definition of the integral of a simple function if the integral
is linear and if Z Z
1B (x)dP (x) = dP (x) = P (B) .
Once you know what the integral should be for simple functions, you know what
it should be for any function that can be approximated by simple functions. If
F is a bounded function, then
F (F ) (x) = kF 1B ( F ) (x)
kF <Fmax

satisfies F (F ) (x) F (x) F for all x. Therefore, if the concept of integra-
tion makes sense at all, the following should be true:
F (x)dP (x) = lim F (F ) (x)dP (x)
F !0
= kF 1B ( F ) dP (x)
kF <Fmax
Z X (F )
F (x)dP (x) = lim kF P (Bk ) (9) eq:sfd
F !0
kF <Fmax

The point of this is that if the integral of indicator functions is defined, then all
other integrals are automatically defined.
A fully rigorous treatment would stop here to discuss a largeeq:Li number of
technical details here. The sum that defines the approximation (6) converges
if F is bounded. If F is not bounded, we can approximate F by a bounded
function and try to take the limit. The most important integration theorem
is the dominated convergence theorem, which gives a condition under which
pointwise convergence

Fn (x) ! F (x) as n ! 1 almost surely

implies convergence of the probability integrals

Fn (x)dP (x) ! F (x)dP (x) . (10) eq:lth

The condition concerns the maximal function

MF (x) = sup |Fn (x)| .


If Z
MF (x)dP (x) < 1 , (11) eq:mb
then (eq:mb
10) is true. We do not give the proof in this course. A simple way to show
that (11) is satisfied is Rto come up with a function G(x) so that |Fn (x)| G(x)
for all x and all n, and G(x)dP (x) < 1. A function G like this is a dominating

2.2 Absolutely continuity of measures

If P is a probability measure then a function L(x) can define a new measure
through the informal relation

dQ(x) = L(x)dP (x) . (12) eq:dQ

This means that for any measurable set B,

Q(B) = L(x)dP (x) . (13) eq:Q

In order for Q to be a probability measure, L must have two properties. First,

L(x) 0 almost surely (with respect to P ). Second,
L(x)dP (x) = 1 . (14) eq:L1

the probability measure P is defined and if L has these two properties, then
(13) defines another probability measure Q.

The informal relation (12) leads to a relationship between expectation values
in the P and Q measures:

EQ [ F (X)] = EP [ L(X)F (X)] . (15) eq:EQ

This becomes clearer when you use (12) to derive the equivalent probability
integral equation
F (x)dQ(x) = F (x)L(x)dP (x) . (16) eq:piL

You can prove this formula as suggested at the end of the previous subsection.
You check that it is true for simple functions. Then it is true for any other
function, because any function can be well approximated by simple functions.
Moreover, it is true for simple functions if it is true for indicator functions. But
F is an indicator function F (x) = 1B (x), then (16) is exactly the same as
(13). eq:dQ
For probability measures
defined by densities, the definition (12) is the same
as the original definition (1). If dP (x) = u(x)dx and dQ = v(x)dx, then dQ(x) =
u(x) dP (x).
You can ask the reverse question: given probability measures dP and dQ,
is there a function L(x) so that dQ(x) = L(x)dP (x)? There is an obvious
necessary condition, which is that any event that is impossible under P is also eq:Q
impossible under Q. If B is an event with P (B) = 0, then the definition (13)
gives Q(B) = 0. You can see this by writing
L(x)dP (x) = 1B (x)L(x)dP (x) .

(F ) (F )
If F (x) = 1(x)L(x) and Bk for this function, then Bk B, which
(F )
Rimplies that P (Bk ) P (B) = 0. Therefore, all the approximations to
1B (x)L(x)dP (x) are zero. The Radon Nikodym theorem states that this nec-
essary condition is sufficient. If P and Q are any two probability measures with
the same algebra F, and if P (B) = eq:Q 0 implies that Q(B), then there is a
function L(x) that gives Q from P via (13). This function is called the Radon
Nikodym derivative of Q with respect to P , and is written
L(x) = .
dP (x)
If the condition P (B) = 0 =) Q(B) = 0 is satisfied, we say that Q is absolutely
continuous with respect to P . This term (absolutely continuous) is equivalent
in a special case to something that really could be called absolute continuity.
But now the term is applied in this more general context. If P is absolutely
continuous with respect to Q and Q is absolutely continuous with respect to P ,
then the two measures are equivalent to each other.
If Q is not absolutely continuous with respect to P , then there is an event B
that has positive probability in the Q sense but probability zero in the P sense.

When this happens, it is usual that all of Q has zero probability in the P sense.
We say that Q is completely singular with respect to P if there is an event B
with Q(B) = 1 and P (B) = 0. If Q is completely singular with respect to P ,
then P is completely singular with respect to Q, because the event C = B c
has P (C) = 1 but Q(C) = 0. We write P ? Q when P and Q are completely
singular with respect to each other.
Here is a statistical interpretation of absolute continuity and singularity.
Suppose X is a sample from P or Q, and you want to guess whether X P or
X Q. Of course you could guess. But if your answer is a function of X alone
(and not another coin toss), then there is some set B so that if X 2 B
you say Q, and otherwise you say P . A type I error would be saying Q when
the answer is P , and a type II error is saying P when the answer is Q.1 The
confidence of your procedure is 1 P (B), which is the probability of accepting
the null hypothesis P if P is true. The power of your test is Q(B), which is the
probability of rejecting the null hypothesis when the null hypothesis is false. If
P ? Q, then there is a test with 100% confidence and 100% power. You say
X Q if X 2 B and you say X P otherwise. Conversely, if there is test with
100% confidence and 100% power, then P ? Q.
A statistical test, or, equivalently, a set B, is efficient if there is no way to
increase the confidence in the test without decreasing its power. Equivalently,
B is ecient if you cannot increase its power without decreasing its confidence.
The Neyman Pearson lemma says that if B is ecient, then there is an L0 so
that B = {x | L(x) > L0 }. Since L = dQ dP , the Q probability is larger than the
P probability when L is large.

2.3 Examples
Suppose X 2 Rn is an n component random variable. Suppose P makes X a
multivariate normal with mean zero and covariance matrix C. If H = C 1 ,
then the probability density for P is
u(x) = cex Hx/2

(The normalization constant c is not the covariance matrix C.) We want L(x) =
cey x to be a likelihood ratio. What should c be and what is the mean and
covariance of the resulting distribution? We find the answer by computing
t t t t
e(x) H(x)/2 = ex Hx/2 e Hx e H/2 . If t H = y t , then H = y because
t t
H is symmetric, so = Cy and e H/2 = ey Cy/2 . Therefore
t t
L(x) = ey x ey Cy/2
(17) eq:LG

is a likelihood ratio, and the Q distribution has the same covariance matrix but
mean = Cy.
1 In statistics, P would be the null hypothesis and Q the alternate. Rejecting the null

hypothesis when it is true is a type I error. Accepting the null hypothesis when it is false is
a type II error. Conservative statisticians regard type I errors as worse than type II.

Suppose X is a one dimensional random variable, the P distribution is uni-
form [0, 1] and the Q distribution is N (0, 1). Then P is absolutely continuous
with respect to Q but Q is not absolutely continuous with respect to P .
Any two Gaussian distributions in the same dimension are equivalent.
Suppose X 2 Rn , and P = N (0, I). Suppose Q is the probability distribu-
tion formed by taking X = |YY | , were Y N (0, I). Let |Y | = (Y12 + + Yn2 )1/2
is the Euclidean length. Then X is a unit vector that is uniformly distributed
on the unit sphere in n dimensions. That sphere is called Sn1 , because it is an
n 1 dimensional surface. For example, S2 is the two dimensional unit sphere
in three dimensions. The set B = Sn1 has Q(B) = 1 and P (B) = 0.

3 Changing the drift, Girsanovs theorem

The simplest version of Girsanovs formula is a formula for the L(x) that changes
standard Brownian motion to one with drift at . This L relates P , which is the
distribution for standard Brownian motion on [0, T ] to Brownian motion with
drift. That is

P : dXt = dWt (18) eq:SDEP

Q: dXt = at dt + dWt . (19) eq:SDEQ

The formula is dQ(x) = L(x)dP (x), where

a2t dt/2
L(x) = e 0
at dXt
e 0 . (20) eq:G1

The integrals that enter into L are well defined. In fact, we showed that they
are defined almost surely. If P and Q are absolutely continuous with respect to
each other (are equivalent), then almost surely with respect to P is the same
as almost surely with respect to Q. On the other hand, likely with respect to
P does not mean likely with respect to Q, as our importance sampling example
shows. eq:G1
There are several ways to derive Girsanovs formula (20). Here is one way
that is less slick but more straightforward. Choose a t = T 2m and let the
observations of X at the times tk = kt be assembled into a vector X ~ =
(X1 , X2 , . . . , X2m ). We are writing Xk for Xtk as we have sometimes done
before. We write an exact formula for the joint PDF of X ~ under P , and an
approximate formula for the joint densityeq:G1under Q. The ratio of these has a
well defined limit, which turns out to be (20), as t ! 0.
Let u(~x) be the density of X ~ under P . We find a formula for u by thinking
of a single time step that goes from xk to xk+1 . The conditional density of Xk+1
given Xk is normal mean zero variance t. This makes (in a hopefully clear
e(xk+1 xk ) /(2t)
uk+1 (x1 , . . . , xk+1 ) = uk (x1 , . . . , xk ) p .

X (xk+1 xk )
u(~x) = u(x1 , . . . , xn ) = c exp . (21) eq:uP

The distribution of X~ under Q can be written (approximately) in a similar way.

We call it v(~x). In the Q process, conditional on Xk , Xk+1 is approximately
normal with variance t and mean Xk + atk t. Therefore,
e(xk+1 xk atk t) /(2t)
vk+1 (x1 , . . . , xk+1 ) vk (x1 , . . . , xk ) p .
This leads to
(xk+1 xk atk t)
v(~x) = v(x1 , . . . , xn ) = c exp
! n1
! n1
X (xk+1 xk )2 X t X 2
= c exp exp (xk+1 xk ) atk exp atk
2t 2
k=0 k=0 k=0
(22) eq:vQ
Now take the quotient L = v/u. The first exponential on the right of (22)
cancels. The second in an approximation of an Ito integral
(xk+1 xk ) atk ! at dXt as t ! 0 .
k=0 0

The third is an approximation to an ordinary integral:

X Z t
t a2tk ! a2t dt .
k=0 0

Z ! Z !
v(~x) T
1 T
lim = exp at dXt exp a2t dt .
t!0 u(~x) 0 2 0

This is the formula (20).
Note the similarity of the Gaussian path space change
of measure formula
(20) to the simple Gaussian change of measure formula (17). In both cases the
first exponential has an exponent that is lineareq:LGx. The second exponential has
a quadratic exponent that normalizes L. In (17), the first exponent makes Q
larger in the direction of y,
eq:G1 R which is the direction in which y x grows the fastest.

In (20), the pulling term at dX t is large when Xt moves in the direction of at .

0, then (19) has a drift to the right, which is direction in
For example, if at >eq:G1
which L, given by (20) is large.

The more general Girsanov theorem concerns two stochastic processes
P : dXt = b1 (Xt )dWt (23) eq:bdW
Q: dXt = a(Xt )dt + b2 (Xt )dWt . (24) eq:abdW

One part of Girsanovs theorem is that P and Q are singular with respect to
each other unless b1 = b2 . The proof of this is the quadratic variation formula
from an earlier week. If dXt = a(Xt )dt + b(Xt )dWt , then
Z t
[X]t = b(Xs )2 ds . (25) eq:qv

The quadratic variation is

X 2
[X]t = lim (Xk+1 Xk ) . (26) eq:qvd
tk <t

We showed that the limit exists for a given path Xt almost surely. If you have
taking the limit (26). If
a path, you can evaluate the quadratic variation byeq:bdW
this is gives b1 , then Xt came from the P process (23). If it gives b2 , then Xt
came from the Q process (24). Since these are the only choices, you can tell
with 100% confidence whether a given path is from P or Q.
This fact has important implications for finance, particularly medium fre-
quency trading. If you have, say, daily return data or ten minute return data,
you have a pretty good idea what the volatility is. The higher frequency the
measurements, the better your estimate of instantaneous volatility. But tak-
ing more frequent measurements does not estimate the drift expected return
more accurately. With high frequency data, the variance of asset prices, or
the covariances of pairs of prices, is more accurately known than the expected
The final part of Girsanovs theorem is a formula for the likelihood ratio
between the measures
P : dXt = b(Xt )dWt (27) eq:bdW1
Q: dXt = a(Xt )dt + b(Xt )dWt . (28) eq:abdW1

This formula can be derived from approximate

formulas for the probability
density in a way similar to how we got (20). But people prefer the following
argument. The change of measure
2t dt/2
L(w) = e 0
t dWt
e 0

Turns the process dWt into the process dWt0 = t dt + dWt . Therefore, if we use
this weight function, on the process (27) we get
dX = b(Xt ) (t dt + dWt ) .
This becomes the process (28) if
t = .

eq:bdW1 eq:abdW1
Therefore, the likelihood ratio between (27) and (28) is
RT RT a2
dWt dt/2
L=e 0 bt e 0
t .

Week 11
Backwards again, Feynman Kac, etc.
Jonathan Goodman
November 26, 2012

1 Introduction to the material for the week

This week has more about the relationship between SDE and PDE. We discuss
ways to formulate the solution of a PDE in terms of an SDE and how to calculate
things about an SDE using a PDE. Informally, activity of this kind is called
Feynman Kac in certain circles, and Fokker Planck in other circles. Neither
name is accurate historically, but this is not a history class.
One topic is the full forward equation. We have done pieces of it, but we
now do it in general for general diusions. We derive the forward equation from
the backward equation using a duality argument.
Next we discuss backward equations for multiplicative function of a stochas-
tic process. If h RT i
f (x, t) = Ex,t e t V (Xs )ds , (1) mf

dXt = a(Xt )dt + b(Xt )dWt , (2) sde
m n n
1 XX X
0 = @t f + b(x)bt (x) ij @xi @xj f + ai (x)@xi f + V (x)f . (3) fepde
2 i=1 j=1 i=1

One of the dierences between this and Girsanovs formula from last week is thatmf
here the exponent does not have an Ito integral. The relationship
fepde mf
between (26)
and (3) goes both ways. You can learn about the expectation (26) by solving a
PDE. Numerical PDE mf
methods are generally more accurate than direct Monte
Carlo evaluation of (26), that is, if X does not have more
than a few components.
In the other direction,
you can use Monte Carlo on (26) to estimate the solution
of the PDE (3). This can be useful if the dimension of the the PDE larger than,
say, 4 or 5.
We discuss a general principle often called splitting. This says that if there
are two or more parts of the dynamics, then you find the dierential equation
describing the dynamics
by adding terms corresponding to each part of the dy-
namics. The PDE (3) illustrates this admittedly vague principle. The quantity

f is determined by three factors (vague term not related to, say, factor anal-
ysis in statistics): diusion,
advection, and the multiplicative functional.PThe
dynamical equation (3) has one term for each factor. The second term 12
corresponds to diusion, dX = b(X)dW . The third term corresponds to ad-
vection, dX = a(X)dt. The last corresponds to multiplication sde
over a dt time
interval by eV (Xt )dt . Splitting applies already to the SDE (2). The right hand
side has one term corresponding to advection, adt, and another corresponding
to diusion bdW .

2 Backward and forward

This section has much review of things we covered earlier in the course, much
earlier in some cases. It serves as review and it puts the new material into
Let Xt is a Markov process of some kind. It could be a discrete time Markov
chain, or a diusion process, or a jump process, whatever. Let S be the state
space. For each t, Xt 2 S. Take times t < T and define the value function
f (x, t) by
f (Xt , t) = E[ V (Xt ) | Ft ] . (4) vf

This definition applies to all kinds of Markov processes. The t variable is discrete
or continuous depending on whether the process takes place in continuous or
discrete time. The x variable is an element of the state space S. For a diusion,
S = Rn , so f is a function of n real variables (x1 , . . . , xn ), as in f (x1 , . . . , xn , t).
Here n is the number of components of Xt .
The generator of a stochastic process is called L. The generator is a matrix
or a linear operator. Either way, the generator has the action
g ! Lg ,

that is linear: ag ! aLg, and (g1 + g2 ) ! Lg1 + Lg2 . Here is the definition of
the generator, as it acts on a function.1 The discrete time process starts with
X0 = x and takes one discrete time step to X1 :

Lg(x) = E[ g(X1 )] . (5) dgd

To say this more explicitly, if h = Lg, then h(x) = E[ g(X1 )]. For example, for
a simple random walk on the integers, S = Z. Suppose P(X ! X + 1) = .5,
P(X ! X 1) = .3 and P(X ! X) = .2. Then

E[ g(X1 )] = g(x + 1) P(X1 = x + 1) + g(x) P(X1 = x) + g(x + 1) P(X1 = x 1)

= .5 g(x + 1) + .3 g(x) + .2 g(x 1) .
1 It is common to define abstract objects by their actions. There is a childrens book with

the line: Dont ask me what Voom is. I never will know. But boy let me tell you it does
clean up show.

You can see that the matrix L is the same as the transition matrix

Pij = P(i ! j) = P(Xt+1 = j | Xt = i) .

This is because
E[ g(X1 )] = P(X1 = j | X0 = x) g(j) .

If g is the column vector whose entries are the values g(i), then this says that
Lg(j) is component j of the vector P g. The transition matrix is P . The
generator is L. They are the same but have a dierent name.
There is a dynamics of value functions (conditional expectations) that in-
volves the generator. The derivation uses the tower property. For example,

E[ g(X2 )] = E[ E[ g(X2 ) | F1 ]] = E[ Lg(X1 )] = (L (Lg)) (x) = L2 g(x) .

The expected value after s steps is

E[ g(Xs )] = Ls g(x) .

This looks slightly dierent if we let s = T t be the time remaining between

t and a final time T . Then

Ex,t [ g(XT )] = LT t g(x) .

The time variable in the backward equation is the time between the start and
the stop. When you increase this time variable, you can imagine starting further
from the stopping time, or starting at the same time and running the process
longer. This is the value function for a payout of g(XT ) at time T .
Now suppose you have a continuous time Markov process in a discrete state
space. At each time t, the state Xt is one of the elements of S. The process is
described by a transition rate matrix, R, with Rij being the transition rate for
i ! j transitions. This means that if j 6= i are two elements of S, then

P(Xt+dt = j | Xt = i) = Rij dt .

Another way to express this is

P(i ! j in time dt | Xt = i) = Rij dt .

The generator of a continuous time process is defined a little dierently from the
generator of a discrete time process. In continuous time you need to focus on
the rate of change of quantities, not the quantities themselves. For this reason,
we define Lg(x) as (assume X0 = x as before)
E[ g(X t )] g(x)
Lg(x) = lim . (6) cL
t!0 t
In time t,cL we expect E[ g(X t )] not to be very dierent from g(x). The
definition (6) describes the rate of change. Two things about the discrete time

problem are similar to this continuous time problem. The generator is the same
as the transition rate matrix. The evolution of expectation values over a longer
time is given by the backward equation using the generator.
To see why L = R, we write approximate expressions for the probabilities
Pxy (t) = P(X t = y | X0 = x). (The notation keeps changing, sometimes i, j,
sometimes x, y. This week its not an accident.) For y 6= x, the probability is
approximately Rxy t. For small t, the same state probability Pxx (t) is
approximately equal to 1. We define the diagonal elements of R to make the
Pxy (t) = xy + tRxy (7) PR
true for all x, y. It is already true for x 6= y. To make it true for x = y, we need
Rxx = Rxy .

The o diagonal entries of the rate matrix are non-negative. The diagonal
entries are negative. The sum over all landing states is zero:
Rxy = 0 .
Assuming this, the formula (7) for P gives
Pxy = P(X t = y | X0 = x) = 1 .
y2S y2S

This is supposed to be true. With the definitions

above, it is.
With all this, we can evaluate the limit (6). Start with
E[ g(X t )] = Pxy (t)g(y) .
Now that (7) applies for all x, y we just get
E[ g(X t )] g(x) + t Rxy g(y) .
We substitute this into (6), cancel the g(x), and then the t. The result is
Lg(x) = Rxy g(y) .

This is Lg = Rg, if we think of R as a matrix and g as the column vector made

of the numbers g(y).
It is convenient to consider functions that depend explicitly on time when
discussing the dynamics of expectation values. Let f (x, t) be such a function.
The generator discussion above implies that

E[ f (X t , 0)] = f (x, 0) + tf (x, 0) + O(t2 ) .

When you consider the explicit dependence of f on t, this becomes

E[ f (X t , t)] = f (x, 0) + t Lf (x, 0) + @t f (x, 0) + O(t2 ) .

Now consider a sequence of time steps of size t, with tk = kt and t = tn .

E[ f (Xtn , tn )] f (x, 0) = E f (Xtk+1 , tk+1 ) f (Xtk , tk )
X h i
E Lf (Xk , tk ) + @t f (Xk , tk ) t

In the limit t ! 0, this becomes

Z t
E[ f (Xt , t)] f (x, 0) = E[ Lf (Xs , s) + @t f (Xs , s)] ds . (8) Df

This equation is true for any function f (x, t). If f satisfies the backward equation

Lf + @t f = 0 (9) bec

then the right side is zero, and

E[ f (Xt , t)] = f (x, 0) .

As for the discrete backward equation, we can replace the time interval [0, t]
with the interval [t, T ], which gives the familiar restatement

Ex,t [ f (XT , T )] = f (x, t) .

The definition of the generator (6) is easy to work
with, particularly if the
process is subtle. Suppose Xt satisfies the SDE (2) and g(x) is a twice dier-
entiable function. Define X = X t x and make the usual approximations
X 1X
g(X t ) g(x) Xi @xi g(x) + Xi Xj @xi @xj g(x) .
2 ij

The SDE gives

E[ Xi ] ai (x)t , E[ Xi Xj ] b(x)bt (x) ij t

0 1
X 1 X
E[ g(X t ) g(x)] @ ai (x)@xi g(x) + b(x)bt (x) ij @xi @xj g(x)A t .
2 ij

X 1 X
Lg(x) = ai (x)@xi g(x) + b(x)bt (x) ij @xi @xj g(x) . (10) cgg
2 ij

It is common to specify L as an operator without putting the function g into

the expression. That would be
X 1 X
L= ai (x)@xi + b(x)bt (x) ij @xi @xj . (11) cg
2 ij

This is a dierential
operator. It is defined by how it acts on a function g, which
is given by (10).
The general relationship between backward and forward equations may be
understood using duality. This term has several related meanings in mathemat-
ics.2 One of them is an abstract version of the relationship between a matrix
and its transpose. In the abstract setting, a matrix becomes an operator, and
the transpose of the matrix becomes the adjoint of the operator. The transpose
of the matrix L is Lt . The adjoint of the operator L is L . This is important
here because if L is the generator of a Markov process, then L is the operator
that appears in the backward equation. But L , the adjoint of L, appears in
the forward equation. This can be the easiest way to figure out the forward
equation in practical examples. It will be easycgto identify the forward equation
for general diusions with generator L as in (11). But look back at Assignment
6 to see how hard it can be to derive the forward equation directly. When all the
bla bla is over, this fancy duality boils down to simple integration by parts.
We introduce abstract adjoints for operators by describing how things work
for finite dimensional vectors and matrices. In that setting, we distinguish
between the n 1 matrix, f , and the 1 n matrix, u. As a matrix, f has
one column and n rows. This is also called a column vector. As a matrix, u has
one row and n columns, which makes u a row vector. A row vector or a column
vector have n components, but they are written in dierent places. Suppose A
is and m n matrix and B is an n k matrix. Then the matrix product AB
is defined but the product BA is not, unless k = m. If A is the n component
row vector u and B is the n n matrix L, we have m = 1 and k = n above.
The matrix product uL = v is another 1 n matrix, or row vector. The more
traditional matrix vector multiplication involves A = L as n n and f as n 1,
so Lf = g is n 1, which makes g another column vector.
Suppose S is a finite state space with states xi for i = 1, . . . , n. We can write
the probability of state xi as u(xi ) or ui . If f (x) is a function of the state, we
write fi for f (xi ). The expected value may be written in several ways
E[ f ] = ui fi = u(xi )f (xi ) = u(x)f (x) .
i=1 xi 2S x2S

2 For example, the dual of the icosahedron is the dodecahedron, and vice versa.

Now let u refer to the row vector with components ui and f the column vector
with components fi . Then the expected value expression above may be written
as the matrix product
E[ f ] = uf .
This is the product of a 1 n matrix, u, with the n 1 matrix f . The result is
a 1 1 matrix, which is just a single number, the expected value.
Now suppose f = ft = Et [ V (XT )] and u = ut with ut (x) = P(Xt = x). The
tower property implies that the overall expectation is given by

E[ V (XT )] = E[ Et [ V (XT )]] = E[ ft (Xt )] = ut ft . (12) ext

The last form on the right is the product of the row vector ut and the column
vector ft . Note, and this is the main point, that the left side is independent of
t in the range 0 t T . This implies, in particular, that

ut+1 ft+1 = ut ft .
But the relationship between ft and ft+1 is given by (5), in the form ft = Lft+1 .
ut+1 ft+1 = ut Lft+1 ,
for any vector ft+1 . This may be re-written using the fact that matrix multi-
plication is associative as

(ut+1 ut L) ft+1 = 0 ,

for every ft+1 . If the set of all possible ft+1 spans the whole space Rn , this
implies that
ut+1 = ut L . (13) fed
To summarize: if the value function satisfies the backward equation involving
the generator L, then the probability distribution satisfies a forward equation
with that same L, but used in a dierent way multiplying from the left rather
than from the right. Recall that what we call L here was called P before. It
is the matrix of transition probabilities the transition matrix for the Markov
You can give the relationship between the backward and forward equations
in a dierent way if you treat all vectors as column vectors. If ut is the column
vector of occupation probabilities at time t, then the expected value formula is
E[ V (XT )] = utt ft . The backward equation formula, written using the column
vector convention,
is utt+1 ft+1 = utt Lft+1 = (Lt ut ) ft+1 . The reasoning we used
to get (13) now gives the column vector formula

ut+1 = Lt ut .

To summarize: ut ft , or utt ft is independent of t. This implies a relationship

between the dynamics of f and the dynamics of u. The relationship is that the
matrix L that does the backward dynamics of f has an adjoint (transpose) that
does the forward dynamics of u.

We move to a continuous time version of this that applies to continuous time
Markov chains on a finite state space S. The dynamics for ft are
ft = Lft ,
where L, the generator, is also the matrix of transition rates. If ut is the row
vector of occupation probabilities, ut (x) = P(Xt = x), then ut ft is independent
of t for the same reason as above. Therefore

d d d d
0= (ut ft ) = ut ft + ut ft = ut ft ut (Lft ) .
dt dt dt dt
This implies, as in the discrete time case above, that

ut ut L ft = 0 ,
for every value function vector ft . If these vectors span Rn , then the vector in
brackets must be equal to zero:
ut = ut L .
If we use the convention of treating the occupation probabilities as a column
vector, then this is
ut = Lt ut .
It is easy to verify all these dynamical equations directly for finite state space
Markov chains in discrete or continuous time. For example, ...
With all this practice, the PDE argument for diusion processes is quick.
Start with the one dimensional case with drift a(x) and noise b(x). The back-
ward equation is
@t f (x, t) + Lf = @t f + @x2 f (x, t) + a(x)@x f (x) = 0 .
Let u(x, t) be the probability density for Xt . Then as before the time t formula
for the expected payout is true for all t between 0 and T
Z 1
E[ V (XT )] = u(x, t)f (x, t) dx .

We dierentiate this with respect to t and use the backward equation for f :
Z 1
0 = @t u(x, t)f (x, t) dx
Z 1 Z 1
= (@t u(x, t)) f (x, t) dx + u(x, t) (@t f (x, t)) dx
1 1
Z 1 Z 1
= (@t u(x, t)) f (x, t) dx u(x, t) (Lf (x, t)) dx
1 1

The new trick here is to move L onto u by integration by parts. This is the
continuous state space analogue of writing ut (Lf ) as (Lt u) f in the discrete
state space case. In the integrations by parts we assume that there are no
boundary terms at 1. The reason for this is that the probability density
u(x, t) goes to zero very rapidly as x ! 1. In typical examples, such as
the Gaussian case of Brownian motion, u(x, t) goes to zero exponentially as
x ! 1. Therefore, even if f (x, t) does not go to zero, or even goes to infinity,
the boundary terms vanish in the limit x ! 1. Here is the algebra:
Z 1
u(x, t) (Lf (x, t)) dx
Z 1
= u(x, t) b(x) @x f (x, t) + a(x)@x f (x, t) dx
2 2
1 2
Z 1
1 2 2
= @x b (x)u(x, t) @x (a(x)u(x, t)) f (x, t) dx .
1 2

The quantity in square brackets is

1 2 2
L u(x, t) = @x b (x)u(x, t) @x a(x)u(x, t) . (14) L*
This defines the operator L , which is the adjoint of the generator L. The
integration by parts above shows that
Z 1
(@t u(x, t) L u(x, t)) f (x, t) dx = 0 ,

for every value function f (x, t). If there are enough value functions, the only
way for all these integrals to vanish is for the u part to vanish, which is
1 2 2
@t u(x, t) = L u(x, t) = @x b (x)u(x, t) @x a(x)u(x, t) . (15) fecc
This is the forward Kolmogorov equation for the evolution of the probability
density u(x, t).
There many features of the forward equation to keep in mind. There are
important dierences between the forward and backward equations. One dier-
ence is that in the backward equation the noise and drift coecients are outside
the dierentiation, but they are inside in the forward equation. In both cases
it has to be like this. For the backward equation, constants are solutions, ob-
viously, because if the payout is V (XT ) = c independent of XT , then the value
function is f (Xt , t) = c independent of Xt . You can see f (x,t) = c satisfies the

backward equation because Lc = 0. If L were to have, say @x a(x)f (x, t) , then

we would have Lc = a(x) c 6= 0 if a is not a constant. That would be bad.
The forward equation, on the other hand, is required to preserve
the integral
u, not constants. If we had a(x)@x u(x, t) instead of @x a(x)u(x, t) , then we

would have (if b is constant)
Z Z 1
d 1
u(x, t) dx = @t u(x, t) dx
dt 1 1
Z 1
= a(x)@x u(x, t) dx
Z 1
= (@x a(x)) u(x, t) dx ,

which is not equal to zero in general if @x a 6= 0.

A feature of the forward equation that takes scientists by surprise is that
the second derivative term is not
1 2
@x b (x)@x u(x, t) .
This is because of the martingale property. In the no drift case, a = 0, we have
supposed not to change the expected value of Xt . Therefore
d d
0 = E[ Xt ] = xu(x, t) dx = x@t u(x, t) dx .
dt dt
If we use the correct form, this works out:
x@t u(x, t) dx = x@x2 b2 (x)u(x, t) dx
= (@x x) @x b2 (x)u(x, t) dx
1 2 2
= @x x b (x)u(x, t) dx

But the incorrect form above would give

2 2
x@x b2 (x)@x u(x, t) dx = b (x)@x u(x, t) dx = @x b (x) u(x, t) dx .

This is not equal to zero in general.

Another crucial distinction between forward and backward is the relation
between the signs of the @t and @x2 terms. This is clearest in the simplest
example, which is Brownian motion: dX = dW , which has a = 0 and b = 1.
Then the backward equation is @t f + 12 @x2 f = 0, and the forward equation is
@t u = 12 @x2 u. You can appreciate the dierence by writing the backward equation
in the form @t f = 12 @x2 f . The forward equation has +@x2 and the backward
equation has @x2 . This is related to the fact that the forward equation is
intended for evolving u(x, t) forward in time. If you specify u(x, t), you the
forward equation determines u(x, t) for t > 0. The backward is for evolving the

value function backward in time. If you specify f (x, T ), the backward equation
determines f (x, t) for t < T .
One way to remember the signs is to think what should happen to a lo-
cal maximum. The probability density at a local maximum should go down
as you move forward in time. A local maximum represents a region of high
concentration of particles of probability. Moving forward in time these par-
ticles will disperse. This causes the density of particles, the probability, to go
down. Mathematically, a local maximum of u at x0 at time t0 is represented
by @x u(x0 , t0 ) = 0 and @x2 u(x0 , t0 ) < 0. At such a point, we want u to be de-
creasing, which is to say @t u(x0 , t0 ) < 0. The forward equation with +@x2 does
this, but with @x2 would get it wrong. The backward equation should lower
a local maximum in the value function f moving backward in time. Suppose
@x f (x0 , t0 ) = 0 and @x2 f (x0 , t0 ) < 0 (a local maximum). Then suppose you start
at a time t < t0 with Xt = x0 . There is a chance that Xt0 will be close to x0 ,
but it probably will miss at least a little. (Actually, it misses almost surely.).
Therefore f (Xt0 , t0 ) < f (x0 , t0 ), so f (x0 , t) = Ex0 ,t [ f (Xt0 , t0 )] < f (x0 , t0 ). This
suggests that @t f (x0 , t0 ) > 0, which makes f decrease as you move backward in
time. The backward equation with @x2 does this.
The multi-dimensional version of these calculations is similar. It is easier
if you express the calculations above in a somewhat more abstract way using
the generator L and an inner product appropriate for the situation. in finite
dimensions and for vectors with real components, the standard inner product
hu, f i = ui fi .

Clearly, this is just a dierent notation for ut f or uf depending on whether you

think of u as a column vector or a row vector. If L is an n n matrix, the
definition of the adjoint of L is that matrix L so that

hu, Lf i = hL u, f i , (16) adj

for all vectors u and f . We already did the matrix calculations to show that
L = Lt for matrices, and with the standard inner product. We derive the
forward equation from the backward equation, in this notation, as follows. Start
hut , ft i = hut+1 , ft+1 i .
Then use the backward equation and the adjoint relation:

hut , ft i = hut , Lft+1 i = hL ut , ft+1 i .

The additivity property of inner products allows us to write this in the form

h(L ut ut+1 ) , ft+1 i = 0 .

This is supposed to hold for all ft+1 , which forces the vector in parentheses to
be zero (another property of inner products).

The continuous time version of this is about the same. It is a property of
inner products that
d d d
hut , ft i = h ut , ft i + hut , ft i .
dt dt dt
If the inner product is independent of t and f satisfies the backward equation,
this gives
0=h ut , ft i hut , Lft i
= h ut , ft i hL ut , ft i

ut L u , ft i

As we argued before, if this holds for enough vectors ft , it implies that the
quantity in parentheses must vanish, which leads to:
ut = L ut . (17) feg
This says that to find the forward equation from the backward equation, you
have to find the adjoint of the generator. cgg
generator of the multi-dimensional diusion process is given by (10)
or (11). We simplify the notation by defining the diusion coecient matrix
P = b(x)b (x). The components of are the diusion coecients jk =
l bjl bkl . The quantity that is constant in time is
u(x1 , . . . , xn , t)f (x1 , . . . .xn , t) dx1 dxn . (18) ndi

The adjoint of L is the operator L so that

(L u(x, t)) f (x, t) dx = u(x, t) (Lf (x, t)) dx

The adjoint is found by integration by parts. The generator is the sum of many
terms. We do the integration by parts separately for each term and add the
results. A typical first derivative part of L is aj (x)@xj . The adjoint of this
term is found by integrating by parts in the ndixj variable, which is just one of the
variables in the n dimensional integration (18). That is
Z 1 Z 1

u(x, t)aj (x)@xj f (x, t) dxj = @xj (aj (x)u(x, t)) f (x, t) dxj .
xj =1 xj =1

The integrations over the other variables does nothing to this. The result is

u(x, t)aj (x)@xj f (x, t) dx = @xj (aj (x)u(x, t)) f (x, t) dx .
Rn Rn

A typical second derivative term in L is jk (x)@xj @xk . If j 6= k, you move
these derivatives from f onto u by integration by parts in xj and xk . The overall
sign comes out + because you do two integrations by parts. The result is

u(x, t)jk (x)@xj @xk f (x, t) dxj dxk = @xj @xk (jk (x)u(x, t)) f (x, t]dxj dxk .
xj ,xk xj ,xk

You can check that this result is still true if j = k. Altogether, the adjoint of L
n n n
1 XX X
L u(x, t) = @xj @xk (jk (x)u(x, t)) @xj (aj (x)u(x, t)) . (19) L*c
2 j=1 j=1

The forward equation for the probability density is

n n n
1 XX X
@t u(x, t) = @xj @xk (jk (x)u(x, t)) @xj (aj (x)u(x, t)) . (20) fec
2 j=1 j=1

A common rookie mistake is to get the factor of 12 wrong in the second deriva-
tive terms. Remember
that the o diagonal terms, the ones with j 6= k, are
given twice in (19), once with (j, k) and again with (k, j). For example, in two
dimensions, the second derivative expression is
1 2 1
@x1 (11 (x)u(x, t)) + @x22 (22 (x)u(x, t)) + @x1 @x2 (12 (x)u(x, t)) .
2 2
The matrix = bbt is symmetric, even when b is not symmetric. Therefore, it
does not matter whether you write 12 or 21 .
This stu is closely related to Itos lemma. Suppose f (x, t) is some function.
The Ito dierential is

df (Xt , t) = @t f dt + Lf (Xt , t)dt + rf (Xt , t)b(Xt , t)dWt . (21) ilg

This is not a form we have used before, but it is easy to check. The expectation
from this is
E[ df (Xt , t) | Ft ] = (@t f (Xt , t) + Lf (Xt , t)) dt . (22) ilge

This is more or less our definition of generator above.

3 Feynman Kac and other equations

There are backward equations for lots of other functions of a process. One
example is the running cost or running payout problem. (Engineers talk about
costs. Finance people talk about payouts.)
A= V (Xs )ds . (23) af

The corresponding value function is
"Z #
f (x, t) = Ex,t V (Xs )ds . (24) afv
The Ito dierential (21) is the easy to find the backward equation. On one hand
you have "Z #
d V (Xs )ds | Ft = V (Xt )dt .
ilg ilge
On the other hand we have (21) or (22). Together, this give
V (x)dt = @t f (x, t) + Lf (x, t) .
This gives the backward equation
0 = @t f + Lf + V (x)f . (25) afbe
If you want to know the value of A in (23), one way to find it is to solve the
backward equation with final condition f (x, T ) = 0. Then A = f (x0 , 0). In
this approach you have to solve the whole PDE and compute the whole value
function just to find the single number A. af
The correspondence
between the additive function (23) and the backward
equation (25) can go both
ways. You can use the PDE to findafbe
the value of A. You
can use the definition (24) to find the solution of the PDE (25). This is useful in
situations where the backward equation was not derived as a probability model.
It is most important when the dimension of the problem is more than a few,
maybe more than 4 or 5, where PDE solution methods are impractical.
Another connection of this kind concerns the multiplicative function
Z T !
A = exp V (Xs )ds . (26) mf

We study this using the corresponding value function

" Z ! #
f (x, t) = Ex,t exp V (Xs )ds | Ft . (27) mfv

We find the backward equation for this value function following the reasoning
we used for the previous one. On one hand, we have
Z T ! Z T !
d exp V (Xs )ds = V (Xt ) exp V (Xs )ds dt .
t t

(If Yt is the additive functional Yt = t V (Xs )ds, then dY has no Ito part, so
d eYt = eYilg
dYt . Weilge
just did dYt .) So we equate this with the Ito version of df
given in (21) and (22) to get
E[ df (Xt , t) | Ft ] = @t f (Xt , tdt) + Lf (Xt , t)dt = V (Xt )f (Xt , t)dt .

Rearranging this leads to

@t f (x, t) + Lf (x, t) + V (x)f (x, t) = 0 . (28) mfbe

The final condition for this one is f (x, T ) = 1. mfv

Again, the relationship between the multiplicative value function (27) and
the backward equation (28) can be useful either way. Solving the PDE allows
you to calculate the mfv
expectation of the multiplicative function. Using the mul-
tiplicative function
(27) allows you to use Monte Carlo to compute the solution
of the PDE (28). It might happen that we want the PDE solution with final
condition f (x, T ) = g(x) that is not identically equal to 1. In this case, the
solution formula is clearly (think this through)
" Z ! #
f (x, t) = Ex,t g(Xt ) exp V (Xs )ds | Ft .

This solution formula for the PDE is called the Feynman Kac formula. A version
of this was proposed by the physicist Feynman in the 1940s for the related PDE
that has iL instead of L. Feynmans formula was criticized by mathematicians
for not being rigorous, in my opinion somewhat unfairly. The mathematician
Kac3 discovered the present completely rigorous version of Feynmans formula.
Modern probabalists, particularly mfbe
applied probabilists working in finance or
operations research, call the PDE (28) the Feynman Kac formula instead. This
reverses the original intention. Some go even further, using the term Feynman
Kac for any backward equation of any kind.

3 Pronounced cats. Another spelling of the same Polish name is Katz.

Week 2
Discrete Markov chains
Jonathan Goodman
September 17, 2012

1 Introduction to the material for the week

This week we discuss Markov random processes in which there is a list of pos-
sible states. We introduce three mathematical ideas: a algebra to represent
a state of partial information, measurability of a function with respect to a dis-
crete algebra, and a filtration that represents gaining information over time.
Filtrations are a convenient way to describe the Markov property and to give the
general definition of a martingale, the latter a few weeks from now. Associated
with Markov chains are the backward and forward equations that describe the
evolution of probabilities and expectation values over time. Forward and back-
ward equations are one of the main calculate things methods of stochastic
A stochastic process in discrete time is a sequence (X1 , X2 , . . .), where Xn
is the state of the system at time n. The path up to time T is X[1:T ] =
(X1 , X2 , . . . , XT ). The state Xn must be in the state space, S. Last week
S was Rd , for a linear Gaussian process. This week, S is either a finite set
S = {x1 , x2 , . . . xm }, or an infinite countable set of the form S = {x1 , x2 , . . .}. A
set such as S is discrete if it is finite or countable. The set of all real numbers,
like the Gaussian state space Rd , is not discrete because the real numbers are
not countable (a famous theorem of Georg Cantor). Spaces that are not discrete
may be called continuous. If we define Xt for all times t, then Xt is a continu-
ous time process. If Xn is defined only for integers n (or another discrete set of
times), then we have a discrete time process. This week is about discrete time
discrete state space stochastic processes.
We will be interested in discrete Markov chains partly for their own sake
and partly because they are a setting where the general definitions are easy to
give without much mathematical subtlety. The concepts of measurability and
filtration for continuous time or continuous state space are more technical and
subtle than we have time for in this class. The same is true of backward and
forward equations. They are rigorous this week, but heuristic when we go to
continuous time and state space. That is the X ! 0 and t ! 0 aspect of
stochastic calculus.

The main examples of Markov chain will be random walk and mean reverting
random walk. There are discrete versions of Brownian motion and the Ornstein
Uhlenbeck process respectively.

2 Basic probability
This section gives some basic general definitions in probability theory in a set-
ting where they are not technical. Look to later sections for more examples.
Philosophically, a probability space, , is the set of all possible outcomes of a
probability experiment. Mathematically, is just a set. In abstract discus-
sions, we usually use ! to denote an element of . In concrete settings, the
elements of have more concrete descriptions. This week, will usually be the
path space consisting of all paths x[1:T ] = (x1 , . . . , xT ), with xn 2 S for each n.
If S is discrete and T is finite then the path space is discrete. If T is infinite,
is not discrete. The discussion below needs to be given more carefully in that
case, which we do not do in this class.
An event is the answer to a yes/no question about the outcome !. Equiva-
lently, an event is a subset of the probability space: A . You can interpret
A as the set of outcomes where the answer is yes, and Ac = {! 2 |! 2 / A}
is the complementary set where the answer is no. We often describe an event
using a version of set notation where the informal definition of the event goes
inside curly braces, such as {X3 6= X5 } to describe x[1:T ] |x3 6= x5 .
A algebra is a mathematical model of a state of partial knowledge about
the outcome. Informally, if F is a algebra and A , we say that A 2 F
if we know whether ! 2 A or not. The most used algebra in stochastic
processes is the one that represents knowing the first n states in a path. This
is called Fn . To illustrate this, if n 2, then the event A = {X1 = X2 } 2 Fn .
If we know both X1 and X2 , then we know whether X1 = X2 . On the other
hand, we do not know whether Xn = Xn+1 , so {Xn 6= Xn+1 } 2 / Fn .
Here is the precise definition of a algebra. The empty set is written ;. It
is the event with no elements. We say F is a algebra if (explanations of the
axioms in parentheses):
(i) 2 F and ; 2 .
(Regardless of how much you know, you know whether ! 2 , it is, and
you know whether ! 2 ;, it isnt.)
(ii) If A 2 F then Ac 2 F.
(If you know whether ! 2 A then you know whether ! 2
/ A. It is the same
(iii) IF A 2 F and B 2 F, then A [ B 2 F and A \ B 2 F.
(If you can answer the questions ! 2 A? and ! 2 B?, then you can answer
the question ! 2 (A or B)? If ! 2 A or ! 2 B, then ! 2 (A [ B).)
(iv) If A1 , A2 , . . . is a sequence of events, then [Ak 2 F.
(If any of the Ak is yes, then the whole thing is yes. The only way to

get no for ! 2 [Ak 2 F?, is for every one of the Ak to be no.)
This is not a minimal list. There are some redundancies. For example, if you
have axiom (ii), and 2 F, then it follows that ; = c 2 F. The last axiom
is called countable additivity. You need countable additivity to do t ! 0
stochastic calculus.
A function of a random variable, sometimes called a random variable is a
real valued (later, vector valued) function of ! 2 . For example, if is path
space and the outcome is a path, and a 2 S is a specific state, then the function
could be

min {n | xn = a} if there is such an n
f (x[1:T ] ) =
T otherwise.

This is called a hitting time and is often written a . It would be more complete
to write a (x[1:T ] ), but it is common not to write the function argument. Such
a function has a discrete set of values when is discrete. If there is a finite or
infinite list of all elements of , then there is a finite or infinite list of possible
values of f . Some of the definitions below simpler and less technical when is
A function of a random variable, f (!), is measurable with respect to F if
the value of f can be determined from the information in F. If is discrete,
this means that for any number F , the question f (!) = F can be answered
using the information in F. More precisely, it means that for any F , the event
AF = {! 2 |f (!) = F } is an element of F. To be clear, AF = ; for most
values of F because there is a list finite or countable list of F values for which
AF 6= ;.
Let Fn be a family of algebras, defined for n = 0, 1, 2, . . .. They form a
filtration if Fn Fn+1 for each n. This is a very general model of acquiring
new information at time n. The only restriction is that in a filtration you do
not forget anything. If you know the answer to a question at time n, then you
still know at time n + 1. The set of questions you can answer at time n is a
subset of the set of questions you can answer at time n + 1. It is common that
F0 is the trivial algebra, F0 = {;, }, in which you can answer only trivial
questions. The most important filtration for us has being the path space
and Fn knowing the path up to time n. This is called the natural filtration,
or the filtration generated by the process Xn , in which Fn knows the values
X1 , . . . , Xn .
Suppose Fn is a filtration and fn (!) is a family of functions. We say that
the functions are progressively measurable, or non-anticipating, or adapted to
the filtration, or predictable, if fn is measurable with respect to Fn for each n.
There subtle dierences between these concepts for continuous time processes,
dierences we will ignore in this class. Adapted functions are important in
several ways. In the Ito calculus that is the core of this course, the integrand
in the Ito integral must be adapted. In stochastic control problems, you try to
find decide on a control at time n using only the information available at time
n. A realistic stochastic control must be non-anticipating.

Partitions are a simple way to describe algebras in discrete probability. A
partition, P, is a collection of events that is mutually exclusive and collectively
exhaustive. That means that if P = {A1 , . . .}, then:
(i) Ai \ Aj = ; whenever i 6= j.
(mutually exclusive.)

(ii) [Ai = .
(collectively exhaustive)

The events Ai are the elements of the partition P. They form a partition if
each ! 2 is a member of exactly one partition element. A partition may have
finitely or countably many elements. One example of a partition, if is the
path space, has one partition element for every state y 2 S. We say x[1,T ] 2 Ay
if and only if x1 = y. Of course, we could partition using the value x2 , etc.
In discrete probability, there is a one to one correspondence between par-
titions and algebras. If F is a algebra, the corresponding partition, in-
formally, is the finest grained information contained in F. To say this more
completely, we say that F distinguishes ! from ! 0 if there is an A 2 F so that
! 2 A and ! 0 2 / A. For example, let F2 part of the natural filtration of path
space. Suppose ! = (y1 , y2 , y3 , . . .), and ! 0 = (y1 , y2 , z3 , . . .). Then F2 does
not distinguish ! from ! 0 . The information in F2 cannot answer the question
y3 = z3 . For any ! 2 , the set of ! 0 that cannot be distinguished from ! is an
event, called the equivalence class of !. The set of all equivalence classes forms
a partition of (check properties (i) and (ii) above if two equivalence classes
overlap, then they are the same). If B! is the equivalence class of !, then

B! = \A , with ! 2 A , and A 2 F .

Therefore, B! 2 F (countable additivity of F). Clearly, if A 2 F, then A is the

union of all the equivalence classes contained in A. Therefore, if you have P,
you can create F buy taking all countable unions of elements of P.
The previous paragraph is full if lengthy but easy verifications that you
may not go through completely or remember long. What you should remember
is what a partition is and how it carries the information in a algebra. The
information in F does not tell you which outcome happened, but it does tell you
which partition element it was in. A function f is measurable with respect to
F if and only if it is constant on each partition element of the corresponding P.
The information in F determines the partition element, Bj , which determines
the value of f .
A probability distribution on is an assignment of a probability to each
outcome in .PIf ! 2 , then P (!) is the probability of !. Naturally,P P (!) 0
for all ! and !2 P (!) = 1. If A is an event, then P (A) = !2A P (!).
If f (!) is a function of a random variable, then
E[f ] = f (!)P (!) .

3 Conditioning
Conditioning is about how probabilities change as you get more information.
The simplest conditional expectation tells you how the probability of event A
changes if you know the event B happened. The formula is often called Bayes
P(A and B) P(A \ B)
P(A|B) = = . (1)
P(B) P(B)
As a simple check, note that if A \ B = ;, then the event B rules A out
completely. Bayes rule (1) gets this right, as P(;) = 0 in the numerator. Bayes
rule does not say how to define conditional probability if P(B) = 0. This is
a serious drawback in continuous probability. For example, if (X1 , X2 ) is a
bivariate normal, then P(X1 = 4) = 0, but we saw last week how to calculate
P(X2 > 0|X1 = 4) (say). The conditional probability of a particular outcome is
given by Bayes rule too. Just take A to be the event that ! happened, which
is written {!}:
P(!) /P(B) if ! 2 B
P(!|B) =
0 if ! 2
The conditional expected value is
X f (!)P(!)
E[f |B] = f (!)P(!|B) = !2B
. (2)

You can check that this formula gives the right answer when f (!) = 1 for all !.
We indeed get E[1|B] = 1.
You can think of expectation as something you say about a random function
when you know nothing but the probabilities of various outcomes. There is a
version of conditional expectation that describes how your understanding of f
will change when you get the information in F. Suppose P = (B1 , B2 , . . .) is
the partition that is determined by F. When you learn the information in F,
you will learn which of the Bj happened. The conditional expectation of f ,
conditional on F, is a function of ! determined by this information.

E[f |F] (!) = E[f |Bj ] if ! 2 Bj . (3)

To say this dierently, if g = E[f |F], then g is a function of !. If ! 2 Bj , then

g(!) = E[f |Bj ].
You can see that the conditional expectation, g = E[f |F], is constant on
partition elements Bj 2 P. This implies that g is measurable with respect to
F, which is another way of saying that E[f |F] is determined by the information
in F. The ordinary expectation is conditional expectation with respect to the
trivial algebra F0 = {;, }. The corresponding partition has only one ele-
ment, . The conditional expectation has the same value for every element of
, and that value is E[f ].
The tower property is a fact about conditional expectations. It leads to back-
ward equations which is a powerful way to calculate conditional expectations.

Suppose F0 F1 is a filtration, and fn = E[f |Fn ]. The tower property is

fn = E[fn+1 |Fn ] . (4)

This is a consequence of a simpler and more general statement. Suppose G is

a algebra with more information than F, which means F G. Suppose f
is some function, g = E[f |G], and h = E[f |F]. Then h = E[g|F]. To say this
another way, you can condition from f down to h directly, which is E[f |F], or
you can do it in two stages, which is f ! g = E[f |G] ! E[g|F]. The result
is the same.
One proof of the tower property makes use the partitions associated with F
and G. The partition for G is a refinement of the partition for F. This means
that you make the partition elements for G by cutting up partition elements of
F. Every Ci that is a partition element of G is completely contained in one of
the partition elements of F. Said another way, if ! and ! 0 are two elements of
Ci , they are indistinguishable using the information in G, which surely makes
them indistinguishable using F, which is less information. This is why Ci cannot
contain outcomes from dierent Bj .
Now it is just a calculation. Let h0 (!) = E[g|F]. For ! 2 Bj , it is intuitively
clear (and we will verify) that
h0 (!) = E[g|Ci ] P(Ci |Bj ) (5)
Ci Bj
= E[f |Ci ] P(Ci |Bj )
Ci Bj
= f (!)P(!|Ci ) P(Ci |Bj )
Ci Bj !2Ci
X X f (!)P(!) P(Ci )
P(Ci ) P(Bj )
Ci Bj !2Ci
X X 1
= f (!)P(!)
P(Bj )
Ci Bj !2Ci
X P(!)
= f (!)
P(Bj )

= h(!) .

The first line (5) is a convenient way to think about the partition produced
by the algebra G. The partition elements Ci play the role of elementary
outcomes ! 2 . The partition P plays the role of the probability space .
Instead of P(!), you have P(Ci ). If the function g is measurable with respect
to G, then g has the same value for each ! 2 Ci , so you might as well call this
g(Ci ). And of course, since g is constant is Ci , if ! 2 Ci , then g(!) = E[g|Ci ].

We justify (5), for ! 2 Bj , using

h0 (!) = E[g|Bj ]
= g(!)P(!|Bj )
X X 1
= g(!)P(!)
P(Bj )
Ci 2Bj !2Ci
X X 1
= g(Ci ) P(!)
P(Bj )
Ci 2Bj !2Ci
X P(Ci )
= g(Ci ) .
P(Bj )
Ci 2Bj

The last line is the same as (5).

4 Markov chains
This section, like the previous two, lacks examples. You might want to read the
next section together with this one for examples.
A Markov chain is a stochastic process where the present is all the informa-
tion about the past that is relevant for predicting the future. The algebra
definitions in the previous sections express these ideas easily. Here is a defini-
tion of the natural filtration Fn . Let x[1:T ] and x0[1:T ] be two paths in the path
space, . Suppose n T . We say the paths are indistinguishable at time n
if xk = x0k for k = 1, 2, . . . , n. This definition of indistinguishability gives rise
to a partition of , with two paths being in the same partition element if they
are indistinguishable. The algebra corresponding to this partition is Fn . A
function f (x[1:T ] ) is measurable with respect to Fn if it is determined by the
first n states (x1 , . . . , xn ). More precisely, if xk = x0k for k = 1, 2, . . . , n, then
f (x[1:T ] ) = f (x0[1:T ] ). The algebra that knows only the value of xn is Gn .
A path function is measurable with respect to Gn if and only if it is determined
by the value xn alone.
Let be the path space and P() a probability distribution on . Then P()
has the Markov property if, for all x 2 S and n = 1, . . . , T 1,

P(Xn+1 = x | Fn ) = P(Xn+1 = x | Gn ) . (6)

Unwinding all the definitions, this is the same as saying that for any path up to
time n, (x1 , . . . , xn ),

P(Xn+1 = x | X1 = x1 , . . . , Xn = xn ) = P(Xn+1 = x | Xn = xn ) .

You might complain that we have defined conditional expectation but not con-
ditional probability in (6). The answer is a trick for defining probability from

expectation. The indicator function of an event A is 1A (!), which is equal
to 1 if ! 2 A and 0 otherwise. Then P(A) = E[1A ]. In particular, if A = {!},
then, using the notation slightly incorrectly, P(!) = E[1! ]. This applies to
conditional expectation too: P(!|F) = E[1! |F]. But you should be alert to
the fact that the latter statement is more complicated, in that both sides are
functions on (measurable with respect to F) rather than just numbers. The
simple notation hides the complexity.
The probabilities in a Markov chain are determined by transition probabil-
ities, which are the numbers defined by the right side of (6). The probability
P(Xn+1 = x | Gn ) is a measurable function of Gn , which means that they are a
function of xn , which we call y to simplify notation. The transition probabilities
pn,yx = P(Xn+1 = x | Xn = y) . (7)
You can remember that it is pn,yx instead of pn,xy by saying that pn,yx =
P(y ! x) is the probability of of a y to x transition in one step.
You can use the definition (7) even if the process does not have the Markov
property. What is special about Markov chains is that the numbers (7) deter-
mine all other probabilities. For example, we will show that

P(Xn+2 = x and Xn+1 = y | Xn = z) = pn+1,yx pn,zy . (8)

What is behind this, besides the Markov property, is a general fact about con-
ditioning. If A, B, and C are any three events, then

P(A and B | C) = P(A | B and C) P(B | C) .

Without the Markov property, this leads to

P(Xn+2 = x and Xn+1 = y | Xn = z) = P(Xn+2 = x | Xn+1 = y and Xn = z)

P(Xn+1 = y | Xn = z) .

According to the Markov property, the first probability on the right is

P(Xn+2 = x | Xn+1 = y), which gives (8).
There is something in the spirit of (8) that is crucial for the backward
equation below. This is that the Markov property applies to the whole fu-
ture path, not just one step into the future. For examaple, (8) implies that
P(Xn+2 = x | Fn ) = P(Xn+1 = x | Gn ). The same line of reasoning justifies the
stronger statement that for any xn+1 , . . . , xT ,

P(Xn+1 = xn+1 , . . . , XT = xT | Fn ) = P(Xn+1 = xn+1 , . . . , XT = xT | Gn )

= pk,xk ,xx+1

Going one more step, consider a function f that depends only on the future:
f = f (xn+1 , xn2 , . . . , xT ). Then

E[f | Fn ] = E[f | Gn ] . (9)

If you like thinking about algebras, you could say this by defining the future
algebra, Hn , that is determined only by information in the future. Then (9)
holds for any f that is measurable with respect to Hn .
A Markov chain is homogeneous if the transition probabilities do not depend
on n. Most of the Markov chains that arise in modeling are homogeneous.
Much of the theory is for homogeneous Markov chains. From now on, unless we
explicitly say otherwise, we will assume that a Markov chain is homogeneous.
Transition probabilities will be pyx for every n.
The forward equation is the equation that describes how probabilities evolve
over time in a Markov chain. Last week we saw that we could evolve the mean
and variance of a linear Gaussian discrete time process (Xn+1 = AXn + BZn )
using n+1 = An and Cn+1 = ACn At + BB t . This determines Xn+1
N (n+1 , Cn+1 ) from the information Xn N (n , Cn ). It is possible to formu-
late a more general forward equation that gives the distribution of Xn+1 even
if Xn is not Gaussian. But we do not need that here.
Let pyx be the transition probabilities of a discrete state space Markov chain.
Let un (y) = P(Xn = y). The forward equation is a formula for the numbers
un+1 in terms of the numbers un . The derivation is simple

un+1 (x) = P(Xn+1 = x)

= P(Xn+1 = x | Xn = y) P(Xn = y)
un+1 (x) = un (y)pyx (10)

The step from the first to second line uses what is sometimes called the law
of total probability. The terms are rearranged in the last line for the following
reason ...
We reformulate the Markov chain forward equation in matrix/vector terms.
Suppose the state space is finite and S = {x1 . . . , xm }. We will say state
j instead of state xj , etc. We write un,j = P(Xn = j) instead of un (xj ) =
P(Xn = xj ). We collect the probabilities into a row vector un = (un,1 , . . . , un,m ).
The transition matrix, P , is the m m matrix of transition probabilities. The
(i, j) entry of P is pij = P(i ! j) = P(Xn+1 = j|Xn = i). The forward equation
un+1,j = un,i pij .

In matrix terms, this is just

un+1 = un P . (11)
The row vector un+1 is the product of the row vector un and the transition
matrix P . It is a tradition to make un a row vector and put it on the left.
You will come to appreciate the wisdom of this unusual choice over the coming

Once we have a linear algebra formulation, many tools of linear algebra be-
come available. For example, powers of the transition matrix trace the evolution
of u over several steps:

un+2 = un+1 P = (un P ) P = un P 2 .

Clearly un+k = un P k for any k. This means we can study the evolution of
Markov chain probabilities using the eigenvalues and eigenvectors of the transi-
tion matrix P .
The other major equation is the backward equation, which propagates con-
ditional expectations backward in time. Take Fn to be the natural filtration
generated by the path up to time n. Take f (x[1:T ] ) = V (xT ). This is a final
time payout. The function is completely determined by the state of the system
at time T . We want to characterize fn = E[f |Fn ] and see how to calculate
it using V and P . The characterization comes from the Markov property (9),
which implies that fn is a function of xn . The backward equation determines
this function.
The backward equation for Markov chains follows from the tower property
(4). Use the transition probabilities (7). Since E[fn+1 | Fn ] is measurable with
respect to Gn , the expression E[fn+1 | Fn ] (xn ) makes sense:

fn (xn ) = E[fn+1 | Fn ] (xn )

= P(Xn+1 = xn+1 |Xn = xn ) fn+1 (xn+1 )
xn+1 2S
= pxn xn+1 fn+1 (xn+1 ) .
xn+1 2S

It is simpler to write the last formula for generic xn = x and xn+1 = y.

fn (x) = pxy fn+1 (y) . (12)

This is one form of the backward equation.

The backward equation gets its name from the fact that it determines fn
from fn+1 . If you think of n as a time variable, then time runs backwards for the
equation. To find the solution, you start with the final condition fT (x) = V (x),
then compute fT 1 using (12), and continue.
The backward equation may be expressed in matrix/vector terms. As we
did when doing this for the forward equation, we suppose the state space is S =
{1, 2, . . . , m}. Then we define the column vector fn 2 Rm whose components
are fn,j = fn (j). The elements of the transition matrix are pij = P(i ! j). The
right side of (12) is the matrix/vector product P fn+1 , so the equation is

fn = P fn+1 . (13)

The forward and backward equations use the same matrix P , but the forward
equation multiplies from the left by a row vector of probabilities, while the

backward equation multiplies from the right by a column vector of conditional
The transition matrix if a homogeneous Markov chain is an m m matrix.
You can ask which matrices arise in this way. A matrix that can be the transition
matrix for a Markov chain is called a stochastic matrix. There are two obvious
properties that characterize stochastic matrices. The first is that pij 0 for all
i and j. The transition probabilities are probabilities, and probabilities cannot
be negative. The second is that
pij = 1 for all i = 1, . . . , m. (14)

This is because S is a complete list of the possible states at time n + 1. If

Xn = i, then Xn+1 is one of the states 1, 2, . . . , m. Therefore, the probabilities
for the landing states (the state at time n + 1) add up to one.
If you know P is a stochastic matrix, you know two things about its eigen-
values. One of those things is that = 1 is an eigenvalue. The proof of this
is to give the corresponding eigenvector, which is the column vector of all ones:
1 = (1, 1, . . . , 1)t . If g = P 1, then the components of g are, using (14),
X m
gi = pij 1j = pij = 1 ,
j=1 j=1

for all i. This shows that g = P 1 = 1. This result is natural in the Markov
chain setting. The statement fn+1 = E[V (XT ) | Fn+1 ] = 1 for all xj 2 S means
that given the information in Fn+1 , the expected value equals 1 no matter what.
But Fn has less information. All you know at time n is that in the next step you
will go to a state where the expected value is 1. But this makes the expected
value at time n equal to 1 already.
The other thing you know is that if is an eigenvalue of P then || 1.
This is a consequence of the maximum principle for P , which we now explain.
Suppose f 2 Rm is any vector and g = P f . The maximum principle for P is
max gi max fj . (15)
i j

Some simple reasoning about P proves this. First observe that if h 2 Rm and
hj 0 for all j, then
(P h)i = pij hj 0 for all i.

This is because all the terms on the right, pij and hj are non-negative. Now
let M = max fj . Then define h by hj = M fj , so that M 1 = f + h. Since
P (M 1) = M 1, we know gi + (P h)i = M . Since (P h)i 0, this implies that
gi M , which is the statement (15). Using similar arguments,
0 1
|gi | pij |fj | @ pij A max |fj | ,
j j

you can show that even if f is complex,

max |gi | max |fj | . (16)

i j

This implies that if is an eigenvalue of P , then || 1. That is because the

equation P f = f with || > 1, violates (16) by a factor of ||.
A stochastic matrix is called ergodic if:

(i) The eigenvalue = 1 is simple.

(ii) If 6= 1 is an eigenvalue of P , then || < 1.

A Markov chain is ergodic if its transition matrix is ergodic. (Warning: the true
definition of ergodicity applies to Markov chains. There is a theorem stating
that if S is finite, then the Markov chain is ergodic if and only if the eigenvalues
of P satisfy the conditions above. Our definitions are not completely wrong,
but they might be misleading.) Most of our examples are ergodic.
Last week we studied the issue of whether the Markov process (a linear
Gaussian process last week, a discrete Markov chain this week) has a statistical
steady state that it approaches as n ! 1. You can ask the same question about
discrete Markov chains. A probability distribution, , is stationary, or steady
state, or statistical steady state, if un = =) un+1 = . That is the same as
saying that Xn =) Xn+1 . The forward equation (11) implies that a
stationary probability distribution must satisfy the equation = P . This says
that is a left eigenvector of P with eigenvalue = 1. We know there is at least
one left eigenvector with eigenvalue = 1 because = 1 is a right eigenvalue
with eigenvector 1.
If S is finite, it is a theorem that the chain is ergodic if and only if it satisfies
both of the following conditions
(i) There is a unique row vector with = P and i i = 1.

(ii) Let u1 be any distribution on m states and X1 u1 . If Xn un , then

un ! as n ! 1.
Without discussing these issues thoroughly, you can see the relation between
these theorems about probabilities and the statements about eigenvalues of P
above. If P has a unique eigenvalue equal to one and the rest less than one, then
un+1 = un P has un ! (the eigenvector), as n ! 1. But is the eigenvector
corresponding Pto eigenvalue = 1. It might be that un ! c, but we know
c = 1 because i un,i = 1, and similarly for .

5 Discrete random walk

This section discusses Markov chains where the states are integers in some range
and the only transitions are i ! i or i ! i 1. The non-zero transition
probabilities are ai = P(i ! i 1) = pi,i1 , and bi = P(i ! i) = pi,i , and

ci = P(i ! i + 1) = pi,i+1 . These are called random walk, particularly if S = Z
and the transition probabilities are independent of i. A reflecting random walk
is one where moves to i < 0 are blocked (reflected). In this case the state space
is the non-negative integers S = Z+ , and c0 = 0, and b0 = b + c. For i > 0,
ai = a, bi = b, and ci = c. You can interpret the transition probabilities at i = 0
as saying that a proposed transition 0 ! 1 is rejected, leading the state to stay
at i = 0. You cannot tell the dierence between a pure random walk and a
reflecting walk until the walker hits the reflecting boundary at i = 0. You could
get a random walk on a finite state space by putting a reflecting boundary also
at i = m 1 (chosen so that S = {0, 1, . . . m 1} and |S| = m). The transition
probabilities at the right boundary would be cm1 = c, bm1 = b + a, and
am1 = 0. These probabilities reject proposed m 1 ! m transitions.
The phrase birth death process is sometimes used for random walks on the
state space Z+ with transition probabilities that depend in a more general way
on i. The term birth/death comes from the idea that Xn is the number of
animals at time n. Then ai is the probability that one dies, ci is the probability
that one is born, and bi = 1 ai ci is the probability that the probability
does not change. You might think that i = 0 would be an absorbing state in
that P(0 ! 1) = 0 because no animals can be born if there are no animals.
Birth/death processes do not necessarily assume this. They permit storks.
Urn processes are a family of probability models that give rise to Markov
chains on the state space {0, . . . , m 1}. An urn is a large clay jar. In urn
processes, we talk about one or more urns each with balls of one or more colors.
Here is an example that has one urn, two colors (red and blue) and a probability,
q. At each stage there are m balls in the urn (cant stay with m 1 any longer.
Some are red and the rest blue. You choose one ball from the urn, each with the
same probability to be chosen a well mixed urn. You replace that ball with a
new one whose color is red with probability q and blue with probability 1 q.
Let Xn be the number of red balls at time n. In this process Xn ! Xn 1 if
you you choose a red ball and replace it with a blue ball. If you choose a blue
ball and replace it with a red ball, then Xn ! Xn + 1. If you replace red with
red, or blue with blue, then Xn+1 = Xn .
A matrix is tridiagonal if all its entries are zero when |i j| > 1. Its non-
zeros are on the main diagonal, the one super-diagonal, and one sub-diagonal.
This section is about Markov chains whose transition matrix is tri-diagaonal.
Consider, for a simple example, the random walk with a reflecting boundary at
i = 0 and i = 5. Suppose the transition probabilities are a = 12 , and b = c = 14 .
The transition matrix is the 6 6 matrix
03 1 1
4 4 0 0 0 0
B1 1 1 0 0 0C
B 2 41 41 1 C
B0 0 0C
P =B B 2 4 4 C . (17)
B 0 0 2 41 41 01 C
1 1 1
@0 0 0 A
2 4 4
0 0 0 0 12 21
There is always confusion about how to number the first row and column.

Markov chain people like to start the numbering with i = 0, as we did above.
The tradition in linear algebra is to start numbering with i = 1. These notes
try to keep the reader on her or his toes by doing both, starting with i = 1 when
describing the entries in P , and starting with i = 0 when describing the Markov
chain transition probabilities. The transition matrix P above has p12 = 14 ,
which indicates that if Xn = 1, there is a .25% chance that Xn+1 = 2. Since Xn
is not allowed to go lower than 1 or higher than 2, the rest of the probability
must be to stay at i = 1, which is why p11 = 34 . In the first column we have
p21 = 12 , which is the probability of a 2 ! 1 transition. From state i = 2 there
are three possible transitions, 2 ! 1 with probability 12 as we just said, 2 ! 2
with probability 14 , and 2 ! 3, with probability 14 . The row sums of this matrix
are all equal to one, and so are most of the column sums. But the first column
sum is p11 + p21 = 45 > 1 and the last is p56 + p66 = 34 < 1.
My computer says that
0 1
0.6875 0.2500 0.0625 0 0 0
B 0.5000 0.3125 0.1250 0.0625 0 0 C
B 0.2500 0.2500 0.3125 0.1250 0.0625 0 C
P =B
2 B C .
0 0.2500 0.2500 0.3125 0.1250 0.0625 C
@ 0 0 0.2500 0.2500 0.3125 0.1875 A
0 0 0 0.2500 0.3750 0.3750
For example, p4,2 = 14 , which says that the probability of going from 4 to 2 in
two hops is 14 . The only way to do that is Xn = 4 ! Xn+1 = 3 ! Xn+2 = 2.
The probability of this is p43 p32 = 12 12 = 14 . The probability of 3 ! 3 in two
steps is p33 = 165
. The three paths that do this, with their probabilities, are
P(3 ! 2 ! 3) = 12 14 = 162
, and P(3 ! 3 ! 3) = 14 14 = 16 1
, and P(3 ! 4 ! 3) =
4 2 = 16 . These add up to 16 . The matrix P is the transition matrix for the
1 1 2 5 2

Markov chain that says take two hops with the P chain. Therefore its row
(2) (2) (2)
sums should equal 1, as p11 + p12 + p13 = 11 16 + 16 + 16 = 1.
4 1

My computer says that, for n = 100 (think n = 1),

0 1
0.5079 0.2540 0.1270 0.0635 0.0317 0.0159
B 0.5079 0.2540 0.1270 0.0635 0.0317 0.0159 C
B 0.5079 0.2540 0.1270 0.0635 0.0317 0.0159 C
P =B
n B C . (18)
B 0.5079 0.2540 0.1270 0.0635 0.0317 0.0159 C
@ 0.5079 0.2540 0.1270 0.0635 0.0317 0.0159 A
0.5079 0.2540 0.1270 0.0635 0.0317 0.0159
(1) P6
These numbers have the form pi,j = 2j /s, where s = j=1 2j . The formula
for s comes from the requirement that the row sums of p(1) are 1. The fact that
(1) (1)
pi,j+1 /pij = 12 comes from the following theory. Think of starting with X0 = i.
Then u1,ij = P(X1 = j|X0 = i) = pij . Similarly, un,j = P(Xn = j|X0 = i) =
pij . If this P is ergodic (it is), then un,j ! j as n ! 1. This implies that
pij ! j as n ! 1 for any i. The P n in (18) seems to fit that, at least in the
fact that all the rows are the same because the limit of pij is independent of i.

The equation = P determines . For our P in (17), these equations are

1 = 1 34 + 2 1
2 = 1 14 + 2 14 + 3 12
5 = 4 14 + 5 14 + 6 12
6 = 5 14 + 6 1
2 .

You can check that

P a solution is j = c2 . The value of c comes from the

requirement that j j = 1 ( is a probability distribution).

Week 3
Continuous time Gaussian processes
Jonathan Goodman
September 24, 2012

1 Introduction to the material for the week

This week we take the limit t ! 0. The limit is a process Xt that is defined
for all t in some range, such as t 2 [0, T ]. The process takes place in continuous
time. This week, Xt is a continuous function of t. The process has continuous
sample paths. It is natural to suppose that the limit of a Markov process is
a continuous time Markov process. The limits we obtain this week will be
either Brownian motion or the Ornstein Uhlenbeck process. Both of these are
Gaussian. We will see how such processes arise as the limits of discrete time
Gaussian processes (week 1) or discrete time random walks and urn processes
(week 2).
The scalings of random processes are dierent from the scalings of dieren-
tiable paths you see in ordinary t ! 0 calculus. Consider a small but non-zero
t. The net change in X over that interval is X = Xt+ t Xt . If a path has
well defined velocity, Vt = dX/dt, then X V t. Mathematicians say that
X is on the order of t, because X is approximately proportional to t for
small1 t. In this linear scaling, reducing t by a factor of 2 (say) reduces X
approximately by the same factor of 2.
Brownian motion, and the Ornstein Uhlenbeck process, have more compli-
cated scalings. There is one scaling for pX, and a dierent one for E[X]. The
change itself, X, is on the order of t. If t is small, this is larger than
the t scaling dierentiable processes have. Brownian motion moves much fur-
ther in a small amount of time than dierentiable processes do. The change in
expected value is smaller, pon the order of t. It is impossible for the expected
value to change by order t, because the total change in the expected value
over a finite time intervalpwould be infinite. The Brownian motion manages to
have X on the order of t through cancellation. The sign of X goes back
and forth, so that the net change is far smaller than the sum of |X| over many
small intervals of time. That is |X1 + X2 + | << |X1 | + |X2 | + .
1 This terminology is dierent from scientists order of magnitude, which means roughly a

power of ten. It does not make sense to compare X to t in the order of magnitude sense
because they have dierent units.

Brownian motion and the Ornstein Uhlenbeck process are Markov processes.
The standard filtration consists of the family of algebras, Ft , which are
generated by X[0,t] (the path up to time t). The Markov property for Xt is
that the conditional probability of X[t,T ] , conditioning on all the information in
Ft , is determined by Xt alone. The infinitesimal mean, or infinitesimal drift is
E[X|Ft ], in the limit t ! 0. The infinitesimal variance is var(X | Ft ). We
will see that both of these scale linearly with t as t ! 0. This allows us to
define the infinitesimal drift coecient,

(Xt )t E[Xt+ t Xt | Ft ] , (1)

and the infinitesimal variance, or noise coefficient

2 (Xt )t var(Xt+ t Xt | Ft) . (2)

The conditional expectation with respect to a algebra requires the left side
to be a function that is measurable with respect to Ft . The Xt that appears
on the left sides is consistent with this. The Markov property says that only Xt
can appear on the left sides, because the right sides are statements about the
future of Ft , which depend on Xt alone.
The properties (1) and (2) are central this course. This week, they tell us
how to take the continuous time limit t ! 0 of discrete time Gaussian Markov
processes or random walks. More precisely, they tell us how a family of processes
must be scaled with t to get a limit as t ! 0. You choose the scalings so that
(1) and (2) work out. The rest follows, as it does in the central limit theorem.
The fact that continuous time limits exist may be thought of as an extension of
the CLT. The infinitesimal mean and variance of the approximating processes
determine the limiting process completely.
Brownian motion and Ornstein Uhlenbeck processes are characterized by
their and 2 . Brownian motion has constant and 2 , independent of Xt .
The standard Brownian motion has = 0, and 2 = 1, and X0 = 0. If 6= 0,
you have Brownian motion with drift. If 2 6= 1, you have a general Brown-
ian motion. Brownian motion is also called the Wiener process, after Norbert
Wiener. We often use Wt to denote a standard Brownian motion. The Orn-
stein Uhlenbeck process has constant , but a linear drift (Xt ) = Xt . Both
Brownian motion and Ornstein are Gaussian processes. If 2 is a function of
Xt or if is a nonlinear function of Xt , then Xt is unlikely to be Gaussian.

2 Kinds of convergence
Suppose we have a family of processes Xt t , and we want to take t ! 0
and find a limit process Xt . There are two kinds of convergence, distributional
convergence, and pathwise convergence. Distributional convergence refers to the
probability distribution of Xt t rather than the numbers. It is written with a
half arrow, Xt t * Xt as t ! 0, or possibly Xt t * Xt . The CLT is an
example of distributional convergence. If Z N (0, 1) and Yk are i.i.d., mean

zero, variance 1, then Xn = p1n converges to Z is distribution, which means that
the distribution of Xn converges to N (0, 1). But the numbers Xn have nothing
to do with the numbers Z, so we do not expect that Xn ! Z as n ! 1. We
write Xn * N (0, 1) of Xn * Z as n ! 1, which is convergence in distribution.
Later in the course, starting in week 5, there will be examples of sequences
that converge pathwise.

3 Discrete time Gaussian process

Consider a linear Gaussian recurrence relation of the form (27) from week 1,
but in the one dimensional case. We write this as

Xn+1 = aXn + bZn . (3)

We want Xn to represent, perhaps approximately, the value of a continuous

time process at time tn = nt. We guess that we can substitute (3) into (1)
and (2) to find the scalings of a and b with t. We naturally use Fn instead of
Ft on the right. The result for (1) is

(Xn )t = aXn .

To get = 0 for Brownian motion, we would take a = 0. To get (x) = x for

the Ornstein Uhlenbeck process, we should take a = t. The a coecient
in the recurrence relation should scale linearly with t to get finite, non-zero,
drift coecient in the continuous time limiting process.
Calibrating the noise coecient gives scalings characteristic of continuous
time stochastic processes. Inserting (3) into (2), with Fn for Ft , we find

2 (Xn )t = b2 .

Brownian motion and the Ornstein

p Uhlenbeck process have constant 2 , which
suggests the scaling b = t. These results, put together, suggest that the
way to approximate the Ornstein Uhlenbeck process by a discrete Gaussian
recurrence relation is
Xn+1 = tXn t + tZn . (4)

Let Xt be the continuous time Ornstein Uhlenbeck process. We define a discrete

time approximation to it using (4) and

Xtnt = Xn t .

Note the inconsistent notation. The subscript on the left side of the equation
refers to time, but the subscript on the right refers to the number of time steps.
These are related by tn = nt.
Let us assume for now that the approximation converges as t ! 0 in
the sense of distributions. There is much discussion of convergence later in

the course. But assuming it converges, (4) gives a Monte Carlo method for
estimating things about the Ornstein Uhlenbeck process. You can approximate
the values of Xt for t between time steps using linear interpolation if necessary.
If tn < t < tn+1 , you can use the definition
t tn
Xt t
= Xtnt + t
Xtn+1 Xtnt .
tn+1 tn

The values Xtnt are defined by, say, (4). Now you can take the limit t ! 0
and ask about the limiting distribution of X[0,T t
] in path space. The limiting
probability distribution is the distribution of Brownian motion or the Ornstein
Uhlenbeck process.

4 Brownian motion
Many of the most important properties of Brownian motion follow from the
limiting process described in Section 3.

4.1 Independent increments property

The increment of Brownian motion is Wt Ws . We often suppose t > s, but
in many formulas it is not strictly necessary. We could consider the increment
of a more general process, which would be Xt Xs . The increment is the net
change in W over the time interval [s, t].
Consider two time intervals that do not overlap: [s1 , t1 ], and [s2 , t2 ], with
s1 t1 s2 t2 . The independent increments property of Brownian motion
is that increments over non-overlapping intervals are independent. The random
variables X1 = Wt1 Ws1 and X2 = Wt2 Ws2 are independent. The intervals
are allowed to touch endpoints, which would be t1 = s2 , but they are not allowed
to have any interior in common. The cases s1 = t1 and s2 = t2 are allowed but
The independent increments property applies to any number of non-overlapping
intervals. If s1 t1 s2 t2 , then the corresponding increments, X1 ,
X2 , X3 , . . ., are an independent family of random variables. Their joint PDF is
a product.
The approximate sample paths for standard Brownian motion are given by
(4) with = 0 and = 1. The exact Brownian motion distribution in path
space is the limit of the distributions of the approximating paths, as discussed
in Section 3. Suppose that the interval endpoints are time step times, such
as s1 = tk1 , t1 = tl1 , and so on. The increment of W in the interval [s1 , t1 ]
is determined by the random variables Zj for s1 tj < t1 . These Zj are
independent for non-overlapping intervals of time. In the discrete approximation
there may be small dependences because one t time step variable Zj is a
member of, say, [s1 , t1 ] and [s2 , t2 ]. This overlap disappears in the limit t ! 0.
The possible dependence between the random variables disappears too.

4.2 Mean and variance
The mean of a Brownian motion increment is zero.
E[Wt Ws ] = 0 . (5)
The variance of a Brownian motion increment is equal to the size of the time
interval: h i
var(Wt Ws ) = E (Wt Ws ) = t s . (6)
This is an easy consequence of (4) with = 0 and = 1. If s = tk and t = tn ,
then the increment is
Wtn t Wtk t = t (Zk + + Zn1 ) .
Since the Zj are independent, the variance is
t (1 + 1 + + 1) = t (n k) = tn tk = t s .

4.3 The martingale property

The independent increments property upgrades the simple statements to condi-
tional expectations. Suppose Fs is the algebra that knows about W[0,s] . This
is the algebra generated by W[0,s] . The is part of the Brownian motion path
is determined by the increments of Brownian motion in the interval [0, s]. But
all of these are independent of Wt Ws . The increment Wt Ws is independent
of any information in Fs . In particular, we have the conditional expectations
E[Wt Ws |Fs ] = 0 . (7)
The variance of a Brownian motion increment is equal to the size of the time
interval: h i
var(Wt Ws |Fs ) = E (Wt Ws ) |Fs = t s . (8)
The formula (7) is called the martingale property. It can be expressed as
E[Wt |Fs ] = Ws . (9)
To understand this, recall the the left side is a function of the path that is known
in Fs . The value Ws qualifies; it is determined (trivially) by the the path W[0,s] .
The variance formula (8) may be re-expressed in a similar way:
h i
E Wt2 |Fs = E (Ws + [Wt Ws ]) |Fs
h i
= E Ws2 + 2Ws [Wt Ws ] + [Wt Ws ] |Fs
h i
= Ws2 + 2Ws E[Wt Ws |Fs ] + E (Wt Ws ) |Fs

E Wt2 |Fs = Ws2 + (t s) .
We used the martingale property, and the fact that a number can be pulled out
of the expectation if it is known in Fs .

5 Ornstein Uhlenbeck process
The Ornstein Uhlenbeck process is the continuous time analogue of a scalar
Gaussian discrete time recurrence relation. Let Xt be a process that satisfies (1)
and (2) with (Xt ) = Xt and constant 2 . Suppose u(x, t) is the probability
density for Xt . Since u is the limit of Gaussians, as we saw in Section 3, u itself
should be Gaussian. Therefore, u(x, t) is completely determined by its mean
and variance. We give arguments that are derived from those in Week 1 to find
the mean and variance.
The mean is simple
t+ t = E[Xt+ t ]
= E[Xt + X]
= E[Xt ] + E[E[X|Ft ]]
= t + E[Xt t + (smaller)]
= t E[Xt ] t + (smaller)
t+ t = t t t + (smaller) .
The last line shows that
@t t = t . (10)
This is the analogue of (28) from Week 1.
An orthogonality property of conditional expectation makes the variance
calculation easy. Suppose F is a algebra, X is a random variable, and Y =
E[X|F]. Then h i
var(X) = E (X Y ) + var(Y ) . (11)
The main step in the proof is to establish the simpler formula
h i
E X 2 = E (X Y ) + E Y 2 . (12)
h i
The formula (12) implies (11). If = E[X], then var(X) = E (X ) , and
E[X |F] = Y . So we get (11) by applying (12) with X instead of X.
Note that
h i
E X 2 = E ([X Y ] + Y )
h i
= E (X Y ) + 2E[(X Y ) Y ] + E Y 2 .

The formula (12) follows from the orthogonality relation that, in turn, depends
on the tower property and the fact that E[Y |F] = Y :
E[(X Y ) Y ] = E[ E[(X Y ) Y | F] ]
= E[ E[ X Y | F] Y ]
= E[ (Y Y ) Y ]

To summarize, we just showed that X E[X] is orthogonal to E[X] in the sense
that E[(X Y ) Y ] = 0. The formula (12) is the Pythagorean relation that
follows from this orthogonality. The variance formula (11) is just the mean zero
case of this Pythagorean relation.
The variance calculation for the Ornstein Uhlenbeck process uses the Pythagorean
relation (11) in Ft . The basic mean value relation (1) may be re-written for Orn-
stein Uhlenbeck as

E[Xt+ t |Ft ] = Xt + (Xt )t + (smaller) = Xt Xt t + (smaller) .

Let t2 = var(Xt ). Then (third line justified below)

t+ t = var(Xt+ t )
h i
2 2
= E (Xt+1 Xt + Xt t + (smaller)) + var (Xt Xt t + (smaller))
= 2 t + (1 t) var(Xt ) + (smaller)

= t2 + 2 2t2 t + (smaller) .

This implies that 22 satisfies the scalar continuous time version of (29) from
Week 1, which is
@t t2 = 2 2t2 . (13)
Both (10) and (13) have simple exponential solutions. We see that t goes
to zero at the exponential rate , while t2 goes to 2 /(2) at the exponential
rate 2. The probability density of Xt is
1 2 2
u(x, t) = p e(xt ) /(2 t) . (14)

This distribution has a limit as t ! 1 that is the statistical steady state for
the Ornstein Uhlenbeck process.

Week 4
Brownian motion and the heat equation
Jonathan Goodman
October 1, 2012

1 Introduction to the material for the week

A diusion process is a Markov process in continuous time with a continuous
state space and continuous sample paths. This course is largely about diusion
processes. Partial dierential equations (PDEs) of diusion type are important
tools for studying diusion processes. Conversely, diusion processes give insight
into solutions of diusion type partial dierential equations. We have seen two
diusion processes so far, Brownian motion and the Ornstein Uhlenbeck process.
This week, we discuss the partial dierential equations associated with these two
We start with the forward equation associated with Brownian motion. Let
Xt be a standard Brownian motion with probability density u(x, t). This prob-
ability density satisfies the heat equation, or diusion equation, which is

@t u = 12 @x2 u . (1)

This PDE allows us to solve the initial value problem. Suppose s is a time and
the probability density Xs u(x, s) is known, then (1) determines u(x, t) for
t s. The initial value problem has a solution for more or less any initial
condition u(x, s). If u(x, s) is a probability density, you can find x(x, t) for t > s
by: first choosing Xs u(, s), then letting Xt for t > s be a Brownian motion.
The probability density of Xt , u(x, t) satisfies the heat equation. By contrast,
the heat generally cannot be run backwards. If you give a probability density
u(x, s), there probably is no function u(x, t) defined for t < s that satisfies
the heat equation for t < s and the specified values u(x, s). Running the heat
equation backwards is ill posed.1
The Brownian motion interpretation provides a solution formula for the heat
equation Z 1
1 2
u(x, t) = p e(xy) /2(ts) u(y, s) ds . (2)
2(t s) 1
1 Stating a problem or task is posing the problem. If the task or mathematical problem has

no solution that makes sense, the problem is poorly stated, or ill posed.

This formula may be expressed more abstractly as
Z 1
u(x, t) = G(x y, t s)u(y, s) ds , (3)

where the function

1 x2 /2t
G(x, t) = p e
is called the fundamental solution, or the heat kernel, or the transition density.
You recognize it as the probability of a Gaussian with mean zero and variance
t. This is the probability density of Xt if X0 = 0 and X is standard Brownian
We can think of the function u(x, t) as an abstract vector and write it u(t).
We did this already in Week 2, where the occupation probabilities un,j =
P(Xn = j) were thought of as components of the row vector un . The solu-
tion formula (3) produces the function u(t) from the data u(s). We write this
abstractly as
u(t) = G(t s)u(s) . (4)
The operator, G, is something like an infinite dimensional matrix. The abstract
expression (4) is shorthand for the more concrete formula (3), just as matrix
multiplication is shorthand for the actual sums involved. The particular opera-
tors G(t) have the semigroup property

G(t) = G(t s)G(s) , (5)

as long as t, s, and t s are all positive.2 This is because u(t) = G(t)u(0), and
u(s) = G(s)u(0), so u(t) = G(t s) [G(s)u(0)] = [G(t s)G(s)] u(0).
You can make a long list of ways the heat equation helps understand the
behavior of Brownian motion. We can write formulas for hitting probabilities by
writing solutions of (1) that satisfy the correct boundary conditions. This will
allow us to explain the simulation results in question (4c) of assignment 3. You
do not have to understand probability to check that a function u(x, t) satisfies
the heat equation, only calculus.
The backward equation for Brownian motion is

@t f + 12 @x2 f = 0 . (6)

This is the equation satisfied by expected values

f (x, t) = E[ V (XT )|Xt = x] , (7)

if T t. The final condition is the obvious statement that f (x, T ) = V (x).

The PDE (6) allows you to move backwards in time to determine values of f
2 A mathematical group is a collection of objects that you can multiply and invert, like

the group of invertible matrices of a given size. A semigroup allows multiplication but not
necessarily inversion. If the operators G(t) were defined for t < 0 and the formula (5) still
applied, then the operators would form a group. Our operators are only half a group because
they are defined only for t 0.

for t < T from f (T ). The backward equation diers from the forward equation
only by a sign, but this is a big dierence. Moving forward with the backward
equation is just as ill posed as moving backward with the forward. In particular,
suppose you have a desired function f (x, 0) and you want to know what function
V (x) gives rise to it using (7). Unless your function f (0) is very special (details
below), there is no V at all.

2 The heat equation

This section describes the heat equation and some of its solutions. This will
help us understand Brownian motion, both qualitatively (general properties)
and quantitatively (specific formulas).
The heat equation is used to model things other than probability. For ex-
ample it can be the flow of heat in a metal rod. Here, u(x, t) is the temperature
at location x at time t. The temperature is modeled by @t u = D@x2 u, where the
diusion coecient, D, depends on the material (metal, stone, ..), and the units
(seconds, days, centimeters, meters, degrees C, ..). The heat equation has the
value D = 12 . Changing units, or rescaling, or non-dimensionalizing can replace
D with 12 . For example, you can use t0 = Dt, or x0 = Dx.
The heat flow picture suggests that heat will flow from high temperature
to low temperature regions. The fluctuations in u(x, t) will smooth out and
relax over time and the heat redistributes itself. The total amount of heat in
an interval [a, b] at time t is3
Z b
u(x, t) dx .

You understand the flow of heat by dierentiating with respect to time and
using the heat equation
Z b Z b Z b
u(x, t) dx = @t u(x, t) dx = 1
2 @x2 u(x, t) dx = 1
2 (@x u(b, t) @x u(a, t)) .
dt a a a

The heat flux,

F (x, t) = 12 @x u(x, t) , (8)
puts this into the conservation form
Z b Z b
u(x, t) dx = @t u(x, t) dx = F (a, t) F (b, t) . (9)
dt a a

The heat flus (8) is the rate at which heat is flowing across x at time t. If
F is positive, heat flows from left to right. The specific formula (8) is Ficks
law, which says that heat flows downhill toward lower temperature at a rate
3 If you are one of those people who knows the technical distinction between heat and

temperature, I say choose units of temperature in which the specific heat is one.

proportional to the temperature gradient. If @x u > 0, then heat flows from right
to left in the direction opposite the temperature gradient. The conservation
equation (9) gives the rate of change of the amount of heat in [a, b] as the rate
of flow in, F (a, t), minus the rate of flow out, F (b, t). Of course, either of these
numbers could be negative.
The heat equation has a family of solutions that are exponential in space
(the x variable). These are

u(x, t) = A(t)eikx . (10)

This is an ansatz, which is a hypothesized functional form for the solution.

Calculating the time and space derivatives, this ansatz satisfies the heat equation
if (here A = dA/dt)
Aeikx = 12 (k 2 )Aeikx .
We cancel the common exponential factor and see that (10) is a solution if

A = 12 k 2 A .
This leads to A(t) = A(0)ek t/2
, and
u(x, t) = eikx ek t/2

The formula ei = cos() + i sin() tells us that the real part of u is

v(x, t) = cos(kx)ek t/2

You cannot v as a probability density because it has negative values. But it

gives insight into that the heat equation does. A large k, which is a high wave
number, or (less accurately) frequency, leads to rapid decay, ek t/2 . This is
because positive and negative heat is close together and does not have to
diuse far to cancel out.
Another function that satisfies the heat equation is
u(x, t) = t1/2 ex /(2t)
. (11)

The relevant calculations are

n o 2
n 2
u ! @t t1/2 ex /(2t) + t1/2 ex /(2t)
= 12 t1 u + u,
@x x 2
u ! t1/2 ex /(2t)
@x 1 x2
! u+ 2u.
t t

This shows that @t u does equal 12 @x2 u. This solution illustrates the spreading of
heat. The maximum of u is 1/ t, which is at x = 0. This is large for small t
and goes to zero as t ! 1. We see the characteristic width by writing u as
1 1 x
u(x, t) = p e 2 t .
This gives u(x, t) as a function of the similarity variable pxt , except for the
outside overall scale factor p1t . Therefore, the characteristic width is on the
order of t. This is the order of the distance in x you have to go to get from
the maximum value (x = 0) to, say, half the maximum value.
The heat equation preserves total heat in the sense that
d 1
u(x, t) dx = 0 . (12)
dt 1

This follows from the conservation law (8) and (9) if @x u ! 0 as x ! 1.

You can check by direct integration that the Gaussian solution (11) satisfies
this global conservation. But there is a dumber way. The total mass of the
bump shaped Gaussian heat distribution (11) is roughly equal to the height
multiplied by the width of the bump. The height is t1/2 and the width is t1/2 .
The product is a constant.
There are methods for building general solutions of the heat equation from
particular solutions such as the plane wave (10) or the Gaussian (11). The heat
equation PDE is linear, which means that if u1 (x, t) and u2 (x, t) are solutions,
then u(x, t) = c1 u1 (x, t) + c2 u2 (x, t) is also a solution. This is the superposition
principle. The graph of c1 u1 + c2 u2 is the superposition (one on top of the
other) of the graphs of u1 and u2 . The equation is translation invariant, or
homogeneous in space and time, which means that if u(x, t) is a solution, then
v(x, t) = u(x x0 , t t0 ) is a solution. The equation has the scaling property
that if u(x, t) is a solution, then u (x, t) = u(x, 2 t) is a solution. This scaling
relation is one power of t is two powers of x, or x2 scales like t.
Here are some simple illustrations. You can put a Gaussian bump of any
height and with any center:
c 2
p e(xx0 ) /2t .
You can combine (superpose) them to make multibump solutions
c1 2 c2 2
u(x, t) = p e(xx1 ) /2t + p e(xx2 ) /2t .
t t

The k = 1 plane wave u(x, t) = sin(x)et/2 may be rescaled to give the general
plane wave: u (x, t) = sin(x)e t/2 , which is the same as (10). Changing the
length scale by a factor of changes the time scale, which is the decay rate in
this case, by a factor of 2 . The Gaussian solutions are self similar in the sense
that u (x, t) = C u(x, t). The exponent calculation is x2 /t ! (2 x2 )/(2 t).

The solution formula (2) is an application of the superposition principle with
integrals instead of sums. We explain it here, and take s = 0 for simplicity. The
total heat (or total probability,
p or total mass, depending on the interpretation)
of the Gaussian bump is 2. You can see that simply by taking t = 1. It is
simpler to work with a Gaussian solution with total mass equal to one. When
you center the normalized bump at a point y, you get
1 (xy)2 /2t
p e . (13)
As t ! 0, this solution concentrates all its heat in a collapsing neighborhood
of y. Therefore, it is the solution that results from an initial condition that
concentrates a unit amount of heat at the point y. This is expressed using the
Dirac delta function as u(x, t) ! (xy) as t ! 0. It shows that the normalized,
centered Gaussian (13) is the solution to the initial value problem for the heat
equation with initial condition u(x, 0) = (x y). More generally, the formula
p 2
(c/ 2t e(xy) /2t says what happens at later time to an amount c of heat at y
at time zero. For general initial heat distribution u(y, 0), the amount of heat in a
p 2
dy neighbornood of y is u(y, 0)dy. This contributes (u(y, 0)/ 2t e(xy) /2t to
the solution u(x, t). We get the total solution by adding all these contribution.
The result is Z 1
1 (xy)2 /2t
u(x, t) = u(y, 0) p e .
y=1 2t
This is the formula (2).

3 The forward equation and Brownian motion

We argue that the probability density of Brownian motion satisfies the heat
equation (1). Suppose u0 (x) is a probability density and we choose X0 u0 .
Suppose we then start a Brownian motion path from X0 . Then Xt X0
N (0, t) and the joint density of X0 and Xt is
1 (xt x0 )2 /2t
u(x0 , xt , t) = u0 (x0 ) p e .
The probability density of Xt is the integral of the joint density
1 (xt x0 )2 /2t
u(xt , t) = u(x0 , xt , t) dx0 = u0 (x0 ) p e dx0 .
If you substitute y for x0 and x for xt , you get (2). This shows that the
probability density of Xt is equal to the solution of the heat equation evaluated
at time t.

4 Hitting probabilities and hitting times

If Xt is a stochastic process with continuous sample paths, the hitting time for
a closed set A is A = min {t | Xt 2 a}. This is a random variable because

the hitting time depends on the path. For one dimensional Brownian motion
starting at X0 = 0, we define a = min {t | Xt = a}. Let fa (t) be the probability
density of a . We will find formulas for fa (t) and the survival probability Sa (t) =
P(a t). Clearly fa (t) = @t Sa (t), the survival is (up to a constant) the
negative of the CDF of .
There are two related approaches to hitting times and survival probabilities
for Brownian motion in one dimension. One uses the heat equation with a
boundary condition at a. The other uses the Kolmogorov reflection principle.
The reflection principle seems simpler, but it has two drawbacks. One is that I
have no idea how Kolmogorov could have discovered it without first doing it the
hard way, with the PDE. The other is that the PDE method is more general.
The PDE approach makes use of the PDF of surviving particles. This is
defined by
P(Xt 2 [x, x + dx]|a > t) = ua (x, t)dx . (14)
Stopped Brownian motion gives a dierent description of the same thing. This
Xt if t < a
Yt =
a if t a
The process moves with Xt until X touches a, then it stops. The density (14)
is the density of Yt except at x = a, where the Y density has a component.
Another notation for stopped Brownian motion uses the wedge notation t ^ s =
min(t, s). The formula is Yt = Xt^a . If t < a , this gives Yt = Xt . If t a ,
this gives Yt = Xa , which is a because a is the hitting time of a.
The conditional probability density ua (x, t) satisfies the heat equation except
at a. We do not give a proof of this, only some plausibility arguments. First,
(see this weeks homework assignment), it is true for a stopped random walk
approximation to stopped Brownian motion. Second, if Xt is not at a and has
not been stopped, then it acts like ordinary Brownian motion, at least for a
short time. In particular,
ua (x, t + t) G(x y, t)ua (y, t) dy ,

if t is small. The right side satisfies the heat equation, so the left should as
well if t is small.
The conditional probability satisfies the boundary condition ua (x, t) ! 0 as
x ! a. This would be the same as u(a, t) = 0 if we knew that u was continuous
(it is but we didnt show it). The boundary condition u(a, t) = 0 is called
an absorbing boundary condition because it represents the physical fact that
particles that touch a get stuck and do not re-enter the region x 6= a. We
will not give a proof that the density for stopped Brownian motion satisfies the
absorbing boundary condition, but we give two plausibility arguments. The
first is that it is true in the approximating stopped random walk. The second
involves the picture of Brownian motion as constantly moving back and forth.
It (almost) never moves in the same direction for a positive amount of time. If
Xt = a, then (almost surely) there are times t1 < t and t2 < t so that Xt1 > a

and Xt2 < a. The closer y is close to a, the less likely it is that Xs 6= a for all
s < t.
Accepting the two above claims, we can find hitting probabilities by finding
solutions of the heat equation with absorbing boundary conditions. Let us
assume that X0 = 0 and the absorbing boundary is at a > 0. We want a
function ua (x, t) that is defined for x a that satisfies the initial condition
ua (x, t) ! (x) as t ! 0, for (x < a) and the absorbing boundary condition
ua (a, t) = 0. The trick that does this is the method of images from physics.
A point x < a has an image point, x0 > a, that is the same distance from a.
The image point is x0 = a + (a x) = 2a x. If x < a, then x0 > a and
|x a| = |x0 a|, and x0 ! a as x ! a. The density function ua (x, t) starts
out defined only for x a. The trick is to extend the definition of ua beyond a
by odd reflection. That is,

ua (x0 , t) = ua (x, t) . (15)

The oddness of the extended function implies that ua (x, t) ! 0 as x ! a from

either direction. The only direction we originally cared about was from x < a,
but the other is true also.
We create an odd solution of the heat equation by taking odd initial data.
We know ua (x, 0) needs a point mass at x = 0. To make the initial data odd,
we add a negative point mass also at the image of 0, which is x = 2a. The
resulting initial data is

ua (x, 0) = (x) (x 2a) .

The initial data has changed, but the part for x a is the same. The solution
is the superposition of the pieces from the two delta functions:
1 x2 /2t 1 (x2a)2 /2t
ua (x, t) = p e p e . (16)
2t 2t
This function satisfies all three of our requirements. It has the right initial data,
at least for x a. It satisfies the heat equation for all x a. It satisfies the
heat equation also for x > a, which is interesting but irrelevant. It satisfies the
absorbing boundary condition. It is a continuous function of x for t > 0 and
has ua (a, t) = 0.
The formula (16) answers many questions about stopped Brownian motion
and absorbing boundaries. The survival probability at time t is
Z a
Sa (t) = ua (x, t) dx . (17)

This is because ua was the probability density of surviving Brownian motion

paths. You can check that the method of images formula (16) has ua (x, t) > 0

if x < a. The probability density of a is
fa (t) = Sa (t)
Z a
= @t ua (x, t) dx
Z a
= 12 @x2 ua (x, t) dx
fa (t) = 12 @x ua (a, t) . (18)

Without using the specific formula (16) we know the right side of (18) is positive.
That is because ua (x, t) is going from positive values for x < a to zero when
x = a. That makes @x ua (a, t) negative (at least not positive) and f (t) positive
(at least not negative). The formula (18) reinforces the interpretation (see (8)
of 12 @x u as a probability flux. It is the rate at which probability leaves the
continuation region, x < 0.
The formula for the hitting time probability density is found by dierenti-
ating (16) with respect to x and setting x = a. The two terms from the right
turn out to be equal. r
2 a a2 /2t
fa (t) = e . (19)
The reader is invited to verify by explicit integration that
r Z 1
2 1 a2 /2t
a 3/2
e dt = 1 .
0 t
This illustrates some features of Brownian motion.
Look at the formula as t ! 0. The exponent has t in the denominator, so
fa (t) ! 0 as t ! 0 exponentially. It is extremely, exponentially, unlikely for Xt
to hit a in a short time. The probability starts being significantly dierent from
zero when the exponent is not a large negative number, which is when t is on
the order of a2 . This is the (time) = (length) aspect of Brownian motion.
Now look at the formula as t ! 1. The exponent converges to zero, so
fa (t) Ct3/2 . (We know the constant, but it just gets in the way.) This
integrates to a statement about the survival probability
Z 1 Z 1
Sa (t) = fa (t0 ) dt0 C t03/2 dt0 = Ct1/2 .
t t

p probability to survive a long time goes to zero as t ! 1, but slowly as
1/ t.
The maximum of a Brownian motion up to time t is

Mt = max Xs .

The hitting time formulas above also give formulas for the distribution of Mt .
Let Gt (a) = P(Mt a) be the CDF of Mt . This is nearly the same as a survival

probability. Suppose X0 = 0 and a > 0 as above. Then the event Mt < a is
the same as Xs < a for all s 2 [0, t], which is the same as a > t. Let gt (a) be
the PDF of Mt . Then g(a) = da d
G(a). Since Gt (a) = Sa (t), we can find g by
dierentiating the formula (17) with respect to a and using (16). This is not
hard, but it is slightly involved because Sa (t) depends on a in two ways the
limit of integration and the integrand ua (x, t).
There is another approach through the Kolmogorov reflection principle. This
is a re-interpretation of the survival probability integral (17). Start with the
observation that Z 1
1 x2 /2t
p e dx = 1 .
1 2t
The integral (17) is less than 1 for two reasons. One reason is that the survival
probability integral omits the part of the above integral from x > a. The other
is the negative contribution from the image charge. It is obvious (draw a
picture) that
Z 1 Z a
1 x2 /2t 1 (x2a)2 /2t
p e dx = p e dx
a 2t 1 2t

Also, the left side is P(Xt > a). Therefore

Sa (t) = 1 2P(Xt > a) . (20)

We derived this formula by calculating integrals. But once we see it we can look
for a simple explanation.
The simple explanation given by Kolmogorov depends on two properties
of Brownian motion: it is symmetric (as likely to go up by X as down by
X), and it is Markov (after it hits level a, it continues as a Brownian motion
starting at a). Let Pa be the set of paths that reach the level a before time t.
The reflection principle is the symmetry condition that

P Xt > a | X[0,t] 2 Pa = P Xt < a | X[0,t] 2 Pa .

This says that a path that touches level a at some time < t is equally likely to
be outside at time t (Xt > a) as inside (Xt < a). If a is the hitting time, then
Xa = a. If a < t then the probabilities for the path from time a to time t are
symmetric about a. In particular, the probabilities to be above a and below a
are the same. A more precise version of this argument would say that if s < t,
P( Xt > a | a = s) = P( Xt < a | a = s) ,
then integrate over s in the range 0 s t. But it takes some mathematical
work to define the conditional probabilities, since P(a = s) = 0 so you cannot
use Bayes rule directly. Anyway, the reflection principle says that exactly half
of the paths (half in the sense of probability) that ever touch the level a are
above level a at time t. That is exactly (20).

5 Backward equation for Brownian motion
The backward equation is a PDE satisfied by conditional probabilities. Suppose
there is a reward function V (x) and you receive V (Xt ) depending on the value
of a Brownian motion path. The value function is the conditional expectation
of the reward given a location at time t < T :
f (x, t) = E[ V (XT ) | Xt = x] . (21)
There other common notations for this. The expression E [ ] means that the
expectation is taken with respect to the probability distribution in the subscript.
For example, if Y N (, 2 ), we might write
E,2 eY = e+ /2 .
We write Ex,t [ ] for expectation with respect to paths X with Xt = x. The
value function in this notation is
f (x, t) = Ex,t [ V (XT )] .
A third equivalent way uses the filtration associated with X, which is Ft . The
random variable E[ |Ft ] is a function of X[0,t] . The Markov property simplifies
X[0,t] to Xt if the random variable depends only on the future of t. Therefore,
E[ V (XT ) | Ft ] is a function of Xt , which we call f (x, t). Therefore the following
definition is equivalent to (21):
f (Xt , t) = E[ V (XT ) | Ft ] . (22)
The backward equation satisfied by f may be derived using the tower prop-
erty. This can be used to compare f (, t) to f (, t + t) for small t. The
physics behind this is that X will be small too, so f (x, t) can be determined
from f (x + x, t + t), at least approximately, using Taylor series. These
relations become exact in the limit t ! 0.
The algebra Ft+ t has a little more information than Ft . Therefore, if
Y is any random variable
E[ E[ Y | Ft+ t] | Ft ] = E[ Y | Ft ] .
We apply this general principle with Y = V (XT ) and make use of (22), which
leads to
E[ f (Xt+ t , t + t) | Ft ] = f (Xt ) .
We write Xt+ t = Xt + x and expand f (Xt+ t, t + t) in a Taylor series.
f (Xt+ t, t + t) = f (Xt , t)
+ @x f (Xt , t)X
+ @t f (Xt , t)t
+ 12 @x2 f (Xt , t)X 2
+ O(|X| ) + O(|X| t) + O(t2 ) .

The three remainder terms on the last line are the sizes of the three lowest order
Taylor series terms left out. Now take the expectation of both sides conditioning
on Ft and pull out of the expectation anything that is known in Ft :

E[ f (Xt+ t, t + t | Ft ] = f (Xt , t)
+ @x f (Xt , t)E[ X | Ft ]
+ @t f (Xt , t)t

+ 12 @x2 f (Xt , t)E X 2 | Ft
h i
+ O E |X| | Ft + O (E[ |X| | Ft ] t) + O(t2 ) .

The two terms on the top line are equal because of the tower property. The
next line is zero because Brownian motion is symmetric and E[ X | Ft ] = 0.
For the fourth line, use the independent incrementsproperty and the variance
of Brownian motion increments to get E X h
| Fit = t. We also know the
scaling relations E[ |X|] = Ct1/2 and E |X| = Ct3/2 . Put all of these
in and cancel the leading power of t:

0 = @t f (Xt , t) + 12 @x2 f (Xt , t) + O t1/2 .

Taking t ! 0 shows that f satisfies the backward equation (6).

We can find several explicit solutions to the backward equation that illustrate
the properties of Brownian motion. One is f (x, t) = x2 +T t. This corresponds
to final conditions V (XT ) = XT2 . It tells us that if X0 = 0, then E[ V (XT )] =
E XT2 = f (0, 0) = T . This is the variance of standard Brownian motion.
Another well known calculation is the expected value of eaXT starting from
X0 = 0. For this, we want f (x, t) that satisfies (6) and final condition f (x, T ) =
eax . We try the ansatz f (x, t) = Aeaxbt . Putting this into the equation gives

bAeaxbt + 12 a2 Aeaxbt = 0 .

Therefore, f (x, t) = Aeaxbt . Matching the final condition gives

2 2
eax = Aeaxa T /2
=) A = ea T /2

The final solution is 2

f (x, t) = eax+a (T t)/2 .
If X0 = 0, we find E eaXT = ea T /2 . We verify that this is the right answer
by noting that Y = aXt N (0, a2 T ).
You can add boundary conditions to the backward equation to take into
account absorbing boundaries. Suppose you get a reward W (t) if you first
touch a barrier at time t, which is a = t. Consider the problem: run a Brownian
motion starting at X0 = 0 to time a ^T . For a > 0, the value function is defined
for t T and x a. The final condition at t = T is f (x, T ) = V (x) as before.
The boundary condition at x = a is f (a, t) = W (t). A more precise statement

of the boundary condition is f (x, t) ! W (t) as x ! a. This is similar to the
boundary condition u(x, t) ! 0 as x ! a. As you approach a, your probability
of not hitting a in a short amount of time goes to zero. This implies that as
Xt ! a, the conditional probability that < t + goes to zero. You might
think that the survival probability calculations above would prove this. But
those were based on the boundary u(a, t) = 0 boundary condition, which we did
not prove. It would be a circular argument.

Week 5
Integrals with respect to Brownian motion
Jonathan Goodman
October 7, 2012

1 Introduction to the material for the week

This week starts the other calculus aspect of stochastic calculus, the limit t !
0 and the Ito integral. This is one of the most technical classes of the course.
Look for applications in coming weeks. Brownian motion plays a new role
this week, as a source of white noise that drives other continuous time random
processes. Starting this week, Wt usually denotes standard Brownian motion,
so that Xt can denote dierent random process driven by W in some way. The
driving white noise is written informally as dWt .
White noise is a continuous time analogue of a sequence of i.i.d.
variables. Let Zn be such a sequence, with E[ Zn ] = 0 and E Zn2 = 1. These
generate a random walk,
Vn = Zk . (1)

The Vn can be expressed in a more dynamical way by saying V0 = 0 and

Vn+1 = Vn + Zn . If the sequence Wn is given, then

Zn = Vn+1 Vn . (2)

In the continuous time limit, a properly scaled Vn converges to Brownian motion.

The discrete time independent increments property is the statement that Zn
defined by (2) are independent. The discrete time analogue of the fact that
Brownian motion is homogeneous in time is the statement that the Zn are
identically distributed.
I.i.d. noise processes cannot have general distributions in continuous time. A
continuous time i.i.d. noise processes, white noise, is Gaussian. The continuous
time scaling limit for Brownian motion is
1 D
p Vn * Wt , as t ! 0 with tn = nt, and tn ! t. (3)
The CLT implies that Wt is Gaussian regardless of the distribution of Zn . White
noise dWt is Gaussian as well, in whatever way it makes sense.

In continuous time, it is simpler to define white noise from Brownian motion
rather than the other way around. The continuous time analogue of (2) is to
write dWt as the source of noise. The continuous time analogue of (1) would be
to define a white noise process Zt somehow, then get Brownian motion as
Z t
Wt = Zs ds . (4)

The numbers Wt make sense as random variables and the path Wt is a contin-
uous function of t. The numbers Zt do not make sense in the same way.
The Ito integral with respect to Brownian motion is written
Z t
Xt = fs dWs . (5)

The relation between X and W may be expressed informally in the Ito dier-
ential form
dXt = ft dWt . (6)
The integrand, f , must be adapted to the filtration generated by W . If Ft
is generated by the path W[0,t] , then ft must be measurable in Ft . The Ito
integral is dierent from other stochastic integrals (e.g. Stratonovich) in that
the increment dWt is taken to be in the future of t and therefore independent
of f[0,t] . This implies that

E[ dXt | Ft ] = ft E[ dWt | Ft ] = 0 , (7)

E dXt2 | Ft = ft2 E dWt2 | Ft = ft2 dt . (8)
The Ito integral is important because more or less any continuous time con-
tinuous path stochastic process Xt can be expressed in terms of it. A martingale
is a process with the mean zero property (7). More or less any such martingale
can be represented as an Ito integral (27). This is in the spirit of the central
limit theorem. In the continuous time limit, a process is determined by its mean
and variance. If the mean is zero, it is only the variance, which is ft2 .
The mathematics this week is reasonably precise yet not fully rigorous. You
should be able to understand it if you have not studied mathematical analysis.
This material is not for culture. You are expected to master it along with
the rest of the course. If this were not possible, or not important, the material
would not be here.
The approach taken here is not the standard approach using approximation
by simple functions and the Ito isometry formula. You can find the standard
approach in the book by Oksendal, for example. The standard approach is
simpler but relies more results from measure theory. The approach here will look
almost the same as the standard approach if you do it completely rigorously,
which we do not.

2 Pathwise convergence and the Borel Cantelli
Section 3 constructs a sequence of approximations to the Ito integral, Xtm . This
section is a description of some technical tools that can show that the Xtm
converge to a limit as m ! 1. What we describe is related to the standard
Borel Cantelli lemma but it is not the same. This section is written without the
usual motivations. You may need to read it twice to see how things fit together.
Suppose am > 0 is a sequence of numbers with a finite sum
s= am < 1 . (9)

Let rn be the tail sum X

rn = am .

Then rn ! 0 as n ! 1. The proof of this is that the partial sums

sn = am

converge to s, and sn + rn = s for any n, so s sn = rn ! 0 as k ! 1.

Now suppose bm is a sequence of numbers with |bm | am . Consider the
x= bm . (10)

The sum converges absolutely if the am have a finite sum. Therefore (9) implies
that x is well defined. The partial sums for (10) are
xn = bm .

These satisfy

|x xn | = bm aj = rn ! 0 ,

m>n m>n

as n ! 1. If xm is a sequence of numbers with bm = xm+1 xm , then the limit

x = lim xn = bm

is well defined. Moreover,

|x xn | < rn ! 0 , as n ! 1 . (11)

Suppose Am is a sequence of non-negative random numbers. P Typically, the
Am can be arbitrarily large and so it might happen that S = Am = 1. We
hope to show that the probability it will happen is zero. The event S = 1 is
a measurable set, which in some sense means it is a possible outcome. But if
P(S = 1) = 0, you will never see that outcome. We say that an event D
happens almost surely if P(D) = 1. This is abbreviated as a.s., as in S < 1
almost surely, or S < 1 a.s. Other expressions are a.e., for almost everywhere,
and p.p., for presque partout (almost everywhere, in French).
Many people refuse to distinguish between outcomes that are impossible,
which would be ! 2 / , and events that have probability zero. We will be sloppy
with the distinction in this class, and ignore it much of the time.
Our strategy will be to show that S < 1 a.s. by showing that E[ S] < 1.
That is
X1 X1
E[ Am ] < 1 =) Am < 1 a.s.
j=m m=1

In particular, letXtm be a sequence of random paths. Suppose you can show


E Xtm+1 Xtm am , with am < 1 , (12)

for all t T . Then you know that the following limit exists almost surely

Xt = lim Xtm . (13)


This is our version of the Borel Cantelli lemma. We calculate expected values
to verify the hypothesis (12), then we conclude that the limit exists pathwise
almost surely.

3 Riemann sums for the Ito integral

We use the following Riemann sum approximation for the Ito integral (27):
Xtm = ftj Wj . (14)
tj <t

The notation is
t = 2m , (15)
tj = jt , (16)
Wt is a standard Brownian motion, and

Wj = Wtj+1 Wtj , (17)

The pathwise convergence will be that for almost every Brownian motion path,
the approximations (14) converge to a limit. This limit will be measurable in
Ft because Xt is a function of W[0,t] .

The Riemann sum approximation (14) needs lots of explanation. The Brow-
nian motion increment used at time tj (17) is in the future of tj . We assume that
ftj is measurable in Ftj , so this makes Wj independent of ftj . In particular,

E ftj Wj | Ftj = 0 , (18)

and h 2 i
E ftj Wj | Ftj = ft2j t . (19)

The Riemann sum definition (14) definies Xtm for all t. It gives a path that
is discontinuous at the times tj . Sometimes it is convenient to re-define Xtm
by linear interpolation between tj and tj+1 so that it is continuous. Those
subtleties do not matter this week.
We use the limit m ! 1 rather than t ! 0. It is easy to compare the
tm = 2m approximation to the one with tm+1 = 12 tm , as we will see.
Moreover, taking t ! 0 rapidly makes it easier for the sum (12) to converge.
We assume that the integrand ft is continuous in some way. Specifically, we
assume that if s > 0, then
h i
E (ft+s ft ) | Ft Cs . (20)

This allows integrands like ft = Wt , or ft = tWt . Some of the integrands we

use later in the course do not satisfy this hypotheses, but most are close. We
will re-examine the conditions on ft below to see what is really necessary.
The main step in the proof is the estimation of the terms in (12). The move
from m to m + 1 replaces tm by tm+1 = 12 tm . We can write Xtm+1 in
terms of the extended m definition tj+ 12 = (j + 12 )t. For simplicity, we write
skip the ts and write fj+ 12 for ftj+ 1 , and Wj+ 12 for Wtj+ 1 , etc.
2 2

Xh i
Xtm+1 = fj+ 12 Wj+1 Wj+ 12 + fj Wj+ 12 Wj +Q.
tj <t

The Q on the end is the term that may result from Xtm+1 having an odd number
of terms in its sum. In that case, Q is the last term. It makes a negligible
contribution to the sum. We subtract from Xtm+1 the Xtm sum
Xtm = fj (Wj+1 Wj ) .
tj <t

The result is
Xtm+1 Xtm = fj+ 12 fj Wj+1 Wj+ 12 +Q. (21)
tj <t

The terms on the right side of (21) have mean zero. This implies that the
sum has cancellations that may be hard to see if we take absolute values too
soon. We find the cancellations by calculating the square and using the Cauchy

Schwarz inequality. In probability, a form of the Cauchy Schwarz inequality is
that if U and V are two random variables, then (proof in the next paragraph)
E[ U V ] E[ U 2 ] E[ V 2 ] .

For V = 1, this is just p

E[ U ] E[ U 2 ] .
Computing the square of (21) gives

E Xtm+1 Xtm am ,

where h 2 i
a2m = E Xtm+1 Xtm .
This is something we can caluclate.
(Here is a proof of the Cauchy Schwarz inequality in the form we need. The
following quantity is non-negative for any
h i
0 E (U V ) = E U 2 2E[ U V ] + 2 E V 2 .

We minimize the right side by taking = E[ U V ] /E V 2 . Putting this in the
first expression gives
E[ U V ]2
0 E U2 .
E[ V 2 ]

Multiply through by E V 2 and you get Cauchy Schwarz.)
Denote a typical term in the sum on the right of (21) as

Yj = fj+ 12 fj Wj+1 Wj+ 12 .

It is clear from the definition that

h i h i
E Yj | Fj+ 12 = fj+ 12 fj E Wj+1 Wj+ 12 | Fj+ 12 = 0

It follows from the tower property that E[ Yj | Fj ] = 0 If k < j, then Yk is

known in Fj , so
E[ Yk Yj | Fj ] = Yk E[ Yj | Fj ] = 0 .
In the expected value of ( Yj2 ) = (Yj Yk ) there are two kinds of terms. We
just saw that o diagonal terms, those with j 6= k have expected value equal to
zero. A typical diagonal term has
h i 2 2
E Yj2 | Fj+ 12 = E fj+ 12 fj Wj+1 Wj+ 12 Fj+ 1

2 2
= fj+ 12 fj E Wj+1 Wj+ 12 Fj+ 1

2 t
= fj+ 12 fj .

The next expectation, and (20) gives the desired inequality
E Yj2 | Fj = E fj+ 12 fj | Fj Ct2 .

Finally, X X
a2m C t2 = Ct t Ct tm .
tj <t tj <t

You can check that adding Q to this calculation does not change the conclusion.
The last inequality may be written
pp p
am C t tm C tm ,

where = 21/2 < 1. The sum in (12) becomes a convergent geometric series.
This completes the proof that the approximations (14) converge to something.
We used the powers of two in two ways. First, it made it easy to compare Xtm
to Xtm+1 . Second, it made the sum on the right of (12) a convergent geometric
series. In another week (which we will not do in this course), we could show that
the restriction to powers of 2 for t is unnecessary. Youhcan see how ito relax
our assumption (20). For example, it suces to take E (ft+s ft ) Cs,
rather than the conditional expectation. This allows discontinuous integrands
that depend p on hitting times. It is possible to substitute a power of s less than
1, such as s. This would just lead to a dierent < 1 in the final geometric

4 Example
There are a few Ito integrals that can be computed directly from the definition.
Itos lemma, which we will see next week, is a better way to approach actual
calculations. This is as in ordinary calculus. Riemann sums are a good way
to define the Riemann integral, but the fundamental theorem of calculus is an
easier way to compute specific examples.
The first example is Z t
Xt = Ws dWs . (22)
The Riemann sum approximation is
Xtm = Wtj Wtj+1 Wtj .
tj <t

The trick for doing this is

1 1
Wtj = Wtj+1 + Wtj Wtj+1 Wtj .
2 2

This leads to
1 X 1 X
Xtm = Wtj+1 + Wtj Wtj+1 Wtj Wtj+1 Wtj Wtj+1 Wtj .
2 t <t 2 t <t
j j

A general term in the first sum is

Wtj+1 + Wtj Wtj+1 Wtj = Wt2j+1 Wt2j .

Therefore, the first sum is a telescoping sum,1 which is a sum of the form

(a b) + (b c) + + (x y) + (y z) = a z .

Let tn = max {tj | tj < t}, then the first sum is 12 Wt2n+1 W02 . This simplifies
more because W0 = 0 to 12 Wt2n+1 . Clearly, Wtn+1 ! Wt as t ! 0.
The second sum involves
S= Wj2 . (23)
tj <t

The mean and variance

describe the answer as precisely as we need. For the
mean, we have E Wj2 = t, so
E[ S] = t = tn ! t as t ! 0 .
tj <t

For the variance, the terms Wj are independent, and var Wj2 = 2t2
(recall: Wj is Gaussian and we know the fourth moments of a Gaussian)
Therefore 0 1
var(S) = 2t @ tA = 2t tn 2t2m .
tj <t

These two calculations show that S ! t as m ! 1. Therefore

1 2
Xtm ! Wt t as m ! 1 .
This gives the famous result
Z t
1 2
Ws dWs = Wt t . (24)
0 2
We have much to say about this result, starting with what it is not. The
answer would be dierent if Wt were a dierentiable function of t. If Wt were
dierentiable, then dWs = dW
ds ds, and
Z t Z t Z
dW 1 t d 2 1
Ws dWs = Ws ds = Ws ds = Wt2 .
0 0 ds 2 0 ds 2
1 The term comes from a collapsing telescope. You can find pictures of these on the web.

The Ito result (24) is dierent. The Ito calculus for rough functions like Brow-
nian motion gives results that are not what you would get using the ordinary
calculus. In ordinary calculus, the sum (23) converges to zero as t ! 0. That
P Wj scales like tt if Wt is a dierentiable function of t, so S is
is because 2 2

like t tj <t t = t t. But W scales like t for Brownian motion. That is

why S makes a positive contribution to the Ito integral.
The answer dierentiable calculus answer 12 Wt2 is wrong because it is not a
martingale. A martingale is a stochastic process so that if t > s, then

E[ Xt | Fs ] = Xs . (25)

The Ito integral is a martingale. But

E Wt2 | Fs = Ws2 + (t s) ,

so Wt2 is not a martingale (see Section 5). The correct formula (24) is a mar-
tingale. The correction Wt2 ! Wt2 t accomplishes this.

5 Properties of the Ito integral

This section discusses two properties of the Ito integral: (1) the martingale
property, (2) the Ito isometry formula.
Two easy steps verify the martingale property. Step one is to say that we
can define the Ito integral with a dierent start time as
Z t X
fs dWs = lim ftj Wtj+1 Wtj . (26)
a m!1
atj <t

This has the additivity property

Z a Z t Z t
fs dWs + fs dWs = fs dWs .
0 a 0

Step two is that Z t

E fs dWs Fa = 0 .

This is because the right side of (26) has expected value zero. That is because
all the terms on the right are in the future of Fa . That zero expectation is
preserved in the limit t ! 0. A general theorem in probability says that if
Ym is a family of random variables and Ym ! Y as m ! 1, and if another
technical condition is satisfied (discussed in Week 8), then E[ Ym ] ! E[ Y ] as
m ! 1.
When we use these facts together, we conclude that
Z t Z a Z t Z a

E fs dWs Fa = E fs dWs Fa +E fs dWs Fa = E fs dWs Fa = Xa .
0 0 a 0

This is the martingale property for Xt .
The Ito isometry formula is
" Z 2 # Z t
E fs dWs = E fs2 ds . (27)
0 0

The variance of the Ito integral is equal the the ordinary integral of the expected
square of the integrand. The ideas we have been using make the proof of this
formula routine. Informally, we write

E[ fs dWs fs0 dWs0 ] = 02 if s 6= s0
E fs ds if s = s0 .
The unequal time formula on the top line reflects that either dWs of dWs0 is
in the future of everything else in the
h formula. iThe equal time formula on the
bottom line reflects the informal E (dWs ) | Fs = dt. Then
Z t 2 Z t Z t Z tZ t
fs dWs = fs dWs fs0 dWs0 = fs dfs0 dWs Ws0 .
0 0 0 0 0

Taking expectations,
" Z 2 # Z t Z t
E fs dWs = E[ fs dfs0 dWs Ws0 ]
0 0 0
Z t
= E fs2 ds .

A more formal, but not completely rigorous, version of this argument is little
dierent from this. We merely switch to the Riemann sum approximation and
take the limit at the end:
20 12 3 2 3
6 X 7 X X
E4 @ ftj Wtj A 5 = E 4 ftj ftk Wtj Wtk 5
tj <t tj <t tk <t
= E ftj ftk Wtj Wtk
tj <t tk <t
X h h ii
= E ft2j E Wt2j | Ftj
tj <t
X h i
= E ft2j t .
tj <t

The last line is the Riemann sum approximation to the right side of (27).
Let us check the Ito isometry formula on the example (24). For the Ito
integral part we have (recall that X N (0, 2 ) implies var X 2 = 2 4 )
Z t
1 1 1 t2
var Ws dWs = var Wt2 t = var Wt2 = 2t2 = .
0 4 4 4 2

For the Riemann integral part, we have
Z t Z t
E Ws2 ds = s ds = .
0 0 2

As the Ito isometry formula (27) says, these are equal.

A simpler example is fs = s2 , and
Z t
Xt = s2 dWs .

This is more typical of general Ito integrals in that Xt is not a function of Wt

alone. Since X is a linear function of W , X is Gaussian. Since X is an Ito in-
tegral, E[ Xt ] = 0. Therefore, we characterize the distribution of Xtcompletely
by finding its variance. The Ito isometry formula gives (fs2 = E fs2 = s4 )
Z t
var(Xt ) = s4 ds = .
0 5

This may be easier than the method used in question (3) of Assignment 3.

Week 6
Itos lemma for Brownian motion
Jonathan Goodman
October 22, 2012

1 Introduction to the material for the week

Itos lemma is the big thing this week. It plays the role in stochastic calculus
that the fundamental theorem of calculus plays in ordinary calculus. Most actual
calculations in stochastic calculus use some form of Itos lemma. Itos lemma
is one of a family of facts that make up the Ito calculus. It is an analogue for
stochastic processes of the ordinary calculus of Leibnitz and Netwon. We use
it both as a language for expressing models, and as a set of tools for reasoning
about models.
For example, suppose Nt is the number of bacteria in a dish (a standard
example in beginning calculus). We model Nt in terms of a growth rate, r. In a
small increment of time dt, the model is that N increases by an amount dNt =
rNt dt. Calculus allows us to express Nt as Nt = N0 ert . The Itos lemma of
ordinary calculus gives df (t) = f 0 (t)dt. For us, this is d(N0 ert = rN0 ert = rNt .
Here is a similar example for a stochastic process Xt that could model a stock
price. We suppose that in the time interval dt that Xt changes by a random
amount whose size is proportional to Xt . In stock terms, the probability to go
from 100 to 102 is the same as the probability to go from 10 to 10.2. A simple
way to do this is to make dX proportional to Xt and dWt , as in dXt = Xt dWt .
The dierentials are all forward looking, so dXt = Xt+dt Xt and dWt =
Wt+dt Wt with dt > 0. The Ito lemma for the Ito calculus is (using subscripts
for partial derivatives) d(f (Wt , t)) = fw (Wt , t)dWt + 12 fw (Wt , t)dt + ft (Wt , t)dt.
2 2
The solution is f (w, t) = x0 e w t/t . We check this using fw = f , fww = 2 f ,
and ft = 2 f . Therefore, if Xt = f (Wt , t), then dXt = Xt dWt as desired.
Itos lemma for this week is about the time derivative of stochastic processes
f (Wt , t), where f (w, t) is a dierentiable function of its arguments. The Ito
dierential is
df = f (Wt+dt , t + dt) f (Wt , t) .
This is the change in f over a small increment of time dt. If you integrate the
Ito dierential of f , you get the change in f . If Xt is any process, then
Z b
Xb Xa = dXs . (1) eq:di

This is the way to show that something is equal to dXt , you put your dierential
on the right, integrate, and see whether you get the left side. In particular, the
dierential formula dXt = t dt + t dWt , means that
Z b Z b
X b Xa = s ds + s dWs . (2) eq:dii
a a

The first integral on the right is an ordinary integral. The second is the Ito
integral from last week. The Ito integral is well defined provided t is an adapted
Itos lemma for Brownian motion is
1 2
df (Wt , t) = @w f (Wt , t)dWt + @w f (Wt , t)dt + @t f (Wt , t)dt . (3) eq:ild
An informal derivation starts by expanding df in Taylor series in dW and dt up
to second order in dW and first order in dt,
1 2
df = @w f dW + @w f (dW )2 + @t f dt .
We get (3) from this using (dWt )2 = dt. The formula dWt )2 = dt cannot be
true, because (dWt )2 is random and dt is not random. It is true that
E (dWt )2 |Ft = dt, but Itos lemma is about more than expectations.
The real theorem of Itos lemma, in the spirit of (2), is
f (Wb , b) f (Wa , a)
Z b Z
1 2
= @w f (Wt , t)dWt + @ f (Wt , t) + @t f (Wt , t) dt (4) eq:ili
a a 2 w
Everything here is has been defined. The second integral on the right is an
ordinary Riemann integral. The first integral on the right is the Ito integral
defined last week. We give an informal proof of this in Section 2.
You see the convenience of Itos lemma by re-doing the example from last
week Z t
Xt = Ws dWs .
A first guess from ordinary calculus might be Xt = 12 Wt2 . Let us take the Ito
dierential of 12 Wt2 . This is df (Wteq:ild
, t), where f (w, t) = 12 w2 , and @w f (w, t) = w,
and 2 @w f (w, t) = 2 . Therefore, (3) gives
1 2 1

1 2 1
d Wt = Wt dWt + dt .
2 2
1 2 1 2 t
1 t
W W0 = Ws dWs +
2 t 2 0 2 0
Z t
= Ws dWs + t .
0 2

You just rearrange this and recall that W0 = 0, and you get the formula from
Week 5: Z t
1 1
Xt = Ws dWs = Wt2 t .
0 2 2
This is quicker than the telescoping sum stu from Week 5.
Itos lemma gives a convenient way to figure out the backward equation for
many problems. Itos lemma and the martingale (mean zero) property of Ito
integrals work together to tell you how to evaluate conditional expectations.
Consider the Ito integral
XT = gs dWs .
Then Z "Z #
t T
E[ XT | Ft ] = E gs dWs | Ft + E gs dWs | Ft
0 t

The first term is completely known at time t, so the expectation is irrelevant.

The second term is zero, because dWs is in the future of gs and Ft . Therefore
"Z # Z
T t
E gs dWs | Ft = gs dWs .
0 0

Now suppose f (w, t) is the value function

f (w, t) = E[ V (WT ) | Wt = w] .
The integral form of Itos lemma (4)
V (WT ) f (Wt , t) = df (Ws , s)
1 2
= @w f (Ws , s)dWs + @t f (Ws , s) + @w f (Ws , s) ds
t t 2
Take the conditional expectation in Ft . Looking on the left side, we have

E[ V (WT ) | Ft ] = f (Wt , t) ,

which is an equivalent definition of the value function. Clearly, E[ f (Wt , t) | Ft ] =

f (Wt , t). Therefore you get zero on the left. The conditional expectation of the
Ito integral on the right also vanishes, as we said just above. Therefore
"Z #
1 2
E @t f (Ws , s) + @w f (Ws , s) ds | Ft = 0 .
t 2

The simplest way for this to happen is for the integrand to vanish identically.
The equation you get by setting the integrand to zero is
1 2
@t f + @ w f =0.

This is the backward equation we derived in Week 4. The dierence here is that
you dont have to think about what youre doing here. All the hard thinking
(the mathematical analysis) goes into Itos lemma. Once you are liberated
from thinking hard, you can easily derive backward equations for many other

2 Informal proof of Itos lemma

sec:p eq:ili
The theorem of Itos lemma is the integral formula (4). We will prove it under
the assumption that f (w, t) is a dierentiable function of its arguments up to
third derivatives. We assume all mixed 3 partial derivatives
up to that order
exist and are bounded. That means @w f (w, t) C, and @t2 f (w, t) C, and
@w @t f (w, t) C, and so on.
We use the notation of Week 5, with t = 2m , and tj = jt. The change
in any quantity from tj to tj+1 is ()j . We use the subscript j for tj , as in
Wj instead of Wtj . For example, f = f (Wj + Wj , tj + t) f (Wj , tj ). In
eq:ili j
this notation, the left side of (4) is
f (Wb , b) f (Wa , a) fj . (5) eq:dfs
atj <b

The right side is a telescoping sum, which is equal to the left side if b = nt
and a = mt for some integers m < n. When t and W are small, there is
a Taylor series approximation of fj . The leading order terms in the Taylor
series combine to form the integrals on the right of (4). The remainder terms
add up to something that goes to zero as t ! 0.
Suppose w and t are some numbers and w and t are some small changes.
Define f = f (w + w, t + t) f (w, t). The Taylor series, up to the order
we need, is

f = @w f (w, t)w + 12 @w
f (w, t)w2 + @t f (w, t)t (6) eq:ts

+ O w3 + O (|w| t) + O t2 . (7) eq:et

The big O quantities on the second

refer to things bounded by a multiple
of whats in the big O, so O w3 means: some quantity Q so that there

is a C with |Q| C w3 . The error terms on the second line correspond
to the highest order neglected terms in the Taylor series. These are (constants
omitted) @w3
f (w, t)w3 , and @w @t f (w, t)wt, and @t2 f (w, t)t2 . The Taylor
remainder theorem tells us that if the derivatives of the appropriate order are
bounded (third derivatives in this case), then the errors are on the order of the
neglected terms. eq:dfs
The sum on the
of (5) now breaks up into six sums, one for each term
on the right of (6) and (7):
fj = S1 + S2 + S3 + S4 + S5 + S6 .
atj <b

We consider them one by one. It does not take long.
The first is X
S1 = @w f (Wj , tj )Wj .
atj <b

In the limit t ! 0 (more precisely, m ! 1 with t = 2m ), this converges to

Z b
@w f (Ws , s)dWs .

The second is X
S2 = 2 @w f (Wj , tj )Wj
1 2 2
. (8) eq:is
atj <b

This is the term in the Ito calculus that has no analogue in ordinary calculus.
We come back to it after the others. The third is
S3 = @t f (Wj , tj )t .
atj <b

As t ! 0 this one converges to

Z b
@t f (Ws , s) ds .

The first error sum is

|S4 | C Wj3 .
atj <b

This is random,
so we evaluate its expected value. We know from experience
that E Wj3 scales like t3/2 , which is one half power of t for each power
of W . Therefore
E[ S4 ] C t3/2 = Ct1/2 t = C(b a)t1/2 .
atj <b atj <b

The second error term goes the same way, as E[ |Wj | t] also scales as t3/2 .
The last error term has
|S6 | C t2 = C(b a)t .
atj <b

It comes now to the sum (8). The Wj2 $ t connection suggests we
(Wj ) = t + Rj ,

E[ Rj | Fj ] = 0 , and E Rj2 | Fj = var(Rj | Fj ) = 2t2 .

S2 = 2 @w f (Wj , tj )t
1 2
+ 2 @w f (Wj , tj )Rj
1 2

atj <b atj <b

= S2,1 + S2,2 .

The first term converges to the Riemann integral

Z b
2 @w f (W s, s) ds .
1 2

The second term converges to zero almost surely. We see this using the now
familiar trick of calculating E S2,2 2
. This becomes a double sum over tj and
tk . The o diagonal terms, the ones with j 6= k vanish. If j > k, we see this as
2 2
E 12 @w f (Wj , tj )Rj 12 @w f (Wk , tj )Rk | Fj
1 2
= E[ Rj | Fj ] @w f (Wj , tj )@w
f (Wk , tj )Rk ,
and the right side vanishes. The conditional expectation of a diagonal term is
1 h 2 2 i 1 2
E @w f (Wj , tj )Rj | Fj = 2
@w f (Wj , tj ) E Rj2 | Fj
4 4
1 2 2
= @w f (Wj , tj ) t2
These calculations show that in E S2,2 , the diagonal terms, which are the only
non-zero ones, sum to C(b a)t.
The almost surely statement follows from the Borel Cantelli lemma, as
last week. The abstract theorem is that if Sn is a family of random variables

E Sn2 < 1 , (9) eq:bc
then Sn ! 0 as n ! 1 almost surely. This is because (9) implies that Sn2 ! 0
as n ! 1. If Sn2 ! 0 then Sn ! 0 also. We know Sn2 ! 0 almost surely
because Sn2 0 and if an infinite sum of positive numbers is convergent, then
the terms go to zero. Our sum is convergent almost surely, so the sum is finite
almost surely.

3 Backward equations
Suppose V (w) is a running reward function and consider
"Z #
f (w, t) = Ew,t V (Ws )ds . (10) eq:rr

As in the Introduction, this may be written in the equivalent form
"Z #
f (Wt , t) = E V (Ws )ds | Ft . (11) eq:rr2

Itos lemma gives

f (WT , T ) f (Wt , t) = fw (Ws , s)dWs + fww (Ws , s) + ft (Ws , s) ds .
t t 2
The definition (10) gives f (WT , T ) = 0. Therefore, as in the Introduction,
"Z #
f (Wt , t) = E fww (Ws , s) + ft (Ws , s) ds | Ft .
t 2

We set the two expressions for f equal:

"Z # "Z #
E V (Ws )ds | Ft = E fww (Ws , s) + ft (Ws , s) ds | Ft .
t t 2

The natural way to achieve this is to set the integrands equal to each other,
which gives
fww (w, s) + ft (w, s) + V (w) = 0 . (12) eq:berr
The final condition for this PDE is f (w, T ) = 0. The PDE then determines the
values f (w, s) for s < T . Now that we have guessed the backward equation, we
can show that it is right by Ito dierentiation once
more. If f (w, s) satisfies the
backward equation (12), then f (Wt , t) satisfies (11).
Here is a slightly better way to say this. From ordinary calculus, we get
Z T !
d V (Ws )ds | Ft = V (Wt )dt .

We pause to consider this. The stochastic process

Xt = V (Ws )ds

is a dierentiable function of t. Its derivative with respect to t follows from the

ordinary rules of calculus, the fundamental theorem in this case
dX T
V (Ws )ds = V (Wt ) .
dt t
This is true for any continuous function Wt whether or not it is random. Con-
ditioning on Ft just ties down the value of Wt . From Itos lemma, any function
f (w, s) satisfies

E[ df (Wt , t) | Ft ] = fww (Ws , s) + ft (Ws , s) dt .

Taking expectations on both sides of (11) gives

fww (Ws , s) + ft (Ws , s) dt = V (Wt )dt ,
which is the backward equation (12).
Consider the specific example
"Z #
f (w, t) = Ew,t Ws2 dt .

We could find thehsolution by direct calculations,

i since there is a simple formula
2 2
Ew,t Ws = Ew,t Wt + (Ws Wt ) = w + (s t). Instead we use the ansatz
2 2

method. Suppose the solution has the form f (w, t) = A(t)w2 + B(t). It is easy
to plug into the backward equation
fww + ft + w2 = 0
and get
2A + Aw2 + B + w2 = 0 .
This gives A = 1. Since f (w, T ) = 0, we have A(T ) = 0 and therefore
A(t) = T t. Next we have B = 2T 2t, so B = 2T tt2 +C. The final condition
B(T ) = 0 gives C = T 2 . The simplified form is B(t) = 2T t t2 T 2 =
(T t)2 . The solution is f (w, t) = (T t)w2 (T t)2 .

Week 7
Diusion processes
Jonathan Goodman
October 29, 2012

1 Introduction to the material for the week

This week we discuss a random process Xt that is a diusion process. A diusion
process has an infinitesimal mean, or drift, which is a(x, t). The process is
supposed to satisfy

E[ Xt | Ft ] = a(Xt )t + O(t2 ) . (1)

Here, t > 0 and X = Xt+ t Xt is the forward looking change. We also

write the dierential version dXt = Xt+dt Xt and

E[ dXt | Ft ] = a(Xt )dt .

A diusion also has an infinitesimal variance, (x, t). If Xt is a one dimensional

process, it should satisfy

E Xt2 | Ft = (Xt , t)t + O(t2 ) . (2)

The dierential version of this is

E dXt2 | Ft = (Xt , t)dt .

For a multidimensional process, the infinitesimal mean is a vector and the

infinitesimal variance is a matrix. For an n dimensional process, a(x, t) 2 Rn ,
and (x, t) is a symmetric positive semi-definite d d matrix. The infinitesimal
mean formula (1) does not change. The infinitesimal variance formula becomes
h i
E (dXt ) (dXt ) | Ft = (Xt , t)dt .

This just says that (Xt , t)dt is the variance-covariance matrix of dXt .
The last part of the definition is that a diusion process must have continuous
sample paths. This means that Xt must be a continuous function of t. For
example, the simple rate one Poisson process Nt satisfies (1) with a = 1 and
(2) with = 1, as we saw in Assignment 5. In practice, you show that sample

paths are continuous by finding a moment of dX that scales like a higher power
of dt. Usually it is
E Xt4 | Ft = O(t2 ) , (3)
the fourth moment scales like t2 . The Poisson process has

E Nt4 | Ft = t + O(t2 ) .

For the Poisson process, if dN 6= 0, then dN = 1. Therefore, (for Nt conditioning

is irrelevant) E[ Ntp | Ft ] = P(dN 6= 0) = dt for any moment power p.
Diusion processes come up as models of stochastic processes. If you want
to build a diusion model of a process, you need to figure out the infinitesimal
mean and variance. You also must find a higher moment that scales like a higher
power of t, or find some other reason for Xt to be a continuous function of t.
We will see examples of this kind of reasoning.
The quadratic variation of Xt measures how much noise the path Xt expe-
rienced up to time t. It is written in many ways, and we write it as [X]t . The
definition is, in our usual notation,
X 2
[X]t = lim (Xj ) . (4)
tj <t

We give a sort-of proof using t = 2m and m ! 1. The sort-of proof is

supposed to prove that this limit exists almost surely. The limit is given by
Z t
[X]t = (Xs , s) ds . (5)

This looks a little like the Ito isometry formula, but there are dierences. The
Ito isometry formula is an equality of expected values. But this is a pathwise
identity, one that holds for almost every path X. Both sides of (5) are functions
of the path Xt . Almost surely, for any path X[0,T ] , the limit on the right of (4)
is equal to the right side of (5).
The quadratic variation formula is related to the Itos lemma for general
diusions. This becomes clear if you write the left side of (4) as an integral to
get the informal expression
Z t
[X]t = (dXs ) .

The identity (5) would be

Z t Z t
(dXs ) = (Xs ) ds .
0 0

Taking the dierential with respect to t gives (dXt ) = (Xt ) dt. The truth of
this informal formula is the same as the truth of the Brownian motion version:
It is not true in the dierential form, but gives a true formula when you integrate
both sides.

We learn about diusions by finding things about them we can calculate or
compute. An important tool in this is the general version of Itos lemma. We
can guess that Itos lemma should be
1 2
df (Xt , t) = fx (Xt , t) dXt + fxx (Xt , t) (dXt ) + ft (Xt , t) dt
= fx (Xt , t) dXt + fxx (Xt , t)(Xt , t) dt + ft (Xt , t) dt . (6)
To show this is true, we need to prove that

f (XT , T ) f (X0 , 0)
= fxx (Xt , t)(Xt , t) + ft (Xt , t) dt + fx (Xt , t) dXt . (7)
0 2 0

The last term on the right is the Ito integral with respect to a general diusion,
what also needs a definition. It looks like we have lots to do this week.
There are backward equations associated to general diusions. One of them
is for the final time payout value function

f (x, t) = Ex,t [ V (XT )] . (8)

This is
@t f + (x, t)@x2 f + a(x, t)@x f = 0 . (9)
There are other backward equations for other quantities defined in terms of X.
We can derive this directly using the tower property, or we can do it using Itos
lemma (6). There are natural versions of quadratic variation and backward
equations for multi-variate diusions processes. These involve the infinitesimal
covariance matrix .

2 Some diusion processes

This section explains how to make a diusion model for a random process. In
future weeks we will see how to do this as an Ito dierential equation. That is
very appealing notationally, but the present method is more fundamental.

2.1 Geometric Brownian motion and geometric random

2.2 Ornstein Uhlenbeck and the urn process

3 Ito calculus for general diusions

This section has a full agenda, but the items should start to seem routine as
we go through them. Most of the arguments are just more general versions of
arguments from last week.

3.1 Backward equation
We start with the backward equation for general diusions. The argument
here is more direct than the argument we gave for the backward equation for
Brownian motion. The earlier argument is more ecient, in that it involves
less writing. But this one is straightforward, and makes it clear what is behind
the equation. It also shows how the technical condition (3) plays a crucial role.
A simple backward equation governs the value function for a state dependent
payout at a specific time. The payout function is V (x). The payout time is
T . At that time, you get payout V (XT ). For t < T , there is the conditional
expected value of XT , conditional on the information in Ft . Since Xt is a
Markov process, is expected value is the same as the conditional expectation
given the value on Xt . This conditional expected value is f (x, t) given by (8).
An equivalent definition is

f (Xt , t) = E[ V (XT ) | Ft ] .

Suppose s is a time intermediate between t and T . Then Ft Fs , and the

tower property gives

f (Xt , t) = E[ E[ V (XT ) | Fs ] | Ft ] = E[ f (Xs , s) | Ft ] .

This may be restated as

f (x, t) = Ex,t [ f (Xs , s)] , (10)

which should hold whenever t s T .

The backward equation (9) is an expression of the tower property. We derive
it from (10) taking s = t + t. The calculations require that f be suciently
dierentiable, which we assume but do not prove. The ingredients are: (i) the
formulas (1) and (2) that characterize Xt , (ii) Taylor expansion of f with the
usual remainder bounds, and (iii) the technical condition (3) that makes Xt
a continuous function of t. We write Xt+ t = x + X and make the usual
Taylor expansions. To simplify the writing, we make two conventions. Partial
derivatives are written as subscripts. We put in the arguments only if they are
not (x, t). For example, fx means @x f (x, t).

f (Xt+ t, t + t) = f (x + X, t + t)
= f + fx X + 12 fxx X 2 + ft t
+ O(|X| ) + O(|X| t) + O(t2 ) .

We briefly postpone the argument that

h i
E |X| | Ft = O(t3/2 ) , (11)

but it is consistent with the scaling X t1/2 . From (10) we find

f (x, t) = Ex,t [ f (Xt+ t, t + t)]

= Ex,t [ f ] + Ex,t [ fx X] + Ex,t fxx X 2 + Ex,t [ ft t]
h i
+ Ex,t O(|X| ) + Ex,t [ O(|X| t)] + Ex,t O(t2 )

= f + fx Ex,t [ X] + 12 fxx Ex,t X 2 + ft t + O(t3/2 )
0 = fx a(x)t + 12 fxx (x)t + ft t + O(t3/2 )
0 = afx + 12 fxx + ft + O(t1/2 ) .

If you take t ! 0, you get the backward equation (9).

The bound (11) is a consequence of (3). There is a trick to show this
1/2 2 1/2
using the Cauchy Schwarz inequality E[ Y U ] E Y 2 E U . If you
take Y = X 2 and U = 1, the Cauchy Schwarz inequality gives E X 2
1/2 2 1/2 1/2
E X 4 E 1 hC t2i = Ct. Use X 3 = X 2 X in Cauchy
3 1/2 1/2
Schwarz, and you get E |X| E X 4 E X 2 Ct3/2 . (Those
of you who know Holders inequality or Jensens inequality may find a shorter
derivation of this t3/2 bound.)
This may seem mysterious, but there is a reason it should work. Suppose
we think X scales
as X
t1/2 . Then we would be inclined to believe
1/2 4
that E X t4
= t2 . Moreover, we might come to believe that

X t 1/2
from the expected square E X 2 t. But this is not a
mathematical theorem. We already saw that the Poisson process is a coun-
E N 2
t but E N 4
t also, not t 2
. This says that
E N 4 is much larger than it would be if N scaled with t in a simple
way you could discover from the mean square. What goes wrong is that N
has fat tails. The expected value of N 2 does not come from typical values of
N . Indeed, the typical value is N = 0. Instead E N 2 is determined by
rare events in which N is much larger than t1/2 . The probability of such
a rare event is approximately t, when t is small. The tails of a probability
distribution give the probability that the random variable is much larger (or
smaller) than typical values. A large (or fat) tail indicates a serious probability
of a large value. If a random variable has thin tails, then the expected values
of higher moments scale as you would expect from lower moments. For a diu-
sion process, E X 4 scales as you would expect from X t1/2 , but not a
Poisson process.
The Cauchy Schwarz inequality allowed us to bound lower h moments i of X in
terms of higher moments. If E X = O(t ), then E |X| = O(t3/2 ).
4 2
h i
But E |X| = O(t3/2 ) does not imply that E X 4 = O(t2 ).

3.2 Integration and Itos lemma with respect to dXt
The stochastic integral with respect to dXt is defined as last week. Suppose gt
is a progressively measurable process that satisfies
h i
E (gt+ t gt ) | Ft Ct . (12)
Define the Riemann sum approximations to the stochastic integral as
Yt = gtj Xtj+1 Xtj . (13)
tj <t

As usual, t = 2m and tj = jt. Precisely as before, we show that the limit

Z t
gs dXs = Yt = lim Yt (14)
0 m!1

exists almost surely. The reason is the same (write instead of = only
because the final time t might split an interval):
(m+1) (m)
Yt Yt Xtj+1 Xtj+ 1 gtj+ 1 gtj
2 2
tj <t

Therefore 2
(m+1) (m)
E Yt Yt Ct = C2m ,

so (using Cauchy Schwarz again)

h i
(m+1) (m)
E Yt Yt Ct1/2 = C2m/2 .
From here, the Borel Cantelli lemma implies that
(m+1) (m)
Yt Yt < 1 almost surely ,

which then implies that the limit (14) exists almost surely.
Itos lemma is a similar story. We want to prove the formula (6) for a
suciently smooth function f . Use our standard notation: fj = f (Xtj , tj ), and
Xj = Xtj , and Xj = Xj+1 Xj . The math is telescoping representation
followed by Taylor expansion
f (Xt , t) f (x0 , 0) [fj+1 fj ]
tj <t
= [f (Xj + Xj , tj + t) f (Xj , tj )]
tj <t
= fx (Xj , tj )Xj + 12 fxx (Xj , tj )Xj2 + ft (Xj , tj )t
tj <t
Xh 3
+ O |Xj | + O (|Xj | t) + O t2
tj <t

= S 1 + S 2 + S 3 + S 4 + S 5 + S6 .

The numbering of the terms is the same as last week. We go through them one
by one, leaving the hardest one, S2 , for last.
The first one is
X Z t
S1 = fx (Xj , tj )Xj ! fx (Xs , s) dXs as m ! 1, almost surely .
tj <t 0

The third one is

X Z t
S3 = ft (Xj , tj )t ! ft (Xs , s) ds as m ! 1 .
tj <t 0

For some reason, people do not feel the need to say almost surely when its
an ordinary Riemann sum converging to an ordinary integral. The first error
term is S4 . Our Borel Cantelli argument shows that the error terms go to zero
almost surely as m ! 1. For example, using familiar arguments,
X h 3
i X
E[ S4 ] C E |X| C t3/2 = Ctt1/2 = Ct 2m/2 .
tj <t tj <t

The sum over m is finite.

Finally, the Ito term:
S2 = 12 fxx (Xj , tj )(Xj )t + 1
2 fxx (Xj , tj ) Xj2 (Xj )t
tj <t tj <t

= S2,1 + S2,2 .
The first sum, S2,1 , converges to an integral that is the last remaining part of
(6). The second sum goes to zero almost surely as m ! 1, but the argument
is more complicated than it was for Brownian motion. Denote a generic term
in S2,2 as
Rj = fxx (Xj , tj ) Xj2 (Xj )t .
With this, S2,2 = Rj , and
2 XX
E S2,2 = E[ Rj Rk ] .
tj <t tk <t

The diagonal part of this sum is

E Rj2 .
tj <t

But Rj2 C Xj4 + t2 , so the diagonal sum is OK. The o diagonal sum
was exactly zero in the Brownian motion case because there was no O(t2 ) on
the right of (2). The o diagonal sum is
2 3
2 4 E[ Rj Rk ]5 .
tk <t tk <tj <t

The inner sum is on the order of t, because

E[ Rj Rk ] = E[ E[ Rj | Fj ] Rk ] O(t2 ) |Rk | ,

so 2 3
E[ Rj Rk ] 4 O(t2 )5 |Rk | Ct t |Rk | .
tk <tj <t tj >tk

You can see from the definition that E[ |Rk |] = O(t). Therefore, the outer
sum is bounded by
2 Ct O(t2 ) = Ct O(t) Ct 2m .
tk <t

This is what Borel and Cantelli need to show S2,2 ! 0 almost surely.

3.3 Quadratic variation

We can apply the results of subsection 3.2 to get the quadratic variation. Look
at Z t
Yt = Xs dXs .
The Ito calculus of subsection 3.2 allows us to find a formula for Yt . On the
other hand, the telescoping sum trick from last week allows us to express Yt in
terms of the quadratic variation.
A naive guess would make Yt equal to 12 Xt2 . But Itos lemma (6) applied to
f (x) = 12 x2 , with fx = x and fxx = 1 gives

d 12 Xt2 = Xt dXt + 12 (Xt )dt .

Integrating this gives

Z t Z t
1 2
2 Xt 12 x20 = Xs dXs + 1
2 (Xs ) ds .
0 0

Rearranging puts this in the form

Z t Z t
Xs dXs = 12 Xt2 12 x20 1
2 (Xs ) ds . (15)
0 0

This is consistent with the formula we had earlier for Brownian motion.
The direct approach to Yt starts from the trick

Xj = 12 (Xj+1 + Xj ) 12 (Xj+1 Xj )

The Riemann sum approximation to Yt is

Xj (Xj+1 Xj ) = 12 (Xj+1 + Xj )(Xj+1 Xj ) 1
2 (Xj+1 Xj )(Xj+1 Xj ) .
tj <t tj <t tj <t

The first sum on the right is
2 (Xj+1
Xj2 ) 12 Xt2 12 x20 .
tj <t

The second sum is X

2 (Xj+1 Xj )2 .
tj <t

In the limit t ! 0, this converges to the quadratic variation [X]t . Comparing

this to (15) gives the formula (5).

Week 8
Stopping times, martingales, strategies
Jonathan Goodman
November 12, 2012

1 Introduction to the material for the week

Suppose Xt is a stochastic process and S is some set. The hitting time is the
first time Xt hits S.
= min {t | Xt 2 S} . (1)
This definition makes sense without extra mathematical technicalities if Xt is
a continuous function of t and S is a closed set.1 In that case, X 2 S and
Xt 2 / S if t < . Many practical problems may be formulated using hitting
times. When does something break? How long does it take to travel a given
A hitting time is an important example of the more general idea of a stopping
time. A stopping time is a time that depends on the path X[0,T ] , which makes
it a random variable. What distinguishes a stopping time is that you know at
time t whether t. If Ft is the filtration corresponding to Xt , then

{ t} 2 Ft . (2)

A hitting time is a stopping time because at time t you know all the values Xs
for s t, so you know whether Xs 2 S for some s t. There are stopping
times that are not hitting times. For example, you could stop the first time Xt
has been inside S for a given amount of time.
Some random times are not stopping times. For example, take the maximum
time rather than the minimum time in (1). This would be the last exit time
for S. At time t, you may not know whether Xs will enter S for some s > t, so
you do not know whether t.
Stopping times give a way to model optimal decision problems related to the
stochastic process Xt . An optimal decision problem is the problem of finding the
function (X[0,T ] that maximizes or minimizes some performance criterion. The
early exercise problem for American style stock options is an optimal stopping
1 The set S is closed if S includes all its limit points. If x 2 S and x ! x as n ! 1,
n n
then x 2 S. For example, in one dimension, S = {0 < x < 1} is not closed because xn = 1/n
converges to x = 0, but 0 2 / S.

problem. Many clinical drug trials have stopping criteria that end the trial if
the drug quickly shows itself to be helpful, or dangerous.
Consider the problem of stopping a Brownian motion at the largest value
max E[ W ] . (3)
(W[0,T ] )

One possible solution would be to take the actual maximum value:

W = max Wt .

But this is not a stopping time. Even if Wt is the largest value you have seen so
far, you have no way of knowing whether Ws > Wt for some s > t. (Correction:
You do know. Almost surely there is an s 2 (t, T ) with Ws > Wt .) The optimal
decision problem would be to restrict the class of random times to those that
are stopping times. You have to say at time t whether to stop at t or keep going.
Many hitting time problems and optimal decision problems may be solved
using partial dierential equations. For hitting time problems, you solve the
forward or backward equation in the complement of S and specify a boundary
condition at the boundary of S. Many optimal decision problems have the struc-
ture that the optimal stopping time is given by an optimal decision boundary.
This is a set St so that is the first time Xt 2 St .
A stochastic process is a martingale if, for any s t,

E[ Xs | Ft ] = Xt . (4)

If Xt is a diusion process, then it is a martingale if the drift coecient is equal

to zero. That is
E[ dXt | Ft ] = 0 .
A general theorem of Doob states that if ft is an adapted process and if f and
X are not too wild, then Z t
Yt = fs dXs
is also a martingale. In some sense this is obvious, because the drift coecient
of Y is
E[ dYt | Ft ] = ft E[ dXt | Ft ] = 0 .
if X is a martingale. The value ft is known at time t if ft is adapted. How wild
is too wild? Thats not a question for this course. But we give some examples
where it is true and false.
The Doob stopping time theorem is a special case of the general martingale
theorem. If is a stopping time that satisfies T (almost surely), then

E[ X ] = x0 . (5)

To prove this, let Yt be the stopped process: Yt = Xt if t and Yt = X for

t (the definitions agree for t = ). If T , then YT = X . The trick is

to write Y as a stochastic integral involving X. The integrand is the switching
function ft = 1 for t and ft = 0 for t > . This is an adapted function
you can determine the value of ft knowing only X[0,t] . If
Z t
Yt = x0 + fs dXs ,

then Yt = Xt if t and Yt is constant for t > . The Doob martingale

theorem implies that Yt is a martingale. Therefore

E[ YT ] = E[ X ] = x0 .

2 Backward equation boundary conditions

There are many questions involving hitting times for a set S that can be an-
swered using a value function f (x, t). The PDE for f depends on the process
X, but not on S. The set S determines boundary conditions that f must satisfy.
If you do it right, f will be completely determined by the final condition, the
boundary conditions, and the PDE.
The PDE involves the generator of the process Xt , which is a dierential op-
erator. For a given t, think of f (, t) as an abstract vector. For a one dimensional
diusion, L acts on the vector f as
Lf (x, t) = (x, t)@x2 f (x, t) + a(x, t)@x f (x, t) . (6)
The generator L does not act on the t variable, so t is just a parameter that
says which function f (, t) the generator is acting on.2
For example, if X is an Ornstein Uhlenbeck process with parameters 2 and
, then = 2 and a = x, so
1 2 2
Lf = @x f x@x f .
If f (x, t) = 3x2 tx + t2 , then Lf = 3 2 6x2 + tx. The operator L may be
expressed as
L = (x, t)@x2 + x(x, t)@x . (7)
Then applying L to a function f is given by the expression (6). Mathematicians
say that the operator L sends f to Lf , or that f goes to Lf :
L 1 2 2
f ! Lf = @x f x@x f .
For example, one might write
/2 @x
ex ! x2 1 ex /2 .
2A famous joke defines a parameter as a variable constant.

Suppose the diusion is multi-dimensional. Let n be the number of compo-
nents of Xt . The generator in this case is
n n n
1 XX X
Lf = ij (x, t)@xi @xj f + ai (x, t)@xi f . (8)
2 i=1 j=1 i=1

We will do some multi-dimensional examples at some point.

Here is a simple example. Let Xt be a one dimensional diusion with x0 = 1.
Let V (x) be a payout function, and suppose you get payout V (XT ) only if
Xt > 0 for 0 t T . We need some notation for the mathematical formulation.
The hitting time is = min {t | X0 = 0}. The event we need to describe is the
event that T . The indicator function of this event is 1 T (X[0,T ] ), which
has the values
1 if T
1 T (X[0,T ] ) =
0 if < T.
(This is also called characteristic function and written T . We use the term
indicator function because in probability, characteristic function can refer to
Fourier transform.) In this notation, the payout V (XT ) 1 T (X[0,T ] ). This is a
function that is equal to V (XT ) if T and is equal to zero otherwise. The
expected payout is
E V (XT ) 1 T (X[0,T ] ) . (9)
The PDE approach to calculating the expected payout (9) is to define a
value function that satisfies a backward equation with boundary conditions.
The value function that works is

f (x, t) = Ex,t V (XT ) 1 T (X[0,T ] ) | t . (10)

If we can evaluate the value function f , we can plug in x = 1 and t = 0 to get

a formula for (9):

f (1, 0) = E V (XT ) 1 T (X[0,T ] ) .

We compute the entire value function (10) for the purpose of getting the single
number (9).
The value function defined in (10) may seem more complicated than neces-
sary. A function that is simpler to write down is

g(x, t) = Ex,t V (XT ) 1 T (X[0,T ] ) . (11)

The dierence between these is that g counts paths that have touched zero at
some time s < t. The definition (10) excludes such paths. More precisely, it
conditions on not having them. The two definitions are related by Bayes rule.
In the the case here, the integrand is zero if < T , so

g(x, t)
f (x, t) = .
Px,t ( T )

The definition of g is suitable for expressing as a conditional expectation, con-
ditional on Ft :
g(Xt , t) = E V (XT ) 1 T (X[0,T ] ) | Ft
The denominator has a similar expression. In fact

P( t) = E 1 t (X[0,T ] ) ,

E V (XT ) 1 T (X[0,T ] ) | Ft
f (Xt , t) = .
E 1 t (X[0,T ] ) | Ft
This definition might be very hard to work with for the following reason. The
denominator E 1 t (X[0,T ] ) | Ft is not really an expectation because the ran-
dom variable 1 t (X[0,T ] ) is known at time t. Therefore

E 1 t (X[0,T ] ) | Ft = 1 t (X[0,T ] ) .

But the formula

E V (XT ) 1 T (X[0,T ] ) | Ft
f (Xt , t) =
1 t (X[0,T ] )

looks bad, because the denominator is not a function of Xt and t alone. It

depends on the path before t. Somehow, when you did the division, this de-
pendence cancels out. The bottom line, for me, is that the more complicated
definition (10) will be easier to work with.
The value function is found by solving a PDE problem. A PDE problem con-
sists of a PDE and other conditions as appropriate initial conditions, boundary
conditions, final conditions, etc. The PDE in this case is the backward equation
@t f = Lf = (x, t)@x2 f (x, t) + a(x, t)@x f (x, t) . (12)
This PDE is satisfied in the region x > 0. The value function may not be
defined for x < 0. If it is defined, the most natural definition would be f = 0.
Either way, the PDE (12) is used only in the continuation region x > 0. The
final condition is clear from the definition of f . If t = T , and x 0, then
f (x, T ) = V (x). There is an extra boundary condition at x = 0, which is
the boundary of the continuation region in this example. We can guess this
boundary condition by continuity. If x is actually on the boundary of the
continuation region, which is to say x = 0, then the definition (10) gives the
value zero. If f (x, t) is a continuous function of x, then 0 is the limiting value
of f as x ! 0. This suggests that the boundary condition should be

f (0, t) = 0 . (13)

Here is one of the few examples where f may be calculated explicitly. Let
the process Xt be Brownian motion starting at x0 = 1 but having var(Xt ) = t.

This makes 2 = 1, and a = 0. Take V (x) = 1, so f is just the conditional
probability of not touching the boundary before time T . The PDE problem is:
Find f (x, t), defined for x 0 and t T that satisfies the PDE
@t f + @x2 f = 0
where it is defined. In addition, f should satisfy the final condition f (x, T ) = 1
for x 0, and the boundary condition f (0, t) = 0 for t T .
This problem may be solved using something like the method of images. We
extend the definition of f so that f is defined for all x with the anti-symmetry
condition f (x, t) = f (x, t). If f is continuous, this implies that f (0, t) = 0.
In order to achieve the skew-symmetry condition, we take the final condition to
be skew symmetric. We do this without changing the already known values of
f (x, T ) for x > 0. Clearly, the extended final condition should be f (x, T ) = 1
for x > 0 and f (x, t) = 1 for x < 0. The value of f when x = 0 is irrelevant.
There is no boundary. We are talking about the simple heat equation (OK,
with the direction of time reversed). The solution may be given as a Greens
function integral using the known final values:
Z 1
f (x, t) = G(x y, T t)f (y, T ) dy
Z1 Z 0
= G(x y, T t) dy G(x y, T t) dy
0 1
Z 1
1 2
= p e(xy) /2(T t)
0 2(T t)
Z 0
1 2
p e(xy) /2(T t)
dy .
1 2(T t)

The two Gaussian integrals on the last line represent probabilities. We can
express them in terms of the cumulative normal distribution function N (z) =
P(Z z), where Z N (0, 1). The second integral on the last line is the
probability that the random variable Y N (x, T t) has Y < 0. In general, if
Y N (, 2 ), then Y +Z, where Z N (0, 1). The expression Y +Z
means that Y and + Z have the same distribution. This implies that

P(Y < 0) = P( + Z < 0) = P Z < =N .

In this example, = x and = T t, so

P(Y < 0) = N p .
T t
There are two properties of Gaussians, each of which would give a way to
write the first integral in terms of N . The first is P(Z > a) = P(Z < a) =

P(Z < a), the last is because Z Z the Gaussian distribution is symmetric.
This gives P(Z > a) = N (a). In the present example, the first integral is

P(Y > 0) = P( + Z > 0) = P Z > =N =N p .
T t
The resulting formula for the survival probability is

x x
f (x, t) = Px,t ( > T ) = N p N p . (14)
T t T t
Here is a quick check that this function satisfies all the conditions we set for it.
It satisfies the PDE (a calculation using N 0 (z) = p12 ez /2 ). It satisfies the
boundary condition. If you put x = 0, the two terms on the right cancel exactly.

It satisfies the final condition. If x > 0 and you send t to T , then N p x
T t

and N pT t ! 0. That is because pT t ! 1 and pT t ! 1 (this is
x x x

where you use x > 0).

The other fact about N is P(Z > a) = 1 P(Z < a) = 1 N (a). Therefore,

P Z> =1P Z < =1N p
T t
This gives a formula equivalent to (14), which is

f (x, t) = Px,t ( > T ) = 1 2N p . (15)
T t
This formula is what you would get from the Kolmogorov reflection principle:
Px,t (Xs < 0 for some s 2 [t, T ]) = 2 Px,t (XT < 0) ,
Px,t (Xs > 0 for all s 2 [t, T ]) = 1 Px,t (Xs < 0 for some s 2 [t, T ])
= 1 2 Px,t (XT < 0)

= 1 2N p .
T t
The formula (15) satisfies the PDE for the same reason as (14). It satisfies the
final condition because P(XT < 0 | XT = x > 0) = 0. It satisfies the boundary
condition because N (0) = 12 .
We can use (14) or (15) to estimate the survival probability starting from
a fixed x at time t = 0 as T ! 1. This is the probability of not hitting
the boundary for a long time. The argument to N goes to zero as T ! 1.
Therefore, we use N () N (0) + N 0 (0). We already saw that N (0) = 12 and
N 0 (0) = p12 . Therefore, for large T ,
Px,0 ( > T ) p . (16)

We see that this goes to zero as T ! 1. Therefore, we know that from any
starting point, Brownian motion hits x = 0 at some positive time almost surely.
This is true also in two dimensions a two dimensional Brownian motion will
touch the origin almost surely. In three or more dimensions it is not true. In
fact, if |X0 | > 1, there is a positive probability that |Xt | > 1 for all t > 0.
Brownian motion in one or two dimensions is recurrent, while it is transient in
dimensions 3 or more.
While we are talking about these solutions to the backward equation, let us
notice some other properties. One is the smoothing property. The formula (14)
defines a function that is discontinuous when t = T . Nevertheless, f (x, t) is a
smooth function of x for t < T . This is a general property of PDEs of diusion
Some other properties are illustrated by a dierent solution of the backward
x+1 x1
h(x, t) = N p N p .
T t T t
This function has final values h(x, T ) = 1 if 1 < x < 1 and h(x, T ) = 0
otherwise. That makes h(x, T ) = 1|x|<1 , which is a step function that is dierent
from zero when x is not too far from zero. Whenever t < T , h(x, t) > 0 for any
x. This means that the fact that h > 0 propagates infinitely fast through the
whole domain where h is defined. This is also a property of general diusion
PDEs. However, the solution is not large for x > 1 and t close to T . In fact,
it is exponentially small. The influence is very small in short times and large
Finally, look at h(x, t) for x near 1 and t close to T . The second term is
exponentially small, as we just said. But the first term looks like the solution
with final data that have a jump at x = 1. The behavior near 1 and T is
almost completely determined by the final condition there. This is approximate

Week 9
Stochastic dierential equations
Jonathan Goodman
November 19, 2012

1 Introduction to the material for the week

The material this week is all about the expression

dXt = at dt + bt dWt . (1)

There are two distinct ways to interpret this, which we will call strong and weak.
The strong interpretation is more literally true, that X is a function of W , that
X, a and b are adapted, and (1) is true in the sense of the Ito calculus. For
example, we saw that if Xt = eWt , then
dXt = eWt dWt + 2 eWt dt
1 2
= Xt dt + Xt dWt .
In this case, (1) is satisfied with at = 12 2 Xt and bt Xt . Roughly speaking, this
is the sense we usually assume when doing analysis.
The weak interpretation is not literal. It does not require that we have a
Brownian motion path W in mind. In the weak sense, (1) means that at time
t you know Xt , at and bt , and

E[ dXt | Ft ] = at dt , (2)

and h i
E (dXt ) | Ft = b2t dt . (3)

In this view, (1) just says that for dt > 0, the corresponding dX is the sum
of a deterministic and a random piece, at dt and bt dWt . Deterministic means
known at time t. The example with at = 12 2 Xt2 shows that at need not be
known at times earlier than t. The random piece models the part of dXt that
cannot be predicted at time t. We assume that this noise component has mean
zero, because if the mean were not zero, we would put the mean into at instead.
The modeling assumption is that the noise innovation, dWt , not only has mean

zero, but is independent of anything known in Ft . The strength of the noise at
time t is bt . This is assumed known at time t. An unknown part bt would be
part of Wt instead.
The point of all this philosophy is that we can create models of stochastic
processes by writing expressions for at and bt in (1). If we think we know the
(conditional at time t) mean and variance, we use (1) to create a model process
Xt . The Black and Scholes model of the evolution of a stock price is a great
example of this kind of reasoning. Suppose St is the price of a stock at time
t. Then dSt should be proportional to St so that dS is measured as percentage
of St (OK, a little tautology there). If St were replaced by, say, 2St , then dSt
would be replaced by 2dSt . If there were a doublestock that consisted of two
shares of stock, the price change of a doubleshare would be twice the change of
a single share. Therefore, we think both the deterministic and random parts of
dSt should be proportional to St . The constants are traditionally called and
, the expected rate of return and the volatility respectively:

dSt = St dt + St dWt . (4)

An equation of the form

dXt = a(Xt , t)dt + b(Xt , t)dWt (5)

is a stochastic dierential equation, or SDE. The dierence between an SDE and

a general process that satisfies (1) is that here at and bt are required to be known
deterministic functions. For example, the function a(x, t) is a function of two
variables that is completely known at time 0. For this reason, solution to an SDE
is a Markov process. The probability distribution of the path starting at time
t, which is X[t,T ] , is completely determined by Xt . If a and b are independent
of t, then the SDE (5) and the corresponding Markov process are homogeneous.
Otherwise they are heterogeneous. If b is independent of X we have additive
noise. Otherwise, the noise is multiplicative. There are many problems with
additive noise. These are simpler from both theoretical and practical points of
Most SDEs do not have closed form solutions. But solutions may be com-
puted numerically. The Euler method, also called EulerMaruyama, is a way to
create approximate sample paths. It is possible to get information about solu-
tions by solving the forward or backward Kolmogorov equation. These, also,
would generally be solved numerically. But that is impractical for SDE systems
with more than a few components.

2 Geometric Brownian motion

Solutions to the SDE (4) are called geometric Brownian motion, or GBM. There
are several ways to find the solution. One is to try an ansatz of the form
St = At eWt . Here, At is a deterministic function of t. We do an Ito calculation

with f (w, t) = At ew , so that @t f = Aew , @w f = f , and @w
f = 2 f . The
result is
d At eWt = At eWt dt + At eWt dWt + 2 At eWt dt
! 2
At 1 2
= + St dt + St dWt .
At 2

This satisfies (4) if

At 1
+ 2 = .
At 2
That implies that
1 2
At = At .
The solution is a simple exponential:

At = A0 e( 2 )t .
1 2

If we set t = 0 and use a standard Brownian motion with W0 = 0, we find

A0 = s0 . The full solution is

St = s0 eWt +( 2 )t .
1 2

Here is a related way to find the solution formula (6). Define

Xt = log(St ) , St = eXt . (7)

We use Itos lemma for the process St . If f (s) = log(s), then @s f = 1

s and
@s2 f = @s 1s = s12 . Then

dXt = df (St )
1 2
= @s f (St )dSt + @s2 f (St ) (dSt )
1 1 1 2
= dSt (dSt )
St 2 St2
1 1 1 2 2
= (St dt + St dWt ) St dt
St 2 St2

dXt = 2 dt + dWt .

The solution of this is ordinary arithmetic Brownian motion (there are geometric
series and arithmetic series).

Xt = x0 + 2 t + Wt .

An arithmetic Brownian motion has constant drift and Brownian motion parts.
This one has drift 12 2 and noise coecient . A standard Brownian motion
has zero drift and unit noise coecient. The solution from this is

St = eXt = ex0 e( 2 )t+Wt .

1 2

This is the same as before, with s0 = ex0 .

There is another way to arrive at the log variable transformation. Suppose
V (ST ) is a payout and we consider the value function

f (s, t) = Es,t [ V (St )] .

This satisfies the backward equation

2 s2 2
@t f + @ f + s@s f = 0 . (8)
2 s
This is because a(s, t) = s, and b(s, t) = s. The Black Scholes PDE is similar
2 2
to this. This PDE has coecients 2s and s, which are functions of the
independent variable s. It is a linear PDE with variable coecients. We can
simplify the PDE by a change of variable to make it into a constant coecient
PDE. This change of variable is

x = log(s) , s = ex .

There is an obvious sense in which this is the same as (7). But there is a sense
in which it is dierent. Here, the substitution is about simple variables s and
x, not stochastic processes St and Xt .
We rewrite (8) in the x variable. The chain rule from calculus gives

@f @f @x 1
@s f = = = @x f .
@s @x @s s
The next derivative is

@s2 f = @s (@s f )

= @s @x f

1 1
= @s @x f + @s (@x f )
s s
1 1 2
= 2 @x f + 2 @x f
s s
These derivatives go back into (8) to give

2 2 2
@t f + @ f + @x f = 0 . (9)
2 x 2


This PDE is the backward equation for the SDE dXt = 2 dt + dWt .
We can transform (9) into the standard heat equation with some simple nor-
malizations. The first is to get rid of the drift term, the term involving @x f ,
using a coordinate that moves with the drift:

2 2
x=y+ t , y =x t.
2 2

It can be tricky to calculate what happens to the equation (9) in the new
coordinates. Since @y x = @x y = 1, calculating space derivatives (x or y) does
not change the equation. The change has to do with the time derivative, which
may not seem to have changed. One way to do it is to recall the definition of
partial derivative. If the variables are x and t, the partial derivative of f with
respect to t with x fixed is

f (t + t) f (t)

@t f = lim
x t!0 t

This is the rate of change of the quantity f when t is changed and x is held
fixed. The definition of @t changes if we use the y variable in place of x. It
becomes the rate of change of f when t changes and y is held fixed. Suppose
we have changes t, x and y. We fix t and calculate y:

y = x t .

If y = 0, then x = 2 t. Therefore

f (x + x, t + t) f (x, t)

@t f = lim
y t!0 t
1 h i

= lim @x f x + @t f t + O x2 + O t2
t!0 t
h t x i
1 @x f 2 + @t f t
= lim t x
t!0 t t

= @x f + @t f .
2 t x

We express the backward equation (9) in the y and t variables as

2 2
@t f + @ f =0.
2 y
The dierence is that @t refers to the derivative with y fixed rather than with
x fixed as in (9). One more step will transform this to the standard heat

equation. We rescale the space variable to get rid of the coecient 2 . The
change of variables that does that is
1 y
= , z= , y = z .
y z
In these variables,
@t f + @z2 f = 0 .
The payo for these manipulations is that we know a lot about solutions of
the heat equation. We know the Greens function and the Fourier transform.
All of these tools now apply to the equation (8) too.

3 Simulating an SDE
There are three primary ways to get information about the solution of an SDE.
The first is to solve it exactly in closed form. That option is limited to very few
SDEs. The second is to solve the backward equation numerically. We discuss
in a future class how to do that. For now it suces to say that this approach
is impractical for SDE systems with more than a few components. The third is
direct numerical simulation of the SDE. We discuss that option here.
Consider an SDE (5). We want to approximate the path X[0,T ] . Choose
a small t, define tn = nt, and define the approximate sample path to be
( t)
Xn Xtn . There are two forms of the approximation algorithm. One is
( t)
Xn+1 = Xn( t)
+ a(Xn( t)
, tn )t + b(Xn( t)
, tn )Wn . (10)

Here Wt is a Brownian motion path, and Wn = Wtn+1 Wtn , is the increment

for time t. In practice, we have to generate the Brownian motion increments
using a random number generator. The properties of Wn are that they are
independent, and that they are Gaussian with mean zero and variance t. In
case it is a multi-variate Brownian motion, the covariance is I. We gener-
ate such random variables starting with independent standard normals Zn and
multiplying by t1/2 :

Wn = t1/2 Zn , Zn N (0, I) . (11)

This algorithm is motivated by the strong form of the SDE. When we integrate
(5) over the time increment [tn , tn + t], we find
Z tn+t Z tn+t
Xtn+1 = Xtn + a(Xt , t)dt + b(Xt , t)dWt . (12)
tn tn

If t is small, then Xt Xtn for t 2 [tn , tn+1 ]. If we replace Xt by the

approximate value Xtn in (12), the integrals simplify to
Z tn+t
a(Xt , t)dt a(Xtn , tn )t ,

and Z Z
tn+t tn+t
b(Xt , t)dWt b(Xtn , tn ) dWt = b(Xtn , tn )Wn .
tn tn

This gives the approximate formula

Xtn+1 Xtn + a(Xtn , tn )t + b(Xtn , tn )Wn .

The approximate formula for the exact path motivates the exact formula (10)
for the approximate path.
The other form of the approximation algorithm just looks for approximate
sample paths that have the right mean and variance over a step of size t. For
that purpose, let Zn be a family of independent random variables with mean
zero and variance 1, or covariance I. You want
h i
( t)
E Xn+1 | Fn = Xn( t) + a(Xn( t) , tn )t ,

( t)
var Xn+1 | Fn = b2 (Xn( t)
, tn )t .
These formulas are not exact for SDE paths, but they will hold exactly for
approximate sample paths X ( t) . The formula
( t)
Xn+1 = Xn( t)
+ a(Xn( t)
, tn )t + b(Xn( t)
, tn )t1/2 Zn (13)

makes these true. The actual algorithms (10) and (13) are identical, if we use a
Gaussian Zn in (13). The dierence is only in interpretation. In the strong form
(10) we are generating an approximate path that is a function of the driving
Brownian motion. In the weak form (13), we are just making a path with
approximately the right statistics. In either case, generating an approximate
sample path might be called simulating the SDE.
We are usually interested in more than simulations. Instead we want to know
expected values of functions of the path. These could be expected payouts or
hitting probabilities or something more complicated. A single path or a small
number of paths may not represent the entire population of paths. The only
way to learn about population properties is to do a large number of simulations.
The main way to digest a large collection of sample paths is to compute statistics
of them. Most such statistics correspond the expected value of some function
of a path.
This brings up an important distinction, between simulation and Monte
Carlo. Simulation is described above. Monte Carlo1 it the process of using
random numbers to compute something that itself is not random. A hitting
probability is not a random number, for example, though it is defined in terms
of a random process. The distinction is important because it suggests that there
may be more than one way to estimate the same number. We will discuss this
a little later when we talk about applications of Girsanovs theorem.
1 This thoughtful definition may be found in the book Monte Carlo Methods by Malvin

Kalos and Paula Whitlock.

For now, let V (x[0,T ] ) be some function of a path on the interval [0, T ]. Let
its expected value for the SDE be

A = E V (X[0,T ] ) .
For example, to estimate a hitting probability you might take V = 1 T . To
estimate A, we generate a large number of independent approximate sample
( t)
paths X[0,T ],k , k = 1, . . . , L. Here L is the sample size, which is the number of
paths. The estimate of A is

1 X ( t)
A= V X[0,T ],k . (14)

b A. This error is composed of bias and statistical error. Bias

The error is A
comes from the fact that sample paths are not exact. You reduce bias by letting
t ! 0.
h i h i
bias = E A b A = E V X ( t) E V (X[0,T ] ) .
[0,T ],k

In statistics, we say a statistic is unbiased if the expected value of the statistic is

the true parameter value. There are unbiased statistics, but it is very rare than
the estimate of a quantity like A is unbiased. Statistical error comes from the
fact that we use a finite number of sample paths. You reduce statistical error
by taking L ! 1. The definition is
h i
statistical error = AbE A b .

Neither the bias nor the statistical error goes to zero very fast as t ! 0 or
L ! 1. The bias typically is proportional to t or t1/2 , depending on the
problem. The statistical error typically is proportional to L1/2 , which comes
from the central limit theorem. For this reason Monte Carlo estimation either
is very expensive, or not very accurate, or both.
You could ask about the scientific justification for using SDE models if all
we do with them is discretize and simulate. Couldnt we just have simulated the
original process? There are several justifications for the SDE approach. One
has to do with time scales. We saw on an old assignment that the diusion
process may be an approximation to another process that operates on a very
fast time scale, Tm . It is possible that the time step t needed to simulate the
SDE is much larger than Tm (the microscopic time scale). The SDE model
also can be simpler than the microscopic model it approximates. For example,
the Ornstein-Uhlenbeck/Einstein model of Brownian motion replaces a process
in (X, V ) space with a simpler process in X space alone.

4 Existence of solutions
The first question asked in a graduate course on dierential equations is: Do
dierential equations have solutions? You can ask the same question about

stochastic dierential equations, and the answer is similar. If the coecients
a(x, t) and b(x, t) are Lipschitz continuous, then a simple iteration argument
shows that solutions exist. If the coecients are not Lipschitz continuous, all
questions get harder and more problem specific.
A function f (x) is Lipschitz continuous with Lipschitz constant C if

|f (x) f (y)| C |x y| . (15)

For example, f (x) = sin(2x) is Lipschitz with Lipschitz constant C = 2. The

functions f (x) = ex , f (x) = x2 , and f (x) = sin(x2 ) are not Lipschitz continuous.
If f is dierentiable and |f 0 (x)| C for all x, then f is Lipschitz continuous
with constant C. We say that f is locally Lipschitz near the point x0 if there
is an R so that (15) holds whenever |x x0 | R and |y x0 | R. Any
dierentiable function is locally Lipschitz. The examples show that many nice
seeming functions are not globally Lipschitz.
In the easy part of the existence theory for dierential equations, locally
Lipshitz equations have local solutions (solutions defined for a finite but possibly
not infinite range of t). Globally Lipschitz equations have solutions defined
globally in time, which is to say, for all t 2 [0, 1). In particular, if a and b are
(globally) Lipschitz then the SDE (5) has a solution defined globally in time.
More precisely, there is a function W[0,t] ! Xt so that the process Xt satisfies
(5). This function may be constructed as a limit of the Picard iteration process.
This defines a sequence of approximate solutions Xt,k and recovers the exact
solution in the limit k ! 1. The iteration is

dXt,k+1 = a(Xt,k , t)dt + b(Xt,k , t)dWk .

In integral form, this is

Z t Z t
Xt,k+1 = x0 + a(Xs,k , s)ds + b(Xs,k , s)dWs .
0 0

We must prove that these approximations converge to something.

Here is a quick version of the argument for the simpler case a = 0. We
subtract the k and k 1 equations to get
Z t
Xt,k+1 Xt,k = (b(Xs,k , s) b(Xs,k1 , s)) dWs .

The object is to prove that Xt,k+1 Xt,k is smaller than Xt,k Xt,k1 eventually
as k ! 1. You can use the Ito isometry formula to get
h i Z t h i
2 2
E (Xt,k+1 Xt,k ) = E (b(Xs,k , s) b(Xs,k1 , s)) ds .

Since b is Lipschitz continuous,

2 2
(b(Xs,k , s) b(Xs,k1 , s)) C 2 (Xs,k Xs,k1 ) .

h i Z t h i
2 2
E (Xt,k+1 Xt,k ) C 2 E (Xs,k Xs,k1 ) ds

Now define h i
Mt,k = E (Xt,k+1 Xt,k ) .

Our integral inequality is

Z t
Mt,k C 2 Ms,k1 ds .

It is easy to derive from this

eC t
Mt,k M0,t .
For any fixed t, this implies that the Xk,t are a Cauchy sequence almost surely
(our Borel Cantelli lemma again).