5 views

Uploaded by Richard Ding

Stochastic calculus

- Lineal Matrices
- Fighting and Team Sports Advanced Bio Mechanics
- [Andrzej_Lasota,_Michael_C._Mackey]_Chaos,_Fractal.pdf
- Random Processes i
- data minig
- Manjual Paper.pdf
- Design and Analysis of (M/G/1):(GD/∞/ ∞) and (Mi /Gi /1):(NPRP/∞/∞) Queueing Systems
- 6_numerical Methods for Chemical Engineers With Matlab Applications
- avalanche paper
- IMF - Policy Analysis and Forecasting in the World Economy: A Panel Unobserved Components Approach
- Appendix a Syntax Quick Reference 2017 Essential MATLAB for Engineers and Scientists Sixth Edition
- Ch05 Part 2-Traffic Flow Models
- masterex.pdf
- Feynrules tutorial
- macqueen_thesis.pdf
- Book Cover
- Product of Gaussian Distribution
- Unit 4
- Cleaning Correlation Matrices
- Manual de Gauss

You are on page 1of 131

Jonathan Goodman

September 10, 2012

These are lecture notes for the class Stochastic Calculus oered at the Courant

Institute in the Fall Semester of 2012. It is a graduate level class. Students

should have a solid background in probability and linear algebra. The topic

selection is guided in part by the needs of our MS program in Mathematics in

Finance. But it is not focused entirely on the Black Scholes theory of derivative

pricing. I hope that the main ideas are easier to understand in general with

a variety of applications and examples. I also hope that the class is useful to

engineers, scientists, economists, and applied mathematicians outside the world

of finance.

The term stochastic calculus refers to a family of mathematical methods for

studying dynamics with randomness. Stochastic by itself means random, and

it implies dynamics, as in stochastic process. The term calculus by itself has

two related meanings. One is a system of methods for calculating things, as in

the calculus of pseudo-dierential operators or the umbral calculus.1 The tools

of stochastic calculus include the backward equations and forward equations,

which allow us to calculate the time evolution of expected values and probability

distributions for stochastic processes. In simple cases these are matrix equations.

In more sophisticated cases they are partial dierential equations of diusion

type.

The other sense of calculus is the study of what happens when t ! 0. In

this limit, finite dierences go to derivatives and sums go to integrals. Calculus

in this sense is short for dierential calculus and integral calculus,2 which refers

to the simple rules for calculating derivatives and integrals the product rule,

the fundamental theorem of calculus, and so on. The operations of calculus,

integration and dierentiation, are harder to justify than the operations of alge-

bra. But the formulas often are simpler and more useful: integrals can be easier

than sums.

1 Just

name-dropping. These are not part of our Stochastic Calculus class.

2 RichardCourant, the founder of the Courant Institute, wrote a two volume textbook

Dierential and Integral Calculus. (originally Vorlesungen uber Dierential und Integralrech-

nung).

1

Dierential and integral calculus is good for modeling as well as for calcu-

lation. We understand the dynamics of a system by asking how the system

changes in a small interval of time. The mathematical model can be a sys-

tem of dierential equations. We predict the behavior of the system by solving

the dierential equations, either analytically or computationally. Examples of

this include rate equations of physical chemistry and the laws of Newtonian

dynamics.

The stochastic calculus in this sense is the Ito calculus. The extra Ito term

makes the Ito calculus more complicated than ordinary calculus. There is no

Ito term in ordinary calculus because the quadratic variation is zero. The Ito

calculus also is a framework for modeling. A stochastic process may be described

by giving an Ito stochastic dierential equation, an SDE. There are relatively

simple rules for deriving this SDE from basic information about the short time

behavior of the process. This is analogous to writing an ordinary dierential

equation (ODE) to describe the evolution of a system that is not random. If

you can describe the behavior of the system over very short time intervals, you

can write the ODE. If you can write the ODE, there is an array of analytical

and computational methods that help you figure out how the system behaves.

This course starts with two simple kinds of stochastic processes that may be

described by basic methods of probability. This week we cover linear Gaussian

recurrence relations. These are used throughout science and economics as the

simplest class of models of stochastic dynamics. Almost everything about linear

Gaussian processes is determined by matrices and linear algebra. Next week we

discuss another class of random processes described by matrices, finite state

space Markov chains. Week 3 begins the transition to continuous time with

continuous time versions of the Gaussian processes discussed this week. The

simplest of these processes is Brownian motion, which is the central construct

that drives most of the Ito calculus.

After than comes the technical core of the course, the Ito integral, Itos

lemma, and general diusion processes. We will see how to associate partial dif-

ferential equations to diusion processes and how to find approximate numerical

solutions.

It is impractical to do all this in a mathematically rigorous way in just one

semester. This class will indicate some of the main ideas of the mathematically

rigorous theory, but we will not discuss them thoroughly. Experience shows

that careful people can Ito calculus more or less correctly without being able to

recite the formal proofs. Indeed, the ordinary calculus of Newton and Leibnitz

is used daily by scientists and engineers around the world, most of whom would

be unable to give a mathematically correct definition of the derivative.

Computing is an integral part of this class as it is an integral part of applied

mathematics. The theorems and formulas of stochastic calculus are easier to

understand when you see them in action in computations of specific examples.

More importantly, in the practice of stochastic calculus, it is very rare that a

problem gets solved without some computation. A training class like this one

should include all aspects of the subject, not just those in use before computers

were invented.

2

2 Introduction to the material for the week

The topic this week is linear recurrence relations with Gaussian noise. A linear

recurrence relation is an equation of the form

0 1

Xn,1

Xn = @ ... A 2 Rd .

B C

Xn,d

The forcing vectors are i.i.d., which stands for independent and identically dis-

tributed. The model is defined by the matrix A and the probability distribution

of the forcing. The model does not change with time because A and the distri-

bution of the Vn are the same for each n. The recurrence relation is Gaussian

if the noise vectors Vn are Gaussian.

This is a simple model of the evolution of a system that is somewhat pre-

dictable, but not entirely. The d components Xn,1 , . . . , Xn,d represent the state

of the system at time n. The deterministic part of the dynamics is Xn+1 = AXn .

This says that the components at the next time period are linear functions of

the components in the current period. The term Vn represents the random in-

fluences at time n. In this model, everything about time n relevant to predicting

time n + 1 is contained in Xn . Therefore, the noise at time n, which is Vn , is

completely independent is anything we have seen before.

In statistics, a model of the form Y = AX + V is a linear regression model

(though usually with E[V ] = 0 and V called ). In it, a family of variables

Yi are predicted using linear functions of a dierent family of variables Xj .

The noise components Vk model the extent to which the Y variables cannot be

lrr

predicted by the X variables. The model (1) is called autoregressive, or AR,

because values of variables Xj are predicted by values of the same variables at

the previous time.

It is possible to understand the behavior of linear a Gaussian AR model in

great detail (technical detail, you must assume it starts with X0 that is Gaus-

sian). All the subsequent Xn are multivariate normal. There are simple matrix

recurrence relations that determine their means and covariance matrices. These

recurrence relations important in themselves, and they are our first example of

an ongoing theme for the course, backward and forward equations.

This material makes a good start for the class for several reasons. For one

thing, it gives us an excuse to review multivariate Gaussian random variables.

Also, we have a simple context in which to talk about path space. In this case,

a path is a sequence of consecutive states, which we write as

3

The notation [n1 : n2 ] comes from two sources. Several programming languages

use similar notation to denote a sequence of consecutive integers: [n1 : n2 ] =

{n1 , n1 + 1, . . . , n2 }. In mathematics, [n1 , n2 ] refers to the closed interval con-

taining all real numbers between n1 and n2 , including n1 and n2 . We write

[n1: n2 ] to denote just the integers in that interval.

The path is an object in a big vector space. Each of the Xn has d components.

The number of integers n in the set [n1 : n2 ] is n2 n1 + 1. Altogether, the path

X[n1:n2 ] has d(n2 n1 + 1) components. Therefore X[n1 :n2 ] can be viewed as a

point in the path space Rd(n2 n1 +1) . As such it is Gaussian. Its distribution is

completely determined by its mean and covariance matrix. The mean of X[n1:n2 ]

is determined by the means of the individual Xn . The covariance matrix of

X[n1:n2 ] has dimension d(n2 n1 + 1) d(n2 n1 + 1). Some of its entries

give the variances and covariances of the components of Xn . Others are the

covariances of compenents Xn,j with Xn0 ,k at unequal times n 6= n0 .

In future weeks we will consider spaces of paths that depend on continu-

ous time t rather than discrete time n. The corresponding path spaces, and

probability distributions on them, are one of the main subjects of this course.

3 Multivariate normals

Most of the material in this section should be review for most of you. The

multivariate Gaussian, or normal, probability distribution is important for so

many reasons that it would be dull to list them all here. That activity might

help you later as you review for the final exam. The important takeaway is

linear algebra as a way to deal with multivariate normals.

sub:lt

Let X 2 Rd be a multivariate random variable. We write X u(x) to indicate

that u is the probability density of X. Let A be a square d d matrix that

describes an invertible linear transformation of random variables X = AY ,

Y = A 1 X. Let v(y) be the probability density of Y . The relation between u

and v is

v(y) = |det(A)| u(Ay) . (3) uvd

This is equivalent to

1

u(x) = x A 1

x . (4) vud

|det(A)|

uvd vud

We will prove it in the form (3) and use it in the form (4). uvd

The

vud

determinants may be the most complicated things in the formulas (3)

and (4). They may be the least important. It is common that probability

densities are known only up to a constant. That is, we know u(x) = cf (x) with

a formula for f , but we do not know c. Even if there is a formula for c, the

formula may be more helpful without it.

4

(For example, the Student t-density is

n+1

x2 2

u(x) = c 1+ ,

n

with

n+1

c = p 2

,

n n2

R1

in terms of the Euler Gamma function (a) = 0 ta 1 e t dt. The important

features of the t-distribution are easier to see without the formula for c: the fact

that it is approximately normal for large n, that it is symmetric and smooth,

and that u(x) x p for large x with exponent p = n + 1 (power lawuvd tails).)

Here is an informal way to understand the transformation rule (3) and get

the determinant prefactor in the right place. Consider a very small region, By ,

in Rd about the point y. This region could be a small ball or box, say. Call

the volume of By , informally, |dy|. Under the transformation y ! x = Ay, say

that By is transformed to a small region Bx about x. Let |dx| be the volume

A

of Bx . Since By ! Bx (this means that the transformation A takes By to Bx ),

the ratio of the volumes is given by the determinant:

But when By small, we have the approximate formula

Z

Pr(By ) = y(y 0 )dy 0 v(y) |dy| . (5) udx

By

The exact formula Pr(Bx ) = Pr(By ) then gives the approximate formula

In the limit |dy| ! 0, the approximations become exact. Cancel theuvd common

factor |dy| from both sides and you get the transformation formula (3).

sub:la

Very simple facts about matrix multiplication make the mathematicians work

much simpler than it would be otherwise. This applies to the associativity

property of matrix multiplication and the distributive property of matrix mul-

tiplication and addition. This is part of what makes linear algebra so useful in

practical probability.

Suppose A, B, and C are three matrices that are compatible for multiplica-

tion. Associativity is the formula (AB) C = A (BC). We can write the product

simply as ABC because the order of multiplication does not matter. Associativ-

ity holds for products of more factors. For example (A (BC)) D = (AB) (CD)

5

gives two of the many ways to calculate the matrix product ABCD: you can

compute BC, then multiply from the left by A and lastly multiply from the

right by D, or you can first calculate AB and CD and then multiply those.

Distributivity is the fact that matrix product is a linear function of each

factor. Suppose AB is compatible for matrix multiplication, that B1 and B2

have the same shape (number of rows and columns) as B, and that u1 and u2

are numbers. Then A(u1 B1 + u2 B2 ) = u1 (AB1 ) + u2 (AB2 ). This works with

more than two B matrices, and with matrices on the right and left, such as

n

! n

X X

A uk Bk C = uk ABk C .

k=1 k=1

a probability density function, then

Z Z

(AB(x)C) u(x) dx = A B(x) u(x) dx C .

This may be said in a more abstract way. If B is a random matrix and A and

C are fixed, not random, then

Matrix multiplication is associative and linear even when some of the matrices

are row vectors or column vectors. These can be treated as 1 d and d 1

matrices respectively.

Of course, matrix multiplication is not commutative: AB 6= BA in gen-

t

eral. The matrix transpose reverses the order of matrix multiplication: (AB) =

1

) 1(A

(B t

). Matrix

t

inverse does the same if A and B are square matrices: (AB) =

B A 1 . If A and B are not square, it is possible that AB is invertible even

though A and B are not.

We illustrate matrix algebra in probability by finding transformation rules

for the mean and covariance of a multivariate random variable. Suppose Y 2 Rd

is a d component random variable, and X = AY . It issub:lt not necessary here for

A to be invertible or square, as it was in subsection 3.1. The mean of Y is

the d component vector given either in matrix/vector form as Y = E[Y ], or in

component form as Y,j = E[Yj ]. The expected value of Y is

We may take A out of the expectation because of the linearity of matrix multi-

plication, and the fact that Y may be treated as a d 1 matrix.

Slightly less trivial is the transformation formula for the covariance matrix.

The covariance matrix CY is the d d symmetric matrix whose entries are

6

The diagonal entries of CY are the variances of the components of Y :

h i

2

CY,jj = E (Yj Y,j ) = Y2 j .

t

Now consider the d d matrix B(Y ) = (Y Y ) (Y Y ) . The (j, k) entry

of B is just (Yj Y,j ) (Yk Y,k ). Therefore the covariance matrix may be

expressed as h i

t

CY = E (Y Y ) (Y Y ) . (7) cx

eabc

The linearity formula (6), and associativity, give the transformation law for

covariances:

h i

t

CY = E (Y Y ) (Y Y )

h i

t

= E (AY AY ) (AY AY )

h i

t

= E {A (Y Y )} {A (Y Y )}

hn on oi

t

= E A (Y Y ) (Y Y ) At

h i

t

= E A (Y Y ) (Y Y ) At

h i

t

= A E (Y Y ) (Y Y ) At

CX = ACY At . (8)

The second to the third line uses distributivity. The third to the fourth uses

the property of matrix transpose. The fourth to the fifth is distributivity again.

The fifth to the sixth is linearity.

This subsection and the next one use the multivariate normal probability den-

sity. The aim is not to use the formula but to find ways to avoid using it. We

use the general probability density formula to prove the important facts about

Gaussians. Working with these general properties is simpler than working with

probability density formulas. These properties are

AX, then Y is Gaussian.

nents of a multivariate normal and if cov(X1 , X2 ) = 0 then X1 and X2

are independent.

a multivariate normal, then the distribution of X1 , conditioned on knowing

the value X2 = x2 , is Gaussian.

7

In this subsection and the next, we use the formula for the Gaussian probability

density to prove these three properties.

Let H be a d d matrix that is SPD (symmetric and positive definite). Let

= (1 , . . . d )t 2 Rd be a vector. If X has the probability density

p

det(H) (x )t H(x )/2

u(x) = e . (9) NH

(2)d/2

then we say that X is a multivariate Gaussian, or normal, with parameters

and H. The probability density on the right is denoted by N (, H 1 ). It will

be clear soon why it is convenient to use H 1 instead of H. We say that X is

centered if = 0. In that case the density is symmetric, u(x) = u(x).

It is usually more convenient to write the density formula as

(x )t H(x )/2

u(x) = c e .

The value of the prefactor,

p

det(H)

c = ,

(2)d/2

often is not important. A probability density is Gaussian if it is the exponential

of a quadratic function of x. NH

We give some examples before explaining (9) in general. A univariate normal

has d = 1. In that case we drop the vector and matrix notation because and

H are just numbers. The simplest univariate normal is the univariate standard

normal, with = 0, and H = 1. We often use Z to denote a standard normal,

so

1 2

Z p e z /2 = N (0, 1) . (10) sn1

2

The cumulative distribution function, or CDF, of the univariate standard normal

is Z x

1 2

N (x) = Pr( Z < x ) = p e z /2 dz .

2 1

There is no explicit formula for N (x) but it is easy to calculate numerically. Most

numerical software packages include procedures that compute N (x) accurately.

The general univariate density may be written without matrix/vector nota-

tion: p

h 2

u(x) = p e (x ) h/2 . (11) wn

2

Simple calculations (explained below) show that the mean is E[X] = , and the

variance is h i

2 1

2

X = var(X) = E (X ) = .

h

wn

In view of this, the probability density (11) is also written

1 (x )2 /(2 2

u(x) = p e )

. (12) uvn

2 2

8

If X has this density, we write X N (, 2 ). It would have been more accurate

to write X2

and X , but we make a habit of dropping subscripts that are easy

to guess from context.

It is useful to express a general univariate normal in terms of a standard

univariate normal. If we want X N (, 2 ), we can take Z (0, 1) and take

X = + Z . (13) XmsZ

h i

2

var(X) = E (X )

= 2 E Z 2 = 2 .

sn1

A calculation with probability densities

XmsZ

(see below) shows that

uvn

if (10) is the

density of Z and X is given by (13), then X has the density (12). This is handy

in calculations, such as

a

Pr[X < a] = Pr[ + Z < a] = Pr[Z < (a ) /] = N .

This says that the probability of X < a depends on how many standard devia-

tions a is away from the mean, which is the argument of N on the right.

We move to a multivariate normal and return to matrix and vector notation.

The standard multivariate

NH

normal has = 0 and H = Idd . In this case,

the exponent in (9) involves just z t Hz = NH

z t z = z12 + zd2 , and det(H) = 1.

Therefore the N (0, I) probability density (9) is

1 z t z/2

Z d/2

e (14)

(2)

d

1 2 2

= p e z1 zd /2

2

1 2 1 z22 /2 1 zd2 /2

= p e z1 /2 p e p e . (15)

2 2 2

The last line writes the probability density of Z as a product of one dimen-

sional N (0, 1) densities for the components Z1 , . . . , Zd . This implies that the

components Zj are independent univariate standard normals. The elements of

the covariance matrix are

2

CZ,jj = var(Zj ) = E Zj 2 = 1 . (16) ZI

CZ,jk = cov(Zj , Zk ) = E Zj Zk = 0 if j 6= k

ate standard normal is the identity matrix. In this case at least, uncorrelated

Gaussian components Zj are also independent. vud

The themes of this section so far are

cycx

the general transformation law (4), the

NH

covariance transformation formula (8), and the Gaussian density formula (9).

9

We are ready to combine them to see how multivariate normals transform under

linear transformations. Suppose Y has probability density (assuming for now

that = 0)

t

v(y) = c e y Hy/2 ,

vud

and X = AY , with an invertible A so Y = A 1

X. We use (4), and calculate

the exponent

t h i

1 t e ,

y t Hy = A 1 x H A 1 x = xt A HA 1

x = xt Hx

with He = A 1 t HA 1 . (Note that A 1 t = (At ) 1 . We denote these by

A t , and write H e = A t HA 1 .) The formula for H e is not important here.

e

xt Hx/2

What is important is that u(x) = c e , which is Gaussian. This proves

that a linear transformation of a multivariate normal is a multivariate normal,

at least if the linear transformation is invertible. NH

We come to the relationship between H, the SPD matrix in (9), and C, the

covariance matrix of X. In one dimension the relation is 2 = C = h 1 . We

now show that for d 1 the relationship is

C = H 1

. (17) CH

C). This is consistent

with the one variable notation

NH

N (, 2

). The relation (17) allows us to rewrite

the probability density (9) in its more familiar form

(x )t C 1 (x )/2

u(x) = c e . (18) N

)

p

1 det (H)

c = d/2

p = d/2

. (19) pf

(2) det (C) (2)

CH

The proof of (17) usesXmsZ

an idea that is important for computation. A natural

multivariate version of (13) is

X = + AZ , (20) XmAZ

the

transformation formula to find the density formula for X. The desired (17) will

fall out. The whole thing is an exercise in linear algebra.

The mean propertycycx

is clear, so we continue to take = 0. The covariance

transformation formula (8), with CZ = I in place of CY , implies that CX = AAt .

We can create a multivariate normal with covariance C if we can find an A with

You can think of A as the square root of C, just as is the square root of 2

XmsZ

in the one dimensional version (13).

10

There are dierent ways to find a suitable A. One is the Cholesky factor-

ization C = LLt , where L is a lower triangular matrix. This is described in

any good linear algebra book (Strang, Lax, not Halmos). This is convenient

for computation because numerical software packages usually include a routine

that computes the Cholesky factorization.

This is the algorithm for creating X.vudWe now find the probability density

of X using the transformation formula (4). We write c for the prefactor in any

probability density formula. The value of c can be dierent in dierent formulas.

t

We saw that v(z) = ce z z/2 . With z = A 1 x, we get

x) = c e (A x)

1 t

A1 x/2

u(x) = c v(A 1

.

The exponent is, as we just saw, x Hx/2 with

2

H = A tA 1

.

cho

But if we

CH

take the inverse of both sides of (21) we find C 1 = A t A 1 . This

proves (17), as the expressions for C 1 and H are the same.cho

The prefactor works out too. The covariance equation (21) implies that

2

det(C) = det(A) det(At ) = [ det(A) ] .

zd uvd

Using (14) and (3) together gives

1 1

c = ,

(2)

d/2 det(A)

pf

which is the prefactor formula (19).

Two important properties of the multivariate normal, for your review list, is

that they exist

cho

for any C and are easy to generate. The covariance square root

equation (21) has a solution

XmAZ

for any SPD matrix C. If C is a desired covariance

matrix, the mapping (20) produces multivariate normal X with covariance C.

Standard software packages include random number generators that produce

independent univariate standard normals Zk . If you have C and you want a

million vectors independent random vectors X, you first compute the Cholesky

factor, L. Then a million times you use the standard normal random number

generator to produce a d component standard normal Z and do the matrix

calculation X = LZ.

This makes the multivariate normal family dierent from other multivariate

families. Sampling a general multivariate random variable can be challenging for

d larger than about 5. Practitioners resort to heavy handed and slow methods

such as Markov chain Monte Carlo. Moreover, there is a modeling question

that may be hard to answer for general random variables. Suppose you want

univariate random variables X1 , . . ., Xd each to have density f (x) and you want

them to be correlated. If f (x) is a univariate normal, you can make the Xk

components of a multivariate normal with desired variances and correlations.

If the Xk are not normal, the copula transformation maps the situation to a

multivariate normal. Warning: the copula transformation has been blamed for

the 2007 financial meltdown, seriously.

11

3.4 Conditional and marginal distributions

sub:cm

When you talk about conditional and marginal distributions, you have to say

which variables are fixed or known, and which are variable or unknown. We

write the multivariate random variable as (X, Y ) with X 2 RdX and y 2 RdY .

The number of random variables in total is still d = dX + dY . We study the

conditional distribution of X, conditioned on knowing Y = y. We also study

the marginal distribution of Y .

The math here is linear algebra with block vectors and matrices. The total

random variable is partitioned into its X and Y parts as

0 1

X1

B .. C

B . C

B C

X B XdX C

= B C

B Y1 C 2 R

d

Y B C

B . C

@ .. A

YdY

t t HXX | HXY x

x ,y = xt HXX x + 2xt HXY y + y t HY Y y .

H Y X | HY Y y

t

, which

is true because H is symmetric.) The joint distribution of X and Y is

t t t

tioned on Y = y, is u(x | Y = y) = u(x | y) = c(y)u(x, y). Here we think of x

as the variable and y as a parameter. The normalization constant here depends

on the parameter. For the Gaussian, we have

t t

(22) cX

t

The factor e y HYcX

Y y/2

has been absorbed into c(y).

We see from (22) that the conditional distribution of X is the exponential

of a quadratic function of x, which is to say, Gaussian. The algebraic trick

of completing

cX

the square identifies the conditional

NH

mean. We seek to write the

exponent (22) in the form of the exponent of (9)

t

xt HXX x + 2xt HXY y = (x X (y)) HXX (x X (y)) + m(y) .

The m(y) will eventually be absorbed into the y dependent prefactor. Some

algebra shows that this works provided

12

This will hold for all x if HXY y = HXX X (y), which gives the formula for

the conditional mean:

X (y) = HXX

1

HXY y . (23) cm

The conditional mean X (y) is in some sense the best prediction of the unknown

X given the known Y = y.

ance

sub:ch

If we have M with M M t = C, we can think of M as a kind of square root of

C. It is possible to find a real d d matrix M as long as C is symmetric and

positive definite. We will see two distinct ways to do this that give two dierent

M matrices.

The Cholesky factorization is one of these ways. The Cholesky factorization

of C is a lower triangular matrix L with LLt = C Lower triangular means that

all non-zero entries of L are on or below the digonal:

0 1

l11 0 0

B .. C

B l21 l22 0 . C

B C

B

L = B . . . . . .

C .

C

B . . . C

@ 0 A

ld1 ldd

Any good linear algebra book explains the basic facts of Cholesky factorization.

These are such an L exists as long as C is SPD. There is a unique lower triangular

L with positive diagonal entries: ljj > 0. There is a straightforward algorithm

that calculates L from C using approximately d3 /6 multiplications (and the

same number of additions).

If you want to generate X N (, C), you compute the Cholesky factoriza-

tion of C. Any good package of linear algebra software can do this, including

downloadable software LAPACK for C or C++ or FORTRAN programming,

and the build in linear algebra facilities in Python, R, and Matlab. To make an

X, you need d independent standard normals Z1 , . . . , Zd . Most packages that

generate pseudo-random numbers have a procedure to generate such standard

normals. This includes Python, R, and Matlab. To do it in C, C++, FOR-

TRAN, you can use a uniform pseudo-random number generator and then use

the Box Muller formula to get Gaussians. You assemble the Zj into a vector

Z = (Z1 , . . . , Zd and take X = LZ + .

Consider as an example the two dimensional case with = 0. Here, we

want X1 and X2 that are jointly normal. It is common to specify var(X1 ) = 12 ,

var(X2 ) = 12 , and the correlation coecient

cov(X1 , X2 ) E(X1 X2 )

12 = corr(X1 , X2 ) = = .

1 2 1 2

13

In this case, the Cholesky factor is

L =

1 p 0 . (24) L2

12 2 1 212 2

The general formula X = LZ becomes

X1 = 1 Z1 (25)

q

X2 = 12 2 Z1 + 1 212 2 Z2 . (26)

2

It is easy to calculate E X1 = 12 , which is the desired value. Similarly, because

Z1 and Z2 are independent, we have

var(X2 ) = E X22 = 212 22 + 1 212 22 = 22 ,

which is the desired answer, too. The correlation coecient is also correct:

E X22 E [1 Z1 12 2 Z1 ]

corr(X1 , X2 ) = = = 12 E Z12 = 12 .

1 2 1 2

You can, and should, verify by matrix multiplication that

12 1 2 12

LLt = ,

1 2 12 22

which is the desired covariance matrix ofx1(X1 , X2 )x2

t

.

We could have turned the formulas (25) and (26) around as

q

X1 = 1 212 1 Z1 + 12 1 Z2 +

X2 = 2 Z2 .

In this version,

x1

it x2

looks like X2 is primary and X1 gets some of its value from

X2 . In (25) and (26), it looks like X1 is primary and X2 gets some of its value

from X1 . These two models are equally valid in the sense that they product

the same observed (X1 , X2 ) distribution. It is a good idea to keep this in mind

when interpreting regression studies involving X1 and X2 .

sec:lgr

lrr

Linear Gaussian linear recurrence relations (1) illustrate the ideas in the previous

section. We now know that if Vn is a multivariate normal with mean zero, there

is a matrix B so that Vn = BZn ,lrr

where Zn N (0, I), is a standard multivariate

normal. Therefore, we rewrite (1) as

Xn+1 = AXn + BZn . (27) lrz

Since the Xn are Gaussian, we need only describe their means and covariances.

This section shows

lrr

that the means and covariances satisfy recurrence relations

derived from (1). The next section explores the distributions of paths. This

determines, for example, the joint distribution of Xn and Xm for n 6= m. These

and more general path spaces and path distributions are important throughout

the course.

14

4.1 Probability distribution dynamics

sub:pdd

As long as Zn is independent of Xn , we can calculate recurrence relations for

n = E[Xn ] and Cnsub:la

= cov[Xn ]. For the mean, we have (you may want to glance

back to subsection 3.2)

= A E[Xn ] + B E[Zn ]

n+1 = An . (28)

that the recurrence relation for the means is the same as the recurrence

relation (27) for the random states if you turn o the

lrz

noisemur

(set Zn to zero).

For the covariance, it is convenient to combine (27) and (28) into

h i

t

Cn+1 = E (Xn+1 n+1 ) (Xn+1 n+1 )

h i

t

= E (A (Xn n ) + BZn ) (A (Xn n ) + BZn )

We expand the last into a sum of four terms. Two of these are zero, one being

h i

t

E A (Xn n ) (BZn ) = 0,

terms:

h i h i

t t

Cn+1 = E (A (Xn n )) (A (Xn n )) + E (BZn ) (BZn )

h n o i

t

= E A (Xn n ) (Xn n ) At + E B Zn Znt B t

h i

t

= A E (Xn n ) (Xn n ) At + B E Zn Znt B t B t

Cn+1 = ACn At + BB t . (29)

mur cr

The recurrence relations (28) and (29) determine the distribution of Xn+1 in

terms of the distribution of Xn . As such, they are the first example in this class

of a forward equation. sub:ho

We will see in subsection 4.2 that there are natural examples where the

dimension of the noise vector Zn is less than d, and the noise matrix B is not

square. When that happens, we let m denote the number of components of Zn ,

which is the number of sources of noise. The noise matrix B is d m; it has d

rows and m columns.

cr

The case m > d is not important for applications. The

matrices in (29) all are d d, including BB t . If you wonder whether it might

be B t B instead, note that B t B is m m, which might be the wrong size.

15

4.2 Higher order recurrence relations, the Markov prop-

erty

sub:ho

It is common to consider recurrence relations with more than one lag. For

example, a k lag relation might take the form

From the point of view of Xn+1 , the k lagged states are Xn (one lag), up to

Xn k+1 (k lags). It is natural to consider models with multiple lags if Xn

represent observable aspects of a large and largely unobservable system. For

example, the components of Xn could be public financial data at time n. There

is much unavailable private financial data. The lagged values Xn j might give

more insight into the complete state at time n than just Xn .

We do not need a new theorylk

of lag k systems. State space expansion puts a

multi-lag

lrz

system of the form ( 30) into the form of a two term recurrence relation

(27). This formulation uses expanded vectors

0 1

Xn

B Xn 1 C

Xen = B B ..

C

C .

@ . A

Xn k+1

en have

If the original states Xn have d components, then the expanded states X

kd components. The noise vector Zn does not need expanding because noise

vectors have no memory. All the memory in the system is contained in Xen . The

recurrence relations in the expanded state formulation are

en+1 = A

X eXen + BZ

e n.

0 1

0 1 A0 A1 Ak+1 0 1 0 1

Xn+1 B I Xn B

B Xn C B 0 0 C C B Xn 1 C B0C

B C B 0 C B C B C

B .. C = B 0 I CB .. C + B .. C Zn . (31) cmrr

@ . A B .. .. .. C @ . A @ . A

@ . . . A

Xn k+2 Xn k+1 0

0 I 0

lk

The matrix A e is the companion matrix of the recurrence relation (30).

sub:ss lrz

We will see in subsection 4.3 that the stability of a recurrence relation (27)

is determined by the eigenvalues of A. For lk

the case d = 1, you might know that

the stability of the recurrence relation (30) is determined by the roots of the

characteristic polynomial p(z) = z k A0 z k 1 Ak 1 . These statements are

consistent because the roots of the characteristic polynomial are the eigenvalues

of the companion matrix. lk

If Xn satisfies a k lag recurrence (30), then the covariance matrix, C en =

e e e e e e e

cov(Xn ), satisfies Cn+1 = ACn A + B B . The simplest way to find the d d

t t

16

covariance matrix Cn , is to find the kd kd covariance matrix C en and look at

the top left d d block.

The Markov property will be important lrz

throughout the course. If the Xn

satisfy the one lag recurrence relation (27), then they have the Markov property.

In this case the Xn form a Markov chain. If they satisfy the k lag recurrence

relation with k > 1 (in a non-trivial way) then the stochastic process Xn does not

have the Markov property. The informal definition is as follows. The process

has the Markov property if Xn is all the information about the past that is

relevant for predicting the future. Said more formally, the distribution of Xn+1

conditional on Xn , . . . , X0 is the same as the distribution of Xn+1 conditional

on Xn alone.

If a random process does not have the Markov property, you can blame that

on the state space being too small, so that Xn does not have as much information

about the state of the system as it should. In many such cases, a version of state

space expansion can create a more complete collection of information at time n.

Genuine state space expansion, with k > 1, always gives a noise matrix B e

with fewer sources of noise than state variables. The number of state variables

is kd and the number of noise variables is m d.

sub:ss

Large

lrz

time behavior is the behavior of Xn as n ! 1. The stochastic process

(27) is stable if it settles into a stochastic steady state for large n. The states Xn

can not have a limit, because of the constant influence of random noise. But the

probability distributions, un (x), with Xn un (x), can have limits. The limit

u(x) = limn!1 un (x) is a statistical steady state. The finite time distributions

n are Gaussian:

umur cr

un = N (n , Cn ), with n and Cn satisfying the recurrences

(28) and (29). The limiting distribution depends on the following limits:

= lim n (32)

n!1

C = lim Cn (33)

n!1

In the following discussion we first ignore several subtleties in linear algebra

for the sake of simplicity. Conclusions are correct as initially stated if m = d, B

is non-singular, and there are no Jordan blocks in the eigenvalue decomposition

of A. We will then re-examine the reasoning to figure out what can happen in

exceptional degenerate

mul

cases.

The limit (32) depends on the eigenvalues of A. Denote the eigenvalues

by j and the corresponding right eigenvectors by rj , so that Arj = j rj for

j = 1, . . . , d. The eigenvalues and eigenvectors do not have to be real even

when A is real. The eigenvectors form a basis of Cd , so the means n have

3 Some readers will worry that this statement is not proven with mathematical rigor. It

17

Pd mur

unique representations n = j=1 mn,j rj . The dynamics (28) implies that

mn+1,j = j mn,j . This implies that

mn,j ! 0 as n ! 1 for each j. In fact, the convergence is exponential. We

see that if A is strongly stable, then n ! 0 as n ! 1 independent of the

initial mean 0 . The opposite case is that |j | > 1 for some j. Such an A is

strongly unstable. It usually happens that |n | ! 1 as n ! 1 for a strongly

unstable A. The limiting distribution u does not exist for strongly unstable A.

The borderline case is |j | 1, for all j and there is at least one j with |j | 1.

This may be called either weakly stable or cl weakly unstable.

If A is strongly stable, then the limit (33) exists. We do not expect Cn ! 0

because the uncertainty in Xn is continually replenished by noise. We start with

a direct but possibly unsatisfying proof. A second and more complicated proof

follows. The first proof just uses the fact that if A is strongly stable, then

kAn k c an , (35) ab

for some constant c and positive a < 1. The value of c depends on the matrix

norm and is not important forclthe proof.

We prove that the limit (33) exists by writing C as a convergent infinite

cr

sum. To simplify notation, writecr

R for BB t . Suppose C0 is given, then (29)

gives C1 = AC0 At + R. Using (29) again gives

C2 = AC1 At + R

= A AC0 At + R At + R

2

= A2 C0 At + ARAt + R

t

= A2 C0 A2 + ARAt + R

t

1 t

Cn = An C0 (An ) + An 1

R An + + R .

n

X1

t t

Cn = An C0 (An ) + Ak R A k . (36) gsf

k=0

The limit of the Cn exists because the first term on the right goes to zero as

n ! 1 and the second term converges to the infinite sum

1

X t

C = Ak R A k . (37) gsi

k=0

18

ab

For the first term, note that (35) and properties of matrix norms imply that4

n t

A C0 (An ) (can ) kC0 k (can ) = ca2n kC0 k .

does not matter. The right side goes to zero as n ! 1 because a < 1. For the

second term, recall that an infinite sum is the limit of its partial sums if the

infinite sum converges absolutely. Absolute convergence is the convergence of

the sum of the absolute values, or the norms in case of vectors and matrices.

Here the sum of norms is:

1

X

k k t

A R A .

k=0

k k t

A R A c a2k kRk .

gsi

cl

You can find C without summing the infinite series cr

(37). Since the limit

(33) exists, you can take the limit on both sides of (29), which gives

C ACAt = BB t . (38) le

sub:ev

Subsection 4.4 explains that this is a system of linear equations for the entries

of C. The system is solvable

le

and the solution is positive definite if A is strongly

stable. As a warning, (38) is solvable in most cases even when A is strongly

unstable. But in those cases the C you get is not positive definite and cr

therefore

is not the covariance matrix

le

of anything. The dynamical equation (29) and the

steady state equation (38) are examples of Liapounov equations.

Here are the conclusions: if A is strongly stable then un , the distribution of

Xn has u ! u as n ! le

gsi n

1, with a Gaussian limit u = N (0, C), and C is given

by (37), or by solving (38). If A is not strongly stable, then it is unlikely that

the un have a limit as n ! 1. It is not altogether impossible in degenerate

situations described below. If A is strongly unstable, then it is most likely that

kn k ! 1 as n ! 1. If gsi A is weakly unstable, then probably kCn k ! 1 as

n ! 1 because the sum (37) diverges.

sub:ev

This subsection is a little esoteric. It is (to the author) interesting mathematics

that is not strictly necessary to understand the material for this cr week. Here we

find eigenvalues and eigen-matrices for the recurrence relation (29). These are

related to the eigenvalues and eigenvectors of A.

4 Part of this expression is similar to the design on Courant Institute tee shirts.

19

cr

The covariance recurrence relation (29)has the same stability/instability di-

chotomy. We explain this by reformulating it as more standard linear algebra.

Consider first the part that does not involve B, which is

Here, the entries of Cn+1 are linear functions of the entries of Cn . We describe

this more explicitly by collecting all the distinct entries of Cn into a vector ~cn .

There are D = (d + 1)d/2 entries in ~cn because the elements of Cn below the

diagonal are equal to the entries above. For example, for d = 3 there are D = 6

distinct entries in Cn , which are Cn,11 , Cn,12 , Cn,13 , Cn,22 , Cn,23 , and Cn,33 ,

which makes ~cn = (Cn,11 , Cn,12 , Cn,13 , Cn,22 , Cn,23 , Cn,33 )t 2 RD (= R

6

). There

is a D D matrix, L so that ~cn+1 = L~cn . In the case d = 2 and A = ,

cr

the Cn recurrence relation, or dynamical Liapounov equation without BB , (29) t

is

Cn+1,11 Cn+1,12 Cn+1,11 Cn+1,12

= .

Cn+1,12 Cn+1,22 Cn+1,12 Cn+1,22

0 1 0 2 10 1

Cn+1,11 2 2 Cn,11

@ Cn+1,12 A = @ + A @ Cn,12 A .

Cn+1,22 2 2 2 Cn,22

0 1

2 2 2

L = @ + A .

2 2 2

This formulationLe

is not so useful for practical calculations. Its only purpose is

to show that (39) is related to a D D matrix L.

The limiting behavior of Cn depends on the eigenvalues of L. It turns out

that these are determined by the eigenvalues of A in a simple way. For each pair

(j, k) there is an eigenvalue of L, which we call jk , that is equal to j k . To

understand this, note that an eigenvector, ~s, of L, with L~s = ~s, corresponds

to a symmetric d d eigen-matrix, S, with

ASAt = S .

It happens that Sjk = rj rkt +rk rjt is the eigen-matrix corresponding to eigenvalue

jk = i j . (To be clear, Sjk is a d d matrix, not the (j, ik) entry of a matrix

20

S.) For one thing, it is symmetric (Sjk

t

= Sjk ). For another thing:

ASjk At = A rj rkt + rk rjt At

= A rj rkt At + A rk rjt At

t t

= (Arj ) (Ark ) + (Ark ) (Arj )

t t

= (j rj ) (k rk ) + (k rk ) (j rj )

= j j rj rkt + rk rjt

= jk Sjk .

take the form of Sjk for some j k. The number of such pairs is the same D,

which is the number of independent entries in a general symmetric matrix. We

do not count Sjk with j < k because Sjk = Skj with k > j.

cr

Now suppose A is strongly stable. Then the Liapounov dynamical equation

(29) is equivalent to

~cn+1 = L~cn + ~r .

Since all the eigenvalues of L are less than one in magnitude, a little reasoning

with linear algebra shows that ~cn ! ~c as n ! 1, and that ~cL~c = (I L) ~c = ~r.

The matrix I L is invertible because L has no eigenvalues equal le

to 1. This

is a dierent proof that the steady state Liapounov equation (38) has a unique

solution. It is likely that

le

L has no eigenvalue equal to 1 even if A is not strongly

stable. In this case (38) has a solution, which is a symmetric matrix C. But

there is no guarantee that this C is positive definite, so it does not represent a

covariance matrix.

sub:d sub:ss sub:ev

The simple conclusions of subsections 4.3 and 4.4 do not hold in every case.

The reasoning there assumed things about the matrices A and B that you might

think are true in almost every interesting case. But it is important to understand

how things might more complicated in borderline and degenerate cases. For one

thing, many important special cases are such borderline cases. Many more

systems have behavior that is strongly influenced by near degeneracy. A

process that is weakly but not strongly unstable is simple Gaussian random

walk, which is a model of Brownian motion. A covariance that is nearly singular

is the covariance matrix of asset returns, of the S&P 500 stocks. This is a matrix

of rank 500 that is pretty well approximated for many purposes by a matrix of

rank 10.

4.5.1 Rank of B

The matrix B need not be square or have rank d.

21

5 Paths and path space

sec:p lrr

There are questions about the process (1) that depend Xn for one n. For ex-

ample, what is Pr (kXn k 1 for 1 n 10)? The probability distribution on

path space answers such questions. For linear Gaussian processes, the distri-

bution in path space is Gaussian. This is not surprising. This subsection goes

through the elementary mechanics of Gaussian path space. We also describe

more general path space terminology that carries over to other other kinds of

Markov processes.

Two relevant probability spaces are the state space and the path space. We

let S denote the state space. This is the set of all possible values of the state

at time n. This week, the state is a d component vector and S = Rd . The path

space is called . This week, is sequences of states with a given starting and

ending time. That is, X[n1 :n2 ] 2 is a sequence (Xn1 , Xn1 +1 , . . . , Xn2 ). There

are n2 n1 + 1 states in the sequence, so = R(n2 n1 +1)d . Even if the state

space is not Rd , still a path is a sequence of states. We express this by writing

= S n2 n1 +1 . The path space depends on n1 and n2 (only the dierence,

really), but we leave that out of the notation because it is usually clear from

the discussion.

6 Exercises

sec:ex

ex:cg 1. This exercise works through conditional distributions of multivariate nor-

mals in a sequence of steps. The themes (for the list of facts about Gaus-

sians) are the role of linear algebra and the relation to linear regression.

Suppose X and Y have dX and dY components respectively. Let u(x, y)

be the joint density. Then the conditional distribution of Y conditional

on X = x is u( y | X = x) = c(x)u(x, y). This says that the conditional

distribution is the same, up to a normalization constant) as the joint dis-

tribution once you fix the variable whose value is known (x in this case).

The normalization constant is determined by the requirement that the

conditional distribution has total probability equal to 1:

1

c(x) = R .

u(x, y) dy

For Gaussian random variables, finding c(x) usually both easy and unnec-

essary.

(a) This part works out the simplest case. Take d = 2, and X =

(X1 , X2 )t . Suppose X N (0, H 1 ). Fix the value of X1 = x1 and

calculate the distribution of the one dimensional random variable X2 .

If H is

h11 h12

H = ,

h12 h22

22

then the joint density is

u(x1 , x2 ) = c exp h11 x21 + 2h12 x1 x2 + h22 x22 /2 .

u( x2 | x1 ) = c(x1 ) exp 2h12 x1 x2 + h22 x22 /2 .

Why is it allowed to leave the term h11 x21 out of the exponent? Com-

plete the square to write this in the form

h i

2

u( x2 | x1 ) = c(x1 ) exp (x2 (x1 )) /(222 ) .

Find formulas for the conditional mean, (x1 ), and the conditional

variance, 22 .

23

Week 10

Change of measure, Girsanov formula

Jonathan Goodman

November 26, 2012

sec:intro

In Week 9 we made a distinction between simulation and Monte Carlo. The

dierence is that in Monte Carlo you are computing a number, A, that is not

random. It is likely that there is more than one formula for A. There may be

more than one way to express A as the expected value of a random variable.

Suppose

A = E[ F (X)] ,

where X has probability density u(x). Suppose v(x) is another probability

density so that

u(x)

L(x) = (1) eq:L

v(x)

is well defined. Then

Z Z

u(x)

A= F (x)u(x)dx = F (x) v(x)dx .

v(x)

This means that there are two distinct ways to evaluate A: (i) take samples

X u and evaluate F , or (ii) take samples X v and evaluate F L. eq:is

Importance sampling means using the change of measure formula (2) for

Monte Carlo. The expected value Eu [ F (X)] means integrate with respect to

the probability measure u(x)dx. Using the measure v(x)dx instead represents

a change of measure. The answer A does not change as long eq:is

as you put the

likelihood ratio into the second integral, as in the identity (2).

There are many uses of importance sampling in Monte Carlo and applied

probability. One use is variance reduction. The variance of the uestimator is

varu (F (X)) = Eu F (X)2 A2 .

1

The variance of the vestimator is

h i

2

varv (F (X)L(X)) = Ev (F (X)L(X)) A2 .

ratio so that the variance of the vestimator is smaller. That would mean

that the variation of F (x)L(x) is smaller than the variation of F (x), at least

in regions that count. A good probability density v is one that puts more of

the probability in regions that are important for the integral, hence the term

importance sampling.

Rare event simulation oers especially dramatic variance reductions. This is

when A = Pu (X 2 B) (which is the same as Pu (B)) The event B is rare when

the probability is small. Applications call for evaluating probabilities ranging

from 1% to 106 or smaller. A good change of measure is one that puts its

weight on the most likely parts of B. Consider the one dimensional example

where u = N (0, 1) and B = {x > b}. If b is large, P0,1 (X > b) is very small.

But most of the samples with X > b are only a little larger than b. The measure

v = N (b, 1) is a simple way to put most of the weight near b. The likelihood

ratio is 2

u(x) ex /2 2

L(x) = = (xb)2 /2 = eb /2 ebx .

v(x) e

2

When x = b, the likelihood ratio is L(b) = eb /2 , which is very small when b is

large. This is the largest value of L(x) when x b.

The probability measures u(x)dx and v(x)dx give two ways to estimate

P0,1 (X > b). The umethod is to draw L independent samples Xk N (0, 1)

and count the number of those with Xk > b. Most of the samples are wasted in

the sense that they are not counted. The vmethod is to draw L independent

samples Xk N (b, 1). The estimator is

Z 1

1 X

P0,1 (X > b) = L(x)v(x)dx L(Xk ) .

b L

Xk >b

Now about half of the samples are counted. But they are counted with a small

2

weight L(Xk ) < eb /2 . A hit is a sample Xk > b. A lot of small weight hits

give a lower variance estimator than a few large weight hits.

The Girsanov theorem describes change of measure for diusion processes.

Probability distributions, or probability measures, on path space do not have

probability densities. In some cases the likelihood ratio L(x) can be well defined

even when the probability densities u(x) and v(x) are not. If x is a path, the

eq:is

likelihood ratio L(x) is a path function that makes (2) true for well behaved

functions F . Roughly speaking, a change of measure can change the drift of a

diusion process but not the noise. The Girsanov formula is the formula for the

L that does the change.

Girsanovs theorem has two parts. One part says when two diusion pro-

cesses may be related by a change of measure. If they can be, the two probability

2

measures are absolutely continuous with respect to each other, or equivalent. If

two probability measures are not equivalent in this sense, then at least one of

them has a component that is singular with respect to the other. The other

part of Girsanovs theorem is a formula for L(x) in cases in which it exists.

This makes the theorem useful in practice. We may compute hitting probabili-

ties or expected payouts using any diusion that is equivalent to the one we are

interested.

2 Probability measures

A probability measure is function that gives the probability of any event in an

appropriate class of events. If B is such an event, then P(B) is this probability.

By class of appropriate events, we mean a algebra. A probability function

must be countably additive, which means that if Bn is a sequence of events with

Bn Bn+1 (an expanding family of events), then

1

!

[

lim P(Bn ) = P Bn . (3) eq:ca

n!1

n=1

This formula says that the probability of a set is in some sense a continuous

function of the set. The infinite union on the right really is the limit of the sets

Bn . Another way to write this is to suppose Ck is any sequence appropriate

events, and define

[n

Bn = Ck .

k=1

n

! 1

!

[ [

lim P Ck = P Ck .

n!1

k=1 k=1

Every proof of every theorem in probability theory makes use of countable ad-

ditivity of probability measures. We do not mention this property very often in

this course, which is a signal that we are not giving full proofs.

A probability density defines a probability measure. If the probability space is

= Rn and u(x) is a probability density for an n component random variable

(x1 , . . . , xn ), then Z

Pu (B) = u(x)dx .

B

is the corresponding probability measure. If B is a small neighborhood of a

specific outcome x, then we write its probability as Pu (B) = dP = u(x)dx.

3

More generally, if F (x) is a function of the random variable x, then

Z

Eu [ F (X)] = F (x)dP (x) . (4) eq:pi

R

This is the same as Feq:pi

(x)u(x)dx when there is a probability density.

But the expression (4) makes sense even when P is a more general probability

measure. A simple definition involves a F = 2m rather than a x. Define

(F )

the events Bk as

(F )

Bk = {x | kF F (x) < (k + 1)F } . (5) eq:Bk

To picture these sets, suppose x is a one dimensional random variable and con-

sider the graph of a function F (x). Divide the vertical axis into equal intervals

(F )

of size F and a horizontal line for each breakpoint kF . The set Bk is the

part of the xaxis where the graph of F lies in the horizontal stripe between

kF and (k + 1)F . This set could consist of several intervals (for example,

two intervals if F is quadratic) or something more complicated if F is a com-

(F )

plicated function. If is an abstract probability space, then the sets Bk are

abstract events in that space. By definition, the function F is meeasurable with

(F )

respect to the algebra F ifeq:pieach of the sets Bk 2 F for each k and F .

The probability integral (4) is defined as a limit of approximations, just as

the Riemann integral and Ito integral are. The approximation in this case is

(F )

motivated that if x 2 Bk , then |F (x) kF | F . Therefore, if the dP

integral were to make sense, we would have

Z Z

( F ) F (x)dP (x) kF dP (x)

B Bk

( F)

k

Z

(F )

= F (x)dP (x) kF P Bk

B( F )

k

(F )

F P Bk .

on each horizontal slice.

Z XZ X

(F )

F (x)dP (x) = F (x)dP (x) kF P Bk

( F)

k Bk k

X

(F )

Im = kF P Bk , with F = 2m . (6) eq:Li

k

Z

F (x)dP (x) = lim Im . (7) eq:pil

m!1

4

The numbers on the right are a Cauchy sequence because if n > m then

X (F )

|Im In | F P Bk

k

X (F )

=2 m

P Bk

k

= 2m .

The expected value is the same thing as the probability integral:

Z

EP [ F (X)] = F (x)dP (x) .

in the next subsection. The indicator function of an event B is 1B (x) = 1 if

x 2 B and 1B (x) = 0 if x 2 / B. A simple function is a finite linear combination

of indicator functions. We say F (x) is a simple function if there are events

B1 , . . . , Bn and weights F1 , . . . , Fn so that

n

X

F (x) = Fk 1Bk (x) .

k=1

Z Xn

F (x)dP (x) = Fk P (Bk ) . (8) eq:sfi

k=1

This has to be the definition of the integral of a simple function if the integral

is linear and if Z Z

1B (x)dP (x) = dP (x) = P (B) .

B

Once you know what the integral should be for simple functions, you know what

it should be for any function that can be approximated by simple functions. If

F is a bounded function, then

X

F (F ) (x) = kF 1B ( F ) (x)

k

kF <Fmax

satisfies F (F ) (x) F (x) F for all x. Therefore, if the concept of integra-

tion makes sense at all, the following should be true:

Z Z

F (x)dP (x) = lim F (F ) (x)dP (x)

F !0

X Z

= kF 1B ( F ) dP (x)

k

kF <Fmax

Z X (F )

F (x)dP (x) = lim kF P (Bk ) (9) eq:sfd

F !0

kF <Fmax

5

The point of this is that if the integral of indicator functions is defined, then all

other integrals are automatically defined.

A fully rigorous treatment would stop here to discuss a largeeq:Li number of

technical details here. The sum that defines the approximation (6) converges

if F is bounded. If F is not bounded, we can approximate F by a bounded

function and try to take the limit. The most important integration theorem

is the dominated convergence theorem, which gives a condition under which

pointwise convergence

Z Z

Fn (x)dP (x) ! F (x)dP (x) . (10) eq:lth

n

If Z

MF (x)dP (x) < 1 , (11) eq:mb

eq:lth

then (eq:mb

10) is true. We do not give the proof in this course. A simple way to show

that (11) is satisfied is Rto come up with a function G(x) so that |Fn (x)| G(x)

for all x and all n, and G(x)dP (x) < 1. A function G like this is a dominating

function.

If P is a probability measure then a function L(x) can define a new measure

through the informal relation

Z

Q(B) = L(x)dP (x) . (13) eq:Q

B

L(x) 0 almost surely (with respect to P ). Second,

Z

L(x)dP (x) = 1 . (14) eq:L1

Ifeq:Q

the probability measure P is defined and if L has these two properties, then

(13) defines another probability measure Q.

6

eq:dQ

The informal relation (12) leads to a relationship between expectation values

in the P and Q measures:

eq:dQ

This becomes clearer when you use (12) to derive the equivalent probability

integral equation

Z Z

F (x)dQ(x) = F (x)L(x)dP (x) . (16) eq:piL

You can prove this formula as suggested at the end of the previous subsection.

You check that it is true for simple functions. Then it is true for any other

function, because any function can be well approximated by simple functions.

Moreover, it is true for simple functions if it is true for indicator functions. But

eq:piL

F is an indicator function F (x) = 1B (x), then (16) is exactly the same as

ifeq:Q

(13). eq:dQ

For probability measures

eq:L

defined by densities, the definition (12) is the same

as the original definition (1). If dP (x) = u(x)dx and dQ = v(x)dx, then dQ(x) =

v(x)

u(x) dP (x).

You can ask the reverse question: given probability measures dP and dQ,

is there a function L(x) so that dQ(x) = L(x)dP (x)? There is an obvious

necessary condition, which is that any event that is impossible under P is also eq:Q

impossible under Q. If B is an event with P (B) = 0, then the definition (13)

gives Q(B) = 0. You can see this by writing

Z Z

L(x)dP (x) = 1B (x)L(x)dP (x) .

B

(F ) (F )

If F (x) = 1(x)L(x) and Bk for this function, then Bk B, which

(F )

Rimplies that P (Bk ) P (B) = 0. Therefore, all the approximations to

1B (x)L(x)dP (x) are zero. The Radon Nikodym theorem states that this nec-

essary condition is sufficient. If P and Q are any two probability measures with

the same algebra F, and if P (B) = eq:Q 0 implies that Q(B), then there is a

function L(x) that gives Q from P via (13). This function is called the Radon

Nikodym derivative of Q with respect to P , and is written

dQ(x)

L(x) = .

dP (x)

If the condition P (B) = 0 =) Q(B) = 0 is satisfied, we say that Q is absolutely

continuous with respect to P . This term (absolutely continuous) is equivalent

in a special case to something that really could be called absolute continuity.

But now the term is applied in this more general context. If P is absolutely

continuous with respect to Q and Q is absolutely continuous with respect to P ,

then the two measures are equivalent to each other.

If Q is not absolutely continuous with respect to P , then there is an event B

that has positive probability in the Q sense but probability zero in the P sense.

7

When this happens, it is usual that all of Q has zero probability in the P sense.

We say that Q is completely singular with respect to P if there is an event B

with Q(B) = 1 and P (B) = 0. If Q is completely singular with respect to P ,

then P is completely singular with respect to Q, because the event C = B c

has P (C) = 1 but Q(C) = 0. We write P ? Q when P and Q are completely

singular with respect to each other.

Here is a statistical interpretation of absolute continuity and singularity.

Suppose X is a sample from P or Q, and you want to guess whether X P or

X Q. Of course you could guess. But if your answer is a function of X alone

(and not another coin toss), then there is some set B so that if X 2 B

you say Q, and otherwise you say P . A type I error would be saying Q when

the answer is P , and a type II error is saying P when the answer is Q.1 The

confidence of your procedure is 1 P (B), which is the probability of accepting

the null hypothesis P if P is true. The power of your test is Q(B), which is the

probability of rejecting the null hypothesis when the null hypothesis is false. If

P ? Q, then there is a test with 100% confidence and 100% power. You say

X Q if X 2 B and you say X P otherwise. Conversely, if there is test with

100% confidence and 100% power, then P ? Q.

A statistical test, or, equivalently, a set B, is efficient if there is no way to

increase the confidence in the test without decreasing its power. Equivalently,

B is ecient if you cannot increase its power without decreasing its confidence.

The Neyman Pearson lemma says that if B is ecient, then there is an L0 so

that B = {x | L(x) > L0 }. Since L = dQ dP , the Q probability is larger than the

P probability when L is large.

2.3 Examples

Suppose X 2 Rn is an n component random variable. Suppose P makes X a

multivariate normal with mean zero and covariance matrix C. If H = C 1 ,

then the probability density for P is

t

u(x) = cex Hx/2

.

(The normalization constant c is not the covariance matrix C.) We want L(x) =

t

cey x to be a likelihood ratio. What should c be and what is the mean and

covariance of the resulting distribution? We find the answer by computing

t t t t

e(x) H(x)/2 = ex Hx/2 e Hx e H/2 . If t H = y t , then H = y because

t t

H is symmetric, so = Cy and e H/2 = ey Cy/2 . Therefore

t t

L(x) = ey x ey Cy/2

(17) eq:LG

is a likelihood ratio, and the Q distribution has the same covariance matrix but

mean = Cy.

1 In statistics, P would be the null hypothesis and Q the alternate. Rejecting the null

hypothesis when it is true is a type I error. Accepting the null hypothesis when it is false is

a type II error. Conservative statisticians regard type I errors as worse than type II.

8

Suppose X is a one dimensional random variable, the P distribution is uni-

form [0, 1] and the Q distribution is N (0, 1). Then P is absolutely continuous

with respect to Q but Q is not absolutely continuous with respect to P .

Any two Gaussian distributions in the same dimension are equivalent.

Suppose X 2 Rn , and P = N (0, I). Suppose Q is the probability distribu-

tion formed by taking X = |YY | , were Y N (0, I). Let |Y | = (Y12 + + Yn2 )1/2

is the Euclidean length. Then X is a unit vector that is uniformly distributed

on the unit sphere in n dimensions. That sphere is called Sn1 , because it is an

n 1 dimensional surface. For example, S2 is the two dimensional unit sphere

in three dimensions. The set B = Sn1 has Q(B) = 1 and P (B) = 0.

The simplest version of Girsanovs formula is a formula for the L(x) that changes

standard Brownian motion to one with drift at . This L relates P , which is the

distribution for standard Brownian motion on [0, T ] to Brownian motion with

drift. That is

Q: dXt = at dt + dWt . (19) eq:SDEQ

RT RT

a2t dt/2

L(x) = e 0

at dXt

e 0 . (20) eq:G1

The integrals that enter into L are well defined. In fact, we showed that they

are defined almost surely. If P and Q are absolutely continuous with respect to

each other (are equivalent), then almost surely with respect to P is the same

as almost surely with respect to Q. On the other hand, likely with respect to

P does not mean likely with respect to Q, as our importance sampling example

shows. eq:G1

There are several ways to derive Girsanovs formula (20). Here is one way

that is less slick but more straightforward. Choose a t = T 2m and let the

observations of X at the times tk = kt be assembled into a vector X ~ =

(X1 , X2 , . . . , X2m ). We are writing Xk for Xtk as we have sometimes done

before. We write an exact formula for the joint PDF of X ~ under P , and an

approximate formula for the joint densityeq:G1under Q. The ratio of these has a

well defined limit, which turns out to be (20), as t ! 0.

Let u(~x) be the density of X ~ under P . We find a formula for u by thinking

of a single time step that goes from xk to xk+1 . The conditional density of Xk+1

given Xk is normal mean zero variance t. This makes (in a hopefully clear

notation)

2

e(xk+1 xk ) /(2t)

uk+1 (x1 , . . . , xk+1 ) = uk (x1 , . . . , xk ) p .

2t

9

Therefore

n1

!

X (xk+1 xk )

2

u(~x) = u(x1 , . . . , xn ) = c exp . (21) eq:uP

2t

k=0

We call it v(~x). In the Q process, conditional on Xk , Xk+1 is approximately

normal with variance t and mean Xk + atk t. Therefore,

2

e(xk+1 xk atk t) /(2t)

vk+1 (x1 , . . . , xk+1 ) vk (x1 , . . . , xk ) p .

2t

This leads to

n1

!

X

(xk+1 xk atk t)

2

v(~x) = v(x1 , . . . , xn ) = c exp

2t

k=0

n1

! n1

! n1

!

X (xk+1 xk )2 X t X 2

= c exp exp (xk+1 xk ) atk exp atk

2t 2

k=0 k=0 k=0

(22) eq:vQ

eq:vQ

Now take the quotient L = v/u. The first exponential on the right of (22)

cancels. The second in an approximation of an Ito integral

n1

X Z T

(xk+1 xk ) atk ! at dXt as t ! 0 .

k=0 0

n1

X Z t

t a2tk ! a2t dt .

k=0 0

Therefore

Z ! Z !

v(~x) T

1 T

lim = exp at dXt exp a2t dt .

t!0 u(~x) 0 2 0

eq:G1

This is the formula (20).

Note the similarity of the Gaussian path space change

eq:G1

of measure formula

eq:LG

(20) to the simple Gaussian change of measure formula (17). In both cases the

first exponential has an exponent that is lineareq:LGx. The second exponential has

a quadratic exponent that normalizes L. In (17), the first exponent makes Q

larger in the direction of y,

eq:G1 R which is the direction in which y x grows the fastest.

t

eq:SDEQ

0, then (19) has a drift to the right, which is direction in

For example, if at >eq:G1

which L, given by (20) is large.

10

The more general Girsanov theorem concerns two stochastic processes

P : dXt = b1 (Xt )dWt (23) eq:bdW

Q: dXt = a(Xt )dt + b2 (Xt )dWt . (24) eq:abdW

One part of Girsanovs theorem is that P and Q are singular with respect to

each other unless b1 = b2 . The proof of this is the quadratic variation formula

from an earlier week. If dXt = a(Xt )dt + b(Xt )dWt , then

Z t

[X]t = b(Xs )2 ds . (25) eq:qv

0

X 2

[X]t = lim (Xk+1 Xk ) . (26) eq:qvd

t!0

tk <t

We showed that the limit exists for a given path Xt almost surely. If you have

eq:qvd

taking the limit (26). If

a path, you can evaluate the quadratic variation byeq:bdW

this is gives b1 , then Xt came from the P process (23). If it gives b2 , then Xt

eq:abdW

came from the Q process (24). Since these are the only choices, you can tell

with 100% confidence whether a given path is from P or Q.

This fact has important implications for finance, particularly medium fre-

quency trading. If you have, say, daily return data or ten minute return data,

you have a pretty good idea what the volatility is. The higher frequency the

measurements, the better your estimate of instantaneous volatility. But tak-

ing more frequent measurements does not estimate the drift expected return

more accurately. With high frequency data, the variance of asset prices, or

the covariances of pairs of prices, is more accurately known than the expected

returns.

The final part of Girsanovs theorem is a formula for the likelihood ratio

between the measures

P : dXt = b(Xt )dWt (27) eq:bdW1

Q: dXt = a(Xt )dt + b(Xt )dWt . (28) eq:abdW1

eq:G1

formulas for the probability

density in a way similar to how we got (20). But people prefer the following

argument. The change of measure

RT RT

2t dt/2

L(w) = e 0

t dWt

e 0

Turns the process dWt into the process dWt0 = t dt + dWt . Therefore, if we use

eq:bdW1

this weight function, on the process (27) we get

dX = b(Xt ) (t dt + dWt ) .

eq:abdW1

This becomes the process (28) if

at

t = .

bt

11

eq:bdW1 eq:abdW1

Therefore, the likelihood ratio between (27) and (28) is

RT RT a2

t

dWt dt/2

at

b2

L=e 0 bt e 0

t .

12

Week 11

Backwards again, Feynman Kac, etc.

Jonathan Goodman

November 26, 2012

sec:intro

This week has more about the relationship between SDE and PDE. We discuss

ways to formulate the solution of a PDE in terms of an SDE and how to calculate

things about an SDE using a PDE. Informally, activity of this kind is called

Feynman Kac in certain circles, and Fokker Planck in other circles. Neither

name is accurate historically, but this is not a history class.

One topic is the full forward equation. We have done pieces of it, but we

now do it in general for general diusions. We derive the forward equation from

the backward equation using a duality argument.

Next we discuss backward equations for multiplicative function of a stochas-

tic process. If h RT i

f (x, t) = Ex,t e t V (Xs )ds , (1) mf

and

dXt = a(Xt )dt + b(Xt )dWt , (2) sde

then

m n n

1 XX X

0 = @t f + b(x)bt (x) ij @xi @xj f + ai (x)@xi f + V (x)f . (3) fepde

2 i=1 j=1 i=1

One of the dierences between this and Girsanovs formula from last week is thatmf

here the exponent does not have an Ito integral. The relationship

fepde mf

between (26)

and (3) goes both ways. You can learn about the expectation (26) by solving a

PDE. Numerical PDE mf

methods are generally more accurate than direct Monte

Carlo evaluation of (26), that is, if X does not have more

mf

than a few components.

In the other direction,

fepde

you can use Monte Carlo on (26) to estimate the solution

of the PDE (3). This can be useful if the dimension of the the PDE larger than,

say, 4 or 5.

We discuss a general principle often called splitting. This says that if there

are two or more parts of the dynamics, then you find the dierential equation

describing the dynamics

fepde

by adding terms corresponding to each part of the dy-

namics. The PDE (3) illustrates this admittedly vague principle. The quantity

1

f is determined by three factors (vague term not related to, say, factor anal-

ysis in statistics): diusion,

fepde

advection, and the multiplicative functional.PThe

dynamical equation (3) has one term for each factor. The second term 12

corresponds to diusion, dX = b(X)dW . The third term corresponds to ad-

vection, dX = a(X)dt. The last corresponds to multiplication sde

over a dt time

interval by eV (Xt )dt . Splitting applies already to the SDE (2). The right hand

side has one term corresponding to advection, adt, and another corresponding

to diusion bdW .

This section has much review of things we covered earlier in the course, much

earlier in some cases. It serves as review and it puts the new material into

context.

Let Xt is a Markov process of some kind. It could be a discrete time Markov

chain, or a diusion process, or a jump process, whatever. Let S be the state

space. For each t, Xt 2 S. Take times t < T and define the value function

f (x, t) by

f (Xt , t) = E[ V (Xt ) | Ft ] . (4) vf

This definition applies to all kinds of Markov processes. The t variable is discrete

or continuous depending on whether the process takes place in continuous or

discrete time. The x variable is an element of the state space S. For a diusion,

S = Rn , so f is a function of n real variables (x1 , . . . , xn ), as in f (x1 , . . . , xn , t).

Here n is the number of components of Xt .

The generator of a stochastic process is called L. The generator is a matrix

or a linear operator. Either way, the generator has the action

L

g ! Lg ,

that is linear: ag ! aLg, and (g1 + g2 ) ! Lg1 + Lg2 . Here is the definition of

the generator, as it acts on a function.1 The discrete time process starts with

X0 = x and takes one discrete time step to X1 :

To say this more explicitly, if h = Lg, then h(x) = E[ g(X1 )]. For example, for

a simple random walk on the integers, S = Z. Suppose P(X ! X + 1) = .5,

P(X ! X 1) = .3 and P(X ! X) = .2. Then

= .5 g(x + 1) + .3 g(x) + .2 g(x 1) .

1 It is common to define abstract objects by their actions. There is a childrens book with

the line: Dont ask me what Voom is. I never will know. But boy let me tell you it does

clean up show.

2

You can see that the matrix L is the same as the transition matrix

This is because

X

E[ g(X1 )] = P(X1 = j | X0 = x) g(j) .

j

If g is the column vector whose entries are the values g(i), then this says that

Lg(j) is component j of the vector P g. The transition matrix is P . The

generator is L. They are the same but have a dierent name.

There is a dynamics of value functions (conditional expectations) that in-

volves the generator. The derivation uses the tower property. For example,

E[ g(Xs )] = Ls g(x) .

t and a final time T . Then

The time variable in the backward equation is the time between the start and

the stop. When you increase this time variable, you can imagine starting further

from the stopping time, or starting at the same time and running the process

longer. This is the value function for a payout of g(XT ) at time T .

Now suppose you have a continuous time Markov process in a discrete state

space. At each time t, the state Xt is one of the elements of S. The process is

described by a transition rate matrix, R, with Rij being the transition rate for

i ! j transitions. This means that if j 6= i are two elements of S, then

P(Xt+dt = j | Xt = i) = Rij dt .

The generator of a continuous time process is defined a little dierently from the

generator of a discrete time process. In continuous time you need to focus on

the rate of change of quantities, not the quantities themselves. For this reason,

we define Lg(x) as (assume X0 = x as before)

E[ g(X t )] g(x)

Lg(x) = lim . (6) cL

t!0 t

In time t,cL we expect E[ g(X t )] not to be very dierent from g(x). The

definition (6) describes the rate of change. Two things about the discrete time

3

problem are similar to this continuous time problem. The generator is the same

as the transition rate matrix. The evolution of expectation values over a longer

time is given by the backward equation using the generator.

To see why L = R, we write approximate expressions for the probabilities

Pxy (t) = P(X t = y | X0 = x). (The notation keeps changing, sometimes i, j,

sometimes x, y. This week its not an accident.) For y 6= x, the probability is

approximately Rxy t. For small t, the same state probability Pxx (t) is

approximately equal to 1. We define the diagonal elements of R to make the

formula

Pxy (t) = xy + tRxy (7) PR

true for all x, y. It is already true for x 6= y. To make it true for x = y, we need

X

Rxx = Rxy .

y6=x

The o diagonal entries of the rate matrix are non-negative. The diagonal

entries are negative. The sum over all landing states is zero:

X

Rxy = 0 .

y2S

PR

Assuming this, the formula (7) for P gives

X X

Pxy = P(X t = y | X0 = x) = 1 .

y2S y2S

cL

above, it is.

With all this, we can evaluate the limit (6). Start with

X

E[ g(X t )] = Pxy (t)g(y) .

y2S

PR

Now that (7) applies for all x, y we just get

X

E[ g(X t )] g(x) + t Rxy g(y) .

y2S

cL

We substitute this into (6), cancel the g(x), and then the t. The result is

X

Lg(x) = Rxy g(y) .

y2S

of the numbers g(y).

It is convenient to consider functions that depend explicitly on time when

discussing the dynamics of expectation values. Let f (x, t) be such a function.

The generator discussion above implies that

4

When you consider the explicit dependence of f on t, this becomes

E[ f (X t , t)] = f (x, 0) + t Lf (x, 0) + @t f (x, 0) + O(t2 ) .

Then

n1

X

E[ f (Xtn , tn )] f (x, 0) = E f (Xtk+1 , tk+1 ) f (Xtk , tk )

k=0

n1

X h i

E Lf (Xk , tk ) + @t f (Xk , tk ) t

k=0

Z t

E[ f (Xt , t)] f (x, 0) = E[ Lf (Xs , s) + @t f (Xs , s)] ds . (8) Df

0

This equation is true for any function f (x, t). If f satisfies the backward equation

Lf + @t f = 0 (9) bec

As for the discrete backward equation, we can replace the time interval [0, t]

with the interval [t, T ], which gives the familiar restatement

cL

The definition of the generator (6) is easy to work

sde

with, particularly if the

process is subtle. Suppose Xt satisfies the SDE (2) and g(x) is a twice dier-

entiable function. Define X = X t x and make the usual approximations

Then

X 1X

g(X t ) g(x) Xi @xi g(x) + Xi Xj @xi @xj g(x) .

i

2 ij

E[ Xi ] ai (x)t , E[ Xi Xj ] b(x)bt (x) ij t

Therefore,

0 1

X 1 X

E[ g(X t ) g(x)] @ ai (x)@xi g(x) + b(x)bt (x) ij @xi @xj g(x)A t .

i

2 ij

5

Therefore,

X 1 X

Lg(x) = ai (x)@xi g(x) + b(x)bt (x) ij @xi @xj g(x) . (10) cgg

i

2 ij

the expression. That would be

X 1 X

L= ai (x)@xi + b(x)bt (x) ij @xi @xj . (11) cg

i

2 ij

This is a dierential

cgg

operator. It is defined by how it acts on a function g, which

is given by (10).

The general relationship between backward and forward equations may be

understood using duality. This term has several related meanings in mathemat-

ics.2 One of them is an abstract version of the relationship between a matrix

and its transpose. In the abstract setting, a matrix becomes an operator, and

the transpose of the matrix becomes the adjoint of the operator. The transpose

of the matrix L is Lt . The adjoint of the operator L is L . This is important

here because if L is the generator of a Markov process, then L is the operator

that appears in the backward equation. But L , the adjoint of L, appears in

the forward equation. This can be the easiest way to figure out the forward

equation in practical examples. It will be easycgto identify the forward equation

for general diusions with generator L as in (11). But look back at Assignment

6 to see how hard it can be to derive the forward equation directly. When all the

bla bla is over, this fancy duality boils down to simple integration by parts.

We introduce abstract adjoints for operators by describing how things work

for finite dimensional vectors and matrices. In that setting, we distinguish

between the n 1 matrix, f , and the 1 n matrix, u. As a matrix, f has

one column and n rows. This is also called a column vector. As a matrix, u has

one row and n columns, which makes u a row vector. A row vector or a column

vector have n components, but they are written in dierent places. Suppose A

is and m n matrix and B is an n k matrix. Then the matrix product AB

is defined but the product BA is not, unless k = m. If A is the n component

row vector u and B is the n n matrix L, we have m = 1 and k = n above.

The matrix product uL = v is another 1 n matrix, or row vector. The more

traditional matrix vector multiplication involves A = L as n n and f as n 1,

so Lf = g is n 1, which makes g another column vector.

Suppose S is a finite state space with states xi for i = 1, . . . , n. We can write

the probability of state xi as u(xi ) or ui . If f (x) is a function of the state, we

write fi for f (xi ). The expected value may be written in several ways

n

X X X

E[ f ] = ui fi = u(xi )f (xi ) = u(x)f (x) .

i=1 xi 2S x2S

2 For example, the dual of the icosahedron is the dodecahedron, and vice versa.

6

Now let u refer to the row vector with components ui and f the column vector

with components fi . Then the expected value expression above may be written

as the matrix product

E[ f ] = uf .

This is the product of a 1 n matrix, u, with the n 1 matrix f . The result is

a 1 1 matrix, which is just a single number, the expected value.

Now suppose f = ft = Et [ V (XT )] and u = ut with ut (x) = P(Xt = x). The

tower property implies that the overall expectation is given by

The last form on the right is the product of the row vector ut and the column

vector ft . Note, and this is the main point, that the left side is independent of

t in the range 0 t T . This implies, in particular, that

ut+1 ft+1 = ut ft .

dgd

But the relationship between ft and ft+1 is given by (5), in the form ft = Lft+1 .

Therefore

ut+1 ft+1 = ut Lft+1 ,

for any vector ft+1 . This may be re-written using the fact that matrix multi-

plication is associative as

(ut+1 ut L) ft+1 = 0 ,

for every ft+1 . If the set of all possible ft+1 spans the whole space Rn , this

implies that

ut+1 = ut L . (13) fed

To summarize: if the value function satisfies the backward equation involving

the generator L, then the probability distribution satisfies a forward equation

with that same L, but used in a dierent way multiplying from the left rather

than from the right. Recall that what we call L here was called P before. It

is the matrix of transition probabilities the transition matrix for the Markov

chain.

You can give the relationship between the backward and forward equations

in a dierent way if you treat all vectors as column vectors. If ut is the column

vector of occupation probabilities at time t, then the expected value formula is

E[ V (XT )] = utt ft . The backward equation formula, written using the column

vector convention,

fed

is utt+1 ft+1 = utt Lft+1 = (Lt ut ) ft+1 . The reasoning we used

to get (13) now gives the column vector formula

ut+1 = Lt ut .

between the dynamics of f and the dynamics of u. The relationship is that the

matrix L that does the backward dynamics of f has an adjoint (transpose) that

does the forward dynamics of u.

7

We move to a continuous time version of this that applies to continuous time

Markov chains on a finite state space S. The dynamics for ft are

d

ft = Lft ,

dt

where L, the generator, is also the matrix of transition rates. If ut is the row

vector of occupation probabilities, ut (x) = P(Xt = x), then ut ft is independent

of t for the same reason as above. Therefore

d d d d

0= (ut ft ) = ut ft + ut ft = ut ft ut (Lft ) .

dt dt dt dt

This implies, as in the discrete time case above, that

d

ut ut L ft = 0 ,

dt

for every value function vector ft . If these vectors span Rn , then the vector in

brackets must be equal to zero:

d

ut = ut L .

dt

If we use the convention of treating the occupation probabilities as a column

vector, then this is

d

ut = Lt ut .

dt

It is easy to verify all these dynamical equations directly for finite state space

Markov chains in discrete or continuous time. For example, ...

With all this practice, the PDE argument for diusion processes is quick.

Start with the one dimensional case with drift a(x) and noise b(x). The back-

ward equation is

1

@t f (x, t) + Lf = @t f + @x2 f (x, t) + a(x)@x f (x) = 0 .

2

Let u(x, t) be the probability density for Xt . Then as before the time t formula

for the expected payout is true for all t between 0 and T

Z 1

E[ V (XT )] = u(x, t)f (x, t) dx .

1

We dierentiate this with respect to t and use the backward equation for f :

Z 1

0 = @t u(x, t)f (x, t) dx

1

Z 1 Z 1

= (@t u(x, t)) f (x, t) dx + u(x, t) (@t f (x, t)) dx

1 1

Z 1 Z 1

= (@t u(x, t)) f (x, t) dx u(x, t) (Lf (x, t)) dx

1 1

8

The new trick here is to move L onto u by integration by parts. This is the

t

continuous state space analogue of writing ut (Lf ) as (Lt u) f in the discrete

state space case. In the integrations by parts we assume that there are no

boundary terms at 1. The reason for this is that the probability density

u(x, t) goes to zero very rapidly as x ! 1. In typical examples, such as

the Gaussian case of Brownian motion, u(x, t) goes to zero exponentially as

x ! 1. Therefore, even if f (x, t) does not go to zero, or even goes to infinity,

the boundary terms vanish in the limit x ! 1. Here is the algebra:

Z 1

u(x, t) (Lf (x, t)) dx

1

Z 1

1

= u(x, t) b(x) @x f (x, t) + a(x)@x f (x, t) dx

2 2

1 2

Z 1

1 2 2

= @x b (x)u(x, t) @x (a(x)u(x, t)) f (x, t) dx .

1 2

1 2 2

L u(x, t) = @x b (x)u(x, t) @x a(x)u(x, t) . (14) L*

2

This defines the operator L , which is the adjoint of the generator L. The

integration by parts above shows that

Z 1

(@t u(x, t) L u(x, t)) f (x, t) dx = 0 ,

1

for every value function f (x, t). If there are enough value functions, the only

way for all these integrals to vanish is for the u part to vanish, which is

1 2 2

@t u(x, t) = L u(x, t) = @x b (x)u(x, t) @x a(x)u(x, t) . (15) fecc

2

This is the forward Kolmogorov equation for the evolution of the probability

density u(x, t).

There many features of the forward equation to keep in mind. There are

important dierences between the forward and backward equations. One dier-

ence is that in the backward equation the noise and drift coecients are outside

the dierentiation, but they are inside in the forward equation. In both cases

it has to be like this. For the backward equation, constants are solutions, ob-

viously, because if the payout is V (XT ) = c independent of XT , then the value

function is f (Xt , t) = c independent of Xt . You can see f (x,t) = c satisfies the

backward equation because Lc = 0. If L were to have, say @x a(x)f (x, t) , then

we would have Lc = a(x) c 6= 0 if a is not a constant. That would be bad.

The forward equation, on the other hand, is required to preserve

the integral

of

u, not constants. If we had a(x)@x u(x, t) instead of @x a(x)u(x, t) , then we

9

would have (if b is constant)

Z Z 1

d 1

u(x, t) dx = @t u(x, t) dx

dt 1 1

Z 1

= a(x)@x u(x, t) dx

1

Z 1

= (@x a(x)) u(x, t) dx ,

1

A feature of the forward equation that takes scientists by surprise is that

the second derivative term is not

1 2

@x b (x)@x u(x, t) .

2

This is because of the martingale property. In the no drift case, a = 0, we have

supposed not to change the expected value of Xt . Therefore

Z Z

d d

0 = E[ Xt ] = xu(x, t) dx = x@t u(x, t) dx .

dt dt

If we use the correct form, this works out:

Z Z

1

x@t u(x, t) dx = x@x2 b2 (x)u(x, t) dx

2

Z

1

= (@x x) @x b2 (x)u(x, t) dx

2

Z

1 2 2

= @x x b (x)u(x, t) dx

2

=0.

Z Z Z

2 2

x@x b2 (x)@x u(x, t) dx = b (x)@x u(x, t) dx = @x b (x) u(x, t) dx .

Another crucial distinction between forward and backward is the relation

between the signs of the @t and @x2 terms. This is clearest in the simplest

example, which is Brownian motion: dX = dW , which has a = 0 and b = 1.

Then the backward equation is @t f + 12 @x2 f = 0, and the forward equation is

@t u = 12 @x2 u. You can appreciate the dierence by writing the backward equation

in the form @t f = 12 @x2 f . The forward equation has +@x2 and the backward

equation has @x2 . This is related to the fact that the forward equation is

intended for evolving u(x, t) forward in time. If you specify u(x, t), you the

forward equation determines u(x, t) for t > 0. The backward is for evolving the

10

value function backward in time. If you specify f (x, T ), the backward equation

determines f (x, t) for t < T .

One way to remember the signs is to think what should happen to a lo-

cal maximum. The probability density at a local maximum should go down

as you move forward in time. A local maximum represents a region of high

concentration of particles of probability. Moving forward in time these par-

ticles will disperse. This causes the density of particles, the probability, to go

down. Mathematically, a local maximum of u at x0 at time t0 is represented

by @x u(x0 , t0 ) = 0 and @x2 u(x0 , t0 ) < 0. At such a point, we want u to be de-

creasing, which is to say @t u(x0 , t0 ) < 0. The forward equation with +@x2 does

this, but with @x2 would get it wrong. The backward equation should lower

a local maximum in the value function f moving backward in time. Suppose

@x f (x0 , t0 ) = 0 and @x2 f (x0 , t0 ) < 0 (a local maximum). Then suppose you start

at a time t < t0 with Xt = x0 . There is a chance that Xt0 will be close to x0 ,

but it probably will miss at least a little. (Actually, it misses almost surely.).

Therefore f (Xt0 , t0 ) < f (x0 , t0 ), so f (x0 , t) = Ex0 ,t [ f (Xt0 , t0 )] < f (x0 , t0 ). This

suggests that @t f (x0 , t0 ) > 0, which makes f decrease as you move backward in

time. The backward equation with @x2 does this.

The multi-dimensional version of these calculations is similar. It is easier

if you express the calculations above in a somewhat more abstract way using

the generator L and an inner product appropriate for the situation. in finite

dimensions and for vectors with real components, the standard inner product

is

n

X

hu, f i = ui fi .

i=1

think of u as a column vector or a row vector. If L is an n n matrix, the

definition of the adjoint of L is that matrix L so that

for all vectors u and f . We already did the matrix calculations to show that

L = Lt for matrices, and with the standard inner product. We derive the

forward equation from the backward equation, in this notation, as follows. Start

with

hut , ft i = hut+1 , ft+1 i .

Then use the backward equation and the adjoint relation:

The additivity property of inner products allows us to write this in the form

This is supposed to hold for all ft+1 , which forces the vector in parentheses to

be zero (another property of inner products).

11

The continuous time version of this is about the same. It is a property of

inner products that

d d d

hut , ft i = h ut , ft i + hut , ft i .

dt dt dt

If the inner product is independent of t and f satisfies the backward equation,

this gives

d

0=h ut , ft i hut , Lft i

dt

d

= h ut , ft i hL ut , ft i

dt

d

=h

ut L u , ft i

dt

As we argued before, if this holds for enough vectors ft , it implies that the

quantity in parentheses must vanish, which leads to:

d

ut = L ut . (17) feg

dt

This says that to find the forward equation from the backward equation, you

have to find the adjoint of the generator. cgg

The

cg

generator of the multi-dimensional diusion process is given by (10)

or (11). We simplify the notation by defining the diusion coecient matrix

P = b(x)b (x). The components of are the diusion coecients jk =

t

(x)

l bjl bkl . The quantity that is constant in time is

Z

u(x1 , . . . , xn , t)f (x1 , . . . .xn , t) dx1 dxn . (18) ndi

Rn

Z Z

(L u(x, t)) f (x, t) dx = u(x, t) (Lf (x, t)) dx

The adjoint is found by integration by parts. The generator is the sum of many

terms. We do the integration by parts separately for each term and add the

results. A typical first derivative part of L is aj (x)@xj . The adjoint of this

term is found by integrating by parts in the ndixj variable, which is just one of the

variables in the n dimensional integration (18). That is

Z 1 Z 1

u(x, t)aj (x)@xj f (x, t) dxj = @xj (aj (x)u(x, t)) f (x, t) dxj .

xj =1 xj =1

The integrations over the other variables does nothing to this. The result is

Z Z

u(x, t)aj (x)@xj f (x, t) dx = @xj (aj (x)u(x, t)) f (x, t) dx .

Rn Rn

12

A typical second derivative term in L is jk (x)@xj @xk . If j 6= k, you move

these derivatives from f onto u by integration by parts in xj and xk . The overall

sign comes out + because you do two integrations by parts. The result is

Z Z

u(x, t)jk (x)@xj @xk f (x, t) dxj dxk = @xj @xk (jk (x)u(x, t)) f (x, t]dxj dxk .

xj ,xk xj ,xk

You can check that this result is still true if j = k. Altogether, the adjoint of L

is

n n n

1 XX X

L u(x, t) = @xj @xk (jk (x)u(x, t)) @xj (aj (x)u(x, t)) . (19) L*c

2 j=1 j=1

k=1

n n n

1 XX X

@t u(x, t) = @xj @xk (jk (x)u(x, t)) @xj (aj (x)u(x, t)) . (20) fec

2 j=1 j=1

k=1

A common rookie mistake is to get the factor of 12 wrong in the second deriva-

tive terms. Remember

L*c

that the o diagonal terms, the ones with j 6= k, are

given twice in (19), once with (j, k) and again with (k, j). For example, in two

dimensions, the second derivative expression is

1 2 1

@x1 (11 (x)u(x, t)) + @x22 (22 (x)u(x, t)) + @x1 @x2 (12 (x)u(x, t)) .

2 2

The matrix = bbt is symmetric, even when b is not symmetric. Therefore, it

does not matter whether you write 12 or 21 .

This stu is closely related to Itos lemma. Suppose f (x, t) is some function.

The Ito dierential is

This is not a form we have used before, but it is easy to check. The expectation

from this is

E[ df (Xt , t) | Ft ] = (@t f (Xt , t) + Lf (Xt , t)) dt . (22) ilge

There are backward equations for lots of other functions of a process. One

example is the running cost or running payout problem. (Engineers talk about

costs. Finance people talk about payouts.)

Z T

A= V (Xs )ds . (23) af

0

13

The corresponding value function is

"Z #

T

f (x, t) = Ex,t V (Xs )ds . (24) afv

t

ilg

The Ito dierential (21) is the easy to find the backward equation. On one hand

you have "Z #

T

d V (Xs )ds | Ft = V (Xt )dt .

t

ilg ilge

On the other hand we have (21) or (22). Together, this give

V (x)dt = @t f (x, t) + Lf (x, t) .

This gives the backward equation

0 = @t f + Lf + V (x)f . (25) afbe

af

If you want to know the value of A in (23), one way to find it is to solve the

backward equation with final condition f (x, T ) = 0. Then A = f (x0 , 0). In

this approach you have to solve the whole PDE and compute the whole value

function just to find the single number A. af

The correspondence

afbe

between the additive function (23) and the backward

equation (25) can go both

afv

ways. You can use the PDE to findafbe

the value of A. You

can use the definition (24) to find the solution of the PDE (25). This is useful in

situations where the backward equation was not derived as a probability model.

It is most important when the dimension of the problem is more than a few,

maybe more than 4 or 5, where PDE solution methods are impractical.

Another connection of this kind concerns the multiplicative function

Z T !

A = exp V (Xs )ds . (26) mf

0

" Z ! #

T

f (x, t) = Ex,t exp V (Xs )ds | Ft . (27) mfv

t

We find the backward equation for this value function following the reasoning

we used for the previous one. On one hand, we have

Z T ! Z T !

d exp V (Xs )ds = V (Xt ) exp V (Xs )ds dt .

t t

RT

(If Yt is the additive functional Yt = t V (Xs )ds, then dY has no Ito part, so

d eYt = eYilg

t

dYt . Weilge

just did dYt .) So we equate this with the Ito version of df

given in (21) and (22) to get

E[ df (Xt , t) | Ft ] = @t f (Xt , tdt) + Lf (Xt , t)dt = V (Xt )f (Xt , t)dt .

14

Rearranging this leads to

Again, the relationship between the multiplicative value function (27) and

mfbe

the backward equation (28) can be useful either way. Solving the PDE allows

you to calculate the mfv

expectation of the multiplicative function. Using the mul-

tiplicative function

mfbe

(27) allows you to use Monte Carlo to compute the solution

of the PDE (28). It might happen that we want the PDE solution with final

condition f (x, T ) = g(x) that is not identically equal to 1. In this case, the

solution formula is clearly (think this through)

" Z ! #

T

f (x, t) = Ex,t g(Xt ) exp V (Xs )ds | Ft .

t

This solution formula for the PDE is called the Feynman Kac formula. A version

of this was proposed by the physicist Feynman in the 1940s for the related PDE

that has iL instead of L. Feynmans formula was criticized by mathematicians

for not being rigorous, in my opinion somewhat unfairly. The mathematician

Kac3 discovered the present completely rigorous version of Feynmans formula.

Modern probabalists, particularly mfbe

applied probabilists working in finance or

operations research, call the PDE (28) the Feynman Kac formula instead. This

reverses the original intention. Some go even further, using the term Feynman

Kac for any backward equation of any kind.

15

Week 2

Discrete Markov chains

Jonathan Goodman

September 17, 2012

This week we discuss Markov random processes in which there is a list of pos-

sible states. We introduce three mathematical ideas: a algebra to represent

a state of partial information, measurability of a function with respect to a dis-

crete algebra, and a filtration that represents gaining information over time.

Filtrations are a convenient way to describe the Markov property and to give the

general definition of a martingale, the latter a few weeks from now. Associated

with Markov chains are the backward and forward equations that describe the

evolution of probabilities and expectation values over time. Forward and back-

ward equations are one of the main calculate things methods of stochastic

calculus.

A stochastic process in discrete time is a sequence (X1 , X2 , . . .), where Xn

is the state of the system at time n. The path up to time T is X[1:T ] =

(X1 , X2 , . . . , XT ). The state Xn must be in the state space, S. Last week

S was Rd , for a linear Gaussian process. This week, S is either a finite set

S = {x1 , x2 , . . . xm }, or an infinite countable set of the form S = {x1 , x2 , . . .}. A

set such as S is discrete if it is finite or countable. The set of all real numbers,

like the Gaussian state space Rd , is not discrete because the real numbers are

not countable (a famous theorem of Georg Cantor). Spaces that are not discrete

may be called continuous. If we define Xt for all times t, then Xt is a continu-

ous time process. If Xn is defined only for integers n (or another discrete set of

times), then we have a discrete time process. This week is about discrete time

discrete state space stochastic processes.

We will be interested in discrete Markov chains partly for their own sake

and partly because they are a setting where the general definitions are easy to

give without much mathematical subtlety. The concepts of measurability and

filtration for continuous time or continuous state space are more technical and

subtle than we have time for in this class. The same is true of backward and

forward equations. They are rigorous this week, but heuristic when we go to

continuous time and state space. That is the X ! 0 and t ! 0 aspect of

stochastic calculus.

1

The main examples of Markov chain will be random walk and mean reverting

random walk. There are discrete versions of Brownian motion and the Ornstein

Uhlenbeck process respectively.

2 Basic probability

This section gives some basic general definitions in probability theory in a set-

ting where they are not technical. Look to later sections for more examples.

Philosophically, a probability space, , is the set of all possible outcomes of a

probability experiment. Mathematically, is just a set. In abstract discus-

sions, we usually use ! to denote an element of . In concrete settings, the

elements of have more concrete descriptions. This week, will usually be the

path space consisting of all paths x[1:T ] = (x1 , . . . , xT ), with xn 2 S for each n.

If S is discrete and T is finite then the path space is discrete. If T is infinite,

is not discrete. The discussion below needs to be given more carefully in that

case, which we do not do in this class.

An event is the answer to a yes/no question about the outcome !. Equiva-

lently, an event is a subset of the probability space: A . You can interpret

A as the set of outcomes where the answer is yes, and Ac = {! 2 |! 2 / A}

is the complementary set where the answer is no. We often describe an event

using a version of set notation where the informal definition of the event goes

inside curly braces, such as {X3 6= X5 } to describe x[1:T ] |x3 6= x5 .

A algebra is a mathematical model of a state of partial knowledge about

the outcome. Informally, if F is a algebra and A , we say that A 2 F

if we know whether ! 2 A or not. The most used algebra in stochastic

processes is the one that represents knowing the first n states in a path. This

is called Fn . To illustrate this, if n 2, then the event A = {X1 = X2 } 2 Fn .

If we know both X1 and X2 , then we know whether X1 = X2 . On the other

hand, we do not know whether Xn = Xn+1 , so {Xn 6= Xn+1 } 2 / Fn .

Here is the precise definition of a algebra. The empty set is written ;. It

is the event with no elements. We say F is a algebra if (explanations of the

axioms in parentheses):

(i) 2 F and ; 2 .

(Regardless of how much you know, you know whether ! 2 , it is, and

you know whether ! 2 ;, it isnt.)

(ii) If A 2 F then Ac 2 F.

(If you know whether ! 2 A then you know whether ! 2

/ A. It is the same

information.)

(iii) IF A 2 F and B 2 F, then A [ B 2 F and A \ B 2 F.

(If you can answer the questions ! 2 A? and ! 2 B?, then you can answer

the question ! 2 (A or B)? If ! 2 A or ! 2 B, then ! 2 (A [ B).)

(iv) If A1 , A2 , . . . is a sequence of events, then [Ak 2 F.

(If any of the Ak is yes, then the whole thing is yes. The only way to

2

get no for ! 2 [Ak 2 F?, is for every one of the Ak to be no.)

This is not a minimal list. There are some redundancies. For example, if you

have axiom (ii), and 2 F, then it follows that ; = c 2 F. The last axiom

is called countable additivity. You need countable additivity to do t ! 0

stochastic calculus.

A function of a random variable, sometimes called a random variable is a

real valued (later, vector valued) function of ! 2 . For example, if is path

space and the outcome is a path, and a 2 S is a specific state, then the function

could be

min {n | xn = a} if there is such an n

f (x[1:T ] ) =

T otherwise.

This is called a hitting time and is often written a . It would be more complete

to write a (x[1:T ] ), but it is common not to write the function argument. Such

a function has a discrete set of values when is discrete. If there is a finite or

infinite list of all elements of , then there is a finite or infinite list of possible

values of f . Some of the definitions below simpler and less technical when is

discrete.

A function of a random variable, f (!), is measurable with respect to F if

the value of f can be determined from the information in F. If is discrete,

?

this means that for any number F , the question f (!) = F can be answered

using the information in F. More precisely, it means that for any F , the event

AF = {! 2 |f (!) = F } is an element of F. To be clear, AF = ; for most

values of F because there is a list finite or countable list of F values for which

AF 6= ;.

Let Fn be a family of algebras, defined for n = 0, 1, 2, . . .. They form a

filtration if Fn Fn+1 for each n. This is a very general model of acquiring

new information at time n. The only restriction is that in a filtration you do

not forget anything. If you know the answer to a question at time n, then you

still know at time n + 1. The set of questions you can answer at time n is a

subset of the set of questions you can answer at time n + 1. It is common that

F0 is the trivial algebra, F0 = {;, }, in which you can answer only trivial

questions. The most important filtration for us has being the path space

and Fn knowing the path up to time n. This is called the natural filtration,

or the filtration generated by the process Xn , in which Fn knows the values

X1 , . . . , Xn .

Suppose Fn is a filtration and fn (!) is a family of functions. We say that

the functions are progressively measurable, or non-anticipating, or adapted to

the filtration, or predictable, if fn is measurable with respect to Fn for each n.

There subtle dierences between these concepts for continuous time processes,

dierences we will ignore in this class. Adapted functions are important in

several ways. In the Ito calculus that is the core of this course, the integrand

in the Ito integral must be adapted. In stochastic control problems, you try to

find decide on a control at time n using only the information available at time

n. A realistic stochastic control must be non-anticipating.

3

Partitions are a simple way to describe algebras in discrete probability. A

partition, P, is a collection of events that is mutually exclusive and collectively

exhaustive. That means that if P = {A1 , . . .}, then:

(i) Ai \ Aj = ; whenever i 6= j.

(mutually exclusive.)

(ii) [Ai = .

(collectively exhaustive)

The events Ai are the elements of the partition P. They form a partition if

each ! 2 is a member of exactly one partition element. A partition may have

finitely or countably many elements. One example of a partition, if is the

path space, has one partition element for every state y 2 S. We say x[1,T ] 2 Ay

if and only if x1 = y. Of course, we could partition using the value x2 , etc.

In discrete probability, there is a one to one correspondence between par-

titions and algebras. If F is a algebra, the corresponding partition, in-

formally, is the finest grained information contained in F. To say this more

completely, we say that F distinguishes ! from ! 0 if there is an A 2 F so that

! 2 A and ! 0 2 / A. For example, let F2 part of the natural filtration of path

space. Suppose ! = (y1 , y2 , y3 , . . .), and ! 0 = (y1 , y2 , z3 , . . .). Then F2 does

not distinguish ! from ! 0 . The information in F2 cannot answer the question

?

y3 = z3 . For any ! 2 , the set of ! 0 that cannot be distinguished from ! is an

event, called the equivalence class of !. The set of all equivalence classes forms

a partition of (check properties (i) and (ii) above if two equivalence classes

overlap, then they are the same). If B! is the equivalence class of !, then

B! = \A , with ! 2 A , and A 2 F .

union of all the equivalence classes contained in A. Therefore, if you have P,

you can create F buy taking all countable unions of elements of P.

The previous paragraph is full if lengthy but easy verifications that you

may not go through completely or remember long. What you should remember

is what a partition is and how it carries the information in a algebra. The

information in F does not tell you which outcome happened, but it does tell you

which partition element it was in. A function f is measurable with respect to

F if and only if it is constant on each partition element of the corresponding P.

The information in F determines the partition element, Bj , which determines

the value of f .

A probability distribution on is an assignment of a probability to each

outcome in .PIf ! 2 , then P (!) is the probability of !. Naturally,P P (!) 0

for all ! and !2 P (!) = 1. If A is an event, then P (A) = !2A P (!).

If f (!) is a function of a random variable, then

X

E[f ] = f (!)P (!) .

!2

4

3 Conditioning

Conditioning is about how probabilities change as you get more information.

The simplest conditional expectation tells you how the probability of event A

changes if you know the event B happened. The formula is often called Bayes

rule.

P(A and B) P(A \ B)

P(A|B) = = . (1)

P(B) P(B)

As a simple check, note that if A \ B = ;, then the event B rules A out

completely. Bayes rule (1) gets this right, as P(;) = 0 in the numerator. Bayes

rule does not say how to define conditional probability if P(B) = 0. This is

a serious drawback in continuous probability. For example, if (X1 , X2 ) is a

bivariate normal, then P(X1 = 4) = 0, but we saw last week how to calculate

P(X2 > 0|X1 = 4) (say). The conditional probability of a particular outcome is

given by Bayes rule too. Just take A to be the event that ! happened, which

is written {!}:

P(!) /P(B) if ! 2 B

P(!|B) =

0 if ! 2

/B

The conditional expected value is

P

X f (!)P(!)

E[f |B] = f (!)P(!|B) = !2B

. (2)

P(B)

!2B

You can check that this formula gives the right answer when f (!) = 1 for all !.

We indeed get E[1|B] = 1.

You can think of expectation as something you say about a random function

when you know nothing but the probabilities of various outcomes. There is a

version of conditional expectation that describes how your understanding of f

will change when you get the information in F. Suppose P = (B1 , B2 , . . .) is

the partition that is determined by F. When you learn the information in F,

you will learn which of the Bj happened. The conditional expectation of f ,

conditional on F, is a function of ! determined by this information.

g(!) = E[f |Bj ].

You can see that the conditional expectation, g = E[f |F], is constant on

partition elements Bj 2 P. This implies that g is measurable with respect to

F, which is another way of saying that E[f |F] is determined by the information

in F. The ordinary expectation is conditional expectation with respect to the

trivial algebra F0 = {;, }. The corresponding partition has only one ele-

ment, . The conditional expectation has the same value for every element of

, and that value is E[f ].

The tower property is a fact about conditional expectations. It leads to back-

ward equations which is a powerful way to calculate conditional expectations.

5

Suppose F0 F1 is a filtration, and fn = E[f |Fn ]. The tower property is

a algebra with more information than F, which means F G. Suppose f

is some function, g = E[f |G], and h = E[f |F]. Then h = E[g|F]. To say this

another way, you can condition from f down to h directly, which is E[f |F], or

you can do it in two stages, which is f ! g = E[f |G] ! E[g|F]. The result

is the same.

One proof of the tower property makes use the partitions associated with F

and G. The partition for G is a refinement of the partition for F. This means

that you make the partition elements for G by cutting up partition elements of

F. Every Ci that is a partition element of G is completely contained in one of

the partition elements of F. Said another way, if ! and ! 0 are two elements of

Ci , they are indistinguishable using the information in G, which surely makes

them indistinguishable using F, which is less information. This is why Ci cannot

contain outcomes from dierent Bj .

Now it is just a calculation. Let h0 (!) = E[g|F]. For ! 2 Bj , it is intuitively

clear (and we will verify) that

X

h0 (!) = E[g|Ci ] P(Ci |Bj ) (5)

Ci Bj

X

= E[f |Ci ] P(Ci |Bj )

Ci Bj

!

X X

= f (!)P(!|Ci ) P(Ci |Bj )

Ci Bj !2Ci

!

X X f (!)P(!) P(Ci )

=

P(Ci ) P(Bj )

Ci Bj !2Ci

!

X X 1

= f (!)P(!)

P(Bj )

Ci Bj !2Ci

X P(!)

= f (!)

P(Bj )

!2Bj

= h(!) .

The first line (5) is a convenient way to think about the partition produced

by the algebra G. The partition elements Ci play the role of elementary

outcomes ! 2 . The partition P plays the role of the probability space .

Instead of P(!), you have P(Ci ). If the function g is measurable with respect

to G, then g has the same value for each ! 2 Ci , so you might as well call this

g(Ci ). And of course, since g is constant is Ci , if ! 2 Ci , then g(!) = E[g|Ci ].

6

We justify (5), for ! 2 Bj , using

h0 (!) = E[g|Bj ]

X

= g(!)P(!|Bj )

!2Bj

!

X X 1

= g(!)P(!)

P(Bj )

Ci 2Bj !2Ci

!

X X 1

= g(Ci ) P(!)

P(Bj )

Ci 2Bj !2Ci

X P(Ci )

= g(Ci ) .

P(Bj )

Ci 2Bj

4 Markov chains

This section, like the previous two, lacks examples. You might want to read the

next section together with this one for examples.

A Markov chain is a stochastic process where the present is all the informa-

tion about the past that is relevant for predicting the future. The algebra

definitions in the previous sections express these ideas easily. Here is a defini-

tion of the natural filtration Fn . Let x[1:T ] and x0[1:T ] be two paths in the path

space, . Suppose n T . We say the paths are indistinguishable at time n

if xk = x0k for k = 1, 2, . . . , n. This definition of indistinguishability gives rise

to a partition of , with two paths being in the same partition element if they

are indistinguishable. The algebra corresponding to this partition is Fn . A

function f (x[1:T ] ) is measurable with respect to Fn if it is determined by the

first n states (x1 , . . . , xn ). More precisely, if xk = x0k for k = 1, 2, . . . , n, then

f (x[1:T ] ) = f (x0[1:T ] ). The algebra that knows only the value of xn is Gn .

A path function is measurable with respect to Gn if and only if it is determined

by the value xn alone.

Let be the path space and P() a probability distribution on . Then P()

has the Markov property if, for all x 2 S and n = 1, . . . , T 1,

Unwinding all the definitions, this is the same as saying that for any path up to

time n, (x1 , . . . , xn ),

P(Xn+1 = x | X1 = x1 , . . . , Xn = xn ) = P(Xn+1 = x | Xn = xn ) .

You might complain that we have defined conditional expectation but not con-

ditional probability in (6). The answer is a trick for defining probability from

7

expectation. The indicator function of an event A is 1A (!), which is equal

to 1 if ! 2 A and 0 otherwise. Then P(A) = E[1A ]. In particular, if A = {!},

then, using the notation slightly incorrectly, P(!) = E[1! ]. This applies to

conditional expectation too: P(!|F) = E[1! |F]. But you should be alert to

the fact that the latter statement is more complicated, in that both sides are

functions on (measurable with respect to F) rather than just numbers. The

simple notation hides the complexity.

The probabilities in a Markov chain are determined by transition probabil-

ities, which are the numbers defined by the right side of (6). The probability

P(Xn+1 = x | Gn ) is a measurable function of Gn , which means that they are a

function of xn , which we call y to simplify notation. The transition probabilities

are

pn,yx = P(Xn+1 = x | Xn = y) . (7)

You can remember that it is pn,yx instead of pn,xy by saying that pn,yx =

P(y ! x) is the probability of of a y to x transition in one step.

You can use the definition (7) even if the process does not have the Markov

property. What is special about Markov chains is that the numbers (7) deter-

mine all other probabilities. For example, we will show that

What is behind this, besides the Markov property, is a general fact about con-

ditioning. If A, B, and C are any three events, then

P(Xn+1 = y | Xn = z) .

P(Xn+2 = x | Xn+1 = y), which gives (8).

There is something in the spirit of (8) that is crucial for the backward

equation below. This is that the Markov property applies to the whole fu-

ture path, not just one step into the future. For examaple, (8) implies that

P(Xn+2 = x | Fn ) = P(Xn+1 = x | Gn ). The same line of reasoning justifies the

stronger statement that for any xn+1 , . . . , xT ,

1

TY

= pk,xk ,xx+1

k=n

Going one more step, consider a function f that depends only on the future:

f = f (xn+1 , xn2 , . . . , xT ). Then

8

If you like thinking about algebras, you could say this by defining the future

algebra, Hn , that is determined only by information in the future. Then (9)

holds for any f that is measurable with respect to Hn .

A Markov chain is homogeneous if the transition probabilities do not depend

on n. Most of the Markov chains that arise in modeling are homogeneous.

Much of the theory is for homogeneous Markov chains. From now on, unless we

explicitly say otherwise, we will assume that a Markov chain is homogeneous.

Transition probabilities will be pyx for every n.

The forward equation is the equation that describes how probabilities evolve

over time in a Markov chain. Last week we saw that we could evolve the mean

and variance of a linear Gaussian discrete time process (Xn+1 = AXn + BZn )

using n+1 = An and Cn+1 = ACn At + BB t . This determines Xn+1

N (n+1 , Cn+1 ) from the information Xn N (n , Cn ). It is possible to formu-

late a more general forward equation that gives the distribution of Xn+1 even

if Xn is not Gaussian. But we do not need that here.

Let pyx be the transition probabilities of a discrete state space Markov chain.

Let un (y) = P(Xn = y). The forward equation is a formula for the numbers

un+1 in terms of the numbers un . The derivation is simple

X

= P(Xn+1 = x | Xn = y) P(Xn = y)

y2S

X

un+1 (x) = un (y)pyx (10)

y2S

The step from the first to second line uses what is sometimes called the law

of total probability. The terms are rearranged in the last line for the following

reason ...

We reformulate the Markov chain forward equation in matrix/vector terms.

Suppose the state space is finite and S = {x1 . . . , xm }. We will say state

j instead of state xj , etc. We write un,j = P(Xn = j) instead of un (xj ) =

P(Xn = xj ). We collect the probabilities into a row vector un = (un,1 , . . . , un,m ).

The transition matrix, P , is the m m matrix of transition probabilities. The

(i, j) entry of P is pij = P(i ! j) = P(Xn+1 = j|Xn = i). The forward equation

is

n

X

un+1,j = un,i pij .

i=1

un+1 = un P . (11)

The row vector un+1 is the product of the row vector un and the transition

matrix P . It is a tradition to make un a row vector and put it on the left.

You will come to appreciate the wisdom of this unusual choice over the coming

weeks.

9

Once we have a linear algebra formulation, many tools of linear algebra be-

come available. For example, powers of the transition matrix trace the evolution

of u over several steps:

Clearly un+k = un P k for any k. This means we can study the evolution of

Markov chain probabilities using the eigenvalues and eigenvectors of the transi-

tion matrix P .

The other major equation is the backward equation, which propagates con-

ditional expectations backward in time. Take Fn to be the natural filtration

generated by the path up to time n. Take f (x[1:T ] ) = V (xT ). This is a final

time payout. The function is completely determined by the state of the system

at time T . We want to characterize fn = E[f |Fn ] and see how to calculate

it using V and P . The characterization comes from the Markov property (9),

which implies that fn is a function of xn . The backward equation determines

this function.

The backward equation for Markov chains follows from the tower property

(4). Use the transition probabilities (7). Since E[fn+1 | Fn ] is measurable with

respect to Gn , the expression E[fn+1 | Fn ] (xn ) makes sense:

X

= P(Xn+1 = xn+1 |Xn = xn ) fn+1 (xn+1 )

xn+1 2S

X

= pxn xn+1 fn+1 (xn+1 ) .

xn+1 2S

X

fn (x) = pxy fn+1 (y) . (12)

y2S

The backward equation gets its name from the fact that it determines fn

from fn+1 . If you think of n as a time variable, then time runs backwards for the

equation. To find the solution, you start with the final condition fT (x) = V (x),

then compute fT 1 using (12), and continue.

The backward equation may be expressed in matrix/vector terms. As we

did when doing this for the forward equation, we suppose the state space is S =

{1, 2, . . . , m}. Then we define the column vector fn 2 Rm whose components

are fn,j = fn (j). The elements of the transition matrix are pij = P(i ! j). The

right side of (12) is the matrix/vector product P fn+1 , so the equation is

fn = P fn+1 . (13)

The forward and backward equations use the same matrix P , but the forward

equation multiplies from the left by a row vector of probabilities, while the

10

backward equation multiplies from the right by a column vector of conditional

probabilities.

The transition matrix if a homogeneous Markov chain is an m m matrix.

You can ask which matrices arise in this way. A matrix that can be the transition

matrix for a Markov chain is called a stochastic matrix. There are two obvious

properties that characterize stochastic matrices. The first is that pij 0 for all

i and j. The transition probabilities are probabilities, and probabilities cannot

be negative. The second is that

m

X

pij = 1 for all i = 1, . . . , m. (14)

j=1

Xn = i, then Xn+1 is one of the states 1, 2, . . . , m. Therefore, the probabilities

for the landing states (the state at time n + 1) add up to one.

If you know P is a stochastic matrix, you know two things about its eigen-

values. One of those things is that = 1 is an eigenvalue. The proof of this

is to give the corresponding eigenvector, which is the column vector of all ones:

1 = (1, 1, . . . , 1)t . If g = P 1, then the components of g are, using (14),

m

X m

X

gi = pij 1j = pij = 1 ,

j=1 j=1

for all i. This shows that g = P 1 = 1. This result is natural in the Markov

chain setting. The statement fn+1 = E[V (XT ) | Fn+1 ] = 1 for all xj 2 S means

that given the information in Fn+1 , the expected value equals 1 no matter what.

But Fn has less information. All you know at time n is that in the next step you

will go to a state where the expected value is 1. But this makes the expected

value at time n equal to 1 already.

The other thing you know is that if is an eigenvalue of P then || 1.

This is a consequence of the maximum principle for P , which we now explain.

Suppose f 2 Rm is any vector and g = P f . The maximum principle for P is

max gi max fj . (15)

i j

Some simple reasoning about P proves this. First observe that if h 2 Rm and

hj 0 for all j, then

m

X

(P h)i = pij hj 0 for all i.

j=1

This is because all the terms on the right, pij and hj are non-negative. Now

let M = max fj . Then define h by hj = M fj , so that M 1 = f + h. Since

P (M 1) = M 1, we know gi + (P h)i = M . Since (P h)i 0, this implies that

gi M , which is the statement (15). Using similar arguments,

0 1

X X

|gi | pij |fj | @ pij A max |fj | ,

j

j j

11

you can show that even if f is complex,

i j

equation P f = f with || > 1, violates (16) by a factor of ||.

A stochastic matrix is called ergodic if:

A Markov chain is ergodic if its transition matrix is ergodic. (Warning: the true

definition of ergodicity applies to Markov chains. There is a theorem stating

that if S is finite, then the Markov chain is ergodic if and only if the eigenvalues

of P satisfy the conditions above. Our definitions are not completely wrong,

but they might be misleading.) Most of our examples are ergodic.

Last week we studied the issue of whether the Markov process (a linear

Gaussian process last week, a discrete Markov chain this week) has a statistical

steady state that it approaches as n ! 1. You can ask the same question about

discrete Markov chains. A probability distribution, , is stationary, or steady

state, or statistical steady state, if un = =) un+1 = . That is the same as

saying that Xn =) Xn+1 . The forward equation (11) implies that a

stationary probability distribution must satisfy the equation = P . This says

that is a left eigenvector of P with eigenvalue = 1. We know there is at least

one left eigenvector with eigenvalue = 1 because = 1 is a right eigenvalue

with eigenvector 1.

If S is finite, it is a theorem that the chain is ergodic if and only if it satisfies

both of the following conditions

P

(i) There is a unique row vector with = P and i i = 1.

un ! as n ! 1.

Without discussing these issues thoroughly, you can see the relation between

these theorems about probabilities and the statements about eigenvalues of P

above. If P has a unique eigenvalue equal to one and the rest less than one, then

un+1 = un P has un ! (the eigenvector), as n ! 1. But is the eigenvector

corresponding Pto eigenvalue = 1. It might be that un ! c, but we know

c = 1 because i un,i = 1, and similarly for .

This section discusses Markov chains where the states are integers in some range

and the only transitions are i ! i or i ! i 1. The non-zero transition

probabilities are ai = P(i ! i 1) = pi,i1 , and bi = P(i ! i) = pi,i , and

12

ci = P(i ! i + 1) = pi,i+1 . These are called random walk, particularly if S = Z

and the transition probabilities are independent of i. A reflecting random walk

is one where moves to i < 0 are blocked (reflected). In this case the state space

is the non-negative integers S = Z+ , and c0 = 0, and b0 = b + c. For i > 0,

ai = a, bi = b, and ci = c. You can interpret the transition probabilities at i = 0

as saying that a proposed transition 0 ! 1 is rejected, leading the state to stay

at i = 0. You cannot tell the dierence between a pure random walk and a

reflecting walk until the walker hits the reflecting boundary at i = 0. You could

get a random walk on a finite state space by putting a reflecting boundary also

at i = m 1 (chosen so that S = {0, 1, . . . m 1} and |S| = m). The transition

probabilities at the right boundary would be cm1 = c, bm1 = b + a, and

am1 = 0. These probabilities reject proposed m 1 ! m transitions.

The phrase birth death process is sometimes used for random walks on the

state space Z+ with transition probabilities that depend in a more general way

on i. The term birth/death comes from the idea that Xn is the number of

animals at time n. Then ai is the probability that one dies, ci is the probability

that one is born, and bi = 1 ai ci is the probability that the probability

does not change. You might think that i = 0 would be an absorbing state in

that P(0 ! 1) = 0 because no animals can be born if there are no animals.

Birth/death processes do not necessarily assume this. They permit storks.

Urn processes are a family of probability models that give rise to Markov

chains on the state space {0, . . . , m 1}. An urn is a large clay jar. In urn

processes, we talk about one or more urns each with balls of one or more colors.

Here is an example that has one urn, two colors (red and blue) and a probability,

q. At each stage there are m balls in the urn (cant stay with m 1 any longer.

Some are red and the rest blue. You choose one ball from the urn, each with the

same probability to be chosen a well mixed urn. You replace that ball with a

new one whose color is red with probability q and blue with probability 1 q.

Let Xn be the number of red balls at time n. In this process Xn ! Xn 1 if

you you choose a red ball and replace it with a blue ball. If you choose a blue

ball and replace it with a red ball, then Xn ! Xn + 1. If you replace red with

red, or blue with blue, then Xn+1 = Xn .

A matrix is tridiagonal if all its entries are zero when |i j| > 1. Its non-

zeros are on the main diagonal, the one super-diagonal, and one sub-diagonal.

This section is about Markov chains whose transition matrix is tri-diagaonal.

Consider, for a simple example, the random walk with a reflecting boundary at

i = 0 and i = 5. Suppose the transition probabilities are a = 12 , and b = c = 14 .

The transition matrix is the 6 6 matrix

03 1 1

4 4 0 0 0 0

B1 1 1 0 0 0C

B 2 41 41 1 C

B0 0 0C

P =B B 2 4 4 C . (17)

C

B 0 0 2 41 41 01 C

1 1 1

@0 0 0 A

2 4 4

0 0 0 0 12 21

There is always confusion about how to number the first row and column.

13

Markov chain people like to start the numbering with i = 0, as we did above.

The tradition in linear algebra is to start numbering with i = 1. These notes

try to keep the reader on her or his toes by doing both, starting with i = 1 when

describing the entries in P , and starting with i = 0 when describing the Markov

chain transition probabilities. The transition matrix P above has p12 = 14 ,

which indicates that if Xn = 1, there is a .25% chance that Xn+1 = 2. Since Xn

is not allowed to go lower than 1 or higher than 2, the rest of the probability

must be to stay at i = 1, which is why p11 = 34 . In the first column we have

p21 = 12 , which is the probability of a 2 ! 1 transition. From state i = 2 there

are three possible transitions, 2 ! 1 with probability 12 as we just said, 2 ! 2

with probability 14 , and 2 ! 3, with probability 14 . The row sums of this matrix

are all equal to one, and so are most of the column sums. But the first column

sum is p11 + p21 = 45 > 1 and the last is p56 + p66 = 34 < 1.

My computer says that

0 1

0.6875 0.2500 0.0625 0 0 0

B 0.5000 0.3125 0.1250 0.0625 0 0 C

B C

B 0.2500 0.2500 0.3125 0.1250 0.0625 0 C

P =B

2 B C .

0 0.2500 0.2500 0.3125 0.1250 0.0625 C

B C

@ 0 0 0.2500 0.2500 0.3125 0.1875 A

0 0 0 0.2500 0.3750 0.3750

(2)

For example, p4,2 = 14 , which says that the probability of going from 4 to 2 in

two hops is 14 . The only way to do that is Xn = 4 ! Xn+1 = 3 ! Xn+2 = 2.

The probability of this is p43 p32 = 12 12 = 14 . The probability of 3 ! 3 in two

(2)

steps is p33 = 165

. The three paths that do this, with their probabilities, are

P(3 ! 2 ! 3) = 12 14 = 162

, and P(3 ! 3 ! 3) = 14 14 = 16 1

, and P(3 ! 4 ! 3) =

4 2 = 16 . These add up to 16 . The matrix P is the transition matrix for the

1 1 2 5 2

Markov chain that says take two hops with the P chain. Therefore its row

(2) (2) (2)

sums should equal 1, as p11 + p12 + p13 = 11 16 + 16 + 16 = 1.

4 1

0 1

0.5079 0.2540 0.1270 0.0635 0.0317 0.0159

B 0.5079 0.2540 0.1270 0.0635 0.0317 0.0159 C

B C

B 0.5079 0.2540 0.1270 0.0635 0.0317 0.0159 C

P =B

n B C . (18)

C

B 0.5079 0.2540 0.1270 0.0635 0.0317 0.0159 C

@ 0.5079 0.2540 0.1270 0.0635 0.0317 0.0159 A

0.5079 0.2540 0.1270 0.0635 0.0317 0.0159

(1) P6

These numbers have the form pi,j = 2j /s, where s = j=1 2j . The formula

for s comes from the requirement that the row sums of p(1) are 1. The fact that

(1) (1)

pi,j+1 /pij = 12 comes from the following theory. Think of starting with X0 = i.

Then u1,ij = P(X1 = j|X0 = i) = pij . Similarly, un,j = P(Xn = j|X0 = i) =

(n)

pij . If this P is ergodic (it is), then un,j ! j as n ! 1. This implies that

(n)

pij ! j as n ! 1 for any i. The P n in (18) seems to fit that, at least in the

(n)

fact that all the rows are the same because the limit of pij is independent of i.

14

The equation = P determines . For our P in (17), these equations are

1 = 1 34 + 2 1

2

2 = 1 14 + 2 14 + 3 12

..

.

5 = 4 14 + 5 14 + 6 12

6 = 5 14 + 6 1

2 .

P a solution is j = c2 . The value of c comes from the

j

15

Week 3

Continuous time Gaussian processes

Jonathan Goodman

September 24, 2012

This week we take the limit t ! 0. The limit is a process Xt that is defined

for all t in some range, such as t 2 [0, T ]. The process takes place in continuous

time. This week, Xt is a continuous function of t. The process has continuous

sample paths. It is natural to suppose that the limit of a Markov process is

a continuous time Markov process. The limits we obtain this week will be

either Brownian motion or the Ornstein Uhlenbeck process. Both of these are

Gaussian. We will see how such processes arise as the limits of discrete time

Gaussian processes (week 1) or discrete time random walks and urn processes

(week 2).

The scalings of random processes are dierent from the scalings of dieren-

tiable paths you see in ordinary t ! 0 calculus. Consider a small but non-zero

t. The net change in X over that interval is X = Xt+ t Xt . If a path has

well defined velocity, Vt = dX/dt, then X V t. Mathematicians say that

X is on the order of t, because X is approximately proportional to t for

small1 t. In this linear scaling, reducing t by a factor of 2 (say) reduces X

approximately by the same factor of 2.

Brownian motion, and the Ornstein Uhlenbeck process, have more compli-

cated scalings. There is one scaling for pX, and a dierent one for E[X]. The

change itself, X, is on the order of t. If t is small, this is larger than

the t scaling dierentiable processes have. Brownian motion moves much fur-

ther in a small amount of time than dierentiable processes do. The change in

expected value is smaller, pon the order of t. It is impossible for the expected

value to change by order t, because the total change in the expected value

over a finite time intervalpwould be infinite. The Brownian motion manages to

have X on the order of t through cancellation. The sign of X goes back

and forth, so that the net change is far smaller than the sum of |X| over many

small intervals of time. That is |X1 + X2 + | << |X1 | + |X2 | + .

1 This terminology is dierent from scientists order of magnitude, which means roughly a

power of ten. It does not make sense to compare X to t in the order of magnitude sense

because they have dierent units.

1

Brownian motion and the Ornstein Uhlenbeck process are Markov processes.

The standard filtration consists of the family of algebras, Ft , which are

generated by X[0,t] (the path up to time t). The Markov property for Xt is

that the conditional probability of X[t,T ] , conditioning on all the information in

Ft , is determined by Xt alone. The infinitesimal mean, or infinitesimal drift is

E[X|Ft ], in the limit t ! 0. The infinitesimal variance is var(X | Ft ). We

will see that both of these scale linearly with t as t ! 0. This allows us to

define the infinitesimal drift coecient,

The conditional expectation with respect to a algebra requires the left side

to be a function that is measurable with respect to Ft . The Xt that appears

on the left sides is consistent with this. The Markov property says that only Xt

can appear on the left sides, because the right sides are statements about the

future of Ft , which depend on Xt alone.

The properties (1) and (2) are central this course. This week, they tell us

how to take the continuous time limit t ! 0 of discrete time Gaussian Markov

processes or random walks. More precisely, they tell us how a family of processes

must be scaled with t to get a limit as t ! 0. You choose the scalings so that

(1) and (2) work out. The rest follows, as it does in the central limit theorem.

The fact that continuous time limits exist may be thought of as an extension of

the CLT. The infinitesimal mean and variance of the approximating processes

determine the limiting process completely.

Brownian motion and Ornstein Uhlenbeck processes are characterized by

their and 2 . Brownian motion has constant and 2 , independent of Xt .

The standard Brownian motion has = 0, and 2 = 1, and X0 = 0. If 6= 0,

you have Brownian motion with drift. If 2 6= 1, you have a general Brown-

ian motion. Brownian motion is also called the Wiener process, after Norbert

Wiener. We often use Wt to denote a standard Brownian motion. The Orn-

stein Uhlenbeck process has constant , but a linear drift (Xt ) = Xt . Both

Brownian motion and Ornstein are Gaussian processes. If 2 is a function of

Xt or if is a nonlinear function of Xt , then Xt is unlikely to be Gaussian.

2 Kinds of convergence

Suppose we have a family of processes Xt t , and we want to take t ! 0

and find a limit process Xt . There are two kinds of convergence, distributional

convergence, and pathwise convergence. Distributional convergence refers to the

probability distribution of Xt t rather than the numbers. It is written with a

D

half arrow, Xt t * Xt as t ! 0, or possibly Xt t * Xt . The CLT is an

example of distributional convergence. If Z N (0, 1) and Yk are i.i.d., mean

2

zero, variance 1, then Xn = p1n converges to Z is distribution, which means that

the distribution of Xn converges to N (0, 1). But the numbers Xn have nothing

to do with the numbers Z, so we do not expect that Xn ! Z as n ! 1. We

D D

write Xn * N (0, 1) of Xn * Z as n ! 1, which is convergence in distribution.

Later in the course, starting in week 5, there will be examples of sequences

that converge pathwise.

Consider a linear Gaussian recurrence relation of the form (27) from week 1,

but in the one dimensional case. We write this as

time process at time tn = nt. We guess that we can substitute (3) into (1)

and (2) to find the scalings of a and b with t. We naturally use Fn instead of

Ft on the right. The result for (1) is

(Xn )t = aXn .

the Ornstein Uhlenbeck process, we should take a = t. The a coecient

in the recurrence relation should scale linearly with t to get finite, non-zero,

drift coecient in the continuous time limiting process.

Calibrating the noise coecient gives scalings characteristic of continuous

time stochastic processes. Inserting (3) into (2), with Fn for Ft , we find

2 (Xn )t = b2 .

p Uhlenbeck process have constant 2 , which

suggests the scaling b = t. These results, put together, suggest that the

way to approximate the Ornstein Uhlenbeck process by a discrete Gaussian

recurrence relation is

p

t

Xn+1 = tXn t + tZn . (4)

time approximation to it using (4) and

Xtnt = Xn t .

Note the inconsistent notation. The subscript on the left side of the equation

refers to time, but the subscript on the right refers to the number of time steps.

These are related by tn = nt.

Let us assume for now that the approximation converges as t ! 0 in

the sense of distributions. There is much discussion of convergence later in

3

the course. But assuming it converges, (4) gives a Monte Carlo method for

estimating things about the Ornstein Uhlenbeck process. You can approximate

the values of Xt for t between time steps using linear interpolation if necessary.

If tn < t < tn+1 , you can use the definition

t tn

Xt t

= Xtnt + t

Xtn+1 Xtnt .

tn+1 tn

The values Xtnt are defined by, say, (4). Now you can take the limit t ! 0

and ask about the limiting distribution of X[0,T t

] in path space. The limiting

probability distribution is the distribution of Brownian motion or the Ornstein

Uhlenbeck process.

4 Brownian motion

Many of the most important properties of Brownian motion follow from the

limiting process described in Section 3.

The increment of Brownian motion is Wt Ws . We often suppose t > s, but

in many formulas it is not strictly necessary. We could consider the increment

of a more general process, which would be Xt Xs . The increment is the net

change in W over the time interval [s, t].

Consider two time intervals that do not overlap: [s1 , t1 ], and [s2 , t2 ], with

s1 t1 s2 t2 . The independent increments property of Brownian motion

is that increments over non-overlapping intervals are independent. The random

variables X1 = Wt1 Ws1 and X2 = Wt2 Ws2 are independent. The intervals

are allowed to touch endpoints, which would be t1 = s2 , but they are not allowed

to have any interior in common. The cases s1 = t1 and s2 = t2 are allowed but

trivial.

The independent increments property applies to any number of non-overlapping

intervals. If s1 t1 s2 t2 , then the corresponding increments, X1 ,

X2 , X3 , . . ., are an independent family of random variables. Their joint PDF is

a product.

The approximate sample paths for standard Brownian motion are given by

(4) with = 0 and = 1. The exact Brownian motion distribution in path

space is the limit of the distributions of the approximating paths, as discussed

in Section 3. Suppose that the interval endpoints are time step times, such

as s1 = tk1 , t1 = tl1 , and so on. The increment of W in the interval [s1 , t1 ]

is determined by the random variables Zj for s1 tj < t1 . These Zj are

independent for non-overlapping intervals of time. In the discrete approximation

there may be small dependences because one t time step variable Zj is a

member of, say, [s1 , t1 ] and [s2 , t2 ]. This overlap disappears in the limit t ! 0.

The possible dependence between the random variables disappears too.

4

4.2 Mean and variance

The mean of a Brownian motion increment is zero.

E[Wt Ws ] = 0 . (5)

The variance of a Brownian motion increment is equal to the size of the time

interval: h i

2

var(Wt Ws ) = E (Wt Ws ) = t s . (6)

This is an easy consequence of (4) with = 0 and = 1. If s = tk and t = tn ,

then the increment is

p

Wtn t Wtk t = t (Zk + + Zn1 ) .

Since the Zj are independent, the variance is

t (1 + 1 + + 1) = t (n k) = tn tk = t s .

The independent increments property upgrades the simple statements to condi-

tional expectations. Suppose Fs is the algebra that knows about W[0,s] . This

is the algebra generated by W[0,s] . The is part of the Brownian motion path

is determined by the increments of Brownian motion in the interval [0, s]. But

all of these are independent of Wt Ws . The increment Wt Ws is independent

of any information in Fs . In particular, we have the conditional expectations

E[Wt Ws |Fs ] = 0 . (7)

The variance of a Brownian motion increment is equal to the size of the time

interval: h i

2

var(Wt Ws |Fs ) = E (Wt Ws ) |Fs = t s . (8)

The formula (7) is called the martingale property. It can be expressed as

E[Wt |Fs ] = Ws . (9)

To understand this, recall the the left side is a function of the path that is known

in Fs . The value Ws qualifies; it is determined (trivially) by the the path W[0,s] .

The variance formula (8) may be re-expressed in a similar way:

h i

2

E Wt2 |Fs = E (Ws + [Wt Ws ]) |Fs

h i

2

= E Ws2 + 2Ws [Wt Ws ] + [Wt Ws ] |Fs

h i

2

= Ws2 + 2Ws E[Wt Ws |Fs ] + E (Wt Ws ) |Fs

E Wt2 |Fs = Ws2 + (t s) .

We used the martingale property, and the fact that a number can be pulled out

of the expectation if it is known in Fs .

5

5 Ornstein Uhlenbeck process

The Ornstein Uhlenbeck process is the continuous time analogue of a scalar

Gaussian discrete time recurrence relation. Let Xt be a process that satisfies (1)

and (2) with (Xt ) = Xt and constant 2 . Suppose u(x, t) is the probability

density for Xt . Since u is the limit of Gaussians, as we saw in Section 3, u itself

should be Gaussian. Therefore, u(x, t) is completely determined by its mean

and variance. We give arguments that are derived from those in Week 1 to find

the mean and variance.

The mean is simple

t+ t = E[Xt+ t ]

= E[Xt + X]

= E[Xt ] + E[E[X|Ft ]]

= t + E[Xt t + (smaller)]

= t E[Xt ] t + (smaller)

t+ t = t t t + (smaller) .

The last line shows that

@t t = t . (10)

This is the analogue of (28) from Week 1.

An orthogonality property of conditional expectation makes the variance

calculation easy. Suppose F is a algebra, X is a random variable, and Y =

E[X|F]. Then h i

2

var(X) = E (X Y ) + var(Y ) . (11)

The main step in the proof is to establish the simpler formula

h i

2

E X 2 = E (X Y ) + E Y 2 . (12)

h i

2

The formula (12) implies (11). If = E[X], then var(X) = E (X ) , and

E[X |F] = Y . So we get (11) by applying (12) with X instead of X.

Note that

h i

2

E X 2 = E ([X Y ] + Y )

h i

2

= E (X Y ) + 2E[(X Y ) Y ] + E Y 2 .

The formula (12) follows from the orthogonality relation that, in turn, depends

on the tower property and the fact that E[Y |F] = Y :

E[(X Y ) Y ] = E[ E[(X Y ) Y | F] ]

= E[ E[ X Y | F] Y ]

= E[ (Y Y ) Y ]

=0.

6

To summarize, we just showed that X E[X] is orthogonal to E[X] in the sense

that E[(X Y ) Y ] = 0. The formula (12) is the Pythagorean relation that

follows from this orthogonality. The variance formula (11) is just the mean zero

case of this Pythagorean relation.

The variance calculation for the Ornstein Uhlenbeck process uses the Pythagorean

relation (11) in Ft . The basic mean value relation (1) may be re-written for Orn-

stein Uhlenbeck as

2

t+ t = var(Xt+ t )

h i

2 2

= E (Xt+1 Xt + Xt t + (smaller)) + var (Xt Xt t + (smaller))

2

= 2 t + (1 t) var(Xt ) + (smaller)

= t2 + 2 2t2 t + (smaller) .

This implies that 22 satisfies the scalar continuous time version of (29) from

Week 1, which is

@t t2 = 2 2t2 . (13)

Both (10) and (13) have simple exponential solutions. We see that t goes

to zero at the exponential rate , while t2 goes to 2 /(2) at the exponential

rate 2. The probability density of Xt is

1 2 2

u(x, t) = p e(xt ) /(2 t) . (14)

2t2

This distribution has a limit as t ! 1 that is the statistical steady state for

the Ornstein Uhlenbeck process.

7

Week 4

Brownian motion and the heat equation

Jonathan Goodman

October 1, 2012

A diusion process is a Markov process in continuous time with a continuous

state space and continuous sample paths. This course is largely about diusion

processes. Partial dierential equations (PDEs) of diusion type are important

tools for studying diusion processes. Conversely, diusion processes give insight

into solutions of diusion type partial dierential equations. We have seen two

diusion processes so far, Brownian motion and the Ornstein Uhlenbeck process.

This week, we discuss the partial dierential equations associated with these two

processes.

We start with the forward equation associated with Brownian motion. Let

Xt be a standard Brownian motion with probability density u(x, t). This prob-

ability density satisfies the heat equation, or diusion equation, which is

@t u = 12 @x2 u . (1)

This PDE allows us to solve the initial value problem. Suppose s is a time and

the probability density Xs u(x, s) is known, then (1) determines u(x, t) for

t s. The initial value problem has a solution for more or less any initial

condition u(x, s). If u(x, s) is a probability density, you can find x(x, t) for t > s

by: first choosing Xs u(, s), then letting Xt for t > s be a Brownian motion.

The probability density of Xt , u(x, t) satisfies the heat equation. By contrast,

the heat generally cannot be run backwards. If you give a probability density

u(x, s), there probably is no function u(x, t) defined for t < s that satisfies

the heat equation for t < s and the specified values u(x, s). Running the heat

equation backwards is ill posed.1

The Brownian motion interpretation provides a solution formula for the heat

equation Z 1

1 2

u(x, t) = p e(xy) /2(ts) u(y, s) ds . (2)

2(t s) 1

1 Stating a problem or task is posing the problem. If the task or mathematical problem has

no solution that makes sense, the problem is poorly stated, or ill posed.

1

This formula may be expressed more abstractly as

Z 1

u(x, t) = G(x y, t s)u(y, s) ds , (3)

1

1 x2 /2t

G(x, t) = p e

2t

is called the fundamental solution, or the heat kernel, or the transition density.

You recognize it as the probability of a Gaussian with mean zero and variance

t. This is the probability density of Xt if X0 = 0 and X is standard Brownian

motion.

We can think of the function u(x, t) as an abstract vector and write it u(t).

We did this already in Week 2, where the occupation probabilities un,j =

P(Xn = j) were thought of as components of the row vector un . The solu-

tion formula (3) produces the function u(t) from the data u(s). We write this

abstractly as

u(t) = G(t s)u(s) . (4)

The operator, G, is something like an infinite dimensional matrix. The abstract

expression (4) is shorthand for the more concrete formula (3), just as matrix

multiplication is shorthand for the actual sums involved. The particular opera-

tors G(t) have the semigroup property

as long as t, s, and t s are all positive.2 This is because u(t) = G(t)u(0), and

u(s) = G(s)u(0), so u(t) = G(t s) [G(s)u(0)] = [G(t s)G(s)] u(0).

You can make a long list of ways the heat equation helps understand the

behavior of Brownian motion. We can write formulas for hitting probabilities by

writing solutions of (1) that satisfy the correct boundary conditions. This will

allow us to explain the simulation results in question (4c) of assignment 3. You

do not have to understand probability to check that a function u(x, t) satisfies

the heat equation, only calculus.

The backward equation for Brownian motion is

@t f + 12 @x2 f = 0 . (6)

The PDE (6) allows you to move backwards in time to determine values of f

2 A mathematical group is a collection of objects that you can multiply and invert, like

the group of invertible matrices of a given size. A semigroup allows multiplication but not

necessarily inversion. If the operators G(t) were defined for t < 0 and the formula (5) still

applied, then the operators would form a group. Our operators are only half a group because

they are defined only for t 0.

2

for t < T from f (T ). The backward equation diers from the forward equation

only by a sign, but this is a big dierence. Moving forward with the backward

equation is just as ill posed as moving backward with the forward. In particular,

suppose you have a desired function f (x, 0) and you want to know what function

V (x) gives rise to it using (7). Unless your function f (0) is very special (details

below), there is no V at all.

This section describes the heat equation and some of its solutions. This will

help us understand Brownian motion, both qualitatively (general properties)

and quantitatively (specific formulas).

The heat equation is used to model things other than probability. For ex-

ample it can be the flow of heat in a metal rod. Here, u(x, t) is the temperature

at location x at time t. The temperature is modeled by @t u = D@x2 u, where the

diusion coecient, D, depends on the material (metal, stone, ..), and the units

(seconds, days, centimeters, meters, degrees C, ..). The heat equation has the

value D = 12 . Changing units, or rescaling, or non-dimensionalizing can replace

p

D with 12 . For example, you can use t0 = Dt, or x0 = Dx.

The heat flow picture suggests that heat will flow from high temperature

to low temperature regions. The fluctuations in u(x, t) will smooth out and

relax over time and the heat redistributes itself. The total amount of heat in

an interval [a, b] at time t is3

Z b

u(x, t) dx .

a

You understand the flow of heat by dierentiating with respect to time and

using the heat equation

Z b Z b Z b

d

u(x, t) dx = @t u(x, t) dx = 1

2 @x2 u(x, t) dx = 1

2 (@x u(b, t) @x u(a, t)) .

dt a a a

F (x, t) = 12 @x u(x, t) , (8)

puts this into the conservation form

Z b Z b

d

u(x, t) dx = @t u(x, t) dx = F (a, t) F (b, t) . (9)

dt a a

The heat flus (8) is the rate at which heat is flowing across x at time t. If

F is positive, heat flows from left to right. The specific formula (8) is Ficks

law, which says that heat flows downhill toward lower temperature at a rate

3 If you are one of those people who knows the technical distinction between heat and

temperature, I say choose units of temperature in which the specific heat is one.

3

proportional to the temperature gradient. If @x u > 0, then heat flows from right

to left in the direction opposite the temperature gradient. The conservation

equation (9) gives the rate of change of the amount of heat in [a, b] as the rate

of flow in, F (a, t), minus the rate of flow out, F (b, t). Of course, either of these

numbers could be negative.

The heat equation has a family of solutions that are exponential in space

(the x variable). These are

Calculating the time and space derivatives, this ansatz satisfies the heat equation

if (here A = dA/dt)

Aeikx = 12 (k 2 )Aeikx .

We cancel the common exponential factor and see that (10) is a solution if

A = 12 k 2 A .

2

This leads to A(t) = A(0)ek t/2

, and

2

u(x, t) = eikx ek t/2

.

2

v(x, t) = cos(kx)ek t/2

.

gives insight into that the heat equation does. A large k, which is a high wave

2

number, or (less accurately) frequency, leads to rapid decay, ek t/2 . This is

because positive and negative heat is close together and does not have to

diuse far to cancel out.

Another function that satisfies the heat equation is

2

u(x, t) = t1/2 ex /(2t)

. (11)

n o 2

n 2

o

@t

u ! @t t1/2 ex /(2t) + t1/2 ex /(2t)

x2

= 12 t1 u + u,

2t2

and

@x x 2

u ! t1/2 ex /(2t)

t

@x 1 x2

! u+ 2u.

t t

4

This shows that @t u does equal 12 @x2 u. This solution illustrates the spreading of

p

heat. The maximum of u is 1/ t, which is at x = 0. This is large for small t

and goes to zero as t ! 1. We see the characteristic width by writing u as

2

1 1 x

p

u(x, t) = p e 2 t .

t

This gives u(x, t) as a function of the similarity variable pxt , except for the

outside overall scale factor p1t . Therefore, the characteristic width is on the

p

order of t. This is the order of the distance in x you have to go to get from

the maximum value (x = 0) to, say, half the maximum value.

The heat equation preserves total heat in the sense that

Z

d 1

u(x, t) dx = 0 . (12)

dt 1

You can check by direct integration that the Gaussian solution (11) satisfies

this global conservation. But there is a dumber way. The total mass of the

bump shaped Gaussian heat distribution (11) is roughly equal to the height

multiplied by the width of the bump. The height is t1/2 and the width is t1/2 .

The product is a constant.

There are methods for building general solutions of the heat equation from

particular solutions such as the plane wave (10) or the Gaussian (11). The heat

equation PDE is linear, which means that if u1 (x, t) and u2 (x, t) are solutions,

then u(x, t) = c1 u1 (x, t) + c2 u2 (x, t) is also a solution. This is the superposition

principle. The graph of c1 u1 + c2 u2 is the superposition (one on top of the

other) of the graphs of u1 and u2 . The equation is translation invariant, or

homogeneous in space and time, which means that if u(x, t) is a solution, then

v(x, t) = u(x x0 , t t0 ) is a solution. The equation has the scaling property

that if u(x, t) is a solution, then u (x, t) = u(x, 2 t) is a solution. This scaling

relation is one power of t is two powers of x, or x2 scales like t.

Here are some simple illustrations. You can put a Gaussian bump of any

height and with any center:

c 2

p e(xx0 ) /2t .

t

You can combine (superpose) them to make multibump solutions

c1 2 c2 2

u(x, t) = p e(xx1 ) /2t + p e(xx2 ) /2t .

t t

The k = 1 plane wave u(x, t) = sin(x)et/2 may be rescaled to give the general

2

plane wave: u (x, t) = sin(x)e t/2 , which is the same as (10). Changing the

length scale by a factor of changes the time scale, which is the decay rate in

this case, by a factor of 2 . The Gaussian solutions are self similar in the sense

that u (x, t) = C u(x, t). The exponent calculation is x2 /t ! (2 x2 )/(2 t).

5

The solution formula (2) is an application of the superposition principle with

integrals instead of sums. We explain it here, and take s = 0 for simplicity. The

total heat (or total probability,

p or total mass, depending on the interpretation)

of the Gaussian bump is 2. You can see that simply by taking t = 1. It is

simpler to work with a Gaussian solution with total mass equal to one. When

you center the normalized bump at a point y, you get

1 (xy)2 /2t

p e . (13)

2t

As t ! 0, this solution concentrates all its heat in a collapsing neighborhood

of y. Therefore, it is the solution that results from an initial condition that

concentrates a unit amount of heat at the point y. This is expressed using the

Dirac delta function as u(x, t) ! (xy) as t ! 0. It shows that the normalized,

centered Gaussian (13) is the solution to the initial value problem for the heat

equation with initial condition u(x, 0) = (x y). More generally, the formula

p 2

(c/ 2t e(xy) /2t says what happens at later time to an amount c of heat at y

at time zero. For general initial heat distribution u(y, 0), the amount of heat in a

p 2

dy neighbornood of y is u(y, 0)dy. This contributes (u(y, 0)/ 2t e(xy) /2t to

the solution u(x, t). We get the total solution by adding all these contribution.

The result is Z 1

1 (xy)2 /2t

u(x, t) = u(y, 0) p e .

y=1 2t

This is the formula (2).

We argue that the probability density of Brownian motion satisfies the heat

equation (1). Suppose u0 (x) is a probability density and we choose X0 u0 .

Suppose we then start a Brownian motion path from X0 . Then Xt X0

N (0, t) and the joint density of X0 and Xt is

1 (xt x0 )2 /2t

u(x0 , xt , t) = u0 (x0 ) p e .

2t

The probability density of Xt is the integral of the joint density

Z Z

1 (xt x0 )2 /2t

u(xt , t) = u(x0 , xt , t) dx0 = u0 (x0 ) p e dx0 .

2t

If you substitute y for x0 and x for xt , you get (2). This shows that the

probability density of Xt is equal to the solution of the heat equation evaluated

at time t.

If Xt is a stochastic process with continuous sample paths, the hitting time for

a closed set A is A = min {t | Xt 2 a}. This is a random variable because

6

the hitting time depends on the path. For one dimensional Brownian motion

starting at X0 = 0, we define a = min {t | Xt = a}. Let fa (t) be the probability

density of a . We will find formulas for fa (t) and the survival probability Sa (t) =

P(a t). Clearly fa (t) = @t Sa (t), the survival is (up to a constant) the

negative of the CDF of .

There are two related approaches to hitting times and survival probabilities

for Brownian motion in one dimension. One uses the heat equation with a

boundary condition at a. The other uses the Kolmogorov reflection principle.

The reflection principle seems simpler, but it has two drawbacks. One is that I

have no idea how Kolmogorov could have discovered it without first doing it the

hard way, with the PDE. The other is that the PDE method is more general.

The PDE approach makes use of the PDF of surviving particles. This is

defined by

P(Xt 2 [x, x + dx]|a > t) = ua (x, t)dx . (14)

Stopped Brownian motion gives a dierent description of the same thing. This

is

Xt if t < a

Yt =

a if t a

The process moves with Xt until X touches a, then it stops. The density (14)

is the density of Yt except at x = a, where the Y density has a component.

Another notation for stopped Brownian motion uses the wedge notation t ^ s =

min(t, s). The formula is Yt = Xt^a . If t < a , this gives Yt = Xt . If t a ,

this gives Yt = Xa , which is a because a is the hitting time of a.

The conditional probability density ua (x, t) satisfies the heat equation except

at a. We do not give a proof of this, only some plausibility arguments. First,

(see this weeks homework assignment), it is true for a stopped random walk

approximation to stopped Brownian motion. Second, if Xt is not at a and has

not been stopped, then it acts like ordinary Brownian motion, at least for a

short time. In particular,

Z

ua (x, t + t) G(x y, t)ua (y, t) dy ,

if t is small. The right side satisfies the heat equation, so the left should as

well if t is small.

The conditional probability satisfies the boundary condition ua (x, t) ! 0 as

x ! a. This would be the same as u(a, t) = 0 if we knew that u was continuous

(it is but we didnt show it). The boundary condition u(a, t) = 0 is called

an absorbing boundary condition because it represents the physical fact that

particles that touch a get stuck and do not re-enter the region x 6= a. We

will not give a proof that the density for stopped Brownian motion satisfies the

absorbing boundary condition, but we give two plausibility arguments. The

first is that it is true in the approximating stopped random walk. The second

involves the picture of Brownian motion as constantly moving back and forth.

It (almost) never moves in the same direction for a positive amount of time. If

Xt = a, then (almost surely) there are times t1 < t and t2 < t so that Xt1 > a

7

and Xt2 < a. The closer y is close to a, the less likely it is that Xs 6= a for all

s < t.

Accepting the two above claims, we can find hitting probabilities by finding

solutions of the heat equation with absorbing boundary conditions. Let us

assume that X0 = 0 and the absorbing boundary is at a > 0. We want a

function ua (x, t) that is defined for x a that satisfies the initial condition

ua (x, t) ! (x) as t ! 0, for (x < a) and the absorbing boundary condition

ua (a, t) = 0. The trick that does this is the method of images from physics.

A point x < a has an image point, x0 > a, that is the same distance from a.

The image point is x0 = a + (a x) = 2a x. If x < a, then x0 > a and

|x a| = |x0 a|, and x0 ! a as x ! a. The density function ua (x, t) starts

out defined only for x a. The trick is to extend the definition of ua beyond a

by odd reflection. That is,

either direction. The only direction we originally cared about was from x < a,

but the other is true also.

We create an odd solution of the heat equation by taking odd initial data.

We know ua (x, 0) needs a point mass at x = 0. To make the initial data odd,

we add a negative point mass also at the image of 0, which is x = 2a. The

resulting initial data is

The initial data has changed, but the part for x a is the same. The solution

is the superposition of the pieces from the two delta functions:

1 x2 /2t 1 (x2a)2 /2t

ua (x, t) = p e p e . (16)

2t 2t

This function satisfies all three of our requirements. It has the right initial data,

at least for x a. It satisfies the heat equation for all x a. It satisfies the

heat equation also for x > a, which is interesting but irrelevant. It satisfies the

absorbing boundary condition. It is a continuous function of x for t > 0 and

has ua (a, t) = 0.

The formula (16) answers many questions about stopped Brownian motion

and absorbing boundaries. The survival probability at time t is

Z a

Sa (t) = ua (x, t) dx . (17)

1

paths. You can check that the method of images formula (16) has ua (x, t) > 0

8

if x < a. The probability density of a is

d

fa (t) = Sa (t)

dt

Z a

= @t ua (x, t) dx

1

Z a

= 12 @x2 ua (x, t) dx

1

fa (t) = 12 @x ua (a, t) . (18)

Without using the specific formula (16) we know the right side of (18) is positive.

That is because ua (x, t) is going from positive values for x < a to zero when

x = a. That makes @x ua (a, t) negative (at least not positive) and f (t) positive

(at least not negative). The formula (18) reinforces the interpretation (see (8)

of 12 @x u as a probability flux. It is the rate at which probability leaves the

continuation region, x < 0.

The formula for the hitting time probability density is found by dierenti-

ating (16) with respect to x and setting x = a. The two terms from the right

turn out to be equal. r

2 a a2 /2t

fa (t) = e . (19)

t3/2

The reader is invited to verify by explicit integration that

r Z 1

2 1 a2 /2t

a 3/2

e dt = 1 .

0 t

This illustrates some features of Brownian motion.

Look at the formula as t ! 0. The exponent has t in the denominator, so

fa (t) ! 0 as t ! 0 exponentially. It is extremely, exponentially, unlikely for Xt

to hit a in a short time. The probability starts being significantly dierent from

zero when the exponent is not a large negative number, which is when t is on

2

the order of a2 . This is the (time) = (length) aspect of Brownian motion.

Now look at the formula as t ! 1. The exponent converges to zero, so

fa (t) Ct3/2 . (We know the constant, but it just gets in the way.) This

integrates to a statement about the survival probability

Z 1 Z 1

Sa (t) = fa (t0 ) dt0 C t03/2 dt0 = Ct1/2 .

t t

The

p probability to survive a long time goes to zero as t ! 1, but slowly as

1/ t.

The maximum of a Brownian motion up to time t is

Mt = max Xs .

0st

The hitting time formulas above also give formulas for the distribution of Mt .

Let Gt (a) = P(Mt a) be the CDF of Mt . This is nearly the same as a survival

9

probability. Suppose X0 = 0 and a > 0 as above. Then the event Mt < a is

the same as Xs < a for all s 2 [0, t], which is the same as a > t. Let gt (a) be

the PDF of Mt . Then g(a) = da d

G(a). Since Gt (a) = Sa (t), we can find g by

dierentiating the formula (17) with respect to a and using (16). This is not

hard, but it is slightly involved because Sa (t) depends on a in two ways the

limit of integration and the integrand ua (x, t).

There is another approach through the Kolmogorov reflection principle. This

is a re-interpretation of the survival probability integral (17). Start with the

observation that Z 1

1 x2 /2t

p e dx = 1 .

1 2t

The integral (17) is less than 1 for two reasons. One reason is that the survival

probability integral omits the part of the above integral from x > a. The other

is the negative contribution from the image charge. It is obvious (draw a

picture) that

Z 1 Z a

1 x2 /2t 1 (x2a)2 /2t

p e dx = p e dx

a 2t 1 2t

We derived this formula by calculating integrals. But once we see it we can look

for a simple explanation.

The simple explanation given by Kolmogorov depends on two properties

of Brownian motion: it is symmetric (as likely to go up by X as down by

X), and it is Markov (after it hits level a, it continues as a Brownian motion

starting at a). Let Pa be the set of paths that reach the level a before time t.

The reflection principle is the symmetry condition that

P Xt > a | X[0,t] 2 Pa = P Xt < a | X[0,t] 2 Pa .

This says that a path that touches level a at some time < t is equally likely to

be outside at time t (Xt > a) as inside (Xt < a). If a is the hitting time, then

Xa = a. If a < t then the probabilities for the path from time a to time t are

symmetric about a. In particular, the probabilities to be above a and below a

are the same. A more precise version of this argument would say that if s < t,

then

P( Xt > a | a = s) = P( Xt < a | a = s) ,

then integrate over s in the range 0 s t. But it takes some mathematical

work to define the conditional probabilities, since P(a = s) = 0 so you cannot

use Bayes rule directly. Anyway, the reflection principle says that exactly half

of the paths (half in the sense of probability) that ever touch the level a are

above level a at time t. That is exactly (20).

10

5 Backward equation for Brownian motion

The backward equation is a PDE satisfied by conditional probabilities. Suppose

there is a reward function V (x) and you receive V (Xt ) depending on the value

of a Brownian motion path. The value function is the conditional expectation

of the reward given a location at time t < T :

f (x, t) = E[ V (XT ) | Xt = x] . (21)

There other common notations for this. The expression E [ ] means that the

expectation is taken with respect to the probability distribution in the subscript.

For example, if Y N (, 2 ), we might write

2

E,2 eY = e+ /2 .

We write Ex,t [ ] for expectation with respect to paths X with Xt = x. The

value function in this notation is

f (x, t) = Ex,t [ V (XT )] .

A third equivalent way uses the filtration associated with X, which is Ft . The

random variable E[ |Ft ] is a function of X[0,t] . The Markov property simplifies

X[0,t] to Xt if the random variable depends only on the future of t. Therefore,

E[ V (XT ) | Ft ] is a function of Xt , which we call f (x, t). Therefore the following

definition is equivalent to (21):

f (Xt , t) = E[ V (XT ) | Ft ] . (22)

The backward equation satisfied by f may be derived using the tower prop-

erty. This can be used to compare f (, t) to f (, t + t) for small t. The

physics behind this is that X will be small too, so f (x, t) can be determined

from f (x + x, t + t), at least approximately, using Taylor series. These

relations become exact in the limit t ! 0.

The algebra Ft+ t has a little more information than Ft . Therefore, if

Y is any random variable

E[ E[ Y | Ft+ t] | Ft ] = E[ Y | Ft ] .

We apply this general principle with Y = V (XT ) and make use of (22), which

leads to

E[ f (Xt+ t , t + t) | Ft ] = f (Xt ) .

We write Xt+ t = Xt + x and expand f (Xt+ t, t + t) in a Taylor series.

f (Xt+ t, t + t) = f (Xt , t)

+ @x f (Xt , t)X

+ @t f (Xt , t)t

+ 12 @x2 f (Xt , t)X 2

3

+ O(|X| ) + O(|X| t) + O(t2 ) .

11

The three remainder terms on the last line are the sizes of the three lowest order

Taylor series terms left out. Now take the expectation of both sides conditioning

on Ft and pull out of the expectation anything that is known in Ft :

E[ f (Xt+ t, t + t | Ft ] = f (Xt , t)

+ @x f (Xt , t)E[ X | Ft ]

+ @t f (Xt , t)t

+ 12 @x2 f (Xt , t)E X 2 | Ft

h i

3

+ O E |X| | Ft + O (E[ |X| | Ft ] t) + O(t2 ) .

The two terms on the top line are equal because of the tower property. The

next line is zero because Brownian motion is symmetric and E[ X | Ft ] = 0.

For the fourth line, use the independent incrementsproperty and the variance

of Brownian motion increments to get E X h

2

| Fit = t. We also know the

3

scaling relations E[ |X|] = Ct1/2 and E |X| = Ct3/2 . Put all of these

in and cancel the leading power of t:

0 = @t f (Xt , t) + 12 @x2 f (Xt , t) + O t1/2 .

We can find several explicit solutions to the backward equation that illustrate

the properties of Brownian motion. One is f (x, t) = x2 +T t. This corresponds

to final conditions V (XT ) = XT2 . It tells us that if X0 = 0, then E[ V (XT )] =

E XT2 = f (0, 0) = T . This is the variance of standard Brownian motion.

Another well known calculation is the expected value of eaXT starting from

X0 = 0. For this, we want f (x, t) that satisfies (6) and final condition f (x, T ) =

eax . We try the ansatz f (x, t) = Aeaxbt . Putting this into the equation gives

bAeaxbt + 12 a2 Aeaxbt = 0 .

2 2

eax = Aeaxa T /2

=) A = ea T /2

.

f (x, t) = eax+a (T t)/2 .

2

If X0 = 0, we find E eaXT = ea T /2 . We verify that this is the right answer

by noting that Y = aXt N (0, a2 T ).

You can add boundary conditions to the backward equation to take into

account absorbing boundaries. Suppose you get a reward W (t) if you first

touch a barrier at time t, which is a = t. Consider the problem: run a Brownian

motion starting at X0 = 0 to time a ^T . For a > 0, the value function is defined

for t T and x a. The final condition at t = T is f (x, T ) = V (x) as before.

The boundary condition at x = a is f (a, t) = W (t). A more precise statement

12

of the boundary condition is f (x, t) ! W (t) as x ! a. This is similar to the

boundary condition u(x, t) ! 0 as x ! a. As you approach a, your probability

of not hitting a in a short amount of time goes to zero. This implies that as

Xt ! a, the conditional probability that < t + goes to zero. You might

think that the survival probability calculations above would prove this. But

those were based on the boundary u(a, t) = 0 boundary condition, which we did

not prove. It would be a circular argument.

13

Week 5

Integrals with respect to Brownian motion

Jonathan Goodman

October 7, 2012

This week starts the other calculus aspect of stochastic calculus, the limit t !

0 and the Ito integral. This is one of the most technical classes of the course.

Look for applications in coming weeks. Brownian motion plays a new role

this week, as a source of white noise that drives other continuous time random

processes. Starting this week, Wt usually denotes standard Brownian motion,

so that Xt can denote dierent random process driven by W in some way. The

driving white noise is written informally as dWt .

White noise is a continuous time analogue of a sequence of i.i.d.

random

variables. Let Zn be such a sequence, with E[ Zn ] = 0 and E Zn2 = 1. These

generate a random walk,

n1

X

Vn = Zk . (1)

k=0

Vn+1 = Vn + Zn . If the sequence Wn is given, then

Zn = Vn+1 Vn . (2)

The discrete time independent increments property is the statement that Zn

defined by (2) are independent. The discrete time analogue of the fact that

Brownian motion is homogeneous in time is the statement that the Zn are

identically distributed.

I.i.d. noise processes cannot have general distributions in continuous time. A

continuous time i.i.d. noise processes, white noise, is Gaussian. The continuous

time scaling limit for Brownian motion is

1 D

p Vn * Wt , as t ! 0 with tn = nt, and tn ! t. (3)

t

The CLT implies that Wt is Gaussian regardless of the distribution of Zn . White

noise dWt is Gaussian as well, in whatever way it makes sense.

1

In continuous time, it is simpler to define white noise from Brownian motion

rather than the other way around. The continuous time analogue of (2) is to

write dWt as the source of noise. The continuous time analogue of (1) would be

to define a white noise process Zt somehow, then get Brownian motion as

Z t

Wt = Zs ds . (4)

0

The numbers Wt make sense as random variables and the path Wt is a contin-

uous function of t. The numbers Zt do not make sense in the same way.

The Ito integral with respect to Brownian motion is written

Z t

Xt = fs dWs . (5)

0

The relation between X and W may be expressed informally in the Ito dier-

ential form

dXt = ft dWt . (6)

The integrand, f , must be adapted to the filtration generated by W . If Ft

is generated by the path W[0,t] , then ft must be measurable in Ft . The Ito

integral is dierent from other stochastic integrals (e.g. Stratonovich) in that

the increment dWt is taken to be in the future of t and therefore independent

of f[0,t] . This implies that

and

E dXt2 | Ft = ft2 E dWt2 | Ft = ft2 dt . (8)

The Ito integral is important because more or less any continuous time con-

tinuous path stochastic process Xt can be expressed in terms of it. A martingale

is a process with the mean zero property (7). More or less any such martingale

can be represented as an Ito integral (27). This is in the spirit of the central

limit theorem. In the continuous time limit, a process is determined by its mean

and variance. If the mean is zero, it is only the variance, which is ft2 .

The mathematics this week is reasonably precise yet not fully rigorous. You

should be able to understand it if you have not studied mathematical analysis.

This material is not for culture. You are expected to master it along with

the rest of the course. If this were not possible, or not important, the material

would not be here.

The approach taken here is not the standard approach using approximation

by simple functions and the Ito isometry formula. You can find the standard

approach in the book by Oksendal, for example. The standard approach is

simpler but relies more results from measure theory. The approach here will look

almost the same as the standard approach if you do it completely rigorously,

which we do not.

2

2 Pathwise convergence and the Borel Cantelli

lemma

Section 3 constructs a sequence of approximations to the Ito integral, Xtm . This

section is a description of some technical tools that can show that the Xtm

converge to a limit as m ! 1. What we describe is related to the standard

Borel Cantelli lemma but it is not the same. This section is written without the

usual motivations. You may need to read it twice to see how things fit together.

Suppose am > 0 is a sequence of numbers with a finite sum

1

X

s= am < 1 . (9)

m=1

rn = am .

m>n

n

X

sn = am

m=1

Now suppose bm is a sequence of numbers with |bm | am . Consider the

sum

1

X

x= bm . (10)

m=1

The sum converges absolutely if the am have a finite sum. Therefore (9) implies

that x is well defined. The partial sums for (10) are

n

X

xn = bm .

m=1

These satisfy

X X

|x xn | = bm aj = rn ! 0 ,

m>n m>n

1

X

x = lim xn = bm

n!1

m=1

|x xn | < rn ! 0 , as n ! 1 . (11)

3

Suppose Am is a sequence of non-negative random numbers. P Typically, the

Am can be arbitrarily large and so it might happen that S = Am = 1. We

hope to show that the probability it will happen is zero. The event S = 1 is

a measurable set, which in some sense means it is a possible outcome. But if

P(S = 1) = 0, you will never see that outcome. We say that an event D

happens almost surely if P(D) = 1. This is abbreviated as a.s., as in S < 1

almost surely, or S < 1 a.s. Other expressions are a.e., for almost everywhere,

and p.p., for presque partout (almost everywhere, in French).

Many people refuse to distinguish between outcomes that are impossible,

which would be ! 2 / , and events that have probability zero. We will be sloppy

with the distinction in this class, and ignore it much of the time.

Our strategy will be to show that S < 1 a.s. by showing that E[ S] < 1.

That is

X1 X1

E[ Am ] < 1 =) Am < 1 a.s.

j=m m=1

that

X1

E Xtm+1 Xtm am , with am < 1 , (12)

m=1

for all t T . Then you know that the following limit exists almost surely

j!1

This is our version of the Borel Cantelli lemma. We calculate expected values

to verify the hypothesis (12), then we conclude that the limit exists pathwise

almost surely.

We use the following Riemann sum approximation for the Ito integral (27):

X

Xtm = ftj Wj . (14)

tj <t

The notation is

t = 2m , (15)

tj = jt , (16)

Wt is a standard Brownian motion, and

The pathwise convergence will be that for almost every Brownian motion path,

the approximations (14) converge to a limit. This limit will be measurable in

Ft because Xt is a function of W[0,t] .

4

The Riemann sum approximation (14) needs lots of explanation. The Brow-

nian motion increment used at time tj (17) is in the future of tj . We assume that

ftj is measurable in Ftj , so this makes Wj independent of ftj . In particular,

E ftj Wj | Ftj = 0 , (18)

and h 2 i

E ftj Wj | Ftj = ft2j t . (19)

The Riemann sum definition (14) definies Xtm for all t. It gives a path that

is discontinuous at the times tj . Sometimes it is convenient to re-define Xtm

by linear interpolation between tj and tj+1 so that it is continuous. Those

subtleties do not matter this week.

We use the limit m ! 1 rather than t ! 0. It is easy to compare the

tm = 2m approximation to the one with tm+1 = 12 tm , as we will see.

Moreover, taking t ! 0 rapidly makes it easier for the sum (12) to converge.

We assume that the integrand ft is continuous in some way. Specifically, we

assume that if s > 0, then

h i

2

E (ft+s ft ) | Ft Cs . (20)

use later in the course do not satisfy this hypotheses, but most are close. We

will re-examine the conditions on ft below to see what is really necessary.

The main step in the proof is the estimation of the terms in (12). The move

from m to m + 1 replaces tm by tm+1 = 12 tm . We can write Xtm+1 in

terms of the extended m definition tj+ 12 = (j + 12 )t. For simplicity, we write

skip the ts and write fj+ 12 for ftj+ 1 , and Wj+ 12 for Wtj+ 1 , etc.

2 2

Xh i

Xtm+1 = fj+ 12 Wj+1 Wj+ 12 + fj Wj+ 12 Wj +Q.

tj <t

The Q on the end is the term that may result from Xtm+1 having an odd number

of terms in its sum. In that case, Q is the last term. It makes a negligible

contribution to the sum. We subtract from Xtm+1 the Xtm sum

X

Xtm = fj (Wj+1 Wj ) .

tj <t

The result is

X

Xtm+1 Xtm = fj+ 12 fj Wj+1 Wj+ 12 +Q. (21)

tj <t

The terms on the right side of (21) have mean zero. This implies that the

sum has cancellations that may be hard to see if we take absolute values too

soon. We find the cancellations by calculating the square and using the Cauchy

5

Schwarz inequality. In probability, a form of the Cauchy Schwarz inequality is

that if U and V are two random variables, then (proof in the next paragraph)

p

E[ U V ] E[ U 2 ] E[ V 2 ] .

E[ U ] E[ U 2 ] .

Computing the square of (21) gives

E Xtm+1 Xtm am ,

where h 2 i

a2m = E Xtm+1 Xtm .

This is something we can caluclate.

(Here is a proof of the Cauchy Schwarz inequality in the form we need. The

following quantity is non-negative for any

h i

2

0 E (U V ) = E U 2 2E[ U V ] + 2 E V 2 .

We minimize the right side by taking = E[ U V ] /E V 2 . Putting this in the

first expression gives

E[ U V ]2

0 E U2 .

E[ V 2 ]

Multiply through by E V 2 and you get Cauchy Schwarz.)

Denote a typical term in the sum on the right of (21) as

Yj = fj+ 12 fj Wj+1 Wj+ 12 .

h i h i

E Yj | Fj+ 12 = fj+ 12 fj E Wj+1 Wj+ 12 | Fj+ 12 = 0

known in Fj , so

E[ Yk Yj | Fj ] = Yk E[ Yj | Fj ] = 0 .

P P

In the expected value of ( Yj2 ) = (Yj Yk ) there are two kinds of terms. We

just saw that o diagonal terms, those with j 6= k have expected value equal to

zero. A typical diagonal term has

h i 2 2

E Yj2 | Fj+ 12 = E fj+ 12 fj Wj+1 Wj+ 12 Fj+ 1

2

2 2

= fj+ 12 fj E Wj+1 Wj+ 12 Fj+ 1

2

2 t

= fj+ 12 fj .

2

6

The next expectation, and (20) gives the desired inequality

2

t

E Yj2 | Fj = E fj+ 12 fj | Fj Ct2 .

2

Finally, X X

a2m C t2 = Ct t Ct tm .

tj <t tj <t

You can check that adding Q to this calculation does not change the conclusion.

The last inequality may be written

pp p

am C t tm C tm ,

where = 21/2 < 1. The sum in (12) becomes a convergent geometric series.

This completes the proof that the approximations (14) converge to something.

We used the powers of two in two ways. First, it made it easy to compare Xtm

to Xtm+1 . Second, it made the sum on the right of (12) a convergent geometric

series. In another week (which we will not do in this course), we could show that

the restriction to powers of 2 for t is unnecessary. Youhcan see how ito relax

2

our assumption (20). For example, it suces to take E (ft+s ft ) Cs,

rather than the conditional expectation. This allows discontinuous integrands

that depend p on hitting times. It is possible to substitute a power of s less than

1, such as s. This would just lead to a dierent < 1 in the final geometric

series.

4 Example

There are a few Ito integrals that can be computed directly from the definition.

Itos lemma, which we will see next week, is a better way to approach actual

calculations. This is as in ordinary calculus. Riemann sums are a good way

to define the Riemann integral, but the fundamental theorem of calculus is an

easier way to compute specific examples.

The first example is Z t

Xt = Ws dWs . (22)

0

The Riemann sum approximation is

X

Xtm = Wtj Wtj+1 Wtj .

tj <t

1 1

Wtj = Wtj+1 + Wtj Wtj+1 Wtj .

2 2

7

This leads to

1 X 1 X

Xtm = Wtj+1 + Wtj Wtj+1 Wtj Wtj+1 Wtj Wtj+1 Wtj .

2 t <t 2 t <t

j j

Wtj+1 + Wtj Wtj+1 Wtj = Wt2j+1 Wt2j .

Therefore, the first sum is a telescoping sum,1 which is a sum of the form

(a b) + (b c) + + (x y) + (y z) = a z .

Let tn = max {tj | tj < t}, then the first sum is 12 Wt2n+1 W02 . This simplifies

more because W0 = 0 to 12 Wt2n+1 . Clearly, Wtn+1 ! Wt as t ! 0.

The second sum involves

X

S= Wj2 . (23)

tj <t

describe the answer as precisely as we need. For the

mean, we have E Wj2 = t, so

X

E[ S] = t = tn ! t as t ! 0 .

tj <t

For the variance, the terms Wj are independent, and var Wj2 = 2t2

(recall: Wj is Gaussian and we know the fourth moments of a Gaussian)

Therefore 0 1

X

var(S) = 2t @ tA = 2t tn 2t2m .

tj <t

1 2

Xtm ! Wt t as m ! 1 .

2

This gives the famous result

Z t

1 2

Ws dWs = Wt t . (24)

0 2

We have much to say about this result, starting with what it is not. The

answer would be dierent if Wt were a dierentiable function of t. If Wt were

dierentiable, then dWs = dW

ds ds, and

Z t Z t Z

dW 1 t d 2 1

Ws dWs = Ws ds = Ws ds = Wt2 .

0 0 ds 2 0 ds 2

1 The term comes from a collapsing telescope. You can find pictures of these on the web.

8

The Ito result (24) is dierent. The Ito calculus for rough functions like Brow-

nian motion gives results that are not what you would get using the ordinary

calculus. In ordinary calculus, the sum (23) converges to zero as t ! 0. That

P Wj scales like tt if Wt is a dierentiable function of t, so S is

is because 2 2

why S makes a positive contribution to the Ito integral.

The answer dierentiable calculus answer 12 Wt2 is wrong because it is not a

martingale. A martingale is a stochastic process so that if t > s, then

E[ Xt | Fs ] = Xs . (25)

E Wt2 | Fs = Ws2 + (t s) ,

so Wt2 is not a martingale (see Section 5). The correct formula (24) is a mar-

tingale. The correction Wt2 ! Wt2 t accomplishes this.

This section discusses two properties of the Ito integral: (1) the martingale

property, (2) the Ito isometry formula.

Two easy steps verify the martingale property. Step one is to say that we

can define the Ito integral with a dierent start time as

Z t X

fs dWs = lim ftj Wtj+1 Wtj . (26)

a m!1

atj <t

Z a Z t Z t

fs dWs + fs dWs = fs dWs .

0 a 0

E fs dWs Fa = 0 .

a

This is because the right side of (26) has expected value zero. That is because

all the terms on the right are in the future of Fa . That zero expectation is

preserved in the limit t ! 0. A general theorem in probability says that if

Ym is a family of random variables and Ym ! Y as m ! 1, and if another

technical condition is satisfied (discussed in Week 8), then E[ Ym ] ! E[ Y ] as

m ! 1.

When we use these facts together, we conclude that

Z t Z a Z t Z a

E fs dWs Fa = E fs dWs Fa +E fs dWs Fa = E fs dWs Fa = Xa .

0 0 a 0

9

This is the martingale property for Xt .

The Ito isometry formula is

" Z 2 # Z t

t

E fs dWs = E fs2 ds . (27)

0 0

The variance of the Ito integral is equal the the ordinary integral of the expected

square of the integrand. The ideas we have been using make the proof of this

formula routine. Informally, we write

E[ fs dWs fs0 dWs0 ] = 02 if s 6= s0

E fs ds if s = s0 .

The unequal time formula on the top line reflects that either dWs of dWs0 is

in the future of everything else in the

h formula. iThe equal time formula on the

2

bottom line reflects the informal E (dWs ) | Fs = dt. Then

Z t 2 Z t Z t Z tZ t

fs dWs = fs dWs fs0 dWs0 = fs dfs0 dWs Ws0 .

0 0 0 0 0

Taking expectations,

" Z 2 # Z t Z t

t

E fs dWs = E[ fs dfs0 dWs Ws0 ]

0 0 0

Z t

= E fs2 ds .

0

A more formal, but not completely rigorous, version of this argument is little

dierent from this. We merely switch to the Riemann sum approximation and

take the limit at the end:

20 12 3 2 3

6 X 7 X X

E4 @ ftj Wtj A 5 = E 4 ftj ftk Wtj Wtk 5

tj <t tj <t tk <t

XX

= E ftj ftk Wtj Wtk

tj <t tk <t

X h h ii

= E ft2j E Wt2j | Ftj

tj <t

X h i

= E ft2j t .

tj <t

The last line is the Riemann sum approximation to the right side of (27).

Let us check the Ito isometry formula on the example (24). For the Ito

integral part we have (recall that X N (0, 2 ) implies var X 2 = 2 4 )

Z t

1 1 1 t2

var Ws dWs = var Wt2 t = var Wt2 = 2t2 = .

0 4 4 4 2

10

For the Riemann integral part, we have

Z t Z t

t2

E Ws2 ds = s ds = .

0 0 2

A simpler example is fs = s2 , and

Z t

Xt = s2 dWs .

0

alone. Since X is a linear function of W , X is Gaussian. Since X is an Ito in-

tegral, E[ Xt ] = 0. Therefore, we characterize the distribution of Xtcompletely

by finding its variance. The Ito isometry formula gives (fs2 = E fs2 = s4 )

Z t

s5

var(Xt ) = s4 ds = .

0 5

This may be easier than the method used in question (3) of Assignment 3.

11

Week 6

Itos lemma for Brownian motion

Jonathan Goodman

October 22, 2012

sec:intro

Itos lemma is the big thing this week. It plays the role in stochastic calculus

that the fundamental theorem of calculus plays in ordinary calculus. Most actual

calculations in stochastic calculus use some form of Itos lemma. Itos lemma

is one of a family of facts that make up the Ito calculus. It is an analogue for

stochastic processes of the ordinary calculus of Leibnitz and Netwon. We use

it both as a language for expressing models, and as a set of tools for reasoning

about models.

For example, suppose Nt is the number of bacteria in a dish (a standard

example in beginning calculus). We model Nt in terms of a growth rate, r. In a

small increment of time dt, the model is that N increases by an amount dNt =

rNt dt. Calculus allows us to express Nt as Nt = N0 ert . The Itos lemma of

ordinary calculus gives df (t) = f 0 (t)dt. For us, this is d(N0 ert = rN0 ert = rNt .

Here is a similar example for a stochastic process Xt that could model a stock

price. We suppose that in the time interval dt that Xt changes by a random

amount whose size is proportional to Xt . In stock terms, the probability to go

from 100 to 102 is the same as the probability to go from 10 to 10.2. A simple

way to do this is to make dX proportional to Xt and dWt , as in dXt = Xt dWt .

The dierentials are all forward looking, so dXt = Xt+dt Xt and dWt =

Wt+dt Wt with dt > 0. The Ito lemma for the Ito calculus is (using subscripts

for partial derivatives) d(f (Wt , t)) = fw (Wt , t)dWt + 12 fw (Wt , t)dt + ft (Wt , t)dt.

2 2

The solution is f (w, t) = x0 e w t/t . We check this using fw = f , fww = 2 f ,

2

and ft = 2 f . Therefore, if Xt = f (Wt , t), then dXt = Xt dWt as desired.

Itos lemma for this week is about the time derivative of stochastic processes

f (Wt , t), where f (w, t) is a dierentiable function of its arguments. The Ito

dierential is

df = f (Wt+dt , t + dt) f (Wt , t) .

This is the change in f over a small increment of time dt. If you integrate the

Ito dierential of f , you get the change in f . If Xt is any process, then

Z b

Xb Xa = dXs . (1) eq:di

a

1

This is the way to show that something is equal to dXt , you put your dierential

on the right, integrate, and see whether you get the left side. In particular, the

dierential formula dXt = t dt + t dWt , means that

Z b Z b

X b Xa = s ds + s dWs . (2) eq:dii

a a

The first integral on the right is an ordinary integral. The second is the Ito

integral from last week. The Ito integral is well defined provided t is an adapted

process.

Itos lemma for Brownian motion is

1 2

df (Wt , t) = @w f (Wt , t)dWt + @w f (Wt , t)dt + @t f (Wt , t)dt . (3) eq:ild

2

An informal derivation starts by expanding df in Taylor series in dW and dt up

to second order in dW and first order in dt,

1 2

df = @w f dW + @w f (dW )2 + @t f dt .

2

eq:ild

We get (3) from this using (dWt )2 = dt. The formula dWt )2 = dt cannot be

exactly

true, because (dWt )2 is random and dt is not random. It is true that

E (dWt )2 |Ft = dt, but Itos lemma is about more than expectations.

eq:dii

The real theorem of Itos lemma, in the spirit of (2), is

f (Wb , b) f (Wa , a)

Z b Z

b

1 2

= @w f (Wt , t)dWt + @ f (Wt , t) + @t f (Wt , t) dt (4) eq:ili

a a 2 w

Everything here is has been defined. The second integral on the right is an

ordinary Riemann integral. The first integral on the right is the Ito integral

sec:p

defined last week. We give an informal proof of this in Section 2.

You see the convenience of Itos lemma by re-doing the example from last

week Z t

Xt = Ws dWs .

0

A first guess from ordinary calculus might be Xt = 12 Wt2 . Let us take the Ito

dierential of 12 Wt2 . This is df (Wteq:ild

, t), where f (w, t) = 12 w2 , and @w f (w, t) = w,

and 2 @w f (w, t) = 2 . Therefore, (3) gives

1 2 1

1 2 1

d Wt = Wt dWt + dt .

2 2

Therefore,

Z Z

1 2 1 2 t

1 t

W W0 = Ws dWs +

ds

2 t 2 0 2 0

Z t

1

= Ws dWs + t .

0 2

2

You just rearrange this and recall that W0 = 0, and you get the formula from

Week 5: Z t

1 1

Xt = Ws dWs = Wt2 t .

0 2 2

This is quicker than the telescoping sum stu from Week 5.

Itos lemma gives a convenient way to figure out the backward equation for

many problems. Itos lemma and the martingale (mean zero) property of Ito

integrals work together to tell you how to evaluate conditional expectations.

Consider the Ito integral

Z T

XT = gs dWs .

0

Then Z "Z #

t T

E[ XT | Ft ] = E gs dWs | Ft + E gs dWs | Ft

0 t

The second term is zero, because dWs is in the future of gs and Ft . Therefore

"Z # Z

T t

E gs dWs | Ft = gs dWs .

0 0

f (w, t) = E[ V (WT ) | Wt = w] .

eq:ili

The integral form of Itos lemma (4)

Z T

V (WT ) f (Wt , t) = df (Ws , s)

t

Z Z

T T

1 2

= @w f (Ws , s)dWs + @t f (Ws , s) + @w f (Ws , s) ds

t t 2

Take the conditional expectation in Ft . Looking on the left side, we have

E[ V (WT ) | Ft ] = f (Wt , t) ,

f (Wt , t). Therefore you get zero on the left. The conditional expectation of the

Ito integral on the right also vanishes, as we said just above. Therefore

"Z #

T

1 2

E @t f (Ws , s) + @w f (Ws , s) ds | Ft = 0 .

t 2

The simplest way for this to happen is for the integrand to vanish identically.

The equation you get by setting the integrand to zero is

1 2

@t f + @ w f =0.

2

3

This is the backward equation we derived in Week 4. The dierence here is that

you dont have to think about what youre doing here. All the hard thinking

(the mathematical analysis) goes into Itos lemma. Once you are liberated

from thinking hard, you can easily derive backward equations for many other

situations.

sec:p eq:ili

The theorem of Itos lemma is the integral formula (4). We will prove it under

the assumption that f (w, t) is a dierentiable function of its arguments up to

third derivatives. We assume all mixed 3 partial derivatives

up to that order

exist and are bounded. That means @w f (w, t) C, and @t2 f (w, t) C, and

2

@w @t f (w, t) C, and so on.

We use the notation of Week 5, with t = 2m , and tj = jt. The change

in any quantity from tj to tj+1 is ()j . We use the subscript j for tj , as in

Wj instead of Wtj . For example, f = f (Wj + Wj , tj + t) f (Wj , tj ). In

eq:ili j

this notation, the left side of (4) is

X

f (Wb , b) f (Wa , a) fj . (5) eq:dfs

atj <b

The right side is a telescoping sum, which is equal to the left side if b = nt

and a = mt for some integers m < n. When t and W are small, there is

a Taylor series approximation of fj . The leading order terms in the Taylor

eq:ili

series combine to form the integrals on the right of (4). The remainder terms

add up to something that goes to zero as t ! 0.

Suppose w and t are some numbers and w and t are some small changes.

Define f = f (w + w, t + t) f (w, t). The Taylor series, up to the order

we need, is

f = @w f (w, t)w + 12 @w

2

f (w, t)w2 + @t f (w, t)t (6) eq:ts

+ O w3 + O (|w| t) + O t2 . (7) eq:et

line

refer to things bounded by a multiple

of whats in the big O, so O w3 means: some quantity Q so that there

is a C with |Q| C w3 . The error terms on the second line correspond

to the highest order neglected terms in the Taylor series. These are (constants

omitted) @w3

f (w, t)w3 , and @w @t f (w, t)wt, and @t2 f (w, t)t2 . The Taylor

remainder theorem tells us that if the derivatives of the appropriate order are

bounded (third derivatives in this case), then the errors are on the order of the

neglected terms. eq:dfs

The sum on the

eq:ts

righteq:et

of (5) now breaks up into six sums, one for each term

on the right of (6) and (7):

X

fj = S1 + S2 + S3 + S4 + S5 + S6 .

atj <b

4

We consider them one by one. It does not take long.

The first is X

S1 = @w f (Wj , tj )Wj .

atj <b

Z b

@w f (Ws , s)dWs .

a

The second is X

S2 = 2 @w f (Wj , tj )Wj

1 2 2

. (8) eq:is

atj <b

This is the term in the Ito calculus that has no analogue in ordinary calculus.

We come back to it after the others. The third is

X

S3 = @t f (Wj , tj )t .

atj <b

Z b

@t f (Ws , s) ds .

a

X

|S4 | C Wj3 .

atj <b

This is random,

so we evaluate its expected value. We know from experience

that E Wj3 scales like t3/2 , which is one half power of t for each power

of W . Therefore

X X

E[ S4 ] C t3/2 = Ct1/2 t = C(b a)t1/2 .

atj <b atj <b

The second error term goes the same way, as E[ |Wj | t] also scales as t3/2 .

The last error term has

X

|S6 | C t2 = C(b a)t .

atj <b

eq:is

It comes now to the sum (8). The Wj2 $ t connection suggests we

write

2

(Wj ) = t + Rj ,

Clearly

E[ Rj | Fj ] = 0 , and E Rj2 | Fj = var(Rj | Fj ) = 2t2 .

5

Now,

X X

S2 = 2 @w f (Wj , tj )t

1 2

+ 2 @w f (Wj , tj )Rj

1 2

= S2,1 + S2,2 .

Z b

2 @w f (W s, s) ds .

1 2

a

The second term converges to zero almost surely. We see this using the now

familiar trick of calculating E S2,2 2

. This becomes a double sum over tj and

tk . The o diagonal terms, the ones with j 6= k vanish. If j > k, we see this as

usual:

2 2

E 12 @w f (Wj , tj )Rj 12 @w f (Wk , tj )Rk | Fj

1 2

= E[ Rj | Fj ] @w f (Wj , tj )@w

2

f (Wk , tj )Rk ,

4

and the right side vanishes. The conditional expectation of a diagonal term is

1 h 2 2 i 1 2

E @w f (Wj , tj )Rj | Fj = 2

@w f (Wj , tj ) E Rj2 | Fj

4 4

1 2 2

= @w f (Wj , tj ) t2

2

2

These calculations show that in E S2,2 , the diagonal terms, which are the only

non-zero ones, sum to C(b a)t.

The almost surely statement follows from the Borel Cantelli lemma, as

last week. The abstract theorem is that if Sn is a family of random variables

with

X1

E Sn2 < 1 , (9) eq:bc

n=1

eq:bc

then Sn ! 0 as n ! 1 almost surely. This is because (9) implies that Sn2 ! 0

as n ! 1. If Sn2 ! 0 then Sn ! 0 also. We know Sn2 ! 0 almost surely

because Sn2 0 and if an infinite sum of positive numbers is convergent, then

the terms go to zero. Our sum is convergent almost surely, so the sum is finite

almost surely.

3 Backward equations

sec:be

Suppose V (w) is a running reward function and consider

"Z #

T

f (w, t) = Ew,t V (Ws )ds . (10) eq:rr

t

6

As in the Introduction, this may be written in the equivalent form

"Z #

T

f (Wt , t) = E V (Ws )ds | Ft . (11) eq:rr2

t

Z Z

T T

1

f (WT , T ) f (Wt , t) = fw (Ws , s)dWs + fww (Ws , s) + ft (Ws , s) ds .

t t 2

eq:rr

The definition (10) gives f (WT , T ) = 0. Therefore, as in the Introduction,

"Z #

T

1

f (Wt , t) = E fww (Ws , s) + ft (Ws , s) ds | Ft .

t 2

"Z # "Z #

T T

1

E V (Ws )ds | Ft = E fww (Ws , s) + ft (Ws , s) ds | Ft .

t t 2

The natural way to achieve this is to set the integrands equal to each other,

which gives

1

fww (w, s) + ft (w, s) + V (w) = 0 . (12) eq:berr

2

The final condition for this PDE is f (w, T ) = 0. The PDE then determines the

values f (w, s) for s < T . Now that we have guessed the backward equation, we

can show that it is right by Ito dierentiation once

eq:berr

more. If f (w, s) satisfies the

eq:rr2

backward equation (12), then f (Wt , t) satisfies (11).

Here is a slightly better way to say this. From ordinary calculus, we get

Z T !

d V (Ws )ds | Ft = V (Wt )dt .

t

Z T

Xt = V (Ws )ds

t

ordinary rules of calculus, the fundamental theorem in this case

Z

dX T

V (Ws )ds = V (Wt ) .

dt t

This is true for any continuous function Wt whether or not it is random. Con-

ditioning on Ft just ties down the value of Wt . From Itos lemma, any function

f (w, s) satisfies

1

E[ df (Wt , t) | Ft ] = fww (Ws , s) + ft (Ws , s) dt .

2

7

eq:rr2

Taking expectations on both sides of (11) gives

1

fww (Ws , s) + ft (Ws , s) dt = V (Wt )dt ,

2

eq:berr

which is the backward equation (12).

Consider the specific example

"Z #

T

f (w, t) = Ew,t Ws2 dt .

t

i since there is a simple formula

2 2

Ew,t Ws = Ew,t Wt + (Ws Wt ) = w + (s t). Instead we use the ansatz

2 2

method. Suppose the solution has the form f (w, t) = A(t)w2 + B(t). It is easy

to plug into the backward equation

1

fww + ft + w2 = 0

2

and get

2A + Aw2 + B + w2 = 0 .

This gives A = 1. Since f (w, T ) = 0, we have A(T ) = 0 and therefore

A(t) = T t. Next we have B = 2T 2t, so B = 2T tt2 +C. The final condition

B(T ) = 0 gives C = T 2 . The simplified form is B(t) = 2T t t2 T 2 =

(T t)2 . The solution is f (w, t) = (T t)w2 (T t)2 .

8

Week 7

Diusion processes

Jonathan Goodman

October 29, 2012

This week we discuss a random process Xt that is a diusion process. A diusion

process has an infinitesimal mean, or drift, which is a(x, t). The process is

supposed to satisfy

write the dierential version dXt = Xt+dt Xt and

process, it should satisfy

E Xt2 | Ft = (Xt , t)t + O(t2 ) . (2)

E dXt2 | Ft = (Xt , t)dt .

infinitesimal variance is a matrix. For an n dimensional process, a(x, t) 2 Rn ,

and (x, t) is a symmetric positive semi-definite d d matrix. The infinitesimal

mean formula (1) does not change. The infinitesimal variance formula becomes

h i

t

E (dXt ) (dXt ) | Ft = (Xt , t)dt .

This just says that (Xt , t)dt is the variance-covariance matrix of dXt .

The last part of the definition is that a diusion process must have continuous

sample paths. This means that Xt must be a continuous function of t. For

example, the simple rate one Poisson process Nt satisfies (1) with a = 1 and

(2) with = 1, as we saw in Assignment 5. In practice, you show that sample

1

paths are continuous by finding a moment of dX that scales like a higher power

of dt. Usually it is

E Xt4 | Ft = O(t2 ) , (3)

the fourth moment scales like t2 . The Poisson process has

E Nt4 | Ft = t + O(t2 ) .

is irrelevant) E[ Ntp | Ft ] = P(dN 6= 0) = dt for any moment power p.

Diusion processes come up as models of stochastic processes. If you want

to build a diusion model of a process, you need to figure out the infinitesimal

mean and variance. You also must find a higher moment that scales like a higher

power of t, or find some other reason for Xt to be a continuous function of t.

We will see examples of this kind of reasoning.

The quadratic variation of Xt measures how much noise the path Xt expe-

rienced up to time t. It is written in many ways, and we write it as [X]t . The

definition is, in our usual notation,

X 2

[X]t = lim (Xj ) . (4)

t!0

tj <t

supposed to prove that this limit exists almost surely. The limit is given by

Z t

[X]t = (Xs , s) ds . (5)

0

This looks a little like the Ito isometry formula, but there are dierences. The

Ito isometry formula is an equality of expected values. But this is a pathwise

identity, one that holds for almost every path X. Both sides of (5) are functions

of the path Xt . Almost surely, for any path X[0,T ] , the limit on the right of (4)

is equal to the right side of (5).

The quadratic variation formula is related to the Itos lemma for general

diusions. This becomes clear if you write the left side of (4) as an integral to

get the informal expression

Z t

2

[X]t = (dXs ) .

0

Z t Z t

2

(dXs ) = (Xs ) ds .

0 0

2

Taking the dierential with respect to t gives (dXt ) = (Xt ) dt. The truth of

this informal formula is the same as the truth of the Brownian motion version:

It is not true in the dierential form, but gives a true formula when you integrate

both sides.

2

We learn about diusions by finding things about them we can calculate or

compute. An important tool in this is the general version of Itos lemma. We

can guess that Itos lemma should be

1 2

df (Xt , t) = fx (Xt , t) dXt + fxx (Xt , t) (dXt ) + ft (Xt , t) dt

2

1

= fx (Xt , t) dXt + fxx (Xt , t)(Xt , t) dt + ft (Xt , t) dt . (6)

2

To show this is true, we need to prove that

f (XT , T ) f (X0 , 0)

Z T Z T

1

= fxx (Xt , t)(Xt , t) + ft (Xt , t) dt + fx (Xt , t) dXt . (7)

0 2 0

The last term on the right is the Ito integral with respect to a general diusion,

what also needs a definition. It looks like we have lots to do this week.

There are backward equations associated to general diusions. One of them

is for the final time payout value function

This is

1

@t f + (x, t)@x2 f + a(x, t)@x f = 0 . (9)

2

There are other backward equations for other quantities defined in terms of X.

We can derive this directly using the tower property, or we can do it using Itos

lemma (6). There are natural versions of quadratic variation and backward

equations for multi-variate diusions processes. These involve the infinitesimal

covariance matrix .

This section explains how to make a diusion model for a random process. In

future weeks we will see how to do this as an Ito dierential equation. That is

very appealing notationally, but the present method is more fundamental.

walk

2.2 Ornstein Uhlenbeck and the urn process

This section has a full agenda, but the items should start to seem routine as

we go through them. Most of the arguments are just more general versions of

arguments from last week.

3

3.1 Backward equation

We start with the backward equation for general diusions. The argument

here is more direct than the argument we gave for the backward equation for

Brownian motion. The earlier argument is more ecient, in that it involves

less writing. But this one is straightforward, and makes it clear what is behind

the equation. It also shows how the technical condition (3) plays a crucial role.

A simple backward equation governs the value function for a state dependent

payout at a specific time. The payout function is V (x). The payout time is

T . At that time, you get payout V (XT ). For t < T , there is the conditional

expected value of XT , conditional on the information in Ft . Since Xt is a

Markov process, is expected value is the same as the conditional expectation

given the value on Xt . This conditional expected value is f (x, t) given by (8).

An equivalent definition is

f (Xt , t) = E[ V (XT ) | Ft ] .

tower property gives

The backward equation (9) is an expression of the tower property. We derive

it from (10) taking s = t + t. The calculations require that f be suciently

dierentiable, which we assume but do not prove. The ingredients are: (i) the

formulas (1) and (2) that characterize Xt , (ii) Taylor expansion of f with the

usual remainder bounds, and (iii) the technical condition (3) that makes Xt

a continuous function of t. We write Xt+ t = x + X and make the usual

Taylor expansions. To simplify the writing, we make two conventions. Partial

derivatives are written as subscripts. We put in the arguments only if they are

not (x, t). For example, fx means @x f (x, t).

f (Xt+ t, t + t) = f (x + X, t + t)

= f + fx X + 12 fxx X 2 + ft t

3

+ O(|X| ) + O(|X| t) + O(t2 ) .

h i

3

E |X| | Ft = O(t3/2 ) , (11)

4

but it is consistent with the scaling X t1/2 . From (10) we find

= Ex,t [ f ] + Ex,t [ fx X] + Ex,t fxx X 2 + Ex,t [ ft t]

h i

3

+ Ex,t O(|X| ) + Ex,t [ O(|X| t)] + Ex,t O(t2 )

= f + fx Ex,t [ X] + 12 fxx Ex,t X 2 + ft t + O(t3/2 )

0 = fx a(x)t + 12 fxx (x)t + ft t + O(t3/2 )

0 = afx + 12 fxx + ft + O(t1/2 ) .

The bound (11) is a consequence of (3). There is a trick to show this

1/2 2 1/2

using the Cauchy Schwarz inequality E[ Y U ] E Y 2 E U . If you

take Y = X 2 and U = 1, the Cauchy Schwarz inequality gives E X 2

1/2 2 1/2 1/2

E X 4 E 1 hC t2i = Ct. Use X 3 = X 2 X in Cauchy

3 1/2 1/2

Schwarz, and you get E |X| E X 4 E X 2 Ct3/2 . (Those

of you who know Holders inequality or Jensens inequality may find a shorter

derivation of this t3/2 bound.)

This may seem mysterious, but there is a reason it should work. Suppose

we think X scales

as X

t1/2 . Then we would be inclined to believe

1/2 4

that E X t4

= t2 . Moreover, we might come to believe that

X t 1/2

from the expected square E X 2 t. But this is not a

mathematical theorem. We already saw that the Poisson process is a coun-

terexample:

E N 2

t but E N 4

t also, not t 2

. This says that

E N 4 is much larger than it would be if N scaled with t in a simple

way you could discover from the mean square. What goes wrong is that N

has fat tails. The expected value of N 2 does not come from typical values of

N . Indeed, the typical value is N = 0. Instead E N 2 is determined by

rare events in which N is much larger than t1/2 . The probability of such

a rare event is approximately t, when t is small. The tails of a probability

distribution give the probability that the random variable is much larger (or

smaller) than typical values. A large (or fat) tail indicates a serious probability

of a large value. If a random variable has thin tails, then the expected values

of higher moments scale as you would expect from lower moments. For a diu-

sion process, E X 4 scales as you would expect from X t1/2 , but not a

Poisson process.

The Cauchy Schwarz inequality allowed us to bound lower h moments i of X in

3

terms of higher moments. If E X = O(t ), then E |X| = O(t3/2 ).

4 2

h i

3

But E |X| = O(t3/2 ) does not imply that E X 4 = O(t2 ).

5

3.2 Integration and Itos lemma with respect to dXt

The stochastic integral with respect to dXt is defined as last week. Suppose gt

is a progressively measurable process that satisfies

h i

2

E (gt+ t gt ) | Ft Ct . (12)

Define the Riemann sum approximations to the stochastic integral as

(m)

X

Yt = gtj Xtj+1 Xtj . (13)

tj <t

Z t

(m)

gs dXs = Yt = lim Yt (14)

0 m!1

exists almost surely. The reason is the same (write instead of = only

because the final time t might split an interval):

(m+1) (m)

X

Yt Yt Xtj+1 Xtj+ 1 gtj+ 1 gtj

2 2

tj <t

Therefore 2

(m+1) (m)

E Yt Yt Ct = C2m ,

h i

(m+1) (m)

E Yt Yt Ct1/2 = C2m/2 .

From here, the Borel Cantelli lemma implies that

X1

(m+1) (m)

Yt Yt < 1 almost surely ,

m=1

which then implies that the limit (14) exists almost surely.

Itos lemma is a similar story. We want to prove the formula (6) for a

suciently smooth function f . Use our standard notation: fj = f (Xtj , tj ), and

Xj = Xtj , and Xj = Xj+1 Xj . The math is telescoping representation

followed by Taylor expansion

X

f (Xt , t) f (x0 , 0) [fj+1 fj ]

tj <t

X

= [f (Xj + Xj , tj + t) f (Xj , tj )]

tj <t

X

= fx (Xj , tj )Xj + 12 fxx (Xj , tj )Xj2 + ft (Xj , tj )t

tj <t

Xh 3

i

+ O |Xj | + O (|Xj | t) + O t2

tj <t

= S 1 + S 2 + S 3 + S 4 + S 5 + S6 .

6

The numbering of the terms is the same as last week. We go through them one

by one, leaving the hardest one, S2 , for last.

The first one is

X Z t

S1 = fx (Xj , tj )Xj ! fx (Xs , s) dXs as m ! 1, almost surely .

tj <t 0

X Z t

S3 = ft (Xj , tj )t ! ft (Xs , s) ds as m ! 1 .

tj <t 0

For some reason, people do not feel the need to say almost surely when its

an ordinary Riemann sum converging to an ordinary integral. The first error

term is S4 . Our Borel Cantelli argument shows that the error terms go to zero

almost surely as m ! 1. For example, using familiar arguments,

X h 3

i X

E[ S4 ] C E |X| C t3/2 = Ctt1/2 = Ct 2m/2 .

tj <t tj <t

Finally, the Ito term:

X X

S2 = 12 fxx (Xj , tj )(Xj )t + 1

2 fxx (Xj , tj ) Xj2 (Xj )t

tj <t tj <t

= S2,1 + S2,2 .

The first sum, S2,1 , converges to an integral that is the last remaining part of

(6). The second sum goes to zero almost surely as m ! 1, but the argument

is more complicated than it was for Brownian motion. Denote a generic term

in S2,2 as

Rj = fxx (Xj , tj ) Xj2 (Xj )t .

P

With this, S2,2 = Rj , and

2 XX

E S2,2 = E[ Rj Rk ] .

tj <t tk <t

X

E Rj2 .

tj <t

But Rj2 C Xj4 + t2 , so the diagonal sum is OK. The o diagonal sum

was exactly zero in the Brownian motion case because there was no O(t2 ) on

the right of (2). The o diagonal sum is

2 3

X X

2 4 E[ Rj Rk ]5 .

tk <t tk <tj <t

7

The inner sum is on the order of t, because

E[ Rj Rk ] = E[ E[ Rj | Fj ] Rk ] O(t2 ) |Rk | ,

so 2 3

X X

E[ Rj Rk ] 4 O(t2 )5 |Rk | Ct t |Rk | .

tk <tj <t tj >tk

You can see from the definition that E[ |Rk |] = O(t). Therefore, the outer

sum is bounded by

X

2 Ct O(t2 ) = Ct O(t) Ct 2m .

tk <t

This is what Borel and Cantelli need to show S2,2 ! 0 almost surely.

We can apply the results of subsection 3.2 to get the quadratic variation. Look

at Z t

Yt = Xs dXs .

0

The Ito calculus of subsection 3.2 allows us to find a formula for Yt . On the

other hand, the telescoping sum trick from last week allows us to express Yt in

terms of the quadratic variation.

A naive guess would make Yt equal to 12 Xt2 . But Itos lemma (6) applied to

f (x) = 12 x2 , with fx = x and fxx = 1 gives

d 12 Xt2 = Xt dXt + 12 (Xt )dt .

Z t Z t

1 2

2 Xt 12 x20 = Xs dXs + 1

2 (Xs ) ds .

0 0

Z t Z t

Xs dXs = 12 Xt2 12 x20 1

2 (Xs ) ds . (15)

0 0

This is consistent with the formula we had earlier for Brownian motion.

The direct approach to Yt starts from the trick

Xj = 12 (Xj+1 + Xj ) 12 (Xj+1 Xj )

X X X

Xj (Xj+1 Xj ) = 12 (Xj+1 + Xj )(Xj+1 Xj ) 1

2 (Xj+1 Xj )(Xj+1 Xj ) .

tj <t tj <t tj <t

8

The first sum on the right is

X

1

2 (Xj+1

2

Xj2 ) 12 Xt2 12 x20 .

tj <t

1

2 (Xj+1 Xj )2 .

tj <t

this to (15) gives the formula (5).

9

Week 8

Stopping times, martingales, strategies

Jonathan Goodman

November 12, 2012

Suppose Xt is a stochastic process and S is some set. The hitting time is the

first time Xt hits S.

= min {t | Xt 2 S} . (1)

This definition makes sense without extra mathematical technicalities if Xt is

a continuous function of t and S is a closed set.1 In that case, X 2 S and

Xt 2 / S if t < . Many practical problems may be formulated using hitting

times. When does something break? How long does it take to travel a given

distance?

A hitting time is an important example of the more general idea of a stopping

time. A stopping time is a time that depends on the path X[0,T ] , which makes

it a random variable. What distinguishes a stopping time is that you know at

time t whether t. If Ft is the filtration corresponding to Xt , then

{ t} 2 Ft . (2)

A hitting time is a stopping time because at time t you know all the values Xs

for s t, so you know whether Xs 2 S for some s t. There are stopping

times that are not hitting times. For example, you could stop the first time Xt

has been inside S for a given amount of time.

Some random times are not stopping times. For example, take the maximum

time rather than the minimum time in (1). This would be the last exit time

for S. At time t, you may not know whether Xs will enter S for some s > t, so

you do not know whether t.

Stopping times give a way to model optimal decision problems related to the

stochastic process Xt . An optimal decision problem is the problem of finding the

function (X[0,T ] that maximizes or minimizes some performance criterion. The

early exercise problem for American style stock options is an optimal stopping

1 The set S is closed if S includes all its limit points. If x 2 S and x ! x as n ! 1,

n n

then x 2 S. For example, in one dimension, S = {0 < x < 1} is not closed because xn = 1/n

converges to x = 0, but 0 2 / S.

1

problem. Many clinical drug trials have stopping criteria that end the trial if

the drug quickly shows itself to be helpful, or dangerous.

Consider the problem of stopping a Brownian motion at the largest value

possible

max E[ W ] . (3)

(W[0,T ] )

W = max Wt .

0tT

But this is not a stopping time. Even if Wt is the largest value you have seen so

far, you have no way of knowing whether Ws > Wt for some s > t. (Correction:

You do know. Almost surely there is an s 2 (t, T ) with Ws > Wt .) The optimal

decision problem would be to restrict the class of random times to those that

are stopping times. You have to say at time t whether to stop at t or keep going.

Many hitting time problems and optimal decision problems may be solved

using partial dierential equations. For hitting time problems, you solve the

forward or backward equation in the complement of S and specify a boundary

condition at the boundary of S. Many optimal decision problems have the struc-

ture that the optimal stopping time is given by an optimal decision boundary.

This is a set St so that is the first time Xt 2 St .

A stochastic process is a martingale if, for any s t,

E[ Xs | Ft ] = Xt . (4)

to zero. That is

E[ dXt | Ft ] = 0 .

A general theorem of Doob states that if ft is an adapted process and if f and

X are not too wild, then Z t

Yt = fs dXs

0

is also a martingale. In some sense this is obvious, because the drift coecient

of Y is

E[ dYt | Ft ] = ft E[ dXt | Ft ] = 0 .

if X is a martingale. The value ft is known at time t if ft is adapted. How wild

is too wild? Thats not a question for this course. But we give some examples

where it is true and false.

The Doob stopping time theorem is a special case of the general martingale

theorem. If is a stopping time that satisfies T (almost surely), then

E[ X ] = x0 . (5)

t (the definitions agree for t = ). If T , then YT = X . The trick is

2

to write Y as a stochastic integral involving X. The integrand is the switching

function ft = 1 for t and ft = 0 for t > . This is an adapted function

you can determine the value of ft knowing only X[0,t] . If

Z t

Yt = x0 + fs dXs ,

0

theorem implies that Yt is a martingale. Therefore

E[ YT ] = E[ X ] = x0 .

There are many questions involving hitting times for a set S that can be an-

swered using a value function f (x, t). The PDE for f depends on the process

X, but not on S. The set S determines boundary conditions that f must satisfy.

If you do it right, f will be completely determined by the final condition, the

boundary conditions, and the PDE.

The PDE involves the generator of the process Xt , which is a dierential op-

erator. For a given t, think of f (, t) as an abstract vector. For a one dimensional

diusion, L acts on the vector f as

1

Lf (x, t) = (x, t)@x2 f (x, t) + a(x, t)@x f (x, t) . (6)

2

The generator L does not act on the t variable, so t is just a parameter that

says which function f (, t) the generator is acting on.2

For example, if X is an Ornstein Uhlenbeck process with parameters 2 and

, then = 2 and a = x, so

1 2 2

Lf = @x f x@x f .

2

If f (x, t) = 3x2 tx + t2 , then Lf = 3 2 6x2 + tx. The operator L may be

expressed as

1

L = (x, t)@x2 + x(x, t)@x . (7)

2

Then applying L to a function f is given by the expression (6). Mathematicians

say that the operator L sends f to Lf , or that f goes to Lf :

L 1 2 2

f ! Lf = @x f x@x f .

2

For example, one might write

2

2

/2 @x

2

ex ! x2 1 ex /2 .

2A famous joke defines a parameter as a variable constant.

3

Suppose the diusion is multi-dimensional. Let n be the number of compo-

nents of Xt . The generator in this case is

n n n

1 XX X

Lf = ij (x, t)@xi @xj f + ai (x, t)@xi f . (8)

2 i=1 j=1 i=1

Here is a simple example. Let Xt be a one dimensional diusion with x0 = 1.

Let V (x) be a payout function, and suppose you get payout V (XT ) only if

Xt > 0 for 0 t T . We need some notation for the mathematical formulation.

The hitting time is = min {t | X0 = 0}. The event we need to describe is the

event that T . The indicator function of this event is 1 T (X[0,T ] ), which

has the values

1 if T

1 T (X[0,T ] ) =

0 if < T.

(This is also called characteristic function and written T . We use the term

indicator function because in probability, characteristic function can refer to

Fourier transform.) In this notation, the payout V (XT ) 1 T (X[0,T ] ). This is a

function that is equal to V (XT ) if T and is equal to zero otherwise. The

expected payout is

E V (XT ) 1 T (X[0,T ] ) . (9)

The PDE approach to calculating the expected payout (9) is to define a

value function that satisfies a backward equation with boundary conditions.

The value function that works is

f (x, t) = Ex,t V (XT ) 1 T (X[0,T ] ) | t . (10)

a formula for (9):

f (1, 0) = E V (XT ) 1 T (X[0,T ] ) .

We compute the entire value function (10) for the purpose of getting the single

number (9).

The value function defined in (10) may seem more complicated than neces-

sary. A function that is simpler to write down is

g(x, t) = Ex,t V (XT ) 1 T (X[0,T ] ) . (11)

The dierence between these is that g counts paths that have touched zero at

some time s < t. The definition (10) excludes such paths. More precisely, it

conditions on not having them. The two definitions are related by Bayes rule.

In the the case here, the integrand is zero if < T , so

g(x, t)

f (x, t) = .

Px,t ( T )

4

The definition of g is suitable for expressing as a conditional expectation, con-

ditional on Ft :

g(Xt , t) = E V (XT ) 1 T (X[0,T ] ) | Ft

The denominator has a similar expression. In fact

P( t) = E 1 t (X[0,T ] ) ,

so

E V (XT ) 1 T (X[0,T ] ) | Ft

f (Xt , t) = .

E 1 t (X[0,T ] ) | Ft

This definition might be very hard to work with for the following reason. The

denominator E 1 t (X[0,T ] ) | Ft is not really an expectation because the ran-

dom variable 1 t (X[0,T ] ) is known at time t. Therefore

E 1 t (X[0,T ] ) | Ft = 1 t (X[0,T ] ) .

E V (XT ) 1 T (X[0,T ] ) | Ft

f (Xt , t) =

1 t (X[0,T ] )

depends on the path before t. Somehow, when you did the division, this de-

pendence cancels out. The bottom line, for me, is that the more complicated

definition (10) will be easier to work with.

The value function is found by solving a PDE problem. A PDE problem con-

sists of a PDE and other conditions as appropriate initial conditions, boundary

conditions, final conditions, etc. The PDE in this case is the backward equation

1

@t f = Lf = (x, t)@x2 f (x, t) + a(x, t)@x f (x, t) . (12)

2

This PDE is satisfied in the region x > 0. The value function may not be

defined for x < 0. If it is defined, the most natural definition would be f = 0.

Either way, the PDE (12) is used only in the continuation region x > 0. The

final condition is clear from the definition of f . If t = T , and x 0, then

f (x, T ) = V (x). There is an extra boundary condition at x = 0, which is

the boundary of the continuation region in this example. We can guess this

boundary condition by continuity. If x is actually on the boundary of the

continuation region, which is to say x = 0, then the definition (10) gives the

value zero. If f (x, t) is a continuous function of x, then 0 is the limiting value

of f as x ! 0. This suggests that the boundary condition should be

f (0, t) = 0 . (13)

Here is one of the few examples where f may be calculated explicitly. Let

the process Xt be Brownian motion starting at x0 = 1 but having var(Xt ) = t.

5

This makes 2 = 1, and a = 0. Take V (x) = 1, so f is just the conditional

probability of not touching the boundary before time T . The PDE problem is:

Find f (x, t), defined for x 0 and t T that satisfies the PDE

1

@t f + @x2 f = 0

2

where it is defined. In addition, f should satisfy the final condition f (x, T ) = 1

for x 0, and the boundary condition f (0, t) = 0 for t T .

This problem may be solved using something like the method of images. We

extend the definition of f so that f is defined for all x with the anti-symmetry

condition f (x, t) = f (x, t). If f is continuous, this implies that f (0, t) = 0.

In order to achieve the skew-symmetry condition, we take the final condition to

be skew symmetric. We do this without changing the already known values of

f (x, T ) for x > 0. Clearly, the extended final condition should be f (x, T ) = 1

for x > 0 and f (x, t) = 1 for x < 0. The value of f when x = 0 is irrelevant.

There is no boundary. We are talking about the simple heat equation (OK,

with the direction of time reversed). The solution may be given as a Greens

function integral using the known final values:

Z 1

f (x, t) = G(x y, T t)f (y, T ) dy

1

Z1 Z 0

= G(x y, T t) dy G(x y, T t) dy

0 1

Z 1

1 2

= p e(xy) /2(T t)

dy

0 2(T t)

Z 0

1 2

p e(xy) /2(T t)

dy .

1 2(T t)

The two Gaussian integrals on the last line represent probabilities. We can

express them in terms of the cumulative normal distribution function N (z) =

P(Z z), where Z N (0, 1). The second integral on the last line is the

probability that the random variable Y N (x, T t) has Y < 0. In general, if

Y N (, 2 ), then Y +Z, where Z N (0, 1). The expression Y +Z

means that Y and + Z have the same distribution. This implies that

P(Y < 0) = P( + Z < 0) = P Z < =N .

p

In this example, = x and = T t, so

x

P(Y < 0) = N p .

T t

There are two properties of Gaussians, each of which would give a way to

write the first integral in terms of N . The first is P(Z > a) = P(Z < a) =

6

P(Z < a), the last is because Z Z the Gaussian distribution is symmetric.

This gives P(Z > a) = N (a). In the present example, the first integral is

x

P(Y > 0) = P( + Z > 0) = P Z > =N =N p .

T t

The resulting formula for the survival probability is

x x

f (x, t) = Px,t ( > T ) = N p N p . (14)

T t T t

Here is a quick check that this function satisfies all the conditions we set for it.

2

It satisfies the PDE (a calculation using N 0 (z) = p12 ez /2 ). It satisfies the

boundary condition. If you put x = 0, the two terms on the right cancel exactly.

It satisfies the final condition. If x > 0 and you send t to T , then N p x

T t

!1

and N pT t ! 0. That is because pT t ! 1 and pT t ! 1 (this is

x x x

The other fact about N is P(Z > a) = 1 P(Z < a) = 1 N (a). Therefore,

x

P Z> =1P Z < =1N p

T t

This gives a formula equivalent to (14), which is

x

f (x, t) = Px,t ( > T ) = 1 2N p . (15)

T t

This formula is what you would get from the Kolmogorov reflection principle:

Px,t (Xs < 0 for some s 2 [t, T ]) = 2 Px,t (XT < 0) ,

so

Px,t (Xs > 0 for all s 2 [t, T ]) = 1 Px,t (Xs < 0 for some s 2 [t, T ])

= 1 2 Px,t (XT < 0)

x

= 1 2N p .

T t

The formula (15) satisfies the PDE for the same reason as (14). It satisfies the

final condition because P(XT < 0 | XT = x > 0) = 0. It satisfies the boundary

condition because N (0) = 12 .

We can use (14) or (15) to estimate the survival probability starting from

a fixed x at time t = 0 as T ! 1. This is the probability of not hitting

the boundary for a long time. The argument to N goes to zero as T ! 1.

Therefore, we use N () N (0) + N 0 (0). We already saw that N (0) = 12 and

N 0 (0) = p12 . Therefore, for large T ,

x

Px,0 ( > T ) p . (16)

2T

7

We see that this goes to zero as T ! 1. Therefore, we know that from any

starting point, Brownian motion hits x = 0 at some positive time almost surely.

This is true also in two dimensions a two dimensional Brownian motion will

touch the origin almost surely. In three or more dimensions it is not true. In

fact, if |X0 | > 1, there is a positive probability that |Xt | > 1 for all t > 0.

Brownian motion in one or two dimensions is recurrent, while it is transient in

dimensions 3 or more.

While we are talking about these solutions to the backward equation, let us

notice some other properties. One is the smoothing property. The formula (14)

defines a function that is discontinuous when t = T . Nevertheless, f (x, t) is a

smooth function of x for t < T . This is a general property of PDEs of diusion

type.

Some other properties are illustrated by a dierent solution of the backward

equation

x+1 x1

h(x, t) = N p N p .

T t T t

This function has final values h(x, T ) = 1 if 1 < x < 1 and h(x, T ) = 0

otherwise. That makes h(x, T ) = 1|x|<1 , which is a step function that is dierent

from zero when x is not too far from zero. Whenever t < T , h(x, t) > 0 for any

x. This means that the fact that h > 0 propagates infinitely fast through the

whole domain where h is defined. This is also a property of general diusion

PDEs. However, the solution is not large for x > 1 and t close to T . In fact,

it is exponentially small. The influence is very small in short times and large

distances.

Finally, look at h(x, t) for x near 1 and t close to T . The second term is

exponentially small, as we just said. But the first term looks like the solution

with final data that have a jump at x = 1. The behavior near 1 and T is

almost completely determined by the final condition there. This is approximate

locality.

8

Week 9

Stochastic dierential equations

Jonathan Goodman

November 19, 2012

The material this week is all about the expression

There are two distinct ways to interpret this, which we will call strong and weak.

The strong interpretation is more literally true, that X is a function of W , that

X, a and b are adapted, and (1) is true in the sense of the Ito calculus. For

example, we saw that if Xt = eWt , then

1

dXt = eWt dWt + 2 eWt dt

2

1 2

= Xt dt + Xt dWt .

2

In this case, (1) is satisfied with at = 12 2 Xt and bt Xt . Roughly speaking, this

is the sense we usually assume when doing analysis.

The weak interpretation is not literal. It does not require that we have a

Brownian motion path W in mind. In the weak sense, (1) means that at time

t you know Xt , at and bt , and

E[ dXt | Ft ] = at dt , (2)

and h i

2

E (dXt ) | Ft = b2t dt . (3)

In this view, (1) just says that for dt > 0, the corresponding dX is the sum

of a deterministic and a random piece, at dt and bt dWt . Deterministic means

known at time t. The example with at = 12 2 Xt2 shows that at need not be

known at times earlier than t. The random piece models the part of dXt that

cannot be predicted at time t. We assume that this noise component has mean

zero, because if the mean were not zero, we would put the mean into at instead.

The modeling assumption is that the noise innovation, dWt , not only has mean

1

zero, but is independent of anything known in Ft . The strength of the noise at

time t is bt . This is assumed known at time t. An unknown part bt would be

part of Wt instead.

The point of all this philosophy is that we can create models of stochastic

processes by writing expressions for at and bt in (1). If we think we know the

(conditional at time t) mean and variance, we use (1) to create a model process

Xt . The Black and Scholes model of the evolution of a stock price is a great

example of this kind of reasoning. Suppose St is the price of a stock at time

t. Then dSt should be proportional to St so that dS is measured as percentage

of St (OK, a little tautology there). If St were replaced by, say, 2St , then dSt

would be replaced by 2dSt . If there were a doublestock that consisted of two

shares of stock, the price change of a doubleshare would be twice the change of

a single share. Therefore, we think both the deterministic and random parts of

dSt should be proportional to St . The constants are traditionally called and

, the expected rate of return and the volatility respectively:

a general process that satisfies (1) is that here at and bt are required to be known

deterministic functions. For example, the function a(x, t) is a function of two

variables that is completely known at time 0. For this reason, solution to an SDE

is a Markov process. The probability distribution of the path starting at time

t, which is X[t,T ] , is completely determined by Xt . If a and b are independent

of t, then the SDE (5) and the corresponding Markov process are homogeneous.

Otherwise they are heterogeneous. If b is independent of X we have additive

noise. Otherwise, the noise is multiplicative. There are many problems with

additive noise. These are simpler from both theoretical and practical points of

view.

Most SDEs do not have closed form solutions. But solutions may be com-

puted numerically. The Euler method, also called EulerMaruyama, is a way to

create approximate sample paths. It is possible to get information about solu-

tions by solving the forward or backward Kolmogorov equation. These, also,

would generally be solved numerically. But that is impractical for SDE systems

with more than a few components.

Solutions to the SDE (4) are called geometric Brownian motion, or GBM. There

are several ways to find the solution. One is to try an ansatz of the form

St = At eWt . Here, At is a deterministic function of t. We do an Ito calculation

2

with f (w, t) = At ew , so that @t f = Aew , @w f = f , and @w

2

f = 2 f . The

result is

1

d At eWt = At eWt dt + At eWt dWt + 2 At eWt dt

! 2

At 1 2

= + St dt + St dWt .

At 2

At 1

+ 2 = .

At 2

That implies that

1 2

At = At .

2

The solution is a simple exponential:

At = A0 e( 2 )t .

1 2

A0 = s0 . The full solution is

St = s0 eWt +( 2 )t .

1 2

(6)

s and

@s2 f = @s 1s = s12 . Then

dXt = df (St )

1 2

= @s f (St )dSt + @s2 f (St ) (dSt )

2

1 1 1 2

= dSt (dSt )

St 2 St2

1 1 1 2 2

= (St dt + St dWt ) St dt

St 2 St2

1

dXt = 2 dt + dWt .

2

The solution of this is ordinary arithmetic Brownian motion (there are geometric

series and arithmetic series).

1

Xt = x0 + 2 t + Wt .

2

3

An arithmetic Brownian motion has constant drift and Brownian motion parts.

This one has drift 12 2 and noise coecient . A standard Brownian motion

has zero drift and unit noise coecient. The solution from this is

1 2

There is another way to arrive at the log variable transformation. Suppose

V (ST ) is a payout and we consider the value function

2 s2 2

@t f + @ f + s@s f = 0 . (8)

2 s

This is because a(s, t) = s, and b(s, t) = s. The Black Scholes PDE is similar

2 2

to this. This PDE has coecients 2s and s, which are functions of the

independent variable s. It is a linear PDE with variable coecients. We can

simplify the PDE by a change of variable to make it into a constant coecient

PDE. This change of variable is

x = log(s) , s = ex .

There is an obvious sense in which this is the same as (7). But there is a sense

in which it is dierent. Here, the substitution is about simple variables s and

x, not stochastic processes St and Xt .

We rewrite (8) in the x variable. The chain rule from calculus gives

@f @f @x 1

@s f = = = @x f .

@s @x @s s

The next derivative is

@s2 f = @s (@s f )

1

= @s @x f

s

1 1

= @s @x f + @s (@x f )

s s

1 1 2

= 2 @x f + 2 @x f

s s

These derivatives go back into (8) to give

2 2 2

@t f + @ f + @x f = 0 . (9)

2 x 2

4

2

This PDE is the backward equation for the SDE dXt = 2 dt + dWt .

We can transform (9) into the standard heat equation with some simple nor-

malizations. The first is to get rid of the drift term, the term involving @x f ,

using a coordinate that moves with the drift:

2 2

x=y+ t , y =x t.

2 2

It can be tricky to calculate what happens to the equation (9) in the new

coordinates. Since @y x = @x y = 1, calculating space derivatives (x or y) does

not change the equation. The change has to do with the time derivative, which

may not seem to have changed. One way to do it is to recall the definition of

partial derivative. If the variables are x and t, the partial derivative of f with

respect to t with x fixed is

f (t + t) f (t)

@t f = lim

x t!0 t

x

This is the rate of change of the quantity f when t is changed and x is held

fixed. The definition of @t changes if we use the y variable in place of x. It

becomes the rate of change of f when t changes and y is held fixed. Suppose

we have changes t, x and y. We fix t and calculate y:

2

y = x t .

2

2

If y = 0, then x = 2 t. Therefore

f (x + x, t + t) f (x, t)

@t f = lim

y t!0 t

y

1 h i

= lim @x f x + @t f t + O x2 + O t2

t!0 t

h t x i

2

1 @x f 2 + @t f t

= lim t x

t!0 t t

2

= @x f + @t f .

2 t x

2 2

@t f + @ f =0.

2 y

The dierence is that @t refers to the derivative with y fixed rather than with

x fixed as in (9). One more step will transform this to the standard heat

5

equation. We rescale the space variable to get rid of the coecient 2 . The

change of variables that does that is

1 y

= , z= , y = z .

y z

In these variables,

1

@t f + @z2 f = 0 .

2

The payo for these manipulations is that we know a lot about solutions of

the heat equation. We know the Greens function and the Fourier transform.

All of these tools now apply to the equation (8) too.

3 Simulating an SDE

There are three primary ways to get information about the solution of an SDE.

The first is to solve it exactly in closed form. That option is limited to very few

SDEs. The second is to solve the backward equation numerically. We discuss

in a future class how to do that. For now it suces to say that this approach

is impractical for SDE systems with more than a few components. The third is

direct numerical simulation of the SDE. We discuss that option here.

Consider an SDE (5). We want to approximate the path X[0,T ] . Choose

a small t, define tn = nt, and define the approximate sample path to be

( t)

Xn Xtn . There are two forms of the approximation algorithm. One is

( t)

Xn+1 = Xn( t)

+ a(Xn( t)

, tn )t + b(Xn( t)

, tn )Wn . (10)

for time t. In practice, we have to generate the Brownian motion increments

using a random number generator. The properties of Wn are that they are

independent, and that they are Gaussian with mean zero and variance t. In

case it is a multi-variate Brownian motion, the covariance is I. We gener-

ate such random variables starting with independent standard normals Zn and

multiplying by t1/2 :

This algorithm is motivated by the strong form of the SDE. When we integrate

(5) over the time increment [tn , tn + t], we find

Z tn+t Z tn+t

Xtn+1 = Xtn + a(Xt , t)dt + b(Xt , t)dWt . (12)

tn tn

approximate value Xtn in (12), the integrals simplify to

Z tn+t

a(Xt , t)dt a(Xtn , tn )t ,

tn

6

and Z Z

tn+t tn+t

b(Xt , t)dWt b(Xtn , tn ) dWt = b(Xtn , tn )Wn .

tn tn

The approximate formula for the exact path motivates the exact formula (10)

for the approximate path.

The other form of the approximation algorithm just looks for approximate

sample paths that have the right mean and variance over a step of size t. For

that purpose, let Zn be a family of independent random variables with mean

zero and variance 1, or covariance I. You want

h i

( t)

E Xn+1 | Fn = Xn( t) + a(Xn( t) , tn )t ,

and

( t)

var Xn+1 | Fn = b2 (Xn( t)

, tn )t .

These formulas are not exact for SDE paths, but they will hold exactly for

approximate sample paths X ( t) . The formula

( t)

Xn+1 = Xn( t)

+ a(Xn( t)

, tn )t + b(Xn( t)

, tn )t1/2 Zn (13)

makes these true. The actual algorithms (10) and (13) are identical, if we use a

Gaussian Zn in (13). The dierence is only in interpretation. In the strong form

(10) we are generating an approximate path that is a function of the driving

Brownian motion. In the weak form (13), we are just making a path with

approximately the right statistics. In either case, generating an approximate

sample path might be called simulating the SDE.

We are usually interested in more than simulations. Instead we want to know

expected values of functions of the path. These could be expected payouts or

hitting probabilities or something more complicated. A single path or a small

number of paths may not represent the entire population of paths. The only

way to learn about population properties is to do a large number of simulations.

The main way to digest a large collection of sample paths is to compute statistics

of them. Most such statistics correspond the expected value of some function

of a path.

This brings up an important distinction, between simulation and Monte

Carlo. Simulation is described above. Monte Carlo1 it the process of using

random numbers to compute something that itself is not random. A hitting

probability is not a random number, for example, though it is defined in terms

of a random process. The distinction is important because it suggests that there

may be more than one way to estimate the same number. We will discuss this

a little later when we talk about applications of Girsanovs theorem.

1 This thoughtful definition may be found in the book Monte Carlo Methods by Malvin

7

For now, let V (x[0,T ] ) be some function of a path on the interval [0, T ]. Let

its expected value for the SDE be

A = E V (X[0,T ] ) .

For example, to estimate a hitting probability you might take V = 1 T . To

estimate A, we generate a large number of independent approximate sample

( t)

paths X[0,T ],k , k = 1, . . . , L. Here L is the sample size, which is the number of

paths. The estimate of A is

1 X ( t)

L

b

A= V X[0,T ],k . (14)

L

k=1

The error is A

comes from the fact that sample paths are not exact. You reduce bias by letting

t ! 0.

h i h i

bias = E A b A = E V X ( t) E V (X[0,T ] ) .

[0,T ],k

the true parameter value. There are unbiased statistics, but it is very rare than

the estimate of a quantity like A is unbiased. Statistical error comes from the

fact that we use a finite number of sample paths. You reduce statistical error

by taking L ! 1. The definition is

h i

statistical error = AbE A b .

Neither the bias nor the statistical error goes to zero very fast as t ! 0 or

L ! 1. The bias typically is proportional to t or t1/2 , depending on the

problem. The statistical error typically is proportional to L1/2 , which comes

from the central limit theorem. For this reason Monte Carlo estimation either

is very expensive, or not very accurate, or both.

You could ask about the scientific justification for using SDE models if all

we do with them is discretize and simulate. Couldnt we just have simulated the

original process? There are several justifications for the SDE approach. One

has to do with time scales. We saw on an old assignment that the diusion

process may be an approximation to another process that operates on a very

fast time scale, Tm . It is possible that the time step t needed to simulate the

SDE is much larger than Tm (the microscopic time scale). The SDE model

also can be simpler than the microscopic model it approximates. For example,

the Ornstein-Uhlenbeck/Einstein model of Brownian motion replaces a process

in (X, V ) space with a simpler process in X space alone.

4 Existence of solutions

The first question asked in a graduate course on dierential equations is: Do

dierential equations have solutions? You can ask the same question about

8

stochastic dierential equations, and the answer is similar. If the coecients

a(x, t) and b(x, t) are Lipschitz continuous, then a simple iteration argument

shows that solutions exist. If the coecients are not Lipschitz continuous, all

questions get harder and more problem specific.

A function f (x) is Lipschitz continuous with Lipschitz constant C if

functions f (x) = ex , f (x) = x2 , and f (x) = sin(x2 ) are not Lipschitz continuous.

If f is dierentiable and |f 0 (x)| C for all x, then f is Lipschitz continuous

with constant C. We say that f is locally Lipschitz near the point x0 if there

is an R so that (15) holds whenever |x x0 | R and |y x0 | R. Any

dierentiable function is locally Lipschitz. The examples show that many nice

seeming functions are not globally Lipschitz.

In the easy part of the existence theory for dierential equations, locally

Lipshitz equations have local solutions (solutions defined for a finite but possibly

not infinite range of t). Globally Lipschitz equations have solutions defined

globally in time, which is to say, for all t 2 [0, 1). In particular, if a and b are

(globally) Lipschitz then the SDE (5) has a solution defined globally in time.

More precisely, there is a function W[0,t] ! Xt so that the process Xt satisfies

(5). This function may be constructed as a limit of the Picard iteration process.

This defines a sequence of approximate solutions Xt,k and recovers the exact

solution in the limit k ! 1. The iteration is

Z t Z t

Xt,k+1 = x0 + a(Xs,k , s)ds + b(Xs,k , s)dWs .

0 0

Here is a quick version of the argument for the simpler case a = 0. We

subtract the k and k 1 equations to get

Z t

Xt,k+1 Xt,k = (b(Xs,k , s) b(Xs,k1 , s)) dWs .

0

The object is to prove that Xt,k+1 Xt,k is smaller than Xt,k Xt,k1 eventually

as k ! 1. You can use the Ito isometry formula to get

h i Z t h i

2 2

E (Xt,k+1 Xt,k ) = E (b(Xs,k , s) b(Xs,k1 , s)) ds .

0

2 2

(b(Xs,k , s) b(Xs,k1 , s)) C 2 (Xs,k Xs,k1 ) .

9

Therefore,

h i Z t h i

2 2

E (Xt,k+1 Xt,k ) C 2 E (Xs,k Xs,k1 ) ds

0

Now define h i

2

Mt,k = E (Xt,k+1 Xt,k ) .

Z t

Mt,k C 2 Ms,k1 ds .

0

t

eC t

Mt,k M0,t .

k!

For any fixed t, this implies that the Xk,t are a Cauchy sequence almost surely

(our Borel Cantelli lemma again).

10

- Lineal MatricesUploaded byBrow Brung
- Fighting and Team Sports Advanced Bio MechanicsUploaded byAttilio Sacripanti
- [Andrzej_Lasota,_Michael_C._Mackey]_Chaos,_Fractal.pdfUploaded byAngel Leon Geronimo
- Random Processes iUploaded byDavid Siegfried
- data minigUploaded bycristianmondaca
- Manjual Paper.pdfUploaded byconcord1103
- Design and Analysis of (M/G/1):(GD/∞/ ∞) and (Mi /Gi /1):(NPRP/∞/∞) Queueing SystemsUploaded byijcsis
- 6_numerical Methods for Chemical Engineers With Matlab ApplicationsUploaded byGonzalo1959
- avalanche paperUploaded bybotee
- IMF - Policy Analysis and Forecasting in the World Economy: A Panel Unobserved Components ApproachUploaded byeho429
- Appendix a Syntax Quick Reference 2017 Essential MATLAB for Engineers and Scientists Sixth EditionUploaded byShreyas Kolapkar
- Ch05 Part 2-Traffic Flow ModelsUploaded byGlenn Midel Delos Santos
- masterex.pdfUploaded byAllister Hodge
- Feynrules tutorialUploaded byRafael Lopes de Sa
- macqueen_thesis.pdfUploaded byniceprachi
- Book CoverUploaded byAdrian Page
- Product of Gaussian DistributionUploaded byDennis Aprilla Christie
- Unit 4Uploaded byKrishna Raaj
- Cleaning Correlation MatricesUploaded bydoc_oz3298
- Manual de GaussUploaded byYossRling
- 2_02-arraypar.pptUploaded byislam
- VIDEO SUMMARIZATION: CORRELATION FOR SUMMARIZATION AND SUBTRACTION FOR RARE EVENTUploaded byJournal 4 Research
- Chapter 05Uploaded byJosue Valladares Ruiz
- Reliability Simulation of Component-Based Software Systems (1998)Uploaded bysleepanon4362
- Developing Modul1Uploaded bysimonckt
- Solution 1Uploaded byking is king
- AshishUploaded byआशीष कुमार सालवी
- ENG Design Report TemplateUploaded byJohn Taulo
- 12 to 19 ExperiementsUploaded byArvinth Sankar
- tutorial4loops.pdfUploaded byllaurollauro

- HW4_PB3Uploaded byRichard Ding
- hw1Uploaded byRichard Ding
- cims_oct19Uploaded byRichard Ding
- An Introto Digital Image ProcessingUploaded byRichard Ding
- Linear Programming With MatlabUploaded byBinod Dhakal
- week01Uploaded byRichard Ding
- LaTeX Mathematical SymbolsUploaded byjfreyre
- Syllabus Nicoleta SerbanUploaded byRichard Ding
- AdviceLearningMathUndergraduates2011-08-14.pdfUploaded byRichard Ding
- puremath.pdfUploaded byRichard Ding
- stability_and_conditioning_multivariable_condition_number.pdfUploaded byRichard Ding
- lec10.pdfUploaded byRichard Ding
- Review of Linear AlgebraUploaded byRichard Ding
- Lecture 22Uploaded byRichard Ding
- Lecture 21Uploaded byRichard Ding
- MATLAB Primer Third EditionUploaded bymohamed ahmed
- CS 2110 SyllabusUploaded bydoodoostix
- ES714glmUploaded byRichard Ding
- OptIII_TRUploaded byRichard Ding
- 00b7d524302612db49000000Uploaded byRichard Ding
- CSE6242-3-VisFixProjectUploaded byRichard Ding
- CSE6242-HW3Uploaded byRichard Ding
- CSE6242-HW4Uploaded byRichard Ding
- A Novel Generalized Ridge Regression Method for Quantitative GeneticsUploaded byRichard Ding
- Deep LearningUploaded byRichard Ding
- Mm at InverseUploaded byRichard Ding

- [Www.asianovel.com] - Cub Raising Association Chapter 1 - Chapter 23Uploaded byamaia bengoetxea
- Camus v LewisUploaded byBrent Matheny
- MethanogensUploaded bySantonuKumar
- Heat treatment of steelsUploaded byRajeev Sai
- TI spra383a CCstudio scripting utility.pdfUploaded byJohn Bofarull Guix
- Business requirements and system requirementsUploaded byapi-3756170
- List of SATRA Test MethodsUploaded byGiuseppe Gori
- RC_BYOD_March2012Uploaded byqasim1423
- BondingUploaded bykratika gupta
- Nanocrystalline CeriaUploaded byfcofimeuanl
- Economics GuidelinesUploaded byTanoli10
- Marsha Meskimmon, ‘Chronology Through Cartography'Uploaded byVal Ravaglia
- AJGS_S_16_01779Uploaded byhamimed67
- RSNLive17 Load Monitoring Workshop SlidesUploaded byjdan
- MorphemeUploaded bynasyratul
- Complete ProjectUploaded byShashank Kapur
- assignment MOTOR CTRLUploaded byhady_mirza
- AP Bio Chapter 13Uploaded byStacy Thiel
- Globalization BuddhismUploaded bySurya Kant Sharma
- t Eschke Middle AgesUploaded byalexandrospatramanis
- Ay Illuminate - CatalogueUploaded byhomeforbrands
- MDSW-Meghalaya07Uploaded byminingnova1
- Lab Quality ManualUploaded byjemms16
- 2012 FactorsinfluencingInternetshoppingvalueandcustomerrepurchaseintention ElectronicCommerceResearchandApplications Kimet (1)Uploaded bythanhkhoaqt1b
- Ramesh FinalUploaded byramesh062
- 2010 en Marine Catalog WebastoUploaded byPablo Posada
- FCPS Writing Center Proposal 12.12.13lUploaded byaustinmj
- CTU_Code_January_2014.pdfUploaded byJavierConti
- Bloomberg India Banks HDFC Case StudyUploaded byHaritika Chhatwal
- f ballinger-hw499-resume-finalUploaded byapi-238801444